Hi, On Mon Nov 11, 2024 at 13:15:52 +0000, Richard Clark wrote: > Your explanation needs a lot more detail as it raises many more questions > than it answers. > I specifically did not use irq-based messaging because it does not provide > the handshaking that I need. > Sending a signal that a message is ready, without the ability to receive some > sort of acknowledgement event > in return would force the sender into a painfully slow and inefficient > polling loop. The ipc_call > function is perfect for this purpose as it not only provides the > acknowledgement that the receiver > has processed the message, but can return a status as well. All event-driven > with no polling and no delays. > The event-driven handshake has to exist so that the sender knows when it is > safe to begin > sending the next message... how does an irq do this? It is only a one-way > signal. Your irq messaging > example can only send one message and then has to poll shared memory to know > when the receiver > has gotten it. They all use the same underlying ipc functions, just with > different kernel object types, so I > don't understand why an ipc_call would be slow and an irq would be faster. In > all cases, the > return handshake is required to avoid polling.
I don't know about your mechanism, I just know that shared memory communication can work with shared memory and a notification. For example, Virtio uses exactly this. Notifications (Irqs) are sent in both directions. This is a rather asychronous model. Other use-cases might need other ways of doing it. Of course, polling should not be used, except when sitting alone on a core and being specifically for that. > Your comment to not use malloc is extremely confusing. I've also seen your > response that using a lot > of small malloc/free calls will slow down the kernel. That just can't be > correct. Malloc is one of the most > used and abused calls in the entire C library. If it is not extremely fast > and efficient, then something > is seriously wrong with the underlying software. Please confirm that this is > the case. Because if true, > then I will have to allocate a few megabytes up front in a large buffer and > port over my own malloc to point to it. > Again, this just doesn't make sense. Can I not assign an individual heap to > each process? The kernel should > only hold a map to the large heap space, not each individual small buffer > that gets malloc'ed. The kernel > should not even be involved in a malloc at all. Right, the kernel has no business with malloc and free (except the really downwards mechanisms of providing proper memory pages to the process). Malloc and free are a pure user-level implementation which works on a chunk of memory. The malloc is the one from uclibc, and is as fast as it is. > I do need to benchmark my message-passing exactly as is, with malloc and > free, and signals and waits and locks and all. > I am not interested in individual component performance, but need to know the > performance when it is > all put together in exactly the form that it will be used. If 3 or 4 messages > per millisecond is real, then something > needs to get redesigned and fixed. I can't use it at that speed. Sure you need the overall performance, however to understand what's going on looking into the individual phases can be a good thing. What do you do with signals, waits and locks? Is your communication within one process or among multiple processes? Or a mix of it? > Our applications involve communications and message passing. They are servers > that run forever, not little > web applications. We need to process hundreds of messages per millisecond, > not single digits. So this is a > huge concern for me. Understood. > I'll go break things up to find the slow parts, to test them one at a time, > but your help in identifying more possible > issues would be greatly appreciated. Thanks, will do my best. Adam > -----Original Message----- > From: Adam Lackorzynski <[email protected]> > Sent: Monday, November 11, 2024 5:29 AM > To: Richard Clark <[email protected]>; > [email protected] > Subject: Re: Throughput questions.... > > Hi Richard, > > for using shared memory based communication I'd like to suggest to use > L4::Irqs instead of IPC messages, especially ipc-calls which have a back and > forth. Please also do not use malloc within a benchmark (or benchmark malloc > separately to get an understanding how the share between L4 ops and libc is > split). On QEMU it should be ok when running with KVM, less so without KVM. > > I do not have a recommendation for an AMD-based laptop. > > > Cheers, > Adam > > On Thu Nov 07, 2024 at 13:36:06 +0000, Richard Clark wrote: > > Dear L4Re experts, > > > > We now have a couple projects in which we are going to be utilizing > > your OS, so I've been implementing and testing some of the basic > > functionality that we will need. Namely that would be message passing.... > > I've been using the Hello World QEMU example as my starting point and > > have created a number of processes that communicate via a pair of > > unidirectional channels with IPC and shared memory. One channel for > > messages coming in, one channel for messages going out. The sender > > does an IPC_CALL() when a message has been put into shared memory. The > > receiver completes an IPC_RECEIVE(), fetches the message, and then responds > > with the IPC_REPLY() to the original IPC_CALL(). It is all interrupt/event > > driven, no sleeping, no polling. > > It works. I've tested it for robustness and it behaves exactly as expected, > > with the exception of throughput. > > > > I seem to be getting only 4000 messages per second. Or roughly 4 > > messages per millisecond. Now there are a couple malloc() and free() > > and condition_wait() and condition_signal()s going on as the events and > > messages get passed through the sender and receiver threads, but nothing > > (IMHO) that should slow things down too much. > > Messages are very small, like 50 bytes, as I'm really just trying to > > get a handle on basic overhead. So pretty much, yes, I'm beating the > > context-switching mechanisms to death... > > > > My questions: > > Is this normal(ish) throughput for a single-core x86_64 QEMU system? > > Am I getting hit by a time-sliced scheduler issue and most of my CPU is > > being wasted? > > How do I switch to a different non-time-sliced scheduler? > > Thoughts on what I could try to improve throughput? > > > > And lastly... > > We are going to be signing up for training soon... do you have a > > recommendation for a big beefy AMD-based linux laptop? > > > > > > Thanks! > > > > Richard H. Clark Adam -- Adam [email protected] Lackorzynski http://os.inf.tu-dresden.de/~adam/ _______________________________________________ l4-hackers mailing list -- [email protected] To unsubscribe send an email to [email protected]
