Re: Throughput questions....

Adam Lackorzynski Tue, 12 Nov 2024 15:23:57 -0800

Hi,

On Mon Nov 11, 2024 at 13:15:52 +0000, Richard Clark wrote:
> Your explanation needs a lot more detail as it raises many more questions 
> than it answers.
> I specifically did not use irq-based messaging because it does not provide 
> the handshaking that I need.
> Sending a signal that a message is ready, without the ability to receive some 
> sort of acknowledgement event
> in return would force the sender into a painfully slow and inefficient 
> polling loop. The ipc_call
> function is perfect for this purpose as it not only provides the 
> acknowledgement that the receiver
> has processed the message, but can return a status as well. All event-driven 
> with no polling and no delays.
> The event-driven handshake has to exist so that the sender knows when it is 
> safe to begin
> sending the next message... how does an irq do this? It is only a one-way 
> signal. Your irq messaging
> example can only send one message and then has to poll shared memory to know 
> when the receiver
> has gotten it. They all use the same underlying ipc functions, just with 
> different kernel object types, so I
> don't understand why an ipc_call would be slow and an irq would be faster. In 
> all cases, the
> return handshake is required to avoid polling.


I don't know about your mechanism, I just know that shared memory
communication can work with shared memory and a notification. For
example, Virtio uses exactly this. Notifications (Irqs) are sent in both
directions. This is a rather asychronous model. Other use-cases might
need other ways of doing it.
Of course, polling should not be used, except when sitting alone on a
core and being specifically for that.

> Your comment to not use malloc is extremely confusing. I've also seen your 
> response that using a lot
> of small malloc/free calls will slow down the kernel. That just can't be 
> correct. Malloc is one of the most
> used and abused calls in the entire C library. If it is not extremely fast 
> and efficient, then something
> is seriously wrong with the underlying software. Please confirm that this is 
> the case. Because if true,
> then I will have to allocate a few megabytes up front in a large buffer and 
> port over my own malloc to point to it.
> Again, this just doesn't make sense. Can I not assign an individual heap to 
> each process? The kernel should
> only hold a map to the large heap space, not each individual small buffer 
> that gets malloc'ed. The kernel
> should not even be involved in a malloc at all. 

Right, the kernel has no business with malloc and free (except the really
downwards mechanisms of providing proper memory pages to the process).
Malloc and free are a pure user-level implementation which works on a
chunk of memory. The malloc is the one from uclibc, and is as fast as it is.

> I do need to benchmark my message-passing exactly as is, with malloc and 
> free, and signals and waits and locks and all.
> I am not interested in individual component performance, but need to know the 
> performance when it is
> all put together in exactly the form that it will be used. If 3 or 4 messages 
> per millisecond is real, then something
> needs to get redesigned and fixed. I can't use it at that speed. 

Sure you need the overall performance, however to understand what's
going on looking into the individual phases can be a good thing.
What do you do with signals, waits and locks?
Is your communication within one process or among multiple processes? Or
a mix of it?

> Our applications involve communications and message passing. They are servers 
> that run forever, not little
> web applications. We need to process hundreds of messages per millisecond, 
> not single digits. So this is a
> huge concern for me.

Understood.

> I'll go break things up to find the slow parts, to test them one at a time, 
> but your help in identifying more possible 
> issues would be greatly appreciated.

Thanks, will do my best.



Adam

> -----Original Message-----
> From: Adam Lackorzynski <[email protected]> 
> Sent: Monday, November 11, 2024 5:29 AM
> To: Richard Clark <[email protected]>; 
> [email protected]
> Subject: Re: Throughput questions....
> 
> Hi Richard,
> 
> for using shared memory based communication I'd like to suggest to use 
> L4::Irqs instead of IPC messages, especially ipc-calls which have a back and 
> forth. Please also do not use malloc within a benchmark (or benchmark malloc 
> separately to get an understanding how the share between L4 ops and libc is 
> split). On QEMU it should be ok when running with KVM, less so without KVM.
> 
> I do not have a recommendation for an AMD-based laptop.
> 
> 
> Cheers,
> Adam
> 
> On Thu Nov 07, 2024 at 13:36:06 +0000, Richard Clark wrote:
> > Dear L4Re experts,
> > 
> > We now have a couple projects in which we are going to be utilizing 
> > your OS, so I've been implementing and testing some of the basic 
> > functionality that we will need. Namely that would be message passing....
> > I've been using the Hello World QEMU example as my starting point and 
> > have created a number of processes that communicate via a pair of 
> > unidirectional channels with IPC and shared memory. One channel for 
> > messages coming in, one channel for messages going out. The sender 
> > does an IPC_CALL() when a message has been put into shared memory. The 
> > receiver completes an IPC_RECEIVE(), fetches the message, and then responds 
> > with the IPC_REPLY() to the original IPC_CALL(). It is all interrupt/event 
> > driven, no sleeping, no polling.
> > It works. I've tested it for robustness and it behaves exactly as expected, 
> > with the exception of throughput.
> > 
> > I seem to be getting only 4000 messages per second. Or roughly 4 
> > messages per millisecond. Now there are a couple malloc() and free() 
> > and condition_wait() and condition_signal()s going on as the events and 
> > messages get passed through the sender and receiver threads, but nothing 
> > (IMHO) that should slow things down too much.
> > Messages are very small, like 50 bytes, as I'm really just trying to 
> > get a handle on basic overhead. So pretty much, yes, I'm beating the 
> > context-switching mechanisms to death...
> > 
> > My questions:
> > Is this normal(ish) throughput for a single-core x86_64 QEMU system?
> > Am I getting hit by a time-sliced scheduler issue and most of my CPU is 
> > being wasted?
> > How do I switch to a different non-time-sliced scheduler?
> > Thoughts on what I could try to improve throughput?
> > 
> > And lastly...
> > We are going to be signing up for training soon... do you have a 
> > recommendation for a big beefy AMD-based linux laptop?
> > 
> > 
> > Thanks!
> > 
> > Richard H. Clark

Adam
-- 
Adam                 [email protected]
  Lackorzynski         http://os.inf.tu-dresden.de/~adam/
_______________________________________________
l4-hackers mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Re: Throughput questions....

Reply via email to