RE: Throughput questions....

Richard Clark Tue, 12 Nov 2024 23:32:42 -0800

Adam,

Your explanation needs a lot more detail as it raises many more questions than 
it answers.
I specifically did not use irq-based messaging because it does not provide the 
handshaking that I need.
Sending a signal that a message is ready, without the ability to receive some 
sort of acknowledgement event
in return would force the sender into a painfully slow and inefficient polling 
loop. The ipc_call
function is perfect for this purpose as it not only provides the 
acknowledgement that the receiver
has processed the message, but can return a status as well. All event-driven 
with no polling and no delays.
The event-driven handshake has to exist so that the sender knows when it is 
safe to begin
sending the next message... how does an irq do this? It is only a one-way 
signal. Your irq messaging
example can only send one message and then has to poll shared memory to know 
when the receiver
has gotten it. They all use the same underlying ipc functions, just with 
different kernel object types, so I
don't understand why an ipc_call would be slow and an irq would be faster. In 
all cases, the
return handshake is required to avoid polling.

Your comment to not use malloc is extremely confusing. I've also seen your 
response that using a lot
of small malloc/free calls will slow down the kernel. That just can't be 
correct. Malloc is one of the most
used and abused calls in the entire C library. If it is not extremely fast and 
efficient, then something
is seriously wrong with the underlying software. Please confirm that this is 
the case. Because if true,
then I will have to allocate a few megabytes up front in a large buffer and 
port over my own malloc to point to it.
Again, this just doesn't make sense. Can I not assign an individual heap to 
each process? The kernel should
only hold a map to the large heap space, not each individual small buffer that 
gets malloc'ed. The kernel
should not even be involved in a malloc at all. 

I do need to benchmark my message-passing exactly as is, with malloc and free, 
and signals and waits and locks and all.
I am not interested in individual component performance, but need to know the 
performance when it is
all put together in exactly the form that it will be used. If 3 or 4 messages 
per millisecond is real, then something
needs to get redesigned and fixed. I can't use it at that speed. 

Our applications involve communications and message passing. They are servers 
that run forever, not little
web applications. We need to process hundreds of messages per millisecond, not 
single digits. So this is a
huge concern for me.

I'll go break things up to find the slow parts, to test them one at a time, but 
your help in identifying more possible 
issues would be greatly appreciated.

Thanks!

Richard

-----Original Message-----
From: Adam Lackorzynski <[email protected]> 
Sent: Monday, November 11, 2024 5:29 AM
To: Richard Clark <[email protected]>; 
[email protected]
Subject: Re: Throughput questions....

Hi Richard,

for using shared memory based communication I'd like to suggest to use L4::Irqs 
instead of IPC messages, especially ipc-calls which have a back and forth. 
Please also do not use malloc within a benchmark (or benchmark malloc 
separately to get an understanding how the share between L4 ops and libc is 
split). On QEMU it should be ok when running with KVM, less so without KVM.

I do not have a recommendation for an AMD-based laptop.

Cheers,
Adam

On Thu Nov 07, 2024 at 13:36:06 +0000, Richard Clark wrote:
> Dear L4Re experts,
> 
> We now have a couple projects in which we are going to be utilizing 
> your OS, so I've been implementing and testing some of the basic 
> functionality that we will need. Namely that would be message passing....
> I've been using the Hello World QEMU example as my starting point and 
> have created a number of processes that communicate via a pair of 
> unidirectional channels with IPC and shared memory. One channel for 
> messages coming in, one channel for messages going out. The sender 
> does an IPC_CALL() when a message has been put into shared memory. The 
> receiver completes an IPC_RECEIVE(), fetches the message, and then responds 
> with the IPC_REPLY() to the original IPC_CALL(). It is all interrupt/event 
> driven, no sleeping, no polling.
> It works. I've tested it for robustness and it behaves exactly as expected, 
> with the exception of throughput.
> 
> I seem to be getting only 4000 messages per second. Or roughly 4 
> messages per millisecond. Now there are a couple malloc() and free() 
> and condition_wait() and condition_signal()s going on as the events and 
> messages get passed through the sender and receiver threads, but nothing 
> (IMHO) that should slow things down too much.
> Messages are very small, like 50 bytes, as I'm really just trying to 
> get a handle on basic overhead. So pretty much, yes, I'm beating the 
> context-switching mechanisms to death...
> 
> My questions:
> Is this normal(ish) throughput for a single-core x86_64 QEMU system?
> Am I getting hit by a time-sliced scheduler issue and most of my CPU is being 
> wasted?
> How do I switch to a different non-time-sliced scheduler?
> Thoughts on what I could try to improve throughput?
> 
> And lastly...
> We are going to be signing up for training soon... do you have a 
> recommendation for a big beefy AMD-based linux laptop?
> 
> 
> Thanks!
> 
> Richard H. Clark
_______________________________________________
l4-hackers mailing list -- [email protected]
To unsubscribe send an email to [email protected]

RE: Throughput questions....

Reply via email to