Re: [beagleboard] How to reliably push data from ARM host to PRU (shared) memory with predictable (low) latency?

William Hermans Thu, 23 Mar 2017 12:33:03 -0700

On Thu, Mar 23, 2017 at 5:48 AM, ags <alfred.g.schm...@gmail.com> wrote:


> OK, I will use the busy-wait loop w/ usleep and test. The reason I used
> select was I thought it would allow me to do other things (I need to have
> another process, thread, or loop in this same application serving out audio
> data to another client, synchronized with this data). My understanding was
> that the process blocking on select() to return would free the CPU for
> other things, but allow a quick wake-up to refresh the buffer as needed.
>

I thought that select(), and all that should work too, initially. But you
have to remember, we're talking about an OS here that has an "expected"
latency of 100ms, or more- Depending. I can tell you that one could easily
experiment, and find out for themselves. One of the easiest tests one could
do for themself. Would be to run a loop, for 10,000 iterations, then
compare using select() to a busy wait loop. Then run the cmdline command
time on each to see the difference. This of course is not a super accurate
test, but should be good enough to show a huge difference in executable
completion time. *If* you're more of the scientific type, then get the
system time in your test app before, and after the test code, then output
the difference between those two times.

Anyway, using an RT kernel, or an xenomai kernel may improve this latency
*some*, but it is said that this comes at the expense of *some* other
performance aspects of the OS. I've not actually tested that myself, but
only read about it.

>
> BTW, I have only mentioned the problems - but it does *almost* work. In
> my tests, I ran 12,500 4KiB buffers from ARM to PRU and measured (on the
> PRU side, using the precise CYCLE counter) to see if the PRU ever had to
> wait for the next buffer fill. Turns out that the PRU had to wait about 180
> times, or about 1.5% of the buffer fill events. The worse case wait (stall)
> time was ~5milliSeconds.
>

One has to be very careful what they use in code when writing an executable
that requires some degree of determinism from userspace. I can not think of
the articles I've read in the past that led me to understand all this. But
they're out there. Pretty much anything that is a system call, will incur a
latency penalty. Because one ends up switching processor context from
userspace, to kernelspace, and back to userspace. This in of it's self may
not be too bad, but any variables that are needed will end up being copied
back and forth as well. In these cases however, you can incur huge latency
spikes that you may not have anticipated.

Personally, I've run into this problem a couple times during two different
projects. So my style of coding is to just get something working, right ?
Then refactoring the code to perform to my expectations. Basically,
starting with really "simple" stuff like printf(), select, etc. Then
refactoring those out when / if needed. Many times, it's not needed, but
when it is, one should understand the consequences of using such function
calls in an executable. That way, one should have  at least a rough idea
where to start with "trimming the fat". But everyone falls into this "trap"
at least once or twice when entering the embedded arena.

My understanding of calls like select(), is that when they're used, you're
yielding the processor back to the system, with the "promise" that
eventually, the system will notify you when something related to that call
has changed. But with a busy wait loop, you're defining the time period
you're allowing the processor to be yielded back to the system. In the case
of my example, approximately 1ms. Just be aware that with any non real-time
OS, much faster than 1ms intervals will yield varying results. e.g. the
system will( may ) not be able to keep up with your code. If your code is
super efficient, you can potentially get hundreds of thousands of
iterations. This is of course not guaranteed, but I've done it personally
with the ADC, so I do know it can be possible. At this performance level,
you're almost certainly using mmap(). You're almost certainly using a lot
of processor time as well. 80% +

Also my code was pseudo code that I picked apart myself after I posted. On
the PRU side of things, you're probably going to want to do things a bit
differently. For starters, you're going to want to time your data transfers
from the PRU probably. That is, every 20ms, you're going to kick off a new
data set. However, this has to be done smartly, as you do not want to
override the userspace side file lock. So perhaps a double buffer will be
needed ? That will depend on the outcome of your given situation. Another
technique that could be used, would be data packing. As plain text data can
be a lot larger in memory than a packed data structure. But it would also
require a lot of thought on as how to do this smartly. As well as a strong
understanding of struct / union "data objects" + data alignment. For the
best results.

There could potentially be a lot more to consider down the road. Just pick
away at it one thing at a time. Eventually you'll be done with it.

-- 
For more options, visit http://beagleboard.org/discuss
--- 
You received this message because you are subscribed to the Google Groups 
"BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to beagleboard+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/beagleboard/CALHSORoHcjzsKchgUBWuOMPN5ow4V-TLxXOr7CGa67obQqPR4g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [beagleboard] How to reliably push data from ARM host to PRU (shared) memory with predictable (low) latency?

Reply via email to