Re: [beagleboard] How to reliably push data from ARM host to PRU (shared) memory with predictable (low) latency?

William Hermans Thu, 23 Mar 2017 12:44:19 -0700

One other thing I did not think of to mention is that: I was recently
watching a video on youtube from Jason Turner. A person who is known for
talking about performance related C++ coding. Now I'm not exactly a huge
fan of C++, but I do like to keep up with the language. But one of the
things he mentions in this video that I completely agree with. "Simple
code, is 99% more likely to perform better than complex code". Or something
to that effect. Which may seem obvious initially, but consider a simple two
lines of a busy wait loop, to the select() system call. Is the select()
only two lines of easy to read / understand code? You know, I can not say
with 100% certainty that is is not. But I seriously doubt it.


On Thu, Mar 23, 2017 at 12:32 PM, William Hermans <yyrk...@gmail.com> wrote:

>
>
> On Thu, Mar 23, 2017 at 5:48 AM, ags <alfred.g.schm...@gmail.com> wrote:
>
>> OK, I will use the busy-wait loop w/ usleep and test. The reason I used
>> select was I thought it would allow me to do other things (I need to have
>> another process, thread, or loop in this same application serving out audio
>> data to another client, synchronized with this data). My understanding was
>> that the process blocking on select() to return would free the CPU for
>> other things, but allow a quick wake-up to refresh the buffer as needed.
>>
>
> I thought that select(), and all that should work too, initially. But you
> have to remember, we're talking about an OS here that has an "expected"
> latency of 100ms, or more- Depending. I can tell you that one could easily
> experiment, and find out for themselves. One of the easiest tests one could
> do for themself. Would be to run a loop, for 10,000 iterations, then
> compare using select() to a busy wait loop. Then run the cmdline command
> time on each to see the difference. This of course is not a super accurate
> test, but should be good enough to show a huge difference in executable
> completion time. *If* you're more of the scientific type, then get the
> system time in your test app before, and after the test code, then output
> the difference between those two times.
>
> Anyway, using an RT kernel, or an xenomai kernel may improve this latency
> *some*, but it is said that this comes at the expense of *some* other
> performance aspects of the OS. I've not actually tested that myself, but
> only read about it.
>
>>
>> BTW, I have only mentioned the problems - but it does *almost* work. In
>> my tests, I ran 12,500 4KiB buffers from ARM to PRU and measured (on the
>> PRU side, using the precise CYCLE counter) to see if the PRU ever had to
>> wait for the next buffer fill. Turns out that the PRU had to wait about 180
>> times, or about 1.5% of the buffer fill events. The worse case wait (stall)
>> time was ~5milliSeconds.
>>
>
> One has to be very careful what they use in code when writing an
> executable that requires some degree of determinism from userspace. I can
> not think of the articles I've read in the past that led me to understand
> all this. But they're out there. Pretty much anything that is a system
> call, will incur a latency penalty. Because one ends up switching processor
> context from userspace, to kernelspace, and back to userspace. This in of
> it's self may not be too bad, but any variables that are needed will end up
> being copied back and forth as well. In these cases however, you can incur
> huge latency spikes that you may not have anticipated.
>
> Personally, I've run into this problem a couple times during two different
> projects. So my style of coding is to just get something working, right ?
> Then refactoring the code to perform to my expectations. Basically,
> starting with really "simple" stuff like printf(), select, etc. Then
> refactoring those out when / if needed. Many times, it's not needed, but
> when it is, one should understand the consequences of using such function
> calls in an executable. That way, one should have  at least a rough idea
> where to start with "trimming the fat". But everyone falls into this "trap"
> at least once or twice when entering the embedded arena.
>
> My understanding of calls like select(), is that when they're used, you're
> yielding the processor back to the system, with the "promise" that
> eventually, the system will notify you when something related to that call
> has changed. But with a busy wait loop, you're defining the time period
> you're allowing the processor to be yielded back to the system. In the case
> of my example, approximately 1ms. Just be aware that with any non real-time
> OS, much faster than 1ms intervals will yield varying results. e.g. the
> system will( may ) not be able to keep up with your code. If your code is
> super efficient, you can potentially get hundreds of thousands of
> iterations. This is of course not guaranteed, but I've done it personally
> with the ADC, so I do know it can be possible. At this performance level,
> you're almost certainly using mmap(). You're almost certainly using a lot
> of processor time as well. 80% +
>
> Also my code was pseudo code that I picked apart myself after I posted. On
> the PRU side of things, you're probably going to want to do things a bit
> differently. For starters, you're going to want to time your data transfers
> from the PRU probably. That is, every 20ms, you're going to kick off a new
> data set. However, this has to be done smartly, as you do not want to
> override the userspace side file lock. So perhaps a double buffer will be
> needed ? That will depend on the outcome of your given situation. Another
> technique that could be used, would be data packing. As plain text data can
> be a lot larger in memory than a packed data structure. But it would also
> require a lot of thought on as how to do this smartly. As well as a strong
> understanding of struct / union "data objects" + data alignment. For the
> best results.
>
> There could potentially be a lot more to consider down the road. Just pick
> away at it one thing at a time. Eventually you'll be done with it.
>

-- 
For more options, visit http://beagleboard.org/discuss
--- 
You received this message because you are subscribed to the Google Groups 
"BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to beagleboard+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/beagleboard/CALHSORrMoOjge1S%2Bk101gygRAmsWLMKj%2BqBUvv%3DQ8de8Wkigjw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [beagleboard] How to reliably push data from ARM host to PRU (shared) memory with predictable (low) latency?

Reply via email to