Re: [beagleboard] How to reliably push data from ARM host to PRU (shared) memory with predictable (low) latency?

ags Fri, 24 Mar 2017 09:35:03 -0700

@William Hermans I thought I'd share the result of my efforts to reliably 
stream data from ARM host (Linux userspace) to PRU.


I instrumented the PRU ASM code to use the CYCLE register for very precise 
measurements. I ran tests that kept track of how many times, for how long, 
and the "worst offender" when the PRU was stalled waiting for data from the 
ARM host. I used this to test my current implementation using select(), and 
then replaced select() with usleep() (and nanosleep()), and then again a 
loop with no sleep, just a brute-force busy wait that never released the 
CPU. As it turns out, the results were surprising. Using usleep() (and 
similar related methods), the number of stalls, the overall stall time and 
the worst-case stall time were all significantly worse than the 
implementation using select(). Even the busy wait loop w/out sleep() was 
worse. I did a bit of research and sleep() and related methods are 
implemented using a syscall (sleep - used to use alarm in the olden-days 
(so I read)). So getting through the call gate and the context swap happens 
with sleep() just like with select(). My theory is that select() is more 
efficient precisely because of this: one call to select() incurs one system 
call/context swap per interrupt. The process is put on the NotRunning list, 
and the the OS continues on. When a trigger event happens, the OS returns 
the process to the Running list and then control back to user space. For 
the sleep() method, there are many calls per "interrupt", polling some 
memory location looking for the signal from the PRU. So what is handled by 
one userspace->kernelspace->userspace transition with select() could 
require dozens of these transitions using sleep().

I don't claim to be an expert, and if there is a flaw in this theory I'm 
open to hearing what it is. But this is my theory at the current moment.

So what I ended up doing is compress the data so that one "frame" can fit 
in PRU memory at once. The PRU needs to send a full "frame" out with 
precise timing (within microsecond timing) for all data in that frame. 
Between frames, there is slack. By compressing the data, I can load a full 
frame into the PRU0/1 DRAM and shared RAM, and then kick off writing out 
the frame. Now everything is (or appears to be) deterministic in the timing 
of all transfers between registers, scratch and PRU DRAM. So I've 
sidestepped the problem of unpredictable latency waiting for data from the 
ARM host.

I hope this might help someone else with similar requirements.

On Thursday, March 23, 2017 at 12:32:28 PM UTC-7, William Hermans wrote:
>
>
>
> On Thu, Mar 23, 2017 at 5:48 AM, ags <alfred.g...@gmail.com <javascript:>> 
> wrote:
>
>> OK, I will use the busy-wait loop w/ usleep and test. The reason I used 
>> select was I thought it would allow me to do other things (I need to have 
>> another process, thread, or loop in this same application serving out audio 
>> data to another client, synchronized with this data). My understanding was 
>> that the process blocking on select() to return would free the CPU for 
>> other things, but allow a quick wake-up to refresh the buffer as needed.
>>
>
> I thought that select(), and all that should work too, initially. But you 
> have to remember, we're talking about an OS here that has an "expected" 
> latency of 100ms, or more- Depending. I can tell you that one could easily 
> experiment, and find out for themselves. One of the easiest tests one could 
> do for themself. Would be to run a loop, for 10,000 iterations, then 
> compare using select() to a busy wait loop. Then run the cmdline command 
> time on each to see the difference. This of course is not a super accurate 
> test, but should be good enough to show a huge difference in executable 
> completion time. *If* you're more of the scientific type, then get the 
> system time in your test app before, and after the test code, then output 
> the difference between those two times.
>
> Anyway, using an RT kernel, or an xenomai kernel may improve this latency 
> *some*, but it is said that this comes at the expense of *some* other 
> performance aspects of the OS. I've not actually tested that myself, but 
> only read about it.
>
>>
>> BTW, I have only mentioned the problems - but it does *almost* work. In 
>> my tests, I ran 12,500 4KiB buffers from ARM to PRU and measured (on the 
>> PRU side, using the precise CYCLE counter) to see if the PRU ever had to 
>> wait for the next buffer fill. Turns out that the PRU had to wait about 180 
>> times, or about 1.5% of the buffer fill events. The worse case wait (stall) 
>> time was ~5milliSeconds.
>>
>
> One has to be very careful what they use in code when writing an 
> executable that requires some degree of determinism from userspace. I can 
> not think of the articles I've read in the past that led me to understand 
> all this. But they're out there. Pretty much anything that is a system 
> call, will incur a latency penalty. Because one ends up switching processor 
> context from userspace, to kernelspace, and back to userspace. This in of 
> it's self may not be too bad, but any variables that are needed will end up 
> being copied back and forth as well. In these cases however, you can incur 
> huge latency spikes that you may not have anticipated.
>
> Personally, I've run into this problem a couple times during two different 
> projects. So my style of coding is to just get something working, right ? 
> Then refactoring the code to perform to my expectations. Basically, 
> starting with really "simple" stuff like printf(), select, etc. Then 
> refactoring those out when / if needed. Many times, it's not needed, but 
> when it is, one should understand the consequences of using such function 
> calls in an executable. That way, one should have  at least a rough idea 
> where to start with "trimming the fat". But everyone falls into this "trap" 
> at least once or twice when entering the embedded arena.
>
> My understanding of calls like select(), is that when they're used, you're 
> yielding the processor back to the system, with the "promise" that 
> eventually, the system will notify you when something related to that call 
> has changed. But with a busy wait loop, you're defining the time period 
> you're allowing the processor to be yielded back to the system. In the case 
> of my example, approximately 1ms. Just be aware that with any non real-time 
> OS, much faster than 1ms intervals will yield varying results. e.g. the 
> system will( may ) not be able to keep up with your code. If your code is 
> super efficient, you can potentially get hundreds of thousands of 
> iterations. This is of course not guaranteed, but I've done it personally 
> with the ADC, so I do know it can be possible. At this performance level, 
> you're almost certainly using mmap(). You're almost certainly using a lot 
> of processor time as well. 80% +
>
> Also my code was pseudo code that I picked apart myself after I posted. On 
> the PRU side of things, you're probably going to want to do things a bit 
> differently. For starters, you're going to want to time your data transfers 
> from the PRU probably. That is, every 20ms, you're going to kick off a new 
> data set. However, this has to be done smartly, as you do not want to 
> override the userspace side file lock. So perhaps a double buffer will be 
> needed ? That will depend on the outcome of your given situation. Another 
> technique that could be used, would be data packing. As plain text data can 
> be a lot larger in memory than a packed data structure. But it would also 
> require a lot of thought on as how to do this smartly. As well as a strong 
> understanding of struct / union "data objects" + data alignment. For the 
> best results.
>
> There could potentially be a lot more to consider down the road. Just pick 
> away at it one thing at a time. Eventually you'll be done with it. 
>

-- 
For more options, visit http://beagleboard.org/discuss
--- 
You received this message because you are subscribed to the Google Groups 
"BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to beagleboard+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/beagleboard/11ffd5ad-af95-44fa-9bd3-e936667e8b1f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [beagleboard] How to reliably push data from ARM host to PRU (shared) memory with predictable (low) latency?

Reply via email to