On February 28, 2018 8:10:09 AM PST, Andreas Bolsch <hyphen0br...@gmail.com> 
wrote:
>To be sure I just did the following tests (openOCD, current head, 
>integrated ST-Link v2-1, 4 MHz SWD clock):
>nucleo-f767zi, 2 MByte random data: prog: 140 kBytes/s, read: 150 
>kBytes/s
>disco-f412g, 1 MByte random data: prog: 134 kBytes/s, read: 158
>kBytes/s
>
>Then STM32CubeProgrammer (defaults, Linux host, integrated ST-Link 
>v2-1):
>disco-f412g, 1 MByte random data: prog. 133 kBytes/s, read: 150
>kBytes/s
>
>And finally openOCD with algorithm disabled, anything else as before:
>disco-f412g, 1 MByte random data: prog. 1 kByte/s (yes, no kidding, 
>ONE!)
>
>All tests above with SWD, not JTAG.

They were also done with an HLA. Not all of us have the option to use such an 
adapter, for various reasons, nor do all of us have the option of using SWD 
over JTAG.

>That the direct register approach is quite slow isn't surprising.
>That's 
>like playing ping-pong over USB for every single bit.

Word, actually. Not bit. At least for the ByteBlaster and FTDI adapters. And 
with the CR and SR accesses pulled out of the loop, it turns into a single 
giant call to target_write_memory to write the entire image, which AFAIK just 
shovels words (or halfwords, at 16× parallelism) into the DRW in TAR 
autoincrement mode.

Anyway, your test results with algorithm on an HLA seem to give roughly the 
same performance (135 kilobytes per second if you don’t include erase time—did 
you?) as I get without algorithm on an FTDI.

> The main benefit 
>of the algorithm approach is that data transport and  programming 
>("real" programming with CPU stall) run simultaneously. Of course, this
>
>can only work smoothly if the programming adapter does support this 
>"streaming" approach, so it won't work reasonably well with a low-level
>
>adapter.

I’ll be completely honest here: the reason I tried doing this is because the 
algorithm approach *broke* with the FTDI adapter, not because I wanted to 
improve speed. It kept issuing messages either timeout waiting for algorithm or 
debug regions unpowered. So I tried bypassing the algorithm and noticed that it 
was really slow, *then* tried speeding it up by moving the CR and SR accesses 
out of the loop and noticed that it became really really fast.

So while the algorithm approach seems really nice conceptually, in practice, 
for me, it doesn’t work, so I took the shortest path to something that *would* 
work, then discovered it could be fast anyway.

>Regarding the parallelism I'd suggest to leave the parallelism by 
>default as it currently is, i. e. 16.
>Anything else would be a pitfall for the unaware user. The assumption 
>that most users will use 2.4V to 3.3V supply is still valid, I guess.
>If 
>it were configurable, 32 wouldn't give substancially higher speed
>(well, 
>at least if a "good" programming adapter is used) anyway.

Fair enough. I never wanted to change the default anyway. I just wanted to 
provide the user with the ability to change it should they wish. Does this seem 
reasonable to you?

-- 
Christopher Head

Attachment: signature.asc
Description: PGP signature

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
OpenOCD-devel mailing list
OpenOCD-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openocd-devel

Reply via email to