So I've spun a patch train that uses threads: https://review.coreboot.org/c/coreboot/+/56225/1
Let me know what you think. Thanks! On Wed, Jul 7, 2021 at 7:24 PM Julius Werner <jwer...@chromium.org> wrote: > > Pulling out some of the discussions that we already had inside Google > to this wider audience, I am strongly against parallelization > mechanisms that pull this much extra complexity into existing code > (particularly the already very complicated CBFS framework). I think > https://review.coreboot.org/c/coreboot/+/56049 is a perfect example of > the maintainability drawbacks of this approach, requiring to weave > large new chunks of code throughout the whole stack downwards, from > CBFS into the rdev layer and finally into the platform SPI driver, > just to be able to asynchronously execute a single read operation. The > CBFS patch isn't even complete, there will be more code needed to > support other primitives (e.g. cbfs_map()), decompression, etc. And > it's duplicating a lot of the existing flow from the synchronous code > in _cbfs_alloc(). > > I think if we want a parallelization mechanism, we should converge > onto a single one that can solve a variety of current and future use > cases, and can be applied optionally where individual platforms need > it without having to grow tentacles into every other part of coreboot. > And I think the much better model for that would be cooperative > multithreading (e.g. setting up a second stack and execution context > and then having the CPU ping-pong back and forth between them while > the other one is idle and waiting on hardware somewhere). In fact > coreboot already has all this implemented in src/lib/threads.c > (CONFIG_COOP_MULTITASKING), although it seems that nobody has really > used it in a while. > > The clear advantage of cooperative multithreading is that almost none > of the code, especially not most of these in-between layers like CBFS > and rdev, need to be modified to support it. You can create a thread > and send it off to, say, begin loading the payload while the rest of > the boot state machine is still running platform initialization code, > and it can run through all the CBFS motions (including verification > and whatever other optional features are enabled) without any of those > having been written specifically with asynchronous operation in mind. > You still need to make sure that the two threads don't both try to > access shared resources at the same time, but you need to do that with > futures too. In the file loading case, it would probably be enough to > put a few critical sections in the SPI driver itself and there should > be no need to pull it up into all the platform-agnostic general > framework parts. > > The cost of cooperative multithreading is mostly just that you need > the space for a second stack. In ramstage that should be a given > anyway, and in romstage it can probably still be easily done on most > (particularly recent) platforms. For an optional performance > enhancement feature, I think that trade-off makes sense. > > On Wed, Jul 7, 2021 at 1:18 PM Raul Rangel <rran...@chromium.org> wrote: > > > > On Wed, Jul 7, 2021 at 12:49 PM Peter Stuge <pe...@stuge.se> wrote: > > > > > > Raul Rangel wrote: > > > > I'm currently working on improving the boot time for the AMD Cezanne > > > > platform. > > > .. > > > > Another difference between the latest AMD SoCs (Picasso, Cezanne), is > > > > that RAM is available in bootblock. > > > > > > As I have understood, the PSP has both trained RAM and copied firmware > > > from > > > SPI to RAM when x86 comes out of reset. > > > > > > Is that accurate, false, or only partially accurate? > > > > It's partially accurate. The PSP will load bootblock into RAM. So > > coreboot still needs to access the SPI flash to load everything else. > > > > > > > > > One place where we spend a decent amount of time is reading from SPI > > > > flash. We have the SPI speed/modes set to the optimal settings for > > > > our platforms, but there is still room for improvement. > > > > > > Please provide numbers? > > > > Sorry, that sentence didn't come out how I wanted. I was just saying > > that we could improve the boot time by exploring other avenues. > > > > > > > > > The question is, how do we model these asynchronous operations, how is > > > > data ownership handled, and how does the BSP know the operation is > > > > done? > > > > > > I haven't yet reveiewed your API proposal, but find it an absolutely > > > horrible idea to create a *general* API for asynchronous operations in > > > coreboot, because - as you recognize - it can easily be misused to great > > > detriment of the codebase, which already suffers chronically from such > > > trivial problems as copy-paste:itis. Don't do it. > > > > > > There is zero incentive for developers to improve the source beyond > > > making it work for their deadline; more complexity *will* create more > > > problems. (I understand that you have good intentions proposing this > > > change!) > > > > There has been a lot of work in refactoring the AMD codebases and > > fixing things throughout the stack. We have reduced a good amount of > > copy/paste when bringing up a new AMD SoC. So I don't agree that we > > are all writing abandonware. Please have a look at the proposal before > > making such hasty judgments. > > > > > > > > > I'm curious to see what the community thinks, and welcome any feedback. > > > > > > A special purpose DMA API is another matter to me, because it's very well > > > defined. It could still be useful beyond x86 and I think a blocking > > > "wait-for-DMA-to-finish" API is easy to understand, easy to implement and > > > to use, and sufficient since runtime flow is serial, especially when > > > measuring. > > > > That was my original thought, but there are plenty of other things > > that can be modeled as asynchronous operations. Stuffing it into the > > rdev API felt wrong. > > > > Thanks, > > Raul > > > > p.s., > > > > Just to add some more data. I performed an experiment where I switched > > the payload compression from LZMA to LZ4. Since we no longer pay the > > SPI read cost (thanks async!), we essentially load the payload for > > free: > > > > ## Payload LZMZ compression (current) > > ``` > > Name Offset Type Size > > fallback/payload 0x224dc0 simple elf 131417 (128 KiB) > > ``` > > ``` > > 90:starting to load payload 1,123,962 (13) > > 15:starting LZMA decompress 1,123,967 (5) > > 16:finished LZMA decompress 1,131,201 (7,234) > > ``` > > > > ## Payload LZ4 compression > > ``` > > Name Offset Type Size > > fallback/payload 0x224dc0 simple elf 173233 (169 KiB) > > ``` > > > > ``` > > 90:starting to load payload 1,130,877 (12) > > 17:starting LZ4 decompress 1,130,882 (5) > > 18:finished LZ4 decompress 1,131,091 (209) > > ``` > > > > ``` > > Payload increase: > > KiB: 169 - 128 = 41 KiB > > %: (169 - 128) / 128 = 0.3203125 > > > > Decompression decrease: > > us: 209 - 7234 = -7025 > > %: (209 - 7234) / 7234 = -0.9711086535803152 > > ``` > > _______________________________________________ > > coreboot mailing list -- coreboot@coreboot.org > > To unsubscribe send an email to coreboot-le...@coreboot.org _______________________________________________ coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-le...@coreboot.org