So I've spun a patch train that uses threads:
https://review.coreboot.org/c/coreboot/+/56225/1

Let me know what you think.

Thanks!

On Wed, Jul 7, 2021 at 7:24 PM Julius Werner <jwer...@chromium.org> wrote:
>
> Pulling out some of the discussions that we already had inside Google
> to this wider audience, I am strongly against parallelization
> mechanisms that pull this much extra complexity into existing code
> (particularly the already very complicated CBFS framework). I think
> https://review.coreboot.org/c/coreboot/+/56049 is a perfect example of
> the maintainability drawbacks of this approach, requiring to weave
> large new chunks of code throughout the whole stack downwards, from
> CBFS into the rdev layer and finally into the platform SPI driver,
> just to be able to asynchronously execute a single read operation. The
> CBFS patch isn't even complete, there will be more code needed to
> support other primitives (e.g. cbfs_map()), decompression, etc. And
> it's duplicating a lot of the existing flow from the synchronous code
> in _cbfs_alloc().
>
> I think if we want a parallelization mechanism, we should converge
> onto a single one that can solve a variety of current and future use
> cases, and can be applied optionally where individual platforms need
> it without having to grow tentacles into every other part of coreboot.
> And I think the much better model for that would be cooperative
> multithreading (e.g. setting up a second stack and execution context
> and then having the CPU ping-pong back and forth between them while
> the other one is idle and waiting on hardware somewhere). In fact
> coreboot already has all this implemented in src/lib/threads.c
> (CONFIG_COOP_MULTITASKING), although it seems that nobody has really
> used it in a while.
>
> The clear advantage of cooperative multithreading is that almost none
> of the code, especially not most of these in-between layers like CBFS
> and rdev, need to be modified to support it. You can create a thread
> and send it off to, say, begin loading the payload while the rest of
> the boot state machine is still running platform initialization code,
> and it can run through all the CBFS motions (including verification
> and whatever other optional features are enabled) without any of those
> having been written specifically with asynchronous operation in mind.
> You still need to make sure that the two threads don't both try to
> access shared resources at the same time, but you need to do that with
> futures too. In the file loading case, it would probably be enough to
> put a few critical sections in the SPI driver itself and there should
> be no need to pull it up into all the platform-agnostic general
> framework parts.
>
> The cost of cooperative multithreading is mostly just that you need
> the space for a second stack. In ramstage that should be a given
> anyway, and in romstage it can probably still be easily done on most
> (particularly recent) platforms. For an optional performance
> enhancement feature, I think that trade-off makes sense.
>
> On Wed, Jul 7, 2021 at 1:18 PM Raul Rangel <rran...@chromium.org> wrote:
> >
> > On Wed, Jul 7, 2021 at 12:49 PM Peter Stuge <pe...@stuge.se> wrote:
> > >
> > > Raul Rangel wrote:
> > > > I'm currently working on improving the boot time for the AMD Cezanne
> > > > platform.
> > > ..
> > > > Another difference between the latest AMD SoCs (Picasso, Cezanne), is
> > > > that RAM is available in bootblock.
> > >
> > > As I have understood, the PSP has both trained RAM and copied firmware 
> > > from
> > > SPI to RAM when x86 comes out of reset.
> > >
> > > Is that accurate, false, or only partially accurate?
> >
> > It's partially accurate. The PSP will load bootblock into RAM. So
> > coreboot still needs to access the SPI flash to load everything else.
> >
> > >
> > > > One place where we spend a decent amount of time is reading from SPI
> > > > flash. We have the SPI speed/modes set to the optimal settings for
> > > > our platforms, but there is still room for improvement.
> > >
> > > Please provide numbers?
> >
> > Sorry, that sentence didn't come out how I wanted. I was just saying
> > that we could improve the boot time by exploring other avenues.
> >
> > >
> > > > The question is, how do we model these asynchronous operations, how is
> > > > data ownership handled, and how does the BSP know the operation is
> > > > done?
> > >
> > > I haven't yet reveiewed your API proposal, but find it an absolutely
> > > horrible idea to create a *general* API for asynchronous operations in
> > > coreboot, because - as you recognize - it can easily be misused to great
> > > detriment of the codebase, which already suffers chronically from such
> > > trivial problems as copy-paste:itis. Don't do it.
> > >
> > > There is zero incentive for developers to improve the source beyond
> > > making it work for their deadline; more complexity *will* create more
> > > problems. (I understand that you have good intentions proposing this
> > > change!)
> >
> > There has been a lot of work in refactoring the AMD codebases and
> > fixing things throughout the stack. We have reduced a good amount of
> > copy/paste when bringing up a new AMD SoC. So I don't agree that we
> > are all writing abandonware. Please have a look at the proposal before
> > making such hasty judgments.
> >
> > >
> > > > I'm curious to see what the community thinks, and welcome any feedback.
> > >
> > > A special purpose DMA API is another matter to me, because it's very well
> > > defined. It could still be useful beyond x86 and I think a blocking
> > > "wait-for-DMA-to-finish" API is easy to understand, easy to implement and
> > > to use, and sufficient since runtime flow is serial, especially when
> > > measuring.
> >
> > That was my original thought, but there are plenty of other things
> > that can be modeled as asynchronous operations. Stuffing it into the
> > rdev API felt wrong.
> >
> > Thanks,
> > Raul
> >
> > p.s.,
> >
> > Just to add some more data. I performed an experiment where I switched
> > the payload compression from LZMA to LZ4. Since we no longer pay the
> > SPI read cost (thanks async!), we essentially load the payload for
> > free:
> >
> > ## Payload LZMZ compression (current)
> > ```
> > Name                           Offset     Type           Size
> > fallback/payload               0x224dc0   simple elf     131417 (128 KiB)
> > ```
> > ```
> >   90:starting to load payload                          1,123,962 (13)
> >   15:starting LZMA decompress                          1,123,967 (5)
> >   16:finished LZMA decompress                          1,131,201 (7,234)
> > ```
> >
> > ## Payload LZ4 compression
> > ```
> > Name                           Offset     Type           Size
> > fallback/payload               0x224dc0   simple elf     173233 (169 KiB)
> > ```
> >
> > ```
> >   90:starting to load payload                          1,130,877 (12)
> >   17:starting LZ4 decompress                           1,130,882 (5)
> >   18:finished LZ4 decompress                           1,131,091 (209)
> > ```
> >
> > ```
> > Payload increase:
> >   KiB: 169 - 128 = 41 KiB
> >   %: (169 - 128) / 128 = 0.3203125
> >
> > Decompression decrease:
> >   us: 209 - 7234 = -7025
> >   %: (209 - 7234) / 7234 = -0.9711086535803152
> > ```
> > _______________________________________________
> > coreboot mailing list -- coreboot@coreboot.org
> > To unsubscribe send an email to coreboot-le...@coreboot.org
_______________________________________________
coreboot mailing list -- coreboot@coreboot.org
To unsubscribe send an email to coreboot-le...@coreboot.org

Reply via email to