Thanks, I think it looks pretty great! That solves exactly the
concerns I had with the other implementation.

On Mon, Jul 12, 2021 at 3:06 PM Raul Rangel <[email protected]> wrote:
>
> So I've spun a patch train that uses threads:
> https://review.coreboot.org/c/coreboot/+/56225/1
>
> Let me know what you think.
>
> Thanks!
>
> On Wed, Jul 7, 2021 at 7:24 PM Julius Werner <[email protected]> wrote:
> >
> > Pulling out some of the discussions that we already had inside Google
> > to this wider audience, I am strongly against parallelization
> > mechanisms that pull this much extra complexity into existing code
> > (particularly the already very complicated CBFS framework). I think
> > https://review.coreboot.org/c/coreboot/+/56049 is a perfect example of
> > the maintainability drawbacks of this approach, requiring to weave
> > large new chunks of code throughout the whole stack downwards, from
> > CBFS into the rdev layer and finally into the platform SPI driver,
> > just to be able to asynchronously execute a single read operation. The
> > CBFS patch isn't even complete, there will be more code needed to
> > support other primitives (e.g. cbfs_map()), decompression, etc. And
> > it's duplicating a lot of the existing flow from the synchronous code
> > in _cbfs_alloc().
> >
> > I think if we want a parallelization mechanism, we should converge
> > onto a single one that can solve a variety of current and future use
> > cases, and can be applied optionally where individual platforms need
> > it without having to grow tentacles into every other part of coreboot.
> > And I think the much better model for that would be cooperative
> > multithreading (e.g. setting up a second stack and execution context
> > and then having the CPU ping-pong back and forth between them while
> > the other one is idle and waiting on hardware somewhere). In fact
> > coreboot already has all this implemented in src/lib/threads.c
> > (CONFIG_COOP_MULTITASKING), although it seems that nobody has really
> > used it in a while.
> >
> > The clear advantage of cooperative multithreading is that almost none
> > of the code, especially not most of these in-between layers like CBFS
> > and rdev, need to be modified to support it. You can create a thread
> > and send it off to, say, begin loading the payload while the rest of
> > the boot state machine is still running platform initialization code,
> > and it can run through all the CBFS motions (including verification
> > and whatever other optional features are enabled) without any of those
> > having been written specifically with asynchronous operation in mind.
> > You still need to make sure that the two threads don't both try to
> > access shared resources at the same time, but you need to do that with
> > futures too. In the file loading case, it would probably be enough to
> > put a few critical sections in the SPI driver itself and there should
> > be no need to pull it up into all the platform-agnostic general
> > framework parts.
> >
> > The cost of cooperative multithreading is mostly just that you need
> > the space for a second stack. In ramstage that should be a given
> > anyway, and in romstage it can probably still be easily done on most
> > (particularly recent) platforms. For an optional performance
> > enhancement feature, I think that trade-off makes sense.
> >
> > On Wed, Jul 7, 2021 at 1:18 PM Raul Rangel <[email protected]> wrote:
> > >
> > > On Wed, Jul 7, 2021 at 12:49 PM Peter Stuge <[email protected]> wrote:
> > > >
> > > > Raul Rangel wrote:
> > > > > I'm currently working on improving the boot time for the AMD Cezanne
> > > > > platform.
> > > > ..
> > > > > Another difference between the latest AMD SoCs (Picasso, Cezanne), is
> > > > > that RAM is available in bootblock.
> > > >
> > > > As I have understood, the PSP has both trained RAM and copied firmware 
> > > > from
> > > > SPI to RAM when x86 comes out of reset.
> > > >
> > > > Is that accurate, false, or only partially accurate?
> > >
> > > It's partially accurate. The PSP will load bootblock into RAM. So
> > > coreboot still needs to access the SPI flash to load everything else.
> > >
> > > >
> > > > > One place where we spend a decent amount of time is reading from SPI
> > > > > flash. We have the SPI speed/modes set to the optimal settings for
> > > > > our platforms, but there is still room for improvement.
> > > >
> > > > Please provide numbers?
> > >
> > > Sorry, that sentence didn't come out how I wanted. I was just saying
> > > that we could improve the boot time by exploring other avenues.
> > >
> > > >
> > > > > The question is, how do we model these asynchronous operations, how is
> > > > > data ownership handled, and how does the BSP know the operation is
> > > > > done?
> > > >
> > > > I haven't yet reveiewed your API proposal, but find it an absolutely
> > > > horrible idea to create a *general* API for asynchronous operations in
> > > > coreboot, because - as you recognize - it can easily be misused to great
> > > > detriment of the codebase, which already suffers chronically from such
> > > > trivial problems as copy-paste:itis. Don't do it.
> > > >
> > > > There is zero incentive for developers to improve the source beyond
> > > > making it work for their deadline; more complexity *will* create more
> > > > problems. (I understand that you have good intentions proposing this
> > > > change!)
> > >
> > > There has been a lot of work in refactoring the AMD codebases and
> > > fixing things throughout the stack. We have reduced a good amount of
> > > copy/paste when bringing up a new AMD SoC. So I don't agree that we
> > > are all writing abandonware. Please have a look at the proposal before
> > > making such hasty judgments.
> > >
> > > >
> > > > > I'm curious to see what the community thinks, and welcome any 
> > > > > feedback.
> > > >
> > > > A special purpose DMA API is another matter to me, because it's very 
> > > > well
> > > > defined. It could still be useful beyond x86 and I think a blocking
> > > > "wait-for-DMA-to-finish" API is easy to understand, easy to implement 
> > > > and
> > > > to use, and sufficient since runtime flow is serial, especially when
> > > > measuring.
> > >
> > > That was my original thought, but there are plenty of other things
> > > that can be modeled as asynchronous operations. Stuffing it into the
> > > rdev API felt wrong.
> > >
> > > Thanks,
> > > Raul
> > >
> > > p.s.,
> > >
> > > Just to add some more data. I performed an experiment where I switched
> > > the payload compression from LZMA to LZ4. Since we no longer pay the
> > > SPI read cost (thanks async!), we essentially load the payload for
> > > free:
> > >
> > > ## Payload LZMZ compression (current)
> > > ```
> > > Name                           Offset     Type           Size
> > > fallback/payload               0x224dc0   simple elf     131417 (128 KiB)
> > > ```
> > > ```
> > >   90:starting to load payload                          1,123,962 (13)
> > >   15:starting LZMA decompress                          1,123,967 (5)
> > >   16:finished LZMA decompress                          1,131,201 (7,234)
> > > ```
> > >
> > > ## Payload LZ4 compression
> > > ```
> > > Name                           Offset     Type           Size
> > > fallback/payload               0x224dc0   simple elf     173233 (169 KiB)
> > > ```
> > >
> > > ```
> > >   90:starting to load payload                          1,130,877 (12)
> > >   17:starting LZ4 decompress                           1,130,882 (5)
> > >   18:finished LZ4 decompress                           1,131,091 (209)
> > > ```
> > >
> > > ```
> > > Payload increase:
> > >   KiB: 169 - 128 = 41 KiB
> > >   %: (169 - 128) / 128 = 0.3203125
> > >
> > > Decompression decrease:
> > >   us: 209 - 7234 = -7025
> > >   %: (209 - 7234) / 7234 = -0.9711086535803152
> > > ```
> > > _______________________________________________
> > > coreboot mailing list -- [email protected]
> > > To unsubscribe send an email to [email protected]
_______________________________________________
coreboot mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to