On Thu, Jan 11, 2024 at 8:46 AM Laszlo Ersek <ler...@redhat.com> wrote:
>
> On 1/10/24 22:50, Pedro Falcato wrote:
> > On Wed, Jan 10, 2024 at 1:45 PM Laszlo Ersek <ler...@redhat.com>
> > wrote:
> >>
> >> (+ Andrew)
> >>
> >> On 1/10/24 14:09, Laszlo Ersek wrote:
> >>
> >>> I think that keeping the depex section read-only is valuable, so I'd
> >>> rule out #2. I'd also not start with option #1 -- copying the depex
> >>> to heap memory, just so this optimization can succeed. I'd simply
> >>> start by removing the optimization, and measuring how much driver
> >>> dispatch slows down in practice, on various platforms.
> >>>
> >>> Can you try this? (I have only build-tested and "uncrustified" it.)
> >>>
> >>> The patch removes the EFI_DEP_REPLACE_TRUE handling altogether, plus
> >>> it CONST-ifies the Iterator pointer (which points into the DEPEX
> >>> section), so that the compiler catch any possible accesses at *build
> >>> time* that would write to the write-protected DEPEX memory area.
> >>
> >> On a tangent: the optimization in question highlights a more general
> >> problem, namely that the DXE (and possibly MM/SMM) protocol databases
> >> are considered slow, for some access patterns.
> >>
> >> Edk2 implements those protocol databases with linked lists, where
> >> lookup costs O(n) operations (average and worst cases). And protocol
> >> lookups are quite frequent (for example, in depex evaluations, they
> >> could be considered "particularly frequent").
> >>
> >> (1) The "Tasks" wiki page mentions something *similar* (but not the
> >> same); see
> >>
> >> https://github.com/tianocore/tianocore.github.io/wiki/Tasks#datahub--gcd-scalability
> >>
> >> The description is: "The DXE core's DataHub and GCD (Global Coherency
> >> Domain) layers don't scale well as the number of data items gets
> >> large, since they are based on simple linked lists. Find better data
> >> structures."
> >
> > How large do they usually get? What's the worst case?
>
> No idea.
>
> >
> >>
> >> The same might apply more or less to the protocol database
> >> implementation.
> >>
> >> (2) More to the point, Andrew Fish reported earlier that at Apple,
> >> they had rewritten the DXE protocol database, using the red-black
> >> tree OrderedCollectionLib that I had contributed previously to edk2
> >> -- and they saw significant performance improvements.
> >>
> >> So upstreaming that feature to edk2 could be very valuable.
> >> (Red-black trees have O(log(n)) time cost (worst case) for lookup,
> >> insertion and deletion, and O(n) cost for iterating through the whole
> >> data structure.)
> >
> > FWIW, can we do better than an RB tree? They're notoriously cache
> > unfriendly...
>
> Sure, if someone contributes a different data structure that is suitable
> for the task :)

Hey, I happen to have written one!
https://github.com/heatd/Onyx/blob/master/kernel/kernel/radix.cpp
It just needs some C'ifying, then some EFI'ing on top, but I'm fairly
confident in its stability.

>
> RB trees may be cache unfriendly, but the functionality they provide
> with O(log(n)) worst case performance is amazing. You can use them as
> read-write associate arrays, sorted lists supporting forward and
> backward traversal (in fact tools for sorting), priority queues, etc.
>
> When I contributed the code, edk2 didn't have any associative array
> type, so something generic that would address the widest range of use
> cases looked like a good idea. (And indeed the library has been well
> applied in several of those use cases since, in various parts of edk2 --
> for sorting, uniqueness-checking, async interface token tracking &
> lookups.)

Definitely, I use it in Ext4Dxe too, it's great! (Although I do have a
small nit in that it allocates nodes itself, and there's no way around
it, but oh well...)

>
> This is not an argument against a more suitable data structure of
> course. Just pointing out that the RB tree has worked well thus far.
> E.g., under the BZ link below, Andrew mentions a diagnostic tool that
> creates 3000 handles. Looking up an element in a 3000-element list would
> cost 1500 iterations on average; using a balanced binary search tree it
> might cost ~11 iterations. Assuming that linked lists and linked binary
> search trees are similarly cache-unfriendly, that's a ~136 factor of
> improvement.

I would guess that binary trees are more cache-and-speculation
unfriendly as you need to decide which branch you're going down. But
yes, even a RB tree would be a great improvement over the linked list.

>
> >
> >>
> >> Let me see if I can find the bugzilla ticket...
> >>
> >> Ah, I got it. Apologies, I misremembered: Andrew's comment was not
> >> about the protocol database, but about the handle database. Here it
> >> is:
> >>
> >> https://bugzilla.tianocore.org/show_bug.cgi?id=988#c7
> >>
> >> (the BZ is still CONFIRMED btw...)
> >>
> >> Still, I think it must be related in a way. Namely, an EFI handle
> >> exists if and only if at least one protocol interface is installed on
> >> it. If you uninstall the last protocol interface from a handle, then
> >> the handle is destroyed -- in fact that's the *only* way to destroy a
> >> handle, to my understanding. See
> >> EFI_BOOT_SERVICES.UninstallProtocolInterface() in the UEFI spec: "If
> >> the last protocol interface is removed from a handle, the handle is
> >> freed and is no longer valid". Furthermore, calling
> >> InstallProtocolInterface() and InstallMultipleProtocolInterfaces() is
> >> how one *creates* new handles.
> >>
> >> So given how handles and protocol interfaces are conceptually
> >> interlinked, an rbtree-based protocol DB might have to maintain
> >> multiple rbtrees internally (for the ability to search the database
> >> quickly with different types of "keys"). I don't have a design ready
> >> in my mind at all (I'm not that familiar with the *current*,
> >> list-based implementation to begin with!). Upstreaming Apple's
> >> (experimental?) code could prove very helpful.
> >
> > Ok, some thoughts:
> >
> > For the handle database, if we made EFI_HANDLE an integer instead, we
> > could very easily use something similar to a radix tree (see Linux's
> > xarray). This would provide O(log(n)) lookups and insertion with a
> > much higher fan-out, without needing CoreValidateHandle's awful O(n)
> > lookup every time it gets an EFI_HANDLE. You'd just look up the handle
> > in the tree and if NULL, it's invalid. [1]
>
> EFI_HANDLE is a typedef to (VOID*) per UEFI spec -- section 2.3.1 "Data
> Types" says: "A collection of related interfaces. Type VOID *" --, so
> EFI_HANDLE can't be redefined, it would break compatibility with
> everything.
>
> Internally, it could be cast to UINTN, and then those 32- or 64-bit long
> bit strings could be considered as keys for a radix tree. But I'm not
> sure if this would still be as useful as the two-level radix tree (with
> a fanout of 64) that you describe.
>
> (If we wanted to map the EFI_HANDLEs to the index space [0..4095], then
> we'd just land back at the original problem: storing the EFI_HANDLEs in
> an associative data structure, for example in a hash map. We could find
> a good hash function, but because it would mathematically reduce the key
> space from 64-bit pointers to 12-bit indices, collisions would be
> theoretically inevitable, and the data structure couldn't avoid
> *storing* the mapping. This is not a general argument against using a
> hash table, it just means that we'd have to provide a good initial table
> size, and then deal with resizing cost whenever necessary.)

My idea was to make each handle an index - like a file descriptor -
AFAIK there's no reason why it *needs* to be an actual pointer.
We'd allocate indices when creating a handle, and return that (casted to VOID*).

Say, for a naive allocation scheme and the data structure I linked above:

/* 0 (NULL) must be reserved, as it's a bad EFI_HANDLE result. We
could also return HandleId + 1, and - 1 when indexing back to the
radix tree. Either option works */
STATIC UINTN mNextHandleId = 1;
STATIC radix_tree mHandleTree;

STATIC
IHANDLE *
CreateHandle (
  OUT UINTN *OutHandleId
  )
{
  IHANDLE *Handle = AllocatePool(sizeof(IHANDLE));
  *OutHandleId = mNextHandleId++;
  mHandleTree.insert(*OutHandleId, Handle);
  return Handle;
}

/* And then transform CoreValidateHandle into */
STATIC
EFI_STATUS
CoreGetHandle (
  IN EFI_HANDLE Handle,
  OUT IHANDLE **OutHandle
  )
{
  IHANDLE *Result;
  if (Handle == NULL)
    return EFI_INVALID_PARAMETER;
  expected<unsigned long, int> result = mHandleTree.get((UINTN) Handle);
  if (result.has_error())
    return EFI_INVALID_PARAMETER; /* Handle not in the tree */
  *OutHandle = (IHANDLE *) result.value();
  return EFI_SUCCESS;
}

--

Hopefully the snippet gets my point across. Insertion, searching and
iteration should all be super fast (O(logFanOut(N))).
Note that one could also just use a standard resizable vector (a-la
std::vector), but I don't know if that maps well to EFI handles and
memory allocation.

>
> > For the protocol database, you'd replace the linked list with a simple
> > hashtable, hashed by protocol. Something as simple as LIST_ENTRY
> > mProtocolHashtable[64]; would probably be enough to fan out most of
> > the problems (I think? How many different protocols are there?)
>
> I can answer this question reasonably well, I think. I have a script
> that collects symbolic names of GUIDs from the edk2 tree (plus hardcodes
> a number of well-known but not edk2 GUIDs), and creates a "sed" script
> out of them. Then another script uses this "sed" script for filtering
> edk2 logs -- the idea being to replace the whole bunch of logged GUIDs
> with their symbolic names. That makes logs much easier to read.
>
> The generator script is written such a way that the generated "sed"
> script only grows over time; put differently, this "dictionary" of
> name<->GUID associations never forgets, it only picks up new GUIDs. The
> "sed" script (= the dictionary file) consists of entries like
>
> s:FFB19961-79CC-4684-84A8-C31B0A2BBE82:[EarlyFdt16550SerialPortHookLib]:ig
> s:FFB56F09-65E3-4462-A799-2F0D1930D38C:[DxeContextBufferModuleConfigLib]:ig
> s:FFE06BDD-6107-46A6-7BB2-5A9C7EC5275C:[EfiAcpiTableProtocol]:ig
> s:FFF12B8D-7696-4C8B-A985-2747075B4F50:[EfiSystemNvDataFv]:ig
>
> it's sorted uniquely by GUID.
>
> Right now, it has 3074 entries. (People like generating GUIDs! :) In
> PI/UEFI/edk2, *everything* is a GUID, not just protocols!)
>
> So using 64 buckets would hash ~50 different protocol GUIDs to any given
> bucket (although, again, not all of those three thousand GUIDs are
> *protocol* GUIDs). 1024 buckets might work better.

Oh! That's nice! Taking your protocol figure of 515 into account, 512
or even 1024 buckets should probably take care of that.

I should note that I find it super hard to get a concrete idea on
performance for EFI firmware without adequate tooling - I meant to
write a standalone flamegraph tool a few weeks back (even posted in
edk2-devel), but, as far as I could tell, the architectural timer
protocol couldn't get me the interrupt frame[1]. Until then, whether
any of this radix tree vs RB tree vs flat array stuff really
matters... I find it hard to say.

[1] x86 has 3 timers (PIT, LAPIC timer, HPET) and performance
monitoring interrupts, and I couldn't freely use any of them :^)

-- 
Pedro


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#113638): https://edk2.groups.io/g/devel/message/113638
Mute This Topic: https://groups.io/mt/103594587/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


Reply via email to