On Fri, May 27, 2005 at 01:47:37PM +0200, Danny van Dyk wrote: > Hi Brian > > What's the gain, aside from implication of collapsing it into a > > single file? Honestly my only use for metadata.xml is looking up who > > I get to poke about fixing broken ebuilds... > The gain is: > ... that you portage people could use it for emerge -s instead of using > a DESCRIPTION-cache.
'you portage people' ? :) > ... we don't need to find the metadata.xml file before parsing it. Portage's emerge -s doesn't use metadata.xml. Guessing you meant emerge -S (--searchDesc), but that too, doesn't use metadata.xml. So, a few implications in what you mean/are after then. 1) This global description cache would have to be duplicated, and recreated on cvs->rsync runs. Why? Unless you're padding extra bytes in the description cache, updates _will_ kill performance. Personally, I'm not much for it because there is a minimal window for cvs->rsync infra-side to get it's thing done, and this will jack up the runtime. 2) You're still doing entry by entry. Y'all are assuming having this data shoved into one file is going to make it quicker for reads (in reality, you're still reading 19000+ records, just your solution is out of a single file). This may be quicker due to syscall overhead, but I posit the drawbacks aren't worth it. 3) This complicates the hell out of cache updates, and still suffers the same issues eix/esearch suffer- namely that it's not sensitive to cache updates. If we make it sensitive to cache updates, you're looking at regen runtimes going through the roof (see #1 comment on updates). This is regardless of if it's a duplication approach or description is stored in it's own db outside of the normal flat_list cache files. 4) This proposal breaks the cache up into seperate chunks. That's the cache backends decision frankly, and _cannot_ be imposed onto the cache backend implementation from above. I moved eclass data into the cache backend in cvs head explicitly for the purpose of allowing the cache to be effectively standalone, and able to be bound to a remote tree. You force this change from above, it breaks the cache design (pure and simple), and ultimately isn't what you're after (see below). Frankly, any comments that this is going to make things faster are ignoring the existing code. Why is emerge -S so damned slow? Better question, why is it that a mysql cache backend _still_ is so damned slow on emerge -S? That should be hella fast compared to opening 19000 files, right? Because the current stable cache design allows *only* for individual record lookups. In other words, even with an rdbms implementation, it goes record by record. What is needed is a way to hand off to the cache "hey you, give me all cpv's that have metadata that matches this criteria". Move the lookup/searching into the cache backend, which is already built into the cache refactoring I wrote for cvs head. If you want to collapse all of the description data into some faster lookup, fine, do so _strictly_ within that cache backend, and modify that class so that it has an appropriate get_matches lookup that's able to do a specific metadata lookup faster. People are free to disgaree mind you, but this talk of speed gains frankly seems to be missing the boat on how our cache actually works, let alone the issues with it. Collapsing all metadata down into a single file, yeah that would be nifty from the standpoint of less files/wasted space on fs's. Centralized DESCRIPTION cache implemented in xml? Eh... ~brian
pgpROvbIkKbMs.pgp
Description: PGP signature