On Fri, May 27, 2005 at 01:47:37PM +0200, Danny van Dyk wrote:
> Hi Brian
> > What's the gain, aside from implication of  collapsing it into a 
> > single file?  Honestly my only use for metadata.xml is looking up who 
> > I get to poke about fixing broken ebuilds...
> The gain is:
> ... that you portage people could use it for emerge -s instead of using
>     a DESCRIPTION-cache.

'you portage people' ? :)

> ... we don't need to find the metadata.xml file before parsing it.

Portage's emerge -s doesn't use metadata.xml.  Guessing you meant 
emerge -S (--searchDesc), but that too, doesn't use metadata.xml.

So, a few implications in what you mean/are after then.
1) This global description cache would have to be duplicated, and 
recreated on cvs->rsync runs.  Why?  Unless you're padding extra bytes 
in the description cache, updates _will_ kill performance.  
Personally, I'm not much for it because there is a minimal window for 
cvs->rsync infra-side to get it's thing done, and this will jack up 
the runtime.

2) You're still doing entry by entry.  Y'all are assuming having this 
data shoved into one file is going to make it quicker for reads (in 
reality, you're still reading 19000+ records, just your solution is 
out of a single file).  This may be quicker due to syscall overhead, 
but I posit the drawbacks aren't worth it.

3) This complicates the hell out of cache updates, and still suffers 
the same issues eix/esearch suffer- namely that it's not sensitive to 
cache updates.  If we make it sensitive to cache updates, you're 
looking at regen runtimes going through the roof (see #1 comment on 
updates).  This is regardless of if it's a duplication approach or 
description is stored in it's own db outside of the normal flat_list 
cache files.

4) This proposal breaks the cache up into seperate chunks.  That's 
the cache backends decision frankly, and _cannot_ be imposed onto the 
cache backend implementation from above.

I moved eclass data into the cache backend in cvs head explicitly 
for the purpose of allowing the cache to be effectively standalone, 
and able to be bound to a remote tree.  You force this change from 
above, it breaks the cache design (pure and simple), and ultimately 
isn't what you're after (see below).


Frankly, any comments that this is going to make things faster are 
ignoring the existing code.  Why is emerge -S so damned slow?

Better question, why is it that a mysql cache backend _still_ is so 
damned slow on emerge -S?  That should be hella fast compared to 
opening 19000 files, right?

Because the current stable cache design allows *only* for individual 
record lookups.  In other words, even with an rdbms implementation, it 
goes record by record.  What is needed is a way to hand off to the 
cache "hey you, give me all cpv's that have metadata that matches this 
criteria".  

Move the lookup/searching into the cache backend, which is already 
built into the cache refactoring I wrote for cvs head.

If you want to collapse all of the description data into some faster 
lookup, fine, do so _strictly_ within that cache backend, and modify 
that class so that it has an appropriate get_matches lookup that's 
able to do a specific metadata lookup faster.

People are free to disgaree mind you, but this talk of speed gains 
frankly seems to be missing the boat on how our cache actually works, 
let alone the issues with it.

Collapsing all metadata down into a single file, yeah that would be 
nifty from the standpoint of less files/wasted space on fs's.  
Centralized DESCRIPTION cache implemented in xml?  Eh...
~brian

Attachment: pgpROvbIkKbMs.pgp
Description: PGP signature

Reply via email to