On Tuesday 25 October 2005 09:23, Richard Freeman wrote:
> John Myers wrote:
> > I designed a system where it took feedback from consenting users, sending
> > the file lists back to my server, were I was going to do some data
> > crunching. The data from just _my_ system was over 60 MB.
>
> It sounds like you really only need to index each package a few times at
> most.  Sure, the raw data from a user could be 60MB each, but there are
> some ways to reduce that significantly:
Hm. I forgot to mention that the largest pieces (the file names and the 
md5sums) are only stored once, and then referenced with a relatively small 
integer (compared to the size of, say, a file name)

Here's how it breaks down:
        table         |  rows   |  size
----------------------+---------+--------
ebuilds               | 994     | 118.3K
filenames             | 381,200 |  27.1M
file info             | 383,168 |  19.9M
installations list    | 1,007   |  26.7K
extra install data    | 1,007   |  88.2K
file->install mapping | 464,193 |  13.1M

There are some reinstallations and upgrades in the above data

> 1.  Don't send in data for anything in the base system install.
>
> 2.  As you populate your database, publish a list of indexed packages
> via a URL.  Users would exclude any packages you've already indexed.  If
> this were a GLEP you could probably put the file in the portage
> directory and everybody would get it via rsync.
>
> 3.  Start by only indexing each package ONCE.  Don't worry about every
> combo of arches, CFLAGS, USE, etc.  That means that most users wouldn't
> upload anything at all, and the rest would only send their unique
> contributions.
Interesting thoughts

> If you get everything working without indexing by USE, you could start
> adding that capability in.  Publish in #2 the list of USE flags indexed
> for each package, and individuals would only upload packages compiled
> with something that wasn't on that list.
>
> Sure, the final database could easily be 100MB or so, but if you just
> put it on a website you won't be sending the whole thing.  Just put it
> in mysql/postgres and build a php front end (sorry, not a web dev
> personally, but it isn't that hard to do from the little I've messed
> with it).
that's what the intention was. Maybe with an XML-RPC service for a 
command-line client to use. And the data is stored in a mysql database
>
> Sorry - I don't intend to make it sound like the whole thing can be done
> in 5 minutes, and I"m sure you've already poured hours into your effort.
>  However, I don't see any theoretical issues with it as long as the
> design is right.  The important thing is that users are only uploading
> diffs against your master repository - and not doing a complete dump of
> their entire system.  Otherwise you will get buried in data!
The biggest problem is that there are a lot of potential variations, and they 
all really need to be there for this to be useful
>
> I must admit that it is easy to just talk about ideas like this - I
> really do want to commend you on the work you've undoubtedly already
> accomplished!  OSS projects require lots of hard work by many volunteers
> and it is all too easy for people like me to just sit back and nitpick
> what could be done better...
Well, I think I might hack around on this a little more

Attachment: pgpdDBggSMLdV.pgp
Description: PGP signature

Reply via email to