On Tuesday 25 October 2005 09:23, Richard Freeman wrote: > John Myers wrote: > > I designed a system where it took feedback from consenting users, sending > > the file lists back to my server, were I was going to do some data > > crunching. The data from just _my_ system was over 60 MB. > > It sounds like you really only need to index each package a few times at > most. Sure, the raw data from a user could be 60MB each, but there are > some ways to reduce that significantly: Hm. I forgot to mention that the largest pieces (the file names and the md5sums) are only stored once, and then referenced with a relatively small integer (compared to the size of, say, a file name)
Here's how it breaks down: table | rows | size ----------------------+---------+-------- ebuilds | 994 | 118.3K filenames | 381,200 | 27.1M file info | 383,168 | 19.9M installations list | 1,007 | 26.7K extra install data | 1,007 | 88.2K file->install mapping | 464,193 | 13.1M There are some reinstallations and upgrades in the above data > 1. Don't send in data for anything in the base system install. > > 2. As you populate your database, publish a list of indexed packages > via a URL. Users would exclude any packages you've already indexed. If > this were a GLEP you could probably put the file in the portage > directory and everybody would get it via rsync. > > 3. Start by only indexing each package ONCE. Don't worry about every > combo of arches, CFLAGS, USE, etc. That means that most users wouldn't > upload anything at all, and the rest would only send their unique > contributions. Interesting thoughts > If you get everything working without indexing by USE, you could start > adding that capability in. Publish in #2 the list of USE flags indexed > for each package, and individuals would only upload packages compiled > with something that wasn't on that list. > > Sure, the final database could easily be 100MB or so, but if you just > put it on a website you won't be sending the whole thing. Just put it > in mysql/postgres and build a php front end (sorry, not a web dev > personally, but it isn't that hard to do from the little I've messed > with it). that's what the intention was. Maybe with an XML-RPC service for a command-line client to use. And the data is stored in a mysql database > > Sorry - I don't intend to make it sound like the whole thing can be done > in 5 minutes, and I"m sure you've already poured hours into your effort. > However, I don't see any theoretical issues with it as long as the > design is right. The important thing is that users are only uploading > diffs against your master repository - and not doing a complete dump of > their entire system. Otherwise you will get buried in data! The biggest problem is that there are a lot of potential variations, and they all really need to be there for this to be useful > > I must admit that it is easy to just talk about ideas like this - I > really do want to commend you on the work you've undoubtedly already > accomplished! OSS projects require lots of hard work by many volunteers > and it is all too easy for people like me to just sit back and nitpick > what could be done better... Well, I think I might hack around on this a little more
pgpdDBggSMLdV.pgp
Description: PGP signature