On Sun, 2006-12-17 at 23:09 -0500, Konstantin Ryabitsev wrote: > On 12/17/06, seth vidal <[EMAIL PROTECTED]> wrote: > > we store the checksum of the old db (compressed and uncompressed) as > > well as the new db file in repomd.xml > > > > yum would download repomd.xml - if the checksum of its sqlite db files > > is the same as the old checksum, then it downloads the sqlite > > transaction diff b/c it can use it. when it doesn't match then grab the > > whole file. > > Actually, it doesn't have to be that way. You can keep the "diffs" all > the way back to the first run of createrepo. E.g.: > > initial run: > create primary.sqlite > > during the next run createrepo does effectively the same thing yum > does and instead of blowing away the old primary.sqlite, it does > INSERT/DELETE operations, while creating changes.sqlite, which > contains a table something like: > > |createrepo run timestamp|action(add/delete)|pkgdata[....]| > > So, let's say the initial primary.sqlite run was a 0 unix seconds. > Next time we run createrpo, we have updated pkgA to 1.1 and removed > pkgA-1.0: > > createrepo run at 111111 unix seconds: > |111111|add|pkgA-1.1| > |111111|del|pkgA-1.0| > > createrepo run at 222222 unix seconds: > |222222|add|pkgB-3.0| > |222222|del|pkgB-2.5| > > createrepo run at 333333 unix seconds: > |333333|add|pkgC-1.2| > |333333|add|pkgB-2.6| > > ... > > primary.sqlite contains the timestamp when it was generated last. So, > if clientA downloads primary.sqlite when it was at 111111 unix > seconds, and then gets the changes.sqlite some time later, at 333333 > unix seconds, it knows exactly what happened to primary.sqlite between > these two revisions and what it needs to do to get from 111111 to > 333333. > > In other words, changes.sqlite contains the entire history of what > happened to primary.sqlite between the time when it was first > generated, and until the last createrepo run. > > If at some point changes.sqlite becomes larger than primary.sqlite, > then it should be blown away and started over, because any > bandwidth-saving benefits would be moot. The repomd.xml will contain > no "changes.sqlite" entry, so clients will know to download the > primary db. If they get the changes.sqlite after it's started all over > again, it would be easy for them to "see" that the "initial primary > run" timestamp is after the last timestamp they have on record for > that repository (hence, continuity is broken), so they should discard > the downloaded changes.sqlite and download the primary db to start the > process over again. > > This might seem complex, but it really isn't. The database operations > for createrpo are limited to 2 simple actions -- insert row and delete > row, which are simple to record in changes.sqlite. Using timestamps > should help clients track how many transactions from changes.db they > need to rerun to get the latest changes to the repository. > > This shouldn't be too hard to implement, and once done, the benefit > for large repositories like fedora extras would be very significant, > since that would cut down on both download size and parsing speed -- > the things everyone complains about the most.
I'm curious how quickly they would get big but it seems like a worthwhile thing to try out. Would you be interested in working on the above for createrepo? -sv _______________________________________________ Yum-devel mailing list [email protected] https://lists.dulug.duke.edu/mailman/listinfo/yum-devel
