Re: [gentoo-dev] Proposal for an alternative portage tree sync method

Brian Harring Sun, 27 Mar 2005 11:05:47 -0800

Karl Trygve Kalleberg wrote:

So... this basically is applicable (at this point) to snapshots, since fundamentally that's what it works on. Couple of flaws/issues though. Tarball entries are rounded up to the nearest multiple of 512 for the file size, plus an additional 512 for the tar header. If for the zsync chksum index, you're using blocksizes above (basically) 1kb, you lose the ability to update individual files- actually, you already lost it, because zsync requires two matching blocks, side by side. So that's more bandwidth, beyond just pulling the control file.
Actually, packing the tree in squashfs _without_ compression, shaved
about 800bytes per file.
Having a tarball of the porttree is obviously
plain stupid, as the overhead about as big as the content itself.

'cept tarballs _are_ what our snapshots are currently, which is what I was referencing (was pointing out why zsync is not going to play nice with tarballs). I haven't compared squashfs snapshots w/out compression delta wise, but I'd expect they're slightly larger (diffball knows about tarfile structures, as such can enforce 'locality' for better matches).

Or... just have the repository module run directly off of the tarball, with an additional pregenerated index of file -> offset. (that's a ways off, but something I intend to try at some point).
Actually, I hacked portage to do this a few years ago. I generated a
.zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The
server maintained diffs in the following scheme:
- Full snapshot every hour
- Deltas hourly back 24 hours
- Deltas daily back a week
- Deltas weekly back two months

Elaborate; back from when, the current time/date? Or just version 'leaps' as it were? If you're recalc'ing the delta all the way back for each hour, the cost adds up.

When a user synced, he downloaded a small manifest

Define small, and what was in the manifest please.

from the server,
telling him the size and contents of the snapshot and deltas. Based on
time stamps

What about issues with users clock being wacky? Yes, systems should have a correct clock, but rsync (with our current opts) doesn't rely on mtime checks (iirc). Course just pulling the last timestamp from the server addresses this...

he would locally calculate which deltas he would need to
fetch.

One failing with this I'd see is that in generating a *total*, tree snapshot to tree snapshot delta, the unmatched files (files that are new, or cannot be mapped back via filepath to the older snapshot) can't be easily diff'ed. Can be worked around though.

If the size of the deltas were >= size of the full snapshot, just
go for the new snapshot.

This system didn't use xdelta, just .zips, but it could.

Locally, everything was stored in /usr/portage.zip (but could be
anywhere), and I hacked portage to read everything straight out the .zip
file instead of the file system.

Sounds like one helluva hack :)

Whenever a package was being merged, the ebuild and all stuff in files/
was extracted, so that the cli tools (bash, ebuild script) could get at
them.

I'd wonder how to integrate gpg/md5'ing of the snapshot into that. Shouldn't be hard, but would be expensive w/out careful management (ie, don't re-verify a repo if the repo has been verified once already). Offhand, this *should* be possible in a clean way with a bit of work.

Performance was not really an issue, since already then, there was some
caching going on. emerge -s, emerge <package>, emerge -pv world was not
appreciably slower. emerge metadata was:/ This may have changed by now,
and unfavourably so.

emerge metadata in cvs head *now* pretty much requires 2*nodes in the new tree; read from the metadata/cache, translate it[1], dump it. While doing this, build up a dict of invalid metadata on the local system, wipe it post metadata transfer. So... uncompressed a file, then interpretting it would be likely slower then the current flat list approach (it's actually pretty speedy in .19 and head). External cache db? sqlite seems like overkill, and anydbm has concurrency issues for updates, but since the repo is effectively 'frozen' (user can't modify the ebuild), anydbm should suffice.

[1] eclass translation- stable stores eclass data per cache entry in two locations, eclass_db, and cache backend. Had quite a few bugs with this, and it's kind of screwwy in design. Head stores *all* of that entries eclass data in the cache backend; thus going from metadata/cache's just INHERITED="eutils" (fex), you have to translate it to a _full_ eclass entry for the cache backend, eutils\tlocation\tmtime (roughly, code isn't in front of me).

However, the patch was obviously rather intrusive, and people liked
rsync a lot, so it never went in.

However, sign me up for hacking on the
"sync module", whenever that's gonna happen.

gentoo-src/portage/sync <-- cvs head/main.

'transports' (fetchcommand/resumecommand) are also abstracted into gentoo-src/transports/fetchcommand (iirc). Also is a bundled httplib/ftplib that needs to be put to better use in a binhost refactored repository db, in gentoo-src/transports/bundled_lib (again, iirc, atm stuck in windows land due to the holidays).

The reason I'm playing around with zsync, is that it's a lot less
intrusive than my zipfs patch.

URL For zipfs patch?

Essentially, it's a bolt-on that can be
added without modifying portage at all, as long as users don't use
"emerge sync" to sync.

emerge sync should use the sync module bound to each repository (not finished, intended). The sync refactoring code that's in cvs head already is the start of this; each sync instance just has a common hook you call. So... emerge sync is viable, assuming an appropriate sync class could be defined.

[1] .zips have a central directory, which makes it faster to search than
tar.gz.  Also, they're directly supported by the python library, and you
can read out individual files pretty easily. Any compression format with
similar properties would do, of course.

Was commenting on uncompressed tarballs, with a pregenerated file -> offset lookup. Working within *one* compressed stream (which a tar.gz is) wasn't the intention. Doing random seeks in it isn't really viable. Heading off any "use gzseek" by others, gzseek either reads forward, or resets the stream, and starts from the ground up. Aside from that, tarballs, too, are directly supported (tarfile) :) ~brian -- [email protected] mailing list

Re: [gentoo-dev] Proposal for an alternative portage tree sync method

Reply via email to