W dniu nie, 28.01.2018 o godzinie 15∶01 +0800, użytkownik Jason Zaman napisał: > On Sat, Jan 27, 2018 at 12:24:39AM +0100, Michał Górny wrote: > > Migrating mirrors to the hashed structure > > ----------------------------------------- > > The hard link solution allows us to save space on the master mirror. > > Additionally, if ``-H`` option is used by the mirrors it avoids > > transferring existing files again. However, this option is known > > to be expensive and could cause significant server load. Without it, > > all mirrors need to transfer a second copy of all the existing files. > > > > The symbolic link solution could be more reliable if we could rely > > on mirrors using the ``--links`` rsync option. Without that, symbolic > > links are not transferred at all. > > These rsync options might help for mirrors too: > --compare-dest=DIR also compare destination files relative to DIR > --copy-dest=DIR ... and include copies of unchanged files > --link-dest=DIR hardlink to files in DIR when unchanged > > > Using hashed structure for local distfiles > > ------------------------------------------ > > The hashed structure defined above could also be used for local distfile > > storage as used by the package manager. For this to work, the package > > manager authors need to ensure that: > > > > a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary > > directory where distfiles specific to the package are linked > > in a flat structure. > > > > b. All tools are updated to support the nested structure. > > > > c. The package manager provides a tool for users to easily manipulate > > distfiles, in particular to add distfiles for fetch-restricted > > packages into an appropriate subdirectory. > > > > For extended compatibility, the package manager may support finding > > distfiles in flat and nested structure simultaneously. > > trying nested first then falling back to flat would make it easy for > users if they have to download distfiles for fetch-restricted packages > because then the instructions stay as "move it to > /usr/portage/distfiles". > or alternatively the tool could have a mode which will go through all > files in the base dir and move it to where it should be in the nested > tree. then you save everything to the same dir and run edist --fix
This is really outside the scope, and up to Portage maintainers. > > Rationale > > ========= > > Algorithm for splitting distfiles > > --------------------------------- > > In the original debate that occurred in bug #534528 [#BUG534528]_, > > three possible solutions for splitting distfiles were listed: > > > > a. using initial portion of filename, > > > > b. using initial portion of file hash, > > > > c. using initial portion of filename hash. > > > > The significant advantage of the filename option was simplicity. With > > that solution, the users could easily determine the correct subdirectory > > themselves. However, it's significant disadvantage was very uneven > > shuffling of data. In particular, the TeΧ Live packages alone count > > almost 23500 distfiles and all use a common prefix, making it impossible > > to split them further. > > the filename is the original upstream or the renamed one? eg > SRC_URI="http://foo/foo.tar -> bar.tar" it will be bar.tar? Renamed one. This is what distfiles use already. Otherwise we'd have a lot of collisions on files named 'v1.2.3.tar.gz'. > I think im in favour of using the initial part of the filename anyway. > sure its not balanced but its still a hell of a lot more balanced than > today and its really easy. 'More balanced' does not mean it solves the problem. If you have one directory with ~25000 files, and others between almost empty and 4000, then you still have a huge problem and a lot of silly reorganization that looks like a 'good idea that misfired'. > Another thing im wondering is if we can just use the same dir layout as > the packages themselves. that would fix texlive since it has a whole lot > of separate packages. eg /usr/portage/distfiles/app-cat/pkg/pkg-1.0.tgz Then you're replacing the problem of many files in a single directory with a problem of huge number of almost empty directories. In other words, you replace performance problem of one kind with performance problem of another kind, plus potential inode problem... > there is a problem if many packages use the same distfiles (quite > extensive for SELinux, every single of the sec-policy/selinux-* packages > has identical distfiles) so im not sure how to deal with it. ...and yes, the problem that we have a lot of distfiles shared between different packages. Also, frequently those distfiles are actually huge (think of big upstream tarball being split into N packages in Gentoo, e.g. Qt). > this would also make it easy in future to make the sandbox restrict > access to files outside of that package if we wanted to do that. I don't see how that's relevant at all. > > The alternate option of using file hash has the advantage of having > > a more balanced split. Furthermore, since hashes are stored > > in Manifests using them is zero-cost. However, this solution has two > > significant disadvantages: > > > > 1. The hash values are unknown for newly-downloaded distfiles, so > > ``repoman`` (or an equivalent tool) would have to use a temporary > > directory before locating the file in appropriate subdirectory. > > > > 2. User-provided distfiles (e.g. for fetch-restricted packages) with > > hash mismatches would be placed in the wrong subdirectory, > > potentially causing confusing errors. > > Not just this, but on principle, I also think you should be able to read > an ebuild and compute the url to download the file from the mirrors > without any extra knowledge (especially downloading the distfile). > > > Using filename hashes has proven to provide a similar balance > > to using file hashes. Furthermore, since filenames are known up front > > this solution does not suffer from the both listed problems. While > > hashes need to be computed manually, hashing short string should not > > cause any performance problems. > > > > .. figure:: glep-0075-extras/by-filename.png > > > > Distribution of distfiles by first character of filenames > > > > .. figure:: glep-0075-extras/by-csum.png > > > > Distribution of distfiles by first hex-digit of checksum > > (x --- content checksum, + --- filename checksum) > > > > .. figure:: glep-0075-extras/by-csum2.png > > > > Distribution of distfiles by two first hex-digits of checksum > > (x --- content checksum, + --- filename checksum) > > do you have an easy way to calculate how big the distfiles are per > category or cat/pkg? i'd be interested to see. Easy, no. But should be easy to write a script that does that. The sources for my stuff are at: https://github.com/mgorny/manifest-distfile-stats Except most of it won't be useful for that case since it works on combined and deduplicated Manifests. If you want to do that, please also include a graph of total file sizes, and mark how much of that is duplicated between groups. > > Backwards Compatibility > > ======================= > > Mirror compatibility > > -------------------- > > The mirrored files are propagated to other mirrors as opaque directory > > structure. Therefore, there are no backwards compatibility concerns > > on the mirroring side. > > > > Backwards compatibility with existing clients is detailed > > in `migrating mirrors to the hashed structure`_ section. Backwards > > compatibility with the old clients will be provided by preserving > > the flat structure during the transitional period. > > Even if there was no transition, things wouldnt be terrible because > portage would fall back to just downloading from SRC_URI directly > if the mirrors fail. > > -- Best regards, Michał Górny