Re: [gentoo-portage-dev] Google SoC and cache sync
Zac Medico wrote: Right. If you wanted to submit a competing multiple repository support soc proposal yourself then you might list the unavaliable repository thing as one of your goals. However, that might be a little over-ambitious since multiple repository support alone would provide enough work for a soc project. ok, I will probably submit a multiple repository support proposal, although if that other guy ends up submitting one I'm sure he'll be accepted over me, it looks like he's way more qualified. That's okay though, there's always next year, and just because I'm not doing summer of code doesn't mean I can't look at code over the summer :] I actually might just pass on this year's summer of code, and work some more on that search thing this summer, honing my real-world programming/python skills so that I'll be ready to kick some ass next summer. Actually, do you think more work on faster search would be an adequate project for soc? In order for it to actually be a plausible modification to portage I still need to implement regex search and support for overlays. As I realized too soon before I had to submit the project to my prof, integrating regex search into the suffix-tree-like index that I created would require a pretty substantial overhaul of my implementation, if not a separate implementation to deal with regex queries. I'm sure it would be possible though. The biggest problem remaining with my implementation, as I believe I said before, is that cPickle unpickles (and pickles?) the index wayy to slowly. Without a better serialization module my implementation is pretty much useless, but ignoring the time it takes to unpickle the index, my implementation is something like an order of magnitude faster than the current search implementation. That's promising, right? I haven't tried out any other picklers yet so I'm not sure how much of an improvement it might be possible to get. I sure, however, that writing my own superior pickler would be beyond my abilities. Unless serialization is simpler than I think it is, and I could maybe throw together something that is optimized for this specific data structure. Expanding on this a bit... if you were going to pack an ebuild into a single file, you would need to include the eclasses which it inherits and also any patches that are included with it in cvs. If the eclasses are included in this way, each source package will contain a redundant copy of the inherited eclasses. Despite this redundancy, you might still have a net decrease in bandwidth usage since you'd only have to download the source packages that you actually want to build. This is an interesting idea... but I think I agree with you on keeping the ebuild format the way it is. Not only because changing to an rpm-like system would be a an almost fundamental change in the way portage works (ebuild-wise, anyway), but because this would add a whole new level of complication to ebuilds, and I think one of their strengths is their simplicity. Of course ideally one could write a tool that would simply package ebuilds into these rpms... but I suspect that would end up being more complicated than it seems. Plus, I don't know about you but back when I used rpm-based distributions, I found that more often than not, rather than simplifying the install process, rpms were just a pain in the ass and didn't work half the time. This was a few years ago and I'm assuming things have changed... but I still think ebuilds are pretty cool and I don't really want to mess with that. On Thu, Apr 2, 2009 at 1:38 AM, Zac Medico zmed...@gentoo.org wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Zac Medico wrote: Emma Strubell wrote: And to clarify: the goal of the project is to modify portage so that instead of fetching all of the ebuilds in the portage tree (or in an overlay) upon a sync, portage only fetches the metadata and cache info (via the metadata/cache/ directory) of the tree, and the ebuilds of packages that are already installed (packages found in the world file?) And then, additional ebuilds would be fetched only when they are needed? The problem with fetching the ebuilds separately is that the remote repository might have changed. So, it's not a very reliable approach unless there is some kind of guarantee that the remote repository will provide a window of time during which older ebuilds that have already been removed from the main tree can still be downloaded. In order to accomplish this, you'd essentially have to devise a new source package format which can be downloaded as a single file (something like a source rpm file that an rpm based distro would provide). Expanding on this a bit... if you were going to pack an ebuild into a single file, you would need to include the eclasses which it inherits and also any patches that are included with it in cvs. If the eclasses are included in this way, each source package will contain a redundant copy of the
Re: [gentoo-portage-dev] Google SoC and cache sync
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Emma Strubell wrote: Actually, do you think more work on faster search would be an adequate project for soc? I don't think it would be enough work for a soc project. In order for it to actually be a plausible modification to portage I still need to implement regex search and support for overlays. As I realized too soon before I had to submit the project to my prof, integrating regex search into the suffix-tree-like index that I created would require a pretty substantial overhaul of my implementation, if not a separate implementation to deal with regex queries. I'm sure it would be possible though. The biggest problem remaining with my implementation, as I believe I said before, is that cPickle unpickles (and pickles?) the index wayy to slowly. Without a better serialization module my implementation is pretty much useless, but ignoring the time it takes to unpickle the index, my implementation is something like an order of magnitude faster than the current search implementation. That's promising, right? Well, your serialization problem is somewhat surprising. How big is that index file? Considering the speed with which esearch is able to load and search its 2 MB database, I suspect that the approach you're using may not be optimal for the data set. I haven't tried out any other picklers yet so I'm not sure how much of an improvement it might be possible to get. I sure, however, that writing my own superior pickler would be beyond my abilities. Unless serialization is simpler than I think it is, and I could maybe throw together something that is optimized for this specific data structure. - -- Thanks, Zac -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.11 (GNU/Linux) iEYEARECAAYFAknVpTIACgkQ/ejvha5XGaMYqwCgtYkLaIWGBVPetvcBORs/Fnn8 /GoAnAgZc9zuodpL4FFj4mvMZ5JdpEwk =+aJx -END PGP SIGNATURE-
Re: [gentoo-portage-dev] Google SoC and cache sync
Zac Medico wrote: The way that I imagine the cache sync idea should be implemented is like paludis's unavailable repository which uses of tarball to distribute package metadata[1]. The tarball approach that they use seems pretty reasonable. However, it would probably also be nice to be able to use a protocol such as rsync to download the metadata/cache/ directory from the same URI which is used to fetch the ebuilds themselves (maybe paludis supports this already, I don't know). You're offering two different ideas here, right? The unavailable repository method, and the method using the metadata/cache/ directory? If so, it makes sense to me to take the metadata/cache/ directory route, since, as you said, multiple repositories aren't yet supported in portage. At first I was thinking I could contact the guy who might be working on multiple repository support this summer and work with him to some extent... but the unavaliable repository solution would basically be dependent on/building off of multiple repository support, and it seems like building off of something that isn't fully built would be a bad idea. And to clarify: the goal of the project is to modify portage so that instead of fetching all of the ebuilds in the portage tree (or in an overlay) upon a sync, portage only fetches the metadata and cache info (via the metadata/cache/ directory) of the tree, and the ebuilds of packages that are already installed (packages found in the world file?) And then, additional ebuilds would be fetched only when they are needed? Or will only metadata/cache/ be fetched upon sync, and then all ebuilds will be fetched only when they are needed? Am I completely oversimplifying the project? Thanks so much for your help, Emma On Tue, Mar 31, 2009 at 6:41 PM, Zac Medico zmed...@gentoo.org wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Emma Strubell wrote: Hi all. So, I'd love to do Google's Summer of Code with you guys. I was perusing the list of ideas on the Gentoo wiki, and the cache sync idea seems pretty interesting, especially since it concerns the overall speed of portage, including search, which of course I've already started some work on. However, there is no contact person associated with that project! I figured I'd come here before going to #gentoo-soc to see if anyone is interested in mentoring me on this project, since it seemed like a few of you might be interested. The way that I imagine the cache sync idea should be implemented is like paludis's unavailable repository which uses of tarball to distribute package metadata[1]. The tarball approach that they use seems pretty reasonable. However, it would probably also be nice to be able to use a protocol such as rsync to download the metadata/cache/ directory from the same URI which is used to fetch the ebuilds themselves (maybe paludis supports this already, I don't know). In order for the clients to be able to download the metadata/cache/ directory, first that directory has to be populated (as is done on gentoo's master rsync server). I'm currently working on a tool called 'egencache' that overlay maintainers will be able to use in order to populate the metadata/cache/ directory [2]. It will be included in the next portage release. Before we implement something like unavailable repository for portage, first we'll have to add multiple repository support, and that's a decent sized project of it's own. Somebody has mentioned interest in multiple repository support on the gentoo-soc list [3], but they haven't submitted a proposal to http://socghop.appspot.com yet. [1] http://paludis.pioto.org/configuration/repositories/unavailable.html [2] http://bugs.gentoo.org/show_bug.cgi?id=261377 [3] http://archives.gentoo.org/gentoo-soc/msg_e383863a6748e367e13fe53b092f3908.xml - -- Thanks, Zac -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.11 (GNU/Linux) iEYEARECAAYFAknSnA4ACgkQ/ejvha5XGaO6tACgjzAsoXP0cJd0Vr1vJxU2CvLQ JtwAn2Sj+GxLyyRpOIdbejPirCljmF2c =k5u1 -END PGP SIGNATURE-
Re: [gentoo-portage-dev] Google SoC and cache sync
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Zac Medico wrote: Emma Strubell wrote: And to clarify: the goal of the project is to modify portage so that instead of fetching all of the ebuilds in the portage tree (or in an overlay) upon a sync, portage only fetches the metadata and cache info (via the metadata/cache/ directory) of the tree, and the ebuilds of packages that are already installed (packages found in the world file?) And then, additional ebuilds would be fetched only when they are needed? The problem with fetching the ebuilds separately is that the remote repository might have changed. So, it's not a very reliable approach unless there is some kind of guarantee that the remote repository will provide a window of time during which older ebuilds that have already been removed from the main tree can still be downloaded. In order to accomplish this, you'd essentially have to devise a new source package format which can be downloaded as a single file (something like a source rpm file that an rpm based distro would provide). Expanding on this a bit... if you were going to pack an ebuild into a single file, you would need to include the eclasses which it inherits and also any patches that are included with it in cvs. If the eclasses are included in this way, each source package will contain a redundant copy of the inherited eclasses. Despite this redundancy, you might still have a net decrease in bandwidth usage since you'd only have to download the source packages that you actually want to build. If you are going to implement something like this, I imagine that you'd create a tool which would pack an ebuild into a source package and optionally sign it with a digital signature. Source packages would be uploaded to a server which would serve them along with a metadata cache file that clients would download for use in dependency calculations (similar to how $PKGDIR/Packages is currently used for binary packages). - -- Thanks, Zac -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.11 (GNU/Linux) iEYEARECAAYFAknUT0oACgkQ/ejvha5XGaPmugCfVs0I4a15trwTgLnPwBac2xOj wI0AoInp1Jf6yaYV5rNvU2EXHbZ30AkS =tNrz -END PGP SIGNATURE-
Re: [gentoo-portage-dev] Google SoC and cache sync
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Emma Strubell wrote: Hi all. So, I'd love to do Google's Summer of Code with you guys. I was perusing the list of ideas on the Gentoo wiki, and the cache sync idea seems pretty interesting, especially since it concerns the overall speed of portage, including search, which of course I've already started some work on. However, there is no contact person associated with that project! I figured I'd come here before going to #gentoo-soc to see if anyone is interested in mentoring me on this project, since it seemed like a few of you might be interested. The way that I imagine the cache sync idea should be implemented is like paludis's unavailable repository which uses of tarball to distribute package metadata[1]. The tarball approach that they use seems pretty reasonable. However, it would probably also be nice to be able to use a protocol such as rsync to download the metadata/cache/ directory from the same URI which is used to fetch the ebuilds themselves (maybe paludis supports this already, I don't know). In order for the clients to be able to download the metadata/cache/ directory, first that directory has to be populated (as is done on gentoo's master rsync server). I'm currently working on a tool called 'egencache' that overlay maintainers will be able to use in order to populate the metadata/cache/ directory [2]. It will be included in the next portage release. Before we implement something like unavailable repository for portage, first we'll have to add multiple repository support, and that's a decent sized project of it's own. Somebody has mentioned interest in multiple repository support on the gentoo-soc list [3], but they haven't submitted a proposal to http://socghop.appspot.com yet. [1] http://paludis.pioto.org/configuration/repositories/unavailable.html [2] http://bugs.gentoo.org/show_bug.cgi?id=261377 [3] http://archives.gentoo.org/gentoo-soc/msg_e383863a6748e367e13fe53b092f3908.xml - -- Thanks, Zac -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.11 (GNU/Linux) iEYEARECAAYFAknSnA4ACgkQ/ejvha5XGaO6tACgjzAsoXP0cJd0Vr1vJxU2CvLQ JtwAn2Sj+GxLyyRpOIdbejPirCljmF2c =k5u1 -END PGP SIGNATURE-