Re: [gentoo-portage-dev] Google SoC and cache sync

2009-04-02 Thread Emma Strubell
Zac Medico wrote:

 Right. If you wanted to submit a competing multiple repository
 support soc proposal yourself then you might list the unavaliable
 repository thing as one of your goals. However, that might be a
 little over-ambitious since multiple repository support alone
 would provide enough work for a soc project.

ok, I will probably submit a multiple repository support proposal,
although if that other guy ends up submitting one I'm sure he'll be
accepted over me, it looks like he's way more qualified. That's okay
though, there's always next year, and just because I'm not doing
summer of code doesn't mean I can't look at code over the summer :] I
actually might just pass on this year's summer of code, and work some
more on that search thing this summer, honing my real-world
programming/python skills so that I'll be ready to kick some ass next
summer.

Actually, do you think more work on faster search would be an adequate
project for soc? In order for it to actually be a plausible
modification to portage I still need to implement regex search and
support for overlays. As I realized too soon before I had to submit
the project to my prof, integrating regex search into the
suffix-tree-like index that I created would require a pretty
substantial overhaul of my implementation, if not a separate
implementation to deal with regex queries. I'm sure it would be
possible though. The biggest problem remaining with my implementation,
as I believe I said before, is that cPickle unpickles (and pickles?)
the index wayy to slowly. Without a better serialization module my
implementation is pretty much useless, but ignoring the time it takes
to unpickle the index, my implementation is something like an order of
magnitude faster than the current search implementation. That's
promising, right? I haven't tried out any other picklers yet so I'm
not sure how much of an improvement it might be possible to get. I
sure, however, that writing my own superior pickler would be beyond my
abilities. Unless serialization is simpler than I think it is, and I
could maybe throw together something that is optimized for this
specific data structure.

 Expanding on this a bit... if you were going to pack an ebuild into
 a single file, you would need to include the eclasses which it
 inherits and also any patches that are included with it in cvs. If
 the eclasses are included in this way, each source package will
 contain a redundant copy of the inherited eclasses. Despite this
 redundancy, you might still have a net decrease in bandwidth usage
 since you'd only have to download the source packages that you
 actually want to build.

This is an interesting idea... but I think I agree with you on keeping
the ebuild format the way it is. Not only because changing to an
rpm-like system would be a an almost fundamental change in the way
portage works (ebuild-wise, anyway), but because this would add a
whole new level of complication to ebuilds, and I think one of their
strengths is their simplicity. Of course ideally one could write a
tool that would simply package ebuilds into these rpms... but I
suspect that would end up being more complicated than it seems. Plus,
I don't know about you but back when I used rpm-based distributions, I
found that more often than not, rather than simplifying the install
process, rpms were just a pain in the ass and didn't work half the
time. This was a few years ago and I'm assuming things have changed...
but I still think ebuilds are pretty cool and I don't really want to
mess with that.

On Thu, Apr 2, 2009 at 1:38 AM, Zac Medico zmed...@gentoo.org wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Zac Medico wrote:
 Emma Strubell wrote:
 And to clarify: the goal of the project is to modify portage so that
 instead of fetching all of the ebuilds in the portage tree (or in an
 overlay) upon a sync, portage only fetches the metadata and cache info
 (via the metadata/cache/ directory) of the tree, and the ebuilds of
 packages that are already installed (packages found in the world
 file?) And then, additional ebuilds would be fetched only when they
 are needed?

 The problem with fetching the ebuilds separately is that the remote
 repository might have changed. So, it's not a very reliable approach
 unless there is some kind of guarantee that the remote repository
 will provide a window of time during which older ebuilds that have
 already been removed from the main tree can still be downloaded. In
 order to accomplish this, you'd essentially have to devise a new
 source package format which can be downloaded as a single file
 (something like a source rpm file that an rpm based distro would
 provide).

 Expanding on this a bit... if you were going to pack an ebuild into
 a single file, you would need to include the eclasses which it
 inherits and also any patches that are included with it in cvs. If
 the eclasses are included in this way, each source package will
 contain a redundant copy of the 

Re: [gentoo-portage-dev] Google SoC and cache sync

2009-04-02 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Emma Strubell wrote:
 Actually, do you think more work on faster search would be an adequate
 project for soc?

I don't think it would be enough work for a soc project.

 In order for it to actually be a plausible
 modification to portage I still need to implement regex search and
 support for overlays. As I realized too soon before I had to submit
 the project to my prof, integrating regex search into the
 suffix-tree-like index that I created would require a pretty
 substantial overhaul of my implementation, if not a separate
 implementation to deal with regex queries. I'm sure it would be
 possible though. The biggest problem remaining with my implementation,
 as I believe I said before, is that cPickle unpickles (and pickles?)
 the index wayy to slowly. Without a better serialization module my
 implementation is pretty much useless, but ignoring the time it takes
 to unpickle the index, my implementation is something like an order of
 magnitude faster than the current search implementation. That's
 promising, right?

Well, your serialization problem is somewhat surprising. How big is
that index file? Considering the speed with which esearch is able to
load and search its 2 MB database, I suspect that the approach
you're using may not be optimal for the data set.

 I haven't tried out any other picklers yet so I'm
 not sure how much of an improvement it might be possible to get. I
 sure, however, that writing my own superior pickler would be beyond my
 abilities. Unless serialization is simpler than I think it is, and I
 could maybe throw together something that is optimized for this
 specific data structure.
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAknVpTIACgkQ/ejvha5XGaMYqwCgtYkLaIWGBVPetvcBORs/Fnn8
/GoAnAgZc9zuodpL4FFj4mvMZ5JdpEwk
=+aJx
-END PGP SIGNATURE-



Re: [gentoo-portage-dev] Google SoC and cache sync

2009-04-01 Thread Emma Strubell
Zac Medico wrote:
 The way that I imagine the cache sync idea should be implemented
 is like paludis's unavailable repository which uses of tarball to
 distribute package metadata[1]. The tarball approach that they use
 seems pretty reasonable. However, it would probably also be nice to
 be able to use a protocol such as rsync to download the
 metadata/cache/ directory from the same URI which is used to fetch
 the ebuilds themselves (maybe paludis supports this already, I don't
 know).

You're offering two different ideas here, right? The unavailable
repository method, and the method using the metadata/cache/
directory?

If so, it makes sense to me to take the metadata/cache/ directory
route, since, as you said, multiple repositories aren't yet supported
in portage. At first I was thinking I could contact the guy who might
be working on multiple repository support this summer and work with
him to some extent... but the unavaliable repository solution would
basically be dependent on/building off of multiple repository support,
and it seems like building off of something that isn't fully built
would be a bad idea.

And to clarify: the goal of the project is to modify portage so that
instead of fetching all of the ebuilds in the portage tree (or in an
overlay) upon a sync, portage only fetches the metadata and cache info
(via the metadata/cache/ directory) of the tree, and the ebuilds of
packages that are already installed (packages found in the world
file?) And then, additional ebuilds would be fetched only when they
are needed? Or will only metadata/cache/ be fetched upon sync, and
then all ebuilds will be fetched only when they are needed? Am I
completely oversimplifying the project?

Thanks so much for your help,
Emma


On Tue, Mar 31, 2009 at 6:41 PM, Zac Medico zmed...@gentoo.org wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Emma Strubell wrote:
  Hi all.
 
  So, I'd love to do Google's Summer of Code with you guys. I was perusing
  the list of ideas on the Gentoo wiki, and the cache sync idea seems
  pretty interesting, especially since it concerns the overall speed of
  portage, including search, which of course I've already started some
  work on. However, there is no contact person associated with that
  project! I figured I'd come here before going to #gentoo-soc to see if
  anyone is interested in mentoring me on this project, since it seemed
  like a few of you might be interested.

 The way that I imagine the cache sync idea should be implemented
 is like paludis's unavailable repository which uses of tarball to
 distribute package metadata[1]. The tarball approach that they use
 seems pretty reasonable. However, it would probably also be nice to
 be able to use a protocol such as rsync to download the
 metadata/cache/ directory from the same URI which is used to fetch
 the ebuilds themselves (maybe paludis supports this already, I don't
 know).

 In order for the clients to be able to download the metadata/cache/
 directory, first that directory has to be populated (as is done on
 gentoo's master rsync server). I'm currently working on a tool
 called 'egencache' that overlay maintainers will be able to use in
 order to populate the metadata/cache/ directory [2]. It will be
 included in the next portage release.

 Before we implement something like unavailable repository for
 portage, first we'll have to add multiple repository support, and
 that's a decent sized project of it's own. Somebody has mentioned
 interest in multiple repository support on the gentoo-soc list
 [3], but they haven't submitted a proposal to
 http://socghop.appspot.com yet.

 [1] http://paludis.pioto.org/configuration/repositories/unavailable.html
 [2] http://bugs.gentoo.org/show_bug.cgi?id=261377
 [3]
 http://archives.gentoo.org/gentoo-soc/msg_e383863a6748e367e13fe53b092f3908.xml
 - --
 Thanks,
 Zac
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.11 (GNU/Linux)

 iEYEARECAAYFAknSnA4ACgkQ/ejvha5XGaO6tACgjzAsoXP0cJd0Vr1vJxU2CvLQ
 JtwAn2Sj+GxLyyRpOIdbejPirCljmF2c
 =k5u1
 -END PGP SIGNATURE-




Re: [gentoo-portage-dev] Google SoC and cache sync

2009-04-01 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Zac Medico wrote:
 Emma Strubell wrote:
 And to clarify: the goal of the project is to modify portage so that
 instead of fetching all of the ebuilds in the portage tree (or in an
 overlay) upon a sync, portage only fetches the metadata and cache info
 (via the metadata/cache/ directory) of the tree, and the ebuilds of
 packages that are already installed (packages found in the world
 file?) And then, additional ebuilds would be fetched only when they
 are needed?
 
 The problem with fetching the ebuilds separately is that the remote
 repository might have changed. So, it's not a very reliable approach
 unless there is some kind of guarantee that the remote repository
 will provide a window of time during which older ebuilds that have
 already been removed from the main tree can still be downloaded. In
 order to accomplish this, you'd essentially have to devise a new
 source package format which can be downloaded as a single file
 (something like a source rpm file that an rpm based distro would
 provide).

Expanding on this a bit... if you were going to pack an ebuild into
a single file, you would need to include the eclasses which it
inherits and also any patches that are included with it in cvs. If
the eclasses are included in this way, each source package will
contain a redundant copy of the inherited eclasses. Despite this
redundancy, you might still have a net decrease in bandwidth usage
since you'd only have to download the source packages that you
actually want to build.

If you are going to implement something like this, I imagine that
you'd create a tool which would pack an ebuild into a source package
and optionally sign it with a digital signature. Source packages
would be uploaded to a server which would serve them along with a
metadata cache file that clients would download for use in
dependency calculations (similar to how $PKGDIR/Packages is
currently used for binary packages).
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAknUT0oACgkQ/ejvha5XGaPmugCfVs0I4a15trwTgLnPwBac2xOj
wI0AoInp1Jf6yaYV5rNvU2EXHbZ30AkS
=tNrz
-END PGP SIGNATURE-



Re: [gentoo-portage-dev] Google SoC and cache sync

2009-03-31 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Emma Strubell wrote:
 Hi all.
 
 So, I'd love to do Google's Summer of Code with you guys. I was perusing
 the list of ideas on the Gentoo wiki, and the cache sync idea seems
 pretty interesting, especially since it concerns the overall speed of
 portage, including search, which of course I've already started some
 work on. However, there is no contact person associated with that
 project! I figured I'd come here before going to #gentoo-soc to see if
 anyone is interested in mentoring me on this project, since it seemed
 like a few of you might be interested.

The way that I imagine the cache sync idea should be implemented
is like paludis's unavailable repository which uses of tarball to
distribute package metadata[1]. The tarball approach that they use
seems pretty reasonable. However, it would probably also be nice to
be able to use a protocol such as rsync to download the
metadata/cache/ directory from the same URI which is used to fetch
the ebuilds themselves (maybe paludis supports this already, I don't
know).

In order for the clients to be able to download the metadata/cache/
directory, first that directory has to be populated (as is done on
gentoo's master rsync server). I'm currently working on a tool
called 'egencache' that overlay maintainers will be able to use in
order to populate the metadata/cache/ directory [2]. It will be
included in the next portage release.

Before we implement something like unavailable repository for
portage, first we'll have to add multiple repository support, and
that's a decent sized project of it's own. Somebody has mentioned
interest in multiple repository support on the gentoo-soc list
[3], but they haven't submitted a proposal to
http://socghop.appspot.com yet.

[1] http://paludis.pioto.org/configuration/repositories/unavailable.html
[2] http://bugs.gentoo.org/show_bug.cgi?id=261377
[3]
http://archives.gentoo.org/gentoo-soc/msg_e383863a6748e367e13fe53b092f3908.xml
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAknSnA4ACgkQ/ejvha5XGaO6tACgjzAsoXP0cJd0Vr1vJxU2CvLQ
JtwAn2Sj+GxLyyRpOIdbejPirCljmF2c
=k5u1
-END PGP SIGNATURE-