Re: [DISCUSS] contents of nutch release artifact
Doğacan Güney wrote: On Thu, Mar 19, 2009 at 23:46, Sami Siren ssi...@gmail.com wrote: Sami Siren wrote: Andrzej Bialecki wrote: How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. this sounds good to me. additionally some new documentation needs to be written too. I added a simple patch to NUTCH-728 to make a plain source release from svn, what do people think should we add the plain source package into next rc. I would not like to make changes to binary package now but propose that we do those changes post 1.0. +1 for including plain source release in next rc. As for, local/distributed separation, it is a good idea but I think we should hold it for 1.1 (or something else) if it requires architectural changes (thus needs review and testing). Yes, sorry for not being more explicit - my proposal was for 1.1, I think 1.0 has to go out as it is (and I'd even hesitate to create a source-only release now - we would have to test that it's still buildable and fully functional.) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] contents of nutch release artifact
Hi, On Fri, Mar 20, 2009 at 1:10 PM, Andrzej Bialecki a...@sigram.com wrote: Yes, sorry for not being more explicit - my proposal was for 1.1, I think 1.0 has to go out as it is (and I'd even hesitate to create a source-only release now - we would have to test that it's still buildable and fully functional.) To be accurate, the source release *is* the collection of bits that the release manager is using to produce binaries and other release artifacts. It's just a packaged svn export of the release tag. If the release manager can build and test the sources, then anyone else should be able do the same using the exact same set of bits. Verifying that is one of the key parts of the release vote. From that perspective I'm even a bit worried about the idea of having an Ant target that exports and packages the tag, as it suggests that the release manager is not necessarily using that set of bits to build the release. BR, Jukka Zitting
Re: [DISCUSS] contents of nutch release artifact
Hi, On Sat, Mar 21, 2009 at 12:28 PM, Jukka Zitting jukka.zitt...@gmail.com wrote: To be accurate, the source release *is* the collection of bits that the release manager is using to produce binaries and other release artifacts. It's just a packaged svn export of the release tag. Or, to express this in another way, the release manager can produce the source release simply by packaging the entire source tree he's using just before invoking any Ant targets to produce the binaries. BR, Jukka Zitting
Re: [DISCUSS] contents of nutch release artifact
On Thu, Mar 19, 2009 at 23:46, Sami Siren ssi...@gmail.com wrote: Sami Siren wrote: Andrzej Bialecki wrote: How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. this sounds good to me. additionally some new documentation needs to be written too. I added a simple patch to NUTCH-728 to make a plain source release from svn, what do people think should we add the plain source package into next rc. I would not like to make changes to binary package now but propose that we do those changes post 1.0. +1 for including plain source release in next rc. As for, local/distributed separation, it is a good idea but I think we should hold it for 1.1 (or something else) if it requires architectural changes (thus needs review and testing). -- Sami Siren -- Doğacan Güney
[DISCUSS] contents of nutch release artifact
Jukka Zitting was suggesting we should rethink the Nutch release packaging because of it's size. I don't see this as a blocker for 1.0 but we could perhaps start the discussion about this anyway so throw in your opinions... the related snippet from email discussion: Sami Siren wrote: Jukka Zitting wrote: * Why does the release package contain pre-built documentation and binaries? Downloading the 90MB package takes much longer than checking out and building the 40MB tag from svn. IMHO it would be a service to users to make the release contain just the svn export with instruction on how to build the rest. I see your point about the fat artifact but I am not totally convinced that users (as in end users) would prefer the idea of fetching the development tools and compiling the software before they use it, at least I am not doing that with the software I use. I will discuss this with rest of the devs and see what we can do here. One solution could be to split the release in two parts binary only and source (they would both be about the same size since out build process currently copies jars around I think that's mostly the reason for the gigantic size) as you propose below. -- Sami Siren
Re: [DISCUSS] contents of nutch release artifact
Sami Siren wrote: Jukka Zitting was suggesting we should rethink the Nutch release packaging because of it's size. I don't see this as a blocker for 1.0 but we could perhaps start the discussion about this anyway so throw in your opinions... I agree with you and Jukka that we should provide separate tarballs of source and binaries. This likely won't result in significant size reductions (anyway, what's a measly 90MB nowadays .. ;) but it would help other parties to deploy clean binaries and/or track the officially released sources. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] contents of nutch release artifact
On Mar 19, 2009, at 8:48 AM, Sami Siren wrote: Jukka Zitting was suggesting we should rethink the Nutch release packaging because of it's size. I don't see this as a blocker for 1.0 but we could perhaps start the discussion about this anyway so throw in your opinions... +1 for both binary and source releases. As I see it, it's not much more work and it gives people options. If we're looking to get more interest in Nutch, making things as easy as possible for people is a good thing. Eric -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University, Fargo, North Dakota, USA
Re: [DISCUSS] contents of nutch release artifact
Hi, On Thu, Mar 19, 2009 at 3:38 PM, Andrzej Bialecki a...@getopt.org wrote: (anyway, what's a measly 90MB nowadays .. ;) It's a pretty long download unless you have a fast connection and a nearby mirror. BR, Jukka Zitting
Re: [DISCUSS] contents of nutch release artifact
On Thu, Mar 19, 2009 at 16:48, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Thu, Mar 19, 2009 at 3:38 PM, Andrzej Bialecki a...@getopt.org wrote: (anyway, what's a measly 90MB nowadays .. ;) It's a pretty long download unless you have a fast connection and a nearby mirror. I agree. Can't we also do a source-only release? Kind of like a checkout from svn (without, of course, svn bits)? I think this would be much more interesting to me if I wasn't using trunk. So, my suggestion is that we have 3 releases? Source only, binary only and full. BR, Jukka Zitting -- Doğacan Güney
Re: [DISCUSS] contents of nutch release artifact
Andrzej Bialecki wrote: Sami Siren wrote: Jukka Zitting was suggesting we should rethink the Nutch release packaging because of it's size. I don't see this as a blocker for 1.0 but we could perhaps start the discussion about this anyway so throw in your opinions... I agree with you and Jukka that we should provide separate tarballs of source and binaries. This likely won't result in significant size reductions (anyway, what's a measly 90MB nowadays .. ;) but it would help other parties to deploy clean binaries and/or track the officially released sources. The source package is straight forward one. Size of source package would be about 30GB. but the binary package will still remain quite big if we need to allow it to run on local and distributed mode (plugins as exploded format and also the .job + .war), size of such binary package would still be nearly 80G. We could split the binary to yet smaller pieces: one for local mode, one for distributed mode, and the .war separately but I am not sure if that's worth the effort. -- Sami Siren
Re: [DISCUSS] contents of nutch release artifact
Sami Siren wrote: Andrzej Bialecki wrote: Sami Siren wrote: Jukka Zitting was suggesting we should rethink the Nutch release packaging because of it's size. I don't see this as a blocker for 1.0 but we could perhaps start the discussion about this anyway so throw in your opinions... I agree with you and Jukka that we should provide separate tarballs of source and binaries. This likely won't result in significant size reductions (anyway, what's a measly 90MB nowadays .. ;) but it would help other parties to deploy clean binaries and/or track the officially released sources. The source package is straight forward one. Size of source package would be about 30GB. but the binary package will still remain quite big if we Now, this is big, indeed ;) need to allow it to run on local and distributed mode (plugins as exploded format and also the .job + .war), size of such binary package would still be nearly 80G. We could split the binary to yet smaller pieces: one for local mode, one for distributed mode, and the .war separately but I am not sure if that's worth the effort. I don't think so either. Please remember also that each binary sub-package may create its own range of support issues ... How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] contents of nutch release artifact
The source package is straight forward one. Size of source package would be about 30GB. but the binary package will still remain quite big if we Now, this is big, indeed ;) heh, some serious software, need to buy more disc just to download it (yes I was thinking of M not G) :) -- Sami Siren
Re: [DISCUSS] contents of nutch release artifact
Andrzej Bialecki wrote: How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. this sounds good to me. additionally some new documentation needs to be written too. -- Sami Siren
Re: [DISCUSS] contents of nutch release artifact
Eric J. Christeson wrote: On Mar 19, 2009, at 12:03 PM, Sami Siren wrote: Andrzej Bialecki wrote: How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. this sounds good to me. additionally some new documentation needs to be written too. Distributed is a little more complicated than just dropping *.job and bin/nutch on a hadoop install. Will this even work unless one edits config/stuff and builds a new .job? Anyone using distributed nutch probably wouldn't be interested in something trivial so a step-by-step config how-to would probably be a good idea. Actually, this works very well and it _is_ just a matter of dropping the *.job file and a (slightly) modified bin/nutch. Some time ago I committed a fix that removed Hadoop artifacts from nutch *.job file. This was exactly to avoid confusion that multiple hadoop-site.xml and hadoop*.jar caused (one in your Hadoop install and the other in your Nutch job jar). So now the only place where you should edit Hadoop-related stuff is in your Hadoop conf/ dir, and the only place where you should edit Nutch-related stuff is in your Nutch conf/ dir (and after that indeed you need to rebuild the *.job jar and drop the new version to your Hadoop master). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] contents of nutch release artifact
Sami Siren wrote: Andrzej Bialecki wrote: How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. this sounds good to me. additionally some new documentation needs to be written too. I added a simple patch to NUTCH-728 to make a plain source release from svn, what do people think should we add the plain source package into next rc. I would not like to make changes to binary package now but propose that we do those changes post 1.0. -- Sami Siren