Re: [VOTE] Release Apache Nutch 1.0
Fellow PMC members, As you might know already we have posted a release candidate for Nutch 1.0 some time ago. However we have so far received only two +1 votes from Lucene PMC members and one more is required before we can actually finalize the release. The vote thread as it currently is can be seen from: http://www.lucidimagination.com/search/document/33b2a26db25db492/vote_release_apache_nutch_1_0 We (as a Nutch community) would really appreciate if somebody from the PMC had the time to check it out. Thanks for your time, Sami Siren Sami Siren wrote: We're lacking one +1, could someone please take a look? Thanks, Sami Siren Sami Siren wrote: Hello, I have packaged the second release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc1/ See the CHANGES.txt[1] file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/?pathrev=752004 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Here's my +1 Thanks! [1] *http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/CHANGES.txt?view=logpathrev=752004 *-- Sami Siren
Re: [VOTE] Release Apache Nutch 1.0
Hi, On Thu, Mar 19, 2009 at 10:32 AM, Sami Siren ssi...@gmail.com wrote: We (as a Nutch community) would really appreciate if somebody from the PMC had the time to check it out. -1 The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. Other comments based on a quick look: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. * The NOTICE.txt file should start with the the following lines: Apache Nutch Copyright 2009 The Apache Software Foundation * The NOTICE.txt file should contain the required copyright notices from all bundled libraries. * The README.txt should start with Apache Nutch instead of Nutch * Why does the release package contain pre-built documentation and binaries? Downloading the 90MB package takes much longer than checking out and building the 40MB tag from svn. IMHO it would be a service to users to make the release contain just the svn export with instruction on how to build the rest. We can also still provide pre-built binaries as separate downloads. More notably: how am I to verify that the release came from the sources in our svn when it contains stuff that doesn't exist in the svn? BR, Jukka Zitting
Re: [VOTE] Release Apache Nutch 1.0
On Mar 15, 2009, at 2:32 PM, Sami Siren wrote: Grant Ingersoll wrote: Where's the KEYS file for Nutch? hi, the keys file is at the top level nutch directory (eg: http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/nutch/KEYS) OK, I think it should be in the tarball, too., at the top
Re: [VOTE] Release Apache Nutch 1.0
thanks Jukka, Jukka Zitting wrote: Hi, On Thu, Mar 19, 2009 at 10:32 AM, Sami Siren ssi...@gmail.com wrote: We (as a Nutch community) would really appreciate if somebody from the PMC had the time to check it out. -1 The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. ok, we need to address that somehow. Other comments based on a quick look: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. * The NOTICE.txt file should start with the the following lines: Apache Nutch Copyright 2009 The Apache Software Foundation * The NOTICE.txt file should contain the required copyright notices from all bundled libraries. * The README.txt should start with Apache Nutch instead of Nutch * Why does the release package contain pre-built documentation and binaries? Downloading the 90MB package takes much longer than checking out and building the 40MB tag from svn. IMHO it would be a service to users to make the release contain just the svn export with instruction on how to build the rest. I see your point about the fat artifact but I am not totally convinced that users (as in end users) would prefer the idea of fetching the development tools and compiling the software before they use it, at least I am not doing that with the software I use. I will discuss this with rest of the devs and see what we can do here. One solution could be to split the release in two parts binary only and source (they would both be about the same size since out build process currently copies jars around I think that's mostly the reason for the gigantic size) as you propose below. We can also still provide pre-built binaries as separate downloads. More notably: how am I to verify that the release came from the sources in our svn when it contains stuff that doesn't exist in the svn? May be that I don't understand what you're trying to say here but isn't that always the case with binary releases (the difficulty to verify that the binary is build from certain tag from svn)? -- Sami Siren
[jira] Created: (NUTCH-722) Nutch contains jars that we cannot redistribute
Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-723) LICENCE.txt is lacking info that should be there
LICENCE.txt is lacking info that should be there Key: NUTCH-723 URL: https://issues.apache.org/jira/browse/NUTCH-723 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-725) NOTICE.txt is lacking info that should be there
NOTICE.txt is lacking info that should be there --- Key: NUTCH-725 URL: https://issues.apache.org/jira/browse/NUTCH-725 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The NOTICE.txt file should start with the the following lines: Apache Nutch Copyright 2009 The Apache Software Foundation * The NOTICE.txt file should contain the required copyright notices from all bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-724) Drop the JAI libraries
Drop the JAI libraries -- Key: NUTCH-724 URL: https://issues.apache.org/jira/browse/NUTCH-724 Project: Nutch Issue Type: Bug Reporter: Jukka Zitting Priority: Blocker Fix For: 1.0.0 The PDF parser plugin contains Java Advanced Imaging (JAI) libraries (jai_core.jar and jai_codec.jar) that are licensed under the Sun Binary Code License. The license is incompatible with Apache policies, so we need to drop those libraries. AFAIK (see PDFBOX-381) PDFBox only uses the JAI libraries for handling page rotations and tiff images, so simply dropping the JAI jars shouldn't have too much impact. A better solution would be to switch to using Apache PDFBox that has a proper workaround for this issue, but the first Apache PDFBox release has not yet been made. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-525) DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment
[ https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683463#action_12683463 ] minhthucpham commented on NUTCH-525: Can anyone guide me how to install the deleteDups.patch ?? I has downloaded it but don't know how to install. I use cygwin for window and my jdk is jdk1.6.0_07. Thanks very much DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment - Key: NUTCH-525 URL: https://issues.apache.org/jira/browse/NUTCH-525 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Environment: Fedora OS, JDK 1.6, Hadoop FS Reporter: Vishal Shah Fix For: 1.0.0 Attachments: deleteDups.patch, RededupUnitTest.patch When trying to rerun dedup on a segment, we get the following Exception: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883 at org.apache.lucene.util.BitVector.get(BitVector.java:72) at org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346) at org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167) at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445) To reproduce the error, try creating two segments with identical urls - fetch, parse, index and dedup the 2 segments. Then rerun dedup. The error comes from the DDRecordReader.next() method: //skip past deleted documents while (indexReader.isDeleted(doc) doc maxDoc) doc++; If the last document in the index is deleted, then this loop will skip past the last document and call indexReader.isDeleted(doc) again. The conditions should be inverted in order to fix the problem. I've attached a patch here. On a related note, why should we skip past deleted documents? The only time when this will happen is when we are rerunning dedup on a segment. If documents are not deleted for any reason other than dedup, then they should be given a chance to compete again, isn't it? We could fix this by putting an indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-726) README.txt is lacking info that should be there
README.txt is lacking info that should be there --- Key: NUTCH-726 URL: https://issues.apache.org/jira/browse/NUTCH-726 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren from Jukkas email: * The README.txt should start with Apache Nutch instead of Nutch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute
[ https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683467#action_12683467 ] Andrzej Bialecki commented on NUTCH-722: - I added these libs when upgrading PDFBox. During my tests I discovered that they are needed to correctly parse PDFs with certain types of images. If these libs are absent PDFBox throws a RuntimeException. Of course we should remove the libraries from our svn, but I wonder whether we shouldn't still download them on the fly. Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-727) Add KEYS file to release artifact
Add KEYS file to release artifact - Key: NUTCH-727 URL: https://issues.apache.org/jira/browse/NUTCH-727 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sami Siren comment from Grant: Where's the KEYS file for Nutch? hi, the keys file is at the top level nutch directory (eg: http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/nutch/KEYS) OK, I think it should be in the tarball, too., at the top -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[DISCUSS] contents of nutch release artifact
Jukka Zitting was suggesting we should rethink the Nutch release packaging because of it's size. I don't see this as a blocker for 1.0 but we could perhaps start the discussion about this anyway so throw in your opinions... the related snippet from email discussion: Sami Siren wrote: Jukka Zitting wrote: * Why does the release package contain pre-built documentation and binaries? Downloading the 90MB package takes much longer than checking out and building the 40MB tag from svn. IMHO it would be a service to users to make the release contain just the svn export with instruction on how to build the rest. I see your point about the fat artifact but I am not totally convinced that users (as in end users) would prefer the idea of fetching the development tools and compiling the software before they use it, at least I am not doing that with the software I use. I will discuss this with rest of the devs and see what we can do here. One solution could be to split the release in two parts binary only and source (they would both be about the same size since out build process currently copies jars around I think that's mostly the reason for the gigantic size) as you propose below. -- Sami Siren
[jira] Resolved: (NUTCH-726) README.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-726. -- Resolution: Fixed Fix Version/s: 1.0.0 committed README.txt is lacking info that should be there --- Key: NUTCH-726 URL: https://issues.apache.org/jira/browse/NUTCH-726 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Fix For: 1.0.0 from Jukkas email: * The README.txt should start with Apache Nutch instead of Nutch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-724) Drop the JAI libraries
[ https://issues.apache.org/jira/browse/NUTCH-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-724. -- Resolution: Duplicate Drop the JAI libraries -- Key: NUTCH-724 URL: https://issues.apache.org/jira/browse/NUTCH-724 Project: Nutch Issue Type: Bug Reporter: Jukka Zitting Priority: Blocker Fix For: 1.0.0 The PDF parser plugin contains Java Advanced Imaging (JAI) libraries (jai_core.jar and jai_codec.jar) that are licensed under the Sun Binary Code License. The license is incompatible with Apache policies, so we need to drop those libraries. AFAIK (see PDFBOX-381) PDFBox only uses the JAI libraries for handling page rotations and tiff images, so simply dropping the JAI jars shouldn't have too much impact. A better solution would be to switch to using Apache PDFBox that has a proper workaround for this issue, but the first Apache PDFBox release has not yet been made. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Release Apache Nutch 1.0
Hi, On Thu, Mar 19, 2009 at 2:15 PM, Sami Siren ssi...@gmail.com wrote: Jukka Zitting wrote: -1 The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. ok, we need to address that somehow. See https://issues.apache.org/jira/browse/NUTCH-724 for some suggestions. * Why does the release package contain pre-built documentation and binaries? Downloading the 90MB package takes much longer than checking out and building the 40MB tag from svn. IMHO it would be a service to users to make the release contain just the svn export with instruction on how to build the rest. I see your point about the fat artifact but I am not totally convinced that users (as in end users) would prefer the idea of fetching the development tools and compiling the software before they use it, at least I am not doing that with the software I use. Most end users are happy with just the binaries. But pure source releases are really useful for example for people that maintain custom modifications as patches against the official source releases (think of Linux distributions with system-specific changes, companies with proprietary extensions, etc.). I'm not sure if Nutch yet has such users. I will discuss this with rest of the devs and see what we can do here. One solution could be to split the release in two parts binary only and source That would be nice. Note that even the users who just want the binaries benefit from such a division as also their downloads will be faster. More notably: how am I to verify that the release came from the sources in our svn when it contains stuff that doesn't exist in the svn? May be that I don't understand what you're trying to say here but isn't that always the case with binary releases (the difficulty to verify that the binary is build from certain tag from svn)? Exactly. That's why it's so important to have a source-only release that preferably matches one-to-one to the contents of the respective svn tag. That should be the official release package that the PMC reviews and approves. There is no reasonable way to accurately review binaries, so while we may (and should) test that they work as expected, ultimately we just need to trust the release manager when he or she says that the binaries are the result of building the source release. Thus we should treat binaries as secondary release artifacts that the release manager is providing as a convenience for users. PS. I know there's a long tradition of doing releases the way you prepared Nutch 1.0, and I'm not claiming that it's necessarily the wrong way of doing things. My -1 was due to the JAI libraries, not due to the structure of the release. However, as described above, I personally much prefer the clear distinction between source releases and binaries. BR, Jukka Zitting
[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute
[ https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683473#action_12683473 ] Jukka Zitting commented on NUTCH-722: - See PDFBOX-381 for how the JAI dependency issues was solved in the currently incubating Apache PDFBox. Unfortunately we don't yet have an official release of Apache PDFBox. Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute
[ https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683474#action_12683474 ] Jukka Zitting commented on NUTCH-722: - One acceptable alternative for now is to drop the jars and add a note to end users that they should explicitly get and add the JAI libraries if they want support for PDF documents with rotated pages or embedded TIFF images. Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute
[ https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683477#action_12683477 ] Andrzej Bialecki commented on NUTCH-722: - +1 for this solution. Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [DISCUSS] contents of nutch release artifact
Sami Siren wrote: Jukka Zitting was suggesting we should rethink the Nutch release packaging because of it's size. I don't see this as a blocker for 1.0 but we could perhaps start the discussion about this anyway so throw in your opinions... I agree with you and Jukka that we should provide separate tarballs of source and binaries. This likely won't result in significant size reductions (anyway, what's a measly 90MB nowadays .. ;) but it would help other parties to deploy clean binaries and/or track the officially released sources. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] contents of nutch release artifact
On Mar 19, 2009, at 8:48 AM, Sami Siren wrote: Jukka Zitting was suggesting we should rethink the Nutch release packaging because of it's size. I don't see this as a blocker for 1.0 but we could perhaps start the discussion about this anyway so throw in your opinions... +1 for both binary and source releases. As I see it, it's not much more work and it gives people options. If we're looking to get more interest in Nutch, making things as easy as possible for people is a good thing. Eric -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University, Fargo, North Dakota, USA
Re: [DISCUSS] contents of nutch release artifact
Hi, On Thu, Mar 19, 2009 at 3:38 PM, Andrzej Bialecki a...@getopt.org wrote: (anyway, what's a measly 90MB nowadays .. ;) It's a pretty long download unless you have a fast connection and a nearby mirror. BR, Jukka Zitting
[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute
[ https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683482#action_12683482 ] Sami Siren commented on NUTCH-722: -- +1, i am fine with this solution too Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [DISCUSS] contents of nutch release artifact
On Thu, Mar 19, 2009 at 16:48, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Thu, Mar 19, 2009 at 3:38 PM, Andrzej Bialecki a...@getopt.org wrote: (anyway, what's a measly 90MB nowadays .. ;) It's a pretty long download unless you have a fast connection and a nearby mirror. I agree. Can't we also do a source-only release? Kind of like a checkout from svn (without, of course, svn bits)? I think this would be much more interesting to me if I wasn't using trunk. So, my suggestion is that we have 3 releases? Source only, binary only and full. BR, Jukka Zitting -- Doğacan Güney
Re: [DISCUSS] contents of nutch release artifact
Andrzej Bialecki wrote: Sami Siren wrote: Jukka Zitting was suggesting we should rethink the Nutch release packaging because of it's size. I don't see this as a blocker for 1.0 but we could perhaps start the discussion about this anyway so throw in your opinions... I agree with you and Jukka that we should provide separate tarballs of source and binaries. This likely won't result in significant size reductions (anyway, what's a measly 90MB nowadays .. ;) but it would help other parties to deploy clean binaries and/or track the officially released sources. The source package is straight forward one. Size of source package would be about 30GB. but the binary package will still remain quite big if we need to allow it to run on local and distributed mode (plugins as exploded format and also the .job + .war), size of such binary package would still be nearly 80G. We could split the binary to yet smaller pieces: one for local mode, one for distributed mode, and the .war separately but I am not sure if that's worth the effort. -- Sami Siren
Re: [DISCUSS] contents of nutch release artifact
Sami Siren wrote: Andrzej Bialecki wrote: Sami Siren wrote: Jukka Zitting was suggesting we should rethink the Nutch release packaging because of it's size. I don't see this as a blocker for 1.0 but we could perhaps start the discussion about this anyway so throw in your opinions... I agree with you and Jukka that we should provide separate tarballs of source and binaries. This likely won't result in significant size reductions (anyway, what's a measly 90MB nowadays .. ;) but it would help other parties to deploy clean binaries and/or track the officially released sources. The source package is straight forward one. Size of source package would be about 30GB. but the binary package will still remain quite big if we Now, this is big, indeed ;) need to allow it to run on local and distributed mode (plugins as exploded format and also the .job + .war), size of such binary package would still be nearly 80G. We could split the binary to yet smaller pieces: one for local mode, one for distributed mode, and the .war separately but I am not sure if that's worth the effort. I don't think so either. Please remember also that each binary sub-package may create its own range of support issues ... How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] contents of nutch release artifact
The source package is straight forward one. Size of source package would be about 30GB. but the binary package will still remain quite big if we Now, this is big, indeed ;) heh, some serious software, need to buy more disc just to download it (yes I was thinking of M not G) :) -- Sami Siren
Re: [DISCUSS] contents of nutch release artifact
Andrzej Bialecki wrote: How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. this sounds good to me. additionally some new documentation needs to be written too. -- Sami Siren
[jira] Resolved: (NUTCH-725) NOTICE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-725. -- Resolution: Fixed went through the libs and added copyright notices NOTICE.txt is lacking info that should be there --- Key: NUTCH-725 URL: https://issues.apache.org/jira/browse/NUTCH-725 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The NOTICE.txt file should start with the the following lines: Apache Nutch Copyright 2009 The Apache Software Foundation * The NOTICE.txt file should contain the required copyright notices from all bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-723) LICENCE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-723. -- Resolution: Fixed added licenses of 4rd party software LICENCE.txt is lacking info that should be there Key: NUTCH-723 URL: https://issues.apache.org/jira/browse/NUTCH-723 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [DISCUSS] contents of nutch release artifact
Eric J. Christeson wrote: On Mar 19, 2009, at 12:03 PM, Sami Siren wrote: Andrzej Bialecki wrote: How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. this sounds good to me. additionally some new documentation needs to be written too. Distributed is a little more complicated than just dropping *.job and bin/nutch on a hadoop install. Will this even work unless one edits config/stuff and builds a new .job? Anyone using distributed nutch probably wouldn't be interested in something trivial so a step-by-step config how-to would probably be a good idea. Actually, this works very well and it _is_ just a matter of dropping the *.job file and a (slightly) modified bin/nutch. Some time ago I committed a fix that removed Hadoop artifacts from nutch *.job file. This was exactly to avoid confusion that multiple hadoop-site.xml and hadoop*.jar caused (one in your Hadoop install and the other in your Nutch job jar). So now the only place where you should edit Hadoop-related stuff is in your Hadoop conf/ dir, and the only place where you should edit Nutch-related stuff is in your Nutch conf/ dir (and after that indeed you need to rebuild the *.job jar and drop the new version to your Hadoop master). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Issue Comment Edited: (NUTCH-723) LICENCE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683618#action_12683618 ] Sami Siren edited comment on NUTCH-723 at 3/19/09 2:11 PM: --- added licenses of 3rd party software was (Author: siren): added licenses of 4rd party software LICENCE.txt is lacking info that should be there Key: NUTCH-723 URL: https://issues.apache.org/jira/browse/NUTCH-723 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-728) Improve nutch release packaging
[ https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-728: - Attachment: NUTCH-728.patch add simple target to generate source release tgz from svn tag -did not touch to the binary one Improve nutch release packaging --- Key: NUTCH-728 URL: https://issues.apache.org/jira/browse/NUTCH-728 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Attachments: NUTCH-728.patch see the discussion from http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute
[ https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683634#action_12683634 ] Sami Siren commented on NUTCH-722: -- if there are no objections I will commit this change tomorrow morning (EET) Nutch contains jars that we cannot redistribute --- Key: NUTCH-722 URL: https://issues.apache.org/jira/browse/NUTCH-722 Project: Nutch Issue Type: Bug Reporter: Sami Siren Priority: Blocker Fix For: 1.0.0 It seems that we have some jars (as part of pdf parser) that we cannot redistribute. Jukkas comment from email: The release contains the Java Advanced Imaging libraries (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary Code License. We can't redistribute those libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [DISCUSS] contents of nutch release artifact
Sami Siren wrote: Andrzej Bialecki wrote: How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. this sounds good to me. additionally some new documentation needs to be written too. I added a simple patch to NUTCH-728 to make a plain source release from svn, what do people think should we add the plain source package into next rc. I would not like to make changes to binary package now but propose that we do those changes post 1.0. -- Sami Siren
[jira] Commented: (NUTCH-725) NOTICE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683648#action_12683648 ] Jukka Zitting commented on NUTCH-725: - Looks good. NOTICE.txt is lacking info that should be there --- Key: NUTCH-725 URL: https://issues.apache.org/jira/browse/NUTCH-725 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The NOTICE.txt file should start with the the following lines: Apache Nutch Copyright 2009 The Apache Software Foundation * The NOTICE.txt file should contain the required copyright notices from all bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-723) LICENCE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683649#action_12683649 ] Jukka Zitting commented on NUTCH-723: - Looks good to me. PS. There's not really a need to repeat the ALv2 for all Apache components, the first copy at the beginning is enough to cover them all (except of course any non-ALv2 parts). But it's no problem to repeat the license if you think it's clearer to explicitly mention the full licensing terms of each bundled library. LICENCE.txt is lacking info that should be there Key: NUTCH-723 URL: https://issues.apache.org/jira/browse/NUTCH-723 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-725) NOTICE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683745#action_12683745 ] Hudson commented on NUTCH-725: -- Integrated in Nutch-trunk #758 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/758/]) NOTICE.txt is lacking info that should be there --- Key: NUTCH-725 URL: https://issues.apache.org/jira/browse/NUTCH-725 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The NOTICE.txt file should start with the the following lines: Apache Nutch Copyright 2009 The Apache Software Foundation * The NOTICE.txt file should contain the required copyright notices from all bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-727) Add KEYS file to release artifact
[ https://issues.apache.org/jira/browse/NUTCH-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683743#action_12683743 ] Hudson commented on NUTCH-727: -- Integrated in Nutch-trunk #758 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/758/]) Add KEYS file to release artifact - Key: NUTCH-727 URL: https://issues.apache.org/jira/browse/NUTCH-727 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sami Siren comment from Grant: Where's the KEYS file for Nutch? hi, the keys file is at the top level nutch directory (eg: http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/nutch/KEYS) OK, I think it should be in the tarball, too., at the top -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-723) LICENCE.txt is lacking info that should be there
[ https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683742#action_12683742 ] Hudson commented on NUTCH-723: -- Integrated in Nutch-trunk #758 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/758/]) LICENCE.txt is lacking info that should be there Key: NUTCH-723 URL: https://issues.apache.org/jira/browse/NUTCH-723 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Sami Siren Jukkas comment from email: * The LICENSE.txt file should have at least references to the licenses of the bundled libraries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.