Re: [VOTE] Release Apache Nutch 1.0

2009-03-19 Thread Sami Siren

Fellow PMC members,

As you might know already we have posted a release candidate for Nutch 
1.0 some time ago. However we have so far received only two +1 votes 
from Lucene PMC members and one more is required before we can actually 
finalize the release.


The vote thread as it currently is can be seen from:
http://www.lucidimagination.com/search/document/33b2a26db25db492/vote_release_apache_nutch_1_0

We (as a Nutch community) would really appreciate if somebody from the 
PMC had the time to check it out.


Thanks for your time,

 Sami Siren



Sami Siren wrote:

We're lacking one +1, could someone please take a look?

Thanks,

Sami Siren



Sami Siren wrote:

Hello,

I have packaged the second release candidate for Apache Nutch 1.0 
release at


http://people.apache.org/~siren/nutch-1.0/rc1/

See the CHANGES.txt[1] file for details on release contents and latest 
changes. The release was made from tag: 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/?pathrev=752004 



Please vote on releasing this package as Apache Nutch 1.0. The vote is 
open for the next 72 hours. Only votes from Lucene PMC members are 
binding, but everyone is welcome to check the release candidate and 
voice their approval or disapproval. The vote  passes if at least 
three binding +1 votes are cast.


[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Here's my +1


Thanks!


[1] 
*http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/CHANGES.txt?view=logpathrev=752004 



*--
Sami Siren







Re: [VOTE] Release Apache Nutch 1.0

2009-03-19 Thread Jukka Zitting
Hi,

On Thu, Mar 19, 2009 at 10:32 AM, Sami Siren ssi...@gmail.com wrote:
 We (as a Nutch community) would really appreciate if somebody from the PMC
 had the time to check it out.

-1 The release contains the Java Advanced Imaging libraries
(jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary
Code License. We can't redistribute those libraries.

Other comments based on a quick look:

* The LICENSE.txt file should have at least references to the licenses
of the bundled libraries.

* The NOTICE.txt file should start with the the following lines:

  Apache Nutch
  Copyright 2009 The Apache Software Foundation

* The NOTICE.txt file should contain the required copyright notices
from all bundled libraries.

* The README.txt should start with Apache Nutch instead of Nutch

* Why does the release package contain pre-built documentation and
binaries? Downloading the 90MB package takes much longer than checking
out and building the 40MB tag from svn. IMHO it would be a service to
users to make the release contain just the svn export with instruction
on how to build the rest. We can also still provide pre-built binaries
as separate downloads. More notably: how am I to verify that the
release came from the sources in our svn when it contains stuff that
doesn't exist in the svn?

BR,

Jukka Zitting


Re: [VOTE] Release Apache Nutch 1.0

2009-03-19 Thread Grant Ingersoll


On Mar 15, 2009, at 2:32 PM, Sami Siren wrote:


Grant Ingersoll wrote:

Where's the KEYS file for Nutch?


hi,

the keys file is at the top level nutch directory (eg: 
http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/nutch/KEYS)


OK, I think it should be in the tarball, too., at the top


Re: [VOTE] Release Apache Nutch 1.0

2009-03-19 Thread Sami Siren

thanks Jukka,

Jukka Zitting wrote:

Hi,

On Thu, Mar 19, 2009 at 10:32 AM, Sami Siren ssi...@gmail.com wrote:

We (as a Nutch community) would really appreciate if somebody from the PMC
had the time to check it out.


-1 The release contains the Java Advanced Imaging libraries
(jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary
Code License. We can't redistribute those libraries.


ok, we need to address that somehow.


Other comments based on a quick look:

* The LICENSE.txt file should have at least references to the licenses
of the bundled libraries.

* The NOTICE.txt file should start with the the following lines:

  Apache Nutch
  Copyright 2009 The Apache Software Foundation

* The NOTICE.txt file should contain the required copyright notices
from all bundled libraries.

* The README.txt should start with Apache Nutch instead of Nutch

* Why does the release package contain pre-built documentation and
binaries? Downloading the 90MB package takes much longer than checking
out and building the 40MB tag from svn.
IMHO it would be a service to users to make the release contain just the svn 
export with instruction
on how to build the rest. 


I see your point about the fat artifact but I am not totally convinced 
that users (as in end users) would prefer the idea of fetching the 
development tools and compiling the software before they use it, at 
least I am not doing that with the software I use.


I will discuss this with rest of the devs and see what we can do here. 
One solution could be to split the release in two parts binary only and 
source (they would both be about the same size since out build process 
currently copies jars around I think that's mostly the reason for the 
gigantic size) as you propose below.



We can also still provide pre-built binaries
as separate downloads. 
More notably: how am I to verify that the

release came from the sources in our svn when it contains stuff that
doesn't exist in the svn?


May be that I don't understand what you're trying to say here but isn't 
that always the case with binary releases (the difficulty to verify that 
the binary is build from certain tag from svn)?


--
 Sami Siren


[jira] Created: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Sami Siren (JIRA)
Nutch contains jars that we cannot redistribute
---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


It seems that we have some jars (as part of pdf parser) that we cannot 
redistribute.

Jukkas comment from email:

The release contains the Java Advanced Imaging libraries (jai_core.jar and 
jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
redistribute those libraries.





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)
LICENCE.txt is lacking info that should be there


 Key: NUTCH-723
 URL: https://issues.apache.org/jira/browse/NUTCH-723
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren


Jukkas comment from email:

* The LICENSE.txt file should have at least references to the licenses of the 
bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-725) NOTICE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)
NOTICE.txt is lacking info that should be there
---

 Key: NUTCH-725
 URL: https://issues.apache.org/jira/browse/NUTCH-725
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren


Jukkas comment from email:

* The NOTICE.txt file should start with the the following lines:

  Apache Nutch
  Copyright 2009 The Apache Software Foundation

* The NOTICE.txt file should contain the required copyright notices
from all bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-724) Drop the JAI libraries

2009-03-19 Thread Jukka Zitting (JIRA)
Drop the JAI libraries
--

 Key: NUTCH-724
 URL: https://issues.apache.org/jira/browse/NUTCH-724
 Project: Nutch
  Issue Type: Bug
Reporter: Jukka Zitting
Priority: Blocker
 Fix For: 1.0.0


The PDF parser plugin contains Java Advanced Imaging (JAI) libraries 
(jai_core.jar and jai_codec.jar) that are licensed under the Sun Binary Code 
License. The license is incompatible with Apache policies, so we need to drop 
those libraries.

AFAIK (see PDFBOX-381) PDFBox only uses the JAI libraries for handling page 
rotations and tiff images, so simply dropping the JAI jars shouldn't have too 
much impact. A better solution would be to switch to using Apache PDFBox that 
has a proper workaround for this issue, but the first Apache PDFBox release has 
not yet been made.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-525) DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment

2009-03-19 Thread minhthucpham (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683463#action_12683463
 ] 

minhthucpham commented on NUTCH-525:


Can anyone guide me how to install the deleteDups.patch ?? I has downloaded it 
but don't know how to install.

I use cygwin for window and my jdk is jdk1.6.0_07.

Thanks very much

 DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to 
 rerun dedup on a segment
 -

 Key: NUTCH-525
 URL: https://issues.apache.org/jira/browse/NUTCH-525
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.9.0
 Environment: Fedora OS, JDK 1.6, Hadoop FS
Reporter: Vishal Shah
 Fix For: 1.0.0

 Attachments: deleteDups.patch, RededupUnitTest.patch


 When trying to rerun dedup on a segment, we get the following Exception:
 java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883
   at org.apache.lucene.util.BitVector.get(BitVector.java:72)
   at 
 org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
   at 
 org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167)
   at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
 To reproduce the error, try creating two segments with identical urls - 
 fetch, parse, index and dedup the 2 segments. Then rerun dedup.
 The error comes from the DDRecordReader.next() method:
 //skip past deleted documents
 while (indexReader.isDeleted(doc)  doc  maxDoc) doc++;
 If the last document in the index is deleted, then this loop will skip past 
 the last document and call indexReader.isDeleted(doc) again.
 The conditions should be inverted in order to fix the problem.
 I've attached a patch here.
 On a related note, why should we skip past deleted documents? The only time 
 when this will happen is when we are rerunning dedup on a segment. If 
 documents are not deleted for any reason other than dedup, then they should 
 be given a chance to compete again, isn't it? We could fix this by putting an 
 indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts 
 on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-726) README.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)
README.txt is lacking info that should be there
---

 Key: NUTCH-726
 URL: https://issues.apache.org/jira/browse/NUTCH-726
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren


from Jukkas email:

* The README.txt should start with Apache Nutch instead of Nutch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683467#action_12683467
 ] 

Andrzej Bialecki  commented on NUTCH-722:
-

I added these libs when upgrading PDFBox. During my tests I discovered that 
they are needed to correctly parse PDFs with certain types of images. If these 
libs are absent PDFBox throws a RuntimeException.

Of course we should remove the libraries from our svn, but I wonder whether we 
shouldn't still download them on the fly.

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-727) Add KEYS file to release artifact

2009-03-19 Thread Sami Siren (JIRA)
Add KEYS file to release artifact
-

 Key: NUTCH-727
 URL: https://issues.apache.org/jira/browse/NUTCH-727
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren


comment from Grant:

 Where's the KEYS file for Nutch?

 hi,

 the keys file is at the top level nutch directory (eg: 
 http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/nutch/KEYS)

OK, I think it should be in the tarball, too., at the top 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren


Jukka Zitting was suggesting we should rethink the Nutch release 
packaging because of it's size. I don't see this as a blocker for 1.0 
but we could perhaps start the discussion about this anyway so throw in 
your opinions...



the related snippet from email discussion:

Sami Siren wrote:
 Jukka Zitting wrote:
 * Why does the release package contain pre-built documentation and
 binaries? Downloading the 90MB package takes much longer than checking
 out and building the 40MB tag from svn.
 IMHO it would be a service to users to make the release contain just
 the svn export with instruction
 on how to build the rest.

 I see your point about the fat artifact but I am not totally convinced
 that users (as in end users) would prefer the idea of fetching the
 development tools and compiling the software before they use it, at
 least I am not doing that with the software I use.

 I will discuss this with rest of the devs and see what we can do here.
 One solution could be to split the release in two parts binary only and
 source (they would both be about the same size since out build process
 currently copies jars around I think that's mostly the reason for the
 gigantic size) as you propose below.


--
 Sami Siren


[jira] Resolved: (NUTCH-726) README.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-726.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

committed

 README.txt is lacking info that should be there
 ---

 Key: NUTCH-726
 URL: https://issues.apache.org/jira/browse/NUTCH-726
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren
 Fix For: 1.0.0


 from Jukkas email:
 * The README.txt should start with Apache Nutch instead of Nutch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-724) Drop the JAI libraries

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-724.
--

Resolution: Duplicate

 Drop the JAI libraries
 --

 Key: NUTCH-724
 URL: https://issues.apache.org/jira/browse/NUTCH-724
 Project: Nutch
  Issue Type: Bug
Reporter: Jukka Zitting
Priority: Blocker
 Fix For: 1.0.0


 The PDF parser plugin contains Java Advanced Imaging (JAI) libraries 
 (jai_core.jar and jai_codec.jar) that are licensed under the Sun Binary Code 
 License. The license is incompatible with Apache policies, so we need to drop 
 those libraries.
 AFAIK (see PDFBOX-381) PDFBox only uses the JAI libraries for handling page 
 rotations and tiff images, so simply dropping the JAI jars shouldn't have too 
 much impact. A better solution would be to switch to using Apache PDFBox that 
 has a proper workaround for this issue, but the first Apache PDFBox release 
 has not yet been made.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Release Apache Nutch 1.0

2009-03-19 Thread Jukka Zitting
Hi,

On Thu, Mar 19, 2009 at 2:15 PM, Sami Siren ssi...@gmail.com wrote:
 Jukka Zitting wrote:
 -1 The release contains the Java Advanced Imaging libraries
 (jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary
 Code License. We can't redistribute those libraries.

 ok, we need to address that somehow.

See https://issues.apache.org/jira/browse/NUTCH-724 for some suggestions.

 * Why does the release package contain pre-built documentation and
 binaries? Downloading the 90MB package takes much longer than checking
 out and building the 40MB tag from svn.
 IMHO it would be a service to users to make the release contain just the
 svn export with instruction on how to build the rest.

 I see your point about the fat artifact but I am not totally convinced that
 users (as in end users) would prefer the idea of fetching the development
 tools and compiling the software before they use it, at least I am not doing
 that with the software I use.

Most end users are happy with just the binaries. But pure source
releases are really useful for example for people that maintain custom
modifications as patches against the official source releases (think
of Linux distributions with system-specific changes, companies with
proprietary extensions, etc.). I'm not sure if Nutch yet has such
users.

 I will discuss this with rest of the devs and see what we can do here. One
 solution could be to split the release in two parts binary only and source

That would be nice. Note that even the users who just want the
binaries benefit from such a division as also their downloads will be
faster.

 More notably: how am I to verify that the
 release came from the sources in our svn when it contains stuff that
 doesn't exist in the svn?

 May be that I don't understand what you're trying to say here but isn't that
 always the case with binary releases (the difficulty to verify that the
 binary is build from certain tag from svn)?

Exactly. That's why it's so important to have a source-only release
that preferably matches one-to-one to the contents of the respective
svn tag. That should be the official release package that the PMC
reviews and approves.

There is no reasonable way to accurately review binaries, so while we
may (and should) test that they work as expected, ultimately we just
need to trust the release manager when he or she says that the
binaries are the result of building the source release. Thus we should
treat binaries as secondary release artifacts that the release manager
is providing as a convenience for users.

PS. I know there's a long tradition of doing releases the way you
prepared Nutch 1.0, and I'm not claiming that it's necessarily the
wrong way of doing things. My -1 was due to the JAI libraries, not due
to the structure of the release. However, as described above, I
personally much prefer the clear distinction between source releases
and binaries.

BR,

Jukka Zitting


[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683473#action_12683473
 ] 

Jukka Zitting commented on NUTCH-722:
-

See PDFBOX-381 for how the JAI dependency issues was solved in the currently 
incubating Apache PDFBox. Unfortunately we don't yet have an official release 
of Apache PDFBox.

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683474#action_12683474
 ] 

Jukka Zitting commented on NUTCH-722:
-

One acceptable alternative for now is to drop the jars and add a note to end 
users that they should explicitly get and add the JAI libraries if they want 
support for PDF documents with rotated pages or embedded TIFF images.

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683477#action_12683477
 ] 

Andrzej Bialecki  commented on NUTCH-722:
-

+1 for this solution.

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Andrzej Bialecki

Sami Siren wrote:


Jukka Zitting was suggesting we should rethink the Nutch release 
packaging because of it's size. I don't see this as a blocker for 1.0 
but we could perhaps start the discussion about this anyway so throw in 
your opinions...


I agree with you and Jukka that we should provide separate tarballs of 
source and binaries. This likely won't result in significant size 
reductions (anyway, what's a measly 90MB nowadays .. ;) but it would 
help other parties to deploy clean binaries and/or track the officially 
released sources.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Eric J. Christeson


On Mar 19, 2009, at 8:48 AM, Sami Siren wrote:



Jukka Zitting was suggesting we should rethink the Nutch release  
packaging because of it's size. I don't see this as a blocker for  
1.0 but we could perhaps start the discussion about this anyway so  
throw in your opinions...


+1 for both binary and source releases.  As I see it, it's not much  
more work and it gives people options.  If we're looking to get more  
interest in Nutch, making things as easy as possible for people is a  
good thing.


Eric

--
Eric J. Christeson  
eric.christe...@ndsu.edu

Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University, Fargo, North Dakota, USA



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Jukka Zitting
Hi,

On Thu, Mar 19, 2009 at 3:38 PM, Andrzej Bialecki a...@getopt.org wrote:
 (anyway, what's a measly 90MB nowadays .. ;)

It's a pretty long download unless you have a fast connection and a
nearby mirror.

BR,

Jukka Zitting


[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683482#action_12683482
 ] 

Sami Siren commented on NUTCH-722:
--

+1, i am fine with this solution too

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Doğacan Güney
On Thu, Mar 19, 2009 at 16:48, Jukka Zitting jukka.zitt...@gmail.com wrote:
 Hi,

 On Thu, Mar 19, 2009 at 3:38 PM, Andrzej Bialecki a...@getopt.org wrote:
 (anyway, what's a measly 90MB nowadays .. ;)

 It's a pretty long download unless you have a fast connection and a
 nearby mirror.


I agree. Can't we also do a source-only release? Kind of like a checkout from
svn (without, of course, svn bits)? I think this would be much more interesting
to me if I wasn't using trunk.

So, my suggestion is that we have 3 releases? Source only, binary only and full.


 BR,

 Jukka Zitting




-- 
Doğacan Güney


Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren

Andrzej Bialecki wrote:

Sami Siren wrote:


Jukka Zitting was suggesting we should rethink the Nutch release 
packaging because of it's size. I don't see this as a blocker for 1.0 
but we could perhaps start the discussion about this anyway so throw 
in your opinions...


I agree with you and Jukka that we should provide separate tarballs of 
source and binaries. This likely won't result in significant size 
reductions (anyway, what's a measly 90MB nowadays .. ;) but it would 
help other parties to deploy clean binaries and/or track the officially 
released sources.


The source package is straight forward one. Size of source package would 
be about 30GB. but the binary package will still remain quite big if we 
need to allow it to run on local and distributed mode (plugins as 
exploded format and also the .job + .war), size of such binary package 
would still be nearly 80G.


We could split the binary to yet smaller pieces: one for local mode, one 
for distributed mode, and the .war separately but I am not sure if 
that's worth the effort.


--
 Sami Siren




Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Andrzej Bialecki

Sami Siren wrote:

Andrzej Bialecki wrote:

Sami Siren wrote:


Jukka Zitting was suggesting we should rethink the Nutch release 
packaging because of it's size. I don't see this as a blocker for 1.0 
but we could perhaps start the discussion about this anyway so throw 
in your opinions...


I agree with you and Jukka that we should provide separate tarballs of 
source and binaries. This likely won't result in significant size 
reductions (anyway, what's a measly 90MB nowadays .. ;) but it would 
help other parties to deploy clean binaries and/or track the 
officially released sources.


The source package is straight forward one. Size of source package would 
be about 30GB. but the binary package will still remain quite big if we 

   

Now, this is big, indeed ;)

need to allow it to run on local and distributed mode (plugins as 
exploded format and also the .job + .war), size of such binary package 
would still be nearly 80G.


We could split the binary to yet smaller pieces: one for local mode, one 
for distributed mode, and the .war separately but I am not sure if 
that's worth the effort.


I don't think so either. Please remember also that each binary 
sub-package may create its own range of support issues ...


How about the following: we build just 2 packages:

* binary: this includes only base hadoop libs in lib/ (enough to start a 
local job, no optional filesystems etc), the *.job and *.war files and 
scripts. Scripts would check for the presence of plugins/ dir, and offer 
an option to create it from *.job. Assumption here is that this shouldbe 
enough to run full cycle in local mode, and that people who want to run 
a distributed cluster will first install a plain Hadoop release, and 
then just put the *.job and bin/nutch on the master.


* source: no build artifacts, no .svn (equivalent to svn export), simple 
tgz.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren
The source package is straight forward one. Size of source package 
would be about 30GB. but the binary package will still remain quite 
big if we 

   

Now, this is big, indeed ;)


heh, some serious software, need to buy more disc just to download it 
(yes I was thinking of M not G)  :)


--
 Sami Siren




Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren

Andrzej Bialecki wrote:

How about the following: we build just 2 packages:

* binary: this includes only base hadoop libs in lib/ (enough to start a 
local job, no optional filesystems etc), the *.job and *.war files and 
scripts. Scripts would check for the presence of plugins/ dir, and offer 
an option to create it from *.job. Assumption here is that this shouldbe 
enough to run full cycle in local mode, and that people who want to run 
a distributed cluster will first install a plain Hadoop release, and 
then just put the *.job and bin/nutch on the master.


* source: no build artifacts, no .svn (equivalent to svn export), simple 
tgz.



this sounds good to me. additionally some new documentation needs to be 
written too.


--
 Sami Siren



[jira] Resolved: (NUTCH-725) NOTICE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-725.
--

Resolution: Fixed

went through the libs and added copyright notices

 NOTICE.txt is lacking info that should be there
 ---

 Key: NUTCH-725
 URL: https://issues.apache.org/jira/browse/NUTCH-725
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The NOTICE.txt file should start with the the following lines:
   Apache Nutch
   Copyright 2009 The Apache Software Foundation
 * The NOTICE.txt file should contain the required copyright notices
 from all bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-723.
--

Resolution: Fixed

added licenses of 4rd party software

 LICENCE.txt is lacking info that should be there
 

 Key: NUTCH-723
 URL: https://issues.apache.org/jira/browse/NUTCH-723
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The LICENSE.txt file should have at least references to the licenses of the 
 bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Andrzej Bialecki

Eric J. Christeson wrote:


On Mar 19, 2009, at 12:03 PM, Sami Siren wrote:


Andrzej Bialecki wrote:

How about the following: we build just 2 packages:
* binary: this includes only base hadoop libs in lib/ (enough to 
start a local job, no optional filesystems etc), the *.job and *.war 
files and scripts. Scripts would check for the presence of plugins/ 
dir, and offer an option to create it from *.job. Assumption here is 
that this shouldbe enough to run full cycle in local mode, and that 
people who want to run a distributed cluster will first install a 
plain Hadoop release, and then just put the *.job and bin/nutch on 
the master.
* source: no build artifacts, no .svn (equivalent to svn export), 
simple tgz.



this sounds good to me. additionally some new documentation needs to 
be written too.


Distributed is a little more complicated than just dropping *.job and 
bin/nutch on a hadoop install.  Will this even work unless one edits 
config/stuff and builds a new .job?  Anyone using distributed nutch 
probably wouldn't be interested in something trivial so a step-by-step 
config how-to would probably be a good idea.


Actually, this works very well and it _is_ just a matter of dropping the 
*.job file and a (slightly) modified bin/nutch.


Some time ago I committed a fix that removed Hadoop artifacts from nutch 
*.job file. This was exactly to avoid confusion that multiple 
hadoop-site.xml and hadoop*.jar caused (one in your Hadoop install and 
the other in your Nutch job jar). So now the only place where you should 
edit Hadoop-related stuff is in your Hadoop conf/ dir, and the only 
place where you should edit Nutch-related stuff is in your Nutch conf/ 
dir (and after that indeed you need to rebuild the *.job jar and drop 
the new version to your Hadoop master).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Issue Comment Edited: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683618#action_12683618
 ] 

Sami Siren edited comment on NUTCH-723 at 3/19/09 2:11 PM:
---

added licenses of 3rd party software

  was (Author: siren):
added licenses of 4rd party software
  
 LICENCE.txt is lacking info that should be there
 

 Key: NUTCH-723
 URL: https://issues.apache.org/jira/browse/NUTCH-723
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The LICENSE.txt file should have at least references to the licenses of the 
 bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-728) Improve nutch release packaging

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-728:
-

Attachment: NUTCH-728.patch

add simple target to generate source release tgz from svn tag

-did not touch to the binary one

 Improve nutch release packaging
 ---

 Key: NUTCH-728
 URL: https://issues.apache.org/jira/browse/NUTCH-728
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
 Attachments: NUTCH-728.patch


 see the discussion from 
 http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683634#action_12683634
 ] 

Sami Siren commented on NUTCH-722:
--

if there are no objections I will commit this change tomorrow morning (EET)

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren

Sami Siren wrote:

Andrzej Bialecki wrote:

How about the following: we build just 2 packages:

* binary: this includes only base hadoop libs in lib/ (enough to start 
a local job, no optional filesystems etc), the *.job and *.war files 
and scripts. Scripts would check for the presence of plugins/ dir, and 
offer an option to create it from *.job. Assumption here is that this 
shouldbe enough to run full cycle in local mode, and that people who 
want to run a distributed cluster will first install a plain Hadoop 
release, and then just put the *.job and bin/nutch on the master.


* source: no build artifacts, no .svn (equivalent to svn export), 
simple tgz.



this sounds good to me. additionally some new documentation needs to be 
written too.




I added a simple patch to NUTCH-728 to make a plain source release from 
svn, what do people think should we add the plain source package into 
next rc. I would not like to make changes to binary package now but 
propose that we do those changes post 1.0.


--
 Sami Siren


[jira] Commented: (NUTCH-725) NOTICE.txt is lacking info that should be there

2009-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683648#action_12683648
 ] 

Jukka Zitting commented on NUTCH-725:
-

Looks good.

 NOTICE.txt is lacking info that should be there
 ---

 Key: NUTCH-725
 URL: https://issues.apache.org/jira/browse/NUTCH-725
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The NOTICE.txt file should start with the the following lines:
   Apache Nutch
   Copyright 2009 The Apache Software Foundation
 * The NOTICE.txt file should contain the required copyright notices
 from all bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683649#action_12683649
 ] 

Jukka Zitting commented on NUTCH-723:
-

Looks good to me.

PS. There's not really a need to repeat the ALv2 for all Apache components, the 
first copy at the beginning is enough to cover them all (except of course any 
non-ALv2 parts). But it's no problem to repeat the license if you think it's 
clearer to explicitly mention the full licensing terms of each bundled library.

 LICENCE.txt is lacking info that should be there
 

 Key: NUTCH-723
 URL: https://issues.apache.org/jira/browse/NUTCH-723
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The LICENSE.txt file should have at least references to the licenses of the 
 bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-725) NOTICE.txt is lacking info that should be there

2009-03-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683745#action_12683745
 ] 

Hudson commented on NUTCH-725:
--

Integrated in Nutch-trunk #758 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/758/])




 NOTICE.txt is lacking info that should be there
 ---

 Key: NUTCH-725
 URL: https://issues.apache.org/jira/browse/NUTCH-725
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The NOTICE.txt file should start with the the following lines:
   Apache Nutch
   Copyright 2009 The Apache Software Foundation
 * The NOTICE.txt file should contain the required copyright notices
 from all bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-727) Add KEYS file to release artifact

2009-03-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683743#action_12683743
 ] 

Hudson commented on NUTCH-727:
--

Integrated in Nutch-trunk #758 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/758/])



 Add KEYS file to release artifact
 -

 Key: NUTCH-727
 URL: https://issues.apache.org/jira/browse/NUTCH-727
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren

 comment from Grant:
  Where's the KEYS file for Nutch?
 
  hi,
 
  the keys file is at the top level nutch directory (eg: 
  http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/nutch/KEYS)
 OK, I think it should be in the tarball, too., at the top 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683742#action_12683742
 ] 

Hudson commented on NUTCH-723:
--

Integrated in Nutch-trunk #758 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/758/])



 LICENCE.txt is lacking info that should be there
 

 Key: NUTCH-723
 URL: https://issues.apache.org/jira/browse/NUTCH-723
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The LICENSE.txt file should have at least references to the licenses of the 
 bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.