Re: [VOTE] Release Apache Nutch 1.0
my non-binding +1 marko On Mar 8, 2009, at 10:07 PM, Dennis Kubes wrote: Non-binding +1 too :) Sami Siren wrote: Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! -- Sami Siren
NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]
Dog(acan Güney wrote: On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com wrote: Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! That's great! I would like to see NUTCH-684 in but I guess I was too late :) Anyway, my non-binding +1. uh, I missed that one, sorry. Do you think it's ready to be included? (IMO that's an important feature) It's not a big deal for me to rebuild the package with that feature included. -- Sami Siren
Re: planning for nutch-1.0-rc1
Hello, It's on 2 linux boxes one with centos and one with ubuntu. Both properly running old bin/nutch crawl. Problem is that it doesn't give exception on command line or in eclipse just writes to logs so it's hard to debug. One is running nutch trunk from 07 march, and one from todays rc1 Any hints? Maybe some logs properties or sth? In hadoop.log it looks exactly the same: 2009-03-09 12:12:09,452 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2009-03-09 12:12:09,452 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) 2009-03-09 12:12:09,560 INFO field.FieldIndexer - IFD [Thread-11]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6210fb 2009-03-09 12:12:09,560 INFO field.FieldIndexer - IW 0 [Thread-11]: setInfoStream: dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-agniesia441/mapred/local/index/_-174719952 autoCommit=true mergepolicy=org.apache.lucene.index.logbytesizemergepol...@48edb5 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@1ee2c2c ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= 2009-03-09 12:12:09,585 WARN mapred.LocalJobRunner - job_local_0001 java.lang.NullPointerException at org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139) at org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:1) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) at org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239) at org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:1) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) 2009-03-09 12:12:10,021 FATAL field.FieldIndexer - FieldIndexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267) at org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275) Thanks, Bartosz Dennis Kubes pisze: Sorry about the docs being sparse on this. I will write more about the process as time permits. Don't know about the problem below. What platform are you running on, windows, linux? Dennis Bartosz Gadzimski wrote: Hello, Thanks Dennis for updateing wiki it helped a lot. You gave example with indexing but you didn't said a bit about it. Can you write some more? :) Anyways I have problems at the last step (nutch from 07 march): bin/nutch org.apache.nutch.indexer.field.FieldIndexer It simply stops somewhere 2009-03-07 16:09:04,432 INFO field.FieldIndexer - FieldIndexer: starting 2009-03-07 16:09:04,436 INFO field.FieldIndexer - FieldIndexer: adding fields db: crawl/fields/basicfields 2009-03-07 16:09:04,498 INFO field.FieldIndexer - FieldIndexer: adding fields db: crawl/fields/anchorfields 2009-03-07 16:09:05,636 INFO plugin.PluginRepository - Plugins: looking in: /usr/local/nutch/plugins 2009-03-07 16:09:06,437 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-03-07 16:09:06,437 INFO plugin.PluginRepository - Registered Plugins: 2009-03-07 16:09:06,437 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2009-03-07 16:09:06,437 INFO plugin.PluginRepository - Basic Query Filter (query-basic) plugins 2009-03-07 16:09:07,769 INFO field.FieldIndexer - IFD [Thread-11]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@1b4a74b 2009-03-07 16:09:07,769 INFO field.FieldIndexer - IW 0 [Thread-11]: setInfoStream: dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313 autoCommit=true mergepolicy=org.apache.lucene.index.logbytesizemergepol...@15356d5 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@69d02b ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= 2009-03-07 16:09:07,781 WARN mapred.LocalJobRunner - job_local_0001 java.lang.NullPointerException at org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139) at org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) at org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239) at org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69) at
Re: NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]
On 09.Mar.2009, at 11:05, Sami Siren ssi...@gmail.com wrote: Doğacan Güney wrote: On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com wrote: Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! That's great! I would like to see NUTCH-684 in but I guess I was too late :) Anyway, my non-binding +1. uh, I missed that one, sorry. Do you think it's ready to be included? (IMO that's an important feature) It's not a big deal for me to rebuild the package with that feature included. I only tested it on a small crawl. Still, I believe it is important too so I would like to include it. Worst case we release a 1.0.1 soon after:) -- Sami Siren
[jira] Commented: (NUTCH-684) Dedup support for Solr
[ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680173#action_12680173 ] Shalin Shekhar Mangar commented on NUTCH-684: - Just found this issue from Sami's post on Lucid blog. Are you guys aware of the Deduplication feature in Solr trunk? http://wiki.apache.org/solr/Deduplication and SOLR-799 Dedup support for Solr -- Key: NUTCH-684 URL: https://issues.apache.org/jira/browse/NUTCH-684 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Doğacan Güney Assignee: Doğacan Güney Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch, solrdedup_v2.patch After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-713) Config options for webgraph Scoring not documented
Config options for webgraph Scoring not documented -- Key: NUTCH-713 URL: https://issues.apache.org/jira/browse/NUTCH-713 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.0.0 Environment: All Reporter: Eric J. Christeson Priority: Minor There are a number of properties for webgraph scoring that are only documented in code. I have found these: link.ignore.internal.host link.ignore.internal.domain link.ignore.limit.domain link.ignore.limit.host link.ignore.limit.page link.loops.depth link.analyze.initial.score link.analyze.damping.factor link.analyze.rank.one link.analyze.iteration link.analyze.num.iterations I have a patch to add these to conf/nutch-default.xml with the best description I could find. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]
Doğacan Güney wrote: On 09.Mar.2009, at 11:05, Sami Siren ssi...@gmail.com mailto:ssi...@gmail.com wrote: Doğacan Güney wrote: On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com mailto:ssi...@gmail.com wrote: Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ http://people.apache.org/%7Esiren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! That's great! I would like to see NUTCH-684 in but I guess I was too late :) Anyway, my non-binding +1. uh, I missed that one, sorry. Do you think it's ready to be included? (IMO that's an important feature) It's not a big deal for me to rebuild the package with that feature included. I only tested it on a small crawl. Still, I believe it is important too so I would like to include it. Worst case we release a 1.0.1 soon after:) I am fine either way. So if you think it's good enough to go in just commit it and I'll build another rc. If not then we can release it later too when it's ready. -- Sami Siren -- Sami Siren
[jira] Updated: (NUTCH-713) Config options for webgraph Scoring not documented
[ https://issues.apache.org/jira/browse/NUTCH-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric J. Christeson updated NUTCH-713: - Attachment: webgraph-scoring.diff Patch to add config options to conf/nutch-default.xml Config options for webgraph Scoring not documented -- Key: NUTCH-713 URL: https://issues.apache.org/jira/browse/NUTCH-713 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.0.0 Environment: All Reporter: Eric J. Christeson Priority: Minor Attachments: webgraph-scoring.diff Original Estimate: 1h Remaining Estimate: 1h There are a number of properties for webgraph scoring that are only documented in code. I have found these: link.ignore.internal.host link.ignore.internal.domain link.ignore.limit.domain link.ignore.limit.host link.ignore.limit.page link.loops.depth link.analyze.initial.score link.analyze.damping.factor link.analyze.rank.one link.analyze.iteration link.analyze.num.iterations I have a patch to add these to conf/nutch-default.xml with the best description I could find. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Release Apache Nutch 1.0
non-binding +1 -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University PGP.sig Description: This is a digitally signed message part
[jira] Commented: (NUTCH-684) Dedup support for Solr
[ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680194#action_12680194 ] Andrzej Bialecki commented on NUTCH-684: - Yes, I'm aware of this functionality. At this point however I thought that it would only complicate things, because users would have to install Nutch classes on Solr in order to use Signature implementations that we use. This is of course an open issue that we should investigate after 1.0 release. Dedup support for Solr -- Key: NUTCH-684 URL: https://issues.apache.org/jira/browse/NUTCH-684 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Doğacan Güney Assignee: Doğacan Güney Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch, solrdedup_v2.patch After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]
On Mon, Mar 9, 2009 at 17:46, Sami Siren ssi...@gmail.com wrote: Doğacan Güney wrote: On 09.Mar.2009, at 11:05, Sami Siren ssi...@gmail.com mailto:ssi...@gmail.com wrote: Doğacan Güney wrote: On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com mailto:ssi...@gmail.com wrote: Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ http://people.apache.org/%7Esiren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! That's great! I would like to see NUTCH-684 in but I guess I was too late :) Anyway, my non-binding +1. uh, I missed that one, sorry. Do you think it's ready to be included? (IMO that's an important feature) It's not a big deal for me to rebuild the package with that feature included. I only tested it on a small crawl. Still, I believe it is important too so I would like to include it. Worst case we release a 1.0.1 soon after:) I am fine either way. So if you think it's good enough to go in just commit it and I'll build another rc. If not then we can release it later too when it's ready. Committed, thanks for waiting :) -- Sami Siren -- Sami Siren -- Doğacan Güney
[jira] Closed: (NUTCH-684) Dedup support for Solr
[ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-684. --- Resolution: Fixed Fix Version/s: 1.0.0 Fixed as of rev. 751774. Dedup support for Solr -- Key: NUTCH-684 URL: https://issues.apache.org/jira/browse/NUTCH-684 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 1.0.0 Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch, solrdedup_v2.patch After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Nutch ML cleanup
Hi, This has been bugging me for a while now. For some reason Nutch MLs get the most junk emails - both rude/rudeish emails, as well as clear spam (with SPAM in the subject - something must be detecting it). I just looked at the headers of the clearly labeled spam messages and found that they all seem to come from SF: To: nutch-...@lists.sourceforge.net To: nutch-gene...@lists.sourceforge.net I assume there is some kind of a mail forward from the old Nutch MLs on SF to the new Nutch MLs at ASF. Do you think we could remove this forwarding and get rid of this spam? Sami Andrzej seem to be members who mght be able to make this change: http://sourceforge.net/project/memberlist.php?group_id=59548 Otis
[jira] Created: (NUTCH-714) Need a SFTP and SCP Protocol Handler
Need a SFTP and SCP Protocol Handler Key: NUTCH-714 URL: https://issues.apache.org/jira/browse/NUTCH-714 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Sanjoy Ghosh Fix For: 0.8.2 An SFTP and SCP Protocol handler is needed to fetch intranet content on an SFTP or SCP server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-714) Need a SFTP and SCP Protocol Handler
[ https://issues.apache.org/jira/browse/NUTCH-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680348#action_12680348 ] Chris A. Mattmann commented on NUTCH-714: - Hi Sanjoy, When you get a patch, let me know and I will work to integrate it. For reference, you were intending this as an upgrade for 0.8.2? I think we should probably do this as a post 1.0 upgrade (maybe 1.1)? Cheers,. Chris Need a SFTP and SCP Protocol Handler Key: NUTCH-714 URL: https://issues.apache.org/jira/browse/NUTCH-714 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Sanjoy Ghosh Assignee: Chris A. Mattmann Fix For: 0.8.2 An SFTP and SCP Protocol handler is needed to fetch intranet content on an SFTP or SCP server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-714) Need a SFTP and SCP Protocol Handler
[ https://issues.apache.org/jira/browse/NUTCH-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-714: --- Assignee: Chris A. Mattmann Need a SFTP and SCP Protocol Handler Key: NUTCH-714 URL: https://issues.apache.org/jira/browse/NUTCH-714 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Sanjoy Ghosh Assignee: Chris A. Mattmann Fix For: 0.8.2 An SFTP and SCP Protocol handler is needed to fetch intranet content on an SFTP or SCP server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-684) Dedup support for Solr
[ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680374#action_12680374 ] Hudson commented on NUTCH-684: -- Integrated in Nutch-trunk #748 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/748/]) - Dedup support for Solr Dedup support for Solr -- Key: NUTCH-684 URL: https://issues.apache.org/jira/browse/NUTCH-684 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 1.0.0 Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch, solrdedup_v2.patch After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-715) Subcollection plugin doesn't work with default subcollections.xml file
Subcollection plugin doesn't work with default subcollections.xml file -- Key: NUTCH-715 URL: https://issues.apache.org/jira/browse/NUTCH-715 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Fix For: 1.0.0 Subcollection plugin cann't parse his configuration file because it contatins top level comment (ASF notice) and DomUtil doesn't carry about of top-level comments -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.