[jira] Commented: (NUTCH-874) Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
[ https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896479#action_12896479 ] Julien Nioche commented on NUTCH-874: - Some plugins have not been ported to the new API as it does not provide multi valued parse results. See See http://search.lucidimagination.com/search/document/844c48289f2d07db/nutchbase_multi_value_parseresult_missing#4ed6f352ebcce8ef This is probably not the case for the ExtParser though. We could rely on Tika's mechanism for external parsing instead of maintaining ours. WDYT? Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora -- Key: NUTCH-874 URL: https://issues.apache.org/jira/browse/NUTCH-874 Project: Nutch Issue Type: Bug Components: parser Environment: Nutch 2.0 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Critical Fix For: 2.0 I just noticed while fixing NUTCH-564 that the ExtParser hasn't been brought up to date with Nutch 2.0 trunk. We should review the plugins in src/plugin to make sure they all work with Gora/Nutchbase now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-864) Fetcher generates entries with status 0
[ https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-864: --- Assignee: Doğacan Güney (was: Julien Nioche) Fetcher generates entries with status 0 --- Key: NUTCH-864 URL: https://issues.apache.org/jira/browse/NUTCH-864 Project: Nutch Issue Type: Bug Components: fetcher Environment: Gora with SQLBackend URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase Last Changed Rev: 980748 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010) Reporter: Julien Nioche Assignee: Doğacan Güney Fix For: 2.0 After a round of fetching which got the following protocol status : 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls: 2690 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690 10/07/30 15:12:37 INFO crawl.WebTableReader: min score: 0.0 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score: 0.7587361 10/07/30 15:12:37 INFO crawl.WebTableReader: max score: 1.0 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched): 1177 (SUCCESS=1177) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone): 112 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry): 93 (EXCEPTION=93) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp): 138 (TEMP_MOVED=138) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm): 521 (MOVED=521) 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done There should not be any entries with status 0 (null) I will investigate a bit more... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-859) Diff trunk and NutchBase
[ https://issues.apache.org/jira/browse/NUTCH-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-859. - Resolution: Fixed NutchBase has become 2.0 and lives in the trunk. I had another look at its differences with 1.2 and could not find any improvement or recent change to the 1.x branch that was missing from NutchBase. However, since the move to the GORA API changed the code drastically it is possible that I missed something but hopefully this won't be the case. Diff trunk and NutchBase - Key: NUTCH-859 URL: https://issues.apache.org/jira/browse/NUTCH-859 Project: Nutch Issue Type: Task Reporter: Julien Nioche Priority: Blocker Fix For: 2.0 Before we turn NutchBase into trunk we need to make sure that all (more or less) recent changes in the trunk have been ported to NutchBase. I have done that recently but given that there is a very large number of changes I might have missed a few things here and there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-859) Diff trunk and NutchBase
[ https://issues.apache.org/jira/browse/NUTCH-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-859. --- Diff trunk and NutchBase - Key: NUTCH-859 URL: https://issues.apache.org/jira/browse/NUTCH-859 Project: Nutch Issue Type: Task Reporter: Julien Nioche Priority: Blocker Fix For: 2.0 Before we turn NutchBase into trunk we need to make sure that all (more or less) recent changes in the trunk have been ported to NutchBase. I have done that recently but given that there is a very large number of changes I might have missed a few things here and there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-875) Port Webgraph to Nutch 2.0
Port Webgraph to Nutch 2.0 -- Key: NUTCH-875 URL: https://issues.apache.org/jira/browse/NUTCH-875 Project: Nutch Issue Type: New Feature Components: linkdb Affects Versions: 2.1 Reporter: Julien Nioche Fix For: 2.1 The webgraph has not yet been ported to the GORA-based API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-851) Port logging to slf4j
[ https://issues.apache.org/jira/browse/NUTCH-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-851: Attachment: NUTCH-851-v2.patch Updated the patch to the 2.0 code. Will commit tomorrow if there aren't any objections Port logging to slf4j - Key: NUTCH-851 URL: https://issues.apache.org/jira/browse/NUTCH-851 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Fix For: 2.0 Attachments: NUTCH-851-v2.patch We are already inheriting a dependency on slf4j from Solr so we might as well use it :-) Any thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-851) Port logging to slf4j
[ https://issues.apache.org/jira/browse/NUTCH-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-851: Attachment: (was: NUTCH-851.patch) Port logging to slf4j - Key: NUTCH-851 URL: https://issues.apache.org/jira/browse/NUTCH-851 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Fix For: 2.0 Attachments: NUTCH-851-v2.patch We are already inheriting a dependency on slf4j from Solr so we might as well use it :-) Any thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-874) Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
[ https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896552#action_12896552 ] Chris A. Mattmann commented on NUTCH-874: - Hey Julien, I think Jukka already worked on something really similar to the ExtParser in Tika. See: http://tika.apache.org/0.7/api/org/apache/tika/parser/ExternalParser.html If we go that route here in Nutch, then I think we should add an encoding attribute similar to NUTCH-564 and flow it through in parse-tika then. If we can do that, I think we're good! Cheers, Chris Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora -- Key: NUTCH-874 URL: https://issues.apache.org/jira/browse/NUTCH-874 Project: Nutch Issue Type: Bug Components: parser Environment: Nutch 2.0 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Critical Fix For: 2.0 I just noticed while fixing NUTCH-564 that the ExtParser hasn't been brought up to date with Nutch 2.0 trunk. We should review the plugins in src/plugin to make sure they all work with Gora/Nutchbase now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-876) Remove remaining robots/IP blocking code in lib-http
[ https://issues.apache.org/jira/browse/NUTCH-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-876: Attachment: NUTCH-876.patch Patch to fix the issue. If there are no objections I'll commit this shortly. Remove remaining robots/IP blocking code in lib-http Key: NUTCH-876 URL: https://issues.apache.org/jira/browse/NUTCH-876 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-876.patch There are remains of the (very old) blocking code in lib-http/.../HttpBase.java. This code was used with the OldFetcher to manage politeness limits. New trunk doesn't have OldFetcher anymore, so this code is useless. Furthermore, there is an actual bug here - FetcherJob forgets to set Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS to false, and the defaults in lib-http are set to true. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Apache Nutch 1.2 Release Candidate #1
+1 to fixing it in 1.2 and rolling another RC, but –1 to reopening issues. I’m not a big fan of that, especially since we record issue fixes in CHANGES.txt and reopening them only leads to confusion and out of sync text files and JIRA. In the future it would be nice to just create a new issue in JIRA and then link your issue to the issue that you wanted to reopen. It’s just as easy and doesn’t cause the out of sync problem. OK, makes sense Cheers, Chris On 8/9/10 7:45 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: I reopened https://issues.apache.org/jira/browse/NUTCH-870. It would be good to fix it before releasing 1.2 On 9 August 2010 14:44, Andrzej Bialecki a...@getopt.org wrote: On 2010-08-08 03:04, Mattmann, Chris A (388J) wrote: Hi Folks, I have posted a release candidate for the Apache Nutch 1.2 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.2/rc1/http://people.apache.org/%7Emattmann/apache-nutch-1.2/rc1/ http://people.apache.org/%7Emattmann/apache-nutch-1.2/rc1/ For more detailed information, see the included CHANGES.txt file for details on release contents and latest changes. The release was made using the Nutch release process, documented on the Wiki here: http://bit.ly/d5ugid A Nutch 1.2 tag is at: http://svn.apache.org/repos/asf/nutch/tags/release-1.2/ Sami Siren previously indicated to integrate RAT into the build, but I haven't had a chance to do it yet. If someone else has time, or wants to, please go ahead and I'd be happy to roll another RC. Please vote on releasing these packages as Apache Nutch 1.2. The vote is open for the next 72 hours. Only votes from Nutch PMC are binding, but folks are welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.2. [ ] -1 Do not release the packages because... +1 - all tests pass, a sample crawl works without problems, both in local and in distributed mode. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: *chris.mattm...@jpl.nasa.gov *WWW: *http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/ *++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com
[Nutch Wiki] Update of TikaPlugin by AndreRicardo
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The TikaPlugin page has been changed by AndreRicardo. http://wiki.apache.org/nutch/TikaPlugin?action=diffrev1=5rev2=6 -- '''js''': ? '''mp3''': Nutch identifies several fields (Title, Album, Artist) whereas Tika knows only about Titles, the rest is stored as paragraphs. + Tika-app can also identify in an mp3 id3v1 and id3v2 tags like: album, artista, audioSampleRate, composer, genre, logcomment, releaseDate, trackNumber. '''msexcel''': comparable (+ Tika able to represent content in structured way as XHTML tables which can be useful for HTML parser plugins)
Re: [VOTE] Apache Nutch 1.2 Release Candidate #1
I got yelled at, too. :-( I'll pull down 1.2 and do a big-stupid-crawl after that metadata issue is fixed. I'm not sure if it affects what I'm doing, but I see the word metadata and it gives me pause. Scott On Mon, Aug 9, 2010 at 8:01 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: +1 to fixing it in 1.2 and rolling another RC, but –1 to reopening issues. I’m not a big fan of that, especially since we record issue fixes in CHANGES.txt and reopening them only leads to confusion and out of sync text files and JIRA. In the future it would be nice to just create a new issue in JIRA and then link your issue to the issue that you wanted to reopen. It’s just as easy and doesn’t cause the out of sync problem. OK, makes sense Cheers, Chris On 8/9/10 7:45 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: I reopened https://issues.apache.org/jira/browse/NUTCH-870. It would be good to fix it before releasing 1.2 On 9 August 2010 14:44, Andrzej Bialecki a...@getopt.org wrote: On 2010-08-08 03:04, Mattmann, Chris A (388J) wrote: Hi Folks, I have posted a release candidate for the Apache Nutch 1.2 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.2/rc1/http://people.apache.org/%7Emattmann/apache-nutch-1.2/rc1/ http://people.apache.org/%7Emattmann/apache-nutch-1.2/rc1/ For more detailed information, see the included CHANGES.txt file for details on release contents and latest changes. The release was made using the Nutch release process, documented on the Wiki here: http://bit.ly/d5ugid A Nutch 1.2 tag is at: http://svn.apache.org/repos/asf/nutch/tags/release-1.2/ Sami Siren previously indicated to integrate RAT into the build, but I haven't had a chance to do it yet. If someone else has time, or wants to, please go ahead and I'd be happy to roll another RC. Please vote on releasing these packages as Apache Nutch 1.2. The vote is open for the next 72 hours. Only votes from Nutch PMC are binding, but folks are welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.2. [ ] -1 Do not release the packages because... +1 - all tests pass, a sample crawl works without problems, both in local and in distributed mode. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: *chris.mattm...@jpl.nasa.gov *WWW: *http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/ *++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com
[jira] Created: (NUTCH-877) Allow setting of slop values for non-quote phrase queries on query-basic plugin
Allow setting of slop values for non-quote phrase queries on query-basic plugin --- Key: NUTCH-877 URL: https://issues.apache.org/jira/browse/NUTCH-877 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.2 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.2 Patch adds a configuration variable for setting slop values on phrase queries. The default slop value, which currently can't be changed through configuration, is Integer.MAX_VALUE. It produces something like this, which doesn't seem right to me. If you are searching for a phrase you usually want it within a certain distance: 2.9141337E-4 = weight(content:my phrase~2147483647 in 1029), product of: * 0.07163286 = queryWeight(content:my phrase~2147483647), product of: o 9.657982 = idf(content: my=13470 phrase=534) o 0.0074169594 = queryNorm This patch adds the query.phrase.slop configuration value to the nutch-default.xml file. It has a default setting of 5. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Alternative search box for Nutch site
Hello, (sending this to d...@nutch instead of old nutch-...@lucene) Over at http://search-lucene.com we index Nutch's mailing lists, wiki, web site, source code, javadoc, jira... Would the community be interested in a patch that adds another search option to the search box on nutch.apache.org? I happened to try a few searches from nutch.a.o just now (now: yesterday) and I got stuff like this: Found 189 results in 6.211 seconds. Displaying page 1 of 19, sorted by Found 12,808 results in 64.342 seconds. Displaying page 1 of 1,281, sorted by Note the times. Ouch! This makes me think having an alternative option would be a good thing to have. Assuming people are for this, any suggestions for how the search should function by default or any specific instructions for how the search box should be modified would be great! Thanks, Otis
Build failed in Hudson: Nutch-trunk #1215
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1215/ -- [...truncated 986 lines...] A src/plugin/subcollection/src/test/org/apache A src/plugin/subcollection/src/test/org/apache/nutch A src/plugin/subcollection/src/test/org/apache/nutch/collection A src/plugin/subcollection/src/test/org/apache/nutch/collection/TestSubcollection.java A src/plugin/subcollection/src/java A src/plugin/subcollection/src/java/org A src/plugin/subcollection/src/java/org/apache A src/plugin/subcollection/src/java/org/apache/nutch A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AU