[jira] [Resolved] (NUTCH-1394) backport NUTCH-1232 Remove site field from index-basic
[ https://issues.apache.org/jira/browse/NUTCH-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1394. - Resolution: Fixed Assignee: Lewis John McGibbney Committed @revision 1418750 in Nutch 2.x branch backport NUTCH-1232 Remove site field from index-basic -- Key: NUTCH-1394 URL: https://issues.apache.org/jira/browse/NUTCH-1394 Project: Nutch Issue Type: Improvement Components: indexer, storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.2 This is a simple backport. The 2.0 Solr schema and mappings still contain the field site which has been removed in 1.x (NUTCH-1232). Should be done also in 2.0: it's easier to maintain only one Solr installation for all Nutch versions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1038) Port IndexingFiltersChecker to 2.0
[ https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13527265#comment-13527265 ] Lewis John McGibbney commented on NUTCH-1038: - Hi Seb, this works great, however I 'think' there is a bug lingering in BasicIndexingFilter's doc.add(tstamp, tstamp) function. It doesn't see right. I have posted it on to the user@ list. I am +1 for committing though. The bug (if it is one) is not related to this patch. Port IndexingFiltersChecker to 2.0 -- Key: NUTCH-1038 URL: https://issues.apache.org/jira/browse/NUTCH-1038 Project: Nutch Issue Type: New Feature Affects Versions: nutchgora Reporter: Markus Jelsma Fix For: 2.2 Attachments: NUTCH-1038.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[ANNOUNCE] Apache Nutch 1.6 Released
Hi All, The Apache Nutch PMC are extremely pleased to announce the release of Apache Nutch v1.6. This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API inluding the normalization of URL's and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8. A full PMC statement can be found here [0] The release can be found on official Apache mirrors [1] as well as sources in Maven Central [2] Thank you Lewis On Behalf of the Nutch PMC [0] http://s.apache.org/NFp [1] http://www.apache.org/dyn/closer.cgi/nutch/ [2] http://search.maven.org/#artifactdetails|org.apache.nutch|nutch|1.6|jar -- Lewis
[jira] [Resolved] (NUTCH-1183) Summary task for adding command line usage instructions to webgraph classes
[ https://issues.apache.org/jira/browse/NUTCH-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1183. - Resolution: Not A Problem This issue is invalid. Thinking back I really haven't got a clue what I registered it for... Sorry troops. CLosing Summary task for adding command line usage instructions to webgraph classes --- Key: NUTCH-1183 URL: https://issues.apache.org/jira/browse/NUTCH-1183 Project: Nutch Issue Type: Improvement Components: documentation Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.7 The following files should provide output when called innacurately from the command line. Something similar to {code} Usage: class -arg1, -arg2, etc etc {code} * webgraph * linkrank * scoreupdater * nodedumper * nodereader If anyone would like to see further classes included in this task please add to the above list. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field
[ https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13527276#comment-13527276 ] Lewis John McGibbney commented on NUTCH-1140: - DO we want to integrate this into trunk and 2.x? If so I can write the trivial test case? index-more plugin, resetTitle method creates multiple values in the Title field --- Key: NUTCH-1140 URL: https://issues.apache.org/jira/browse/NUTCH-1140 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.3 Reporter: Joe Liedtke Priority: Minor Fix For: 1.7 Attachments: MoreIndexingFilter.093011.patch From the comments in MoreIndexingFilter.java, the index-more plugin is meant to reset the Title field of a document if it contains a Content-Disposition header. The current behavior is to add a Title regardless of whether one exists or not, which can cause issues down the line with the Solr Indexing process, and based on a thread in the nutch user list it appears that this is causing some users to mark the title as multi-valued in the schema: http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8 The following patch removes the title field before adding a new one, which has resolved the issue for me: --- MoreIndexingFilter.old2011-09-30 11:44:35.0 + +++ MoreIndexingFilter.java 2011-09-30 09:58:48.0 + @@ -276,6 +276,7 @@ for (int i=0; ipatterns.length; i++) { if (matcher.contains(contentDisposition,patterns[i])) { result = matcher.getMatch(); +doc.removeField(title); doc.add(title, result.group(1)); break; } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1409) Remove deprecated properties in nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13527278#comment-13527278 ] Lewis John McGibbney commented on NUTCH-1409: - I am not sure about the logging for this one. If we are to remove the properties, thereby denying users the choice to override them, then why do we need to log that they should use some other settings instead? Saying that, this one is nearly ready though. @Matthias, off the top of your head, I wonder if you were able to check out 2.x and comment on the code? Thank you very much for the patch. Remove deprecated properties in nutch-default.xml - Key: NUTCH-1409 URL: https://issues.apache.org/jira/browse/NUTCH-1409 Project: Nutch Issue Type: Improvement Reporter: Matthias Agethle Priority: Minor Fix For: 1.7 Attachments: NUTCH-1409.patch 1) Remove deprecated properties from nutch-default.xml (generate.max.per.host and db.default.fetch.interval). 2) The already removed properties generate.max.per.host.by.ip and db.max.fetch.interval are still used in source code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-840: --- Attachment: NUTCH-840v2.patch This is for trunk. There is a problem here where the new tests (for parse-tika) also seem to be executed against (within?) other plugin testing scenarios... I am stuck atm as to why this is. Once we fix we will port to 2.x Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.2 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [ANNOUNCE] Apache Nutch 1.6 Released
Great stuff! Thanks Lewis On 8 December 2012 21:50, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi All, The Apache Nutch PMC are extremely pleased to announce the release of Apache Nutch v1.6. This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API inluding the normalization of URL's and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8. A full PMC statement can be found here [0] The release can be found on official Apache mirrors [1] as well as sources in Maven Central [2] Thank you Lewis On Behalf of the Nutch PMC [0] http://s.apache.org/NFp [1] http://www.apache.org/dyn/closer.cgi/nutch/ [2] http://search.maven.org/#artifactdetails|org.apache.nutch|nutch|1.6|jar -- Lewis -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-840: Attachment: NUTCH-840-trunk.patch Modified version of the patch to fix the tests post NUTCH-797 Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.2 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13527362#comment-13527362 ] Julien Nioche commented on NUTCH-840: - The tests now run OK with the patch I just attached. bq. There is a problem here where the new tests (for parse-tika) also seem to be executed against (within?) other plugin testing scenarios can you give more detail on this please Lewis? Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.2 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-840: Affects Version/s: 1.6 Fix Version/s: 1.7 Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1, 1.6 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.7, 2.2 Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-891) Nutch build should not depend on unversioned local deps
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-891: Affects Version/s: 2.1 Probably not an issue anymore. marking it as 2.x to triage unversioned issues, will check later Nutch build should not depend on unversioned local deps --- Key: NUTCH-891 URL: https://issues.apache.org/jira/browse/NUTCH-891 Project: Nutch Issue Type: Bug Affects Versions: 2.1 Reporter: Andrzej Bialecki Attachments: gora-49_v1.patch, gora.build.patch The fix in NUTCH-873 introduces an unknown variable to the build process. Since local ivy artifacts are unversioned, different people that install Gora jars at different points in time will use the same artifact id but in fact the artifacts (jars) will differ because they will come from different revisions of Gora sources. Therefore Nutch builds based on the same svn rev. won't be repeatable across different environments. As much as it pains the ivy purists ;) until Gora publishes versioned artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a known external rev. We can add a README that contains commit id from Gora. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-807) JSParseFilter produces malformed URL
[ https://issues.apache.org/jira/browse/NUTCH-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-807. --- Resolution: Won't Fix Closing old issues. The JSParseFilter is known to generate noisy URLS and is not used by default anymore. This won't get fixed JSParseFilter produces malformed URL Key: NUTCH-807 URL: https://issues.apache.org/jira/browse/NUTCH-807 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.0.0 Environment: Redhat 2.6.18-128.1.6.el5PAE i686 i686 i386 GNU/Linux Reporter: Minyao Zhu This is found when crawling site: http://zhidao.baidu.com/( a Chinese language site ) It appears this page contains javascripts which confused JSParseFilter, which produced URL like this: http://zhidao.baidu.com/){if(A===46){baidu.hide( Not sure the impact/scope of this issue in general. The observation for this specific site is, much less pages got crawled. Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-62) Add html META tag information into metaData in index-more plugin
[ https://issues.apache.org/jira/browse/NUTCH-62?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-62. Resolution: Implemented This can be done in a more flexible way using index-metadata https://issues.apache.org/jira/browse/NUTCH-1264 Add html META tag information into metaData in index-more plugin Key: NUTCH-62 URL: https://issues.apache.org/jira/browse/NUTCH-62 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Jack Tang Priority: Trivial Attachments: index-more.patch.zip Now(version dev-0.7), only some metaData in http response such as type, date, content-length are available int the index-more plugin. And we cannot index/sotre the meta data in html header (META exactly) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata
[ https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1267: - Assignee: Julien Nioche urlmeta to delegate indexing to index-metadata -- Key: NUTCH-1267 URL: https://issues.apache.org/jira/browse/NUTCH-1267 Project: Nutch Issue Type: Sub-task Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira