Re: Nutch Site
Hi Lewis, Brilliant! Thanks a lot Julien On 18 June 2013 05:32, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi All, @Julien, A while ago you mentioned about changing the Nutch site to be more direct towards Downloads. I agreed with this but as I didn't deal with it then and there, it got put to the bottom of my TODO. Anyway, today I got around to it and our site it now more directly liked to Downloads page. In time I will inevitably migrate out site and documentation over to the Apache CMS, but for the mean time this will do I suppose. You'll now notice that in the feed strip the Download link it right up there. Thanks for now LEwis -- *Lewis* -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1527: - Attachment: NUTCH-1527.patch Ok, here's a new patch. If you set elastic.host (elastic.port is default set to 9300) TransportClient will be used. elastic.cluster remains mandatory. Properties are now also available in nutch-default. Please comment and report issues. Thanks Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: Markus Jelsma Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1583) Headings does not support multiValued headings
Markus Jelsma created NUTCH-1583: Summary: Headings does not support multiValued headings Key: NUTCH-1583 URL: https://issues.apache.org/jira/browse/NUTCH-1583 Project: Nutch Issue Type: Improvement Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Headings can now support multiple values since NUTCH-1560 and NUTCH-1467. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1583) Headings does not support multiValued headings
[ https://issues.apache.org/jira/browse/NUTCH-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1583: - Attachment: NUTCH-1583.patch Patch for trunk. If headings.multivalued=true multiple values will be recorded and indexed. Default is false to preserve current behaviour. Comments? Headings does not support multiValued headings -- Key: NUTCH-1583 URL: https://issues.apache.org/jira/browse/NUTCH-1583 Project: Nutch Issue Type: Improvement Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1583.patch Headings can now support multiple values since NUTCH-1560 and NUTCH-1467. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1475) Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1475: --- Attachment: NUTCH-1475-trunk-v1.patch Why not rely first on CrawlDatum's modifiedTime? See patch. Index-More Plugin -- A better fall back value for date field Key: NUTCH-1475 URL: https://issues.apache.org/jira/browse/NUTCH-1475 Project: Nutch Issue Type: Bug Affects Versions: 2.1, 1.5.1 Environment: All Reporter: James Sullivan Priority: Minor Labels: index-more, plugins Fix For: 2.3, 1.8 Attachments: index-more-1xand2x.patch, index-more-2x.patch, index-more-2x.patch, NUTCH-1475-trunk-v1.patch Original Estimate: 1h Remaining Estimate: 1h Among other fields, the more plugin for Nutch 2.x provides a last modified and date field for the Solr index. The last modified field is the last modified date from the http headers if available, if not available it is left empty. Currently, the date field is the same as the last modified field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from time = page.getFetchTime(); // use fetch time to time = new Date().getTime(); Users interested in the getFetchTime value can still get it from the tstamp field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686830#comment-13686830 ] lufeng commented on NUTCH-1527: --- Thanks Markus, I try the patch and can index the document success. +1 for commit. Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: Markus Jelsma Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Nutch Site
Woot you da man Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Julien Nioche lists.digitalpeb...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Tuesday, June 18, 2013 1:01 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: Nutch Site Hi Lewis, Brilliant! Thanks a lot Julien On 18 June 2013 05:32, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi All, @Julien, A while ago you mentioned about changing the Nutch site to be more direct towards Downloads. I agreed with this but as I didn't deal with it then and there, it got put to the bottom of my TODO. Anyway, today I got around to it and our site it now more directly liked to Downloads page. In time I will inevitably migrate out site and documentation over to the Apache CMS, but for the mean time this will do I suppose. You'll now notice that in the feed strip the Download link it right up there. Thanks for now LEwis -- Lewis -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
[jira] [Resolved] (NUTCH-1475) Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1475. - Resolution: Fixed Committed @revision 1494234 in trunk. Thank you [~wastl-nagel] for the final patch. Index-More Plugin -- A better fall back value for date field Key: NUTCH-1475 URL: https://issues.apache.org/jira/browse/NUTCH-1475 Project: Nutch Issue Type: Bug Affects Versions: 2.1, 1.5.1 Environment: All Reporter: James Sullivan Priority: Minor Labels: index-more, plugins Fix For: 2.3, 1.8 Attachments: index-more-1xand2x.patch, index-more-2x.patch, index-more-2x.patch, NUTCH-1475-trunk-v1.patch Original Estimate: 1h Remaining Estimate: 1h Among other fields, the more plugin for Nutch 2.x provides a last modified and date field for the Solr index. The last modified field is the last modified date from the http headers if available, if not available it is left empty. Currently, the date field is the same as the last modified field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from time = page.getFetchTime(); // use fetch time to time = new Date().getTime(); Users interested in the getFetchTime value can still get it from the tstamp field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1475) Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687079#comment-13687079 ] Hudson commented on NUTCH-1475: --- Integrated in Nutch-trunk #2245 (See [https://builds.apache.org/job/Nutch-trunk/2245/]) NUTCH-1475 Index-More Plugin -- A better fall back value for date field (Revision 1494234) Result = SUCCESS lewismc : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1494234 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java Index-More Plugin -- A better fall back value for date field Key: NUTCH-1475 URL: https://issues.apache.org/jira/browse/NUTCH-1475 Project: Nutch Issue Type: Bug Affects Versions: 2.1, 1.5.1 Environment: All Reporter: James Sullivan Priority: Minor Labels: index-more, plugins Fix For: 2.3, 1.8 Attachments: index-more-1xand2x.patch, index-more-2x.patch, index-more-2x.patch, NUTCH-1475-trunk-v1.patch Original Estimate: 1h Remaining Estimate: 1h Among other fields, the more plugin for Nutch 2.x provides a last modified and date field for the Solr index. The last modified field is the last modified date from the http headers if available, if not available it is left empty. Currently, the date field is the same as the last modified field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from time = page.getFetchTime(); // use fetch time to time = new Date().getTime(); Users interested in the getFetchTime value can still get it from the tstamp field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Jenkins build is back to normal : Nutch-trunk #2245
See https://builds.apache.org/job/Nutch-trunk/2245/changes
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687176#comment-13687176 ] Lewis John McGibbney commented on NUTCH-1527: - Hi Markus, the attached patch also includes your boilerpipe stuff ;) I am reverting those parts on the patch and trying it out right now. Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: Markus Jelsma Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1527: Attachment: NUTCH-1527v2.patch New patch removing your Boilerpipe stuff Markus. I am +1 for this to oo in there. We keep index-solr as default for the time being and push the RC with this included. It is a real nice addition to the release candidate. Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: Markus Jelsma Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527v2.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1584) Port NUTCH-1405 Allow to overwrite CrawlDatum's with injected entries to 2.x
Lewis John McGibbney created NUTCH-1584: --- Summary: Port NUTCH-1405 Allow to overwrite CrawlDatum's with injected entries to 2.x Key: NUTCH-1584 URL: https://issues.apache.org/jira/browse/NUTCH-1584 Project: Nutch Issue Type: Improvement Components: crawldb, injector Reporter: Lewis John McGibbney Priority: Minor Fix For: 2.3 I was recently curious about what happens in 2.x when we inject similar but not identical seed lists in order to bootstrap a system. I started looking about and found NUTCH-1405. I think it would be great to port this concept to 2.x. This issue should do exactly that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1585) Ensure duplicate tags do not exist in microformat-reltag tag set.
[ https://issues.apache.org/jira/browse/NUTCH-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1585: --- Assignee: Lewis John McGibbney Ensure duplicate tags do not exist in microformat-reltag tag set. - Key: NUTCH-1585 URL: https://issues.apache.org/jira/browse/NUTCH-1585 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.6, 2.2 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1585-2.x.patch, NUTCH-1585-trunk.patch A WebPage can have many many embedded tags and other such markup. Creating huge tag lists containing many many duplicates is counter productive to the process of parsing and extracting out such structure. We should add a mechanism to only include single tag occurrences for the microformats-reltag parser. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1585) Ensure duplicate tags do not exist in microformat-reltag tag set.
[ https://issues.apache.org/jira/browse/NUTCH-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1585: Attachment: NUTCH-1585-trunk.patch NUTCH-1585-2.x.patch patches for trunk and 2.x. Simply check if the tag exists in the set. If it doesn't then add it. I suppose this is difficult/expensive if the set is huge, however by doing this check, the set is logically much much smaller than it would be otherwise. Ensure duplicate tags do not exist in microformat-reltag tag set. - Key: NUTCH-1585 URL: https://issues.apache.org/jira/browse/NUTCH-1585 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.6, 2.2 Reporter: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1585-2.x.patch, NUTCH-1585-trunk.patch A WebPage can have many many embedded tags and other such markup. Creating huge tag lists containing many many duplicates is counter productive to the process of parsing and extracting out such structure. We should add a mechanism to only include single tag occurrences for the microformats-reltag parser. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira