[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686169#comment-13686169 ]
Lewis John McGibbney edited comment on NUTCH-1475 at 6/17/13 11:32 PM: ----------------------------------------------------------------------- Committed @revision 1493973 in 2.x HEAD I made issue more general so that this issue can be kept open for trunk. Thanks for the patch James. You are in CHANGES.txt... again ;) was (Author: lewismc): Committed @revision 1493973 in 2.x HEAD I made issue more general so that this issue can be kept open for trunk. > Index-More Plugin -- A better fall back value for date field > ------------------------------------------------------------ > > Key: NUTCH-1475 > URL: https://issues.apache.org/jira/browse/NUTCH-1475 > Project: Nutch > Issue Type: Bug > Affects Versions: 2.1, 1.5.1 > Environment: All > Reporter: James Sullivan > Priority: Minor > Labels: index-more, plugins > Fix For: 2.3, 1.8 > > Attachments: index-more-1xand2x.patch, index-more-2x.patch, > index-more-2x.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Among other fields, the more plugin for Nutch 2.x provides a "last modified" > and "date" field for the Solr index. The "last modified" field is the last > modified date from the http headers if available, if not available it is left > empty. Currently, the "date" field is the same as the "last modified" field > unless that field is empty in which case getFetchTime is used as a fall back. > I think getFetchTime is not a good fall back as it is the next fetch time and > often a month or more in the future which doesn't make sense for the date > field. Users do not expect webpages/documents with future dates. A more > sensible fallback would be current date at the time it is indexed. > This is possible by simply changing line 97 of > https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java > from > time = page.getFetchTime(); // use fetch time > to > time = new Date().getTime(); > Users interested in the getFetchTime value can still get it from the "tstamp" > field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira