[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688607#comment-13688607 ]
Hudson commented on NUTCH-1475: ------------------------------- Integrated in Nutch-trunk #2249 (See [https://builds.apache.org/job/Nutch-trunk/2249/]) NUTCH-1475 (fix after fix) fill field "date" with fetch time (as before) if modified time is unset (Revision 1494785) Result = SUCCESS snagel : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1494785 Files : * /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java > Index-More Plugin -- A better fall back value for date field > ------------------------------------------------------------ > > Key: NUTCH-1475 > URL: https://issues.apache.org/jira/browse/NUTCH-1475 > Project: Nutch > Issue Type: Bug > Affects Versions: 2.1, 1.5.1 > Environment: All > Reporter: James Sullivan > Assignee: Sebastian Nagel > Priority: Minor > Labels: index-more, plugins > Fix For: 2.3, 1.8 > > Attachments: index-more-1xand2x.patch, index-more-2x.patch, > index-more-2x.patch, NUTCH-1475-trunk-v1.patch, NUTCH-1475-trunk-v2.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Among other fields, the more plugin for Nutch 2.x provides a "last modified" > and "date" field for the Solr index. The "last modified" field is the last > modified date from the http headers if available, if not available it is left > empty. Currently, the "date" field is the same as the "last modified" field > unless that field is empty in which case getFetchTime is used as a fall back. > I think getFetchTime is not a good fall back as it is the next fetch time and > often a month or more in the future which doesn't make sense for the date > field. Users do not expect webpages/documents with future dates. A more > sensible fallback would be current date at the time it is indexed. > This is possible by simply changing line 97 of > https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java > from > time = page.getFetchTime(); // use fetch time > to > time = new Date().getTime(); > Users interested in the getFetchTime value can still get it from the "tstamp" > field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira