[ https://issues.apache.org/jira/browse/NUTCH-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882886#comment-13882886 ]
Markus Jelsma commented on NUTCH-1414: -------------------------------------- Hi Luke, * We send it to Solr using protected SimpleDateFormat formattedDate = new SimpleDateFormat("yyyy-MM-dd'T00:00:00'Z"); That is the format Solr/Lucene expects to get. * Yes, i did. I separated the tool from Nutch and made some small changes, one of the notable changes is that extracting a date from the URL has the preference by default. You do have to expand the regex' a bit to ignore false dates in URL's. * Makes sense. I limited the size to a) prevent the regular expressions to choke on very large pages and b) to ignore dates that do not represent the article or published date. This is also the reason we're not using this anymore but have tied it into our text extraction tool. There it knows the context of a page so it won't yield many false positives. * No not likely, but the patch works so you should not have much trouble using it. It might get committed if enough users express their interest. > Date extraction parse filter > ---------------------------- > > Key: NUTCH-1414 > URL: https://issues.apache.org/jira/browse/NUTCH-1414 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.9 > > Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch > > > Date extraction parse filter for Nutch to provide means to extract an > arbitrary page date (article date) from the parse text. -- This message was sent by Atlassian JIRA (v6.1.5#6160)