[jira] [Commented] (NUTCH-1414) Date extraction parse filter

Markus Jelsma (JIRA) Mon, 27 Jan 2014 07:30:02 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882886#comment-13882886
 ]


Markus Jelsma commented on NUTCH-1414:
--------------------------------------

Hi Luke,

* We send it to Solr using protected SimpleDateFormat formattedDate = new 
SimpleDateFormat("yyyy-MM-dd'T00:00:00'Z"); That is the format Solr/Lucene 
expects to get.

* Yes, i did. I separated the tool from Nutch and made some small changes, one 
of the notable changes is that extracting a date from the URL has the 
preference by default. You do have to expand the regex' a bit to ignore false 
dates in URL's.

* Makes sense. I limited the size to a) prevent the regular expressions to 
choke on very large pages and b) to ignore dates that do not represent the 
article or published date. This is also the reason we're not using this anymore 
but have tied it into our text extraction tool. There it knows the context of a 
page so it won't yield many false positives.

* No not likely, but the patch works so you should not have much trouble using 
it. It might get committed if enough users express their interest.

> Date extraction parse filter
> ----------------------------
>
>                 Key: NUTCH-1414
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1414
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>         Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch
>
>
> Date extraction parse filter for Nutch to provide means to extract an 
> arbitrary page date (article date) from the parse text.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1414) Date extraction parse filter

Reply via email to