[ 
https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2227:
---------------------------------
    Attachment: NUTCH-2227.patch

Patch for trunk! Tests pass.

> RegexParseFilter
> ----------------
>
>                 Key: NUTCH-2227
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2227
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.12
>
>         Attachments: NUTCH-2227.patch
>
>
> A parse filter that takes a regex and a field name. If regex matches via 
> matcher.find() on the HTML. The field name is set to true in the CrawlDatum's 
> metadata.
> Combined with the HostDB, it is easy to get a list of hosts that match some 
> regex criteria.
> {code}
> # Example configuration file for parsefilter-regex
> #
> # Parse metadata field <name> is set to true if the HTML matches the regex. 
> The
> # source can either be html or text. If source is html, the regex is applied 
> to
> # the entire HTML tree. If source is text, the regex is applied to the
> # extracted text.
> #
> # format: <name>\t<source>\t<regex>\n
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to