Markus Jelsma created NUTCH-2227:
------------------------------------

             Summary: RegexParseFilter
                 Key: NUTCH-2227
                 URL: https://issues.apache.org/jira/browse/NUTCH-2227
             Project: Nutch
          Issue Type: New Feature
          Components: parser
    Affects Versions: 1.11
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.12


A parse filter that takes a regex and a field name. If regex matches via 
matcher.find() on the HTML. The field name is set to true in the CrawlDatum's 
metadata.

Combined with the HostDB, it is easy to get a list of hosts that match some 
regex criteria.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to