Markus Jelsma created NUTCH-2227: ------------------------------------ Summary: RegexParseFilter Key: NUTCH-2227 URL: https://issues.apache.org/jira/browse/NUTCH-2227 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.11 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.12
A parse filter that takes a regex and a field name. If regex matches via matcher.find() on the HTML. The field name is set to true in the CrawlDatum's metadata. Combined with the HostDB, it is easy to get a list of hosts that match some regex criteria. -- This message was sent by Atlassian JIRA (v6.3.4#6332)