[ https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-2227: --------------------------------- Attachment: NUTCH-2227.patch Updated patch, build.xml was missing > RegexParseFilter > ---------------- > > Key: NUTCH-2227 > URL: https://issues.apache.org/jira/browse/NUTCH-2227 > Project: Nutch > Issue Type: New Feature > Components: parser > Affects Versions: 1.11 > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2227.patch, NUTCH-2227.patch > > > A parse filter that takes a regex and a field name. If regex matches via > matcher.find() on the HTML. The field name is set to true in the CrawlDatum's > metadata. > Combined with the HostDB, it is easy to get a list of hosts that match some > regex criteria. > {code} > # Example configuration file for parsefilter-regex > # > # Parse metadata field <name> is set to true if the HTML matches the regex. > The > # source can either be html or text. If source is html, the regex is applied > to > # the entire HTML tree. If source is text, the regex is applied to the > # extracted text. > # > # format: <name>\t<source>\t<regex>\n > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)