[ https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma resolved NUTCH-1838. ---------------------------------- Resolution: Fixed > Host and domain based regex and automaton filtering > --------------------------------------------------- > > Key: NUTCH-1838 > URL: https://issues.apache.org/jira/browse/NUTCH-1838 > Project: Nutch > Issue Type: New Feature > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Attachments: NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, > NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch > > > Both regex and automaton filter pass all URL's through all rules although > this makes little sense if you have a lot of generated rules for many > different hosts or domains. This patch allows the users to configure specific > rules for a specific host or domain only, making filtering much more > efficient. > Each rule has an optional hostOrDomain field, the filter is applied for rules > that have no hostOrDomain and for URL's that match the rule's host name and > domain name. > The following line enables hostOrDomain specific rules: > {code} > > www.example.org > {code} > The following line disables/resets it again: > {code} > < > {code} > full example: > {code} > -some generic filter > +another generic filter > > www.example.org > -rule only applied to URL's of www.example.org > +another rule only applied to URL's of www.example.org > > apache.org > -rule only applied to URL's of apache.org > +another rule only applied to URL's of apache.org > < > -more generic rules > +and another one > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)