[ https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089165#comment-15089165 ]
Hudson commented on NUTCH-1838: ------------------------------- SUCCESS: Integrated in Nutch-trunk #3332 (See [https://builds.apache.org/job/Nutch-trunk/3332/]) NUTCH-1838 Host and domain based regex and automaton filtering (markus: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1723710]) * trunk/CHANGES.txt * trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java * trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java * trunk/src/plugin/urlfilter-automaton/src/java/org/apache/nutch/urlfilter/automaton/AutomatonURLFilter.java * trunk/src/plugin/urlfilter-regex/sample/nutch1838.rules * trunk/src/plugin/urlfilter-regex/sample/nutch1838.urls * trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java * trunk/src/plugin/urlfilter-regex/src/test/org/apache/nutch/urlfilter/regex/TestRegexURLFilter.java > Host and domain based regex and automaton filtering > --------------------------------------------------- > > Key: NUTCH-1838 > URL: https://issues.apache.org/jira/browse/NUTCH-1838 > Project: Nutch > Issue Type: New Feature > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Attachments: NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, > NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch > > > Both regex and automaton filter pass all URL's through all rules although > this makes little sense if you have a lot of generated rules for many > different hosts or domains. This patch allows the users to configure specific > rules for a specific host or domain only, making filtering much more > efficient. > Each rule has an optional hostOrDomain field, the filter is applied for rules > that have no hostOrDomain and for URL's that match the rule's host name and > domain name. > The following line enables hostOrDomain specific rules: > {code} > > www.example.org > {code} > The following line disables/resets it again: > {code} > < > {code} > full example: > {code} > -some generic filter > +another generic filter > > www.example.org > -rule only applied to URL's of www.example.org > +another rule only applied to URL's of www.example.org > > apache.org > -rule only applied to URL's of apache.org > +another rule only applied to URL's of apache.org > < > -more generic rules > +and another one > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)