[ https://issues.apache.org/jira/browse/NUTCH-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel resolved NUTCH-2689. ------------------------------------ Resolution: Implemented Thanks, [~markus17]! Merged. > Speed up urlfilter-regex and urlfilter-automaton > ------------------------------------------------ > > Key: NUTCH-2689 > URL: https://issues.apache.org/jira/browse/NUTCH-2689 > Project: Nutch > Issue Type: Improvement > Components: plugin > Affects Versions: 1.15 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Minor > Fix For: 1.16 > > > The unit tests of urlfilter-regex and urlfilter-automaton include a > benchmark. After playing and benchmarking modifications the following changes > seem to significantly improve the performance: > - do not extract host and domain name from the URL if not needed (no > host/domain-specific rules used, cf. NUTCH-1838) > - use non-capturing groups if possible > - use {{(?i)}} to make the patterns case insensitive and remove uppercase > variants -- This message was sent by Atlassian JIRA (v7.6.3#76005)