Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by JeromeCharron: http://wiki.apache.org/nutch/RegexURLFiltersBenchs The comment on the change is: Creation New page: == Introduction == This page provides some performance benchmarks of the regular expressions based URLFilters in Nutch (currently urlfilter-regex and urlfilter-automaton). The '''urlfilter-regex''' plugin is based on the standard jdk [http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/package-summary.html java.util.regex] implementation, whereas the '''urlfilter-automaton''' plugin is based on [http://www.brics.dk/automaton/ dk.brics.automaton] Finite-State Automata for Java. == Performance == === Data set === These ''performance'' benchmarks were produced by collecting the results of the unit tests of each plugin using the same rule file (`Benchmarks.rules`) and the same set of urls to filter (`Benchmarks.urls`). === Raw results === The following matrix shows the '''urlfilter-regex''' and '''urlfilter-automaton''' plugins processing time in ''ms'' for many numbers of loops on the `Benchmarks.urls` file filtering. || ||'''50'''||'''100'''||'''200'''||'''400'''||'''800'''|| ||'''regex'''||459||899||1917||3703||7873|| ||'''automaton'''||335||419||657||1119||1997|| === Graphical representation === [http://frutch.free.fr/images/nutch/regexfilters-benchs.png] === Conclusion === '''urlfilter-automaton''' supports less operators than '''urlfilter-regex''' but provides some really best performance. It can probably be usefull in some contexts. A next step could be to mix the usage of these two plugins in order to take the best of each one by using the '''`urlfilter.order`''' configuration property.