Hi Kirby, Thanks for sharing this. It is definitely relevant for Nutch and I am sure that there would be quite a few people interested in giving it a try. Let's hope that this patch gets into the original library or that the Lucene people ship it in a separate jar, in the meantime your patch would help comparing performances. Could you please open a new issue on JIRA and include the patch + description? It will be easier to comment and track its progress.
Thanks a lot Julien On 25 July 2011 05:01, Kirby Bohling <kirby.bohl...@gmail.com> wrote: > All, > > Not sure how much you guys care, but the Lucene folks (specifically > rmuir and mikemcand), made some fairly significant performance speed > ups to the Automaton library while working on the Lucene Fuzzy > matching optimizations for the 4.0 release. I've backported them to > the Automaton library and trying to get them integrated into the > mainline library (with permission from the Lucene devs). I haven't > heard back from the Automaton author, but I figured that enough folks > have made noise about how nice performance boost of using Automaton > vs. RegEx, that Nutch itself might want to integrate these types of > changes, or re-use the ones from Lucene. > > The best version of the code itself is here: > > > http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ > > Nutch would likely only use 1/2-2/3 of those files (only the stuff > required to build RegExp). > > The patch I applied to the latest Automaton library is attached if > anybody wants to rebuild and test. In some mainline code that does a > _lot_ of NFA-to-DFA translation, it is a 4x speed up. For the actual > execution of the DFAs, I'm not sure how much faster it actually is (I > think 1.5-2.0 as fast). My patch doesn't include the UTF-32 fixes in > the Lucene version (The Lucene code also converts the UTF-32 to UTF-8 > representation, and uses several Lucene internal implementations of > memory growth, sorting, etc, etc). It is unfortunate that the Lucene > version isn't broken out into a utility jar to be re-used. Lucene has > several really nice high performance non-trivial, but highly useful CS > data structure implementations. > > My patch itself applies to the latest Automaton library (1.11-7 as of > this writing). If it is better to use the original Automaton library. > One annoyance of the Automaton library is that you have to submit > personal info to get the source, but it is all BSD licensed. No > public repo of source. > > It might be worth while to port the plugins using the automaton > library to use the version from Lucene or one with the patch applied > and test the performance. > > Thanks, > Kirby > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com