Hi Kirby,

Thanks for sharing this. It is definitely relevant for Nutch and I am sure
that there would be quite a few people interested in giving it a try.
Let's hope that this patch gets into the original library or that the Lucene
people ship it in a separate jar, in the meantime your patch would help
comparing performances. Could you please open a new issue on JIRA and
include the patch + description? It will be easier to comment and track its
progress.

Thanks a lot

Julien

On 25 July 2011 05:01, Kirby Bohling <kirby.bohl...@gmail.com> wrote:

> All,
>
>   Not sure how much you guys care, but the Lucene folks (specifically
> rmuir and mikemcand), made some fairly significant performance speed
> ups to the Automaton library while working on the Lucene Fuzzy
> matching optimizations for the 4.0 release.  I've backported them to
> the Automaton library and trying to get them integrated into the
> mainline library (with permission from the Lucene devs).  I haven't
> heard back from the Automaton author, but I figured that enough folks
> have made noise about how nice performance boost of using Automaton
> vs. RegEx, that Nutch itself might want to integrate these types of
> changes, or re-use the ones from Lucene.
>
>   The best version of the code itself is here:
>
>
> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
>
> Nutch would likely only use 1/2-2/3 of those files (only the stuff
> required to build RegExp).
>
> The patch I applied to the latest Automaton library is attached if
> anybody wants to rebuild and test.  In some mainline code that does a
> _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
> execution of the DFAs, I'm not sure how much faster it actually is (I
> think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
> the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
> representation, and uses several Lucene internal implementations of
> memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
> version isn't broken out into a utility jar to be re-used.  Lucene has
> several really nice high performance non-trivial, but highly useful CS
> data structure implementations.
>
> My patch itself applies to the latest Automaton library (1.11-7 as of
> this writing).  If it is better to use the original Automaton library.
>  One annoyance of the Automaton library is that you have to submit
> personal info to get the source, but it is all BSD licensed.  No
> public repo of source.
>
> It might be worth while to port the plugins using the automaton
> library to use the version from Lucene or one with the patch applied
> and test the performance.
>
> Thanks,
>    Kirby
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to