Re: Automaton improvements

2011-07-25 Thread Dawid Weiss
I don't think this will make it into a separate library, Julien. It's a port
of brics and done specifically so that it fits Lucene's internal needs. If
anything, I would just make Nutch require Lucene as a dependency -- this
would provide more stable updates.

Dawid

On Mon, Jul 25, 2011 at 10:35 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi Kirby,

 Thanks for sharing this. It is definitely relevant for Nutch and I am sure
 that there would be quite a few people interested in giving it a try.
 Let's hope that this patch gets into the original library or that the
 Lucene people ship it in a separate jar, in the meantime your patch would
 help comparing performances. Could you please open a new issue on JIRA and
 include the patch + description? It will be easier to comment and track its
 progress.

 Thanks a lot

 Julien


 On 25 July 2011 05:01, Kirby Bohling kirby.bohl...@gmail.com wrote:

 All,

   Not sure how much you guys care, but the Lucene folks (specifically
 rmuir and mikemcand), made some fairly significant performance speed
 ups to the Automaton library while working on the Lucene Fuzzy
 matching optimizations for the 4.0 release.  I've backported them to
 the Automaton library and trying to get them integrated into the
 mainline library (with permission from the Lucene devs).  I haven't
 heard back from the Automaton author, but I figured that enough folks
 have made noise about how nice performance boost of using Automaton
 vs. RegEx, that Nutch itself might want to integrate these types of
 changes, or re-use the ones from Lucene.

   The best version of the code itself is here:


 http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/

 Nutch would likely only use 1/2-2/3 of those files (only the stuff
 required to build RegExp).

 The patch I applied to the latest Automaton library is attached if
 anybody wants to rebuild and test.  In some mainline code that does a
 _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
 execution of the DFAs, I'm not sure how much faster it actually is (I
 think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
 the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
 representation, and uses several Lucene internal implementations of
 memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
 version isn't broken out into a utility jar to be re-used.  Lucene has
 several really nice high performance non-trivial, but highly useful CS
 data structure implementations.

 My patch itself applies to the latest Automaton library (1.11-7 as of
 this writing).  If it is better to use the original Automaton library.
  One annoyance of the Automaton library is that you have to submit
 personal info to get the source, but it is all BSD licensed.  No
 public repo of source.

 It might be worth while to port the plugins using the automaton
 library to use the version from Lucene or one with the patch applied
 and test the performance.

 Thanks,
Kirby




 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com



Re: Automaton improvements

2011-07-25 Thread Julien Nioche
Hi Dawid,

This was a bit of wishful thinking indeed :-) With a bit of luck the
improvements will be added to brics, but as you pointed out we can always
use the lucene jar anyway.

BTW you are too modest, you should have pointed to the video of your talk in
Berlin http://vimeo.com/26517310 which is both informative and entertaining

Thanks

Julien

On 25 July 2011 09:51, Dawid Weiss dawid.we...@gmail.com wrote:


 I don't think this will make it into a separate library, Julien. It's a
 port of brics and done specifically so that it fits Lucene's internal needs.
 If anything, I would just make Nutch require Lucene as a dependency -- this
 would provide more stable updates.

 Dawid


 On Mon, Jul 25, 2011 at 10:35 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 Hi Kirby,

 Thanks for sharing this. It is definitely relevant for Nutch and I am sure
 that there would be quite a few people interested in giving it a try.
 Let's hope that this patch gets into the original library or that the
 Lucene people ship it in a separate jar, in the meantime your patch would
 help comparing performances. Could you please open a new issue on JIRA and
 include the patch + description? It will be easier to comment and track its
 progress.

 Thanks a lot

 Julien


 On 25 July 2011 05:01, Kirby Bohling kirby.bohl...@gmail.com wrote:

 All,

   Not sure how much you guys care, but the Lucene folks (specifically
 rmuir and mikemcand), made some fairly significant performance speed
 ups to the Automaton library while working on the Lucene Fuzzy
 matching optimizations for the 4.0 release.  I've backported them to
 the Automaton library and trying to get them integrated into the
 mainline library (with permission from the Lucene devs).  I haven't
 heard back from the Automaton author, but I figured that enough folks
 have made noise about how nice performance boost of using Automaton
 vs. RegEx, that Nutch itself might want to integrate these types of
 changes, or re-use the ones from Lucene.

   The best version of the code itself is here:


 http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/

 Nutch would likely only use 1/2-2/3 of those files (only the stuff
 required to build RegExp).

 The patch I applied to the latest Automaton library is attached if
 anybody wants to rebuild and test.  In some mainline code that does a
 _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
 execution of the DFAs, I'm not sure how much faster it actually is (I
 think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
 the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
 representation, and uses several Lucene internal implementations of
 memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
 version isn't broken out into a utility jar to be re-used.  Lucene has
 several really nice high performance non-trivial, but highly useful CS
 data structure implementations.

 My patch itself applies to the latest Automaton library (1.11-7 as of
 this writing).  If it is better to use the original Automaton library.
  One annoyance of the Automaton library is that you have to submit
 personal info to get the source, but it is all BSD licensed.  No
 public repo of source.

 It might be worth while to port the plugins using the automaton
 library to use the version from Lucene or one with the patch applied
 and test the performance.

 Thanks,
Kirby




 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com





-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: Automaton improvements

2011-07-25 Thread Dawid Weiss
It is actually Robert Muir and Mike McCandless doing the heavy lifting here,
so modesty has nothing to do with it :) I just think it'll stay inside
Lucene because it is often tweaked and tuned. Plus, there is the FSTBuilder
and associated classes which provide yet another way to build and traverse
automata in Lucene (this is not brics-dependent).

Dawid

On Mon, Jul 25, 2011 at 10:59 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi Dawid,

 This was a bit of wishful thinking indeed :-) With a bit of luck the
 improvements will be added to brics, but as you pointed out we can always
 use the lucene jar anyway.

 BTW you are too modest, you should have pointed to the video of your talk
 in Berlin http://vimeo.com/26517310 which is both informative and
 entertaining

 Thanks

 Julien


 On 25 July 2011 09:51, Dawid Weiss dawid.we...@gmail.com wrote:


 I don't think this will make it into a separate library, Julien. It's a
 port of brics and done specifically so that it fits Lucene's internal needs.
 If anything, I would just make Nutch require Lucene as a dependency -- this
 would provide more stable updates.

 Dawid


 On Mon, Jul 25, 2011 at 10:35 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 Hi Kirby,

 Thanks for sharing this. It is definitely relevant for Nutch and I am
 sure that there would be quite a few people interested in giving it a try.
 Let's hope that this patch gets into the original library or that the
 Lucene people ship it in a separate jar, in the meantime your patch would
 help comparing performances. Could you please open a new issue on JIRA and
 include the patch + description? It will be easier to comment and track its
 progress.

 Thanks a lot

 Julien


 On 25 July 2011 05:01, Kirby Bohling kirby.bohl...@gmail.com wrote:

 All,

   Not sure how much you guys care, but the Lucene folks (specifically
 rmuir and mikemcand), made some fairly significant performance speed
 ups to the Automaton library while working on the Lucene Fuzzy
 matching optimizations for the 4.0 release.  I've backported them to
 the Automaton library and trying to get them integrated into the
 mainline library (with permission from the Lucene devs).  I haven't
 heard back from the Automaton author, but I figured that enough folks
 have made noise about how nice performance boost of using Automaton
 vs. RegEx, that Nutch itself might want to integrate these types of
 changes, or re-use the ones from Lucene.

   The best version of the code itself is here:


 http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/

 Nutch would likely only use 1/2-2/3 of those files (only the stuff
 required to build RegExp).

 The patch I applied to the latest Automaton library is attached if
 anybody wants to rebuild and test.  In some mainline code that does a
 _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
 execution of the DFAs, I'm not sure how much faster it actually is (I
 think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
 the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
 representation, and uses several Lucene internal implementations of
 memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
 version isn't broken out into a utility jar to be re-used.  Lucene has
 several really nice high performance non-trivial, but highly useful CS
 data structure implementations.

 My patch itself applies to the latest Automaton library (1.11-7 as of
 this writing).  If it is better to use the original Automaton library.
  One annoyance of the Automaton library is that you have to submit
 personal info to get the source, but it is all BSD licensed.  No
 public repo of source.

 It might be worth while to port the plugins using the automaton
 library to use the version from Lucene or one with the patch applied
 and test the performance.

 Thanks,
Kirby




 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com





 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com



Re: Automaton improvements

2011-07-25 Thread Kirby Bohling
https://issues.apache.org/jira/browse/NUTCH-1068

Issue created, patch attached.  Once I hear back from the author about
getting it included in the upstream library, I'll update the issue.  I'm
really not able to pursue directly, as I'm not much of a Nutch user at the
moment.  I've lurked on the list because there is some good info, and I
previously used Nutch as part of a RD project at work.  I use Lucene and
the Automaton library quite a bit, and found out about the Automaton library
here.  It's been a great find for us, so hopefully this is a way I can
contribute back.  Either way, the ASF likely already has better code that
Nutch could just pick up.

I wish the Lucene guys would peel these utility parts out into a separate
library.  I have several places it'd be useful, where I really have no need
for all of the core Lucene (and also I use a 3.x version in my project, and
this code is only in the 4.x branch, until that's released, I've have to
maintain it myself.

Kirby


On Mon, Jul 25, 2011 at 3:35 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi Kirby,

 Thanks for sharing this. It is definitely relevant for Nutch and I am sure
 that there would be quite a few people interested in giving it a try.
 Let's hope that this patch gets into the original library or that the
 Lucene people ship it in a separate jar, in the meantime your patch would
 help comparing performances. Could you please open a new issue on JIRA and
 include the patch + description? It will be easier to comment and track its
 progress.

 Thanks a lot

 Julien


 On 25 July 2011 05:01, Kirby Bohling kirby.bohl...@gmail.com wrote:

 All,

   Not sure how much you guys care, but the Lucene folks (specifically
 rmuir and mikemcand), made some fairly significant performance speed
 ups to the Automaton library while working on the Lucene Fuzzy
 matching optimizations for the 4.0 release.  I've backported them to
 the Automaton library and trying to get them integrated into the
 mainline library (with permission from the Lucene devs).  I haven't
 heard back from the Automaton author, but I figured that enough folks
 have made noise about how nice performance boost of using Automaton
 vs. RegEx, that Nutch itself might want to integrate these types of
 changes, or re-use the ones from Lucene.

   The best version of the code itself is here:


 http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/

 Nutch would likely only use 1/2-2/3 of those files (only the stuff
 required to build RegExp).

 The patch I applied to the latest Automaton library is attached if
 anybody wants to rebuild and test.  In some mainline code that does a
 _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
 execution of the DFAs, I'm not sure how much faster it actually is (I
 think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
 the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
 representation, and uses several Lucene internal implementations of
 memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
 version isn't broken out into a utility jar to be re-used.  Lucene has
 several really nice high performance non-trivial, but highly useful CS
 data structure implementations.

 My patch itself applies to the latest Automaton library (1.11-7 as of
 this writing).  If it is better to use the original Automaton library.
  One annoyance of the Automaton library is that you have to submit
 personal info to get the source, but it is all BSD licensed.  No
 public repo of source.

 It might be worth while to port the plugins using the automaton
 library to use the version from Lucene or one with the patch applied
 and test the performance.

 Thanks,
Kirby




 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com