Re: StandardTokenizer is slowing down highlighting a lot

Stanislaw Osinski Wed, 25 Jul 2007 23:54:13 -0700

On 25/07/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 7/25/07, Stanislaw Osinski <[EMAIL PROTECTED]> wrote:
> JavaCC is slow indeed.

JavaCC is a very fast parser for a large document... the issue is
small fields and JavaCC's use of an exception for flow control at the
end of a value.  As JVMs have advanced, exception-as-control-flow as
gotten comparably slower.



In Carrot2 we tokenize mostly very short documents (search results), so in
this context JFlex proved much faster. I did a very rough performance test
of Highlighter using JFlex and JavaCC-generated analyzers with medium-sized
documents (up to ~1kB), and JFlex was still faster. What size would a
'large' document be?

Does JFlex have a jar associated with it?  It's GPL (although you can

freely use the files it generates under any license), so if there were
other non-generated files required, we wouldn't be able to incorporate
them.



You need JFlex jar only to generate the tokenizer (one Java class). The
generated tokenizer is standalone and doesn't need the JFlex jar to run.

Staszek

Re: StandardTokenizer is slowing down highlighting a lot

Reply via email to