On 25/07/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
On 7/25/07, Stanislaw Osinski <[EMAIL PROTECTED]> wrote: > JavaCC is slow indeed. JavaCC is a very fast parser for a large document... the issue is small fields and JavaCC's use of an exception for flow control at the end of a value. As JVMs have advanced, exception-as-control-flow as gotten comparably slower.
In Carrot2 we tokenize mostly very short documents (search results), so in this context JFlex proved much faster. I did a very rough performance test of Highlighter using JFlex and JavaCC-generated analyzers with medium-sized documents (up to ~1kB), and JFlex was still faster. What size would a 'large' document be? Does JFlex have a jar associated with it? It's GPL (although you can
freely use the files it generates under any license), so if there were other non-generated files required, we wouldn't be able to incorporate them.
You need JFlex jar only to generate the tokenizer (one Java class). The generated tokenizer is standalone and doesn't need the JFlex jar to run. Staszek