Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. You should first see if you can get away without it and use a different Analyzer, or if you can re-implement just the functionality you need in a custom Analyzer. Do you really need the support for abbreviations, companies, email address, etc?

If so:

You can use the TokenSources class in the highlighter package to rebuild a TokenStream without re-analyzing if you store term offsets and positions in the index. I have not found this to be super beneficial, even when using the StandardAnalyzer to re-analyze, but it certainly could be faster if you have large enough documents.

Your best bet is probably to use https://issues.apache.org/jira/browse/LUCENE-644, which is a non-positional Highlighter that finds offsets to highlight by looking up query term offset information in the index. For larger documents this can be much faster than using the standard contrib Highlighter, even if your using TokenSources. LUCENE-644 has a much flatter curve than the contrib Highlighter as document size goes up.

- Mark

Michael Stoppelman wrote:
Hi all,

I was tracking down slowness in the contrib highlighter code and it seems
the seemingly simple tokenStream.next() is the culprit.
I've seen multiple posts about this being a possible cause. Has anyone
looked into how to speed up StandardTokenizer? For my
documents it's taking about 70ms per document that's a big ugh! I was
thinking I might just cache the TermVectors in memory if
that will be faster. Anyone have another approach to solving this problem?

-M


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to