Re: StandardTokenizer is slowing down highlighting a lot

Mark Miller Wed, 18 Jul 2007 17:59:36 -0700

Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is reallylimited by JavaCC speed. You cannot shave much more performance out ofthe grammar as it is already about as simple as it gets. You shouldfirst see if you can get away without it and use a different Analyzer,or if you can re-implement just the functionality you need in a customAnalyzer. Do you really need the support for abbreviations, companies,email address, etc?


If so:

You can use the TokenSources class in the highlighter package to rebuilda TokenStream without re-analyzing if you store term offsets andpositions in the index. I have not found this to be super beneficial,even when using the StandardAnalyzer to re-analyze, but it certainlycould be faster if you have large enough documents.

Your best bet is probably to usehttps://issues.apache.org/jira/browse/LUCENE-644, which is anon-positional Highlighter that finds offsets to highlight by looking upquery term offset information in the index. For larger documents thiscan be much faster than using the standard contrib Highlighter, even ifyour using TokenSources. LUCENE-644 has a much flatter curve than thecontrib Highlighter as document size goes up.


- Mark

Michael Stoppelman wrote:

Hi all,

I was tracking down slowness in the contrib highlighter code and it seems
the seemingly simple tokenStream.next() is the culprit.
I've seen multiple posts about this being a possible cause. Has anyone
looked into how to speed up StandardTokenizer? For my
documents it's taking about 70ms per document that's a big ugh! I was
thinking I might just cache the TermVectors in memory if

that will be faster. Anyone have another approach to solving thisproblem?

-M


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: StandardTokenizer is slowing down highlighting a lot

Reply via email to