Re: StandardTokenizer is slowing down highlighting a lot

Michael Stoppelman Thu, 19 Jul 2007 11:28:46 -0700

On 7/19/07, Mark Miller <[EMAIL PROTECTED]> wrote:


I think it goes without saying that a semi-complex NFA or DFA is going
to be quite a bit slower than say, breaking on whitespace. Not that I am
against such a warning.



This is true to those very familiar with the code base and the Tokenizer
source code. I think having a comment
about using complex a semi-complex NFA/DFA possibly being a major
performance hit in the highlighting code
would save others time, imho.

To support my point on writing a custom solution that is more exact

towards your needs:

If you just remove the <NUM> recognizer in StandardTokenizer.jj you will
gain 20-25% speed in my tests of small and large documents.

Limiting what is considered a letter to just the language/encodings you
need might also get some good returns.



Both good ideas. I just released that the tokenizer for highlighting doesn't
need
to be the same as the tokenizer for indexing so I can make the highlighting
tokenizer
much simpler. Everything will be fast and happy soon.

-M

- Mark


Michael Stoppelman wrote:
> Might be nice to add a line of documentation to the highlighter on the
> possible
> performance hit if one uses StandardAnalyzer which probably is a common
> case.
> Thanks for the speedy response.
>
> -M
>
> On 7/18/07, Mark Miller <[EMAIL PROTECTED]> wrote:
>>
>> Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
>> limited by JavaCC speed. You cannot shave much more performance out of
>> the grammar as it is already about as simple as it gets. You should
>> first see if you can get away without it and use a different Analyzer,
>> or if you can re-implement just the functionality you need in a custom
>> Analyzer. Do you really need the support for abbreviations, companies,
>> email address, etc?
>>
>> If so:
>>
>> You can use the TokenSources class in the highlighter package to
rebuild
>> a TokenStream without re-analyzing if you store term offsets and
>> positions in the index. I have not found this to be super beneficial,
>> even when using the StandardAnalyzer to re-analyze, but it certainly
>> could be faster if you have large enough documents.
>>
>> Your best bet is probably to use
>> https://issues.apache.org/jira/browse/LUCENE-644, which is a
>> non-positional Highlighter that finds offsets to highlight by looking
up
>> query term offset information in the index. For larger documents this
>> can be much faster than using the standard contrib Highlighter, even if
>> your using TokenSources. LUCENE-644 has a much flatter curve than the
>> contrib Highlighter as document size goes up.
>>
>> - Mark
>>
>> Michael Stoppelman wrote:
>> > Hi all,
>> >
>> > I was tracking down slowness in the contrib highlighter code and it
>> seems
>> > the seemingly simple tokenStream.next() is the culprit.
>> > I've seen multiple posts about this being a possible cause. Has
anyone
>> > looked into how to speed up StandardTokenizer? For my
>> > documents it's taking about 70ms per document that's a big ugh! I was
>> > thinking I might just cache the TermVectors in memory if
>> > that will be faster. Anyone have another approach to solving this
>> > problem?
>> >
>> > -M
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: StandardTokenizer is slowing down highlighting a lot

Reply via email to