On 7/19/07, Mark Miller <[EMAIL PROTECTED]> wrote:
I think it goes without saying that a semi-complex NFA or DFA is going to be quite a bit slower than say, breaking on whitespace. Not that I am against such a warning.
This is true to those very familiar with the code base and the Tokenizer source code. I think having a comment about using complex a semi-complex NFA/DFA possibly being a major performance hit in the highlighting code would save others time, imho. To support my point on writing a custom solution that is more exact
towards your needs: If you just remove the <NUM> recognizer in StandardTokenizer.jj you will gain 20-25% speed in my tests of small and large documents. Limiting what is considered a letter to just the language/encodings you need might also get some good returns.
Both good ideas. I just released that the tokenizer for highlighting doesn't need to be the same as the tokenizer for indexing so I can make the highlighting tokenizer much simpler. Everything will be fast and happy soon. -M - Mark
Michael Stoppelman wrote: > Might be nice to add a line of documentation to the highlighter on the > possible > performance hit if one uses StandardAnalyzer which probably is a common > case. > Thanks for the speedy response. > > -M > > On 7/18/07, Mark Miller <[EMAIL PROTECTED]> wrote: >> >> Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really >> limited by JavaCC speed. You cannot shave much more performance out of >> the grammar as it is already about as simple as it gets. You should >> first see if you can get away without it and use a different Analyzer, >> or if you can re-implement just the functionality you need in a custom >> Analyzer. Do you really need the support for abbreviations, companies, >> email address, etc? >> >> If so: >> >> You can use the TokenSources class in the highlighter package to rebuild >> a TokenStream without re-analyzing if you store term offsets and >> positions in the index. I have not found this to be super beneficial, >> even when using the StandardAnalyzer to re-analyze, but it certainly >> could be faster if you have large enough documents. >> >> Your best bet is probably to use >> https://issues.apache.org/jira/browse/LUCENE-644, which is a >> non-positional Highlighter that finds offsets to highlight by looking up >> query term offset information in the index. For larger documents this >> can be much faster than using the standard contrib Highlighter, even if >> your using TokenSources. LUCENE-644 has a much flatter curve than the >> contrib Highlighter as document size goes up. >> >> - Mark >> >> Michael Stoppelman wrote: >> > Hi all, >> > >> > I was tracking down slowness in the contrib highlighter code and it >> seems >> > the seemingly simple tokenStream.next() is the culprit. >> > I've seen multiple posts about this being a possible cause. Has anyone >> > looked into how to speed up StandardTokenizer? For my >> > documents it's taking about 70ms per document that's a big ugh! I was >> > thinking I might just cache the TermVectors in memory if >> > that will be faster. Anyone have another approach to solving this >> > problem? >> > >> > -M >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]