[ 
https://issues.apache.org/jira/browse/LUCENE-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558066#action_12558066
 ] 

Mark Miller commented on LUCENE-644:
------------------------------------

Yes it is still an issue.

Its been a while since I look at this, so I may be a little off, but I believe 
the speed gain comes from the fact that this implementation will only consider 
the terms from the query, and using info from TermVectors, reconstructs the 
document in large chunks (chunks between each query term). So a 200 page 
document with one query term will be put together from the original doc after 
examining one token.

The current Highlighter reconstructs by running over every term in the 
TokenStream. This doesn't scale well. A 200 page document will have every token 
analyzed and scored as the correct offsets from the original document are 
slowly built up.

The result is, Ronnies highlighter is *much* faster with larger documents, but 
not for smaller documents (getting TermVector info is slow enough that you need 
large docs to benefit).

I think Mark H could probably incorporate this into the other Highlighter, but 
it certainly won't fit the framework, so either you have to change the 
framework quite radically (affecting code out there I suppose) or have two 
frameworks that can be chosen from.

The other disadvantage to this approach is that I don't see any way to 
incorporate position sensitive highlighting.

> Contrib: another highlighter approach
> -------------------------------------
>
>                 Key: LUCENE-644
>                 URL: https://issues.apache.org/jira/browse/LUCENE-644
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Other
>            Reporter: Ronnie Kolehmainen
>            Priority: Minor
>         Attachments: FulltextHighlighter.java, FulltextHighlighter.java, 
> FulltextHighlighterTest.java, FulltextHighlighterTest.java, svn-diff.patch, 
> svn-diff.patch, TokenSources.java, TokenSources.java.diff
>
>
> Mark Harwoods highlighter package is a great contribution to Lucene, I've 
> used it a lot! However, when you have *large* documents (fields), 
> highlighting can be quite time consuming if you increase the number of bytes 
> to analyze with setMaxDocBytesToAnalyze(int). The default value of 50k is 
> often too low for indexed PDFs etcetera, which results in empty highlight 
> strings.
> This is an alternative approach using term position vectors only to build 
> fragment info objects. Then a StringReader can read the relevant fragments 
> and skip() between them. This is a lot faster. Also, this method uses the 
> *entire* field for finding the best fragments so you're always guaranteed to 
> get a highlight snippet.
> Because this method only works with fields which have term positions stored 
> one can check if this method works for a particular field using following 
> code (taken from TokenSources.java):
>         TermFreqVector tfv = (TermFreqVector) reader.getTermFreqVector(docId, 
> field);
>         if (tfv != null && tfv instanceof TermPositionVector)
>         {
>           // use FulltextHighlighter
>         }
>         else
>         {
>           // use standard Highlighter
>         }
> Someone else might find this useful so I'm posting the code here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to