[ 
https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002491#comment-13002491
 ] 

Uwe Schindler edited comment on SOLR-2400 at 3/4/11 8:44 AM:
-------------------------------------------------------------

Stefan, this is a general issue of TokenStreams adding Tokens. TokenStreams 
that remove Tokens *should* automatically preserve position, but not even all 
of those do that correctly (we were fixing some of them lately). The way of how 
the Lucene analysis works makes it impossible to guarantee any corresponence of 
the position numbers. Because for the indexer it's only important what comes 
out at the end, the steps inbetween are not interesting. AnalysisReqHandler on 
the other hand does some bad "hacks" to look "inside" the analysis (by using 
temporary TokenStreams that buffer tokens), which are not the general use-case 
of TokenStreams.

I wonder a little bit about your xml file, it only contains text and position, 
but it should also contain rawTerm, startOffset, endOffset. When I call 
analysis i get all of those attributes not only two of them. Is this a 
hand-made file or what is the problem? Which Solr version?

One possibility to handle the thing might be the char offset in the original 
text, because that the req handler may use the character offset of begin and 
end of the token in the original stream instead of the token position, but this 
is likely to break for lots of TokenFilters (WordDelimiterFilter would work as 
long as you don't do stemming before...). The problem is incorrect handling of 
offset calculation (also leading to bugs in highlighting) when the inserted 
terms are longer than their originals.

Alltogether: Its unlikely that you can implement that and it will work for all 
combinations of TokenStream components.

      was (Author: thetaphi):
    Stefan, this is an egeneral issue of TokenStreams adding Tokens. 
TokenStreams that remove Tokens *should* automatically preserve position, but 
not even all of those do that correctly (we were fixing some of them lately). 
The way of how the Lucene analysis works makes it impossible to guarantee any 
corresponence of the position numbers. Because for the indexer its only 
important what comes out at the end, the steps inbetween are impossible. 
AnalysisReqHandler on the other hand does some bad "hacks" to look "inside" the 
analysis (by using temporary TokenStreams that buffer tokens), which are not 
the general use-case of TokenStreams.

I wonder a little bit about your xml file, it only contains text and position, 
but it should also contain rawTerm, startOffset, endOffset. When I call 
analysis i get all of those attributes not only two of them. Is this a 
hand-made file or what is the problem? Which Solr version?

One possibility to handle the thing might be the char offset in the original 
text, because that one should point to the character offset of begin and end of 
the token in the original stream instead of the token position, but this is 
likely to break for lots of TokenFilters (WordDelimiterFilter would work as 
long as you don't do stemming before...). The problem is incorrect handling of 
offset calculation (also leading to bugs in highlighting) when the inserted 
terms are longer than their originals.

Alltogether: Its unlikely that you can implement that and it will work for all 
combinations of TokenStream components.
  
> FieldAnalysisRequestHandler; add information about token-relation
> -----------------------------------------------------------------
>
>                 Key: SOLR-2400
>                 URL: https://issues.apache.org/jira/browse/SOLR-2400
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Stefan Matheis (steffkes)
>            Priority: Minor
>         Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 
> 110303_FieldAnalysisRequestHandler_view.png
>
>
> The XML-Output (simplified example attached) is missing one small information 
> .. which could be very useful to build an nice Analysis-Output, and that's 
> "Token-Relation" (if there is special/correct word for this, please correct 
> me).
> Meaning, that is actually not possible to "follow" the Analysis-Process 
> (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) 
> or split it into multiple Tokens (f.e. WordDelimiter).
> Would it be possible to include this Information? If so, it would be possible 
> to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - 
> short scribble attached

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to