[ https://issues.apache.org/jira/browse/SOLR-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13690421#comment-13690421 ]
Shruthi Khatawkar edited comment on SOLR-4945 at 6/21/13 4:06 PM: ------------------------------------------------------------------ Hi Chritian, For me this looks to be an issue with highlighting, suggestedTextHighlightOffset is returning incorrect values. When I run the same Query just with Autocomplete everything looks fine. We are using - SenTokenizerFactory. In the Analysis tab I can see words being tokenised correctly. For most of the cases Term Position and Source start end are defined right. For the above scenarios all the offset values seems to be right. Just one observation though (this is pretty confusing)- In few cases SenTokenizerFactory returns same offset values for repetitive words. For e.g.. if there are 2 instances of product in a sentence then both the same source start and end as 0,4. was (Author: shruthi10): Hi Chritian, Hi Chritian, For me this looks to be an issue with highlighting, suggestedTextHighlightOffset is returning incorrect values. When I run the same Query just with Autocomplete everything looks fine. We are using - SenTokenizerFactory. In the Analysis tab I can see words being tokenised correctly. For most of the cases Term Position and Source start end are defined right. For the above scenarios all the offset values seems to be right. Just one observation though (this is pretty confusing)- In few cases SenTokenizerFactory returns same offset values for repetitive words. For e.g.. if there are 2 instances of product in a sentence then both the same source start and end as 0,4. For me this looks to be an issue with highlighting, suggestedTextHighlightOffset is returning incorrect values only first offset. When I run the same Query just with Autocomplete everything looks fine. We are using - SenTokenizerFactory. In the Analysis tab I can see words being tokenised correctly. For most of the cases Term Position and Source start end are defined right. For the above scenarios all the offset values seems to be right. Just one observation though (this is pretty confusing)- In few cases SenTokenizerFactory returns same offset values for repetitive words. For e.g.. if there are 2 instances of product in a sentence then both the same source start and end as 0,4. > Japanese Autocomplete and Highlighter broken > -------------------------------------------- > > Key: SOLR-4945 > URL: https://issues.apache.org/jira/browse/SOLR-4945 > Project: Solr > Issue Type: Bug > Components: highlighter > Reporter: Shruthi Khatawkar > > Autocomplete is implemented with Highlighter functionality. This works fine > for most of the languages but breaks for Japanese. > multivalued,termVector,termPositions and termOffset are set to true. > Here is an example: > Query: product classic. > Result: > Actual : > この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 USB接続ケーブルをproduct 1 やclassic > Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか? > With Highlighter (<em> </em> tags being used): > この商品の互換性の機種<em>にproduct</em> 1 <em>やclassic</em> Touch2 が記載が有りません。 > USB接続ケーブルをproduct 1 やclassic Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか? > Though query terms "product classic" is repeated twice, highlighting is > happening only on the first instance. As shown above. > Solr returns only first instance offset and second instance is ignored. > Also it's observed, highlighter repeats first letter of the token if there is > numeric. > For eg.Query : product and We have product1, highlighter returns as > p<em>product</em>1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org