[ https://issues.apache.org/jira/browse/LUCENE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jinsong Hu updated LUCENE-4730: ------------------------------- Description: We found that SmartChineseAnalyzer got wrong matched offset with the following test code: public void testHighlight() throws Exception { String text = "My China "; String queryText = "China"; StringBuilder builder = new StringBuilder("<html>"); Analyzer analyzer = new SmartChineseAnalyzer(Version.LUCENE_40); //Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); QueryParser parser = new QueryParser(Version.LUCENE_40, "text", analyzer); Query query = parser.parse(queryText); SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span style=\"background: yellow\">", "</span>"); TokenStream tokens = analyzer.tokenStream("text", new StringReader(text)); QueryScorer scorer = new QueryScorer(query, "text"); Highlighter highlighter = new Highlighter(formatter, scorer); highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer)); String result = highlighter.getBestFragments(tokens, text, 10, "..."); if (result.length() < text.length()) { result = text; } builder.append("<body>"); builder.append(result); builder.append("</body>"); builder.append("</html>"); System.out.println(builder.toString()); } This method will generate a hilighted text, however, the highlight position is obviously wrong, and if we remove one space from the text, that is, changed text from "My China " (ends with two spaces) to "My China " (ends with one space), it will generate a text with correct highlight. If we change the analyzer from SmartChineseAnalyzer to StandardAnalyzer, the highlight issue will disappear. was: We found that SmartChineseAnalyzer got wrong matched offset with the following test code: public void testHighlight() throws Exception { String text = "My China "; String queryText = "China"; StringBuilder builder = new StringBuilder("<html>"); Analyzer analyzer = new SmartChineseAnalyzer(Version.LUCENE_40); //Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); QueryParser parser = new QueryParser(Version.LUCENE_40, "text", analyzer); Query query = parser.parse(queryText); SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span style=\"background: yellow\">", "</span>"); TokenStream tokens = analyzer.tokenStream("text", new StringReader(text)); QueryScorer scorer = new QueryScorer(query, "text"); Highlighter highlighter = new Highlighter(formatter, scorer); highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer)); String result = highlighter.getBestFragments(tokens, text, 10, "..."); if (result.length() < text.length()) { result = text; } builder.append("<body>"); builder.append(result); builder.append("</body>"); builder.append("</html>"); System.out.println(builder.toString()); } This method will generate a hilighted text, however, the highlight position is obviously wrong, and if we remove one space from the text, that is, changed text from "My China " to "My China ", it will generate a text with correct highlight. If we change the analyzer from SmartChineseAnalyzer to StandardAnalyzer, the highlight issue will disappear. > SmartChineseAnalyzer got wrong matched offset > --------------------------------------------- > > Key: LUCENE-4730 > URL: https://issues.apache.org/jira/browse/LUCENE-4730 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 4.0, 4.1 > Environment: JDK1.7 Linux/Windows > Reporter: Jinsong Hu > Priority: Critical > > We found that SmartChineseAnalyzer got wrong matched offset with the > following test code: > public void testHighlight() throws Exception { > String text = "My China "; > String queryText = "China"; > StringBuilder builder = new StringBuilder("<html>"); > Analyzer analyzer = new SmartChineseAnalyzer(Version.LUCENE_40); > //Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); > QueryParser parser = new QueryParser(Version.LUCENE_40, "text", > analyzer); > Query query = parser.parse(queryText); > SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span > style=\"background: yellow\">", "</span>"); > TokenStream tokens = analyzer.tokenStream("text", new > StringReader(text)); > QueryScorer scorer = new QueryScorer(query, "text"); > Highlighter highlighter = new Highlighter(formatter, scorer); > highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer)); > String result = highlighter.getBestFragments(tokens, text, 10, "..."); > if (result.length() < text.length()) { > result = text; > } > builder.append("<body>"); > builder.append(result); > builder.append("</body>"); > builder.append("</html>"); > System.out.println(builder.toString()); > } > This method will generate a hilighted text, however, the highlight position > is obviously wrong, and if we remove one space from the text, that is, > changed text from "My China " (ends with two spaces) to "My China " (ends > with one space), it will generate a text with correct highlight. If we change > the analyzer from SmartChineseAnalyzer to StandardAnalyzer, the highlight > issue will disappear. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org