[jira] Commented: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725460#action_12725460 ] Dima May commented on LUCENE-1723: -- Verified! You are absolutely correct, the bug has been fixed on the latest trunk. The next method in the KeywordTokenizer now sets the start and end offsets: reusableToken.setStartOffset(input.correctOffset(0)) reusableToken.setEndOffset(input.correctOffset(upto)); I will resolve and close the ticket. Sorry for the trouble and thank you for the prompt attention. > KeywordTokenizer does not properly set the end offset > - > > Key: LUCENE-1723 > URL: https://issues.apache.org/jira/browse/LUCENE-1723 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.4.1 >Reporter: Dima May >Priority: Minor > Attachments: AnalyzerBug.java > > > KeywordTokenizer sets the Token's term length attribute but appears to omit > the end offset. The issue was discovered while using a highlighter with the > KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating > the bug. > Below is a JUnit test (source is also attached) that exercises various > analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer > successfully wraps the text with the highlight tags, such as > "thetext". When using KeywordAnalyzer the tags appear before the text, > for example: "thetext". > Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When > using NewKeywordAnalyzer the tags are properly placed around the text. The > NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting > the end offset for the returned Token. NewKeywordAnalyzer utilizes > KeywordTokenizer to produce proper token. > Unless there is an objection I will gladly post a patch in the very near > future . > - > package lucene; > import java.io.IOException; > import java.io.Reader; > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.KeywordAnalyzer; > import org.apache.lucene.analysis.KeywordTokenizer; > import org.apache.lucene.analysis.SimpleAnalyzer; > import org.apache.lucene.analysis.StopAnalyzer; > import org.apache.lucene.analysis.Token; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.Tokenizer; > import org.apache.lucene.analysis.WhitespaceAnalyzer; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.search.highlight.Highlighter; > import org.apache.lucene.search.highlight.QueryScorer; > import org.apache.lucene.search.highlight.SimpleHTMLFormatter; > import org.apache.lucene.search.highlight.WeightedTerm; > import org.junit.Test; > import static org.junit.Assert.*; > public class AnalyzerBug { > @Test > public void testWithHighlighting() throws IOException { > String text = "thetext"; > WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; > Highlighter highlighter = new Highlighter(new > SimpleHTMLFormatter( > "", ""), new QueryScorer(terms)); > Analyzer[] analazers = { new StandardAnalyzer(), new > SimpleAnalyzer(), > new StopAnalyzer(), new WhitespaceAnalyzer(), > new NewKeywordAnalyzer(), new KeywordAnalyzer() > }; > // Analyzers pass except KeywordAnalyzer > for (Analyzer analazer : analazers) { > String highighted = > highlighter.getBestFragment(analazer, > "CONTENT", text); > assertEquals("Failed for " + > analazer.getClass().getName(), "" > + text + "", highighted); > System.out.println(analazer.getClass().getName() > + " passed, value highlighted: " + > highighted); > } > } > } > class NewKeywordAnalyzer extends KeywordAnalyzer { > @Override > public TokenStream reusableTokenStream(String fieldName, Reader reader) > throws IOException { > Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); > if (tokenizer == null) { > tokenizer = new NewKeywordTokenizer(reader); > setPreviousTokenStream(tokenizer); > } else > tokenizer.reset(reader); > return tokenizer; > } > @Override > public TokenStream tokenStream(String fieldName, Reader reader) { > return new NewKeywordTokenizer(reader); > } > } > class NewKeywordTokenizer extends KeywordToke
[jira] Commented: (LUCENE-1723) KeywordTokenizer does not properly set the end offset
[ https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725448#action_12725448 ] Robert Muir commented on LUCENE-1723: - Dima, have you tried your test against the latest lucene trunk? I got these results: {noformat} org.apache.lucene.analysis.standard.StandardAnalyzer passed, value highlighted: thetext org.apache.lucene.analysis.SimpleAnalyzer passed, value highlighted: thetext org.apache.lucene.analysis.StopAnalyzer passed, value highlighted: thetext org.apache.lucene.analysis.WhitespaceAnalyzer passed, value highlighted: thetext org.apache.lucene.analysis.NewKeywordAnalyzer passed, value highlighted: thetext org.apache.lucene.analysis.KeywordAnalyzer passed, value highlighted: thetext {noformat} maybe you can verify the same? > KeywordTokenizer does not properly set the end offset > - > > Key: LUCENE-1723 > URL: https://issues.apache.org/jira/browse/LUCENE-1723 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.4.1 >Reporter: Dima May >Priority: Minor > Attachments: AnalyzerBug.java > > > KeywordTokenizer sets the Token's term length attribute but appears to omit > the end offset. The issue was discovered while using a highlighter with the > KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating > the bug. > Below is a JUnit test (source is also attached) that exercises various > analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer > successfully wraps the text with the highlight tags, such as > "thetext". When using KeywordAnalyzer the tags appear before the text, > for example: "thetext". > Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When > using NewKeywordAnalyzer the tags are properly placed around the text. The > NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting > the end offset for the returned Token. NewKeywordAnalyzer utilizes > KeywordTokenizer to produce proper token. > Unless there is an objection I will gladly post a patch in the very near > future . > - > package lucene; > import java.io.IOException; > import java.io.Reader; > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.KeywordAnalyzer; > import org.apache.lucene.analysis.KeywordTokenizer; > import org.apache.lucene.analysis.SimpleAnalyzer; > import org.apache.lucene.analysis.StopAnalyzer; > import org.apache.lucene.analysis.Token; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.Tokenizer; > import org.apache.lucene.analysis.WhitespaceAnalyzer; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.search.highlight.Highlighter; > import org.apache.lucene.search.highlight.QueryScorer; > import org.apache.lucene.search.highlight.SimpleHTMLFormatter; > import org.apache.lucene.search.highlight.WeightedTerm; > import org.junit.Test; > import static org.junit.Assert.*; > public class AnalyzerBug { > @Test > public void testWithHighlighting() throws IOException { > String text = "thetext"; > WeightedTerm[] terms = { new WeightedTerm(1.0f, text) }; > Highlighter highlighter = new Highlighter(new > SimpleHTMLFormatter( > "", ""), new QueryScorer(terms)); > Analyzer[] analazers = { new StandardAnalyzer(), new > SimpleAnalyzer(), > new StopAnalyzer(), new WhitespaceAnalyzer(), > new NewKeywordAnalyzer(), new KeywordAnalyzer() > }; > // Analyzers pass except KeywordAnalyzer > for (Analyzer analazer : analazers) { > String highighted = > highlighter.getBestFragment(analazer, > "CONTENT", text); > assertEquals("Failed for " + > analazer.getClass().getName(), "" > + text + "", highighted); > System.out.println(analazer.getClass().getName() > + " passed, value highlighted: " + > highighted); > } > } > } > class NewKeywordAnalyzer extends KeywordAnalyzer { > @Override > public TokenStream reusableTokenStream(String fieldName, Reader reader) > throws IOException { > Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); > if (tokenizer == null) { > tokenizer = new NewKeywordTokenizer(reader); > setPreviousTokenStream(tokenizer); > } else > tokenizer.reset(reader);