[jira] Commented: (LUCENE-1723) KeywordTokenizer does not properly set the end offset

2009-06-29 Thread Dima May (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725460#action_12725460
 ] 

Dima May commented on LUCENE-1723:
--

Verified! You are absolutely correct, the bug has been fixed on the latest 
trunk. The next method in the KeywordTokenizer now sets the start and end 
offsets:

   reusableToken.setStartOffset(input.correctOffset(0))
   reusableToken.setEndOffset(input.correctOffset(upto));

I will resolve and close the ticket. Sorry for the trouble and thank you for 
the prompt attention. 


> KeywordTokenizer does not properly set the end offset
> -
>
> Key: LUCENE-1723
> URL: https://issues.apache.org/jira/browse/LUCENE-1723
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.4.1
>Reporter: Dima May
>Priority: Minor
> Attachments: AnalyzerBug.java
>
>
> KeywordTokenizer sets the Token's term length attribute but appears to omit 
> the end offset. The issue was discovered while using a highlighter with the 
> KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating 
> the bug. 
> Below is a JUnit test (source is also attached) that exercises various 
> analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer 
> successfully wraps the text with the highlight tags, such as 
> "thetext". When using KeywordAnalyzer the tags appear before the text, 
> for example: "thetext". 
> Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When 
> using NewKeywordAnalyzer the tags are properly placed around the text. The 
> NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting 
> the end offset for the returned Token. NewKeywordAnalyzer utilizes 
> KeywordTokenizer to produce proper token.
> Unless there is an objection I will gladly post a patch in the very near 
> future . 
> -
> package lucene;
> import java.io.IOException;
> import java.io.Reader;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.KeywordAnalyzer;
> import org.apache.lucene.analysis.KeywordTokenizer;
> import org.apache.lucene.analysis.SimpleAnalyzer;
> import org.apache.lucene.analysis.StopAnalyzer;
> import org.apache.lucene.analysis.Token;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.Tokenizer;
> import org.apache.lucene.analysis.WhitespaceAnalyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.search.highlight.Highlighter;
> import org.apache.lucene.search.highlight.QueryScorer;
> import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
> import org.apache.lucene.search.highlight.WeightedTerm;
> import org.junit.Test;
> import static org.junit.Assert.*;
> public class AnalyzerBug {
>   @Test
>   public void testWithHighlighting() throws IOException {
>   String text = "thetext";
>   WeightedTerm[] terms = { new WeightedTerm(1.0f, text) };
>   Highlighter highlighter = new Highlighter(new 
> SimpleHTMLFormatter(
>   "", ""), new QueryScorer(terms));
>   Analyzer[] analazers = { new StandardAnalyzer(), new 
> SimpleAnalyzer(),
>   new StopAnalyzer(), new WhitespaceAnalyzer(),
>   new NewKeywordAnalyzer(), new KeywordAnalyzer() 
> };
>   // Analyzers pass except KeywordAnalyzer
>   for (Analyzer analazer : analazers) {
>   String highighted = 
> highlighter.getBestFragment(analazer,
>   "CONTENT", text);
>   assertEquals("Failed for " + 
> analazer.getClass().getName(), ""
>   + text + "", highighted);
>   System.out.println(analazer.getClass().getName()
>   + " passed, value highlighted: " + 
> highighted);
>   }
>   }
> }
> class NewKeywordAnalyzer extends KeywordAnalyzer {
>   @Override
>   public TokenStream reusableTokenStream(String fieldName, Reader reader)
>   throws IOException {
>   Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream();
>   if (tokenizer == null) {
>   tokenizer = new NewKeywordTokenizer(reader);
>   setPreviousTokenStream(tokenizer);
>   } else
>   tokenizer.reset(reader);
>   return tokenizer;
>   }
>   @Override
>   public TokenStream tokenStream(String fieldName, Reader reader) {
>   return new NewKeywordTokenizer(reader);
>   }
> }
> class NewKeywordTokenizer extends KeywordToke

[jira] Commented: (LUCENE-1723) KeywordTokenizer does not properly set the end offset

2009-06-29 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725448#action_12725448
 ] 

Robert Muir commented on LUCENE-1723:
-

Dima, have you tried your test against the latest lucene trunk?

I got these results:
{noformat}
org.apache.lucene.analysis.standard.StandardAnalyzer passed, value highlighted: 
thetext
org.apache.lucene.analysis.SimpleAnalyzer passed, value highlighted: 
thetext
org.apache.lucene.analysis.StopAnalyzer passed, value highlighted: 
thetext
org.apache.lucene.analysis.WhitespaceAnalyzer passed, value highlighted: 
thetext
org.apache.lucene.analysis.NewKeywordAnalyzer passed, value highlighted: 
thetext
org.apache.lucene.analysis.KeywordAnalyzer passed, value highlighted: 
thetext
{noformat}

maybe you can verify the same?

> KeywordTokenizer does not properly set the end offset
> -
>
> Key: LUCENE-1723
> URL: https://issues.apache.org/jira/browse/LUCENE-1723
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.4.1
>Reporter: Dima May
>Priority: Minor
> Attachments: AnalyzerBug.java
>
>
> KeywordTokenizer sets the Token's term length attribute but appears to omit 
> the end offset. The issue was discovered while using a highlighter with the 
> KeywordAnalyzer. KeywordAnalyzer delegates to KeywordTokenizer propagating 
> the bug. 
> Below is a JUnit test (source is also attached) that exercises various 
> analyzers via a Highlighter instance. Every analyzer but the KeywordAnazlyzer 
> successfully wraps the text with the highlight tags, such as 
> "thetext". When using KeywordAnalyzer the tags appear before the text, 
> for example: "thetext". 
> Please note NewKeywordAnalyzer and NewKeywordTokenizer classes below. When 
> using NewKeywordAnalyzer the tags are properly placed around the text. The 
> NewKeywordTokenizer overrides the next method of the KeywordTokenizer setting 
> the end offset for the returned Token. NewKeywordAnalyzer utilizes 
> KeywordTokenizer to produce proper token.
> Unless there is an objection I will gladly post a patch in the very near 
> future . 
> -
> package lucene;
> import java.io.IOException;
> import java.io.Reader;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.KeywordAnalyzer;
> import org.apache.lucene.analysis.KeywordTokenizer;
> import org.apache.lucene.analysis.SimpleAnalyzer;
> import org.apache.lucene.analysis.StopAnalyzer;
> import org.apache.lucene.analysis.Token;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.Tokenizer;
> import org.apache.lucene.analysis.WhitespaceAnalyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.search.highlight.Highlighter;
> import org.apache.lucene.search.highlight.QueryScorer;
> import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
> import org.apache.lucene.search.highlight.WeightedTerm;
> import org.junit.Test;
> import static org.junit.Assert.*;
> public class AnalyzerBug {
>   @Test
>   public void testWithHighlighting() throws IOException {
>   String text = "thetext";
>   WeightedTerm[] terms = { new WeightedTerm(1.0f, text) };
>   Highlighter highlighter = new Highlighter(new 
> SimpleHTMLFormatter(
>   "", ""), new QueryScorer(terms));
>   Analyzer[] analazers = { new StandardAnalyzer(), new 
> SimpleAnalyzer(),
>   new StopAnalyzer(), new WhitespaceAnalyzer(),
>   new NewKeywordAnalyzer(), new KeywordAnalyzer() 
> };
>   // Analyzers pass except KeywordAnalyzer
>   for (Analyzer analazer : analazers) {
>   String highighted = 
> highlighter.getBestFragment(analazer,
>   "CONTENT", text);
>   assertEquals("Failed for " + 
> analazer.getClass().getName(), ""
>   + text + "", highighted);
>   System.out.println(analazer.getClass().getName()
>   + " passed, value highlighted: " + 
> highighted);
>   }
>   }
> }
> class NewKeywordAnalyzer extends KeywordAnalyzer {
>   @Override
>   public TokenStream reusableTokenStream(String fieldName, Reader reader)
>   throws IOException {
>   Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream();
>   if (tokenizer == null) {
>   tokenizer = new NewKeywordTokenizer(reader);
>   setPreviousTokenStream(tokenizer);
>   } else
>   tokenizer.reset(reader);