[jira] Commented: (LUCENE-579) TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars

Michael McCandless (JIRA) Wed, 31 Dec 2008 04:17:11 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660073#action_12660073
 ]


Michael McCandless commented on LUCENE-579:
-------------------------------------------

Thank you for the patch Andrew.

I think this issue is a dup of LUCENE-1448, where the plan is to effectively 
add a getFinalOffset() method to TokenStream, which for the core/contrib 
analyzers would in fact default to the total number of characters read from 
their Reader inputs.

> TermPositionVector offsets incorrect if indexed field has multiple values and 
> one ends with non-term chars
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-579
>                 URL: https://issues.apache.org/jira/browse/LUCENE-579
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9
>            Reporter: Keiron McCammon
>         Attachments: offsets.patch
>
>
> If you add multiple values for a field with term vector positions and offsets 
> enabled and one of the values ends with a non-term then the offsets for the 
> terms from subsequent values are wrong. For example (note the '.' in the 
> first value):
>         IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), 
> true);
>         Document doc = new Document();
>         doc.add(new Field("", "one.", Field.Store.YES, Field.Index.TOKENIZED, 
> Field.TermVector.WITH_POSITIONS_OFFSETS));
>         doc.add(new Field("", "two", Field.Store.YES, Field.Index.TOKENIZED, 
> Field.TermVector.WITH_POSITIONS_OFFSETS));
>         writer.addDocument(doc);
>         writer.optimize();
>         writer.close();
>         IndexSearcher searcher = new IndexSearcher(directory);
>         Hits hits = searcher.search(new MatchAllDocsQuery());
>         Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),
>             new QueryScorer(new TermQuery(new Term("", "camera")), 
> searcher.getIndexReader(), ""));
>         for (int i = 0; i < hits.length(); ++i) {
>             TermPositionVector v = (TermPositionVector) 
> searcher.getIndexReader().getTermFreqVector(
>                 hits.id(i), "");
>             StringBuilder str = new StringBuilder();
>             for (String s : hits.doc(i).getValues("")) {
>                 str.append(s);
>                 str.append(" ");
>             }
>             
>             System.out.println(str);
>             TokenStream tokenStream = TokenSources.getTokenStream(v, false);
>             String[] terms = v.getTerms();
>             int[] freq = v.getTermFrequencies();
>             for (int j = 0; j < terms.length; ++j) {
>                 System.out.print(terms[j] + ":" + freq[j] + ":");
>                 
>                 int[] pos = v.getTermPositions(j);
>                 
>                 System.out.print(Arrays.toString(pos));
>                 
>                 TermVectorOffsetInfo[] offset = v.getOffsets(j); 
>                 for (int k = 0; k < offset.length; ++k) {
>                     
>                     System.out.print(":");
>                     
> System.out.print(str.substring(offset[k].getStartOffset(), 
> offset[k].getEndOffset()));
>                 }
>                 
>                 System.out.println();
>             }
>         }
>         searcher.close();
> If I run the above I get:
>         one:1:[0]:one
>         two:1:[1]: tw
> Note that the offsets for the second term are off by 1.
> It seems to be that the length of the value that is stored is not taken into 
> account when calculating the offset for the fields of the next value.
> I noticed ths problem when using the highlight contrib package which can make 
> use of term vectors for highlighting. I also noticed that the offset for the 
> second string is +1 the end of the previous value, so when concatenating the 
> fields values to pass to the hgighlighter I add to append a ' ' character 
> after each string...which is quite useful, but not documented anywhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-579) TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars

Reply via email to