[ https://issues.apache.org/jira/browse/LUCENE-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660073#action_12660073 ]
Michael McCandless commented on LUCENE-579: ------------------------------------------- Thank you for the patch Andrew. I think this issue is a dup of LUCENE-1448, where the plan is to effectively add a getFinalOffset() method to TokenStream, which for the core/contrib analyzers would in fact default to the total number of characters read from their Reader inputs. > TermPositionVector offsets incorrect if indexed field has multiple values and > one ends with non-term chars > ---------------------------------------------------------------------------------------------------------- > > Key: LUCENE-579 > URL: https://issues.apache.org/jira/browse/LUCENE-579 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Affects Versions: 1.9 > Reporter: Keiron McCammon > Attachments: offsets.patch > > > If you add multiple values for a field with term vector positions and offsets > enabled and one of the values ends with a non-term then the offsets for the > terms from subsequent values are wrong. For example (note the '.' in the > first value): > IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), > true); > Document doc = new Document(); > doc.add(new Field("", "one.", Field.Store.YES, Field.Index.TOKENIZED, > Field.TermVector.WITH_POSITIONS_OFFSETS)); > doc.add(new Field("", "two", Field.Store.YES, Field.Index.TOKENIZED, > Field.TermVector.WITH_POSITIONS_OFFSETS)); > writer.addDocument(doc); > writer.optimize(); > writer.close(); > IndexSearcher searcher = new IndexSearcher(directory); > Hits hits = searcher.search(new MatchAllDocsQuery()); > Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), > new QueryScorer(new TermQuery(new Term("", "camera")), > searcher.getIndexReader(), "")); > for (int i = 0; i < hits.length(); ++i) { > TermPositionVector v = (TermPositionVector) > searcher.getIndexReader().getTermFreqVector( > hits.id(i), ""); > StringBuilder str = new StringBuilder(); > for (String s : hits.doc(i).getValues("")) { > str.append(s); > str.append(" "); > } > > System.out.println(str); > TokenStream tokenStream = TokenSources.getTokenStream(v, false); > String[] terms = v.getTerms(); > int[] freq = v.getTermFrequencies(); > for (int j = 0; j < terms.length; ++j) { > System.out.print(terms[j] + ":" + freq[j] + ":"); > > int[] pos = v.getTermPositions(j); > > System.out.print(Arrays.toString(pos)); > > TermVectorOffsetInfo[] offset = v.getOffsets(j); > for (int k = 0; k < offset.length; ++k) { > > System.out.print(":"); > > System.out.print(str.substring(offset[k].getStartOffset(), > offset[k].getEndOffset())); > } > > System.out.println(); > } > } > searcher.close(); > If I run the above I get: > one:1:[0]:one > two:1:[1]: tw > Note that the offsets for the second term are off by 1. > It seems to be that the length of the value that is stored is not taken into > account when calculating the offset for the fields of the next value. > I noticed ths problem when using the highlight contrib package which can make > use of term vectors for highlighting. I also noticed that the offset for the > second string is +1 the end of the previous value, so when concatenating the > fields values to pass to the hgighlighter I add to append a ' ' character > after each string...which is quite useful, but not documented anywhere. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org