[jira] Commented: (LUCENE-579) TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars

2008-12-31 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660073#action_12660073
 ] 

Michael McCandless commented on LUCENE-579:
---

Thank you for the patch Andrew.

I think this issue is a dup of LUCENE-1448, where the plan is to effectively 
add a getFinalOffset() method to TokenStream, which for the core/contrib 
analyzers would in fact default to the total number of characters read from 
their Reader inputs.

 TermPositionVector offsets incorrect if indexed field has multiple values and 
 one ends with non-term chars
 --

 Key: LUCENE-579
 URL: https://issues.apache.org/jira/browse/LUCENE-579
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 1.9
Reporter: Keiron McCammon
 Attachments: offsets.patch


 If you add multiple values for a field with term vector positions and offsets 
 enabled and one of the values ends with a non-term then the offsets for the 
 terms from subsequent values are wrong. For example (note the '.' in the 
 first value):
 IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), 
 true);
 Document doc = new Document();
 doc.add(new Field(, one., Field.Store.YES, Field.Index.TOKENIZED, 
 Field.TermVector.WITH_POSITIONS_OFFSETS));
 doc.add(new Field(, two, Field.Store.YES, Field.Index.TOKENIZED, 
 Field.TermVector.WITH_POSITIONS_OFFSETS));
 writer.addDocument(doc);
 writer.optimize();
 writer.close();
 IndexSearcher searcher = new IndexSearcher(directory);
 Hits hits = searcher.search(new MatchAllDocsQuery());
 Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),
 new QueryScorer(new TermQuery(new Term(, camera)), 
 searcher.getIndexReader(), ));
 for (int i = 0; i  hits.length(); ++i) {
 TermPositionVector v = (TermPositionVector) 
 searcher.getIndexReader().getTermFreqVector(
 hits.id(i), );
 StringBuilder str = new StringBuilder();
 for (String s : hits.doc(i).getValues()) {
 str.append(s);
 str.append( );
 }
 
 System.out.println(str);
 TokenStream tokenStream = TokenSources.getTokenStream(v, false);
 String[] terms = v.getTerms();
 int[] freq = v.getTermFrequencies();
 for (int j = 0; j  terms.length; ++j) {
 System.out.print(terms[j] + : + freq[j] + :);
 
 int[] pos = v.getTermPositions(j);
 
 System.out.print(Arrays.toString(pos));
 
 TermVectorOffsetInfo[] offset = v.getOffsets(j); 
 for (int k = 0; k  offset.length; ++k) {
 
 System.out.print(:);
 
 System.out.print(str.substring(offset[k].getStartOffset(), 
 offset[k].getEndOffset()));
 }
 
 System.out.println();
 }
 }
 searcher.close();
 If I run the above I get:
 one:1:[0]:one
 two:1:[1]: tw
 Note that the offsets for the second term are off by 1.
 It seems to be that the length of the value that is stored is not taken into 
 account when calculating the offset for the fields of the next value.
 I noticed ths problem when using the highlight contrib package which can make 
 use of term vectors for highlighting. I also noticed that the offset for the 
 second string is +1 the end of the previous value, so when concatenating the 
 fields values to pass to the hgighlighter I add to append a ' ' character 
 after each string...which is quite useful, but not documented anywhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-579) TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars

2008-12-31 Thread Andrew Duffy (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660088#action_12660088
 ] 

Andrew Duffy commented on LUCENE-579:
-

It is a duplicate of LUCENE-1448; the fix proposed in that issue will fix the 
problem in a very comprehensive way.

 TermPositionVector offsets incorrect if indexed field has multiple values and 
 one ends with non-term chars
 --

 Key: LUCENE-579
 URL: https://issues.apache.org/jira/browse/LUCENE-579
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 1.9
Reporter: Keiron McCammon
 Attachments: offsets.patch


 If you add multiple values for a field with term vector positions and offsets 
 enabled and one of the values ends with a non-term then the offsets for the 
 terms from subsequent values are wrong. For example (note the '.' in the 
 first value):
 IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), 
 true);
 Document doc = new Document();
 doc.add(new Field(, one., Field.Store.YES, Field.Index.TOKENIZED, 
 Field.TermVector.WITH_POSITIONS_OFFSETS));
 doc.add(new Field(, two, Field.Store.YES, Field.Index.TOKENIZED, 
 Field.TermVector.WITH_POSITIONS_OFFSETS));
 writer.addDocument(doc);
 writer.optimize();
 writer.close();
 IndexSearcher searcher = new IndexSearcher(directory);
 Hits hits = searcher.search(new MatchAllDocsQuery());
 Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),
 new QueryScorer(new TermQuery(new Term(, camera)), 
 searcher.getIndexReader(), ));
 for (int i = 0; i  hits.length(); ++i) {
 TermPositionVector v = (TermPositionVector) 
 searcher.getIndexReader().getTermFreqVector(
 hits.id(i), );
 StringBuilder str = new StringBuilder();
 for (String s : hits.doc(i).getValues()) {
 str.append(s);
 str.append( );
 }
 
 System.out.println(str);
 TokenStream tokenStream = TokenSources.getTokenStream(v, false);
 String[] terms = v.getTerms();
 int[] freq = v.getTermFrequencies();
 for (int j = 0; j  terms.length; ++j) {
 System.out.print(terms[j] + : + freq[j] + :);
 
 int[] pos = v.getTermPositions(j);
 
 System.out.print(Arrays.toString(pos));
 
 TermVectorOffsetInfo[] offset = v.getOffsets(j); 
 for (int k = 0; k  offset.length; ++k) {
 
 System.out.print(:);
 
 System.out.print(str.substring(offset[k].getStartOffset(), 
 offset[k].getEndOffset()));
 }
 
 System.out.println();
 }
 }
 searcher.close();
 If I run the above I get:
 one:1:[0]:one
 two:1:[1]: tw
 Note that the offsets for the second term are off by 1.
 It seems to be that the length of the value that is stored is not taken into 
 account when calculating the offset for the fields of the next value.
 I noticed ths problem when using the highlight contrib package which can make 
 use of term vectors for highlighting. I also noticed that the offset for the 
 second string is +1 the end of the previous value, so when concatenating the 
 fields values to pass to the hgighlighter I add to append a ' ' character 
 after each string...which is quite useful, but not documented anywhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-579) TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars

2007-07-18 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513604
 ] 

Grant Ingersoll commented on LUCENE-579:


Can you provide a unit test for this?

 TermPositionVector offsets incorrect if indexed field has multiple values and 
 one ends with non-term chars
 --

 Key: LUCENE-579
 URL: https://issues.apache.org/jira/browse/LUCENE-579
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 1.9
Reporter: Keiron McCammon

 If you add multiple values for a field with term vector positions and offsets 
 enabled and one of the values ends with a non-term then the offsets for the 
 terms from subsequent values are wrong. For example (note the '.' in the 
 first value):
 IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), 
 true);
 Document doc = new Document();
 doc.add(new Field(, one., Field.Store.YES, Field.Index.TOKENIZED, 
 Field.TermVector.WITH_POSITIONS_OFFSETS));
 doc.add(new Field(, two, Field.Store.YES, Field.Index.TOKENIZED, 
 Field.TermVector.WITH_POSITIONS_OFFSETS));
 writer.addDocument(doc);
 writer.optimize();
 writer.close();
 IndexSearcher searcher = new IndexSearcher(directory);
 Hits hits = searcher.search(new MatchAllDocsQuery());
 Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),
 new QueryScorer(new TermQuery(new Term(, camera)), 
 searcher.getIndexReader(), ));
 for (int i = 0; i  hits.length(); ++i) {
 TermPositionVector v = (TermPositionVector) 
 searcher.getIndexReader().getTermFreqVector(
 hits.id(i), );
 StringBuilder str = new StringBuilder();
 for (String s : hits.doc(i).getValues()) {
 str.append(s);
 str.append( );
 }
 
 System.out.println(str);
 TokenStream tokenStream = TokenSources.getTokenStream(v, false);
 String[] terms = v.getTerms();
 int[] freq = v.getTermFrequencies();
 for (int j = 0; j  terms.length; ++j) {
 System.out.print(terms[j] + : + freq[j] + :);
 
 int[] pos = v.getTermPositions(j);
 
 System.out.print(Arrays.toString(pos));
 
 TermVectorOffsetInfo[] offset = v.getOffsets(j); 
 for (int k = 0; k  offset.length; ++k) {
 
 System.out.print(:);
 
 System.out.print(str.substring(offset[k].getStartOffset(), 
 offset[k].getEndOffset()));
 }
 
 System.out.println();
 }
 }
 searcher.close();
 If I run the above I get:
 one:1:[0]:one
 two:1:[1]: tw
 Note that the offsets for the second term are off by 1.
 It seems to be that the length of the value that is stored is not taken into 
 account when calculating the offset for the fields of the next value.
 I noticed ths problem when using the highlight contrib package which can make 
 use of term vectors for highlighting. I also noticed that the offset for the 
 second string is +1 the end of the previous value, so when concatenating the 
 fields values to pass to the hgighlighter I add to append a ' ' character 
 after each string...which is quite useful, but not documented anywhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-579) TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars

2007-07-14 Thread Shahan Khatchadourian (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512743
 ] 

Shahan Khatchadourian commented on LUCENE-579:
--

DocumentWriter seems to be the culprit in adding 1 to the previous token's 
endOffset. It may not be possible to provide token offsets that undo this 
operation since it is not possible to determine the order in which tokens are 
handled as they are grouped by field which doesn't necessarily correspond to 
document-order. This would also interfere with custom synonym tokens since 
custom token offsets are no longer guaranteed.

I suggest that there be a flag in Fieldable or IndexWriter which allows exact 
provided offsets to be stored rather than increased by one. There does not seem 
to be any sanity checks on offset values during reading/writing the term vector.

A current workaround to this issue is to store the correct startOffset, but 
leave the endOffset as -1. This has the effect of undoing the +1 of the 
previous token's endOffset but prevents endOffset information from being 
available without retokenizing/reparsing the original document.



 TermPositionVector offsets incorrect if indexed field has multiple values and 
 one ends with non-term chars
 --

 Key: LUCENE-579
 URL: https://issues.apache.org/jira/browse/LUCENE-579
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 1.9
Reporter: Keiron McCammon

 If you add multiple values for a field with term vector positions and offsets 
 enabled and one of the values ends with a non-term then the offsets for the 
 terms from subsequent values are wrong. For example (note the '.' in the 
 first value):
 IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), 
 true);
 Document doc = new Document();
 doc.add(new Field(, one., Field.Store.YES, Field.Index.TOKENIZED, 
 Field.TermVector.WITH_POSITIONS_OFFSETS));
 doc.add(new Field(, two, Field.Store.YES, Field.Index.TOKENIZED, 
 Field.TermVector.WITH_POSITIONS_OFFSETS));
 writer.addDocument(doc);
 writer.optimize();
 writer.close();
 IndexSearcher searcher = new IndexSearcher(directory);
 Hits hits = searcher.search(new MatchAllDocsQuery());
 Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),
 new QueryScorer(new TermQuery(new Term(, camera)), 
 searcher.getIndexReader(), ));
 for (int i = 0; i  hits.length(); ++i) {
 TermPositionVector v = (TermPositionVector) 
 searcher.getIndexReader().getTermFreqVector(
 hits.id(i), );
 StringBuilder str = new StringBuilder();
 for (String s : hits.doc(i).getValues()) {
 str.append(s);
 str.append( );
 }
 
 System.out.println(str);
 TokenStream tokenStream = TokenSources.getTokenStream(v, false);
 String[] terms = v.getTerms();
 int[] freq = v.getTermFrequencies();
 for (int j = 0; j  terms.length; ++j) {
 System.out.print(terms[j] + : + freq[j] + :);
 
 int[] pos = v.getTermPositions(j);
 
 System.out.print(Arrays.toString(pos));
 
 TermVectorOffsetInfo[] offset = v.getOffsets(j); 
 for (int k = 0; k  offset.length; ++k) {
 
 System.out.print(:);
 
 System.out.print(str.substring(offset[k].getStartOffset(), 
 offset[k].getEndOffset()));
 }
 
 System.out.println();
 }
 }
 searcher.close();
 If I run the above I get:
 one:1:[0]:one
 two:1:[1]: tw
 Note that the offsets for the second term are off by 1.
 It seems to be that the length of the value that is stored is not taken into 
 account when calculating the offset for the fields of the next value.
 I noticed ths problem when using the highlight contrib package which can make 
 use of term vectors for highlighting. I also noticed that the offset for the 
 second string is +1 the end of the previous value, so when concatenating the 
 fields values to pass to the hgighlighter I add to append a ' ' character 
 after each string...which is quite useful, but not documented anywhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]