[jira] Commented: (SOLR-1662) BufferedTokenStream incorrect cloning

2009-12-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791940#action_12791940
 ] 

Uwe Schindler commented on SOLR-1662:
-

+1 Looks good!

> BufferedTokenStream incorrect cloning
> -
>
> Key: SOLR-1662
> URL: https://issues.apache.org/jira/browse/SOLR-1662
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 1.4
>Reporter: Robert Muir
>Assignee: Shalin Shekhar Mangar
> Attachments: SOLR-1662.patch
>
>
> As part of writing tests for SOLR-1657, I rewrote one of the base classes 
> (BaseTokenTestCase) to use the new TokenStream API, but also with some 
> additional safety.
> {code}
>  public static String tsToString(TokenStream in) throws IOException {
> StringBuilder out = new StringBuilder();
> TermAttribute termAtt = (TermAttribute) 
> in.addAttribute(TermAttribute.class);
> // extra safety to enforce, that the state is not preserved and also
> // assign bogus values
> in.clearAttributes();
> termAtt.setTermBuffer("bogusTerm");
> while (in.incrementToken()) {
>   if (out.length() > 0)
> out.append(' ');
>   out.append(termAtt.term());
>   in.clearAttributes();
>   termAtt.setTermBuffer("bogusTerm");
> }
> in.close();
> return out.toString();
>   }
> {code}
> Setting the term text to bogus values helps find bugs in tokenstreams that do 
> not clear or clone properly. In this case there is a problem with a 
> tokenstream AB_AAB_Stream in TestBufferedTokenStream, it converts A B -> A A 
> B but does not clone, so the values get overwritten.
> This can be fixed in two ways: 
> * BufferedTokenStream does the cloning
> * subclasses are responsible for the cloning
> The question is which one should it be?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1662) BufferedTokenStream incorrect cloning

2009-12-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791888#action_12791888
 ] 

Robert Muir commented on SOLR-1662:
---

cool, will work up a patch with javadocs wordage (I think we need this at 
least, its not obvious and there is no mention of it), and any fixes.

> BufferedTokenStream incorrect cloning
> -
>
> Key: SOLR-1662
> URL: https://issues.apache.org/jira/browse/SOLR-1662
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 1.4
>Reporter: Robert Muir
>
> As part of writing tests for SOLR-1657, I rewrote one of the base classes 
> (BaseTokenTestCase) to use the new TokenStream API, but also with some 
> additional safety.
> {code}
>  public static String tsToString(TokenStream in) throws IOException {
> StringBuilder out = new StringBuilder();
> TermAttribute termAtt = (TermAttribute) 
> in.addAttribute(TermAttribute.class);
> // extra safety to enforce, that the state is not preserved and also
> // assign bogus values
> in.clearAttributes();
> termAtt.setTermBuffer("bogusTerm");
> while (in.incrementToken()) {
>   if (out.length() > 0)
> out.append(' ');
>   out.append(termAtt.term());
>   in.clearAttributes();
>   termAtt.setTermBuffer("bogusTerm");
> }
> in.close();
> return out.toString();
>   }
> {code}
> Setting the term text to bogus values helps find bugs in tokenstreams that do 
> not clear or clone properly. In this case there is a problem with a 
> tokenstream AB_AAB_Stream in TestBufferedTokenStream, it converts A B -> A A 
> B but does not clone, so the values get overwritten.
> This can be fixed in two ways: 
> * BufferedTokenStream does the cloning
> * subclasses are responsible for the cloning
> The question is which one should it be?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1662) BufferedTokenStream incorrect cloning

2009-12-17 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791867#action_12791867
 ] 

Uwe Schindler commented on SOLR-1662:
-

bq. I think cloning should be done by sub-classes before writing. If 
BufferedTokenStream clones the token then every sub-class pays the price even 
though the use-case may just be to throw the token away.

+1, that was what i said in my first comment, too. Because BufferedTokenStream 
itsself never reuses the token. The problem is the test and RemoveDuplicates, 
that push the same instance twice into the queue.

> BufferedTokenStream incorrect cloning
> -
>
> Key: SOLR-1662
> URL: https://issues.apache.org/jira/browse/SOLR-1662
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 1.4
>Reporter: Robert Muir
>
> As part of writing tests for SOLR-1657, I rewrote one of the base classes 
> (BaseTokenTestCase) to use the new TokenStream API, but also with some 
> additional safety.
> {code}
>  public static String tsToString(TokenStream in) throws IOException {
> StringBuilder out = new StringBuilder();
> TermAttribute termAtt = (TermAttribute) 
> in.addAttribute(TermAttribute.class);
> // extra safety to enforce, that the state is not preserved and also
> // assign bogus values
> in.clearAttributes();
> termAtt.setTermBuffer("bogusTerm");
> while (in.incrementToken()) {
>   if (out.length() > 0)
> out.append(' ');
>   out.append(termAtt.term());
>   in.clearAttributes();
>   termAtt.setTermBuffer("bogusTerm");
> }
> in.close();
> return out.toString();
>   }
> {code}
> Setting the term text to bogus values helps find bugs in tokenstreams that do 
> not clear or clone properly. In this case there is a problem with a 
> tokenstream AB_AAB_Stream in TestBufferedTokenStream, it converts A B -> A A 
> B but does not clone, so the values get overwritten.
> This can be fixed in two ways: 
> * BufferedTokenStream does the cloning
> * subclasses are responsible for the cloning
> The question is which one should it be?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1662) BufferedTokenStream incorrect cloning

2009-12-17 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791866#action_12791866
 ] 

Shalin Shekhar Mangar commented on SOLR-1662:
-

{quote}
So if we decide its the responsibility of the subclass, these implementations 
need thorough tests to see if they are ok or not.
If we add the cloning to BufferedTokenStream itself, then we know they are ok...
{quote}

I think cloning should be done by sub-classes before writing. If 
BufferedTokenStream clones the token then every sub-class pays the price even 
though the use-case may just be to throw the token away.

> BufferedTokenStream incorrect cloning
> -
>
> Key: SOLR-1662
> URL: https://issues.apache.org/jira/browse/SOLR-1662
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 1.4
>Reporter: Robert Muir
>
> As part of writing tests for SOLR-1657, I rewrote one of the base classes 
> (BaseTokenTestCase) to use the new TokenStream API, but also with some 
> additional safety.
> {code}
>  public static String tsToString(TokenStream in) throws IOException {
> StringBuilder out = new StringBuilder();
> TermAttribute termAtt = (TermAttribute) 
> in.addAttribute(TermAttribute.class);
> // extra safety to enforce, that the state is not preserved and also
> // assign bogus values
> in.clearAttributes();
> termAtt.setTermBuffer("bogusTerm");
> while (in.incrementToken()) {
>   if (out.length() > 0)
> out.append(' ');
>   out.append(termAtt.term());
>   in.clearAttributes();
>   termAtt.setTermBuffer("bogusTerm");
> }
> in.close();
> return out.toString();
>   }
> {code}
> Setting the term text to bogus values helps find bugs in tokenstreams that do 
> not clear or clone properly. In this case there is a problem with a 
> tokenstream AB_AAB_Stream in TestBufferedTokenStream, it converts A B -> A A 
> B but does not clone, so the values get overwritten.
> This can be fixed in two ways: 
> * BufferedTokenStream does the cloning
> * subclasses are responsible for the cloning
> The question is which one should it be?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1662) BufferedTokenStream incorrect cloning

2009-12-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791374#action_12791374
 ] 

Robert Muir commented on SOLR-1662:
---

bq. but Robert told me that e.g. RemoveDuplicates has a similar problem.

Right, there is no cloning in RemoveDuplicates. CommonGrams creates a new 
Token() when it grams, but its not clear that this one is correct either.

So if we decide its the responsibility of the subclass, these implementations 
need thorough tests to see if they are ok or not.
If we add the cloning to BufferedTokenStream itself, then we know they are 
ok... 


> BufferedTokenStream incorrect cloning
> -
>
> Key: SOLR-1662
> URL: https://issues.apache.org/jira/browse/SOLR-1662
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 1.4
>Reporter: Robert Muir
>
> As part of writing tests for SOLR-1657, I rewrote one of the base classes 
> (BaseTokenTestCase) to use the new TokenStream API, but also with some 
> additional safety.
> {code}
>  public static String tsToString(TokenStream in) throws IOException {
> StringBuilder out = new StringBuilder();
> TermAttribute termAtt = (TermAttribute) 
> in.addAttribute(TermAttribute.class);
> // extra safety to enforce, that the state is not preserved and also
> // assign bogus values
> in.clearAttributes();
> termAtt.setTermBuffer("bogusTerm");
> while (in.incrementToken()) {
>   if (out.length() > 0)
> out.append(' ');
>   out.append(termAtt.term());
>   in.clearAttributes();
>   termAtt.setTermBuffer("bogusTerm");
> }
> in.close();
> return out.toString();
>   }
> {code}
> Setting the term text to bogus values helps find bugs in tokenstreams that do 
> not clear or clone properly. In this case there is a problem with a 
> tokenstream AB_AAB_Stream in TestBufferedTokenStream, it converts A B -> A A 
> B but does not clone, so the values get overwritten.
> This can be fixed in two ways: 
> * BufferedTokenStream does the cloning
> * subclasses are responsible for the cloning
> The question is which one should it be?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1662) BufferedTokenStream incorrect cloning

2009-12-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791370#action_12791370
 ] 

Uwe Schindler commented on SOLR-1662:
-

Just the short desription from the API side in Lucene:
Lucene's documentation of TokenStream.next() says: "The returned Token is a 
"full private copy" (not re-used across calls to next())". 
AB_AAB_Stream.process() duplicates the token by just putting it uncloned into 
the outQueue. As the consumer of the BufferedTokenStream assumes that the Token 
is private it is allowed to change it - and by that it also changes the token 
in the outQueue. If you e.g. put another TokenFilter in fromt of this 
AB_AAB_Stream, and modify the token there it would break.
In my opinion, the responsibility to clone is in AB_AAB_Stream, 
BufferedTokenStream will never return the same token twice by itsself. So its a 
bug in the test. But Robert told me that e.g. RemoveDuplicates has a similar 
problem.
The general contract for writing such streams is: whenever you return a Token 
from next(), never put it somewhere else uncloned, because the caller can 
change it.

The fix is to do: write((Token) t.clone());

> BufferedTokenStream incorrect cloning
> -
>
> Key: SOLR-1662
> URL: https://issues.apache.org/jira/browse/SOLR-1662
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 1.4
>Reporter: Robert Muir
>
> As part of writing tests for SOLR-1657, I rewrote one of the base classes 
> (BaseTokenTestCase) to use the new TokenStream API, but also with some 
> additional safety.
> {code}
>  public static String tsToString(TokenStream in) throws IOException {
> StringBuilder out = new StringBuilder();
> TermAttribute termAtt = (TermAttribute) 
> in.addAttribute(TermAttribute.class);
> // extra safety to enforce, that the state is not preserved and also
> // assign bogus values
> in.clearAttributes();
> termAtt.setTermBuffer("bogusTerm");
> while (in.incrementToken()) {
>   if (out.length() > 0)
> out.append(' ');
>   out.append(termAtt.term());
>   in.clearAttributes();
>   termAtt.setTermBuffer("bogusTerm");
> }
> in.close();
> return out.toString();
>   }
> {code}
> Setting the term text to bogus values helps find bugs in tokenstreams that do 
> not clear or clone properly. In this case there is a problem with a 
> tokenstream AB_AAB_Stream in TestBufferedTokenStream, it converts A B -> A A 
> B but does not clone, so the values get overwritten.
> This can be fixed in two ways: 
> * BufferedTokenStream does the cloning
> * subclasses are responsible for the cloning
> The question is which one should it be?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.