Re: ICUTokenizer ArrayIndexOutOfBounds

2012-10-17 Thread Robert Muir
calling reset() is mandatory part of the consumer lifecycle before
calling incrementToken(), see:

https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html

A lot of people don't consume these correctly, thats why these
tokenizers now try to throw exceptions if you do it wrong, rather than
wrong results otherwise.

If you really want to test that your consumer code (queryparser,
whatever) is doing this correctly, test your code with
MockTokenizer/MockAnalyzer in the test-framework package. This has a
little state machine with a lot more checks.

On Wed, Oct 17, 2012 at 6:56 AM, Shane Perry  wrote:
> Hi,
>
> I've been playing around with using the ICUTokenizer from 4.0.0.
> Using the code below, I was receiving an ArrayIndexOutOfBounds
> exception on the call to tokenizer.incrementToken().  Looking at the
> ICUTokenizer source, I can see why this is occuring (usableLength
> defaults to -1).
>
> ICUTokenizer tokenizer = new ICUTokenizer(myReader);
> CharTermAttribute termAtt = 
> tokenizer.getAttribute(CharTermAttribute.class);
>
> while(tokenizer.incrementToken())
> {
> System.out.println(termAtt.toString());
> }
>
> After poking around a little more, I found that I can just call
> tokenizer.reset() (initializes usableLength to 0) right after
> constructing the object
> (org.apache.lucene.analysis.icu.segmentation.TestICUTokenizer does a
> similar step in it's super class).  I was wondering if someone could
> explain why I need to call tokenizer.reset() prior to using the
> tokenizer for the first time.
>
> Thanks in advance,
>
> Shane


ICUTokenizer ArrayIndexOutOfBounds

2012-10-17 Thread Shane Perry
Hi,

I've been playing around with using the ICUTokenizer from 4.0.0.
Using the code below, I was receiving an ArrayIndexOutOfBounds
exception on the call to tokenizer.incrementToken().  Looking at the
ICUTokenizer source, I can see why this is occuring (usableLength
defaults to -1).

ICUTokenizer tokenizer = new ICUTokenizer(myReader);
CharTermAttribute termAtt = 
tokenizer.getAttribute(CharTermAttribute.class);

while(tokenizer.incrementToken())
{
System.out.println(termAtt.toString());
}

After poking around a little more, I found that I can just call
tokenizer.reset() (initializes usableLength to 0) right after
constructing the object
(org.apache.lucene.analysis.icu.segmentation.TestICUTokenizer does a
similar step in it's super class).  I was wondering if someone could
explain why I need to call tokenizer.reset() prior to using the
tokenizer for the first time.

Thanks in advance,

Shane