Re: are long words split into up to 256 long tokens?

jm Wed, 21 Apr 2010 07:07:36 -0700

oh, yes it does extend CharTokenizer..thanks Ahmet. I had searched
lucene source code for 256 and found nothing suspicious, and that was
itself suspicious cause it looked clearly like an inner limit. Of
course I should have searched for 255...


I'll see how I proceed cause I don't want to use a custom build.

On Wed, Apr 21, 2010 at 3:50 PM, Ahmet Arslan <iori...@yahoo.com> wrote:
>> Is 256 some inner maximum too
>> in some
>> lucene internal that causes this? What is happening is that
>> the long
>> word is split into smaller words up to 256 and then the min
>> and max
>> limit applied. Is that correct? I have removed LengthFilter
>> and still
>> see the splitting at 256 happen. I would like not to have
>> this, and
>> removed altogheter any word longer than max, wihtout
>> decomposing into
>> smaller ones. Is there a way to achieve this?
>>
>> Using lucene 3.0.1
>
>
> Assuming your Tokenizer extends CharTokenizer:
>
> CharTokenizer.java has this field:
> private static final int MAX_WORD_LEN = 255;
>
> you can modify CharTokenizer.java according to your needs.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: are long words split into up to 256 long tokens?

Reply via email to