How about defaulting to a max token size of 16K in StandardTokenizer, so
that it never causes an IndexWriter exception, with an option to reduce
that size?
The backward incompatibilty is limited then - tokens exceeding 16K will
NOT causing an IndexWriter exception. In 3.0 we can reduce that default
to a useful size.
The option to truncate the token can be useful, I think. It will index
the max size prefix of the long tokens. You can still find them, pretty
accurately - this becomes a prefix search, but is unlikely to return
multiple values because it's a long prefix. It allow you to choose a
relatively small max, such as 32 or 64, reducing the overhead caused by
junk in the documents while minimizing the chance of not finding something.
Gabi.
Michael McCandless wrote:
Gabi Steinberg wrote:
On balance, I think that dropping the document makes sense. I think
Yonik is right in that ensuring that keys are useful - and indexable -
is the tokenizer's job.
StandardTokenizer, in my opinion, should behave similarly to a person
looking at a document and deciding which tokens should be indexed.
Few people would argue that a 16K block of binary data is useful for
searching, but it's reasonable to suggest that the text around it is
useful.
I know that one can add the LengthFilter to avoid this problem, but
this is not really intuitive; one does not expect the standard
tokenizer to generate tokens that IndexWriter chokes on.
My vote is to:
- drop documents with tokens longer than 16K, as Mike and Yonik suggested
- because uninformed user would start with StandardTokenizer, I think
it should limit token size to 128 bytes, and add options to change
that size, choose between truncating or dropping longer tokens, and in
no case produce tokens longer that what IndexWriter can digest.
I like this idea, though we probably can't do that until 3.0 so we don't
break backwards compatibility?
...
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]