How about defaulting to a max token size of 16K in StandardTokenizer, so that it never causes an IndexWriter exception, with an option to reduce that size?

The backward incompatibilty is limited then - tokens exceeding 16K will NOT causing an IndexWriter exception. In 3.0 we can reduce that default to a useful size.

The option to truncate the token can be useful, I think. It will index the max size prefix of the long tokens. You can still find them, pretty accurately - this becomes a prefix search, but is unlikely to return multiple values because it's a long prefix. It allow you to choose a relatively small max, such as 32 or 64, reducing the overhead caused by junk in the documents while minimizing the chance of not finding something.

Gabi.

Michael McCandless wrote:
Gabi Steinberg wrote:

On balance, I think that dropping the document makes sense. I think Yonik is right in that ensuring that keys are useful - and indexable - is the tokenizer's job.

StandardTokenizer, in my opinion, should behave similarly to a person looking at a document and deciding which tokens should be indexed. Few people would argue that a 16K block of binary data is useful for searching, but it's reasonable to suggest that the text around it is useful.

I know that one can add the LengthFilter to avoid this problem, but this is not really intuitive; one does not expect the standard tokenizer to generate tokens that IndexWriter chokes on.

My vote is to:
- drop documents with tokens longer than 16K, as Mike and Yonik suggested
- because uninformed user would start with StandardTokenizer, I think it should limit token size to 128 bytes, and add options to change that size, choose between truncating or dropping longer tokens, and in no case produce tokens longer that what IndexWriter can digest.

I like this idea, though we probably can't do that until 3.0 so we don't break backwards compatibility?

...

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to