I think this is a good approach -- any objections?
This way, IndexWriter is in-your-face (throws TermTooLongException on
seeing a massive term), but StandardAnalyzer is robust (silently
skips or prefix's the too-long terms).
Mike
Gabi Steinberg wrote:
How about defaulting to a max token size of 16K in
StandardTokenizer, so that it never causes an IndexWriter
exception, with an option to reduce that size?
The backward incompatibilty is limited then - tokens exceeding 16K
will NOT causing an IndexWriter exception. In 3.0 we can reduce
that default to a useful size.
The option to truncate the token can be useful, I think. It will
index the max size prefix of the long tokens. You can still find
them, pretty accurately - this becomes a prefix search, but is
unlikely to return multiple values because it's a long prefix. It
allow you to choose a relatively small max, such as 32 or 64,
reducing the overhead caused by junk in the documents while
minimizing the chance of not finding something.
Gabi.
Michael McCandless wrote:
Gabi Steinberg wrote:
On balance, I think that dropping the document makes sense. I
think Yonik is right in that ensuring that keys are useful - and
indexable - is the tokenizer's job.
StandardTokenizer, in my opinion, should behave similarly to a
person looking at a document and deciding which tokens should be
indexed. Few people would argue that a 16K block of binary data
is useful for searching, but it's reasonable to suggest that the
text around it is useful.
I know that one can add the LengthFilter to avoid this problem,
but this is not really intuitive; one does not expect the
standard tokenizer to generate tokens that IndexWriter chokes on.
My vote is to:
- drop documents with tokens longer than 16K, as Mike and Yonik
suggested
- because uninformed user would start with StandardTokenizer, I
think it should limit token size to 128 bytes, and add options to
change that size, choose between truncating or dropping longer
tokens, and in no case produce tokens longer that what
IndexWriter can digest.
I like this idea, though we probably can't do that until 3.0 so we
don't break backwards compatibility?
...
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]