Re: DocumentsWriter.checkMaxTermLength issues

Michael McCandless Fri, 21 Dec 2007 12:47:30 -0800


I think this is a good approach -- any objections?

This way, IndexWriter is in-your-face (throws TermTooLongException onseeing a massive term), but StandardAnalyzer is robust (silentlyskips or prefix's the too-long terms).


Mike

Gabi Steinberg wrote:

How about defaulting to a max token size of 16K inStandardTokenizer, so that it never causes an IndexWriterexception, with an option to reduce that size?
The backward incompatibilty is limited then - tokens exceeding 16Kwill NOT causing an IndexWriter exception. In 3.0 we can reducethat default to a useful size.
The option to truncate the token can be useful, I think. It willindex the max size prefix of the long tokens. You can still findthem, pretty accurately - this becomes a prefix search, but isunlikely to return multiple values because it's a long prefix. Itallow you to choose a relatively small max, such as 32 or 64,reducing the overhead caused by junk in the documents whileminimizing the chance of not finding something.
Gabi.

Michael McCandless wrote:
Gabi Steinberg wrote:
On balance, I think that dropping the document makes sense. Ithink Yonik is right in that ensuring that keys are useful - andindexable - is the tokenizer's job.
StandardTokenizer, in my opinion, should behave similarly to aperson looking at a document and deciding which tokens should beindexed. Few people would argue that a 16K block of binary datais useful for searching, but it's reasonable to suggest that thetext around it is useful.
I know that one can add the LengthFilter to avoid this problem,but this is not really intuitive; one does not expect thestandard tokenizer to generate tokens that IndexWriter chokes on.
My vote is to:
- drop documents with tokens longer than 16K, as Mike and Yoniksuggested- because uninformed user would start with StandardTokenizer, Ithink it should limit token size to 128 bytes, and add options tochange that size, choose between truncating or dropping longertokens, and in no case produce tokens longer that whatIndexWriter can digest.
I like this idea, though we probably can't do that until 3.0 so wedon't break backwards compatibility?
...

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DocumentsWriter.checkMaxTermLength issues

Reply via email to