Re: DocumentsWriter.checkMaxTermLength issues

Gabi Steinberg Thu, 20 Dec 2007 13:52:07 -0800

How about defaulting to a max token size of 16K in StandardTokenizer, sothat it never causes an IndexWriter exception, with an option to reducethat size?

The backward incompatibilty is limited then - tokens exceeding 16K willNOT causing an IndexWriter exception. In 3.0 we can reduce that defaultto a useful size.

The option to truncate the token can be useful, I think. It will indexthe max size prefix of the long tokens. You can still find them, prettyaccurately - this becomes a prefix search, but is unlikely to returnmultiple values because it's a long prefix. It allow you to choose arelatively small max, such as 32 or 64, reducing the overhead caused byjunk in the documents while minimizing the chance of not finding something.


Gabi.

Michael McCandless wrote:

Gabi Steinberg wrote:
On balance, I think that dropping the document makes sense. I thinkYonik is right in that ensuring that keys are useful - and indexable -is the tokenizer's job.
StandardTokenizer, in my opinion, should behave similarly to a personlooking at a document and deciding which tokens should be indexed.Few people would argue that a 16K block of binary data is useful forsearching, but it's reasonable to suggest that the text around it isuseful.
I know that one can add the LengthFilter to avoid this problem, butthis is not really intuitive; one does not expect the standardtokenizer to generate tokens that IndexWriter chokes on.
My vote is to:
- drop documents with tokens longer than 16K, as Mike and Yonik suggested
- because uninformed user would start with StandardTokenizer, I thinkit should limit token size to 128 bytes, and add options to changethat size, choose between truncating or dropping longer tokens, and inno case produce tokens longer that what IndexWriter can digest.
I like this idea, though we probably can't do that until 3.0 so we don'tbreak backwards compatibility?

...

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DocumentsWriter.checkMaxTermLength issues

Reply via email to