I like the approach of configuration of this behavior in Analysis
(and so IndexWriter can throw an exception on such errors).

It seems that this should be a property of Analyzer vs.
just StandardAnalyzer, right?

It can probably be a "policy" property, with two parameters:
1) maxLength, 2) action: chop/split/ignore/raiseException when
generating too long tokens.

Doron

On Dec 21, 2007 10:46 PM, Michael McCandless <[EMAIL PROTECTED]>
wrote:

>
> I think this is a good approach -- any objections?
>
> This way, IndexWriter is in-your-face (throws TermTooLongException on
> seeing a massive term), but StandardAnalyzer is robust (silently
> skips or prefix's the too-long terms).
>
> Mike
>
> Gabi Steinberg wrote:
>
> > How about defaulting to a max token size of 16K in
> > StandardTokenizer, so that it never causes an IndexWriter
> > exception, with an option to reduce that size?
> >
> > The backward incompatibilty is limited then - tokens exceeding 16K
> > will NOT causing an IndexWriter exception.  In 3.0 we can reduce
> > that default to a useful size.
> >
> > The option to truncate the token can be useful, I think.  It will
> > index the max size prefix of the long tokens.  You can still find
> > them, pretty accurately - this becomes a prefix search, but is
> > unlikely to return multiple values because it's a long prefix.  It
> > allow you to choose a relatively small max, such as 32 or 64,
> > reducing the overhead caused by junk in the documents while
> > minimizing the chance of not finding something.
> >
> > Gabi.
> >
> > Michael McCandless wrote:
> >> Gabi Steinberg wrote:
> >>> On balance, I think that dropping the document makes sense.  I
> >>> think Yonik is right in that ensuring that keys are useful - and
> >>> indexable - is the tokenizer's job.
> >>>
> >>> StandardTokenizer, in my opinion, should behave similarly to a
> >>> person looking at a document and deciding which tokens should be
> >>> indexed.  Few people would argue that a 16K block of binary data
> >>> is useful for searching, but it's reasonable to suggest that the
> >>> text around it is useful.
> >>>
> >>> I know that one can add the LengthFilter to avoid this problem,
> >>> but this is not really intuitive; one does not expect the
> >>> standard tokenizer to generate tokens that IndexWriter chokes on.
> >>>
> >>> My vote is to:
> >>> - drop documents with tokens longer than 16K, as Mike and Yonik
> >>> suggested
> >>> - because uninformed user would start with StandardTokenizer, I
> >>> think it should limit token size to 128 bytes, and add options to
> >>> change that size, choose between truncating or dropping longer
> >>> tokens, and in no case produce tokens longer that what
> >>> IndexWriter can digest.
> >> I like this idea, though we probably can't do that until 3.0 so we
> >> don't break backwards compatibility?
> > ...
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Reply via email to