I like the approach of configuration of this behavior in Analysis (and so IndexWriter can throw an exception on such errors).
It seems that this should be a property of Analyzer vs. just StandardAnalyzer, right? It can probably be a "policy" property, with two parameters: 1) maxLength, 2) action: chop/split/ignore/raiseException when generating too long tokens. Doron On Dec 21, 2007 10:46 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > I think this is a good approach -- any objections? > > This way, IndexWriter is in-your-face (throws TermTooLongException on > seeing a massive term), but StandardAnalyzer is robust (silently > skips or prefix's the too-long terms). > > Mike > > Gabi Steinberg wrote: > > > How about defaulting to a max token size of 16K in > > StandardTokenizer, so that it never causes an IndexWriter > > exception, with an option to reduce that size? > > > > The backward incompatibilty is limited then - tokens exceeding 16K > > will NOT causing an IndexWriter exception. In 3.0 we can reduce > > that default to a useful size. > > > > The option to truncate the token can be useful, I think. It will > > index the max size prefix of the long tokens. You can still find > > them, pretty accurately - this becomes a prefix search, but is > > unlikely to return multiple values because it's a long prefix. It > > allow you to choose a relatively small max, such as 32 or 64, > > reducing the overhead caused by junk in the documents while > > minimizing the chance of not finding something. > > > > Gabi. > > > > Michael McCandless wrote: > >> Gabi Steinberg wrote: > >>> On balance, I think that dropping the document makes sense. I > >>> think Yonik is right in that ensuring that keys are useful - and > >>> indexable - is the tokenizer's job. > >>> > >>> StandardTokenizer, in my opinion, should behave similarly to a > >>> person looking at a document and deciding which tokens should be > >>> indexed. Few people would argue that a 16K block of binary data > >>> is useful for searching, but it's reasonable to suggest that the > >>> text around it is useful. > >>> > >>> I know that one can add the LengthFilter to avoid this problem, > >>> but this is not really intuitive; one does not expect the > >>> standard tokenizer to generate tokens that IndexWriter chokes on. > >>> > >>> My vote is to: > >>> - drop documents with tokens longer than 16K, as Mike and Yonik > >>> suggested > >>> - because uninformed user would start with StandardTokenizer, I > >>> think it should limit token size to 128 bytes, and add options to > >>> change that size, choose between truncating or dropping longer > >>> tokens, and in no case produce tokens longer that what > >>> IndexWriter can digest. > >> I like this idea, though we probably can't do that until 3.0 so we > >> don't break backwards compatibility? > > ... > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >