Re: DocumentsWriter.checkMaxTermLength issues

Michael McCandless Tue, 01 Jan 2008 02:51:05 -0800


Doron Cohen wrote:

On Dec 31, 2007 7:54 PM, Michael McCandless<[EMAIL PROTECTED]>
wrote:
I actually think indexing should try to be as robust as possible.Youcould test like crazy and never hit a massive term, go intoproduction
(say, ship your app to lots of your customer's computers) only to
suddenly see this exception. In general it could be a long timebefore
you "accidentally" our users see this.

So I'm thinking we should have the default behavior, in IndexWriter,
be to skip immense terms?

Then people can use TokenFilter to change this behavior if they want.
+1


OK I will take this approach.

At first I saw this similar to IndexWriter.setMaxFieldLength(), butit was
a wrong comparison, because #terms is a "real" indexing/serarch
characteristic that many applications can benefit from being able
to modify, whereas a huge token is in most cases a bug.
Just to make sure on the scenario - the only change is to skip toolong
tokens, while any other exception is thrown (not ignored.)

Exactly. And, on any exception, we will immediately mark anypartially indexed doc as deleted.

Also, for a skipped token I think the position increment of the
following token should be incremented.


Good point; I'll make sure we do.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DocumentsWriter.checkMaxTermLength issues

Reply via email to