Doron Cohen wrote:

On Dec 31, 2007 7:54 PM, Michael McCandless <[EMAIL PROTECTED]>
wrote:

I actually think indexing should try to be as robust as possible. You could test like crazy and never hit a massive term, go into production
(say, ship your app to lots of your customer's computers) only to
suddenly see this exception. In general it could be a long time before
you "accidentally" our users see this.

So I'm thinking we should have the default behavior, in IndexWriter,
be to skip immense terms?

Then people can use TokenFilter to change this behavior if they want.


+1

OK I will take this approach.

At first I saw this similar to IndexWriter.setMaxFieldLength(), but it was
a wrong comparison, because #terms is a "real" indexing/serarch
characteristic that many applications can benefit from being able
to modify, whereas a huge token is in most cases a bug.

Just to make sure on the scenario - the only change is to skip too long
tokens, while any other exception is thrown (not ignored.)

Exactly. And, on any exception, we will immediately mark any partially indexed doc as deleted.

Also, for a skipped token I think the position increment of the
following token should be incremented.

Good point; I'll make sure we do.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to