Doron Cohen wrote:
On Dec 31, 2007 7:54 PM, Michael McCandless
<[EMAIL PROTECTED]>
wrote:
I actually think indexing should try to be as robust as possible.
You
could test like crazy and never hit a massive term, go into
production
(say, ship your app to lots of your customer's computers) only to
suddenly see this exception. In general it could be a long time
before
you "accidentally" our users see this.
So I'm thinking we should have the default behavior, in IndexWriter,
be to skip immense terms?
Then people can use TokenFilter to change this behavior if they want.
+1
OK I will take this approach.
At first I saw this similar to IndexWriter.setMaxFieldLength(), but
it was
a wrong comparison, because #terms is a "real" indexing/serarch
characteristic that many applications can benefit from being able
to modify, whereas a huge token is in most cases a bug.
Just to make sure on the scenario - the only change is to skip too
long
tokens, while any other exception is thrown (not ignored.)
Exactly. And, on any exception, we will immediately mark any
partially indexed doc as deleted.
Also, for a skipped token I think the position increment of the
following token should be incremented.
Good point; I'll make sure we do.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]