On Dec 31, 2007 7:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> I actually think indexing should try to be as robust as possible. You > could test like crazy and never hit a massive term, go into production > (say, ship your app to lots of your customer's computers) only to > suddenly see this exception. In general it could be a long time before > you "accidentally" our users see this. > > So I'm thinking we should have the default behavior, in IndexWriter, > be to skip immense terms? > > Then people can use TokenFilter to change this behavior if they want. > +1 At first I saw this similar to IndexWriter.setMaxFieldLength(), but it was a wrong comparison, because #terms is a "real" indexing/serarch characteristic that many applications can benefit from being able to modify, whereas a huge token is in most cases a bug. Just to make sure on the scenario - the only change is to skip too long tokens, while any other exception is thrown (not ignored.) Also, for a skipped token I think the position increment of the following token should be incremented.