I actually think indexing should try to be as robust as possible. You could test like crazy and never hit a massive term, go into production (say, ship your app to lots of your customer's computers) only to suddenly see this exception. In general it could be a long time before you "accidentally" our users see this.
So I'm thinking we should have the default behavior, in IndexWriter, be to skip immense terms? Then people can use TokenFilter to change this behavior if they want. Mike Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Dec 31, 2007 12:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > Sure, but I mean in the >16K (in other words, in the case where > > DocsWriter fails, which presumably only DocsWriter knows about) case. > > I want the option to ignore tokens larger than that instead of failing/ > > throwing an exception. > > I think the issue here is what the default behavior for IndexWriter should be. > > If configuration is required because something other than the default > is desired, then one could use a TokenFilter to change the behavior > rather than changing options on IndexWriter. Using a TokenFilter is > much more flexible. > > > Imagine I am charged w/ indexing some data > > that I don't know anything about (i.e. computer forensics), my goal > > would be to index as much as possible in my first raw pass, so that I > > can then begin to explore the dataset. Having it completely discard > > the document is not a good thing, but throwing away some large binary > > tokens would be acceptable (especially if I get warnings about said > > tokens) and robust. > > -Yonik > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]