Re: Term pollution from binary data

2007-11-13 Thread Michael McCandless
"Chuck Williams" <[EMAIL PROTECTED]> wrote: > Doug Cutting wrote on 11/07/2007 09:26 AM: > > Hadoop's MapFile is similar to Lucene's term index, and supports a > > feature where only a subset of the index entries are loaded > > (determined by io.map.index.skip). It would not be difficult to add

Re: Term pollution from binary data

2007-11-12 Thread Chuck Williams
Doug Cutting wrote on 11/07/2007 09:26 AM: Hadoop's MapFile is similar to Lucene's term index, and supports a feature where only a subset of the index entries are loaded (determined by io.map.index.skip). It would not be difficult to add such a feature to Lucene by changing TermInfosReader#ens

Re: Term pollution from binary data

2007-11-09 Thread Michael McCandless
"Doug Cutting" <[EMAIL PROTECTED]> wrote: > In any case, I think Michael is opting to skip this proposal for now. At least for the time being, yes. I think Lucene doesn't (yet) need this and we should stick with straightforward setters/args-to-constructors for now. Mike ---

Re: Term pollution from binary data

2007-11-09 Thread Doug Cutting
Nicolas Lalevée wrote: And from my point of view as a deep user of the Lucene API, generally I do not like generic properties settings because it makes the API undocumented. The java doc around the setter and the getter of the property is as usefull as : /** * Set a property * * @param prop

Re: Term pollution from binary data

2007-11-09 Thread Nicolas Lalevée
Le jeudi 8 novembre 2007, Michael McCandless a écrit : > "Doug Cutting" <[EMAIL PROTECTED]> wrote: > > Aren't indexes loaded lazily? That's an important optimization for > > merging, no? For performance reasons, opening an IndexReader shouldn't > > do much more than open files. However, if we bu

Re: Term pollution from binary data

2007-11-08 Thread Michael McCandless
"Doug Cutting" <[EMAIL PROTECTED]> wrote: > Aren't indexes loaded lazily? That's an important optimization for > merging, no? For performance reasons, opening an IndexReader shouldn't > do much more than open files. However, if we build a more generic > mechanism, we should not rely on that

Re: Term pollution from binary data

2007-11-08 Thread robert engels
I was thinking of more along the Java ImageIO ImageRead/WriteParam stuff. class IndexReaderParam { get/set UseLargeBuffers() get/set UseReadAhead(); .. etc. other "standard" options, a particular index reader if free to ignore them ... } a custom IndexReader would create a

Re: Term pollution from binary data

2007-11-08 Thread Doug Cutting
robert engels wrote: I think it would be better to have IndexReaderProperties, and IndexWriterProperties. What methods would these have? The notion of a termIndexDivisor is specific to a particular IndexReader implementation, so probably shouldn't be handled by a generic IndexReaderPropertie

Re: Term pollution from binary data

2007-11-08 Thread robert engels
I think it would be better to have IndexReaderProperties, and IndexWriterProperties. Just seems an easier API for maintenance. It is more logical, as it keeps related items together. On Nov 8, 2007, at 12:04 PM, Doug Cutting wrote: Michael McCandless wrote: One thing is: I'd prefer to no

Re: Term pollution from binary data

2007-11-08 Thread Doug Cutting
Michael McCandless wrote: One thing is: I'd prefer to not use system property for this, since it's so global, but I'm not sure how to better do it. I agree. That was the quick-and-dirty hack. Ideally it should be a method on IndexReader. I can think of two ways to do that: 1. Add a generi

Re: Term pollution from binary data

2007-11-08 Thread Michael McCandless
I like this approach: it means, at search time, you can choose to further subsample the already subsampled (during indexing) set of terms for the TermInfosReader index. So you can easily turn the knob to trade off memory usage vs IO cost/latency during searching. I'll open an issue and work thro

Re: Term pollution from binary data

2007-11-07 Thread Doug Cutting
Chuck Williams wrote: It appears that termIndexInterval is factored into the stored index and thus cannot be changed dynamically to work around the problem after an index has become polluted. Other than identifying the documents containing binary data, deleting them, and then optimizing the wh

Re: Term pollution from binary data

2007-11-06 Thread robert engels
I think the binary section recognizer is probably your best best. If you write an analyzer that ignores terms that consist of only hexadecimal digits, and contain embedded digits, you will probably reduce the pollution quite a bit, and it is trivial to write, and not too expensive to check.

Term pollution from binary data

2007-11-06 Thread Chuck Williams
Hi All, We are experiencing OOM's when binary data contained in text files (e.g., a base64 section of a text file) is indexed. We have extensive recognition of file types but have encountered binary sections inside of otherwise normal text files. We are using the default value of 128 for te