"Chuck Williams" <[EMAIL PROTECTED]> wrote:
> Doug Cutting wrote on 11/07/2007 09:26 AM:
> > Hadoop's MapFile is similar to Lucene's term index, and supports a
> > feature where only a subset of the index entries are loaded
> > (determined by io.map.index.skip). It would not be difficult to add
Doug Cutting wrote on 11/07/2007 09:26 AM:
Hadoop's MapFile is similar to Lucene's term index, and supports a
feature where only a subset of the index entries are loaded
(determined by io.map.index.skip). It would not be difficult to add
such a feature to Lucene by changing TermInfosReader#ens
"Doug Cutting" <[EMAIL PROTECTED]> wrote:
> In any case, I think Michael is opting to skip this proposal for now.
At least for the time being, yes. I think Lucene doesn't (yet) need
this and we should stick with straightforward
setters/args-to-constructors for now.
Mike
---
Nicolas Lalevée wrote:
And from my point of view as a deep user of the Lucene API, generally I do not
like generic properties settings because it makes the API undocumented. The
java doc around the setter and the getter of the property is as usefull as :
/**
* Set a property
*
* @param prop
Le jeudi 8 novembre 2007, Michael McCandless a écrit :
> "Doug Cutting" <[EMAIL PROTECTED]> wrote:
> > Aren't indexes loaded lazily? That's an important optimization for
> > merging, no? For performance reasons, opening an IndexReader shouldn't
> > do much more than open files. However, if we bu
"Doug Cutting" <[EMAIL PROTECTED]> wrote:
> Aren't indexes loaded lazily? That's an important optimization for
> merging, no? For performance reasons, opening an IndexReader shouldn't
> do much more than open files. However, if we build a more generic
> mechanism, we should not rely on that
I was thinking of more along the Java ImageIO ImageRead/WriteParam
stuff.
class IndexReaderParam {
get/set UseLargeBuffers()
get/set UseReadAhead();
.. etc. other "standard" options, a particular index reader if free
to ignore them ...
}
a custom IndexReader would create a
robert engels wrote:
I think it would be better to have IndexReaderProperties, and
IndexWriterProperties.
What methods would these have?
The notion of a termIndexDivisor is specific to a particular IndexReader
implementation, so probably shouldn't be handled by a generic
IndexReaderPropertie
I think it would be better to have IndexReaderProperties, and
IndexWriterProperties.
Just seems an easier API for maintenance. It is more logical, as it
keeps related items together.
On Nov 8, 2007, at 12:04 PM, Doug Cutting wrote:
Michael McCandless wrote:
One thing is: I'd prefer to no
Michael McCandless wrote:
One thing is: I'd prefer to not use system property for this, since
it's so global, but I'm not sure how to better do it.
I agree. That was the quick-and-dirty hack. Ideally it should be a
method on IndexReader. I can think of two ways to do that:
1. Add a generi
I like this approach: it means, at search time, you can choose to
further subsample the already subsampled (during indexing) set of
terms for the TermInfosReader index. So you can easily turn the
knob to trade off memory usage vs IO cost/latency during searching.
I'll open an issue and work thro
Chuck Williams wrote:
It appears that termIndexInterval is factored into the stored index and
thus cannot be changed dynamically to work around the problem after an
index has become polluted. Other than identifying the documents
containing binary data, deleting them, and then optimizing the wh
I think the binary section recognizer is probably your best best.
If you write an analyzer that ignores terms that consist of only
hexadecimal digits, and contain embedded digits, you will probably
reduce the pollution quite a bit, and it is trivial to write, and not
too expensive to check.
Hi All,
We are experiencing OOM's when binary data contained in text files
(e.g., a base64 section of a text file) is indexed. We have extensive
recognition of file types but have encountered binary sections inside of
otherwise normal text files.
We are using the default value of 128 for te
14 matches
Mail list logo