Re: Term pollution from binary data

robert engels Thu, 08 Nov 2007 10:09:41 -0800

I think it would be better to have IndexReaderProperties, andIndexWriterProperties.

Just seems an easier API for maintenance. It is more logical, as itkeeps related items together.


On Nov 8, 2007, at 12:04 PM, Doug Cutting wrote:

Michael McCandless wrote:
One thing is: I'd prefer to not use system property for this, since
it's so global, but I'm not sure how to better do it.
I agree. That was the quick-and-dirty hack. Ideally it should bea method on IndexReader. I can think of two ways to do that:
1. Add a generic method like IndexReader#setProperty(String,String).
2. Add a specific method like IndexReader#setTermIndexDivisor(int).
I slightly prefer the former, as it permits various IndexReadersimplementations to support arbitrary properties, at the expense ofbeing untyped, but that might be overkill. Thoughts?
We can't add a "setIndexDivisor(...)" method because the terms are
already loading (consuming too much ram) during the ctor.
Aren't indexes loaded lazily? That's an important optimization formerging, no? For performance reasons, opening an IndexReadershouldn't do much more than open files. However, if we build amore generic mechanism, we should not rely on that.
What if, instead, we passed down a Properties instance to IndexReader
ctors?  Or alternatively a dedicated class, eg,
"IndexReaderInitParameters"?  The advantage of a dedicated class is
it's strongly typed at compile time, and, you could put things in
there like an optional DeletionPolicy instance as well. I thinkthere
are a growing list of these sorts of "advanced optional parameters
used during init" that could be handled with such an approach?
(I probably should have read your entire message before starting torespond... But it's nice to see that we think alike!) This issimilar to my (2) approach, but attempts to solve the typing issue,although I'm not sure how...
The way we handle it in Hadoop is to pass around a <String,String>map in the abstract kernel, then have concrete implementationclasses provide static methods that access it. So this might looksomething like:
public class LuceneProperties extends Properties {
// utility methods to handle conversion of values to and fromStrings
  void setInt(String prop, int value);
  int getInt(String prop);
  void setClass(String prop, Class value);
  Class getClass(String prop);
  Object newInstance(String prop)
  ...
}

public class SegmentReaderProperties {
  private static final String DIVISOR_PROP =
    "org.apache.lucene.index.SegmentReader.divisor";
  public static setTermIndexDivisor(LuceneProperties props, int i) {
    props.setInt(DIVISOR_PROP, i);
  }
}
Then the IndexReader constructor methods could accept aLuceneProperties. No point in making this IndexReader specific,since it might be useful for, e.g., IndexWriter, Searchers,Directories, etc.
An advantage of a <String,String> map over a <String,Object> mapfor Hadoop is that it's trivial to serialize.
Is this what you had in mind?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term pollution from binary data

Reply via email to