Re: Term pollution from binary data

Michael McCandless Thu, 08 Nov 2007 12:15:35 -0800

"Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Aren't indexes loaded lazily?  That's an important optimization for 
> merging, no?  For performance reasons, opening an IndexReader shouldn't 
> do much more than open files.  However, if we build a more generic 
> mechanism, we should not rely on that.


Woops, you are right!  So in this case we could wait until after ctor
to set the property.  I will take that approach for this, then, so we
can decouple it from the "generic properties" discussion.  I think,
also, I will throw an IllegalStateException if you try to set this
after the index was already loaded.

For other things, eg the DeletionPolicy instance & lock timeout for
IndexWriter, and infoStream for both IndexWriter & IndexReader, we
need to use them in the ctor but we don't want to explode the number
of ctors.  Eg we now have setDefaultLockTimeout/setDefaultInfoStream
which we could deprecate if we can set this in generic properties
instead.

> > What if, instead, we passed down a Properties instance to IndexReader
> > ctors?  Or alternatively a dedicated class, eg,
> > "IndexReaderInitParameters"?  The advantage of a dedicated class is
> > it's strongly typed at compile time, and, you could put things in
> > there like an optional DeletionPolicy instance as well.  I think there
> > are a growing list of these sorts of "advanced optional parameters
> > used during init" that could be handled with such an approach?
> 
> (I probably should have read your entire message before starting to 
> respond...  But it's nice to see that we think alike!)

That is nice!

> This is similar to my (2) approach, but attempts to solve the typing
> issue, although I'm not sure how...
>
> The way we handle it in Hadoop is to pass around a <String,String> map 
> in the abstract kernel, then have concrete implementation classes 
> provide static methods that access it.  So this might look something
> like:
> 
> public class LuceneProperties extends Properties {
>    // utility methods to handle conversion of values to and from Strings
>    void setInt(String prop, int value);
>    int getInt(String prop);
>    void setClass(String prop, Class value);
>    Class getClass(String prop);
>    Object newInstance(String prop)
>    ...
> }
> 
> public class SegmentReaderProperties {
>    private static final String DIVISOR_PROP =
>      "org.apache.lucene.index.SegmentReader.divisor";
>    public static setTermIndexDivisor(LuceneProperties props, int i) {
>      props.setInt(DIVISOR_PROP, i);
>    }
> }
> 
> Then the IndexReader constructor methods could accept a 
> LuceneProperties.  No point in making this IndexReader specific, since 
> it might be useful for, e.g., IndexWriter, Searchers, Directories, etc.
> 
> An advantage of a <String,String> map over a <String,Object> map for 
> Hadoop is that it's trivial to serialize.
> 
> Is this what you had in mind?

I like that approach!  I think I'd prefer <String,Object> so we could
put InfoStream, DeletionPolicy and other class instances in there?
(Without requiring that they have zero-arg ctors).  Unless there would
be some reason for Lucene to also need serialization?

(Actually, for infoStream I think eventually we should switch to a
logging framework).

Hmmm, one wrinkle: when we would "look at" a property?  I guess it's
per-property.  EG infoStream we could "look at" every time we needed
to print something to it.  But eg say we have "deletionPolicy" in
there, and you suddenly change it in your properties, then, when are
we supposed to notice that and re-init it?  That is a downside vs
putting set/get on the class directly because with set/get the class
obviously knows when the property is being changed.

OK, I'm no longer sure this is [yet] necessary for Lucene!  What
"properties" would we actually want to put here and NOT in the ctors
or set/gets on the class itself?  It feels like a vanishing set.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term pollution from binary data

Reply via email to