Pluggable IndexReader (was 2.9/3.0 plan & Java 1.5)

Marvin Humphrey Fri, 12 Dec 2008 16:01:14 -0800

Doug Cutting:

> Folks are discussing whether generics are a special case for 
> back-compatibility.  This is an important discussion, since major 
> releases are defined by their back-compatibility.  This discussion thus 
> should have priority over the discussion of new 3.0 features.


Okeedoke.  Since I'm working on this right now for KS, though, I'd like to
continue the conversation under a new thread heading.

I have a bunch of file format changes to push through, and I'm hoping to
implement them using pluggable modules.  For instance, I'd like to be able to
swap out bit-vector-based deletions for tombstone-based deletions, just by
overriding a method or two.

Jason Rutherglen:

> Decoupling IndexReader would for 3.0 would be great.  This includes making
> public SegmentReader, MultiSegmentReader.

I definitely think that IndexReader can and should be made more pluggable.  Is
exposing per-segment sub-readers a definite win, though?  Does it make sense
to leave open the door to index components which don't operate on segments?
Or even to eliminate SegmentReader entirely and have sub-components of
IndexReader manage collation?

I've been thinking about this with regard to tombstone-based deletions, where
you can't know everything about a segment unless you've opened up other
segments.

> A constructor like new SegmentReader(TermsDictionary termDictionary,
> TermPostings termPostings, ColumnStrideFields csd, DocIdBitSet deletedDocs);

You end up with a proliferation of constructors that way.  Term vectors?
Arbitrary auxiliary components such as an R-tree component supporting
geographic search?

My original proposal to clean this up involved an "IndexComponent" class.
However, when I started implementing it, I ended up with a slew of new classes
with only two factory methods each.

We could possibly move those factory methods up into Schema, but I'm reluctant 
to
dirty it up, since it's a major public class in KS (as I anticipate it will be
in Lucy) and major public classes should be as simple as possible.

So, how about an IndexArchitecture or IndexPlan class?

  class MyArchitecture extends IndexArchitecture {
    public PostingsWriter PostingsWriter() {
      return new PForDeltaPostingsWriter();
    }
    public PostingsReader PostingsReader() {
      return new PForDeltaPostingsReader();
    }
    public DeletionsWriter DeletionsWriter() {
      return new TombstoneWriter();
    }
    public DeletionsReader DeletionsReader() {
      return new TombstoneReader();
    }
  }

Lucene:

  IndexWriter writer = new IndexWriter("/path/to/index", 
    new StandardAnalyzer(), new MyArchitecture());

Lucy with Java bindings:

  class MySchema extends Schema {
    public MySchema() {
      initField("title", "text");
      initField("content", "text");
    }
    public IndexArchitecture indexArchitecture() { 
      return new MyArchitecture(); 
    }
    public Analyzer analyzer() { 
      return new PolyAnalyzer("en"); 
    }
  }

  IndexWriter writer = new IndexWriter(MySchema.open("/path/to/index"));

> Decouple rollback, commit, IndexDeletionPolicy from DirectoryIndexReader
> into a class like SegmentsVersionSystem which could act as the controller
> for reopen types of methods.  There could be a SegmentVersionSystem that
> manages the versioning of a single segment.

I like it. :)

Sometimes you want to change up the merge policy for different writers against
the same index.  How does that fit into your plan?

My thought is that merge-policies would be application-specific rather than
index-specific.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Pluggable IndexReader (was 2.9/3.0 plan & Java 1.5)

Reply via email to