Re: Flexible indexing design

Marvin Humphrey Sat, 12 Apr 2008 14:26:45 -0700


On Apr 10, 2008, at 3:10 AM, Michael McCandless wrote:

Can't you compartmentalize while still serializing skip data into the
single frq/prx file?


Yes, that's possible.

The way KS is set up right now, PostingList objects maintain i/ostate, and Posting's Read_Record() method just deals with whateverinstream gets passed to it. If the PostingList were to sneak in thereading of a skip packet, the Posting would be none the wiser.

This as analagous to how videos are encoded.  EG the AVI file format
is a "container" format, and in contains packets of video and packets
of audio, interleaved at the right rate so a player can play both in
sync.  The "container" has no idea how to decode the audio and video
packets.  Separate codecs do that.

Taking this back to Lucene, there's a container format that, using
TermInfo, knows where the frq/prx data (packet) is and where the skip
data (packet) is.  And it calls on separate decoders to decode each.


This is an intriguing proposal.  :)

The dev branch of KS currently uses oodles of per-segment files forthe lexicon and the postings:


  * One postings file per field per segment.      [SEGNAME-FIELDNUM.p]

* One lexicon file per field per segment. [SEGNAME-FIELDNUM.lex]* One lexicon index file per field per segment. [SEGNAME-FIELDNUM.lexx]

Having so many files is something of a drawback, but it means thateach individual file can be very specialized, and that yields numerousbenefits:


  * Each file has a simple format.
  * File Format spec easier to write and understand.
  * Formats are pluggable.
      o Easy to deprecate.
      o Easy to isolate within a single class.
  * PostingList objects are always single-field.
      o Simplified internals.
          * No field numbers to track.
          * Repeat one read operation to scan the whole file.
      o Pluggable using subclasses of Posting.
      o Fewer subclasses (e.g. SegmentTermPositions is not needed).
  * Lexicon objects are always single-field.
      o Simplified internals.
          * No field numbers to track.
          * Repeat one read operation to scan the whole file.
      o Possible to extend with custom per-field sorting at index-time.
      o Easier to extend to non-text terms.
          * Comparisons ops guaranteed to see like objects.
  * Stream-related errors are comparatively easy to track down.

Some of these benefits are preserved when reading from a singlestream. However, there are some downsides:


  * Container classes like PostingList more complex.
      o No longer single-field.
      o Harder to detect overruns that would have been EOF errors.
      o Easier to lose stream sync.
      o Periodic sampling for index records more complex.

* Tricky to prevent inappropriate compareTo ops atboundaries.

  * Harder to troubleshoot.
      o Glitch in one plugin can manifest as an error somewhere else.
      o Hexdump nearly impossible to interpret.

o Mentally taxing to follow like packets in an interleavedstream.

  * File corruption harder to recover from.
      o Only as reliable as the weakest plugin.

Benefits of the single stream include:

  * Fewer hard disk seeks.
  * Far fewer files.

If you're using Lucene's non-compound file format, having far fewerfiles could be a significant benefit depending on the OS. But here'sthe thing:

If you're using a custom virtual file system a la Lucene's compoundfiles, what's the difference between divvying up data using filenameswithin the CompoundFileReader object, and divvying up data downstreamin some other object using some ad hoc mechanism?

My conclusion was that it was better to exploit the benefits ofbounded, single-purpose streams and simple file formats wheneverpossible.

There's also a middle way, where each *format* gets its own file.Then you wind up with fewer files, but you have to track field numberstate.

The nice thing is that packet-scoped plugins can be compatible withALL of these configurations:

This way we can decouple the question of "how many files do I store my
things in" from "how is each thing encoded/decoded".  Maybe I want
frq/prx/skip all in one file, or maybe I want them in 3 differentfiles.


Well said.

The second problem is how to share a term dictionary over acluster. Itwould be nice to be able to plug modules into IndexReader thatrepresentclusters of machines but that are dedicated to specific tasks: oneclustercould be dedicated to fetching full documents and applyinghighlighting;
another cluster could be dedicated to scanning through postings and
finding/scoring hits; a third cluster could store the entire termdictionary
in RAM.
A centralized term dictionary held in RAM would be particularlyhandy for
sorting purposes.  The problem is that the file pointers of a term
dictionary are specific to indexes on individual machines.  A shared
dictionary in RAM would have to contain pointers for *all* clients,which
isn't really workable.
So, just how do you go about assembling task specific clusters?The storeddocuments cluster is easy, but the term dictionary and the postingsare
hard.
Phew!  This is way beyond what I'm trying to solve now :)

Hmm. It doesn't look that difficult from my perspective. The problemseems reasonably well isolated and contained. But I've worked hard tomake KS modular, so perhaps there's less distance left to travel.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing design

Reply via email to