On Apr 10, 2008, at 3:10 AM, Michael McCandless wrote:
Can't you compartmentalize while still serializing skip data into the
single frq/prx file?
Yes, that's possible.
The way KS is set up right now, PostingList objects maintain i/o
state, and Posting's Read_Record() method just deals with whatever
instream gets passed to it. If the PostingList were to sneak in the
reading of a skip packet, the Posting would be none the wiser.
This as analagous to how videos are encoded. EG the AVI file format
is a "container" format, and in contains packets of video and packets
of audio, interleaved at the right rate so a player can play both in
sync. The "container" has no idea how to decode the audio and video
packets. Separate codecs do that.
Taking this back to Lucene, there's a container format that, using
TermInfo, knows where the frq/prx data (packet) is and where the skip
data (packet) is. And it calls on separate decoders to decode each.
This is an intriguing proposal. :)
The dev branch of KS currently uses oodles of per-segment files for
the lexicon and the postings:
* One postings file per field per segment. [SEGNAME-FIELDNUM.p]
* One lexicon file per field per segment. [SEGNAME-
FIELDNUM.lex]
* One lexicon index file per field per segment. [SEGNAME-
FIELDNUM.lexx]
Having so many files is something of a drawback, but it means that
each individual file can be very specialized, and that yields numerous
benefits:
* Each file has a simple format.
* File Format spec easier to write and understand.
* Formats are pluggable.
o Easy to deprecate.
o Easy to isolate within a single class.
* PostingList objects are always single-field.
o Simplified internals.
* No field numbers to track.
* Repeat one read operation to scan the whole file.
o Pluggable using subclasses of Posting.
o Fewer subclasses (e.g. SegmentTermPositions is not needed).
* Lexicon objects are always single-field.
o Simplified internals.
* No field numbers to track.
* Repeat one read operation to scan the whole file.
o Possible to extend with custom per-field sorting at index-time.
o Easier to extend to non-text terms.
* Comparisons ops guaranteed to see like objects.
* Stream-related errors are comparatively easy to track down.
Some of these benefits are preserved when reading from a single
stream. However, there are some downsides:
* Container classes like PostingList more complex.
o No longer single-field.
o Harder to detect overruns that would have been EOF errors.
o Easier to lose stream sync.
o Periodic sampling for index records more complex.
* Tricky to prevent inappropriate compareTo ops at
boundaries.
* Harder to troubleshoot.
o Glitch in one plugin can manifest as an error somewhere else.
o Hexdump nearly impossible to interpret.
o Mentally taxing to follow like packets in an interleaved
stream.
* File corruption harder to recover from.
o Only as reliable as the weakest plugin.
Benefits of the single stream include:
* Fewer hard disk seeks.
* Far fewer files.
If you're using Lucene's non-compound file format, having far fewer
files could be a significant benefit depending on the OS. But here's
the thing:
If you're using a custom virtual file system a la Lucene's compound
files, what's the difference between divvying up data using filenames
within the CompoundFileReader object, and divvying up data downstream
in some other object using some ad hoc mechanism?
My conclusion was that it was better to exploit the benefits of
bounded, single-purpose streams and simple file formats whenever
possible.
There's also a middle way, where each *format* gets its own file.
Then you wind up with fewer files, but you have to track field number
state.
The nice thing is that packet-scoped plugins can be compatible with
ALL of these configurations:
This way we can decouple the question of "how many files do I store my
things in" from "how is each thing encoded/decoded". Maybe I want
frq/prx/skip all in one file, or maybe I want them in 3 different
files.
Well said.
The second problem is how to share a term dictionary over a
cluster. It
would be nice to be able to plug modules into IndexReader that
represent
clusters of machines but that are dedicated to specific tasks: one
cluster
could be dedicated to fetching full documents and applying
highlighting;
another cluster could be dedicated to scanning through postings and
finding/scoring hits; a third cluster could store the entire term
dictionary
in RAM.
A centralized term dictionary held in RAM would be particularly
handy for
sorting purposes. The problem is that the file pointers of a term
dictionary are specific to indexes on individual machines. A shared
dictionary in RAM would have to contain pointers for *all* clients,
which
isn't really workable.
So, just how do you go about assembling task specific clusters?
The stored
documents cluster is easy, but the term dictionary and the postings
are
hard.
Phew! This is way beyond what I'm trying to solve now :)
Hmm. It doesn't look that difficult from my perspective. The problem
seems reasonably well isolated and contained. But I've worked hard to
make KS modular, so perhaps there's less distance left to travel.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]