Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > On Apr 10, 2008, at 3:10 AM, Michael McCandless wrote: > > > > Can't you compartmentalize while still serializing skip data into the > > single frq/prx file? > > > > Yes, that's possible. > > The way KS is set up right now, PostingList objects maintain i/o state, and > Posting's Read_Record() method just deals with whatever instream gets passed > to it. If the PostingList were to sneak in the reading of a skip packet, > the Posting would be none the wiser.
Got it. > > This as analagous to how videos are encoded. EG the AVI file format > > is a "container" format, and in contains packets of video and packets > > of audio, interleaved at the right rate so a player can play both in > > sync. The "container" has no idea how to decode the audio and video > > packets. Separate codecs do that. > > > > Taking this back to Lucene, there's a container format that, using > > TermInfo, knows where the frq/prx data (packet) is and where the skip > > data (packet) is. And it calls on separate decoders to decode each. > > > > This is an intriguing proposal. :) > > The dev branch of KS currently uses oodles of per-segment files for the > lexicon and the postings: > > * One postings file per field per segment. [SEGNAME-FIELDNUM.p] > * One lexicon file per field per segment. [SEGNAME-FIELDNUM.lex] > * One lexicon index file per field per segment. [SEGNAME-FIELDNUM.lexx] > > Having so many files is something of a drawback, but it means that each > individual file can be very specialized, and that yields numerous benefits: > > * Each file has a simple format. > * File Format spec easier to write and understand. > * Formats are pluggable. > o Easy to deprecate. > o Easy to isolate within a single class. > * PostingList objects are always single-field. > o Simplified internals. > * No field numbers to track. > * Repeat one read operation to scan the whole file. > o Pluggable using subclasses of Posting. > o Fewer subclasses (e.g. SegmentTermPositions is not needed). > * Lexicon objects are always single-field. > o Simplified internals. > * No field numbers to track. > * Repeat one read operation to scan the whole file. > o Possible to extend with custom per-field sorting at index-time. > o Easier to extend to non-text terms. > * Comparisons ops guaranteed to see like objects. > * Stream-related errors are comparatively easy to track down. > > Some of these benefits are preserved when reading from a single stream. > However, there are some downsides: > > * Container classes like PostingList more complex. > o No longer single-field. > o Harder to detect overruns that would have been EOF errors. > o Easier to lose stream sync. > o Periodic sampling for index records more complex. > * Tricky to prevent inappropriate compareTo ops at boundaries. > * Harder to troubleshoot. > o Glitch in one plugin can manifest as an error somewhere else. > o Hexdump nearly impossible to interpret. > o Mentally taxing to follow like packets in an interleaved stream. > * File corruption harder to recover from. > o Only as reliable as the weakest plugin. > > Benefits of the single stream include: > > * Fewer hard disk seeks. > * Far fewer files. > > If you're using Lucene's non-compound file format, having far fewer files > could be a significant benefit depending on the OS. But here's the thing: > > If you're using a custom virtual file system a la Lucene's compound files, > what's the difference between divvying up data using filenames within the > CompoundFileReader object, and divvying up data downstream in some other > object using some ad hoc mechanism? I think the major difference is locality? In a compound file, you have to seek "far away" to reach the prx & skip data (if they are separate). This is like "column stride" vs "row stride" serialization of a matrix. Relatively soon, though, we will all be on SSDs, so maybe this locality argument becomes far less important ;) Does KS allow non-compound format? I would think running out of file descriptors is common problem otherwise. Though, I think your fibonacci merge policy is more "aggressive" than Lucene's LogMergePolicy (ie, fewer segments for the same # docs). > My conclusion was that it was better to exploit the benefits of bounded, > single-purpose streams and simple file formats whenever possible. > > There's also a middle way, where each *format* gets its own file. Then you > wind up with fewer files, but you have to track field number state. > > The nice thing is that packet-scoped plugins can be compatible with ALL of > these configurations: Right. This way users can pick & choose how to put things in the index (with "healthy" defaults, of course). Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]