Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > On Apr 8, 2008, at 10:25 AM, Michael McCandless wrote: > > > I've actually been working on factoring DocumentsWriter, as a first > > step towards flexible indexing. > > > > The way I handled this in KS was to turn Posting into a class akin to > TermBuffer: the individual Posting object persists, but its values change. > > Meanwhile, each Posting subclass has a Read_Raw method which generates a > "RawPosting". RawPosting objects are a serialized, sortable, lowest common > denominator form of Posting which every subclass must be able to export. > They're allocated from a specialized MemoryPool, making them cheap to > manufacture and to release. > > RawPosting is the only form PostingsWriter is actually required to know > about: > > // PostingsWriter loop: > while ((RawPosting rawPosting = rawPostingQueue.pop()) != null) { > writeRawPosting(rawPosting); > > } > > > I agree we would have an abstract base Posting class that just tracks > > the term text. > > > > IMO, the abstract base Posting class should not track text. It should > include only one datum: a document number. This keeps it in line with the > simplest IR definition for a "posting": one document matching one term.
But how do you then write out a segment with the terms packed, in sorted order? Your "generic" layer needs to know how to sort these Posting lists by term text, right? > Posting: doc num (abstract) > MatchPosting: doc num > ScorePosting: doc num, freq, per-doc boost, positions > RichPosting: doc num, freq, positions with per-position boost > PayloadPosting: doc num, payload OK I now see that what we call Posting really should be called PostingList: each instance of this class, in DW, tracks all documents that contained that term. Whereas for KS, Posting is a single occurrence of term in a single doc, right? Does a Posting contain all occurrences of the term in the doc (multiple positions) or only one? How do you do buffering/flushing? After each document do you re-sweep your Posting instances and write them into a single segment? Or do accumulate many of these Posting instances (so many docs are held in this form) and when RAM is full you flush to disk? > Then, for search-time you have a PostingList class which takes the place of > TermDocs/TermPositions, and uses an underlying Posting object to read the > file. (PostingList and its subclasses don't know anything about file > formats.) Wouldn't PostingList need to know something of the file format? EG maybe it's a sparse format (docID or gap encoded each time), or, it's non-sparse (like norms, column-stride fields). > Each Posting subclass is associated with a subclass of TermScorer which > implements its own Posting-subclass-specific scoring algorithm. > > // MatchPostingScorer scoring algo ... > while (postingList.next()) { > MatchPosting posting = postingList.getPosting(); > collector.collect(posting.getDocNum(), 1.0); > } > > // ScorePostingScorer scoring algo... > while (postingList.next()) { > ScorePosting posting = (ScorePosting)postingList.getPosting(); > int freq = posting.getFreq(); > float score = freq < TERMSCORER_SCORE_CACHE_SIZE > ? scoreCache[freq] // cache hit > : sim.tf(freq) * weightValue; > collector.collect(posting.getDocNum(), score); > > } > > > > And then the code that writes the current index format would plug into > > this and should be fairly small and easy to understand. > > > > I'm pessimistic that that anything that writes the current index format > could be "easy to understand", because the spec is so dreadfully convoluted. I'm quite a bit more optimistic here. > As I have argued before, the key is to have each Posting subclass wholly > define a file format. That makes them pluggable, breaking the tight binding > between the Lucene codebase and the Lucene file format spec. It's not just Posting that defines the file format. Things like stored fields, norms, column-stride fields, have nothing to do with inversion. So these writers/readers should "plug in" at a layer above the inversion? OK, I see these below: > > Then there would also be plugins that just tap into the entire > > document (don't need inversion), like FieldsWriter. > > > > > Yes. Here's how things are set up in KS: > > InvIndexer > SegWriter > DocWriter > PostingsWriter > LexWriter > TermVectorsWriter > // plug in more writers here? > > Ideally, all of the writers under SegWriter would be subclasses of an > abstract SegDataWriter class, and would implement addInversion() and > addSegment(). SegWriter.addDoc() would look something like this: > > addDoc(Document doc) { > Inversion inversion = invert(doc); > for (int i = 0; i < writers.size; i++) { > writers[i].addInversion(inversion); > } > } I think TermVectorsWriter should be seen as a consumer of the "inversion" plugin API. It's just that, unlike the frq/prx writer, which flushes when RAM is full, the TermVectorsWriter flushes after each doc. Ie, the generic code does the inversion, feeding "you" Posting occurrences, and "you" write this to a file however you want. > In practice, three of the writers are required (one for term > dictionary/lexicon, one for postings, and one for some form of document > storage), but the design allows for plugging in additional SegDataWriter > subclasses. OK. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]