On Apr 10, 2008, at 2:37 AM, Michael McCandless wrote:
IMO, the abstract base Posting class should not track text. It
should
include only one datum: a document number. This keeps it in line
with the
simplest IR definition for a "posting": one document matching one
term.
But how do you then write out a segment with the terms packed, in
sorted order? Your "generic" layer needs to know how to sort these
Posting lists by term text, right?
Yes. But the generic layer is the semi-serialized export form
RawPosting, not the abstract base class Posting. (Would it be clearer
if RawPosting was named "FlatPosting"?)
struct kino_Posting {
kino_VirtualTable* _;
kino_ref_t ref;
chy_u32_t doc_num;
};
struct kino_RawPosting {
kino_VirtualTable* _;
kino_ref_t ref;
chy_u32_t doc_num;
chy_u32_t freq;
chy_u32_t content_len;
chy_u32_t aux_len;
char blob[1]; /* flexible array */
};
The last member of RawPosting, "blob", uses the C "flexible array"
hack which takes advantage of C's lack of bounds checking. A
RawPosting consists of a single contiguous memory allocation, but the
size of that allocation varies depending on how much information the
object needs to carry.
The "blob" is divided into two parts: content, and "aux". Content is
term text. "aux" is encoded material which will be written to the
postings file. The total allocation for any given RawPosting can be
calculated like so:
offsetof(RawPosting, blob) + posting->content_len + posting->aux_len
Sorting is done by comparing first the "content" component of blob
using memcmp(), then breaking ties with content_len and finally,
doc_num.
The write method invoked by PostingsWriter is the same --
RawPost_Write_Record -- no matter what Posting subclass is assigned to
the field. The first part of the implementing function
RawPost_write_record may look familiar to you, as it uses a close
variant of Lucene's algo for encoding doc_num and freq together. Note
the last line, though:
void
RawPost_write_record(RawPosting *self, OutStream *outstream,
u32_t last_doc_num)
{
const u32_t delta_doc = self->doc_num - last_doc_num;
char *const aux_content = self->blob + self->content_len;
/* Write doc num, freq. */
if (self->freq == 1) {
const u32_t doc_code = (delta_doc << 1) | 1;
OutStream_Write_C32(outstream, doc_code);
}
else {
const u32_t doc_code = delta_doc << 1;
OutStream_Write_C32(outstream, doc_code);
OutStream_Write_C32(outstream, self->freq);
}
/* Print prepared data. */
OutStream_Write_Bytes(outstream, aux_content, self->aux_len);
}
The idea is that information other than doc num and freq gets
serialized during the export operation which creates the RawPosting --
then the last line of RawPost_write_record prints the pre-serialized
data to the OutStream.
OK I now see that what we call Posting really should be called
PostingList: each instance of this class, in DW, tracks all documents
that contained that term.
I suppose that would be more accurate... though in KS, the
"PostingList" class is the replacement for TermDocs, TermPositions, etc.
Whereas for KS, Posting is a single occurrence of term in a single
doc, right?
It's one term matching one document.
These are the definitions I'm using, as set out in
KinoSearch::Docs::IRTheory at <http://xrl.us/bi8w2>:
* document - An atomic unit of retrieval.
* term - An attribute which describes a document.
* posting - One term indexing one document.
* term list - The complete list of terms which describe a document.
* posting list - The complete list of documents which a term
indexes.
* inverted index - A data structure which maps from terms to
documents.
Does a Posting contain all
occurrences of the term in the doc (multiple positions) or only one?
In KS, a Posting object can represent multiple occurrences of a term
within a single document. That was somewhat arbitrary on my part.
How do you do buffering/flushing?
RawPostings are fed into an object which handles external sorting.
The external sorter tracks RAM usage, and checks once per document
whether it should flush a sorted run to temporary storage.
After the last document gets added, the external sorter flips to
fetching mode. The pre-sorted runs are merged into a sorted stream of
RawPostings, and the postings and lexicon files get written out.
Then, for search-time you have a PostingList class which takes the
place of
TermDocs/TermPositions, and uses an underlying Posting object to
read the
file. (PostingList and its subclasses don't know anything about file
formats.)
Wouldn't PostingList need to know something of the file format? EG
maybe it's a sparse format (docID or gap encoded each time), or, it's
non-sparse (like norms, column-stride fields).
Each subclass of Posting implements its own Read_Record() method.
PostingList just hands off the InStream to its internal Posting object
-- it doesn't know or care how much data from the InStream the Posting
consumes with each Read_Record().
As I have argued before, the key is to have each Posting subclass
wholly
define a file format. That makes them pluggable, breaking the
tight binding
between the Lucene codebase and the Lucene file format spec.
It's not just Posting that defines the file format.
I could have been clearer -- I meant a postings file format.
I think TermVectorsWriter should be seen as a consumer of the
"inversion" plugin API. It's just that, unlike the frq/prx writer,
which flushes when RAM is full, the TermVectorsWriter flushes after
each doc. Ie, the generic code does the inversion, feeding "you"
Posting occurrences, and "you" write this to a file however you want.
Yes, that's how it works in KS, too. :)
I haven't written a formal Inversion class yet, though. Upon
reflection, perhaps it would be better to have "Inversion" be specific
to a single field, and "InvertedDocument" collect everything together?
public void addDocument(Document document) {
InvertedDocument inverteDoc = invert(document)
for (i = 0; i < writers.length; i++) {
writers[i].addInvertedDocument(invertedDoc);
}
}
I think that's better in KS, because I can rename "TokenBatch" -- the
class that actually does the low-level inverting and that subclasses
of Analyzer need to deal with -- to "Inversion".
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]