Sort cache file format

Marvin Humphrey Mon, 30 Mar 2009 22:21:55 -0700

Greets,

It's time to think about how to implement segment-centric sorted search in
Lucy, as discussed in the JIRA thread at

https://issues.apache.org/jira/browse/LUCENE-1458?focusedCommentId=12650377&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12650377

... and implemented in Lucene via...

https://issues.apache.org/jira/browse/LUCENE-1483

[Jeez, JIRA permalinks are even longer than mail.apache.org permalinks.]

Lucy's implementation will differ markedly from that of Lucene's because we
will write out sort cache files at index time to be mmap'd at search-time,
rather than lazily build sort caches in process memory within IndexReader.

Our segment sort cache files will need two logical parts:

* Doc-num-to-ordinal mapping.
* Field cache values.

At present, KinoSearch divides up lexicon and postings files by field number
within each segment: three files for each field's lexicon and one for each
fields postings. Dividing up the content so aggressively allows the files to
have very simple formats, at the cost of file proliferation. However, since
KS uses a compound file format for all segment files (exclusively), it only
looks like there are a lot of files from KinoSearch's perspective; from the
operating system's perspective, only one file descriptor is held open per
segment.

In the future, it will make sense for us to divide up lexicon, postings, and
sort cache files per-unique-FieldSpec rather than per-unique-field-name; we
will then merely add offsets for each field, which we might store as metadata
within segmeta.json. However, that's a revision I plan to take up in a month
or two.

My immediate goal is to make a dev release of KinoSearch that includes
real-time indexing support. My goal for the release after that is to overhaul
the postings, lexicon, and sort cache formats to support "flexible indexing"
and PFOR. While those revisions are underway, it will make sense to make the
per-FieldSpec revamp.

At the end of that process, I hope to have completed a draft Lucy File Format
Spec version 0.1 which KinoSearch will adhere to. In the mean time, a
per-field format for sort caches will be adequate.

The obvious format for the doc-num-to-ord mapping is a stack of 32-bit
integers. The doc num would be implied by the array tick, and the ordinal
value by the integer value at that tick.

i32_t sort_ord = self->sort_ords[doc_num];

We'd like to be able to use the field's lexicon to obtain the field values.
However, since lexicons use prefix compression, we can't jump to a random
entry. Therefore, we either need to change the lexicon format for sortable
fields so that it doesn't use prefix compression, or write redundant data.

In the interest of simplicity, let's explore the redundant-data route for now.
However, let's plan to revisit that at a later date and see if we can make the
lexicon do double-duty as a sort cache, because the smaller the index, the
more of it that fits into the OS cache.

For pure Unicode character data, a two-file format would work well.

* A stack of 64-bit file offsets into the character data file.
* Pure character data, with string lengths implied by the offset file.

Under this scheme, the value at a given position would be obtained like
so[1][2]:

i64_t offset = self->offsets[sort_ord];
i64_t size = self->offsets[sort_ord + 1] - offset;
ViewCB_Assign_Str(value, self->char_data + offset, size);

In the interest of flexibility, we might allow arbitrary codecs, in which case
we would require the UTF-8 byte count to be prepended to the character data as
a C32:

i64_t offset = self->offsets[sort_ord];
char *data = self->data + offset;
i32_t size = MATH_DECODE_C32(data); /* advances data pointer */
ViewCB_Assign_Str(value, data, size);

However, I think it makes sense to set up the SortCacheReader to know what
type it is supposed to be retrieving.

Now it turns out that our offsets have a funny property: they sort in the same
order as the string data that they point to. Because of that, we can
theoretically use them as our sort-ords, and do away with the stack of 32-bit
sort ords.

So, perhaps we only need two files for mmap'd sort-caches:

* A stack of 64-bit file pointers, mapping doc num to character data.
* One file of character data, with each string prepended by a C32 byte
count.

The two files would be named field_cache-XXX.ix and field_cache-XXX.dat, with
XXX representing the field number.

Thoughts?

Marvin Humphrey

[1] Bounds checking has been omitted from these examples.
[2] ViewCB = ViewCharBuf, a string type that borrows content.

Sort cache file format

Reply via email to