Re: Sort cache file format

Michael McCandless Wed, 01 Apr 2009 04:53:41 -0700

On Tue, Mar 31, 2009 at 1:21 AM, Marvin Humphrey <[email protected]> wrote:
> Greets,
>
> It's time to think about how to implement segment-centric sorted search in
> Lucy, as discussed in the JIRA thread at
>
>    
> https://issues.apache.org/jira/browse/LUCENE-1458?focusedCommentId=12650377&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12650377
>
> ... and implemented in Lucene via...
>
>    https://issues.apache.org/jira/browse/LUCENE-1483
>
> [Jeez, JIRA permalinks are even longer than mail.apache.org permalinks.]
>
> Lucy's implementation will differ markedly from that of Lucene's because we
> will write out sort cache files at index time to be mmap'd at search-time,
> rather than lazily build sort caches in process memory within IndexReader.
>
> Our segment sort cache files will need two logical parts:
>
>  * Doc-num-to-ordinal mapping.
>  * Field cache values.
>
> At present, KinoSearch divides up lexicon and postings files by field number
> within each segment: three files for each field's lexicon and one for each
> fields postings.  Dividing up the content so aggressively allows the files to
> have very simple formats, at the cost of file proliferation.  However, since
> KS uses a compound file format for all segment files (exclusively), it only
> looks like there are a lot of files from KinoSearch's perspective; from the
> operating system's perspective, only one file descriptor is held open per
> segment.
>
> In the future, it will make sense for us to divide up lexicon, postings, and
> sort cache files per-unique-FieldSpec rather than per-unique-field-name; we
> will then merely add offsets for each field, which we might store as metadata
> within segmeta.json.  However, that's a revision I plan to take up in a month
> or two.


Can the same field name have more than one FieldSpec?  What's
otherwise the difference?

> My immediate goal is to make a dev release of KinoSearch that includes
> real-time indexing support.  My goal for the release after that is to overhaul
> the postings, lexicon, and sort cache formats to support "flexible indexing"
> and PFOR.

Nice!

> While those revisions are underway, it will make sense to make the
> per-FieldSpec revamp.
>
> At the end of that process, I hope to have completed a draft Lucy File Format
> Spec version 0.1 which KinoSearch will adhere to.  In the mean time, a
> per-field format for sort caches will be adequate.
>
> The obvious format for the doc-num-to-ord mapping is a stack of 32-bit
> integers.  The doc num would be implied by the array tick, and the ordinal
> value by the integer value at that tick.
>
>   i32_t sort_ord = self->sort_ords[doc_num];
>
> We'd like to be able to use the field's lexicon to obtain the field values.
> However, since lexicons use prefix compression, we can't jump to a random
> entry.  Therefore, we either need to change the lexicon format for sortable
> fields so that it doesn't use prefix compression, or write redundant data.

The requirement is that each doc has a single value, right?

For numeric fields, can't you simply store the values instead of
separate ord + values?  EG sorting float/long/doubles is very easy.
Also for smaller types (byte, short) you can be much more compact with
only the values.

Does/will Lucy have column-stride field storage?  In which case, that
file can already be your sort cache.  (Lucene does not, yet, and so we
must un-invert an inverted single-token-per-doc field, so even numeric
fields are costly for Lucene to warm).

> In the interest of simplicity, let's explore the redundant-data route for now.
> However, let's plan to revisit that at a later date and see if we can make the
> lexicon do double-duty as a sort cache, because the smaller the index, the
> more of it that fits into the OS cache.
>
> For pure Unicode character data, a two-file format would work well.
>
>   * A stack of 64-bit file offsets into the character data file.
>   * Pure character data, with string lengths implied by the offset file.
>
> Under this scheme, the value at a given position would be obtained like
> so[1][2]:
>
>   i64_t offset = self->offsets[sort_ord];
>   i64_t size   = self->offsets[sort_ord + 1] - offset;
>   ViewCB_Assign_Str(value, self->char_data + offset, size);
>
> In the interest of flexibility, we might allow arbitrary codecs, in which case
> we would require the UTF-8 byte count to be prepended to the character data as
> a C32:
>
>   i64_t  offset = self->offsets[sort_ord];
>   char  *data   = self->data + offset;
>   i32_t  size   = MATH_DECODE_C32(data); /* advances data pointer */
>   ViewCB_Assign_Str(value, data, size);
>
> However, I think it makes sense to set up the SortCacheReader to know what
> type it is supposed to be retrieving.
>
> Now it turns out that our offsets have a funny property: they sort in the same
> order as the string data that they point to.  Because of that, we can
> theoretically use them as our sort-ords, and do away with the stack of 32-bit
> sort ords.

Except, they take 2X the storage as int ords.

> So, perhaps we only need two files for mmap'd sort-caches:
>
>   * A stack of 64-bit file pointers, mapping doc num to character data.
>   * One file of character data, with each string prepended by a C32 byte
>     count.
>
> The two files would be named field_cache-XXX.ix and field_cache-XXX.dat, with
> XXX representing the field number.

Mike

Re: Sort cache file format

Reply via email to