On Tue, Mar 31, 2009 at 1:21 AM, Marvin Humphrey <[email protected]> wrote: > Greets, > > It's time to think about how to implement segment-centric sorted search in > Lucy, as discussed in the JIRA thread at > > > https://issues.apache.org/jira/browse/LUCENE-1458?focusedCommentId=12650377&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12650377 > > ... and implemented in Lucene via... > > https://issues.apache.org/jira/browse/LUCENE-1483 > > [Jeez, JIRA permalinks are even longer than mail.apache.org permalinks.] > > Lucy's implementation will differ markedly from that of Lucene's because we > will write out sort cache files at index time to be mmap'd at search-time, > rather than lazily build sort caches in process memory within IndexReader. > > Our segment sort cache files will need two logical parts: > > * Doc-num-to-ordinal mapping. > * Field cache values. > > At present, KinoSearch divides up lexicon and postings files by field number > within each segment: three files for each field's lexicon and one for each > fields postings. Dividing up the content so aggressively allows the files to > have very simple formats, at the cost of file proliferation. However, since > KS uses a compound file format for all segment files (exclusively), it only > looks like there are a lot of files from KinoSearch's perspective; from the > operating system's perspective, only one file descriptor is held open per > segment. > > In the future, it will make sense for us to divide up lexicon, postings, and > sort cache files per-unique-FieldSpec rather than per-unique-field-name; we > will then merely add offsets for each field, which we might store as metadata > within segmeta.json. However, that's a revision I plan to take up in a month > or two.
Can the same field name have more than one FieldSpec? What's otherwise the difference? > My immediate goal is to make a dev release of KinoSearch that includes > real-time indexing support. My goal for the release after that is to overhaul > the postings, lexicon, and sort cache formats to support "flexible indexing" > and PFOR. Nice! > While those revisions are underway, it will make sense to make the > per-FieldSpec revamp. > > At the end of that process, I hope to have completed a draft Lucy File Format > Spec version 0.1 which KinoSearch will adhere to. In the mean time, a > per-field format for sort caches will be adequate. > > The obvious format for the doc-num-to-ord mapping is a stack of 32-bit > integers. The doc num would be implied by the array tick, and the ordinal > value by the integer value at that tick. > > i32_t sort_ord = self->sort_ords[doc_num]; > > We'd like to be able to use the field's lexicon to obtain the field values. > However, since lexicons use prefix compression, we can't jump to a random > entry. Therefore, we either need to change the lexicon format for sortable > fields so that it doesn't use prefix compression, or write redundant data. The requirement is that each doc has a single value, right? For numeric fields, can't you simply store the values instead of separate ord + values? EG sorting float/long/doubles is very easy. Also for smaller types (byte, short) you can be much more compact with only the values. Does/will Lucy have column-stride field storage? In which case, that file can already be your sort cache. (Lucene does not, yet, and so we must un-invert an inverted single-token-per-doc field, so even numeric fields are costly for Lucene to warm). > In the interest of simplicity, let's explore the redundant-data route for now. > However, let's plan to revisit that at a later date and see if we can make the > lexicon do double-duty as a sort cache, because the smaller the index, the > more of it that fits into the OS cache. > > For pure Unicode character data, a two-file format would work well. > > * A stack of 64-bit file offsets into the character data file. > * Pure character data, with string lengths implied by the offset file. > > Under this scheme, the value at a given position would be obtained like > so[1][2]: > > i64_t offset = self->offsets[sort_ord]; > i64_t size = self->offsets[sort_ord + 1] - offset; > ViewCB_Assign_Str(value, self->char_data + offset, size); > > In the interest of flexibility, we might allow arbitrary codecs, in which case > we would require the UTF-8 byte count to be prepended to the character data as > a C32: > > i64_t offset = self->offsets[sort_ord]; > char *data = self->data + offset; > i32_t size = MATH_DECODE_C32(data); /* advances data pointer */ > ViewCB_Assign_Str(value, data, size); > > However, I think it makes sense to set up the SortCacheReader to know what > type it is supposed to be retrieving. > > Now it turns out that our offsets have a funny property: they sort in the same > order as the string data that they point to. Because of that, we can > theoretically use them as our sort-ords, and do away with the stack of 32-bit > sort ords. Except, they take 2X the storage as int ords. > So, perhaps we only need two files for mmap'd sort-caches: > > * A stack of 64-bit file pointers, mapping doc num to character data. > * One file of character data, with each string prepended by a C32 byte > count. > > The two files would be named field_cache-XXX.ix and field_cache-XXX.dat, with > XXX representing the field number. Mike
