Re: Sort cache file format

Michael McCandless Thu, 09 Apr 2009 03:52:42 -0700

On Wed, Apr 8, 2009 at 1:43 PM, Marvin Humphrey <[email protected]> wrote:


>> Can the same field name have more than one FieldSpec?
>
> Absolutely not.  I have always regarded that feature of Lucene as insane.
> It's like an SQL engine where any INSERT can silently ALTER TABLE.
>
> Unsurprisingly, the back-end code to support that insane feature is ginormous
> and convoluted and has been the source of many bugs over the years.
> LUCENE-1590 is only the latest.

Agreed, but this "flexibility" gets much love... (or at least any
possible steps away from it gets much anti-love).

>> What's otherwise the difference?
>
> [regarding per-field files vs. per-FieldSpec files]
>
> Lexicon objects in KinoSearch only iterate over one field's terms.
> PostingLists are single-field, too -- you can't seek to a term in another
> field.

I like that approach.  LUCENE-1458 makes the same change to Lucene.

> In the Lucene file format, each term in the term dictionary has to list a
> field number.  That wastes space compared to the KinoSearch format, since in
> KS there is only one field number per file and it is encoded within the file's
> name.

Got it.  LUCENE-1458 stops wasting space as well, but we still store
multiple fields in a single file.

> Furthermore, per-field lexicon and postings files are easier to troubleshoot.
> You just look at the directory listings and you get a pretty good idea of
> what's going on: files that are missing, that have zero length, etc, are big
> red flags.

Yes, nice transparency.

> If you're using the compound file format, you don't have to worry about
> running out of file descriptors, no matter how many virtual files you use --
> because a single descriptor is shared by all of them.  Thus, the only
> consequence of dividing up files per-FieldSpec rather than per-field is... to
> make the index more opaque and harder to debug. :(  It's a step backwards.

Ahhhhh... OK this finally clarifies my [silly] confusion: a FieldSpec
corresponds to more than one field.  It's like FieldInfos in Lucene?

So for Lucy you'll move away from KS's required "compound file
format", and allow either compound or not.

> However, I think Lucy will have to accept that loss of clarity in order to
> support the non-compound format. :(

Darned OSs.  Why can't they give us more file descriptors?

>> For numeric fields, can't you simply store the values instead of
>> separate ord + values?  EG sorting float/long/doubles is very easy.
>> Also for smaller types (byte, short) you can be much more compact with
>> only the values.
>
> Right now there are three core FieldSpec classes in KS:
>
>  FullTextField
>  StringField
>  BlobField
>
> Only "fulltext" fields are associated with Analyzers (and they are *always*
> analyzed).  Only "string" fields -- which are always single-value since they
> aren't analyzed -- allow sorting.
>
> The plan in the near future is to add additional field types such as these:
>
>  Int8Field
>  Int16Field
>  Int32Field
>  Int64Field
>  Float32Field
>  TimeStamp32Field
>  TimeStamp64Field
>
> I think the sort cache format for a given field would be dependent on the
> FieldSpec class.  So would the encode/decode for the full document storage,
> for that matter.

How to divide up fields is a tricky matter... Lucene is challenged in
this area now :)  We've divided up "Field" into three classes.

But you didn't answer the original question: will you simply use the
numeric value as the ord?

Also, for enumerated fields (small universe of possible values, eg
"yes" or "no" in the binary case) will you bit-pack the ords?

>> Does/will Lucy have column-stride field storage?
>
> I think these sort caches should be either column-stride or variable width.
> Which one would be most appropriate would be data dependent.  I don't know how
> important it is to expose an API to support multiple sort cache types.  I
> suppose that someone will need it eventually for maximum efficiency.

I think variable/fixed width storage vs column/row stride are
orthogonal.

> As I mentioned in this JIRA comment for Lucene column-stride fields this
> morning...
>
> https://issues.apache.org/jira/browse/LUCENE-1231?focusedCommentId=12697076&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12697076
>
> ... I think the only purpose for column-stride fields should be to aid search.

I don't really understand what "to aid search" means -- that's very
generic.

>> > For pure Unicode character data, a two-file format would work well.
>> >
>> >   * A stack of 64-bit file offsets into the character data file.
>> >   * Pure character data, with string lengths implied by the offset file.
>
>> > Now it turns out that our offsets have a funny property: they sort in the 
>> > same
>> > order as the string data that they point to.  Because of that, we can
>> > theoretically use them as our sort-ords, and do away with the stack of 
>> > 32-bit
>> > sort ords.
>>
>> Except, they take 2X the storage as int ords.
>
> True.  For a field with a small number of unique values, the 32-bit ords are 
> a win.

Well, you can do better by bit packing.  For enumerated value fields
you usually can use far fewer than 32 bits per value.

>  // Less space with a small number of unique values:
>  32-bit ord => 64-bit file pointer => character data
>
>  // Less space with many unique values:
>  64-bit file pointer => character data
>
> However, I think that we should accomodate fields with few values using
> dedicated enum field types, e.g. a "StringEnumField" that requires you to
> declare all possible values in advance.

Is that too much of a fixed schema (requiring all values to be
declared in advance)?

> There's still an unoptimized case where a field has few values, but they
> aren't all known in advance.  It would be nice if we could come up with a
> field type for that.

AHH OK.

> Nevertheless, I think we cover a lot of cases if we support both an enum type
> and a one-value-per-doc type.

There are so many types to support... it's not clear how best to
design it.

EG "multi-valued" vs "single-valued" should be orthogonal to most
other attrs of a field's type.

Mike

Re: Sort cache file format

Reply via email to