On Wed, Apr 8, 2009 at 1:43 PM, Marvin Humphrey <[email protected]> wrote:
>> Can the same field name have more than one FieldSpec? > > Absolutely not. I have always regarded that feature of Lucene as insane. > It's like an SQL engine where any INSERT can silently ALTER TABLE. > > Unsurprisingly, the back-end code to support that insane feature is ginormous > and convoluted and has been the source of many bugs over the years. > LUCENE-1590 is only the latest. Agreed, but this "flexibility" gets much love... (or at least any possible steps away from it gets much anti-love). >> What's otherwise the difference? > > [regarding per-field files vs. per-FieldSpec files] > > Lexicon objects in KinoSearch only iterate over one field's terms. > PostingLists are single-field, too -- you can't seek to a term in another > field. I like that approach. LUCENE-1458 makes the same change to Lucene. > In the Lucene file format, each term in the term dictionary has to list a > field number. That wastes space compared to the KinoSearch format, since in > KS there is only one field number per file and it is encoded within the file's > name. Got it. LUCENE-1458 stops wasting space as well, but we still store multiple fields in a single file. > Furthermore, per-field lexicon and postings files are easier to troubleshoot. > You just look at the directory listings and you get a pretty good idea of > what's going on: files that are missing, that have zero length, etc, are big > red flags. Yes, nice transparency. > If you're using the compound file format, you don't have to worry about > running out of file descriptors, no matter how many virtual files you use -- > because a single descriptor is shared by all of them. Thus, the only > consequence of dividing up files per-FieldSpec rather than per-field is... to > make the index more opaque and harder to debug. :( It's a step backwards. Ahhhhh... OK this finally clarifies my [silly] confusion: a FieldSpec corresponds to more than one field. It's like FieldInfos in Lucene? So for Lucy you'll move away from KS's required "compound file format", and allow either compound or not. > However, I think Lucy will have to accept that loss of clarity in order to > support the non-compound format. :( Darned OSs. Why can't they give us more file descriptors? >> For numeric fields, can't you simply store the values instead of >> separate ord + values? EG sorting float/long/doubles is very easy. >> Also for smaller types (byte, short) you can be much more compact with >> only the values. > > Right now there are three core FieldSpec classes in KS: > > FullTextField > StringField > BlobField > > Only "fulltext" fields are associated with Analyzers (and they are *always* > analyzed). Only "string" fields -- which are always single-value since they > aren't analyzed -- allow sorting. > > The plan in the near future is to add additional field types such as these: > > Int8Field > Int16Field > Int32Field > Int64Field > Float32Field > TimeStamp32Field > TimeStamp64Field > > I think the sort cache format for a given field would be dependent on the > FieldSpec class. So would the encode/decode for the full document storage, > for that matter. How to divide up fields is a tricky matter... Lucene is challenged in this area now :) We've divided up "Field" into three classes. But you didn't answer the original question: will you simply use the numeric value as the ord? Also, for enumerated fields (small universe of possible values, eg "yes" or "no" in the binary case) will you bit-pack the ords? >> Does/will Lucy have column-stride field storage? > > I think these sort caches should be either column-stride or variable width. > Which one would be most appropriate would be data dependent. I don't know how > important it is to expose an API to support multiple sort cache types. I > suppose that someone will need it eventually for maximum efficiency. I think variable/fixed width storage vs column/row stride are orthogonal. > As I mentioned in this JIRA comment for Lucene column-stride fields this > morning... > > https://issues.apache.org/jira/browse/LUCENE-1231?focusedCommentId=12697076&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12697076 > > ... I think the only purpose for column-stride fields should be to aid search. I don't really understand what "to aid search" means -- that's very generic. >> > For pure Unicode character data, a two-file format would work well. >> > >> > * A stack of 64-bit file offsets into the character data file. >> > * Pure character data, with string lengths implied by the offset file. > >> > Now it turns out that our offsets have a funny property: they sort in the >> > same >> > order as the string data that they point to. Because of that, we can >> > theoretically use them as our sort-ords, and do away with the stack of >> > 32-bit >> > sort ords. >> >> Except, they take 2X the storage as int ords. > > True. For a field with a small number of unique values, the 32-bit ords are > a win. Well, you can do better by bit packing. For enumerated value fields you usually can use far fewer than 32 bits per value. > // Less space with a small number of unique values: > 32-bit ord => 64-bit file pointer => character data > > // Less space with many unique values: > 64-bit file pointer => character data > > However, I think that we should accomodate fields with few values using > dedicated enum field types, e.g. a "StringEnumField" that requires you to > declare all possible values in advance. Is that too much of a fixed schema (requiring all values to be declared in advance)? > There's still an unoptimized case where a field has few values, but they > aren't all known in advance. It would be nice if we could come up with a > field type for that. AHH OK. > Nevertheless, I think we cover a lot of cases if we support both an enum type > and a one-value-per-doc type. There are so many types to support... it's not clear how best to design it. EG "multi-valued" vs "single-valued" should be orthogonal to most other attrs of a field's type. Mike
