Re: Types and Schemas (was "Sort cache file format")

Michael McCandless Mon, 13 Apr 2009 06:45:48 -0700

On Sun, Apr 12, 2009 at 5:04 PM, Marvin Humphrey <[email protected]> wrote:
>> I think Lucene could continue to merge yet isolate information
>> (subdivision, subclassing).  At least I sure hope so :)
>>
>> > I see why subdividing options might be useful in Lucene, but I'm not
>> > sure it's necessary for Lucy.
>>
>> It's all still hazy to me :) Hopefully once we talk about it enough
>> I'll get some clarity...
>
> Actually, what we probably need are Python bindings so that you can start
> playing around.  :)


That'd be nice but I'm quite hurting for time these days ;)  Sudden
bursts of innovation all over the place...

> I've been trying to clean up Boilerplater enough so that it porting
> Boilerplater::Binding::Perl to Boilerplater::Binding::Python would be a
> reasonable undertaking.  Perl's C API and object model are so complicated that
> other languages will probably be a lot easier -- but right now, it's not
> apparent from Boilerplater's API how you would get started.

OK.  It would also be good to have > 1 host language driving the
design... to keep things generic/portable.

>> it is sort of scary that we're inventing a type system.
>
> What's scary is that Java Lucene *has* a type system but won't admit it.

Yah.  In fact Lucene is "weakly typed", like Tcl.  We gleefully,
secretly "merge" one type with another.  I'd be happy to get to strong
but dynamic typing (ie the write once schema).

>> EG there are many things the FieldType should somehow tell us:
>>
>>   * How does FieldSpec model "multi-valued" fields? Is there a
>>     boolean in the base class?
>
> Because Lucy's Doc objects will be hash based, there will *never* be a case
> where the same field has two "values" per se within the same doc.
>
> However, it's fine if we support compound types via specific FieldType
> subclasses, e.g. Float32ArrayType, or StringArrayType.

I see -- does KS support multi-valued (compound) types today?  For
which "types"?  And I imagine for such types, "sortable" is not
allowed (yet "sortable" is set at the top FieldSpec, right?)?

> It's also important to distinguish between "multi-valued" and the
> "multi-token" FullTextType.  FullTextType fields are tokenized within the
> index, but in the context of the doc reader, they only have one string
> "value".  Note, however, that you cannot sort on a FullTextType field in KS.

So if I want to index & sort by "title" field, I make 2 separate fields?

>>   * "Has only one token" -- I guess this is implied by the class (ie
>>     only FullTextType may have > 1 token)
>
> For the near-to-middle-term future, yes -- FullTextType is the only
> multi-token, single-valued type.
>
> Looking down the road, I suppose other types like Int32ArrayType could have
> more than one "token", but it wouldn't be an ordinary string "token".

OK

>>   * Open vs closed (known set of values) enums
>
> It would be nice to add this later.  I don't think it's a high priority, since
> it's an optimization.

You mean you'd start with "open" enums?

>>   * Sortable
>
> I think this belongs in the base class -- that's where KS has it now.  That
> way, we can perform the following test, regardless of what the type is.
>
>   if (FieldType_Sortable(field_type)) {
>        /* Build sort cache. */
>        ...
>   }

Yeah... except multi-valued (compound) types would disable this, I
guess.  Though Lucene users seem to hit this limitation enough to make
it relaxable... and customize how SortCache gets created.

>>   * nulls sort on top or bottom
>
> This would be individual to each sort comparator.  Note that we might want to
> use a different sort comparator for NOT NULL fields for efficiency's sake,
> which complicates making the comparator a method on FieldSpec.

Yes, we're iterating on this now in LUCENE-831.  Though I wonder if
this ought to be the realm of source code specialization...
multiplying out all the combinations of "single comparator or not",
"scoring or not", "track max score or not", "string index may have
nulls or not", in Lucene's "true" sources (vs generated sources)
starts to get crazy.  Soon we'll also multiply in "docIDs guaranteed
to arrive in order to the collector, or not" as well.

> My general inclination is to have NULLs sort towards the end of the array.
>
>>   * Omit norms, omit TFAP
>
> I'm putting this off for now.  It will be addressed when we refactor for
> flexible indexing.

OK.  These would seem to live nicely under FullTextType... oh actually
maybe not, because presumably I can index single-valued fields (the
equivalent of NOT_ANALYZED in Lucene).  EG an Int32Type may in fact be
indexed, and I would at that point want to put omit norms/TFAP there.
Hmmm, cross cutting concerns.  Maybe sub-typing is needed...

>>   * Binary or not (I guess BlobType <-> binary)
>
> BlobType is one binary type, but I propose adding others, e.g. Int32Type.
>
> Binary() should be an abstract method on the base class.  It shouldn't be a
> boolean flag member, because it's not something that can be switched up within
> a class.

OK.

>>   * Term vectors or not, positions, offsets
>
> Term vectors are unique to FullTextType, since it is the only multi-token
> field.  Right now in KS, it's a boolean member var in FullTextType.

Single-token indexed fields might want term vectors too?

>>   * Stored or not -- toplevel?
>
> Yes.  As a boolean member.

Makes sense.

>>   * CSF'd or not
>
> Right now, I'd say keep this out of core.

OK, and, merge with sort cache somehow.  For most types they are one
and the same.

>>   * ValueSource is XYZ for this field
>
> I'd like to avoid ValueSource if we can.  I think it's better to add real
> binary types like Int32Type, DateStamp32, and so on -- instead of faking them
> with strings.

Well, that's UninversionValueSource you're thinking of (faking w/
strings).

But, yes, it's not good that ValueSource has type switching internal
to itself..... vs, you get lookup FieldType for the field and use it
to "switch".

>>   * I will use RangeFilter on this field
>
> The "sortable" boolean member var fills this need, no?

They are different?  Eg you'll add aggregates (Trie*) to your index
for fast range constraints, but for sorting you just need a sort cache
computed.

>>   * Analyzer to use (exposed only FullTextType)
>
> Analyzer should be a required constructor arg to FullTextType.

OK

>>   * Extensibility -- so app can enroll new attrs / make new type
>>     subclasses
>
> So long as the core performs inheritance checks rather than absolute class
> membership checks, subclasses will work fine.

OK.

>> Remind me again: do custom subclasses get enrolled into the global
>> hash in Lucy's core?  I know you had said it's a thread risk, ie, not
>> read only...
>
> Yes.
>
>> I'm still confused.  Say StandardAnalyzer is implemented in C; maybe
>> you'd name it Lucy_Analysis_StandardAnalyzer (since C doesn't support
>> namespaces you put prefixes in front).
>
> FWIW, the current implementation of Boilerplater only supports two level
> namespacing (with nicknames).  Outside of core, fully qualified code would
> look like this:
>
>  lucy_StandardAnalyzer *analyzer = lucy_StdAnalyzer_new();
>  lucy_Inversion *inversion = Lucy_StdAnalyzer_Transform_Text(charbuf);

What are the two levels here?  Level 1 is "StdAnalyzer", and Level 2
is "new" and "Transform_Text"?

> One of the constraints the two-level limitation imposes is that the last part
> of every core class name must be unique.  However, it makes for fully
> qualified C names that are are just cumbersome rather than unworkably long.

OK

>> Any time something in core wants to use that class, it refers to it by
>> name (and the C compiler/linker maps it), not via the global hash?
>
> For the most part.  A quick once-over of the KS code seems to indicate that
> the exceptions to that rule are all related to Deserialize() and Load().

OK

>> But for deserializing a core object, when the deserializer is
>> implemented in C, I agree you'd need a global lookup; basically
>> because you can't consult the OBJ's symbol table dynamically.  (If you
>> have a hosty deserializer, then it would "import lucy; lucy.XXX" to
>> find its classes).
>>
>> (But it seems like that global hash should be readonly-able).
>
> If we readonly that Hash, we can't add subclasses to it -- and therefore we
> won't be able to retrieve their deserializers.

I guess it's only subclasses implemented in C where this is important?

Because a hosty subclass's deserializer is using/relying the host's
namespace to find classes by name.

Mike

Re: Types and Schemas (was "Sort cache file format")

Reply via email to