Re: Sort cache file format

Michael McCandless Sat, 11 Apr 2009 07:59:15 -0700

On Fri, Apr 10, 2009 at 5:38 PM, Marvin Humphrey <[email protected]> wrote:


>> Ie, maybe we should instantiate Field.Index and tweak its options
>> (norms, omitTFAP, etc.) and that instance becomes the type of your
>> field (at least wrt indexing).
>
> That's sort of the idea, but Field.Index is pretty limited in its options.
> FieldSpec does a lot more.
>
> As we have seen, FieldSpec is responsible for associating Analyzers with
> fulltext fields.  In Lucene, you have to do that via IndexWriter, QueryParser,
> PerFieldAnalyzerWrapper, and probably a few others I've forgotten.
>
> FieldSpec is similarily responsible for associating Similarity instances and
> posting formats with field names (as appropriate).  Looking forward, sort
> comparators also belong in FieldSpec.  And so on.

Does FieldSpec sub divide the options?  Eg options about indexing
could live in its own class, with commonly used constants like "NO".

This was the motivation of that comment in Lucene (the fact that we
don't subdivide means suddenly stored only fields have to figure out
what to do with omitNorms, omitTFAP booleans; if we had Field.Index.NO
that's be better).

>> This is also sort of like the crazy static types one can create with
>> generics, ie, a "type" used to be something nice simple (int, float,
>> your own class, etc.) but now can be a rich object (instance) in
>> itself.
>
> The FieldSpec approach is actually quite similar to the "flyweight" pattern.
>
> From <http://en.wikipedia.org/wiki/Flyweight_pattern>:
>
>    A classic example usage of the flyweight pattern are the data structures 
> for
>    graphical representation of characters in a word processor. It would be 
> nice
>    to have, for each character in a document, a glyph object containing its 
> font
>    outline, font metrics, and other formatting data, but it would amount to
>    hundreds or thousands of bytes for each character. Instead, for every
>    character there might be a reference to a flyweight glyph object shared by
>    every instance of the same character in the document; only the position of
>    each character (in the document and/or the page) would need to be stored
>    externally.

Yes.

> In Lucy, Docs will be hash-based (rather than array-based as in Lucene) -- so
> each field value will be associated with a single field name.  When the
> document is submitted for indexing, we use the field name to associate the
> value with a FieldSpec object.  Making that association attaches a bunch of
> traits and behaviors to the value: whether it should be indexed, how it should
> sort, whether it should be stored and how it should be encoded when it is
> stored, etc.
>
> So, the difference is that in Lucene, every field value is an object with an
> arbitrary set of traits and behaviors, while in Lucy, values for a given field
> will have a uniform type.

Well, in Lucene we could better decouple a Field's value from its
"extended type".  The type would still be attached to the Field's
value (not to the global schema as in KS), but strongly decoupled &
shared across Field instances.

>> [A class and an instance really should not be different,
>> anyway (prototype languages like Self don't differentiate).]
>
> Haven't used Self, but I've done plenty of JavaScript programming, so I think
> I can comment.

[A fun aside: Wow I just did a Google search for "javascript self" and
it offered up respelling to "javascript this" -- they've got one smart
respeller!]

> In general, I don't think there's a way to implement the "objects are classes"
> model without making every object gigantic.  I mean, you're not so much
> merging the "class" and "instance" concepts so much as you are eliminating all
> class data and shoving everything down into the object.  But sharing class
> data is highly efficient in many, many situations.  Why make every character
> in a word processing document a gigantic object?

I think you're overstating implementation cost of not distinguishing
between "class" and "instance".  EG the bindings are shared, not
copied, from a parent (each obj keeps the reference to its parent).  A
single character would share tons from its parent and hold very little
itself.

That said, there are clearly some implementation cost vs a strict
class vs instance language.

> With regards to fields and field values in Lucene and Lucy: Allowing
> individual field values to define their own behaviors is insane.  There are
> many high level objects which must act on groups of values.   The "freedom"
> that fields have to "morph" isn't free, because the high level objects can no
> longer know so much about the values, and thus must interact with those values
> in more indirect and inefficient ways.

It definitely causes problems.  EG people may think they are turning
off norms, but in fact Lucene silently turns them back on if any
instance of that field is in the index with norms enable.

Lucene in fact implicitly has a global schema in that when segments
are merged, or when docs are added into a single segment, the schema
for each document or segment are "merged" according to certain rules.
When your index is optimized then you have your global schema.

FWIW, I agree that a fixed up front schema is cleaner, but I don't
think we can up and change this about Lucene today.

>> > BTW, in KS svn trunk, Schemas are now fully serialized and written
>> > to the index as "schema_NNN.json".  Including Analyzers.  :)
>>
>> How do you serialize Analyzers again?
>
> Dump them to a JSON-izable data structure.  Include the class name so that you
> can pick a deserialization routine at load time.

You rely on the same namespace -> obj mapping being present at
deserialize time?  Ie its the callers responsibility to import the
same modules, ensure the names "map" to the same objs (or at least
compatible ones) as were used during serialization, etc.

Though, for core objects, you would use the global name -> vtable
mapping that Lucy core maintains?  (I still don't fully understand why
Lucy needs that global hash -- this is what namespaces are for).

> Here's a PolyAnalyzer example with three sub-analyzers:
>
>      {
>         "_class" : "KinoSearch::Analysis::PolyAnalyzer",
>         "analyzers" : [
>            {
>               "_class" : "KinoSearch::Analysis::CaseFolder"
>            },
>            {
>               "_class" : "KinoSearch::Analysis::Tokenizer",
>               "pattern" : "\\w+(?:['\\x{2019}]\\w+)*"
>            },
>            {
>               "_class" : "KinoSearch::Analysis::Stemmer",
>               "language" : "en"
>            }
>         ]
>      }
>
> Stopalizers take up more space because they require serialization of the
> stoplist.

> In the current KS implementation, Analyzers are required to implement custom
> Dump() and Load() methods; Dump creates a JSON-izable data structure, while
> Load() creates a new object based on the contents of the dump.
>
>   Analyzer clone = analyzer.load(analyzer.dump());
>
> In the simplest case, a custom Analyzer subclass can implement a no-argument
> constructor and call that from Load().

OK, so if I've made a custom Tokenizer doing some funky Python code
instead of a regexp, I could simply implement dump/load to do the
right thing.

Mike

Re: Sort cache file format

Reply via email to