On Mar 10, 2007, at 3:27 PM, Michael Busch wrote:
- Introduce index format. Nicolas has already written a lot of code
in this regard!
I worry that going the interface route is going to be too
restrictive. When I looked at Nicholas's index format spec, I
immediately wanted to add an Analyzer and a bunch of other stuff to
it. Other people are going to want to add their own stuff.
My suggestion is that the top-level plan for the index be called
Schema, and that it be an abstract class. An email to the KS list
explaining the rationale behind KinoSearch's current version of this
is below my sig. Here are the API docs:
http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/
Schema.html
http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Schema/
FieldSpec.html
It uses global field semantics, which Hoss won't be happy about. ;)
However, I'm grateful to Hoss for past critiques, as they've helped
me to refine and improve how Schema works. For instance, as of KS
0.20_02 you can introduce new field_name => FieldSpec associations to
KS at any time during indexing.
It may be that adapting Lucene to use something like what KS uses
would be too radical a change. However, I believe that one reason
that flexible indexing has been in incubation so long is that the
current mechanism for attaching semantics to field names does not
scale as well as it might.
For instance, the logical extension of the current FieldInfos system
is to add booleans as described at <http://wiki.apache.org/lucene-
java/FlexibleIndexing>. However, conflict resolution during segment
merging is going to present challenges. What happens when in one
segment 'content' has freq and in another segment it doesn't? Things
are so much easier if the posting format, once set, never changes.
It will include different interfaces for the different extension
points (FieldsFormat, PostingFormat, DictionaryFormat).
KS still uses TermDocs and its children, but I'm about to go in and
replace them with PostingList. What subclass of Posting the
PostingList returns would be controlled by the FieldSpec.
At present KS allows you to attach both a Similarity and an Analyzer
to a field name via a FieldSpec subclass. I haven't quite figured
out how to attach a posting format. Should it return an object, like
FieldSpec's similarity() method does? Should it actually implement a
codec? Not sure yet. What do you think?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
--------------------------------------------------------------------
Begin forwarded message:
From: Marvin Humphrey <[EMAIL PROTECTED]>
Date: February 27, 2007 1:08:33 AM PST
To: KinoSearch discussion forum <[EMAIL PROTECTED]>
Subject: [KinoSearch] KinoSearch::Schema - Rationale
Reply-To: KinoSearch discussion forum <[EMAIL PROTECTED]>
Greets,
The thing about Lucene/KS indexes is that all the information you
need to read them can never be stored in the index files alone
because there's always that bleedin' Analyzer. You can look at a
Lucene index and see that it has fields with certain names that are
indexed, stored, etc, but you can't actually make sense of the
index's content unless you know everything about all Analyzers used
at index-time.
Since the Analyzer is not hooked to the index file, but has to be
created anew in every app that interacts with the index, it's often
wrong, and analyzer mismatches are a constant source of confusion,
frustration, and error for users.
KinoSearch::Schema solves the Analyzer problem. Not only that, but
it sets the stage for attaching ever more semantic meaning to field
names. Not just booleans like "I'm indexed" and "I'm stored", but
behaviors, objects... For example, each field may now be associated
with its own Similarity implementation, which affects scoring. In
the reasonably near future, the plan is to allow each FieldSpec to
define a comparison sub which determines the sort order of terms.
And so on.
Schema is somewhat akin to SWISH's index configuration file, which
can hold regexes, stoplists, and so on. In fact, an earlier
incarnation of Schema was primarily concerned with reading/writing a
configuration file. It attempted to solve the Lucene Analyzer
problem by storing EVERYTHING, including a class name for the
Analyzer; at search-time, the Analyzer object was created by calling
a no-arg constructor.
I ash-canned that design after trying to write docs explaining the
bit about the no-arg constructor -- too confusing, not Perlish, and
ultimately, less direct than allowing the user to write arbitrary
code. It's hard to maintain security, though, when you allow data
files to contain code. (I'm sure SWISH manages it, I just don't want
the same headache).
The thinking behind KinoSearch::Schema is, if you're going to create
a index configuration file that has code in it, why not go all the
way, and make it a Perl module? It's the best of all worlds. You
get to leverage the power of the language itself when defining your
index structure, but it's also a self-contained, complete spec that
both your indexing app and your search app can load.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
[EMAIL PROTECTED]
http://www.rectangular.com/mailman/listinfo/kinosearch
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]