On Mar 10, 2007, at 3:27 PM, Michael Busch wrote:

- Introduce index format. Nicolas has already written a lot of code in this regard!

I worry that going the interface route is going to be too restrictive. When I looked at Nicholas's index format spec, I immediately wanted to add an Analyzer and a bunch of other stuff to it. Other people are going to want to add their own stuff.

My suggestion is that the top-level plan for the index be called Schema, and that it be an abstract class. An email to the KS list explaining the rationale behind KinoSearch's current version of this is below my sig. Here are the API docs:

http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/ Schema.html http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Schema/ FieldSpec.html

It uses global field semantics, which Hoss won't be happy about. ;) However, I'm grateful to Hoss for past critiques, as they've helped me to refine and improve how Schema works. For instance, as of KS 0.20_02 you can introduce new field_name => FieldSpec associations to KS at any time during indexing.

It may be that adapting Lucene to use something like what KS uses would be too radical a change. However, I believe that one reason that flexible indexing has been in incubation so long is that the current mechanism for attaching semantics to field names does not scale as well as it might.

For instance, the logical extension of the current FieldInfos system is to add booleans as described at <http://wiki.apache.org/lucene- java/FlexibleIndexing>. However, conflict resolution during segment merging is going to present challenges. What happens when in one segment 'content' has freq and in another segment it doesn't? Things are so much easier if the posting format, once set, never changes.

It will include different interfaces for the different extension points (FieldsFormat, PostingFormat, DictionaryFormat).

KS still uses TermDocs and its children, but I'm about to go in and replace them with PostingList. What subclass of Posting the PostingList returns would be controlled by the FieldSpec.

At present KS allows you to attach both a Similarity and an Analyzer to a field name via a FieldSpec subclass. I haven't quite figured out how to attach a posting format. Should it return an object, like FieldSpec's similarity() method does? Should it actually implement a codec? Not sure yet. What do you think?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

--------------------------------------------------------------------

Begin forwarded message:
From: Marvin Humphrey <[EMAIL PROTECTED]>
Date: February 27, 2007 1:08:33 AM PST
To: KinoSearch discussion forum <[EMAIL PROTECTED]>
Subject: [KinoSearch] KinoSearch::Schema - Rationale
Reply-To: KinoSearch discussion forum <[EMAIL PROTECTED]>

Greets,

The thing about Lucene/KS indexes is that all the information you need to read them can never be stored in the index files alone because there's always that bleedin' Analyzer. You can look at a Lucene index and see that it has fields with certain names that are indexed, stored, etc, but you can't actually make sense of the index's content unless you know everything about all Analyzers used at index-time.

Since the Analyzer is not hooked to the index file, but has to be created anew in every app that interacts with the index, it's often wrong, and analyzer mismatches are a constant source of confusion, frustration, and error for users.

KinoSearch::Schema solves the Analyzer problem. Not only that, but it sets the stage for attaching ever more semantic meaning to field names. Not just booleans like "I'm indexed" and "I'm stored", but behaviors, objects... For example, each field may now be associated with its own Similarity implementation, which affects scoring. In the reasonably near future, the plan is to allow each FieldSpec to define a comparison sub which determines the sort order of terms. And so on.

Schema is somewhat akin to SWISH's index configuration file, which can hold regexes, stoplists, and so on. In fact, an earlier incarnation of Schema was primarily concerned with reading/writing a configuration file. It attempted to solve the Lucene Analyzer problem by storing EVERYTHING, including a class name for the Analyzer; at search-time, the Analyzer object was created by calling a no-arg constructor.

I ash-canned that design after trying to write docs explaining the bit about the no-arg constructor -- too confusing, not Perlish, and ultimately, less direct than allowing the user to write arbitrary code. It's hard to maintain security, though, when you allow data files to contain code. (I'm sure SWISH manages it, I just don't want the same headache).

The thinking behind KinoSearch::Schema is, if you're going to create a index configuration file that has code in it, why not go all the way, and make it a Perl module? It's the best of all worlds. You get to leverage the power of the language itself when defining your index structure, but it's also a self-contained, complete spec that both your indexing app and your search app can load.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
[EMAIL PROTECTED]
http://www.rectangular.com/mailman/listinfo/kinosearch






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to