Re: Flexible indexing

Marvin Humphrey Mon, 12 Mar 2007 12:35:21 -0800


On Mar 10, 2007, at 3:27 PM, Michael Busch wrote:

- Introduce index format. Nicolas has already written a lot of codein this regard!

I worry that going the interface route is going to be toorestrictive. When I looked at Nicholas's index format spec, Iimmediately wanted to add an Analyzer and a bunch of other stuff toit. Other people are going to want to add their own stuff.

My suggestion is that the top-level plan for the index be calledSchema, and that it be an abstract class. An email to the KS listexplaining the rationale behind KinoSearch's current version of thisis below my sig. Here are the API docs:

http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Schema.htmlhttp://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Schema/FieldSpec.html

It uses global field semantics, which Hoss won't be happy about. ;)However, I'm grateful to Hoss for past critiques, as they've helpedme to refine and improve how Schema works. For instance, as of KS0.20_02 you can introduce new field_name => FieldSpec associations toKS at any time during indexing.

It may be that adapting Lucene to use something like what KS useswould be too radical a change. However, I believe that one reasonthat flexible indexing has been in incubation so long is that thecurrent mechanism for attaching semantics to field names does notscale as well as it might.

For instance, the logical extension of the current FieldInfos systemis to add booleans as described at <http://wiki.apache.org/lucene-java/FlexibleIndexing>. However, conflict resolution during segmentmerging is going to present challenges. What happens when in onesegment 'content' has freq and in another segment it doesn't? Thingsare so much easier if the posting format, once set, never changes.

It will include different interfaces for the different extensionpoints (FieldsFormat, PostingFormat, DictionaryFormat).

KS still uses TermDocs and its children, but I'm about to go in andreplace them with PostingList. What subclass of Posting thePostingList returns would be controlled by the FieldSpec.

At present KS allows you to attach both a Similarity and an Analyzerto a field name via a FieldSpec subclass. I haven't quite figuredout how to attach a posting format. Should it return an object, likeFieldSpec's similarity() method does? Should it actually implement acodec? Not sure yet. What do you think?


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

--------------------------------------------------------------------

Begin forwarded message:
From: Marvin Humphrey <[EMAIL PROTECTED]>
Date: February 27, 2007 1:08:33 AM PST
To: KinoSearch discussion forum <[EMAIL PROTECTED]>
Subject: [KinoSearch] KinoSearch::Schema - Rationale
Reply-To: KinoSearch discussion forum <[EMAIL PROTECTED]>

Greets,

The thing about Lucene/KS indexes is that all the information youneed to read them can never be stored in the index files alonebecause there's always that bleedin' Analyzer. You can look at aLucene index and see that it has fields with certain names that areindexed, stored, etc, but you can't actually make sense of theindex's content unless you know everything about all Analyzers usedat index-time.

Since the Analyzer is not hooked to the index file, but has to becreated anew in every app that interacts with the index, it's oftenwrong, and analyzer mismatches are a constant source of confusion,frustration, and error for users.

KinoSearch::Schema solves the Analyzer problem. Not only that, butit sets the stage for attaching ever more semantic meaning to fieldnames. Not just booleans like "I'm indexed" and "I'm stored", butbehaviors, objects... For example, each field may now be associatedwith its own Similarity implementation, which affects scoring. Inthe reasonably near future, the plan is to allow each FieldSpec todefine a comparison sub which determines the sort order of terms.And so on.

Schema is somewhat akin to SWISH's index configuration file, whichcan hold regexes, stoplists, and so on. In fact, an earlierincarnation of Schema was primarily concerned with reading/writing aconfiguration file. It attempted to solve the Lucene Analyzerproblem by storing EVERYTHING, including a class name for theAnalyzer; at search-time, the Analyzer object was created by callinga no-arg constructor.

I ash-canned that design after trying to write docs explaining thebit about the no-arg constructor -- too confusing, not Perlish, andultimately, less direct than allowing the user to write arbitrarycode. It's hard to maintain security, though, when you allow datafiles to contain code. (I'm sure SWISH manages it, I just don't wantthe same headache).

The thinking behind KinoSearch::Schema is, if you're going to createa index configuration file that has code in it, why not go all theway, and make it a Perl module? It's the best of all worlds. Youget to leverage the power of the language itself when defining yourindex structure, but it's also a self-contained, complete spec thatboth your indexing app and your search app can load.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
[EMAIL PROTECTED]
http://www.rectangular.com/mailman/listinfo/kinosearch






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing

Reply via email to