Re: Flexible indexing

Nicolas Lalevée Tue, 13 Mar 2007 01:04:39 -0800

Le Lundi 12 Mars 2007 21:34, Marvin Humphrey a écrit :
> On Mar 10, 2007, at 3:27 PM, Michael Busch wrote:
> > - Introduce index format. Nicolas has already written a lot of code
> > in this regard!
>
> I worry that going the interface route is going to be too
> restrictive.  When I looked at Nicholas's index format spec, I
> immediately wanted to add an Analyzer and a bunch of other stuff to
> it.  Other people are going to want to add their own stuff.
>
> My suggestion is that the top-level plan for the index be called
> Schema, and that it be an abstract class.  An email to the KS list
> explaining the rationale behind KinoSearch's current version of this
> is below my sig.  Here are the API docs:
>
>    http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/
> Schema.html
>    http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Schema/
> FieldSpec.html
>
> It uses global field semantics, which Hoss won't be happy about.  ;)
> However, I'm grateful to Hoss for past critiques, as they've helped
> me to refine and improve how Schema works.  For instance, as of KS
> 0.20_02 you can introduce new field_name => FieldSpec associations to
> KS at any time during indexing.
>
> It may be that adapting Lucene to use something like what KS uses
> would be too radical a change.  However, I believe that one reason
> that flexible indexing has been in incubation so long is that the
> current mechanism for attaching semantics to field names does not
> scale as well as it might.
>
> For instance, the logical extension of the current FieldInfos system
> is to add booleans as described at <http://wiki.apache.org/lucene-
> java/FlexibleIndexing>.  However, conflict resolution during segment
> merging is going to present challenges.  What happens when in one
> segment 'content' has freq and in another segment it doesn't?  Things
> are so much easier if the posting format, once set, never changes.


Here you raise another issue. The "IndexFormat" of my submitted patch only 
talks about how data is stored : the field data and the terms/posting data. 
Here you are talking about how the term/posting are created before storing 
them in the index. I agree with you that the behaviour is not clearely 
defined when there are different kind of indexing options for the same field. 
This produce bugs like LUCENE-766. And I think I am still confused about it 
because rethinking about the attached path, the termvector data will be 
computed even if the user have put a TermVector.NO.

This issue needs to be discussed of course, but this is related to the 
implementation of a specific new format proposed here 
<http://wiki.apache.org/lucene-java/FlexibleIndexing> and the design of the 
Field constructor.

> > It will include different interfaces for the different extension
> > points (FieldsFormat, PostingFormat, DictionaryFormat).
>
> KS still uses TermDocs and its children, but I'm about to go in and
> replace them with PostingList.  What subclass of Posting the
> PostingList returns would be controlled by the FieldSpec.
>
> At present KS allows you to attach both a Similarity and an Analyzer
> to a field name via a FieldSpec subclass.  I haven't quite figured
> out how to attach a posting format.  Should it return an object, like
> FieldSpec's similarity() method does?  Should it actually implement a
> codec?  Not sure yet.  What do you think?

The posting format defines how you want to store the terms data, so defines 
how to search. I don't think it is a good idea to mix different kind of 
posting format in the same index. It will make Lucene the responsablilty to 
manage different kind of readers instanciating different kind of termEnums 
and so on. I don't really know what will be the different kind of impact of a 
such feature, but it might be quite difficult to manage it correctly. But as 
the posting format can be redefined by the user, he can implement a custom 
format which is handling internally different kind of data associated to 
terms.

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
> --------------------------------------------------------------------
>
> Begin forwarded message:
> From: Marvin Humphrey <[EMAIL PROTECTED]>
> Date: February 27, 2007 1:08:33 AM PST
> To: KinoSearch discussion forum <[EMAIL PROTECTED]>
> Subject: [KinoSearch] KinoSearch::Schema - Rationale
> Reply-To: KinoSearch discussion forum <[EMAIL PROTECTED]>
>
> Greets,
>
> The thing about Lucene/KS indexes is that all the information you
> need to read them can never be stored in the index files alone
> because there's always that bleedin' Analyzer.  You can look at a
> Lucene index and see that it has fields with certain names that are
> indexed, stored, etc, but you can't actually make sense of the
> index's content unless you know everything about all Analyzers used
> at index-time.
>
> Since the Analyzer is not hooked to the index file, but has to be
> created anew in every app that interacts with the index, it's often
> wrong, and analyzer mismatches are a constant source of confusion,
> frustration, and error for users.
>
> KinoSearch::Schema solves the Analyzer problem.  Not only that, but
> it sets the stage for attaching ever more semantic meaning to field
> names.  Not just booleans like "I'm indexed" and "I'm stored", but
> behaviors, objects...  For example, each field may now be associated
> with its own Similarity implementation, which affects scoring.  In
> the reasonably near future, the plan is to allow each FieldSpec to
> define a comparison sub which determines the sort order of terms.
> And so on.
>
> Schema is somewhat akin to SWISH's index configuration file, which
> can hold regexes, stoplists, and so on.  In fact, an earlier
> incarnation of Schema was primarily concerned with reading/writing a
> configuration file.  It attempted to solve the Lucene Analyzer
> problem by storing EVERYTHING, including a class name for the
> Analyzer; at search-time, the Analyzer object was created by calling
> a no-arg constructor.
>
> I ash-canned that design after trying to write docs explaining the
> bit about the no-arg constructor -- too confusing, not Perlish, and
> ultimately, less direct than allowing the user to write arbitrary
> code.  It's hard to maintain security, though, when you allow data
> files to contain code.  (I'm sure SWISH manages it, I just don't want
> the same headache).
>
> The thinking behind KinoSearch::Schema is, if you're going to create
> a index configuration file that has code in it, why not go all the
> way, and make it a Perl module?  It's the best of all worlds.  You
> get to leverage the power of the language itself when defining your
> index structure, but it's also a self-contained, complete spec that
> both your indexing app and your search app can load.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> _______________________________________________
> KinoSearch mailing list
> [EMAIL PROTECTED]
> http://www.rectangular.com/mailman/listinfo/kinosearch
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing

Reply via email to