Re: Flexible indexing

Marvin Humphrey Tue, 13 Mar 2007 20:19:43 -0800


On Mar 13, 2007, at 2:03 AM, Nicolas Lalevée wrote:

At present KS allows you to attach both a Similarity and an Analyzer
to a field name via a FieldSpec subclass.  I haven't quite figured
out how to attach a posting format.  Should it return an object, like
FieldSpec's similarity() method does?  Should it actually implement a
codec?  Not sure yet.  What do you think?

The posting format defines how you want to store the terms data, sodefines

how to search.

Hmm. I'm talking about the stuff currently held in .frq, .prx,and .fNNN in Lucene. That's not the terms data. I think we'remiscommunicating.

KinoSearch 0.20_01 and forward move the postings datafrom .frq, .prx, and .fNNN to a single file per field, with theextension .pNNN. The philosophy of KS 0.20 is to have all binary"files" be decodable by launching a single iterator at the front ofthe file and having it read to the end. (They're actually virtualfiles within the compound file -- KS only supports the compoundformat.) That translates one posting format per file.

I don't think it is a good idea to mix different kind of
posting format in the same index.

Allowing different fields to use different posting formats is veryimportant.

When matching a value in a "category" field, all you might care aboutis whether the doc hits or not -- you don't care about freq, boost,per-position boost, any of that. The posting format for "category"would thus specify "doc num only", and the .pNNN file would consistentirely of a sequence of delta-doc_num VInts.

In contrast, a "content" field scoring HTML source material mightspecify a posting format that includes boost-per-position. Eachrecord would have one doc_num, one freq, several positions, andseveral boosts. The file would be much more complex.

If you want to score based on "content", but constrain results basedon "category", you want to allow the simpler format for the"category" field, or you'll be wasting both disk and CPU.

It's actually possible to make different multiple posting formatswork within a single monolithic postings file, but I opted to avoidthat for the sake of simplicity and ease of debugging.

It will make Lucene the responsablilty to
manage different kind of readers instanciating different kind oftermEnums
and so on.

I've actually chosen to break up the term list into two separatefiles per field as well. This was a more costly and dubious choice,but was harmonious with KinoSearch's expansion of field semantics.

KS will soon allow users to determine sort order of term texts withineach field. Keeping separate TermLists for each field means that Idon't need to to worry about either tracking field numbers/names orswitching up comparators -- the TermList iterator terminates ratherthan proceed on to another field like TermEnum does.

I don't really know what will be the different kind of impact of a
such feature, but it might be quite difficult to manage itcorrectly. But asthe posting format can be redefined by the user, he can implement acustomformat which is handling internally different kind of dataassociated to
terms.

If you guarantee that the posting format for a given field can neverchange by imposing global field semantics, it's not a big deal. Ifyou break things up by field at both the file and the data structurelevel, it gets even easier.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing

Reply via email to