On Mar 13, 2007, at 2:03 AM, Nicolas Lalevée wrote:

At present KS allows you to attach both a Similarity and an Analyzer
to a field name via a FieldSpec subclass.  I haven't quite figured
out how to attach a posting format.  Should it return an object, like
FieldSpec's similarity() method does?  Should it actually implement a
codec?  Not sure yet.  What do you think?

The posting format defines how you want to store the terms data, so defines
how to search.

Hmm. I'm talking about the stuff currently held in .frq, .prx, and .fNNN in Lucene. That's not the terms data. I think we're miscommunicating.

KinoSearch 0.20_01 and forward move the postings data from .frq, .prx, and .fNNN to a single file per field, with the extension .pNNN. The philosophy of KS 0.20 is to have all binary "files" be decodable by launching a single iterator at the front of the file and having it read to the end. (They're actually virtual files within the compound file -- KS only supports the compound format.) That translates one posting format per file.

I don't think it is a good idea to mix different kind of
posting format in the same index.

Allowing different fields to use different posting formats is very important.

When matching a value in a "category" field, all you might care about is whether the doc hits or not -- you don't care about freq, boost, per-position boost, any of that. The posting format for "category" would thus specify "doc num only", and the .pNNN file would consist entirely of a sequence of delta-doc_num VInts.

In contrast, a "content" field scoring HTML source material might specify a posting format that includes boost-per-position. Each record would have one doc_num, one freq, several positions, and several boosts. The file would be much more complex.

If you want to score based on "content", but constrain results based on "category", you want to allow the simpler format for the "category" field, or you'll be wasting both disk and CPU.

It's actually possible to make different multiple posting formats work within a single monolithic postings file, but I opted to avoid that for the sake of simplicity and ease of debugging.

It will make Lucene the responsablilty to
manage different kind of readers instanciating different kind of termEnums
and so on.

I've actually chosen to break up the term list into two separate files per field as well. This was a more costly and dubious choice, but was harmonious with KinoSearch's expansion of field semantics.

KS will soon allow users to determine sort order of term texts within each field. Keeping separate TermLists for each field means that I don't need to to worry about either tracking field numbers/names or switching up comparators -- the TermList iterator terminates rather than proceed on to another field like TermEnum does.

I don't really know what will be the different kind of impact of a
such feature, but it might be quite difficult to manage it correctly. But as the posting format can be redefined by the user, he can implement a custom format which is handling internally different kind of data associated to
terms.

If you guarantee that the posting format for a given field can never change by imposing global field semantics, it's not a big deal. If you break things up by field at both the file and the data structure level, it gets even easier.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to