On Mar 13, 2007, at 2:03 AM, Nicolas Lalevée wrote:
At present KS allows you to attach both a Similarity and an Analyzer
to a field name via a FieldSpec subclass. I haven't quite figured
out how to attach a posting format. Should it return an object, like
FieldSpec's similarity() method does? Should it actually implement a
codec? Not sure yet. What do you think?
The posting format defines how you want to store the terms data, so
defines
how to search.
Hmm. I'm talking about the stuff currently held in .frq, .prx,
and .fNNN in Lucene. That's not the terms data. I think we're
miscommunicating.
KinoSearch 0.20_01 and forward move the postings data
from .frq, .prx, and .fNNN to a single file per field, with the
extension .pNNN. The philosophy of KS 0.20 is to have all binary
"files" be decodable by launching a single iterator at the front of
the file and having it read to the end. (They're actually virtual
files within the compound file -- KS only supports the compound
format.) That translates one posting format per file.
I don't think it is a good idea to mix different kind of
posting format in the same index.
Allowing different fields to use different posting formats is very
important.
When matching a value in a "category" field, all you might care about
is whether the doc hits or not -- you don't care about freq, boost,
per-position boost, any of that. The posting format for "category"
would thus specify "doc num only", and the .pNNN file would consist
entirely of a sequence of delta-doc_num VInts.
In contrast, a "content" field scoring HTML source material might
specify a posting format that includes boost-per-position. Each
record would have one doc_num, one freq, several positions, and
several boosts. The file would be much more complex.
If you want to score based on "content", but constrain results based
on "category", you want to allow the simpler format for the
"category" field, or you'll be wasting both disk and CPU.
It's actually possible to make different multiple posting formats
work within a single monolithic postings file, but I opted to avoid
that for the sake of simplicity and ease of debugging.
It will make Lucene the responsablilty to
manage different kind of readers instanciating different kind of
termEnums
and so on.
I've actually chosen to break up the term list into two separate
files per field as well. This was a more costly and dubious choice,
but was harmonious with KinoSearch's expansion of field semantics.
KS will soon allow users to determine sort order of term texts within
each field. Keeping separate TermLists for each field means that I
don't need to to worry about either tracking field numbers/names or
switching up comparators -- the TermList iterator terminates rather
than proceed on to another field like TermEnum does.
I don't really know what will be the different kind of impact of a
such feature, but it might be quite difficult to manage it
correctly. But as
the posting format can be redefined by the user, he can implement a
custom
format which is handling internally different kind of data
associated to
terms.
If you guarantee that the posting format for a given field can never
change by imposing global field semantics, it's not a big deal. If
you break things up by field at both the file and the data structure
level, it gets even easier.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]