On Apr 17, 2008, at 11:57 AM, Michael McCandless wrote:
If I have a pluggable indexer,
then on the querying side I need something (I'm not sure what/how)
that knows how to create the right demuxer (container) and codec
(decoder) to interact with whatever my indexing plugins wrote.
So I don't think it's "out of bounds" to have *Query classes that know
to use the non-prx variant of a field when positions aren't needed.
Though I agree it makes things more complex.
Perhaps switching up files based on Query type isn't "out of bounds",
but you get hit with a double whammy. First, there's the added
complexity -- which gets extra nasty when you consider that we
wouldn't want this optimization to be on by default. Second, you wind
up with duplicated information in the index -- resulting in greater
space requirements and increased indexing time.
That's an awful lot of inconvenience for the sake of an optimization.
I wouldn't bother.
My initial thought was that the unified postings file format that KS
is using now offers so many benefits that KS should go with it anyway,
despite the increased costs for simple TermQueries. But maybe we can
nail the design for the divide-by-data-type format right now.
This is like "column stride" vs "row stride" serialization
of a matrix.
Relatively soon, though, we will all be on SSDs, so maybe this
locality argument becomes far less important ;)
Yes, I've thought about that. It defeats the phrase-query locality
argument for unified postings files and recommends breaking things up
logically by type of data into frq/prx/payload/whatever.
Would it be possible to design a Posting plugin class that reads from
multiple files? I'm pretty sure the answer is yes. It messes up the
single-stream readRecord() method I've been detailing and might force
Posting to maintain state. But if Postings are scarce TermBuffer-
style
objects where values mutate, rather than populous Term-style
objects where
you need a new instance for each set of values, then it doesn't
matter if
they're large.
If that could be done, I think it would be possible to retrofit the
Posting/PostingList concept into Lucene without a file format
change. FWIW.
I think this is possible.
We should aggressively explore this concept. By my tally, breaking up
posting content into multiple files by data type has the most benefits
and the fewest drawbacks.
Brainstorming...
In devel branch KS, ScorePost_Read_Record reads three blocks of data
from an InStream:
doc_num/freq
boost/norm byte
positions
Let's say that instead of ScorePost_Read_Record operating on a passed-
in InStream, we give ScorePosting a ScorePost_Next() method, and have
it operate on three internally maintained InStreams:
void
ScorePost_next(ScorePosting *self)
{
u32_t num_prox;
u32_t position = 0;
u32_t *positions;
const u32_t doc_code = InStream_Read_C32(self->frq_in);
const u32_t doc_delta = doc_code >> 1;
/* Apply delta doc and retrieve freq. */
self->doc_num += doc_delta;
if (doc_code & 1)
self->freq = 1;
else
self->freq = InStream_Read_C32(self->frq_in);
/* Decode boost/norm byte. */
self->impact = self->sim->norm_decoder[
InStream_Read_U8(self->norms_in)
];
/* Read positions. */
num_prox = self->freq;
if (num_prox > self->prox_cap) {
self->prox = REALLOCATE(self->prox, num_prox, u32_t);
}
positions = self->prox;
while (num_prox--) {
position += InStream_Read_C32(self->prx_in);
*positions++ = position;
}
}
So, there we have a single operation, in a single class, defining a
codec -- achieving my main goal, even while reading from multiple files.
But check this out: we could feed the ScorePosting constructor the
same InStream object three times, and it wouldn't know the
difference. So the same read operation can be used against both a
multi-file format and a single file format.
Seeking might get a little weird, I suppose.
I think a variant of that read op could be created against various
versions of the Lucene file format going back, making it possible to
isolate and archive obsolete codecs and clean up the container classes.
When reading an index, the
Posting/PostingList should be more like TermBuffer than Term.
Yes.
Now's a good time to remark on a difference between KS and Lucene when
reading the equivalent of TermPositions: In KS, all positions are read
in one grab during ScorePost_Read_Record -- there's no nextPosition()
method. That was a tradeoff I made primarily for simplicity's sake,
since it meant that PhraseScorer could be implemented with integer
arrays and pointer math. Reduced CPU overhead was another theoretical
benefit, but I've never benchmarked it.
If you didn't want to do that but you still wanted to implement a
PhraseScorer based around PostingList objects rather than
TermPositions objects, you have a bit of a quandary: PostingList only
advances in doc-sized increments, because it doesn't have a
nextPosition() method. So, nextPosition() would have to be
implemented by ScorePosting:
// Iterate over docs and positions.
while (postingList.next()) {
ScorePosting posting = postingList.getPosting();
while (posting.nextPostition()) {
int position = posting.GetPosition();
...
}
}
If we did that, it would require a change in the behavior for
ScorePost_Next(), above.
Thinking about a codec reading multiple files... I would also love
this: a codec that can write & read layered updates to the inverted
files.
IMO, it should not be the codec object that performs actions like this
-- it should be a container class that the user knows nothing about.
I agree, Lucene needs a stronger separation of "container" from
"codec". If someone just wants to plugin a new codec they should be
able to cleanly plug into an existing container and "only" provide the
codec.
EG, say I want to store say an arbitrary array of ints per document.
Not all documents have an array, and when they do, the array length
varies.
To do this, I'd like to re-use a container that knows how to store a
byte blob per document, sparsely, and say indexed with a multi-level
skip list. That container is exactly the container we now use for
storing frq/prx data, under each term. So, ideally, I can easily
re-use that container and just provide a codec (encoder & decoder)
that maps from an int[] to a byte blob, and back.
We need to factor things so that this container can be easily shared
and is entirely decoupled from the codecs it's using.
Perfect.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]