On Apr 9, 2008, at 6:35 AM, Michael Busch wrote:
We also need to come up with a good solution for the dictionary,
because a term with frq/prx postings needs to store two (or three
for skiplist) file pointers in the dictionary, whereas e. g. a
"binary" posting list only needs one pointer.
This is something I'm working on as well, and I hope we can solve a
couple of design problems I've been turning over in my mind for some
time.
In KS, the information Lucene stores in the frq/prx files is carried
in one postings file per field, as discussed previously. However, I
made the additional change of breaking out skip data into a separate
file (shared across all fields). Isolating skip data sacrifices some
locality of reference, but buys substantial gains in simplicity and
compartmentalization. Individual Posting subclasses, each of which
defines a file format, don't have to know about skip algorithms at
all. :) Further, improvements in the skip algorithm only require
changes to the .skip file, and falling back to PostingList_Next still
works if the .skip file becomes corrupted since .skip carries only
optimization info and no real data.
For reasons I won't go into here, KS doesn't need to put a field
number in it's TermInfo, but it does need doc freq, plus file
positions for the postings file, the skip file, and the primary
Lexicon file. (Lexicon is the KS term dictionary class, akin to
Lucene's TermEnum.)
struct kino_TermInfo {
kino_VirtualTable* _;
kino_ref_t ref;
chy_i32_t doc_freq;
chy_u64_t post_filepos;
chy_u64_t skip_filepos;
chy_u64_t lex_filepos;
};
There are two problems.
First is that I'd like to extend indexing with arbitrary subclasses of
SegDataWriter, and I'd like these classes to be able to put their own
file position bookmarks (or possibly other data) into TermInfo.
Making TermInfo hash-based would probably do it, but there would be
nasty performance and memory penalties since TermInfo objects are
numerous.
So, what's the best way to allow multiple, unrelated classes to extend
TermInfo and the term dictionary file format? Is it to break up
TermInfo information horizontally rather than vertically, so that
instead of a single array of TermInfo objects, we have a flexible
stack of arrays of 64-bit integers representing file positions?
The second problem is how to share a term dictionary over a cluster.
It would be nice to be able to plug modules into IndexReader that
represent clusters of machines but that are dedicated to specific
tasks: one cluster could be dedicated to fetching full documents and
applying highlighting; another cluster could be dedicated to scanning
through postings and finding/scoring hits; a third cluster could store
the entire term dictionary in RAM.
A centralized term dictionary held in RAM would be particularly handy
for sorting purposes. The problem is that the file pointers of a term
dictionary are specific to indexes on individual machines. A shared
dictionary in RAM would have to contain pointers for *all* clients,
which isn't really workable.
So, just how do you go about assembling task specific clusters? The
stored documents cluster is easy, but the term dictionary and the
postings are hard.
For example, we should think about the Field APIs. Since we don't
have global field semantics in Lucene I wonder how to handle
conflict cases, e. g. when a document specifies a different posting
list format than a previous one for the same field. The easiest way
would be to not allow it and throw an exception. But this is kind of
against Lucene's way of dealing with fields currently. But I'm
scared of the complicated code to handle conflicts of all the
possible combinations of posting list formats.
Yeah. Lucene's field definition conflict-resolution code is gnarly
already. :(
KinoSearch doesn't have to worry about this, because it has a static
schema (I think?), but isn't as flexible as Lucene.
Earlier versions of KS did not allow the addition of new fields on the
fly, but this has been changed. You can now add fields to an existing
Schema object like so:
for my $doc (@docs) {
# Dynamically define any new fields as 'text'.
for my $field ( keys %$doc ) {
$schema->add_field( $field => 'text' );
}
$invindexer->add_doc($doc);
}
See the attached sample app for that snippet in context.
Here are some current differences between KS and Lucene:
* KS doesn't yet purge *old* dynamic field definitions which have
become obsolete. However, that should be possible to add later,
as a sweep triggered during full optimization.
* You can't change the definition of an existing field.
* Documents are hash-based, so you can't have multiple fields with
the same name within one document object. However, I consider
that capability a misfeature of Lucene.
In summary, I don't think that global field semantics meaningfully
restrict flexibility for the vast majority of users.
The primary distinction is/was philosophical. IIRC, Doug didn't want
to force people to think about index design in advance, so the Field/
Document API was optimized for newbies. In contrast, KS wants you to
give it a Schema before indexing commences.
It's still true that full-power KS forces you to think about index
design up-front. However, there's now a KinoSearch::Simple API
targeted at newbies which hides the Schema API and handles field
definition automatically -- so Doug's ease-of-use design goal has been
achieved via different means.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
use strict;
use warnings;
package MySchema;
use base qw( KinoSearch::Schema );
use KinoSearch::Analysis::PolyAnalyzer;
sub analyzer {
return KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' );
}
our %fields = (
title => 'text',
content => 'text',
);
package main;
use KinoSearch::Store::RAMFolder;
use KinoSearch::InvIndex;
use KinoSearch::InvIndexer;
use KinoSearch::Searcher;
my $schema = MySchema->new;
my $folder = KinoSearch::Store::RAMFolder->new;
my $invindex = KinoSearch::InvIndex->open(
schema => $schema,
folder => $folder,
);
my $invindexer = KinoSearch::InvIndexer->new( invindex => $invindex );
my @docs = (
{ title => 'foo', content => 'foo foo', category => 'fooish' },
{ title => 'bar', content => 'bar bar', keyword => 'barbarian' }
);
for my $doc (@docs) {
# Dynamically define any new fields as 'text'.
for my $field ( keys %$doc ) {
$schema->add_field( $field => 'text' );
}
$invindexer->add_doc($doc);
}
$invindexer->finish;
my $searcher = KinoSearch::Searcher->new( invindex => $invindex );
my $hits = $searcher->search( query => 'barbarian' );
print "Total hits: " . $hits->total_hits . "\n";
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]