Re: Flexible indexing design

Marvin Humphrey Wed, 09 Apr 2008 13:29:22 -0700

On Apr 9, 2008, at 6:35 AM, Michael Busch wrote:

We also need to come up with a good solution for the dictionary,because a term with frq/prx postings needs to store two (or threefor skiplist) file pointers in the dictionary, whereas e. g. a"binary" posting list only needs one pointer.

This is something I'm working on as well, and I hope we can solve acouple of design problems I've been turning over in my mind for sometime.

In KS, the information Lucene stores in the frq/prx files is carriedin one postings file per field, as discussed previously. However, Imade the additional change of breaking out skip data into a separatefile (shared across all fields). Isolating skip data sacrifices somelocality of reference, but buys substantial gains in simplicity andcompartmentalization. Individual Posting subclasses, each of whichdefines a file format, don't have to know about skip algorithms atall. :) Further, improvements in the skip algorithm only requirechanges to the .skip file, and falling back to PostingList_Next stillworks if the .skip file becomes corrupted since .skip carries onlyoptimization info and no real data.

For reasons I won't go into here, KS doesn't need to put a fieldnumber in it's TermInfo, but it does need doc freq, plus filepositions for the postings file, the skip file, and the primaryLexicon file. (Lexicon is the KS term dictionary class, akin toLucene's TermEnum.)


  struct kino_TermInfo {
      kino_VirtualTable* _;
      kino_ref_t ref;
      chy_i32_t doc_freq;
      chy_u64_t post_filepos;
      chy_u64_t skip_filepos;
      chy_u64_t lex_filepos;
  };

There are two problems.

First is that I'd like to extend indexing with arbitrary subclasses ofSegDataWriter, and I'd like these classes to be able to put their ownfile position bookmarks (or possibly other data) into TermInfo.Making TermInfo hash-based would probably do it, but there would benasty performance and memory penalties since TermInfo objects arenumerous.

So, what's the best way to allow multiple, unrelated classes to extendTermInfo and the term dictionary file format? Is it to break upTermInfo information horizontally rather than vertically, so thatinstead of a single array of TermInfo objects, we have a flexiblestack of arrays of 64-bit integers representing file positions?

The second problem is how to share a term dictionary over a cluster.It would be nice to be able to plug modules into IndexReader thatrepresent clusters of machines but that are dedicated to specifictasks: one cluster could be dedicated to fetching full documents andapplying highlighting; another cluster could be dedicated to scanningthrough postings and finding/scoring hits; a third cluster could storethe entire term dictionary in RAM.

A centralized term dictionary held in RAM would be particularly handyfor sorting purposes. The problem is that the file pointers of a termdictionary are specific to indexes on individual machines. A shareddictionary in RAM would have to contain pointers for *all* clients,which isn't really workable.

So, just how do you go about assembling task specific clusters? Thestored documents cluster is easy, but the term dictionary and thepostings are hard.

For example, we should think about the Field APIs. Since we don'thave global field semantics in Lucene I wonder how to handleconflict cases, e. g. when a document specifies a different postinglist format than a previous one for the same field. The easiest waywould be to not allow it and throw an exception. But this is kind ofagainst Lucene's way of dealing with fields currently. But I'mscared of the complicated code to handle conflicts of all thepossible combinations of posting list formats.

Yeah. Lucene's field definition conflict-resolution code is gnarlyalready. :(

KinoSearch doesn't have to worry about this, because it has a staticschema (I think?), but isn't as flexible as Lucene.

Earlier versions of KS did not allow the addition of new fields on thefly, but this has been changed. You can now add fields to an existingSchema object like so:


    for my $doc (@docs) {
        # Dynamically define any new fields as 'text'.
        for my $field ( keys %$doc ) {
            $schema->add_field( $field => 'text' );
        }
        $invindexer->add_doc($doc);
    }

See the attached sample app for that snippet in context.

Here are some current differences between KS and Lucene:

  * KS doesn't yet purge *old* dynamic field definitions which have
    become obsolete.  However, that should be possible to add later,
    as a sweep triggered during full optimization.
  * You can't change the definition of an existing field.
  * Documents are hash-based, so you can't have multiple fields with
    the same name within one document object.  However, I consider
    that capability a misfeature of Lucene.

In summary, I don't think that global field semantics meaningfullyrestrict flexibility for the vast majority of users.

The primary distinction is/was philosophical. IIRC, Doug didn't wantto force people to think about index design in advance, so the Field/Document API was optimized for newbies. In contrast, KS wants you togive it a Schema before indexing commences.

It's still true that full-power KS forces you to think about indexdesign up-front. However, there's now a KinoSearch::Simple APItargeted at newbies which hides the Schema API and handles fielddefinition automatically -- so Doug's ease-of-use design goal has beenachieved via different means.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

use strict;
use warnings;

package MySchema;
use base qw( KinoSearch::Schema );
use KinoSearch::Analysis::PolyAnalyzer;

sub analyzer {
    return KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' );
}

our %fields = (
    title   => 'text',
    content => 'text',
);

package main;

use KinoSearch::Store::RAMFolder;
use KinoSearch::InvIndex;
use KinoSearch::InvIndexer;
use KinoSearch::Searcher;

my $schema   = MySchema->new;
my $folder   = KinoSearch::Store::RAMFolder->new;
my $invindex = KinoSearch::InvIndex->open(
    schema => $schema,
    folder => $folder,
);
my $invindexer = KinoSearch::InvIndexer->new( invindex => $invindex );

my @docs = (
    { title => 'foo', content => 'foo foo', category => 'fooish' },
    { title => 'bar', content => 'bar bar', keyword  => 'barbarian' }
);

for my $doc (@docs) {
    # Dynamically define any new fields as 'text'.
    for my $field ( keys %$doc ) {
        $schema->add_field( $field => 'text' );
    }
    $invindexer->add_doc($doc);
}

$invindexer->finish;

my $searcher = KinoSearch::Searcher->new( invindex => $invindex );
my $hits = $searcher->search( query => 'barbarian' );
print "Total hits: " . $hits->total_hits . "\n";

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing design

Reply via email to