Re: Flexible indexing

Marvin Humphrey Tue, 13 Mar 2007 18:41:39 -0800


On Mar 12, 2007, at 5:08 PM, Grant Ingersoll wrote:

I can see having storage at:
Index
Document/Field  //already exists
Token

I hadn't thought of it that way, as a logical extension outwards atall levels.

If I understand you correctly, it's a clever point, but the thing is,it's cake for someone to add arbitrary index-level data on their own,just by adding their own file. We'd have to come up with and supportan infrastructure for handling this kind of data, and whatever weinvented would be unlikely to suit all needs. Ergo, I think it makessense for us to focus on the Token and Document/Field levels.

I think we can do much better with regards to opening up Document/Field retrieval. Under global field semantics, the fieldbits Byte isno longer needed. Go one step beyond that, and change the fieldnumber to a field name string, and documents can be handled asmonolithic blobs when merging segments. Document storage becomessimply a combination of fixed width storage and (optional) variablewidth storage, and the possibilities for subclassing break wideopen. Extended thoughts below.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Begin forwarded message:
From: Marvin Humphrey <[EMAIL PROTECTED]>
Date: February 26, 2007 1:26:00 PM PST
To: KinoSearch discussion forum <[EMAIL PROTECTED]>
Subject: [KinoSearch] Subclassing DocWriter/DocReader
Reply-To: KinoSearch discussion forum <[EMAIL PROTECTED]>

Greets,

The file format changes in the new KS have opened up possibilitiesfor subclassing DocWriter/DocReader, the classes responsible forstorage/retrieval of serialized documents.


Here are some potential features that subclasses could implement:

  * storage of arbitrary data (e.g. arrayref values)
  * different field values for display and searching
  * complete document recovery
  * arbitrary compression algo choice
  * lazy loading
  * optimized external document storage (e.g. in SQL DB)

Anything else? The more ideas we dream up now and consider how tosupport, the better the design will be.

Right now, there are two files, _XXX.ds and _XXX.dsx, with .ds being"document storage", and .dsx being "document storage index". .ds isa stack of variable width records -- serialized documents -- storedend to end. .dsx is a stack of fixed width records: 64-bit pointersinto the variable-width .ds file. (For a more extensive explanation,see <http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Docs/FileFormat.html>)

The fixed width file, I intend to monkey with myself, because I'mgoing to start storing document boost as a 32-bit float within it.(That's what's driving this development track -- I need a place toput these doc boosts.)

My thinking is, why not add more than that? So long as theadditional data is fixed width, we can still index into the .dsx filequickly.

The variable width .ds file is up for grabs. Right now, docs areserialized using a scheme derived from Lucene which isn't reallyoptimal for KS and doesn't need to be as complicated as it is. Solong as we can recover a hash from the serialized data, we're fine.

Rough sketch example subclasses implementing storage of arbitrarydata and external storage in a DB are below.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#--------------------------------------------------------------------

package ArbitraryDataDocWriter;
use base qw( KinoSearch::Index::DocWriter );
use Storable qw( nfreeze );

sub store_doc {
    my ( $self, $doc ) = @_;
    my %ret_hash = ( var_width_data => nfreeze($doc) );
    return \%ret_hash;
}

package ArbitraryDataDocReader;
use base qw( KinoSearch::Index::DocReader );
use Storable qw( thaw );

sub fetch_doc {
    my ( $self, %args ) = @_;
    my $serialized;
    $self->read_var_width( \$serialized, $args{var_width_bytes} );
    return thaw($$serialized);
}


#--------------------------------------------------------------------

package DBDocWriter;
use base qw( KinoSearch::Index::DocWriter );
use DBI;

sub fixed_width_data_size { 8 }

sub store_doc {
    my ( $self, $doc ) = @_;
    $self->store_in_db($doc);
    my %ret_hash = ( fixed_width_data => $doc->{primary_key} );
    return \%ret_hash;
}

package DBDocReader;
use base qw( KinoSearch::Index::DocReader );
use DBI;

sub fixed_width_data_size { 8 }

sub fetch_doc {
    my ( $self, %args ) = @_;
    return $self->fetch_from_db( $args{fixed_width_data} );
}



_______________________________________________
KinoSearch mailing list
[EMAIL PROTECTED]
http://www.rectangular.com/mailman/listinfo/kinosearch


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing

Reply via email to