On Mar 12, 2007, at 5:08 PM, Grant Ingersoll wrote:

I can see having storage at:
Index
Document/Field  //already exists
Token

I hadn't thought of it that way, as a logical extension outwards at all levels.

If I understand you correctly, it's a clever point, but the thing is, it's cake for someone to add arbitrary index-level data on their own, just by adding their own file. We'd have to come up with and support an infrastructure for handling this kind of data, and whatever we invented would be unlikely to suit all needs. Ergo, I think it makes sense for us to focus on the Token and Document/Field levels.

I think we can do much better with regards to opening up Document/ Field retrieval. Under global field semantics, the fieldbits Byte is no longer needed. Go one step beyond that, and change the field number to a field name string, and documents can be handled as monolithic blobs when merging segments. Document storage becomes simply a combination of fixed width storage and (optional) variable width storage, and the possibilities for subclassing break wide open. Extended thoughts below.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Begin forwarded message:
From: Marvin Humphrey <[EMAIL PROTECTED]>
Date: February 26, 2007 1:26:00 PM PST
To: KinoSearch discussion forum <[EMAIL PROTECTED]>
Subject: [KinoSearch] Subclassing DocWriter/DocReader
Reply-To: KinoSearch discussion forum <[EMAIL PROTECTED]>

Greets,

The file format changes in the new KS have opened up possibilities for subclassing DocWriter/DocReader, the classes responsible for storage/retrieval of serialized documents.

Here are some potential features that subclasses could implement:

  * storage of arbitrary data (e.g. arrayref values)
  * different field values for display and searching
  * complete document recovery
  * arbitrary compression algo choice
  * lazy loading
  * optimized external document storage (e.g. in SQL DB)

Anything else? The more ideas we dream up now and consider how to support, the better the design will be.

Right now, there are two files, _XXX.ds and _XXX.dsx, with .ds being "document storage", and .dsx being "document storage index". .ds is a stack of variable width records -- serialized documents -- stored end to end. .dsx is a stack of fixed width records: 64-bit pointers into the variable-width .ds file. (For a more extensive explanation, see <http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Docs/ FileFormat.html>)

The fixed width file, I intend to monkey with myself, because I'm going to start storing document boost as a 32-bit float within it. (That's what's driving this development track -- I need a place to put these doc boosts.)

My thinking is, why not add more than that? So long as the additional data is fixed width, we can still index into the .dsx file quickly.

The variable width .ds file is up for grabs. Right now, docs are serialized using a scheme derived from Lucene which isn't really optimal for KS and doesn't need to be as complicated as it is. So long as we can recover a hash from the serialized data, we're fine.

Rough sketch example subclasses implementing storage of arbitrary data and external storage in a DB are below.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#--------------------------------------------------------------------

package ArbitraryDataDocWriter;
use base qw( KinoSearch::Index::DocWriter );
use Storable qw( nfreeze );

sub store_doc {
    my ( $self, $doc ) = @_;
    my %ret_hash = ( var_width_data => nfreeze($doc) );
    return \%ret_hash;
}

package ArbitraryDataDocReader;
use base qw( KinoSearch::Index::DocReader );
use Storable qw( thaw );

sub fetch_doc {
    my ( $self, %args ) = @_;
    my $serialized;
    $self->read_var_width( \$serialized, $args{var_width_bytes} );
    return thaw($$serialized);
}


#--------------------------------------------------------------------

package DBDocWriter;
use base qw( KinoSearch::Index::DocWriter );
use DBI;

sub fixed_width_data_size { 8 }

sub store_doc {
    my ( $self, $doc ) = @_;
    $self->store_in_db($doc);
    my %ret_hash = ( fixed_width_data => $doc->{primary_key} );
    return \%ret_hash;
}

package DBDocReader;
use base qw( KinoSearch::Index::DocReader );
use DBI;

sub fixed_width_data_size { 8 }

sub fetch_doc {
    my ( $self, %args ) = @_;
    return $self->fetch_from_db( $args{fixed_width_data} );
}



_______________________________________________
KinoSearch mailing list
[EMAIL PROTECTED]
http://www.rectangular.com/mailman/listinfo/kinosearch


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to