Re: Sort cache file format

Marvin Humphrey Thu, 30 Apr 2009 16:18:09 -0700

On Sat, Apr 11, 2009 at 11:16:28AM -0400, Michael McCandless wrote:

> > But then, let's consider discrete vs. compound with regards to transparency.
> >
> > When we're talking about discrete segment files, we're only talking about
> > binary data -- because the metadata is all in segmeta.json.  Those binary
> > files are hard to examine without a tool anyway -- hexdumping is hard core. 
> > :)
> >
> > So, transparency-wise, perhaps not so much is gained by going discrete.
> 
> You can list their size, and see their presence or not.


Well, in the current KS format, there are two "real" files which make up the
compound system:

  * cf.dat -- binary data.
  * cfmeta.json -- list of file names mapped to offset and length.

So, opening the cfmeta.json file is analogous to a directory listing, though
slightly less information-rich and intuitive.

> > FWIW...  I've already implemented a ByteBufDocReader proof-of-concept class 
> > in
> > pure Perl; instead of serializing all fields marked as "stored", it stores 
> > one
> > fixed-width byte array per document -- so doc storage is essentially a
> > flatfile.  I'm also pretty close to finishing a ZlibDocReader that uses Zlib
> > compression. (The "compressed" field spec flag has been removed.)
> 
> How can doc storage be fixed width?  (text fields have different
> length).

It's not real doc storage.  The Stored() attribute is ignored by this
implementation; only one fixed length byte array gets written for each doc.

The main use case for this is when documents are stored externally -- perhaps
in a database, or potentially, on separate doc servers.  For large search
clusters, dedicated doc/highlight servers are a good idea, and this is a start
in that direction.

> So you removed "compressed" from FieldSpec and instead the user swaps
> out the DocReader component?  I wonder how compression compares if you
> did column-stride body text vs row stride body text plus all other
> fields.

Let a thousand flowers bloom -- make doc storage pluggable, and let people
experiment.

Marvin Humphrey

Re: Sort cache file format

Reply via email to