Re: Sort cache file format

Marvin Humphrey Fri, 10 Apr 2009 18:51:56 -0700

On Fri, Apr 10, 2009 at 09:52:01AM -0400, Michael McCandless wrote:

> >> So for Lucy you'll move away from KS's required "compound file
> >> format", and allow either compound or not.
> >
> > I think so.  It would be nice not to foreclose on the option of a
> > discrete file format.  In some sense discrete files are actually
> > easier to troubleshoot because you can inspect the individual binary
> > files directly.  And of course, that extra file copy at index time
> > goes away.
> 
> I'm still torn on how crucial "live" transparency is, vs running a
> tool (CheckIndex, Luke) to see things, but yes it's a plus.


It can be pretty powerful stuff.  Case in point: today, a colleague and I were
examining a snapshot_NNN.json file, and we realized that we could implement
index data truncation just by deleting all the segment related files from the
"entries" array.  First, we hacked it using ordinary Perl JSON editing tools,
and now we're contemplating how to make it work for real in public KS module
code.

I think it's really, really important to approach the Lucy file format
specification as a public API design task.  Not so much because we expect
ordinary users to look at it, but because something which is simple and easy
to grok will stimulate and facilitate innovation by expert users.

The index directory structure should be as transparent as possible:

  * Segments should be housed in individual directories, to reinforce that
    each segment is an independent, coherent entity on its own.
  * Segment data files should be given meaningful names.
  * As previously discussed and provisionally resolved, snapshot and segment
    metadata should be human-readable and easily accessible.

But then, let's consider discrete vs. compound with regards to transparency.

When we're talking about discrete segment files, we're only talking about
binary data -- because the metadata is all in segmeta.json.  Those binary
files are hard to examine without a tool anyway -- hexdumping is hard core. :)

So, transparency-wise, perhaps not so much is gained by going discrete.

> But: I've often wondered whether that extra copy gains performance.

Mmm.  On a fragmented disk, I suppose it might.  And search-time performance
is more important than index-time performance.

I guess I'm now leaning towards requiring the compound file after all.  Are
there any other criteria we should consider?

I guess another thing is that creating lots of little files is reportedly
expensive on Windows.

FWIW, in the KS prototype, compound files are handled at the Store level,
rather than the index level.  So e.g. a HadoopFolder implementation wouldn't
perform the extra file copy.

> I suppose we could do a similar trick to the discrete files too.

The only way we could do that would be to rewrite them all at the end of the
indexing session.  I'd want to see benchmarking data proving that actually did
something before committing.

> > Meaning, we'd use the column-stride fields during matching/scoring,
> > but IndexReader.fetchDoc() would never use them.
> 
> I'm still unsure.  I think how CSFs get used will be app dependent; I
> guess we throw them out there and see how they're used.

OK, fair enough.  

I guess my first priority is implementing mmap'd sort caches, and I consider
that so important that I want to make sure there are no design compromises
made to sort caching because we've decided that we want column-stride fields
to do double duty as both field storage and sort caches.  The field storage
use case is much less important, IMO.

> EG maybe the APP stores price & manuf only in CSF and not in the
> "normal" stored fields... the additional cost to retrieve stored field
> / excerpts for each of 10 hits on the page can be quite high.

FWIW...  I've already implemented a ByteBufDocReader proof-of-concept class in
pure Perl; instead of serializing all fields marked as "stored", it stores one
fixed-width byte array per document -- so doc storage is essentially a
flatfile.  I'm also pretty close to finishing a ZlibDocReader that uses Zlib
compression. (The "compressed" field spec flag has been removed.)  

Lucy's plugin API will allow people to add their own column-stride field
storage if they so choose.  If we make the sort cache writer pluggable as
well, then someone could make something that performs double duty, if that's
important to them.

To my mind, it's more important to expose a plugin API that makes this kind of
thing possible than it is to provide support for column-stride fields in core
Lucy.

> Well if the terms are "largely" unique (eg a "title" field), you lose
> by treating them as enumerated (as Lucene does today) since the extra
> deref only adds waste.

That's true, but it's not important for sorted search.  The vast majority of
comparisons will occur within the segment using only the ords.  We only need
to retrieve the values for resolving intra-segment comparisons, and the extra
cost won't matter too much there because it scales with the number of segments
rather than the number of documents in the index.

Are there other uses for a field cache where the extra deref would
meaningfully impede performance?

> But then, if you are going to sort by that field, you need the ord
> [deref] anyway, so... yeah maybe we only implement "enumerated"
> values, at least for the sort cache.
> 
> If I just want to retrieve values, and the values are mostly unique,
> it's more efficient to inline the values.
> 
> And yes at indexing time we know precisely all stats on the field so
> we could pick and choose betweeen the "hey it's obviously enumerated"
> vs the "hey most values are unique" cases, if we want to do both.

My inclination is to implement the enumerated version for now and add code to
support the "hey most values are unique" case if necessary later.

> > I prefer the two-file model over the block model because comparison
> > of long values which cross block boundaries gets messy unless you
> > build a contiguous value by copying.  With the two-file model, you
> > just pass around pointers into the mmap'd data.  Locality of
> > reference isn't a big deal if those two files are jammed right next
> > to each other in the compound file.
> 
> In the block-based approach, you'd still need to store, somewhere, the
> pointers (block + offset) for each value, somewhere?

Ha, yeah, I guess that's true.   Block-based -1, two-file FTW.

Marvin Humphrey

Re: Sort cache file format

Reply via email to