Multi-node stats within individual nodes (was "Baby steps...")

Marvin Humphrey Sun, 07 Mar 2010 08:43:34 -0800

On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:
> > Fortunately, beaming field length data around is an easier problem than
> > distributed IDF, because with rare exceptions, the number of fields in a
> > typical index is miniscule compared to the number of terms.
> 
> Right... so how do we control/configure when stats are fully
> recomputed corpus wide.... hmmm.  Should be fully app controllable.


Hmm, at first, I don't like the sound of that.  Right now, we're talking about
an esoteric need for a specific plugin, BM25 similarity.  The top level
indexer object should be oblivious to the implementation details of plugins.

However, the theme here is the need for an individual node to sync up with the
distributed corpus.  If you don't do that at index time, you have to do it at
search time, which isn't always ideal.  So I can see us building in some sort
of functionality to address that more general case.  It would be the flip of
the MultiSearcher-comprised-of-remote-searchables situation.

> > I guess you'd want to accumulate that average while building the segment...
> > oh wait, ugh, deletions are going to make that really messy.  :(
> >
> > Think about it for a sec, and see if you swing back to the desirability of
> > calculation on the fly using maxDoc(), like I just did.
> 
> I think we'd store a float (holding avg(tf) that you computed when
> inverting that doc, ie, for all unique terms in the doc what's the avg
> of their freqs) for every doc, in the index.  Then we can regen fully
> when needed right?  

Hmm, full regeneration would be expensive, so I'd discounted it.  You'd have
to iterate the entire posting list for every term, adding up freq() while
skipping deleted docs.

> Or maybe we store sum(tf) and #unique terms... hmm.
> 
> Handling docs that did not have the field is a good point... but we
> can assign a special value (eg 0.0, or, any negative number say)  to
> encode that?

Where?

In the full field storage?  To slow to recover.

In the term dictionary?  The term dictionary can't store nulls.  You'd have to
use sentinels... thus restricting the allowable content of the field?!  No
way.

In the Lucy-style mmap'd sort cache?  That would work, because we always have
a "null ord", to which documents which did not supply a value for the field
get assigned in the ords array.  However, sort/field caches are orthogonal to
this problem and we don't want to require them for an ancillary need.

I suppose you could do it by iterating all posting lists for a field and
flipping bits in a bit vector.  The bits that are left unset correspond to
docs with null values.

> Deletions I think across the board will skew stats until they are
> reclaimed.

Yes, and unless the stats are fully regenerated when a segment with deletions
get merged away, the averages will be wrong to some degree, with the skew
potentially worsening over time.

Say that you have a segment with an average field length of 5 for the "tags"
field, but that that average is the result of most docs having 1 tag, while a
handful of docs have 100 tags.  Now say you delete all of the docs with 100
tags.  The recorded average for the "tags" field within the segment is now all
messed up -- it should be "1", but it's "5".  You have to regenerate a new,
correct average when building a new segment.  You can't use the existing value
of "5" as a shortcut, or the consolidated segment's averages will be wrong
from the get-go.

That's what I was getting at earlier.  However, I'd thought that we could get
around the problem by fudging with maxDoc(), and I no longer believe that.  I
think full regeneration is the only way.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Multi-node stats within individual nodes (was "Baby steps...")

Reply via email to