On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote: > > Fortunately, beaming field length data around is an easier problem than > > distributed IDF, because with rare exceptions, the number of fields in a > > typical index is miniscule compared to the number of terms. > > Right... so how do we control/configure when stats are fully > recomputed corpus wide.... hmmm. Should be fully app controllable.
Hmm, at first, I don't like the sound of that. Right now, we're talking about an esoteric need for a specific plugin, BM25 similarity. The top level indexer object should be oblivious to the implementation details of plugins. However, the theme here is the need for an individual node to sync up with the distributed corpus. If you don't do that at index time, you have to do it at search time, which isn't always ideal. So I can see us building in some sort of functionality to address that more general case. It would be the flip of the MultiSearcher-comprised-of-remote-searchables situation. > > I guess you'd want to accumulate that average while building the segment... > > oh wait, ugh, deletions are going to make that really messy. :( > > > > Think about it for a sec, and see if you swing back to the desirability of > > calculation on the fly using maxDoc(), like I just did. > > I think we'd store a float (holding avg(tf) that you computed when > inverting that doc, ie, for all unique terms in the doc what's the avg > of their freqs) for every doc, in the index. Then we can regen fully > when needed right? Hmm, full regeneration would be expensive, so I'd discounted it. You'd have to iterate the entire posting list for every term, adding up freq() while skipping deleted docs. > Or maybe we store sum(tf) and #unique terms... hmm. > > Handling docs that did not have the field is a good point... but we > can assign a special value (eg 0.0, or, any negative number say) to > encode that? Where? In the full field storage? To slow to recover. In the term dictionary? The term dictionary can't store nulls. You'd have to use sentinels... thus restricting the allowable content of the field?! No way. In the Lucy-style mmap'd sort cache? That would work, because we always have a "null ord", to which documents which did not supply a value for the field get assigned in the ords array. However, sort/field caches are orthogonal to this problem and we don't want to require them for an ancillary need. I suppose you could do it by iterating all posting lists for a field and flipping bits in a bit vector. The bits that are left unset correspond to docs with null values. > Deletions I think across the board will skew stats until they are > reclaimed. Yes, and unless the stats are fully regenerated when a segment with deletions get merged away, the averages will be wrong to some degree, with the skew potentially worsening over time. Say that you have a segment with an average field length of 5 for the "tags" field, but that that average is the result of most docs having 1 tag, while a handful of docs have 100 tags. Now say you delete all of the docs with 100 tags. The recorded average for the "tags" field within the segment is now all messed up -- it should be "1", but it's "5". You have to regenerate a new, correct average when building a new segment. You can't use the existing value of "5" as a shortcut, or the consolidated segment's averages will be wrong from the get-go. That's what I was getting at earlier. However, I'd thought that we could get around the problem by fudging with maxDoc(), and I no longer believe that. I think full regeneration is the only way. Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org