Re: Baby steps towards making Lucene's scoring more flexible...

Marvin Humphrey Fri, 05 Mar 2010 10:54:49 -0800

On Thu, Mar 04, 2010 at 12:23:38PM -0500, Michael McCandless wrote:
> > In a multi-node search cluster, pre-calculating norms at index-time
> > wouldn't work well without additional communication between nodes to
> > gather corpus-wide stats.  But I suspect the same trick that works
> > for IDF in large corpuses would work for average field length: it
> > will tend to be the stable over time, so you can update it
> > infrequently.
> 
> Right I imagine we'd need to use this trick within a single index,
> too.  Recomputing norms for entire index when only a small new segment
> was added to the new NRT reader will probably be too costly.


Agreed.  But you definitely want corpus-wide stats, because you're not
guaranteed to have consistently random distribution of field lengths across
nodes.  

Hoss had a good example illustrating why per-node IDF doesn't always work well
in a cluster: search cluster of news content with nodes divided by year, and
the top scoring hit for "iphone" is a misspelling from 1997 (because it was an
extremely rare term on that search node).

Similarly, if you calc field length stats on one node where the "tags" field
averages 50 tokens and on another node where it averages 5, you're going to
get screwy results.

Fortunately, beaming field length data around is an easier problem than
distributed IDF, because with rare exceptions, the number of fields in a
typical index is miniscule compared to the number of terms.

> Though one alternative (if you don't mind burning RAM) is to skip
> casting to norms, ie store the actual field length, and do the
> divide-by-avg during scoring (though that's a biggish hit to search
> perf).

I suppose that's theoretically available to a codec if desired, but it
wouldn't ever be a first choice of mine.

> >   token_counts: {
> >       segment: {
> >           title: 4,
> >           content: 154,
> >       },
> >       all: {
> >           title: 98342,
> >           content: 2854213
> >       }
> >   }
> >
> > (Would that suffice?  I don't recall the gory details of BM25.)
> 
> I think so, though why store all, per segment?  Reader can regen on
> open?  (That above json comes from a single segment right?).

You're right, no need to store "all", calculating on the fly is cheap.

> lnu.ltc would need sum(avg(tf)) as well.

Hmm, I was thinking you'd calc that on the fly, but then deriving the average
means you have to know the number of docs where the field was not null --
which could be different from maxDoc() for the segment.

I guess you'd want to accumulate that average while building the segment...
oh wait, ugh, deletions are going to make that really messy.  :(

Think about it for a sec, and see if you swing back to the desirability of
calculation on the fly using maxDoc(), like I just did.

> >> The norms array will be stored in this per-field sim instance.
> >
> > Interesting, but that wasn't where I was thinking of putting them.
> > Similarity objects need to be sent over the network, don't they?  At
> > least they do in KS.  So I think we need a local per-field
> > PostingsReader object to hold such cached data.
> 
> OK maybe not stored on them, but, accessible to them.  Maybe cached in
> the SegmentReader.

Well, I think SegmentReader should be as minimal as possible, with most of the
real action happening down in sub-readers -- so I think the cached norms
arrays belong in a sub reader.  But we're almost on the same page.

> Though we need every norm(docID) lookup to be fast.  Maybe we ask the
> per-field Similarity to give us a scorer, that holds the right byte[]?

I absolutely agree that scorers need to be operating on a raw byte array.

> > What do you do when you have to reconcile two posting codecs like this?
> >
> >  * doc id, freq, position, part-of-speech identifier
> >  * doc id, boost
> >
> > Do you silently drop all information except doc id?
> 
> I don't know -- we haven't hit that yet ;)  The closest we have is
> when <doc id> is merged with <doc id,freq,<position+>>, and in that
> case we drop the freq,<position+>.

OK, I suppose that answers my question.  I dislike the notion of silently
discarding data on merge conflict, as it becomes possible for one bunk
document to poison an entire index.  But then I also dislike the notion of
inventing new data ex nihilo, as happens when resolving omitNorms.  But then I
think the whole tangled mess is insane.

In any case, so long as there's a resolution policy in place and any
Similarity or posting format codec can fall back to doc-id-only, you can move
on past this challenge.

> With flex this'll be up to the codec's merge methods.

With the default being to fall back to doc-id-only and discard data when an
unknown posting format is encountered, I presume.

> >> > Similarity is where we decode norms right now.  In my opinion, it
> >> > should be the Similarity object from which we specify per-field
> >> > posting formats.
> >>
> >> I agree.
> >
> > Great, I'm glad we're on the same page about that.
> 
> Actually [sorry] I'm not longer so sure I agree!
> 
> In flex we have a separate Codec class that's responsible for
> creating the necessary readers/writers.  It seems like Similarity is a
> consumer of these stats, but need not know what format is used to
> encode them on disk?

It's true that it's possible to separate out Similarity as a consumer.
However, I'm also thinking about how to make this API as easy to use as
possible.

One rationale behind the proposed elevation of Similarity is that I'm not a
fan of the name "Codec".  I think it's too generic to use for the class which
specifies a posting format.  "PostingCodec" is better, but might be too long.
In contrast, "Similarity" is more esoteric than "Codec", and thus conveys more
information.  

For Lucy, I'm imagining a stripped-down Similarity class compared to current
Lucene.  It would bear the responsibility for setting policy as to how scores
are calculated (in other words, judging how "similar" a document is to the
query), but what information it uses to calculate that score would be left
entirely open.  Methods such as tf(), idf(), encodeNorm(), etc. would move to
a TF/IDF-specific subclass.  Here's a sampling of possible Similarity
subclasses:

  * MatchSimilarity               // core
  * TFIDFSimilarity               // core
  * LongFieldTFIDFSimilarity      // contrib
  * BM25Similarity                // contrib
  * PartOfSpeechSimilarity        // contrib

For Lucy, Similarity would be specified as a member of a FieldType object
within a Schema.  No subclassing would be required to spec custom posting
formats:

   Schema schema = new Schema();
   FullTextType bm25Type = new FullTextType(new BM25Similarity());
   schema.specField("content", bm25Type);
   schema.specField("title", bm25Type);
   StringType matchType = new StringType(new MatchSimilarity());
   schema.specField("category", matchType);

Since the Similarity instance is settable rather than generated by a factory
method, that means it will have to be serialized within the schema JSON file,
just like analyzers must be.

I think it's important to make choosing a posting format reasonably easy.
Match-only fields should be accessible to someone learning basic index tuning
and optimization techniques.

Actually writing posting codecs is totally different.  Not many people are
going to want to do that, though we should make it easy for experts.

What's the flex API for specifying a custom posting format?

> > What's going to be a little tricky is that you can't have just one
> > Similarity.makePostingDecoder() method.  Sometime's you'll want a
> > match-only decoder.  Sometimes you'll want positions.  Sometimes
> > you'll want part-of-speech id.  It's more of a interface/roles
> > situation than a subclass situation.
> 
> match-only decoder is handled on flex now by asking for the DocsEnum
> and then while iterating only using the .doc() (even if underlyingly
> the codec spent effort decoding freq and maybe other things).
> 
> If you want positions you get a DocsAndPositionsEnum.

Right.  But what happens when you want a custom codec to use BM25 weighting
*and* inline a part-of-speech ID *and* use PFOR?

I think we have to supply a class object or class name when asking for the
enumerator, like you do with AttributeSource.
   
  PostingList plist = null;
  PostingListReader pListReader = segReader.fetch(PostingListReader);
  if (pListReader != null) {
    PostingsReader pReader = pListReader.fetch(field);
    if (pReader != null) {
      plist = pReader.makePostingList(klass); // e.g. PartOfSpeechPostingList
    }
  }

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to