Re: Baby steps towards making Lucene's scoring more flexible...

Marvin Humphrey Tue, 09 Mar 2010 12:58:52 -0800

On Tue, Mar 09, 2010 at 01:18:12PM -0500, Michael McCandless wrote:
> 
> >> You said "of course" before but... how in your proposal could one
> >> store all stats for a given field during indexing, but then sometimes
> >> use match-only and sometimes full-scoring when querying against that
> >> field?
> >
> > The same way that Lucene knows that sometimes it needs a docs-only-enum and
> > sometimes it needs a docs-and-positions enum.  Sometimes you need scores,
> > sometimes you don't.
> 
> But if user had specified BM25Sim when indexing... can they later just
> change that to MatchOnlySim at search time?


The user won't be able to modify the Schema by reaching into a FieldType
object and replacing its Similarity instance. 

However, internally, match-only iteration of a posting list would work just
fine.  I mean, the doc id data is there in one form or another.  Under a field
spec'd to use MatchSimilarity, the default would be to write only one file,
holding nothing but delta-encoded doc ids.  Under LuceneSimilarity, freq would
probably be embedded in the doc id file, but iterating that with match-only
just means throwing away freq.  Slightly less efficient, but still pretty
good.

So there would be polymorphism in the decoding phase while we're supplying
information the Similarity object needs to make its similarity judgments.
However, that polymorphism would be handled internally -- it wouldn't be the
responsibility of the user to determine whether a codec supported a particular
scoring model.

What Lucy users absolutely wouldn't be able to do is change up BM25 weighting
to standard Lucene weighting at search time, because we'll be writing
pre-calculated boost bytes at index time.  Re-indexing will be required.

I think that's a nice feature for Lucene to provide, but Lucy will have to
skip it because of our cheap-searcher requirement.

> > What users will be able to tell us is how they want the field to be used, 
> > and
> > we can use that information to help us optimize.  For example, when a user
> > declares that they want a field to be "match-only", we know we don't have to
> > write boost bytes, freq or positions, saving space.
> 
> Yeah.... so, I don't like that in Lucene you call "Field.setOmitTFAP"
> instead of saying "Field.matchOnly" (or something).  So I do agree
> that it'd be better if the API made it clear what the *search* time
> impact is of using this advanced Field API.

In my opinion, it makes sense to communicate "match only" by way of the
Similarity object as opposed to a boolean.  I think it's a good way to
introduce the Similarity class and get people comfortable with it, and I also
think that it's good to keep stuff out of the FieldType API when we can.

> We get users who are baffled that their phrase queries no longer work
> after setting omitTFAP.  

This is still a weakness of MatchSimilarity.

The default behavior of the KinoSearch QueryParser, which I expect Lucy to
follow, is to expand all TermQueries and PhraseQueries out to cover all
indexed fields.  If we include MatchSimilarity fields in that expansion, we'll
match terms but not phrases.  Maybe that would be a little hard for users to
understand -- shouldn't a MatchSimilarity field allow phrases to match without
contributing to scores?

On the other hand, typical candidates for MatchSimilarity...

  * unique_id
  * category
  * tags

... either won't contain multiple tokens, or won't generally return sensible
results for phrase queries.  

> (Today it silently returns no results... with flex you'll get an exception).

Mmm, tough call.

> > They could use better codecs under the format-follows-Similarity model, too.
> > They'd just have to subclass and override the factory methods that spawn
> > posting encoders/decoders.
> 
> Ahh, OK so that's how they'd do it.
> 
> So... I think we're making a mountain out of a molehill.

Well, I don't see it that way, because I place great value on designing
good public APIs, and I think it's important that we avoid forcing users to
know about codecs.

> In format-follows-Sim, it sounds like that simply means the Sim has a
> default codec, but you can override it if you want (and it's the Sim
> that "owns" (has the method for) handing out the Codec you'll use).

Yes.

> Whereas in Lucene the same defaulting will take place.  It's just that
> Sim won't "own" picking the Codec.

However, *something* down in Lucene besides the codec itself will be
influencing decoder polymorphism.  If there was only one decoding function,
you'd always iterate positions.  :)

Under format-follows-Sim, it would be the Similarity object that knows all
supported decoding configurations for the field.

> > You don't want to use the stronger, more constrictive check, right?
> 
> You mean single inheritance?  No.  Because then we hardwire the attrs
> to the Codec.  Standard codec should encode whatever attrs the app
> hands us... I think.

I might approach things the same way if Clownfish supported interface method
dispatch.  :)  

As it is, though, I'm not sure that the single inheritance requirement is an
important liability.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to