Re: Baby steps towards making Lucene's scoring more flexible...

Michael McCandless Thu, 25 Mar 2010 03:25:07 -0700

On Mon, Mar 22, 2010 at 12:45 PM, Marvin Humphrey
<mar...@rectangular.com> wrote:
> On Thu, Mar 18, 2010 at 05:16:23AM -0500, Michael McCandless wrote:
>> Also, will Lucy store the original stats?
>
> These?
>
>   * Total number of tokens in the field.
>   * Number of unique terms in the field.
>   * Doc boost.
>   * Field boost.


Also sum(tf).  Robert can generate more :)

> That would depend on which Similiarity the user specs for that field.  In
> other words, it's just another data-reduction decision: if the Sim needs it,
> keep it, and if doesn't, throw it away.

OK.

> Incidentally, what are you planning to do about field boost if it's not always
> 1.0?  Are you going to store full 32-bit floats?

For starters, yes.  We may (later) want to make a new attr that sets
the #bits (levels/precision) you want... then uses packed ints to
encode.

>> Ie so the chosen Sim can properly recompute all boost bytes (if it uses
>> those), for scoring models that "pivot" based on avg's of these stats?
>
> Yes, we could support that.
>
> It's not high on my todo-list for core Lucy, though: poor payoff for all the
> complexity it would introduce, particularly file format complexity with its
> heavy backwards compatibility burden.  Right now, we only have the boost
> bytes, and the fact that they are used for length normalization, field boost,
> and doc boost is incidental.  If we add all the raw stats, that's a bunch of
> stuff we have to support for a long time, yet which doesn't yield practical
> advantages for us yet.
>
> I'd be much more interested in finding a way to support such a feature as an
> extension.

I was specifically asking if Lucy will allow the user to force true
average to be recomputed, ie, at commit time from the writer.  It's
more costly and often not needed (ie, once your index is large enough,
new docs "typically" won't shift the average much).  But I imagine
some users will want "true average".

>> > In any case, the proposal to start delaying Sim choice to search-time -- 
>> > while
>> > a nice feature for Lucene -- is a non-starter for Lucy.   We can't do that
>> > because it would kill the cheap-Searcher model to generate boost bytes at
>> > Searcher construction time and cache them within the object.  We need those
>> > boost bytes written to disk so we can mmap them and share them amongst many
>> > cheap Searchers.
>>
>> It'd seem like Lucy could re-gen the boost bytes if a different Sim
>> were selected, or, the current Sim hadn't yet computed & cached its
>> bytes?  But then logically this means a "reader" needs write
>> permission to the index dir, which is not good...
>
> Whatever's reading the boost bytes can't tell the difference between process
> RAM and mmap'd RAM, so write-permission on the index dir isn't required.

Hmm if you could somehow soften this... so that a custom Sim could
regen its boost bytes (if it needed to), write them into the index,
and then "whoever's reading" can mmap... that'd buy you some
flexibility back.

> What's trickier is that Schemas are not normally mutable, and that they are
> part of the index.  You don't have to supply an Analyzer, or a Similarity, or
> anything else when opening a Searcher -- you just provide the location of the
> index, and the Schema gets deserialized from the latest schema_NNN.json file.
> That has many advantages, e.g. inadvertent Analyzer conflicts are pretty much
> a thing of the past for us.

That's nice... though... is it too rigid?  Do users even want to pick
a different analyzer at search time?

> But it makes your feature request of runtime settability for
> Similarity awkward to implement: by the time you have a Schema
> object to work with, the Searcher is already open.
>
>  Searcher searcher = new Searcher("/path/to/index");
>  Schema schema = searcher.getSchema();
>  schema.setSim("content", altSim); // Too late, and not implemented anyway.

I see...

>> > To my mind, these are all related data reduction tasks:
>> >
>> >  * Omit doc-boost and field-boost, replacing them with a single float
>> >    docXfield multiplier -- because you never need doc-boost on its own.
>> >  * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost,
>> >    replacing them all with a single boost byte -- because for the kind of
>> >    scoring you want to do, you don't need all those raw stats.
>> >  * Omit the boost byte, because you don't need to do scoring at all.
>> >  * Omit positions because you don't need PhraseQueries, etc. to match.
>>
>> I wouldn't group this one with the others -- I mean technically it is
>> "data reduction" -- but omitting positions means certain queries
>> (PhraseQuery) won't work even in "match only" searching.  Whereas the
>> rest of these examples affect how scoring is done (or whether it's
>> done).
>
> Couldn't disagree more.  Omitting positions is *exactly* the kind of data
> reduction task which we know is safe to perform when a user specifically tells
> us they don't need PhraseQueries by specifying a MinimalSimilarity.

Hmmm... it just seems to be different categories to me.  One category
prevents certain kinds of queries (span, phrase) from even matching
properly (let alone score).  The other affects how matched docs are
scored.

Sure, I agree these two categories can be broadly grouped under a
bigger "data reduction" umbrella... but it seems too big.

> MinimalSimilarity will be documented as a good choice for single-token field
> types like StringType, Int32Type, Float32Type, and so on -- because those
> can't match multi-token PhraseQueries anyway.  Usage with FullTextType will be
> discouraged.

OK.

> Maybe aggressive automatic data-reduction makes more sense in the context of
> "flexible matching", which is more expansive than "flexible scoring"?

I think so.  Maybe it shouldn't be called a Similarity (which to me
(though, carrying a heavy curse of knowledge burden...) means
"scoring")?  Matcher?

>> > If that Sim turns out to be a MatchSimilarity, why on earth should
>> > we keep around the boost bytes?
>>
>> Well maybe some queries do scoring on the field and some don't...
>
> That would violate the contract the user made when they spec'd
> MatchSimilarity.  Saying that Lucy should keep the boost bytes under those
> circumstances is like saying that Lucene should outright ignore omitNorms()
> and always write boost bytes because users can't be trusted.

OK.

>> > I meant that if you're writing out boost bytes, there's no sensible way to
>> > execute the lossy data reduction and reduce the index size other than 
>> > having
>> > Sim do it.
>>
>> Right Sim is the right class to do this.  Heck one could even use
>> boost nibbles... or, use float.  This is an impl detail of the Sim
>> class.
>
> For Lucene, I think that makes sense, because the reduced form would be
> ephemeral.
>
> For Lucy, it's more complicated because the reduced data gets written to the
> index.

Right, Lucy must go through the filesystem...

> Core Sim implementations should all use the same algorithm in order to
> minimize the complexity of the index file spec.  However, it would be nice to
> offer an extension point enabling user-defined Sims to write non-standard
> formats.

OK.

>> I think this all boils down to how important flexible scoring is --
>
> Oh, please, Mike.  Search-time settability for Similarity isn't the same thing
> as "flexible scoring".  :(  Everybody thinks "flexible scoring" is important.
>
> Frankly, I think we're going to do a better job making "flexible scoring"
> available to our users because we're not going to make them fight through a
> thicket of jargon to get it.

But if one wants to tweak their Similarity, eg altering lengthNorm,
they have to fully reindex that field, with Lucy, right?

>> I'd like users to be able to try out different scoring at search
>> time, even if it means "having to understand low level stuff" when
>> setting their field types during indexing.
>>
>> You don't think flexible scoring is that important ("just reindex")
>> and that's it's not great to have users understand low level stats for
>> indexing.
>
> I'm +0 (FWIW) on search-time Sim settability for Lucene.  It's a nice feature,
> but I don't think we've worked out all the problems yet.  If we can, I might
> switch to +1 (FWIW).

What problems remain, for Lucene?

> For Lucy, I'm -1 on search-time Sim settability, for a wide variety of
> reasons.

OK.

> Whether or not to perform automatic data-reduction based on Similarity choice
> or force the user to specify data-reduction manually is a separate issue.

Hmm they are all forms of "data reduction", but I think that's too
broad an umbrella.  I would decouple "reduction that causes certain
queries not to match" (discarding positions) from "reduction that
alters how matches are scored" (discarding freq, using boost
bytes/nibbles/floats, length norm pivot, etc).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to