Re: Baby steps towards making Lucene's scoring more flexible...

Michael McCandless Tue, 09 Mar 2010 02:06:40 -0800

On Mon, Mar 8, 2010 at 9:47 PM, Marvin Humphrey <mar...@rectangular.com> wrote:
> On Mon, Mar 08, 2010 at 01:13:53PM -0500, Michael McCandless wrote:
>> I think we can actually do so w/o losing Lucene's loose typing if we
>> simply peeled out [say] a FieldType class that holds the settings you
>> now set on each field (omitTFAP, omitNorms, TermVector, Store,
>> Index), and Field instance holds a ref to its FieldType.  We could
>> then store Analyzer and Codec on there, too.
>
> You can use shared FieldType instances to hold typing information without
> enforcing consistency.


Precisely.

>> Lucene would still be "loosely typed" (ie, no global schema) in that
>> every time you index new docs you're free to make a up a new FieldType
>> instance (ie it wouldn't be stored in the index -- it's "stored" in
>> your app's java sources), though probably FieldType itself would be
>> write once during an IndexWriter session.
>
> For what it's worth, that's sort of the way KS used to work: Schema/FieldType
> information was stored entirely in source code.  That's changed and now we
> serialize the whole schema including all Analyzers, but source-code-only is a
> viable approach.

Hmm but KS still somehow enforced strong typing across indexing
sessions?

>> Hmm big change though -- I don't want to gate landing flex with this.
>
> Perhaps factoring out FieldType from Field can be done on trunk, now?   From a
> distance, it looks to be a straightforward subtractive refactoring.

I agree but we need a volunteer ;)

>> > I see what you're getting at.  However, Similarity *already* affects the
>> > contents of the index, via encodeNorm()/decodeNorm() and lengthNorm().  So 
>> > if
>> > you want to divorce Similarity from index format, you'll need to remove 
>> > those
>> > methods.
>>
>> This brings us full circle -- it's exactly what I'd like to do as the
>> baby step ;)
>>
>> Ie, lengthNorm would no longer be publicly used (since, instead, the
>> true stats are written to the index).  (Privately, within Sim impls
>> it'd presumably still be used).
>>
>> encode/decodeNorm would also be private to the Sim impl -- that's just
>> a way to quantize a float into a single byte, to save RAM.  Other Sim
>> impls may just want to store a float directly, use 2 bytes to quantize
>> floats, use only 4 bits per norm, don't store anything (match only),
>> etc.
>
> OK, I see.  Note that although it would mean writing redundant data, Lucy
> could theoretically record the same raw stats.  It's just that Lucene would
> generate the derived data structures at search-time, while Lucy would generate
> them at index-time and then mmap the files at search-time.

Lucene may still also generate them @ indexing time & store in the
filesystem... it's an option.

> I don't think we'd do that, though -- we'd just accept the lossiness and write
> out the derived data -- but preparing per-docXfield boost/norm info involves
> approximately the same amount of work no matter how you time-shift it.

Yes.

>> I do agree there's some connection -- if I don't store tf nor
>> positions then I can't use a Sim that needs these stats.
>>
>> > I also like the idea of novice/intermediate users being able to express the
>> > intent for how a field gets scored by choosing a Similarity subclass, 
>> > without
>> > having to worry about the underlying details of posting format.
>>
>> Well.. I think standard codec in Lucene will store these 2 common
>> stats (field length, avg(tf)), then provide various Sim impls?  So w/
>> default codec user can still pick the Sim impl that does the scoring
>> they want?
>
> OK, that's actually handy, because it allows people to tweak length
> normalization without reindexing and presumably speeds up development.  Of all
> the knobs that Similarity gives us, lengthNorm() is far and away the most
> important.
>
> I guess you're OK with slowing down standard Lucene index opens to achieve
> this flexibility, since you're going to burn CPU deriving those boost/norm
> stats.  Subtle way of encouraging people to use the NRT API, eh?

Well, either .reopen or NRT, but yes.

Still we may allow storing in the index....

> I can see why you're so resistant to the idea of tying Similarity to format,
> now.  However, I think you've managed to persuade me that it's exactly the
> right thing to do from an API standpoint.  :)

I suppose that's a positive outcome then ;)

> Probably our perspectives and priorities diverge because of the fact that in
> Lucene, Similarity is index-wide, while in Lucy/KS, it's per-field.

I don't think that's the source of our different conclusions, becauase
with this proposal, Sim also becomes per-field in Lucene.

> E.g from your perspective, match-only indexes would be pretty
> esoteric, but from my perspective, match-only fields make perfect
> sense.

I don't think it's esoteric at all.  And I see match-only fields as
working fine in Lucene with this change.

It's just that a match-only Sim is a search time decision, and I see
it as strongly decoupled from what postings format had been used for
recording the stats in the index.

If I have an MP3 file I can use any number of players to play
it... the encoding of the file should be strongly decoupled from how
it's consumed.

You said "of course" before but... how in your proposal could one
store all stats for a given field during indexing, but then sometimes
use match-only and sometimes full-scoring when querying against that
field?

>> If user switches up their codec then they'll need to ensure it also
>> stores stats required by their Sim(s).
>
> That's backwards, IMO.

I'm still baffled.  If I wanna play a movie on my 1080P monitor I'll
need to find a movie that was encoded hidef (ie, bluray not dvd).

I mean, I don't have to.  DVD content will play fine still... just
degraded quality.

I don't buy my monitor and all movies I may want to watch on it, at
the same time...

> The posting format encoding should be an implementation detail.  The general
> user should be expressing their intent as far as how they want the field to be
> scored, and the posting format should flow from that.

Maybe it's that it bothers you that with this proposed changed the
user makes 2 decisions -- Codec and Sim?  Ie user will choose PFor or
Standard or Pulsing(PFor/Standard) codec, and then separately choose
Sim?

But these are important choices.  They should be separate.  Why
force-bundle them?

> Whether we use VInt, PFOR, group varint, hand-tuned bit shifting, etc under
> the hood to implement BM25, match-only, boost-per-position or whatever
> shouldn't be the user's concern.  As time goes on, we should allow ourselves
> the flexibility to use new compression techniques to write new segments.

But w/ the proposed change Lucene users will be free to use better
codecs?  Are you worried about proper defaulting?  We'll handle that
(under Version).

>> > Just a thought: why not make positions an attribute on a DocsEnum?
>>
>> Maybe... though I think the double method call (enum.next() then
>> posAttr.get()) is too much added cost.
>
> Why wouldn't it work to have the consumer extract the positions attribute from
> the DocsEnum during construction?

That's exactly what's done today.  Ie, you'd get a posAttr up front.

> There's no difference between calling enum.nextPosition() and
> positions.next(), is there?

Right now it's a 2 step process when you access via attr -- first you
ask the enum to next(), then you ask each attr associated w/ that enum
for their value.

I think that'd be too costly for stepping through positions vs the
1-step we have today.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to