Re: Baby steps towards making Lucene's scoring more flexible...

Marvin Humphrey Sun, 07 Mar 2010 10:22:17 -0800

On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:
> It won't encounter an unknown posting format.  It's the codec.  It
> knows all posting formats by the time it sees it.


OK, so you're not going to handle this the way Lucene handles field types and
accept a new codec spec reference with each field in each Document.  There
will be per-index associations between field names and codecs and it will be
invalid to change those associations.

> Well, Codec is intentionally generic -- currently it "only" serves up
> readers & writers for postings, but over time I expect it'll
> be the class Lucene uses to get reader/writer for other parts of the
> index.

Huh?  What does the posting format specifier have to do with e.g. stored
fields?

What you're describing sounds more like the Architecture class in KinoSearch.

> I'm a little confused: if I indexed a field with full postings data,
> shouldn't I still be allowed score with match only scoring?

Of course.

> When a movie is encoded to a file, the codec(s) determine all sorts of
> interesting details.  Then when you watch the movie you're free to do
> whatever you want -- watch as hidef, as normal def, cropped, sound
> only, listen to different languages, pick subtitles, etc.  How it's
> specifically encoded is strongly decoupled from how you use it.

I see what you're getting at.  However, Similarity *already* affects the
contents of the index, via encodeNorm()/decodeNorm() and lengthNorm().  So if
you want to divorce Similarity from index format, you'll need to remove those
methods.

In my opinion, it makes more sense to go the opposite direction, and have
Similarity objects spawn PostingEncoder objects which define the index format.
The ability of a search-time Similarity object to make relevance judgements
and assign scores is intimately tied to the information prepared for it in
advance and written at index-time.

I also like the idea of novice/intermediate users being able to express the
intent for how a field gets scored by choosing a Similarity subclass, without
having to worry about the underlying details of posting format.

> > What's the flex API for specifying a custom posting format?
> 
> You implement a Codecs class, which within it knows about any number
> of Codec impls that it can retrieve by name.  

So you have both a class named "Codec" and a class named "Codecs"?  :(

Tell me, is this an array of Codecs or a Codecs?

   return codecs;

> Here's the default
> Codecs on flex now:
> 
> class DefaultCodecs extends Codecs {
>   DefaultCodecs() {
>     register(new StandardCodec());
>     register(new IntBlockCodec());
>     register(new PreFlexCodec());
>     register(new PulsingCodec());
>     register(new SepCodec());
>   }
> 
>   @Override
>   public Codec getWriter(SegmentWriteState state) {
>     return lookup("Standard");
>     //return lookup("Pulsing");
>     //return lookup("Sep");
>     //return lookup("IntBlock");
>   }
> }
> 
> getWriter returns the Codec that will write the current segment.

So...

  * The user needs to know about SegmentWriteState?
  * The "codec" is per-index, not per-field?  Presumably this will change?
  * The "codec" is a writer in this case, but since the name "codec" implies
    both coding and decoding, it must also be capable of functioning as a
    reader?

> > Right.  But what happens when you want a custom codec to use BM25 weighting
> > *and* inline a part-of-speech ID *and* use PFOR?
> 
> You'd use the PForCodec, and make an attr that injects POS.

OK.

I don't think we're likely to do things that way in Lucy.  The functions which
decode postings will operate directly on raw mmap'd memory, and they typically
won't make any external calls to either methods or non-inline functions.

If you wanted to use an esoteric custom format, you'd write your own decoder
function.  There won't be a lot of code reuse at this inner-loop level --
unrolling will be the rule rather than the exception.

> > I think we have to supply a class object or class name when asking for the
> > enumerator, like you do with AttributeSource.
> >
> >  PostingList plist = null;
> >  PostingListReader pListReader = segReader.fetch(PostingListReader);
> >  if (pListReader != null) {
> >    PostingsReader pReader = pListReader.fetch(field);
> >    if (pReader != null) {
> >      plist = pReader.makePostingList(klass); // e.g. PartOfSpeechPostingList
> >    }
> >  }
> 
> But is plist a "normal" postings iterator (ie, subclasses it) that has
> also exposed a dedicated POS API?

It's definitely a "normal" postings iterator.  As to whether we expose the
part-of-speech via an attribute or via a method, that's up in the air.  Hmm.

>From a class-design perspective, it would probably be best to go with an
attribute, since Lucy has only single-inheritance and no interfaces.  A rigid
class hierarchy is going to cause problems when you need an iterator that
combines unrelated concepts like BM25 weighting and part-of-speech tagging.

> In flex you'd get a "normal" DocsAndPositionsEnum, pull the POS attr
> up front, and as you're next'ing your way through it, optionally look
> up the POS of each position you step through, using the POS attr.

Just a thought: why not make positions an attribute on a DocsEnum?

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

Reply via email to