Hi Alex,

The entry point for block-max metadata is TermsEnum#impacts (
https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int))
which returns a view of the postings lists that includes block-max
metadata. In particular, see documentation for ImpactsSource#advanceShallow
and ImpactsSource#getImpacts (
https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html
).

You can look at ImpactsDISI to see how this metadata is leveraged in
practice to turn this metadata into score upper bounds, which is in-turn
used to skip irrelevant documents.

On Mon, Oct 12, 2020 at 2:45 AM Alex K <aklib...@gmail.com> wrote:

> Hi all,
> There was some fairly recent work in Lucene to introduce Block-Max WAND
> Scoring (
>
> https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
> , https://issues.apache.org/jira/browse/LUCENE-8135).
>
> I've been working on a use-case where I need very efficient top-k scoring
> for 100s of query terms (usually between 300 and 600 terms, k between 100
> and 10000, each term contributes a simple TF-IDF score). There's some
> discussion here: https://github.com/alexklibisz/elastiknn/issues/160.
>
> Now that block-based metadata are presumably available in Lucene, how would
> I access this metadata?
>
> I've read the WANDScorer.java code, but I couldn't quite understand how
> exactly it is leveraging a block-max codec or block-based statistics. In my
> own code, I'm exploring some ways to prune low-quality docs, and I figured
> there might be some block-max metadata that I can access to improve the
> pruning. I'm iterating over the docs matching each term using the
> .advance() and .nextDoc() methods on a PostingsEnum. I don't see any
> block-related methods on the PostingsEnum interface. I feel like I'm
> missing something.. hopefully something simple!
>
> I appreciate any tips or examples!
>
> Thanks,
> Alex
>


-- 
Adrien

Reply via email to