Re: Payloads for each term

Bruno Roustant Mon, 26 Oct 2020 03:17:01 -0700

Hi Ankur,
Indeed payloads are the standard way to solve this problem. For light
queries with a few top N results that should be efficient. For multi-term
queries that could become penalizing if you need to access the payloads of
too many terms.
Also, there is an experimental PostingsFormat called
SharedTermsUniformSplit (class named STUniformSplitPostingsFormat) that
would allow you to effectively share the overlapping terms in the index
while having 50 fields. This would solve the index bloat issue, but would
not fully solve the seeks issue. You might want to benchmark this approach
too.


Bruno

Le ven. 23 oct. 2020 à 02:48, Ankur Goel <ankur.goe...@gmail.com> a écrit :

> Hi Lucene Devs,
>            I have a need to store a sparse feature vector on a per term
> basis. The total number of possible dimensions are small (~50) and known at
> indexing time. The feature values will be used in scoring along with corpus
> statistics. It looks like payloads
> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> were
> created for this exact same purpose but some workaround is needed to
> minimize the performance penalty as mentioned on the wiki
> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> .
>
> An alternative is to override *term frequency* to be a *pointer* in a 
> *Map<pointer,
> Feature_Vector>* serialized and stored in *BinaryDocValues*. At query
> time, the matching *docId *will be used to advance the pointer to the
> starting offset of this map*. *The term frequency will be used to perform
> lookup into the serialized map to retrieve the* Feature_Vector. *That's
> my current plan but I haven't benchmarked it.
>
> The problem that I am trying to solve is to *reduce the index bloat* and
> *eliminate* *unnecessary seeks* as currently these ~50 dimensions are
> stored as separate fields in the index with very high term overlap and
> Lucene does not share Terms dictionary across different fields. This itself
> can be a new feature for Lucene but will reqiure lots of work I imagine.
>
> Any ideas are welcome :-)
>
> Thanks
> -Ankur
>

Re: Payloads for each term

Reply via email to