Hi Lucene Devs,
           I have a need to store a sparse feature vector on a per term
basis. The total number of possible dimensions are small (~50) and known at
indexing time. The feature values will be used in scoring along with corpus
statistics. It looks like payloads
<https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> were
created for this exact same purpose but some workaround is needed to
minimize the performance penalty as mentioned on the wiki
<https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> .

An alternative is to override *term frequency* to be a *pointer* in a
*Map<pointer,
Feature_Vector>* serialized and stored in *BinaryDocValues*. At query time,
the matching *docId *will be used to advance the pointer to the starting
offset of this map*. *The term frequency will be used to perform lookup
into the serialized map to retrieve the* Feature_Vector. *That's my current
plan but I haven't benchmarked it.

The problem that I am trying to solve is to *reduce the index bloat* and
*eliminate* *unnecessary seeks* as currently these ~50 dimensions are
stored as separate fields in the index with very high term overlap and
Lucene does not share Terms dictionary across different fields. This itself
can be a new feature for Lucene but will reqiure lots of work I imagine.

Any ideas are welcome :-)

Thanks
-Ankur

Reply via email to