Hi everyone,

I work on the Lucene product search team at Amazon. We’ve been considering
indexing scoring signals for ordinals into the taxonomy, which could reduce
index size for some use-cases.

Example

Let's consider a library of research papers, where each paper is represented by
a Lucene document and the paper's author is a facet field in that document. For
each author we store the total number of citations. We want to compute a
measure of each author's impact, the total number of citations divided by
the number of articles published.

Implementation

Each author will be assigned an ordinal in the taxonomy. Lucene doesn't
currently support storing data about an ordinal, but the taxonomy is itself a
Lucene index, where each ordinal is represented by a document. Right now, the
ordinal document has only a few fields allowing it to model the taxonomy
structure, but we could conceivably add arbitrary fields to the ordinal
documents. We would index the total number of citations an author has as a
DocValue in the corresponding ordinal document.

Advantages

The alternative would be to denormalize data about the authors and have it on
each doc that references that author. This leads to duplication. Since Lucene
already has a document representation of the author (the ordinal doc), it
makes sense conceptually that data about the author should be associated
with the ordinal doc.


I'm curious if anyone else has tried something like this and if the approach
seems reasonable. I’ve made an attempt to code it and I can open a PR if this
sounds like a useful feature.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to