Hi Stefan,

This sounds interesting and useful. It's like static scores for Lucene
documents, only that we will apply them to ordinals. Since I assume it's
not a very common use case though, do you know if this new functionality
affects existing use cases? For example, will it change the API in
non-backward compatible way, or impact faceted search performance for the
common case?

Do you intend to support arbitrary signals, or only numeric ones? Numeric
signals will allow you to efficiently update the taxonomy index's ordinal
documents without updating the documents themselves (which will change
their ordinal!!). Other signals don't support this sort of update (yet), so
you might run into the issue of not being able to update them. And at least
for the author-citation-signal, that's definitely something you'll want to
update (unless you rebuild the index from time to time, when the signals
are updated).

Have you considered an alternative implementation of pulling that info from
another source during retrieval? Just curious what would be the performance
implications, since an alternative source can give you the flexibility of
supporting other signals which are more complicated to update, but won't
affect the taxonomy index.

Generally though, I don't see a reason not to support it.

Shai

On Thu, May 11, 2023 at 1:03 PM Stefan Vodita <stefan.vod...@gmail.com>
wrote:

> Hi everyone,
>
> I work on the Lucene product search team at Amazon. We’ve been considering
> indexing scoring signals for ordinals into the taxonomy, which could reduce
> index size for some use-cases.
>
> Example
>
> Let's consider a library of research papers, where each paper is
> represented by
> a Lucene document and the paper's author is a facet field in that
> document. For
> each author we store the total number of citations. We want to compute a
> measure of each author's impact, the total number of citations divided by
> the number of articles published.
>
> Implementation
>
> Each author will be assigned an ordinal in the taxonomy. Lucene doesn't
> currently support storing data about an ordinal, but the taxonomy is
> itself a
> Lucene index, where each ordinal is represented by a document. Right now,
> the
> ordinal document has only a few fields allowing it to model the taxonomy
> structure, but we could conceivably add arbitrary fields to the ordinal
> documents. We would index the total number of citations an author has as a
> DocValue in the corresponding ordinal document.
>
> Advantages
>
> The alternative would be to denormalize data about the authors and have it
> on
> each doc that references that author. This leads to duplication. Since
> Lucene
> already has a document representation of the author (the ordinal doc), it
> makes sense conceptually that data about the author should be associated
> with the ordinal doc.
>
>
> I'm curious if anyone else has tried something like this and if the
> approach
> seems reasonable. I’ve made an attempt to code it and I can open a PR if
> this
> sounds like a useful feature.
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to