Hi Stefan, This sounds interesting and useful. It's like static scores for Lucene documents, only that we will apply them to ordinals. Since I assume it's not a very common use case though, do you know if this new functionality affects existing use cases? For example, will it change the API in non-backward compatible way, or impact faceted search performance for the common case?
Do you intend to support arbitrary signals, or only numeric ones? Numeric signals will allow you to efficiently update the taxonomy index's ordinal documents without updating the documents themselves (which will change their ordinal!!). Other signals don't support this sort of update (yet), so you might run into the issue of not being able to update them. And at least for the author-citation-signal, that's definitely something you'll want to update (unless you rebuild the index from time to time, when the signals are updated). Have you considered an alternative implementation of pulling that info from another source during retrieval? Just curious what would be the performance implications, since an alternative source can give you the flexibility of supporting other signals which are more complicated to update, but won't affect the taxonomy index. Generally though, I don't see a reason not to support it. Shai On Thu, May 11, 2023 at 1:03 PM Stefan Vodita <stefan.vod...@gmail.com> wrote: > Hi everyone, > > I work on the Lucene product search team at Amazon. We’ve been considering > indexing scoring signals for ordinals into the taxonomy, which could reduce > index size for some use-cases. > > Example > > Let's consider a library of research papers, where each paper is > represented by > a Lucene document and the paper's author is a facet field in that > document. For > each author we store the total number of citations. We want to compute a > measure of each author's impact, the total number of citations divided by > the number of articles published. > > Implementation > > Each author will be assigned an ordinal in the taxonomy. Lucene doesn't > currently support storing data about an ordinal, but the taxonomy is > itself a > Lucene index, where each ordinal is represented by a document. Right now, > the > ordinal document has only a few fields allowing it to model the taxonomy > structure, but we could conceivably add arbitrary fields to the ordinal > documents. We would index the total number of citations an author has as a > DocValue in the corresponding ordinal document. > > Advantages > > The alternative would be to denormalize data about the authors and have it > on > each doc that references that author. This leads to duplication. Since > Lucene > already has a document representation of the author (the ordinal doc), it > makes sense conceptually that data about the author should be associated > with the ordinal doc. > > > I'm curious if anyone else has tried something like this and if the > approach > seems reasonable. I’ve made an attempt to code it and I can open a PR if > this > sounds like a useful feature. > > Stefan > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >