Hello, I’ve opened an issue [1] to continue this discussion and a PR [2] showing an easy way to add data about the ordinals to the taxonomy. Let me know if you think it's reasonable.
Thank you, Stefan [1] https://github.com/apache/lucene/issues/12336 [2] https://github.com/apache/lucene/pull/12337 On Sun, 14 May 2023 at 06:52, Shai Erera <ser...@gmail.com> wrote: > > Hi > > > There's two approaches we could take initially, > > Both approaches look fine to me. As long as we expose the right API. I assume > that if we use updatable DV, then we'll have a proper API on TaxoWrite to > update the fields, but otherwise (if we'll only allow updating during Taxo > rewrite) we won't have any update API. Another option is to allow these > rewrites during taxonomy merges, something we can think about. > > > Yes, we've considered things like a local database or a separate index. > > Another approach is to treat this like a rescore query: you aggregate the > facets without their signals and then rescore the top-K (100, 1000, 10000) > facets according to external signals. Just another idea to think about (yes, > it's not perfect, but it might work OK-ish?) > > Shai > > On Sat, May 13, 2023 at 6:45 PM Stefan Vodita <stefan.vod...@gmail.com> wrote: >> >> Hello Shai, >> >> Thank you for the feedback! I'll try to answer each of the questions. >> >> > will it change the API in non-backward compatible way, or impact faceted >> > search performance for the common case? >> >> The new API could overload FacetsConfig.build or provide a new method in >> TaxonomyWriter to plug in ordinal data. It doesn't have to change the >> functionality that already exists. A taxonomy index in the common case would >> be >> indistinguishable before and after this change. >> >> > Do you intend to support arbitrary signals, or only numeric ones? >> >> This is a crucial question. I'd like to take one small step forward and leave >> room for us to make improvements later. There's two approaches we could take >> initially, which I think you've already identified in your email: >> >> 1. Allow only updatabe DocValues as ordinal data. This could become limiting >> at >> some point, but maybe it's a good first solution. >> >> 2. Disallow updating ordinal data. New ordinal data can only come in when a >> new >> taxonomy gets built. >> >> For the Amazon product search use case, option 2 is slightly better. We would >> build new indexes more often than we would get ordinal data updates. But I'm >> not sure what the better option is in the general case. This is where I'd >> like >> feedback from other users. Maybe there's also some other approach I haven't >> thought of. >> >> > Have you considered an alternative implementation of pulling that info >> > from another source during retrieval? >> >> Yes, we've considered things like a local database or a separate index. >> I haven't done a performance test, but my guess is that having the ordinal >> data in the taxonomy is as fast as it gets for use-cases like the faceting >> aggregation example in my previous email. Even if that isn't the case, the >> taxonomy solution is more convenient and less burdensome from an operational >> standpoint. >> >> >> I hope that's useful. Thanks again for the feedback, >> >> Stefan >> >> On Thu, 11 May 2023 at 16:53, Shai Erera <ser...@gmail.com> wrote: >> > >> > Hi Stefan, >> > >> > This sounds interesting and useful. It's like static scores for Lucene >> > documents, only that we will apply them to ordinals. Since I assume it's >> > not a very common use case though, do you know if this new functionality >> > affects existing use cases? For example, will it change the API in >> > non-backward compatible way, or impact faceted search performance for the >> > common case? >> > >> > Do you intend to support arbitrary signals, or only numeric ones? Numeric >> > signals will allow you to efficiently update the taxonomy index's ordinal >> > documents without updating the documents themselves (which will change >> > their ordinal!!). Other signals don't support this sort of update (yet), >> > so you might run into the issue of not being able to update them. And at >> > least for the author-citation-signal, that's definitely something you'll >> > want to update (unless you rebuild the index from time to time, when the >> > signals are updated). >> > >> > Have you considered an alternative implementation of pulling that info >> > from another source during retrieval? Just curious what would be the >> > performance implications, since an alternative source can give you the >> > flexibility of supporting other signals which are more complicated to >> > update, but won't affect the taxonomy index. >> > >> > Generally though, I don't see a reason not to support it. >> > >> > Shai >> > >> > On Thu, May 11, 2023 at 1:03 PM Stefan Vodita <stefan.vod...@gmail.com> >> > wrote: >> >> >> >> Hi everyone, >> >> >> >> I work on the Lucene product search team at Amazon. We’ve been considering >> >> indexing scoring signals for ordinals into the taxonomy, which could >> >> reduce >> >> index size for some use-cases. >> >> >> >> Example >> >> >> >> Let's consider a library of research papers, where each paper is >> >> represented by >> >> a Lucene document and the paper's author is a facet field in that >> >> document. For >> >> each author we store the total number of citations. We want to compute a >> >> measure of each author's impact, the total number of citations divided by >> >> the number of articles published. >> >> >> >> Implementation >> >> >> >> Each author will be assigned an ordinal in the taxonomy. Lucene doesn't >> >> currently support storing data about an ordinal, but the taxonomy is >> >> itself a >> >> Lucene index, where each ordinal is represented by a document. Right now, >> >> the >> >> ordinal document has only a few fields allowing it to model the taxonomy >> >> structure, but we could conceivably add arbitrary fields to the ordinal >> >> documents. We would index the total number of citations an author has as a >> >> DocValue in the corresponding ordinal document. >> >> >> >> Advantages >> >> >> >> The alternative would be to denormalize data about the authors and have >> >> it on >> >> each doc that references that author. This leads to duplication. Since >> >> Lucene >> >> already has a document representation of the author (the ordinal doc), it >> >> makes sense conceptually that data about the author should be associated >> >> with the ordinal doc. >> >> >> >> >> >> I'm curious if anyone else has tried something like this and if the >> >> approach >> >> seems reasonable. I’ve made an attempt to code it and I can open a PR if >> >> this >> >> sounds like a useful feature. >> >> >> >> Stefan >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org