Re: Index ordinal data in the taxonomy

Shai Erera Sat, 13 May 2023 22:51:50 -0700

Hi

> There's two approaches we could take initially,


Both approaches look fine to me. As long as we expose the right API. I
assume that if we use updatable DV, then we'll have a proper API on
TaxoWrite to update the fields, but otherwise (if we'll only allow updating
during Taxo rewrite) we won't have any update API. Another option is to
allow these rewrites during taxonomy merges, something we can think about.

> Yes, we've considered things like a local database or a separate index.

Another approach is to treat this like a rescore query: you aggregate the
facets without their signals and then rescore the top-K (100, 1000, 10000)
facets according to external signals. Just another idea to think about
(yes, it's not perfect, but it might work OK-ish?)

Shai

On Sat, May 13, 2023 at 6:45 PM Stefan Vodita <stefan.vod...@gmail.com>
wrote:

> Hello Shai,
>
> Thank you for the feedback! I'll try to answer each of the questions.
>
> > will it change the API in non-backward compatible way, or impact faceted
> search performance for the common case?
>
> The new API could overload FacetsConfig.build or provide a new method in
> TaxonomyWriter to plug in ordinal data. It doesn't have to change the
> functionality that already exists. A taxonomy index in the common case
> would be
> indistinguishable before and after this change.
>
> > Do you intend to support arbitrary signals, or only numeric ones?
>
> This is a crucial question. I'd like to take one small step forward and
> leave
> room for us to make improvements later. There's two approaches we could
> take
> initially, which I think you've already identified in your email:
>
> 1. Allow only updatabe DocValues as ordinal data. This could become
> limiting at
> some point, but maybe it's a good first solution.
>
> 2. Disallow updating ordinal data. New ordinal data can only come in when
> a new
> taxonomy gets built.
>
> For the Amazon product search use case, option 2 is slightly better. We
> would
> build new indexes more often than we would get ordinal data updates. But
> I'm
> not sure what the better option is in the general case. This is where I'd
> like
> feedback from other users. Maybe there's also some other approach I haven't
> thought of.
>
> > Have you considered an alternative implementation of pulling that info
> from another source during retrieval?
>
> Yes, we've considered things like a local database or a separate index.
> I haven't done a performance test, but my guess is that having the ordinal
> data in the taxonomy is as fast as it gets for use-cases like the faceting
> aggregation example in my previous email. Even if that isn't the case, the
> taxonomy solution is more convenient and less burdensome from an
> operational
> standpoint.
>
>
> I hope that's useful. Thanks again for the feedback,
>
> Stefan
>
> On Thu, 11 May 2023 at 16:53, Shai Erera <ser...@gmail.com> wrote:
> >
> > Hi Stefan,
> >
> > This sounds interesting and useful. It's like static scores for Lucene
> documents, only that we will apply them to ordinals. Since I assume it's
> not a very common use case though, do you know if this new functionality
> affects existing use cases? For example, will it change the API in
> non-backward compatible way, or impact faceted search performance for the
> common case?
> >
> > Do you intend to support arbitrary signals, or only numeric ones?
> Numeric signals will allow you to efficiently update the taxonomy index's
> ordinal documents without updating the documents themselves (which will
> change their ordinal!!). Other signals don't support this sort of update
> (yet), so you might run into the issue of not being able to update them.
> And at least for the author-citation-signal, that's definitely something
> you'll want to update (unless you rebuild the index from time to time, when
> the signals are updated).
> >
> > Have you considered an alternative implementation of pulling that info
> from another source during retrieval? Just curious what would be the
> performance implications, since an alternative source can give you the
> flexibility of supporting other signals which are more complicated to
> update, but won't affect the taxonomy index.
> >
> > Generally though, I don't see a reason not to support it.
> >
> > Shai
> >
> > On Thu, May 11, 2023 at 1:03 PM Stefan Vodita <stefan.vod...@gmail.com>
> wrote:
> >>
> >> Hi everyone,
> >>
> >> I work on the Lucene product search team at Amazon. We’ve been
> considering
> >> indexing scoring signals for ordinals into the taxonomy, which could
> reduce
> >> index size for some use-cases.
> >>
> >> Example
> >>
> >> Let's consider a library of research papers, where each paper is
> represented by
> >> a Lucene document and the paper's author is a facet field in that
> document. For
> >> each author we store the total number of citations. We want to compute a
> >> measure of each author's impact, the total number of citations divided
> by
> >> the number of articles published.
> >>
> >> Implementation
> >>
> >> Each author will be assigned an ordinal in the taxonomy. Lucene doesn't
> >> currently support storing data about an ordinal, but the taxonomy is
> itself a
> >> Lucene index, where each ordinal is represented by a document. Right
> now, the
> >> ordinal document has only a few fields allowing it to model the taxonomy
> >> structure, but we could conceivably add arbitrary fields to the ordinal
> >> documents. We would index the total number of citations an author has
> as a
> >> DocValue in the corresponding ordinal document.
> >>
> >> Advantages
> >>
> >> The alternative would be to denormalize data about the authors and have
> it on
> >> each doc that references that author. This leads to duplication. Since
> Lucene
> >> already has a document representation of the author (the ordinal doc),
> it
> >> makes sense conceptually that data about the author should be associated
> >> with the ordinal doc.
> >>
> >>
> >> I'm curious if anyone else has tried something like this and if the
> approach
> >> seems reasonable. I’ve made an attempt to code it and I can open a PR
> if this
> >> sounds like a useful feature.
> >>
> >> Stefan
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Index ordinal data in the taxonomy

Reply via email to