Re: Index ordinal data in the taxonomy

Stefan Vodita Sat, 27 May 2023 05:55:01 -0700

Hello,

I’ve opened an issue [1] to continue this discussion and a PR [2]
showing an easy
way to add data about the ordinals to the taxonomy. Let me know if you
think it's
reasonable.


Thank you,
Stefan

[1] https://github.com/apache/lucene/issues/12336
[2] https://github.com/apache/lucene/pull/12337

On Sun, 14 May 2023 at 06:52, Shai Erera <ser...@gmail.com> wrote:
>
> Hi
>
> > There's two approaches we could take initially,
>
> Both approaches look fine to me. As long as we expose the right API. I assume 
> that if we use updatable DV, then we'll have a proper API on TaxoWrite to 
> update the fields, but otherwise (if we'll only allow updating during Taxo 
> rewrite) we won't have any update API. Another option is to allow these 
> rewrites during taxonomy merges, something we can think about.
>
> > Yes, we've considered things like a local database or a separate index.
>
> Another approach is to treat this like a rescore query: you aggregate the 
> facets without their signals and then rescore the top-K (100, 1000, 10000) 
> facets according to external signals. Just another idea to think about (yes, 
> it's not perfect, but it might work OK-ish?)
>
> Shai
>
> On Sat, May 13, 2023 at 6:45 PM Stefan Vodita <stefan.vod...@gmail.com> wrote:
>>
>> Hello Shai,
>>
>> Thank you for the feedback! I'll try to answer each of the questions.
>>
>> > will it change the API in non-backward compatible way, or impact faceted 
>> > search performance for the common case?
>>
>> The new API could overload FacetsConfig.build or provide a new method in
>> TaxonomyWriter to plug in ordinal data. It doesn't have to change the
>> functionality that already exists. A taxonomy index in the common case would 
>> be
>> indistinguishable before and after this change.
>>
>> > Do you intend to support arbitrary signals, or only numeric ones?
>>
>> This is a crucial question. I'd like to take one small step forward and leave
>> room for us to make improvements later. There's two approaches we could take
>> initially, which I think you've already identified in your email:
>>
>> 1. Allow only updatabe DocValues as ordinal data. This could become limiting 
>> at
>> some point, but maybe it's a good first solution.
>>
>> 2. Disallow updating ordinal data. New ordinal data can only come in when a 
>> new
>> taxonomy gets built.
>>
>> For the Amazon product search use case, option 2 is slightly better. We would
>> build new indexes more often than we would get ordinal data updates. But I'm
>> not sure what the better option is in the general case. This is where I'd 
>> like
>> feedback from other users. Maybe there's also some other approach I haven't
>> thought of.
>>
>> > Have you considered an alternative implementation of pulling that info 
>> > from another source during retrieval?
>>
>> Yes, we've considered things like a local database or a separate index.
>> I haven't done a performance test, but my guess is that having the ordinal
>> data in the taxonomy is as fast as it gets for use-cases like the faceting
>> aggregation example in my previous email. Even if that isn't the case, the
>> taxonomy solution is more convenient and less burdensome from an operational
>> standpoint.
>>
>>
>> I hope that's useful. Thanks again for the feedback,
>>
>> Stefan
>>
>> On Thu, 11 May 2023 at 16:53, Shai Erera <ser...@gmail.com> wrote:
>> >
>> > Hi Stefan,
>> >
>> > This sounds interesting and useful. It's like static scores for Lucene 
>> > documents, only that we will apply them to ordinals. Since I assume it's 
>> > not a very common use case though, do you know if this new functionality 
>> > affects existing use cases? For example, will it change the API in 
>> > non-backward compatible way, or impact faceted search performance for the 
>> > common case?
>> >
>> > Do you intend to support arbitrary signals, or only numeric ones? Numeric 
>> > signals will allow you to efficiently update the taxonomy index's ordinal 
>> > documents without updating the documents themselves (which will change 
>> > their ordinal!!). Other signals don't support this sort of update (yet), 
>> > so you might run into the issue of not being able to update them. And at 
>> > least for the author-citation-signal, that's definitely something you'll 
>> > want to update (unless you rebuild the index from time to time, when the 
>> > signals are updated).
>> >
>> > Have you considered an alternative implementation of pulling that info 
>> > from another source during retrieval? Just curious what would be the 
>> > performance implications, since an alternative source can give you the 
>> > flexibility of supporting other signals which are more complicated to 
>> > update, but won't affect the taxonomy index.
>> >
>> > Generally though, I don't see a reason not to support it.
>> >
>> > Shai
>> >
>> > On Thu, May 11, 2023 at 1:03 PM Stefan Vodita <stefan.vod...@gmail.com> 
>> > wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> I work on the Lucene product search team at Amazon. We’ve been considering
>> >> indexing scoring signals for ordinals into the taxonomy, which could 
>> >> reduce
>> >> index size for some use-cases.
>> >>
>> >> Example
>> >>
>> >> Let's consider a library of research papers, where each paper is 
>> >> represented by
>> >> a Lucene document and the paper's author is a facet field in that 
>> >> document. For
>> >> each author we store the total number of citations. We want to compute a
>> >> measure of each author's impact, the total number of citations divided by
>> >> the number of articles published.
>> >>
>> >> Implementation
>> >>
>> >> Each author will be assigned an ordinal in the taxonomy. Lucene doesn't
>> >> currently support storing data about an ordinal, but the taxonomy is 
>> >> itself a
>> >> Lucene index, where each ordinal is represented by a document. Right now, 
>> >> the
>> >> ordinal document has only a few fields allowing it to model the taxonomy
>> >> structure, but we could conceivably add arbitrary fields to the ordinal
>> >> documents. We would index the total number of citations an author has as a
>> >> DocValue in the corresponding ordinal document.
>> >>
>> >> Advantages
>> >>
>> >> The alternative would be to denormalize data about the authors and have 
>> >> it on
>> >> each doc that references that author. This leads to duplication. Since 
>> >> Lucene
>> >> already has a document representation of the author (the ordinal doc), it
>> >> makes sense conceptually that data about the author should be associated
>> >> with the ordinal doc.
>> >>
>> >>
>> >> I'm curious if anyone else has tried something like this and if the 
>> >> approach
>> >> seems reasonable. I’ve made an attempt to code it and I can open a PR if 
>> >> this
>> >> sounds like a useful feature.
>> >>
>> >> Stefan
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Index ordinal data in the taxonomy

Reply via email to