Re: Computing multiple different aggregations over a match-set in one pass

Greg Miller Wed, 15 Feb 2023 07:48:22 -0800

Hi Stefan-

> In that case, iterating twice duplicates most of the work, correct?


I'm not sure I'd agree that it duplicates "most" of the work. This is an
association faceting example, which is a little bit of a special case in
some ways. But, to your question, there is duplicated work here of
re-loading the ordinals across the two aggregations, but I would suspect
the more expensive work is actually computing the different aggregations,
which is not duplicated. You're right that it would likely be more
efficient to iterate the hits once, loading the ordinals once and computing
multiple aggregations in one pass. There's no facility for doing that
currently in Lucene's faceting module, but you could always propose it! :)
That said, I'm not sure how common of a case this really is for the
majority of users? But that's just a guess/assumption.

Cheers,
-Greg

On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <[email protected]>
wrote:

> Hi Greg,
>
> I see now where my example didn’t give enough info. In my mind, `Genre /
> Author nationality / Author name` is stored in one hierarchical facet
> field.
> The data we’re aggregating over, like publish date or price, are stored in
> DocValues.
>
> The demo package shows something similar [1], where the aggregation
> is computed across a facet field using data from a `popularity` DocValue.
>
> In the demo, we compute `sum(_score * sqrt(popularity))`, but what if we
> want several other different aggregations with respect to the same facet
> field? Maybe we want `max(popularity)`. In that case, iterating twice
> duplicates most of the work, correct?
>
>
> Stefan
>
> [1]
> https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
>
> On Mon, 13 Feb 2023 at 22:46, Greg Miller <[email protected]> wrote:
> >
> > Hi Stefan-
> >
> > That helps, thanks. I'm a bit confused about where you're concerned with
> > iterating over the match set multiple times. Is this a situation where
> the
> > ordinals you want to facet over are stored in different index fields, so
> > you have to create multiple Facets instances (one per field) to compute
> the
> > aggregations? If that's the case, then yes—you have to iterate over the
> > match set multiple times (once per field). I'm not sure that's such a big
> > issue given that you're doing novel work during each iteration, so the
> only
> > repetitive cost is actually iterating the hits. If the ordinals are
> > "packed" into the same field though (which is the default in Lucene if
> > you're using taxonomy faceting), then you should only need to do a single
> > iteration over that field.
> >
> > Cheers,
> > -Greg
> >
> > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <[email protected]>
> > wrote:
> >
> > > Hi Greg,
> > >
> > > I’m assuming we have one match-set which was not constrained by any
> > > of the categories we want to aggregate over, so it may have books by
> > > Mark Twain, books by American authors, and sci-fi books.
> > >
> > > Maybe we can imagine we obtained it by searching for a keyword, say
> > > “Washington”, which is present in Mark Twain’s writing, and those of
> other
> > > American authors, and in sci-fi novels too.
> > >
> > > Does that make the example clearer?
> > >
> > >
> > > Stefan
> > >
> > >
> > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <[email protected]> wrote:
> > > >
> > > > Hi Stefan-
> > > >
> > > > Can you clarify your example a little bit? It sounds like you want to
> > > facet
> > > > over three different match sets (one constrained by "Mark Twain" as
> the
> > > > author, one constrained by "American authors" and one constrained by
> the
> > > > "sci-fi" genre). Is that correct?
> > > >
> > > > Cheers,
> > > > -Greg
> > > >
> > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Let’s say I have an index of books, similar to the example in the
> facet
> > > > > demo [1]
> > > > > with a hierarchical facet field encapsulating `Genre / Author’s
> > > > > nationality /
> > > > > Author’s name`.
> > > > >
> > > > > I might like to find the latest publish date of a book written by
> Mark
> > > > > Twain, the
> > > > > sum of the prices of books written by American authors, and the
> number
> > > of
> > > > > sci-fi novels.
> > > > >
> > > > > As far as I understand, this would require faceting 3 times over
> the
> > > > > match-set,
> > > > > one iteration for each aggregation of a different type (max(date),
> > > > > sum(price),
> > > > > count). That seems inefficient if we could instead compute all
> > > > > aggregations in
> > > > > one pass.
> > > > >
> > > > > Is there a way to do that?
> > > > >
> > > > >
> > > > > Stefan
> > > > >
> > > > > [1]
> > > > >
> > >
> https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Computing multiple different aggregations over a match-set in one pass

Reply via email to