Re: GROUP BY in Lucene

Gimantha Bandara Sat, 19 Mar 2016 04:36:38 -0700

Hi Rob,

Thank you for explaining your approach. Still I have a few questions. Do I
need to store the values being aggregated as STORED at indexing time? and
how does the collector handle a large number of documents when aggregating?
I mean lets say I have several millions documents in an index and I am
going to call the SUM of a field called "subject_marks". How does the
collector efficiently handle summation? Is it going through all the
segments parallelly or something like that?


For now we have a facetfield which has X Y Z and I can get documents which
belong to a specific XYZ group and perform aggregation over those records.
So I can actually do that for all the groups. But it is not fast. It is
like a simple Java loop which go through all the different facet values and
aggregate the documents values belong to those facet values and put them
into a map. It is slow because we are not storing field values in the
Lucene documents, we fetch the actual data from a DB. We only keep an ID as
a STORED field in lucene documents once we get those IDS from lucene
documents we look up the DB and perform aggregation. This is really slow
when the number of records grow.

Thanks,
Gimantha

On Mon, Aug 10, 2015 at 6:26 PM, Rob Audenaerde <[email protected]>
wrote:

> You can write a custom (facet) collector to do this. I have done something
> similar, I'll describe my approach:
>
> For all the values that need grouping or aggregating, I have added a
> FacetField ( an AssociatedFacetField, so I can store the value alongside
> the ordinal) . The main search stays the same, in your case for example a
> NumericRangeQuery  (if the date is store in ms).
>
> Then I have a custom facet collector that does the grouping.
>
> Basically, it goes through all the MatchingDocs. For each doc, it creates a
> unique key (composed of X, Y and Z), and makes aggregates as needed (sum
> D).These are stored in a map. If a key is already in the map, the existing
> aggregate is added to the new value. Tricky is to make your unique key fast
> and immutable, so you can  precompute the hashcode.
>
> This is fast enough if the number of unique keys is smallish (<10.000),
> index size +- 1M docs).
>
> -Rob
>
>
> On Mon, Aug 10, 2015 at 2:47 PM, Michael McCandless <
> [email protected]> wrote:
>
> > Lucene has a grouping module that has several approaches for grouping
> > search hits, though it's only by a single field I believe.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Sun, Aug 9, 2015 at 2:55 PM, Gimantha Bandara <[email protected]>
> > wrote:
> > > Hi all,
> > >
> > > Is there a way to achieve $subject? For example, consider the following
> > SQL
> > > query.
> > >
> > > SELECT A, B, C SUM(D) as E FROM  `table` WHERE time BETWEEN fromDate
> AND
> > > toDate *GROUP BY X,Y,Z*
> > >
> > > In the above query we can group the records by, X,Y,Z. Is there a way
> to
> > > achieve the same in Lucene? (I guess Faceting would help, But is it
> > > possible get all the categoryPaths along with the matching records? )
> Is
> > > there any other way other than using Facets?
> > >
> > > --
> > > Gimantha Bandara
> > > Software Engineer
> > > WSO2. Inc : http://wso2.com
> > > Mobile : +94714961919
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>



-- 
Gimantha Bandara
Software Engineer
WSO2. Inc : http://wso2.com
Mobile : +94714961919

Re: GROUP BY in Lucene

Reply via email to