Re: How best to workaround time filtering + aggregation

Cory Zue Thu, 16 Jun 2011 15:15:28 -0700

Hi Aroj,

Welcome to my least favorite thing in couch!


If you anticipate high volumes of data I'd rule out (1). Likewise list
functions (2), I don't think will help you unless I misunderstand how
they work. You're still going to have to load all that data into the
list function which I doubt saves you much overhead as an alternative
to (1).

3, 4, and 5 are all solid options. The right answer for you will
likely be a tradeoff between how much you can tailor your use case to
fit in a relatively simple view (5), versus the desire to do more
complex and freeform querying in the future (3, 4). As for the choice
between SQL and Lucene, it just depends how comfortable you are with
the two technologies. Lucene will likely be less work to setup. SQL
will likely be easier to query.

[plug] If you end up going with option 3 you may find the blog post I
just wrote about almost this exact use case useful:
http://www.dimagi.com/pulling-data-from-couchdb-to-a-relational-database-made-easy-with-_changes/

Cory

--
Cory L. Zue
Dimagi, Inc
http://www.dimagi.com/



On Thu, Jun 16, 2011 at 4:38 PM, Aroj George <[email protected]> wrote:
> Hi All,
>
> We have this requirement where we want to do aggregations over a set of time
> series data after we have filtered it by time, in response to the time range
> values input by a user.
>
> Size: 10 million documents.
> Aggregations to be applied: Sum, Count, Average, Max, Min, Latest etc..
> Grouped By: Clinic Id,Field Name
> *
> Doc Structure:*
> { clinic id: 10, data :  { beds : 10, meds: 20, doctors : 30 }, timestamp :
> T1 }
>
> *View Structure:*
> key: timestamp, clinic id, field name ( T1, 10, beds )
> value: field value (10)
>
> The above view will allow us to filter by a time range, but not group by
> clinic id + field name, because of the timestamp being the first part of the
> key.
>
> So the only option it seems is to do the time filtering in Couch and do the
> aggregations on the filtered set outside of Couch.
>
> *Options:*
>
> 1.  *Python*
> Load the filtered set in the client, (python) and use pythons's built in
> reduce functions like sum, count etc.
>
> Concerns: Loading a huge set of data into the python client may not be
> efficient.
>
> 2. *List functions*
> Use the Couch view to do time filter + do the aggregation in a list
> function.
> The key advantage here could be that everything is being done in Couch. This
> not what a list function is meant for,
> but given that aggregation via map/reduce is not possible in this case,
> could this be a reasonable solution?
>
> Concerns:
> Will the javascript list function be slower than even doing the aggregation
> in python?
> Can I switch to an erlang list function to speed things up?
>
> 3. *SQL*
> Load the data into a SQL table at regular intervals or using the _changes
> view, with the below columns,
> and use SQL's group by and sum,count functions to return the query results.
>
> Entity ID, Field Name, Field Value, Timestamp
>
> Select sum(field value) group by entity id, field name where timestamp > T1
> and timestamp < T2
>
> 4. *Lucene*
> I am not sure, but can something like Elastic Search (
> http://www.elasticsearch.org/) or Solr help here?
>
> 5. *Multiple Queries*
> Another option I could think of is, to change the view key to
>
> *[clinic id, field name, timestamp]* ie add the timestamp at the end.
>
> Now do a query for each clinic and field, like start_key = [clinic id, field
> name, T1] and end_key = [clinic id, field name, T2] and group = 2.
> This will give me the result, but I will have to do a query for (number of
> clinics * number of fields) times.
> Which again will be too inefficient because of that many HTTP calls. Unless
> I can make all multiple such queries in a single HTTP GET?
> Can I optimise this somehow?
>
> What do you guys think, which of the above is the best option? How best
> would you solve this use case?
>
> The question essentially being, what's the most efficient  and scalable way
> to perform real time aggregations on a set of data with time as filter
> criteria?
> The primary impact of the time filter being that we can't really do the
> aggregations using the map reduce views.
>
> Rgds,
> Aroj
>

Re: How best to workaround time filtering + aggregation

Reply via email to