[ https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525667#comment-17525667 ]
Andres de la Peña commented on CASSANDRA-11871: ----------------------------------------------- One last detail, we should probably add an entry on {{NEWS.txt}}. > Allow to aggregate by time intervals > ------------------------------------ > > Key: CASSANDRA-11871 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11871 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/CQL > Reporter: Benjamin Lerer > Assignee: Benjamin Lerer > Priority: Normal > Fix For: 4.x > > Time Spent: 3h 20m > Remaining Estimate: 0h > > For time series data it can be usefull to aggregate by time intervals. > The idea would be to add support for one or several functions in the {{GROUP > BY}} clause. > Regarding the implementation, even if in general I also prefer to follow the > SQL syntax, I do not believe it will be a good fit for Cassandra. > If we have a table like: > {code} > CREATE TABLE trades > { > symbol text, > date date, > time time, > priceMantissa int, > priceExponent tinyint, > volume int, > PRIMARY KEY ((symbol, date), time) > }; > {code} > The trades will be inserted with an increasing time and sorted in the same > order. As we can have to process a large amount of data, we want to try to > limit ourself to the cases where we can build the groups on the flight (which > is not a requirement in the SQL world). > If we want to get the number of trades per minutes with the SQL syntax we > will have to write: > {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' > AND date = '2016-01-11' GROUP BY hour(time), minute(time);}} > which is fine. The problem is that if the user invert by mistake the > functions like that: > {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' > AND date = '2016-01-11' GROUP BY minute(time), hour(time);}} > the query will return weird results. > The only way to prevent that would be to check the function order and make > sure that we do not allow to skip functions (e.g. {{GROUP BY hour(time), > second(time)}}). > In my opinion a function like {{floor(<columnName>, <time range>)}} will be > much better as it does not allow for this type of mistakes and is much more > flexible (you can create 5 minutes buckets if you want to). > {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date = > '2016-01-11' GROUP BY floor(time, m);}} > An important aspect to keep in mind with a function like {{floor}} is the > starting point. For a query like: {{SELECT floor(time, m), count() FROM > Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' > AND time =< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the > result should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}. > -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org