[jira] [Updated] (CASSANDRA-11871) Allow to aggregate by time intervals

2022-04-22 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-11871:
---
Description: 
For time series data it can be usefull to aggregate by time intervals.

The idea would be to add support for one or several functions in the {{GROUP 
BY}} clause.

Regarding the implementation, even if in general I also prefer to follow the 
SQL syntax, I do not believe it will be a good fit for Cassandra.

If we have a table like:
{code}
CREATE TABLE trades
{
symbol text,
date date,
time time,
priceMantissa int,
priceExponent tinyint,
volume int,
PRIMARY KEY ((symbol, date), time)
};
{code}
The trades will be inserted with an increasing time and sorted in the same 
order. As we can have to process a large amount of data, we want to try to 
limit ourself to the cases where we can build the groups on the flight (which 
is not a requirement in the SQL world).

If we want to get the number of trades per minutes with the SQL syntax we will 
have to write:

{{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
which is fine. The problem is that if the user invert by mistake the functions 
like that:
{{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
the query will return weird results.
The only way to prevent that would be to check the function order and make sure 
that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
second(time)}}).

In my opinion a function like {{floor(, )}} will be 
much better as it does not allow for this type of mistakes and is much more 
flexible (you can create 5 minutes buckets if you want to).
{code}SELECT floor(time, m), count() FROM Trades 
 WHERE symbol = 'AAPL' AND date = '2016-01-11'
 GROUP BY floor(time, m);{code}

An important aspect to keep in mind with a function like {{floor}} is the 
starting point. For a query like:  {{SELECT floor(time, m), count() FROM Trades 
WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' AND time 
=< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the result 
should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
 

  was:
For time series data it can be usefull to aggregate by time intervals.

The idea would be to add support for one or several functions in the {{GROUP 
BY}} clause.

Regarding the implementation, even if in general I also prefer to follow the 
SQL syntax, I do not believe it will be a good fit for Cassandra.

If we have a table like:
{code}
CREATE TABLE trades
{
symbol text,
date date,
time time,
priceMantissa int,
priceExponent tinyint,
volume int,
PRIMARY KEY ((symbol, date), time)
};
{code}
The trades will be inserted with an increasing time and sorted in the same 
order. As we can have to process a large amount of data, we want to try to 
limit ourself to the cases where we can build the groups on the flight (which 
is not a requirement in the SQL world).

If we want to get the number of trades per minutes with the SQL syntax we will 
have to write:

{{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
which is fine. The problem is that if the user invert by mistake the functions 
like that:
{{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
the query will return weird results.
The only way to prevent that would be to check the function order and make sure 
that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
second(time)}}).

In my opinion a function like {{floor(, )}} will be 
much better as it does not allow for this type of mistakes and is much more 
flexible (you can create 5 minutes buckets if you want to).
{code}SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date 
= '2016-01-11' GROUP BY floor(time, m);{code}

An important aspect to keep in mind with a function like {{floor}} is the 
starting point. For a query like:  {{SELECT floor(time, m), count() FROM Trades 
WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' AND time 
=< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the result 
should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
 


> Allow to aggregate by time intervals
> 
>
> Key: CASSANDRA-11871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Legacy/CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>Priority: 

[jira] [Updated] (CASSANDRA-11871) Allow to aggregate by time intervals

2022-04-22 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-11871:
---
Description: 
For time series data it can be usefull to aggregate by time intervals.

The idea would be to add support for one or several functions in the {{GROUP 
BY}} clause.

Regarding the implementation, even if in general I also prefer to follow the 
SQL syntax, I do not believe it will be a good fit for Cassandra.

If we have a table like:
{code}
CREATE TABLE trades
{
symbol text,
date date,
time time,
priceMantissa int,
priceExponent tinyint,
volume int,
PRIMARY KEY ((symbol, date), time)
};
{code}
The trades will be inserted with an increasing time and sorted in the same 
order. As we can have to process a large amount of data, we want to try to 
limit ourself to the cases where we can build the groups on the flight (which 
is not a requirement in the SQL world).

If we want to get the number of trades per minutes with the SQL syntax we will 
have to write:

{{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
which is fine. The problem is that if the user invert by mistake the functions 
like that:
{{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
the query will return weird results.
The only way to prevent that would be to check the function order and make sure 
that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
second(time)}}).

In my opinion a function like {{floor(, )}} will be 
much better as it does not allow for this type of mistakes and is much more 
flexible (you can create 5 minutes buckets if you want to).
{code}
SELECT floor(time, m), count() FROM Trades 
WHERE symbol = 'AAPL' AND date = '2016-01-11'
GROUP BY floor(time, m);
{code}
An important aspect to keep in mind with a function like {{floor}} is the 
starting point. For a query like:  {{SELECT floor(time, m), count() FROM Trades 
WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' AND time 
=< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the result 
should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
 

  was:
For time series data it can be usefull to aggregate by time intervals.

The idea would be to add support for one or several functions in the {{GROUP 
BY}} clause.

Regarding the implementation, even if in general I also prefer to follow the 
SQL syntax, I do not believe it will be a good fit for Cassandra.

If we have a table like:
{code}
CREATE TABLE trades
{
symbol text,
date date,
time time,
priceMantissa int,
priceExponent tinyint,
volume int,
PRIMARY KEY ((symbol, date), time)
};
{code}
The trades will be inserted with an increasing time and sorted in the same 
order. As we can have to process a large amount of data, we want to try to 
limit ourself to the cases where we can build the groups on the flight (which 
is not a requirement in the SQL world).

If we want to get the number of trades per minutes with the SQL syntax we will 
have to write:

{{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
which is fine. The problem is that if the user invert by mistake the functions 
like that:
{{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
the query will return weird results.
The only way to prevent that would be to check the function order and make sure 
that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
second(time)}}).

In my opinion a function like {{floor(, )}} will be 
much better as it does not allow for this type of mistakes and is much more 
flexible (you can create 5 minutes buckets if you want to).
{code}SELECT floor(time, m), count() FROM Trades 
 WHERE symbol = 'AAPL' AND date = '2016-01-11'
 GROUP BY floor(time, m);{code}

An important aspect to keep in mind with a function like {{floor}} is the 
starting point. For a query like:  {{SELECT floor(time, m), count() FROM Trades 
WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' AND time 
=< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the result 
should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
 


> Allow to aggregate by time intervals
> 
>
> Key: CASSANDRA-11871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Legacy/CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>Priority: 

[jira] [Updated] (CASSANDRA-11871) Allow to aggregate by time intervals

2022-04-22 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-11871:
---
  Fix Version/s: 4.1
 (was: 4.x)
Source Control Link: 
https://github.com/apache/cassandra/commit/1ad8bf67a9c82cbb5ff38e5cf785f9fe2516d009
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

Patch committed into trunk at 1ad8bf67a9c82cbb5ff38e5cf785f9fe2516d009

> Allow to aggregate by time intervals
> 
>
> Key: CASSANDRA-11871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Legacy/CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 4.1
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> For time series data it can be usefull to aggregate by time intervals.
> The idea would be to add support for one or several functions in the {{GROUP 
> BY}} clause.
> Regarding the implementation, even if in general I also prefer to follow the 
> SQL syntax, I do not believe it will be a good fit for Cassandra.
> If we have a table like:
> {code}
> CREATE TABLE trades
> {
> symbol text,
> date date,
> time time,
> priceMantissa int,
> priceExponent tinyint,
> volume int,
> PRIMARY KEY ((symbol, date), time)
> };
> {code}
> The trades will be inserted with an increasing time and sorted in the same 
> order. As we can have to process a large amount of data, we want to try to 
> limit ourself to the cases where we can build the groups on the flight (which 
> is not a requirement in the SQL world).
> If we want to get the number of trades per minutes with the SQL syntax we 
> will have to write:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
> which is fine. The problem is that if the user invert by mistake the 
> functions like that:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
> the query will return weird results.
> The only way to prevent that would be to check the function order and make 
> sure that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
> second(time)}}).
> In my opinion a function like {{floor(, )}} will be 
> much better as it does not allow for this type of mistakes and is much more 
> flexible (you can create 5 minutes buckets if you want to).
> {code}SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND 
> date = '2016-01-11' GROUP BY floor(time, m);{code}
> An important aspect to keep in mind with a function like {{floor}} is the 
> starting point. For a query like:  {{SELECT floor(time, m), count() FROM 
> Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' 
> AND time =< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the 
> result should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-11871) Allow to aggregate by time intervals

2022-04-22 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-11871:
---
Status: Ready to Commit  (was: Review In Progress)

> Allow to aggregate by time intervals
> 
>
> Key: CASSANDRA-11871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Legacy/CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 4.x
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> For time series data it can be usefull to aggregate by time intervals.
> The idea would be to add support for one or several functions in the {{GROUP 
> BY}} clause.
> Regarding the implementation, even if in general I also prefer to follow the 
> SQL syntax, I do not believe it will be a good fit for Cassandra.
> If we have a table like:
> {code}
> CREATE TABLE trades
> {
> symbol text,
> date date,
> time time,
> priceMantissa int,
> priceExponent tinyint,
> volume int,
> PRIMARY KEY ((symbol, date), time)
> };
> {code}
> The trades will be inserted with an increasing time and sorted in the same 
> order. As we can have to process a large amount of data, we want to try to 
> limit ourself to the cases where we can build the groups on the flight (which 
> is not a requirement in the SQL world).
> If we want to get the number of trades per minutes with the SQL syntax we 
> will have to write:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
> which is fine. The problem is that if the user invert by mistake the 
> functions like that:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
> the query will return weird results.
> The only way to prevent that would be to check the function order and make 
> sure that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
> second(time)}}).
> In my opinion a function like {{floor(, )}} will be 
> much better as it does not allow for this type of mistakes and is much more 
> flexible (you can create 5 minutes buckets if you want to).
> {code}SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND 
> date = '2016-01-11' GROUP BY floor(time, m);{code}
> An important aspect to keep in mind with a function like {{floor}} is the 
> starting point. For a query like:  {{SELECT floor(time, m), count() FROM 
> Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' 
> AND time =< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the 
> result should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-11871) Allow to aggregate by time intervals

2022-04-22 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-11871:
---
Description: 
For time series data it can be usefull to aggregate by time intervals.

The idea would be to add support for one or several functions in the {{GROUP 
BY}} clause.

Regarding the implementation, even if in general I also prefer to follow the 
SQL syntax, I do not believe it will be a good fit for Cassandra.

If we have a table like:
{code}
CREATE TABLE trades
{
symbol text,
date date,
time time,
priceMantissa int,
priceExponent tinyint,
volume int,
PRIMARY KEY ((symbol, date), time)
};
{code}
The trades will be inserted with an increasing time and sorted in the same 
order. As we can have to process a large amount of data, we want to try to 
limit ourself to the cases where we can build the groups on the flight (which 
is not a requirement in the SQL world).

If we want to get the number of trades per minutes with the SQL syntax we will 
have to write:

{{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
which is fine. The problem is that if the user invert by mistake the functions 
like that:
{{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
the query will return weird results.
The only way to prevent that would be to check the function order and make sure 
that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
second(time)}}).

In my opinion a function like {{floor(, )}} will be 
much better as it does not allow for this type of mistakes and is much more 
flexible (you can create 5 minutes buckets if you want to).
{code}SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date 
= '2016-01-11' GROUP BY floor(time, m);{code}

An important aspect to keep in mind with a function like {{floor}} is the 
starting point. For a query like:  {{SELECT floor(time, m), count() FROM Trades 
WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' AND time 
=< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the result 
should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
 

  was:
For time series data it can be usefull to aggregate by time intervals.

The idea would be to add support for one or several functions in the {{GROUP 
BY}} clause.

Regarding the implementation, even if in general I also prefer to follow the 
SQL syntax, I do not believe it will be a good fit for Cassandra.

If we have a table like:
{code}
CREATE TABLE trades
{
symbol text,
date date,
time time,
priceMantissa int,
priceExponent tinyint,
volume int,
PRIMARY KEY ((symbol, date), time)
};
{code}
The trades will be inserted with an increasing time and sorted in the same 
order. As we can have to process a large amount of data, we want to try to 
limit ourself to the cases where we can build the groups on the flight (which 
is not a requirement in the SQL world).

If we want to get the number of trades per minutes with the SQL syntax we will 
have to write:

{{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
which is fine. The problem is that if the user invert by mistake the functions 
like that:
{{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
the query will return weird results.
The only way to prevent that would be to check the function order and make sure 
that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
second(time)}}).

In my opinion a function like {{floor(, )}} will be 
much better as it does not allow for this type of mistakes and is much more 
flexible (you can create 5 minutes buckets if you want to).
{{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date = 
'2016-01-11' GROUP BY floor(time, m);}}

An important aspect to keep in mind with a function like {{floor}} is the 
starting point. For a query like:  {{SELECT floor(time, m), count() FROM Trades 
WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' AND time 
=< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the result 
should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
 


> Allow to aggregate by time intervals
> 
>
> Key: CASSANDRA-11871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Legacy/CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 

[jira] [Updated] (CASSANDRA-11871) Allow to aggregate by time intervals

2022-04-21 Thread Yifan Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Cai updated CASSANDRA-11871:
--
Reviewers: Andres de la Peña, Yifan Cai  (was: Andres de la Peña)
   Status: Review In Progress  (was: Patch Available)

> Allow to aggregate by time intervals
> 
>
> Key: CASSANDRA-11871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Legacy/CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 4.x
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> For time series data it can be usefull to aggregate by time intervals.
> The idea would be to add support for one or several functions in the {{GROUP 
> BY}} clause.
> Regarding the implementation, even if in general I also prefer to follow the 
> SQL syntax, I do not believe it will be a good fit for Cassandra.
> If we have a table like:
> {code}
> CREATE TABLE trades
> {
> symbol text,
> date date,
> time time,
> priceMantissa int,
> priceExponent tinyint,
> volume int,
> PRIMARY KEY ((symbol, date), time)
> };
> {code}
> The trades will be inserted with an increasing time and sorted in the same 
> order. As we can have to process a large amount of data, we want to try to 
> limit ourself to the cases where we can build the groups on the flight (which 
> is not a requirement in the SQL world).
> If we want to get the number of trades per minutes with the SQL syntax we 
> will have to write:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
> which is fine. The problem is that if the user invert by mistake the 
> functions like that:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
> the query will return weird results.
> The only way to prevent that would be to check the function order and make 
> sure that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
> second(time)}}).
> In my opinion a function like {{floor(, )}} will be 
> much better as it does not allow for this type of mistakes and is much more 
> flexible (you can create 5 minutes buckets if you want to).
> {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date = 
> '2016-01-11' GROUP BY floor(time, m);}}
> An important aspect to keep in mind with a function like {{floor}} is the 
> starting point. For a query like:  {{SELECT floor(time, m), count() FROM 
> Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' 
> AND time =< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the 
> result should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-11871) Allow to aggregate by time intervals

2022-03-23 Thread Jira


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andres de la Peña updated CASSANDRA-11871:
--
Reviewers: Andres de la Peña

> Allow to aggregate by time intervals
> 
>
> Key: CASSANDRA-11871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Legacy/CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 4.x
>
>
> For time series data it can be usefull to aggregate by time intervals.
> The idea would be to add support for one or several functions in the {{GROUP 
> BY}} clause.
> Regarding the implementation, even if in general I also prefer to follow the 
> SQL syntax, I do not believe it will be a good fit for Cassandra.
> If we have a table like:
> {code}
> CREATE TABLE trades
> {
> symbol text,
> date date,
> time time,
> priceMantissa int,
> priceExponent tinyint,
> volume int,
> PRIMARY KEY ((symbol, date), time)
> };
> {code}
> The trades will be inserted with an increasing time and sorted in the same 
> order. As we can have to process a large amount of data, we want to try to 
> limit ourself to the cases where we can build the groups on the flight (which 
> is not a requirement in the SQL world).
> If we want to get the number of trades per minutes with the SQL syntax we 
> will have to write:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
> which is fine. The problem is that if the user invert by mistake the 
> functions like that:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
> the query will return weird results.
> The only way to prevent that would be to check the function order and make 
> sure that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
> second(time)}}).
> In my opinion a function like {{floor(, )}} will be 
> much better as it does not allow for this type of mistakes and is much more 
> flexible (you can create 5 minutes buckets if you want to).
> {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date = 
> '2016-01-11' GROUP BY floor(time, m);}}
> An important aspect to keep in mind with a function like {{floor}} is the 
> starting point. For a query like:  {{SELECT floor(time, m), count() FROM 
> Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' 
> AND time =< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the 
> result should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-11871) Allow to aggregate by time intervals

2022-03-22 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-11871:
---
Test and Documentation Plan: The patch add new unit tests and DTests 
 Status: Patch Available  (was: In Progress)

> Allow to aggregate by time intervals
> 
>
> Key: CASSANDRA-11871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Legacy/CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 4.x
>
>
> For time series data it can be usefull to aggregate by time intervals.
> The idea would be to add support for one or several functions in the {{GROUP 
> BY}} clause.
> Regarding the implementation, even if in general I also prefer to follow the 
> SQL syntax, I do not believe it will be a good fit for Cassandra.
> If we have a table like:
> {code}
> CREATE TABLE trades
> {
> symbol text,
> date date,
> time time,
> priceMantissa int,
> priceExponent tinyint,
> volume int,
> PRIMARY KEY ((symbol, date), time)
> };
> {code}
> The trades will be inserted with an increasing time and sorted in the same 
> order. As we can have to process a large amount of data, we want to try to 
> limit ourself to the cases where we can build the groups on the flight (which 
> is not a requirement in the SQL world).
> If we want to get the number of trades per minutes with the SQL syntax we 
> will have to write:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY hour(time), minute(time);}}
> which is fine. The problem is that if the user invert by mistake the 
> functions like that:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' 
> AND date = '2016-01-11' GROUP BY minute(time), hour(time);}}
> the query will return weird results.
> The only way to prevent that would be to check the function order and make 
> sure that we do not allow to skip functions (e.g. {{GROUP BY hour(time), 
> second(time)}}).
> In my opinion a function like {{floor(, )}} will be 
> much better as it does not allow for this type of mistakes and is much more 
> flexible (you can create 5 minutes buckets if you want to).
> {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date = 
> '2016-01-11' GROUP BY floor(time, m);}}
> An important aspect to keep in mind with a function like {{floor}} is the 
> starting point. For a query like:  {{SELECT floor(time, m), count() FROM 
> Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' 
> AND time =< '07:30:00' GROUP BY floor(time, 2h);}}, I think that ideally the 
> result should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{05:30:00}}.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org