Re: Storing bi-temporal data in Cassandra
I am trying to get the state as of a particular transaction_time -- In that case you should probably define your primary key in another order for clustering columns PRIMARY KEY (weatherstation_id,transaction_time,event_time) Then, select * from temperatures where weatherstation_id = 'foo' and event_time = '2015-01-01 00:00:00' and event_time '2015-01-02 00:00:00' and transaction_time = '' On Sat, Feb 14, 2015 at 3:06 AM, Raj N raj.cassan...@gmail.com wrote: Has anyone designed a bi-temporal table in Cassandra? Doesn't look like I can do this using CQL for now. Taking the time series example from well known modeling tutorials in Cassandra - CREATE TABLE temperatures ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time), ) WITH CLUSTERING ORDER BY (event_time DESC); If I add another column transaction_time CREATE TABLE temperatures ( weatherstation_id text, event_time timestamp, transaction_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time,transaction_time), ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC); If I try to run a query using the following CQL, it throws an error - select * from temperatures where weatherstation_id = 'foo' and event_time = '2015-01-01 00:00:00' and event_time '2015-01-02 00:00:00' and transaction_time '2015-01-02 00:00:00' It works if I use an equals clause for the event_time. I am trying to get the state as of a particular transaction_time -Raj
Re: How to speed up SELECT * query in Cassandra
Jirka, But I am really interested how it can work well with Spark/Hadoop where you basically needs to read all the data as well (as far as I understand that). I can't give you any benchmarking between technologies (nor am i particularly interested in getting involved in such a discussion) but i can share our experiences with Cassandra, Hadoop, and Spark, over the past 4+ years, and hopefully assure you that Cassandra+Spark is a smart choice. On a four node cluster we were running 5000+ small hadoop jobs each day each finishing within two minutes, often within one minute, resulting in (give or take) a billion records read and 150 millions records written from and to c*. These small jobs are incrementally processing on limited partition key sets each time. These jobs are primarily reading data from a raw events store that has a ttl of 3 months and 22+Gb of tombstones a day (reads over old partition keys are rare). We also run full-table-scan jobs and have never come across any issues particular to that. There are hadoop map/reduce settings to increase durability if you have tables with troublesome partition keys. This is also a cluster that serves requests to web applications that need low latency. We recently wrote a spark job that does full table scans over 100 million+ rows, involves a handful of stages (two tables, 9 maps, 4 reduce, and 2 joins), and writes back to a new table 5 millions rows. This job runs in ~260 seconds. Spark is becoming a natural complement to schema evolution for cassandra, something you'll want to do to keep your schema optimised against your read request patterns, even little things like switching cluster keys around. With any new technology hitting some hurdles (especially if you go wondering outside recommended practices) will of course be part of the game, but that said I've only had positive experiences with this community's ability to help out (and do so quickly). Starting from scratch i'd use Spark (on scala) over Hadoop no questions asked. Otherwise Cassandra has always been our 'big data' platform, hadoop/spark is just an extra tool on top. We've never kept data in hdfs and are very grateful for having made that choice. ~mck ref https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/
Re: Storing bi-temporal data in Cassandra
I don't think thats solves my problem. The question really is why can't we use ranges for both time columns when they are part of the primary key. They are on 1 row after all. Is this just a CQL limitation? -Raj On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan doanduy...@gmail.com wrote: I am trying to get the state as of a particular transaction_time -- In that case you should probably define your primary key in another order for clustering columns PRIMARY KEY (weatherstation_id,transaction_time,event_time) Then, select * from temperatures where weatherstation_id = 'foo' and event_time = '2015-01-01 00:00:00' and event_time '2015-01-02 00:00:00' and transaction_time = '' On Sat, Feb 14, 2015 at 3:06 AM, Raj N raj.cassan...@gmail.com wrote: Has anyone designed a bi-temporal table in Cassandra? Doesn't look like I can do this using CQL for now. Taking the time series example from well known modeling tutorials in Cassandra - CREATE TABLE temperatures ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time), ) WITH CLUSTERING ORDER BY (event_time DESC); If I add another column transaction_time CREATE TABLE temperatures ( weatherstation_id text, event_time timestamp, transaction_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time,transaction_time), ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC); If I try to run a query using the following CQL, it throws an error - select * from temperatures where weatherstation_id = 'foo' and event_time = '2015-01-01 00:00:00' and event_time '2015-01-02 00:00:00' and transaction_time '2015-01-02 00:00:00' It works if I use an equals clause for the event_time. I am trying to get the state as of a particular transaction_time -Raj