Re: Storing bi-temporal data in Cassandra

2015-02-14 Thread DuyHai Doan
I am trying to get the state as of a particular transaction_time

 -- In that case you should probably define your primary key in another
order for clustering columns

PRIMARY KEY (weatherstation_id,transaction_time,event_time)

Then, select * from temperatures where weatherstation_id = 'foo' and
event_time = '2015-01-01 00:00:00' and event_time  '2015-01-02
00:00:00' and transaction_time = ''



On Sat, Feb 14, 2015 at 3:06 AM, Raj N raj.cassan...@gmail.com wrote:

 Has anyone designed a bi-temporal table in Cassandra? Doesn't look like I
 can do this using CQL for now. Taking the time series example from well
 known modeling tutorials in Cassandra -

 CREATE TABLE temperatures (
 weatherstation_id text,
 event_time timestamp,
 temperature text,
 PRIMARY KEY (weatherstation_id,event_time),
 ) WITH CLUSTERING ORDER BY (event_time DESC);

 If I add another column transaction_time

 CREATE TABLE temperatures (
 weatherstation_id text,
 event_time timestamp,
 transaction_time timestamp,
 temperature text,
 PRIMARY KEY (weatherstation_id,event_time,transaction_time),
 ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);

 If I try to run a query using the following CQL, it throws an error -

 select * from temperatures where weatherstation_id = 'foo' and event_time
 = '2015-01-01 00:00:00' and event_time  '2015-01-02 00:00:00' and
 transaction_time  '2015-01-02 00:00:00'

 It works if I use an equals clause for the event_time. I am trying to get
 the state as of a particular transaction_time

 -Raj



Re: How to speed up SELECT * query in Cassandra

2015-02-14 Thread mck
Jirka,

 But I am really interested how it can work well with Spark/Hadoop where
 you basically needs to read all the data as well (as far as I understand
 that).


I can't give you any benchmarking between technologies (nor am i
particularly interested in getting involved in such a discussion) but i
can share our experiences with Cassandra, Hadoop, and Spark, over the
past 4+ years, and hopefully assure you that Cassandra+Spark is a smart
choice.

On a four node cluster we were running 5000+ small hadoop jobs each day
each finishing within two minutes, often within one minute, resulting in
(give or take) a billion records read and 150 millions records written
from and to c*.
These small jobs are incrementally processing on limited partition key
sets each time. These jobs are primarily reading data from a raw events
store that has a ttl of 3 months and 22+Gb of tombstones a day (reads
over old partition keys are rare).

We also run full-table-scan jobs and have never come across any issues
particular to that. There are hadoop map/reduce settings to increase
durability if you have tables with troublesome partition keys.

This is also a cluster that serves requests to web applications that
need low latency.

We recently wrote a spark job that does full table scans over 100
million+ rows, involves a handful of stages (two tables, 9 maps, 4
reduce, and 2 joins), and writes back to a new table 5 millions rows.
This job runs in ~260 seconds.

Spark is becoming a natural complement to schema evolution for
cassandra, something you'll want to do to keep your schema optimised
against your read request patterns, even little things like switching
cluster keys around. 

With any new technology hitting some hurdles (especially if you go
wondering outside recommended practices) will of course be part of the
game, but that said I've only had positive experiences with this
community's ability to help out (and do so quickly).

Starting from scratch i'd use Spark (on scala) over Hadoop no questions
asked. 
Otherwise Cassandra has always been our 'big data' platform,
hadoop/spark is just an extra tool on top.
We've never kept data in hdfs and are very grateful for having made that
choice.

~mck

ref
https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/


Re: Storing bi-temporal data in Cassandra

2015-02-14 Thread Raj N
I don't think thats solves my problem. The question really is why can't we
use ranges for both time columns when they are part of the primary key.
They are on 1 row after all. Is this just a CQL limitation?

-Raj

On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan doanduy...@gmail.com wrote:

 I am trying to get the state as of a particular transaction_time

  -- In that case you should probably define your primary key in another
 order for clustering columns

 PRIMARY KEY (weatherstation_id,transaction_time,event_time)

 Then, select * from temperatures where weatherstation_id = 'foo' and
 event_time = '2015-01-01 00:00:00' and event_time  '2015-01-02
 00:00:00' and transaction_time = ''



 On Sat, Feb 14, 2015 at 3:06 AM, Raj N raj.cassan...@gmail.com wrote:

 Has anyone designed a bi-temporal table in Cassandra? Doesn't look like I
 can do this using CQL for now. Taking the time series example from well
 known modeling tutorials in Cassandra -

 CREATE TABLE temperatures (
 weatherstation_id text,
 event_time timestamp,
 temperature text,
 PRIMARY KEY (weatherstation_id,event_time),
 ) WITH CLUSTERING ORDER BY (event_time DESC);

 If I add another column transaction_time

 CREATE TABLE temperatures (
 weatherstation_id text,
 event_time timestamp,
 transaction_time timestamp,
 temperature text,
 PRIMARY KEY (weatherstation_id,event_time,transaction_time),
 ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);

 If I try to run a query using the following CQL, it throws an error -

 select * from temperatures where weatherstation_id = 'foo' and event_time
 = '2015-01-01 00:00:00' and event_time  '2015-01-02 00:00:00' and
 transaction_time  '2015-01-02 00:00:00'

 It works if I use an equals clause for the event_time. I am trying to get
 the state as of a particular transaction_time

 -Raj