Avoiding Data Duplication

Abhishek Singh Bailoo Fri, 05 Jun 2015 14:11:04 -0700

Hello!

I have a column family to log in data coming from my GPS devices.


CREATE TABLE log(
  imei ascii,
  date ascii,
  dtime timestamp,
  data ascii,
  stime timestamp,
  PRIMARY KEY ((imei, date), dtime))
  WITH CLUSTERING ORDER BY (dtime DESC)
;

It is the standard schema for modeling time series data where
imei is the unique ID associated with each GPS device
date is the date taken from dtime
dtime is the date-time coming from the device
data is all the latitude, longitude etc that the device is sending us
stime is the date-time stamp of the server

The reason why I put dtime in the primary key as the clustering column is
because most of our queries are done on device time. There can be a delay
of a few minutes to a few hours (or a few days! in rare cases) between
dtime and stime if the network is not available.

However, now we want to query on server time as well for the purpose of
debugging. These queries will be not as common as queries on  device time.
Say for every 100 queries on dtime there will be just 1 query on stime.

What options do I have?

1. Seconday Index - not possible because stime is a timestamp and CQL does
not allow me to put < or > in the query for secondary index

2. Data duplication - I can build another column family where I will index
by stime but that means I am storing twice as much data. I know everyone
says that write operations are cheap and storage is cheap but how? If I
have to buy twice as many machines on AWS EC2 each with their own ephemeral
storage, then my bill doubles up!

Any other ideas I can try?

Many Thanks,
Abhishek

Avoiding Data Duplication

Reply via email to