Hello! I have a column family to log in data coming from my GPS devices.
CREATE TABLE log( imei ascii, date ascii, dtime timestamp, data ascii, stime timestamp, PRIMARY KEY ((imei, date), dtime)) WITH CLUSTERING ORDER BY (dtime DESC) ; It is the standard schema for modeling time series data where imei is the unique ID associated with each GPS device date is the date taken from dtime dtime is the date-time coming from the device data is all the latitude, longitude etc that the device is sending us stime is the date-time stamp of the server The reason why I put dtime in the primary key as the clustering column is because most of our queries are done on device time. There can be a delay of a few minutes to a few hours (or a few days! in rare cases) between dtime and stime if the network is not available. However, now we want to query on server time as well for the purpose of debugging. These queries will be not as common as queries on device time. Say for every 100 queries on dtime there will be just 1 query on stime. What options do I have? 1. Seconday Index - not possible because stime is a timestamp and CQL does not allow me to put < or > in the query for secondary index 2. Data duplication - I can build another column family where I will index by stime but that means I am storing twice as much data. I know everyone says that write operations are cheap and storage is cheap but how? If I have to buy twice as many machines on AWS EC2 each with their own ephemeral storage, then my bill doubles up! Any other ideas I can try? Many Thanks, Abhishek