On 2017-07-26 05:15 (-0700), Junaid Nasir <jna...@an10.io> wrote: > I have a C* cluster (3 nodes) with some 60gb data (replication factor 2). > when I started using C* coming from SQL background didn't give much thought > about modeling the data correctly. so what I did was > > CREATE TABLE data ( deviceId int, > time timestamp, > field1 text, > filed2 text, > field3 text, > PRIMARY KEY(deviceId, time)) WITH CLUSTERING ORDER > BY (time ASC); > > but most of the queries I run (using spark and datastax connector) compares > data of different devices for some time period. for example > > SELECT * FROM data WHERE time > '2017-07-01 12:00:00'; > > from my understanding this runs a full table scan. as shown in spark UI > (from DAG visualization "Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@32bb7d65") meaning > C* will read all the data and then filter for time. Spark jobs runs for > hours even for smaller time frames. > > what is the right approach for data modeling for such queries?. I want to > get a general idea of things to look for when modeling such data. > really appreciate all the help from this community :). if you need any > extra details please ask me here. >
The right approach for modeling all queries in Cassandra is to start with the SELECTs you'll want to do, and then build a table around it. If your typical behavior is to query all of the data for a time window, you probably want your partition keys to be time windows. For a larger cluster, this would give you hotspots (so you may want to rethink it if you grow significantly), but for your size where you're already using a majority of the cluster for every write, it shouldn't be a big deal. That, then, would give you a table like: > CREATE TABLE data ( deviceId int, > time timestamp, > field1 text, > filed2 text, > field3 text, > timeBucket text, > PRIMARY KEY(timeBucket, deviceId, time) WITH CLUSTERING > ORDER > BY (deviceId ASC, time ASC); Where timeBucket is some date-like string like "2017-07-26-01:00:00", for all entries in the first hour of Jul 26 2017. This gives you: 1) a way to query by time (primary use case), and 2) a way to query by ID if needed (though you'll need to issue a query for each time bucket and aggregate it client side). If 2 is insufficient, you would denormalize and create a second table. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org