On 2017-07-26 05:15 (-0700), Junaid Nasir <jna...@an10.io> wrote: 
> I have a C* cluster (3 nodes) with some 60gb data (replication factor 2).
> when I started using C* coming from SQL background didn't give much thought
> about modeling the data correctly. so what I did was
> 
> CREATE TABLE data ( deviceId int,
>                     time timestamp,
>                     field1 text,
>                     filed2 text,
>                     field3 text,
>                     PRIMARY KEY(deviceId, time)) WITH CLUSTERING ORDER
> BY (time ASC);
> 
> but most of the queries I run (using spark and datastax connector) compares
> data of different devices for some time period. for example
> 
> SELECT * FROM data WHERE time > '2017-07-01 12:00:00';
> 
> from my understanding this runs a full table scan. as shown in spark UI
> (from DAG visualization "Scan
> org.apache.spark.sql.cassandra.CassandraSourceRelation@32bb7d65") meaning
> C* will read all the data and then filter for time. Spark jobs runs for
> hours even for smaller time frames.
> 
> what is the right approach for data modeling for such queries?. I want to
> get a general idea of things to look for when modeling such data.
> really appreciate all the help from this community :). if you need any
> extra details please ask me here.
> 

The right approach for modeling all queries in Cassandra is to start with the 
SELECTs you'll want to do, and then build a table around it.

If your typical behavior is to query all of the data for a time window, you 
probably want your partition keys to be time windows. For a larger cluster, 
this would give you hotspots (so you may want to rethink it if you grow 
significantly), but for your size where you're already using a majority of the 
cluster for every write, it shouldn't be a big deal.

That, then, would give you a table like:

> CREATE TABLE data ( deviceId int,
>                     time timestamp,
>                     field1 text,
>                     filed2 text,
>                     field3 text,
>                     timeBucket text,
>                     PRIMARY KEY(timeBucket, deviceId, time) WITH CLUSTERING 
> ORDER
> BY (deviceId ASC, time ASC);

Where timeBucket is some date-like string like "2017-07-26-01:00:00", for all 
entries in the first hour of Jul 26 2017.

This gives you: 1) a way to query by time (primary use case), and 2) a way to 
query by ID if needed (though you'll need to issue a query for each time bucket 
and aggregate it client side). If 2 is insufficient, you would denormalize and 
create a second table.





---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Reply via email to