Jack,
Thanks for the response. I don't think I provided enough information and used 
the wrong terminology as your response is more the canned advice is response to 
Cassandra antipatterns.
To make this clearer, this is what we are doing:
create table sensorReadings (sensorUnitId int,
sensorId int,time timestamp,timeShard int,
readings blob,primary key((sensorUnitId, sensorId, timeShard), time);
where timeShard is a combination of year and week of year
For known time range based queries, this works great. However, the specific 
problem is in knowing the maximum and minimum timeShard values when we want to 
select the entire range of data. Our understanding is that if we update another 
related table with the maximum and minimum timeShard value for a given 
sensorUnitId and sensorId combination, we will create a hotspot and lots of 
tombstones. If we SELECT DISTINCT, we get a huge list of partition keys for the 
table because we cannot reduce the scope with a where clause.

If there is a recommended pattern that solves this, we haven't come across it.

I hope makes the problem clearer.
Thanks,
Jason

      From: Jack Krupansky <jack.krupan...@gmail.com>
 To: user@cassandra.apache.org; Jason Kania <jason.ka...@ymail.com> 
 Sent: Thursday, March 10, 2016 10:42 AM
 Subject: Re: Strategy for dividing wide rows beyond just adding to the 
partition key
   
There is an effort underway to support wider 
rows:https://issues.apache.org/jira/browse/CASSANDRA-9754

This won't help you now though. Even with that improvement you still may need a 
more optimal data model since large-scale scanning/filtering is always a very 
bad idea with Cassandra.
The data modeling methodology for Cassandra dictates that queries drive the 
data model and that each form of query requires a separate table ("query 
table.") Materialized view can automate that process for a lot of cases, but in 
any case it does sound as if some of your queries do require additional tables.
As a general proposition, Cassandra should not be used for heavy filtering - 
query tables with the filtering criteria baked into the PK is the way to go.

-- Jack Krupansky
On Thu, Mar 10, 2016 at 8:54 AM, Jason Kania <jason.ka...@ymail.com> wrote:

Hi,
We have sensor input that creates very wide rows and operations on these rows 
have started to timeout regulary. We have been trying to find a solution to 
dividing wide rows but keep hitting limitations that move the problem around 
instead of solving it.
We have a partition key consisting of a sensorUnitId and a sensorId and use a 
time field to access each column in the row. We tried adding a time based 
entry, timeShardId, to the partition key that consists of the year and week of 
year during which the reading was taken. This works for a number of queries but 
for scanning all the readings against a particular sensorUnitId and sensorId 
combination, we seem to be stuck.
We won't know the range of valid values of the timeShardId for a given 
sensorUnitId and sensorId combination so would have to write to an additional 
table to track the valid timeShardId. We suspect this would create tombstone 
accumulation problems given the number of updates required to the same row so 
haven't tried this option.

Alternatively, we hit a different bottleneck in the form of SELECT DISTINCT in 
trying to directly access the partition keys. Since SELECT DISTINCT does not 
allow for a where clause to filter on the partition key values, we have to 
filter several hundred thousand partition keys just to find those related to 
the relevant sensorUnitId and sensorId. This problem will only grow worse for 
us.

Are there any other approaches that can be suggested? We have been looking 
around, but haven't found any references beyond the initial suggestion to add 
some sort of shard id to the partition key to handle wide rows.
Thanks,
Jason




   

Reply via email to