You could add the column number as an additional clustering key. And then you can actually use COMPACT STORAGE for even more efficient storage and access (assuming there is only a single non-PK data column, the blob value.) You can then access (read or write) an individual column/blob or a slice of them.
-- Jack Krupansky On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gianl...@sysdig.com> wrote: > Hi > > I've just painfully discovered a "little" detail in Cassandra: Cassandra > touches all columns on a CQL select (related issues > https://issues.apache.org/jira/browse/CASSANDRA-6586, > https://issues.apache.org/jira/browse/CASSANDRA-6588, > https://issues.apache.org/jira/browse/CASSANDRA-7085). > > My data model is fairly simple: I have a bunch of "sensors" reporting a > blob of data (~10-100KB) periodically. When reading, 99% of the times I'm > interested in a subportion of that blob of data across an arbitrary period > of time. What I do is simply splitting those blobs of data in about 30 > logical units and write them in a CQL table such as: > > create table data ( > id bigint, > ts bigint, > column1 blob, > column2 blob, > column3 blob, > ... > column29 blob, > column30 blob > primary key (id, ts) > > id is a combination of the sensor id and a time bucket, in order to not > get the row too wide. Essentially, I thought this was a very legit data > model that helps me keep my application code very simple (because I can > work on a single table, I can write a split sensor blob in a single CQL > query and I can read a subset of the columns very efficiently with one > single CQL query). > > What I didn't realize is that Cassandra seems to always process all the > columns of the CQL row, regardless of the fact that my query asks just one > column, and this has dramatic effect on the performance of my reads. > > I wrote a simple isolated test case where I test how long it takes to read > one *single* column in a CQL table composed of several columns (at each > iteration I add and populate 10 new columns), each filled with 1MB blobs: > > 10 columns: 209 ms > 20 columns: 339 ms > 30 columns: 510 ms > 40 columns: 670 ms > 50 columns: 884 ms > 60 columns: 1056 ms > 70 columns: 1527 ms > 80 columns: 1503 ms > 90 columns: 1600 ms > 100 columns: 1792 ms > > In other words, even if the result set returned is exactly the same across > all these iteration, the response time increases linearly with the size of > the other columns, and this is really causing a lot of problems in my > application. > > By reading the JIRA issues, it seems like this is considered a very minor > optimization not worth the effort of fixing, so I'm asking: is my use case > really so anomalous that the horrible performance that I'm experiencing are > to be considered "expected" and need to be fixed with some painful column > family splitting and messy application code? > > Thanks >