You could add the column number as an additional clustering key. And then
you can actually use COMPACT STORAGE for even more efficient storage and
access (assuming there is only  a single non-PK data column, the blob
value.) You can then access (read or write) an individual column/blob or a
slice of them.

-- Jack Krupansky

On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gianl...@sysdig.com>
wrote:

> Hi
>
> I've just painfully discovered a "little" detail in Cassandra: Cassandra
> touches all columns on a CQL select (related issues
> https://issues.apache.org/jira/browse/CASSANDRA-6586,
> https://issues.apache.org/jira/browse/CASSANDRA-6588,
> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>
> My data model is fairly simple: I have a bunch of "sensors" reporting a
> blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
> interested in a subportion of that blob of data across an arbitrary period
> of time. What I do is simply splitting those blobs of data in about 30
> logical units and write them in a CQL table such as:
>
> create table data (
> id bigint,
> ts bigint,
> column1 blob,
> column2 blob,
> column3 blob,
> ...
> column29 blob,
> column30 blob
> primary key (id, ts)
>
> id is a combination of the sensor id and a time bucket, in order to not
> get the row too wide. Essentially, I thought this was a very legit data
> model that helps me keep my application code very simple (because I can
> work on a single table, I can write a split sensor blob in a single CQL
> query and I can read a subset of the columns very efficiently with one
> single CQL query).
>
> What I didn't realize is that Cassandra seems to always process all the
> columns of the CQL row, regardless of the fact that my query asks just one
> column, and this has dramatic effect on the performance of my reads.
>
> I wrote a simple isolated test case where I test how long it takes to read
> one *single* column in a CQL table composed of several columns (at each
> iteration I add and populate 10 new columns), each filled with 1MB blobs:
>
> 10 columns: 209 ms
> 20 columns: 339 ms
> 30 columns: 510 ms
> 40 columns: 670 ms
> 50 columns: 884 ms
> 60 columns: 1056 ms
> 70 columns: 1527 ms
> 80 columns: 1503 ms
> 90 columns: 1600 ms
> 100 columns: 1792 ms
>
> In other words, even if the result set returned is exactly the same across
> all these iteration, the response time increases linearly with the size of
> the other columns, and this is really causing a lot of problems in my
> application.
>
> By reading the JIRA issues, it seems like this is considered a very minor
> optimization not worth the effort of fixing, so I'm asking: is my use case
> really so anomalous that the horrible performance that I'm experiencing are
> to be considered "expected" and need to be fixed with some painful column
> family splitting and messy application code?
>
> Thanks
>

Reply via email to