Re: Performance issues with "many" CQL columns

Jack Krupansky Sun, 14 Feb 2016 16:34:51 -0800

You can definitely read all of columns in a single SELECT. And the
n-INSERTS can be batched and will insert fewer cells in the storage engine
than the previous approach.


-- Jack Krupansky

On Sun, Feb 14, 2016 at 7:31 PM, Gianluca Borello <gianl...@sysdig.com>
wrote:

> Thank you for your reply.
>
> Your advice is definitely sound, although it still seems suboptimal to me
> because:
>
> 1) It requires N INSERT queries from the application code (where N is the
> number of columns)
>
> 2) It requires N SELECT queries from my application code (where N is the
> number of columns I need to read at any given time, which is determined at
> runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1,
> 2, 3, ...)) because I am already using a non-EQ relation on the timestamp
> key and Cassandra restricts me to only one non-EQ relation.
>
> In summary, I can (and will) adapt my code to use a similar approach
> despite everything, but the goal of my message was mainly to understand why
> the jira issues I linked above are not full of dozens of "+1" comments.
>
> To me this really feels like a terrible performance issue that should be
> fixed by default (or in the very worst case clearly documented), even after
> understanding the motivation for reading all the columns in the CQL row.
>
> Thanks
>
> On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> You could add the column number as an additional clustering key. And then
>> you can actually use COMPACT STORAGE for even more efficient storage and
>> access (assuming there is only  a single non-PK data column, the blob
>> value.) You can then access (read or write) an individual column/blob or a
>> slice of them.
>>
>> -- Jack Krupansky
>>
>> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gianl...@sysdig.com>
>> wrote:
>>
>>> Hi
>>>
>>> I've just painfully discovered a "little" detail in Cassandra: Cassandra
>>> touches all columns on a CQL select (related issues
>>> https://issues.apache.org/jira/browse/CASSANDRA-6586,
>>> https://issues.apache.org/jira/browse/CASSANDRA-6588,
>>> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>>>
>>> My data model is fairly simple: I have a bunch of "sensors" reporting a
>>> blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
>>> interested in a subportion of that blob of data across an arbitrary period
>>> of time. What I do is simply splitting those blobs of data in about 30
>>> logical units and write them in a CQL table such as:
>>>
>>> create table data (
>>> id bigint,
>>> ts bigint,
>>> column1 blob,
>>> column2 blob,
>>> column3 blob,
>>> ...
>>> column29 blob,
>>> column30 blob
>>> primary key (id, ts)
>>>
>>> id is a combination of the sensor id and a time bucket, in order to not
>>> get the row too wide. Essentially, I thought this was a very legit data
>>> model that helps me keep my application code very simple (because I can
>>> work on a single table, I can write a split sensor blob in a single CQL
>>> query and I can read a subset of the columns very efficiently with one
>>> single CQL query).
>>>
>>> What I didn't realize is that Cassandra seems to always process all the
>>> columns of the CQL row, regardless of the fact that my query asks just one
>>> column, and this has dramatic effect on the performance of my reads.
>>>
>>> I wrote a simple isolated test case where I test how long it takes to
>>> read one *single* column in a CQL table composed of several columns (at
>>> each iteration I add and populate 10 new columns), each filled with 1MB
>>> blobs:
>>>
>>> 10 columns: 209 ms
>>> 20 columns: 339 ms
>>> 30 columns: 510 ms
>>> 40 columns: 670 ms
>>> 50 columns: 884 ms
>>> 60 columns: 1056 ms
>>> 70 columns: 1527 ms
>>> 80 columns: 1503 ms
>>> 90 columns: 1600 ms
>>> 100 columns: 1792 ms
>>>
>>> In other words, even if the result set returned is exactly the same
>>> across all these iteration, the response time increases linearly with the
>>> size of the other columns, and this is really causing a lot of problems in
>>> my application.
>>>
>>> By reading the JIRA issues, it seems like this is considered a very
>>> minor optimization not worth the effort of fixing, so I'm asking: is my use
>>> case really so anomalous that the horrible performance that I'm
>>> experiencing are to be considered "expected" and need to be fixed with some
>>> painful column family splitting and messy application code?
>>>
>>> Thanks
>>>
>>
>>
>

Re: Performance issues with "many" CQL columns

Reply via email to