Re: Performance issues with "many" CQL columns

Gianluca Borello Sun, 14 Feb 2016 16:41:48 -0800

Thanks again.

One clarification about "reading in a single SELECT": in my point 2, I
mentioned the need to read a variable subset of columns every time, usually
in the range of ~5 out of 30. I can't find a way to do that in a single
SELECT unless I use the IN operator (which I can't, as explained).


Is there any other method you were thinking of, or your "reading in a
single SELECT" is just applicable when I need to read the whole set of
columns (which is never my case, unfortunately)?

Thanks


On Sun, Feb 14, 2016 at 4:34 PM, Jack Krupansky <[email protected]>
wrote:

> You can definitely read all of columns in a single SELECT. And the
> n-INSERTS can be batched and will insert fewer cells in the storage engine
> than the previous approach.
>
> -- Jack Krupansky
>
> On Sun, Feb 14, 2016 at 7:31 PM, Gianluca Borello <[email protected]>
> wrote:
>
>> Thank you for your reply.
>>
>> Your advice is definitely sound, although it still seems suboptimal to me
>> because:
>>
>> 1) It requires N INSERT queries from the application code (where N is the
>> number of columns)
>>
>> 2) It requires N SELECT queries from my application code (where N is the
>> number of columns I need to read at any given time, which is determined at
>> runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1,
>> 2, 3, ...)) because I am already using a non-EQ relation on the timestamp
>> key and Cassandra restricts me to only one non-EQ relation.
>>
>> In summary, I can (and will) adapt my code to use a similar approach
>> despite everything, but the goal of my message was mainly to understand why
>> the jira issues I linked above are not full of dozens of "+1" comments.
>>
>> To me this really feels like a terrible performance issue that should be
>> fixed by default (or in the very worst case clearly documented), even after
>> understanding the motivation for reading all the columns in the CQL row.
>>
>> Thanks
>>
>> On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <[email protected]
>> > wrote:
>>
>>> You could add the column number as an additional clustering key. And
>>> then you can actually use COMPACT STORAGE for even more efficient storage
>>> and access (assuming there is only  a single non-PK data column, the blob
>>> value.) You can then access (read or write) an individual column/blob or a
>>> slice of them.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <[email protected]>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I've just painfully discovered a "little" detail in Cassandra:
>>>> Cassandra touches all columns on a CQL select (related issues
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6586,
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6588,
>>>> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>>>>
>>>> My data model is fairly simple: I have a bunch of "sensors" reporting a
>>>> blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
>>>> interested in a subportion of that blob of data across an arbitrary period
>>>> of time. What I do is simply splitting those blobs of data in about 30
>>>> logical units and write them in a CQL table such as:
>>>>
>>>> create table data (
>>>> id bigint,
>>>> ts bigint,
>>>> column1 blob,
>>>> column2 blob,
>>>> column3 blob,
>>>> ...
>>>> column29 blob,
>>>> column30 blob
>>>> primary key (id, ts)
>>>>
>>>> id is a combination of the sensor id and a time bucket, in order to not
>>>> get the row too wide. Essentially, I thought this was a very legit data
>>>> model that helps me keep my application code very simple (because I can
>>>> work on a single table, I can write a split sensor blob in a single CQL
>>>> query and I can read a subset of the columns very efficiently with one
>>>> single CQL query).
>>>>
>>>> What I didn't realize is that Cassandra seems to always process all the
>>>> columns of the CQL row, regardless of the fact that my query asks just one
>>>> column, and this has dramatic effect on the performance of my reads.
>>>>
>>>> I wrote a simple isolated test case where I test how long it takes to
>>>> read one *single* column in a CQL table composed of several columns (at
>>>> each iteration I add and populate 10 new columns), each filled with 1MB
>>>> blobs:
>>>>
>>>> 10 columns: 209 ms
>>>> 20 columns: 339 ms
>>>> 30 columns: 510 ms
>>>> 40 columns: 670 ms
>>>> 50 columns: 884 ms
>>>> 60 columns: 1056 ms
>>>> 70 columns: 1527 ms
>>>> 80 columns: 1503 ms
>>>> 90 columns: 1600 ms
>>>> 100 columns: 1792 ms
>>>>
>>>> In other words, even if the result set returned is exactly the same
>>>> across all these iteration, the response time increases linearly with the
>>>> size of the other columns, and this is really causing a lot of problems in
>>>> my application.
>>>>
>>>> By reading the JIRA issues, it seems like this is considered a very
>>>> minor optimization not worth the effort of fixing, so I'm asking: is my use
>>>> case really so anomalous that the horrible performance that I'm
>>>> experiencing are to be considered "expected" and need to be fixed with some
>>>> painful column family splitting and messy application code?
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>

Re: Performance issues with "many" CQL columns

Reply via email to