Re: Performance issues with "many" CQL columns

Gianluca Borello Sun, 14 Feb 2016 18:12:52 -0800

Considering the (simplified) table that I wrote before:

create table data (
id bigint,
ts bigint,
column1 blob,
column2 blob,
column3 blob,
...
column29 blob,
column30 blob
primary key (id, ts)


A user request (varies every time) translates into a set of queries asking
a subset of the columns (< 10) for a specific set of sensors (< 100) for a
specific time range (< 300):

SELECT column1, column7, column20, column25 FROM data where id =
SENSOR_ID_1 and ts > x and ts < y)
SELECT column1, column7, column20, column25 FROM data where id =
SENSOR_ID_2 and ts > x and ts < y)
...
SELECT column1, column7, column20, column25 FROM data where id =
SENSOR_ID_N and ts > x and ts < y)

To answer your question, each non-EQ predicate on timestamp returns a few
hundreds rows (it's essentially a time series).

If I put the column number as clustering key with the timestamp, I'll have
to further increase the number of queries and make my code more complicated:

SELECT value FROM data where id = SENSOR_ID_1 and ts > x and ts < y and
column_number = 1)
SELECT value FROM data where id = SENSOR_ID_1 and ts > x and ts < y and
column_number = 7)
SELECT value FROM data where id = SENSOR_ID_1 and ts > x and ts < y and
column_number = 20)
SELECT value FROM data where id = SENSOR_ID_1 and ts > x and ts < y and
column_number = 25)
...

Again, not too terrible and I'll definitely have to do something similar
because the performance penalty I'm paying now is very significant, but by
all means this seems to me a complication in the data model (and in my
application).

Thanks again


On Sun, Feb 14, 2016 at 5:21 PM, Jack Krupansky <[email protected]>
wrote:

> What does your query actually look like today?
>
> Is your non-EQ on timestamp selecting a single row a few rows or many rows
> (dozens, hundreds, thousands)?
>
>
> -- Jack Krupansky
>
> On Sun, Feb 14, 2016 at 7:40 PM, Gianluca Borello <[email protected]>
> wrote:
>
>> Thanks again.
>>
>> One clarification about "reading in a single SELECT": in my point 2, I
>> mentioned the need to read a variable subset of columns every time, usually
>> in the range of ~5 out of 30. I can't find a way to do that in a single
>> SELECT unless I use the IN operator (which I can't, as explained).
>>
>> Is there any other method you were thinking of, or your "reading in a
>> single SELECT" is just applicable when I need to read the whole set of
>> columns (which is never my case, unfortunately)?
>>
>> Thanks
>>
>>
>> On Sun, Feb 14, 2016 at 4:34 PM, Jack Krupansky <[email protected]
>> > wrote:
>>
>>> You can definitely read all of columns in a single SELECT. And the
>>> n-INSERTS can be batched and will insert fewer cells in the storage engine
>>> than the previous approach.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sun, Feb 14, 2016 at 7:31 PM, Gianluca Borello <[email protected]>
>>> wrote:
>>>
>>>> Thank you for your reply.
>>>>
>>>> Your advice is definitely sound, although it still seems suboptimal to
>>>> me because:
>>>>
>>>> 1) It requires N INSERT queries from the application code (where N is
>>>> the number of columns)
>>>>
>>>> 2) It requires N SELECT queries from my application code (where N is
>>>> the number of columns I need to read at any given time, which is determined
>>>> at runtime). I can't even use the IN operator (e.g. WHERE column_number IN
>>>> (1, 2, 3, ...)) because I am already using a non-EQ relation on the
>>>> timestamp key and Cassandra restricts me to only one non-EQ relation.
>>>>
>>>> In summary, I can (and will) adapt my code to use a similar approach
>>>> despite everything, but the goal of my message was mainly to understand why
>>>> the jira issues I linked above are not full of dozens of "+1" comments.
>>>>
>>>> To me this really feels like a terrible performance issue that should
>>>> be fixed by default (or in the very worst case clearly documented), even
>>>> after understanding the motivation for reading all the columns in the CQL
>>>> row.
>>>>
>>>> Thanks
>>>>
>>>> On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <
>>>> [email protected]> wrote:
>>>>
>>>>> You could add the column number as an additional clustering key. And
>>>>> then you can actually use COMPACT STORAGE for even more efficient storage
>>>>> and access (assuming there is only  a single non-PK data column, the blob
>>>>> value.) You can then access (read or write) an individual column/blob or a
>>>>> slice of them.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I've just painfully discovered a "little" detail in Cassandra:
>>>>>> Cassandra touches all columns on a CQL select (related issues
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6586,
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6588,
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>>>>>>
>>>>>> My data model is fairly simple: I have a bunch of "sensors" reporting
>>>>>> a blob of data (~10-100KB) periodically. When reading, 99% of the times 
>>>>>> I'm
>>>>>> interested in a subportion of that blob of data across an arbitrary 
>>>>>> period
>>>>>> of time. What I do is simply splitting those blobs of data in about 30
>>>>>> logical units and write them in a CQL table such as:
>>>>>>
>>>>>> create table data (
>>>>>> id bigint,
>>>>>> ts bigint,
>>>>>> column1 blob,
>>>>>> column2 blob,
>>>>>> column3 blob,
>>>>>> ...
>>>>>> column29 blob,
>>>>>> column30 blob
>>>>>> primary key (id, ts)
>>>>>>
>>>>>> id is a combination of the sensor id and a time bucket, in order to
>>>>>> not get the row too wide. Essentially, I thought this was a very legit 
>>>>>> data
>>>>>> model that helps me keep my application code very simple (because I can
>>>>>> work on a single table, I can write a split sensor blob in a single CQL
>>>>>> query and I can read a subset of the columns very efficiently with one
>>>>>> single CQL query).
>>>>>>
>>>>>> What I didn't realize is that Cassandra seems to always process all
>>>>>> the columns of the CQL row, regardless of the fact that my query asks 
>>>>>> just
>>>>>> one column, and this has dramatic effect on the performance of my reads.
>>>>>>
>>>>>> I wrote a simple isolated test case where I test how long it takes to
>>>>>> read one *single* column in a CQL table composed of several columns (at
>>>>>> each iteration I add and populate 10 new columns), each filled with 1MB
>>>>>> blobs:
>>>>>>
>>>>>> 10 columns: 209 ms
>>>>>> 20 columns: 339 ms
>>>>>> 30 columns: 510 ms
>>>>>> 40 columns: 670 ms
>>>>>> 50 columns: 884 ms
>>>>>> 60 columns: 1056 ms
>>>>>> 70 columns: 1527 ms
>>>>>> 80 columns: 1503 ms
>>>>>> 90 columns: 1600 ms
>>>>>> 100 columns: 1792 ms
>>>>>>
>>>>>> In other words, even if the result set returned is exactly the same
>>>>>> across all these iteration, the response time increases linearly with the
>>>>>> size of the other columns, and this is really causing a lot of problems 
>>>>>> in
>>>>>> my application.
>>>>>>
>>>>>> By reading the JIRA issues, it seems like this is considered a very
>>>>>> minor optimization not worth the effort of fixing, so I'm asking: is my 
>>>>>> use
>>>>>> case really so anomalous that the horrible performance that I'm
>>>>>> experiencing are to be considered "expected" and need to be fixed with 
>>>>>> some
>>>>>> painful column family splitting and messy application code?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Performance issues with "many" CQL columns

Reply via email to