Re: Issues, understanding how CQL works

Marc Richter Wed, 22 Apr 2020 03:23:25 -0700

Hi Jeff,

thank you for your exhaustive and verbose answer!

Also, a very big "Thank you!" to all the other replyers; I hope youunderstand that I summarize all your feedback in this single answer.

From what I understand from your answers, Cassandra seems to beoptimized to store (and read) data in only exactly that way that thedata structure has been designed for. That makes it very inflexible, butallows it to do that single job very effectively for a trade-off.

I also understand, the more I dig into Cassandra, that the team I amsupporting is using Cassandra kind of wrong; they for example do haveonly one node and so do not use neither the load-balancing, nor theredundancy-capabilities Cassandra offers.Thus, maybe relevant side-note: All the data resides on just one singlenode; maybe that info is important, because we know on which node thedata is (I know that Cassandra internally is applying the same Hashing -Voodoo as if there were 1k nodes, but maybe this is important anyways).

Anyways: I do not really care if a query or effort to find thisinformation is sub-optimal or very "expensive" in means of effectivityor system-load, since this isn't something that I need to extract on aregular basis, but only once. Due to that, it doesn't need to be optimalor effective; I also do not care if it blocks the node for severalhours, since Cassandra is only working on this single request. I reallyneed this info (most recent "insertdate") only once.

Is, considering this, a way to do that?

> Because you didnt provide a signalid and monthyear, it doesn't know
> which machine in your cluster to use to start the query.

I know this already; thanks for confirming that I got this correct! Butwhat do I do then if I do not know all "signalid"s? How to learn them?

Is it maybe possible to get a full list of all "signalid"s? Or is itpossible to "re-arrange" the data in the cluster or something thatenables me to learn what's the most recent "insertdate"?I really do not care if I need to do some expensive copy-all-data -move, but I do not know about what is possible and how to do that.


Best regards,
Marc Richter

On 21.04.20 19:20, Jeff Jirsa wrote:

On Tue, Apr 21, 2020 at 6:20 AM Marc Richter <[email protected]<mailto:[email protected]>> wrote:
    Hi everyone,

    I'm very new to Cassandra. I have, however, some experience with SQL.
The biggest thing to remember is that Cassandra is designed to scale outto massive clusters - like thousands of instances. To do that, you can'tassume it's ever ok to read all of the data, because that doesn't scale.So cassandra takes shortcuts / optimizations to make it possible toADDRESS all of that data, but not SCAN it.
    I need to extract some information from a Cassandra database that has
    the following table definition:

    CREATE TABLE tagdata.central (
    signalid int,
    monthyear int,
    fromtime bigint,
    totime bigint,
    avg decimal,
    insertdate bigint,
    max decimal,
    min decimal,
    readings text,
    PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
    )


What your primary key REALLY MEANS is:
The database on reads and writes will hash(signalid+monthyear) to findwhich hosts have the data, then
In each data file, the data for a given (signalid,monthyear) is storedsorted by fromtime and totime
    The database is already of round about 260 GB in size.
    I now need to know what is the most recent entry in it; the correct
    column to learn this would be "insertdate".

    In SQL I would do something like this:

    SELECT insertdate FROM tagdata.central
    ORDER BY insertdate DESC LIMIT 1;

    In CQL, however, I just can't get it to work.

    What I have tried already is this:

    SELECT insertdate FROM "tagdata.central"
    ORDER BY insertdate DESC LIMIT 1;
Because you didnt provide a signalid and monthyear, it doesn't knowwhich machine in your cluster to use to start the query.
    But this gives me an error:
    ERROR: ORDER BY is only supported when the partition key is restricted
    by an EQ or an IN.
Because it's designed for potentially petabytes of data per cluster, itdoesn't believe you really want to walk all the data and order ALL ofit. Instead, it assumes that when you need to use an ORDER BY, you'regoing to have some very small piece of data - confined to a singlesignalid/monthyear pair. And even then, the ORDER is going to assumethat you're ordering it by the ordering keys you've defined - fromtimefirst, and then totime.
So you can do

  SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC
And you can do

  SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC

And you can do ranges:
SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BYfromtime DESC
But you have to work within the boundaries of how the data is stored.It's stored grouped by signalid+monthyear, and then sorted by fromtime,and then sorted by totime.
    So, after some trial and error and a lot of Googling, I learned that I
    must include all rows from the PRIMARY KEY from left to right in my
    query. Thus, this is the "best" I can get to work:


    SELECT
             *
    FROM
             "tagdata.central"
    WHERE
             "signalid" = 4002
             AND "monthyear" = 201908
    ORDER BY
             "fromtime" DESC
    LIMIT 10;


    The "monthyear" column, I crafted like a fool by incrementing the date
    one month after another until no results could be found anymore.
    The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -
    query results. But these can't be as easily guessed as the "monthyear"
    values could.

    This is where I'm stuck!

    1. This does not really feel like the ideal way to go. I think there is
    something more mature in modern IT systems. Can anyone tell me what
    is a
    better way to get these informations?
You can denormalize. Because cassandra allows you to have very largeclusters, you can make multiple tables sorted in different ways toenable the queries you need to run. Normal data modeling is to buildtables based on the SELECT statements you need to do (unless you're veryadvanced, in which case you do it based on the transaction semantics ofthe INSERT/UPDATE statements, but that's probably not you).
Or you can use a more flexible database.


    2. I need a way to learn all values that are in the "monthyear" and
    "signalid" columns in order to be able to craft that query.
    How can I achieve that in a reasonable way? As I said: The DB is round
    about 260 GB which makes it next to impossible to just "have a look" at
    the output of "SELECT *"..


You probably want to keep another table of monthyear + signalid pairs.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Issues, understanding how CQL works

Reply via email to