On Tue, Apr 21, 2020 at 6:20 AM Marc Richter <m...@marc-richter.info> wrote:

> Hi everyone,
>
> I'm very new to Cassandra. I have, however, some experience with SQL.
>

The biggest thing to remember is that Cassandra is designed to scale out to
massive clusters - like thousands of instances. To do that, you can't
assume it's ever ok to read all of the data, because that doesn't scale. So
cassandra takes shortcuts / optimizations to make it possible to ADDRESS
all of that data, but not SCAN it.


>
> I need to extract some information from a Cassandra database that has
> the following table definition:
>
> CREATE TABLE tagdata.central (
> signalid int,
> monthyear int,
> fromtime bigint,
> totime bigint,
> avg decimal,
> insertdate bigint,
> max decimal,
> min decimal,
> readings text,
> PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> )
>
>
What your primary key REALLY MEANS is:

The database on reads and writes will hash(signalid+monthyear) to find
which hosts have the data, then

In each data file, the data for a given (signalid,monthyear) is stored
sorted by fromtime and totime



> The database is already of round about 260 GB in size.
> I now need to know what is the most recent entry in it; the correct
> column to learn this would be "insertdate".
>
> In SQL I would do something like this:
>
> SELECT insertdate FROM tagdata.central
> ORDER BY insertdate DESC LIMIT 1;
>
> In CQL, however, I just can't get it to work.
>
> What I have tried already is this:
>
> SELECT insertdate FROM "tagdata.central"
> ORDER BY insertdate DESC LIMIT 1;
>

Because you didnt provide a signalid and monthyear, it doesn't know which
machine in your cluster to use to start the query.


>
> But this gives me an error:
> ERROR: ORDER BY is only supported when the partition key is restricted
> by an EQ or an IN.
>

Because it's designed for potentially petabytes of data per cluster, it
doesn't believe you really want to walk all the data and order ALL of it.
Instead, it assumes that when you need to use an ORDER BY, you're going to
have some very small piece of data - confined to a single
signalid/monthyear pair. And even then, the ORDER is going to assume that
you're ordering it by the ordering keys you've defined - fromtime  first,
and then totime.

So you can do

 SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC
And you can do

 SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC

And you can do ranges:

 SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY
fromtime DESC

But you have to work within the boundaries of how the data is stored. It's
stored grouped by signalid+monthyear, and then sorted by fromtime, and then
sorted by totime.



> So, after some trial and error and a lot of Googling, I learned that I
> must include all rows from the PRIMARY KEY from left to right in my
> query. Thus, this is the "best" I can get to work:
>
>
> SELECT
>         *
> FROM
>         "tagdata.central"
> WHERE
>         "signalid" = 4002
>         AND "monthyear" = 201908
> ORDER BY
>         "fromtime" DESC
> LIMIT 10;
>
>
> The "monthyear" column, I crafted like a fool by incrementing the date
> one month after another until no results could be found anymore.
> The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -
> query results. But these can't be as easily guessed as the "monthyear"
> values could.
>
> This is where I'm stuck!
>
> 1. This does not really feel like the ideal way to go. I think there is
> something more mature in modern IT systems. Can anyone tell me what is a
> better way to get these informations?
>

You can denormalize. Because cassandra allows you to have very large
clusters, you can make multiple tables sorted in different ways to enable
the queries you need to run. Normal data modeling is to build tables based
on the SELECT statements you need to do (unless you're very advanced, in
which case you do it based on the transaction semantics of the
INSERT/UPDATE statements, but that's probably not you).

Or you can use a more flexible database.


>
> 2. I need a way to learn all values that are in the "monthyear" and
> "signalid" columns in order to be able to craft that query.
> How can I achieve that in a reasonable way? As I said: The DB is round
> about 260 GB which makes it next to impossible to just "have a look" at
> the output of "SELECT *"..
>

You probably want to keep another table of monthyear + signalid pairs.

Reply via email to