Re: Issues, understanding how CQL works

Alex Ott Wed, 22 Apr 2020 06:46:27 -0700

DSBulk also works with JSON...
if transformations of data are complex, I would go with Spark running in
local mode, and process data...


On Wed, Apr 22, 2020 at 3:38 PM Durity, Sean R <[email protected]>
wrote:

> I thought this might be a single-time use case request. I think my first
> approach would be to use something like dsbulk to unload the data and then
> reload it into a table designed for the query you want to do (as long as
> you have adequate disk space). I think like a DBA/admin first. Dsbulk
> creates csv files, so you could move that data to any kind of database, if
> you chose.
>
> An alternative approach would be to use a driver that supports paging (I
> think this would be most of them) and write a program to walk the data set
> and output what you need in whatever format you need.
>
> Or, since this is a single node scenario, you could try sstable2json to
> export the sstables (files on disk) into JSON, if that is a more workable
> format for you.
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
> -----Original Message-----
> From: Marc Richter <[email protected]>
> Sent: Wednesday, April 22, 2020 6:22 AM
> To: [email protected]
> Subject: [EXTERNAL] Re: Issues, understanding how CQL works
>
> Hi Jeff,
>
> thank you for your exhaustive and verbose answer!
> Also, a very big "Thank you!" to all the other replyers; I hope you
> understand that I summarize all your feedback in this single answer.
>
>  From what I understand from your answers, Cassandra seems to be
> optimized to store (and read) data in only exactly that way that the
> data structure has been designed for. That makes it very inflexible, but
> allows it to do that single job very effectively for a trade-off.
>
> I also understand, the more I dig into Cassandra, that the team I am
> supporting is using Cassandra kind of wrong; they for example do have
> only one node and so do not use neither the load-balancing, nor the
> redundancy-capabilities Cassandra offers.
> Thus, maybe relevant side-note: All the data resides on just one single
> node; maybe that info is important, because we know on which node the
> data is (I know that Cassandra internally is applying the same Hashing -
> Voodoo as if there were 1k nodes, but maybe this is important anyways).
>
> Anyways: I do not really care if a query or effort to find this
> information is sub-optimal or very "expensive" in means of effectivity
> or system-load, since this isn't something that I need to extract on a
> regular basis, but only once. Due to that, it doesn't need to be optimal
> or effective; I also do not care if it blocks the node for several
> hours, since Cassandra is only working on this single request. I really
> need this info (most recent "insertdate") only once.
> Is, considering this, a way to do that?
>
>  > Because you didnt provide a signalid and monthyear, it doesn't know
>  > which machine in your cluster to use to start the query.
>
> I know this already; thanks for confirming that I got this correct! But
> what do I do then if I do not know all "signalid"s? How to learn them?
>
> Is it maybe possible to get a full list of all "signalid"s? Or is it
> possible to "re-arrange" the data in the cluster or something that
> enables me to learn what's the most recent "insertdate"?
> I really do not care if I need to do some expensive copy-all-data -
> move, but I do not know about what is possible and how to do that.
>
> Best regards,
> Marc Richter
>
> On 21.04.20 19:20, Jeff Jirsa wrote:
> >
> >
> > On Tue, Apr 21, 2020 at 6:20 AM Marc Richter <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Hi everyone,
> >
> >     I'm very new to Cassandra. I have, however, some experience with SQL.
> >
> >
> > The biggest thing to remember is that Cassandra is designed to scale out
> > to massive clusters - like thousands of instances. To do that, you can't
> > assume it's ever ok to read all of the data, because that doesn't scale.
> > So cassandra takes shortcuts / optimizations to make it possible to
> > ADDRESS all of that data, but not SCAN it.
> >
> >
> >     I need to extract some information from a Cassandra database that has
> >     the following table definition:
> >
> >     CREATE TABLE tagdata.central (
> >     signalid int,
> >     monthyear int,
> >     fromtime bigint,
> >     totime bigint,
> >     avg decimal,
> >     insertdate bigint,
> >     max decimal,
> >     min decimal,
> >     readings text,
> >     PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> >     )
> >
> >
> > What your primary key REALLY MEANS is:
> >
> > The database on reads and writes will hash(signalid+monthyear) to find
> > which hosts have the data, then
> >
> > In each data file, the data for a given (signalid,monthyear) is stored
> > sorted by fromtime and totime
> >
> >     The database is already of round about 260 GB in size.
> >     I now need to know what is the most recent entry in it; the correct
> >     column to learn this would be "insertdate".
> >
> >     In SQL I would do something like this:
> >
> >     SELECT insertdate FROM tagdata.central
> >     ORDER BY insertdate DESC LIMIT 1;
> >
> >     In CQL, however, I just can't get it to work.
> >
> >     What I have tried already is this:
> >
> >     SELECT insertdate FROM "tagdata.central"
> >     ORDER BY insertdate DESC LIMIT 1;
> >
> >
> > Because you didnt provide a signalid and monthyear, it doesn't know
> > which machine in your cluster to use to start the query.
> >
> >
> >     But this gives me an error:
> >     ERROR: ORDER BY is only supported when the partition key is
> restricted
> >     by an EQ or an IN.
> >
> >
> > Because it's designed for potentially petabytes of data per cluster, it
> > doesn't believe you really want to walk all the data and order ALL of
> > it. Instead, it assumes that when you need to use an ORDER BY, you're
> > going to have some very small piece of data - confined to a single
> > signalid/monthyear pair. And even then, the ORDER is going to assume
> > that you're ordering it by the ordering keys you've defined - fromtime
> > first, and then totime.
> >
> > So you can do
> >
> >   SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC
> > And you can do
> >
> >   SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC
> >
> > And you can do ranges:
> >
> >   SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY
> > fromtime DESC
> >
> > But you have to work within the boundaries of how the data is stored.
> > It's stored grouped by signalid+monthyear, and then sorted by fromtime,
> > and then sorted by totime.
> >
> >
> >
> >     So, after some trial and error and a lot of Googling, I learned that
> I
> >     must include all rows from the PRIMARY KEY from left to right in my
> >     query. Thus, this is the "best" I can get to work:
> >
> >
> >     SELECT
> >              *
> >     FROM
> >              "tagdata.central"
> >     WHERE
> >              "signalid" = 4002
> >              AND "monthyear" = 201908
> >     ORDER BY
> >              "fromtime" DESC
> >     LIMIT 10;
> >
> >
> >     The "monthyear" column, I crafted like a fool by incrementing the
> date
> >     one month after another until no results could be found anymore.
> >     The "signalid" I grabbed from one of the unrestricted "SELECT *
> FROM" -
> >     query results. But these can't be as easily guessed as the
> "monthyear"
> >     values could.
> >
> >     This is where I'm stuck!
> >
> >     1. This does not really feel like the ideal way to go. I think there
> is
> >     something more mature in modern IT systems. Can anyone tell me what
> >     is a
> >     better way to get these informations?
> >
> >
> > You can denormalize. Because cassandra allows you to have very large
> > clusters, you can make multiple tables sorted in different ways to
> > enable the queries you need to run. Normal data modeling is to build
> > tables based on the SELECT statements you need to do (unless you're very
> > advanced, in which case you do it based on the transaction semantics of
> > the INSERT/UPDATE statements, but that's probably not you).
> >
> > Or you can use a more flexible database.
> >
> >
> >     2. I need a way to learn all values that are in the "monthyear" and
> >     "signalid" columns in order to be able to craft that query.
> >     How can I achieve that in a reasonable way? As I said: The DB is
> round
> >     about 260 GB which makes it next to impossible to just "have a look"
> at
> >     the output of "SELECT *"..
> >
> >
> > You probably want to keep another table of monthyear + signalid pairs.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ________________________________
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


-- 
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

Re: Issues, understanding how CQL works

Reply via email to