Re: Issues, understanding how CQL works

Alex Ott Wed, 22 Apr 2020 09:59:14 -0700

not directly related, but you can try to use zstd as compression - in my
tests it performed faster offload, with slightly worse compression ratio


Marc Richter  at "Wed, 22 Apr 2020 17:57:44 +0200" wrote:
 MR> Seems as if sstable2json is deprecated; see [1] and [2].

 MR> So, dsbulk [3] it is, I guess.

 MR> I downloaded it and crafted the following commandline from the docs [4] 
for my use case:

 MR> $ ../dsbulk-1.5.0/bin/dsbulk unload -h '["MY_CASSANDRA_IP"]' \
 MR>   --driver.advanced.auth-provider.class PlainTextAuthProvider \
 MR>   -u cassandra -p MY_PASSWORD -k tagdata -t central -c json \
 MR>   --connector.json.compression gzip -url /path/to/big/storage

 MR> This seems to result in multiple JSON files compressed with GZIP; seems to 
be exactly what
 MR> I needed to help me in this case!

 MR> There's only one thing that I do not really understand what it means:
 MR> Besides the GZIP archives, it also creates two logfiles. One of them 
(unload-errors.log)
 MR> contains some Java stacks. I do not understand what those lines are 
supposed to say:

 MR> (Added it to pastebin to not render the mail unreadable):

 MR> https://pastebin.com/WpYvqxAA

 MR> What are those lines supposed to tell me?
 MR> Marc Richter

 MR> [1] 
https://docs.datastax.com/en/cassandra-oss/2.2/cassandra/tools/toolsSSTable2Json.html
 MR> [2] https://issues.apache.org/jira/browse/CASSANDRA-9618
 MR> [3] https://downloads.datastax.com/#bulk-loader
 MR> [4] https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkRef.html

 MR> On 22.04.20 16:15, Marc Richter wrote:
 >> This sounds like a promising way; thank you for bringing this up!
 >>
 >> I will see if I can manage it with this approach.
 >>
 >> Best regards,
 >> Marc Richter
 >>
 >>
 >>
 >> On 22.04.20 15:38, Durity, Sean R wrote:
 >>> I thought this might be a single-time use case request. I think my first 
 >>> approach would
 >>> be to use something like dsbulk to unload the data and then reload it into 
 >>> a table
 >>> designed for the query you want to do (as long as you have adequate disk 
 >>> space). I
 >>> think like a DBA/admin first. Dsbulk creates csv files, so you could move 
 >>> that data to
 >>> any kind of database, if you chose.
 >>>
 >>> An alternative approach would be to use a driver that supports paging (I 
 >>> think this
 >>> would be most of them) and write a program to walk the data set and output 
 >>> what you
 >>> need in whatever format you need.
 >>>
 >>> Or, since this is a single node scenario, you could try sstable2json to 
 >>> export the
 >>> sstables (files on disk) into JSON, if that is a more workable format for 
 >>> you.
 >>>
 >>> Sean Durity – Staff Systems Engineer, Cassandra
 >>>
 >>> -----Original Message-----
 >>> From: Marc Richter <m...@marc-richter.info>
 >>> Sent: Wednesday, April 22, 2020 6:22 AM
 >>> To: user@cassandra.apache.org
 >>> Subject: [EXTERNAL] Re: Issues, understanding how CQL works
 >>>
 >>> Hi Jeff,
 >>>
 >>> thank you for your exhaustive and verbose answer!
 >>> Also, a very big "Thank you!" to all the other replyers; I hope you
 >>> understand that I summarize all your feedback in this single answer.
 >>>
 >>>   From what I understand from your answers, Cassandra seems to be
 >>> optimized to store (and read) data in only exactly that way that the
 >>> data structure has been designed for. That makes it very inflexible, but
 >>> allows it to do that single job very effectively for a trade-off.
 >>>
 >>> I also understand, the more I dig into Cassandra, that the team I am
 >>> supporting is using Cassandra kind of wrong; they for example do have
 >>> only one node and so do not use neither the load-balancing, nor the
 >>> redundancy-capabilities Cassandra offers.
 >>> Thus, maybe relevant side-note: All the data resides on just one single
 >>> node; maybe that info is important, because we know on which node the
 >>> data is (I know that Cassandra internally is applying the same Hashing -
 >>> Voodoo as if there were 1k nodes, but maybe this is important anyways).
 >>>
 >>> Anyways: I do not really care if a query or effort to find this
 >>> information is sub-optimal or very "expensive" in means of effectivity
 >>> or system-load, since this isn't something that I need to extract on a
 >>> regular basis, but only once. Due to that, it doesn't need to be optimal
 >>> or effective; I also do not care if it blocks the node for several
 >>> hours, since Cassandra is only working on this single request. I really
 >>> need this info (most recent "insertdate") only once.
 >>> Is, considering this, a way to do that?
 >>>
 >>>   > Because you didnt provide a signalid and monthyear, it doesn't know
 >>>   > which machine in your cluster to use to start the query.
 >>>
 >>> I know this already; thanks for confirming that I got this correct! But
 >>> what do I do then if I do not know all "signalid"s? How to learn them?
 >>>
 >>> Is it maybe possible to get a full list of all "signalid"s? Or is it
 >>> possible to "re-arrange" the data in the cluster or something that
 >>> enables me to learn what's the most recent "insertdate"?
 >>> I really do not care if I need to do some expensive copy-all-data -
 >>> move, but I do not know about what is possible and how to do that.
 >>>
 >>> Best regards,
 >>> Marc Richter
 >>>
 >>> On 21.04.20 19:20, Jeff Jirsa wrote:
 >>>>
 >>>>
 >>>> On Tue, Apr 21, 2020 at 6:20 AM Marc Richter <m...@marc-richter.info
 >>>> <mailto:m...@marc-richter.info>> wrote:
 >>>>
 >>>>      Hi everyone,
 >>>>
 >>>>      I'm very new to Cassandra. I have, however, some experience with SQL.
 >>>>
 >>>>
 >>>> The biggest thing to remember is that Cassandra is designed to scale out
 >>>> to massive clusters - like thousands of instances. To do that, you can't
 >>>> assume it's ever ok to read all of the data, because that doesn't scale.
 >>>> So cassandra takes shortcuts / optimizations to make it possible to
 >>>> ADDRESS all of that data, but not SCAN it.
 >>>>
 >>>>
 >>>>      I need to extract some information from a Cassandra database that has
 >>>>      the following table definition:
 >>>>
 >>>>      CREATE TABLE tagdata.central (
 >>>>      signalid int,
 >>>>      monthyear int,
 >>>>      fromtime bigint,
 >>>>      totime bigint,
 >>>>      avg decimal,
 >>>>      insertdate bigint,
 >>>>      max decimal,
 >>>>      min decimal,
 >>>>      readings text,
 >>>>      PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
 >>>>      )
 >>>>
 >>>>
 >>>> What your primary key REALLY MEANS is:
 >>>>
 >>>> The database on reads and writes will hash(signalid+monthyear) to find
 >>>> which hosts have the data, then
 >>>>
 >>>> In each data file, the data for a given (signalid,monthyear) is stored
 >>>> sorted by fromtime and totime
 >>>>
 >>>>      The database is already of round about 260 GB in size.
 >>>>      I now need to know what is the most recent entry in it; the correct
 >>>>      column to learn this would be "insertdate".
 >>>>
 >>>>      In SQL I would do something like this:
 >>>>
 >>>>      SELECT insertdate FROM tagdata.central
 >>>>      ORDER BY insertdate DESC LIMIT 1;
 >>>>
 >>>>      In CQL, however, I just can't get it to work.
 >>>>
 >>>>      What I have tried already is this:
 >>>>
 >>>>      SELECT insertdate FROM "tagdata.central"
 >>>>      ORDER BY insertdate DESC LIMIT 1;
 >>>>
 >>>>
 >>>> Because you didnt provide a signalid and monthyear, it doesn't know
 >>>> which machine in your cluster to use to start the query.
 >>>>
 >>>>
 >>>>      But this gives me an error:
 >>>>      ERROR: ORDER BY is only supported when the partition key is 
 >>>> restricted
 >>>>      by an EQ or an IN.
 >>>>
 >>>>
 >>>> Because it's designed for potentially petabytes of data per cluster, it
 >>>> doesn't believe you really want to walk all the data and order ALL of
 >>>> it. Instead, it assumes that when you need to use an ORDER BY, you're
 >>>> going to have some very small piece of data - confined to a single
 >>>> signalid/monthyear pair. And even then, the ORDER is going to assume
 >>>> that you're ordering it by the ordering keys you've defined - fromtime
 >>>> first, and then totime.
 >>>>
 >>>> So you can do
 >>>>
 >>>>    SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC
 >>>> And you can do
 >>>>
 >>>>    SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC
 >>>>
 >>>> And you can do ranges:
 >>>>
 >>>>    SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY
 >>>> fromtime DESC
 >>>>
 >>>> But you have to work within the boundaries of how the data is stored.
 >>>> It's stored grouped by signalid+monthyear, and then sorted by fromtime,
 >>>> and then sorted by totime.
 >>>>
 >>>>
 >>>>
 >>>>      So, after some trial and error and a lot of Googling, I learned that 
 >>>> I
 >>>>      must include all rows from the PRIMARY KEY from left to right in my
 >>>>      query. Thus, this is the "best" I can get to work:
 >>>>
 >>>>
 >>>>      SELECT
 >>>>               *
 >>>>      FROM
 >>>>               "tagdata.central"
 >>>>      WHERE
 >>>>               "signalid" = 4002
 >>>>               AND "monthyear" = 201908
 >>>>      ORDER BY
 >>>>               "fromtime" DESC
 >>>>      LIMIT 10;
 >>>>
 >>>>
 >>>>      The "monthyear" column, I crafted like a fool by incrementing the 
 >>>> date
 >>>>      one month after another until no results could be found anymore.
 >>>>      The "signalid" I grabbed from one of the unrestricted "SELECT * 
 >>>> FROM" -
 >>>>      query results. But these can't be as easily guessed as the 
 >>>> "monthyear"
 >>>>      values could.
 >>>>
 >>>>      This is where I'm stuck!
 >>>>
 >>>>      1. This does not really feel like the ideal way to go. I think there 
 >>>> is
 >>>>      something more mature in modern IT systems. Can anyone tell me what
 >>>>      is a
 >>>>      better way to get these informations?
 >>>>
 >>>>
 >>>> You can denormalize. Because cassandra allows you to have very large
 >>>> clusters, you can make multiple tables sorted in different ways to
 >>>> enable the queries you need to run. Normal data modeling is to build
 >>>> tables based on the SELECT statements you need to do (unless you're very
 >>>> advanced, in which case you do it based on the transaction semantics of
 >>>> the INSERT/UPDATE statements, but that's probably not you).
 >>>>
 >>>> Or you can use a more flexible database.
 >>>>
 >>>>
 >>>>      2. I need a way to learn all values that are in the "monthyear" and
 >>>>      "signalid" columns in order to be able to craft that query.
 >>>>      How can I achieve that in a reasonable way? As I said: The DB is 
 >>>> round
 >>>>      about 260 GB which makes it next to impossible to just "have a look" 
 >>>> at
 >>>>      the output of "SELECT *"..
 >>>>
 >>>>
 >>>> You probably want to keep another table of monthyear + signalid pairs.
 >>>
 >>> ---------------------------------------------------------------------
 >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 >>> For additional commands, e-mail: user-h...@cassandra.apache.org
 >>>
 >>>
 >>> ________________________________
 >>>
 >>> The information in this Internet Email is confidential and may be legally
 >>> privileged. It is intended solely for the addressee. Access to this Email 
 >>> by anyone
 >>> else is unauthorized. If you are not the intended recipient, any 
 >>> disclosure, copying,
 >>> distribution or any action taken or omitted to be taken in reliance on it, 
 >>> is
 >>> prohibited and may be unlawful. When addressed to our clients any opinions 
 >>> or advice
 >>> contained in this Email are subject to the terms and conditions expressed 
 >>> in any
 >>> applicable governing The Home Depot terms of business or client engagement 
 >>> letter. The
 >>> Home Depot disclaims all responsibility and liability for the accuracy and 
 >>> content of
 >>> this attachment and for any damages or losses arising from any 
 >>> inaccuracies, errors,
 >>> viruses, e.g., worms, trojan horses, etc., or other items of a destructive 
 >>> nature,
 >>> which may be contained in this attachment and shall not be liable for 
 >>> direct, indirect,
 >>> consequential or special damages in connection with this e-mail message or 
 >>> its
 >>> attachment.
 >>>
 >>> ---------------------------------------------------------------------
 >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 >>> For additional commands, e-mail: user-h...@cassandra.apache.org
 >>>
 >>
 >> ---------------------------------------------------------------------
 >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 >> For additional commands, e-mail: user-h...@cassandra.apache.org

 MR> ---------------------------------------------------------------------
 MR> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 MR> For additional commands, e-mail: user-h...@cassandra.apache.org



-- 
With best wishes,                    Alex Ott
Principal Architect, DataStax
http://datastax.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Issues, understanding how CQL works

Reply via email to