Seems as if sstable2json is deprecated; see [1] and [2].

So, dsbulk [3] it is, I guess.

I downloaded it and crafted the following commandline from the docs [4] for my use case:

$ ../dsbulk-1.5.0/bin/dsbulk unload -h '["MY_CASSANDRA_IP"]' \
  --driver.advanced.auth-provider.class PlainTextAuthProvider \
  -u cassandra -p MY_PASSWORD -k tagdata -t central -c json \
  --connector.json.compression gzip -url /path/to/big/storage

This seems to result in multiple JSON files compressed with GZIP; seems to be exactly what I needed to help me in this case!

There's only one thing that I do not really understand what it means:
Besides the GZIP archives, it also creates two logfiles. One of them (unload-errors.log) contains some Java stacks. I do not understand what those lines are supposed to say:

(Added it to pastebin to not render the mail unreadable):

https://pastebin.com/WpYvqxAA

What are those lines supposed to tell me?

Best regards,
Marc Richter




[1] https://docs.datastax.com/en/cassandra-oss/2.2/cassandra/tools/toolsSSTable2Json.html
[2] https://issues.apache.org/jira/browse/CASSANDRA-9618
[3] https://downloads.datastax.com/#bulk-loader
[4] https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkRef.html

On 22.04.20 16:15, Marc Richter wrote:
This sounds like a promising way; thank you for bringing this up!

I will see if I can manage it with this approach.

Best regards,
Marc Richter



On 22.04.20 15:38, Durity, Sean R wrote:
I thought this might be a single-time use case request. I think my first approach would be to use something like dsbulk to unload the data and then reload it into a table designed for the query you want to do (as long as you have adequate disk space). I think like a DBA/admin first. Dsbulk creates csv files, so you could move that data to any kind of database, if you chose.

An alternative approach would be to use a driver that supports paging (I think this would be most of them) and write a program to walk the data set and output what you need in whatever format you need.

Or, since this is a single node scenario, you could try sstable2json to export the sstables (files on disk) into JSON, if that is a more workable format for you.

Sean Durity – Staff Systems Engineer, Cassandra

-----Original Message-----
From: Marc Richter <m...@marc-richter.info>
Sent: Wednesday, April 22, 2020 6:22 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Issues, understanding how CQL works

Hi Jeff,

thank you for your exhaustive and verbose answer!
Also, a very big "Thank you!" to all the other replyers; I hope you
understand that I summarize all your feedback in this single answer.

  From what I understand from your answers, Cassandra seems to be
optimized to store (and read) data in only exactly that way that the
data structure has been designed for. That makes it very inflexible, but
allows it to do that single job very effectively for a trade-off.

I also understand, the more I dig into Cassandra, that the team I am
supporting is using Cassandra kind of wrong; they for example do have
only one node and so do not use neither the load-balancing, nor the
redundancy-capabilities Cassandra offers.
Thus, maybe relevant side-note: All the data resides on just one single
node; maybe that info is important, because we know on which node the
data is (I know that Cassandra internally is applying the same Hashing -
Voodoo as if there were 1k nodes, but maybe this is important anyways).

Anyways: I do not really care if a query or effort to find this
information is sub-optimal or very "expensive" in means of effectivity
or system-load, since this isn't something that I need to extract on a
regular basis, but only once. Due to that, it doesn't need to be optimal
or effective; I also do not care if it blocks the node for several
hours, since Cassandra is only working on this single request. I really
need this info (most recent "insertdate") only once.
Is, considering this, a way to do that?

  > Because you didnt provide a signalid and monthyear, it doesn't know
  > which machine in your cluster to use to start the query.

I know this already; thanks for confirming that I got this correct! But
what do I do then if I do not know all "signalid"s? How to learn them?

Is it maybe possible to get a full list of all "signalid"s? Or is it
possible to "re-arrange" the data in the cluster or something that
enables me to learn what's the most recent "insertdate"?
I really do not care if I need to do some expensive copy-all-data -
move, but I do not know about what is possible and how to do that.

Best regards,
Marc Richter

On 21.04.20 19:20, Jeff Jirsa wrote:


On Tue, Apr 21, 2020 at 6:20 AM Marc Richter <m...@marc-richter.info
<mailto:m...@marc-richter.info>> wrote:

     Hi everyone,

     I'm very new to Cassandra. I have, however, some experience with SQL.


The biggest thing to remember is that Cassandra is designed to scale out
to massive clusters - like thousands of instances. To do that, you can't
assume it's ever ok to read all of the data, because that doesn't scale.
So cassandra takes shortcuts / optimizations to make it possible to
ADDRESS all of that data, but not SCAN it.


     I need to extract some information from a Cassandra database that has
     the following table definition:

     CREATE TABLE tagdata.central (
     signalid int,
     monthyear int,
     fromtime bigint,
     totime bigint,
     avg decimal,
     insertdate bigint,
     max decimal,
     min decimal,
     readings text,
     PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
     )


What your primary key REALLY MEANS is:

The database on reads and writes will hash(signalid+monthyear) to find
which hosts have the data, then

In each data file, the data for a given (signalid,monthyear) is stored
sorted by fromtime and totime

     The database is already of round about 260 GB in size.
     I now need to know what is the most recent entry in it; the correct
     column to learn this would be "insertdate".

     In SQL I would do something like this:

     SELECT insertdate FROM tagdata.central
     ORDER BY insertdate DESC LIMIT 1;

     In CQL, however, I just can't get it to work.

     What I have tried already is this:

     SELECT insertdate FROM "tagdata.central"
     ORDER BY insertdate DESC LIMIT 1;


Because you didnt provide a signalid and monthyear, it doesn't know
which machine in your cluster to use to start the query.


     But this gives me an error:
     ERROR: ORDER BY is only supported when the partition key is restricted
     by an EQ or an IN.


Because it's designed for potentially petabytes of data per cluster, it
doesn't believe you really want to walk all the data and order ALL of
it. Instead, it assumes that when you need to use an ORDER BY, you're
going to have some very small piece of data - confined to a single
signalid/monthyear pair. And even then, the ORDER is going to assume
that you're ordering it by the ordering keys you've defined - fromtime
first, and then totime.

So you can do

   SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC
And you can do

   SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC

And you can do ranges:

   SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY
fromtime DESC

But you have to work within the boundaries of how the data is stored.
It's stored grouped by signalid+monthyear, and then sorted by fromtime,
and then sorted by totime.



     So, after some trial and error and a lot of Googling, I learned that I
     must include all rows from the PRIMARY KEY from left to right in my
     query. Thus, this is the "best" I can get to work:


     SELECT
              *
     FROM
              "tagdata.central"
     WHERE
              "signalid" = 4002
              AND "monthyear" = 201908
     ORDER BY
              "fromtime" DESC
     LIMIT 10;


     The "monthyear" column, I crafted like a fool by incrementing the date
     one month after another until no results could be found anymore.
     The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -      query results. But these can't be as easily guessed as the "monthyear"
     values could.

     This is where I'm stuck!

     1. This does not really feel like the ideal way to go. I think there is
     something more mature in modern IT systems. Can anyone tell me what
     is a
     better way to get these informations?


You can denormalize. Because cassandra allows you to have very large
clusters, you can make multiple tables sorted in different ways to
enable the queries you need to run. Normal data modeling is to build
tables based on the SELECT statements you need to do (unless you're very
advanced, in which case you do it based on the transaction semantics of
the INSERT/UPDATE statements, but that's probably not you).

Or you can use a more flexible database.


     2. I need a way to learn all values that are in the "monthyear" and
     "signalid" columns in order to be able to craft that query.
     How can I achieve that in a reasonable way? As I said: The DB is round      about 260 GB which makes it next to impossible to just "have a look" at
     the output of "SELECT *"..


You probably want to keep another table of monthyear + signalid pairs.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org


________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Reply via email to