DSBulk also works with JSON... if transformations of data are complex, I would go with Spark running in local mode, and process data...
On Wed, Apr 22, 2020 at 3:38 PM Durity, Sean R <sean_r_dur...@homedepot.com> wrote: > I thought this might be a single-time use case request. I think my first > approach would be to use something like dsbulk to unload the data and then > reload it into a table designed for the query you want to do (as long as > you have adequate disk space). I think like a DBA/admin first. Dsbulk > creates csv files, so you could move that data to any kind of database, if > you chose. > > An alternative approach would be to use a driver that supports paging (I > think this would be most of them) and write a program to walk the data set > and output what you need in whatever format you need. > > Or, since this is a single node scenario, you could try sstable2json to > export the sstables (files on disk) into JSON, if that is a more workable > format for you. > > Sean Durity – Staff Systems Engineer, Cassandra > > -----Original Message----- > From: Marc Richter <m...@marc-richter.info> > Sent: Wednesday, April 22, 2020 6:22 AM > To: user@cassandra.apache.org > Subject: [EXTERNAL] Re: Issues, understanding how CQL works > > Hi Jeff, > > thank you for your exhaustive and verbose answer! > Also, a very big "Thank you!" to all the other replyers; I hope you > understand that I summarize all your feedback in this single answer. > > From what I understand from your answers, Cassandra seems to be > optimized to store (and read) data in only exactly that way that the > data structure has been designed for. That makes it very inflexible, but > allows it to do that single job very effectively for a trade-off. > > I also understand, the more I dig into Cassandra, that the team I am > supporting is using Cassandra kind of wrong; they for example do have > only one node and so do not use neither the load-balancing, nor the > redundancy-capabilities Cassandra offers. > Thus, maybe relevant side-note: All the data resides on just one single > node; maybe that info is important, because we know on which node the > data is (I know that Cassandra internally is applying the same Hashing - > Voodoo as if there were 1k nodes, but maybe this is important anyways). > > Anyways: I do not really care if a query or effort to find this > information is sub-optimal or very "expensive" in means of effectivity > or system-load, since this isn't something that I need to extract on a > regular basis, but only once. Due to that, it doesn't need to be optimal > or effective; I also do not care if it blocks the node for several > hours, since Cassandra is only working on this single request. I really > need this info (most recent "insertdate") only once. > Is, considering this, a way to do that? > > > Because you didnt provide a signalid and monthyear, it doesn't know > > which machine in your cluster to use to start the query. > > I know this already; thanks for confirming that I got this correct! But > what do I do then if I do not know all "signalid"s? How to learn them? > > Is it maybe possible to get a full list of all "signalid"s? Or is it > possible to "re-arrange" the data in the cluster or something that > enables me to learn what's the most recent "insertdate"? > I really do not care if I need to do some expensive copy-all-data - > move, but I do not know about what is possible and how to do that. > > Best regards, > Marc Richter > > On 21.04.20 19:20, Jeff Jirsa wrote: > > > > > > On Tue, Apr 21, 2020 at 6:20 AM Marc Richter <m...@marc-richter.info > > <mailto:m...@marc-richter.info>> wrote: > > > > Hi everyone, > > > > I'm very new to Cassandra. I have, however, some experience with SQL. > > > > > > The biggest thing to remember is that Cassandra is designed to scale out > > to massive clusters - like thousands of instances. To do that, you can't > > assume it's ever ok to read all of the data, because that doesn't scale. > > So cassandra takes shortcuts / optimizations to make it possible to > > ADDRESS all of that data, but not SCAN it. > > > > > > I need to extract some information from a Cassandra database that has > > the following table definition: > > > > CREATE TABLE tagdata.central ( > > signalid int, > > monthyear int, > > fromtime bigint, > > totime bigint, > > avg decimal, > > insertdate bigint, > > max decimal, > > min decimal, > > readings text, > > PRIMARY KEY (( signalid, monthyear ), fromtime, totime) > > ) > > > > > > What your primary key REALLY MEANS is: > > > > The database on reads and writes will hash(signalid+monthyear) to find > > which hosts have the data, then > > > > In each data file, the data for a given (signalid,monthyear) is stored > > sorted by fromtime and totime > > > > The database is already of round about 260 GB in size. > > I now need to know what is the most recent entry in it; the correct > > column to learn this would be "insertdate". > > > > In SQL I would do something like this: > > > > SELECT insertdate FROM tagdata.central > > ORDER BY insertdate DESC LIMIT 1; > > > > In CQL, however, I just can't get it to work. > > > > What I have tried already is this: > > > > SELECT insertdate FROM "tagdata.central" > > ORDER BY insertdate DESC LIMIT 1; > > > > > > Because you didnt provide a signalid and monthyear, it doesn't know > > which machine in your cluster to use to start the query. > > > > > > But this gives me an error: > > ERROR: ORDER BY is only supported when the partition key is > restricted > > by an EQ or an IN. > > > > > > Because it's designed for potentially petabytes of data per cluster, it > > doesn't believe you really want to walk all the data and order ALL of > > it. Instead, it assumes that when you need to use an ORDER BY, you're > > going to have some very small piece of data - confined to a single > > signalid/monthyear pair. And even then, the ORDER is going to assume > > that you're ordering it by the ordering keys you've defined - fromtime > > first, and then totime. > > > > So you can do > > > > SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC > > And you can do > > > > SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC > > > > And you can do ranges: > > > > SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY > > fromtime DESC > > > > But you have to work within the boundaries of how the data is stored. > > It's stored grouped by signalid+monthyear, and then sorted by fromtime, > > and then sorted by totime. > > > > > > > > So, after some trial and error and a lot of Googling, I learned that > I > > must include all rows from the PRIMARY KEY from left to right in my > > query. Thus, this is the "best" I can get to work: > > > > > > SELECT > > * > > FROM > > "tagdata.central" > > WHERE > > "signalid" = 4002 > > AND "monthyear" = 201908 > > ORDER BY > > "fromtime" DESC > > LIMIT 10; > > > > > > The "monthyear" column, I crafted like a fool by incrementing the > date > > one month after another until no results could be found anymore. > > The "signalid" I grabbed from one of the unrestricted "SELECT * > FROM" - > > query results. But these can't be as easily guessed as the > "monthyear" > > values could. > > > > This is where I'm stuck! > > > > 1. This does not really feel like the ideal way to go. I think there > is > > something more mature in modern IT systems. Can anyone tell me what > > is a > > better way to get these informations? > > > > > > You can denormalize. Because cassandra allows you to have very large > > clusters, you can make multiple tables sorted in different ways to > > enable the queries you need to run. Normal data modeling is to build > > tables based on the SELECT statements you need to do (unless you're very > > advanced, in which case you do it based on the transaction semantics of > > the INSERT/UPDATE statements, but that's probably not you). > > > > Or you can use a more flexible database. > > > > > > 2. I need a way to learn all values that are in the "monthyear" and > > "signalid" columns in order to be able to craft that query. > > How can I achieve that in a reasonable way? As I said: The DB is > round > > about 260 GB which makes it next to impossible to just "have a look" > at > > the output of "SELECT *".. > > > > > > You probably want to keep another table of monthyear + signalid pairs. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > > ________________________________ > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email > by anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it, is prohibited and may be unlawful. When addressed > to our clients any opinions or advice contained in this Email are subject > to the terms and conditions expressed in any applicable governing The Home > Depot terms of business or client engagement letter. The Home Depot > disclaims all responsibility and liability for the accuracy and content of > this attachment and for any damages or losses arising from any > inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other > items of a destructive nature, which may be contained in this attachment > and shall not be liable for direct, indirect, consequential or special > damages in connection with this e-mail message or its attachment. > -- With best wishes, Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)