Re: Mechanism to Bulk Export from Cassandra on daily Basis

2020-02-21 Thread Peter Corless
Question: would daily deltas be a good use of CDC? (Rather than export
entire tables.)

(I can understand that this might make analytics hard if you need to span
multiple resultant daily files.)

Perhaps along with CDC, maybe set up the tables for export via a Kafka
topic?

(https://docs.lenses.io/connectors/source/cassandra.html)

Or maybe some sort of exporter using Apache Spark?

https://github.com/scylladb/scylla-migrator

I'm just trying to throw out a few other ideas on how to solve the
exportation problem.

On Fri, Feb 21, 2020, 8:45 AM Durity, Sean R 
wrote:

> I would also push for something besides a full refresh, if at all
> possible. It feels like a waste of resources to me – and not predictably
> scalable. Suggestions: use a queue to send writes to both systems. If the
> downstream system doesn’t handle TTL, perhaps set an expiration date and a
> purge query on the downstream target.
>
>
>
> If you have to do the full refresh, perhaps a Spark job would be a decent
> solution. I would probably create a separate DC (with a lower replication
> factor and smaller number of nodes) just to handle the analytical/unload
> kind of workload (if the other functions of the cluster might be impacted
> by the unload).
>
>
>
> DSBulk from DataStax is very fast and scriptable, too.
>
>
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
>
>
> *From:* JOHN, BIBIN 
> *Sent:* Wednesday, February 19, 2020 5:25 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] RE: Mechanism to Bulk Export from Cassandra on
> daily Basis
>
>
>
> Thank you for suggestion. Full refresh is currently designed because with
> delta we cannot identify what got deleted. So downstreams prefer full data
> everyday.
>
>
>
>
>
> Thanks
>
> Bibin John
>
>
>
> *From:* Reid Pinchback 
> *Sent:* Wednesday, February 19, 2020 3:14 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Mechanism to Bulk Export from Cassandra on daily Basis
>
>
>
> To the question of ‘best approach’, so far the comments have been about
> alternatives in tools.
>
>
>
> Another axis you might want to consider is from the data model viewpoint.
> So, for example, let’s say you have 600M rows.  You want to do a daily
> transfer of data for some reason.  First question that comes to mind is, do
> you need all the data every day?  Usually that would only be the case if
> all of the data is at risk of changing.
>
>
>
> Generally the way I’d cut down the pain on something like this is to
> figure out if the data model currently does, or could be made to, only
> mutate in a limited subset.  Then maybe all you are transferring are the
> daily changes.  Systems based on catching up to daily changes will usually
> be pulling single-digit percentages of data volume compared to the entire
> storage footprint.  That’s not only a lot less data to pull, it’s also a
> lot less impact on the ongoing operations of the cluster while you are
> pulling that data.
>
>
> R
>
>
>
> *From: *"JOHN, BIBIN" 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Wednesday, February 19, 2020 at 1:13 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Mechanism to Bulk Export from Cassandra on daily Basis
>
>
>
> *Message from External Sender*
>
> Team,
>
> We have a requirement to bulk export data from Cassandra on daily basis?
> Table contain close to 600M records and cluster is having 12 nodes. What is
> the best approach to do this?
>
>
>
>
>
> Thanks
>
> Bibin John
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


Re: Predicting Read/Write Latency as a Function of Total Requests & Cluster Size

2019-12-10 Thread Peter Corless
The theoretical answer involves Little's Law
<https://en.wikipedia.org/wiki/Little%27s_law> (*L=λW*). But the practical
experience is, as you say, dependent on a fair number of factors. We wrote
a recent blog
<https://www.scylladb.com/2019/11/20/maximizing-performance-via-concurrency-while-minimizing-timeouts-in-distributed-databases/>
that
should be applicable to your thought processes about parallelism,
throughput, latency, and timeouts.

Earlier this year, we also wrote a blog about sizing Scylla clusters
<https://www.scylladb.com/2019/06/20/sizing-up-your-scylla-cluster/> that
touches on latency and throughput. For example a general rule of thumb is
that with the current generation of Intel cores, for payloads of <1kb you
can get ~12.5k ops/core with Scylla. If there are similar blogs about
sizing Cassandra clusters, I'd be interested in reading them as well!

Also, in terms of latency, I want to point out that there is a great deal
dependent on the nature of your data, queries and caching. For example, if
you have a very low cache hit rate, expect greater latencies — data will
still need to be read from storage even if you add more nodes.

On Tue, Dec 10, 2019 at 6:57 AM Fred Habash  wrote:

> I'm looking for an empirical way to answer these two question:
>
> 1. If I increase application work load (read/write requests) by some
> percentage, how is it going to affect read/write latency. Of course, all
> other factors remaining constant e.g. ec2 instance class, ssd specs, number
> of nodes, etc.
>
> 2) How many nodes do I have to add to maintain a given read/write latency?
>
> Are there are any methods or instruments out there that can help answer
> these que
>
>
>
> --------
> Thank you
>
>
>

-- 
Peter Corless
Technical Marketing Manager
ScyllaDB
e: pe...@scylladb.com
t: @petercorless <https://twitter.com/PeterCorless>
v: 650-906-3134


Re: Released an ACID-compliant transaction library on top of Cassandra

2019-01-16 Thread Peter Corless
t; > Regarding the licensing, we are thinking of releasing it with Apache
>>> 2
>>> > > if lots of developers are interested in it.
>>> > >
>>> > > Best regards,
>>> > > Hiroyuki
>>> > > On Wed, Oct 17, 2018 at 3:13 AM Jonathan Ellis 
>>> wrote:
>>> > > >
>>> > > > Which was followed up by
>>> https://www.researchgate.net/profile/Akon_Dey/publication/282156834_Scalable_Distributed_Transactions_across_Heterogeneous_Stores/links/56058b9608ae5e8e3f32b98d.pdf
>>> > > >
>>> > > > On Tue, Oct 16, 2018 at 1:02 PM Jonathan Ellis 
>>> wrote:
>>> > > >>
>>> > > >> It looks like it's based on this:
>>> http://www.vldb.org/pvldb/vol6/p1434-dey.pdf
>>> > > >>
>>> > > >> On Tue, Oct 16, 2018 at 11:37 AM Ariel Weisberg <
>>> ar...@weisberg.ws> wrote:
>>> > > >>>
>>> > > >>> Hi,
>>> > > >>>
>>> > > >>> Yes this does sound great. Does this rely on Cassandra's
>>> internal SERIAL consistency and CAS functionality or is that implemented at
>>> a higher level?
>>> > > >>>
>>> > > >>> Regards,
>>> > > >>> Ariel
>>> > > >>>
>>> > > >>> On Tue, Oct 16, 2018, at 12:31 PM, Jeff Jirsa wrote:
>>> > > >>> > This is great!
>>> > > >>> >
>>> > > >>> > --
>>> > > >>> > Jeff Jirsa
>>> > > >>> >
>>> > > >>> >
>>> > > >>> > > On Oct 16, 2018, at 5:47 PM, Hiroyuki Yamada <
>>> mogwa...@gmail.com> wrote:
>>> > > >>> > >
>>> > > >>> > > Hi all,
>>> > > >>> > >
>>> > > >>> > > # Sorry, I accidentally emailed the following to dev@, so
>>> re-sending to here.
>>> > > >>> > >
>>> > > >>> > > We have been working on ACID-compliant transaction library
>>> on top of
>>> > > >>> > > Cassandra called Scalar DB,
>>> > > >>> > > and are pleased to announce the release of v.1.0 RC version
>>> in open source.
>>> > > >>> > >
>>> > > >>> > > https://github.com/scalar-labs/scalardb/
>>> > > >>> > >
>>> > > >>> > > Scalar DB is a library that provides a distributed storage
>>> abstraction
>>> > > >>> > > and client-coordinated distributed transaction on the
>>> storage,
>>> > > >>> > > and makes non-ACID distributed database/storage
>>> ACID-compliant.
>>> > > >>> > > And Cassandra is the first supported database implementation.
>>> > > >>> > >
>>> > > >>> > > It's been internally tested intensively and is jepsen-passed.
>>> > > >>> > > (see jepsen directory for more detail)
>>> > > >>> > > If you are looking for ACID transaction capability on top of
>>> cassandra,
>>> > > >>> > > Please take a look and give us a feedback or contribution.
>>> > > >>> > >
>>> > > >>> > > Best regards,
>>> > > >>> > > Hiroyuki Yamada
>>> > > >>> > >
>>> > > >>> > >
>>> -
>>> > > >>> > > To unsubscribe, e-mail:
>>> user-unsubscr...@cassandra.apache.org
>>> > > >>> > > For additional commands, e-mail:
>>> user-h...@cassandra.apache.org
>>> > > >>> > >
>>> > > >>> >
>>> > > >>> >
>>> -
>>> > > >>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> > > >>> > For additional commands, e-mail:
>>> user-h...@cassandra.apache.org
>>> > > >>> >
>>> > > >>>
>>> > > >>>
>>> -
>>> > > >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> > > >>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>> > > >>>
>>> > > >>
>>> > > >>
>>> > > >> --
>>> > > >> Jonathan Ellis
>>> > > >> co-founder, http://www.datastax.com
>>> > > >> @spyced
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > Jonathan Ellis
>>> > > > co-founder, http://www.datastax.com
>>> > > > @spyced
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>


-- 
Peter Corless
Technical Marketing Manager
pe...@scylladb.com
650-906-3134


Re: Modeling Time Series data

2019-01-11 Thread Peter Corless
Hello Akash!

For Time Series, I'd make two recommendations:

   - Check out at KairosDB <https://kairosdb.github.io/>. Works on
   Cassandra as well as Scylla; they've been around for a while, and Brian
   Hawkins has spoken a great deal on KairosDB, both at Scylla Summit 2017
   
<https://www.scylladb.com/tech-talk/learn-build-time-series-database-scylla-summit-2017/>
   and 2018
   
<https://www.scylladb.com/tech-talk/scylla-and-kairosdb-in-smart-vehicle-diagnostics/>
   .
   - Also check out Newts <http://opennms.github.io/newts/>. Also usable
   for both Cassandra and Scylla. This comes from the folks at OpenNMS, and
   Jesse White had a great talk at Scylla Summit 2018
   
<https://www.scylladb.com/tech-talk/scaling-your-time-series-data-with-newts/>
   on the topic.

While you *can* do basic time series natively in Scylla or Cassandra
without such add-ons, look at these projects to do it with more emphasis on
the 'Time Series' part.

Would love to hear your thoughts, as well as the thoughts of others, if
they've done any analysis, or actually tested or gone to production with
either of these two projects.

Your friendly neighborhood,

-Peter.

On Fri, Jan 11, 2019 at 2:45 PM Akash Gangil  wrote:

> Hi,
>
> I have a data model where the partition key for a lot of tables is based
> on time
> (year, month, day, hour)
>
> Would this create a hotspot in my cluster, given all the writes/reads
> would go to the same node for a given hour? Or does the cassandra storage
> engine also takes into account the table info like table name, when
> distributing the data?
>
> If the above model would be a problem, what's the suggested way to solve
> this? Add tablename to partition key?
>
>
> --
> Akash
>


-- 
Peter Corless
Technical Marketing Manager
pe...@scylladb.com
650-906-3134


Re: Alter table

2018-12-17 Thread Peter Corless
Alter table would change columns (the structure) of a table. Adding or
deleting a column, for instance.

Upserts would add (or edit) rows of an existing table.

ALTER TABLE <https://docs.scylladb.com/getting-started/ddl/#id10> vs. UPDATE
<https://docs.scylladb.com/getting-started/dml/#update-statement>

(These docs are for Scylla, but Cassandra should be the same or similar for
Datastax.)

On Mon, Dec 17, 2018 at 1:45 PM Mark Furlong  wrote:

> Why would I want to use alter table vs upserts with the new document
> format?
>
>
>
> *Mark Furlong*
>
> Sr. Database Administrator
>
> *mfurl...@ancestry.com *
> M: 801-859-7427
>
> O: 801-705-7115
>
> 1300 W Traverse Pkwy
>
> Lehi, UT 84043
>
>
>
>
>
> ​[image: http://c.mfcreative.com/mars/email/shared-icon/sig-logo.gif]
>
>
>
>
>


-- 
Peter Corless
Technical Marketing Manager
pe...@scylladb.com
650-906-3134


Re: [EXTERNAL] Upcoming Cassandra-related Conferences

2018-10-08 Thread Peter Corless
Hey folks!

Sean: I did a blog on DIstributed Data Summit
<https://www.scylladb.com/2018/09/19/overheard-at-distributed-data-summit/>.
On top of the Scylla-oriented content, I covered Nate's keynote and
highlighted the sidecar talk by Netflix (incl. YouTube video for anyone who
wanted to watch it after-the-fact). I'd be interested to read & compare any
other similar blogs on the event. (Summit-as-Rashomon, as it were.)

Max: Thanks for the shout-out on Scylla Summit
<https://www.scylladb.com/summit-2018-schedule/>. Besides Scylla-oriented
tracks, we'll also have presentations from Kong, Kafka (KSQL), KairosDB for
time-series, OpenNMS, and Red Hat talking about rebuilding Ceph on the
Seastar framework.

-Pete.

On Mon, Oct 8, 2018 at 7:29 AM Durity, Sean R 
wrote:

> Thank you. I do want to hear about future conferences. I would also love
> to hear reports/summaries/highlights from folks who went to Distributed
> Data Summit (or other conferences). I think user conferences are great!
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Max C. 
> *Sent:* Friday, October 05, 2018 8:33 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Upcoming Cassandra-related Conferences
>
>
>
> Some upcoming Cassandra-related conferences, if anyone is interested:
>
>
>
> *Scylla Summit*
>
> November 5-7, 2018
>
> Pullman San Francisco Bay Hotel, Redwood City CA
>
> https://www.scylladb.com/scylla-summit-2018/
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.scylladb.com_scylla-2Dsummit-2D2018_=DwMFaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=oo6D4oywJAOkCVQea9I7JvdiywETo8u-M0QAMtYDc5g=LKQcumvnZN383K65Ylarutib079F5w14SE0seEgoQFA=>
>
>
>
> (This one seems to be almost entirely Scylla focussed, maybe not terribly
> useful for non-Scylla users)
>
>
>
> *DataStax Accelerate*
>
> May 21-23, 2019
> National Harbor, Maryland
>
> https://www.datastax.com/accelerate
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.datastax.com_accelerate=DwMFaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=oo6D4oywJAOkCVQea9I7JvdiywETo8u-M0QAMtYDc5g=eNZ6s-YdMt-7RY-fyQUaFsa9HUZq64lsE1HhbkYYRFc=>
>
>
>
> (No talks list or sponsors have been posted yet)
>
>
>
> *DISCLAIMER:*
>
> I’m not in the middle of the politics or nor do I have any affiliation
> with either of these companies.  I just thought lowly users like myself
> might appreciate the mention these on the -users list.
>
>
>
> I wish we should have had a post or two about the Distributed Data Summit;
>  I think we probably would have had an even better conference!  :-)
>
>
>
> - Max
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


-- 
Peter Corless
Technical Marketing Manager
pe...@scylladb.com
650-906-3134