Re: Predicting Read/Write Latency as a Function of Total Requests & Cluster Size

2019-12-10 Thread Peter Corless
The theoretical answer involves Little's Law
 (*L=λW*). But the practical
experience is, as you say, dependent on a fair number of factors. We wrote
a recent blog

that
should be applicable to your thought processes about parallelism,
throughput, latency, and timeouts.

Earlier this year, we also wrote a blog about sizing Scylla clusters
 that
touches on latency and throughput. For example a general rule of thumb is
that with the current generation of Intel cores, for payloads of <1kb you
can get ~12.5k ops/core with Scylla. If there are similar blogs about
sizing Cassandra clusters, I'd be interested in reading them as well!

Also, in terms of latency, I want to point out that there is a great deal
dependent on the nature of your data, queries and caching. For example, if
you have a very low cache hit rate, expect greater latencies — data will
still need to be read from storage even if you add more nodes.

On Tue, Dec 10, 2019 at 6:57 AM Fred Habash  wrote:

> I'm looking for an empirical way to answer these two question:
>
> 1. If I increase application work load (read/write requests) by some
> percentage, how is it going to affect read/write latency. Of course, all
> other factors remaining constant e.g. ec2 instance class, ssd specs, number
> of nodes, etc.
>
> 2) How many nodes do I have to add to maintain a given read/write latency?
>
> Are there are any methods or instruments out there that can help answer
> these que
>
>
>
> 
> Thank you
>
>
>

-- 
Peter Corless
Technical Marketing Manager
ScyllaDB
e: pe...@scylladb.com
t: @petercorless 
v: 650-906-3134


Re: Connection Pooling in v4.x Java Driver

2019-12-10 Thread Jon Haddad
I'm not sure how closely the driver maintainers are following this list.
You might want to ask on the Java Driver mailing list:
https://groups.google.com/a/lists.datastax.com/forum/#!forum/java-driver-user




On Tue, Dec 10, 2019 at 5:10 PM Caravaggio, Kevin <
kevin.caravag...@lowes.com> wrote:

> Hello,
>
>
>
>
>
> When integrating with DataStax OSS Cassandra Java driver v4.x, I noticed 
> “Unlike
> previous versions of the driver, pools do not resize dynamically”
> 
> in reference to the connection pool configuration. Is anyone aware of the
> reasoning for this departure from dynamic pool sizing, which I believe was
> available in v3.x?
>
>
>
>
>
> Thanks,
>
>
>
>
>
> Kevin
>
>
> --
> NOTICE: All information in and attached to the e-mails below may be
> proprietary, confidential, privileged and otherwise protected from improper
> or erroneous disclosure. If you are not the sender's intended recipient,
> you are not authorized to intercept, read, print, retain, copy, forward, or
> disseminate this message. If you have erroneously received this
> communication, please notify the sender immediately by phone (704-758-1000)
> or by e-mail and destroy all copies of this message electronic, paper, or
> otherwise. By transmitting documents via this email: Users, Customers,
> Suppliers and Vendors collectively acknowledge and agree the transmittal of
> information via email is voluntary, is offered as a convenience, and is not
> a secured method of communication; Not to transmit any payment information
> E.G. credit card, debit card, checking account, wire transfer information,
> passwords, or sensitive and personal information E.G. Driver's license,
> DOB, social security, or any other information the user wishes to remain
> confidential; To transmit only non-confidential information such as plans,
> pictures and drawings and to assume all risk and liability for and
> indemnify Lowe's from any claims, losses or damages that may arise from the
> transmittal of documents or including non-confidential information in the
> body of an email transmittal. Thank you.
>


Connection Pooling in v4.x Java Driver

2019-12-10 Thread Caravaggio, Kevin
Hello,


When integrating with DataStax OSS Cassandra Java driver v4.x, I noticed 
“Unlike previous versions of the driver, pools do not resize 
dynamically”
 in reference to the connection pool configuration. Is anyone aware of the 
reasoning for this departure from dynamic pool sizing, which I believe was 
available in v3.x?


Thanks,


Kevin


NOTICE: All information in and attached to the e-mails below may be 
proprietary, confidential, privileged and otherwise protected from improper or 
erroneous disclosure. If you are not the sender's intended recipient, you are 
not authorized to intercept, read, print, retain, copy, forward, or disseminate 
this message. If you have erroneously received this communication, please 
notify the sender immediately by phone (704-758-1000) or by e-mail and destroy 
all copies of this message electronic, paper, or otherwise. By transmitting 
documents via this email: Users, Customers, Suppliers and Vendors collectively 
acknowledge and agree the transmittal of information via email is voluntary, is 
offered as a convenience, and is not a secured method of communication; Not to 
transmit any payment information E.G. credit card, debit card, checking 
account, wire transfer information, passwords, or sensitive and personal 
information E.G. Driver's license, DOB, social security, or any other 
information the user wishes to remain confidential; To transmit only 
non-confidential information such as plans, pictures and drawings and to assume 
all risk and liability for and indemnify Lowe's from any claims, losses or 
damages that may arise from the transmittal of documents or including 
non-confidential information in the body of an email transmittal. Thank you.


Re: Dynamo autoscaling: does it beat cassandra?

2019-12-10 Thread Dor Laor
Compression of 3.x is much better than 2.y (see the attached graph of our
of our customers (scylla).
However it's not related to Dynamo's hot partition and caching. In Dynamo,
every tablet has its own
limits and caching isn't taken into account. Once the throughput goes
beyond the tablet reservation/maximum,
the tablet is either split or get capped. So Zipfian workloads get
penalized a lot.

Both Cassasndra and Scylla rather have uniform workloads but will handle
Zipfian better.

On Tue, Dec 10, 2019 at 12:19 PM Carl Mueller
 wrote:

> Dor and Reid: thanks, that was very helpful.
>
> Is the large amount of compression an artifact of pre-cass3.11 where the
> column names were per-cell (combined with the cluster key for extreme
> verbosity, I think), so compression would at least be effective against
> those portions of the sstable data? IIRC the cass commiters figured as long
> as you can shrink the data, the reduced size drops the time to read off of
> the disk, maybe even the time to get into CPU cache from memory and the CPU
> to decompress is somewhat "free" at that point since everything else is
> stalled for I/O or memory reads?
>
> But I don't know how the 3.11.x format works to avoid spamming of those
> column names, I haven't torn into that part of the code.
>
> On Tue, Dec 10, 2019 at 10:15 AM Reid Pinchback <
> rpinchb...@tripadvisor.com> wrote:
>
>> Note that DynamoDB I/O throughput scaling doesn’t work well with brief
>> spikes.  Unless you write your own machinery to manage the provisioning, by
>> the time AWS scales the I/O bandwidth your incident has long since passed.
>> It’s not a thing to rely on if you have a latency SLA.  It really only
>> works for situations like a sustained alteration in load, e.g. if you have
>> a sinusoidal daily traffic pattern, or periodic large batch operations that
>> run for an hour or two, and you need the I/O adjustment while that takes
>> place.
>>
>>
>>
>> Also note that DynamoDB routinely chokes on write contention, which C*
>> would rarely do.  About the only benefit DynamoDB has over C* is that more
>> of its operations function as atomic mutations of an existing row.
>>
>>
>>
>> One thing to also factor into the comparison is developer effort.  The
>> DynamoDB API isn’t exactly tuned to making developers productive.  Most of
>> the AWS APIs aren’t, really, once you use them for non-toy projects. AWS
>> scales in many dimensions, but total developer effort is not one of them
>> when you are talking about high-volume tier one production systems.
>>
>>
>>
>> To respond to one of the other original points/questions, yes key and row
>> caches don’t seem to be a win, but that would vary with your specific usage
>> pattern.  Caches need a good enough hit rate to offset the GC impact.  Even
>> when C* lets you move things off heap, you’ll see a fair number of GC-able
>> artifacts associated with data in caches.  Chunk cache somewhat wins with
>> being off-heap, because it isn’t just I/O avoidance with that cache, you’re
>> also benefitting from the decompression.  However I’ve started to wonder
>> how often sstable compression is worth the performance drag and internal C*
>> complexity.  If you compare to where a more traditional RDBMS would use
>> compression, e.g. Postgres, use of compression is more selective; you only
>> bear the cost in the places already determined to win from the tradeoff.
>>
>>
>>
>> *From: *Dor Laor 
>> *Reply-To: *"user@cassandra.apache.org" 
>> *Date: *Monday, December 9, 2019 at 5:58 PM
>> *To: *"user@cassandra.apache.org" 
>> *Subject: *Re: Dynamo autoscaling: does it beat cassandra?
>>
>>
>>
>> *Message from External Sender*
>>
>> The DynamoDB model has several key benefits over Cassandra's.
>>
>> The most notable one is the tablet concept - data is partitioned into 10GB
>>
>> chunks. So scaling happens where such a tablet reaches maximum capacity
>>
>> and it is automatically divided to two. It can happen in parallel across
>> the entire
>>
>> data set, thus there is no concept of growing the amount of nodes or
>> vnodes.
>>
>> As the actual hardware is multi-tenant, the average server should have
>> plenty
>>
>> of capacity to receive these streams.
>>
>>
>>
>> That said, when we benchmarked DynamoDB and just hit it with ingest
>> workload,
>>
>> even when it was reserved, we had to slow down the pace since we received
>> many
>>
>> 'error 500' which means internal server errors. Their hot partitions do
>> not behave great
>>
>> as well.
>>
>>
>>
>> So I believe a growth of 10% the capacity with good key distribution can
>> be handled well
>>
>> but a growth of 2x in a short time will fail. It's something you're
>> expect from any database
>>
>> but Dynamo has an advantage with tablets and multitenancy and issues with
>> hot partitions
>>
>> and accounting of hot keys which will get cached in Cassandra better.
>>
>>
>>
>> Dynamo allows you to detach compute from the storage which is a key
>> benefit in a serverless, spiky 

Re: Dynamo autoscaling: does it beat cassandra?

2019-12-10 Thread Reid Pinchback
Hi Carl,

I can’t speak to all of the internal mechanics and what the committers factored 
in.  I have no doubt that intelligent decisions were the goal given the context 
of the time.  More where I come from is that at least in our case, we see nodes 
with a fair hunk of file data sitting in buffer cache, and the benefits of that 
have to become throttled when the buffered data won’t be consumable until 
decompression.  Unfortunately with Java you aren’t in the tidy situation where 
you can just claim a page of memory from buffer cache with no real overhead, so 
its plausible that memory copy vs memory traversal for decompaction don’t 
differ terribly.  Definitely something I want to look into.

Most of where I’ve seen I/O stalling relates to flushing of dirty pages from 
the writes of sstable compaction.  What to do depends a lot on the specifics of 
your situation.  TL;DR summary is that short bursts of write ops can choke out 
everything else once the I/O queue is filled.  It doesn’t really pertain to 
mean/median/95-percentile performance.  It starts to show at 99, and definitely 
999.

I don’t know if the interleaving with I/O wait results in some of the 
decompression being effectively free, it’s entirely plausible that this has 
been observed and the current approach improved accordingly. It’s a pretty 
reasonable CPU scheduling behavior unless cores are otherwise being kept busy, 
e.g. with memory copies or pauses to yank things from memory to CPU cache.  Jon 
Haddad recently pointed me to some resources that might explain getting less 
from CPU than the actual CPU numbers suggest, but I haven’t yet really wrapped 
my head around the details enough to decide how I would want to investigate 
reductions in CPU instructions executed.

I do know that we definitely saw from our latency metrics that read times are 
impacted when writes flush in a spike, so we tuned to mitigate it.  It probably 
doesn’t take much to achieve a read stall, as anything that stats a filesystem 
entry (either via cache miss on the dirnodes, or if you haven’t disabled atime) 
might be just as subject to stalling as anything that tries to read content 
from the file itself.

No opinion on 3.11.x handling of column metadata.  I’ve read that it is a great 
deal more complicated and a factor in various performance headaches, but like 
you, I haven’t gotten into the source around that so I don’t have a mental 
model for the details.

From: Carl Mueller 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, December 10, 2019 at 3:19 PM
To: "user@cassandra.apache.org" 
Subject: Re: Dynamo autoscaling: does it beat cassandra?

Message from External Sender
Dor and Reid: thanks, that was very helpful.

Is the large amount of compression an artifact of pre-cass3.11 where the column 
names were per-cell (combined with the cluster key for extreme verbosity, I 
think), so compression would at least be effective against those portions of 
the sstable data? IIRC the cass commiters figured as long as you can shrink the 
data, the reduced size drops the time to read off of the disk, maybe even the 
time to get into CPU cache from memory and the CPU to decompress is somewhat 
"free" at that point since everything else is stalled for I/O or memory reads?

But I don't know how the 3.11.x format works to avoid spamming of those column 
names, I haven't torn into that part of the code.

On Tue, Dec 10, 2019 at 10:15 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
Note that DynamoDB I/O throughput scaling doesn’t work well with brief spikes.  
Unless you write your own machinery to manage the provisioning, by the time AWS 
scales the I/O bandwidth your incident has long since passed.  It’s not a thing 
to rely on if you have a latency SLA.  It really only works for situations like 
a sustained alteration in load, e.g. if you have a sinusoidal daily traffic 
pattern, or periodic large batch operations that run for an hour or two, and 
you need the I/O adjustment while that takes place.

Also note that DynamoDB routinely chokes on write contention, which C* would 
rarely do.  About the only benefit DynamoDB has over C* is that more of its 
operations function as atomic mutations of an existing row.

One thing to also factor into the comparison is developer effort.  The DynamoDB 
API isn’t exactly tuned to making developers productive.  Most of the AWS APIs 
aren’t, really, once you use them for non-toy projects. AWS scales in many 
dimensions, but total developer effort is not one of them when you are talking 
about high-volume tier one production systems.

To respond to one of the other original points/questions, yes key and row 
caches don’t seem to be a win, but that would vary with your specific usage 
pattern.  Caches need a good enough hit rate to offset the GC impact.  Even 
when C* lets you move things off heap, you’ll see a fair number of GC-able 
artifacts associated with data in caches.  Chunk cache somewhat wins with being 

Re: Dynamo autoscaling: does it beat cassandra?

2019-12-10 Thread Carl Mueller
Dor and Reid: thanks, that was very helpful.

Is the large amount of compression an artifact of pre-cass3.11 where the
column names were per-cell (combined with the cluster key for extreme
verbosity, I think), so compression would at least be effective against
those portions of the sstable data? IIRC the cass commiters figured as long
as you can shrink the data, the reduced size drops the time to read off of
the disk, maybe even the time to get into CPU cache from memory and the CPU
to decompress is somewhat "free" at that point since everything else is
stalled for I/O or memory reads?

But I don't know how the 3.11.x format works to avoid spamming of those
column names, I haven't torn into that part of the code.

On Tue, Dec 10, 2019 at 10:15 AM Reid Pinchback 
wrote:

> Note that DynamoDB I/O throughput scaling doesn’t work well with brief
> spikes.  Unless you write your own machinery to manage the provisioning, by
> the time AWS scales the I/O bandwidth your incident has long since passed.
> It’s not a thing to rely on if you have a latency SLA.  It really only
> works for situations like a sustained alteration in load, e.g. if you have
> a sinusoidal daily traffic pattern, or periodic large batch operations that
> run for an hour or two, and you need the I/O adjustment while that takes
> place.
>
>
>
> Also note that DynamoDB routinely chokes on write contention, which C*
> would rarely do.  About the only benefit DynamoDB has over C* is that more
> of its operations function as atomic mutations of an existing row.
>
>
>
> One thing to also factor into the comparison is developer effort.  The
> DynamoDB API isn’t exactly tuned to making developers productive.  Most of
> the AWS APIs aren’t, really, once you use them for non-toy projects. AWS
> scales in many dimensions, but total developer effort is not one of them
> when you are talking about high-volume tier one production systems.
>
>
>
> To respond to one of the other original points/questions, yes key and row
> caches don’t seem to be a win, but that would vary with your specific usage
> pattern.  Caches need a good enough hit rate to offset the GC impact.  Even
> when C* lets you move things off heap, you’ll see a fair number of GC-able
> artifacts associated with data in caches.  Chunk cache somewhat wins with
> being off-heap, because it isn’t just I/O avoidance with that cache, you’re
> also benefitting from the decompression.  However I’ve started to wonder
> how often sstable compression is worth the performance drag and internal C*
> complexity.  If you compare to where a more traditional RDBMS would use
> compression, e.g. Postgres, use of compression is more selective; you only
> bear the cost in the places already determined to win from the tradeoff.
>
>
>
> *From: *Dor Laor 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Monday, December 9, 2019 at 5:58 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Dynamo autoscaling: does it beat cassandra?
>
>
>
> *Message from External Sender*
>
> The DynamoDB model has several key benefits over Cassandra's.
>
> The most notable one is the tablet concept - data is partitioned into 10GB
>
> chunks. So scaling happens where such a tablet reaches maximum capacity
>
> and it is automatically divided to two. It can happen in parallel across
> the entire
>
> data set, thus there is no concept of growing the amount of nodes or
> vnodes.
>
> As the actual hardware is multi-tenant, the average server should have
> plenty
>
> of capacity to receive these streams.
>
>
>
> That said, when we benchmarked DynamoDB and just hit it with ingest
> workload,
>
> even when it was reserved, we had to slow down the pace since we received
> many
>
> 'error 500' which means internal server errors. Their hot partitions do
> not behave great
>
> as well.
>
>
>
> So I believe a growth of 10% the capacity with good key distribution can
> be handled well
>
> but a growth of 2x in a short time will fail. It's something you're expect
> from any database
>
> but Dynamo has an advantage with tablets and multitenancy and issues with
> hot partitions
>
> and accounting of hot keys which will get cached in Cassandra better.
>
>
>
> Dynamo allows you to detach compute from the storage which is a key
> benefit in a serverless, spiky deployment.
>
>
>
> On Mon, Dec 9, 2019 at 1:02 PM Jeff Jirsa  wrote:
>
> Expansion probably much faster in 4.0 with complete sstable streaming
> (skips ser/deser), though that may have diminishing returns with vnodes
> unless you're using LCS.
>
> Dynamo on demand / autoscaling isn't magic - they're overprovisioning to
> give you the burst, then expanding on demand. That overprovisioning comes
> with a cost. Unless you're actively and regularly scaling, you're probably
> going to pay more for it.
>
>
> It'd be cool if someone focused on this - I think the faster streaming
> goes a long way. The way vnodes work today make it difficult to add more
> than one at a time without violating 

Re: Seeing tons of DigestMismatchException exceptions after upgrading from 2.2.13 to 3.11.4

2019-12-10 Thread Reid Pinchback
Colleen, to your question, yes there is a difference between 2.x and 3.x that 
would impact repairs.  The merkel tree computations changed, to having a 
default tree depth that is greater. That can cause significant memory drag, to 
the point that nodes sometimes even OOM.  This has been fixed in 4.x to make 
the setting tunable.  I think 3.11.5 now contains the same as a back-patch.

From: Reid Pinchback 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, December 10, 2019 at 11:23 AM
To: "user@cassandra.apache.org" 
Subject: Re: Seeing tons of DigestMismatchException exceptions after upgrading 
from 2.2.13 to 3.11.4

Message from External Sender
Carl, your speculation matches our observations, and we have a use case with 
that unfortunate usage pattern.  Write-then-immediately-read is not friendly to 
eventually-consistent data stores. It makes the reading pay a tax that really 
is associated with writing activity.

From: Carl Mueller 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, December 9, 2019 at 3:18 PM
To: "user@cassandra.apache.org" 
Subject: Re: Seeing tons of DigestMismatchException exceptions after upgrading 
from 2.2.13 to 3.11.4

Message from External Sender
My speculation on rapidly churning/fast reads of recently written data:

- data written at quorum (for RF3): write confirm is after two nodes reply
- data read very soon after (possibly code antipattern), and let's assume the 
third node update hasn't completed yet (e.g. AWS network "variance"). The read 
will pick a replica, and then there is a 50% chance the second replica chosen 
for quorum read is the stale node, which triggers a DigestMismatch read repair.

Is that plausible?

The code seems to log the exception in all read repair instances, so it doesn't 
seem to be an ERROR with red blaring klaxons, maybe it should be a WARN?

On Mon, Nov 25, 2019 at 11:12 AM Colleen Velo 
mailto:cmv...@gmail.com>> wrote:
Hello,

As part of the final stages of our 2.2 --> 3.11 upgrades, one of our clusters 
(on AWS/ 18 nodes/ m4.2xlarge) produced some post-upgrade fits. We started 
getting spikes of Cassandra read and write timeouts despite the fact the 
overall metrics volumes were unchanged. As part of the upgrade process, there 
was a TWCS table that we used a facade implementation to help change the 
namespace of the compaction class, but that has very low query volume.

The DigestMismatchException error messages, (based on sampling the hash keys 
and finding which tables have partitions for that hash key), seem to be 
occurring on the heaviest volume table (4,000 reads, 1600 writes per second per 
node approximately), and that table has semi-medium row widths with about 10-40 
column keys. (Or at least the digest mismatch partitions have that type of 
width). The keyspace is an RF3 using NetworkTopology, the CL is QUORUM for both 
reads and writes.

We have experienced the DigestMismatchException errors on all 3 of the 
Production clusters that we have upgraded (all of them are single DC in the 
us-east-1/eu-west-1/ap-northeast-2 AWS regions) and in all three cases, those 
DigestMismatchException errors were not there in either the  2.1.x or 2.2.x 
versions of Cassandra.
Does anyone know of changes from 2.2 to 3.11 that would produce additional 
timeout problems, such as heavier blocking read repair logic?  Also,

We ran repairs (via reaper v1.4.8) (much nicer in 3.11 than 2.1) on all of the 
tables and across all of the nodes, and our timeouts seemed to have 
disappeared, but we continue to see a rapid streaming of the Digest mismatches 
exceptions, so much so that our Cassandra debug logs are rolling over every 15 
minutes..   There is a mail list post from 2018 that indicates that some 
DigestMismatchException error messages are natural if you are reading while 
writing, but the sheer volume that we are getting is very concerning:
 - 
https://www.mail-archive.com/user@cassandra.apache.org/msg56078.html

Is that level of DigestMismatchException unusual? Or is can that volume of 
mismatches appear if semi-wide rows simply require a lot of resolution because 
flurries of quorum reads/writes (RF3) on recent partitions have a decent chance 
of not having fully synced data on the replica reads? Does the digest mismatch 
error get debug-logged on every chance read repair? (edited)
Also, why are these DigestMismatchException only occurring once the upgrade to 
3.11 has occurred?

~

Sample DigestMismatchException error message:
DEBUG [ReadRepairStage:13] 2019-11-22 01:38:14,448 ReadCallback.java:242 - 
Digest mismatch:

Re: Seeing tons of DigestMismatchException exceptions after upgrading from 2.2.13 to 3.11.4

2019-12-10 Thread Reid Pinchback
Carl, your speculation matches our observations, and we have a use case with 
that unfortunate usage pattern.  Write-then-immediately-read is not friendly to 
eventually-consistent data stores. It makes the reading pay a tax that really 
is associated with writing activity.

From: Carl Mueller 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, December 9, 2019 at 3:18 PM
To: "user@cassandra.apache.org" 
Subject: Re: Seeing tons of DigestMismatchException exceptions after upgrading 
from 2.2.13 to 3.11.4

Message from External Sender
My speculation on rapidly churning/fast reads of recently written data:

- data written at quorum (for RF3): write confirm is after two nodes reply
- data read very soon after (possibly code antipattern), and let's assume the 
third node update hasn't completed yet (e.g. AWS network "variance"). The read 
will pick a replica, and then there is a 50% chance the second replica chosen 
for quorum read is the stale node, which triggers a DigestMismatch read repair.

Is that plausible?

The code seems to log the exception in all read repair instances, so it doesn't 
seem to be an ERROR with red blaring klaxons, maybe it should be a WARN?

On Mon, Nov 25, 2019 at 11:12 AM Colleen Velo 
mailto:cmv...@gmail.com>> wrote:
Hello,

As part of the final stages of our 2.2 --> 3.11 upgrades, one of our clusters 
(on AWS/ 18 nodes/ m4.2xlarge) produced some post-upgrade fits. We started 
getting spikes of Cassandra read and write timeouts despite the fact the 
overall metrics volumes were unchanged. As part of the upgrade process, there 
was a TWCS table that we used a facade implementation to help change the 
namespace of the compaction class, but that has very low query volume.

The DigestMismatchException error messages, (based on sampling the hash keys 
and finding which tables have partitions for that hash key), seem to be 
occurring on the heaviest volume table (4,000 reads, 1600 writes per second per 
node approximately), and that table has semi-medium row widths with about 10-40 
column keys. (Or at least the digest mismatch partitions have that type of 
width). The keyspace is an RF3 using NetworkTopology, the CL is QUORUM for both 
reads and writes.

We have experienced the DigestMismatchException errors on all 3 of the 
Production clusters that we have upgraded (all of them are single DC in the 
us-east-1/eu-west-1/ap-northeast-2 AWS regions) and in all three cases, those 
DigestMismatchException errors were not there in either the  2.1.x or 2.2.x 
versions of Cassandra.
Does anyone know of changes from 2.2 to 3.11 that would produce additional 
timeout problems, such as heavier blocking read repair logic?  Also,

We ran repairs (via reaper v1.4.8) (much nicer in 3.11 than 2.1) on all of the 
tables and across all of the nodes, and our timeouts seemed to have 
disappeared, but we continue to see a rapid streaming of the Digest mismatches 
exceptions, so much so that our Cassandra debug logs are rolling over every 15 
minutes..   There is a mail list post from 2018 that indicates that some 
DigestMismatchException error messages are natural if you are reading while 
writing, but the sheer volume that we are getting is very concerning:
 - 
https://www.mail-archive.com/user@cassandra.apache.org/msg56078.html

Is that level of DigestMismatchException unusual? Or is can that volume of 
mismatches appear if semi-wide rows simply require a lot of resolution because 
flurries of quorum reads/writes (RF3) on recent partitions have a decent chance 
of not having fully synced data on the replica reads? Does the digest mismatch 
error get debug-logged on every chance read repair? (edited)
Also, why are these DigestMismatchException only occurring once the upgrade to 
3.11 has occurred?

~

Sample DigestMismatchException error message:
DEBUG [ReadRepairStage:13] 2019-11-22 01:38:14,448 ReadCallback.java:242 - 
Digest mismatch:
org.apache.cassandra.service.DigestMismatchException: Mismatch for key 
DecoratedKey(-6492169518344121155, 
66306139353831322d323064382d313037322d663965632d636565663165326563303965) 
(be2c0feaa60d99c388f9d273fdc360f7 vs 09eaded2d69cf2dd49718076edf56b36)
at 
org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
 ~[apache-cassandra-3.11.4.jar:3.11.4]
at 
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
 ~[apache-cassandra-3.11.4.jar:3.11.4]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_77]
at 

Re: Dynamo autoscaling: does it beat cassandra?

2019-12-10 Thread Reid Pinchback
Note that DynamoDB I/O throughput scaling doesn’t work well with brief spikes.  
Unless you write your own machinery to manage the provisioning, by the time AWS 
scales the I/O bandwidth your incident has long since passed.  It’s not a thing 
to rely on if you have a latency SLA.  It really only works for situations like 
a sustained alteration in load, e.g. if you have a sinusoidal daily traffic 
pattern, or periodic large batch operations that run for an hour or two, and 
you need the I/O adjustment while that takes place.

Also note that DynamoDB routinely chokes on write contention, which C* would 
rarely do.  About the only benefit DynamoDB has over C* is that more of its 
operations function as atomic mutations of an existing row.

One thing to also factor into the comparison is developer effort.  The DynamoDB 
API isn’t exactly tuned to making developers productive.  Most of the AWS APIs 
aren’t, really, once you use them for non-toy projects. AWS scales in many 
dimensions, but total developer effort is not one of them when you are talking 
about high-volume tier one production systems.

To respond to one of the other original points/questions, yes key and row 
caches don’t seem to be a win, but that would vary with your specific usage 
pattern.  Caches need a good enough hit rate to offset the GC impact.  Even 
when C* lets you move things off heap, you’ll see a fair number of GC-able 
artifacts associated with data in caches.  Chunk cache somewhat wins with being 
off-heap, because it isn’t just I/O avoidance with that cache, you’re also 
benefitting from the decompression.  However I’ve started to wonder how often 
sstable compression is worth the performance drag and internal C* complexity.  
If you compare to where a more traditional RDBMS would use compression, e.g. 
Postgres, use of compression is more selective; you only bear the cost in the 
places already determined to win from the tradeoff.

From: Dor Laor 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, December 9, 2019 at 5:58 PM
To: "user@cassandra.apache.org" 
Subject: Re: Dynamo autoscaling: does it beat cassandra?

Message from External Sender
The DynamoDB model has several key benefits over Cassandra's.
The most notable one is the tablet concept - data is partitioned into 10GB
chunks. So scaling happens where such a tablet reaches maximum capacity
and it is automatically divided to two. It can happen in parallel across the 
entire
data set, thus there is no concept of growing the amount of nodes or vnodes.
As the actual hardware is multi-tenant, the average server should have plenty
of capacity to receive these streams.

That said, when we benchmarked DynamoDB and just hit it with ingest workload,
even when it was reserved, we had to slow down the pace since we received many
'error 500' which means internal server errors. Their hot partitions do not 
behave great
as well.

So I believe a growth of 10% the capacity with good key distribution can be 
handled well
but a growth of 2x in a short time will fail. It's something you're expect from 
any database
but Dynamo has an advantage with tablets and multitenancy and issues with hot 
partitions
and accounting of hot keys which will get cached in Cassandra better.

Dynamo allows you to detach compute from the storage which is a key benefit in 
a serverless, spiky deployment.

On Mon, Dec 9, 2019 at 1:02 PM Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
Expansion probably much faster in 4.0 with complete sstable streaming (skips 
ser/deser), though that may have diminishing returns with vnodes unless you're 
using LCS.

Dynamo on demand / autoscaling isn't magic - they're overprovisioning to give 
you the burst, then expanding on demand. That overprovisioning comes with a 
cost. Unless you're actively and regularly scaling, you're probably going to 
pay more for it.

It'd be cool if someone focused on this - I think the faster streaming goes a 
long way. The way vnodes work today make it difficult to add more than one at a 
time without violating consistency, and thats unlikely to change, but if each 
individual node is much faster, that may mask it a bit.



On Mon, Dec 9, 2019 at 12:35 PM Carl Mueller 
 wrote:
Dynamo salespeople have been pushing autoscaling abilities that have been one 
of the key temptations to our management to switch off of cassandra.

Has anyone done any numbers on how well dynamo will autoscale demand spikes, 
and how we could architect cassandra to compete with such abilities?

We probably could overprovision and with the presumably higher cost of dynamo 
beat it, although the sales engineers claim they are closing the cost factor 
too. We could vertically scale to some degree, but node expansion seems close.

VNode expansion is still limited to one at a time?

We use VNodes so we can't do netflix's cluster doubling, correct? With cass 
4.0's alleged segregation of the data by token we could though and possibly 
also "prep" the node by having the necessary 

Re: Predicting Read/Write Latency as a Function of Total Requests & Cluster Size

2019-12-10 Thread Reid Pinchback
Latency SLAs are very much *not* Cassandra’s sweet spot, scaling throughput and 
storage is more where C*’s strengths shine.  If you want just median latency 
you’ll find things a bit more amenable to modeling, but not if you have 2 nines 
and particularly not 3 nines SLA expectations.  Basically, the harder you push 
on the nodes, the more you get sporadic but non-ignorable timing artifacts due 
to garbage collection and IO stalls when the flushing of the writes can choke 
out the disk reads.  Also, running in AWS, you’ll find that noisy neighbors are 
a routine issue no matter what the specifics of your use.

What your actual data model is, and what your patterns of reads and writes are, 
the impact of deletes and TTLs requiring tombstone cleanup, etc., all 
dramatically change the picture.

If you aren’t already aware of it, there is something called cassandra-stress 
that can help you do some experiments. The challenge though is determining if 
the experiments are representative of what your actual usage will be.  Because 
of the GC issues in anything implemented in a JVM or interpreter, it’s pretty 
easy to fall off the cliff of relevance.  TLP wrote an article about some of 
the challenges of this with cassandra-stress:

https://thelastpickle.com/blog/2017/02/08/Modeling-real-life-workloads-with-cassandra-stress.html

Note that one way to not have to care a lot about variable latency is to make 
use of speculative retry.  Basically you’re trading off some of your median 
throughput to help achieve a latency SLA.  The tradeoff benefit breaks down 
when you get to 3 nines.

I’m actually hoping to start on some modeling of what the latency surface looks 
like with different assumptions in the new year, not because I expect the 
specific numbers to translate to anybody else but just to show how the 
underyling dynamics evidence themselves in metrics when C* nodes are under 
duress.

R


From: Fred Habash 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, December 10, 2019 at 9:57 AM
To: "user@cassandra.apache.org" 
Subject: Predicting Read/Write Latency as a Function of Total Requests & 
Cluster Size

Message from External Sender
I'm looking for an empirical way to answer these two question:

1. If I increase application work load (read/write requests) by some 
percentage, how is it going to affect read/write latency. Of course, all other 
factors remaining constant e.g. ec2 instance class, ssd specs, number of nodes, 
etc.

2) How many nodes do I have to add to maintain a given read/write latency?

Are there are any methods or instruments out there that can help answer these 
que




Thank you



Predicting Read/Write Latency as a Function of Total Requests & Cluster Size

2019-12-10 Thread Fred Habash
I'm looking for an empirical way to answer these two question:

1. If I increase application work load (read/write requests) by some
percentage, how is it going to affect read/write latency. Of course, all
other factors remaining constant e.g. ec2 instance class, ssd specs, number
of nodes, etc.

2) How many nodes do I have to add to maintain a given read/write latency?

Are there are any methods or instruments out there that can help answer
these que




Thank you