About a year ago I remember a conversation about C* clusters with large
numbers of nodes. I think Jon Haddad had raised the point that > 100 nodes
you start to run into issues, something related to a thread pool with a
size proportionate to the number of nodes, but that this problem would be
I don’t know if it’s the OPs intent in this case, but the response latency
profile will likely be different for two clusters equivalent in total storage
but different in node count. Multiple reasons for that, but probably the
biggest would be that you’re changing a divisor in I/O queuing
Here’s an article link for repairing table corruption, something I’d saved back
last year in case I ever needed it:
https://blog.pythian.com/so-you-have-a-broken-cassandra-sstable-file/
Hope it helps.
R
From: F
Reply-To: "user@cassandra.apache.org"
Date: Thursday, July 2, 2020 at 12:50 PM
It’s pretty easy to make Ansible, or Python with Jinja by itself if you don’t
use Ansible, and just templatize your config file so the environment variables
get substituted.
From: Jeff Jirsa
Reply-To: "user@cassandra.apache.org"
Date: Monday, June 29, 2020 at 10:36 AM
To: cassandra
Subject:
If you’re using AWS with EBS then you can just handle that with KMS to encrypt
the volumes. If you’re using local storage on EC2, or you aren’t on AWS, then
you’ll have to do heavier lifting with luks and dm-crypt, or eCryptfs, etc. If
you’re using a container mechanism for your C*
Just to confirm, is this memory decline outside of the Cassandra process? If
so, I’d look at crond and at memory held for network traffic. Those are the
two areas I’ve seen leak. If you’ve configured to have swap=0, then you end up
in a position where even if the memory usage is stale,
I’d also take a look at the O/S level. You might be queued up on flushing of
dirty pages, which would also throttle your ability to write mempages. Once
the I/O gets throttled badly, I’ve seen it push back into what you see in C*.
To Aaron’s point, you want a balance in memory between C* and
Would updating disk boundaries be sensitive to disk I/O tuning? I’m
remembering Jon Haddad’s talk about typical throughput problems in disk page
sizing.
From: Jai Bheemsen Rao Dhanwada
Reply-To: "user@cassandra.apache.org"
Date: Tuesday, June 2, 2020 at 10:48 AM
To:
I bootstrap the node or restart a C*
process.
I don't believe it's a GC issue and correction from initial question, it's not
just bootstrap, but every restart of C* process is causing this.
On Mon, Jun 1, 2020 at 3:22 PM Reid Pinchback
mailto:rpinchb...@tripadvisor.com>> wrote:
That gap se
That gap seems a long time. Have you checked GC logs around the timeframe?
From: Jai Bheemsen Rao Dhanwada
Reply-To: "user@cassandra.apache.org"
Date: Monday, June 1, 2020 at 3:52 PM
To: "user@cassandra.apache.org"
Subject: Cassandra Bootstrap Sequence
Message from External Sender
Hello
By retry logic, I’m going to guess you are doing some kind of version
consistency trick where you have a non-key column managing a visibility horizon
to simulate a transaction, and you poll for a horizon value >= some threshold
that the app is keeping aware of.
Note that these assorted
If you’re correct that the issue you linked to is the bug you are hitting, then
it was fixed in 3.11.3. You may have no choice but to upgrade. From the
discussion it doesn’t read as if any tuning tweaks avoided the issue, just the
patch fixed it.
If you do, I’d suggest going to at least
contain the off heap memory. I understand that I have to
test as Eric said that I might get outOfMemoryError issue. Or are there any
other better options available for handling such situations?
On Tue, Apr 21, 2020 at 9:52 PM Reid Pinchback
mailto:rpinchb...@tripadvisor.com>> wrote:
No
Marc, have you had any exposure to DynamoDB at all? The API approach is
different, but the fundamental concepts are similar. That’s actually a better
reference point to have than an RDBMS, because really it’s a small subset of
usage patterns that would overlap with CQL. If you were, for
Note that from a performance standpoint, it’s hard to see a reason to care
about releasing the memory unless you are co-tenanting C* with something else
that’s significant in its memory demands, and significant on a schedule
anti-correlated with when C* needs that memory.
If you aren’t doing
I would pay attention to the dirty background writer activity at the O/S level.
If you see that it isn’t keeping up with flushing changes to disk, then you’ll
be in an even worse situation as you increase the JVM heap size, because that
will be done at the cost of the size of available buffer
I think there is some potential yak shaving to worrying excessively about swap.
The reality is that you should know the memory demands of what you are running
on your C* nodes and have things configured so that significant swap would be a
highly abnormal situation.
I'd expect to see
If I understand the logic of things like SlabAllocator properly, this is
essentially buffer space that has been allocated for the purpose and C* pulls
off ByteBuffer hunks of it as needed. The notion of reclaiming by the kernel
wouldn’t apply, C* would be managing the use of the space itself.
;user@cassandra.apache.org"
Cc: Reid Pinchback
Subject: Re: OOM only on one datacenter nodes
Message from External Sender
We are using JRE and not JDK , hence not able to take heap dump .
On Sun, 5 Apr 2020 at 19:21, Jeff Jirsa
mailto:jji...@gmail.com>> wrote:
Set the jvm flags to
Surbi:
If you aren’t seeing connection activity in DC2, I’d check to see if the
operations hitting DC1 are quorum ops instead of local quorum. That still
wouldn’t explain DC2 nodes going down, but would at least explain them doing
more work than might be on your radar right now.
The hint
I’ll add a few cautionary notes:
* JVM object overhead has memory allocation efficiency issues possible with
heap >= 32gig, but yes to the added memory for off-heap storage and O/S buffer
cache.
* C* creates a lot of threads, but the number active can sometimes be
rather small.
If you care about low-latency reads, I’d worry less about columnar data types,
and more about the general quality of the data modeling and usage patterns, and
tuning the things that you see cause latency spikes. There isn’t just a single
cause to latency spikes, so expect to spend a couple of
Our experience with G1GC was that 31gb wasn’t optimal (for us) because while
you have less frequent full GCs they are bigger when they do happen. But even
so, not to the point of a 9.5s full collection.
Unless it is a rare event associated with something weird happening outside of
the JVM
To the question of ‘best approach’, so far the comments have been about
alternatives in tools.
Another axis you might want to consider is from the data model viewpoint. So,
for example, let’s say you have 600M rows. You want to do a daily transfer of
data for some reason. First question
No actually in this case I didn’t really have an opinion because C* is an
architecturally different beast than an RDBMS. That’s kinda what ticked the
curiosity when you made the suggestion about co-locating commit and data. It
raises an interesting question for me. As for the 10 seconds
I was curious and did some digging. 400k is the max read IOPs on the 1-device
instance types, 3M IOPS is for the 8-device instance types.
From: Reid Pinchback
Reply-To: "user@cassandra.apache.org"
Date: Friday, February 14, 2020 at 11:24 AM
To: "user@cassandra.apache.org"
I’ve seen claims of 3M IOPS on reads for AWS, not sure about writes. I think
you just need a recent enough kernel to not get in the way of doing multiqueue
operations against the NVMe device.
Erick, a question purely as a point of curiosity. The entire model of a commit
log, historically
Since ping is ICMP, not TCP, you probably want to investigate a mix of TCP and
CPU stats to see what is behind the slow pings. I’d guess you are getting
network impacts beyond what the ping times are hinting at. ICMP isn’t subject
to retransmission, so your TCP situation could be far worse
Hi Sergio,
We have a production cluster with vnodes=4 that is a bit larger than that, so
yes it is possible to do so. That said, we aren’t wedded to vnodes=4 and are
paying attention to discussions happening around the 4.0 work and mulling the
possibility of shifting to 16.
Note though, we
A caveat to the 31GB recommendation for G1GC. If you have tight latency SLAs
instead of throughput SLAs then this doesn’t necessary pan out to be beneficial.
Yes the GCs are less frequent, but they can hurt more when they do happen. The
win is if your usage pattern is such that the added time
Just mulling this based on some code and log digging I was doing while trying
to have Reaper stay on top of our cluster.
I think maybe the caveat here relates to eventual consistency. C* doesn’t do
state changes as distributed transactions. The assumption here is that RF=3 is
implying that
Ankit, are the instance types identical in the new cluster, with I/O
configuration identical at the system level, and are the Java settings for C*
identical between the two clusters? With radical timing differences happening
periodically, the two things I’d have on my radar would be garbage
? and minute = ?
Sean Durity
From: Reid Pinchback
Sent: Thursday, February 6, 2020 4:10 PM
To: user@cassandra.apache.org
Subject: Re: [EXTERNAL] Re: Running select against cassandra
Abdul,
When in doubt, have a query model that immediately feeds you exactly what you
are looking for. That’s kin
Abdul,
When in doubt, have a query model that immediately feeds you exactly what you
are looking for. That’s kind of the data model philosophy that you want to
shoot for as much as feasible with C*.
The point of Sean’s table isn’t the similarity to yours, it is how he has it
keyed because it
Another thing I'll add, since I don't think any of the other responders brought
it up.
This all assumes that you already believe that the update is safe. If you have
any kind of test cluster, I'd evaluate the change there first.
While I haven't hit it with C* specifically, I have seen
Just a thought along those lines. If the memtable flush isn’t keeping up, you
might find that manifested in the I/O queue length and dirty page stats leading
into the time the OOM event took place. If you do see that, then you might
need to do some I/O tuning as well.
From: Jeff Jirsa
Jon Haddad has previously made the case for num_tokens=4. His Accelerate 2019
talk is available at:
https://www.youtube.com/watch?v=swL7bCnolkU
You might want to check that out. Also I think the amount of effort you put
into evening out the token distribution increases as vnode count
, January 22, 2020 at 4:46 PM
To: Reid Pinchback
Cc: "user@cassandra.apache.org"
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days
to 8 days?
Message from External Sender
Thanks for the explanation. It should deserve a blog post
Sergio
On Wed, Jan 22, 202
: Sergio
Date: Wednesday, January 22, 2020 at 4:08 PM
To: Reid Pinchback
Cc: "user@cassandra.apache.org"
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days
to 8 days?
Message from External Sender
Thank you very much for your extended response.
Should I look
of data, the probability of not having the degenerate
cases become real cases becomes vanishingly small.
R
From: Sergio
Date: Wednesday, January 22, 2020 at 1:41 PM
To: "user@cassandra.apache.org" , Reid Pinchback
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5
Sergio, if you’re looking for a new frequency for your repairs because of the
change, if you are using reaper, then I’d go for repair_freq <= gc_grace / 2.
Just serendipity with a conversation I was having at work this morning. When
you actually watch the reaper logs then you can see
, 11:07 AM, "Reid Pinchback" wrote:
Message from External Sender
I would think that it would be largely driven by the replication factor.
It isn't that the sstables are forklifted from one dc to another, it's just
that the writes being made to the memtables are also ship
I would think that it would be largely driven by the replication factor. It
isn't that the sstables are forklifted from one dc to another, it's just that
the writes being made to the memtables are also shipped around by the
coordinator nodes as the writes happen. Operations at the sstable
I can’t find it anywhere either, but I’m looking at a 3.11.4 source image.
From the naming I’d bet that this is being used to feed the
cassandra.migration_task_wait_in_seconds property. It’s already coded to have
a default of 1 second, which matches what you are seeing in the shell script
As others pointed out, compression will reduce the size and replication will
(across nodes) increase the total size.
The other thing to note is that you can have multiple versions of the data in
different sstables, and tombstones related to deletions and TTLs, and indexes,
and any snapshots,
Once upon a time the implication of ‘nosql’ was ‘not SQL’, but these days it
would be more accurate to characterize it as ‘not only SQL’.
‘schemaless’ also can be interpreted a little flexibly. In a relational
database structure, you can think of ‘schema’ (with respect to tables) as
meaning
Metrics are exposed via JMX. You can use something like jmxtrans or collectd
with the jmx plugin to capture metrics per-node and route them to whatever you
use to aggregate metrics.
From: Fred Habash
Reply-To: "user@cassandra.apache.org"
Date: Thursday, December 12, 2019 at 9:38 AM
To:
Also note that you should be expecting async operations to be slower on a
call-by-call basis. Async protocols have added overhead. The point of them
really is to leave the client free to interleave other computing activity
between the async calls. It’s not usually a better way to do batch
y reads?
But I don't know how the 3.11.x format works to avoid spamming of those column
names, I haven't torn into that part of the code.
On Tue, Dec 10, 2019 at 10:15 AM Reid Pinchback
mailto:rpinchb...@tripadvisor.com>> wrote:
Note that DynamoDB I/O throughput scaling doesn’t work well w
in 4.x to make
the setting tunable. I think 3.11.5 now contains the same as a back-patch.
From: Reid Pinchback
Reply-To: "user@cassandra.apache.org"
Date: Tuesday, December 10, 2019 at 11:23 AM
To: "user@cassandra.apache.org"
Subject: Re: Seeing tons of DigestMismatchExcept
Carl, your speculation matches our observations, and we have a use case with
that unfortunate usage pattern. Write-then-immediately-read is not friendly to
eventually-consistent data stores. It makes the reading pay a tax that really
is associated with writing activity.
From: Carl Mueller
Note that DynamoDB I/O throughput scaling doesn’t work well with brief spikes.
Unless you write your own machinery to manage the provisioning, by the time AWS
scales the I/O bandwidth your incident has long since passed. It’s not a thing
to rely on if you have a latency SLA. It really only
Latency SLAs are very much *not* Cassandra’s sweet spot, scaling throughput and
storage is more where C*’s strengths shine. If you want just median latency
you’ll find things a bit more amenable to modeling, but not if you have 2 nines
and particularly not 3 nines SLA expectations. Basically,
Correction: “most of your database will be in chunk cache, or buffer cache
anyways.
From: Reid Pinchback
Reply-To: "user@cassandra.apache.org"
Date: Friday, December 6, 2019 at 10:16 AM
To: "user@cassandra.apache.org"
Subject: Re: AWS ephemeral instances + backup
M
If you’re only going to have a small storage footprint per node like 100gb,
another option comes to mind. Use an instance type with large ram. Use an EBS
storage volume on an EBS-optimized instance type, and take EBS snapshots. Most
of your database will be in chunk cache anyways, so you only
rmance benefit to increasing its size?
Thanks,
John Belliveau
On Mon, Dec 2, 2019, 11:07 AM Reid Pinchback
mailto:rpinchb...@tripadvisor.com>> wrote:
Rahul, if my memory of this is correct, that particular logging message is
noisy, the cache is pretty much always used to its lim
gmail.com>> wrote:
Reid,
I've only been working with Cassandra for 2 years, and this echoes my
experience as well.
Regarding the cache use, I know every use case is different, but have you
experimented and found any performance benefit to increasing its size?
Thanks,
John Belliveau
Rahul, if my memory of this is correct, that particular logging message is
noisy, the cache is pretty much always used to its limit (and why not, it’s a
cache, no point in using less than you have).
No matter what value you set, you’ll just change the “reached (….)” part of it.
I think what
I will try to continue providing additional information /
thoughts on the Cassandra ticket.
Regards,
Thomas
From: Reid Pinchback
Sent: Mittwoch, 06. November 2019 18:28
To: user@cassandra.apache.org
Subject: Re: Cassandra 3.0.18 went OOM several hours after joining a cluster
The other thing that c
ore of them, until eventually…pop.
From: Reid Pinchback
Reply-To: "user@cassandra.apache.org"
Date: Wednesday, November 6, 2019 at 12:11 PM
To: "user@cassandra.apache.org"
Subject: Re: Cassandra 3.0.18 went OOM several hours after joining a cluster
Message from External
Almost 15 minutes, that sounds suspiciously like blocking on a default TCP
socket timeout.
From: Rahul Reddy
Reply-To: "user@cassandra.apache.org"
Date: Wednesday, November 6, 2019 at 12:12 PM
To: "user@cassandra.apache.org"
Subject: Re: Aws instance stop and star with ebs
Message from
My first thought was that you were running into the merkle tree depth problem,
but the details on the ticket don’t seem to confirm that.
It does look like eden is too small. C* lives in Java’s GC pain point, a lot
of medium-lifetime objects. If you haven’t already done so, you’ll want to
It’s not a setting I’ve played with at all. I understand the gist of it
though, essentially it’ll let you automatically adjust your JVM size relative
to whatever you allocated to the cgroup. Unfortunately I’m not a K8s developer
(that may change shortly, but atm the case). What you need to a
Hi Ben, just catching up over the weekend.
The typical advice, per Sergio’s link reference, is an obvious starting point.
We use G1GC and normally I’d treat 8gig as the minimal starting point for a
heap. What sometimes doesn’t get talked about in the myriad of tunings, is
that you have to
Maybe I’m missing something. You’re expecting less than 1 gig of data per
node? Unless this is some situation of super-high data churn/brief TTL, it
sounds like you’ll end up with your entire database in memory.
From: Ben Mills
Reply-To: "user@cassandra.apache.org"
Date: Friday, November 1,
That is indeed what Amazon AMIs are for.
However if your question is “why don’t the C* developers do that for people?”
the answer is going to be some mix of “people only do so much work for free”
and “the ones that don’t do it for free have a company you pay to do things
like that
_window_scaling' => 1,
'net.core.netdev_max_backlog' => 2500,
'net.core.somaxconn' => 65000,
'vm.max_map_count' => 1048575,
'vm.swappiness' => 0
}
These are my tweaked value and I used the values recommended from datastax.
Do you have something different?
Best,
Sergio
Il
Oh nvm, didn't see the later msg about just posting what your fix was.
R
On 10/30/19, 4:24 PM, "Reid Pinchback" wrote:
Message from External Sender
Hi Sergio,
Assuming nobody is actually mounting a SYN flood attack, then this sounds
like you're either bein
Hi Sergio,
Assuming nobody is actually mounting a SYN flood attack, then this sounds like
you're either being hammered with connection requests in very short periods of
time, or your TCP backlog tuning is off. At least, that's where I'd start
looking. If you take that log message and google
Oh, my mistake, there was also another subdirectory there with the old rpm’s, I
missed that the first time. Thanks.
From: Reid Pinchback
Reply-To: "user@cassandra.apache.org"
Date: Wednesday, October 30, 2019 at 1:47 PM
To: "user@cassandra.apache.org"
Subject: Re: W
b1FZQ3uJI=>
The old releases are removed by Apache automatically as part of their policy,
it's not specific to Cassandra.
On Wed, Oct 30, 2019 at 10:39 AM Reid Pinchback
mailto:rpinchb...@tripadvisor.com>> wrote:
With the latest round of C* updates, the yum repo no lon
Thanks Michael, that was exactly the info I needed.
On 10/30/19, 1:44 PM, "Michael Shuler" wrote:
Message from External Sender
On 10/30/19 12:39 PM, Reid Pinchback wrote:
> With the latest round of C* updates, the yum repo no longer has
> whate
With the latest round of C* updates, the yum repo no longer has whatever the
previous version is. For environments that try to do more controlled stepping
of release changes instead of just taking the latest, is there any URL for
previous versions of RPMs? Previous jars I can find easily
Ben, you may find this helpful:
https://blog.pythian.com/so-you-have-a-broken-cassandra-sstable-file/
From: Ben Mills
Reply-To: "user@cassandra.apache.org"
Date: Thursday, October 24, 2019 at 3:31 PM
To: "user@cassandra.apache.org"
Subject: Repair Issues
Message from External Sender
separate AZ
?
Best,
Sergio
On Thu, Oct 24, 2019, 7:36 AM Reid Pinchback
mailto:rpinchb...@tripadvisor.com>> wrote:
Hey Sergio,
Forgive but I’m at work and had to skim the info quickly.
When in doubt, simplify. So 1 rack per DC. Distributed systems get rapidly
harder to reason about the more
am
correct if we have a keyspace with 100GB and Replication Factor = 3 and RACKS
= 3 => 100 * 3 * 3 = 900GB
If I had only one rack across 2 or even 3 availability zone I would save in
space and I would have 300GB only. Please correct me if I am wrong.
Best,
Sergio
Il giorno mer 23 ott 2019
23 ott 2019 alle ore 09:21 Reid Pinchback
mailto:rpinchb...@tripadvisor.com>> ha scritto:
Datacenters and racks are different concepts. While they don't have to be
associated with their historical meanings, the historical meanings probably
provide a helpful model for understanding w
I haven’t seen much evidence that larger cluster = more performance, plus or
minus the statistics of speculative retry. It horizontally scales for storage
definitely, and somewhat for connection volume. If anything, per Sean’s
observation, you have less ability to have a stable tuning for a
Datacenters and racks are different concepts. While they don't have to be
associated with their historical meanings, the historical meanings probably
provide a helpful model for understanding what you want from them.
When companies own their own physical servers and have them housed somewhere,
U during GC pauses,
you can try using more GC threads by setting -XX:ParallelGCThreads to match the
number of cores you have, since by default it won't use them all. You've got
40 cores in the m4.10xlarge, try setting -XX:ParallelGCThreads to 40.
Jon
On Tue, Oct 22, 2019 at 11:38 A
Thomas, what is your frequency of metric collection? If it is minute-level
granularity, that can give a very false impression. I’ve seen CPU and disk
throttles that don’t even begin to show visibility until second-level
granularity around the time of the constraining event. Even clearer is
A high level of compaction seems highly likely to throttle you by sending the
service into a GC death spiral, doubly-so if any repairs happen to be underway
at the same time (I may or may not have killed a few nodes this way, but I
admit nothing!). Even if not in GC hell, it can cause you to
will probably make more sense for most
setups.
On Mon, Oct 21, 2019 at 10:21 AM Sergio
mailto:lapostadiser...@gmail.com>> wrote:
Hello!
This is the kernel that I am using
Linux 4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018 x86_64
x86_64 x86_64 GNU/Linux
Best,
Sergio
act
number?
Can you share the flags for ParNew + CMS that I can play with it and perform a
test?
Best,
Sergio
Il giorno lun 21 ott 2019 alle ore 09:27 Reid Pinchback
mailto:rpinchb...@tripadvisor.com>> ha scritto:
Since the instance size is < 32gb, hopefully swap isn’t being used, so it
nse.com/v3/__http:/thelastpickle.com/tlp-stress__;!OYIaWQQGbnA!ZhiXAdRaL49J8nBlh0F_5MQ97Z1QNTUuTSMvksmEmxan3d65D6ATmQO1ig58W52uuCUZYKw$>
Jon
On Mon, Oct 21, 2019 at 10:24 AM Reid Pinchback
mailto:rpinchb...@tripadvisor.com>> wrote:
An i3x large has 30.5 gb of RAM but you’re using less than 4gb for C
I don't know which distro and version you are using, but watch out for
surprises in what vm.swappiness=0 means. In older kernels it means "only use
swap when desperate". I believe that newer kernels changed to have 1 mean
that, and 0 means to always use the oomkiller. Neither situation is
An i3x large has 30.5 gb of RAM but you’re using less than 4gb for C*. So
minus room for other uses of jvm memory and for kernel activity, that’s about
25 gb for file cache. You’ll have to see if you either want a bigger heap to
allow for less frequent gc cycles, or you could save money on
Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.thelastpickle.com=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g=kapaVkL0EZE
I’d look to see if you have compactions fronting the p99’s. If so, then go
back to looking at the I/O. Disbelieve any metrics not captured at a high
resolution for a time window around the compactions, like 100ms. You could be
hitting I/O stalls where reads are blocked by the flushing of
89 matches
Mail list logo