Regular NullPointerExceptions from `nodetool compactionstats` on 3.7 node

2018-04-20 Thread Paul Pollack
Hi all,

We have a cluster running on Cassandra 3.7 (we already know this is
considered a "bad" version and plan to upgrade to 3.11 in the
not-too-distant future) and we have a few Nagios checks that run `nodetool
compactionstats` to check how many pending compactions there currently are,
as well as bytes remaining for compactions to see if they will push us past
our comfortable disk utilization threshold.

The check regularly fails with an exit code of 2, and then shortly after
will run successfully, resulting in a check that flaps.

When I am able to reproduce the issue, the output looks like this:

ubuntu@statistic-timelines-11:~$ nodetool compactionstats
error: null
-- StackTrace --
java.lang.NullPointerException

ubuntu@statistic-timelines-11:~$ echo $?
2

I've seen this issue 
for 3.0.11 that was fixed and seems slightly different since in this case,
something is swallowing the full stack trace.

So given all this I have a few questions:
- Has anyone seen this before and have an idea as to what might cause it?
- Is it possible that I have something misconfigured that's swallowing the
stack trace?
- Should I file an issue in the Cassandra JIRA for this?

Thanks,
Paul


Re: What is a node's "counter ID?"

2017-10-23 Thread Paul Pollack
Makes sense, thanks Blake!

On Fri, Oct 20, 2017 at 9:17 PM, Blake Eggleston <beggles...@apple.com>
wrote:

> I believe that’s just referencing a counter implementation detail. If I
> remember correctly, there was a fairly large improvement of the
> implementation of counters in 2.1, and the assignment of the id would
> basically be a format migration.
>
>
> On Oct 20, 2017, at 9:57 AM, Paul Pollack <paul.poll...@klaviyo.com>
> wrote:
>
> Hi,
>
> I was reading the doc page for nodetool cleanup
> https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/
> toolsCleanup.html because I was planning to run it after replacing a node
> in my counter cluster and the sentence "Cassandra assigns a new counter ID
> to the node" gave me pause. I can't find any other reference to a node's
> counter ID in the docs and was wondering if anyone here could shed light on
> what this means, and how it would affect the data being stored on a node
> that had its counter ID changed?
>
> Thanks,
> Paul
>
>


What is a node's "counter ID?"

2017-10-20 Thread Paul Pollack
Hi,

I was reading the doc page for nodetool cleanup
https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsCleanup.html
because I was planning to run it after replacing a node in my counter
cluster and the sentence "Cassandra assigns a new counter ID to the node"
gave me pause. I can't find any other reference to a node's counter ID in
the docs and was wondering if anyone here could shed light on what this
means, and how it would affect the data being stored on a node that had its
counter ID changed?

Thanks,
Paul


Re: Drastic increase in disk usage after starting repair on 3.7

2017-09-21 Thread Paul Pollack
So I got to the bottom of this -- turns out it's not an issue with
Cassandra at all. Seems that whenever these instances were set up we had
originally mounted 2TB drives from /dev/xvdc and those were persisted to
/etc/fstab, but at some point someone unmounted those and replaced them
with 4TB drives on /dev/xvdf, however, didn't fix fstab. So what has
esssentially happened is I brought a node back into the cluster with a
blank data drive and started a repair, which I'm guessing then went and
started adding all the data that just wasn't there at all. I've killed the
repair and am going to replace that node.

On Thu, Sep 21, 2017 at 7:58 AM, Paul Pollack <paul.poll...@klaviyo.com>
wrote:

> Thanks for the suggestions guys.
>
> Nicolas, I just checked nodetool listsnapshots and it doesn't seem like
> those are causing the increase:
>
> Snapshot Details:
> Snapshot nameKeyspace name Column family
> name True size Size on disk
> 1479343904106-statistic_segment_timeline klaviyo
> statistic_segment_timeline 91.73 MiB 91.73 MiB
> 1479343904516-statistic_segment_timeline klaviyo
> statistic_segment_timeline 69.42 MiB 69.42 MiB
> 1479343904607-statistic_segment_timeline klaviyo
> statistic_segment_timeline 69.43 MiB 69.43 MiB
>
> Total TrueDiskSpaceUsed: 91.77 MiB
>
> Kurt, we definitely do have a large backlog of compactions, but I would
> expect only the currently running compactions to take up 2x extra space,
> and for that space to be freed up after its completion, is that an
> inaccurate idea of how compaction actually works? When the disk was almost
> full at 2TB I increased the EBS volume to 3TB, and now it's using 2.6TB so
> I think it's only a matter of hours before it takes up the space on the
> rest of the volume. The largest files on disk are *-big-Data.db files. Is
> there anything else I can check that might indicate whether or not the
> repair is really the root cause of this issue?
>
> Thanks,
> Paul
>
> On Thu, Sep 21, 2017 at 4:02 AM, Nicolas Guyomar <
> nicolas.guyo...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> This might be a long shot, but some repairs might fail to clear their
>> snapshot (not sure if its still the case with C* 3.7 however, I had the
>> problem on 2.X branche).
>> What does nodetool listsnapshot indicate ?
>>
>> On 21 September 2017 at 05:49, kurt greaves <k...@instaclustr.com> wrote:
>>
>>> repair does overstream by design, so if that node is inconsistent you'd
>>> expect a bit of an increase. if you've got a backlog of compactions that's
>>> probably due to repair and likely the cause of the increase. if you're
>>> really worried you can rolling restart to stop the repair, otherwise maybe
>>> try increasing compaction throughput.
>>>
>>
>>
>


Re: Drastic increase in disk usage after starting repair on 3.7

2017-09-21 Thread Paul Pollack
Thanks for the suggestions guys.

Nicolas, I just checked nodetool listsnapshots and it doesn't seem like
those are causing the increase:

Snapshot Details:
Snapshot nameKeyspace name Column family
name True size Size on disk
1479343904106-statistic_segment_timeline klaviyo
statistic_segment_timeline 91.73 MiB 91.73 MiB
1479343904516-statistic_segment_timeline klaviyo
statistic_segment_timeline 69.42 MiB 69.42 MiB
1479343904607-statistic_segment_timeline klaviyo
statistic_segment_timeline 69.43 MiB 69.43 MiB

Total TrueDiskSpaceUsed: 91.77 MiB

Kurt, we definitely do have a large backlog of compactions, but I would
expect only the currently running compactions to take up 2x extra space,
and for that space to be freed up after its completion, is that an
inaccurate idea of how compaction actually works? When the disk was almost
full at 2TB I increased the EBS volume to 3TB, and now it's using 2.6TB so
I think it's only a matter of hours before it takes up the space on the
rest of the volume. The largest files on disk are *-big-Data.db files. Is
there anything else I can check that might indicate whether or not the
repair is really the root cause of this issue?

Thanks,
Paul

On Thu, Sep 21, 2017 at 4:02 AM, Nicolas Guyomar 
wrote:

> Hi Paul,
>
> This might be a long shot, but some repairs might fail to clear their
> snapshot (not sure if its still the case with C* 3.7 however, I had the
> problem on 2.X branche).
> What does nodetool listsnapshot indicate ?
>
> On 21 September 2017 at 05:49, kurt greaves  wrote:
>
>> repair does overstream by design, so if that node is inconsistent you'd
>> expect a bit of an increase. if you've got a backlog of compactions that's
>> probably due to repair and likely the cause of the increase. if you're
>> really worried you can rolling restart to stop the repair, otherwise maybe
>> try increasing compaction throughput.
>>
>
>


Re: Drastic increase in disk usage after starting repair on 3.7

2017-09-20 Thread Paul Pollack
Just a quick additional note -- we have checked and this is the only node
in the cluster exhibiting this behavior, disk usage is steady on all the
others. CPU load on the repairing node is slightly higher but nothing
significant.

On Wed, Sep 20, 2017 at 9:08 PM, Paul Pollack <paul.poll...@klaviyo.com>
wrote:

> Hi,
>
> I'm running a repair on a node in my 3.7 cluster and today got alerted on
> disk space usage. We keep the data and commit log directories on separate
> EBS volumes. The data volume is 2TB. The node went down due to EBS failure
> on the commit log drive. I stopped the instance and was later told by AWS
> support that the drive had recovered. I started the node back up and saw
> that it couldn't replay commit logs due to corrupted data, so I cleared the
> commit logs and then it started up again just fine. I'm not worried about
> anything there that wasn't flushed, I can replay that. I was unfortunately
> just outside the hinted handoff window so decided to run a repair.
>
> Roughly 24 hours after I started the repair is when I got the alert on
> disk space. I checked and saw that right before I started the repair the
> node was using almost 1TB of space, which is right where all the nodes sit,
> and over the course of 24 hours had dropped to about 200GB free.
>
> My gut reaction was that the repair must have caused this increase, but
> I'm not convinced since the disk usage doubled and continues to grow. I
> figured we would see at most an increase of 2x the size of an SSTable
> undergoing compaction, unless there's more to the disk usage profile of a
> node during repair. We use SizeTieredCompactionStrategy on all the tables
> in this keyspace.
>
> Running nodetool compactionstats shows that there are a higher than usual
> number of pending compactions (currently 20), and there's been a large one
> of 292.82GB moving slowly.
>
> Is it plausible that the repair is the cause of this sudden increase in
> disk space usage? Are there any other things I can check that might provide
> insight into what happened?
>
> Thanks,
> Paul
>
>
>


Drastic increase in disk usage after starting repair on 3.7

2017-09-20 Thread Paul Pollack
Hi,

I'm running a repair on a node in my 3.7 cluster and today got alerted on
disk space usage. We keep the data and commit log directories on separate
EBS volumes. The data volume is 2TB. The node went down due to EBS failure
on the commit log drive. I stopped the instance and was later told by AWS
support that the drive had recovered. I started the node back up and saw
that it couldn't replay commit logs due to corrupted data, so I cleared the
commit logs and then it started up again just fine. I'm not worried about
anything there that wasn't flushed, I can replay that. I was unfortunately
just outside the hinted handoff window so decided to run a repair.

Roughly 24 hours after I started the repair is when I got the alert on disk
space. I checked and saw that right before I started the repair the node
was using almost 1TB of space, which is right where all the nodes sit, and
over the course of 24 hours had dropped to about 200GB free.

My gut reaction was that the repair must have caused this increase, but I'm
not convinced since the disk usage doubled and continues to grow. I figured
we would see at most an increase of 2x the size of an SSTable undergoing
compaction, unless there's more to the disk usage profile of a node during
repair. We use SizeTieredCompactionStrategy on all the tables in this
keyspace.

Running nodetool compactionstats shows that there are a higher than usual
number of pending compactions (currently 20), and there's been a large one
of 292.82GB moving slowly.

Is it plausible that the repair is the cause of this sudden increase in
disk space usage? Are there any other things I can check that might provide
insight into what happened?

Thanks,
Paul


Question about counters read before write behavior

2017-09-17 Thread Paul Pollack
Hi,

We're trying to confirm on a counter write that the entire partition is
read from disk vs. just the row and column of the partition to increment.
We've trade the code to this line
.
It looks like the code only uses a filter on the partition for reading if
the read does not involve collections or counters. Can anyone familiar with
the source code confirm if this is true and whether we're looking at the
right lines of code that show what data is read from disk (or from an
internal cache)?

Thanks,
Paul


Re: Bootstrapping node on Cassandra 3.7 causes cluster-wide performance issues

2017-09-11 Thread Paul Pollack
Thanks again guys, this has been a major blocker for us and I think we've
made some major progress with your advice.

We have gone ahead with Lerh's suggestion and the cluster is operating much
more smoothly while the new node compacts. We read at quorum, so in the
event that we don't make it within the hinted handoff window, at least
there won't be inconsistent data from reads.

Kurt - what we've been observing is that after the node finishes getting
data streamed to it from other nodes, it will go into state UN and only
then start the compactions, in this case it has about 130 pending. When
it's still joining we don't see an I/O bottleneck. I think the reason this
may be an issue for us is because our nodes generally are not OK since
they're constantly maxing out their disk throughput and have long queues,
which is why we're trying to increase capacity by both adding nodes and
switching to RAIDed disks. Under normal operating circumstances they're
pushed to their limits, so I think when the node gets backed up on
compactions it really is enough to tip over the cluster.

That's helpful to know regarding sstableofflinerelevel, in my dry run it
did appear that it would shuffle even more SSTables into L0.

On Mon, Sep 11, 2017 at 11:50 PM, kurt greaves  wrote:

>
>> Kurt - We're on 3.7, and our approach was to try thorttling compaction
>> throughput as much as possible rather than the opposite. I had found some
>> resources that suggested unthrottling to let it get it over with, but
>> wasn't sure if this would really help in our situation since the I/O pipe
>> was already fully saturated.
>>
>
> You should unthrottle during bootstrap as the node won't receive read
> queries until it finishes streaming and joins the cluster. It seems
> unlikely that you'd be bottlenecked on I/O during the bootstrapping
> process. If you were, you'd certainly have bigger problems. The aim is to
> clear out the majority of compactions *before* the node joins and starts
> servicing reads. You might also want to increase concurrent_compactors.
> Typical advice is same as # CPU cores, but you might want to increase it
> for the bootstrapping period.
>
> sstableofflinerelevel could help but I wouldn't count on it. Usage is
> pretty straightforward but you may find that a lot of the existing SSTables
> in L0 just get put back in L0 anyways, which is where the main compaction
> backlog comes from. Plus you have to take the node offline which may not be
> ideal. In this case I would suggest the strategy Lerh suggested as being
> more viable.
>
> Regardless, if the rest of your nodes are OK (and you don't have RF1/using
> CL=ALL) Cassandra should pretty effectively route around the slow node so a
> single node backed up on compactions shouldn't be a big deal.
>


Re: Bootstrapping node on Cassandra 3.7 causes cluster-wide performance issues

2017-09-11 Thread Paul Pollack
Thanks for the responses Lerh and Kurt!

Lerh - We had been considering those particular nodetool commands but were
hesitant to perform them on a production node without either testing
adequately in a dev environment or getting some feedback from someone who
knew what they were doing (such as yourself), so thank you for that! Your
point about the blacklist makes complete sense. So I think we'll probably
end up running those after the node finishes streaming and we confirm that
the blacklist is not improving latency. Just out of curiosity, do you have
any experience with sstableofflinerelevel? Is this something that would be
helpful to run with any kind of regularity?

Kurt - We're on 3.7, and our approach was to try thorttling compaction
throughput as much as possible rather than the opposite. I had found some
resources that suggested unthrottling to let it get it over with, but
wasn't sure if this would really help in our situation since the I/O pipe
was already fully saturated.

Best,
Paul

On Mon, Sep 11, 2017 at 9:16 PM, kurt greaves  wrote:

> What version are you using? There are improvements to streaming with LCS
> in 2.2.
> Also, are you unthrottling compaction throughput while the node is
> bootstrapping?
> ​
>


Bootstrapping node on Cassandra 3.7 causes cluster-wide performance issues

2017-09-11 Thread Paul Pollack
Hi,

We run 48 node cluster that stores counts in wide rows. Each node is using
roughly 1TB space on a 2TB EBS gp2 drive for data directory and
LeveledCompactionStrategy. We have been trying to bootstrap new nodes that
use a raid0 configuration over 2 1TB EBS drives to increase I/O throughput
cap from 160 MB/s to 250 MB/s (AWS limits). Every time a node finishes
streaming it is bombarded by a large number of compactions. We see CPU load
on the new node spike extremely high and CPU load on all the other nodes in
the cluster drop unreasonably low. Meanwhile our app's latency for writes
to this cluster average 10 seconds or greater. We've already tried
throttling compaction throughput to 1 mbps and we've always had
concurrent_compactors set to 2 but the disk is still saturated. In every
case we have had to shut down the Cassandra process on the new node to
resume acceptable operations.

We're currently upgrading all of our clients to use the 3.11.0 version of
the DataStax Python driver, which will allow us to add our next newly
bootstrapped node to a blacklist, hoping that if it doesn't accept writes
the rest of the cluster can serve them adequately (as is the case whenever
we turn down the bootstrapping node), and allow it to finish its
compactions.

We were also interested in hearing if anyone has had much luck using the
sstableofflinerelevel tool, and if this is a reasonable approach for our
issue.

One of my colleagues found a post where a user had a similar issue and
found that bloom filters had an extremely high false positive ratio, and
although I didn't check that during any of these attempts to bootstrap it
seems to me like if we have that many compactions to do we're likely to
observe that same thing.

Would appreciate any guidance anyone can offer.

Thanks,
Paul


Re: Cassandra 3.7 repair error messages

2017-09-11 Thread Paul Pollack
Thanks Erick, and sorry it took me so long to respond, I had to turn my
attention to other things. It definitely looks like there had been some
network blips going on with that node for a while before we saw it marked
down from every other node's perspective. Additionally, my original comment
that all of the failure messages referred to the same node was incorrect,
it seems like every few hours it would start to log messages for other
nodes in turn.

I went through the logs on all of the other nodes that were reported failed
from .204's perspective and found that they all failed to create a merkle
tree. We decided to set the consistency level for reads on this cluster to
quorum, which has at least prevented any data inconsistencies and as far as
we can tell no noticeable performance loss.

To answer your last question, I did once successfully run a repair on a
different node. It ran in about 12 hours or so.

I think before I dig further into why this repair could not run to
completion I have to address some other issues with the cluster -- namely
that we're hitting the Amazon EBS throughput cap on the the data volumes
for our nodes, which is causing our disk queue length to get big and
cluster-wide throughput to tank.

Thanks again for your help,
Paul

On Wed, Aug 30, 2017 at 9:54 PM, Erick Ramirez <flightc...@gmail.com> wrote:

> No, it isn't normal for sessions to fail and you will need to investigate.
> You need to review the logs on node .204 to determine why the session
> failed. For example, did it timeout because of a very large sstable? Or did
> the connection get truncated after a while?
>
> You will need to address the cause of those failures. It could be external
> to the nodes, e.g. firewall closing the socket so you might need to
> configure TCP keep_alive. 33 hours sounds like a really long time. Have you
> successfully run a repair on this cluster before?
>
> On Thu, Aug 31, 2017 at 11:39 AM, Paul Pollack <paul.poll...@klaviyo.com>
> wrote:
>
>> Hi,
>>
>> I'm trying to run a repair on a node my Cassandra cluster, version 3.7,
>> and was hoping someone may be able to shed light on an error message that
>> keeps cropping up.
>>
>> I started the repair on a node after discovering that it somehow became
>> partitioned from the rest of the cluster, e.g. nodetool status on all other
>> nodes showed it as DN, and on the node itself showed all other nodes as DN.
>> After restarting the Cassandra daemon the node seemed to re-join the
>> cluster just fine, so I began a repair.
>>
>> The repair has been running for about 33 hours (first incremental repair
>> on this cluster), and every so often I'll see a line like this:
>>
>> [2017-08-31 00:18:16,300] Repair session f7ae4e71-8ce3-11e7-b466-79eba0383e4f
>> for range [(-5606588017314999649,-5604469721630340065],
>> (9047587767449433379,9047652965163017217]] failed with error Endpoint /
>> 20.0.122.204 died (progress: 9%)
>>
>> Every one of these lines refers to the same node, 20.0.122.204.
>>
>> I'm mostly looking for guidance here. Do these errors indicate that the
>> entire repair will be worthless, or just for token ranges shared by these
>> two nodes? Is it normal to see error messages of this nature and for a
>> repair not to terminate?
>>
>> Thanks,
>> Paul
>>
>
>


Cassandra 3.7 repair error messages

2017-08-30 Thread Paul Pollack
Hi,

I'm trying to run a repair on a node my Cassandra cluster, version 3.7, and
was hoping someone may be able to shed light on an error message that keeps
cropping up.

I started the repair on a node after discovering that it somehow became
partitioned from the rest of the cluster, e.g. nodetool status on all other
nodes showed it as DN, and on the node itself showed all other nodes as DN.
After restarting the Cassandra daemon the node seemed to re-join the
cluster just fine, so I began a repair.

The repair has been running for about 33 hours (first incremental repair on
this cluster), and every so often I'll see a line like this:

[2017-08-31 00:18:16,300] Repair session
f7ae4e71-8ce3-11e7-b466-79eba0383e4f for range
[(-5606588017314999649,-5604469721630340065],
(9047587767449433379,9047652965163017217]] failed with error Endpoint /
20.0.122.204 died (progress: 9%)

Every one of these lines refers to the same node, 20.0.122.204.

I'm mostly looking for guidance here. Do these errors indicate that the
entire repair will be worthless, or just for token ranges shared by these
two nodes? Is it normal to see error messages of this nature and for a
repair not to terminate?

Thanks,
Paul