NullPointerException when running nodetool stopdaemon

2019-02-22 Thread Timothy Palpant
I am trying to use `nodetool stopdaemon` to stop Cassandra but hitting the
following error:

```
$ cassandra_ctl nodetool -h 127.0.0.1 -p 5100 stopdaemon
error: null
-- StackTrace --
java.lang.NullPointerException
at
org.apache.cassandra.config.DatabaseDescriptor.getDiskFailurePolicy(DatabaseDescriptor.java:1877)
at
org.apache.cassandra.utils.JVMStabilityInspector.inspectThrowable(JVMStabilityInspector.java:62)
at
org.apache.cassandra.tools.nodetool.StopDaemon.execute(StopDaemon.java:39)
at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:254)
at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:168)
```

This looks very similar to:
https://issues.apache.org/jira/browse/CASSANDRA-13030
but I am running v3.11.1, which has that fix in it:

```
$ cassandra_ctl nodetool -h 127.0.0.1 -p 5100 version
ReleaseVersion: 3.11.1
```

Has anyone else run into this problem, or know of a way to work around it?
(or am I running the command incorrectly?)

Thanks!
Tim


Re: Tombstones in memtable

2019-02-22 Thread Jeff Jirsa
If all of your data is TTL’d and you never explicitly delete a cell without 
using s TTL, you can probably drop your GCGS to 1 hour (or less).

Which compaction strategy are you using? You need a way to clear out those 
tombstones. There exist tombstone compaction sub properties that can help 
encourage compaction to grab sstables just because they’re full of tombstones 
which will probably help you.


-- 
Jeff Jirsa


> On Feb 22, 2019, at 8:37 AM, Kenneth Brotman  
> wrote:
> 
> Can we see the histogram?  Why wouldn’t you at times have that many 
> tombstones?  Makes sense.
>  
> Kenneth Brotman
>  
> From: Rahul Reddy [mailto:rahulreddy1...@gmail.com] 
> Sent: Thursday, February 21, 2019 7:06 AM
> To: user@cassandra.apache.org
> Subject: Tombstones in memtable
>  
> We have small table records are about 5k .
> All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc 
> grace seconds has 3 hours.  We do 5k reads a second during peak load During 
> the peak load seeing Alerts for tomstone scanned histogram reaching million.
> Cassandra version 3.11.1. Please let me know how can this tombstone scan can 
> be avoided in memtable


RE: Tombstones in memtable

2019-02-22 Thread Kenneth Brotman
Can we see the histogram?  Why wouldn’t you at times have that many tombstones? 
 Makes sense.

 

Kenneth Brotman

 

From: Rahul Reddy [mailto:rahulreddy1...@gmail.com] 
Sent: Thursday, February 21, 2019 7:06 AM
To: user@cassandra.apache.org
Subject: Tombstones in memtable

 

We have small table records are about 5k .

All the inserts comes as 4hr ttl and we have table level ttl 1 day and gc grace 
seconds has 3 hours.  We do 5k reads a second during peak load During the peak 
load seeing Alerts for tomstone scanned histogram reaching million.

Cassandra version 3.11.1. Please let me know how can this tombstone scan can be 
avoided in memtable



Re: Looking for feedback on automated root-cause system

2019-02-22 Thread Matt Stump
For some reason responses to the thread didn't hit my work email, I didn't
see the responses until I check from my personal.

The way that the system works is that we install a collector that pulls a
bunch of metrics from each node and sends it up to our NOC every minute.
We've got a bunch of stream processors that take this data and do a bunch
of things with it. We've got some dumb ones that check for common
miss-configurations, bugs etc.. they also populate dashboards and a couple
of minimal graphs. The more intelligent agents take a look at the metrics
and they start generating a bunch of calculated/scaled metrics and events.
If one of these triggers a threshold then we kick off the ML that does
classification using the stored data to classify the root cause, and point
you to the correct knowledge base article with remediation steps. Because
we've got he cluster history we can identify a breach, and give you an SLA
in about 1 minute. The goal is to get you from 0 to resolution as quickly
as possible.

We're looking for feedback on the existing system, do these events make
sense, do I need to beef up a knowledge base article, did it classify
correctly, or is there some big bug that everyone is running into that
needs to be publicized. We're also looking for where to go next, which
models are going to make your life easier?

The system works for C*, Elastic and Kafka. We'll be doing some blog posts
explaining in more detail how it works and some of the interesting things
we've found. For example everything everyone thought they knew about
Cassandra thread pool tuning is wrong, nobody really knows how to tune
Kafka for large messages, or that there are major issues with the
Kubernetes charts that people are using.



On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman
 wrote:

> Any information you can share on the inputs it needs/uses would be helpful.
>
>
>
> Kenneth Brotman
>
>
>
> *From:* daemeon reiydelle [mailto:daeme...@gmail.com]
> *Sent:* Tuesday, February 19, 2019 4:27 PM
> *To:* user
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> Welcome to the world of testing predictive analytics. I will pass this on
> to my folks at Accenture, know of a couple of C* clients we run, wondering
> what you had in mind?
>
>
>
>
>
> *Daemeon C.M. Reiydelle*
>
> *email: daeme...@gmail.com *
>
> *San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype
> daemeon.c.mreiydelle*
>
>
>
>
>
> On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump 
> wrote:
>
> Howdy,
>
> I’ve been engaged in the Cassandra user community for a long time, almost
> 8 years, and have worked on hundreds of Cassandra deployments. One of the
> things I’ve noticed in myself and a lot of my peers that have done
> consulting, support or worked on really big deployments is that we get
> burnt out. We fight a lot of the same fires over and over again, and don’t
> get to work on new or interesting stuff Also, what we do is really hard to
> transfer to other people because it’s based on experience.
>
> Over the past year my team and I have been working to overcome that gap,
> creating an assistant that’s able to scale some of this knowledge. We’ve
> got it to the point where it’s able to classify known root causes for an
> outage or an SLA breach in Cassandra with an accuracy greater than 90%. It
> can accurately diagnose bugs, data-modeling issues, or misuse of certain
> features and when it does give you specific remediation steps with links to
> knowledge base articles.
>
>
>
> We think we’ve seeded our database with enough root causes that it’ll
> catch the vast majority of issues but there is always the possibility that
> we’ll run into something previously unknown like CASSANDRA-11170 (one of
> the issues our system found in the wild).
>
> We’re looking for feedback and would like to know if anyone is interested
> in giving the product a trial. The process would be a collaboration, where
> we both get to learn from each other and improve how we’re doing things.
>
> Thanks,
> Matt Stump
>
>