Re: Looking for feedback on automated root-cause system

Matthew Stump Tue, 05 Mar 2019 11:48:02 -0800

We probably will, that'll come soon-ish (a couple of weeks perhaps). Right
now we're limited by who we can engage with in order to collect feedback.


On Tue, Mar 5, 2019 at 11:34 AM Kenneth Brotman
<kenbrot...@yahoo.com.invalid> wrote:

> Simulators will never get you there.  Why don’t you let everyone plug in
> to the NOC in exchange for standard features or limited scale, make some
> money on the big cats that can you can make value proposition attractive
> for anyway.  You get the data you have to have – and free; everyone’s
> Cassandra cluster get’s smart!
>
>
>
>
>
> *From:* Matthew Stump [mailto:mst...@vorstella.com]
> *Sent:* Tuesday, March 05, 2019 11:12 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> Getting people to send data to us can be a little bit of a PITA, but it's
> doable. We've got data from regulated/secure environments streaming in.
> None of the data we collect is a risk, but the default is to say no and
> you've got to overcome that barrier. We've been through the audit a bunch
> of times, it gets easier each time because everyone asks more or less the
> same questions and requires the same set of disclosures.
>
>
>
> Cold start for AI is always an issue but we overcame it via two routes:
>
>
>
> We had customers from a pre-existing line of business. We were probably
> the first ones to run production Cassandra workloads at scale in k8s. We
> funded the work behind the some of the initial blog posts and had to figure
> out most of the ins-and-outs of making it work. This data is good for
> helping to identify edge cases and bugs that you wouldn't normally
> encounter, but it's super noisy and you've got to do a lot to isolate
> and/or derive value from data in the beginning if you're attempting to do
> root cause.
>
>
>
> Leveraging the above we built out an extensive simulations pipeline. It
> initially started as python scripts targeting k8s, but it's since been
> fully automated with Spinnaker.  We have a couple of simulations running
> all the time doing continuous integration with the models, collectors and
> pipeline code, but will burst out to a couple hundred clusters if we need
> to test something complicated. It's takes just a couple of minutes to have
> it spin up hundreds of different load generators, targeting different
> versions of C*, running with different topologies, using clean disks or
> restoring from previous snapshots.
>
>
>
> As the corpus grows simulations mater less, and it's easier to get signal
> from noise in a customer cluster.
>
>
>
> On Tue, Mar 5, 2019 at 10:15 AM Kenneth Brotman
> <kenbrot...@yahoo.com.invalid> wrote:
>
> Matt,
>
>
>
> Do you anticipate having trouble getting clients to allow the collector to
> send data up to your NOC?  Wouldn’t a lot of companies be unable or uneasy
> about that?
>
>
>
> Your ML can only work if it’s got LOTS of data from many different
> scenarios.  How are you addressing that?  How are you able to get that much
> good quality data?
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Kenneth Brotman [mailto:kenbrot...@yahoo.com]
> *Sent:* Tuesday, March 05, 2019 10:01 AM
> *To:* 'user@cassandra.apache.org'
> *Subject:* RE: Looking for feedback on automated root-cause system
>
>
>
> I see they have a website now at https://vorstella.com/
>
>
>
>
>
> *From:* Matt Stump [mailto:mrevilgn...@gmail.com]
> *Sent:* Friday, February 22, 2019 7:56 AM
> *To:* user
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> For some reason responses to the thread didn't hit my work email, I didn't
> see the responses until I check from my personal.
>
>
>
> The way that the system works is that we install a collector that pulls a
> bunch of metrics from each node and sends it up to our NOC every minute.
> We've got a bunch of stream processors that take this data and do a bunch
> of things with it. We've got some dumb ones that check for common
> miss-configurations, bugs etc.. they also populate dashboards and a couple
> of minimal graphs. The more intelligent agents take a look at the metrics
> and they start generating a bunch of calculated/scaled metrics and events.
> If one of these triggers a threshold then we kick off the ML that does
> classification using the stored data to classify the root cause, and point
> you to the correct knowledge base article with remediation steps. Because
> we've got he cluster history we can identify a breach, and give you an SLA
> in about 1 minute. The goal is to get you from 0 to resolution as quickly
> as possible.
>
>
>
> We're looking for feedback on the existing system, do these events make
> sense, do I need to beef up a knowledge base article, did it classify
> correctly, or is there some big bug that everyone is running into that
> needs to be publicized. We're also looking for where to go next, which
> models are going to make your life easier?
>
>
>
> The system works for C*, Elastic and Kafka. We'll be doing some blog posts
> explaining in more detail how it works and some of the interesting things
> we've found. For example everything everyone thought they knew about
> Cassandra thread pool tuning is wrong, nobody really knows how to tune
> Kafka for large messages, or that there are major issues with the
> Kubernetes charts that people are using.
>
>
>
>
>
>
>
> On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman
> <kenbrot...@yahoo.com.invalid> wrote:
>
> Any information you can share on the inputs it needs/uses would be helpful.
>
>
>
> Kenneth Brotman
>
>
>
> *From:* daemeon reiydelle [mailto:daeme...@gmail.com]
> *Sent:* Tuesday, February 19, 2019 4:27 PM
> *To:* user
> *Subject:* Re: Looking for feedback on automated root-cause system
>
>
>
> Welcome to the world of testing predictive analytics. I will pass this on
> to my folks at Accenture, know of a couple of C* clients we run, wondering
> what you had in mind?
>
>
>
>
>
> *Daemeon C.M. Reiydelle*
>
> *email: daeme...@gmail.com <daeme...@gmail.com>*
>
> *San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype
> daemeon.c.mreiydelle*
>
>
>
>
>
> On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <mst...@vorstella.com>
> wrote:
>
> Howdy,
>
> I’ve been engaged in the Cassandra user community for a long time, almost
> 8 years, and have worked on hundreds of Cassandra deployments. One of the
> things I’ve noticed in myself and a lot of my peers that have done
> consulting, support or worked on really big deployments is that we get
> burnt out. We fight a lot of the same fires over and over again, and don’t
> get to work on new or interesting stuff Also, what we do is really hard to
> transfer to other people because it’s based on experience.
>
> Over the past year my team and I have been working to overcome that gap,
> creating an assistant that’s able to scale some of this knowledge. We’ve
> got it to the point where it’s able to classify known root causes for an
> outage or an SLA breach in Cassandra with an accuracy greater than 90%. It
> can accurately diagnose bugs, data-modeling issues, or misuse of certain
> features and when it does give you specific remediation steps with links to
> knowledge base articles.
>
>
>
> We think we’ve seeded our database with enough root causes that it’ll
> catch the vast majority of issues but there is always the possibility that
> we’ll run into something previously unknown like CASSANDRA-11170 (one of
> the issues our system found in the wild).
>
> We’re looking for feedback and would like to know if anyone is interested
> in giving the product a trial. The process would be a collaboration, where
> we both get to learn from each other and improve how we’re doing things.
>
> Thanks,
> Matt Stump
>
>

Re: Looking for feedback on automated root-cause system

Reply via email to