We probably will, that'll come soon-ish (a couple of weeks perhaps). Right now we're limited by who we can engage with in order to collect feedback.
On Tue, Mar 5, 2019 at 11:34 AM Kenneth Brotman <kenbrot...@yahoo.com.invalid> wrote: > Simulators will never get you there. Why don’t you let everyone plug in > to the NOC in exchange for standard features or limited scale, make some > money on the big cats that can you can make value proposition attractive > for anyway. You get the data you have to have – and free; everyone’s > Cassandra cluster get’s smart! > > > > > > *From:* Matthew Stump [mailto:mst...@vorstella.com] > *Sent:* Tuesday, March 05, 2019 11:12 AM > *To:* user@cassandra.apache.org > *Subject:* Re: Looking for feedback on automated root-cause system > > > > Getting people to send data to us can be a little bit of a PITA, but it's > doable. We've got data from regulated/secure environments streaming in. > None of the data we collect is a risk, but the default is to say no and > you've got to overcome that barrier. We've been through the audit a bunch > of times, it gets easier each time because everyone asks more or less the > same questions and requires the same set of disclosures. > > > > Cold start for AI is always an issue but we overcame it via two routes: > > > > We had customers from a pre-existing line of business. We were probably > the first ones to run production Cassandra workloads at scale in k8s. We > funded the work behind the some of the initial blog posts and had to figure > out most of the ins-and-outs of making it work. This data is good for > helping to identify edge cases and bugs that you wouldn't normally > encounter, but it's super noisy and you've got to do a lot to isolate > and/or derive value from data in the beginning if you're attempting to do > root cause. > > > > Leveraging the above we built out an extensive simulations pipeline. It > initially started as python scripts targeting k8s, but it's since been > fully automated with Spinnaker. We have a couple of simulations running > all the time doing continuous integration with the models, collectors and > pipeline code, but will burst out to a couple hundred clusters if we need > to test something complicated. It's takes just a couple of minutes to have > it spin up hundreds of different load generators, targeting different > versions of C*, running with different topologies, using clean disks or > restoring from previous snapshots. > > > > As the corpus grows simulations mater less, and it's easier to get signal > from noise in a customer cluster. > > > > On Tue, Mar 5, 2019 at 10:15 AM Kenneth Brotman > <kenbrot...@yahoo.com.invalid> wrote: > > Matt, > > > > Do you anticipate having trouble getting clients to allow the collector to > send data up to your NOC? Wouldn’t a lot of companies be unable or uneasy > about that? > > > > Your ML can only work if it’s got LOTS of data from many different > scenarios. How are you addressing that? How are you able to get that much > good quality data? > > > > Kenneth Brotman > > > > *From:* Kenneth Brotman [mailto:kenbrot...@yahoo.com] > *Sent:* Tuesday, March 05, 2019 10:01 AM > *To:* 'user@cassandra.apache.org' > *Subject:* RE: Looking for feedback on automated root-cause system > > > > I see they have a website now at https://vorstella.com/ > > > > > > *From:* Matt Stump [mailto:mrevilgn...@gmail.com] > *Sent:* Friday, February 22, 2019 7:56 AM > *To:* user > *Subject:* Re: Looking for feedback on automated root-cause system > > > > For some reason responses to the thread didn't hit my work email, I didn't > see the responses until I check from my personal. > > > > The way that the system works is that we install a collector that pulls a > bunch of metrics from each node and sends it up to our NOC every minute. > We've got a bunch of stream processors that take this data and do a bunch > of things with it. We've got some dumb ones that check for common > miss-configurations, bugs etc.. they also populate dashboards and a couple > of minimal graphs. The more intelligent agents take a look at the metrics > and they start generating a bunch of calculated/scaled metrics and events. > If one of these triggers a threshold then we kick off the ML that does > classification using the stored data to classify the root cause, and point > you to the correct knowledge base article with remediation steps. Because > we've got he cluster history we can identify a breach, and give you an SLA > in about 1 minute. The goal is to get you from 0 to resolution as quickly > as possible. > > > > We're looking for feedback on the existing system, do these events make > sense, do I need to beef up a knowledge base article, did it classify > correctly, or is there some big bug that everyone is running into that > needs to be publicized. We're also looking for where to go next, which > models are going to make your life easier? > > > > The system works for C*, Elastic and Kafka. We'll be doing some blog posts > explaining in more detail how it works and some of the interesting things > we've found. For example everything everyone thought they knew about > Cassandra thread pool tuning is wrong, nobody really knows how to tune > Kafka for large messages, or that there are major issues with the > Kubernetes charts that people are using. > > > > > > > > On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman > <kenbrot...@yahoo.com.invalid> wrote: > > Any information you can share on the inputs it needs/uses would be helpful. > > > > Kenneth Brotman > > > > *From:* daemeon reiydelle [mailto:daeme...@gmail.com] > *Sent:* Tuesday, February 19, 2019 4:27 PM > *To:* user > *Subject:* Re: Looking for feedback on automated root-cause system > > > > Welcome to the world of testing predictive analytics. I will pass this on > to my folks at Accenture, know of a couple of C* clients we run, wondering > what you had in mind? > > > > > > *Daemeon C.M. Reiydelle* > > *email: daeme...@gmail.com <daeme...@gmail.com>* > > *San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype > daemeon.c.mreiydelle* > > > > > > On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <mst...@vorstella.com> > wrote: > > Howdy, > > I’ve been engaged in the Cassandra user community for a long time, almost > 8 years, and have worked on hundreds of Cassandra deployments. One of the > things I’ve noticed in myself and a lot of my peers that have done > consulting, support or worked on really big deployments is that we get > burnt out. We fight a lot of the same fires over and over again, and don’t > get to work on new or interesting stuff Also, what we do is really hard to > transfer to other people because it’s based on experience. > > Over the past year my team and I have been working to overcome that gap, > creating an assistant that’s able to scale some of this knowledge. We’ve > got it to the point where it’s able to classify known root causes for an > outage or an SLA breach in Cassandra with an accuracy greater than 90%. It > can accurately diagnose bugs, data-modeling issues, or misuse of certain > features and when it does give you specific remediation steps with links to > knowledge base articles. > > > > We think we’ve seeded our database with enough root causes that it’ll > catch the vast majority of issues but there is always the possibility that > we’ll run into something previously unknown like CASSANDRA-11170 (one of > the issues our system found in the wild). > > We’re looking for feedback and would like to know if anyone is interested > in giving the product a trial. The process would be a collaboration, where > we both get to learn from each other and improve how we’re doing things. > > Thanks, > Matt Stump > >