I did some stuff like this recently with simple calls to `tc` (samples that
I used were in the README for https://github.com/tylertreat/comcast). The
only notable bug I found so far is that if you cut all the kafka nodes
entirely off from zookeeper for say, 60 seconds, then reconnect them, the
nodes don't crash, they report as healthy in JMX, but calls to fetch
metadata from them timeout entirely. That can be fixed with a rolling
restart, but it doesn't sound ideal (especially in the face of cloud
networks, where short-lived total network outages can and do happen).
Should I file a Jira detailing that bug?

On Wed, Oct 5, 2016 at 7:26 PM, Gwen Shapira <g...@confluent.io> wrote:

> Yeah, totally agree on discussing what we want to test first and
> implement anything later :)
>
> Its just that whenever I have this discussion Jepsen came up, so I was
> curious what was driving the interest and whether the specific
> framework is important to the community.
>
> On Tue, Oct 4, 2016 at 5:46 PM, Joel Koshy <jjkosh...@gmail.com> wrote:
> > Hi Gwen,
> >
> > I've also seen suggestions of using Jepsen for fault injection, but
> >> I'm not familiar with this framework.
> >>
> >> What do you guys think? Write our own failure injection? or write
> >> Kafka tests in Jepsen?
> >>
> >
> > This would definitely add a lot of value and save a lot on release
> > validation overheads. I have heard of Jepsen (via the blog), but haven't
> > used it. At LinkedIn a couple of infra teams have been using Simoorg
> > <https://github.com/linkedin/simoorg> which being python-based would
> > perhaps be easier to use for system test writers than Clojure (under
> > Jepsen). The Ambry <https://github.com/linkedin/ambry> project at
> LinkedIn
> > uses it extensively (and I think has added several more failure scenarios
> > which don't seem to be reflected in the github repo). Anyway, I think we
> > should at least enumerate what we want to test and evaluate the
> > alternatives before reinventing.
> >
> > Thanks,
> >
> > Joel
>
>
>
> --
> Gwen Shapira
> Product Manager | Confluent
> 650.450.2760 | @gwenshap
> Follow us: Twitter | blog
>

Reply via email to