Re: kafka + autoscaling groups fuckery

Charity Majors Tue, 28 Jun 2016 15:58:53 -0700

Reasons.

Investigated it thoroughly, believe me.  Some of the limitations that
Kinesis uses to protect itself are non starters for us.


Forgot to mention, we are using 0.9.0.1-0.



On Tue, Jun 28, 2016 at 3:56 PM, Pradeep Gollakota <pradeep...@gmail.com>
wrote:

> Just out of curiosity, if you guys are in AWS for everything, why not use
> Kinesis?
>
> On Tue, Jun 28, 2016 at 3:49 PM, Charity Majors <char...@hound.sh> wrote:
>
> > Hi there,
> >
> > I just finished implementing kafka + autoscaling groups in a way that
> made
> > sense to me.  I have a _lot_ of experience with ASGs and various storage
> > types but I'm a kafka noob (about 4-5 months of using in development and
> > staging and pre-launch production).
> >
> > It seems to be working fine from the Kafka POV but causing troubling side
> > effects elsewhere that I don't understand.  I don't know enough about
> Kafka
> > to know if my implementation is just fundamentally flawed for some
> reason,
> > or if so how and why.
> >
> > My process is basically this:
> >
> > - *Terminate a node*, or increment the size of the ASG by one.  (I'm not
> > doing any graceful shutdowns because I don't want to rely on graceful
> > shutdowns, and I'm not attempting to act upon more than one node at a
> > time.  Planning on doing a ZK lock or something later to enforce one
> > process at a time, if I can work the major kinks out.)
> >
> > - *Firstboot script,* which runs on all hosts from rc.init.  (We run ASGs
> > for *everything.)  It infers things like the chef role, environment,
> > cluster name, etc, registers DNS, bootstraps and runs chef-client, etc.
> > For storage nodes, it formats and mounts a PIOPS volume under the right
> > mount point, or just remounts the volume if it already contains data.
> Etc.
> >
> > - *Run a balancing script from firstboot* on kafka nodes.  It checks to
> > see how many brokers there are and what their ids are, and checks for any
> > underbalanced partitions with less than 3 ISRs.  Then we generate a new
> > assignment file for rebalancing partitions, and execute it.  We watch on
> > the host for all the partitions to finish rebalancing, then complete.
> >
> > *- So far so good*.  I have repeatedly killed kafka nodes and had them
> > come up, rebalance the cluster, and everything on the kafka side looks
> > healthy.  All the partitions have the correct number of ISRs, etc.
> >
> > But after doing this, we have repeatedly gotten into a state where
> > consumers that are pulling off the kafka partitions enter a weird state
> > where their last known offset is *ahead* of the last known offset for
> that
> > partition, and we can't recover from it.
> >
> > *A example.*  Last night I terminated ... I think it was broker 1002 or
> > 1005, and it came back up as broker 1009.  It rebalanced on boot,
> > everything looked good from the kafka side.  This morning we noticed that
> > the storage node that maps to partition 5 has been broken for like 22
> > hours, it thinks the next offset is too far ahead / out of bounds so
> > stopped consuming.  This happened shortly after broker 1009 came online
> and
> > the consumer caught up.
> >
> > From the storage node log:
> >
> > time="2016-06-28T21:51:48.286035635Z" level=info msg="Serving at
> > 0.0.0.0:8089..."
> > time="2016-06-28T21:51:48.293946529Z" level=error msg="Error creating
> > consumer" error="kafka server: The requested offset is outside the range
> of
> > offsets maintained by the server for the given topic/partition."
> > time="2016-06-28T21:51:48.294532365Z" level=error msg="Failed to start
> > services: kafka server: The requested offset is outside the range of
> > offsets maintained by the server for the given topic/partition."
> > time="2016-06-28T21:51:48.29461156Z" level=info msg="Shutting down..."
> >
> > From the mysql mapping of partitions to storage nodes/statuses:
> >
> > PRODUCTION ubuntu@retriever-112c6d8d:/srv/hound/retriever/log$
> > hound-kennel
> >
> > Listing by default. Use -action <listkafka, nextoffset, watchlive,
> > setstate, addslot, removeslot, removenode> for other actions
> >
> > Part    Status          Last Updated                    Hostname
> > 0       live            2016-06-28 22:29:10 +0000 UTC
>  retriever-772045ec
> > 1       live            2016-06-28 22:29:29 +0000 UTC
>  retriever-75e0e4f2
> > 2       live            2016-06-28 22:29:25 +0000 UTC
>  retriever-78804480
> > 3       live            2016-06-28 22:30:01 +0000 UTC
>  retriever-c0da5f85
> > 4       live            2016-06-28 22:29:42 +0000 UTC
>  retriever-122c6d8e
> > 5                       2016-06-28 21:53:48 +0000 UTC
> >
> >
> > PRODUCTION ubuntu@retriever-112c6d8d:/srv/hound/retriever/log$
> > hound-kennel -partition 5 -action nextoffset
> >
> > Next offset for partition 5: 12040353
> >
> >
> > Interestingly, the primary for partition 5 is 1004, and its follower is
> > the new node 1009.  (Partition 2 has 1009 as its leader and 1004 as its
> > follower, and seems just fine.)
> >
> > I've attached all the kafka logs for the broker 1009 node since it
> > launched yesterday.
> >
> > I guess my main question is: *Is there something I am fundamentally
> > missing about the kafka model that makes it it not play well with
> > autoscaling?*  I see a couple of other people on the internet talking
> > about using ASGs with kafka, but always in the context of maintaining a
> > list of broker ids and reusing them.
> >
> > *I don't want to do that.  I want the path for hardware termination,
> > expanding the ASG size, and rolling entire ASGs to pick up new AMIs to
> all
> > be the same.*  I want all of these actions to be completely trivial and
> > no big deal.  Is there something I'm missing, does anyone know why this
> is
> > causing problems?
> >
> > Thanks so much for any help or insight anyone can provide,
> > charity.
> >
> >
> > P.S., some additional details about our kafka/consumer configuration:
> >
> > - We autogenerate/autoincrement broker ids from zk
> >
> > - We have one topic, with "many" partitions depending on the env, and a
> > replication factor of 2 (now bumping to 3)
> >
> > - We have our own in-house written storage layer ("retriever") which
> > consumes Kafka partitions.  The mapping of partitions to storage nodes is
> > stored in mysql, as well as last known offset and some other details.
> > Partitions currently have a 1-1 mapping with storage nodes, e.g.
> partition
> > 5 => retriever-112c6d8d storage node.
> >
> > - We are using the golang serama client, with the __consumer_offset
> > internal partition.  This also seems to have weird problems.  It does not
> > rebalance the way the docs say it is supposed to, when consumers are
> added
> > or restarted.  (In fact I haven't been able to figure out how to get it
> to
> > rebalance or how to change the replication factor ... but I haven't
> really
> > dived into this one and tried to debug it yet, I've been deep in the ASG
> > stuff.)  But looking at this next, it seems very likely related in some
> way
> > because the __consumer_offsets topic seems to break at the same time.
> > `kafkacat` and `kafka-topics --describe output` in the gist below:
> >
> > https://gist.github.com/charity/d83f25b5e3f4994eb202f35fae74e7d1
> >
> > as you can see, even though 2/3 of the __consumer_offsets replicas are
> > online, it thinks none of them are available.  despite the fact that 5
> of 6
> > consumers are happily consuming away.
> >
> >
>

Re: kafka + autoscaling groups fuckery

Reply via email to