Gwen, thanks for the response.

1.1 Your life may be a bit simpler if you have a way of starting a new

> broker with the same ID as the old one - this means it will
> automatically pick up the old replicas and you won't need to
> rebalance. Makes life slightly easier in some cases.
>

Yeah, this is definitely doable, I just don't *want* to do it.  I really
want all of these to share the same code path: 1) rolling all nodes in an
ASG to pick up a new AMI, 2) hardware failure / unintentional node
termination, 3) resizing the ASG and rebalancing the data across nodes.

Everything but the first one means generating new node IDs, so I would
rather just do that across the board.  It's the solution that really fits
the ASG model best, so I'm reluctant to give up on it.


> 1.2 Careful not too rebalance too many partitions at once - you only
> have so much bandwidth and currently Kafka will not throttle
> rebalancing traffic.
>

Nod, got it.  This is def something I plan to work on hardening once I have
the basic nut of things working (or if I've had to give up on it and accept
a lesser solution).


> 2. I think your rebalance script is not rebalancing the offsets topic?
> It still has a replica on broker 1002. You have two good replicas, so
> you are no where near disaster, but make sure you get this working
> too.
>

Yes, this is another problem I am working on in parallel.  The Shopify
sarama library <https://godoc.org/github.com/Shopify/sarama> uses the
__consumer_offsets topic, but it does *not* let you rebalance or resize the
topic when consumers connect, disconnect, or restart.

"Note that Sarama's Consumer implementation does not currently support
automatic consumer-group rebalancing and offset tracking"

I'm working on trying to get the sarama-cluster to do something here.  I
think these problems are likely related, I'm not sure wtf you are
*supposed* to do to rebalance this god damn topic.  It also seems like we
aren't using a consumer group which sarama-cluster depends on to rebalance
a topic.  I'm still pretty confused by the 0.9 "consumer group" stuff.

Seriously considering downgrading to the latest 0.8 release, because
there's a massive gap in documentation for the new stuff in 0.9 (like
consumer groups) and we don't really need any of the new features.

A common work-around is to configure the consumer to handle "offset
> out of range" exception by jumping to the last offset available in the
> log. This is the behavior of the Java client, and it would have saved
> your consumer here. Go client looks very low level, so I don't know
> how easy it is to do that.
>

Erf, this seems like it would almost guarantee data loss.  :(  Will check
it out tho.

If I were you, I'd retest your ASG scripts without the auto leader
> election - since your own scripts can / should handle that.
>

Okay, this is straightforward enough.  Will try it.  And will keep tryingn
to figure out how to balance the __consumer_offsets topic, since I
increasingly think that's the key to this giant mess.

If anyone has any advice there, massively appreciated.

Thanks,

charity.

Reply via email to