Tom, on the implications you are referring to. For me they seem the same for the __consumer_offsets and the __transaction_state topic.
So I am wondering if we can rely on the same solutions for them, like providing a *.replication.factor config option. Best regards, Steven Op ma 19 mrt. 2018 om 14:30 schreef Tom Bentley <t.j.bent...@gmail.com>: > Last week I was able to spend a bit of time working on KIP-236 again and, > based on the discussion about that with Jun back in December, I refactored > the controller to store the reassignment state in /brokers/topics/${topic} > instead of introducing new ZK nodes. This morning I was wondering what to > do as a next step, as these changes are more or less useless on their own, > without APIs for discovering the current partitions and/or reassigning > partitions. I started thinking again about this KIP, and realised that > using an internal compacted topic (say __partition_reassignments), as > suggested by Steven and Colin, would require changes in basically the same > places. > > Thinking through some of the failure modes ("what if I update ZK, but can't > update produce to the topic?") I realised that it would actually be > possible to simply remove storing this info from ZK entirely and just store > this state in the __partition_reassignments topic. Doing it that way would > eliminate those failure modes and would allow clients interested in > reassignment completion the possibility to consume from this topic and > respond to records published with a null value (indicating completion of a > reassignment). > > There are some interesting implications to doing this: > > 1. This __partition_reassignments topic would need to be replicated in > order to have availability of reassignment (if the leader of a partition of > __partition_reassignments is not available then reassignment of those > partitions whose state is held by the partition > of __partition_reassignments would not be reassignable). > 2. We would want to avoid unclean leader election for this topic. > > But I am interested in what other people think about this approach? > > Cheers, > > Tom > > > On 9 January 2018 at 21:18, Colin McCabe <cmcc...@apache.org> wrote: > > > What if we had an internal topic which watchers could listen to for > > information about partition reassignments? The information could be in > > JSON, so if we want to add new fields later, we always could. > > > > This avoids introducing a new AdminClient API. For clients that want to > > be notified about partition reassignments in a timely fashion, this > avoids > > the "polling an AdminClient API in a tight loop" antipattern. It allows > > watchers to be notified in a simple and natural way about what is going > > on. Access can be controlled by the existing topic ACL mechanisms. > > > > best, > > Colin > > > > > > On Fri, Dec 22, 2017, at 06:48, Tom Bentley wrote: > > > Hi Steven, > > > > > > I must admit that I didn't really considered that option. I can see how > > > attractive it is from your perspective. In practice it would come with > > lots > > > of edge cases which would need to be thought through: > > > > > > 1. What happens if the controller can't produce a record to this topic > > > because the partitions leader is unavailable? > > > 2. One solution to that is for the topic to be replicated on every > > broker, > > > so that the controller could elect itself leader on controller > failover. > > > But that raises another problem: What if, upon controller failover, the > > > controller is ineligible for leader election because it's not in the > ISR? > > > 3. The above questions suggest the controller might not always be able > to > > > produce to the topic, but the controller isn't able to control when > other > > > brokers catch up replicating moved partitions and has to deal with > those > > > events. The controller would have to record (in memory) that the > > > reassignment was complete, but hadn't been published, and publish > later, > > > when it was able to. > > > 4. Further to 3, we would need to recover the in-memory state of > > > reassignments on controller failover. But now we have to consider what > > > happens if the controller cannot *consume* from the topic. > > > > > > This seems pretty complicated to me. I think each of the above points > has > > > alternatives (or compromises) which might make the problem more > > tractable, > > > so I'd welcome hearing from anyone who has ideas on that. In particular > > > there are parallels with consumer offsets which might be worth thinking > > > about some more. > > > > > > I would be useful it define better the use case we're trying to cater > to > > > here. > > > > > > * Is it just a notification that a given reassignment has finished that > > > you're interested in? > > > * What are the consequences if such a notification is delayed, or > dropped > > > entirely? > > > > > > Regards, > > > > > > Tom > > > > > > > > > > > > On 19 December 2017 at 20:34, Steven Aerts <steven.ae...@gmail.com> > > wrote: > > > > > > > Hello Tom, > > > > > > > > > > > > when you were working out KIP-236, did you consider migrating the > > > > reassignment > > > > state from zookeeper to an internal kafka topic, keyed by partition > > > > and log compacted? > > > > > > > > It would allow an admin client and controller to easily subscribe for > > > > those changes, > > > > without the need to extend the network protocol as discussed in > > KIP-240. > > > > > > > > This is just a theoretical idea I wanted to share, as I can't find a > > > > reason why it would > > > > be a stupid idea. > > > > But I assume that in practice, this will imply too much change to the > > > > code base to be > > > > viable. > > > > > > > > > > > > Regards, > > > > > > > > > > > > Steven > > > > > > > > > > > > 2017-12-18 11:49 GMT+01:00 Tom Bentley <t.j.bent...@gmail.com>: > > > > > Hi Steven, > > > > > > > > > > I think it would be useful to be able to subscribe yourself on > > updates of > > > > >> reassignment changes. > > > > > > > > > > > > > > > I agree this would be really useful, but, to the extent I > understand > > the > > > > > networking underpinnings of the admin client, it might be difficult > > to do > > > > > well in practice. Part of the problem is that you might "set a > > watch" (to > > > > > borrow the ZK terminology) via one broker (or the controller), only > > for > > > > > that broker to fail (or the controller be re-elected). Obviously > you > > can > > > > > detect the loss of connection and set a new watch via a different > > broker > > > > > (or the new controller), but that couldn't be transparent to the > > user, > > > > > because the AdminClient doesn't know what changed while it was > > > > > disconnected/not watching. > > > > > > > > > > Another issue is that to avoid races you really need to combine > > fetching > > > > > the current state with setting the watch (as is done in the native > > > > > ZooKeeper API). I think there are lots of subtle issues of this > sort > > > > which > > > > > would need to be addressed to make something reliable. > > > > > > > > > > In the mean time, ZooKeeper already has a (proven and mature) API > for > > > > > watches, so there is, in principle, a good workaround. I say "in > > > > principle" > > > > > because in the KIP-236 proposal right now the > > /admin/reassign_partitions > > > > > znode is legacy and the reassignment is represented by > > > > > /admin/reassigments/$topic/$partition. That naming scheme for the > > znode > > > > > would make it harder for ZooKeeper clients like yours because such > > > > clients > > > > > would need to set a child watch per topic. The original proposal > for > > the > > > > > naming scheme was /admin/reassigments/$topic-$partition, which > would > > > > mean > > > > > clients like yours would need only 1 child watch. The advantage of > > > > > /admin/reassigments/$topic/$partition is it scales better. I don't > > > > > currently know how well ZooKeeper copes with nodes with many > > children, so > > > > > it's difficult for me weigh those two options, but I would be happy > > to > > > > > switch back to /admin/reassigments/$topic-$partition if we could > > > > reassure > > > > > ourselves it would scale OK to the reassignment sizes would people > > need > > > > in > > > > > practice. > > > > > > > > > > Overall I would prefer not to tackle something like this in *this* > > KIP, > > > > > though it could be something for a future KIP. Of course I'm happy > to > > > > hear > > > > > more discussion about this too! > > > > > > > > > > Cheers, > > > > > > > > > > Tom > > > > > > > > > > > > > > > On 15 December 2017 at 18:51, Steven Aerts <steven.ae...@gmail.com > > > > > > wrote: > > > > > > > > > >> Tom, > > > > >> > > > > >> > > > > >> I think it would be useful to be able to subscribe yourself on > > updates > > > > of > > > > >> reassignment changes. > > > > >> Our internal kafka supervisor and monitoring tools are currently > > > > subscribed > > > > >> to these changes in zookeeper so they can babysit our clusters. > > > > >> > > > > >> I think it would be nice if we could receive these events through > > the > > > > >> adminclient. > > > > >> In the api proposal, you can only poll for changes. > > > > >> > > > > >> No clue how difficult it would be to implement, maybe you can > > piggyback > > > > on > > > > >> some version number in the repartition messages or on zookeeper. > > > > >> > > > > >> This is just an idea, not a must have feature for me. We can > always > > > > poll > > > > >> over > > > > >> the proposed api. > > > > >> > > > > >> > > > > >> Regards, > > > > >> > > > > >> > > > > >> Steven > > > > >> > > > > >> > > > > >> Op vr 15 dec. 2017 om 19:16 schreef Tom Bentley < > > t.j.bent...@gmail.com > > > > >: > > > > >> > > > > >> > Hi, > > > > >> > > > > > >> > KIP-236 lays the foundations for AdminClient APIs to do with > > partition > > > > >> > reassignment. I'd now like to start discussing KIP-240, which > adds > > > > APIs > > > > >> to > > > > >> > the AdminClient to list and describe the current reassignments. > > > > >> > > > > > >> > > > > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > >> 240%3A+AdminClient.listReassignments+AdminClient. > > describeReassignments > > > > >> > > > > > >> > Aside: I have fairly developed ideas for the API for starting a > > > > >> > reassignment, but I intend to put that in a third KIP. > > > > >> > > > > > >> > Cheers, > > > > >> > > > > > >> > Tom > > > > >> > > > > > >> > > > > > > >