Re: [DISCUSS] KIP-455: Create an Administrative API for Replica Reassignment

Colin McCabe Sat, 04 May 2019 23:18:35 -0700

On Thu, May 2, 2019, at 09:35, Viktor Somogyi-Vass wrote:
> Hey Colin & George,
> 
> Thinking on George's points I was wondering if it's feasible to submit a
> big reassignment to the controller and thus Zookeeper as frequent writes
> are slow as the quorum has to synchronize. Perhaps it should be the
> responsibility of KIP-435 <https://issues.apache.org/jira/browse/KIP-435> but
> I'd like to note it here as we're changing the current znode layout in this
> KIP.

Hi Viktor,

This is similar conceptually to if we lose a broker from the cluster.  In that 
case, we have to remove that node from the ISR of all the partitions it has, 
which means updating O(partitions_on_node) znodes.  It's also similar to 
completing a reassignment in the existing Kafka version, and updating the 
partition znodes to reflect new nodes joining the ISR for various partitions.  
While you are right that ZK is a low-bandwidth system, in general writing, to a 
few thousand ZNodes over the course of a second or two is OK.

The existing reassignment znode requires the whole plan to fit within a single 
znode.  The maximum znodes size of 1 megabyte by default, and almost nobody 
reconfigures this.  Assuming about 100 bytes per reassignment, we can't get 
many more than about 10,000 partitions in a reassignment today in any case.  
The current scalability bottleneck is much more on the side of "can kafka 
actually handle a huge amount of extra traffic due to ongoing reassignments"?

That does bring up a good point, though-- we may want to have a "maximum 
concurrent reassignments" to avoid a common scenario that happens now, where 
people accidentally submit a plan that's way too big.  But this is not to 
protect ZooKeeper-- it is to protect the brokers.

> I think ideally we should add these writes in batches to zookeeper and
> otherwise store it in a replicated internal topic
> (__partition_reassignments). That would solve the scalability problem as
> the failover controller would be able to read it up very quickly and also
> we would spread the writes in Zookeeper over time. Just the current,
> actively replicated partitions should be present under
> /brokers/topics/[topic]/partitions/[partitionId]/state, so those partitions
> will know if they have to do reassignment (even in case of a broker
> bounce). The controller on the other hand could regain its state by reading
> up the last produced message from this __partition_reassignments topic and
> reading up the Zookeeper state to figure out which batch its currently
> doing (supposing it goes sequentially in the given reassignment).

As I wrote in my reply to the other email, this is not needed because we're not 
adding any controller startup overhead beyond what already exists.  We do have 
some plans to optimize this, but it's outside the scope of this KIP.

> I'll think a little bit more about this to fill out any gaps there are and
> perhaps add it to my KIP. That being said probably we'll need to make some
> benchmarking first if this bulk read-write causes a problem at all to avoid
> premature optimisation. I generally don't really worry about reading up
> this new information as the controller would read up the assignment anyway
> in initializeControllerContext().

Right, the controller will read those znodes on startup anyway.

> 
> A question on SubmitPartitionReassignmentsRequest and its connection with
> KIP-435 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-435>. Would
> the list of topic-partitions have the same ordering on the client side as
> well as the broker side? I think it would be an advantage as the user would
> know in which order the reassignment would be performed. I think it's
> useful when it comes to incrementalization as they'd be able to figure out
> what replicas will be in one batch (given they know about the batch size).

The big advantage of doing batching on the controller is that the controller 
has more information about what is going on in the cluster.  So it can schedule 
reassignments in a more optimal way.  For instance, it can schedule 
reassignments so that the load is distributed evenly across nodes.  This 
advantage is lost if we have to adhere to a rigid ordering that is set up in 
advance.  We don't know exactly when anything will complete in any case.  Just 
because one partition reassignment was started before another doesn't mean it 
will finish before another.

Additionally, there may be multiple clients submitting assignments and multiple 
clients querying them.  So I don't think ordering makes sense here.

best,
Colin

> 
> Viktor
> 
> On Wed, May 1, 2019 at 8:33 AM George Li <[email protected]>
> wrote:
> 
> >  Hi Colin,
> >
> > Thanks for KIP-455!  yes. KIP-236, etc. will depend on it.  It is the good
> > direction to go for the RP
> >
> > Regarding storing the new reassignments & original replicas at the
> > topic/partition level.  I have some concerns when controller is failing
> > over, and the scalability of scanning the active reassignments from ZK
> > topic/partition level nodes. Please see my reply to Jason in the KIP-236
> > thread.
> >
> > Once the decision is made where new reassignment and original replicas is
> > stored, I will modify KIP-236 accordingly for how to cancel/rollback the
> > reassignments.
> >
> > Thanks,
> > George
> >
> >
> >     On Monday, April 15, 2019, 6:07:44 PM PDT, Colin McCabe <
> > [email protected]> wrote:
> >
> >  Hi all,
> >
> > We've been having discussions on a few different KIPs (KIP-236, KIP-435,
> > etc.) about what the Admin Client replica reassignment API should look
> > like.  The current API is really hard to extend and maintain, which is a
> > big source of problems.  I think it makes sense to have a KIP that
> > establishes a clean API that we can use and extend going forward, so I
> > posted KIP-455.  Take a look.  :)
> >
> > best,
> > Colin
> >
>

Re: [DISCUSS] KIP-455: Create an Administrative API for Replica Reassignment

Reply via email to