Hi Colin,

Thanks for explaining all this, it makes sense.

Viktor

On Sun, May 5, 2019 at 8:18 AM Colin McCabe <cmcc...@apache.org> wrote:

> On Thu, May 2, 2019, at 09:35, Viktor Somogyi-Vass wrote:
> > Hey Colin & George,
> >
> > Thinking on George's points I was wondering if it's feasible to submit a
> > big reassignment to the controller and thus Zookeeper as frequent writes
> > are slow as the quorum has to synchronize. Perhaps it should be the
> > responsibility of KIP-435 <https://issues.apache.org/jira/browse/KIP-435>
> but
> > I'd like to note it here as we're changing the current znode layout in
> this
> > KIP.
>
> Hi Viktor,
>
> This is similar conceptually to if we lose a broker from the cluster.  In
> that case, we have to remove that node from the ISR of all the partitions
> it has, which means updating O(partitions_on_node) znodes.  It's also
> similar to completing a reassignment in the existing Kafka version, and
> updating the partition znodes to reflect new nodes joining the ISR for
> various partitions.  While you are right that ZK is a low-bandwidth system,
> in general writing, to a few thousand ZNodes over the course of a second or
> two is OK.
>
> The existing reassignment znode requires the whole plan to fit within a
> single znode.  The maximum znodes size of 1 megabyte by default, and almost
> nobody reconfigures this.  Assuming about 100 bytes per reassignment, we
> can't get many more than about 10,000 partitions in a reassignment today in
> any case.  The current scalability bottleneck is much more on the side of
> "can kafka actually handle a huge amount of extra traffic due to ongoing
> reassignments"?
>
> That does bring up a good point, though-- we may want to have a "maximum
> concurrent reassignments" to avoid a common scenario that happens now,
> where people accidentally submit a plan that's way too big.  But this is
> not to protect ZooKeeper-- it is to protect the brokers.
>
> > I think ideally we should add these writes in batches to zookeeper and
> > otherwise store it in a replicated internal topic
> > (__partition_reassignments). That would solve the scalability problem as
> > the failover controller would be able to read it up very quickly and also
> > we would spread the writes in Zookeeper over time. Just the current,
> > actively replicated partitions should be present under
> > /brokers/topics/[topic]/partitions/[partitionId]/state, so those
> partitions
> > will know if they have to do reassignment (even in case of a broker
> > bounce). The controller on the other hand could regain its state by
> reading
> > up the last produced message from this __partition_reassignments topic
> and
> > reading up the Zookeeper state to figure out which batch its currently
> > doing (supposing it goes sequentially in the given reassignment).
>
> As I wrote in my reply to the other email, this is not needed because
> we're not adding any controller startup overhead beyond what already
> exists.  We do have some plans to optimize this, but it's outside the scope
> of this KIP.
>
> > I'll think a little bit more about this to fill out any gaps there are
> and
> > perhaps add it to my KIP. That being said probably we'll need to make
> some
> > benchmarking first if this bulk read-write causes a problem at all to
> avoid
> > premature optimisation. I generally don't really worry about reading up
> > this new information as the controller would read up the assignment
> anyway
> > in initializeControllerContext().
>
> Right, the controller will read those znodes on startup anyway.
>
> >
> > A question on SubmitPartitionReassignmentsRequest and its connection with
> > KIP-435 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-435>.
> Would
> > the list of topic-partitions have the same ordering on the client side as
> > well as the broker side? I think it would be an advantage as the user
> would
> > know in which order the reassignment would be performed. I think it's
> > useful when it comes to incrementalization as they'd be able to figure
> out
> > what replicas will be in one batch (given they know about the batch
> size).
>
> The big advantage of doing batching on the controller is that the
> controller has more information about what is going on in the cluster.  So
> it can schedule reassignments in a more optimal way.  For instance, it can
> schedule reassignments so that the load is distributed evenly across
> nodes.  This advantage is lost if we have to adhere to a rigid ordering
> that is set up in advance.  We don't know exactly when anything will
> complete in any case.  Just because one partition reassignment was started
> before another doesn't mean it will finish before another.
>
> Additionally, there may be multiple clients submitting assignments and
> multiple clients querying them.  So I don't think ordering makes sense here.
>
> best,
> Colin
>
> >
> > Viktor
> >
> > On Wed, May 1, 2019 at 8:33 AM George Li <sql_consult...@yahoo.com
> .invalid>
> > wrote:
> >
> > >  Hi Colin,
> > >
> > > Thanks for KIP-455!  yes. KIP-236, etc. will depend on it.  It is the
> good
> > > direction to go for the RP
> > >
> > > Regarding storing the new reassignments & original replicas at the
> > > topic/partition level.  I have some concerns when controller is failing
> > > over, and the scalability of scanning the active reassignments from ZK
> > > topic/partition level nodes. Please see my reply to Jason in the
> KIP-236
> > > thread.
> > >
> > > Once the decision is made where new reassignment and original replicas
> is
> > > stored, I will modify KIP-236 accordingly for how to cancel/rollback
> the
> > > reassignments.
> > >
> > > Thanks,
> > > George
> > >
> > >
> > >     On Monday, April 15, 2019, 6:07:44 PM PDT, Colin McCabe <
> > > cmcc...@apache.org> wrote:
> > >
> > >  Hi all,
> > >
> > > We've been having discussions on a few different KIPs (KIP-236,
> KIP-435,
> > > etc.) about what the Admin Client replica reassignment API should look
> > > like.  The current API is really hard to extend and maintain, which is
> a
> > > big source of problems.  I think it makes sense to have a KIP that
> > > establishes a clean API that we can use and extend going forward, so I
> > > posted KIP-455.  Take a look.  :)
> > >
> > > best,
> > > Colin
> > >
> >
>

Reply via email to