Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Dong Lin Thu, 03 Aug 2017 21:39:06 -0700

Hi all,

I realized that we need new API in AdminClient in order to use the new
request/response added in KIP-113. Since this is required by KIP-113, I
choose to add the new interface in this KIP instead of creating a new KIP.


The documentation of the new API in AdminClient can be found here
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories#KIP-113:Supportreplicasmovementbetweenlogdirectories-AdminClient>.
Can you please review and comment if you have any concern?

Thanks!
Dong



On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com> wrote:

> The protocol change has been updated in KIP-113
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories>
> .
>
> On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have made a minor change to the DescribeDirsRequest so that user can
>> choose to query the status for a specific list of partitions. This is a bit
>> more fine-granular than the previous format that allows user to query the
>> status for a specific list of topics. I realized that querying the status
>> of selected partitions can be useful to check the whether the reassignment
>> of the replicas to the specific log directories has been completed.
>>
>> I will assume this minor change is OK if there is no concern with it in
>> the community :)
>>
>> Thanks,
>> Dong
>>
>>
>>
>> On Mon, Jun 12, 2017 at 10:46 AM, Dong Lin <lindon...@gmail.com> wrote:
>>
>>> Hey Colin,
>>>
>>> Thanks for the suggestion. We have actually considered this and list
>>> this as the first future work in KIP-112
>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD>.
>>> The two advantages that you mentioned are exactly the motivation for this
>>> feature. Also as you have mentioned, this involves the tradeoff between
>>> disk performance and availability -- the more you distribute topic across
>>> disks, the more topics will be offline due to a single disk failure.
>>>
>>> Despite its complexity, it is not clear to me that the reduced rebalance
>>> overhead is worth the reduction in availability. I am optimistic that the
>>> rebalance overhead will not be that a big problem since we are not too
>>> bothered by cross-broker rebalance as of now.
>>>
>>> Thanks,
>>> Dong
>>>
>>> On Mon, Jun 12, 2017 at 10:36 AM, Colin McCabe <cmcc...@apache.org>
>>> wrote:
>>>
>>>> Has anyone considered a scheme for sharding topic data across multiple
>>>> disks?
>>>>
>>>> For example, if you sharded topics across 3 disks, and you had 10 disks,
>>>> you could pick a different set of 3 disks for each topic.  If you
>>>> distribute them randomly then you have 10 choose 3 = 120 different
>>>> combinations.  You would probably never need rebalancing if you had a
>>>> reasonable distribution of topic sizes (could probably prove this with a
>>>> Monte Carlo or something).
>>>>
>>>> The disadvantage is that if one of the 3 disks fails, then you have to
>>>> take the topic offline.  But if we assume independent disk failure
>>>> probabilities, probability of failure with RAID 0 is: 1 -
>>>> Psuccess^(num_disks) whereas the probability of failure with this scheme
>>>> is 1 - Psuccess ^ 3.
>>>>
>>>> This addresses the biggest downsides of JBOD now:
>>>> * limiting a topic to the size of a single disk limits scalability
>>>> * the topic movement process is tricky to get right and involves "racing
>>>> against producers" and wasted double I/Os
>>>>
>>>> Of course, one other question is how frequently we add new disk drives
>>>> to an existing broker.  In this case, you might reasonably want disk
>>>> rebalancing to avoid overloading the new disk(s) with writes.
>>>>
>>>> cheers,
>>>> Colin
>>>>
>>>>
>>>> On Fri, Jun 9, 2017, at 18:46, Jun Rao wrote:
>>>> > Just a few comments on this.
>>>> >
>>>> > 1. One of the issues with using RAID 0 is that a single disk failure
>>>> > causes
>>>> > a hard failure of the broker. Hard failure increases the
>>>> unavailability
>>>> > window for all the partitions on the failed broker, which includes the
>>>> > failure detection time (tied to ZK session timeout right now) and
>>>> leader
>>>> > election time by the controller. If we support JBOD natively, when a
>>>> > single
>>>> > disk fails, only partitions on the failed disk will experience a hard
>>>> > failure. The availability for partitions on the rest of the disks are
>>>> not
>>>> > affected.
>>>> >
>>>> > 2. For running things on the Cloud such as AWS. Currently, each EBS
>>>> > volume
>>>> > has a throughout limit of about 300MB/sec. If you get an enhanced EC2
>>>> > instance, you can get 20Gb/sec network. To saturate the network, you
>>>> may
>>>> > need about 7 EBS volumes. So, being able to support JBOD in the Cloud
>>>> is
>>>> > still potentially useful.
>>>> >
>>>> > 3. On the benefit of balancing data across disks within the same
>>>> broker.
>>>> > Data imbalance can happen across brokers as well as across disks
>>>> within
>>>> > the
>>>> > same broker. Balancing the data across disks within the broker has the
>>>> > benefit of saving network bandwidth as Dong mentioned. So, if intra
>>>> > broker
>>>> > load balancing is possible, it's probably better to avoid the more
>>>> > expensive inter broker load balancing. One of the reasons for disk
>>>> > imbalance right now is that partitions within a broker are assigned to
>>>> > disks just based on the partition count. So, it does seem possible for
>>>> > disks to get imbalanced from time to time. If someone can share some
>>>> > stats
>>>> > for that in practice, that will be very helpful.
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Jun
>>>> >
>>>> >
>>>> > On Wed, Jun 7, 2017 at 2:30 PM, Dong Lin <lindon...@gmail.com> wrote:
>>>> >
>>>> > > Hey Sriram,
>>>> > >
>>>> > > I think there is one way to explain why the ability to move replica
>>>> between
>>>> > > disks can save space. Let's say the load is distributed to disks
>>>> > > independent of the broker. Sooner or later, the load imbalance will
>>>> exceed
>>>> > > a threshold and we will need to rebalance load across disks. Now our
>>>> > > questions is whether our rebalancing algorithm will be able to take
>>>> > > advantage of locality by moving replicas between disks on the same
>>>> broker.
>>>> > >
>>>> > > Say for a given disk, there is 20% probability it is overloaded, 20%
>>>> > > probability it is underloaded, and 60% probability its load is
>>>> around the
>>>> > > expected average load if the cluster is well balanced. Then for a
>>>> broker of
>>>> > > 10 disks, we would 2 disks need to have in-bound replica movement,
>>>> 2 disks
>>>> > > need to have out-bound replica movement, and 6 disks do not need
>>>> replica
>>>> > > movement. Thus we would expect KIP-113 to be useful since we will
>>>> be able
>>>> > > to move replica from the two over-loaded disks to the two
>>>> under-loaded
>>>> > > disks on the same broKER. Does this make sense?
>>>> > >
>>>> > > Thanks,
>>>> > > Dong
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Wed, Jun 7, 2017 at 2:12 PM, Dong Lin <lindon...@gmail.com>
>>>> wrote:
>>>> > >
>>>> > > > Hey Sriram,
>>>> > > >
>>>> > > > Thanks for raising these concerns. Let me answer these questions
>>>> below:
>>>> > > >
>>>> > > > - The benefit of those additional complexity to move the data
>>>> stored on a
>>>> > > > disk within the broker is to avoid network bandwidth usage.
>>>> Creating
>>>> > > > replica on another broker is less efficient than creating replica
>>>> on
>>>> > > > another disk in the same broker IF there is actually
>>>> lightly-loaded disk
>>>> > > on
>>>> > > > the same broker.
>>>> > > >
>>>> > > > - In my opinion the rebalance algorithm would this: 1) we balance
>>>> the
>>>> > > load
>>>> > > > across brokers using the same algorithm we are using today. 2) we
>>>> balance
>>>> > > > load across disk on a given broker using a greedy algorithm, i.e.
>>>> move
>>>> > > > replica from the overloaded disk to lightly loaded disk. The
>>>> greedy
>>>> > > > algorithm would only consider the capacity and replica size. We
>>>> can
>>>> > > improve
>>>> > > > it to consider throughput in the future.
>>>> > > >
>>>> > > > - With 30 brokers with each having 10 disks, using the rebalancing
>>>> > > algorithm,
>>>> > > > the chances of choosing disks within the broker can be high.
>>>> There will
>>>> > > > always be load imbalance across disks of the same broker for the
>>>> same
>>>> > > > reason that there will always be load imbalance across brokers.
>>>> The
>>>> > > > algorithm specified above will take advantage of the locality,
>>>> i.e. first
>>>> > > > balance load across disks of the same broker, and only balance
>>>> across
>>>> > > > brokers if some brokers are much more loaded than others.
>>>> > > >
>>>> > > > I think it is useful to note that the load imbalance across disks
>>>> of the
>>>> > > > same broker is independent of the load imbalance across brokers.
>>>> Both are
>>>> > > > guaranteed to happen in any Kafka cluster for the same reason,
>>>> i.e.
>>>> > > > variation in the partition size. Say broker 1 have two disks that
>>>> are 80%
>>>> > > > loaded and 20% loaded. And broker 2 have two disks that are also
>>>> 80%
>>>> > > > loaded and 20%. We can balance them without inter-broker traffic
>>>> with
>>>> > > > KIP-113.  This is why I think KIP-113 can be very useful.
>>>> > > >
>>>> > > > Do these explanation sound reasonable?
>>>> > > >
>>>> > > > Thanks,
>>>> > > > Dong
>>>> > > >
>>>> > > >
>>>> > > > On Wed, Jun 7, 2017 at 1:33 PM, Sriram Subramanian <
>>>> r...@confluent.io>
>>>> > > > wrote:
>>>> > > >
>>>> > > >> Hey Dong,
>>>> > > >>
>>>> > > >> Thanks for the explanation. I don't think anyone is denying that
>>>> we
>>>> > > should
>>>> > > >> rebalance at the disk level. I think it is important to restore
>>>> the disk
>>>> > > >> and not wait for disk replacement. There are also other benefits
>>>> of
>>>> > > doing
>>>> > > >> that which is that you don't need to opt for hot swap racks that
>>>> can
>>>> > > save
>>>> > > >> cost.
>>>> > > >>
>>>> > > >> The question here is what do you save by trying to add
>>>> complexity to
>>>> > > move
>>>> > > >> the data stored on a disk within the broker? Why would you not
>>>> simply
>>>> > > >> create another replica on the disk that results in a balanced
>>>> load
>>>> > > across
>>>> > > >> brokers and have it catch up. We are missing a few things here -
>>>> > > >> 1. What would your data balancing algorithm be? Would it include
>>>> just
>>>> > > >> capacity or will it also consider throughput on disk to decide
>>>> on the
>>>> > > >> final
>>>> > > >> location of a partition?
>>>> > > >> 2. With 30 brokers with each having 10 disks, using the
>>>> rebalancing
>>>> > > >> algorithm, the chances of choosing disks within the broker is
>>>> going to
>>>> > > be
>>>> > > >> low. This probability further decreases with more brokers and
>>>> disks.
>>>> > > Given
>>>> > > >> that, why are we trying to save network cost? How much would
>>>> that saving
>>>> > > >> be
>>>> > > >> if you go that route?
>>>> > > >>
>>>> > > >> These questions are hard to answer without having to verify
>>>> empirically.
>>>> > > >> My
>>>> > > >> suggestion is to avoid doing pre matured optimization that
>>>> brings in the
>>>> > > >> added complexity to the code and treat inter and intra broker
>>>> movements
>>>> > > of
>>>> > > >> partition the same. Deploy the code, use it and see if it is an
>>>> actual
>>>> > > >> problem and you get great savings by avoiding the network route
>>>> to move
>>>> > > >> partitions within the same broker. If so, add this optimization.
>>>> > > >>
>>>> > > >> On Wed, Jun 7, 2017 at 1:03 PM, Dong Lin <lindon...@gmail.com>
>>>> wrote:
>>>> > > >>
>>>> > > >> > Hey Jay, Sriram,
>>>> > > >> >
>>>> > > >> > Great point. If I understand you right, you are suggesting
>>>> that we can
>>>> > > >> > simply use RAID-0 so that the load can be evenly distributed
>>>> across
>>>> > > >> disks.
>>>> > > >> > And even though a disk failure will bring down the enter
>>>> broker, the
>>>> > > >> > reduced availability as compared to using KIP-112 and KIP-113
>>>> will may
>>>> > > >> be
>>>> > > >> > negligible. And it may be better to just accept the slightly
>>>> reduced
>>>> > > >> > availability instead of introducing the complexity from
>>>> KIP-112 and
>>>> > > >> > KIP-113.
>>>> > > >> >
>>>> > > >> > Let's assume the following:
>>>> > > >> >
>>>> > > >> > - There are 30 brokers in a cluster and each broker has 10
>>>> disks
>>>> > > >> > - The replication factor is 3 and min.isr = 2.
>>>> > > >> > - The probability of annual disk failure rate is 2% according
>>>> to this
>>>> > > >> > <https://www.backblaze.com/blog/hard-drive-failure-rates-q1-
>>>> 2017/>
>>>> > > >> blog.
>>>> > > >> > - It takes 3 days to replace a disk.
>>>> > > >> >
>>>> > > >> > Here is my calculation for probability of data loss due to disk
>>>> > > failure:
>>>> > > >> > probability of a given disk fails in a given year: 2%
>>>> > > >> > probability of a given disk stays offline for one day in a
>>>> given day:
>>>> > > >> 2% /
>>>> > > >> > 365 * 3
>>>> > > >> > probability of a given broker stays offline for one day in a
>>>> given day
>>>> > > >> due
>>>> > > >> > to disk failure: 2% / 365 * 3 * 10
>>>> > > >> > probability of any broker stays offline for one day in a given
>>>> day due
>>>> > > >> to
>>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
>>>> > > >> > probability of any three broker stays offline for one day in a
>>>> given
>>>> > > day
>>>> > > >> > due to disk failure: 5% * 5% * 5% = 0.0125%
>>>> > > >> > probability of data loss due to disk failure: 0.0125%
>>>> > > >> >
>>>> > > >> > Here is my calculation for probability of service
>>>> unavailability due
>>>> > > to
>>>> > > >> > disk failure:
>>>> > > >> > probability of a given disk fails in a given year: 2%
>>>> > > >> > probability of a given disk stays offline for one day in a
>>>> given day:
>>>> > > >> 2% /
>>>> > > >> > 365 * 3
>>>> > > >> > probability of a given broker stays offline for one day in a
>>>> given day
>>>> > > >> due
>>>> > > >> > to disk failure: 2% / 365 * 3 * 10
>>>> > > >> > probability of any broker stays offline for one day in a given
>>>> day due
>>>> > > >> to
>>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
>>>> > > >> > probability of any two broker stays offline for one day in a
>>>> given day
>>>> > > >> due
>>>> > > >> > to disk failure: 5% * 5% * 5% = 0.25%
>>>> > > >> > probability of unavailability due to disk failure: 0.25%
>>>> > > >> >
>>>> > > >> > Note that the unavailability due to disk failure will be
>>>> unacceptably
>>>> > > >> high
>>>> > > >> > in this case. And the probability of data loss due to disk
>>>> failure
>>>> > > will
>>>> > > >> be
>>>> > > >> > higher than 0.01%. Neither is acceptable if Kafka is intended
>>>> to
>>>> > > achieve
>>>> > > >> > four nigh availability.
>>>> > > >> >
>>>> > > >> > Thanks,
>>>> > > >> > Dong
>>>> > > >> >
>>>> > > >> >
>>>> > > >> > On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps <j...@confluent.io>
>>>> wrote:
>>>> > > >> >
>>>> > > >> > > I think Ram's point is that in place failure is pretty
>>>> complicated,
>>>> > > >> and
>>>> > > >> > > this is meant to be a cost saving feature, we should
>>>> construct an
>>>> > > >> > argument
>>>> > > >> > > for it grounded in data.
>>>> > > >> > >
>>>> > > >> > > Assume an annual failure rate of 1% (reasonable, but data is
>>>> > > available
>>>> > > >> > > online), and assume it takes 3 days to get the drive
>>>> replaced. Say
>>>> > > you
>>>> > > >> > have
>>>> > > >> > > 10 drives per server. Then the expected downtime for each
>>>> server is
>>>> > > >> > roughly
>>>> > > >> > > 1% * 3 days * 10 = 0.3 days/year (this is slightly off since
>>>> I'm
>>>> > > >> ignoring
>>>> > > >> > > the case of multiple failures, but I don't know that changes
>>>> it
>>>> > > >> much). So
>>>> > > >> > > the savings from this feature is 0.3/365 = 0.08%. Say you
>>>> have 1000
>>>> > > >> > servers
>>>> > > >> > > and they cost $3000/year fully loaded including power, the
>>>> cost of
>>>> > > >> the hw
>>>> > > >> > > amortized over it's life, etc. Then this feature saves you
>>>> $3000 on
>>>> > > >> your
>>>> > > >> > > total server cost of $3m which seems not very worthwhile
>>>> compared to
>>>> > > >> > other
>>>> > > >> > > optimizations...?
>>>> > > >> > >
>>>> > > >> > > Anyhow, not sure the arithmetic is right there, but i think
>>>> that is
>>>> > > >> the
>>>> > > >> > > type of argument that would be helpful to think about the
>>>> tradeoff
>>>> > > in
>>>> > > >> > > complexity.
>>>> > > >> > >
>>>> > > >> > > -Jay
>>>> > > >> > >
>>>> > > >> > >
>>>> > > >> > >
>>>> > > >> > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <
>>>> lindon...@gmail.com>
>>>> > > wrote:
>>>> > > >> > >
>>>> > > >> > > > Hey Sriram,
>>>> > > >> > > >
>>>> > > >> > > > Thanks for taking time to review the KIP. Please see below
>>>> my
>>>> > > >> answers
>>>> > > >> > to
>>>> > > >> > > > your questions:
>>>> > > >> > > >
>>>> > > >> > > > >1. Could you pick a hardware/Kafka configuration and go
>>>> over what
>>>> > > >> is
>>>> > > >> > the
>>>> > > >> > > > >average disk/partition repair/restore time that we are
>>>> targeting
>>>> > > >> for a
>>>> > > >> > > > >typical JBOD setup?
>>>> > > >> > > >
>>>> > > >> > > > We currently don't have this data. I think the
>>>> disk/partition
>>>> > > >> > > repair/store
>>>> > > >> > > > time depends on availability of hardware, the response
>>>> time of
>>>> > > >> > > > site-reliability engineer, the amount of data on the bad
>>>> disk etc.
>>>> > > >> > These
>>>> > > >> > > > vary between companies and even clusters within the same
>>>> company
>>>> > > >> and it
>>>> > > >> > > is
>>>> > > >> > > > probably hard to determine what is the average situation.
>>>> > > >> > > >
>>>> > > >> > > > I am not very sure why we need this. Can you explain a bit
>>>> why
>>>> > > this
>>>> > > >> > data
>>>> > > >> > > is
>>>> > > >> > > > useful to evaluate the motivation and design of this KIP?
>>>> > > >> > > >
>>>> > > >> > > > >2. How often do we believe disks are going to fail (in
>>>> your
>>>> > > example
>>>> > > >> > > > >configuration) and what do we gain by avoiding the network
>>>> > > overhead
>>>> > > >> > and
>>>> > > >> > > > >doing all the work of moving the replica within the
>>>> broker to
>>>> > > >> another
>>>> > > >> > > disk
>>>> > > >> > > > >instead of balancing it globally?
>>>> > > >> > > >
>>>> > > >> > > > I think the chance of disk failure depends mainly on the
>>>> disk
>>>> > > itself
>>>> > > >> > > rather
>>>> > > >> > > > than the broker configuration. I don't have this data now.
>>>> I will
>>>> > > >> ask
>>>> > > >> > our
>>>> > > >> > > > SRE whether they know the mean-time-to-fail for our disk.
>>>> What I
>>>> > > was
>>>> > > >> > told
>>>> > > >> > > > by SRE is that disk failure is the most common type of
>>>> hardware
>>>> > > >> > failure.
>>>> > > >> > > >
>>>> > > >> > > > When there is disk failure, I think it is reasonable to
>>>> move
>>>> > > >> replica to
>>>> > > >> > > > another broker instead of another disk on the same broker.
>>>> The
>>>> > > >> reason
>>>> > > >> > we
>>>> > > >> > > > want to move replica within broker is mainly to optimize
>>>> the Kafka
>>>> > > >> > > cluster
>>>> > > >> > > > performance when we balance load across disks.
>>>> > > >> > > >
>>>> > > >> > > > In comparison to balancing replicas globally, the benefit
>>>> of
>>>> > > moving
>>>> > > >> > > replica
>>>> > > >> > > > within broker is that:
>>>> > > >> > > >
>>>> > > >> > > > 1) the movement is faster since it doesn't go through
>>>> socket or
>>>> > > >> rely on
>>>> > > >> > > the
>>>> > > >> > > > available network bandwidth;
>>>> > > >> > > > 2) much less impact on the replication traffic between
>>>> broker by
>>>> > > not
>>>> > > >> > > taking
>>>> > > >> > > > up bandwidth between brokers. Depending on the pattern of
>>>> traffic,
>>>> > > >> we
>>>> > > >> > may
>>>> > > >> > > > need to balance load across disk frequently and it is
>>>> necessary to
>>>> > > >> > > prevent
>>>> > > >> > > > this operation from slowing down the existing operation
>>>> (e.g.
>>>> > > >> produce,
>>>> > > >> > > > consume, replication) in the Kafka cluster.
>>>> > > >> > > > 3) It gives us opportunity to do automatic broker rebalance
>>>> > > between
>>>> > > >> > disks
>>>> > > >> > > > on the same broker.
>>>> > > >> > > >
>>>> > > >> > > >
>>>> > > >> > > > >3. Even if we had to move the replica within the broker,
>>>> why
>>>> > > >> cannot we
>>>> > > >> > > > just
>>>> > > >> > > > >treat it as another replica and have it go through the
>>>> same
>>>> > > >> > replication
>>>> > > >> > > > >code path that we have today? The downside here is
>>>> obviously that
>>>> > > >> you
>>>> > > >> > > need
>>>> > > >> > > > >to catchup from the leader but it is completely free!
>>>> What do we
>>>> > > >> think
>>>> > > >> > > is
>>>> > > >> > > > >the impact of the network overhead in this case?
>>>> > > >> > > >
>>>> > > >> > > > Good point. My initial proposal actually used the existing
>>>> > > >> > > > ReplicaFetcherThread (i.e. the existing code path) to move
>>>> replica
>>>> > > >> > > between
>>>> > > >> > > > disks. However, I switched to use separate thread pool
>>>> after
>>>> > > >> discussion
>>>> > > >> > > > with Jun and Becket.
>>>> > > >> > > >
>>>> > > >> > > > The main argument for using separate thread pool is to
>>>> actually
>>>> > > keep
>>>> > > >> > the
>>>> > > >> > > > design simply and easy to reason about. There are a number
>>>> of
>>>> > > >> > difference
>>>> > > >> > > > between inter-broker replication and intra-broker
>>>> replication
>>>> > > which
>>>> > > >> > makes
>>>> > > >> > > > it cleaner to do them in separate code path. I will list
>>>> them
>>>> > > below:
>>>> > > >> > > >
>>>> > > >> > > > - The throttling mechanism for inter-broker replication
>>>> traffic
>>>> > > and
>>>> > > >> > > > intra-broker replication traffic is different. For
>>>> example, we may
>>>> > > >> want
>>>> > > >> > > to
>>>> > > >> > > > specify per-topic quota for inter-broker replication
>>>> traffic
>>>> > > >> because we
>>>> > > >> > > may
>>>> > > >> > > > want some topic to be moved faster than other topic. But
>>>> we don't
>>>> > > >> care
>>>> > > >> > > > about priority of topics for intra-broker movement. So the
>>>> current
>>>> > > >> > > proposal
>>>> > > >> > > > only allows user to specify per-broker quota for
>>>> inter-broker
>>>> > > >> > replication
>>>> > > >> > > > traffic.
>>>> > > >> > > >
>>>> > > >> > > > - The quota value for inter-broker replication traffic and
>>>> > > >> intra-broker
>>>> > > >> > > > replication traffic is different. The available bandwidth
>>>> for
>>>> > > >> > > inter-broker
>>>> > > >> > > > replication can probably be much higher than the bandwidth
>>>> for
>>>> > > >> > > inter-broker
>>>> > > >> > > > replication.
>>>> > > >> > > >
>>>> > > >> > > > - The ReplicaFetchThread is per broker. Intuitively, the
>>>> number of
>>>> > > >> > > threads
>>>> > > >> > > > doing intra broker data movement should be related to the
>>>> number
>>>> > > of
>>>> > > >> > disks
>>>> > > >> > > > in the broker, not the number of brokers in the cluster.
>>>> > > >> > > >
>>>> > > >> > > > - The leader replica has no ReplicaFetchThread to start
>>>> with. It
>>>> > > >> seems
>>>> > > >> > > > weird to
>>>> > > >> > > > start one just for intra-broker replication.
>>>> > > >> > > >
>>>> > > >> > > > Because of these difference, we think it is simpler to use
>>>> > > separate
>>>> > > >> > > thread
>>>> > > >> > > > pool and code path so that we can configure and throttle
>>>> them
>>>> > > >> > separately.
>>>> > > >> > > >
>>>> > > >> > > >
>>>> > > >> > > > >4. What are the chances that we will be able to identify
>>>> another
>>>> > > >> disk
>>>> > > >> > to
>>>> > > >> > > > >balance within the broker instead of another disk on
>>>> another
>>>> > > >> broker?
>>>> > > >> > If
>>>> > > >> > > we
>>>> > > >> > > > >have 100's of machines, the probability of finding a
>>>> better
>>>> > > >> balance by
>>>> > > >> > > > >choosing another broker is much higher than balancing
>>>> within the
>>>> > > >> > broker.
>>>> > > >> > > > >Could you add some info on how we are determining this?
>>>> > > >> > > >
>>>> > > >> > > > It is possible that we can find available space on a remote
>>>> > > broker.
>>>> > > >> The
>>>> > > >> > > > benefit of allowing intra-broker replication is that, when
>>>> there
>>>> > > are
>>>> > > >> > > > available space in both the current broker and a remote
>>>> broker,
>>>> > > the
>>>> > > >> > > > rebalance can be completed faster with much less impact on
>>>> the
>>>> > > >> > > inter-broker
>>>> > > >> > > > replication or the users traffic. It is about taking
>>>> advantage of
>>>> > > >> > > locality
>>>> > > >> > > > when balance the load.
>>>> > > >> > > >
>>>> > > >> > > > >5. Finally, in a cloud setup where more users are going to
>>>> > > >> leverage a
>>>> > > >> > > > >shared filesystem (example, EBS in AWS), all this change
>>>> is not
>>>> > > of
>>>> > > >> > much
>>>> > > >> > > > >gain since you don't need to balance between the volumes
>>>> within
>>>> > > the
>>>> > > >> > same
>>>> > > >> > > > >broker.
>>>> > > >> > > >
>>>> > > >> > > > You are right. This KIP-113 is useful only if user uses
>>>> JBOD. If
>>>> > > >> user
>>>> > > >> > > uses
>>>> > > >> > > > an extra storage layer of replication, such as RAID-10 or
>>>> EBS,
>>>> > > they
>>>> > > >> > don't
>>>> > > >> > > > need KIP-112 or KIP-113. Note that user will replicate
>>>> data more
>>>> > > >> times
>>>> > > >> > > than
>>>> > > >> > > > the replication factor of the Kafka topic if an extra
>>>> storage
>>>> > > layer
>>>> > > >> of
>>>> > > >> > > > replication is used.
>>>> > > >> > > >
>>>> > > >> > >
>>>> > > >> >
>>>> > > >>
>>>> > > >
>>>> > > >
>>>> > >
>>>>
>>>
>>>
>>
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to