Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Dong Lin Wed, 12 Jul 2017 10:46:07 -0700

The protocol change has been updated in KIP-113
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories>
.


On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <[email protected]> wrote:

> Hi all,
>
> I have made a minor change to the DescribeDirsRequest so that user can
> choose to query the status for a specific list of partitions. This is a bit
> more fine-granular than the previous format that allows user to query the
> status for a specific list of topics. I realized that querying the status
> of selected partitions can be useful to check the whether the reassignment
> of the replicas to the specific log directories has been completed.
>
> I will assume this minor change is OK if there is no concern with it in
> the community :)
>
> Thanks,
> Dong
>
>
>
> On Mon, Jun 12, 2017 at 10:46 AM, Dong Lin <[email protected]> wrote:
>
>> Hey Colin,
>>
>> Thanks for the suggestion. We have actually considered this and list this
>> as the first future work in KIP-112
>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD>.
>> The two advantages that you mentioned are exactly the motivation for this
>> feature. Also as you have mentioned, this involves the tradeoff between
>> disk performance and availability -- the more you distribute topic across
>> disks, the more topics will be offline due to a single disk failure.
>>
>> Despite its complexity, it is not clear to me that the reduced rebalance
>> overhead is worth the reduction in availability. I am optimistic that the
>> rebalance overhead will not be that a big problem since we are not too
>> bothered by cross-broker rebalance as of now.
>>
>> Thanks,
>> Dong
>>
>> On Mon, Jun 12, 2017 at 10:36 AM, Colin McCabe <[email protected]>
>> wrote:
>>
>>> Has anyone considered a scheme for sharding topic data across multiple
>>> disks?
>>>
>>> For example, if you sharded topics across 3 disks, and you had 10 disks,
>>> you could pick a different set of 3 disks for each topic.  If you
>>> distribute them randomly then you have 10 choose 3 = 120 different
>>> combinations.  You would probably never need rebalancing if you had a
>>> reasonable distribution of topic sizes (could probably prove this with a
>>> Monte Carlo or something).
>>>
>>> The disadvantage is that if one of the 3 disks fails, then you have to
>>> take the topic offline.  But if we assume independent disk failure
>>> probabilities, probability of failure with RAID 0 is: 1 -
>>> Psuccess^(num_disks) whereas the probability of failure with this scheme
>>> is 1 - Psuccess ^ 3.
>>>
>>> This addresses the biggest downsides of JBOD now:
>>> * limiting a topic to the size of a single disk limits scalability
>>> * the topic movement process is tricky to get right and involves "racing
>>> against producers" and wasted double I/Os
>>>
>>> Of course, one other question is how frequently we add new disk drives
>>> to an existing broker.  In this case, you might reasonably want disk
>>> rebalancing to avoid overloading the new disk(s) with writes.
>>>
>>> cheers,
>>> Colin
>>>
>>>
>>> On Fri, Jun 9, 2017, at 18:46, Jun Rao wrote:
>>> > Just a few comments on this.
>>> >
>>> > 1. One of the issues with using RAID 0 is that a single disk failure
>>> > causes
>>> > a hard failure of the broker. Hard failure increases the unavailability
>>> > window for all the partitions on the failed broker, which includes the
>>> > failure detection time (tied to ZK session timeout right now) and
>>> leader
>>> > election time by the controller. If we support JBOD natively, when a
>>> > single
>>> > disk fails, only partitions on the failed disk will experience a hard
>>> > failure. The availability for partitions on the rest of the disks are
>>> not
>>> > affected.
>>> >
>>> > 2. For running things on the Cloud such as AWS. Currently, each EBS
>>> > volume
>>> > has a throughout limit of about 300MB/sec. If you get an enhanced EC2
>>> > instance, you can get 20Gb/sec network. To saturate the network, you
>>> may
>>> > need about 7 EBS volumes. So, being able to support JBOD in the Cloud
>>> is
>>> > still potentially useful.
>>> >
>>> > 3. On the benefit of balancing data across disks within the same
>>> broker.
>>> > Data imbalance can happen across brokers as well as across disks within
>>> > the
>>> > same broker. Balancing the data across disks within the broker has the
>>> > benefit of saving network bandwidth as Dong mentioned. So, if intra
>>> > broker
>>> > load balancing is possible, it's probably better to avoid the more
>>> > expensive inter broker load balancing. One of the reasons for disk
>>> > imbalance right now is that partitions within a broker are assigned to
>>> > disks just based on the partition count. So, it does seem possible for
>>> > disks to get imbalanced from time to time. If someone can share some
>>> > stats
>>> > for that in practice, that will be very helpful.
>>> >
>>> > Thanks,
>>> >
>>> > Jun
>>> >
>>> >
>>> > On Wed, Jun 7, 2017 at 2:30 PM, Dong Lin <[email protected]> wrote:
>>> >
>>> > > Hey Sriram,
>>> > >
>>> > > I think there is one way to explain why the ability to move replica
>>> between
>>> > > disks can save space. Let's say the load is distributed to disks
>>> > > independent of the broker. Sooner or later, the load imbalance will
>>> exceed
>>> > > a threshold and we will need to rebalance load across disks. Now our
>>> > > questions is whether our rebalancing algorithm will be able to take
>>> > > advantage of locality by moving replicas between disks on the same
>>> broker.
>>> > >
>>> > > Say for a given disk, there is 20% probability it is overloaded, 20%
>>> > > probability it is underloaded, and 60% probability its load is
>>> around the
>>> > > expected average load if the cluster is well balanced. Then for a
>>> broker of
>>> > > 10 disks, we would 2 disks need to have in-bound replica movement, 2
>>> disks
>>> > > need to have out-bound replica movement, and 6 disks do not need
>>> replica
>>> > > movement. Thus we would expect KIP-113 to be useful since we will be
>>> able
>>> > > to move replica from the two over-loaded disks to the two
>>> under-loaded
>>> > > disks on the same broKER. Does this make sense?
>>> > >
>>> > > Thanks,
>>> > > Dong
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Wed, Jun 7, 2017 at 2:12 PM, Dong Lin <[email protected]>
>>> wrote:
>>> > >
>>> > > > Hey Sriram,
>>> > > >
>>> > > > Thanks for raising these concerns. Let me answer these questions
>>> below:
>>> > > >
>>> > > > - The benefit of those additional complexity to move the data
>>> stored on a
>>> > > > disk within the broker is to avoid network bandwidth usage.
>>> Creating
>>> > > > replica on another broker is less efficient than creating replica
>>> on
>>> > > > another disk in the same broker IF there is actually
>>> lightly-loaded disk
>>> > > on
>>> > > > the same broker.
>>> > > >
>>> > > > - In my opinion the rebalance algorithm would this: 1) we balance
>>> the
>>> > > load
>>> > > > across brokers using the same algorithm we are using today. 2) we
>>> balance
>>> > > > load across disk on a given broker using a greedy algorithm, i.e.
>>> move
>>> > > > replica from the overloaded disk to lightly loaded disk. The greedy
>>> > > > algorithm would only consider the capacity and replica size. We can
>>> > > improve
>>> > > > it to consider throughput in the future.
>>> > > >
>>> > > > - With 30 brokers with each having 10 disks, using the rebalancing
>>> > > algorithm,
>>> > > > the chances of choosing disks within the broker can be high. There
>>> will
>>> > > > always be load imbalance across disks of the same broker for the
>>> same
>>> > > > reason that there will always be load imbalance across brokers. The
>>> > > > algorithm specified above will take advantage of the locality,
>>> i.e. first
>>> > > > balance load across disks of the same broker, and only balance
>>> across
>>> > > > brokers if some brokers are much more loaded than others.
>>> > > >
>>> > > > I think it is useful to note that the load imbalance across disks
>>> of the
>>> > > > same broker is independent of the load imbalance across brokers.
>>> Both are
>>> > > > guaranteed to happen in any Kafka cluster for the same reason, i.e.
>>> > > > variation in the partition size. Say broker 1 have two disks that
>>> are 80%
>>> > > > loaded and 20% loaded. And broker 2 have two disks that are also
>>> 80%
>>> > > > loaded and 20%. We can balance them without inter-broker traffic
>>> with
>>> > > > KIP-113.  This is why I think KIP-113 can be very useful.
>>> > > >
>>> > > > Do these explanation sound reasonable?
>>> > > >
>>> > > > Thanks,
>>> > > > Dong
>>> > > >
>>> > > >
>>> > > > On Wed, Jun 7, 2017 at 1:33 PM, Sriram Subramanian <
>>> [email protected]>
>>> > > > wrote:
>>> > > >
>>> > > >> Hey Dong,
>>> > > >>
>>> > > >> Thanks for the explanation. I don't think anyone is denying that
>>> we
>>> > > should
>>> > > >> rebalance at the disk level. I think it is important to restore
>>> the disk
>>> > > >> and not wait for disk replacement. There are also other benefits
>>> of
>>> > > doing
>>> > > >> that which is that you don't need to opt for hot swap racks that
>>> can
>>> > > save
>>> > > >> cost.
>>> > > >>
>>> > > >> The question here is what do you save by trying to add complexity
>>> to
>>> > > move
>>> > > >> the data stored on a disk within the broker? Why would you not
>>> simply
>>> > > >> create another replica on the disk that results in a balanced load
>>> > > across
>>> > > >> brokers and have it catch up. We are missing a few things here -
>>> > > >> 1. What would your data balancing algorithm be? Would it include
>>> just
>>> > > >> capacity or will it also consider throughput on disk to decide on
>>> the
>>> > > >> final
>>> > > >> location of a partition?
>>> > > >> 2. With 30 brokers with each having 10 disks, using the
>>> rebalancing
>>> > > >> algorithm, the chances of choosing disks within the broker is
>>> going to
>>> > > be
>>> > > >> low. This probability further decreases with more brokers and
>>> disks.
>>> > > Given
>>> > > >> that, why are we trying to save network cost? How much would that
>>> saving
>>> > > >> be
>>> > > >> if you go that route?
>>> > > >>
>>> > > >> These questions are hard to answer without having to verify
>>> empirically.
>>> > > >> My
>>> > > >> suggestion is to avoid doing pre matured optimization that brings
>>> in the
>>> > > >> added complexity to the code and treat inter and intra broker
>>> movements
>>> > > of
>>> > > >> partition the same. Deploy the code, use it and see if it is an
>>> actual
>>> > > >> problem and you get great savings by avoiding the network route
>>> to move
>>> > > >> partitions within the same broker. If so, add this optimization.
>>> > > >>
>>> > > >> On Wed, Jun 7, 2017 at 1:03 PM, Dong Lin <[email protected]>
>>> wrote:
>>> > > >>
>>> > > >> > Hey Jay, Sriram,
>>> > > >> >
>>> > > >> > Great point. If I understand you right, you are suggesting that
>>> we can
>>> > > >> > simply use RAID-0 so that the load can be evenly distributed
>>> across
>>> > > >> disks.
>>> > > >> > And even though a disk failure will bring down the enter
>>> broker, the
>>> > > >> > reduced availability as compared to using KIP-112 and KIP-113
>>> will may
>>> > > >> be
>>> > > >> > negligible. And it may be better to just accept the slightly
>>> reduced
>>> > > >> > availability instead of introducing the complexity from KIP-112
>>> and
>>> > > >> > KIP-113.
>>> > > >> >
>>> > > >> > Let's assume the following:
>>> > > >> >
>>> > > >> > - There are 30 brokers in a cluster and each broker has 10 disks
>>> > > >> > - The replication factor is 3 and min.isr = 2.
>>> > > >> > - The probability of annual disk failure rate is 2% according
>>> to this
>>> > > >> > <https://www.backblaze.com/blog/hard-drive-failure-rates-q1-
>>> 2017/>
>>> > > >> blog.
>>> > > >> > - It takes 3 days to replace a disk.
>>> > > >> >
>>> > > >> > Here is my calculation for probability of data loss due to disk
>>> > > failure:
>>> > > >> > probability of a given disk fails in a given year: 2%
>>> > > >> > probability of a given disk stays offline for one day in a
>>> given day:
>>> > > >> 2% /
>>> > > >> > 365 * 3
>>> > > >> > probability of a given broker stays offline for one day in a
>>> given day
>>> > > >> due
>>> > > >> > to disk failure: 2% / 365 * 3 * 10
>>> > > >> > probability of any broker stays offline for one day in a given
>>> day due
>>> > > >> to
>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
>>> > > >> > probability of any three broker stays offline for one day in a
>>> given
>>> > > day
>>> > > >> > due to disk failure: 5% * 5% * 5% = 0.0125%
>>> > > >> > probability of data loss due to disk failure: 0.0125%
>>> > > >> >
>>> > > >> > Here is my calculation for probability of service
>>> unavailability due
>>> > > to
>>> > > >> > disk failure:
>>> > > >> > probability of a given disk fails in a given year: 2%
>>> > > >> > probability of a given disk stays offline for one day in a
>>> given day:
>>> > > >> 2% /
>>> > > >> > 365 * 3
>>> > > >> > probability of a given broker stays offline for one day in a
>>> given day
>>> > > >> due
>>> > > >> > to disk failure: 2% / 365 * 3 * 10
>>> > > >> > probability of any broker stays offline for one day in a given
>>> day due
>>> > > >> to
>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
>>> > > >> > probability of any two broker stays offline for one day in a
>>> given day
>>> > > >> due
>>> > > >> > to disk failure: 5% * 5% * 5% = 0.25%
>>> > > >> > probability of unavailability due to disk failure: 0.25%
>>> > > >> >
>>> > > >> > Note that the unavailability due to disk failure will be
>>> unacceptably
>>> > > >> high
>>> > > >> > in this case. And the probability of data loss due to disk
>>> failure
>>> > > will
>>> > > >> be
>>> > > >> > higher than 0.01%. Neither is acceptable if Kafka is intended to
>>> > > achieve
>>> > > >> > four nigh availability.
>>> > > >> >
>>> > > >> > Thanks,
>>> > > >> > Dong
>>> > > >> >
>>> > > >> >
>>> > > >> > On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps <[email protected]>
>>> wrote:
>>> > > >> >
>>> > > >> > > I think Ram's point is that in place failure is pretty
>>> complicated,
>>> > > >> and
>>> > > >> > > this is meant to be a cost saving feature, we should
>>> construct an
>>> > > >> > argument
>>> > > >> > > for it grounded in data.
>>> > > >> > >
>>> > > >> > > Assume an annual failure rate of 1% (reasonable, but data is
>>> > > available
>>> > > >> > > online), and assume it takes 3 days to get the drive
>>> replaced. Say
>>> > > you
>>> > > >> > have
>>> > > >> > > 10 drives per server. Then the expected downtime for each
>>> server is
>>> > > >> > roughly
>>> > > >> > > 1% * 3 days * 10 = 0.3 days/year (this is slightly off since
>>> I'm
>>> > > >> ignoring
>>> > > >> > > the case of multiple failures, but I don't know that changes
>>> it
>>> > > >> much). So
>>> > > >> > > the savings from this feature is 0.3/365 = 0.08%. Say you
>>> have 1000
>>> > > >> > servers
>>> > > >> > > and they cost $3000/year fully loaded including power, the
>>> cost of
>>> > > >> the hw
>>> > > >> > > amortized over it's life, etc. Then this feature saves you
>>> $3000 on
>>> > > >> your
>>> > > >> > > total server cost of $3m which seems not very worthwhile
>>> compared to
>>> > > >> > other
>>> > > >> > > optimizations...?
>>> > > >> > >
>>> > > >> > > Anyhow, not sure the arithmetic is right there, but i think
>>> that is
>>> > > >> the
>>> > > >> > > type of argument that would be helpful to think about the
>>> tradeoff
>>> > > in
>>> > > >> > > complexity.
>>> > > >> > >
>>> > > >> > > -Jay
>>> > > >> > >
>>> > > >> > >
>>> > > >> > >
>>> > > >> > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <[email protected]
>>> >
>>> > > wrote:
>>> > > >> > >
>>> > > >> > > > Hey Sriram,
>>> > > >> > > >
>>> > > >> > > > Thanks for taking time to review the KIP. Please see below
>>> my
>>> > > >> answers
>>> > > >> > to
>>> > > >> > > > your questions:
>>> > > >> > > >
>>> > > >> > > > >1. Could you pick a hardware/Kafka configuration and go
>>> over what
>>> > > >> is
>>> > > >> > the
>>> > > >> > > > >average disk/partition repair/restore time that we are
>>> targeting
>>> > > >> for a
>>> > > >> > > > >typical JBOD setup?
>>> > > >> > > >
>>> > > >> > > > We currently don't have this data. I think the
>>> disk/partition
>>> > > >> > > repair/store
>>> > > >> > > > time depends on availability of hardware, the response time
>>> of
>>> > > >> > > > site-reliability engineer, the amount of data on the bad
>>> disk etc.
>>> > > >> > These
>>> > > >> > > > vary between companies and even clusters within the same
>>> company
>>> > > >> and it
>>> > > >> > > is
>>> > > >> > > > probably hard to determine what is the average situation.
>>> > > >> > > >
>>> > > >> > > > I am not very sure why we need this. Can you explain a bit
>>> why
>>> > > this
>>> > > >> > data
>>> > > >> > > is
>>> > > >> > > > useful to evaluate the motivation and design of this KIP?
>>> > > >> > > >
>>> > > >> > > > >2. How often do we believe disks are going to fail (in your
>>> > > example
>>> > > >> > > > >configuration) and what do we gain by avoiding the network
>>> > > overhead
>>> > > >> > and
>>> > > >> > > > >doing all the work of moving the replica within the broker
>>> to
>>> > > >> another
>>> > > >> > > disk
>>> > > >> > > > >instead of balancing it globally?
>>> > > >> > > >
>>> > > >> > > > I think the chance of disk failure depends mainly on the
>>> disk
>>> > > itself
>>> > > >> > > rather
>>> > > >> > > > than the broker configuration. I don't have this data now.
>>> I will
>>> > > >> ask
>>> > > >> > our
>>> > > >> > > > SRE whether they know the mean-time-to-fail for our disk.
>>> What I
>>> > > was
>>> > > >> > told
>>> > > >> > > > by SRE is that disk failure is the most common type of
>>> hardware
>>> > > >> > failure.
>>> > > >> > > >
>>> > > >> > > > When there is disk failure, I think it is reasonable to move
>>> > > >> replica to
>>> > > >> > > > another broker instead of another disk on the same broker.
>>> The
>>> > > >> reason
>>> > > >> > we
>>> > > >> > > > want to move replica within broker is mainly to optimize
>>> the Kafka
>>> > > >> > > cluster
>>> > > >> > > > performance when we balance load across disks.
>>> > > >> > > >
>>> > > >> > > > In comparison to balancing replicas globally, the benefit of
>>> > > moving
>>> > > >> > > replica
>>> > > >> > > > within broker is that:
>>> > > >> > > >
>>> > > >> > > > 1) the movement is faster since it doesn't go through
>>> socket or
>>> > > >> rely on
>>> > > >> > > the
>>> > > >> > > > available network bandwidth;
>>> > > >> > > > 2) much less impact on the replication traffic between
>>> broker by
>>> > > not
>>> > > >> > > taking
>>> > > >> > > > up bandwidth between brokers. Depending on the pattern of
>>> traffic,
>>> > > >> we
>>> > > >> > may
>>> > > >> > > > need to balance load across disk frequently and it is
>>> necessary to
>>> > > >> > > prevent
>>> > > >> > > > this operation from slowing down the existing operation
>>> (e.g.
>>> > > >> produce,
>>> > > >> > > > consume, replication) in the Kafka cluster.
>>> > > >> > > > 3) It gives us opportunity to do automatic broker rebalance
>>> > > between
>>> > > >> > disks
>>> > > >> > > > on the same broker.
>>> > > >> > > >
>>> > > >> > > >
>>> > > >> > > > >3. Even if we had to move the replica within the broker,
>>> why
>>> > > >> cannot we
>>> > > >> > > > just
>>> > > >> > > > >treat it as another replica and have it go through the same
>>> > > >> > replication
>>> > > >> > > > >code path that we have today? The downside here is
>>> obviously that
>>> > > >> you
>>> > > >> > > need
>>> > > >> > > > >to catchup from the leader but it is completely free! What
>>> do we
>>> > > >> think
>>> > > >> > > is
>>> > > >> > > > >the impact of the network overhead in this case?
>>> > > >> > > >
>>> > > >> > > > Good point. My initial proposal actually used the existing
>>> > > >> > > > ReplicaFetcherThread (i.e. the existing code path) to move
>>> replica
>>> > > >> > > between
>>> > > >> > > > disks. However, I switched to use separate thread pool after
>>> > > >> discussion
>>> > > >> > > > with Jun and Becket.
>>> > > >> > > >
>>> > > >> > > > The main argument for using separate thread pool is to
>>> actually
>>> > > keep
>>> > > >> > the
>>> > > >> > > > design simply and easy to reason about. There are a number
>>> of
>>> > > >> > difference
>>> > > >> > > > between inter-broker replication and intra-broker
>>> replication
>>> > > which
>>> > > >> > makes
>>> > > >> > > > it cleaner to do them in separate code path. I will list
>>> them
>>> > > below:
>>> > > >> > > >
>>> > > >> > > > - The throttling mechanism for inter-broker replication
>>> traffic
>>> > > and
>>> > > >> > > > intra-broker replication traffic is different. For example,
>>> we may
>>> > > >> want
>>> > > >> > > to
>>> > > >> > > > specify per-topic quota for inter-broker replication traffic
>>> > > >> because we
>>> > > >> > > may
>>> > > >> > > > want some topic to be moved faster than other topic. But we
>>> don't
>>> > > >> care
>>> > > >> > > > about priority of topics for intra-broker movement. So the
>>> current
>>> > > >> > > proposal
>>> > > >> > > > only allows user to specify per-broker quota for
>>> inter-broker
>>> > > >> > replication
>>> > > >> > > > traffic.
>>> > > >> > > >
>>> > > >> > > > - The quota value for inter-broker replication traffic and
>>> > > >> intra-broker
>>> > > >> > > > replication traffic is different. The available bandwidth
>>> for
>>> > > >> > > inter-broker
>>> > > >> > > > replication can probably be much higher than the bandwidth
>>> for
>>> > > >> > > inter-broker
>>> > > >> > > > replication.
>>> > > >> > > >
>>> > > >> > > > - The ReplicaFetchThread is per broker. Intuitively, the
>>> number of
>>> > > >> > > threads
>>> > > >> > > > doing intra broker data movement should be related to the
>>> number
>>> > > of
>>> > > >> > disks
>>> > > >> > > > in the broker, not the number of brokers in the cluster.
>>> > > >> > > >
>>> > > >> > > > - The leader replica has no ReplicaFetchThread to start
>>> with. It
>>> > > >> seems
>>> > > >> > > > weird to
>>> > > >> > > > start one just for intra-broker replication.
>>> > > >> > > >
>>> > > >> > > > Because of these difference, we think it is simpler to use
>>> > > separate
>>> > > >> > > thread
>>> > > >> > > > pool and code path so that we can configure and throttle
>>> them
>>> > > >> > separately.
>>> > > >> > > >
>>> > > >> > > >
>>> > > >> > > > >4. What are the chances that we will be able to identify
>>> another
>>> > > >> disk
>>> > > >> > to
>>> > > >> > > > >balance within the broker instead of another disk on
>>> another
>>> > > >> broker?
>>> > > >> > If
>>> > > >> > > we
>>> > > >> > > > >have 100's of machines, the probability of finding a better
>>> > > >> balance by
>>> > > >> > > > >choosing another broker is much higher than balancing
>>> within the
>>> > > >> > broker.
>>> > > >> > > > >Could you add some info on how we are determining this?
>>> > > >> > > >
>>> > > >> > > > It is possible that we can find available space on a remote
>>> > > broker.
>>> > > >> The
>>> > > >> > > > benefit of allowing intra-broker replication is that, when
>>> there
>>> > > are
>>> > > >> > > > available space in both the current broker and a remote
>>> broker,
>>> > > the
>>> > > >> > > > rebalance can be completed faster with much less impact on
>>> the
>>> > > >> > > inter-broker
>>> > > >> > > > replication or the users traffic. It is about taking
>>> advantage of
>>> > > >> > > locality
>>> > > >> > > > when balance the load.
>>> > > >> > > >
>>> > > >> > > > >5. Finally, in a cloud setup where more users are going to
>>> > > >> leverage a
>>> > > >> > > > >shared filesystem (example, EBS in AWS), all this change
>>> is not
>>> > > of
>>> > > >> > much
>>> > > >> > > > >gain since you don't need to balance between the volumes
>>> within
>>> > > the
>>> > > >> > same
>>> > > >> > > > >broker.
>>> > > >> > > >
>>> > > >> > > > You are right. This KIP-113 is useful only if user uses
>>> JBOD. If
>>> > > >> user
>>> > > >> > > uses
>>> > > >> > > > an extra storage layer of replication, such as RAID-10 or
>>> EBS,
>>> > > they
>>> > > >> > don't
>>> > > >> > > > need KIP-112 or KIP-113. Note that user will replicate data
>>> more
>>> > > >> times
>>> > > >> > > than
>>> > > >> > > > the replication factor of the Kafka topic if an extra
>>> storage
>>> > > layer
>>> > > >> of
>>> > > >> > > > replication is used.
>>> > > >> > > >
>>> > > >> > >
>>> > > >> >
>>> > > >>
>>> > > >
>>> > > >
>>> > >
>>>
>>
>>
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to