Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Dong Lin Mon, 13 Feb 2017 17:54:07 -0800

And the test plan has also been updated to simulate disk failure by
changing log directory permission to 000.


On Mon, Feb 13, 2017 at 5:50 PM, Dong Lin <[email protected]> wrote:

> Hi Jun,
>
> Thanks for the reply. These comments are very helpful. Let me answer them
> inline.
>
>
> On Mon, Feb 13, 2017 at 3:25 PM, Jun Rao <[email protected]> wrote:
>
>> Hi, Dong,
>>
>> Thanks for the reply. A few more replies and new comments below.
>>
>> On Fri, Feb 10, 2017 at 4:27 PM, Dong Lin <[email protected]> wrote:
>>
>> > Hi Jun,
>> >
>> > Thanks for the detailed comments. Please see answers inline:
>> >
>> > On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao <[email protected]> wrote:
>> >
>> > > Hi, Dong,
>> > >
>> > > Thanks for the updated wiki. A few comments below.
>> > >
>> > > 1. Topics get created
>> > > 1.1 Instead of storing successfully created replicas in ZK, could we
>> > store
>> > > unsuccessfully created replicas in ZK? Since the latter is less
>> common,
>> > it
>> > > probably reduces the load on ZK.
>> > >
>> >
>> > We can store unsuccessfully created replicas in ZK. But I am not sure if
>> > that can reduce write load on ZK.
>> >
>> > If we want to reduce write load on ZK using by store unsuccessfully
>> created
>> > replicas in ZK, then broker should not write to ZK if all replicas are
>> > successfully created. It means that if /broker/topics/[topic]/partiti
>> > ons/[partitionId]/controller_managed_state doesn't exist in ZK for a
>> given
>> > partition, we have to assume all replicas of this partition have been
>> > successfully created and send LeaderAndIsrRequest with create = false.
>> This
>> > becomes a problem if controller crashes before receiving
>> > LeaderAndIsrResponse to validate whether a replica has been created.
>> >
>> > I think this approach and reduce the number of bytes stored in ZK. But
>> I am
>> > not sure if this is a concern.
>> >
>> >
>> >
>> I was mostly concerned about the controller failover time. Currently, the
>> controller failover is likely dominated by the cost of reading
>> topic/partition level information from ZK. If we add another partition
>> level path in ZK, it probably will double the controller failover time. If
>> the approach of representing the non-created replicas doesn't work, have
>> you considered just adding the created flag in the leaderAndIsr path in
>> ZK?
>>
>>
> Yes, I have considered adding the created flag in the leaderAndIsr path in
> ZK. If we were to add created flag per replica in the LeaderAndIsrRequest,
> then it requires a lot of change in the code base.
>
> If we don't add created flag per replica in the LeaderAndIsrRequest, then
> the information in leaderAndIsr path in ZK and LeaderAndIsrRequest would be
> different. Further, the procedure for broker to update ISR in ZK will be a
> bit complicated. When leader updates leaderAndIsr path in ZK, it will have
> to first read created flags from ZK, change isr, and write leaderAndIsr
> back to ZK. And it needs to check znode version and re-try write operation
> in ZK if controller has updated ZK during this period. This is in contrast
> to the current implementation where the leader either gets all the
> information from LeaderAndIsrRequest sent by controller, or determine the
> infromation by itself (e.g. ISR), before writing to leaderAndIsr path in ZK.
>
> It seems to me that the above solution is a bit complicated and not clean.
> Thus I come up with the design in this KIP to store this created flag in a
> separate zk path. The path is named controller_managed_state to indicate
> that we can store in this znode all information that is managed by
> controller only, as opposed to ISR.
>
> I agree with your concern of increased ZK read time during controller
> failover. How about we store the "created" information in the
> znode /brokers/topics/[topic]? We can change that znode to have the
> following data format:
>
> {
>   "version" : 2,
>   "created" : {
>     "1" : [1, 2, 3],
>     ...
>   }
>   "partition" : {
>     "1" : [1, 2, 3],
>     ...
>   }
> }
>
> We won't have extra zk read using this solution. It also seems reasonable
> to put the partition assignment information together with replica creation
> information. The latter is only changed once after the partition is created
> or re-assigned.
>
>
>>
>>
>> >
>> > > 1.2 If an error is received for a follower, does the controller
>> eagerly
>> > > remove it from ISR or do we just let the leader removes it after
>> timeout?
>> > >
>> >
>> > No, Controller will not actively remove it from ISR. But controller will
>> > recognize it as offline replica and propagate this information to all
>> > brokers via UpdateMetadataRequest. Each leader can use this information
>> to
>> > actively remove offline replica from ISR set. I have updated to wiki to
>> > clarify it.
>> >
>> >
>>
>> That seems inconsistent with how the controller deals with offline
>> replicas
>> due to broker failures. When that happens, the broker will (1) select a
>> new
>> leader if the offline replica is the leader; (2) remove the replica from
>> ISR if the offline replica is the follower. So, intuitively, it seems that
>> we should be doing the same thing when dealing with offline replicas due
>> to
>> disk failure.
>>
>
> My bad. I misunderstand how the controller currently handles broker
> failure and ISR change. Yes we should do the same thing when dealing with
> offline replicas here. I have updated the KIP to specify that, when an
> offline replica is discovered by controller, the controller removes offline
> replicas from ISR in the ZK and sends LeaderAndIsrRequest with updated ISR
> to be used by partition leaders.
>
>
>>
>>
>>
>> >
>> > > 1.3 Similar, if an error is received for a leader, should the
>> controller
>> > > trigger leader election again?
>> > >
>> >
>> > Yes, controller will trigger leader election if leader replica is
>> offline.
>> > I have updated the wiki to clarify it.
>> >
>> >
>> > >
>> > > 2. A log directory stops working on a broker during runtime:
>> > > 2.1 It seems the broker remembers the failed directory after hitting
>> an
>> > > IOException and the failed directory won't be used for creating new
>> > > partitions until the broker is restarted? If so, could you add that to
>> > the
>> > > wiki.
>> > >
>> >
>> > Right, broker assumes a log directory to be good after it starts, and
>> mark
>> > log directory as bad once there is IOException when broker attempts to
>> > access the log directory. New replicas will only be created on good log
>> > directory. I just added this to the KIP.
>> >
>> >
>> > > 2.2 Could you be a bit more specific on how and during which operation
>> > the
>> > > broker detects directory failure? Is it when the broker hits an
>> > IOException
>> > > during writes, or both reads and writes?  For example, during broker
>> > > startup, it only reads from each of the log directories, if it hits an
>> > > IOException there, does the broker immediately mark the directory as
>> > > offline?
>> > >
>> >
>> > Broker marks log directory as bad once there is IOException when broker
>> > attempts to access the log directory. This includes read and write.
>> These
>> > operations include log append, log read, log cleaning, watermark
>> checkpoint
>> > etc. If broker hits IOException when it reads from each of the log
>> > directory during startup, it immediately mark the directory as offline.
>> >
>> > I just updated the KIP to clarify it.
>> >
>> >
>> > > 3. Partition reassignment: If we know a replica is offline, do we
>> still
>> > > want to send StopReplicaRequest to it?
>> > >
>> >
>> > No, controller doesn't send StopReplicaRequest for an offline replica.
>> > Controller treats this scenario in the same way that exiting Kafka
>> > implementation does when the broker of this replica is offline.
>> >
>> >
>> > >
>> > > 4. UpdateMetadataRequestPartitionState: For offline_replicas, do they
>> > only
>> > > include offline replicas due to log directory failures or do they also
>> > > include offline replicas due to broker failure?
>> > >
>> >
>> > UpdateMetadataRequestPartitionState's offline_replicas include offline
>> > replicas due to both log directory failure and broker failure. This is
>> to
>> > make the semantics of this field easier to understand. Broker can
>> > distinguish whether a replica is offline due to broker failure or disk
>> > failure by checking whether a broker is live in the
>> UpdateMetadataRequest.
>> >
>> >
>> > >
>> > > 5. Tools: Could we add some kind of support in the tool to list
>> offline
>> > > directories?
>> > >
>> >
>> > In KIP-112 we don't have tools to list offline directories because we
>> have
>> > intentionally avoided exposing log directory information (e.g. log
>> > directory path) to user or other brokers. I think we can add this
>> feature
>> > in KIP-113, in which we will have DescribeDirsRequest to list log
>> directory
>> > information (e.g. partition assignment, path, size) needed for
>> rebalance.
>> >
>> >
>> Since we are introducing a new failure mode, if a replica becomes offline
>> due to failure in log directories, the first thing an admin wants to know
>> is which log directories are offline from the broker's perspective.  So,
>> including such a tool will be useful. Do you plan to do KIP-112 and
>> KIP-113
>>  in the same release?
>>
>>
> Yes, I agree that including such a tool is using. This is probably better
> to be added in KIP-113 because we need DescribeDirsRequest to get this
> information. I will update KIP-113 to include this tool.
>
> I plan to do KIP-112 and KIP-113 separately to make each KIP and their
> patch easier to review. I don't have any plan about which release to have
> these KIPs. My plan is to both of them ASAP. Is there particular timeline
> you prefer for code of these two KIPs to checked-in?
>
>
>> >
>> > >
>> > > 6. Metrics: Could we add some metrics to show offline directories?
>> > >
>> >
>> > Sure. I think it makes sense to have each broker report its number of
>> > offline replicas and offline log directories. The previous metric was
>> put
>> > in KIP-113. I just added both metrics in KIP-112.
>> >
>> >
>> > >
>> > > 7. There are still references to kafka-log-dirs.sh. Are they valid?
>> > >
>> >
>> > My bad. I just removed this from "Changes in Operational Procedures" and
>> > "Test Plan" in the KIP.
>> >
>> >
>> > >
>> > > 8. Do you think KIP-113 is ready for review? One thing that KIP-113
>> > > mentions during partition reassignment is to first send
>> > > LeaderAndIsrRequest, followed by ChangeReplicaDirRequest. It seems
>> it's
>> > > better if the replicas are created in the right log directory in the
>> > first
>> > > place? The reason that I brought it up here is because it may affect
>> the
>> > > protocol of LeaderAndIsrRequest.
>> > >
>> >
>> > Yes, KIP-113 is ready for review. The advantage of the current design is
>> > that we can keep LeaderAndIsrRequest log-direcotry-agnostic. The
>> > implementation would be much easier to read if all log related logic
>> (e.g.
>> > various errors) are put in ChangeReplicadIRrequest and the code path of
>> > handling replica movement is separated from leadership handling.
>> >
>> > In other words, I think Kafka may be easier to develop in the long term
>> if
>> > we separate these two requests.
>> >
>> > I agree that ideally we want to create replicas in the right log
>> directory
>> > in the first place. But I am not sure if there is any performance or
>> > correctness concern with the existing way of moving it after it is
>> created.
>> > Besides, does this decision affect the change proposed in KIP-112?
>> >
>> >
>> I am just wondering if you have considered including the log directory for
>> the replicas in the LeaderAndIsrRequest.
>>
>>
> Yeah I have thought about this idea, but only briefly. I rejected this
> idea because log directory is broker's local information and I prefer not
> to expose local config information to the cluster through
> LeaderAndIsrRequest.
>
>
>> 9. Could you describe when the offline replicas due to log directory
>> failure are removed from the replica fetch threads?
>>
>
> Yes. If the offline replica was a leader, either a new leader is elected
> or all follower brokers will stop fetching for this partition. If the
> offline replica is a follower, the broker will stop fetching for this
> replica immediately. A broker stops fetching data for a replica by removing
> the replica from the replica fetch threads. I have updated the KIP to
> clarify it.
>
>
>>
>> 10. The wiki mentioned changing the log directory to a file for simulating
>> disk failure in system tests. Could we just change the permission of the
>> log directory to 000 to simulate that?
>>
>
>
> Sure,
>
>
>>
>> Thanks,
>>
>> Jun
>>
>>
>> > > Jun
>> > >
>> > > On Fri, Feb 10, 2017 at 9:53 AM, Dong Lin <[email protected]>
>> wrote:
>> > >
>> > > > Hi Jun,
>> > > >
>> > > > Can I replace zookeeper access with direct RPC for both ISR
>> > notification
>> > > > and disk failure notification in a future KIP, or do you feel we
>> should
>> > > do
>> > > > it in this KIP?
>> > > >
>> > > > Hi Eno, Grant and everyone,
>> > > >
>> > > > Is there further improvement you would like to see with this KIP?
>> > > >
>> > > > Thanks you all for the comments,
>> > > >
>> > > > Dong
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Feb 9, 2017 at 4:45 PM, Dong Lin <[email protected]>
>> wrote:
>> > > >
>> > > > >
>> > > > >
>> > > > > On Thu, Feb 9, 2017 at 3:37 PM, Colin McCabe <[email protected]>
>> > > wrote:
>> > > > >
>> > > > >> On Thu, Feb 9, 2017, at 11:40, Dong Lin wrote:
>> > > > >> > Thanks for all the comments Colin!
>> > > > >> >
>> > > > >> > To answer your questions:
>> > > > >> > - Yes, a broker will shutdown if all its log directories are
>> bad.
>> > > > >>
>> > > > >> That makes sense.  Can you add this to the writeup?
>> > > > >>
>> > > > >
>> > > > > Sure. This has already been added. You can find it here
>> > > > > <https://cwiki.apache.org/confluence/pages/diffpagesbyversio
>> n.action
>> > ?
>> > > > pageId=67638402&selectedPageVersions=9&selectedPageVersions=10>
>> > > > > .
>> > > > >
>> > > > >
>> > > > >>
>> > > > >> > - I updated the KIP to explicitly state that a log directory
>> will
>> > be
>> > > > >> > assumed to be good until broker sees IOException when it tries
>> to
>> > > > access
>> > > > >> > the log directory.
>> > > > >>
>> > > > >> Thanks.
>> > > > >>
>> > > > >> > - Controller doesn't explicitly know whether there is new log
>> > > > directory
>> > > > >> > or
>> > > > >> > not. All controller knows is whether replicas are online or
>> > offline
>> > > > >> based
>> > > > >> > on LeaderAndIsrResponse. According to the existing Kafka
>> > > > implementation,
>> > > > >> > controller will always send LeaderAndIsrRequest to a broker
>> after
>> > it
>> > > > >> > bounces.
>> > > > >>
>> > > > >> I thought so.  It's good to clarify, though.  Do you think it's
>> > worth
>> > > > >> adding a quick discussion of this on the wiki?
>> > > > >>
>> > > > >
>> > > > > Personally I don't think it is needed. If broker starts with no
>> bad
>> > log
>> > > > > directory, everything should work it is and we should not need to
>> > > clarify
>> > > > > it. The KIP has already covered the scenario when a broker starts
>> > with
>> > > > bad
>> > > > > log directory. Also, the KIP doesn't claim or hint that we support
>> > > > dynamic
>> > > > > addition of new log directories. I think we are good.
>> > > > >
>> > > > >
>> > > > >> best,
>> > > > >> Colin
>> > > > >>
>> > > > >> >
>> > > > >> > Please see this
>> > > > >> > <https://cwiki.apache.org/confluence/pages/diffpagesbyversio
>> > > > >> n.action?pageId=67638402&selectedPageVersions=9&
>> > > > selectedPageVersions=10>
>> > > > >> > for the change of the KIP.
>> > > > >> >
>> > > > >> > On Thu, Feb 9, 2017 at 11:04 AM, Colin McCabe <
>> [email protected]
>> > >
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > > On Thu, Feb 9, 2017, at 11:03, Colin McCabe wrote:
>> > > > >> > > > Thanks, Dong L.
>> > > > >> > > >
>> > > > >> > > > Do we plan on bringing down the broker process when all log
>> > > > >> directories
>> > > > >> > > > are offline?
>> > > > >> > > >
>> > > > >> > > > Can you explicitly state on the KIP that the log dirs are
>> all
>> > > > >> considered
>> > > > >> > > > good after the broker process is bounced?  It seems like an
>> > > > >> important
>> > > > >> > > > thing to be clear about.  Also, perhaps discuss how the
>> > > controller
>> > > > >> > > > becomes aware of the newly good log directories after a
>> broker
>> > > > >> bounce
>> > > > >> > > > (and whether this triggers re-election).
>> > > > >> > >
>> > > > >> > > I meant to write, all the log dirs where the broker can still
>> > read
>> > > > the
>> > > > >> > > index and some other files.  Clearly, log dirs that are
>> > completely
>> > > > >> > > inaccessible will still be considered bad after a broker
>> process
>> > > > >> bounce.
>> > > > >> > >
>> > > > >> > > best,
>> > > > >> > > Colin
>> > > > >> > >
>> > > > >> > > >
>> > > > >> > > > +1 (non-binding) aside from that
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > > On Wed, Feb 8, 2017, at 00:47, Dong Lin wrote:
>> > > > >> > > > > Hi all,
>> > > > >> > > > >
>> > > > >> > > > > Thank you all for the helpful suggestion. I have updated
>> the
>> > > KIP
>> > > > >> to
>> > > > >> > > > > address
>> > > > >> > > > > the comments received so far. See here
>> > > > >> > > > > <https://cwiki.apache.org/conf
>> > luence/pages/diffpagesbyversio
>> > > > >> n.action?
>> > > > >> > > pageId=67638402&selectedPageVersions=8&selectedPageVersions=
>> > 9>to
>> > > > >> > > > > read the changes of the KIP. Here is a summary of change:
>> > > > >> > > > >
>> > > > >> > > > > - Updated the Proposed Change section to change the
>> recovery
>> > > > >> steps.
>> > > > >> > > After
>> > > > >> > > > > this change, broker will also create replica as long as
>> all
>> > > log
>> > > > >> > > > > directories
>> > > > >> > > > > are working.
>> > > > >> > > > > - Removed kafka-log-dirs.sh from this KIP since user no
>> > longer
>> > > > >> needs to
>> > > > >> > > > > use
>> > > > >> > > > > it for recovery from bad disks.
>> > > > >> > > > > - Explained how the znode controller_managed_state is
>> > managed
>> > > in
>> > > > >> the
>> > > > >> > > > > Public
>> > > > >> > > > > interface section.
>> > > > >> > > > > - Explained what happens during controller failover,
>> > partition
>> > > > >> > > > > reassignment
>> > > > >> > > > > and topic deletion in the Proposed Change section.
>> > > > >> > > > > - Updated Future Work section to include the following
>> > > potential
>> > > > >> > > > > improvements
>> > > > >> > > > >   - Let broker notify controller of ISR change and disk
>> > state
>> > > > >> change
>> > > > >> > > via
>> > > > >> > > > > RPC instead of using zookeeper
>> > > > >> > > > >   - Handle various failure scenarios (e.g. slow disk) on
>> a
>> > > > >> case-by-case
>> > > > >> > > > > basis. For example, we may want to detect slow disk and
>> > > consider
>> > > > >> it as
>> > > > >> > > > > offline.
>> > > > >> > > > >   - Allow admin to mark a directory as bad so that it
>> will
>> > not
>> > > > be
>> > > > >> used.
>> > > > >> > > > >
>> > > > >> > > > > Thanks,
>> > > > >> > > > > Dong
>> > > > >> > > > >
>> > > > >> > > > >
>> > > > >> > > > >
>> > > > >> > > > > On Tue, Feb 7, 2017 at 5:23 PM, Dong Lin <
>> > [email protected]
>> > > >
>> > > > >> wrote:
>> > > > >> > > > >
>> > > > >> > > > > > Hey Eno,
>> > > > >> > > > > >
>> > > > >> > > > > > Thanks much for the comment!
>> > > > >> > > > > >
>> > > > >> > > > > > I still think the complexity added to Kafka is
>> justified
>> > by
>> > > > its
>> > > > >> > > benefit.
>> > > > >> > > > > > Let me provide my reasons below.
>> > > > >> > > > > >
>> > > > >> > > > > > 1) The additional logic is easy to understand and thus
>> its
>> > > > >> complexity
>> > > > >> > > > > > should be reasonable.
>> > > > >> > > > > >
>> > > > >> > > > > > On the broker side, it needs to catch exception when
>> > access
>> > > > log
>> > > > >> > > directory,
>> > > > >> > > > > > mark log directory and all its replicas as offline,
>> notify
>> > > > >> > > controller by
>> > > > >> > > > > > writing the zookeeper notification path, and specify
>> error
>> > > in
>> > > > >> > > > > > LeaderAndIsrResponse. On the controller side, it will
>> > > listener
>> > > > >> to
>> > > > >> > > > > > zookeeper for disk failure notification, learn about
>> > offline
>> > > > >> > > replicas in
>> > > > >> > > > > > the LeaderAndIsrResponse, and take offline replicas
>> into
>> > > > >> > > consideration when
>> > > > >> > > > > > electing leaders. It also mark replica as created in
>> > > zookeeper
>> > > > >> and
>> > > > >> > > use it
>> > > > >> > > > > > to determine whether a replica is created.
>> > > > >> > > > > >
>> > > > >> > > > > > That is all the logic we need to add in Kafka. I
>> > personally
>> > > > feel
>> > > > >> > > this is
>> > > > >> > > > > > easy to reason about.
>> > > > >> > > > > >
>> > > > >> > > > > > 2) The additional code is not much.
>> > > > >> > > > > >
>> > > > >> > > > > > I expect the code for KIP-112 to be around 1100 lines
>> new
>> > > > code.
>> > > > >> > > Previously
>> > > > >> > > > > > I have implemented a prototype of a slightly different
>> > > design
>> > > > >> (see
>> > > > >> > > here
>> > > > >> > > > > > <https://docs.google.com/docum
>> ent/d/1Izza0SBmZMVUBUt9s_
>> > > > >> > > -Dqi3D8e0KGJQYW8xgEdRsgAI/edit>)
>> > > > >> > > > > > and uploaded it to github (see here
>> > > > >> > > > > > <https://github.com/lindong28/kafka/tree/JBOD>). The
>> > patch
>> > > > >> changed
>> > > > >> > > 33
>> > > > >> > > > > > files, added 1185 lines and deleted 183 lines. The
>> size of
>> > > > >> prototype
>> > > > >> > > patch
>> > > > >> > > > > > is actually smaller than patch of KIP-107 (see here
>> > > > >> > > > > > <https://github.com/apache/kafka/pull/2476>) which is
>> > > already
>> > > > >> > > accepted.
>> > > > >> > > > > > The KIP-107 patch changed 49 files, added 1349 lines
>> and
>> > > > >> deleted 141
>> > > > >> > > lines.
>> > > > >> > > > > >
>> > > > >> > > > > > 3) Comparison with one-broker-per-multiple-volumes
>> > > > >> > > > > >
>> > > > >> > > > > > This KIP can improve the availability of Kafka in this
>> > case
>> > > > such
>> > > > >> > > that one
>> > > > >> > > > > > failed volume doesn't bring down the entire broker.
>> > > > >> > > > > >
>> > > > >> > > > > > 4) Comparison with one-broker-per-volume
>> > > > >> > > > > >
>> > > > >> > > > > > If each volume maps to multiple disks, then we still
>> have
>> > > > >> similar
>> > > > >> > > problem
>> > > > >> > > > > > such that the broker will fail if any disk of the
>> volume
>> > > > failed.
>> > > > >> > > > > >
>> > > > >> > > > > > If each volume maps to one disk, it means that we need
>> to
>> > > > >> deploy 10
>> > > > >> > > > > > brokers on a machine if the machine has 10 disks. I
>> will
>> > > > >> explain the
>> > > > >> > > > > > concern with this approach in order of their
>> importance.
>> > > > >> > > > > >
>> > > > >> > > > > > - It is weird if we were to tell kafka user to deploy
>> 50
>> > > > >> brokers on a
>> > > > >> > > > > > machine of 50 disks.
>> > > > >> > > > > >
>> > > > >> > > > > > - Either when user deploys Kafka on a commercial cloud
>> > > > platform
>> > > > >> or
>> > > > >> > > when
>> > > > >> > > > > > user deploys their own cluster, the size or largest
>> disk
>> > is
>> > > > >> usually
>> > > > >> > > > > > limited. There will be scenarios where user want to
>> > increase
>> > > > >> broker
>> > > > >> > > > > > capacity by having multiple disks per broker. This JBOD
>> > KIP
>> > > > >> makes it
>> > > > >> > > > > > feasible without hurting availability due to single
>> disk
>> > > > >> failure.
>> > > > >> > > > > >
>> > > > >> > > > > > - Automatic load rebalance across disks will be easier
>> and
>> > > > more
>> > > > >> > > flexible
>> > > > >> > > > > > if one broker has multiple disks. This can be future
>> work.
>> > > > >> > > > > >
>> > > > >> > > > > > - There is performance concern when you deploy 10
>> broker
>> > > vs. 1
>> > > > >> > > broker on
>> > > > >> > > > > > one machine. The metadata the cluster, including
>> > > FetchRequest,
>> > > > >> > > > > > ProduceResponse, MetadataRequest and so on will all be
>> 10X
>> > > > >> more. The
>> > > > >> > > > > > packet-per-second will be 10X higher which may limit
>> > > > >> performance if
>> > > > >> > > pps is
>> > > > >> > > > > > the performance bottleneck. The number of socket on the
>> > > > machine
>> > > > >> is
>> > > > >> > > 10X
>> > > > >> > > > > > higher. And the number of replication thread will be
>> 100X
>> > > > more.
>> > > > >> The
>> > > > >> > > impact
>> > > > >> > > > > > will be more significant with increasing number of
>> disks
>> > per
>> > > > >> > > machine. Thus
>> > > > >> > > > > > it will limit Kakfa's scalability in the long term.
>> > > > >> > > > > >
>> > > > >> > > > > > Thanks,
>> > > > >> > > > > > Dong
>> > > > >> > > > > >
>> > > > >> > > > > >
>> > > > >> > > > > > On Tue, Feb 7, 2017 at 1:51 AM, Eno Thereska <
>> > > > >> [email protected]
>> > > > >> > > >
>> > > > >> > > > > > wrote:
>> > > > >> > > > > >
>> > > > >> > > > > >> Hi Dong,
>> > > > >> > > > > >>
>> > > > >> > > > > >> To simplify the discussion today, on my part I'll zoom
>> > into
>> > > > one
>> > > > >> > > thing
>> > > > >> > > > > >> only:
>> > > > >> > > > > >>
>> > > > >> > > > > >> - I'll discuss the options called below :
>> > > > >> "one-broker-per-disk" or
>> > > > >> > > > > >> "one-broker-per-few-disks".
>> > > > >> > > > > >>
>> > > > >> > > > > >> - I completely buy the JBOD vs RAID arguments so
>> there is
>> > > no
>> > > > >> need to
>> > > > >> > > > > >> discuss that part for me. I buy it that JBODs are
>> good.
>> > > > >> > > > > >>
>> > > > >> > > > > >> I find the terminology can be improved a bit. Ideally
>> > we'd
>> > > be
>> > > > >> > > talking
>> > > > >> > > > > >> about volumes, not disks. Just to make it clear that
>> > Kafka
>> > > > >> > > understand
>> > > > >> > > > > >> volumes/directories, not individual raw disks. So by
>> > > > >> > > > > >> "one-broker-per-few-disks" what I mean is that the
>> admin
>> > > can
>> > > > >> pool a
>> > > > >> > > few
>> > > > >> > > > > >> disks together to create a volume/directory and give
>> that
>> > > to
>> > > > >> Kafka.
>> > > > >> > > > > >>
>> > > > >> > > > > >>
>> > > > >> > > > > >> The kernel of my question will be that the admin
>> already
>> > > has
>> > > > >> tools
>> > > > >> > > to 1)
>> > > > >> > > > > >> create volumes/directories from a JBOD and 2) start a
>> > > broker
>> > > > >> on a
>> > > > >> > > desired
>> > > > >> > > > > >> machine and 3) assign a broker resources like a
>> > directory.
>> > > I
>> > > > >> claim
>> > > > >> > > that
>> > > > >> > > > > >> those tools are sufficient to optimise resource
>> > allocation.
>> > > > I
>> > > > >> > > understand
>> > > > >> > > > > >> that a broker could manage point 3) itself, ie juggle
>> the
>> > > > >> > > directories. My
>> > > > >> > > > > >> question is whether the complexity added to Kafka is
>> > > > justified.
>> > > > >> > > > > >> Operationally it seems to me an admin will still have
>> to
>> > do
>> > > > >> all the
>> > > > >> > > three
>> > > > >> > > > > >> items above.
>> > > > >> > > > > >>
>> > > > >> > > > > >> Looking forward to the discussion
>> > > > >> > > > > >> Thanks
>> > > > >> > > > > >> Eno
>> > > > >> > > > > >>
>> > > > >> > > > > >>
>> > > > >> > > > > >> > On 1 Feb 2017, at 17:21, Dong Lin <
>> [email protected]
>> > >
>> > > > >> wrote:
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > Hey Eno,
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > Thanks much for the review.
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > I think your suggestion is to split disks of a
>> machine
>> > > into
>> > > > >> > > multiple
>> > > > >> > > > > >> disk
>> > > > >> > > > > >> > sets and run one broker per disk set. Yeah this is
>> > > similar
>> > > > to
>> > > > >> > > Colin's
>> > > > >> > > > > >> > suggestion of one-broker-per-disk, which we have
>> > > evaluated
>> > > > at
>> > > > >> > > LinkedIn
>> > > > >> > > > > >> and
>> > > > >> > > > > >> > considered it to be a good short term approach.
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > As of now I don't think any of these approach is a
>> > better
>> > > > >> > > alternative in
>> > > > >> > > > > >> > the long term. I will summarize these here. I have
>> put
>> > > > these
>> > > > >> > > reasons in
>> > > > >> > > > > >> the
>> > > > >> > > > > >> > KIP's motivation section and rejected alternative
>> > > section.
>> > > > I
>> > > > >> am
>> > > > >> > > happy to
>> > > > >> > > > > >> > discuss more and I would certainly like to use an
>> > > > alternative
>> > > > >> > > solution
>> > > > >> > > > > >> that
>> > > > >> > > > > >> > is easier to do with better performance.
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > - JBOD vs. RAID-10: if we switch from RAID-10 with
>> > > > >> > > > > >> replication-factoer=2 to
>> > > > >> > > > > >> > JBOD with replicatio-factor=3, we get 25% reduction
>> in
>> > > disk
>> > > > >> usage
>> > > > >> > > and
>> > > > >> > > > > >> > doubles the tolerance of broker failure before data
>> > > > >> > > unavailability from
>> > > > >> > > > > >> 1
>> > > > >> > > > > >> > to 2. This is pretty huge gain for any company that
>> > uses
>> > > > >> Kafka at
>> > > > >> > > large
>> > > > >> > > > > >> > scale.
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > - JBOD vs. one-broker-per-disk: The benefit of
>> > > > >> > > one-broker-per-disk is
>> > > > >> > > > > >> that
>> > > > >> > > > > >> > no major code change is needed in Kafka. Among the
>> > > > >> disadvantage of
>> > > > >> > > > > >> > one-broker-per-disk summarized in the KIP and
>> previous
>> > > > email
>> > > > >> with
>> > > > >> > > Colin,
>> > > > >> > > > > >> > the biggest one is the 15% throughput loss compared
>> to
>> > > JBOD
>> > > > >> and
>> > > > >> > > less
>> > > > >> > > > > >> > flexibility to balance across disks. Further, it
>> > probably
>> > > > >> requires
>> > > > >> > > > > >> change
>> > > > >> > > > > >> > to internal deployment tools at various companies to
>> > deal
>> > > > >> with
>> > > > >> > > > > >> > one-broker-per-disk setup.
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > - JBOD vs. RAID-0: This is the setup that used at
>> > > > Microsoft.
>> > > > >> The
>> > > > >> > > > > >> problem is
>> > > > >> > > > > >> > that a broker becomes unavailable if any disk fail.
>> > > Suppose
>> > > > >> > > > > >> > replication-factor=2 and there are 10 disks per
>> > machine.
>> > > > >> Then the
>> > > > >> > > > > >> > probability of of any message becomes unavailable
>> due
>> > to
>> > > > disk
>> > > > >> > > failure
>> > > > >> > > > > >> with
>> > > > >> > > > > >> > RAID-0 is 100X higher than that with JBOD.
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > - JBOD vs. one-broker-per-few-disks:
>> > > > one-broker-per-few-disk
>> > > > >> is
>> > > > >> > > > > >> somewhere
>> > > > >> > > > > >> > between one-broker-per-disk and RAID-0. So it
>> carries
>> > an
>> > > > >> averaged
>> > > > >> > > > > >> > disadvantages of these two approaches.
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > To answer your question regarding, I think it is
>> > > reasonable
>> > > > >> to
>> > > > >> > > mange
>> > > > >> > > > > >> disk
>> > > > >> > > > > >> > in Kafka. By "managing disks" we mean the
>> management of
>> > > > >> > > assignment of
>> > > > >> > > > > >> > replicas across disks. Here are my reasons in more
>> > > detail:
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > - I don't think this KIP is a big step change. By
>> > > allowing
>> > > > >> user to
>> > > > >> > > > > >> > configure Kafka to run multiple log directories or
>> > disks
>> > > as
>> > > > >> of
>> > > > >> > > now, it
>> > > > >> > > > > >> is
>> > > > >> > > > > >> > implicit that Kafka manages disks. It is just not a
>> > > > complete
>> > > > >> > > feature.
>> > > > >> > > > > >> > Microsoft and probably other companies are using
>> this
>> > > > feature
>> > > > >> > > under the
>> > > > >> > > > > >> > undesirable effect that a broker will fail any if
>> any
>> > > disk
>> > > > >> fail.
>> > > > >> > > It is
>> > > > >> > > > > >> good
>> > > > >> > > > > >> > to complete this feature.
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > - I think it is reasonable to manage disk in Kafka.
>> One
>> > > of
>> > > > >> the
>> > > > >> > > most
>> > > > >> > > > > >> > important work that Kafka is doing is to determine
>> the
>> > > > >> replica
>> > > > >> > > > > >> assignment
>> > > > >> > > > > >> > across brokers and make sure enough copies of a
>> given
>> > > > >> replica is
>> > > > >> > > > > >> available.
>> > > > >> > > > > >> > I would argue that it is not much different than
>> > > > determining
>> > > > >> the
>> > > > >> > > replica
>> > > > >> > > > > >> > assignment across disk conceptually.
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > - I would agree that this KIP is improve
>> performance of
>> > > > >> Kafka at
>> > > > >> > > the
>> > > > >> > > > > >> cost
>> > > > >> > > > > >> > of more complexity inside Kafka, by switching from
>> > > RAID-10
>> > > > to
>> > > > >> > > JBOD. I
>> > > > >> > > > > >> would
>> > > > >> > > > > >> > argue that this is a right direction. If we can gain
>> > 20%+
>> > > > >> > > performance by
>> > > > >> > > > > >> > managing NIC in Kafka as compared to existing
>> approach
>> > > and
>> > > > >> other
>> > > > >> > > > > >> > alternatives, I would say we should just do it.
>> Such a
>> > > gain
>> > > > >> in
>> > > > >> > > > > >> performance,
>> > > > >> > > > > >> > or equivalently reduction in cost, can save
>> millions of
>> > > > >> dollars
>> > > > >> > > per year
>> > > > >> > > > > >> > for any company running Kafka at large scale.
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > Thanks,
>> > > > >> > > > > >> > Dong
>> > > > >> > > > > >> >
>> > > > >> > > > > >> >
>> > > > >> > > > > >> > On Wed, Feb 1, 2017 at 5:41 AM, Eno Thereska <
>> > > > >> > > [email protected]>
>> > > > >> > > > > >> wrote:
>> > > > >> > > > > >> >
>> > > > >> > > > > >> >> I'm coming somewhat late to the discussion,
>> apologies
>> > > for
>> > > > >> that.
>> > > > >> > > > > >> >>
>> > > > >> > > > > >> >> I'm worried about this proposal. It's moving Kafka
>> to
>> > a
>> > > > >> world
>> > > > >> > > where it
>> > > > >> > > > > >> >> manages disks. So in a sense, the scope of the KIP
>> is
>> > > > >> limited,
>> > > > >> > > but the
>> > > > >> > > > > >> >> direction it sets for Kafka is quite a big step
>> > change.
>> > > > >> > > Fundamentally
>> > > > >> > > > > >> this
>> > > > >> > > > > >> >> is about balancing resources for a Kafka broker.
>> This
>> > > can
>> > > > be
>> > > > >> > > done by a
>> > > > >> > > > > >> >> tool, rather than by changing Kafka. E.g., the tool
>> > > would
>> > > > >> take a
>> > > > >> > > bunch
>> > > > >> > > > > >> of
>> > > > >> > > > > >> >> disks together, create a volume over them and
>> export
>> > > that
>> > > > >> to a
>> > > > >> > > Kafka
>> > > > >> > > > > >> broker
>> > > > >> > > > > >> >> (in addition to setting the memory limits for that
>> > > broker
>> > > > or
>> > > > >> > > limiting
>> > > > >> > > > > >> other
>> > > > >> > > > > >> >> resources). A different bunch of disks can then
>> make
>> > up
>> > > a
>> > > > >> second
>> > > > >> > > > > >> volume,
>> > > > >> > > > > >> >> and be used by another Kafka broker. This is
>> aligned
>> > > with
>> > > > >> what
>> > > > >> > > Colin is
>> > > > >> > > > > >> >> saying (as I understand it).
>> > > > >> > > > > >> >>
>> > > > >> > > > > >> >> Disks are not the only resource on a machine, there
>> > are
>> > > > >> several
>> > > > >> > > > > >> instances
>> > > > >> > > > > >> >> where multiple NICs are used for example. Do we
>> want
>> > > fine
>> > > > >> grained
>> > > > >> > > > > >> >> management of all these resources? I'd argue that
>> > opens
>> > > us
>> > > > >> the
>> > > > >> > > system
>> > > > >> > > > > >> to a
>> > > > >> > > > > >> >> lot of complexity.
>> > > > >> > > > > >> >>
>> > > > >> > > > > >> >> Thanks
>> > > > >> > > > > >> >> Eno
>> > > > >> > > > > >> >>
>> > > > >> > > > > >> >>
>> > > > >> > > > > >> >>> On 1 Feb 2017, at 01:53, Dong Lin <
>> > [email protected]
>> > > >
>> > > > >> wrote:
>> > > > >> > > > > >> >>>
>> > > > >> > > > > >> >>> Hi all,
>> > > > >> > > > > >> >>>
>> > > > >> > > > > >> >>> I am going to initiate the vote If there is no
>> > further
>> > > > >> concern
>> > > > >> > > with
>> > > > >> > > > > >> the
>> > > > >> > > > > >> >> KIP.
>> > > > >> > > > > >> >>>
>> > > > >> > > > > >> >>> Thanks,
>> > > > >> > > > > >> >>> Dong
>> > > > >> > > > > >> >>>
>> > > > >> > > > > >> >>>
>> > > > >> > > > > >> >>> On Fri, Jan 27, 2017 at 8:08 PM, radai <
>> > > > >> > > [email protected]>
>> > > > >> > > > > >> >> wrote:
>> > > > >> > > > > >> >>>
>> > > > >> > > > > >> >>>> a few extra points:
>> > > > >> > > > > >> >>>>
>> > > > >> > > > > >> >>>> 1. broker per disk might also incur more client
>> <-->
>> > > > >> broker
>> > > > >> > > sockets:
>> > > > >> > > > > >> >>>> suppose every producer / consumer "talks" to >1
>> > > > partition,
>> > > > >> > > there's a
>> > > > >> > > > > >> >> very
>> > > > >> > > > > >> >>>> good chance that partitions that were co-located
>> on
>> > a
>> > > > >> single
>> > > > >> > > 10-disk
>> > > > >> > > > > >> >> broker
>> > > > >> > > > > >> >>>> would now be split between several single-disk
>> > broker
>> > > > >> > > processes on
>> > > > >> > > > > >> the
>> > > > >> > > > > >> >> same
>> > > > >> > > > > >> >>>> machine. hard to put a multiplier on this, but
>> > likely
>> > > > >x1.
>> > > > >> > > sockets
>> > > > >> > > > > >> are a
>> > > > >> > > > > >> >>>> limited resource at the OS level and incur some
>> > memory
>> > > > >> cost
>> > > > >> > > (kernel
>> > > > >> > > > > >> >>>> buffers)
>> > > > >> > > > > >> >>>>
>> > > > >> > > > > >> >>>> 2. there's a memory overhead to spinning up a JVM
>> > > > >> (compiled
>> > > > >> > > code and
>> > > > >> > > > > >> >> byte
>> > > > >> > > > > >> >>>> code objects etc). if we assume this overhead is
>> > ~300
>> > > MB
>> > > > >> > > (order of
>> > > > >> > > > > >> >>>> magnitude, specifics vary) than spinning up 10
>> JVMs
>> > > > would
>> > > > >> lose
>> > > > >> > > you 3
>> > > > >> > > > > >> GB
>> > > > >> > > > > >> >> of
>> > > > >> > > > > >> >>>> RAM. not a ton, but non negligible.
>> > > > >> > > > > >> >>>>
>> > > > >> > > > > >> >>>> 3. there would also be some overhead downstream
>> of
>> > > kafka
>> > > > >> in any
>> > > > >> > > > > >> >> management
>> > > > >> > > > > >> >>>> / monitoring / log aggregation system. likely
>> less
>> > > than
>> > > > >> x10
>> > > > >> > > though.
>> > > > >> > > > > >> >>>>
>> > > > >> > > > > >> >>>> 4. (related to above) - added complexity of
>> > > > administration
>> > > > >> > > with more
>> > > > >> > > > > >> >>>> running instances.
>> > > > >> > > > > >> >>>>
>> > > > >> > > > > >> >>>> is anyone running kafka with anywhere near 100GB
>> > > heaps?
>> > > > i
>> > > > >> > > thought the
>> > > > >> > > > > >> >> point
>> > > > >> > > > > >> >>>> was to rely on kernel page cache to do the disk
>> > > > buffering
>> > > > >> ....
>> > > > >> > > > > >> >>>>
>> > > > >> > > > > >> >>>> On Thu, Jan 26, 2017 at 11:00 AM, Dong Lin <
>> > > > >> > > [email protected]>
>> > > > >> > > > > >> wrote:
>> > > > >> > > > > >> >>>>
>> > > > >> > > > > >> >>>>> Hey Colin,
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>> Thanks much for the comment. Please see me
>> comment
>> > > > >> inline.
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>> On Thu, Jan 26, 2017 at 10:15 AM, Colin McCabe <
>> > > > >> > > [email protected]>
>> > > > >> > > > > >> >>>> wrote:
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>>> On Wed, Jan 25, 2017, at 13:50, Dong Lin wrote:
>> > > > >> > > > > >> >>>>>>> Hey Colin,
>> > > > >> > > > > >> >>>>>>>
>> > > > >> > > > > >> >>>>>>> Good point! Yeah we have actually considered
>> and
>> > > > >> tested this
>> > > > >> > > > > >> >>>> solution,
>> > > > >> > > > > >> >>>>>>> which we call one-broker-per-disk. It would
>> work
>> > > and
>> > > > >> should
>> > > > >> > > > > >> require
>> > > > >> > > > > >> >>>> no
>> > > > >> > > > > >> >>>>>>> major change in Kafka as compared to this JBOD
>> > KIP.
>> > > > So
>> > > > >> it
>> > > > >> > > would
>> > > > >> > > > > >> be a
>> > > > >> > > > > >> >>>>> good
>> > > > >> > > > > >> >>>>>>> short term solution.
>> > > > >> > > > > >> >>>>>>>
>> > > > >> > > > > >> >>>>>>> But it has a few drawbacks which makes it less
>> > > > >> desirable in
>> > > > >> > > the
>> > > > >> > > > > >> long
>> > > > >> > > > > >> >>>>>>> term.
>> > > > >> > > > > >> >>>>>>> Assume we have 10 disks on a machine. Here are
>> > the
>> > > > >> problems:
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>> Hi Dong,
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>> Thanks for the thoughtful reply.
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>>>
>> > > > >> > > > > >> >>>>>>> 1) Our stress test result shows that
>> > > > >> one-broker-per-disk
>> > > > >> > > has 15%
>> > > > >> > > > > >> >>>> lower
>> > > > >> > > > > >> >>>>>>> throughput
>> > > > >> > > > > >> >>>>>>>
>> > > > >> > > > > >> >>>>>>> 2) Controller would need to send 10X as many
>> > > > >> > > LeaderAndIsrRequest,
>> > > > >> > > > > >> >>>>>>> MetadataUpdateRequest and StopReplicaRequest.
>> > This
>> > > > >> > > increases the
>> > > > >> > > > > >> >>>> burden
>> > > > >> > > > > >> >>>>>>> on
>> > > > >> > > > > >> >>>>>>> controller which can be the performance
>> > bottleneck.
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>> Maybe I'm misunderstanding something, but there
>> > > would
>> > > > >> not be
>> > > > >> > > 10x as
>> > > > >> > > > > >> >>>> many
>> > > > >> > > > > >> >>>>>> StopReplicaRequest RPCs, would there?  The
>> other
>> > > > >> requests
>> > > > >> > > would
>> > > > >> > > > > >> >>>> increase
>> > > > >> > > > > >> >>>>>> 10x, but from a pretty low base, right?  We are
>> > not
>> > > > >> > > reassigning
>> > > > >> > > > > >> >>>>>> partitions all the time, I hope (or else we
>> have
>> > > > bigger
>> > > > >> > > > > >> problems...)
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>> I think the controller will group
>> > StopReplicaRequest
>> > > > per
>> > > > >> > > broker and
>> > > > >> > > > > >> >> send
>> > > > >> > > > > >> >>>>> only one StopReplicaRequest to a broker during
>> > > > controlled
>> > > > >> > > shutdown.
>> > > > >> > > > > >> >>>> Anyway,
>> > > > >> > > > > >> >>>>> we don't have to worry about this if we agree
>> that
>> > > > other
>> > > > >> > > requests
>> > > > >> > > > > >> will
>> > > > >> > > > > >> >>>>> increase by 10X. One MetadataRequest to send to
>> > each
>> > > > >> broker
>> > > > >> > > in the
>> > > > >> > > > > >> >>>> cluster
>> > > > >> > > > > >> >>>>> every time there is leadership change. I am not
>> > sure
>> > > > >> this is
>> > > > >> > > a real
>> > > > >> > > > > >> >>>>> problem. But in theory this makes the overhead
>> > > > complexity
>> > > > >> > > O(number
>> > > > >> > > > > >> of
>> > > > >> > > > > >> >>>>> broker) and may be a concern in the future.
>> Ideally
>> > > we
>> > > > >> should
>> > > > >> > > avoid
>> > > > >> > > > > >> it.
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>>>
>> > > > >> > > > > >> >>>>>>> 3) Less efficient use of physical resource on
>> the
>> > > > >> machine.
>> > > > >> > > The
>> > > > >> > > > > >> number
>> > > > >> > > > > >> >>>>> of
>> > > > >> > > > > >> >>>>>>> socket on each machine will increase by 10X.
>> The
>> > > > >> number of
>> > > > >> > > > > >> connection
>> > > > >> > > > > >> >>>>>>> between any two machine will increase by 100X.
>> > > > >> > > > > >> >>>>>>>
>> > > > >> > > > > >> >>>>>>> 4) Less efficient way to management memory and
>> > > quota.
>> > > > >> > > > > >> >>>>>>>
>> > > > >> > > > > >> >>>>>>> 5) Rebalance between disks/brokers on the same
>> > > > machine
>> > > > >> will
>> > > > >> > > less
>> > > > >> > > > > >> >>>>>>> efficient
>> > > > >> > > > > >> >>>>>>> and less flexible. Broker has to read data
>> from
>> > > > another
>> > > > >> > > broker on
>> > > > >> > > > > >> the
>> > > > >> > > > > >> >>>>>>> same
>> > > > >> > > > > >> >>>>>>> machine via socket. It is also harder to do
>> > > automatic
>> > > > >> load
>> > > > >> > > balance
>> > > > >> > > > > >> >>>>>>> between
>> > > > >> > > > > >> >>>>>>> disks on the same machine in the future.
>> > > > >> > > > > >> >>>>>>>
>> > > > >> > > > > >> >>>>>>> I will put this and the explanation in the
>> > rejected
>> > > > >> > > alternative
>> > > > >> > > > > >> >>>>> section.
>> > > > >> > > > > >> >>>>>>> I
>> > > > >> > > > > >> >>>>>>> have a few questions:
>> > > > >> > > > > >> >>>>>>>
>> > > > >> > > > > >> >>>>>>> - Can you explain why this solution can help
>> > avoid
>> > > > >> > > scalability
>> > > > >> > > > > >> >>>>>>> bottleneck?
>> > > > >> > > > > >> >>>>>>> I actually think it will exacerbate the
>> > scalability
>> > > > >> problem
>> > > > >> > > due
>> > > > >> > > > > >> the
>> > > > >> > > > > >> >>>> 2)
>> > > > >> > > > > >> >>>>>>> above.
>> > > > >> > > > > >> >>>>>>> - Why can we push more RPC with this solution?
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>> To really answer this question we'd have to
>> take a
>> > > > deep
>> > > > >> dive
>> > > > >> > > into
>> > > > >> > > > > >> the
>> > > > >> > > > > >> >>>>>> locking of the broker and figure out how
>> > effectively
>> > > > it
>> > > > >> can
>> > > > >> > > > > >> >> parallelize
>> > > > >> > > > > >> >>>>>> truly independent requests.  Almost every
>> > > > multithreaded
>> > > > >> > > process is
>> > > > >> > > > > >> >>>> going
>> > > > >> > > > > >> >>>>>> to have shared state, like shared queues or
>> shared
>> > > > >> sockets,
>> > > > >> > > that is
>> > > > >> > > > > >> >>>>>> going to make scaling less than linear when you
>> > add
>> > > > >> disks or
>> > > > >> > > > > >> >>>> processors.
>> > > > >> > > > > >> >>>>>> (And clearly, another option is to improve that
>> > > > >> scalability,
>> > > > >> > > rather
>> > > > >> > > > > >> >>>>>> than going multi-process!)
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>> Yeah I also think it is better to improve
>> > scalability
>> > > > >> inside
>> > > > >> > > kafka
>> > > > >> > > > > >> code
>> > > > >> > > > > >> >>>> if
>> > > > >> > > > > >> >>>>> possible. I am not sure we currently have any
>> > > > scalability
>> > > > >> > > issue
>> > > > >> > > > > >> inside
>> > > > >> > > > > >> >>>>> Kafka that can not be removed without using
>> > > > >> multi-process.
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>>> - It is true that a garbage collection in one
>> > > broker
>> > > > >> would
>> > > > >> > > not
>> > > > >> > > > > >> affect
>> > > > >> > > > > >> >>>>>>> others. But that is after every broker only
>> uses
>> > > 1/10
>> > > > >> of the
>> > > > >> > > > > >> memory.
>> > > > >> > > > > >> >>>>> Can
>> > > > >> > > > > >> >>>>>>> we be sure that this will actually help
>> > > performance?
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>> The big question is, how much memory do Kafka
>> > > brokers
>> > > > >> use
>> > > > >> > > now, and
>> > > > >> > > > > >> how
>> > > > >> > > > > >> >>>>>> much will they use in the future?  Our
>> experience
>> > in
>> > > > >> HDFS
>> > > > >> > > was that
>> > > > >> > > > > >> >> once
>> > > > >> > > > > >> >>>>>> you start getting more than 100-200GB Java heap
>> > > sizes,
>> > > > >> full
>> > > > >> > > GCs
>> > > > >> > > > > >> start
>> > > > >> > > > > >> >>>>>> taking minutes to finish when using the
>> standard
>> > > JVMs.
>> > > > >> That
>> > > > >> > > alone
>> > > > >> > > > > >> is
>> > > > >> > > > > >> >> a
>> > > > >> > > > > >> >>>>>> good reason to go multi-process or consider
>> > storing
>> > > > more
>> > > > >> > > things off
>> > > > >> > > > > >> >> the
>> > > > >> > > > > >> >>>>>> Java heap.
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>> I see. Now I agree one-broker-per-disk should be
>> > more
>> > > > >> > > efficient in
>> > > > >> > > > > >> >> terms
>> > > > >> > > > > >> >>>> of
>> > > > >> > > > > >> >>>>> GC since each broker probably needs less than
>> 1/10
>> > of
>> > > > the
>> > > > >> > > memory
>> > > > >> > > > > >> >>>> available
>> > > > >> > > > > >> >>>>> on a typical machine nowadays. I will remove
>> this
>> > > from
>> > > > >> the
>> > > > >> > > reason of
>> > > > >> > > > > >> >>>>> rejection.
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>> Disk failure is the "easy" case.  The "hard"
>> case,
>> > > > >> which is
>> > > > >> > > > > >> >>>>>> unfortunately also the much more common case,
>> is
>> > > disk
>> > > > >> > > misbehavior.
>> > > > >> > > > > >> >>>>>> Towards the end of their lives, disks tend to
>> > start
>> > > > >> slowing
>> > > > >> > > down
>> > > > >> > > > > >> >>>>>> unpredictably.  Requests that would have
>> completed
>> > > > >> > > immediately
>> > > > >> > > > > >> before
>> > > > >> > > > > >> >>>>>> start taking 20, 100 500 milliseconds.  Some
>> files
>> > > may
>> > > > >> be
>> > > > >> > > readable
>> > > > >> > > > > >> and
>> > > > >> > > > > >> >>>>>> other files may not be.  System calls hang,
>> > > sometimes
>> > > > >> > > forever, and
>> > > > >> > > > > >> the
>> > > > >> > > > > >> >>>>>> Java process can't abort them, because the
>> hang is
>> > > in
>> > > > >> the
>> > > > >> > > kernel.
>> > > > >> > > > > >> It
>> > > > >> > > > > >> >>>> is
>> > > > >> > > > > >> >>>>>> not fun when threads are stuck in "D state"
>> > > > >> > > > > >> >>>>>> http://stackoverflow.com/quest
>> > > > >> ions/20423521/process-perminan
>> > > > >> > > > > >> >>>>>> tly-stuck-on-d-state
>> > > > >> > > > > >> >>>>>> .  Even kill -9 cannot abort the thread then.
>> > > > >> Fortunately,
>> > > > >> > > this is
>> > > > >> > > > > >> >>>>>> rare.
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>> I agree it is a harder problem and it is rare.
>> We
>> > > > >> probably
>> > > > >> > > don't
>> > > > >> > > > > >> have
>> > > > >> > > > > >> >> to
>> > > > >> > > > > >> >>>>> worry about it in this KIP since this issue is
>> > > > >> orthogonal to
>> > > > >> > > > > >> whether or
>> > > > >> > > > > >> >>>> not
>> > > > >> > > > > >> >>>>> we use JBOD.
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>> Another approach we should consider is for
>> Kafka
>> > to
>> > > > >> > > implement its
>> > > > >> > > > > >> own
>> > > > >> > > > > >> >>>>>> storage layer that would stripe across multiple
>> > > disks.
>> > > > >> This
>> > > > >> > > > > >> wouldn't
>> > > > >> > > > > >> >>>>>> have to be done at the block level, but could
>> be
>> > > done
>> > > > >> at the
>> > > > >> > > file
>> > > > >> > > > > >> >>>> level.
>> > > > >> > > > > >> >>>>>> We could use consistent hashing to determine
>> which
>> > > > >> disks a
>> > > > >> > > file
>> > > > >> > > > > >> should
>> > > > >> > > > > >> >>>>>> end up on, for example.
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>> Are you suggesting that we should distribute
>> log,
>> > or
>> > > > log
>> > > > >> > > segment,
>> > > > >> > > > > >> >> across
>> > > > >> > > > > >> >>>>> disks of brokers? I am not sure if I fully
>> > understand
>> > > > >> this
>> > > > >> > > > > >> approach. My
>> > > > >> > > > > >> >>>> gut
>> > > > >> > > > > >> >>>>> feel is that this would be a drastic solution
>> that
>> > > > would
>> > > > >> > > require
>> > > > >> > > > > >> >>>>> non-trivial design. While this may be useful to
>> > > Kafka,
>> > > > I
>> > > > >> would
>> > > > >> > > > > >> prefer
>> > > > >> > > > > >> >> not
>> > > > >> > > > > >> >>>>> to discuss this in detail in this thread unless
>> you
>> > > > >> believe
>> > > > >> > > it is
>> > > > >> > > > > >> >>>> strictly
>> > > > >> > > > > >> >>>>> superior to the design in this KIP in terms of
>> > > solving
>> > > > >> our
>> > > > >> > > use-case.
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>>> best,
>> > > > >> > > > > >> >>>>>> Colin
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>>>
>> > > > >> > > > > >> >>>>>>> Thanks,
>> > > > >> > > > > >> >>>>>>> Dong
>> > > > >> > > > > >> >>>>>>>
>> > > > >> > > > > >> >>>>>>> On Wed, Jan 25, 2017 at 11:34 AM, Colin
>> McCabe <
>> > > > >> > > > > >> [email protected]>
>> > > > >> > > > > >> >>>>>>> wrote:
>> > > > >> > > > > >> >>>>>>>
>> > > > >> > > > > >> >>>>>>>> Hi Dong,
>> > > > >> > > > > >> >>>>>>>>
>> > > > >> > > > > >> >>>>>>>> Thanks for the writeup!  It's very
>> interesting.
>> > > > >> > > > > >> >>>>>>>>
>> > > > >> > > > > >> >>>>>>>> I apologize in advance if this has been
>> > discussed
>> > > > >> > > somewhere else.
>> > > > >> > > > > >> >>>>> But
>> > > > >> > > > > >> >>>>>> I
>> > > > >> > > > > >> >>>>>>>> am curious if you have considered the
>> solution
>> > of
>> > > > >> running
>> > > > >> > > > > >> multiple
>> > > > >> > > > > >> >>>>>>>> brokers per node.  Clearly there is a memory
>> > > > overhead
>> > > > >> with
>> > > > >> > > this
>> > > > >> > > > > >> >>>>>> solution
>> > > > >> > > > > >> >>>>>>>> because of the fixed cost of starting
>> multiple
>> > > JVMs.
>> > > > >> > > However,
>> > > > >> > > > > >> >>>>> running
>> > > > >> > > > > >> >>>>>>>> multiple JVMs would help avoid scalability
>> > > > >> bottlenecks.
>> > > > >> > > You
>> > > > >> > > > > >> could
>> > > > >> > > > > >> >>>>>>>> probably push more RPCs per second, for
>> example.
>> > > A
>> > > > >> garbage
>> > > > >> > > > > >> >>>>> collection
>> > > > >> > > > > >> >>>>>>>> in one broker would not affect the others.
>> It
>> > > would
>> > > > >> be
>> > > > >> > > > > >> interesting
>> > > > >> > > > > >> >>>>> to
>> > > > >> > > > > >> >>>>>>>> see this considered in the "alternate
>> designs"
>> > > > design,
>> > > > >> > > even if
>> > > > >> > > > > >> you
>> > > > >> > > > > >> >>>>> end
>> > > > >> > > > > >> >>>>>>>> up deciding it's not the way to go.
>> > > > >> > > > > >> >>>>>>>>
>> > > > >> > > > > >> >>>>>>>> best,
>> > > > >> > > > > >> >>>>>>>> Colin
>> > > > >> > > > > >> >>>>>>>>
>> > > > >> > > > > >> >>>>>>>>
>> > > > >> > > > > >> >>>>>>>> On Thu, Jan 12, 2017, at 10:46, Dong Lin
>> wrote:
>> > > > >> > > > > >> >>>>>>>>> Hi all,
>> > > > >> > > > > >> >>>>>>>>>
>> > > > >> > > > > >> >>>>>>>>> We created KIP-112: Handle disk failure for
>> > JBOD.
>> > > > >> Please
>> > > > >> > > find
>> > > > >> > > > > >> the
>> > > > >> > > > > >> >>>>> KIP
>> > > > >> > > > > >> >>>>>>>>> wiki
>> > > > >> > > > > >> >>>>>>>>> in the link https://cwiki.apache.org/confl
>> > > > >> > > > > >> >>>> uence/display/KAFKA/KIP-
>> > > > >> > > > > >> >>>>>>>>> 112%3A+Handle+disk+failure+for+JBOD.
>> > > > >> > > > > >> >>>>>>>>>
>> > > > >> > > > > >> >>>>>>>>> This KIP is related to KIP-113
>> > > > >> > > > > >> >>>>>>>>> <https://cwiki.apache.org/conf
>> > > > >> luence/display/KAFKA/KIP-
>> > > > >> > > > > >> >>>>>>>> 113%3A+Support+replicas+moveme
>> > > > >> nt+between+log+directories>:
>> > > > >> > > > > >> >>>>>>>>> Support replicas movement between log
>> > > directories.
>> > > > >> They
>> > > > >> > > are
>> > > > >> > > > > >> >>>> needed
>> > > > >> > > > > >> >>>>> in
>> > > > >> > > > > >> >>>>>>>>> order
>> > > > >> > > > > >> >>>>>>>>> to support JBOD in Kafka. Please help review
>> > the
>> > > > >> KIP. You
>> > > > >> > > > > >> >>>> feedback
>> > > > >> > > > > >> >>>>> is
>> > > > >> > > > > >> >>>>>>>>> appreciated!
>> > > > >> > > > > >> >>>>>>>>>
>> > > > >> > > > > >> >>>>>>>>> Thanks,
>> > > > >> > > > > >> >>>>>>>>> Dong
>> > > > >> > > > > >> >>>>>>>>
>> > > > >> > > > > >> >>>>>>
>> > > > >> > > > > >> >>>>>
>> > > > >> > > > > >> >>>>
>> > > > >> > > > > >> >>
>> > > > >> > > > > >> >>
>> > > > >> > > > > >>
>> > > > >> > > > > >>
>> > > > >> > > > > >
>> > > > >> > >
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to