Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Dong Lin Thu, 23 Feb 2017 19:14:15 -0800

Hey Jun,

Sure, here is my explanation.


Design B would not work if it doesn't store created replicas in the ZK. For
example, say broker B is health when it is shutdown. At this moment no
offline replica is written in ZK for this broker. Suppose log directory is
damaged when broker is offline, then when this broker starts, it won't know
which replicas are in the bad log directory. And it won't be able to
specify those offline replicas in /failed-log-directory either.

Let's say design B stores created replica in ZK. Then the next problem is
that, in the scenario that multiple log directories are damaged while
broker is offline, when broker starts, it won't be able to know the exact
list of offline replicas on each bad log directory. All it knows is the
offline replicas on all those bad log directories. Thus it is impossible
for broker to specify offline replicas per log directory in this scenario.

I agree with your observation that, if admin fixes replaces dir1 with a
good empty disk but leave dir2 untouched, design A won't create replica
whereas design B can create it. But I am not sure that is a problem which
we want to optimize. It seems reasonable for admin to fix both log
directories in practice. If admin fixes only one of the two log
directories, we can say it is a partial fix and Kafka won't re-create any
offline replicas on dir1 and dir2. Similar to extra round of
LeaderAndIsrRequest in case of log failure, I think this is also a pretty
minor issue with design B.

Thanks,
Dong


On Thu, Feb 23, 2017 at 6:46 PM, Jun Rao <[email protected]> wrote:

> Hi, Dong,
>
> My replies are inlined below.
>
> On Thu, Feb 23, 2017 at 4:47 PM, Dong Lin <[email protected]> wrote:
>
> > Hey Jun,
> >
> > Thanks for you reply! Let me first comment on the things that you listed
> as
> > advantage of B over A.
> >
> > 1) No change in LeaderAndIsrRequest protocol.
> >
> > I agree with this.
> >
> > 2) Step 1. One less round of LeaderAndIsrRequest and no additional ZK
> > writes to record the created flag.
> >
> > I don't think this is true. There will be one round of
> LeaderAndIsrRequest
> > in both A and B. In the design A controller needs to write to ZK once to
> > record this replica as created. The design B the broker needs to write
> > zookeeper once to record this replica as created. So there is same number
> > of LeaderAndIsrRequest and ZK writes.
> >
> > Broker needs to record created replica in design B so that when it
> > bootstraps with failed log directory, the broker can derive the offline
> > replicas as the difference between created replicas and replicas found on
> > good log directories.
> >
> >
> Design B actually doesn't write created replicas in ZK. When a broker
> starts up, all offline replicas are stored in the /failed-log-directory
> path in ZK. So if a replica is not there and is not in the live log
> directories either, it's never created. Does this work?
>
>
>
> > 3) Step 2. One less round of LeaderAndIsrRequest and no additional logic
> to
> > handle LeaderAndIsrResponse.
> >
> > While I agree there is one less round of LeaderAndIsrRequest in design
> B, I
> > don't think one additional LeaderAndIsrRequest to handle log directory
> > failure is a big deal given that it doesn't happen frequently.
> >
> > Also, while there is no additional logic to handle LeaderAndIsrResponse
> in
> > design B, I actually think this is something that controller should do
> > anyway. Say the broker stops responding to any requests without removing
> > itself from zookeeper, the only way for controller to realize this and
> > re-elect leader is to send request to this broker and handle response.
> The
> > is a problem that we don't do it as of now.
> >
> > 4) Step 6. Additional ZK reads proportional to # of failed log
> directories,
> > instead of # of partitions.
> >
> > If one znode is able to describe all topic partitions in a log directory,
> > then the existing znode /brokers/topics/[topic] should be able to
> describe
> > created replicas in addition to the assigned replicas for every partition
> > of the topic. In this case, design A requires no additional ZK reads
> > whereas design B ZK reads proportional to # of failed log directories.
> >
> > 5) Step 3. In design A, if a broker is restarted and the failed log
> > directory is unreadable, the broker doesn't know which replicas are on
> the
> > failed log directory. So, when the broker receives the LeadAndIsrRequest
> > with created = false, it's bit hard for the broker to decide whether it
> > should create the missing replica on other log directories. This is
> easier
> > in design B since the list of failed replicas are persisted in ZK.
> >
> > I don't understand why it is hard for broker to make decision in design
> A.
> > With design A, if a broker is started with a failed log directory and it
> > receives LeaderAndIsrRequest with created=false for a replica that can
> not
> > be found on any good log directory, broker will not create this replica.
> Is
> > there any drawback with this approach?
> >
> >
> >
> I was thinking about this case. Suppose two log directories dir1 and dir2
> fail. The admin replaces dir1 with an empty new disk. The broker is
> restarted with dir1 alive, and dir2 still failing. Now, when receiving a
> LeaderAndIsrRequest including a replica that was previously in dir1, the
> broker won't be able to create that replica when it could.
>
>
>
> > Here is my summary of pros and cons of design B as compared to design A.
> >
> > pros:
> >
> > 1) No change to LeaderAndIsrRequest.
> > 2) One less round of LeaderAndIsrRequest in case of log directory
> failure.
> >
> > cons:
> >
> > 1) This is impossible for broker to figure out the log directory of
> offline
> > replicas for failed-log-directory/[directory] if multiple log
> directories
> > are unreadable when broker starts.
> >
> >
> Hmm, I am not sure that I get this point. If multiple log directories fail,
> design B stores each directory under /failed-log-directory, right?
>
> Thanks,
>
> Jun
>
>
>
> > 2) The znode size limit of failed-log-directory/[directory] essentially
> > limits the number of topic partitions that can exist on a log directory.
> It
> > becomes more of a problem when a broker is configured to use multiple log
> > directories each of which is a RAID-10 of large capacity. While this may
> > not be a problem in practice with additional requirement (e.g. don't use
> > more than one log directory if using RAID-10), ideally we want to avoid
> > such limit.
> >
> > 3) Extra ZK read of failed-log-directory/[directory] when broker starts
> >
> >
> > My main concern with the design B is the use of znode
> > /brokers/ids/[brokerId]/failed-log-directory/[directory]. I don't really
> > think other pros/cons of design B matter to us. Does my summary make
> sense?
> >
> > Thanks,
> > Dong
> >
> >
> > On Thu, Feb 23, 2017 at 2:20 PM, Jun Rao <[email protected]> wrote:
> >
> > > Hi, Dong,
> > >
> > > Just so that we are on the same page. Let me spec out the alternative
> > > design a bit more and then compare. Let's call the current design A and
> > the
> > > alternative design B.
> > >
> > > Design B:
> > >
> > > New ZK path
> > > failed log directory path (persistent): This is created by a broker
> when
> > a
> > > log directory fails and is potentially removed when the broker is
> > > restarted.
> > > /brokers/ids/[brokerId]/failed-log-directory/directory1 => { json of
> the
> > > replicas in the log directory }.
> > >
> > > *1. Topic gets created*
> > > - Works the same as before.
> > >
> > > *2. A log directory stops working on a broker during runtime*
> > >
> > > - The controller watches the path /failed-log-directory for the new
> > znode.
> > >
> > > - The broker detects an offline log directory during runtime and marks
> > > affected replicas as offline in memory.
> > >
> > > - The broker writes the failed directory and all replicas in the failed
> > > directory under /failed-log-directory/directory1.
> > >
> > > - The controller reads /failed-log-directory/directory1 and stores in
> > > memory a list of failed replicas due to disk failures.
> > >
> > > - The controller moves those replicas due to disk failure to offline
> > state
> > > and triggers the state change in replica state machine.
> > >
> > >
> > > *3. Broker is restarted*
> > >
> > > - The broker reads /brokers/ids/[brokerId]/failed-log-directory, if
> any.
> > >
> > > - For each failed log directory it reads from ZK, if the log directory
> > > exists in log.dirs and is accessible now, or if the log directory no
> > longer
> > > exists in log.dirs, remove that log directory from
> failed-log-directory.
> > > Otherwise, the broker loads replicas in the failed log directory in
> > memory
> > > as offline.
> > >
> > > - The controller handles the failed log directory change event, if
> needed
> > > (same as #2).
> > >
> > > - The controller handles the broker registration event.
> > >
> > >
> > > *6. Controller failover*
> > > - Controller reads all child paths under /failed-log-directory to
> rebuild
> > > the list of failed replicas due to disk failures. Those replicas will
> be
> > > transitioned to the offline state during controller initialization.
> > >
> > > Comparing this with design A, I think the following are the things that
> > > design B simplifies.
> > > * No change in LeaderAndIsrRequest protocol.
> > > * Step 1. One less round of LeaderAndIsrRequest and no additional ZK
> > writes
> > > to record the created flag.
> > > * Step 2. One less round of LeaderAndIsrRequest and no additional logic
> > to
> > > handle LeaderAndIsrResponse.
> > > * Step 6. Additional ZK reads proportional to # of failed log
> > directories,
> > > instead of # of partitions.
> > > * Step 3. In design A, if a broker is restarted and the failed log
> > > directory is unreadable, the broker doesn't know which replicas are on
> > the
> > > failed log directory. So, when the broker receives the
> LeadAndIsrRequest
> > > with created = false, it's bit hard for the broker to decide whether it
> > > should create the missing replica on other log directories. This is
> > easier
> > > in design B since the list of failed replicas are persisted in ZK.
> > >
> > > Now, for some of the other things that you mentioned.
> > >
> > > * What happens if a log directory is renamed?
> > > I think this can be handled in the same way as non-existing log
> directory
> > > during broker restart.
> > >
> > > * What happens if replicas are moved manually across disks?
> > > Good point. Well, if all log directories are available, the failed log
> > > directory path will be cleared. In the rarer case that a log directory
> is
> > > still offline and one of the replicas registered in the failed log
> > > directory shows up in another available log directory, I am not quite
> > sure.
> > > Perhaps the simplest approach is to just error out and let the admin
> fix
> > > things manually?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > >
> > > On Wed, Feb 22, 2017 at 3:39 PM, Dong Lin <[email protected]> wrote:
> > >
> > > > Hey Jun,
> > > >
> > > > Thanks much for the explanation. I have some questions about 21 but
> > that
> > > is
> > > > less important than 20. 20 would require considerable change to the
> KIP
> > > and
> > > > probably requires weeks to discuss again. Thus I would like to be
> very
> > > sure
> > > > that we agree on the problems with the current design as you
> mentioned
> > > and
> > > > there is no foreseeable problem with the alternate design.
> > > >
> > > > Please see below I detail response. To summarize my points, I
> couldn't
> > > > figure out any non-trival drawback of the current design as compared
> to
> > > the
> > > > alternative design; and I couldn't figure out a good way to store
> > offline
> > > > replicas in the alternative design. Can you see if these points make
> > > sense?
> > > > Thanks in advance for your time!!
> > > >
> > > >
> > > > 1) The alternative design requires slightly more dependency on ZK.
> > While
> > > > both solutions store created replicas in the ZK, the alternative
> design
> > > > would also store offline replicas in ZK but the current design
> doesn't.
> > > > Thus
> > > >
> > > > 2) I am not sure that we should store offline replicas in znode
> > > > /brokers/ids/[brokerId]/failed-log-directory/[directory]. We
> probably
> > > > don't
> > > > want to expose log directory path in zookeeper based on the concept
> > that
> > > we
> > > > should only store logical information (e.g. topic, brokerId) in
> > > zookeeper's
> > > > path name. More specifically, we probably don't want to rename path
> in
> > > > zookeeper simply because user renamed a log director. And we probably
> > > don't
> > > > want to read/write these znode just because user manually moved
> > replicas
> > > > between log directories.
> > > >
> > > > 3) I couldn't find a good way to store offline replicas in ZK in the
> > > > alternative design. We can store this information one znode
> per-topic,
> > > > per-brokerId, or per-brokerId-topic. All these choices have their own
> > > > problems. If we store it in per-topic znode then multiple brokers may
> > > need
> > > > to read/write offline replicas in the same znode which is generally
> > bad.
> > > If
> > > > we store it per-brokerId then we effectively limit the maximum number
> > of
> > > > topic-partition that can be stored on a broker by the znode size
> limit.
> > > > This contradicts the idea to expand the single broker capacity by
> > > throwing
> > > > in more disks. If we store it per-brokerId-topic, then when
> controller
> > > > starts, it needs to read number of brokerId*topic znodes which may
> > double
> > > > the overall znode reads during controller startup.
> > > >
> > > > 4) The alternative design is less efficient than the current design
> in
> > > case
> > > > of log directory failure. The alternative design requires extra znode
> > > reads
> > > > in order to read offline replicas from zk while the current design
> > > requires
> > > > only one pair of LeaderAndIsrRequest and LeaderAndIsrResponse. The
> > extra
> > > > znode reads will be proportional to the number of topics on the
> broker
> > if
> > > > we store offline replicas per-brokerId-topic.
> > > >
> > > > 5) While I agree that the failure reporting should be done where the
> > > > failure is originated, I think the current design is consistent with
> > what
> > > > we are already doing. With the current design, broker will send
> > > > notification via zookeeper and controller will send
> LeaderAndIsrRequest
> > > to
> > > > broker. This is similar to how broker sends isr change notification
> and
> > > > controller read latest isr from broker. If we do want broker to
> report
> > > > failure directly to controller, we should probably have broker send
> RPC
> > > > directly to controller as it sends ControllerShutdownRequest. I can
> do
> > > this
> > > > as well.
> > > >
> > > > 6) I don't think the current design requires additional state
> > management
> > > in
> > > > each of the existing state handling such as topic creation or
> > controller
> > > > failover. All these existing logic should stay exactly the same
> except
> > > that
> > > > the controller should recognize offline replicas on the live broker
> > > instead
> > > > of assuming all replicas on live brokers are live. But this
> additional
> > > > change is required in both the current design and the alternate
> design.
> > > > Thus there should be no difference between current design and the
> > > alternate
> > > > design with respect to these existing state handling logic in
> > controller.
> > > >
> > > > 7) While I agree that the current design requires additional
> complexity
> > > in
> > > > the controller in order to handle LeaderAndIsrResponse and
> potentially
> > > > change partition and replica state to offline in the sate machines, I
> > > think
> > > > such logic is necessary in a well-designed controller either with the
> > > > alternate design or even without JBOD. Controller should be able to
> > > handle
> > > > error (e.g. ClusterAuthorizationException) in LeaderAndIsrResponse
> and
> > > > every responses in general. For example, if the controller hasn't
> > > received
> > > > LeaderAndIsrResponse after a given period if time, it probably means
> > the
> > > > broker has hang and the controller should consider this broker as
> > offline
> > > > and re-elect leader from other brokers. This would actually fix some
> > > > problem we have seen before at LinkedIn, where broker hangs due to
> > > > RAID-controller failure. In other words, I think it is a good idea
> for
> > > > controller to handle response.
> > > >
> > > > 8) I am not sure that the additional state management to handle
> > > > LeaderAndIsrResponse causes new types of synchronization. It is true
> > that
> > > > the logic is not handled ZK event handling
> > > > thread. But the existing ControllerShutdownRequest is also not
> handled
> > by
> > > > ZK event handling thread. The LeaderAndIsrReponse can be handled by
> the
> > > > same thread is that currently handling ControllerShutdownRequest so
> > that
> > > we
> > > > don't require new type of synchronization. Further, it should be rare
> > to
> > > > require additional synchronization to handle LeaderAndIsrReponse
> > because
> > > we
> > > > only need synchronization when there are offline replicas.
> > > >
> > > >
> > > > Thanks,
> > > > Dong
> > > >
> > > > On Wed, Feb 22, 2017 at 10:36 AM, Jun Rao <[email protected]> wrote:
> > > >
> > > > > Hi, Dong, Jiangjie,
> > > > >
> > > > > 20. (1) I agree that ideally we'd like to use direct RPC for
> > > > > broker-to-broker communication instead of ZK. However, in the
> > > alternative
> > > > > design, the failed log directory path also serves as the persistent
> > > state
> > > > > for remembering the offline partitions. This is similar to the
> > > > > controller_managed_state path in your design. The difference is
> that
> > > the
> > > > > alternative design stores the state in fewer ZK paths, which helps
> > > reduce
> > > > > the controller failover time. (2) I agree that we want the
> controller
> > > to
> > > > be
> > > > > the single place to make decisions. However, intuitively, the
> failure
> > > > > reporting should be done where the failure is originated. For
> > example,
> > > > if a
> > > > > broker fails, the broker reports failure by de-registering from ZK.
> > The
> > > > > failed log directory path is similar in that regard. (3) I am not
> > > worried
> > > > > about the additional load from extra LeaderAndIsrRequest. What I
> > worry
> > > > > about is any unnecessary additional complexity in the controller.
> To
> > > me,
> > > > > the additional complexity in the current design is the additional
> > state
> > > > > management in each of the existing state handling (e.g., topic
> > > creation,
> > > > > controller failover, etc), and the additional synchronization since
> > the
> > > > > additional state management is not initiated from the ZK event
> > handling
> > > > > thread.
> > > > >
> > > > > 21. One of the reasons that we need to send a StopReplicaRequest to
> > > > offline
> > > > > replica is to handle controlled shutdown. In that case, a broker is
> > > still
> > > > > alive, but indicates to the controller that it plans to shut down.
> > > Being
> > > > > able to stop the replica in the shutting down broker reduces churns
> > in
> > > > ISR.
> > > > > So, for simplicity, it's probably easier to always send a
> > > > > StopReplicaRequest
> > > > > to any offline replica.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > > On Tue, Feb 21, 2017 at 2:37 PM, Dong Lin <[email protected]>
> > wrote:
> > > > >
> > > > > > Hey Jun,
> > > > > >
> > > > > > Thanks much for your comments.
> > > > > >
> > > > > > I actually proposed the design to store both offline replicas and
> > > > created
> > > > > > replicas in per-broker znode before switching to the design in
> the
> > > > > current
> > > > > > KIP. The current design stores created replicas in per-partition
> > > znode
> > > > > and
> > > > > > transmits offline replicas via LeaderAndIsrResponse. The original
> > > > > solution
> > > > > > is roughly the same as what you suggested. The advantage of the
> > > current
> > > > > > solution is kind of philosophical: 1) we want to transmit data
> > (e.g.
> > > > > > offline replicas) using RPC and reduce dependency on zookeeper;
> 2)
> > we
> > > > > want
> > > > > > controller to be the only one that determines any state (e.g.
> > offline
> > > > > > replicas) that will be exposed to user. The advantage of the
> > solution
> > > > to
> > > > > > store offline replica in zookeeper is that we can save one
> > roundtrip
> > > > time
> > > > > > for controller to handle log directory failure. However, this
> extra
> > > > > > roundtrip time should not be a big deal since the log directory
> > > failure
> > > > > is
> > > > > > rare and inefficiency of extra latency is less of a problem when
> > > there
> > > > is
> > > > > > log directory failure.
> > > > > >
> > > > > > Do you think the two philosophical advantages of the current KIP
> > make
> > > > > > sense? If not, then I can switch to the original design that
> stores
> > > > > offline
> > > > > > replicas in zookeeper. It is actually written already. One
> > > disadvantage
> > > > > is
> > > > > > that we have to make non-trivial change the KIP (e.g. no create
> > flag
> > > in
> > > > > > LeaderAndIsrRequest and no created flag zookeeper) and restart
> this
> > > KIP
> > > > > > discussion.
> > > > > >
> > > > > > Regarding 21, it seems to me that LeaderAndIsrRequest/
> > > > StopReplicaRequest
> > > > > > only makes sense when broker can make the choice (e.g. fetch data
> > for
> > > > > this
> > > > > > replica or not). In the case that the log directory of the
> replica
> > is
> > > > > > already offline, broker have to stop fetching data for this
> replica
> > > > > > regardless of what controller tells it to do. Thus it seems
> cleaner
> > > for
> > > > > > broker to stop fetch data for this replica immediately. The
> > advantage
> > > > of
> > > > > > this solution is that the controller logic is simpler since it
> > > doesn't
> > > > > need
> > > > > > to send StopReplicaRequest in case of log directory failure, and
> > the
> > > > > log4j
> > > > > > log is also cleaner. Is there specific advantage of having
> > controller
> > > > > send
> > > > > > tells broker to stop fetching data for offline replicas?
> > > > > >
> > > > > > Regarding 22, I agree with your observation that it will happen.
> I
> > > will
> > > > > > update the KIP and specify that broker will exist with proper
> error
> > > > > message
> > > > > > in the log and user needs to manually remove partitions and
> restart
> > > the
> > > > > > broker.
> > > > > >
> > > > > > Thanks!
> > > > > > Dong
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 20, 2017 at 10:17 PM, Jun Rao <[email protected]>
> > wrote:
> > > > > >
> > > > > > > Hi, Dong,
> > > > > > >
> > > > > > > Sorry for the delay. A few more comments.
> > > > > > >
> > > > > > > 20. One complexity that I found in the current KIP is that the
> > way
> > > > the
> > > > > > > broker communicates failed replicas to the controller is
> > > inefficient.
> > > > > > When
> > > > > > > a log directory fails, the broker only sends an indication
> > through
> > > ZK
> > > > > to
> > > > > > > the controller and the controller has to issue a
> > > LeaderAndIsrRequest
> > > > to
> > > > > > > discover which replicas are offline due to log directory
> failure.
> > > An
> > > > > > > alternative approach is that when a log directory fails, the
> > broker
> > > > > just
> > > > > > > writes the failed the directory and the corresponding topic
> > > > partitions
> > > > > > in a
> > > > > > > new failed log directory ZK path like the following.
> > > > > > >
> > > > > > > Failed log directory path:
> > > > > > > /brokers/ids/[brokerId]/failed-log-directory/directory1 => {
> > json
> > > of
> > > > > the
> > > > > > > topic partitions in the log directory }.
> > > > > > >
> > > > > > > The controller just watches for child changes in
> > > > > > > /brokers/ids/[brokerId]/failed-log-directory.
> > > > > > > After reading this path, the broker knows the exact set of
> > replicas
> > > > > that
> > > > > > > are offline and can trigger that replica state change
> > accordingly.
> > > > This
> > > > > > > saves an extra round of LeaderAndIsrRequest handling.
> > > > > > >
> > > > > > > With this new ZK path, we get probably get rid
> > > > > of/broker/topics/[topic]/
> > > > > > > partitions/[partitionId]/controller_managed_state. The
> creation
> > > of a
> > > > > new
> > > > > > > replica is expected to always succeed unless all log
> directories
> > > > fail,
> > > > > in
> > > > > > > which case, the broker goes down anyway. Then, during
> controller
> > > > > > failover,
> > > > > > > the controller just needs to additionally read from ZK the
> extra
> > > > failed
> > > > > > log
> > > > > > > directory paths, which is many fewer than topics or partitions.
> > > > > > >
> > > > > > > On broker startup, if a log directory becomes available, the
> > > > > > corresponding
> > > > > > > log directory path in ZK will be removed.
> > > > > > >
> > > > > > > The downside of this approach is that the value of this new ZK
> > path
> > > > can
> > > > > > be
> > > > > > > large. However, even with 5K partition per log directory and
> 100
> > > > bytes
> > > > > > per
> > > > > > > partition, the size of the value is 500KB, still less than the
> > > > default
> > > > > > 1MB
> > > > > > > znode limit in ZK.
> > > > > > >
> > > > > > > 21. "Broker will remove offline replica from its replica
> fetcher
> > > > > > threads."
> > > > > > > The proposal lets the broker remove the replica from the
> replica
> > > > > fetcher
> > > > > > > thread when it detects a directory failure. An alternative is
> to
> > > only
> > > > > do
> > > > > > > that until the broker receives the LeaderAndIsrRequest/
> > > > > > StopReplicaRequest.
> > > > > > > The benefit of this is that the controller is the only one who
> > > > decides
> > > > > > > which replica to be removed from the replica fetcher threads.
> The
> > > > > broker
> > > > > > > also doesn't need additional logic to remove the replica from
> > > replica
> > > > > > > fetcher threads. The downside is that in a small window, the
> > > replica
> > > > > > fetch
> > > > > > > thread will keep writing to the failed log directory and may
> > > pollute
> > > > > the
> > > > > > > log4j log.
> > > > > > >
> > > > > > > 22. In the current design, there is a potential corner case
> issue
> > > > that
> > > > > > the
> > > > > > > same partition may exist in more than one log directory at some
> > > > point.
> > > > > > > Consider the following steps: (1) a new topic t1 is created and
> > the
> > > > > > > controller sends LeaderAndIsrRequest to a broker; (2) the
> broker
> > > > > creates
> > > > > > > partition t1-p1 in log dir1; (3) before the broker sends a
> > > response,
> > > > it
> > > > > > > goes down; (4) the broker is restarted with log dir1
> unreadable;
> > > (5)
> > > > > the
> > > > > > > broker receives a new LeaderAndIsrRequest and creates partition
> > > t1-p1
> > > > > on
> > > > > > > log dir2; (6) at some point, the broker is restarted with log
> > dir1
> > > > > fixed.
> > > > > > > Now partition t1-p1 is in two log dirs. The alternative
> approach
> > > > that I
> > > > > > > suggested above may suffer from a similar corner case issue.
> > Since
> > > > this
> > > > > > is
> > > > > > > rare, if the broker detects this during broker startup, it can
> > > > probably
> > > > > > > just log an error and exit. The admin can remove the redundant
> > > > > partitions
> > > > > > > manually and then restart the broker.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Sat, Feb 18, 2017 at 9:31 PM, Dong Lin <[email protected]
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hey Jun,
> > > > > > > >
> > > > > > > > Could you please let me know if the solutions above could
> > address
> > > > > your
> > > > > > > > concern? I really want to move the discussion forward.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Dong
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Feb 14, 2017 at 8:17 PM, Dong Lin <
> [email protected]
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Jun,
> > > > > > > > >
> > > > > > > > > Thanks for all your help and time to discuss this KIP. When
> > you
> > > > get
> > > > > > the
> > > > > > > > > time, could you let me know if the previous answers address
> > the
> > > > > > > concern?
> > > > > > > > >
> > > > > > > > > I think the more interesting question in your last email is
> > > where
> > > > > we
> > > > > > > > > should store the "created" flag in ZK. I proposed the
> > solution
> > > > > that I
> > > > > > > > like
> > > > > > > > > most, i.e. store it together with the replica assignment
> data
> > > in
> > > > > the
> > > > > > > > /brokers/topics/[topic].
> > > > > > > > > In order to expedite discussion, let me provide another two
> > > ideas
> > > > > to
> > > > > > > > > address the concern just in case the first idea doesn't
> work:
> > > > > > > > >
> > > > > > > > > - We can avoid extra controller ZK read when there is no
> disk
> > > > > failure
> > > > > > > > > (95% of time?). When controller starts, it doesn't
> > > > > > > > > read controller_managed_state in ZK and sends
> > > LeaderAndIsrRequest
> > > > > > with
> > > > > > > > > "create = false". Only if LeaderAndIsrResponse shows
> failure
> > > for
> > > > > any
> > > > > > > > > replica, then controller will read controller_managed_state
> > for
> > > > > this
> > > > > > > > > partition and re-send LeaderAndIsrRequset with
> "create=true"
> > if
> > > > > this
> > > > > > > > > replica has not been created.
> > > > > > > > >
> > > > > > > > > - We can significantly reduce this ZK read time by making
> > > > > > > > > controller_managed_state a topic level information in ZK,
> > e.g.
> > > > > > > > > /brokers/topics/[topic]/state. Given that most topic has
> 10+
> > > > > > partition,
> > > > > > > > > the extra ZK read time should be less than 10% of the
> > existing
> > > > > total
> > > > > > zk
> > > > > > > > > read time during controller failover.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > Dong
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Feb 14, 2017 at 7:30 AM, Dong Lin <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Hey Jun,
> > > > > > > > >>
> > > > > > > > >> I just realized that you may be suggesting that a tool for
> > > > listing
> > > > > > > > >> offline directories is necessary for KIP-112 by asking
> > whether
> > > > > > KIP-112
> > > > > > > > and
> > > > > > > > >> KIP-113 will be in the same release. I think such a tool
> is
> > > > useful
> > > > > > but
> > > > > > > > >> doesn't have to be included in KIP-112. This is because as
> > of
> > > > now
> > > > > > > admin
> > > > > > > > >> needs to log into broker machine and check broker log to
> > > figure
> > > > > out
> > > > > > > the
> > > > > > > > >> cause of broker failure and the bad log directory in case
> of
> > > > disk
> > > > > > > > failure.
> > > > > > > > >> The KIP-112 won't make it harder since admin can still
> > figure
> > > > out
> > > > > > the
> > > > > > > > bad
> > > > > > > > >> log directory by doing the same thing. Thus it is probably
> > OK
> > > to
> > > > > > just
> > > > > > > > >> include this script in KIP-113. Regardless, my hope is to
> > > finish
> > > > > > both
> > > > > > > > KIPs
> > > > > > > > >> ASAP and make them in the same release since both KIPs are
> > > > needed
> > > > > > for
> > > > > > > > the
> > > > > > > > >> JBOD setup.
> > > > > > > > >>
> > > > > > > > >> Thanks,
> > > > > > > > >> Dong
> > > > > > > > >>
> > > > > > > > >> On Mon, Feb 13, 2017 at 5:52 PM, Dong Lin <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >>> And the test plan has also been updated to simulate disk
> > > > failure
> > > > > by
> > > > > > > > >>> changing log directory permission to 000.
> > > > > > > > >>>
> > > > > > > > >>> On Mon, Feb 13, 2017 at 5:50 PM, Dong Lin <
> > > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > > >>>
> > > > > > > > >>>> Hi Jun,
> > > > > > > > >>>>
> > > > > > > > >>>> Thanks for the reply. These comments are very helpful.
> Let
> > > me
> > > > > > answer
> > > > > > > > >>>> them inline.
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>> On Mon, Feb 13, 2017 at 3:25 PM, Jun Rao <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > >>>>
> > > > > > > > >>>>> Hi, Dong,
> > > > > > > > >>>>>
> > > > > > > > >>>>> Thanks for the reply. A few more replies and new
> comments
> > > > > below.
> > > > > > > > >>>>>
> > > > > > > > >>>>> On Fri, Feb 10, 2017 at 4:27 PM, Dong Lin <
> > > > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >>>>>
> > > > > > > > >>>>> > Hi Jun,
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > Thanks for the detailed comments. Please see answers
> > > > inline:
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao <
> > > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > > Hi, Dong,
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> > > Thanks for the updated wiki. A few comments below.
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> > > 1. Topics get created
> > > > > > > > >>>>> > > 1.1 Instead of storing successfully created
> replicas
> > in
> > > > ZK,
> > > > > > > could
> > > > > > > > >>>>> we
> > > > > > > > >>>>> > store
> > > > > > > > >>>>> > > unsuccessfully created replicas in ZK? Since the
> > latter
> > > > is
> > > > > > less
> > > > > > > > >>>>> common,
> > > > > > > > >>>>> > it
> > > > > > > > >>>>> > > probably reduces the load on ZK.
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > We can store unsuccessfully created replicas in ZK.
> > But I
> > > > am
> > > > > > not
> > > > > > > > >>>>> sure if
> > > > > > > > >>>>> > that can reduce write load on ZK.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > If we want to reduce write load on ZK using by store
> > > > > > > unsuccessfully
> > > > > > > > >>>>> created
> > > > > > > > >>>>> > replicas in ZK, then broker should not write to ZK if
> > all
> > > > > > > replicas
> > > > > > > > >>>>> are
> > > > > > > > >>>>> > successfully created. It means that if
> > > > > > > > /broker/topics/[topic]/partiti
> > > > > > > > >>>>> > ons/[partitionId]/controller_managed_state doesn't
> > exist
> > > > in
> > > > > ZK
> > > > > > > for
> > > > > > > > >>>>> a given
> > > > > > > > >>>>> > partition, we have to assume all replicas of this
> > > partition
> > > > > > have
> > > > > > > > been
> > > > > > > > >>>>> > successfully created and send LeaderAndIsrRequest
> with
> > > > > create =
> > > > > > > > >>>>> false. This
> > > > > > > > >>>>> > becomes a problem if controller crashes before
> > receiving
> > > > > > > > >>>>> > LeaderAndIsrResponse to validate whether a replica
> has
> > > been
> > > > > > > > created.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > I think this approach and reduce the number of bytes
> > > stored
> > > > > in
> > > > > > > ZK.
> > > > > > > > >>>>> But I am
> > > > > > > > >>>>> > not sure if this is a concern.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> I was mostly concerned about the controller failover
> > time.
> > > > > > > Currently,
> > > > > > > > >>>>> the
> > > > > > > > >>>>> controller failover is likely dominated by the cost of
> > > > reading
> > > > > > > > >>>>> topic/partition level information from ZK. If we add
> > > another
> > > > > > > > partition
> > > > > > > > >>>>> level path in ZK, it probably will double the
> controller
> > > > > failover
> > > > > > > > >>>>> time. If
> > > > > > > > >>>>> the approach of representing the non-created replicas
> > > doesn't
> > > > > > work,
> > > > > > > > >>>>> have
> > > > > > > > >>>>> you considered just adding the created flag in the
> > > > leaderAndIsr
> > > > > > > path
> > > > > > > > >>>>> in ZK?
> > > > > > > > >>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>> Yes, I have considered adding the created flag in the
> > > > > leaderAndIsr
> > > > > > > > path
> > > > > > > > >>>> in ZK. If we were to add created flag per replica in the
> > > > > > > > >>>> LeaderAndIsrRequest, then it requires a lot of change in
> > the
> > > > > code
> > > > > > > > base.
> > > > > > > > >>>>
> > > > > > > > >>>> If we don't add created flag per replica in the
> > > > > > LeaderAndIsrRequest,
> > > > > > > > >>>> then the information in leaderAndIsr path in ZK and
> > > > > > > > LeaderAndIsrRequest
> > > > > > > > >>>> would be different. Further, the procedure for broker to
> > > > update
> > > > > > ISR
> > > > > > > > in ZK
> > > > > > > > >>>> will be a bit complicated. When leader updates
> > leaderAndIsr
> > > > path
> > > > > > in
> > > > > > > > ZK, it
> > > > > > > > >>>> will have to first read created flags from ZK, change
> isr,
> > > and
> > > > > > write
> > > > > > > > >>>> leaderAndIsr back to ZK. And it needs to check znode
> > version
> > > > and
> > > > > > > > re-try
> > > > > > > > >>>> write operation in ZK if controller has updated ZK
> during
> > > this
> > > > > > > > period. This
> > > > > > > > >>>> is in contrast to the current implementation where the
> > > leader
> > > > > > either
> > > > > > > > gets
> > > > > > > > >>>> all the information from LeaderAndIsrRequest sent by
> > > > controller,
> > > > > > or
> > > > > > > > >>>> determine the infromation by itself (e.g. ISR), before
> > > writing
> > > > > to
> > > > > > > > >>>> leaderAndIsr path in ZK.
> > > > > > > > >>>>
> > > > > > > > >>>> It seems to me that the above solution is a bit
> > complicated
> > > > and
> > > > > > not
> > > > > > > > >>>> clean. Thus I come up with the design in this KIP to
> store
> > > > this
> > > > > > > > created
> > > > > > > > >>>> flag in a separate zk path. The path is named
> > > > > > > > controller_managed_state to
> > > > > > > > >>>> indicate that we can store in this znode all information
> > > that
> > > > is
> > > > > > > > managed by
> > > > > > > > >>>> controller only, as opposed to ISR.
> > > > > > > > >>>>
> > > > > > > > >>>> I agree with your concern of increased ZK read time
> during
> > > > > > > controller
> > > > > > > > >>>> failover. How about we store the "created" information
> in
> > > the
> > > > > > > > >>>> znode /brokers/topics/[topic]? We can change that znode
> to
> > > > have
> > > > > > the
> > > > > > > > >>>> following data format:
> > > > > > > > >>>>
> > > > > > > > >>>> {
> > > > > > > > >>>>   "version" : 2,
> > > > > > > > >>>>   "created" : {
> > > > > > > > >>>>     "1" : [1, 2, 3],
> > > > > > > > >>>>     ...
> > > > > > > > >>>>   }
> > > > > > > > >>>>   "partition" : {
> > > > > > > > >>>>     "1" : [1, 2, 3],
> > > > > > > > >>>>     ...
> > > > > > > > >>>>   }
> > > > > > > > >>>> }
> > > > > > > > >>>>
> > > > > > > > >>>> We won't have extra zk read using this solution. It also
> > > seems
> > > > > > > > >>>> reasonable to put the partition assignment information
> > > > together
> > > > > > with
> > > > > > > > >>>> replica creation information. The latter is only changed
> > > once
> > > > > > after
> > > > > > > > the
> > > > > > > > >>>> partition is created or re-assigned.
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > > 1.2 If an error is received for a follower, does
> the
> > > > > > controller
> > > > > > > > >>>>> eagerly
> > > > > > > > >>>>> > > remove it from ISR or do we just let the leader
> > removes
> > > > it
> > > > > > > after
> > > > > > > > >>>>> timeout?
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > No, Controller will not actively remove it from ISR.
> > But
> > > > > > > controller
> > > > > > > > >>>>> will
> > > > > > > > >>>>> > recognize it as offline replica and propagate this
> > > > > information
> > > > > > to
> > > > > > > > all
> > > > > > > > >>>>> > brokers via UpdateMetadataRequest. Each leader can
> use
> > > this
> > > > > > > > >>>>> information to
> > > > > > > > >>>>> > actively remove offline replica from ISR set. I have
> > > > updated
> > > > > to
> > > > > > > > wiki
> > > > > > > > >>>>> to
> > > > > > > > >>>>> > clarify it.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> >
> > > > > > > > >>>>>
> > > > > > > > >>>>> That seems inconsistent with how the controller deals
> > with
> > > > > > offline
> > > > > > > > >>>>> replicas
> > > > > > > > >>>>> due to broker failures. When that happens, the broker
> > will
> > > > (1)
> > > > > > > select
> > > > > > > > >>>>> a new
> > > > > > > > >>>>> leader if the offline replica is the leader; (2) remove
> > the
> > > > > > replica
> > > > > > > > >>>>> from
> > > > > > > > >>>>> ISR if the offline replica is the follower. So,
> > > intuitively,
> > > > it
> > > > > > > seems
> > > > > > > > >>>>> that
> > > > > > > > >>>>> we should be doing the same thing when dealing with
> > offline
> > > > > > > replicas
> > > > > > > > >>>>> due to
> > > > > > > > >>>>> disk failure.
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > > >>>> My bad. I misunderstand how the controller currently
> > handles
> > > > > > broker
> > > > > > > > >>>> failure and ISR change. Yes we should do the same thing
> > when
> > > > > > dealing
> > > > > > > > with
> > > > > > > > >>>> offline replicas here. I have updated the KIP to specify
> > > that,
> > > > > > when
> > > > > > > an
> > > > > > > > >>>> offline replica is discovered by controller, the
> > controller
> > > > > > removes
> > > > > > > > offline
> > > > > > > > >>>> replicas from ISR in the ZK and sends
> LeaderAndIsrRequest
> > > with
> > > > > > > > updated ISR
> > > > > > > > >>>> to be used by partition leaders.
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > > 1.3 Similar, if an error is received for a leader,
> > > should
> > > > > the
> > > > > > > > >>>>> controller
> > > > > > > > >>>>> > > trigger leader election again?
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > Yes, controller will trigger leader election if
> leader
> > > > > replica
> > > > > > is
> > > > > > > > >>>>> offline.
> > > > > > > > >>>>> > I have updated the wiki to clarify it.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> > > 2. A log directory stops working on a broker during
> > > > > runtime:
> > > > > > > > >>>>> > > 2.1 It seems the broker remembers the failed
> > directory
> > > > > after
> > > > > > > > >>>>> hitting an
> > > > > > > > >>>>> > > IOException and the failed directory won't be used
> > for
> > > > > > creating
> > > > > > > > new
> > > > > > > > >>>>> > > partitions until the broker is restarted? If so,
> > could
> > > > you
> > > > > > add
> > > > > > > > >>>>> that to
> > > > > > > > >>>>> > the
> > > > > > > > >>>>> > > wiki.
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > Right, broker assumes a log directory to be good
> after
> > it
> > > > > > starts,
> > > > > > > > >>>>> and mark
> > > > > > > > >>>>> > log directory as bad once there is IOException when
> > > broker
> > > > > > > attempts
> > > > > > > > >>>>> to
> > > > > > > > >>>>> > access the log directory. New replicas will only be
> > > created
> > > > > on
> > > > > > > good
> > > > > > > > >>>>> log
> > > > > > > > >>>>> > directory. I just added this to the KIP.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > > 2.2 Could you be a bit more specific on how and
> > during
> > > > > which
> > > > > > > > >>>>> operation
> > > > > > > > >>>>> > the
> > > > > > > > >>>>> > > broker detects directory failure? Is it when the
> > broker
> > > > > hits
> > > > > > an
> > > > > > > > >>>>> > IOException
> > > > > > > > >>>>> > > during writes, or both reads and writes?  For
> > example,
> > > > > during
> > > > > > > > >>>>> broker
> > > > > > > > >>>>> > > startup, it only reads from each of the log
> > > directories,
> > > > if
> > > > > > it
> > > > > > > > >>>>> hits an
> > > > > > > > >>>>> > > IOException there, does the broker immediately mark
> > the
> > > > > > > directory
> > > > > > > > >>>>> as
> > > > > > > > >>>>> > > offline?
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > Broker marks log directory as bad once there is
> > > IOException
> > > > > > when
> > > > > > > > >>>>> broker
> > > > > > > > >>>>> > attempts to access the log directory. This includes
> > read
> > > > and
> > > > > > > write.
> > > > > > > > >>>>> These
> > > > > > > > >>>>> > operations include log append, log read, log
> cleaning,
> > > > > > watermark
> > > > > > > > >>>>> checkpoint
> > > > > > > > >>>>> > etc. If broker hits IOException when it reads from
> each
> > > of
> > > > > the
> > > > > > > log
> > > > > > > > >>>>> > directory during startup, it immediately mark the
> > > directory
> > > > > as
> > > > > > > > >>>>> offline.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > I just updated the KIP to clarify it.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > > 3. Partition reassignment: If we know a replica is
> > > > offline,
> > > > > > do
> > > > > > > we
> > > > > > > > >>>>> still
> > > > > > > > >>>>> > > want to send StopReplicaRequest to it?
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > No, controller doesn't send StopReplicaRequest for an
> > > > offline
> > > > > > > > >>>>> replica.
> > > > > > > > >>>>> > Controller treats this scenario in the same way that
> > > > exiting
> > > > > > > Kafka
> > > > > > > > >>>>> > implementation does when the broker of this replica
> is
> > > > > offline.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> > > 4. UpdateMetadataRequestPartitionState: For
> > > > > > offline_replicas,
> > > > > > > do
> > > > > > > > >>>>> they
> > > > > > > > >>>>> > only
> > > > > > > > >>>>> > > include offline replicas due to log directory
> > failures
> > > or
> > > > > do
> > > > > > > they
> > > > > > > > >>>>> also
> > > > > > > > >>>>> > > include offline replicas due to broker failure?
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > UpdateMetadataRequestPartitionState's
> offline_replicas
> > > > > include
> > > > > > > > >>>>> offline
> > > > > > > > >>>>> > replicas due to both log directory failure and broker
> > > > > failure.
> > > > > > > This
> > > > > > > > >>>>> is to
> > > > > > > > >>>>> > make the semantics of this field easier to
> understand.
> > > > Broker
> > > > > > can
> > > > > > > > >>>>> > distinguish whether a replica is offline due to
> broker
> > > > > failure
> > > > > > or
> > > > > > > > >>>>> disk
> > > > > > > > >>>>> > failure by checking whether a broker is live in the
> > > > > > > > >>>>> UpdateMetadataRequest.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> > > 5. Tools: Could we add some kind of support in the
> > tool
> > > > to
> > > > > > list
> > > > > > > > >>>>> offline
> > > > > > > > >>>>> > > directories?
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > In KIP-112 we don't have tools to list offline
> > > directories
> > > > > > > because
> > > > > > > > >>>>> we have
> > > > > > > > >>>>> > intentionally avoided exposing log directory
> > information
> > > > > (e.g.
> > > > > > > log
> > > > > > > > >>>>> > directory path) to user or other brokers. I think we
> > can
> > > > add
> > > > > > this
> > > > > > > > >>>>> feature
> > > > > > > > >>>>> > in KIP-113, in which we will have DescribeDirsRequest
> > to
> > > > list
> > > > > > log
> > > > > > > > >>>>> directory
> > > > > > > > >>>>> > information (e.g. partition assignment, path, size)
> > > needed
> > > > > for
> > > > > > > > >>>>> rebalance.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> Since we are introducing a new failure mode, if a
> replica
> > > > > becomes
> > > > > > > > >>>>> offline
> > > > > > > > >>>>> due to failure in log directories, the first thing an
> > admin
> > > > > wants
> > > > > > > to
> > > > > > > > >>>>> know
> > > > > > > > >>>>> is which log directories are offline from the broker's
> > > > > > perspective.
> > > > > > > > >>>>> So,
> > > > > > > > >>>>> including such a tool will be useful. Do you plan to do
> > > > KIP-112
> > > > > > and
> > > > > > > > >>>>> KIP-113
> > > > > > > > >>>>>  in the same release?
> > > > > > > > >>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>> Yes, I agree that including such a tool is using. This
> is
> > > > > probably
> > > > > > > > >>>> better to be added in KIP-113 because we need
> > > > > DescribeDirsRequest
> > > > > > to
> > > > > > > > get
> > > > > > > > >>>> this information. I will update KIP-113 to include this
> > > tool.
> > > > > > > > >>>>
> > > > > > > > >>>> I plan to do KIP-112 and KIP-113 separately to make each
> > KIP
> > > > and
> > > > > > > their
> > > > > > > > >>>> patch easier to review. I don't have any plan about
> which
> > > > > release
> > > > > > to
> > > > > > > > have
> > > > > > > > >>>> these KIPs. My plan is to both of them ASAP. Is there
> > > > particular
> > > > > > > > timeline
> > > > > > > > >>>> you prefer for code of these two KIPs to checked-in?
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> > > 6. Metrics: Could we add some metrics to show
> offline
> > > > > > > > directories?
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > Sure. I think it makes sense to have each broker
> report
> > > its
> > > > > > > number
> > > > > > > > of
> > > > > > > > >>>>> > offline replicas and offline log directories. The
> > > previous
> > > > > > metric
> > > > > > > > >>>>> was put
> > > > > > > > >>>>> > in KIP-113. I just added both metrics in KIP-112.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> > > 7. There are still references to kafka-log-dirs.sh.
> > Are
> > > > > they
> > > > > > > > valid?
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > My bad. I just removed this from "Changes in
> > Operational
> > > > > > > > Procedures"
> > > > > > > > >>>>> and
> > > > > > > > >>>>> > "Test Plan" in the KIP.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> > > 8. Do you think KIP-113 is ready for review? One
> > thing
> > > > that
> > > > > > > > KIP-113
> > > > > > > > >>>>> > > mentions during partition reassignment is to first
> > send
> > > > > > > > >>>>> > > LeaderAndIsrRequest, followed by
> > > ChangeReplicaDirRequest.
> > > > > It
> > > > > > > > seems
> > > > > > > > >>>>> it's
> > > > > > > > >>>>> > > better if the replicas are created in the right log
> > > > > directory
> > > > > > > in
> > > > > > > > >>>>> the
> > > > > > > > >>>>> > first
> > > > > > > > >>>>> > > place? The reason that I brought it up here is
> > because
> > > it
> > > > > may
> > > > > > > > >>>>> affect the
> > > > > > > > >>>>> > > protocol of LeaderAndIsrRequest.
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > Yes, KIP-113 is ready for review. The advantage of
> the
> > > > > current
> > > > > > > > >>>>> design is
> > > > > > > > >>>>> > that we can keep LeaderAndIsrRequest
> > > > log-direcotry-agnostic.
> > > > > > The
> > > > > > > > >>>>> > implementation would be much easier to read if all
> log
> > > > > related
> > > > > > > > logic
> > > > > > > > >>>>> (e.g.
> > > > > > > > >>>>> > various errors) are put in ChangeReplicadIRrequest
> and
> > > the
> > > > > code
> > > > > > > > path
> > > > > > > > >>>>> of
> > > > > > > > >>>>> > handling replica movement is separated from
> leadership
> > > > > > handling.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > In other words, I think Kafka may be easier to
> develop
> > in
> > > > the
> > > > > > > long
> > > > > > > > >>>>> term if
> > > > > > > > >>>>> > we separate these two requests.
> > > > > > > > >>>>> >
> > > > > > > > >>>>> > I agree that ideally we want to create replicas in
> the
> > > > right
> > > > > > log
> > > > > > > > >>>>> directory
> > > > > > > > >>>>> > in the first place. But I am not sure if there is any
> > > > > > performance
> > > > > > > > or
> > > > > > > > >>>>> > correctness concern with the existing way of moving
> it
> > > > after
> > > > > it
> > > > > > > is
> > > > > > > > >>>>> created.
> > > > > > > > >>>>> > Besides, does this decision affect the change
> proposed
> > in
> > > > > > > KIP-112?
> > > > > > > > >>>>> >
> > > > > > > > >>>>> >
> > > > > > > > >>>>> I am just wondering if you have considered including
> the
> > > log
> > > > > > > > directory
> > > > > > > > >>>>> for
> > > > > > > > >>>>> the replicas in the LeaderAndIsrRequest.
> > > > > > > > >>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>> Yeah I have thought about this idea, but only briefly. I
> > > > > rejected
> > > > > > > this
> > > > > > > > >>>> idea because log directory is broker's local information
> > > and I
> > > > > > > prefer
> > > > > > > > not
> > > > > > > > >>>> to expose local config information to the cluster
> through
> > > > > > > > >>>> LeaderAndIsrRequest.
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>>> 9. Could you describe when the offline replicas due to
> > log
> > > > > > > directory
> > > > > > > > >>>>> failure are removed from the replica fetch threads?
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > > >>>> Yes. If the offline replica was a leader, either a new
> > > leader
> > > > is
> > > > > > > > >>>> elected or all follower brokers will stop fetching for
> > this
> > > > > > > > partition. If
> > > > > > > > >>>> the offline replica is a follower, the broker will stop
> > > > fetching
> > > > > > for
> > > > > > > > this
> > > > > > > > >>>> replica immediately. A broker stops fetching data for a
> > > > replica
> > > > > by
> > > > > > > > removing
> > > > > > > > >>>> the replica from the replica fetch threads. I have
> updated
> > > the
> > > > > KIP
> > > > > > > to
> > > > > > > > >>>> clarify it.
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>> 10. The wiki mentioned changing the log directory to a
> > file
> > > > for
> > > > > > > > >>>>> simulating
> > > > > > > > >>>>> disk failure in system tests. Could we just change the
> > > > > permission
> > > > > > > of
> > > > > > > > >>>>> the
> > > > > > > > >>>>> log directory to 000 to simulate that?
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>> Sure,
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>> Thanks,
> > > > > > > > >>>>>
> > > > > > > > >>>>> Jun
> > > > > > > > >>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>> > > Jun
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> > > On Fri, Feb 10, 2017 at 9:53 AM, Dong Lin <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > >>>>> wrote:
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> > > > Hi Jun,
> > > > > > > > >>>>> > > >
> > > > > > > > >>>>> > > > Can I replace zookeeper access with direct RPC
> for
> > > both
> > > > > ISR
> > > > > > > > >>>>> > notification
> > > > > > > > >>>>> > > > and disk failure notification in a future KIP, or
> > do
> > > > you
> > > > > > feel
> > > > > > > > we
> > > > > > > > >>>>> should
> > > > > > > > >>>>> > > do
> > > > > > > > >>>>> > > > it in this KIP?
> > > > > > > > >>>>> > > >
> > > > > > > > >>>>> > > > Hi Eno, Grant and everyone,
> > > > > > > > >>>>> > > >
> > > > > > > > >>>>> > > > Is there further improvement you would like to
> see
> > > with
> > > > > > this
> > > > > > > > KIP?
> > > > > > > > >>>>> > > >
> > > > > > > > >>>>> > > > Thanks you all for the comments,
> > > > > > > > >>>>> > > >
> > > > > > > > >>>>> > > > Dong
> > > > > > > > >>>>> > > >
> > > > > > > > >>>>> > > >
> > > > > > > > >>>>> > > >
> > > > > > > > >>>>> > > > On Thu, Feb 9, 2017 at 4:45 PM, Dong Lin <
> > > > > > > [email protected]>
> > > > > > > > >>>>> wrote:
> > > > > > > > >>>>> > > >
> > > > > > > > >>>>> > > > >
> > > > > > > > >>>>> > > > >
> > > > > > > > >>>>> > > > > On Thu, Feb 9, 2017 at 3:37 PM, Colin McCabe <
> > > > > > > > >>>>> [email protected]>
> > > > > > > > >>>>> > > wrote:
> > > > > > > > >>>>> > > > >
> > > > > > > > >>>>> > > > >> On Thu, Feb 9, 2017, at 11:40, Dong Lin wrote:
> > > > > > > > >>>>> > > > >> > Thanks for all the comments Colin!
> > > > > > > > >>>>> > > > >> >
> > > > > > > > >>>>> > > > >> > To answer your questions:
> > > > > > > > >>>>> > > > >> > - Yes, a broker will shutdown if all its log
> > > > > > directories
> > > > > > > > >>>>> are bad.
> > > > > > > > >>>>> > > > >>
> > > > > > > > >>>>> > > > >> That makes sense.  Can you add this to the
> > > writeup?
> > > > > > > > >>>>> > > > >>
> > > > > > > > >>>>> > > > >
> > > > > > > > >>>>> > > > > Sure. This has already been added. You can find
> > it
> > > > here
> > > > > > > > >>>>> > > > > <https://cwiki.apache.org/confluence/pages/
> > > > > > > diffpagesbyversio
> > > > > > > > >>>>> n.action
> > > > > > > > >>>>> > ?
> > > > > > > > >>>>> > > > pageId=67638402&selectedPageVersions=9&
> > > > > > selectedPageVersions=
> > > > > > > > 10>
> > > > > > > > >>>>> > > > > .
> > > > > > > > >>>>> > > > >
> > > > > > > > >>>>> > > > >
> > > > > > > > >>>>> > > > >>
> > > > > > > > >>>>> > > > >> > - I updated the KIP to explicitly state
> that a
> > > log
> > > > > > > > >>>>> directory will
> > > > > > > > >>>>> > be
> > > > > > > > >>>>> > > > >> > assumed to be good until broker sees
> > IOException
> > > > > when
> > > > > > it
> > > > > > > > >>>>> tries to
> > > > > > > > >>>>> > > > access
> > > > > > > > >>>>> > > > >> > the log directory.
> > > > > > > > >>>>> > > > >>
> > > > > > > > >>>>> > > > >> Thanks.
> > > > > > > > >>>>> > > > >>
> > > > > > > > >>>>> > > > >> > - Controller doesn't explicitly know whether
> > > there
> > > > > is
> > > > > > > new
> > > > > > > > >>>>> log
> > > > > > > > >>>>> > > > directory
> > > > > > > > >>>>> > > > >> > or
> > > > > > > > >>>>> > > > >> > not. All controller knows is whether
> replicas
> > > are
> > > > > > online
> > > > > > > > or
> > > > > > > > >>>>> > offline
> > > > > > > > >>>>> > > > >> based
> > > > > > > > >>>>> > > > >> > on LeaderAndIsrResponse. According to the
> > > existing
> > > > > > Kafka
> > > > > > > > >>>>> > > > implementation,
> > > > > > > > >>>>> > > > >> > controller will always send
> > LeaderAndIsrRequest
> > > > to a
> > > > > > > > broker
> > > > > > > > >>>>> after
> > > > > > > > >>>>> > it
> > > > > > > > >>>>> > > > >> > bounces.
> > > > > > > > >>>>> > > > >>
> > > > > > > > >>>>> > > > >> I thought so.  It's good to clarify, though.
> Do
> > > you
> > > > > > think
> > > > > > > > >>>>> it's
> > > > > > > > >>>>> > worth
> > > > > > > > >>>>> > > > >> adding a quick discussion of this on the wiki?
> > > > > > > > >>>>> > > > >>
> > > > > > > > >>>>> > > > >
> > > > > > > > >>>>> > > > > Personally I don't think it is needed. If
> broker
> > > > starts
> > > > > > > with
> > > > > > > > >>>>> no bad
> > > > > > > > >>>>> > log
> > > > > > > > >>>>> > > > > directory, everything should work it is and we
> > > should
> > > > > not
> > > > > > > > need
> > > > > > > > >>>>> to
> > > > > > > > >>>>> > > clarify
> > > > > > > > >>>>> > > > > it. The KIP has already covered the scenario
> > when a
> > > > > > broker
> > > > > > > > >>>>> starts
> > > > > > > > >>>>> > with
> > > > > > > > >>>>> > > > bad
> > > > > > > > >>>>> > > > > log directory. Also, the KIP doesn't claim or
> > hint
> > > > that
> > > > > > we
> > > > > > > > >>>>> support
> > > > > > > > >>>>> > > > dynamic
> > > > > > > > >>>>> > > > > addition of new log directories. I think we are
> > > good.
> > > > > > > > >>>>> > > > >
> > > > > > > > >>>>> > > > >
> > > > > > > > >>>>> > > > >> best,
> > > > > > > > >>>>> > > > >> Colin
> > > > > > > > >>>>> > > > >>
> > > > > > > > >>>>> > > > >> >
> > > > > > > > >>>>> > > > >> > Please see this
> > > > > > > > >>>>> > > > >> > <https://cwiki.apache.org/conf
> > > > > > > > >>>>> luence/pages/diffpagesbyversio
> > > > > > > > >>>>> > > > >> n.action?pageId=67638402&
> > selectedPageVersions=9&
> > > > > > > > >>>>> > > > selectedPageVersions=10>
> > > > > > > > >>>>> > > > >> > for the change of the KIP.
> > > > > > > > >>>>> > > > >> >
> > > > > > > > >>>>> > > > >> > On Thu, Feb 9, 2017 at 11:04 AM, Colin
> McCabe
> > <
> > > > > > > > >>>>> [email protected]
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> > > > >> wrote:
> > > > > > > > >>>>> > > > >> >
> > > > > > > > >>>>> > > > >> > > On Thu, Feb 9, 2017, at 11:03, Colin
> McCabe
> > > > wrote:
> > > > > > > > >>>>> > > > >> > > > Thanks, Dong L.
> > > > > > > > >>>>> > > > >> > > >
> > > > > > > > >>>>> > > > >> > > > Do we plan on bringing down the broker
> > > process
> > > > > > when
> > > > > > > > all
> > > > > > > > >>>>> log
> > > > > > > > >>>>> > > > >> directories
> > > > > > > > >>>>> > > > >> > > > are offline?
> > > > > > > > >>>>> > > > >> > > >
> > > > > > > > >>>>> > > > >> > > > Can you explicitly state on the KIP that
> > the
> > > > log
> > > > > > > dirs
> > > > > > > > >>>>> are all
> > > > > > > > >>>>> > > > >> considered
> > > > > > > > >>>>> > > > >> > > > good after the broker process is
> bounced?
> > > It
> > > > > > seems
> > > > > > > > >>>>> like an
> > > > > > > > >>>>> > > > >> important
> > > > > > > > >>>>> > > > >> > > > thing to be clear about.  Also, perhaps
> > > > discuss
> > > > > > how
> > > > > > > > the
> > > > > > > > >>>>> > > controller
> > > > > > > > >>>>> > > > >> > > > becomes aware of the newly good log
> > > > directories
> > > > > > > after
> > > > > > > > a
> > > > > > > > >>>>> broker
> > > > > > > > >>>>> > > > >> bounce
> > > > > > > > >>>>> > > > >> > > > (and whether this triggers re-election).
> > > > > > > > >>>>> > > > >> > >
> > > > > > > > >>>>> > > > >> > > I meant to write, all the log dirs where
> the
> > > > > broker
> > > > > > > can
> > > > > > > > >>>>> still
> > > > > > > > >>>>> > read
> > > > > > > > >>>>> > > > the
> > > > > > > > >>>>> > > > >> > > index and some other files.  Clearly, log
> > dirs
> > > > > that
> > > > > > > are
> > > > > > > > >>>>> > completely
> > > > > > > > >>>>> > > > >> > > inaccessible will still be considered bad
> > > after
> > > > a
> > > > > > > broker
> > > > > > > > >>>>> process
> > > > > > > > >>>>> > > > >> bounce.
> > > > > > > > >>>>> > > > >> > >
> > > > > > > > >>>>> > > > >> > > best,
> > > > > > > > >>>>> > > > >> > > Colin
> > > > > > > > >>>>> > > > >> > >
> > > > > > > > >>>>> > > > >> > > >
> > > > > > > > >>>>> > > > >> > > > +1 (non-binding) aside from that
> > > > > > > > >>>>> > > > >> > > >
> > > > > > > > >>>>> > > > >> > > >
> > > > > > > > >>>>> > > > >> > > >
> > > > > > > > >>>>> > > > >> > > > On Wed, Feb 8, 2017, at 00:47, Dong Lin
> > > wrote:
> > > > > > > > >>>>> > > > >> > > > > Hi all,
> > > > > > > > >>>>> > > > >> > > > >
> > > > > > > > >>>>> > > > >> > > > > Thank you all for the helpful
> > suggestion.
> > > I
> > > > > have
> > > > > > > > >>>>> updated the
> > > > > > > > >>>>> > > KIP
> > > > > > > > >>>>> > > > >> to
> > > > > > > > >>>>> > > > >> > > > > address
> > > > > > > > >>>>> > > > >> > > > > the comments received so far. See here
> > > > > > > > >>>>> > > > >> > > > > <https://cwiki.apache.org/conf
> > > > > > > > >>>>> > luence/pages/diffpagesbyversio
> > > > > > > > >>>>> > > > >> n.action?
> > > > > > > > >>>>> > > > >> > > pageId=67638402&selectedPageVe
> > > > > > > > >>>>> rsions=8&selectedPageVersions=
> > > > > > > > >>>>> > 9>to
> > > > > > > > >>>>> > > > >> > > > > read the changes of the KIP. Here is a
> > > > summary
> > > > > > of
> > > > > > > > >>>>> change:
> > > > > > > > >>>>> > > > >> > > > >
> > > > > > > > >>>>> > > > >> > > > > - Updated the Proposed Change section
> to
> > > > > change
> > > > > > > the
> > > > > > > > >>>>> recovery
> > > > > > > > >>>>> > > > >> steps.
> > > > > > > > >>>>> > > > >> > > After
> > > > > > > > >>>>> > > > >> > > > > this change, broker will also create
> > > replica
> > > > > as
> > > > > > > long
> > > > > > > > >>>>> as all
> > > > > > > > >>>>> > > log
> > > > > > > > >>>>> > > > >> > > > > directories
> > > > > > > > >>>>> > > > >> > > > > are working.
> > > > > > > > >>>>> > > > >> > > > > - Removed kafka-log-dirs.sh from this
> > KIP
> > > > > since
> > > > > > > user
> > > > > > > > >>>>> no
> > > > > > > > >>>>> > longer
> > > > > > > > >>>>> > > > >> needs to
> > > > > > > > >>>>> > > > >> > > > > use
> > > > > > > > >>>>> > > > >> > > > > it for recovery from bad disks.
> > > > > > > > >>>>> > > > >> > > > > - Explained how the znode
> > > > > > controller_managed_state
> > > > > > > > is
> > > > > > > > >>>>> > managed
> > > > > > > > >>>>> > > in
> > > > > > > > >>>>> > > > >> the
> > > > > > > > >>>>> > > > >> > > > > Public
> > > > > > > > >>>>> > > > >> > > > > interface section.
> > > > > > > > >>>>> > > > >> > > > > - Explained what happens during
> > controller
> > > > > > > failover,
> > > > > > > > >>>>> > partition
> > > > > > > > >>>>> > > > >> > > > > reassignment
> > > > > > > > >>>>> > > > >> > > > > and topic deletion in the Proposed
> > Change
> > > > > > section.
> > > > > > > > >>>>> > > > >> > > > > - Updated Future Work section to
> include
> > > the
> > > > > > > > following
> > > > > > > > >>>>> > > potential
> > > > > > > > >>>>> > > > >> > > > > improvements
> > > > > > > > >>>>> > > > >> > > > >   - Let broker notify controller of
> ISR
> > > > change
> > > > > > and
> > > > > > > > >>>>> disk
> > > > > > > > >>>>> > state
> > > > > > > > >>>>> > > > >> change
> > > > > > > > >>>>> > > > >> > > via
> > > > > > > > >>>>> > > > >> > > > > RPC instead of using zookeeper
> > > > > > > > >>>>> > > > >> > > > >   - Handle various failure scenarios
> > (e.g.
> > > > > slow
> > > > > > > > disk)
> > > > > > > > >>>>> on a
> > > > > > > > >>>>> > > > >> case-by-case
> > > > > > > > >>>>> > > > >> > > > > basis. For example, we may want to
> > detect
> > > > slow
> > > > > > > disk
> > > > > > > > >>>>> and
> > > > > > > > >>>>> > > consider
> > > > > > > > >>>>> > > > >> it as
> > > > > > > > >>>>> > > > >> > > > > offline.
> > > > > > > > >>>>> > > > >> > > > >   - Allow admin to mark a directory as
> > bad
> > > > so
> > > > > > that
> > > > > > > > it
> > > > > > > > >>>>> will
> > > > > > > > >>>>> > not
> > > > > > > > >>>>> > > > be
> > > > > > > > >>>>> > > > >> used.
> > > > > > > > >>>>> > > > >> > > > >
> > > > > > > > >>>>> > > > >> > > > > Thanks,
> > > > > > > > >>>>> > > > >> > > > > Dong
> > > > > > > > >>>>> > > > >> > > > >
> > > > > > > > >>>>> > > > >> > > > >
> > > > > > > > >>>>> > > > >> > > > >
> > > > > > > > >>>>> > > > >> > > > > On Tue, Feb 7, 2017 at 5:23 PM, Dong
> > Lin <
> > > > > > > > >>>>> > [email protected]
> > > > > > > > >>>>> > > >
> > > > > > > > >>>>> > > > >> wrote:
> > > > > > > > >>>>> > > > >> > > > >
> > > > > > > > >>>>> > > > >> > > > > > Hey Eno,
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > Thanks much for the comment!
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > I still think the complexity added
> to
> > > > Kafka
> > > > > is
> > > > > > > > >>>>> justified
> > > > > > > > >>>>> > by
> > > > > > > > >>>>> > > > its
> > > > > > > > >>>>> > > > >> > > benefit.
> > > > > > > > >>>>> > > > >> > > > > > Let me provide my reasons below.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > 1) The additional logic is easy to
> > > > > understand
> > > > > > > and
> > > > > > > > >>>>> thus its
> > > > > > > > >>>>> > > > >> complexity
> > > > > > > > >>>>> > > > >> > > > > > should be reasonable.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > On the broker side, it needs to
> catch
> > > > > > exception
> > > > > > > > when
> > > > > > > > >>>>> > access
> > > > > > > > >>>>> > > > log
> > > > > > > > >>>>> > > > >> > > directory,
> > > > > > > > >>>>> > > > >> > > > > > mark log directory and all its
> > replicas
> > > as
> > > > > > > > offline,
> > > > > > > > >>>>> notify
> > > > > > > > >>>>> > > > >> > > controller by
> > > > > > > > >>>>> > > > >> > > > > > writing the zookeeper notification
> > path,
> > > > and
> > > > > > > > >>>>> specify error
> > > > > > > > >>>>> > > in
> > > > > > > > >>>>> > > > >> > > > > > LeaderAndIsrResponse. On the
> > controller
> > > > > side,
> > > > > > it
> > > > > > > > >>>>> will
> > > > > > > > >>>>> > > listener
> > > > > > > > >>>>> > > > >> to
> > > > > > > > >>>>> > > > >> > > > > > zookeeper for disk failure
> > notification,
> > > > > learn
> > > > > > > > about
> > > > > > > > >>>>> > offline
> > > > > > > > >>>>> > > > >> > > replicas in
> > > > > > > > >>>>> > > > >> > > > > > the LeaderAndIsrResponse, and take
> > > offline
> > > > > > > > replicas
> > > > > > > > >>>>> into
> > > > > > > > >>>>> > > > >> > > consideration when
> > > > > > > > >>>>> > > > >> > > > > > electing leaders. It also mark
> replica
> > > as
> > > > > > > created
> > > > > > > > in
> > > > > > > > >>>>> > > zookeeper
> > > > > > > > >>>>> > > > >> and
> > > > > > > > >>>>> > > > >> > > use it
> > > > > > > > >>>>> > > > >> > > > > > to determine whether a replica is
> > > created.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > That is all the logic we need to add
> > in
> > > > > > Kafka. I
> > > > > > > > >>>>> > personally
> > > > > > > > >>>>> > > > feel
> > > > > > > > >>>>> > > > >> > > this is
> > > > > > > > >>>>> > > > >> > > > > > easy to reason about.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > 2) The additional code is not much.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > I expect the code for KIP-112 to be
> > > around
> > > > > > 1100
> > > > > > > > >>>>> lines new
> > > > > > > > >>>>> > > > code.
> > > > > > > > >>>>> > > > >> > > Previously
> > > > > > > > >>>>> > > > >> > > > > > I have implemented a prototype of a
> > > > slightly
> > > > > > > > >>>>> different
> > > > > > > > >>>>> > > design
> > > > > > > > >>>>> > > > >> (see
> > > > > > > > >>>>> > > > >> > > here
> > > > > > > > >>>>> > > > >> > > > > > <https://docs.google.com/docum
> > > > > > > > >>>>> ent/d/1Izza0SBmZMVUBUt9s_
> > > > > > > > >>>>> > > > >> > > -Dqi3D8e0KGJQYW8xgEdRsgAI/edit>)
> > > > > > > > >>>>> > > > >> > > > > > and uploaded it to github (see here
> > > > > > > > >>>>> > > > >> > > > > > <https://github.com/lindong28/
> > > > > kafka/tree/JBOD
> > > > > > > >).
> > > > > > > > >>>>> The
> > > > > > > > >>>>> > patch
> > > > > > > > >>>>> > > > >> changed
> > > > > > > > >>>>> > > > >> > > 33
> > > > > > > > >>>>> > > > >> > > > > > files, added 1185 lines and deleted
> > 183
> > > > > lines.
> > > > > > > The
> > > > > > > > >>>>> size of
> > > > > > > > >>>>> > > > >> prototype
> > > > > > > > >>>>> > > > >> > > patch
> > > > > > > > >>>>> > > > >> > > > > > is actually smaller than patch of
> > > KIP-107
> > > > > (see
> > > > > > > > here
> > > > > > > > >>>>> > > > >> > > > > > <https://github.com/apache/
> > > > kafka/pull/2476
> > > > > >)
> > > > > > > > which
> > > > > > > > >>>>> is
> > > > > > > > >>>>> > > already
> > > > > > > > >>>>> > > > >> > > accepted.
> > > > > > > > >>>>> > > > >> > > > > > The KIP-107 patch changed 49 files,
> > > added
> > > > > 1349
> > > > > > > > >>>>> lines and
> > > > > > > > >>>>> > > > >> deleted 141
> > > > > > > > >>>>> > > > >> > > lines.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > 3) Comparison with
> > > > > > > one-broker-per-multiple-volume
> > > > > > > > s
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > This KIP can improve the
> availability
> > of
> > > > > Kafka
> > > > > > > in
> > > > > > > > >>>>> this
> > > > > > > > >>>>> > case
> > > > > > > > >>>>> > > > such
> > > > > > > > >>>>> > > > >> > > that one
> > > > > > > > >>>>> > > > >> > > > > > failed volume doesn't bring down the
> > > > entire
> > > > > > > > broker.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > 4) Comparison with
> > one-broker-per-volume
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > If each volume maps to multiple
> disks,
> > > > then
> > > > > we
> > > > > > > > >>>>> still have
> > > > > > > > >>>>> > > > >> similar
> > > > > > > > >>>>> > > > >> > > problem
> > > > > > > > >>>>> > > > >> > > > > > such that the broker will fail if
> any
> > > disk
> > > > > of
> > > > > > > the
> > > > > > > > >>>>> volume
> > > > > > > > >>>>> > > > failed.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > If each volume maps to one disk, it
> > > means
> > > > > that
> > > > > > > we
> > > > > > > > >>>>> need to
> > > > > > > > >>>>> > > > >> deploy 10
> > > > > > > > >>>>> > > > >> > > > > > brokers on a machine if the machine
> > has
> > > 10
> > > > > > > disks.
> > > > > > > > I
> > > > > > > > >>>>> will
> > > > > > > > >>>>> > > > >> explain the
> > > > > > > > >>>>> > > > >> > > > > > concern with this approach in order
> of
> > > > their
> > > > > > > > >>>>> importance.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > - It is weird if we were to tell
> kafka
> > > > user
> > > > > to
> > > > > > > > >>>>> deploy 50
> > > > > > > > >>>>> > > > >> brokers on a
> > > > > > > > >>>>> > > > >> > > > > > machine of 50 disks.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > - Either when user deploys Kafka on
> a
> > > > > > commercial
> > > > > > > > >>>>> cloud
> > > > > > > > >>>>> > > > platform
> > > > > > > > >>>>> > > > >> or
> > > > > > > > >>>>> > > > >> > > when
> > > > > > > > >>>>> > > > >> > > > > > user deploys their own cluster, the
> > size
> > > > or
> > > > > > > > largest
> > > > > > > > >>>>> disk
> > > > > > > > >>>>> > is
> > > > > > > > >>>>> > > > >> usually
> > > > > > > > >>>>> > > > >> > > > > > limited. There will be scenarios
> where
> > > > user
> > > > > > want
> > > > > > > > to
> > > > > > > > >>>>> > increase
> > > > > > > > >>>>> > > > >> broker
> > > > > > > > >>>>> > > > >> > > > > > capacity by having multiple disks
> per
> > > > > broker.
> > > > > > > This
> > > > > > > > >>>>> JBOD
> > > > > > > > >>>>> > KIP
> > > > > > > > >>>>> > > > >> makes it
> > > > > > > > >>>>> > > > >> > > > > > feasible without hurting
> availability
> > > due
> > > > to
> > > > > > > > single
> > > > > > > > >>>>> disk
> > > > > > > > >>>>> > > > >> failure.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > - Automatic load rebalance across
> > disks
> > > > will
> > > > > > be
> > > > > > > > >>>>> easier and
> > > > > > > > >>>>> > > > more
> > > > > > > > >>>>> > > > >> > > flexible
> > > > > > > > >>>>> > > > >> > > > > > if one broker has multiple disks.
> This
> > > can
> > > > > be
> > > > > > > > >>>>> future work.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > - There is performance concern when
> > you
> > > > > deploy
> > > > > > > 10
> > > > > > > > >>>>> broker
> > > > > > > > >>>>> > > vs. 1
> > > > > > > > >>>>> > > > >> > > broker on
> > > > > > > > >>>>> > > > >> > > > > > one machine. The metadata the
> cluster,
> > > > > > including
> > > > > > > > >>>>> > > FetchRequest,
> > > > > > > > >>>>> > > > >> > > > > > ProduceResponse, MetadataRequest and
> > so
> > > on
> > > > > > will
> > > > > > > > all
> > > > > > > > >>>>> be 10X
> > > > > > > > >>>>> > > > >> more. The
> > > > > > > > >>>>> > > > >> > > > > > packet-per-second will be 10X higher
> > > which
> > > > > may
> > > > > > > > limit
> > > > > > > > >>>>> > > > >> performance if
> > > > > > > > >>>>> > > > >> > > pps is
> > > > > > > > >>>>> > > > >> > > > > > the performance bottleneck. The
> number
> > > of
> > > > > > socket
> > > > > > > > on
> > > > > > > > >>>>> the
> > > > > > > > >>>>> > > > machine
> > > > > > > > >>>>> > > > >> is
> > > > > > > > >>>>> > > > >> > > 10X
> > > > > > > > >>>>> > > > >> > > > > > higher. And the number of
> replication
> > > > thread
> > > > > > > will
> > > > > > > > >>>>> be 100X
> > > > > > > > >>>>> > > > more.
> > > > > > > > >>>>> > > > >> The
> > > > > > > > >>>>> > > > >> > > impact
> > > > > > > > >>>>> > > > >> > > > > > will be more significant with
> > increasing
> > > > > > number
> > > > > > > of
> > > > > > > > >>>>> disks
> > > > > > > > >>>>> > per
> > > > > > > > >>>>> > > > >> > > machine. Thus
> > > > > > > > >>>>> > > > >> > > > > > it will limit Kakfa's scalability in
> > the
> > > > > long
> > > > > > > > term.
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > Thanks,
> > > > > > > > >>>>> > > > >> > > > > > Dong
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > > On Tue, Feb 7, 2017 at 1:51 AM, Eno
> > > > > Thereska <
> > > > > > > > >>>>> > > > >> [email protected]
> > > > > > > > >>>>> > > > >> > > >
> > > > > > > > >>>>> > > > >> > > > > > wrote:
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > > > > >> Hi Dong,
> > > > > > > > >>>>> > > > >> > > > > >>
> > > > > > > > >>>>> > > > >> > > > > >> To simplify the discussion today,
> on
> > my
> > > > > part
> > > > > > > I'll
> > > > > > > > >>>>> zoom
> > > > > > > > >>>>> > into
> > > > > > > > >>>>> > > > one
> > > > > > > > >>>>> > > > >> > > thing
> > > > > > > > >>>>> > > > >> > > > > >> only:
> > > > > > > > >>>>> > > > >> > > > > >>
> > > > > > > > >>>>> > > > >> > > > > >> - I'll discuss the options called
> > > below :
> > > > > > > > >>>>> > > > >> "one-broker-per-disk" or
> > > > > > > > >>>>> > > > >> > > > > >> "one-broker-per-few-disks".
> > > > > > > > >>>>> > > > >> > > > > >>
> > > > > > > > >>>>> > > > >> > > > > >> - I completely buy the JBOD vs RAID
> > > > > arguments
> > > > > > > so
> > > > > > > > >>>>> there is
> > > > > > > > >>>>> > > no
> > > > > > > > >>>>> > > > >> need to
> > > > > > > > >>>>> > > > >> > > > > >> discuss that part for me. I buy it
> > that
> > > > > JBODs
> > > > > > > are
> > > > > > > > >>>>> good.
> > > > > > > > >>>>> > > > >> > > > > >>
> > > > > > > > >>>>> > > > >> > > > > >> I find the terminology can be
> > improved
> > > a
> > > > > bit.
> > > > > > > > >>>>> Ideally
> > > > > > > > >>>>> > we'd
> > > > > > > > >>>>> > > be
> > > > > > > > >>>>> > > > >> > > talking
> > > > > > > > >>>>> > > > >> > > > > >> about volumes, not disks. Just to
> > make
> > > it
> > > > > > clear
> > > > > > > > >>>>> that
> > > > > > > > >>>>> > Kafka
> > > > > > > > >>>>> > > > >> > > understand
> > > > > > > > >>>>> > > > >> > > > > >> volumes/directories, not individual
> > raw
> > > > > > disks.
> > > > > > > So
> > > > > > > > >>>>> by
> > > > > > > > >>>>> > > > >> > > > > >> "one-broker-per-few-disks" what I
> > mean
> > > is
> > > > > > that
> > > > > > > > the
> > > > > > > > >>>>> admin
> > > > > > > > >>>>> > > can
> > > > > > > > >>>>> > > > >> pool a
> > > > > > > > >>>>> > > > >> > > few
> > > > > > > > >>>>> > > > >> > > > > >> disks together to create a
> > > > volume/directory
> > > > > > and
> > > > > > > > >>>>> give that
> > > > > > > > >>>>> > > to
> > > > > > > > >>>>> > > > >> Kafka.
> > > > > > > > >>>>> > > > >> > > > > >>
> > > > > > > > >>>>> > > > >> > > > > >>
> > > > > > > > >>>>> > > > >> > > > > >> The kernel of my question will be
> > that
> > > > the
> > > > > > > admin
> > > > > > > > >>>>> already
> > > > > > > > >>>>> > > has
> > > > > > > > >>>>> > > > >> tools
> > > > > > > > >>>>> > > > >> > > to 1)
> > > > > > > > >>>>> > > > >> > > > > >> create volumes/directories from a
> > JBOD
> > > > and
> > > > > 2)
> > > > > > > > >>>>> start a
> > > > > > > > >>>>> > > broker
> > > > > > > > >>>>> > > > >> on a
> > > > > > > > >>>>> > > > >> > > desired
> > > > > > > > >>>>> > > > >> > > > > >> machine and 3) assign a broker
> > > resources
> > > > > > like a
> > > > > > > > >>>>> > directory.
> > > > > > > > >>>>> > > I
> > > > > > > > >>>>> > > > >> claim
> > > > > > > > >>>>> > > > >> > > that
> > > > > > > > >>>>> > > > >> > > > > >> those tools are sufficient to
> > optimise
> > > > > > resource
> > > > > > > > >>>>> > allocation.
> > > > > > > > >>>>> > > > I
> > > > > > > > >>>>> > > > >> > > understand
> > > > > > > > >>>>> > > > >> > > > > >> that a broker could manage point 3)
> > > > itself,
> > > > > > ie
> > > > > > > > >>>>> juggle the
> > > > > > > > >>>>> > > > >> > > directories. My
> > > > > > > > >>>>> > > > >> > > > > >> question is whether the complexity
> > > added
> > > > to
> > > > > > > Kafka
> > > > > > > > >>>>> is
> > > > > > > > >>>>> > > > justified.
> > > > > > > > >>>>> > > > >> > > > > >> Operationally it seems to me an
> admin
> > > > will
> > > > > > > still
> > > > > > > > >>>>> have to
> > > > > > > > >>>>> > do
> > > > > > > > >>>>> > > > >> all the
> > > > > > > > >>>>> > > > >> > > three
> > > > > > > > >>>>> > > > >> > > > > >> items above.
> > > > > > > > >>>>> > > > >> > > > > >>
> > > > > > > > >>>>> > > > >> > > > > >> Looking forward to the discussion
> > > > > > > > >>>>> > > > >> > > > > >> Thanks
> > > > > > > > >>>>> > > > >> > > > > >> Eno
> > > > > > > > >>>>> > > > >> > > > > >>
> > > > > > > > >>>>> > > > >> > > > > >>
> > > > > > > > >>>>> > > > >> > > > > >> > On 1 Feb 2017, at 17:21, Dong
> Lin <
> > > > > > > > >>>>> [email protected]
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> > > > >> wrote:
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > Hey Eno,
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > Thanks much for the review.
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > I think your suggestion is to
> split
> > > > disks
> > > > > > of
> > > > > > > a
> > > > > > > > >>>>> machine
> > > > > > > > >>>>> > > into
> > > > > > > > >>>>> > > > >> > > multiple
> > > > > > > > >>>>> > > > >> > > > > >> disk
> > > > > > > > >>>>> > > > >> > > > > >> > sets and run one broker per disk
> > set.
> > > > > Yeah
> > > > > > > this
> > > > > > > > >>>>> is
> > > > > > > > >>>>> > > similar
> > > > > > > > >>>>> > > > to
> > > > > > > > >>>>> > > > >> > > Colin's
> > > > > > > > >>>>> > > > >> > > > > >> > suggestion of
> one-broker-per-disk,
> > > > which
> > > > > we
> > > > > > > > have
> > > > > > > > >>>>> > > evaluated
> > > > > > > > >>>>> > > > at
> > > > > > > > >>>>> > > > >> > > LinkedIn
> > > > > > > > >>>>> > > > >> > > > > >> and
> > > > > > > > >>>>> > > > >> > > > > >> > considered it to be a good short
> > term
> > > > > > > approach.
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > As of now I don't think any of
> > these
> > > > > > approach
> > > > > > > > is
> > > > > > > > >>>>> a
> > > > > > > > >>>>> > better
> > > > > > > > >>>>> > > > >> > > alternative in
> > > > > > > > >>>>> > > > >> > > > > >> > the long term. I will summarize
> > these
> > > > > > here. I
> > > > > > > > >>>>> have put
> > > > > > > > >>>>> > > > these
> > > > > > > > >>>>> > > > >> > > reasons in
> > > > > > > > >>>>> > > > >> > > > > >> the
> > > > > > > > >>>>> > > > >> > > > > >> > KIP's motivation section and
> > rejected
> > > > > > > > alternative
> > > > > > > > >>>>> > > section.
> > > > > > > > >>>>> > > > I
> > > > > > > > >>>>> > > > >> am
> > > > > > > > >>>>> > > > >> > > happy to
> > > > > > > > >>>>> > > > >> > > > > >> > discuss more and I would
> certainly
> > > like
> > > > > to
> > > > > > > use
> > > > > > > > an
> > > > > > > > >>>>> > > > alternative
> > > > > > > > >>>>> > > > >> > > solution
> > > > > > > > >>>>> > > > >> > > > > >> that
> > > > > > > > >>>>> > > > >> > > > > >> > is easier to do with better
> > > > performance.
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. RAID-10: if we switch
> > from
> > > > > > RAID-10
> > > > > > > > >>>>> with
> > > > > > > > >>>>> > > > >> > > > > >> replication-factoer=2 to
> > > > > > > > >>>>> > > > >> > > > > >> > JBOD with replicatio-factor=3, we
> > get
> > > > 25%
> > > > > > > > >>>>> reduction in
> > > > > > > > >>>>> > > disk
> > > > > > > > >>>>> > > > >> usage
> > > > > > > > >>>>> > > > >> > > and
> > > > > > > > >>>>> > > > >> > > > > >> > doubles the tolerance of broker
> > > failure
> > > > > > > before
> > > > > > > > >>>>> data
> > > > > > > > >>>>> > > > >> > > unavailability from
> > > > > > > > >>>>> > > > >> > > > > >> 1
> > > > > > > > >>>>> > > > >> > > > > >> > to 2. This is pretty huge gain
> for
> > > any
> > > > > > > company
> > > > > > > > >>>>> that
> > > > > > > > >>>>> > uses
> > > > > > > > >>>>> > > > >> Kafka at
> > > > > > > > >>>>> > > > >> > > large
> > > > > > > > >>>>> > > > >> > > > > >> > scale.
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. one-broker-per-disk:
> The
> > > > > benefit
> > > > > > > of
> > > > > > > > >>>>> > > > >> > > one-broker-per-disk is
> > > > > > > > >>>>> > > > >> > > > > >> that
> > > > > > > > >>>>> > > > >> > > > > >> > no major code change is needed in
> > > > Kafka.
> > > > > > > Among
> > > > > > > > >>>>> the
> > > > > > > > >>>>> > > > >> disadvantage of
> > > > > > > > >>>>> > > > >> > > > > >> > one-broker-per-disk summarized in
> > the
> > > > KIP
> > > > > > and
> > > > > > > > >>>>> previous
> > > > > > > > >>>>> > > > email
> > > > > > > > >>>>> > > > >> with
> > > > > > > > >>>>> > > > >> > > Colin,
> > > > > > > > >>>>> > > > >> > > > > >> > the biggest one is the 15%
> > throughput
> > > > > loss
> > > > > > > > >>>>> compared to
> > > > > > > > >>>>> > > JBOD
> > > > > > > > >>>>> > > > >> and
> > > > > > > > >>>>> > > > >> > > less
> > > > > > > > >>>>> > > > >> > > > > >> > flexibility to balance across
> > disks.
> > > > > > Further,
> > > > > > > > it
> > > > > > > > >>>>> > probably
> > > > > > > > >>>>> > > > >> requires
> > > > > > > > >>>>> > > > >> > > > > >> change
> > > > > > > > >>>>> > > > >> > > > > >> > to internal deployment tools at
> > > various
> > > > > > > > >>>>> companies to
> > > > > > > > >>>>> > deal
> > > > > > > > >>>>> > > > >> with
> > > > > > > > >>>>> > > > >> > > > > >> > one-broker-per-disk setup.
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. RAID-0: This is the
> > setup
> > > > that
> > > > > > > used
> > > > > > > > at
> > > > > > > > >>>>> > > > Microsoft.
> > > > > > > > >>>>> > > > >> The
> > > > > > > > >>>>> > > > >> > > > > >> problem is
> > > > > > > > >>>>> > > > >> > > > > >> > that a broker becomes unavailable
> > if
> > > > any
> > > > > > disk
> > > > > > > > >>>>> fail.
> > > > > > > > >>>>> > > Suppose
> > > > > > > > >>>>> > > > >> > > > > >> > replication-factor=2 and there
> are
> > 10
> > > > > disks
> > > > > > > per
> > > > > > > > >>>>> > machine.
> > > > > > > > >>>>> > > > >> Then the
> > > > > > > > >>>>> > > > >> > > > > >> > probability of of any message
> > becomes
> > > > > > > > >>>>> unavailable due
> > > > > > > > >>>>> > to
> > > > > > > > >>>>> > > > disk
> > > > > > > > >>>>> > > > >> > > failure
> > > > > > > > >>>>> > > > >> > > > > >> with
> > > > > > > > >>>>> > > > >> > > > > >> > RAID-0 is 100X higher than that
> > with
> > > > > JBOD.
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs.
> > one-broker-per-few-disks:
> > > > > > > > >>>>> > > > one-broker-per-few-disk
> > > > > > > > >>>>> > > > >> is
> > > > > > > > >>>>> > > > >> > > > > >> somewhere
> > > > > > > > >>>>> > > > >> > > > > >> > between one-broker-per-disk and
> > > RAID-0.
> > > > > So
> > > > > > it
> > > > > > > > >>>>> carries
> > > > > > > > >>>>> > an
> > > > > > > > >>>>> > > > >> averaged
> > > > > > > > >>>>> > > > >> > > > > >> > disadvantages of these two
> > > approaches.
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > To answer your question
> regarding,
> > I
> > > > > think
> > > > > > it
> > > > > > > > is
> > > > > > > > >>>>> > > reasonable
> > > > > > > > >>>>> > > > >> to
> > > > > > > > >>>>> > > > >> > > mange
> > > > > > > > >>>>> > > > >> > > > > >> disk
> > > > > > > > >>>>> > > > >> > > > > >> > in Kafka. By "managing disks" we
> > mean
> > > > the
> > > > > > > > >>>>> management of
> > > > > > > > >>>>> > > > >> > > assignment of
> > > > > > > > >>>>> > > > >> > > > > >> > replicas across disks. Here are
> my
> > > > > reasons
> > > > > > in
> > > > > > > > >>>>> more
> > > > > > > > >>>>> > > detail:
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > - I don't think this KIP is a big
> > > step
> > > > > > > change.
> > > > > > > > By
> > > > > > > > >>>>> > > allowing
> > > > > > > > >>>>> > > > >> user to
> > > > > > > > >>>>> > > > >> > > > > >> > configure Kafka to run multiple
> log
> > > > > > > directories
> > > > > > > > >>>>> or
> > > > > > > > >>>>> > disks
> > > > > > > > >>>>> > > as
> > > > > > > > >>>>> > > > >> of
> > > > > > > > >>>>> > > > >> > > now, it
> > > > > > > > >>>>> > > > >> > > > > >> is
> > > > > > > > >>>>> > > > >> > > > > >> > implicit that Kafka manages
> disks.
> > It
> > > > is
> > > > > > just
> > > > > > > > >>>>> not a
> > > > > > > > >>>>> > > > complete
> > > > > > > > >>>>> > > > >> > > feature.
> > > > > > > > >>>>> > > > >> > > > > >> > Microsoft and probably other
> > > companies
> > > > > are
> > > > > > > > using
> > > > > > > > >>>>> this
> > > > > > > > >>>>> > > > feature
> > > > > > > > >>>>> > > > >> > > under the
> > > > > > > > >>>>> > > > >> > > > > >> > undesirable effect that a broker
> > will
> > > > > fail
> > > > > > > any
> > > > > > > > >>>>> if any
> > > > > > > > >>>>> > > disk
> > > > > > > > >>>>> > > > >> fail.
> > > > > > > > >>>>> > > > >> > > It is
> > > > > > > > >>>>> > > > >> > > > > >> good
> > > > > > > > >>>>> > > > >> > > > > >> > to complete this feature.
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > - I think it is reasonable to
> > manage
> > > > disk
> > > > > > in
> > > > > > > > >>>>> Kafka. One
> > > > > > > > >>>>> > > of
> > > > > > > > >>>>> > > > >> the
> > > > > > > > >>>>> > > > >> > > most
> > > > > > > > >>>>> > > > >> > > > > >> > important work that Kafka is
> doing
> > is
> > > > to
> > > > > > > > >>>>> determine the
> > > > > > > > >>>>> > > > >> replica
> > > > > > > > >>>>> > > > >> > > > > >> assignment
> > > > > > > > >>>>> > > > >> > > > > >> > across brokers and make sure
> enough
> > > > > copies
> > > > > > > of a
> > > > > > > > >>>>> given
> > > > > > > > >>>>> > > > >> replica is
> > > > > > > > >>>>> > > > >> > > > > >> available.
> > > > > > > > >>>>> > > > >> > > > > >> > I would argue that it is not much
> > > > > different
> > > > > > > > than
> > > > > > > > >>>>> > > > determining
> > > > > > > > >>>>> > > > >> the
> > > > > > > > >>>>> > > > >> > > replica
> > > > > > > > >>>>> > > > >> > > > > >> > assignment across disk
> > conceptually.
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > - I would agree that this KIP is
> > > > improve
> > > > > > > > >>>>> performance of
> > > > > > > > >>>>> > > > >> Kafka at
> > > > > > > > >>>>> > > > >> > > the
> > > > > > > > >>>>> > > > >> > > > > >> cost
> > > > > > > > >>>>> > > > >> > > > > >> > of more complexity inside Kafka,
> by
> > > > > > switching
> > > > > > > > >>>>> from
> > > > > > > > >>>>> > > RAID-10
> > > > > > > > >>>>> > > > to
> > > > > > > > >>>>> > > > >> > > JBOD. I
> > > > > > > > >>>>> > > > >> > > > > >> would
> > > > > > > > >>>>> > > > >> > > > > >> > argue that this is a right
> > direction.
> > > > If
> > > > > we
> > > > > > > can
> > > > > > > > >>>>> gain
> > > > > > > > >>>>> > 20%+
> > > > > > > > >>>>> > > > >> > > performance by
> > > > > > > > >>>>> > > > >> > > > > >> > managing NIC in Kafka as compared
> > to
> > > > > > existing
> > > > > > > > >>>>> approach
> > > > > > > > >>>>> > > and
> > > > > > > > >>>>> > > > >> other
> > > > > > > > >>>>> > > > >> > > > > >> > alternatives, I would say we
> should
> > > > just
> > > > > do
> > > > > > > it.
> > > > > > > > >>>>> Such a
> > > > > > > > >>>>> > > gain
> > > > > > > > >>>>> > > > >> in
> > > > > > > > >>>>> > > > >> > > > > >> performance,
> > > > > > > > >>>>> > > > >> > > > > >> > or equivalently reduction in
> cost,
> > > can
> > > > > save
> > > > > > > > >>>>> millions of
> > > > > > > > >>>>> > > > >> dollars
> > > > > > > > >>>>> > > > >> > > per year
> > > > > > > > >>>>> > > > >> > > > > >> > for any company running Kafka at
> > > large
> > > > > > scale.
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > Thanks,
> > > > > > > > >>>>> > > > >> > > > > >> > Dong
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> > On Wed, Feb 1, 2017 at 5:41 AM,
> Eno
> > > > > > Thereska
> > > > > > > <
> > > > > > > > >>>>> > > > >> > > [email protected]>
> > > > > > > > >>>>> > > > >> > > > > >> wrote:
> > > > > > > > >>>>> > > > >> > > > > >> >
> > > > > > > > >>>>> > > > >> > > > > >> >> I'm coming somewhat late to the
> > > > > > discussion,
> > > > > > > > >>>>> apologies
> > > > > > > > >>>>> > > for
> > > > > > > > >>>>> > > > >> that.
> > > > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > > > >>>>> > > > >> > > > > >> >> I'm worried about this proposal.
> > > It's
> > > > > > moving
> > > > > > > > >>>>> Kafka to
> > > > > > > > >>>>> > a
> > > > > > > > >>>>> > > > >> world
> > > > > > > > >>>>> > > > >> > > where it
> > > > > > > > >>>>> > > > >> > > > > >> >> manages disks. So in a sense,
> the
> > > > scope
> > > > > of
> > > > > > > the
> > > > > > > > >>>>> KIP is
> > > > > > > > >>>>> > > > >> limited,
> > > > > > > > >>>>> > > > >> > > but the
> > > > > > > > >>>>> > > > >> > > > > >> >> direction it sets for Kafka is
> > > quite a
> > > > > big
> > > > > > > > step
> > > > > > > > >>>>> > change.
> > > > > > > > >>>>> > > > >> > > Fundamentally
> > > > > > > > >>>>> > > > >> > > > > >> this
> > > > > > > > >>>>> > > > >> > > > > >> >> is about balancing resources
> for a
> > > > Kafka
> > > > > > > > >>>>> broker. This
> > > > > > > > >>>>> > > can
> > > > > > > > >>>>> > > > be
> > > > > > > > >>>>> > > > >> > > done by a
> > > > > > > > >>>>> > > > >> > > > > >> >> tool, rather than by changing
> > Kafka.
> > > > > E.g.,
> > > > > > > the
> > > > > > > > >>>>> tool
> > > > > > > > >>>>> > > would
> > > > > > > > >>>>> > > > >> take a
> > > > > > > > >>>>> > > > >> > > bunch
> > > > > > > > >>>>> > > > >> > > > > >> of
> > > > > > > > >>>>> > > > >> > > > > >> >> disks together, create a volume
> > over
> > > > > them
> > > > > > > and
> > > > > > > > >>>>> export
> > > > > > > > >>>>> > > that
> > > > > > > > >>>>> > > > >> to a
> > > > > > > > >>>>> > > > >> > > Kafka
> > > > > > > > >>>>> > > > >> > > > > >> broker
> > > > > > > > >>>>> > > > >> > > > > >> >> (in addition to setting the
> memory
> > > > > limits
> > > > > > > for
> > > > > > > > >>>>> that
> > > > > > > > >>>>> > > broker
> > > > > > > > >>>>> > > > or
> > > > > > > > >>>>> > > > >> > > limiting
> > > > > > > > >>>>> > > > >> > > > > >> other
> > > > > > > > >>>>> > > > >> > > > > >> >> resources). A different bunch of
> > > disks
> > > > > can
> > > > > > > > then
> > > > > > > > >>>>> make
> > > > > > > > >>>>> > up
> > > > > > > > >>>>> > > a
> > > > > > > > >>>>> > > > >> second
> > > > > > > > >>>>> > > > >> > > > > >> volume,
> > > > > > > > >>>>> > > > >> > > > > >> >> and be used by another Kafka
> > broker.
> > > > > This
> > > > > > is
> > > > > > > > >>>>> aligned
> > > > > > > > >>>>> > > with
> > > > > > > > >>>>> > > > >> what
> > > > > > > > >>>>> > > > >> > > Colin is
> > > > > > > > >>>>> > > > >> > > > > >> >> saying (as I understand it).
> > > > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > > > >>>>> > > > >> > > > > >> >> Disks are not the only resource
> > on a
> > > > > > > machine,
> > > > > > > > >>>>> there
> > > > > > > > >>>>> > are
> > > > > > > > >>>>> > > > >> several
> > > > > > > > >>>>> > > > >> > > > > >> instances
> > > > > > > > >>>>> > > > >> > > > > >> >> where multiple NICs are used for
> > > > > example.
> > > > > > Do
> > > > > > > > we
> > > > > > > > >>>>> want
> > > > > > > > >>>>> > > fine
> > > > > > > > >>>>> > > > >> grained
> > > > > > > > >>>>> > > > >> > > > > >> >> management of all these
> resources?
> > > I'd
> > > > > > argue
> > > > > > > > >>>>> that
> > > > > > > > >>>>> > opens
> > > > > > > > >>>>> > > us
> > > > > > > > >>>>> > > > >> the
> > > > > > > > >>>>> > > > >> > > system
> > > > > > > > >>>>> > > > >> > > > > >> to a
> > > > > > > > >>>>> > > > >> > > > > >> >> lot of complexity.
> > > > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > > > >>>>> > > > >> > > > > >> >> Thanks
> > > > > > > > >>>>> > > > >> > > > > >> >> Eno
> > > > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > > > >>>>> > > > >> > > > > >> >>> On 1 Feb 2017, at 01:53, Dong
> > Lin <
> > > > > > > > >>>>> > [email protected]
> > > > > > > > >>>>> > > >
> > > > > > > > >>>>> > > > >> wrote:
> > > > > > > > >>>>> > > > >> > > > > >> >>>
> > > > > > > > >>>>> > > > >> > > > > >> >>> Hi all,
> > > > > > > > >>>>> > > > >> > > > > >> >>>
> > > > > > > > >>>>> > > > >> > > > > >> >>> I am going to initiate the vote
> > If
> > > > > there
> > > > > > is
> > > > > > > > no
> > > > > > > > >>>>> > further
> > > > > > > > >>>>> > > > >> concern
> > > > > > > > >>>>> > > > >> > > with
> > > > > > > > >>>>> > > > >> > > > > >> the
> > > > > > > > >>>>> > > > >> > > > > >> >> KIP.
> > > > > > > > >>>>> > > > >> > > > > >> >>>
> > > > > > > > >>>>> > > > >> > > > > >> >>> Thanks,
> > > > > > > > >>>>> > > > >> > > > > >> >>> Dong
> > > > > > > > >>>>> > > > >> > > > > >> >>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>
> > > > > > > > >>>>> > > > >> > > > > >> >>> On Fri, Jan 27, 2017 at 8:08
> PM,
> > > > radai
> > > > > <
> > > > > > > > >>>>> > > > >> > > [email protected]>
> > > > > > > > >>>>> > > > >> > > > > >> >> wrote:
> > > > > > > > >>>>> > > > >> > > > > >> >>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>> a few extra points:
> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>> 1. broker per disk might also
> > > incur
> > > > > more
> > > > > > > > >>>>> client <-->
> > > > > > > > >>>>> > > > >> broker
> > > > > > > > >>>>> > > > >> > > sockets:
> > > > > > > > >>>>> > > > >> > > > > >> >>>> suppose every producer /
> > consumer
> > > > > > "talks"
> > > > > > > to
> > > > > > > > >>>>> >1
> > > > > > > > >>>>> > > > partition,
> > > > > > > > >>>>> > > > >> > > there's a
> > > > > > > > >>>>> > > > >> > > > > >> >> very
> > > > > > > > >>>>> > > > >> > > > > >> >>>> good chance that partitions
> that
> > > > were
> > > > > > > > >>>>> co-located on
> > > > > > > > >>>>> > a
> > > > > > > > >>>>> > > > >> single
> > > > > > > > >>>>> > > > >> > > 10-disk
> > > > > > > > >>>>> > > > >> > > > > >> >> broker
> > > > > > > > >>>>> > > > >> > > > > >> >>>> would now be split between
> > several
> > > > > > > > single-disk
> > > > > > > > >>>>> > broker
> > > > > > > > >>>>> > > > >> > > processes on
> > > > > > > > >>>>> > > > >> > > > > >> the
> > > > > > > > >>>>> > > > >> > > > > >> >> same
> > > > > > > > >>>>> > > > >> > > > > >> >>>> machine. hard to put a
> > multiplier
> > > on
> > > > > > this,
> > > > > > > > but
> > > > > > > > >>>>> > likely
> > > > > > > > >>>>> > > > >x1.
> > > > > > > > >>>>> > > > >> > > sockets
> > > > > > > > >>>>> > > > >> > > > > >> are a
> > > > > > > > >>>>> > > > >> > > > > >> >>>> limited resource at the OS
> level
> > > and
> > > > > > incur
> > > > > > > > >>>>> some
> > > > > > > > >>>>> > memory
> > > > > > > > >>>>> > > > >> cost
> > > > > > > > >>>>> > > > >> > > (kernel
> > > > > > > > >>>>> > > > >> > > > > >> >>>> buffers)
> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>> 2. there's a memory overhead
> to
> > > > > spinning
> > > > > > > up
> > > > > > > > a
> > > > > > > > >>>>> JVM
> > > > > > > > >>>>> > > > >> (compiled
> > > > > > > > >>>>> > > > >> > > code and
> > > > > > > > >>>>> > > > >> > > > > >> >> byte
> > > > > > > > >>>>> > > > >> > > > > >> >>>> code objects etc). if we
> assume
> > > this
> > > > > > > > overhead
> > > > > > > > >>>>> is
> > > > > > > > >>>>> > ~300
> > > > > > > > >>>>> > > MB
> > > > > > > > >>>>> > > > >> > > (order of
> > > > > > > > >>>>> > > > >> > > > > >> >>>> magnitude, specifics vary)
> than
> > > > > spinning
> > > > > > > up
> > > > > > > > >>>>> 10 JVMs
> > > > > > > > >>>>> > > > would
> > > > > > > > >>>>> > > > >> lose
> > > > > > > > >>>>> > > > >> > > you 3
> > > > > > > > >>>>> > > > >> > > > > >> GB
> > > > > > > > >>>>> > > > >> > > > > >> >> of
> > > > > > > > >>>>> > > > >> > > > > >> >>>> RAM. not a ton, but non
> > > negligible.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>> 3. there would also be some
> > > overhead
> > > > > > > > >>>>> downstream of
> > > > > > > > >>>>> > > kafka
> > > > > > > > >>>>> > > > >> in any
> > > > > > > > >>>>> > > > >> > > > > >> >> management
> > > > > > > > >>>>> > > > >> > > > > >> >>>> / monitoring / log aggregation
> > > > system.
> > > > > > > > likely
> > > > > > > > >>>>> less
> > > > > > > > >>>>> > > than
> > > > > > > > >>>>> > > > >> x10
> > > > > > > > >>>>> > > > >> > > though.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>> 4. (related to above) - added
> > > > > complexity
> > > > > > > of
> > > > > > > > >>>>> > > > administration
> > > > > > > > >>>>> > > > >> > > with more
> > > > > > > > >>>>> > > > >> > > > > >> >>>> running instances.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>> is anyone running kafka with
> > > > anywhere
> > > > > > near
> > > > > > > > >>>>> 100GB
> > > > > > > > >>>>> > > heaps?
> > > > > > > > >>>>> > > > i
> > > > > > > > >>>>> > > > >> > > thought the
> > > > > > > > >>>>> > > > >> > > > > >> >> point
> > > > > > > > >>>>> > > > >> > > > > >> >>>> was to rely on kernel page
> cache
> > > to
> > > > do
> > > > > > the
> > > > > > > > >>>>> disk
> > > > > > > > >>>>> > > > buffering
> > > > > > > > >>>>> > > > >> ....
> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>> On Thu, Jan 26, 2017 at 11:00
> > AM,
> > > > Dong
> > > > > > > Lin <
> > > > > > > > >>>>> > > > >> > > [email protected]>
> > > > > > > > >>>>> > > > >> > > > > >> wrote:
> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Hey Colin,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Thanks much for the comment.
> > > Please
> > > > > see
> > > > > > > me
> > > > > > > > >>>>> comment
> > > > > > > > >>>>> > > > >> inline.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> On Thu, Jan 26, 2017 at 10:15
> > AM,
> > > > > Colin
> > > > > > > > >>>>> McCabe <
> > > > > > > > >>>>> > > > >> > > [email protected]>
> > > > > > > > >>>>> > > > >> > > > > >> >>>> wrote:
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> On Wed, Jan 25, 2017, at
> > 13:50,
> > > > Dong
> > > > > > Lin
> > > > > > > > >>>>> wrote:
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Hey Colin,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Good point! Yeah we have
> > > actually
> > > > > > > > >>>>> considered and
> > > > > > > > >>>>> > > > >> tested this
> > > > > > > > >>>>> > > > >> > > > > >> >>>> solution,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> which we call
> > > > one-broker-per-disk.
> > > > > It
> > > > > > > > >>>>> would work
> > > > > > > > >>>>> > > and
> > > > > > > > >>>>> > > > >> should
> > > > > > > > >>>>> > > > >> > > > > >> require
> > > > > > > > >>>>> > > > >> > > > > >> >>>> no
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> major change in Kafka as
> > > compared
> > > > > to
> > > > > > > this
> > > > > > > > >>>>> JBOD
> > > > > > > > >>>>> > KIP.
> > > > > > > > >>>>> > > > So
> > > > > > > > >>>>> > > > >> it
> > > > > > > > >>>>> > > > >> > > would
> > > > > > > > >>>>> > > > >> > > > > >> be a
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> good
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> short term solution.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> But it has a few drawbacks
> > > which
> > > > > > makes
> > > > > > > it
> > > > > > > > >>>>> less
> > > > > > > > >>>>> > > > >> desirable in
> > > > > > > > >>>>> > > > >> > > the
> > > > > > > > >>>>> > > > >> > > > > >> long
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> term.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Assume we have 10 disks on
> a
> > > > > machine.
> > > > > > > > Here
> > > > > > > > >>>>> are
> > > > > > > > >>>>> > the
> > > > > > > > >>>>> > > > >> problems:
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Hi Dong,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Thanks for the thoughtful
> > reply.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 1) Our stress test result
> > shows
> > > > > that
> > > > > > > > >>>>> > > > >> one-broker-per-disk
> > > > > > > > >>>>> > > > >> > > has 15%
> > > > > > > > >>>>> > > > >> > > > > >> >>>> lower
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> throughput
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 2) Controller would need to
> > > send
> > > > > 10X
> > > > > > as
> > > > > > > > >>>>> many
> > > > > > > > >>>>> > > > >> > > LeaderAndIsrRequest,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> MetadataUpdateRequest and
> > > > > > > > >>>>> StopReplicaRequest.
> > > > > > > > >>>>> > This
> > > > > > > > >>>>> > > > >> > > increases the
> > > > > > > > >>>>> > > > >> > > > > >> >>>> burden
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> on
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> controller which can be the
> > > > > > performance
> > > > > > > > >>>>> > bottleneck.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Maybe I'm misunderstanding
> > > > > something,
> > > > > > > but
> > > > > > > > >>>>> there
> > > > > > > > >>>>> > > would
> > > > > > > > >>>>> > > > >> not be
> > > > > > > > >>>>> > > > >> > > 10x as
> > > > > > > > >>>>> > > > >> > > > > >> >>>> many
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> StopReplicaRequest RPCs,
> would
> > > > > there?
> > > > > > > The
> > > > > > > > >>>>> other
> > > > > > > > >>>>> > > > >> requests
> > > > > > > > >>>>> > > > >> > > would
> > > > > > > > >>>>> > > > >> > > > > >> >>>> increase
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> 10x, but from a pretty low
> > base,
> > > > > > right?
> > > > > > > > We
> > > > > > > > >>>>> are
> > > > > > > > >>>>> > not
> > > > > > > > >>>>> > > > >> > > reassigning
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> partitions all the time, I
> > hope
> > > > (or
> > > > > > else
> > > > > > > > we
> > > > > > > > >>>>> have
> > > > > > > > >>>>> > > > bigger
> > > > > > > > >>>>> > > > >> > > > > >> problems...)
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> I think the controller will
> > group
> > > > > > > > >>>>> > StopReplicaRequest
> > > > > > > > >>>>> > > > per
> > > > > > > > >>>>> > > > >> > > broker and
> > > > > > > > >>>>> > > > >> > > > > >> >> send
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> only one StopReplicaRequest
> to
> > a
> > > > > broker
> > > > > > > > >>>>> during
> > > > > > > > >>>>> > > > controlled
> > > > > > > > >>>>> > > > >> > > shutdown.
> > > > > > > > >>>>> > > > >> > > > > >> >>>> Anyway,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> we don't have to worry about
> > this
> > > > if
> > > > > we
> > > > > > > > >>>>> agree that
> > > > > > > > >>>>> > > > other
> > > > > > > > >>>>> > > > >> > > requests
> > > > > > > > >>>>> > > > >> > > > > >> will
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> increase by 10X. One
> > > > MetadataRequest
> > > > > to
> > > > > > > > send
> > > > > > > > >>>>> to
> > > > > > > > >>>>> > each
> > > > > > > > >>>>> > > > >> broker
> > > > > > > > >>>>> > > > >> > > in the
> > > > > > > > >>>>> > > > >> > > > > >> >>>> cluster
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> every time there is
> leadership
> > > > > change.
> > > > > > I
> > > > > > > am
> > > > > > > > >>>>> not
> > > > > > > > >>>>> > sure
> > > > > > > > >>>>> > > > >> this is
> > > > > > > > >>>>> > > > >> > > a real
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> problem. But in theory this
> > makes
> > > > the
> > > > > > > > >>>>> overhead
> > > > > > > > >>>>> > > > complexity
> > > > > > > > >>>>> > > > >> > > O(number
> > > > > > > > >>>>> > > > >> > > > > >> of
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> broker) and may be a concern
> in
> > > the
> > > > > > > future.
> > > > > > > > >>>>> Ideally
> > > > > > > > >>>>> > > we
> > > > > > > > >>>>> > > > >> should
> > > > > > > > >>>>> > > > >> > > avoid
> > > > > > > > >>>>> > > > >> > > > > >> it.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 3) Less efficient use of
> > > physical
> > > > > > > > resource
> > > > > > > > >>>>> on the
> > > > > > > > >>>>> > > > >> machine.
> > > > > > > > >>>>> > > > >> > > The
> > > > > > > > >>>>> > > > >> > > > > >> number
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> of
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> socket on each machine will
> > > > > increase
> > > > > > by
> > > > > > > > >>>>> 10X. The
> > > > > > > > >>>>> > > > >> number of
> > > > > > > > >>>>> > > > >> > > > > >> connection
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> between any two machine
> will
> > > > > increase
> > > > > > > by
> > > > > > > > >>>>> 100X.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 4) Less efficient way to
> > > > management
> > > > > > > > memory
> > > > > > > > >>>>> and
> > > > > > > > >>>>> > > quota.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 5) Rebalance between
> > > > disks/brokers
> > > > > on
> > > > > > > the
> > > > > > > > >>>>> same
> > > > > > > > >>>>> > > > machine
> > > > > > > > >>>>> > > > >> will
> > > > > > > > >>>>> > > > >> > > less
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> efficient
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> and less flexible. Broker
> has
> > > to
> > > > > read
> > > > > > > > data
> > > > > > > > >>>>> from
> > > > > > > > >>>>> > > > another
> > > > > > > > >>>>> > > > >> > > broker on
> > > > > > > > >>>>> > > > >> > > > > >> the
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> same
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> machine via socket. It is
> > also
> > > > > harder
> > > > > > > to
> > > > > > > > do
> > > > > > > > >>>>> > > automatic
> > > > > > > > >>>>> > > > >> load
> > > > > > > > >>>>> > > > >> > > balance
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> between
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> disks on the same machine
> in
> > > the
> > > > > > > future.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> I will put this and the
> > > > explanation
> > > > > > in
> > > > > > > > the
> > > > > > > > >>>>> > rejected
> > > > > > > > >>>>> > > > >> > > alternative
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> section.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> I
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> have a few questions:
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> - Can you explain why this
> > > > solution
> > > > > > can
> > > > > > > > >>>>> help
> > > > > > > > >>>>> > avoid
> > > > > > > > >>>>> > > > >> > > scalability
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> bottleneck?
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> I actually think it will
> > > > exacerbate
> > > > > > the
> > > > > > > > >>>>> > scalability
> > > > > > > > >>>>> > > > >> problem
> > > > > > > > >>>>> > > > >> > > due
> > > > > > > > >>>>> > > > >> > > > > >> the
> > > > > > > > >>>>> > > > >> > > > > >> >>>> 2)
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> above.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> - Why can we push more RPC
> > with
> > > > > this
> > > > > > > > >>>>> solution?
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> To really answer this
> question
> > > > we'd
> > > > > > have
> > > > > > > > to
> > > > > > > > >>>>> take a
> > > > > > > > >>>>> > > > deep
> > > > > > > > >>>>> > > > >> dive
> > > > > > > > >>>>> > > > >> > > into
> > > > > > > > >>>>> > > > >> > > > > >> the
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> locking of the broker and
> > figure
> > > > out
> > > > > > how
> > > > > > > > >>>>> > effectively
> > > > > > > > >>>>> > > > it
> > > > > > > > >>>>> > > > >> can
> > > > > > > > >>>>> > > > >> > > > > >> >> parallelize
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> truly independent requests.
> > > > Almost
> > > > > > > every
> > > > > > > > >>>>> > > > multithreaded
> > > > > > > > >>>>> > > > >> > > process is
> > > > > > > > >>>>> > > > >> > > > > >> >>>> going
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> to have shared state, like
> > > shared
> > > > > > queues
> > > > > > > > or
> > > > > > > > >>>>> shared
> > > > > > > > >>>>> > > > >> sockets,
> > > > > > > > >>>>> > > > >> > > that is
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> going to make scaling less
> > than
> > > > > linear
> > > > > > > > when
> > > > > > > > >>>>> you
> > > > > > > > >>>>> > add
> > > > > > > > >>>>> > > > >> disks or
> > > > > > > > >>>>> > > > >> > > > > >> >>>> processors.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> (And clearly, another option
> > is
> > > to
> > > > > > > improve
> > > > > > > > >>>>> that
> > > > > > > > >>>>> > > > >> scalability,
> > > > > > > > >>>>> > > > >> > > rather
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> than going multi-process!)
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Yeah I also think it is
> better
> > to
> > > > > > improve
> > > > > > > > >>>>> > scalability
> > > > > > > > >>>>> > > > >> inside
> > > > > > > > >>>>> > > > >> > > kafka
> > > > > > > > >>>>> > > > >> > > > > >> code
> > > > > > > > >>>>> > > > >> > > > > >> >>>> if
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> possible. I am not sure we
> > > > currently
> > > > > > have
> > > > > > > > any
> > > > > > > > >>>>> > > > scalability
> > > > > > > > >>>>> > > > >> > > issue
> > > > > > > > >>>>> > > > >> > > > > >> inside
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Kafka that can not be removed
> > > > without
> > > > > > > using
> > > > > > > > >>>>> > > > >> multi-process.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> - It is true that a garbage
> > > > > > collection
> > > > > > > in
> > > > > > > > >>>>> one
> > > > > > > > >>>>> > > broker
> > > > > > > > >>>>> > > > >> would
> > > > > > > > >>>>> > > > >> > > not
> > > > > > > > >>>>> > > > >> > > > > >> affect
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> others. But that is after
> > every
> > > > > > broker
> > > > > > > > >>>>> only uses
> > > > > > > > >>>>> > > 1/10
> > > > > > > > >>>>> > > > >> of the
> > > > > > > > >>>>> > > > >> > > > > >> memory.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Can
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> we be sure that this will
> > > > actually
> > > > > > help
> > > > > > > > >>>>> > > performance?
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> The big question is, how
> much
> > > > memory
> > > > > > do
> > > > > > > > >>>>> Kafka
> > > > > > > > >>>>> > > brokers
> > > > > > > > >>>>> > > > >> use
> > > > > > > > >>>>> > > > >> > > now, and
> > > > > > > > >>>>> > > > >> > > > > >> how
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> much will they use in the
> > > future?
> > > > > Our
> > > > > > > > >>>>> experience
> > > > > > > > >>>>> > in
> > > > > > > > >>>>> > > > >> HDFS
> > > > > > > > >>>>> > > > >> > > was that
> > > > > > > > >>>>> > > > >> > > > > >> >> once
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> you start getting more than
> > > > > 100-200GB
> > > > > > > Java
> > > > > > > > >>>>> heap
> > > > > > > > >>>>> > > sizes,
> > > > > > > > >>>>> > > > >> full
> > > > > > > > >>>>> > > > >> > > GCs
> > > > > > > > >>>>> > > > >> > > > > >> start
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> taking minutes to finish
> when
> > > > using
> > > > > > the
> > > > > > > > >>>>> standard
> > > > > > > > >>>>> > > JVMs.
> > > > > > > > >>>>> > > > >> That
> > > > > > > > >>>>> > > > >> > > alone
> > > > > > > > >>>>> > > > >> > > > > >> is
> > > > > > > > >>>>> > > > >> > > > > >> >> a
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> good reason to go
> > multi-process
> > > or
> > > > > > > > consider
> > > > > > > > >>>>> > storing
> > > > > > > > >>>>> > > > more
> > > > > > > > >>>>> > > > >> > > things off
> > > > > > > > >>>>> > > > >> > > > > >> >> the
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Java heap.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> I see. Now I agree
> > > > > one-broker-per-disk
> > > > > > > > >>>>> should be
> > > > > > > > >>>>> > more
> > > > > > > > >>>>> > > > >> > > efficient in
> > > > > > > > >>>>> > > > >> > > > > >> >> terms
> > > > > > > > >>>>> > > > >> > > > > >> >>>> of
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> GC since each broker probably
> > > needs
> > > > > > less
> > > > > > > > >>>>> than 1/10
> > > > > > > > >>>>> > of
> > > > > > > > >>>>> > > > the
> > > > > > > > >>>>> > > > >> > > memory
> > > > > > > > >>>>> > > > >> > > > > >> >>>> available
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> on a typical machine
> nowadays.
> > I
> > > > will
> > > > > > > > remove
> > > > > > > > >>>>> this
> > > > > > > > >>>>> > > from
> > > > > > > > >>>>> > > > >> the
> > > > > > > > >>>>> > > > >> > > reason of
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> rejection.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Disk failure is the "easy"
> > case.
> > > > > The
> > > > > > > > >>>>> "hard" case,
> > > > > > > > >>>>> > > > >> which is
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> unfortunately also the much
> > more
> > > > > > common
> > > > > > > > >>>>> case, is
> > > > > > > > >>>>> > > disk
> > > > > > > > >>>>> > > > >> > > misbehavior.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Towards the end of their
> > lives,
> > > > > disks
> > > > > > > tend
> > > > > > > > >>>>> to
> > > > > > > > >>>>> > start
> > > > > > > > >>>>> > > > >> slowing
> > > > > > > > >>>>> > > > >> > > down
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> unpredictably.  Requests
> that
> > > > would
> > > > > > have
> > > > > > > > >>>>> completed
> > > > > > > > >>>>> > > > >> > > immediately
> > > > > > > > >>>>> > > > >> > > > > >> before
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> start taking 20, 100 500
> > > > > milliseconds.
> > > > > > > > >>>>> Some files
> > > > > > > > >>>>> > > may
> > > > > > > > >>>>> > > > >> be
> > > > > > > > >>>>> > > > >> > > readable
> > > > > > > > >>>>> > > > >> > > > > >> and
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> other files may not be.
> > System
> > > > > calls
> > > > > > > > hang,
> > > > > > > > >>>>> > > sometimes
> > > > > > > > >>>>> > > > >> > > forever, and
> > > > > > > > >>>>> > > > >> > > > > >> the
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Java process can't abort
> them,
> > > > > because
> > > > > > > the
> > > > > > > > >>>>> hang is
> > > > > > > > >>>>> > > in
> > > > > > > > >>>>> > > > >> the
> > > > > > > > >>>>> > > > >> > > kernel.
> > > > > > > > >>>>> > > > >> > > > > >> It
> > > > > > > > >>>>> > > > >> > > > > >> >>>> is
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> not fun when threads are
> stuck
> > > in
> > > > "D
> > > > > > > > state"
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > http://stackoverflow.com/quest
> > > > > > > > >>>>> > > > >> ions/20423521/process-perminan
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> tly-stuck-on-d-state
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> .  Even kill -9 cannot abort
> > the
> > > > > > thread
> > > > > > > > >>>>> then.
> > > > > > > > >>>>> > > > >> Fortunately,
> > > > > > > > >>>>> > > > >> > > this is
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> rare.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> I agree it is a harder
> problem
> > > and
> > > > it
> > > > > > is
> > > > > > > > >>>>> rare. We
> > > > > > > > >>>>> > > > >> probably
> > > > > > > > >>>>> > > > >> > > don't
> > > > > > > > >>>>> > > > >> > > > > >> have
> > > > > > > > >>>>> > > > >> > > > > >> >> to
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> worry about it in this KIP
> > since
> > > > this
> > > > > > > issue
> > > > > > > > >>>>> is
> > > > > > > > >>>>> > > > >> orthogonal to
> > > > > > > > >>>>> > > > >> > > > > >> whether or
> > > > > > > > >>>>> > > > >> > > > > >> >>>> not
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> we use JBOD.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Another approach we should
> > > > consider
> > > > > is
> > > > > > > for
> > > > > > > > >>>>> Kafka
> > > > > > > > >>>>> > to
> > > > > > > > >>>>> > > > >> > > implement its
> > > > > > > > >>>>> > > > >> > > > > >> own
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> storage layer that would
> > stripe
> > > > > across
> > > > > > > > >>>>> multiple
> > > > > > > > >>>>> > > disks.
> > > > > > > > >>>>> > > > >> This
> > > > > > > > >>>>> > > > >> > > > > >> wouldn't
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> have to be done at the block
> > > > level,
> > > > > > but
> > > > > > > > >>>>> could be
> > > > > > > > >>>>> > > done
> > > > > > > > >>>>> > > > >> at the
> > > > > > > > >>>>> > > > >> > > file
> > > > > > > > >>>>> > > > >> > > > > >> >>>> level.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> We could use consistent
> > hashing
> > > to
> > > > > > > > >>>>> determine which
> > > > > > > > >>>>> > > > >> disks a
> > > > > > > > >>>>> > > > >> > > file
> > > > > > > > >>>>> > > > >> > > > > >> should
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> end up on, for example.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Are you suggesting that we
> > should
> > > > > > > > distribute
> > > > > > > > >>>>> log,
> > > > > > > > >>>>> > or
> > > > > > > > >>>>> > > > log
> > > > > > > > >>>>> > > > >> > > segment,
> > > > > > > > >>>>> > > > >> > > > > >> >> across
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> disks of brokers? I am not
> sure
> > > if
> > > > I
> > > > > > > fully
> > > > > > > > >>>>> > understand
> > > > > > > > >>>>> > > > >> this
> > > > > > > > >>>>> > > > >> > > > > >> approach. My
> > > > > > > > >>>>> > > > >> > > > > >> >>>> gut
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> feel is that this would be a
> > > > drastic
> > > > > > > > >>>>> solution that
> > > > > > > > >>>>> > > > would
> > > > > > > > >>>>> > > > >> > > require
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> non-trivial design. While
> this
> > > may
> > > > be
> > > > > > > > useful
> > > > > > > > >>>>> to
> > > > > > > > >>>>> > > Kafka,
> > > > > > > > >>>>> > > > I
> > > > > > > > >>>>> > > > >> would
> > > > > > > > >>>>> > > > >> > > > > >> prefer
> > > > > > > > >>>>> > > > >> > > > > >> >> not
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> to discuss this in detail in
> > this
> > > > > > thread
> > > > > > > > >>>>> unless you
> > > > > > > > >>>>> > > > >> believe
> > > > > > > > >>>>> > > > >> > > it is
> > > > > > > > >>>>> > > > >> > > > > >> >>>> strictly
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> superior to the design in
> this
> > > KIP
> > > > in
> > > > > > > terms
> > > > > > > > >>>>> of
> > > > > > > > >>>>> > > solving
> > > > > > > > >>>>> > > > >> our
> > > > > > > > >>>>> > > > >> > > use-case.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> best,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Colin
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Thanks,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Dong
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> On Wed, Jan 25, 2017 at
> 11:34
> > > AM,
> > > > > > Colin
> > > > > > > > >>>>> McCabe <
> > > > > > > > >>>>> > > > >> > > > > >> [email protected]>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> wrote:
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Hi Dong,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Thanks for the writeup!
> > It's
> > > > very
> > > > > > > > >>>>> interesting.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> I apologize in advance if
> > this
> > > > has
> > > > > > > been
> > > > > > > > >>>>> > discussed
> > > > > > > > >>>>> > > > >> > > somewhere else.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> But
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> I
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> am curious if you have
> > > > considered
> > > > > > the
> > > > > > > > >>>>> solution
> > > > > > > > >>>>> > of
> > > > > > > > >>>>> > > > >> running
> > > > > > > > >>>>> > > > >> > > > > >> multiple
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> brokers per node.  Clearly
> > > there
> > > > > is
> > > > > > a
> > > > > > > > >>>>> memory
> > > > > > > > >>>>> > > > overhead
> > > > > > > > >>>>> > > > >> with
> > > > > > > > >>>>> > > > >> > > this
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> solution
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> because of the fixed cost
> of
> > > > > > starting
> > > > > > > > >>>>> multiple
> > > > > > > > >>>>> > > JVMs.
> > > > > > > > >>>>> > > > >> > > However,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> running
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> multiple JVMs would help
> > avoid
> > > > > > > > scalability
> > > > > > > > >>>>> > > > >> bottlenecks.
> > > > > > > > >>>>> > > > >> > > You
> > > > > > > > >>>>> > > > >> > > > > >> could
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> probably push more RPCs
> per
> > > > > second,
> > > > > > > for
> > > > > > > > >>>>> example.
> > > > > > > > >>>>> > > A
> > > > > > > > >>>>> > > > >> garbage
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> collection
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> in one broker would not
> > affect
> > > > the
> > > > > > > > >>>>> others.  It
> > > > > > > > >>>>> > > would
> > > > > > > > >>>>> > > > >> be
> > > > > > > > >>>>> > > > >> > > > > >> interesting
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> to
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> see this considered in the
> > > > > > "alternate
> > > > > > > > >>>>> designs"
> > > > > > > > >>>>> > > > design,
> > > > > > > > >>>>> > > > >> > > even if
> > > > > > > > >>>>> > > > >> > > > > >> you
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> end
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> up deciding it's not the
> way
> > > to
> > > > > go.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> best,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Colin
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> On Thu, Jan 12, 2017, at
> > > 10:46,
> > > > > Dong
> > > > > > > Lin
> > > > > > > > >>>>> wrote:
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Hi all,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> We created KIP-112:
> Handle
> > > disk
> > > > > > > failure
> > > > > > > > >>>>> for
> > > > > > > > >>>>> > JBOD.
> > > > > > > > >>>>> > > > >> Please
> > > > > > > > >>>>> > > > >> > > find
> > > > > > > > >>>>> > > > >> > > > > >> the
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> KIP
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> wiki
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> in the link
> > > > > > > > >>>>> https://cwiki.apache.org/confl
> > > > > > > > >>>>> > > > >> > > > > >> >>>> uence/display/KAFKA/KIP-
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
> 112%3A+Handle+disk+failure+
> > > > > > for+JBOD.
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> This KIP is related to
> > > KIP-113
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> <
> > > https://cwiki.apache.org/conf
> > > > > > > > >>>>> > > > >> luence/display/KAFKA/KIP-
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > 113%3A+Support+replicas+moveme
> > > > > > > > >>>>> > > > >> nt+between+log+directories>:
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Support replicas movement
> > > > between
> > > > > > log
> > > > > > > > >>>>> > > directories.
> > > > > > > > >>>>> > > > >> They
> > > > > > > > >>>>> > > > >> > > are
> > > > > > > > >>>>> > > > >> > > > > >> >>>> needed
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> in
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> order
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> to support JBOD in Kafka.
> > > > Please
> > > > > > help
> > > > > > > > >>>>> review
> > > > > > > > >>>>> > the
> > > > > > > > >>>>> > > > >> KIP. You
> > > > > > > > >>>>> > > > >> > > > > >> >>>> feedback
> > > > > > > > >>>>> > > > >> > > > > >> >>>>> is
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> appreciated!
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Thanks,
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Dong
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> > > > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > > > >>>>> > > > >> > > > > >> >>
> > > > > > > > >>>>> > > > >> > > > > >>
> > > > > > > > >>>>> > > > >> > > > > >>
> > > > > > > > >>>>> > > > >> > > > > >
> > > > > > > > >>>>> > > > >> > >
> > > > > > > > >>>>> > > > >>
> > > > > > > > >>>>> > > > >
> > > > > > > > >>>>> > > > >
> > > > > > > > >>>>> > > >
> > > > > > > > >>>>> > >
> > > > > > > > >>>>> >
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to