Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Alexey Ozeritsky Mon, 23 Jan 2017 06:46:11 -0800


13.01.2017, 22:29, "Dong Lin" <lindon...@gmail.com>:
> Hey Alexey,
>
> Thanks for your review and the alternative approach. Here is my
> understanding of your patch. kafka's background threads are used to move
> data between replicas. When data movement is triggered, the log will be
> rolled and the new logs will be put in the new directory, and background
> threads will move segment from old directory to new directory.
>
> It is important to note that KIP-112 is intended to work with KIP-113 to
> support JBOD. I think your solution is definitely simpler and better under
> the current kafka implementation that a broker will fail if any disk fails.
> But I am not sure if we want to allow broker to run with partial disks
> failure. Let's say the a replica is being moved from log_dir_old to
> log_dir_new and then log_dir_old stops working due to disk failure. How
> would your existing patch handles it? To make the scenario a bit more

We will lose log_dir_old. After broker restart we can read the data from 
log_dir_new.

> complicated, let's say the broker is shtudown, log_dir_old's disk fails,
> and the broker starts. In this case broker doesn't even know if log_dir_new
> has all the data needed for this replica. It becomes a problem if the
> broker is elected leader of this partition in this case.

log_dir_new contains the most recent data so we will lose the tail of partition.
This is not a big problem for us because we already delete tails by hand (see 
https://issues.apache.org/jira/browse/KAFKA-1712).
Also we dont use authomatic leader balancing 
(auto.leader.rebalance.enable=false), so this partition becomes the leader with 
a low probability.
I think my patch can be modified to prohibit the selection of the leader until 
the partition does not move completely.

>
> The solution presented in the KIP attempts to handle it by replacing
> replica in an atomic version fashion after the log in the new dir has fully
> caught up with the log in the old dir. At at time the log can be considered
> to exist on only one log directory.

As I understand your solution does not cover quotas.
What happens if someone starts to transfer 100 partitions ? 

> If yes, it will read a ByteBufferMessageSet from topicPartition.log and 
> append the message set to topicPartition.move

i.e. processPartitionData will read data from the beginning of 
topicPartition.log? What is the read size?
A ReplicaFetchThread reads many partitions so if one does some complicated work 
(= read a lot of data from disk) everything will slow down.
I think read size should not be very big. 

On the other hand at this point (processPartitionData) one can use only the new 
data (ByteBufferMessageSet  from parameters) and wait until 
(topicPartition.move.smallestOffset <= topicPartition.log.smallestOffset && 
topicPartition.log.largestOffset == topicPartition.log.largestOffset).
In this case the write speed to topicPartition.move and topicPartition.log will 
be the same so this will allow us to move many partitions to one disk.

>
> And to answer your question, yes topicpartition.log refers to
> topic-paritition/segment.log.
>
> Thanks,
> Dong
>
> On Fri, Jan 13, 2017 at 4:12 AM, Alexey Ozeritsky <aozerit...@yandex.ru>
> wrote:
>
>>  Hi,
>>
>>  We have the similar solution that have been working in production since
>>  2014. You can see it here: https://github.com/resetius/ka
>>  fka/commit/20658593e246d2184906879defa2e763c4d413fb
>>  The idea is very simple
>>  1. Disk balancer runs in a separate thread inside scheduler pool.
>>  2. It does not touch empty partitions
>>  3. Before it moves a partition it forcibly creates new segment on a
>>  destination disk
>>  4. It moves segment by segment from new to old.
>>  5. Log class works with segments on both disks
>>
>>  Your approach seems too complicated, moreover it means that you have to
>>  patch different components of the system
>>  Could you clarify what do you mean by topicPartition.log? Is it
>>  topic-paritition/segment.log ?
>>
>>  12.01.2017, 21:47, "Dong Lin" <lindon...@gmail.com>:
>>  > Hi all,
>>  >
>>  > We created KIP-113: Support replicas movement between log directories.
>>  > Please find the KIP wiki in the link
>>  > *https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%
>>  3A+Support+replicas+movement+between+log+directories
>>  > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%
>>  3A+Support+replicas+movement+between+log+directories>.*
>>  >
>>  > This KIP is related to KIP-112
>>  > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%
>>  3A+Handle+disk+failure+for+JBOD>:
>>  > Handle disk failure for JBOD. They are needed in order to support JBOD in
>>  > Kafka. Please help review the KIP. You feedback is appreciated!
>>  >
>>  > Thanks,
>>  > Dong

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to