13.01.2017, 22:29, "Dong Lin" <lindon...@gmail.com>: > Hey Alexey, > > Thanks for your review and the alternative approach. Here is my > understanding of your patch. kafka's background threads are used to move > data between replicas. When data movement is triggered, the log will be > rolled and the new logs will be put in the new directory, and background > threads will move segment from old directory to new directory. > > It is important to note that KIP-112 is intended to work with KIP-113 to > support JBOD. I think your solution is definitely simpler and better under > the current kafka implementation that a broker will fail if any disk fails. > But I am not sure if we want to allow broker to run with partial disks > failure. Let's say the a replica is being moved from log_dir_old to > log_dir_new and then log_dir_old stops working due to disk failure. How > would your existing patch handles it? To make the scenario a bit more
We will lose log_dir_old. After broker restart we can read the data from log_dir_new. > complicated, let's say the broker is shtudown, log_dir_old's disk fails, > and the broker starts. In this case broker doesn't even know if log_dir_new > has all the data needed for this replica. It becomes a problem if the > broker is elected leader of this partition in this case. log_dir_new contains the most recent data so we will lose the tail of partition. This is not a big problem for us because we already delete tails by hand (see https://issues.apache.org/jira/browse/KAFKA-1712). Also we dont use authomatic leader balancing (auto.leader.rebalance.enable=false), so this partition becomes the leader with a low probability. I think my patch can be modified to prohibit the selection of the leader until the partition does not move completely. > > The solution presented in the KIP attempts to handle it by replacing > replica in an atomic version fashion after the log in the new dir has fully > caught up with the log in the old dir. At at time the log can be considered > to exist on only one log directory. As I understand your solution does not cover quotas. What happens if someone starts to transfer 100 partitions ? > If yes, it will read a ByteBufferMessageSet from topicPartition.log and > append the message set to topicPartition.move i.e. processPartitionData will read data from the beginning of topicPartition.log? What is the read size? A ReplicaFetchThread reads many partitions so if one does some complicated work (= read a lot of data from disk) everything will slow down. I think read size should not be very big. On the other hand at this point (processPartitionData) one can use only the new data (ByteBufferMessageSet from parameters) and wait until (topicPartition.move.smallestOffset <= topicPartition.log.smallestOffset && topicPartition.log.largestOffset == topicPartition.log.largestOffset). In this case the write speed to topicPartition.move and topicPartition.log will be the same so this will allow us to move many partitions to one disk. > > And to answer your question, yes topicpartition.log refers to > topic-paritition/segment.log. > > Thanks, > Dong > > On Fri, Jan 13, 2017 at 4:12 AM, Alexey Ozeritsky <aozerit...@yandex.ru> > wrote: > >> Hi, >> >> We have the similar solution that have been working in production since >> 2014. You can see it here: https://github.com/resetius/ka >> fka/commit/20658593e246d2184906879defa2e763c4d413fb >> The idea is very simple >> 1. Disk balancer runs in a separate thread inside scheduler pool. >> 2. It does not touch empty partitions >> 3. Before it moves a partition it forcibly creates new segment on a >> destination disk >> 4. It moves segment by segment from new to old. >> 5. Log class works with segments on both disks >> >> Your approach seems too complicated, moreover it means that you have to >> patch different components of the system >> Could you clarify what do you mean by topicPartition.log? Is it >> topic-paritition/segment.log ? >> >> 12.01.2017, 21:47, "Dong Lin" <lindon...@gmail.com>: >> > Hi all, >> > >> > We created KIP-113: Support replicas movement between log directories. >> > Please find the KIP wiki in the link >> > *https://cwiki.apache.org/confluence/display/KAFKA/KIP-113% >> 3A+Support+replicas+movement+between+log+directories >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-113% >> 3A+Support+replicas+movement+between+log+directories>.* >> > >> > This KIP is related to KIP-112 >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-112% >> 3A+Handle+disk+failure+for+JBOD>: >> > Handle disk failure for JBOD. They are needed in order to support JBOD in >> > Kafka. Please help review the KIP. You feedback is appreciated! >> > >> > Thanks, >> > Dong