Hey Ismael, Thanks much for your comments. Please see my reply inline.
On Mon, Aug 7, 2017 at 5:28 AM, Ismael Juma <ism...@juma.me.uk> wrote: > Hi Dong, > > Thanks for the explanation. Comments inline. > > On Fri, Aug 4, 2017 at 6:47 PM, Dong Lin <lindon...@gmail.com> wrote: > > > 1. Yes it has been considered. Here are the reasons why we don't do it > > through controller. > > > > - There can be use-cases where we only want to rebalance the load of log > > directories on a given broker. It seems unnecessary to go through > > controller in this case. > > > > Even though this is true, not sure how common it will be. > I think the frequency of the need to balance across disks on the same broker will be considerably higher (e.g. 2X or more) than the frequency needed to balance across brokers. This is because the underlying replica has the same size distribution but the capacity of broker can be 10X as much as the capacity of disk. I don't think this is a strong argument for having this logic only in the tool instead of controller. It is a nice to have feature if there is no strong reason to do it in controller. > > - If controller is responsible for sending ChangeReplicaDirRequest, and if > > the user-specified log directory is either invalid or offline, then > > controller probably needs a way to tell user that the partition > > reassignment has failed. We currently don't have a way to do this since > > kafka-reassign-partition.sh simply creates the reassignment znode without > > waiting for response. I am not sure that is a good solution to this. > > > > Since the JSON is provided by the user, we would ideally validate its > contents before storing it. Why are the log directories different than the > other information in the JSON? I think there are two difference between the log directory field and other fields in the JSON: - The log directory field take much more bytes than other fields in the reassignment znode. Due to the 1MB size limit of znode, Kafka admin currently have to split a large reassignment into multiple smaller reassignment which limits the number of partitions that can be moved concurrently. Currently the reassignment znode has 1 integer for each replica. The log directory will introduce probably 10+ characters for each replica. This can significantly lower the number of partitions that can be reassigned at the same time. - All other fields in the reassignment znode can be found and verified by the other znodes in the zookeeper. Thus controller only needs to register a ZK listener to be notified once the reassignment completes. However, the log directory of each replica is not in the zookeeper. The controller would have to periodically sending DescribeDirsRequet to check whether the replica has been successfully moved to the destination log directory. Currently there is nothing like this in the controller logic. Ideally we want to avoid adding this complexity and performance overhead in controller. > - If controller is responsible for sending ChangeReplicaDirRequest, the > > controller logic would be more complicated because controller needs to > > first send ChangeReplicaRequest so that the broker memorize the partition > > -> log directory mapping, send LeaderAndIsrRequest, and keep sending > > ChangeReplicaDirRequest (just in case broker restarted) until replica is > > created. Note that the last step needs repeat and timeout as the proposed > > in the KIP-113. > > > > Overall I think this adds quite a bit complexity to controller and we > > probably want to do this only if there is strong clear of doing so. > > Currently in KIP-113 the kafka-reassign-partitions.sh is responsible for > > sending ChangeReplicaDirRequest with repeat and provides error to user if > > it either fails or timeout. It seems to be much simpler and user > shouldn't > > care whether it is done through controller. > > > > If I understand correctly, the logic is the same in both cases, it's just a > question of where it lives. The advantage of placing it in the Controller > is that the whole reassignment logic is in one place (instead of split > between the tool and the Controller). That seems easier to reason about. > It seems that the main motivation for putting this logic in controller is to simplify the work for Kafka developer. I agree it is desirable to put the logic in the same place. On the other hand we developer also want to keep controller simple and efficient. I actually did this in the original design but later I was convinced by Jun that it is simpler to put the new logic in the reassignment tool. I think we can put this logic in controller if we can find good solution to the problems described above. > > Also, how do you think things would work in the context of KIP-179? Would > the tool still invoke these requests or would it be done by the broker > receiving the alterTopics/reassignPartitions protocol call? > My gut feel is that the tool will still invoke these requests. But I have a few questions to KIP-179 before I can answer this question. For example, is AlterTopicsRequest request sent to controller only? If the new assignment is not written in zookeeper, how is this information propagated to the new controller if the previous controller dies after it receives AlterTopicsRequest but before it sends LeaderAndIsrRequest? I can post these questions in that discussion thread later. > > And thanks for the suggestion. I will add this to the Rejected Alternative > > Section in the KIP-113. > > > > 2) I think user needs to be able to specify different log directories for > > the replicas of the same partition in order to rebalance load across log > > directories of all brokers. I am not sure I understand the question. Can > > you explain a bit more why "that the log directory has to be the same for > > all replicas of a given partition"? > > > I think I misunderstood the schema. The KIP has the following example: > > "partitions" : [ > { > "topic" : str, > "partition" : int, > "replicas" : [int], > "log_dirs" : [str] <-- NEW. A log directory can be either "any", > or a valid absolute path that begins with '/'. This is an optional filed. > It is treated as an array of "any" if this field is not explicitly > specified in the json file. > }, > ... > ] > > Is it right that `log_dirs` is an array in the same order as `replicas`? > That's a bit obscure and we should document it more clearly. Did we discard > the option of a more readable schema (i.e. a JSON object mapping a replica > id to a log dir) due to efficiency concerns? It would be good to include > that in the KIP. > Yes, we expect log_dirs to be in the same order as `replicas`. I agree that the new format is a bit obscure. But the map from replica to log directory seems a bit worse because it doesn't preserve the order of replicas of the partition. We need to specify the new replicas in an ordered list because it tells controller which one is the preferred leader. I am open to any other format that is less obscure while being explicit about the order of replicas > > Thanks, > Ismael >