Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Dong Lin Mon, 07 Aug 2017 10:51:12 -0700

Hey Ismael,

Thanks much for your comments. Please see my reply inline.


On Mon, Aug 7, 2017 at 5:28 AM, Ismael Juma <ism...@juma.me.uk> wrote:

> Hi Dong,
>
> Thanks for the explanation. Comments inline.
>
> On Fri, Aug 4, 2017 at 6:47 PM, Dong Lin <lindon...@gmail.com> wrote:
>
> > 1. Yes it has been considered. Here are the reasons why we don't do it
> > through controller.
> >
> > - There can be use-cases where we only want to rebalance the load of log
> > directories on a given broker. It seems unnecessary to go through
> > controller in this case.
> >
>
> Even though this is true, not sure how common it will be.
>

I think the frequency of the need to balance across disks on the same
broker will be considerably higher (e.g. 2X or more) than the frequency
needed to balance across brokers. This is because the underlying replica
has the same size distribution but the capacity of broker can be 10X as
much as the capacity of disk.

I don't think this is a strong argument for having this logic only in the
tool instead of controller. It is a nice to have feature if there is no
strong reason to do it in controller.


>
>  - If controller is responsible for sending ChangeReplicaDirRequest, and if
> > the user-specified log directory is either invalid or offline, then
> > controller probably needs a way to tell user that the partition
> > reassignment has failed. We currently don't have a way to do this since
> > kafka-reassign-partition.sh simply creates the reassignment znode without
> > waiting for response. I am not sure that is a good solution to this.
> >
>
> Since the JSON is provided by the user, we would ideally validate its
> contents before storing it. Why are the log directories different than the
> other information in the JSON?


I think there are two difference between the log directory field and other
fields in the JSON:

- The log directory field take much more bytes than other fields in the
reassignment znode. Due to the 1MB size limit of znode, Kafka admin
currently have to split a large reassignment into multiple smaller
reassignment which limits the number of partitions that can be moved
concurrently. Currently the reassignment znode has 1 integer for each
replica. The log directory will introduce probably 10+ characters for each
replica. This can significantly lower the number of partitions that can be
reassigned at the same time.

- All other fields in the reassignment znode can be found and verified by
the other znodes in the zookeeper. Thus controller only needs to register a
ZK listener to be notified once the reassignment completes. However, the
log directory of each replica is not in the zookeeper. The controller would
have to periodically sending DescribeDirsRequet to check whether the
replica has been successfully moved to the destination log directory.
Currently there is nothing like this in the controller logic. Ideally we
want to avoid adding this complexity and performance overhead in controller.



> - If controller is responsible for sending ChangeReplicaDirRequest, the
> > controller logic would be more complicated because controller needs to
> > first send ChangeReplicaRequest so that the broker memorize the partition
> > -> log directory mapping, send LeaderAndIsrRequest, and keep sending
> > ChangeReplicaDirRequest (just in case broker restarted) until replica is
> > created. Note that the last step needs repeat and timeout as the proposed
> > in the KIP-113.
> >
> > Overall I think this adds quite a bit complexity to controller and we
> > probably want to do this only if there is strong clear of doing so.
> > Currently in KIP-113 the kafka-reassign-partitions.sh is responsible for
> > sending ChangeReplicaDirRequest with repeat and provides error to user if
> > it either fails or timeout. It seems to be much simpler and user
> shouldn't
> > care whether it is done through controller.
> >
>
> If I understand correctly, the logic is the same in both cases, it's just a
> question of where it lives. The advantage of placing it in the Controller
> is that the whole reassignment logic is in one place (instead of split
> between the tool and the Controller). That seems easier to reason about.
>

It seems that the main motivation for putting this logic in controller is
to simplify the work for Kafka developer. I agree it is desirable to put
the logic in the same place. On the other hand we developer also want to
keep controller simple and efficient.

I actually did this in the original design but later I was convinced by Jun
that it is simpler to put the new logic in the reassignment tool. I think
we can put this logic in controller if we can find good solution to the
problems described above.


>
> Also, how do you think things would work in the context of KIP-179? Would
> the tool still invoke these requests or would it be done by the broker
> receiving the alterTopics/reassignPartitions protocol call?
>

My gut feel is that the tool will still invoke these requests. But I have a
few questions to KIP-179 before I can answer this question. For example, is
AlterTopicsRequest request sent to controller only? If the new assignment
is not written in zookeeper, how is this information propagated to the new
controller if the previous controller dies after it receives
AlterTopicsRequest but before it sends LeaderAndIsrRequest? I can post
these questions in that discussion thread later.


>
> And thanks for the suggestion. I will add this to the Rejected Alternative
> > Section in the KIP-113.
> >
> > 2) I think user needs to be able to specify different log directories for
> > the replicas of the same partition in order to rebalance load across log
> > directories of all brokers. I am not sure I understand the question. Can
> > you explain a bit more why "that the log directory has to be the same for
> > all replicas of a given partition"?
>
>
> I think I misunderstood the schema. The KIP has the following example:
>
> "partitions" : [
>     {
>       "topic" : str,
>       "partition" : int,
>       "replicas" : [int],
>       "log_dirs" : [str]    <-- NEW. A log directory can be either "any",
> or a valid absolute path that begins with '/'. This is an optional filed.
> It is treated as an array of "any" if this field is not explicitly
> specified in the json file.
>     },
>     ...
>   ]
>
> Is it right that `log_dirs` is an array in the same order as `replicas`?
> That's a bit obscure and we should document it more clearly. Did we discard
> the option of a more readable schema (i.e. a JSON object mapping a replica
> id to a log dir) due to efficiency concerns? It would be good to include
> that in the KIP.
>

Yes, we expect log_dirs to be in the same order as `replicas`. I agree that
the new format is a bit obscure. But the map from replica to log directory
seems a bit worse because it doesn't preserve the order of replicas of the
partition. We need to specify the new replicas in an ordered list because
it tells controller which one is the preferred leader. I am open to any
other format that is less obscure while being explicit about the order of
replicas



>
> Thanks,
> Ismael
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to