[ 
https://issues.apache.org/jira/browse/KAFKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901142#comment-14901142
 ] 

Aditya Auradkar commented on KAFKA-1599:
----------------------------------------

[~anigam] - Perhaps you can write up your proposal here? Based on what the 
committers say, you write a KIP if required.

> Change preferred replica election admin command to handle large clusters
> ------------------------------------------------------------------------
>
>                 Key: KAFKA-1599
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1599
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 0.8.2.0
>            Reporter: Todd Palino
>            Assignee: Abhishek Nigam
>              Labels: newbie++
>
> We ran into a problem with a cluster that has 70k partitions where we could 
> not trigger a preferred replica election for all topics and partitions using 
> the admin tool. Upon investigation, it was determined that this was because 
> the JSON object that was being written to the admin znode to tell the 
> controller to start the election was 1.8 MB in size. As the default Zookeeper 
> data size limit is 1MB, and it is non-trivial to change, we should come up 
> with a better way to represent the list of topics and partitions for this 
> admin command.
> I have several thoughts on this so far:
> 1) Trigger the command for all topics and partitions with a JSON object that 
> does not include an explicit list of them (i.e. a flag that says "all 
> partitions")
> 2) Use a more compact JSON representation. Currently, the JSON contains a 
> 'partitions' key which holds a list of dictionaries that each have a 'topic' 
> and 'partition' key, and there must be one list item for each partition. This 
> results in a lot of repetition of key names that is unneeded. Changing this 
> to a format like this would be much more compact:
> {'topics': {'topicName1': [0, 1, 2, 3], 'topicName2': [0,1]}, 'version': 1}
> 3) Use a representation other than JSON. Strings are inefficient. A binary 
> format would be the most compact. This does put a greater burden on tools and 
> scripts that do not use the inbuilt libraries, but it is not too high.
> 4) Use a representation that involves multiple znodes. A structured tree in 
> the admin command would probably provide the most complete solution. However, 
> we would need to make sure to not exceed the data size limit with a wide tree 
> (the list of children for any single znode cannot exceed the ZK data size of 
> 1MB)
> Obviously, there could be a combination of #1 with a change in the 
> representation, which would likely be appropriate as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to