[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086772#comment-17086772 ]
GEORGE LI edited comment on KAFKA-4084 at 4/19/20, 7:28 AM: ------------------------------------------------------------ [~blodsbror] I am not very familiar with 5.4 setup. Do you have the error message of the crash in the log? is it missing the zkclient jar like below? {code} $ ls -l zk*.jar -rw-r--r-- 1 georgeli engineering 74589 Nov 18 18:21 zkclient-0.11.jar $ jar tvf zkclient-0.11.jar 0 Mon Nov 18 18:11:58 UTC 2019 META-INF/ 1135 Mon Nov 18 18:11:58 UTC 2019 META-INF/MANIFEST.MF 0 Mon Nov 18 18:11:58 UTC 2019 org/ 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/ 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/ 3486 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/ContentWatcher.class 263 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/DataUpdater.class {code} If this jar file was there before, please copy it back. I need to find out why it was missing after the build. maybe some dependency setup in gradle. I have also update the [install doc |https://docs.google.com/document/d/14vlPkbaog_5Xdd-HB4vMRaQQ7Fq4SlxddULsvc3PlbY/edit] using `./gradew clean build -x test` Also make sure the startup script for kafka is not hard coding 5.4 jars, but take the jars from the lib classpath? e.g. {code} /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dlog4j.configuration=file:/etc/kafka/log4j.xml -Xms22G -Xmx22G -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:NewSize=16G -XX:MaxNewSize=16G -XX:InitiatingHeapOccupancyPercent=3 -XX:G1MixedGCCountTarget=1 -XX:G1HeapWastePercent=1 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -verbose:gc -Xloggc:/var/log/kafka/gc-kafka.log -server -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=29010 -Djava.rmi.server.hostname=kafka12345-dca4 -cp '.:/usr/share/kafka/lib/*' kafka.Kafka /etc/kafka/server.properties {code} If you give us more details, we can help more. Thanks Actually, I just patched and added back zkclient libs for the gradle build. Please "git clone https://github.com/sql888/kafka.git" (or git pull) and try to build again. I suspect that was the issue. Otherwise, we need to see the errors of the crash from the kafka logs. was (Author: sql_consulting): [~blodsbror] I am not very familiar with 5.4 setup. Do you have the error message of the crash in the log? is it missing the zkclient jar like below? {code} $ ls -l zk*.jar -rw-r--r-- 1 georgeli engineering 74589 Nov 18 18:21 zkclient-0.11.jar $ jar tvf zkclient-0.11.jar 0 Mon Nov 18 18:11:58 UTC 2019 META-INF/ 1135 Mon Nov 18 18:11:58 UTC 2019 META-INF/MANIFEST.MF 0 Mon Nov 18 18:11:58 UTC 2019 org/ 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/ 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/ 3486 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/ContentWatcher.class 263 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/DataUpdater.class {code} If this jar file was there before, please copy it back. I need to find out why it was missing after the build. maybe some dependency setup in gradle. I have also update the [install doc |https://docs.google.com/document/d/14vlPkbaog_5Xdd-HB4vMRaQQ7Fq4SlxddULsvc3PlbY/edit] using `./gradew clean build -x test` Also make sure the startup script for kafka is not hard coding 5.4 jars, but take the jars from the lib classpath? e.g. {code} /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dlog4j.configuration=file:/etc/kafka/log4j.xml -Xms22G -Xmx22G -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:NewSize=16G -XX:MaxNewSize=16G -XX:InitiatingHeapOccupancyPercent=3 -XX:G1MixedGCCountTarget=1 -XX:G1HeapWastePercent=1 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -verbose:gc -Xloggc:/var/log/kafka/gc-kafka.log -server -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=29010 -Djava.rmi.server.hostname=kafka12345-dca4 -cp '.:/usr/share/kafka/lib/*' kafka.Kafka /etc/kafka/server.properties {code} If you give us more details, we can help more. Thanks Actually, I just patched and added back zkclient libs for the gradle build. Please "git clone https://github.com/sql888/kafka.git" and try to build again. I suspect that was the issue. Otherwise, we need to see the errors of the crash from the kafka logs. > automated leader rebalance causes replication downtime for clusters with too > many partitions > -------------------------------------------------------------------------------------------- > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 > Reporter: Tom Crayford > Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)