[ 
https://issues.apache.org/jira/browse/KAFKA-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001331#comment-14001331
 ] 

Sriharsha Chintalapani commented on KAFKA-1298:
-----------------------------------------------

[~nehanarkhede] Thanks for the details. I tested in few combinations by 
enabling controlled shutdown and controlled.shutdown.max.retries=100. In a 
single broker with replication factor set to 1 although there are exceptions 
thrown which are getting caught in PartitionStateMachine.handleStateChange  and 
controlled shutdown is going through without blocking.
1)KafkaApis.handleControlledShutdownRequest calls 
KafkaController.shutdownBroker which checks if the broker about to shutdown is 
the leader in the above case it is and invokes 
PartitionStateMachine.handleStateChange 
2) PartitionStateMachine.electLeaderForPartition calls 
ControlledShutdownLeaderSelector.selectLeader which checks if there are any 
other brokers available to make it leader in this case it will be empty and 
throws a StateChangeException which is being caught in handleStateChange logged 
into state-change.log
3) KafkaController.shutdownBroker doesn't know about the exception goes forward 
with returning a successful controlledShutdown response back.
4) KafkaController checks for the remainingPartitions
   leaderIsrAndControllerEpoch.leaderAndIsr.leader == id && 
controllerContext.partitionReplicaAssignment(topicAndPartition).size > 1. 
      
I tested in multi node env and its the same case.
so by making controlled shutdown default it won't be blocking at this point.
But I'll send a patch to skip leadership handoff if the replicationFactor is 1



> Controlled shutdown tool doesn't seem to work out of the box
> ------------------------------------------------------------
>
>                 Key: KAFKA-1298
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1298
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Jay Kreps
>              Labels: usability
>
> Download Kafka and try to use our shutdown tool. Got this:
> bin/kafka-run-class.sh kafka.admin.ShutdownBroker --zookeeper localhost:2181 
> --broker 0
> [2014-03-06 16:58:23,636] ERROR Operation failed due to controller failure 
> (kafka.admin.ShutdownBroker$)
> java.io.IOException: Failed to retrieve RMIServer stub: 
> javax.naming.ServiceUnavailableException [Root exception is 
> java.rmi.ConnectException: Connection refused to host: 
> jkreps-mn.linkedin.biz; nested exception is: 
>       java.net.ConnectException: Connection refused]
>       at 
> javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:340)
>       at 
> javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:249)
>       at 
> kafka.admin.ShutdownBroker$.kafka$admin$ShutdownBroker$$invokeShutdown(ShutdownBroker.scala:56)
>       at kafka.admin.ShutdownBroker$.main(ShutdownBroker.scala:109)
>       at kafka.admin.ShutdownBroker.main(ShutdownBroker.scala)
> Caused by: javax.naming.ServiceUnavailableException [Root exception is 
> java.rmi.ConnectException: Connection refused to host: 
> jkreps-mn.linkedin.biz; nested exception is: 
>       java.net.ConnectException: Connection refused]
>       at 
> com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101)
>       at 
> com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:185)
>       at javax.naming.InitialContext.lookup(InitialContext.java:392)
>       at 
> javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1888)
>       at 
> javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1858)
>       at 
> javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:257)
>       ... 4 more
> Caused by: java.rmi.ConnectException: Connection refused to host: 
> jkreps-mn.linkedin.biz; nested exception is: 
>       java.net.ConnectException: Connection refused
>       at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:601)
>       at 
> sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:198)
>       at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)
>       at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322)
>       at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)
>       at 
> com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:97)
>       ... 9 more
> Caused by: java.net.ConnectException: Connection refused
>       at java.net.PlainSocketImpl.socketConnect(Native Method)
>       at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:382)
>       at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:241)
>       at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:228)
>       at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:431)
>       at java.net.Socket.connect(Socket.java:527)
>       at java.net.Socket.connect(Socket.java:476)
>       at java.net.Socket.<init>(Socket.java:373)
>       at java.net.Socket.<init>(Socket.java:187)
>       at 
> sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:22)
>       at 
> sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:128)
>       at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:595)
>       ... 14 more
> Oh god, RMI?????!!!???
> Presumably this is because we stopped setting the JMX port by default. This 
> is good because setting the JMX port breaks the quickstart which requires 
> running multiple nodes on a single machine. The root cause imo is just using 
> RMI here instead of our regular RPC.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to