[ https://issues.apache.org/jira/browse/GIRAPH-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232012#comment-13232012 ]
Avery Ching commented on GIRAPH-154: ------------------------------------ Nice work Zhiwei (+1), I verified it as well and committed. Will close once Hudson verifies as well. > Worker ports are not synched properly with its peers > ---------------------------------------------------- > > Key: GIRAPH-154 > URL: https://issues.apache.org/jira/browse/GIRAPH-154 > Project: Giraph > Issue Type: Bug > Components: bsp > Affects Versions: 0.2.0 > Reporter: Zhiwei Gu > Assignee: Zhiwei Gu > Attachments: GIRAPH-154.patch > > > When worker trying multiple ports to setup the rpc server, the final port is > not synched with it's peer workers properly, and resulted in peer workers > send message to the default port. > Here is some logs: > ############################################################################ > Base port: 34900 > ############################################################################ > ############################################################################ > log for worker 161: > ############################################################################ > IPC Server handler 98 on 36061: starting > BasicRPCCommunications: Started RPC communication server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:36061 with 100 handlers and 199 > flush threads on bind attempt 1 > IPC Server handler 99 on 36061: starting > setup: Registering health of this worker... > getJobState: Job state already exists > (/_hadoopBsp/job_201203130609_14838/_masterJobState) > getApplicationAttempt: Node > /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir already exists! > getApplicationAttempt: Node > /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir already exists! > registerHealth: Created my health node for attempt=0, superstep=-1 with > /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir/gsta32085.tan.ygrid.yahoo.com_161 > and workerInfo= Worker(hostname=gsta32085.tan.ygrid.yahoo.com, > MRpartition=161, port=35061) > process: partitionAssignmentsReadyChanged (partitions are assigned) > startSuperstep: Ready for computation on superstep -1 since worker selection > and vertex range assignments are done in > /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir/0/_superstepDir/-1/_partitionAssignments > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 0 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 1 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 2 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 3 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 4 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 5 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 6 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 7 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 8 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 9 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 10 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 11 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 12 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 13 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 14 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 15 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 16 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 17 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 18 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 19 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 20 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 21 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 22 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 23 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 24 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 25 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 26 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 27 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 28 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 29 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 30 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 31 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 32 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 33 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 34 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 35 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 36 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 37 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 38 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 39 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 40 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 41 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 42 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 43 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 44 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 45 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 46 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 47 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 48 time(s). > Retrying connect to server: > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 49 time(s). > PriviledgedActionException as:job_201203130609_14838 (auth:SIMPLE) > cause:java.net.ConnectException: Call to > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection > exception: java.net.ConnectException: Connection refused > connectAllRPCProxys: Failed on attempt 0 of 5 to connect to > (id=33,cur=Worker(hostname=gsta32085.tan.ygrid.yahoo.com, MRpartition=161, > port=35061),prev=null,ckpt_file=null) > java.net.ConnectException: Call to > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection > exception: java.net.ConnectException: Connection refused > at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095) > at org.apache.hadoop.ipc.Client.call(Client.java:1071) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) > at $Proxy8.getProtocolVersion(Unknown Source) > at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) > at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370) > at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420) > at > org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:159) > at > org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:155) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082) > at > org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:153) > at > org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:51) > at > org.apache.giraph.comm.BasicRPCCommunications.startPeerConnectionThread(BasicRPCCommunications.java:599) > at > org.apache.giraph.comm.BasicRPCCommunications.connectAllRPCProxys(BasicRPCCommunications.java:542) > at > org.apache.giraph.comm.BasicRPCCommunications.setup(BasicRPCCommunications.java:513) > at > org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:550) > at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:458) > at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:630) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) > at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202) > at org.apache.hadoop.ipc.Client.call(Client.java:1046) > ... 25 more > ############################################################################ > log for worker 154 > ############################################################################ > PriviledgedActionException as:job_201203130609_14838 (auth:SIMPLE) > cause:java.net.ConnectException: Call to > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection > exception: java.net.ConnectException: Connection refused > connectAllRPCProxys: Failed on attempt 4 of 5 to connect to > (id=33,cur=Worker(hostname=gsta32085.tan.ygrid.yahoo.com, MRpartition=161, > port=35061),prev=null,ckpt_file=null) > java.net.ConnectException: Call to > gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection > exception: java.net.ConnectException: Connection refused > at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095) > at org.apache.hadoop.ipc.Client.call(Client.java:1071) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) > at $Proxy8.getProtocolVersion(Unknown Source) > at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) > at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370) > at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420) > at > org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:159) > at > org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:155) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082) > at > org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:153) > at > org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:51) > at > org.apache.giraph.comm.BasicRPCCommunications.startPeerConnectionThread(BasicRPCCommunications.java:599) > at > org.apache.giraph.comm.BasicRPCCommunications.connectAllRPCProxys(BasicRPCCommunications.java:542) > at > org.apache.giraph.comm.BasicRPCCommunications.setup(BasicRPCCommunications.java:513) > at > org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:550) > at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:458) > at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:630) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) > at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202) > at org.apache.hadoop.ipc.Client.call(Client.java:1046) > ... 25 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira