[ 
https://issues.apache.org/jira/browse/GIRAPH-53?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127626#comment-13127626
 ] 

Avery Ching commented on GIRAPH-53:
-----------------------------------

Thanks for reporting the issue.  A few questions:

1)  Is it always the 103rd superstep?

2)  It looks like the task lost its connection to the ZooKeeper service.  
Probably good to see what happen to that task as well.  Most likely it crashed 
for some reason.
                
> Unable to read additional data from server session, likely server has closed 
> socket
> -----------------------------------------------------------------------------------
>
>                 Key: GIRAPH-53
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-53
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: locker
>
> I've got an error recently. Every thing goes well till it comes to the 103rd 
> superstep. 
> 2011-10-14 16:23:38,904 INFO org.apache.giraph.comm.BasicRPCCommunications: 
> prepareSuperstep
> 2011-10-14 16:23:39,018 WARN org.apache.giraph.graph.BspService: process: 
> Unknown and unprocessed event 
> (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_vertexRangeAssignments,
>  type=NodeDeleted, state=SyncConnected)
> 2011-10-14 16:23:39,057 INFO org.apache.giraph.graph.BspServiceWorker: 
> registerHealth: Created my health node for attempt=0, superstep=103 with 
> /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_workerHealthyDir/locker-desktop_1
>  and hostnamePort = ["locker-desktop",30001]
> 2011-10-14 16:23:39,057 WARN org.apache.giraph.graph.BspService: process: 
> Unknown and unprocessed event 
> (path=/_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/101/_superstepFinished,
>  type=NodeDeleted, state=SyncConnected)
> 2011-10-14 16:23:39,529 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
> additional data from server sessionid 0x1330186cff30001, likely server has 
> closed socket, closing socket connection and attempting reconnect
> 2011-10-14 16:23:39,630 ERROR org.apache.zookeeper.ClientCnxn: Error while 
> calling watcher 
> java.lang.RuntimeException: process: Disconnected from ZooKeeper, cannot 
> recover.
>       at org.apache.giraph.graph.BspService.process(BspService.java:995)
>       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> 2011-10-14 16:23:41,098 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server locker-desktop/10.13.30.90:22181
> 2011-10-14 16:23:41,099 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x1330186cff30001 for server null, unexpected error, closing socket 
> connection and attempting reconnect
> java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
> 2011-10-14 16:23:41,212 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2011-10-14 16:23:41,306 INFO org.apache.hadoop.io.nativeio.NativeIO: 
> Initialized cache for UID to User mapping with a cache timeout of 14400 
> seconds.
> 2011-10-14 16:23:41,307 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
> UserName dic for UID 1001 from the native implementation
> 2011-10-14 16:23:41,318 WARN org.apache.hadoop.mapred.Child: Error running 
> child
> java.lang.RuntimeException: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
>       at 
> org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:836)
>       at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:551)
>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>       at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>       at org.apache.hadoop.mapred.Child.main(Child.java:253)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for 
> /_hadoopBsp/job_201110141621_0001/_applicationAttemptsDir/0/_superstepDir/103/_vertexRangeAssignments
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
>       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
>       at 
> org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:830)
>       ... 9 more
> I dont know whether it should be called a bug or not. Wait for some help, 
> thx...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to