[ 
https://issues.apache.org/jira/browse/GIRAPH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avery Ching updated GIRAPH-356:
-------------------------------

    Attachment: GIRAPH-356.2.patch

Updated patch to address all the ZooKeeper issues I could find at scale.

-Configuration ZooKeeper connection attempts, min/max session timeout, force 
sync (off for perf), skip ACLS (no for perf)
-Do not kill job on a disconnect event, it's still possible for the client to 
connect again, only session expired is bad
-Dump failed workers on the master when a superstep does not get started due to 
missing ZooKeeper health
-Dump last 100 lines of ZooKeeper process stdout/stderr when there is a failure 
that could be related to ZooKeeper
-Small change for more descriptive message when can't find last good checkpoint
                
> Help debug ZooKeeper issues
> ---------------------------
>
>                 Key: GIRAPH-356
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-356
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-356.2.patch, GIRAPH-356.patch
>
>
> Currently, if the ZooKeeper process fails, we have little information on why 
> and what happened.  This patch addresses this by keeping the last 100 log 
> lines and dumps when the map fails under a RuntimeException.
> Here is an example of a master task failure when there is an invalid JVM 
> argument passed to ZooKeeper.  The error is much for obvious now.
> 2012-10-04 15:05:28,916 WARN org.apache.giraph.zk.ZooKeeperManager: 
> logZooKeeperOutput: Dumping up to last 100 lines of the ZooKeeper process 
> STDOUT and STDERR.
> 2012-10-04 15:05:28,916 WARN 
> org.apache.giraph.zk.ZooKeeperManager$StreamCollector: Unrecognized option: 
> -BadOpt
> 2012-10-04 15:05:28,916 WARN 
> org.apache.giraph.zk.ZooKeeperManager$StreamCollector: Could not create the 
> Java virtual machine.
> 2012-10-04 15:05:28,919 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2012-10-04 15:05:28,959 WARN org.apache.hadoop.mapred.Child: Error running 
> child
> java.lang.IllegalStateException: run: Caught an unrecoverable exception 
> onlineZooKeeperServers: Failed to connect in 5 tries!
>                                  at 
> org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:591)
>                                  at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>                                  at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>                                  at 
> org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>                                  at 
> java.security.AccessController.doPrivileged(Native Method)
>                                  at 
> javax.security.auth.Subject.doAs(Subject.java:396)
>                                  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>                                  at 
> org.apache.hadoop.mapred.Child.main(Child.java:253)
> Caused by: java.lang.IllegalStateException: onlineZooKeeperServers: Failed to 
> connect in 5 tries!
>        at 
> org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:721)
>        at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:328)
>        at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:573)
>        ... 7 more
> 2012-10-04 15:05:28,963 INFO org.apache.hadoop.mapred.Task: Runnning cleanup 
> for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to