[
https://issues.apache.org/jira/browse/GIRAPH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Avery Ching updated GIRAPH-356:
---
Attachment: GIRAPH-356.2.patch
Updated patch to address all the ZooKeeper issues I could find at scale.
-Configuration ZooKeeper connection attempts, min/max session timeout, force
sync (off for perf), skip ACLS (no for perf)
-Do not kill job on a disconnect event, it's still possible for the client to
connect again, only session expired is bad
-Dump failed workers on the master when a superstep does not get started due to
missing ZooKeeper health
-Dump last 100 lines of ZooKeeper process stdout/stderr when there is a failure
that could be related to ZooKeeper
-Small change for more descriptive message when can't find last good checkpoint
> Help debug ZooKeeper issues
> ---
>
> Key: GIRAPH-356
> URL: https://issues.apache.org/jira/browse/GIRAPH-356
> Project: Giraph
> Issue Type: Improvement
>Reporter: Avery Ching
>Assignee: Avery Ching
> Attachments: GIRAPH-356.2.patch, GIRAPH-356.patch
>
>
> Currently, if the ZooKeeper process fails, we have little information on why
> and what happened. This patch addresses this by keeping the last 100 log
> lines and dumps when the map fails under a RuntimeException.
> Here is an example of a master task failure when there is an invalid JVM
> argument passed to ZooKeeper. The error is much for obvious now.
> 2012-10-04 15:05:28,916 WARN org.apache.giraph.zk.ZooKeeperManager:
> logZooKeeperOutput: Dumping up to last 100 lines of the ZooKeeper process
> STDOUT and STDERR.
> 2012-10-04 15:05:28,916 WARN
> org.apache.giraph.zk.ZooKeeperManager$StreamCollector: Unrecognized option:
> -BadOpt
> 2012-10-04 15:05:28,916 WARN
> org.apache.giraph.zk.ZooKeeperManager$StreamCollector: Could not create the
> Java virtual machine.
> 2012-10-04 15:05:28,919 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2012-10-04 15:05:28,959 WARN org.apache.hadoop.mapred.Child: Error running
> child
> java.lang.IllegalStateException: run: Caught an unrecoverable exception
> onlineZooKeeperServers: Failed to connect in 5 tries!
> at
> org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:591)
> at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
> at
> org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> at
> java.security.AccessController.doPrivileged(Native Method)
> at
> javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> at
> org.apache.hadoop.mapred.Child.main(Child.java:253)
> Caused by: java.lang.IllegalStateException: onlineZooKeeperServers: Failed to
> connect in 5 tries!
>at
> org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:721)
>at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:328)
>at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:573)
>... 7 more
> 2012-10-04 15:05:28,963 INFO org.apache.hadoop.mapred.Task: Runnning cleanup
> for the task
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira