[ 
https://issues.apache.org/jira/browse/GIRAPH-850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Fonseca updated GIRAPH-850:
-------------------------------------

    Attachment: GIRAPH-850-2.patch

Noticed a bug when offlining the ZooKeeper servers with Yarn. While doing the 
offlining, it would try to wait for:

{code}conf.getMapTasks(){code} 

In Yarn, this is clearly incorrect (in fact, it's set to 2 no matter how many 
workers we have, at least in my setup). The result would be that the 
computation would never stop, infinitely running with the Master and AM 
containers without throwing any exception (although the output had already been 
written).

Adding some extra LOG.info and a catch Throwable before calling 
offlineZooKeeperServers wielded the following (this is a submission with -w 5):

{code}
14/02/13 17:51:22 INFO graph.GraphTaskManager: Joined with master thread
14/02/13 17:51:22 INFO graph.GraphTaskManager: Offlining ZooKeeper servers
14/02/13 17:51:22 INFO zk.ZooKeeperManager: createZooKeeperClosedStamp: 
Creating my filestamp 
_bsp/_defaultZkManagerDir/giraph_yarn_application_1392301880048_0010/_task/2.COMPUTATION_DONE
14/02/13 17:51:22 INFO zk.ZooKeeperManager: offlineZooKeeperServers: entered 
sync area
14/02/13 17:51:22 INFO zk.ZooKeeperManager: offlineZooKeeperServers: Will wait 
for 2 tasks
14/02/13 17:51:22 INFO zk.ZooKeeperManager: waitUntilAllTasksDone: entering 
attempt 0
14/02/13 17:51:22 INFO zk.ZooKeeperManager: waitUntilAllTasksDone: listing task 
directory
14/02/13 17:51:23 INFO zk.ZooKeeperManager: waitUntilAllTasksDone: task 
directory has 12 files
14/02/13 17:51:23 INFO zk.ZooKeeperManager: waitUntilAllTasksDone: 
0.COMPUTATION_DONE
14/02/13 17:51:23 INFO zk.ZooKeeperManager: waitUntilAllTasksDone: name matches 
begin
14/02/13 17:51:23 INFO zk.ZooKeeperManager: waitUntilAllTasksDone: name matches 
end
14/02/13 17:51:23 INFO zk.ZooKeeperManager: waitUntilAllTasksDone: 
1.COMPUTATION_DONE
14/02/13 17:51:23 INFO zk.ZooKeeperManager: waitUntilAllTasksDone: name matches 
begin
14/02/13 17:51:23 INFO zk.ZooKeeperManager: waitUntilAllTasksDone: name matches 
end
14/02/13 17:51:23 INFO zk.ZooKeeperManager: waitUntilAllTasksDone: 
2.COMPUTATION_DONE
14/02/13 17:51:23 INFO zk.ZooKeeperManager: waitUntilAllTasksDone: name matches 
begin
14/02/13 17:51:23 ERROR graph.GraphTaskManager: Error offlining zookeeper
java.lang.ArrayIndexOutOfBoundsException: 2
  at 
org.apache.giraph.zk.ZooKeeperManager.waitUntilAllTasksDone(ZooKeeperManager.java:835)
  at 
org.apache.giraph.zk.ZooKeeperManager.offlineZooKeeperServers(ZooKeeperManager.java:900)
  at org.apache.giraph.graph.GraphTaskManager.cleanup(GraphTaskManager.java:857)
  at org.apache.giraph.yarn.GiraphYarnTask.run(GiraphYarnTask.java:93)
  at org.apache.giraph.yarn.GiraphYarnTask.main(GiraphYarnTask.java:196)
14/02/13 17:51:23 INFO graph.GraphTaskManager: Shutting down GiraphMetrics
{code}

This new patch (GIRAPH-850-2.patch) adds some logic so that, in Yarn 
executions, it waits for as many workers as Yarn containers launched by the 
yarn GiraphApplicationMaster (GiraphApplicationMaster:165):

{code}
    containersToLaunch = giraphConf.getMaxWorkers() + 1;
{code}

The patch also contains the catch Throwable I added during debugging since 
Hadoop seems to eat all exceptions thrown from the cleanup and I think it's 
useful to have them logged to notice these issues.

Tested with Hadoop 1.2.1 MR1, Hadoop 2.2.0 MR2, Hadoop 2.2.0 Yarn on a 5-node 
cluster and passes mvn verify.

> Improve internal zookeeper launching
> ------------------------------------
>
>                 Key: GIRAPH-850
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-850
>             Project: Giraph
>          Issue Type: Bug
>          Components: zookeeper
>            Reporter: Alexandre Fonseca
>             Fix For: 1.1.0
>
>         Attachments: GIRAPH-850-2.patch, GIRAPH-850.patch
>
>
> With the most up to date trunk, internal zookeeper launching only appears to 
> work with Hadoop 1.x.x MR1.
> With Hadoop 2.x.x MR2, trying to run a job without specifying an external 
> zookeeper location results in a failed job with the following in the logs:
> {code}
> 2014-02-12 17:30:30,281 INFO [main] org.apache.giraph.zk.ZooKeeperManager: 
> onlineZooKeeperServers: Attempting to start ZooKeeper server with command 
> [/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51.x86_64/jre/bin/java, -Xmx512m, 
> -XX:ParallelGCThr
> eads=4, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=70, 
> -XX:MaxGCPauseMillis=100, -cp, 
> /tmp/hadoop-yarn/staging/b.ajf/.staging/job_1392221733726_0002/job.jar, 
> org.apache.zookeeper.server.quorum.QuorumPeerMain, /tmp/hadoop-b
> .ajf/nm-local-dir/usercache/b.ajf/appcache/application_1392221733726_0002/work/_bspZooKeeper/zoo.cfg]
>  in directory 
> /tmp/hadoop-b.ajf/nm-local-dir/usercache/b.ajf/appcache/application_1392221733726_0002/work/_bspZooKeeper
> (...)
> 2014-02-12 17:30:30,285 INFO [main] org.apache.giraph.zk.ZooKeeperManager: 
> onlineZooKeeperServers: Connect attempt 0 of 10 max trying to connect to 
> igraph-02.hi.inet:22181 with poll msecs = 3000
> 2014-02-12 17:30:30,289 WARN [main] org.apache.giraph.zk.ZooKeeperManager: 
> onlineZooKeeperServers: Got ConnectException
> java.net.ConnectException: Connection refused
> (...)
> 2014-02-12 17:30:30,413 INFO 
> [org.apache.giraph.zk.ZooKeeperManager$StreamCollector] 
> org.apache.giraph.zk.ZooKeeperManager$StreamCollector: readLines: Error: 
> Could not find or load main class 
> org.apache.zookeeper.server.quorum.QuorumPeerMain
> (...)
> {code}
> It clearly is unable to launch Zookeeper as it can't find the necessary class 
> in the classpath. Looking at the command with which it tries to launch 
> Zookeeper, we can see that it has specified a classpath of:
> {code}
> -cp, /tmp/hadoop/yarn/staging/b.ajf/.staging/job_1392221733726_0002/job.jar
> {code}
> which is a HDFS location.
> It seems that with Hadoop 2.x.x, the function Job.getJar() returns a HDFS 
> path to the jar instead of the path to the local copy of the jar in the 
> DirectoryCache. Hadoop 1.x.x appears to return a correct path as I didn't 
> detect any problem there.
> The whole logic of finding the Zookeeper classpath seems extremely convoluted 
> to me (not to mention broken as just shown for both MR2 and YARN). Since the 
> currently running Java process has to have the zookeeper classes in its 
> classpath anyway (because some of the classes in Giraph refer to Zookeeper 
> classes), wouldn't it make more sense to just have the child java process 
> starting Zookeeper simply inherit the classpath?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to