ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters
I have successfully run the shortest path example using Avery's sample input data. I am now attempting to run the shortest-path algorithm on a much larger data set (300,000 nodes) and I am running into errors. I have a 4-node cluster and am running the following command: ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar org.apache.giraph.examples.SimpleShortestPathsVertex -if org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip /user/hduser/insight -of org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op /user/hduser/insight-out -w 3 It appears as though the shortest path computation finishes. That is to say, I hit 100%. Then the job just hangs for about 30 seconds, decreases it's progress to 75%, and then finally throws an exception: No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf 12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004 12/11/28 08:26:17 INFO mapred.JobClient: map 0% reduce 0% 12/11/28 08:26:33 INFO mapred.JobClient: map 25% reduce 0% 12/11/28 08:26:40 INFO mapred.JobClient: map 50% reduce 0% 12/11/28 08:26:42 INFO mapred.JobClient: map 75% reduce 0% 12/11/28 08:26:44 INFO mapred.JobClient: map 100% reduce 0% 12/11/28 08:27:45 INFO mapred.JobClient: map 75% reduce 0% 12/11/28 08:27:50 INFO mapred.JobClient: Task Id : attempt_201211271542_0004_m_00_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) Digging into the log files a little deeper, I noticed that the number of files generated by the last node in my cluster contains more log directories than the previous three. I see: *attempt_201211280843_0001_m_00_0 - /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_00_0 *attempt_201211280843_0001_m_00_0.cleanup - /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_00_0.cleanup *attempt_201211280843_0001_m_05_0 - /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_05_0 *job-acls.xml Whereas the first 3 nodes only contain 1 log folder underneath the job, something like: attempt_201211280843_0001_m_03_0. I am assuming this is because something went wrong on node 4 and some cleanup logic was attempted. At any rate, when I cd into the first log folder on the bad node, (attempt_201211280843_0001_m_00_0) and look into syslog, I see the following error: 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster: barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2] 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: collectAndProcessAggregatorValues: Processed aggregators 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: aggregateWorkerStats: Aggregation found (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=false) on superstep = 98 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Cleaning up old Superstep /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstepDir/97 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread: masterThread: Coordination of superstep 98 took 0.445 seconds ended with state THIS_SUPERSTEP_DONE and is now on superstep 99 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on number of counters - Counters=120 Limit=120, exiting... org.apache.hadoop.mapred.Counters$CountersExceededException: Error: Exceeded limits on number of counters - Counters=120 Limit=120 at org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:312) at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446) at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596) at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541) at org.apache.hadoop.mapreduce.TaskInputOutputContext.getCounter(TaskInputOutputContext.java:88) at org.apache.giraph.graph.MasterThread.run(MasterThread.java:131) 2012-11-28 08:45:36,612 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process. What exactly is this limit on MapReduce job counters? What is a MapReduce job counter? I assume it is some variable threshold to keep things in check, and I know that I can modify the value in mapred-site.xml: property namemapreduce.job.counters.limit/name value120/value descriptionI have no idea what this does!!!/description /property I have tried increasing and
Re: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters
Hi Bence, on older version of hadoop there is a hard limit on counters, which a job cannot modify. Since the counters are not crucial for the functioning of giraph, you can turn them off by setting giraph.useSuperstepCounters to false in your job config. I would also recommend looking into the GiraphConfiguration class, as it contains all the settings, that you might be interested in (like checkpoint frequency etc.): https://github.com/apache/giraph/blob/trunk/giraph/src/main/java/org/apache/giraph/GiraphConfiguration.java HTH -Andre 2012/11/28 Magyar, Bence (US SSA) bence.mag...@baesystems.com: I have successfully run the shortest path example using Avery’s sample input data. I am now attempting to run the shortest-path algorithm on a much larger data set (300,000 nodes) and I am running into errors. I have a 4-node cluster and am running the following command: ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar org.apache.giraph.examples.SimpleShortestPathsVertex -if org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip /user/hduser/insight -of org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op /user/hduser/insight-out -w 3 It appears as though the shortest path computation “finishes”. That is to say, I hit “100%”. Then the job just hangs for about 30 seconds, decreases it’s progress to 75%, and then finally throws an exception: No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf 12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004 12/11/28 08:26:17 INFO mapred.JobClient: map 0% reduce 0% 12/11/28 08:26:33 INFO mapred.JobClient: map 25% reduce 0% 12/11/28 08:26:40 INFO mapred.JobClient: map 50% reduce 0% 12/11/28 08:26:42 INFO mapred.JobClient: map 75% reduce 0% 12/11/28 08:26:44 INFO mapred.JobClient: map 100% reduce 0% 12/11/28 08:27:45 INFO mapred.JobClient: map 75% reduce 0% 12/11/28 08:27:50 INFO mapred.JobClient: Task Id : attempt_201211271542_0004_m_00_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) Digging into the log files a little deeper, I noticed that the number of files generated by the last node in my cluster contains more log directories than the previous three. I see: ·attempt_201211280843_0001_m_00_0 - /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_00_0 ·attempt_201211280843_0001_m_00_0.cleanup - /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_00_0.cleanup ·attempt_201211280843_0001_m_05_0 - /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_05_0 ·job-acls.xml Whereas the first 3 nodes only contain 1 log folder underneath the job, something like: “attempt_201211280843_0001_m_03_0”. I am assuming this is because something went wrong on node 4 and some “cleanup logic” was attempted. At any rate, when I cd into the first log folder on the bad node, (attempt_201211280843_0001_m_00_0) and look into “syslog”, I see the following error: 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster: barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2] 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: collectAndProcessAggregatorValues: Processed aggregators 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: aggregateWorkerStats: Aggregation found (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=false) on superstep = 98 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Cleaning up old Superstep /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstepDir/97 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread: masterThread: Coordination of superstep 98 took 0.445 seconds ended with state THIS_SUPERSTEP_DONE and is now on superstep 99 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on number of counters - Counters=120 Limit=120, exiting... org.apache.hadoop.mapred.Counters$CountersExceededException: Error: Exceeded limits on number of counters - Counters=120 Limit=120 at org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:312) at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446) at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596) at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541)
Re: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters
Bence, I set that value to 100 - I think there is a recommendation to set this very high. Remember to reboot you cluster after making the change. Jon On Wed, Nov 28, 2012 at 6:07 AM, Magyar, Bence (US SSA) bence.mag...@baesystems.com wrote: I have successfully run the shortest path example using Avery’s sample input data. I am now attempting to run the shortest-path algorithm on a much larger data set (300,000 nodes) and I am running into errors. I have a 4-node cluster and am running the following command: ** ** ** ** ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar org.apache.giraph.examples.SimpleShortestPathsVertex -if org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip /user/hduser/insight -of org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op /user/hduser/insight-out -w 3 ** ** ** ** It appears as though the shortest path computation “finishes”. That is to say, I hit “100%”. Then the job just hangs for about 30 seconds, * decreases* it’s progress to 75%, and then finally throws an exception: ** ** No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf 12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004 12/11/28 08:26:17 INFO mapred.JobClient: map 0% reduce 0% 12/11/28 08:26:33 INFO mapred.JobClient: map 25% reduce 0% 12/11/28 08:26:40 INFO mapred.JobClient: map 50% reduce 0% 12/11/28 08:26:42 INFO mapred.JobClient: map 75% reduce 0% 12/11/28 08:26:44 INFO mapred.JobClient: map 100% reduce 0% 12/11/28 08:27:45 INFO mapred.JobClient: map 75% reduce 0% 12/11/28 08:27:50 INFO mapred.JobClient: Task Id : attempt_201211271542_0004_m_00_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)*** * Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)*** * ** ** ** ** Digging into the log files a little deeper, I noticed that the number of files generated by the *last* node in my cluster contains more log directories than the previous three. ** ** I see: ** ** **·**attempt_201211280843_0001_m_00_0 - /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_00_0 **·**attempt_201211280843_0001_m_00_0.cleanup - /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_00_0.cleanup **·**attempt_201211280843_0001_m_05_0 - /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_05_0 **·**job-acls.xml ** ** Whereas the first 3 nodes only contain 1 log folder underneath the job, something like: “attempt_201211280843_0001_m_03_0”. I am assuming this is because something went wrong on node 4 and some “cleanup logic” was attempted. ** ** At any rate, when I cd into the first log folder on the bad node, (attempt_201211280843_0001_m_00_0) and look into “syslog”, I see the following error: ** ** ** ** 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster: barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2] 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: collectAndProcessAggregatorValues: Processed aggregators 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: aggregateWorkerStats: Aggregation found (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=false) on superstep = 98 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Cleaning up old Superstep /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstepDir/97 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread: masterThread: Coordination of superstep 98 took 0.445 seconds ended with state THIS_SUPERSTEP_DONE and is now on superstep 99 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on number of counters - Counters=120 Limit=120, exiting... org.apache.hadoop.mapred.Counters$CountersExceededException: Error: Exceeded limits on number of counters - Counters=120 Limit=120 at org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:312) at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446) at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596) at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541) at
Re: What a worker really is and other interesting runtime information
Oh, forgot one thing. You need to set the number of partitions to use single each thread works on a single partition at a time. Try -Dhash.userPartitionCount=number of threads On 11/28/12 5:29 AM, Alexandros Daglis wrote: Dear Avery, I followed your advice, but the application seems to be totally thread-count-insensitive: I literally observe zero scaling of performance, while I increase the thread count. Maybe you can point out if I am doing something wrong. - Using only 4 cores on a single node at the moment - Input graph: 14 million vertices, file size is 470 MB - Running SSSP as follows: hadoop jar target/giraph-0.1-jar-with-dependencies.jar org.apache.giraph.examples.SimpleShortestPathsVertex -Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=X input output 12 1 where X=1,2,3,12,30 - I notice a total insensitivity to the number of thread I specify. Aggregate core utilization is always approximately the same (usually around 25-30% = only one of the cores running) and overall execution time is always the same (~8 mins) Why is Giraph's performance not scaling? Is the input size / number of workers inappropriate? It's not an IO issue either, because even during really low core utilization, time is wasted on idle, not on IO. Cheers, Alexandros On 28 November 2012 11:13, Alexandros Daglis alexandros.dag...@epfl.ch mailto:alexandros.dag...@epfl.ch wrote: Thank you Avery, that helped a lot! Regards, Alexandros On 27 November 2012 20:57, Avery Ching ach...@apache.org mailto:ach...@apache.org wrote: Hi Alexandros, The extra task is for the master process (a coordination task). In your case, since you are using a single machine, you can use a single task. -Dgiraph.SplitMasterWorker=false and you can try multithreading instead of multiple workers. -Dgiraph.numComputeThreads=12 The reason why cpu usage increases is due to netty threads to handle network requests. By using multithreading instead, you should bypass this. Avery On 11/27/12 9:40 AM, Alexandros Daglis wrote: Hello everybody, I went through most of the documentation I could find for Giraph and also most of the messages in this email list, but still I have not figured out precisely what a worker really is. I would really appreciate it if you could help me understand how the framework works. At first I thought that a worker has a one-to-one correspondence to a map task. Apparently this is not exactly the case, since I have noticed that if I ask for x workers, the job finishes after having used x+1 map tasks. What is this extra task for? I have been trying out the example SSSP application on a single node with 12 cores. Giving an input graph of ~400MB and using 1 worker, around 10 GBs of memory are used during execution. What intrigues me is that if I use 2 workers for the same input (and without limiting memory per map task), double the memory will be used. Furthermore, there will be no improvement in performance. I rather notice a slowdown. Are these observations normal? Might it be the case that 1 and 2 workers are very few and I should go to the 30-100 range that is the proposed number of mappers for a conventional MapReduce job? Finally, a last observation. Even though I use only 1 worker, I see that there are significant periods during execution where up to 90% of the 12 cores computing power is consumed, that is, almost 10 cores are used in parallel. Does each worker spawn multiple threads and dynamically balances the load to utilize the available hardware? Thanks a lot in advance! Best, Alexandros
RE: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters
Thank you Andre, Setting giraph.useSuperstepCounters = false solved my issue. The job still hung at 100% and then eventually completed successfully. -Bence -Original Message- From: André Kelpe [mailto:efeshundert...@googlemail.com] Sent: Wednesday, November 28, 2012 10:45 AM To: user@giraph.apache.org Subject: Re: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters Hi Bence, on older version of hadoop there is a hard limit on counters, which a job cannot modify. Since the counters are not crucial for the functioning of giraph, you can turn them off by setting giraph.useSuperstepCounters to false in your job config. I would also recommend looking into the GiraphConfiguration class, as it contains all the settings, that you might be interested in (like checkpoint frequency etc.): https://github.com/apache/giraph/blob/trunk/giraph/src/main/java/org/apache/giraph/GiraphConfiguration.java HTH -Andre 2012/11/28 Magyar, Bence (US SSA) bence.mag...@baesystems.com: I have successfully run the shortest path example using Avery’s sample input data. I am now attempting to run the shortest-path algorithm on a much larger data set (300,000 nodes) and I am running into errors. I have a 4-node cluster and am running the following command: ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar org.apache.giraph.examples.SimpleShortestPathsVertex -if org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip /user/hduser/insight -of org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op /user/hduser/insight-out -w 3 It appears as though the shortest path computation “finishes”. That is to say, I hit “100%”. Then the job just hangs for about 30 seconds, decreases it’s progress to 75%, and then finally throws an exception: No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf 12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004 12/11/28 08:26:17 INFO mapred.JobClient: map 0% reduce 0% 12/11/28 08:26:33 INFO mapred.JobClient: map 25% reduce 0% 12/11/28 08:26:40 INFO mapred.JobClient: map 50% reduce 0% 12/11/28 08:26:42 INFO mapred.JobClient: map 75% reduce 0% 12/11/28 08:26:44 INFO mapred.JobClient: map 100% reduce 0% 12/11/28 08:27:45 INFO mapred.JobClient: map 75% reduce 0% 12/11/28 08:27:50 INFO mapred.JobClient: Task Id : attempt_201211271542_0004_m_00_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) Digging into the log files a little deeper, I noticed that the number of files generated by the last node in my cluster contains more log directories than the previous three. I see: ·attempt_201211280843_0001_m_00_0 - /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20 1211280843_0001_m_00_0 ·attempt_201211280843_0001_m_00_0.cleanup - /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20 1211280843_0001_m_00_0.cleanup ·attempt_201211280843_0001_m_05_0 - /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20 1211280843_0001_m_05_0 ·job-acls.xml Whereas the first 3 nodes only contain 1 log folder underneath the job, something like: “attempt_201211280843_0001_m_03_0”. I am assuming this is because something went wrong on node 4 and some “cleanup logic” was attempted. At any rate, when I cd into the first log folder on the bad node, (attempt_201211280843_0001_m_00_0) and look into “syslog”, I see the following error: 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster: barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2] 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: collectAndProcessAggregatorValues: Processed aggregators 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: aggregateWorkerStats: Aggregation found (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation= false) on superstep = 98 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Cleaning up old Superstep /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstep Dir/97 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread: masterThread: Coordination of superstep 98 took 0.445 seconds ended with state THIS_SUPERSTEP_DONE and is now on superstep 99 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on number of counters - Counters=120 Limit=120, exiting...
Issue running Giraph on more mappers
Hi, I am trying to run this workflow which uses Giraph. I am able to succesfully run the Giraph job when I use lesser no. of mappers and less data. But it fails for more mappers. This is what the logs say for master and worker nodes: Master Node: 2012-11-29 00:01:10,235 INFO [main] org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Connected to gsta31113.tan.ygrid.yahoo.com/10.216.124.59:24681! 2012-11-29 00:01:10,235 INFO [main] org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Creating my filestamp _bsp/_defaultZkManagerDir/_zkServer/gsta31113.tan.ygrid.yahoo.com 3 2012-11-29 00:01:10,241 INFO [main] org.apache.giraph.graph.GraphMapper: setup: Starting up BspServiceMaster (master thread)... 2012-11-29 00:01:10,257 INFO [main] org.apache.giraph.graph.BspService: BspService: Connecting to ZooKeeper with job job_1353148790244_114419, 3 on gsta31113.tan.ygrid.yahoo.com:24681 2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.4-1386507, built on 09/17/2012 08:33 GMT 2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:host.name=gsta31113.tan.ygrid.yahoo.com 2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.6.0_21 2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. 2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:java.home=/home/Releases/gridjdk-1.6.0_21.1011192346-20110120-000/share/gridjdk-1.6.0_21/jre 2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:java.class.path= {really long class path} 2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/home/Releases/gridjdk-1.6.0_21.1011192346-20110120-000/share/gridjdk-1.6.0_21/jre/lib/i386/server:/home/Releases/gridjdk-1.6.0_21.1011192346-20110120-000/share/gridjdk-1.6.0_21/jre/lib/i386:/home/Releases/gridjdk-1.6.0_21.1011192346-20110120-000/share/gridjdk-1.6.0_21/jre/../lib/i386:/grid/2/tmp/yarn-local/usercache/nova_sln/appcache/application_1353148790244_114419/container_1353148790244_114419_01_09:/home/gs/hadoop/current/lib/native/Linux-i386-32:/usr/java/packages/lib/i386:/lib:/usr/lib 2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/grid/2/tmp/yarn-local/usercache/nova_sln/appcache/application_1353148790244_114419/container_1353148790244_114419_01_09/tmp 2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:java.compiler= 2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux 2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:os.arch=i386 2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:os.version=2.6.18-238.19.1.el5.YAHOO.20111028 2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:user.name=nova_sln 2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:user.home=/homes/nova_sln 2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client environment:user.dir=/grid/2/tmp/yarn-local/usercache/nova_sln/appcache/application_1353148790244_114419/container_1353148790244_114419_01_09 2012-11-29 00:01:10,280 INFO [main] org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=gsta31113.tan.ygrid.yahoo.com:24681 sessionTimeout=6 watcher=org.apache.giraph.graph.BspServiceMaster@16f70a4 2012-11-29 00:01:10,304 INFO [main-SendThread(gsta31113.tan.ygrid.yahoo.com:24681)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server gsta31113.tan.ygrid.yahoo.com/10.216.124.59:24681. Will not attempt to authenticate using SASL (Unable to locate a login configuration) 2012-11-29 00:01:10,305 INFO [main-SendThread(gsta31113.tan.ygrid.yahoo.com:24681)] org.apache.zookeeper.ClientCnxn: Socket connection established to gsta31113.tan.ygrid.yahoo.com/10.216.124.59:24681, initiating session 2012-11-29 00:01:10,331 INFO [main-SendThread(gsta31113.tan.ygrid.yahoo.com:24681)] org.apache.zookeeper.ClientCnxn: Session establishment complete on server gsta31113.tan.ygrid.yahoo.com/10.216.124.59:24681, sessionid = 0x13b497783e4, negotiated timeout = 60 2012-11-29 00:01:10,333 INFO [main-EventThread] org.apache.giraph.graph.BspService: process: Asynchronous connection complete. 2012-11-29 00:01:10,335 INFO [main] org.apache.giraph.graph.GraphMapper: map: No need to do anything when not a worker 2012-11-29 00:01:10,335 INFO [main] org.apache.giraph.graph.GraphMapper: cleanup: Starting for MASTER_ZOOKEEPER_ONLY 2012-11-29 00:01:10,396 INFO [org.apache.giraph.graph.MasterThread] org.apache.giraph.graph.BspServiceMaster: becomeMaster: First child is
Re: _zkServer does not Exist
Hi, Update on this one. I was able to resolve this error with the patch here: https://issues.apache.org/jira/browse/GIRAPH-391 Thanks, Tripti. From: Yahoo! Inc. tri...@yahoo-inc.commailto:tri...@yahoo-inc.com Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Date: Monday, October 22, 2012 5:19 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: _zkServer does not Exist Hi, I am trying to build Giraph with Hadoop_0.23 profile. When I try to run the PageRankBenchmark, I get the following error: Cmd hadoop jar giraph-0.71-SNAPSHOT-for-hadoop-0.23.4.1210022201-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark -Dmapred.job.tracker=host:port -Dgiraph.zkManagerDirectory='/tmp/giraph/cc/_bsp/_defaultZkManagerDir' -c 1 -e 2 -s 2 -V 10 -w 1 2012-10-22 10:53:33,446 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: yarn.app.mapreduce.am.job.client.port-range; Ignoring. 2012-10-22 10:53:33,446 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.admin.reduce.child.java.opts; Ignoring. 2012-10-22 10:53:33,447 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: hadoop.tmp.dir; Ignoring. 2012-10-22 10:53:33,561 INFO [main] org.apache.giraph.graph.GraphMapper: setup: Set log level to info 2012-10-22 10:53:33,561 INFO [main] org.apache.giraph.graph.GraphMapper: Distributed cache is empty. Assuming fatjar. 2012-10-22 10:53:33,561 INFO [main] org.apache.giraph.graph.GraphMapper: setup: classpath @ {path}/job.jar for job org.apache.giraph.benchmark.PageRankBenchmark 2012-10-22 10:53:33,564 WARN [main] org.apache.hadoop.conf.Configuration: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 2012-10-22 10:53:33,564 WARN [main] org.apache.hadoop.conf.Configuration: mapred.job.id is deprecated. Instead, use mapreduce.job.id 2012-10-22 10:53:33,564 WARN [main] org.apache.hadoop.conf.Configuration: job.local.dir is deprecated. Instead, use mapreduce.job.local.dir 2012-10-22 10:53:33,565 INFO [main] org.apache.giraph.zk.ZooKeeperManager: createCandidateStamp: Made the directory /tmp/giraph/cc/_bsp/_defaultZkManagerDir 2012-10-22 10:53:33,568 INFO [main] org.apache.giraph.zk.ZooKeeperManager: createCandidateStamp: Creating my filestamp /tmp/giraph/cc/_bsp/_defaultZkManagerDir/_task/gsrd215n08.red.ygrid.yahoo.com 1 2012-10-22 10:53:33,601 INFO [main] org.apache.giraph.zk.ZooKeeperManager: getZooKeeperServerList: For task 1, got file 'zkServerList_gsrd208n08.red.ygrid.yahoo.com 0 ' (polling period is 3000) 2012-10-22 10:53:33,601 INFO [main] org.apache.giraph.zk.ZooKeeperManager: getZooKeeperServerList: Found [gsrd208n08.red.ygrid.yahoo.com, 0] 2 hosts in filename 'zkServerList_gsrd208n08.red.ygrid.yahoo.com 0 ' 2012-10-22 10:53:33,603 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.IllegalStateException: run: Caught an unrecoverable exception java.io.FileNotFoundException: File /tmp/giraph/cc/_bsp/_defaultZkManagerDir/_zkServer does not exist. at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:595) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:725) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1212) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152) Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File /tmp/giraph/cc/_bsp/_defaultZkManagerDir/_zkServer does not exist. at org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:796) at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:328) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:573) ... 7 more Caused by: java.io.FileNotFoundException: File /tmp/giraph/cc/_bsp/_defaultZkManagerDir/_zkServer does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:362) at org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:755) ... 9 more 2012-10-22 10:53:33,607 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task 2012-10-22 10:53:33,609 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system... 2012-10-22 10:53:33,610 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped. 2012-10-22 10:53:33,610 INFO [main]