ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters

2012-11-28 Thread Magyar, Bence (US SSA)
I have successfully run the shortest path example using Avery's sample input 
data.  I am now attempting to run the shortest-path algorithm on a much larger 
data set (300,000 nodes) and I am running into errors.  I have a 4-node cluster 
and am running the following command:


./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar 
org.apache.giraph.examples.SimpleShortestPathsVertex -if 
org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip 
/user/hduser/insight -of 
org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op 
/user/hduser/insight-out -w 3


It appears as though the shortest path computation finishes.  That is to say, 
I hit 100%.  Then the job just hangs for about 30 seconds, decreases it's 
progress to 75%, and then finally throws an exception:

No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf
12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004
12/11/28 08:26:17 INFO mapred.JobClient:  map 0% reduce 0%
12/11/28 08:26:33 INFO mapred.JobClient:  map 25% reduce 0%
12/11/28 08:26:40 INFO mapred.JobClient:  map 50% reduce 0%
12/11/28 08:26:42 INFO mapred.JobClient:  map 75% reduce 0%
12/11/28 08:26:44 INFO mapred.JobClient:  map 100% reduce 0%
12/11/28 08:27:45 INFO mapred.JobClient:  map 75% reduce 0%
12/11/28 08:27:50 INFO mapred.JobClient: Task Id : 
attempt_201211271542_0004_m_00_0, Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)


Digging into the log files a little deeper, I noticed that the number of files 
generated by the last node in my cluster contains more log directories than the 
previous three.

I see:


*attempt_201211280843_0001_m_00_0 - 
/app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_00_0

*attempt_201211280843_0001_m_00_0.cleanup - 
/app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_00_0.cleanup

*attempt_201211280843_0001_m_05_0 - 
/app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_05_0

*job-acls.xml

Whereas the first 3 nodes only contain 1 log folder underneath the job, 
something like:  attempt_201211280843_0001_m_03_0.  I am assuming this is 
because something went wrong on node 4 and some cleanup logic was attempted.

At any rate, when I cd into the first log folder on the bad node, 
(attempt_201211280843_0001_m_00_0) and look into syslog, I see the 
following error:


2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster: 
barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2]
2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: 
collectAndProcessAggregatorValues: Processed aggregators
2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster: 
aggregateWorkerStats: Aggregation found 
(vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=false) on 
superstep = 98
2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster: 
coordinateSuperstep: Cleaning up old Superstep 
/_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstepDir/97
2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread: 
masterThread: Coordination of superstep 98 took 0.445 seconds ended with state 
THIS_SUPERSTEP_DONE and is now on superstep 99
2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper: 
uncaughtException: OverrideExceptionHandler on thread 
org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on number of 
counters - Counters=120 Limit=120, exiting...
org.apache.hadoop.mapred.Counters$CountersExceededException: Error: Exceeded 
limits on number of counters - Counters=120 Limit=120
at 
org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:312)
at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446)
at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596)
at org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541)
at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.getCounter(TaskInputOutputContext.java:88)
at org.apache.giraph.graph.MasterThread.run(MasterThread.java:131)
2012-11-28 08:45:36,612 WARN org.apache.giraph.zk.ZooKeeperManager: 
onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.


What exactly is this limit on MapReduce job counters?  What is a MapReduce 
job counter?  I assume it is some variable threshold to keep things in check, 
and I know that I can modify the value in mapred-site.xml:

property
  namemapreduce.job.counters.limit/name
  value120/value
  descriptionI have no idea what this does!!!/description
/property

I have tried increasing and 

Re: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters

2012-11-28 Thread André Kelpe
Hi Bence,

on older version of hadoop there is a hard limit on counters, which a
job cannot modify. Since the counters are not crucial for the
functioning of giraph, you can turn them off by setting
giraph.useSuperstepCounters to false in your job config.

I would also recommend looking into the GiraphConfiguration class, as
it contains all the settings, that you might be interested in (like
checkpoint frequency etc.):
https://github.com/apache/giraph/blob/trunk/giraph/src/main/java/org/apache/giraph/GiraphConfiguration.java

HTH

-Andre

2012/11/28 Magyar, Bence (US SSA) bence.mag...@baesystems.com:
 I have successfully run the shortest path example using Avery’s sample input
 data.  I am now attempting to run the shortest-path algorithm on a much
 larger data set (300,000 nodes) and I am running into errors.  I have a
 4-node cluster and am running the following command:





 ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar
 org.apache.giraph.examples.SimpleShortestPathsVertex -if
 org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip
 /user/hduser/insight -of
 org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op
 /user/hduser/insight-out -w 3





 It appears as though the shortest path computation “finishes”.  That is to
 say, I hit “100%”.  Then the job just hangs for about 30 seconds, decreases
 it’s progress to 75%, and then finally throws an exception:



 No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf

 12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004

 12/11/28 08:26:17 INFO mapred.JobClient:  map 0% reduce 0%

 12/11/28 08:26:33 INFO mapred.JobClient:  map 25% reduce 0%

 12/11/28 08:26:40 INFO mapred.JobClient:  map 50% reduce 0%

 12/11/28 08:26:42 INFO mapred.JobClient:  map 75% reduce 0%

 12/11/28 08:26:44 INFO mapred.JobClient:  map 100% reduce 0%

 12/11/28 08:27:45 INFO mapred.JobClient:  map 75% reduce 0%

 12/11/28 08:27:50 INFO mapred.JobClient: Task Id :
 attempt_201211271542_0004_m_00_0, Status : FAILED

 java.lang.Throwable: Child Error

 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)

 Caused by: java.io.IOException: Task process exit with nonzero status of 1.

 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)





 Digging into the log files a little deeper, I noticed that the number of
 files generated by the last node in my cluster contains more log directories
 than the previous three.



 I see:



 ·attempt_201211280843_0001_m_00_0 -
 /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_00_0

 ·attempt_201211280843_0001_m_00_0.cleanup -
 /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_00_0.cleanup

 ·attempt_201211280843_0001_m_05_0 -
 /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_05_0

 ·job-acls.xml



 Whereas the first 3 nodes only contain 1 log folder underneath the job,
 something like:  “attempt_201211280843_0001_m_03_0”.  I am assuming this
 is because something went wrong on node 4 and some “cleanup logic” was
 attempted.



 At any rate, when I cd into the first log folder on the bad node,
 (attempt_201211280843_0001_m_00_0) and look into “syslog”, I see the
 following error:





 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster:
 barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2]

 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
 collectAndProcessAggregatorValues: Processed aggregators

 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
 aggregateWorkerStats: Aggregation found
 (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=false)
 on superstep = 98

 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster:
 coordinateSuperstep: Cleaning up old Superstep
 /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstepDir/97

 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread:
 masterThread: Coordination of superstep 98 took 0.445 seconds ended with
 state THIS_SUPERSTEP_DONE and is now on superstep 99

 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper:
 uncaughtException: OverrideExceptionHandler on thread
 org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on number
 of counters - Counters=120 Limit=120, exiting...

 org.apache.hadoop.mapred.Counters$CountersExceededException: Error: Exceeded
 limits on number of counters - Counters=120 Limit=120

 at
 org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:312)

 at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446)

 at
 org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596)

 at
 org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541)

 

Re: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters

2012-11-28 Thread Jonathan Bishop
Bence,

I set that value to 100 - I think there is a recommendation to set this
very high. Remember to reboot you cluster after making the change.

Jon


On Wed, Nov 28, 2012 at 6:07 AM, Magyar, Bence (US SSA) 
bence.mag...@baesystems.com wrote:

  I have successfully run the shortest path example using Avery’s sample
 input data.  I am now attempting to run the shortest-path algorithm on a
 much larger data set (300,000 nodes) and I am running into errors.  I have
 a 4-node cluster and am running the following command:

 ** **

 ** **

 ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar
 org.apache.giraph.examples.SimpleShortestPathsVertex -if
 org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip
 /user/hduser/insight -of
 org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op
 /user/hduser/insight-out -w 3

 ** **

 ** **

 It appears as though the shortest path computation “finishes”.  That is to
 say, I hit “100%”.  Then the job just hangs for about 30 seconds, *
 decreases* it’s progress to 75%, and then finally throws an exception:

 ** **

 No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf

 12/11/28 08:26:16 INFO mapred.JobClient: Running job: job_201211271542_0004
 

 12/11/28 08:26:17 INFO mapred.JobClient:  map 0% reduce 0%

 12/11/28 08:26:33 INFO mapred.JobClient:  map 25% reduce 0%

 12/11/28 08:26:40 INFO mapred.JobClient:  map 50% reduce 0%

 12/11/28 08:26:42 INFO mapred.JobClient:  map 75% reduce 0%

 12/11/28 08:26:44 INFO mapred.JobClient:  map 100% reduce 0%

 12/11/28 08:27:45 INFO mapred.JobClient:  map 75% reduce 0%

 12/11/28 08:27:50 INFO mapred.JobClient: Task Id :
 attempt_201211271542_0004_m_00_0, Status : FAILED

 java.lang.Throwable: Child Error

 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)***
 *

 Caused by: java.io.IOException: Task process exit with nonzero status of 1.
 

 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)***
 *

 ** **

 ** **

 Digging into the log files a little deeper, I noticed that the number of
 files generated by the *last* node in my cluster contains more log
 directories than the previous three.

 ** **

 I see:  

 ** **

 **·**attempt_201211280843_0001_m_00_0 -
 /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_00_0
 

 **·**attempt_201211280843_0001_m_00_0.cleanup -
 /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_00_0.cleanup
 

 **·**attempt_201211280843_0001_m_05_0 -
 /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_201211280843_0001_m_05_0
 

 **·**job-acls.xml

 ** **

 Whereas the first 3 nodes only contain 1 log folder underneath the job,
 something like:  “attempt_201211280843_0001_m_03_0”.  I am assuming
 this is because something went wrong on node 4 and some “cleanup logic” was
 attempted.

 ** **

 At any rate, when I cd into the first log folder on the bad node,
 (attempt_201211280843_0001_m_00_0) and look into “syslog”, I see the
 following error:

 ** **

 ** **

 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster:
 barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2]

 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
 collectAndProcessAggregatorValues: Processed aggregators

 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
 aggregateWorkerStats: Aggregation found
 (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=false)
 on superstep = 98

 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster:
 coordinateSuperstep: Cleaning up old Superstep
 /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstepDir/97
 

 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread:
 masterThread: Coordination of superstep 98 took 0.445 seconds ended with
 state THIS_SUPERSTEP_DONE and is now on superstep 99

 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper:
 uncaughtException: OverrideExceptionHandler on thread
 org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on
 number of counters - Counters=120 Limit=120, exiting...

 org.apache.hadoop.mapred.Counters$CountersExceededException: Error:
 Exceeded limits on number of counters - Counters=120 Limit=120

 at
 org.apache.hadoop.mapred.Counters$Group.getCounterForName(Counters.java:312)
 

 at org.apache.hadoop.mapred.Counters.findCounter(Counters.java:446)
 

 at
 org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:596)

 at
 org.apache.hadoop.mapred.Task$TaskReporter.getCounter(Task.java:541)

 at
 

Re: What a worker really is and other interesting runtime information

2012-11-28 Thread Avery Ching
Oh, forgot one thing.  You need to set the number of partitions to use 
single each thread works on a single partition at a time.


Try -Dhash.userPartitionCount=number of threads

On 11/28/12 5:29 AM, Alexandros Daglis wrote:

Dear Avery,

I followed your advice, but the application seems to be totally 
thread-count-insensitive: I literally observe zero scaling of 
performance, while I increase the thread count. Maybe you can point 
out if I am doing something wrong.


- Using only 4 cores on a single node at the moment
- Input graph: 14 million vertices, file size is 470 MB
- Running SSSP as follows: hadoop jar 
target/giraph-0.1-jar-with-dependencies.jar 
org.apache.giraph.examples.SimpleShortestPathsVertex 
-Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=X input 
output 12 1

where X=1,2,3,12,30
- I notice a total insensitivity to the number of thread I specify. 
Aggregate core utilization is always approximately the same (usually 
around 25-30% = only one of the cores running) and overall execution 
time is always the same (~8 mins)


Why is Giraph's performance not scaling? Is the input size / number of 
workers inappropriate? It's not an IO issue either, because even 
during really low core utilization, time is wasted on idle, not on IO.


Cheers,
Alexandros



On 28 November 2012 11:13, Alexandros Daglis 
alexandros.dag...@epfl.ch mailto:alexandros.dag...@epfl.ch wrote:


Thank you Avery, that helped a lot!

Regards,
Alexandros


On 27 November 2012 20:57, Avery Ching ach...@apache.org
mailto:ach...@apache.org wrote:

Hi Alexandros,

The extra task is for the master process (a coordination
task). In your case, since you are using a single machine, you
can use a single task.

-Dgiraph.SplitMasterWorker=false

and you can try multithreading instead of multiple workers.

-Dgiraph.numComputeThreads=12

The reason why cpu usage increases is due to netty threads to
handle network requests.  By using multithreading instead, you
should bypass this.

Avery


On 11/27/12 9:40 AM, Alexandros Daglis wrote:

Hello everybody,

I went through most of the documentation I could find for
Giraph and also most of the messages in this email list,
but still I have not figured out precisely what a worker
really is. I would really appreciate it if you could help
me understand how the framework works.

At first I thought that a worker has a one-to-one
correspondence to a map task. Apparently this is not
exactly the case, since I have noticed that if I ask for x
workers, the job finishes after having used x+1 map tasks.
What is this extra task for?

I have been trying out the example SSSP application on a
single node with 12 cores. Giving an input graph of ~400MB
and using 1 worker, around 10 GBs of memory are used
during execution. What intrigues me is that if I use 2
workers for the same input (and without limiting memory
per map task), double the memory will be used.
Furthermore, there will be no improvement in performance.
I rather notice a slowdown. Are these observations normal?

Might it be the case that 1 and 2 workers are very few and
I should go to the 30-100 range that is the proposed
number of mappers for a conventional MapReduce job?

Finally, a last observation. Even though I use only 1
worker, I see that there are significant periods during
execution where up to 90% of the 12 cores computing power
is consumed, that is, almost 10 cores are used in
parallel. Does each worker spawn multiple threads and
dynamically balances the load to utilize the available
hardware?

Thanks a lot in advance!

Best,
Alexandros









RE: ShortestPathExample on 300,000 node graph - Error: Exceeded limits on number of counters

2012-11-28 Thread Magyar, Bence (US SSA)
Thank you Andre, 

Setting giraph.useSuperstepCounters = false 

solved my issue.  The job still hung at 100% and then eventually completed 
successfully.

-Bence

-Original Message-
From: André Kelpe [mailto:efeshundert...@googlemail.com] 
Sent: Wednesday, November 28, 2012 10:45 AM
To: user@giraph.apache.org
Subject: Re: ShortestPathExample on 300,000 node graph - Error: Exceeded limits 
on number of counters

Hi Bence,

on older version of hadoop there is a hard limit on counters, which a job 
cannot modify. Since the counters are not crucial for the functioning of 
giraph, you can turn them off by setting giraph.useSuperstepCounters to false 
in your job config.

I would also recommend looking into the GiraphConfiguration class, as it 
contains all the settings, that you might be interested in (like checkpoint 
frequency etc.):
https://github.com/apache/giraph/blob/trunk/giraph/src/main/java/org/apache/giraph/GiraphConfiguration.java

HTH

-Andre

2012/11/28 Magyar, Bence (US SSA) bence.mag...@baesystems.com:
 I have successfully run the shortest path example using Avery’s sample 
 input data.  I am now attempting to run the shortest-path algorithm on 
 a much larger data set (300,000 nodes) and I am running into errors.  
 I have a 4-node cluster and am running the following command:





 ./giraph -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar 
 org.apache.giraph.examples.SimpleShortestPathsVertex -if 
 org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip 
 /user/hduser/insight -of 
 org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op 
 /user/hduser/insight-out -w 3





 It appears as though the shortest path computation “finishes”.  That 
 is to say, I hit “100%”.  Then the job just hangs for about 30 
 seconds, decreases it’s progress to 75%, and then finally throws an exception:



 No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf

 12/11/28 08:26:16 INFO mapred.JobClient: Running job: 
 job_201211271542_0004

 12/11/28 08:26:17 INFO mapred.JobClient:  map 0% reduce 0%

 12/11/28 08:26:33 INFO mapred.JobClient:  map 25% reduce 0%

 12/11/28 08:26:40 INFO mapred.JobClient:  map 50% reduce 0%

 12/11/28 08:26:42 INFO mapred.JobClient:  map 75% reduce 0%

 12/11/28 08:26:44 INFO mapred.JobClient:  map 100% reduce 0%

 12/11/28 08:27:45 INFO mapred.JobClient:  map 75% reduce 0%

 12/11/28 08:27:50 INFO mapred.JobClient: Task Id :
 attempt_201211271542_0004_m_00_0, Status : FAILED

 java.lang.Throwable: Child Error

 at 
 org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)

 Caused by: java.io.IOException: Task process exit with nonzero status of 1.

 at 
 org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)





 Digging into the log files a little deeper, I noticed that the number 
 of files generated by the last node in my cluster contains more log 
 directories than the previous three.



 I see:



 ·attempt_201211280843_0001_m_00_0 -
 /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20
 1211280843_0001_m_00_0

 ·attempt_201211280843_0001_m_00_0.cleanup -
 /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20
 1211280843_0001_m_00_0.cleanup

 ·attempt_201211280843_0001_m_05_0 -
 /app/hadoop/tmp/mapred/local/userlogs/job_201211280843_0001/attempt_20
 1211280843_0001_m_05_0

 ·job-acls.xml



 Whereas the first 3 nodes only contain 1 log folder underneath the 
 job, something like:  “attempt_201211280843_0001_m_03_0”.  I am 
 assuming this is because something went wrong on node 4 and some 
 “cleanup logic” was attempted.



 At any rate, when I cd into the first log folder on the bad node,
 (attempt_201211280843_0001_m_00_0) and look into “syslog”, I see 
 the following error:





 2012-11-28 08:45:36,212 INFO org.apache.giraph.graph.BspServiceMaster:
 barrierOnWorkerList: Waiting on [cap03_3, cap02_1, cap01_2]

 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
 collectAndProcessAggregatorValues: Processed aggregators

 2012-11-28 08:45:36,330 INFO org.apache.giraph.graph.BspServiceMaster:
 aggregateWorkerStats: Aggregation found
 (vtx=142711,finVtx=142711,edges=409320,msgCount=46846,haltComputation=
 false)
 on superstep = 98

 2012-11-28 08:45:36,341 INFO org.apache.giraph.graph.BspServiceMaster:
 coordinateSuperstep: Cleaning up old Superstep
 /_hadoopBsp/job_201211280843_0001/_applicationAttemptsDir/0/_superstep
 Dir/97

 2012-11-28 08:45:36,611 INFO org.apache.giraph.graph.MasterThread:
 masterThread: Coordination of superstep 98 took 0.445 seconds ended 
 with state THIS_SUPERSTEP_DONE and is now on superstep 99

 2012-11-28 08:45:36,611 FATAL org.apache.giraph.graph.GraphMapper:
 uncaughtException: OverrideExceptionHandler on thread 
 org.apache.giraph.graph.MasterThread, msg = Error: Exceeded limits on 
 number of counters - Counters=120 Limit=120, exiting...

 

Issue running Giraph on more mappers

2012-11-28 Thread Tripti Singh
Hi,
I am trying to run this workflow which uses Giraph.
I am able to succesfully run the Giraph job when I use lesser no. of mappers  
and less data. But it fails for more mappers.
This is what the logs say for master and worker nodes:

Master Node:

2012-11-29 00:01:10,235 INFO [main] org.apache.giraph.zk.ZooKeeperManager: 
onlineZooKeeperServers: Connected to 
gsta31113.tan.ygrid.yahoo.com/10.216.124.59:24681!
2012-11-29 00:01:10,235 INFO [main] org.apache.giraph.zk.ZooKeeperManager: 
onlineZooKeeperServers: Creating my filestamp 
_bsp/_defaultZkManagerDir/_zkServer/gsta31113.tan.ygrid.yahoo.com 3
2012-11-29 00:01:10,241 INFO [main] org.apache.giraph.graph.GraphMapper: setup: 
Starting up BspServiceMaster (master thread)...
2012-11-29 00:01:10,257 INFO [main] org.apache.giraph.graph.BspService: 
BspService: Connecting to ZooKeeper with job job_1353148790244_114419, 3 on 
gsta31113.tan.ygrid.yahoo.com:24681
2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:zookeeper.version=3.4.4-1386507, built on 09/17/2012 08:33 GMT
2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:host.name=gsta31113.tan.ygrid.yahoo.com
2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:java.version=1.6.0_21
2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:java.vendor=Sun Microsystems Inc.
2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:java.home=/home/Releases/gridjdk-1.6.0_21.1011192346-20110120-000/share/gridjdk-1.6.0_21/jre
2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:java.class.path= {really long class path}
2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:java.library.path=/home/Releases/gridjdk-1.6.0_21.1011192346-20110120-000/share/gridjdk-1.6.0_21/jre/lib/i386/server:/home/Releases/gridjdk-1.6.0_21.1011192346-20110120-000/share/gridjdk-1.6.0_21/jre/lib/i386:/home/Releases/gridjdk-1.6.0_21.1011192346-20110120-000/share/gridjdk-1.6.0_21/jre/../lib/i386:/grid/2/tmp/yarn-local/usercache/nova_sln/appcache/application_1353148790244_114419/container_1353148790244_114419_01_09:/home/gs/hadoop/current/lib/native/Linux-i386-32:/usr/java/packages/lib/i386:/lib:/usr/lib
2012-11-29 00:01:10,278 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:java.io.tmpdir=/grid/2/tmp/yarn-local/usercache/nova_sln/appcache/application_1353148790244_114419/container_1353148790244_114419_01_09/tmp
2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:java.compiler=
2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:os.name=Linux
2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:os.arch=i386
2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:os.version=2.6.18-238.19.1.el5.YAHOO.20111028
2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:user.name=nova_sln
2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:user.home=/homes/nova_sln
2012-11-29 00:01:10,279 INFO [main] org.apache.zookeeper.ZooKeeper: Client 
environment:user.dir=/grid/2/tmp/yarn-local/usercache/nova_sln/appcache/application_1353148790244_114419/container_1353148790244_114419_01_09
2012-11-29 00:01:10,280 INFO [main] org.apache.zookeeper.ZooKeeper: Initiating 
client connection, connectString=gsta31113.tan.ygrid.yahoo.com:24681 
sessionTimeout=6 watcher=org.apache.giraph.graph.BspServiceMaster@16f70a4
2012-11-29 00:01:10,304 INFO 
[main-SendThread(gsta31113.tan.ygrid.yahoo.com:24681)] 
org.apache.zookeeper.ClientCnxn: Opening socket connection to server 
gsta31113.tan.ygrid.yahoo.com/10.216.124.59:24681. Will not attempt to 
authenticate using SASL (Unable to locate a login configuration)
2012-11-29 00:01:10,305 INFO 
[main-SendThread(gsta31113.tan.ygrid.yahoo.com:24681)] 
org.apache.zookeeper.ClientCnxn: Socket connection established to 
gsta31113.tan.ygrid.yahoo.com/10.216.124.59:24681, initiating session
2012-11-29 00:01:10,331 INFO 
[main-SendThread(gsta31113.tan.ygrid.yahoo.com:24681)] 
org.apache.zookeeper.ClientCnxn: Session establishment complete on server 
gsta31113.tan.ygrid.yahoo.com/10.216.124.59:24681, sessionid = 
0x13b497783e4, negotiated timeout = 60
2012-11-29 00:01:10,333 INFO [main-EventThread] 
org.apache.giraph.graph.BspService: process: Asynchronous connection complete.
2012-11-29 00:01:10,335 INFO [main] org.apache.giraph.graph.GraphMapper: map: 
No need to do anything when not a worker
2012-11-29 00:01:10,335 INFO [main] org.apache.giraph.graph.GraphMapper: 
cleanup: Starting for MASTER_ZOOKEEPER_ONLY
2012-11-29 00:01:10,396 INFO [org.apache.giraph.graph.MasterThread] 
org.apache.giraph.graph.BspServiceMaster: becomeMaster: First child is 

Re: _zkServer does not Exist

2012-11-28 Thread Tripti Singh
Hi,
Update on this one.
I was able to resolve this error with the patch here:
https://issues.apache.org/jira/browse/GIRAPH-391

Thanks,
Tripti.

From: Yahoo! Inc. tri...@yahoo-inc.commailto:tri...@yahoo-inc.com
Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org 
user@giraph.apache.orgmailto:user@giraph.apache.org
Date: Monday, October 22, 2012 5:19 PM
To: user@giraph.apache.orgmailto:user@giraph.apache.org 
user@giraph.apache.orgmailto:user@giraph.apache.org
Subject: Re: _zkServer does not Exist

Hi,
I am trying to build Giraph with Hadoop_0.23 profile.
When I try to run the PageRankBenchmark, I get the following error:

Cmd hadoop jar 
giraph-0.71-SNAPSHOT-for-hadoop-0.23.4.1210022201-jar-with-dependencies.jar 
org.apache.giraph.benchmark.PageRankBenchmark
 -Dmapred.job.tracker=host:port 
-Dgiraph.zkManagerDirectory='/tmp/giraph/cc/_bsp/_defaultZkManagerDir' -c 1 -e 
2 -s 2 -V 10 -w 1

2012-10-22 10:53:33,446 WARN [main] org.apache.hadoop.conf.Configuration: 
job.xml:an attempt to override final parameter: 
yarn.app.mapreduce.am.job.client.port-range;  Ignoring.
2012-10-22 10:53:33,446 WARN [main] org.apache.hadoop.conf.Configuration: 
job.xml:an attempt to override final parameter: 
mapreduce.admin.reduce.child.java.opts;  Ignoring.
2012-10-22 10:53:33,447 WARN [main] org.apache.hadoop.conf.Configuration: 
job.xml:an attempt to override final parameter: hadoop.tmp.dir;  Ignoring.
2012-10-22 10:53:33,561 INFO [main] org.apache.giraph.graph.GraphMapper: setup: 
Set log level to info
2012-10-22 10:53:33,561 INFO [main] org.apache.giraph.graph.GraphMapper: 
Distributed cache is empty. Assuming fatjar.
2012-10-22 10:53:33,561 INFO [main] org.apache.giraph.graph.GraphMapper: setup: 
classpath @ {path}/job.jar for job org.apache.giraph.benchmark.PageRankBenchmark
2012-10-22 10:53:33,564 WARN [main] org.apache.hadoop.conf.Configuration: 
mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
2012-10-22 10:53:33,564 WARN [main] org.apache.hadoop.conf.Configuration: 
mapred.job.id is deprecated. Instead, use mapreduce.job.id
2012-10-22 10:53:33,564 WARN [main] org.apache.hadoop.conf.Configuration: 
job.local.dir is deprecated. Instead, use mapreduce.job.local.dir
2012-10-22 10:53:33,565 INFO [main] org.apache.giraph.zk.ZooKeeperManager: 
createCandidateStamp: Made the directory 
/tmp/giraph/cc/_bsp/_defaultZkManagerDir
2012-10-22 10:53:33,568 INFO [main] org.apache.giraph.zk.ZooKeeperManager: 
createCandidateStamp: Creating my filestamp 
/tmp/giraph/cc/_bsp/_defaultZkManagerDir/_task/gsrd215n08.red.ygrid.yahoo.com 1
2012-10-22 10:53:33,601 INFO [main] org.apache.giraph.zk.ZooKeeperManager: 
getZooKeeperServerList: For task 1, got file 
'zkServerList_gsrd208n08.red.ygrid.yahoo.com 0 ' (polling period is 3000)
2012-10-22 10:53:33,601 INFO [main] org.apache.giraph.zk.ZooKeeperManager: 
getZooKeeperServerList: Found [gsrd208n08.red.ygrid.yahoo.com, 0] 2 hosts in 
filename 'zkServerList_gsrd208n08.red.ygrid.yahoo.com 0 '
2012-10-22 10:53:33,603 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : java.lang.IllegalStateException: run: Caught an 
unrecoverable exception java.io.FileNotFoundException: File 
/tmp/giraph/cc/_bsp/_defaultZkManagerDir/_zkServer does not exist.
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:595)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:725)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1212)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File 
/tmp/giraph/cc/_bsp/_defaultZkManagerDir/_zkServer does not exist.
at 
org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:796)
at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:328)
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:573)
... 7 more
Caused by: java.io.FileNotFoundException: File 
/tmp/giraph/cc/_bsp/_defaultZkManagerDir/_zkServer does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:362)
at 
org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:755)
... 9 more

2012-10-22 10:53:33,607 INFO [main] org.apache.hadoop.mapred.Task: Runnning 
cleanup for the task
2012-10-22 10:53:33,609 INFO [main] 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics 
system...
2012-10-22 10:53:33,610 INFO [main] 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system 
stopped.
2012-10-22 10:53:33,610 INFO [main]