Re: Zookeeper Problems when running Giraph (PageRankVertex) on large graphs.

2014-09-25 Thread Fontana, Peter C.
Hello,

Thank you for the response.

I tried to find the zookeeper logs from where the zookeeper servers store the 
data logs, but I have no entries with that date. I used

echo conf | nc server port

To find the log directory, and then used

java org.apache.zookeeper.server.LogFormatter 

To read the logs. The only entries I have report an error –101 and an error 
–110, neither of which I think is the problem.

If you do not mind me asking, what do you mean by “partitions”, and what 
zookeeper limit was exceeded? Is the too many partitions a giraph issue, or is 
this an issue of putting the data into fewer files? I processed the input data 
in Hadoop, and the input directory contains multiple files.

Thank you for your time.

Best Wishes,
Peter

From: Lukas Nalezenec 
mailto:lukas.naleze...@firma.seznam.cz>>
Reply-To: user Giraph Mailing List 
mailto:user@giraph.apache.org>>
Date: Wednesday, September 24, 2014 at 8:50 AM
To: user Giraph Mailing List 
mailto:user@giraph.apache.org>>
Subject: Re: Zookeeper Problems when running Giraph (PageRankVertex) on large 
graphs.

Hi,
I had similar problems. Files that Giraph wrote to zookeeper were over limit so 
zookeeper crushed. In my case i had too many partitions with too long input and 
checkpoint paths.
You can try to get logs from the Zookeeper process.

Lukas

On 24.9.2014 14:31, Fontana, Peter C. wrote:
Hello,

I am trying to run the PageRankVertex code on a large graph. I successfully got 
it to run on smaller examples, but when I try to run it on a large example 
(100M nodes, 10B edges, 300GB space), it does not finish. I get the following 
error.

java.lang.IllegalStateException: run: Caught an unrecoverable exception 
waitFor: ExecutionException occurred while waiting for 
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@2ffecaeb
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:102)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.IllegalStateException: waitFor: ExecutionException 
occurred while waiting for 
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@2ffecaeb
at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:151)
at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableU
---
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 65.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)


>From looking at the logs, I think the following line is the cause of the error:

date hr:19:56,576 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: 
Waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@2ffecaeb
date hr:19:56,577 INFO org.apache.zookeeper.ClientCnxn: Client session timed 
out, have not heard from server in 52735ms for sessionid 0x24889e30b710063, 
closing socket connection and attempting reconnect

After the signature is the syntax of the command and the full error log of the 
failing nodes (with machine names, pathnames, and ip addresses replaced with 
generic names).

I looked at the following thread on the giraph mailing list: 
http://mail-archives.apache.org/mod_mbox/giraph-user/201310.mbox/%3cCAEv8GwXmg9YPTYoR6QAtwqqcWAgT8PbMaqFjKz=pn1+w51m...@mail.gmail.com%3e
 but that change did not solve the problem.

I have tried using an out of core graph, but that did not solve the problem. I 
have also tried enabling checkpoints with -Dgiraph.checkpointFrequency=1 but 
that does not solve the problem. I get similar errors. I have also both 
increased and decreased the number of workers, but that did not solve the 
problem.

Does anybody have any thoughts? Is it a memory issue or is it something else? I 
am using Giraph 1.0.0 built from the 1.0.0 branch of the github repository. 
PageRankVertex (and all the other classes) are example classes that are bundled 
with the graph source (giraph-examples), so I am using pre-built code rather 
than compiling my own Giraph code. I get a similar error using 
PageRankBenchmark.

Thank you for your time.

Best Wishes,
Peter


Command:
/usr/local/giraph$ hadoop jar giraph-examples.jar 
org.apache.giraph.GiraphRunner  
-Dgiraph.zkList=node1.loc:port,node2.loc:port,node3.loc:port 
-Dmapred.child.java-opts="-Xmx64g -Xms64g XX:+UseConcMarkSweepGC 
-XX:-UseGCOverheadLimit" -Dgiraph.zkJavaOpts="-Xmx64g -XX:ParallelGCThreads=4 
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 
-XX:MaxGCPauseMillis=100 -XX:-UseGCOverheadLimit" 
-Dgiraph.useSuperstepCounters=false  -Dgiraph.zkSessionMsecTime

Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

2014-09-25 Thread Matthew Cornell
On Mon, Sep 22, 2014 at 2:10 PM, Matthew Saltz  wrote:
> In the logs for the workers, do you have a line that looks like:
> 2014-09-21 18:12:13,021 INFO org.apache.giraph.worker.BspServiceWorker:
> finishSuperstep: Waiting on all requests, superstep 93 Memory
> (free/total/max) = 21951.08M / 36456.50M / 43691.00M
>
> Looking at the memory usage in the worker that fails at the end of
superstep
> before failure could give you a clue.

Yes, all four workers when I use "-w 4" have those lines:

Task Logs: 'attempt_201409191450_0016_m_01_0': compute-0-1:
2014-09-25 09:28:13,425 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep -1 Memory
(free/total/max) = 242.41M / 438.06M / 1820.50M
2014-09-25 09:28:13,817 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 0 Memory
(free/total/max) = 194.77M / 438.06M / 1820.50M
2014-09-25 09:28:14,936 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 1 Memory
(free/total/max) = 383.74M / 600.38M / 1820.50M
2014-09-25 09:28:17,820 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 2 Memory
(free/total/max) = 362.14M / 1007.50M / 1820.50M
2014-09-25 09:28:31,680 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 3 Memory
(free/total/max) = 203.33M / 1661.50M / 1820.50M

Task Logs: 'attempt_201409191450_0016_m_02_0': compute-0-1:
2014-09-25 09:28:13,458 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep -1 Memory
(free/total/max) = 887.74M / 964.50M / 1820.50M
2014-09-25 09:28:14,381 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 0 Memory
(free/total/max) = 830.14M / 964.50M / 1820.50M
2014-09-25 09:28:15,337 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 1 Memory
(free/total/max) = 785.66M / 1217.00M / 1820.50M
2014-09-25 09:28:18,114 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 2 Memory
(free/total/max) = 661.72M / 1113.50M / 1820.50M
2014-09-25 09:28:52,451 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 3 Memory
(free/total/max) = 285.90M / 1831.00M / 1831.00M

Task Logs: 'attempt_201409191450_0016_m_03_0': wright:
2014-09-25 09:28:13,456 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep -1 Memory
(free/total/max) = 886.23M / 964.50M / 1820.50M
2014-09-25 09:28:14,399 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 0 Memory
(free/total/max) = 826.36M / 964.50M / 1820.50M
2014-09-25 09:28:15,556 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 1 Memory
(free/total/max) = 662.50M / 1217.00M / 1820.50M
2014-09-25 09:28:18,170 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 2 Memory
(free/total/max) = 581.14M / 1115.00M / 1820.50M
2014-09-25 09:29:31,673 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 3 Memory
(free/total/max) = 299.61M / 1834.00M / 1834.00M

Task Logs: 'attempt_201409191450_0016_m_04_0': wright:
2014-09-25 09:28:13,473 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep -1 Memory
(free/total/max) = 887.10M / 964.50M / 1820.50M
2014-09-25 09:28:14,374 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 0 Memory
(free/total/max) = 826.65M / 964.50M / 1820.50M
2014-09-25 09:28:15,755 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 1 Memory
(free/total/max) = 980.33M / 1217.00M / 1820.50M
2014-09-25 09:28:18,254 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 2 Memory
(free/total/max) = 517.13M / 1128.50M / 1820.50M
2014-09-25 09:29:34,392 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 3 Memory
(free/total/max) = 271.52M / 1858.50M / 1858.50M


I'm still not clear on a couple of things:

   1. Each compute node has 16GB of memory, but each task has a max of
   ~1820M (<2GB). In Cloudera's web UI, I set "MapReduce Child Java Maximum
   Heap Size" to 2GB (default is 1GB). I will try upping it to 8GB.
   2. I still don't understand why only two of my five possible nodes are
   being used.

Thank you.



-- 
Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34 Dickinson
Street, Amherst MA 01002 | matthewcornell.org


Re: receiving messages that I didn't send

2014-09-25 Thread Matthew Cornell
Thanks for replying, Pavan. I figured out that my message Writable (an
ArrayListWritable) needed to call clear() in readFields() before
calling super():

@Override
public void readFields(DataInput in) throws IOException {
clear();
super.readFields(in);
}

This was an 'of course' moment when I realized it was, like other
Writables, being reused. But what I don't understand is why doesn't
ArrayListWritable#readFields() call clear? Isn't this a nasty bug? ...
Oh wait - sure enough:

ArrayListWritable object is not cleared in readFields()
https://issues.apache.org/jira/browse/GIRAPH-740

Thanks again,

matt


On Tue, Sep 23, 2014 at 11:46 AM, Pavan Kumar A  wrote:
> Can you give more context?
> What are the types of messages, patch of your compute method, etc.
> You will not receive messages that are not sent, but one thing that can
> happen is
> -- message can have multiple parameters.
> suppose message objects can have 2 parameters
> m - a,b
> say in m's write(out) you do not handle the case of b = null
> m1 sets b
> m2 has b=null
> then because of incorrect code for m's write() m2 can show b = m1.b
> that is because message objects will be re-used when receiving. This is a
> Giraph gotcha, because of
> object reuse in most iterators.
>
> Thanks
>
>> From: m...@matthewcornell.org
>> Date: Tue, 23 Sep 2014 10:10:48 -0400
>> Subject: receiving messages that I didn't send
>> To: user@giraph.apache.org
>
>>
>> Hi Folks. I am refactoring my compute() to use a set of ids as its
>> message type, and in my tests it is receiving a message that it
>> absolutely did not send. I've debugged it and am at a loss.
>> Interestingly, I encountered this once before and solved it by
>> creating a copy of a Writeable instead of re-using it, but I haven't
>> been able to solve it this time. In general, does this anomalous
>> behavior indicate a Giraph/Hadoop gotcha'? It's really confounding!
>> Thank very much -- matt
>>
>> --
>> Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34
>> Dickinson Street, Amherst MA 01002 | matthewcornell.org



-- 
Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34
Dickinson Street, Amherst MA 01002 | matthewcornell.org