@Pankaj, I am running the ShortestPath example on a tiny graph now (5 nodes). That is also getting hung indefinitely the exact same way. This machine has 1 TB of memory and I have used -Xmx25g (25 GB) as Java options. So hopefully it should not be because of memory limitation. [ (free/total/max) = 1706.68M / 1979.75M / 25242.25M ]
@Lukas, I am trying to run the example packaged with the Giraph installation - SimpleShortestPathsVertex. I haven't written any code myself yet - just trying to get this to work first. I am not getting any memory exception - no dump file is being generated at the DumpPath. $HADOOP_HOME/bin/hadoop jar ~/.local/bin/giraph-examples.jar org.apache.giraph.GiraphRunner -D giraph.logLevel="all" -libjars ~/.local/bin/giraph-core.jar org.apache.giraph.examples.SimpleShortestPathsVertex -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/vikesh/input/tiny_graph.txt -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/vikesh/shortestPaths8 -ca SimpleShortestPathsVertex.source=2 -w 1 I am printing debug level logs now, and I am seeing these calls indefinitely in both the zookeeper and worker tasks - 2014-04-07 14:45:32,325 DEBUG org.apache.hadoop.ipc.RPC: Call: statusUpdate 8 2014-04-07 14:45:35,326 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) connection to /127.0.0.1:45894 from job_201404071443_0001 sending #34 2014-04-07 14:45:35,327 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) connection to /127.0.0.1:45894 from job_201404071443_0001 got value #34 2014-04-07 14:45:35,327 DEBUG org.apache.hadoop.ipc.RPC: Call: ping 2 2014-04-07 14:45:38,328 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) connection to /127.0.0.1:45894 from job_201404071443_0001 sending #35 2014-04-07 14:45:38,329 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) connection to /127.0.0.1:45894 from job_201404071443_0001 got value #35 2014-04-07 14:45:38,329 DEBUG org.apache.hadoop.ipc.RPC: Call: ping 1 2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: Got timed signaled of false 2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: Wait for 0 2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: Got timed signaled of false 2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: Wait for 0 These calls go on for 10 minutes and then the job is killed by Hadoop. Thanks, Vikesh Khanna, Masters, Computer Science (Class of 2015) Stanford University ----- Original Message ----- From: "Lukas Nalezenec" <lukas.naleze...@firma.seznam.cz> To: user@giraph.apache.org Sent: Monday, April 7, 2014 4:13:23 AM Subject: Re: Giraph job hangs indefinitely and is eventually killed by JobTracker Hi, Try making and analyzing memory dump after exception (JVM param -XX:+HeapDumpOnOutOfMemoryError ) What configuration (mainly Partition class) do you use ? Lukas On 7.4.2014 11:45, Vikesh Khanna wrote: Hi, Any ideas why Giraph waits indefinitely? I've been stuck on this for a long time now. Thanks, Vikesh Khanna, Masters, Computer Science (Class of 2015) Stanford University ----- Original Message ----- From: "Vikesh Khanna" <vik...@stanford.edu> To: user@giraph.apache.org Sent: Friday, April 4, 2014 6:06:51 AM Subject: Re: Giraph job hangs indefinitely and is eventually killed by JobTracker Hi Avery, I tried both the options. It does appear to be a GC problem. The problem continues with the second option as well :(. I have attached the logs after enabling the first set of options and using 1 worker. Would be very helpful if you can take a look. This machine has 1 TB memory. We ran benchmarks of various other graph libraries on this machine and they worked fine (even with graphs 10x larger than the Giraph PageRank benchmark - 40 million nodes). I am sure Giraph would work fine as well - this should not be a resource constraint. Thanks, Vikesh Khanna, Masters, Computer Science (Class of 2015) Stanford University ----- Original Message ----- From: "Avery Ching" <ach...@apache.org> To: user@giraph.apache.org Sent: Thursday, April 3, 2014 7:26:56 PM Subject: Re: Giraph job hangs indefinitely and is eventually killed by JobTracker This is for a single worker it appears. Most likely your worker went into GC and never returned. You can try with GC settings turned on, try adding something like. -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -verbose:gc You could also try the concurrent mark/sweep collector. -XX:+UseConcMarkSweepGC Any chance you can use more workers and/or get more memory? Avery On 4/3/14, 5:46 PM, Vikesh Khanna wrote: <blockquote> @Avery, Thanks for the help. I checked out the task logs, and turns out there was an exception "GC overhead limit exceeded" due to which the benchmarks wouldn't even load the vertices. I got around it by increasing the heap size (mapred.child.java.opts) in mapred-site.xml. The benchmark is loading vertices now. However, the job is still getting stuck indefinitely (and eventually killed). I have attached the small log for the map task on 1 worker. Would really appreciate if you can help understand the cause. Thanks, Vikesh Khanna, Masters, Computer Science (Class of 2015) Stanford University ----- Original Message ----- From: "Praveen kumar s.k" <skpraveenkum...@gmail.com> To: user@giraph.apache.org Sent: Thursday, April 3, 2014 4:40:07 PM Subject: Re: Giraph job hangs indefinitely and is eventually killed by JobTracker You have given -w 30, make sure that that many number of map tasks are configured in your cluster On Thu, Apr 3, 2014 at 6:24 PM, Avery Ching <ach...@apache.org> wrote: > My guess is that you don't get your resources. It would be very helpful to > print the master log. You can find it when the job is running to look at > the Hadoop counters on the job UI page. > > Avery > > > On 4/3/14, 12:49 PM, Vikesh Khanna wrote: > > Hi, > > I am running the PageRank benchmark under giraph-examples from giraph-1.0.0 > release. I am using the following command to run the job (as mentioned here) > > vikesh@madmax > /lfs/madmax/0/vikesh/usr/local/giraph/giraph-examples/src/main/java/org/apache/giraph/examples > > $ $HADOOP_HOME/bin/hadoop jar > $GIRAPH_HOME/giraph-core/target/giraph-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar > > org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 50000000 -w 30 > > > However, the job gets stuck at map 9% and is eventually killed by the > JobTracker on reaching the mapred.task.timeout (default 10 minutes). I tried > increasing the timeout to a very large value, and the job went on for over 8 > hours without completion. I also tried the ShortestPathsBenchmark, which > also fails the same way. > > > Any help is appreciated. > > > ****** ---------------- *********** > > > Machine details: > > Linux version 2.6.32-279.14.1.el6.x86_64 > ( mockbu...@c6b8.bsys.dev.centos.org ) (gcc version 4.4.6 20120305 (Red Hat > 4.4.6-4) (GCC) ) #1 SMP Tue Nov 6 23:43:09 UTC 2012 > > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 64 > On-line CPU(s) list: 0-63 > Thread(s) per core: 1 > Core(s) per socket: 8 > CPU socket(s): 8 > NUMA node(s): 8 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 47 > Stepping: 2 > CPU MHz: 1064.000 > BogoMIPS: 5333.20 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 24576K > NUMA node0 CPU(s): 1-8 > NUMA node1 CPU(s): 9-16 > NUMA node2 CPU(s): 17-24 > NUMA node3 CPU(s): 25-32 > NUMA node4 CPU(s): 0,33-39 > NUMA node5 CPU(s): 40-47 > NUMA node6 CPU(s): 48-55 > NUMA node7 CPU(s): 56-63 > > > I am using a pseudo-distributed Hadoop cluster on a single machine with > 64-cores. > > > *****-------------******* > > > Thanks, > Vikesh Khanna, > Masters, Computer Science (Class of 2015) > Stanford University > > > </blockquote>