Hi Hassan, Thanks for the elaborate response. I am not running Giraph jobs paralleley , i am trying to run one job with 900M edges . I have removed the _bsp folder as well before every run .
I did also checkout the latest code from phabricator commit. Out of Core works perfectly for 300M records but when i increase the data set to 500M *Exception1:* *script:* hadoop jar /usr/local/giraph/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.1-jar-with-dependencies.jar.giraph.GiraphRunner -Dmapreduce.task.timeout=12000000 -Dmapred.job.tracker=ip-172-31-42-220.eu-west-1.compute.internal:8021 -Dmapreduce.map.memory.mb=23480 -Dmapreduce.map.java.opts=-Xmx22480m org.apache.giraph.examples.ConnectedComponentsComputation -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip /VUID/input_500M -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /VUID/ouput_500M -w 4 -ca giraph.SplitMasterWorker=true,giraph.isStaticGraph=true,mapred.map.max.attempts=2,giraph.numOutputThreads=10,giraph.numInputThreads=10,giraph.numComputeThreads=4,giraph.waitForPerWorkerRequests=true,giraph.zkSessionMsecTimeout=1200000 After Supert step1 execution get stuck , this is if i run without out of core, 2016-05-17 19:11:06,536 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 2267ms 2016-05-17 19:11:55,694 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 1314ms 2016-05-17 19:11:55,695 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 47407ms 2016-05-17 19:11:58,930 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 1465ms 2016-05-17 19:12:03,222 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 2655ms 2016-05-17 19:13:00,659 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 3013ms 2016-05-17 19:13:00,659 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 52769ms 2016-05-17 19:13:04,359 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 1961ms 2016-05-17 19:14:01,246 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 2925ms 2016-05-17 19:14:01,247 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 52743ms 2016-05-17 19:14:50,029 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 48410ms 2016-05-17 19:15:34,967 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 44449ms 2016-05-17 19:16:20,120 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 44219ms 2016-05-17 19:17:06,099 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 45196ms 2016-05-17 19:18:21,141 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 73103ms 2016-05-17 19:19:56,003 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 93005ms 2016-05-17 19:21:24,339 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 88099ms 2016-05-17 19:22:57,828 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 93429ms 2016-05-17 19:24:31,891 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 94049ms 2016-05-17 19:25:53,983 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: inst *Exception 2:* I got heap space error after step2 when i run with out of core enabled. But with the code i had checked out from GIT TRUNK of giraph i was able to successfully run 600M edges with in memory. Not sure why i am getting heap space error in the code i had checked out from your commit 2016-05-17 18:31:14,387 INFO [main] org.apache.giraph.worker.BspServiceWorker: finishSuperstep: Completed superstep 2 with global stats (vtx=451097689,finVtx=451097689,edges=499895018,msgCount=16110,msgBytesCount=129712,haltComputation=false, checkpointStatus=NONE) and classes (computation=org.apache.giraph.examples.ConnectedComponentsComputation,incoming=org.apache.giraph.conf.DefaultMessageClasses@18388a3c,outgoing=org.apache.giraph.conf.DefaultMessageClasses@1d035be3) 2016-05-17 18:31:14,395 INFO [main-EventThread] org.apache.giraph.worker.BspServiceWorker: processEvent : partitionExchangeChildrenChanged (at least one worker is done sending partitions) 2016-05-17 18:31:14,414 WARN [main-EventThread] org.apache.giraph.bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_1463146675144_0120/_applicationAttemptsDir/0/_superstepDir/1/_superstepFinished, type=NodeDeleted, state=SyncConnected) 2016-05-17 18:31:14,414 WARN [main-EventThread] org.apache.giraph.bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_1463146675144_0120/_applicationAttemptsDir/0/_superstepDir/1/_addressesAndPartitions, type=NodeDeleted, state=SyncConnected) 2016-05-17 18:31:50,368 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 35912ms 2016-05-17 18:31:50,371 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 0 2016-05-17 18:31:50,371 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 0 2016-05-17 18:32:53,125 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: call: last GC happened a while ago and the amount of used memory is high (used memory fraction is 0.96). Calling GC manually 2016-05-17 18:36:09,485 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: call: manual GC is done. It took 196.36 seconds. Used memory fraction is 0.96 2016-05-17 18:36:09,485 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 0 2016-05-17 18:36:09,485 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 0 2016-05-17 18:36:42,959 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: call: last GC happened a while ago and the amount of used memory is high (used memory fraction is 0.89). Calling GC manually 2016-05-17 18:37:45,432 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: call: manual GC is done. It took 62.47 seconds. Used memory fraction is 0.89 2016-05-17 18:37:45,432 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 18:37:45,432 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 5 2016-05-17 18:37:45,432 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 62656ms 2016-05-17 18:37:45,433 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 33689ms 2016-05-17 18:37:45,433 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 92383ms 2016-05-17 18:37:45,433 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Allocation Failure, duration = 70285ms 2016-05-17 18:37:45,433 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 33473ms 2016-05-17 18:37:45,434 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = System.gc(), duration = 63ms 2016-05-17 18:37:45,434 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = System.gc(), duration = 62409ms 2016-05-17 18:37:45,438 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space at it.unimi.dsi.fastutil.ints.Int2ObjectOpenHashMap.<init>(Int2ObjectOpenHashMap.java:107) at it.unimi.dsi.fastutil.ints.Int2ObjectOpenHashMap.<init>(Int2ObjectOpenHashMap.java:115) at org.apache.giraph.comm.messages.primitives.IntByteArrayMessageStore.<init>(IntByteArrayMessageStore.java:89) at org.apache.giraph.comm.messages.InMemoryMessageStoreFactory.newStoreWithoutCombiner(InMemoryMessageStoreFactory.java:128) at org.apache.giraph.comm.messages.InMemoryMessageStoreFactory.newStore(InMemoryMessageStoreFactory.java:178) at org.apache.giraph.comm.messages.InMemoryMessageStoreFactory.newStore(InMemoryMessageStoreFactory.java:54) at org.apache.giraph.comm.ServerData.prepareSuperstep(ServerData.java:285) at org.apache.giraph.comm.netty.NettyWorkerServer.prepareSuperstep(NettyWorkerServer.java:97) at org.apache.giraph.worker.BspServiceWorker.startSuperstep(BspServiceWorker.java:700) at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:329) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92) *Exception 3:* *I got this when i ran 600M edges with 5 nodes and 9 workers.I was still getting this error with older versions of Giraph as well.By older version i meant the version checked out from Trunk.I am having these issues only with huge data set only.* 2016-05-17 17:56:33,576 ERROR [ooc-io-0] org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed java.lang.RuntimeException: java.io.EOFException at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:114) at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:36) at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:47) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:290) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:335) at org.apache.giraph.ooc.data.OutOfCoreDataManager.loadPartitionData(OutOfCoreDataManager.java:198) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:368) at org.apache.giraph.ooc.io.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66) at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:102) ... 6 more 2016-05-17 17:56:33,580 INFO [ooc-io-0] org.apache.giraph.ooc.OutOfCoreIOCallableFactory: afterExecute: an out-of-core thread terminated unexpectedly with java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.EOFException 2016-05-17 17:56:34,937 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:34,937 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:37,437 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:37,437 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:38,202 INFO [main] org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, task ids: [6] 2016-05-17 17:56:39,937 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:39,938 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:42,438 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:42,438 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:44,938 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:44,938 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:47,439 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:47,439 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:49,939 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:49,939 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:52,440 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:52,440 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:54,941 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:54,941 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:55,420 INFO [netty-server-worker-7] org.apache.giraph.comm.netty.handler.RequestDecoder: decode: Server window metrics MBytes/sec received = 0, MBytesReceived = 0.0007, ave received req MBytes = 0.0001, secs waited = 52.835 2016-05-17 17:56:55,421 INFO [main] org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 0 more aggregator requests 2016-05-17 17:56:55,421 INFO [main] org.apache.giraph.graph.GraphTaskManager: execute: 8 partitions to process with 1 compute thread(s), originally 1 thread(s) on superstep 3 2016-05-17 17:56:55,421 INFO [main] org.apache.giraph.ooc.OutOfCoreEngine: startIteration: with 0 partitions in memory and 1 active threads 2016-05-17 17:56:55,422 INFO [compute-0] org.apache.giraph.ooc.OutOfCoreEngine: getNextPartition: waiting until a partition becomes available! 2016-05-17 17:56:57,442 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:57,442 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:59,942 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:59,942 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:57:02,443 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:57:02,443 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:57:04,943 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:57:04,943 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:57:05,422 ERROR [compute-0] org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread! at org.apache.giraph.ooc.OutOfCoreEngine.getNextPartition(OutOfCoreEngine.java:285) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:174) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70) at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-05-17 17:57:05,423 ERROR [main] org.apache.giraph.graph.GraphMapper: Caught an unrecoverable exception Exception occurred java.lang.IllegalStateException: Exception occurred at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:253) at org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:817) at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:364) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread! at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:250) ... 10 more Caused by: java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread! at org.apache.giraph.ooc.OutOfCoreEngine.getNextPartition(OutOfCoreEngine.java:285) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:174) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70) at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-05-17 17:57:05,424 ERROR [main] org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got failure, unregistering health on /_hadoopBsp/job_1463146675144_0119/_applicationAttemptsDir/0/_superstepDir/3/_workerHealthyDir/ip-172-31-42-220.eu-west-1.compute.internal_4 on superstep 3 2016-05-17 17:57:05,427 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.IllegalStateException: run: Caught an unrecoverable exception Exception occurred at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:108) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.lang.IllegalStateException: Exception occurred at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:253) at org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:817) at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:364) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92) ... 7 more Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread! at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:250) ... 10 more Caused by: java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread! at org.apache.giraph.ooc.OutOfCoreEngine.getNextPartition(OutOfCoreEngine.java:285) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:174) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70) at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-05-17 17:57:05,428 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task On Tue, May 17, 2016 at 12:40 AM, Hassan Eslami <hsn.esl...@gmail.com> wrote: > Ramesh, > > The out-of-core mechanism keeps spilled data in files in local job > directory, which is usually obtained from Hadoop's "mapred.job.id". This > should be different from one run to another, so there shouldn't be any > conflict between different runs using out-of-core mechanism. However, you > may have manually overwritten related Hadoop/YARN config, so there may be > conflict in your case. That means, if you run your jobs subsequently, a > later job may make some decisions based on already existing files from a > previous job. This can be one reason you are getting this error. Please > make sure the local job directory is different from run to run, or simply > delete the "_bsp/_partitions" directory from your local job directory every > time you run your job using out-of-core. > > As a side note, you don't need to specify out-of-core messages ( > giraph.maxMessagesInMemory=100,giraph.useOutOfCoreMessages=true) anymore. > Also, you can try a new out-of-core feature in which you don't have to > specify the number of partitions in memory either (you can also get rid of > giraph.maxPartitionsInMemory=5). This new feature is extensively tested, > but is still under review and has not been pushed to the code base yet. You > can access this feature here: https://reviews.facebook.net/D55479 > > Best, > Hassan > > On Sat, May 14, 2016 at 10:46 PM, Ramesh Krishnan <ramesh.154...@gmail.com > > wrote: > >> Thanks Hassan. I have removed the checkpointing, still getting a >> different error >> >> *Script :* >> >> hadoop jar >> /usr/local/giraph.back.1.2.0/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.0-jar-with-dependencies.jar >> org.apache.giraph.GiraphRunner -Dmapreduce.task.timeout=12000000 >> -Dmapred.job.tracker=ip-172-31-42-220.eu-west-1.compute.internal:8021 >> -Dmapreduce.map.memory.mb=23480 -Dmapreduce.map.java.opts=-Xmx22480m >> org.apache.giraph.examples.ConnectedComponentsComputation -vif >> org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip /test/input_10M >> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op >> /test/ouput_10M -w 5 -ca >> giraph.userPartitionCount=150,giraph.SplitMasterWorker=true,giraph.isStaticGraph=true,giraph.maxPartitionsInMemory=5,mapred.map.max.attempts=2,giraph.maxMessagesInMemory=100,giraph.useOutOfCoreMessages=true,giraph.useOutOfCoreGraph=true >> >> *Exception:* >> >> 2016-05-15 05:34:28,113 INFO [ooc-io-0] >> org.apache.giraph.ooc.OutOfCoreIOCallable: call: execution of IO command >> LoadPartitionIOCommand: (partitionId = 107, superstep = 0) failed! >> 2016-05-15 05:34:28,114 ERROR [ooc-io-0] >> org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed >> java.lang.RuntimeException: java.io.EOFException >> at >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:76) >> at >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:30) >> at >> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: java.io.EOFException >> at java.io.DataInputStream.readInt(DataInputStream.java:392) >> at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:47) >> at >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:286) >> at >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:329) >> at >> org.apache.giraph.ooc.data.OutOfCoreDataManager.loadPartitionData(OutOfCoreDataManager.java:195) >> at >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:360) >> at >> org.apache.giraph.ooc.io.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:64) >> at >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:72) >> ... 6 more >> 2016-05-15 05:34:28,117 INFO [ooc-io-0] >> org.apache.giraph.ooc.OutOfCoreIOCallableFactory: afterExecute: an >> out-of-core thread terminated unexpectedly with >> java.util.concurrent.ExecutionException: java.lang.RuntimeException: >> java.io.EOFException >> 2016-05-15 05:34:28,441 INFO [compute-0] >> org.apache.giraph.ooc.FixedOutOfCoreEngine: doneProcessingPartition: >> processing partition 117 is done! >> 2016-05-15 05:34:29,111 INFO [compute-0] >> org.apache.giraph.ooc.FixedOutOfCoreEngine: doneProcessingPartition: >> processing partition 27 is done! >> 2016-05-15 05:34:29,620 INFO [compute-0] >> org.apache.giraph.ooc.FixedOutOfCoreEngine: doneProcessingPartition: >> processing partition 127 is done! >> 2016-05-15 05:34:30,123 INFO [compute-0] >> org.apache.giraph.ooc.FixedOutOfCoreEngine: doneProcessingPartition: >> processing partition 22 is done! >> 2016-05-15 05:34:30,123 INFO [compute-0] >> org.apache.giraph.ooc.FixedOutOfCoreEngine: getNextPartition: waiting until >> a partition becomes available! >> 2016-05-15 05:34:31,123 ERROR [compute-0] >> org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed >> java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO >> thread >> at >> org.apache.giraph.ooc.FixedOutOfCoreEngine.getNextPartition(FixedOutOfCoreEngine.java:81) >> at >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187) >> at >> org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:153) >> at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:69) >> at >> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> 2016-05-15 05:34:31,124 ERROR [main] org.apache.giraph.graph.GraphMapper: >> Caught an unrecoverable exception Exception occurred >> java.lang.IllegalStateException: Exception occurred >> at >> org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:253) >> at >> org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:761) >> at >> org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:349) >> at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:422) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) >> Caused by: java.util.concurrent.ExecutionException: >> java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO >> thread >> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >> at java.util.concurrent.FutureTask.get(FutureTask.java:206) >> at >> org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:250) >> ... 10 more >> Caused by: java.lang.RuntimeException: Job Failed due to a failure in an >> out-of-core IO thread >> at >> org.apache.giraph.ooc.FixedOutOfCoreEngine.getNextPartition(FixedOutOfCoreEngine.java:81) >> at >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187) >> at >> org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:153) >> at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:69) >> at >> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> 2016-05-15 05:34:31,125 ERROR [main] >> org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got failure, >> unregistering health on >> /_hadoopBsp/job_1463146675144_0036/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-172-31-37-39.eu-west-1.compute.internal_2 >> on superstep 0 >> >> >> >> On Sun, May 15, 2016 at 3:54 AM, Hassan Eslami <hsn.esl...@gmail.com> >> wrote: >> >>> Hi Ramesh! >>> >>> Thanks for bringing this up, and thanks for trying out the new >>> out-of-core mechanism. The new out-of-core mechanism has not been >>> integrated with checkpointing yet. This is part of an ongoing project, and >>> we should have the integration within a few weeks. In the meantime, you can >>> try out-of-core without checkpointing enabled. >>> >>> Best, >>> Hassan >>> >>> >>> On Saturday, May 14, 2016, Ramesh Krishnan <ramesh.154...@gmail.com> >>> wrote: >>> >>>> PFA the correct logs for the concurrent exception >>>> >>>> 2016-05-14 19:10:55,733 ERROR [ooc-io-0] >>>> org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed >>>> java.lang.RuntimeException: java.io.EOFException >>>> at >>>> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:76) >>>> at >>>> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:30) >>>> at >>>> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) >>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>> at java.lang.Thread.run(Thread.java:745) >>>> Caused by: java.io.EOFException >>>> at java.io.DataInputStream.readInt(DataInputStream.java:392) >>>> at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:47) >>>> at >>>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:286) >>>> at >>>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:329) >>>> at >>>> org.apache.giraph.ooc.data.OutOfCoreDataManager.loadPartitionData(OutOfCoreDataManager.java:195) >>>> at >>>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:360) >>>> at >>>> org.apache.giraph.ooc.io.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:64) >>>> at >>>> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:72) >>>> ... 6 more >>>> 2016-05-14 19:10:55,737 INFO [ooc-io-0] >>>> org.apache.giraph.ooc.OutOfCoreIOCallableFactory: afterExecute: an >>>> out-of-core thread terminated unexpectedly with >>>> java.util.concurrent.ExecutionException: java.lang.RuntimeException: >>>> java.io.EOFException >>>> 2016-05-14 19:10:55,739 INFO [checkpoint-vertices-7] >>>> org.apache.giraph.ooc.FixedOutOfCoreEngine: getNextPartition: waiting >>>> until a partition becomes available! >>>> 2016-05-14 19:10:56,426 ERROR [checkpoint-vertices-6] >>>> org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed >>>> java.lang.RuntimeException: Job Failed due to a failure in an out-of-core >>>> IO thread >>>> at >>>> org.apache.giraph.ooc.FixedOutOfCoreEngine.getNextPartition(FixedOutOfCoreEngine.java:81) >>>> at >>>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187) >>>> at >>>> org.apache.giraph.worker.BspServiceWorker$3$1.call(BspServiceWorker.java:1398) >>>> at >>>> org.apache.giraph.worker.BspServiceWorker$3$1.call(BspServiceWorker.java:1392) >>>> at >>>> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) >>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>> at java.lang.Thread.run(Thread.java:745) >>>> >>>> >>>> >>>> On Sun, May 15, 2016 at 1:02 AM, Ramesh Krishnan < >>>> ramesh.154...@gmail.com> wrote: >>>> >>>>> >>>>> Hi Team, >>>>> >>>>> I have the latest build of giraph running on a 5 node cluster. When i >>>>> try to use OutofCore Graph option for a huge data set like 600Milion edges >>>>> i am running into >>>>> the following exception. Please find below the script being executed >>>>> and the exception logs. I have tried all possible ways and could not avoid >>>>> this issue , i am really in need of your help. >>>>> >>>>> *Script:*hadoop jar >>>>> /usr/local/giraph.back.1.2.0/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.0-jar-with-dependencies.jar >>>>> org.apache.giraph.GiraphRunner -Dmapreduce.task.timeout=12000000 >>>>> -Dmapred.job.tracker=ip-172-31-42-220.eu-west-1.compute.internal:8021 >>>>> -Dmapreduce.map.memory.mb=23480 -Dmapreduce.map.java.opts=-Xmx22480m >>>>> org.apache.giraph.examples.ConnectedComponentsComputation -vif >>>>> org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip >>>>> /test/input_10M >>>>> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op >>>>> /test/ouput_10M -w 5 -ca >>>>> giraph.userPartitionCount=150,giraph.SplitMasterWorker=true,giraph.isStaticGraph=true,giraph.maxPartitionsInMemory=10,mapred.map.max.attempts=2,giraph.maxMessagesInMemory=100,giraph.numOutputThreads=10,giraph.useOutOfCoreMessages=true,giraph.numOutputThreads=4,giraph.numInputThreads=4,giraph.useOutOfCoreGraph=true,giraph.cleanupCheckpointsAfterSuccess=true,giraph.checkpointFrequency=1 >>>>> >>>>> >>>>> >>>>> >>>>> *Exception:hadoop jar >>>>> /usr/local/giraph.back.1.2.0/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.0-jar-with-dependencies.jar >>>>> org.apache.giraph.GiraphRunner -Dmapreduce.task.timeout=12000000 >>>>> -Dmapred.job.tracker=ip-172-31-42-220.eu-west-1.compute.internal:8021 >>>>> -Dmapreduce.map.memory.mb=23480 -Dmapreduce.map.java.opts=-Xmx22480m >>>>> org.apache.giraph.examples.ConnectedComponentsComputation -vif >>>>> org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip >>>>> /test/input_10M >>>>> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op >>>>> /test/ouput_10M -w 5 -ca >>>>> giraph.userPartitionCount=150,giraph.SplitMasterWorker=true,giraph.isStaticGraph=true,giraph.maxPartitionsInMemory=10,mapred.map.max.attempts=2,giraph.maxMessagesInMemory=100,giraph.numOutputThreads=10,giraph.useOutOfCoreMessages=true,giraph.numOutputThreads=4,giraph.numInputThreads=4,giraph.useOutOfCoreGraph=true,giraph.cleanupCheckpointsAfterSuccess=true,giraph.checkpointFrequency=1* >>>>> >>>>> *thanks* >>>>> >>>>> *Ramesh* >>>>> >>>>> >>>> >> >