Hi Hassan,

Thanks for the elaborate response. I am not running Giraph jobs paralleley
, i am trying to run one job with 900M edges .
I have removed  the _bsp folder as well before every run .

I did also checkout the latest code from phabricator commit. Out of Core
works perfectly for 300M records but when i increase the data set to 500M

*Exception1:*
*script:*
hadoop jar
/usr/local/giraph/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.1-jar-with-dependencies.jar.giraph.GiraphRunner
-Dmapreduce.task.timeout=12000000
-Dmapred.job.tracker=ip-172-31-42-220.eu-west-1.compute.internal:8021
-Dmapreduce.map.memory.mb=23480 -Dmapreduce.map.java.opts=-Xmx22480m
org.apache.giraph.examples.ConnectedComponentsComputation   -vif
org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip
/VUID/input_500M -vof
org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
/VUID/ouput_500M -w 4 -ca
giraph.SplitMasterWorker=true,giraph.isStaticGraph=true,mapred.map.max.attempts=2,giraph.numOutputThreads=10,giraph.numInputThreads=10,giraph.numComputeThreads=4,giraph.waitForPerWorkerRequests=true,giraph.zkSessionMsecTimeout=1200000


After Supert step1 execution get stuck , this is if i run without out of
core,

2016-05-17 19:11:06,536 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS Scavenge, action = end of minor GC, cause = Allocation Failure,
duration = 2267ms
2016-05-17 19:11:55,694 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS Scavenge, action = end of minor GC, cause = Allocation Failure,
duration = 1314ms
2016-05-17 19:11:55,695 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
47407ms
2016-05-17 19:11:58,930 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS Scavenge, action = end of minor GC, cause = Allocation Failure,
duration = 1465ms
2016-05-17 19:12:03,222 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS Scavenge, action = end of minor GC, cause = Allocation Failure,
duration = 2655ms
2016-05-17 19:13:00,659 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS Scavenge, action = end of minor GC, cause = Allocation Failure,
duration = 3013ms
2016-05-17 19:13:00,659 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
52769ms
2016-05-17 19:13:04,359 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS Scavenge, action = end of minor GC, cause = Allocation Failure,
duration = 1961ms
2016-05-17 19:14:01,246 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS Scavenge, action = end of minor GC, cause = Allocation Failure,
duration = 2925ms
2016-05-17 19:14:01,247 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
52743ms
2016-05-17 19:14:50,029 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
48410ms
2016-05-17 19:15:34,967 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
44449ms
2016-05-17 19:16:20,120 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
44219ms
2016-05-17 19:17:06,099 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
45196ms
2016-05-17 19:18:21,141 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
73103ms
2016-05-17 19:19:56,003 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
93005ms
2016-05-17 19:21:24,339 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
88099ms
2016-05-17 19:22:57,828 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
93429ms
2016-05-17 19:24:31,891 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
94049ms
2016-05-17 19:25:53,983 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: inst





*Exception 2:*
I got heap space error after step2 when i run with out of core enabled. But
with the code i had checked out from GIT TRUNK of giraph i was able to
successfully run 600M edges with in memory. Not sure why i am getting heap
space error in the code i had checked out from your commit


2016-05-17 18:31:14,387 INFO [main]
org.apache.giraph.worker.BspServiceWorker: finishSuperstep: Completed
superstep 2 with global stats
(vtx=451097689,finVtx=451097689,edges=499895018,msgCount=16110,msgBytesCount=129712,haltComputation=false,
checkpointStatus=NONE) and classes
(computation=org.apache.giraph.examples.ConnectedComponentsComputation,incoming=org.apache.giraph.conf.DefaultMessageClasses@18388a3c,outgoing=org.apache.giraph.conf.DefaultMessageClasses@1d035be3)
2016-05-17 18:31:14,395 INFO [main-EventThread]
org.apache.giraph.worker.BspServiceWorker: processEvent :
partitionExchangeChildrenChanged (at least one worker is done sending
partitions)
2016-05-17 18:31:14,414 WARN [main-EventThread]
org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
event 
(path=/_hadoopBsp/job_1463146675144_0120/_applicationAttemptsDir/0/_superstepDir/1/_superstepFinished,
type=NodeDeleted, state=SyncConnected)
2016-05-17 18:31:14,414 WARN [main-EventThread]
org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
event 
(path=/_hadoopBsp/job_1463146675144_0120/_applicationAttemptsDir/0/_superstepDir/1/_addressesAndPartitions,
type=NodeDeleted, state=SyncConnected)
2016-05-17 18:31:50,368 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
35912ms
2016-05-17 18:31:50,371 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 0
2016-05-17 18:31:50,371 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 0
2016-05-17 18:32:53,125 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: call: last GC happened a
while ago and the amount of used memory is high (used memory fraction
is 0.96). Calling GC manually
2016-05-17 18:36:09,485 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: call: manual GC is done.
It took 196.36 seconds. Used memory fraction is 0.96
2016-05-17 18:36:09,485 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 0
2016-05-17 18:36:09,485 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 0
2016-05-17 18:36:42,959 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: call: last GC happened a
while ago and the amount of used memory is high (used memory fraction
is 0.89). Calling GC manually
2016-05-17 18:37:45,432 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: call: manual GC is done.
It took 62.47 seconds. Used memory fraction is 0.89
2016-05-17 18:37:45,432 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 18:37:45,432 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 5
2016-05-17 18:37:45,432 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
62656ms
2016-05-17 18:37:45,433 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
33689ms
2016-05-17 18:37:45,433 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
92383ms
2016-05-17 18:37:45,433 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Allocation Failure,
duration = 70285ms
2016-05-17 18:37:45,433 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = Ergonomics, duration =
33473ms
2016-05-17 18:37:45,434 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS Scavenge, action = end of minor GC, cause = System.gc(), duration =
63ms
2016-05-17 18:37:45,434 INFO [Service Thread]
org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name =
PS MarkSweep, action = end of major GC, cause = System.gc(), duration
= 62409ms
2016-05-17 18:37:45,438 FATAL [main]
org.apache.hadoop.mapred.YarnChild: Error running child :
java.lang.OutOfMemoryError: Java heap space
        at 
it.unimi.dsi.fastutil.ints.Int2ObjectOpenHashMap.<init>(Int2ObjectOpenHashMap.java:107)
        at 
it.unimi.dsi.fastutil.ints.Int2ObjectOpenHashMap.<init>(Int2ObjectOpenHashMap.java:115)
        at 
org.apache.giraph.comm.messages.primitives.IntByteArrayMessageStore.<init>(IntByteArrayMessageStore.java:89)
        at 
org.apache.giraph.comm.messages.InMemoryMessageStoreFactory.newStoreWithoutCombiner(InMemoryMessageStoreFactory.java:128)
        at 
org.apache.giraph.comm.messages.InMemoryMessageStoreFactory.newStore(InMemoryMessageStoreFactory.java:178)
        at 
org.apache.giraph.comm.messages.InMemoryMessageStoreFactory.newStore(InMemoryMessageStoreFactory.java:54)
        at 
org.apache.giraph.comm.ServerData.prepareSuperstep(ServerData.java:285)
        at 
org.apache.giraph.comm.netty.NettyWorkerServer.prepareSuperstep(NettyWorkerServer.java:97)
        at 
org.apache.giraph.worker.BspServiceWorker.startSuperstep(BspServiceWorker.java:700)
        at 
org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:329)
        
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92)



*Exception 3:*





*I got this when i ran 600M edges with 5 nodes and 9 workers.I was
still getting this error with older versions of Giraph as well.By
older  version i meant the version checked out from Trunk.I am having
these issues only with huge data set only.*


2016-05-17 17:56:33,576 ERROR [ooc-io-0]
org.apache.giraph.utils.LogStacktraceCallable: Execution of callable
failed
java.lang.RuntimeException: java.io.EOFException
        at 
org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:114)
        at 
org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:36)
        at 
org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:47)
        at 
org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:290)
        at 
org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:335)
        at 
org.apache.giraph.ooc.data.OutOfCoreDataManager.loadPartitionData(OutOfCoreDataManager.java:198)
        at 
org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:368)
        at 
org.apache.giraph.ooc.io.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66)
        at 
org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:102)
        ... 6 more
2016-05-17 17:56:33,580 INFO [ooc-io-0]
org.apache.giraph.ooc.OutOfCoreIOCallableFactory: afterExecute: an
out-of-core thread terminated unexpectedly with
java.util.concurrent.ExecutionException: java.lang.RuntimeException:
java.io.EOFException
2016-05-17 17:56:34,937 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:56:34,937 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:56:37,437 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:56:37,437 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:56:38,202 INFO [main]
org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [6]
2016-05-17 17:56:39,937 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:56:39,938 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:56:42,438 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:56:42,438 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:56:44,938 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:56:44,938 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:56:47,439 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:56:47,439 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:56:49,939 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:56:49,939 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:56:52,440 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:56:52,440 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:56:54,941 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:56:54,941 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:56:55,420 INFO [netty-server-worker-7]
org.apache.giraph.comm.netty.handler.RequestDecoder: decode: Server
window metrics MBytes/sec received = 0, MBytesReceived = 0.0007, ave
received req MBytes = 0.0001, secs waited = 52.835
2016-05-17 17:56:55,421 INFO [main]
org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 0 more aggregator requests
2016-05-17 17:56:55,421 INFO [main]
org.apache.giraph.graph.GraphTaskManager: execute: 8 partitions to
process with 1 compute thread(s), originally 1 thread(s) on superstep
3
2016-05-17 17:56:55,421 INFO [main]
org.apache.giraph.ooc.OutOfCoreEngine: startIteration: with 0
partitions in memory and 1 active threads
2016-05-17 17:56:55,422 INFO [compute-0]
org.apache.giraph.ooc.OutOfCoreEngine: getNextPartition: waiting until
a partition becomes available!
2016-05-17 17:56:57,442 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:56:57,442 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:56:59,942 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:56:59,942 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:57:02,443 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:57:02,443 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:57:04,943 INFO [check-memory]
org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction:
updating the number of active threads to 1
2016-05-17 17:57:04,943 INFO [check-memory]
org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit:
updating the credit to 20
2016-05-17 17:57:05,422 ERROR [compute-0]
org.apache.giraph.utils.LogStacktraceCallable: Execution of callable
failed
java.lang.RuntimeException: Job Failed due to a failure in an
out-of-core IO thread!
        at 
org.apache.giraph.ooc.OutOfCoreEngine.getNextPartition(OutOfCoreEngine.java:285)
        at 
org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187)
        at 
org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:174)
        at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70)
        at 
org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2016-05-17 17:57:05,423 ERROR [main]
org.apache.giraph.graph.GraphMapper: Caught an unrecoverable exception
Exception occurred
java.lang.IllegalStateException: Exception occurred
        at 
org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:253)
        at 
org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:817)
        at 
org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:364)
        at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.util.concurrent.ExecutionException:
java.lang.RuntimeException: Job Failed due to a failure in an
out-of-core IO thread!
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:206)
        at 
org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:250)
        ... 10 more
Caused by: java.lang.RuntimeException: Job Failed due to a failure in
an out-of-core IO thread!
        at 
org.apache.giraph.ooc.OutOfCoreEngine.getNextPartition(OutOfCoreEngine.java:285)
        at 
org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187)
        at 
org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:174)
        at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70)
        at 
org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2016-05-17 17:57:05,424 ERROR [main]
org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got
failure, unregistering health on
/_hadoopBsp/job_1463146675144_0119/_applicationAttemptsDir/0/_superstepDir/3/_workerHealthyDir/ip-172-31-42-220.eu-west-1.compute.internal_4
on superstep 3
2016-05-17 17:57:05,427 WARN [main]
org.apache.hadoop.mapred.YarnChild: Exception running child :
java.lang.IllegalStateException: run: Caught an unrecoverable
exception Exception occurred
        at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:108)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.IllegalStateException: Exception occurred
        at 
org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:253)
        at 
org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:817)
        at 
org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:364)
        at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92)
        ... 7 more
Caused by: java.util.concurrent.ExecutionException:
java.lang.RuntimeException: Job Failed due to a failure in an
out-of-core IO thread!
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:206)
        at 
org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:250)
        ... 10 more
Caused by: java.lang.RuntimeException: Job Failed due to a failure in
an out-of-core IO thread!
        at 
org.apache.giraph.ooc.OutOfCoreEngine.getNextPartition(OutOfCoreEngine.java:285)
        at 
org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187)
        at 
org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:174)
        at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70)
        at 
org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

2016-05-17 17:57:05,428 INFO [main] org.apache.hadoop.mapred.Task:
Runnning cleanup for the task



On Tue, May 17, 2016 at 12:40 AM, Hassan Eslami <hsn.esl...@gmail.com>
wrote:

> Ramesh,
>
> The out-of-core mechanism keeps spilled data in files in local job
> directory, which is usually obtained from Hadoop's "mapred.job.id". This
> should be different from one run to another, so there shouldn't be any
> conflict between different runs using out-of-core mechanism. However, you
> may have manually overwritten related Hadoop/YARN config, so there may be
> conflict in your case. That means, if you run your jobs subsequently, a
> later job may make some decisions based on already existing files from a
> previous job. This can be one reason you are getting this error. Please
> make sure the local job directory is different from run to run, or simply
> delete the "_bsp/_partitions" directory from your local job directory every
> time you run your job using out-of-core.
>
> As a side note, you don't need to specify out-of-core messages (
> giraph.maxMessagesInMemory=100,giraph.useOutOfCoreMessages=true) anymore.
> Also, you can try a new out-of-core feature in which you don't have to
> specify the number of partitions in memory either (you can also get rid of
> giraph.maxPartitionsInMemory=5). This new feature is extensively tested,
> but is still under review and has not been pushed to the code base yet. You
> can access this feature here: https://reviews.facebook.net/D55479
>
> Best,
> Hassan
>
> On Sat, May 14, 2016 at 10:46 PM, Ramesh Krishnan <ramesh.154...@gmail.com
> > wrote:
>
>> Thanks Hassan. I have removed the checkpointing, still getting a
>> different error
>>
>> *Script :*
>>
>> hadoop jar
>> /usr/local/giraph.back.1.2.0/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.0-jar-with-dependencies.jar
>> org.apache.giraph.GiraphRunner -Dmapreduce.task.timeout=12000000
>> -Dmapred.job.tracker=ip-172-31-42-220.eu-west-1.compute.internal:8021
>> -Dmapreduce.map.memory.mb=23480 -Dmapreduce.map.java.opts=-Xmx22480m
>> org.apache.giraph.examples.ConnectedComponentsComputation   -vif
>> org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip /test/input_10M
>> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
>> /test/ouput_10M -w 5 -ca
>> giraph.userPartitionCount=150,giraph.SplitMasterWorker=true,giraph.isStaticGraph=true,giraph.maxPartitionsInMemory=5,mapred.map.max.attempts=2,giraph.maxMessagesInMemory=100,giraph.useOutOfCoreMessages=true,giraph.useOutOfCoreGraph=true
>>
>> *Exception:*
>>
>> 2016-05-15 05:34:28,113 INFO [ooc-io-0] 
>> org.apache.giraph.ooc.OutOfCoreIOCallable: call: execution of IO command 
>> LoadPartitionIOCommand: (partitionId = 107, superstep = 0) failed!
>> 2016-05-15 05:34:28,114 ERROR [ooc-io-0] 
>> org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed
>> java.lang.RuntimeException: java.io.EOFException
>>      at 
>> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:76)
>>      at 
>> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:30)
>>      at 
>> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
>>      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>      at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.io.EOFException
>>      at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>      at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:47)
>>      at 
>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:286)
>>      at 
>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:329)
>>      at 
>> org.apache.giraph.ooc.data.OutOfCoreDataManager.loadPartitionData(OutOfCoreDataManager.java:195)
>>      at 
>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:360)
>>      at 
>> org.apache.giraph.ooc.io.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:64)
>>      at 
>> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:72)
>>      ... 6 more
>> 2016-05-15 05:34:28,117 INFO [ooc-io-0] 
>> org.apache.giraph.ooc.OutOfCoreIOCallableFactory: afterExecute: an 
>> out-of-core thread terminated unexpectedly with 
>> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
>> java.io.EOFException
>> 2016-05-15 05:34:28,441 INFO [compute-0] 
>> org.apache.giraph.ooc.FixedOutOfCoreEngine: doneProcessingPartition: 
>> processing partition 117 is done!
>> 2016-05-15 05:34:29,111 INFO [compute-0] 
>> org.apache.giraph.ooc.FixedOutOfCoreEngine: doneProcessingPartition: 
>> processing partition 27 is done!
>> 2016-05-15 05:34:29,620 INFO [compute-0] 
>> org.apache.giraph.ooc.FixedOutOfCoreEngine: doneProcessingPartition: 
>> processing partition 127 is done!
>> 2016-05-15 05:34:30,123 INFO [compute-0] 
>> org.apache.giraph.ooc.FixedOutOfCoreEngine: doneProcessingPartition: 
>> processing partition 22 is done!
>> 2016-05-15 05:34:30,123 INFO [compute-0] 
>> org.apache.giraph.ooc.FixedOutOfCoreEngine: getNextPartition: waiting until 
>> a partition becomes available!
>> 2016-05-15 05:34:31,123 ERROR [compute-0] 
>> org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed
>> java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO 
>> thread
>>      at 
>> org.apache.giraph.ooc.FixedOutOfCoreEngine.getNextPartition(FixedOutOfCoreEngine.java:81)
>>      at 
>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187)
>>      at 
>> org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:153)
>>      at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:69)
>>      at 
>> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
>>      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>      at java.lang.Thread.run(Thread.java:745)
>> 2016-05-15 05:34:31,124 ERROR [main] org.apache.giraph.graph.GraphMapper: 
>> Caught an unrecoverable exception Exception occurred
>> java.lang.IllegalStateException: Exception occurred
>>      at 
>> org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:253)
>>      at 
>> org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:761)
>>      at 
>> org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:349)
>>      at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92)
>>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>>      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>>      at java.security.AccessController.doPrivileged(Native Method)
>>      at javax.security.auth.Subject.doAs(Subject.java:422)
>>      at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
>> Caused by: java.util.concurrent.ExecutionException: 
>> java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO 
>> thread
>>      at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>      at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>>      at 
>> org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:250)
>>      ... 10 more
>> Caused by: java.lang.RuntimeException: Job Failed due to a failure in an 
>> out-of-core IO thread
>>      at 
>> org.apache.giraph.ooc.FixedOutOfCoreEngine.getNextPartition(FixedOutOfCoreEngine.java:81)
>>      at 
>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187)
>>      at 
>> org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:153)
>>      at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:69)
>>      at 
>> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
>>      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>      at java.lang.Thread.run(Thread.java:745)
>> 2016-05-15 05:34:31,125 ERROR [main] 
>> org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got failure, 
>> unregistering health on 
>> /_hadoopBsp/job_1463146675144_0036/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-172-31-37-39.eu-west-1.compute.internal_2
>>  on superstep 0
>>
>>
>>
>> On Sun, May 15, 2016 at 3:54 AM, Hassan Eslami <hsn.esl...@gmail.com>
>> wrote:
>>
>>> Hi Ramesh!
>>>
>>> Thanks for bringing this up, and thanks for trying out the new
>>> out-of-core mechanism. The new out-of-core mechanism has not been
>>> integrated with checkpointing yet. This is part of an ongoing project, and
>>> we should have the integration within a few weeks. In the meantime, you can
>>> try out-of-core without checkpointing enabled.
>>>
>>> Best,
>>> Hassan
>>>
>>>
>>> On Saturday, May 14, 2016, Ramesh Krishnan <ramesh.154...@gmail.com>
>>> wrote:
>>>
>>>> PFA the correct logs for the concurrent exception
>>>>
>>>> 2016-05-14 19:10:55,733 ERROR [ooc-io-0] 
>>>> org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed
>>>> java.lang.RuntimeException: java.io.EOFException
>>>>    at 
>>>> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:76)
>>>>    at 
>>>> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:30)
>>>>    at 
>>>> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
>>>>    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>    at 
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>    at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>    at java.lang.Thread.run(Thread.java:745)
>>>> Caused by: java.io.EOFException
>>>>    at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>    at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:47)
>>>>    at 
>>>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:286)
>>>>    at 
>>>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:329)
>>>>    at 
>>>> org.apache.giraph.ooc.data.OutOfCoreDataManager.loadPartitionData(OutOfCoreDataManager.java:195)
>>>>    at 
>>>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:360)
>>>>    at 
>>>> org.apache.giraph.ooc.io.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:64)
>>>>    at 
>>>> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:72)
>>>>    ... 6 more
>>>> 2016-05-14 19:10:55,737 INFO [ooc-io-0] 
>>>> org.apache.giraph.ooc.OutOfCoreIOCallableFactory: afterExecute: an 
>>>> out-of-core thread terminated unexpectedly with 
>>>> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
>>>> java.io.EOFException
>>>> 2016-05-14 19:10:55,739 INFO [checkpoint-vertices-7] 
>>>> org.apache.giraph.ooc.FixedOutOfCoreEngine: getNextPartition: waiting 
>>>> until a partition becomes available!
>>>> 2016-05-14 19:10:56,426 ERROR [checkpoint-vertices-6] 
>>>> org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed
>>>> java.lang.RuntimeException: Job Failed due to a failure in an out-of-core 
>>>> IO thread
>>>>    at 
>>>> org.apache.giraph.ooc.FixedOutOfCoreEngine.getNextPartition(FixedOutOfCoreEngine.java:81)
>>>>    at 
>>>> org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187)
>>>>    at 
>>>> org.apache.giraph.worker.BspServiceWorker$3$1.call(BspServiceWorker.java:1398)
>>>>    at 
>>>> org.apache.giraph.worker.BspServiceWorker$3$1.call(BspServiceWorker.java:1392)
>>>>    at 
>>>> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
>>>>    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>    at 
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>    at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>    at java.lang.Thread.run(Thread.java:745)
>>>>
>>>>
>>>>
>>>> On Sun, May 15, 2016 at 1:02 AM, Ramesh Krishnan <
>>>> ramesh.154...@gmail.com> wrote:
>>>>
>>>>>
>>>>> Hi Team,
>>>>>
>>>>> I have the latest build of giraph running on a 5 node cluster. When i
>>>>> try to use OutofCore Graph option for a huge data set like 600Milion edges
>>>>> i am running into
>>>>> the following exception. Please find below the script being executed
>>>>> and the exception logs. I have tried all possible ways and could not avoid
>>>>> this issue , i am really in need of your help.
>>>>>
>>>>> *Script:*hadoop jar
>>>>> /usr/local/giraph.back.1.2.0/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.0-jar-with-dependencies.jar
>>>>> org.apache.giraph.GiraphRunner -Dmapreduce.task.timeout=12000000
>>>>> -Dmapred.job.tracker=ip-172-31-42-220.eu-west-1.compute.internal:8021
>>>>> -Dmapreduce.map.memory.mb=23480 -Dmapreduce.map.java.opts=-Xmx22480m
>>>>> org.apache.giraph.examples.ConnectedComponentsComputation   -vif
>>>>> org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip 
>>>>> /test/input_10M
>>>>> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
>>>>> /test/ouput_10M -w 5 -ca
>>>>> giraph.userPartitionCount=150,giraph.SplitMasterWorker=true,giraph.isStaticGraph=true,giraph.maxPartitionsInMemory=10,mapred.map.max.attempts=2,giraph.maxMessagesInMemory=100,giraph.numOutputThreads=10,giraph.useOutOfCoreMessages=true,giraph.numOutputThreads=4,giraph.numInputThreads=4,giraph.useOutOfCoreGraph=true,giraph.cleanupCheckpointsAfterSuccess=true,giraph.checkpointFrequency=1
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *Exception:hadoop jar
>>>>> /usr/local/giraph.back.1.2.0/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.0-jar-with-dependencies.jar
>>>>> org.apache.giraph.GiraphRunner -Dmapreduce.task.timeout=12000000
>>>>> -Dmapred.job.tracker=ip-172-31-42-220.eu-west-1.compute.internal:8021
>>>>> -Dmapreduce.map.memory.mb=23480 -Dmapreduce.map.java.opts=-Xmx22480m
>>>>> org.apache.giraph.examples.ConnectedComponentsComputation   -vif
>>>>> org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip 
>>>>> /test/input_10M
>>>>> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
>>>>> /test/ouput_10M -w 5 -ca
>>>>> giraph.userPartitionCount=150,giraph.SplitMasterWorker=true,giraph.isStaticGraph=true,giraph.maxPartitionsInMemory=10,mapred.map.max.attempts=2,giraph.maxMessagesInMemory=100,giraph.numOutputThreads=10,giraph.useOutOfCoreMessages=true,giraph.numOutputThreads=4,giraph.numInputThreads=4,giraph.useOutOfCoreGraph=true,giraph.cleanupCheckpointsAfterSuccess=true,giraph.checkpointFrequency=1*
>>>>>
>>>>> *thanks*
>>>>>
>>>>> *Ramesh*
>>>>>
>>>>>
>>>>
>>
>

Reply via email to