Hi Hassan, Thank you for really quick response!
I changed "giraph.isStaticGraph" to false and the error disappeared. As expected iteration became slow and wrote to disk edges once again in superstep 1. However, the computation failed at superstep 2 with error "java.lang.OutOfMemoryError: GC overhead limit exceeded". It seems to be unrelated to "isStaticGraph" issue, but I think it worth mentioning to see the picture as a whole. Are there any other tests/information I am able to execute/check to help to pinpoint "isStaticGraph" problem? Best Regards, Denis Dudinski 2016-11-07 20:00 GMT+03:00 Hassan Eslami <hsn.esl...@gmail.com>: > Hi Denis, > > Thanks for bringing up the issue. In the previous conversation thread, the > similar problem is reported even with a simpler example connected component > calculation. Although, back then, we were developing other > performance-critical components of OOC. > > Let's debug this issue together to make the new OOC more stable. I suspect > the problem is with "giraph.isStaticGraph=true" (as this is only an > optimization and most of our end-to-end testing was on cases where the graph > could change). Let's get rid of it for now and see if the problem still > exists. > > Best, > Hassan > > On Mon, Nov 7, 2016 at 6:24 AM, Denis Dudinski <denis.dudin...@gmail.com> > wrote: >> >> Hello, >> >> We are trying to calculate PageRank on huge graph, which does not fit >> into memory. For calculation to succeed we tried to turn on OutOfCore >> feature of Giraph, but every launch we tried resulted in >> com.esotericsoftware.kryo.KryoException: Buffer underflow. >> Each time it happens on different servers but exactly right after >> start of superstep 1. >> >> We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod version, can't >> back-step to Giraph's officially supported version and had to patch >> Giraph a little) placed on 11 servers + 3 master servers (namenodes >> etc.) with separate ZooKeeper cluster deployment. >> >> Our launch command: >> >> hadoop jar /opt/giraph-1.2.0/pr-job-jar-with-dependencies.jar >> org.apache.giraph.GiraphRunner com.prototype.di.pr.PageRankComputation >> \ >> -mc com.prototype.di.pr.PageRankMasterCompute \ >> -yj pr-job-jar-with-dependencies.jar \ >> -vif com.belprime.di.pr.input.HBLongVertexInputFormat \ >> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \ >> -op /user/hadoop/output/pr_test \ >> -w 10 \ >> -c com.prototype.di.pr.PRDoubleCombiner \ >> -wc com.prototype.di.pr.PageRankWorkerContext \ >> -ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \ >> -ca giraph.logLevel=info \ >> -ca hbase.mapreduce.inputtable=di_test \ >> -ca hbase.mapreduce.scan.columns=di:n \ >> -ca hbase.defaults.for.version.skip=true \ >> -ca hbase.table.row.textkey=false \ >> -ca giraph.yarn.task.heap.mb=48000 \ >> -ca giraph.isStaticGraph=true \ >> -ca giraph.SplitMasterWorker=false \ >> -ca giraph.oneToAllMsgSending=true \ >> -ca giraph.metrics.enable=true \ >> -ca giraph.jmap.histo.enable=true \ >> -ca giraph.vertexIdClass=com.prototype.di.pr.DomainPartAwareLongWritable \ >> -ca giraph.outgoingMessageValueClass=org.apache.hadoop.io.DoubleWritable \ >> -ca giraph.inputOutEdgesClass=org.apache.giraph.edge.LongNullArrayEdges \ >> -ca giraph.useOutOfCoreGraph=true \ >> -ca giraph.waitForPerWorkerRequests=true \ >> -ca giraph.maxNumberOfUnsentRequests=1000 \ >> -ca >> giraph.vertexInputFilterClass=com.prototype.di.pr.input.PagesFromSameDomainLimiter >> \ >> -ca giraph.useInputSplitLocality=true \ >> -ca hbase.mapreduce.scan.cachedrows=10000 \ >> -ca giraph.minPartitionsPerComputeThread=60 \ >> -ca >> giraph.graphPartitionerFactoryClass=com.prototype.di.pr.DomainAwareGraphPartitionerFactory >> \ >> -ca giraph.numInputThreads=1 \ >> -ca giraph.inputSplitSamplePercent=20 \ >> -ca giraph.pr.maxNeighborsPerVertex=50 \ >> -ca giraph.partitionClass=org.apache.giraph.partition.ByteArrayPartition \ >> -ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex \ >> -ca >> giraph.partitionsDirectory=/disk1/_bsp/_partitions,/disk2/_bsp/_partitions >> >> Logs excerpt: >> >> 16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre superstep in worker >> context >> 16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60 partitions >> to process with 1 compute thread(s), originally 1 thread(s) on >> superstep 1 >> 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration: with 60 >> partitions in memory and 1 active threads >> 16/11/06 15:47:15 INFO pr.PageRankComputation: Pre superstep1 in PR >> computation >> 16/11/06 15:47:15 INFO policy.ThresholdBasedOracle: getNextIOActions: >> usedMemoryFraction = 0.75 >> 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine: >> updateActiveThreadsFraction: updating the number of active threads to >> 1 >> 16/11/06 15:47:16 INFO policy.ThresholdBasedOracle: >> updateRequestsCredit: updating the credit to 20 >> 16/11/06 15:47:17 INFO graph.GraphTaskManager: installGCMonitoring: >> name = PS Scavenge, action = end of minor GC, cause = Allocation >> Failure, duration = 937ms >> 16/11/06 15:47:17 INFO policy.ThresholdBasedOracle: getNextIOActions: >> usedMemoryFraction = 0.72 >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: getNextIOActions: >> usedMemoryFraction = 0.74 >> 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine: >> updateActiveThreadsFraction: updating the number of active threads to >> 1 >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: >> updateRequestsCredit: updating the credit to 20 >> 16/11/06 15:47:19 INFO policy.ThresholdBasedOracle: getNextIOActions: >> usedMemoryFraction = 0.76 >> 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine: doneProcessingPartition: >> processing partition 234 is done! >> 16/11/06 15:47:20 INFO policy.ThresholdBasedOracle: getNextIOActions: >> usedMemoryFraction = 0.79 >> 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine: >> updateActiveThreadsFraction: updating the number of active threads to >> 1 >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: >> updateRequestsCredit: updating the credit to 18 >> 16/11/06 15:47:21 INFO handler.RequestDecoder: decode: Server window >> metrics MBytes/sec received = 1.0994, MBytesReceived = 33.0459, ave >> received req MBytes = 0.0138, secs waited = 30.058 >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: >> usedMemoryFraction = 0.82 >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next >> IO command is: StorePartitionIOCommand: (partitionId = 234) >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's >> command StorePartitionIOCommand: (partitionId = 234) completed: bytes= >> 64419740, duration=351, bandwidth=175.03, bandwidth (excluding GC >> time)=175.03 >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: >> usedMemoryFraction = 0.83 >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next >> IO command is: StoreIncomingMessageIOCommand: (partitionId = 234) >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's >> command StoreIncomingMessageIOCommand: (partitionId = 234) completed: >> bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC time)=NaN >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: >> usedMemoryFraction = 0.83 >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring: >> name = PS Scavenge, action = end of minor GC, cause = Allocation >> Failure, duration = 3107ms >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring: >> name = PS MarkSweep, action = end of major GC, cause = Ergonomics, >> duration = 15064ms >> 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine: >> updateActiveThreadsFraction: updating the number of active threads to >> 1 >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: >> updateRequestsCredit: updating the credit to 20 >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: getNextIOActions: >> usedMemoryFraction = 0.71 >> 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call: thread 0's next >> IO command is: LoadPartitionIOCommand: (partitionId = 234, superstep = >> 2) >> JMap histo dump at Sun Nov 06 15:47:41 CET 2016 >> 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine: doneProcessingPartition: >> processing partition 364 is done! >> 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine: >> updateActiveThreadsFraction: updating the number of active threads to >> 1 >> 16/11/06 15:47:48 INFO policy.ThresholdBasedOracle: >> updateRequestsCredit: updating the credit to 20 >> -- >> -- num #instances #bytes class name >> -- ---------------------------------------------- >> -- 1: 224004229 10752202992 >> java.util.concurrent.ConcurrentHashMap$Node >> -- 2: 19751666 6645730528 [B >> -- 3: 222135985 5331263640 >> com.belprime.di.pr.DomainPartAwareLongWritable >> -- 4: 214686483 5152475592 org.apache.hadoop.io.DoubleWritable >> -- 5: 353 4357261784 >> [Ljava.util.concurrent.ConcurrentHashMap$Node; >> -- 6: 486266 204484688 [I >> -- 7: 6017652 192564864 >> org.apache.giraph.utils.UnsafeByteArrayOutputStream >> -- 8: 3986203 159448120 >> org.apache.giraph.utils.UnsafeByteArrayInputStream >> -- 9: 2064182 148621104 >> org.apache.giraph.graph.ByteValueVertex >> -- 10: 2064182 82567280 >> org.apache.giraph.edge.ByteArrayEdges >> -- 11: 1886875 45285000 java.lang.Integer >> -- 12: 349409 30747992 >> java.util.concurrent.ConcurrentHashMap$TreeNode >> -- 13: 916970 29343040 java.util.Collections$1 >> -- 14: 916971 22007304 java.util.Collections$SingletonSet >> -- 15: 47270 3781600 >> java.util.concurrent.ConcurrentHashMap$TreeBin >> -- 16: 26201 2590912 [C >> -- 17: 34175 1367000 >> org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator >> -- 18: 6143 1067704 java.lang.Class >> -- 19: 25953 830496 java.lang.String >> -- 20: 34175 820200 org.apache.giraph.edge.EdgeNoValue >> -- 21: 4488 703400 [Ljava.lang.Object; >> -- 22: 70 395424 [Ljava.nio.channels.SelectionKey; >> -- 23: 2052 328320 java.lang.reflect.Method >> -- 24: 6600 316800 >> org.apache.giraph.utils.ByteArrayVertexIdMessages >> -- 25: 5781 277488 java.util.HashMap$Node >> -- 26: 5651 271248 java.util.Hashtable$Entry >> -- 27: 6604 211328 >> org.apache.giraph.factories.DefaultMessageValueFactory >> 16/11/06 15:47:49 ERROR utils.LogStacktraceCallable: Execution of >> callable failed >> java.lang.RuntimeException: call: execution of IO command >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed! >> at >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:115) >> at >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:36) >> at >> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:67) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow. >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199) >> at com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInput.java:112) >> at >> com.esotericsoftware.kryo.io.KryoDataInput.readLong(KryoDataInput.java:91) >> at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47) >> at >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:245) >> at >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:278) >> at >> org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy(DiskBackedDataStore.java:234) >> at >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:311) >> at >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66) >> at >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:99) >> ... 6 more >> 16/11/06 15:47:49 FATAL graph.GraphTaskManager: uncaughtException: >> OverrideExceptionHandler on thread ooc-io-0, msg = call: execution of >> IO command LoadPartitionIOCommand: (partitionId = 234, superstep = 2) >> failed!, exiting... >> java.lang.RuntimeException: call: execution of IO command >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed! >> at >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:115) >> at >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:36) >> at >> org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:67) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow. >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199) >> at com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInput.java:112) >> at >> com.esotericsoftware.kryo.io.KryoDataInput.readLong(KryoDataInput.java:91) >> at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47) >> at >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:245) >> at >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:278) >> at >> org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy(DiskBackedDataStore.java:234) >> at >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:311) >> at >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66) >> at >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:99) >> ... 6 more >> 16/11/06 15:47:49 ERROR worker.BspServiceWorker: unregisterHealth: Got >> failure, unregistering health on >> >> /_hadoopBsp/giraph_yarn_application_1478342673283_0009/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir/datanode6.webmeup.com_5 >> on superstep 1 >> >> We looked into one thread >> >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%3CCAECWHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT%3DoLJAAPQw%40mail.gmail.com%3E >> but it is rather old and at that time the answer was "do not use it >> yet". >> (see reply >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%3CCAH1LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw%40mail.gmail.com%3E). >> Does it hold today? We would like to use new advanced adaptive OOC >> approach if possible... >> >> Thank you in advance, any help or hint would be really appreciated. >> >> Best Regards, >> Denis Dudinski > >