Hello, We are trying to calculate PageRank on huge graph, which does not fit into memory. For calculation to succeed we tried to turn on OutOfCore feature of Giraph, but every launch we tried resulted in com.esotericsoftware.kryo.KryoException: Buffer underflow. Each time it happens on different servers but exactly right after start of superstep 1.
We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod version, can't back-step to Giraph's officially supported version and had to patch Giraph a little) placed on 11 servers + 3 master servers (namenodes etc.) with separate ZooKeeper cluster deployment. Our launch command: hadoop jar /opt/giraph-1.2.0/pr-job-jar-with-dependencies.jar org.apache.giraph.GiraphRunner com.prototype.di.pr.PageRankComputation \ -mc com.prototype.di.pr.PageRankMasterCompute \ -yj pr-job-jar-with-dependencies.jar \ -vif com.belprime.di.pr.input.HBLongVertexInputFormat \ -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \ -op /user/hadoop/output/pr_test \ -w 10 \ -c com.prototype.di.pr.PRDoubleCombiner \ -wc com.prototype.di.pr.PageRankWorkerContext \ -ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \ -ca giraph.logLevel=info \ -ca hbase.mapreduce.inputtable=di_test \ -ca hbase.mapreduce.scan.columns=di:n \ -ca hbase.defaults.for.version.skip=true \ -ca hbase.table.row.textkey=false \ -ca giraph.yarn.task.heap.mb=48000 \ -ca giraph.isStaticGraph=true \ -ca giraph.SplitMasterWorker=false \ -ca giraph.oneToAllMsgSending=true \ -ca giraph.metrics.enable=true \ -ca giraph.jmap.histo.enable=true \ -ca giraph.vertexIdClass=com.prototype.di.pr.DomainPartAwareLongWritable \ -ca giraph.outgoingMessageValueClass=org.apache.hadoop.io.DoubleWritable \ -ca giraph.inputOutEdgesClass=org.apache.giraph.edge.LongNullArrayEdges \ -ca giraph.useOutOfCoreGraph=true \ -ca giraph.waitForPerWorkerRequests=true \ -ca giraph.maxNumberOfUnsentRequests=1000 \ -ca giraph.vertexInputFilterClass=com.prototype.di.pr.input.PagesFromSameDomainLimiter \ -ca giraph.useInputSplitLocality=true \ -ca hbase.mapreduce.scan.cachedrows=10000 \ -ca giraph.minPartitionsPerComputeThread=60 \ -ca giraph.graphPartitionerFactoryClass=com.prototype.di.pr.DomainAwareGraphPartitionerFactory \ -ca giraph.numInputThreads=1 \ -ca giraph.inputSplitSamplePercent=20 \ -ca giraph.pr.maxNeighborsPerVertex=50 \ -ca giraph.partitionClass=org.apache.giraph.partition.ByteArrayPartition \ -ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex \ -ca giraph.partitionsDirectory=/disk1/_bsp/_partitions,/disk2/_bsp/_partitions Logs excerpt: 16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre superstep in worker context 16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60 partitions to process with 1 compute thread(s), originally 1 thread(s) on superstep 1 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration: with 60 partitions in memory and 1 active threads 16/11/06 15:47:15 INFO pr.PageRankComputation: Pre superstep1 in PR computation 16/11/06 15:47:15 INFO policy.ThresholdBasedOracle: getNextIOActions: usedMemoryFraction = 0.75 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 16/11/06 15:47:16 INFO policy.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 16/11/06 15:47:17 INFO graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 937ms 16/11/06 15:47:17 INFO policy.ThresholdBasedOracle: getNextIOActions: usedMemoryFraction = 0.72 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: getNextIOActions: usedMemoryFraction = 0.74 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 16/11/06 15:47:19 INFO policy.ThresholdBasedOracle: getNextIOActions: usedMemoryFraction = 0.76 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine: doneProcessingPartition: processing partition 234 is done! 16/11/06 15:47:20 INFO policy.ThresholdBasedOracle: getNextIOActions: usedMemoryFraction = 0.79 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 18 16/11/06 15:47:21 INFO handler.RequestDecoder: decode: Server window metrics MBytes/sec received = 1.0994, MBytesReceived = 33.0459, ave received req MBytes = 0.0138, secs waited = 30.058 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: usedMemoryFraction = 0.82 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next IO command is: StorePartitionIOCommand: (partitionId = 234) 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's command StorePartitionIOCommand: (partitionId = 234) completed: bytes= 64419740, duration=351, bandwidth=175.03, bandwidth (excluding GC time)=175.03 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: usedMemoryFraction = 0.83 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next IO command is: StoreIncomingMessageIOCommand: (partitionId = 234) 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's command StoreIncomingMessageIOCommand: (partitionId = 234) completed: bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC time)=NaN 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: usedMemoryFraction = 0.83 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 3107ms 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 15064ms 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: getNextIOActions: usedMemoryFraction = 0.71 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call: thread 0's next IO command is: LoadPartitionIOCommand: (partitionId = 234, superstep = 2) JMap histo dump at Sun Nov 06 15:47:41 CET 2016 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine: doneProcessingPartition: processing partition 364 is done! 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 16/11/06 15:47:48 INFO policy.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 -- -- num #instances #bytes class name -- ---------------------------------------------- -- 1: 224004229 10752202992 java.util.concurrent.ConcurrentHashMap$Node -- 2: 19751666 6645730528 [B -- 3: 222135985 5331263640 com.belprime.di.pr.DomainPartAwareLongWritable -- 4: 214686483 5152475592 org.apache.hadoop.io.DoubleWritable -- 5: 353 4357261784 [Ljava.util.concurrent.ConcurrentHashMap$Node; -- 6: 486266 204484688 [I -- 7: 6017652 192564864 org.apache.giraph.utils.UnsafeByteArrayOutputStream -- 8: 3986203 159448120 org.apache.giraph.utils.UnsafeByteArrayInputStream -- 9: 2064182 148621104 org.apache.giraph.graph.ByteValueVertex -- 10: 2064182 82567280 org.apache.giraph.edge.ByteArrayEdges -- 11: 1886875 45285000 java.lang.Integer -- 12: 349409 30747992 java.util.concurrent.ConcurrentHashMap$TreeNode -- 13: 916970 29343040 java.util.Collections$1 -- 14: 916971 22007304 java.util.Collections$SingletonSet -- 15: 47270 3781600 java.util.concurrent.ConcurrentHashMap$TreeBin -- 16: 26201 2590912 [C -- 17: 34175 1367000 org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator -- 18: 6143 1067704 java.lang.Class -- 19: 25953 830496 java.lang.String -- 20: 34175 820200 org.apache.giraph.edge.EdgeNoValue -- 21: 4488 703400 [Ljava.lang.Object; -- 22: 70 395424 [Ljava.nio.channels.SelectionKey; -- 23: 2052 328320 java.lang.reflect.Method -- 24: 6600 316800 org.apache.giraph.utils.ByteArrayVertexIdMessages -- 25: 5781 277488 java.util.HashMap$Node -- 26: 5651 271248 java.util.Hashtable$Entry -- 27: 6604 211328 org.apache.giraph.factories.DefaultMessageValueFactory 16/11/06 15:47:49 ERROR utils.LogStacktraceCallable: Execution of callable failed java.lang.RuntimeException: call: execution of IO command LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed! at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:115) at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:36) at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:67) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow. at com.esotericsoftware.kryo.io.Input.require(Input.java:199) at com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInput.java:112) at com.esotericsoftware.kryo.io.KryoDataInput.readLong(KryoDataInput.java:91) at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:245) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:278) at org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy(DiskBackedDataStore.java:234) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:311) at org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66) at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:99) ... 6 more 16/11/06 15:47:49 FATAL graph.GraphTaskManager: uncaughtException: OverrideExceptionHandler on thread ooc-io-0, msg = call: execution of IO command LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!, exiting... java.lang.RuntimeException: call: execution of IO command LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed! at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:115) at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:36) at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:67) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow. at com.esotericsoftware.kryo.io.Input.require(Input.java:199) at com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInput.java:112) at com.esotericsoftware.kryo.io.KryoDataInput.readLong(KryoDataInput.java:91) at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:245) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:278) at org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy(DiskBackedDataStore.java:234) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:311) at org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66) at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:99) ... 6 more 16/11/06 15:47:49 ERROR worker.BspServiceWorker: unregisterHealth: Got failure, unregistering health on /_hadoopBsp/giraph_yarn_application_1478342673283_0009/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir/datanode6.webmeup.com_5 on superstep 1 We looked into one thread http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%3CCAECWHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT%3DoLJAAPQw%40mail.gmail.com%3E but it is rather old and at that time the answer was "do not use it yet". (see reply http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%3CCAH1LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw%40mail.gmail.com%3E). Does it hold today? We would like to use new advanced adaptive OOC approach if possible... Thank you in advance, any help or hint would be really appreciated. Best Regards, Denis Dudinski