Hi Denis, Thanks for bringing up the issue. In the previous conversation thread, the similar problem is reported even with a simpler example connected component calculation. Although, back then, we were developing other performance-critical components of OOC.
Let's debug this issue together to make the new OOC more stable. I suspect the problem is with "giraph.isStaticGraph=true" (as this is only an optimization and most of our end-to-end testing was on cases where the graph could change). Let's get rid of it for now and see if the problem still exists. Best, Hassan On Mon, Nov 7, 2016 at 6:24 AM, Denis Dudinski <denis.dudin...@gmail.com> wrote: > Hello, > > We are trying to calculate PageRank on huge graph, which does not fit > into memory. For calculation to succeed we tried to turn on OutOfCore > feature of Giraph, but every launch we tried resulted in > com.esotericsoftware.kryo.KryoException: Buffer underflow. > Each time it happens on different servers but exactly right after > start of superstep 1. > > We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod version, can't > back-step to Giraph's officially supported version and had to patch > Giraph a little) placed on 11 servers + 3 master servers (namenodes > etc.) with separate ZooKeeper cluster deployment. > > Our launch command: > > hadoop jar /opt/giraph-1.2.0/pr-job-jar-with-dependencies.jar > org.apache.giraph.GiraphRunner com.prototype.di.pr.PageRankComputation > \ > -mc com.prototype.di.pr.PageRankMasterCompute \ > -yj pr-job-jar-with-dependencies.jar \ > -vif com.belprime.di.pr.input.HBLongVertexInputFormat \ > -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \ > -op /user/hadoop/output/pr_test \ > -w 10 \ > -c com.prototype.di.pr.PRDoubleCombiner \ > -wc com.prototype.di.pr.PageRankWorkerContext \ > -ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \ > -ca giraph.logLevel=info \ > -ca hbase.mapreduce.inputtable=di_test \ > -ca hbase.mapreduce.scan.columns=di:n \ > -ca hbase.defaults.for.version.skip=true \ > -ca hbase.table.row.textkey=false \ > -ca giraph.yarn.task.heap.mb=48000 \ > -ca giraph.isStaticGraph=true \ > -ca giraph.SplitMasterWorker=false \ > -ca giraph.oneToAllMsgSending=true \ > -ca giraph.metrics.enable=true \ > -ca giraph.jmap.histo.enable=true \ > -ca giraph.vertexIdClass=com.prototype.di.pr.DomainPartAwareLongWritable \ > -ca giraph.outgoingMessageValueClass=org.apache.hadoop.io.DoubleWritable \ > -ca giraph.inputOutEdgesClass=org.apache.giraph.edge.LongNullArrayEdges \ > -ca giraph.useOutOfCoreGraph=true \ > -ca giraph.waitForPerWorkerRequests=true \ > -ca giraph.maxNumberOfUnsentRequests=1000 \ > -ca giraph.vertexInputFilterClass=com.prototype.di.pr.input. > PagesFromSameDomainLimiter > \ > -ca giraph.useInputSplitLocality=true \ > -ca hbase.mapreduce.scan.cachedrows=10000 \ > -ca giraph.minPartitionsPerComputeThread=60 \ > -ca giraph.graphPartitionerFactoryClass=com.prototype.di.pr. > DomainAwareGraphPartitionerFactory > \ > -ca giraph.numInputThreads=1 \ > -ca giraph.inputSplitSamplePercent=20 \ > -ca giraph.pr.maxNeighborsPerVertex=50 \ > -ca giraph.partitionClass=org.apache.giraph.partition.ByteArrayPartition \ > -ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex \ > -ca giraph.partitionsDirectory=/disk1/_bsp/_partitions,/disk2/ > _bsp/_partitions > > Logs excerpt: > > 16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre superstep in worker > context > 16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60 partitions > to process with 1 compute thread(s), originally 1 thread(s) on > superstep 1 > 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration: with 60 > partitions in memory and 1 active threads > 16/11/06 15:47:15 INFO pr.PageRankComputation: Pre superstep1 in PR > computation > 16/11/06 15:47:15 INFO policy.ThresholdBasedOracle: getNextIOActions: > usedMemoryFraction = 0.75 > 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine: > updateActiveThreadsFraction: updating the number of active threads to > 1 > 16/11/06 15:47:16 INFO policy.ThresholdBasedOracle: > updateRequestsCredit: updating the credit to 20 > 16/11/06 15:47:17 INFO graph.GraphTaskManager: installGCMonitoring: > name = PS Scavenge, action = end of minor GC, cause = Allocation > Failure, duration = 937ms > 16/11/06 15:47:17 INFO policy.ThresholdBasedOracle: getNextIOActions: > usedMemoryFraction = 0.72 > 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: getNextIOActions: > usedMemoryFraction = 0.74 > 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine: > updateActiveThreadsFraction: updating the number of active threads to > 1 > 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: > updateRequestsCredit: updating the credit to 20 > 16/11/06 15:47:19 INFO policy.ThresholdBasedOracle: getNextIOActions: > usedMemoryFraction = 0.76 > 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine: doneProcessingPartition: > processing partition 234 is done! > 16/11/06 15:47:20 INFO policy.ThresholdBasedOracle: getNextIOActions: > usedMemoryFraction = 0.79 > 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine: > updateActiveThreadsFraction: updating the number of active threads to > 1 > 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: > updateRequestsCredit: updating the credit to 18 > 16/11/06 15:47:21 INFO handler.RequestDecoder: decode: Server window > metrics MBytes/sec received = 1.0994, MBytesReceived = 33.0459, ave > received req MBytes = 0.0138, secs waited = 30.058 > 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: > usedMemoryFraction = 0.82 > 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next > IO command is: StorePartitionIOCommand: (partitionId = 234) > 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's > command StorePartitionIOCommand: (partitionId = 234) completed: bytes= > 64419740, duration=351, bandwidth=175.03, bandwidth (excluding GC > time)=175.03 > 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: > usedMemoryFraction = 0.83 > 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next > IO command is: StoreIncomingMessageIOCommand: (partitionId = 234) > 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's > command StoreIncomingMessageIOCommand: (partitionId = 234) completed: > bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC time)=NaN > 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: > usedMemoryFraction = 0.83 > 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring: > name = PS Scavenge, action = end of minor GC, cause = Allocation > Failure, duration = 3107ms > 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring: > name = PS MarkSweep, action = end of major GC, cause = Ergonomics, > duration = 15064ms > 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine: > updateActiveThreadsFraction: updating the number of active threads to > 1 > 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: > updateRequestsCredit: updating the credit to 20 > 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: getNextIOActions: > usedMemoryFraction = 0.71 > 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call: thread 0's next > IO command is: LoadPartitionIOCommand: (partitionId = 234, superstep = > 2) > JMap histo dump at Sun Nov 06 15:47:41 CET 2016 > 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine: doneProcessingPartition: > processing partition 364 is done! > 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine: > updateActiveThreadsFraction: updating the number of active threads to > 1 > 16/11/06 15:47:48 INFO policy.ThresholdBasedOracle: > updateRequestsCredit: updating the credit to 20 > -- > -- num #instances #bytes class name > -- ---------------------------------------------- > -- 1: 224004229 10752202992 > java.util.concurrent.ConcurrentHashMap$Node > -- 2: 19751666 6645730528 [B > -- 3: 222135985 5331263640 > com.belprime.di.pr.DomainPartAwareLongWritable > -- 4: 214686483 5152475592 org.apache.hadoop.io.DoubleWritable > -- 5: 353 4357261784 > [Ljava.util.concurrent.ConcurrentHashMap$Node; > -- 6: 486266 204484688 [I > -- 7: 6017652 192564864 > org.apache.giraph.utils.UnsafeByteArrayOutputStream > -- 8: 3986203 159448120 > org.apache.giraph.utils.UnsafeByteArrayInputStream > -- 9: 2064182 148621104 org.apache.giraph.graph. > ByteValueVertex > -- 10: 2064182 82567280 org.apache.giraph.edge. > ByteArrayEdges > -- 11: 1886875 45285000 java.lang.Integer > -- 12: 349409 30747992 > java.util.concurrent.ConcurrentHashMap$TreeNode > -- 13: 916970 29343040 java.util.Collections$1 > -- 14: 916971 22007304 java.util.Collections$SingletonSet > -- 15: 47270 3781600 > java.util.concurrent.ConcurrentHashMap$TreeBin > -- 16: 26201 2590912 [C > -- 17: 34175 1367000 > org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator > -- 18: 6143 1067704 java.lang.Class > -- 19: 25953 830496 java.lang.String > -- 20: 34175 820200 org.apache.giraph.edge.EdgeNoValue > -- 21: 4488 703400 [Ljava.lang.Object; > -- 22: 70 395424 [Ljava.nio.channels.SelectionKey; > -- 23: 2052 328320 java.lang.reflect.Method > -- 24: 6600 316800 > org.apache.giraph.utils.ByteArrayVertexIdMessages > -- 25: 5781 277488 java.util.HashMap$Node > -- 26: 5651 271248 java.util.Hashtable$Entry > -- 27: 6604 211328 > org.apache.giraph.factories.DefaultMessageValueFactory > 16/11/06 15:47:49 ERROR utils.LogStacktraceCallable: Execution of > callable failed > java.lang.RuntimeException: call: execution of IO command > LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed! > at org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:115) > at org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:36) > at org.apache.giraph.utils.LogStacktraceCallable.call( > LogStacktraceCallable.java:67) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow. > at com.esotericsoftware.kryo.io.Input.require(Input.java:199) > at com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInput.java:112) > at com.esotericsoftware.kryo.io.KryoDataInput.readLong( > KryoDataInput.java:91) > at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47) > at org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges( > DiskBackedPartitionStore.java:245) > at org.apache.giraph.ooc.data.DiskBackedPartitionStore. > loadInMemoryPartitionData(DiskBackedPartitionStore.java:278) > at org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy( > DiskBackedDataStore.java:234) > at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData( > DiskBackedPartitionStore.java:311) > at org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute( > LoadPartitionIOCommand.java:66) > at org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:99) > ... 6 more > 16/11/06 15:47:49 FATAL graph.GraphTaskManager: uncaughtException: > OverrideExceptionHandler on thread ooc-io-0, msg = call: execution of > IO command LoadPartitionIOCommand: (partitionId = 234, superstep = 2) > failed!, exiting... > java.lang.RuntimeException: call: execution of IO command > LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed! > at org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:115) > at org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:36) > at org.apache.giraph.utils.LogStacktraceCallable.call( > LogStacktraceCallable.java:67) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow. > at com.esotericsoftware.kryo.io.Input.require(Input.java:199) > at com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInput.java:112) > at com.esotericsoftware.kryo.io.KryoDataInput.readLong( > KryoDataInput.java:91) > at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47) > at org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges( > DiskBackedPartitionStore.java:245) > at org.apache.giraph.ooc.data.DiskBackedPartitionStore. > loadInMemoryPartitionData(DiskBackedPartitionStore.java:278) > at org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy( > DiskBackedDataStore.java:234) > at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData( > DiskBackedPartitionStore.java:311) > at org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute( > LoadPartitionIOCommand.java:66) > at org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:99) > ... 6 more > 16/11/06 15:47:49 ERROR worker.BspServiceWorker: unregisterHealth: Got > failure, unregistering health on > /_hadoopBsp/giraph_yarn_application_1478342673283_ > 0009/_applicationAttemptsDir/0/_superstepDir/1/_ > workerHealthyDir/datanode6.webmeup.com_5 > on superstep 1 > > We looked into one thread > http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/% > 3CCAECWHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT% > 3DoLJAAPQw%40mail.gmail.com%3E > but it is rather old and at that time the answer was "do not use it > yet". > (see reply http://mail-archives.apache.org/mod_mbox/giraph-user/ > 201607.mbox/%3CCAH1LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw% > 40mail.gmail.com%3E). > Does it hold today? We would like to use new advanced adaptive OOC > approach if possible... > > Thank you in advance, any help or hint would be really appreciated. > > Best Regards, > Denis Dudinski >