Out of core computation fails with KryoException: Buffer underflow

Denis Dudinski Mon, 07 Nov 2016 04:24:40 -0800

Hello,

We are trying to calculate PageRank on huge graph, which does not fit
into memory. For calculation to succeed we tried to turn on OutOfCore
feature of Giraph, but every launch we tried resulted in
com.esotericsoftware.kryo.KryoException: Buffer underflow.
Each time it happens on different servers but exactly right after
start of superstep 1.


We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod version, can't
back-step to Giraph's officially supported version and had to patch
Giraph a little) placed on 11 servers + 3 master servers (namenodes
etc.) with separate ZooKeeper cluster deployment.

Our launch command:

hadoop jar /opt/giraph-1.2.0/pr-job-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner com.prototype.di.pr.PageRankComputation
\
-mc com.prototype.di.pr.PageRankMasterCompute \
-yj pr-job-jar-with-dependencies.jar \
-vif com.belprime.di.pr.input.HBLongVertexInputFormat \
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
-op /user/hadoop/output/pr_test \
-w 10 \
-c com.prototype.di.pr.PRDoubleCombiner \
-wc com.prototype.di.pr.PageRankWorkerContext \
-ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \
-ca giraph.logLevel=info \
-ca hbase.mapreduce.inputtable=di_test \
-ca hbase.mapreduce.scan.columns=di:n \
-ca hbase.defaults.for.version.skip=true \
-ca hbase.table.row.textkey=false \
-ca giraph.yarn.task.heap.mb=48000 \
-ca giraph.isStaticGraph=true \
-ca giraph.SplitMasterWorker=false \
-ca giraph.oneToAllMsgSending=true \
-ca giraph.metrics.enable=true \
-ca giraph.jmap.histo.enable=true \
-ca giraph.vertexIdClass=com.prototype.di.pr.DomainPartAwareLongWritable \
-ca giraph.outgoingMessageValueClass=org.apache.hadoop.io.DoubleWritable \
-ca giraph.inputOutEdgesClass=org.apache.giraph.edge.LongNullArrayEdges \
-ca giraph.useOutOfCoreGraph=true \
-ca giraph.waitForPerWorkerRequests=true \
-ca giraph.maxNumberOfUnsentRequests=1000 \
-ca 
giraph.vertexInputFilterClass=com.prototype.di.pr.input.PagesFromSameDomainLimiter
\
-ca giraph.useInputSplitLocality=true \
-ca hbase.mapreduce.scan.cachedrows=10000 \
-ca giraph.minPartitionsPerComputeThread=60 \
-ca 
giraph.graphPartitionerFactoryClass=com.prototype.di.pr.DomainAwareGraphPartitionerFactory
\
-ca giraph.numInputThreads=1 \
-ca giraph.inputSplitSamplePercent=20 \
-ca giraph.pr.maxNeighborsPerVertex=50 \
-ca giraph.partitionClass=org.apache.giraph.partition.ByteArrayPartition \
-ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex \
-ca giraph.partitionsDirectory=/disk1/_bsp/_partitions,/disk2/_bsp/_partitions

Logs excerpt:

16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre superstep in worker context
16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60 partitions
to process with 1 compute thread(s), originally 1 thread(s) on
superstep 1
16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration: with 60
partitions in memory and 1 active threads
16/11/06 15:47:15 INFO pr.PageRankComputation: Pre superstep1 in PR computation
16/11/06 15:47:15 INFO policy.ThresholdBasedOracle: getNextIOActions:
usedMemoryFraction = 0.75
16/11/06 15:47:16 INFO ooc.OutOfCoreEngine:
updateActiveThreadsFraction: updating the number of active threads to
1
16/11/06 15:47:16 INFO policy.ThresholdBasedOracle:
updateRequestsCredit: updating the credit to 20
16/11/06 15:47:17 INFO graph.GraphTaskManager: installGCMonitoring:
name = PS Scavenge, action = end of minor GC, cause = Allocation
Failure, duration = 937ms
16/11/06 15:47:17 INFO policy.ThresholdBasedOracle: getNextIOActions:
usedMemoryFraction = 0.72
16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: getNextIOActions:
usedMemoryFraction = 0.74
16/11/06 15:47:18 INFO ooc.OutOfCoreEngine:
updateActiveThreadsFraction: updating the number of active threads to
1
16/11/06 15:47:18 INFO policy.ThresholdBasedOracle:
updateRequestsCredit: updating the credit to 20
16/11/06 15:47:19 INFO policy.ThresholdBasedOracle: getNextIOActions:
usedMemoryFraction = 0.76
16/11/06 15:47:19 INFO ooc.OutOfCoreEngine: doneProcessingPartition:
processing partition 234 is done!
16/11/06 15:47:20 INFO policy.ThresholdBasedOracle: getNextIOActions:
usedMemoryFraction = 0.79
16/11/06 15:47:21 INFO ooc.OutOfCoreEngine:
updateActiveThreadsFraction: updating the number of active threads to
1
16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
updateRequestsCredit: updating the credit to 18
16/11/06 15:47:21 INFO handler.RequestDecoder: decode: Server window
metrics MBytes/sec received = 1.0994, MBytesReceived = 33.0459, ave
received req MBytes = 0.0138, secs waited = 30.058
16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions:
usedMemoryFraction = 0.82
16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next
IO command is: StorePartitionIOCommand: (partitionId = 234)
16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
command StorePartitionIOCommand: (partitionId = 234) completed: bytes=
64419740, duration=351, bandwidth=175.03, bandwidth (excluding GC
time)=175.03
16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions:
usedMemoryFraction = 0.83
16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next
IO command is: StoreIncomingMessageIOCommand: (partitionId = 234)
16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
command StoreIncomingMessageIOCommand: (partitionId = 234) completed:
bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC time)=NaN
16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions:
usedMemoryFraction = 0.83
16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring:
name = PS Scavenge, action = end of minor GC, cause = Allocation
Failure, duration = 3107ms
16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring:
name = PS MarkSweep, action = end of major GC, cause = Ergonomics,
duration = 15064ms
16/11/06 15:47:40 INFO ooc.OutOfCoreEngine:
updateActiveThreadsFraction: updating the number of active threads to
1
16/11/06 15:47:40 INFO policy.ThresholdBasedOracle:
updateRequestsCredit: updating the credit to 20
16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: getNextIOActions:
usedMemoryFraction = 0.71
16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call: thread 0's next
IO command is: LoadPartitionIOCommand: (partitionId = 234, superstep =
2)
JMap histo dump at Sun Nov 06 15:47:41 CET 2016
16/11/06 15:47:41 INFO ooc.OutOfCoreEngine: doneProcessingPartition:
processing partition 364 is done!
16/11/06 15:47:48 INFO ooc.OutOfCoreEngine:
updateActiveThreadsFraction: updating the number of active threads to
1
16/11/06 15:47:48 INFO policy.ThresholdBasedOracle:
updateRequestsCredit: updating the credit to 20
--
-- num     #instances         #bytes  class name
-- ----------------------------------------------
--   1:     224004229    10752202992
java.util.concurrent.ConcurrentHashMap$Node
--   2:      19751666     6645730528  [B
--   3:     222135985     5331263640
com.belprime.di.pr.DomainPartAwareLongWritable
--   4:     214686483     5152475592  org.apache.hadoop.io.DoubleWritable
--   5:           353     4357261784
[Ljava.util.concurrent.ConcurrentHashMap$Node;
--   6:        486266      204484688  [I
--   7:       6017652      192564864
org.apache.giraph.utils.UnsafeByteArrayOutputStream
--   8:       3986203      159448120
org.apache.giraph.utils.UnsafeByteArrayInputStream
--   9:       2064182      148621104  org.apache.giraph.graph.ByteValueVertex
--  10:       2064182       82567280  org.apache.giraph.edge.ByteArrayEdges
--  11:       1886875       45285000  java.lang.Integer
--  12:        349409       30747992
java.util.concurrent.ConcurrentHashMap$TreeNode
--  13:        916970       29343040  java.util.Collections$1
--  14:        916971       22007304  java.util.Collections$SingletonSet
--  15:         47270        3781600
java.util.concurrent.ConcurrentHashMap$TreeBin
--  16:         26201        2590912  [C
--  17:         34175        1367000
org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator
--  18:          6143        1067704  java.lang.Class
--  19:         25953         830496  java.lang.String
--  20:         34175         820200  org.apache.giraph.edge.EdgeNoValue
--  21:          4488         703400  [Ljava.lang.Object;
--  22:            70         395424  [Ljava.nio.channels.SelectionKey;
--  23:          2052         328320  java.lang.reflect.Method
--  24:          6600         316800
org.apache.giraph.utils.ByteArrayVertexIdMessages
--  25:          5781         277488  java.util.HashMap$Node
--  26:          5651         271248  java.util.Hashtable$Entry
--  27:          6604         211328
org.apache.giraph.factories.DefaultMessageValueFactory
16/11/06 15:47:49 ERROR utils.LogStacktraceCallable: Execution of
callable failed
java.lang.RuntimeException: call: execution of IO command
LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!
at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:115)
at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:36)
at 
org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:67)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow.
at com.esotericsoftware.kryo.io.Input.require(Input.java:199)
at com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInput.java:112)
at com.esotericsoftware.kryo.io.KryoDataInput.readLong(KryoDataInput.java:91)
at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47)
at 
org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:245)
at 
org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:278)
at 
org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy(DiskBackedDataStore.java:234)
at 
org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:311)
at 
org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66)
at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:99)
... 6 more
16/11/06 15:47:49 FATAL graph.GraphTaskManager: uncaughtException:
OverrideExceptionHandler on thread ooc-io-0, msg = call: execution of
IO command LoadPartitionIOCommand: (partitionId = 234, superstep = 2)
failed!, exiting...
java.lang.RuntimeException: call: execution of IO command
LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!
at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:115)
at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:36)
at 
org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:67)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow.
at com.esotericsoftware.kryo.io.Input.require(Input.java:199)
at com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInput.java:112)
at com.esotericsoftware.kryo.io.KryoDataInput.readLong(KryoDataInput.java:91)
at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47)
at 
org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:245)
at 
org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:278)
at 
org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy(DiskBackedDataStore.java:234)
at 
org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:311)
at 
org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66)
at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:99)
... 6 more
16/11/06 15:47:49 ERROR worker.BspServiceWorker: unregisterHealth: Got
failure, unregistering health on
/_hadoopBsp/giraph_yarn_application_1478342673283_0009/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir/datanode6.webmeup.com_5
on superstep 1

We looked into one thread
http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%3CCAECWHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT%3DoLJAAPQw%40mail.gmail.com%3E
but it is rather old and at that time the answer was "do not use it
yet".
(see reply 
http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%3CCAH1LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw%40mail.gmail.com%3E).
Does it hold today? We would like to use new advanced adaptive OOC
approach if possible...

Thank you in advance, any help or hint would be really appreciated.

Best Regards,
Denis Dudinski

Out of core computation fails with KryoException: Buffer underflow

Reply via email to