Hi Guys The OutOfMemoryError might be solved be adding "-Dmapreduce.map.memory.mb=14848". But in my tests, I found some more problems during running out of core graph.
I did two tests with 150G 10^10 vertices input in 1.2 version, and it seems like it not necessary to add like "giraph.userPartitionCount=1000, giraph.maxPartitionsInMemory=1" cause it is adaptive. However, If I run without setting "userPartitionCount and maxPartitionsInMemory", it will it will keep running on superstep -1 forever. *None of worker can finish superstep -1. And I can see a warn in zookeeper log, not sure if it is the problem:* WARN [netty-client-worker-3] org.apache.giraph.comm.netty.handler.ResponseClientHandler: exceptionCaught: Channel failed with remote address trantor21.umiacs.umd.edu/192.168.74.221:30172 java.lang.ArrayIndexOutOfBoundsException: 1075052544 at org.apache.giraph.comm.flow_control.NoOpFlowControl.getAckSignalFlag(NoOpFlowControl.java:52) at org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:796) at org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87) at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153) at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) at org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:74) at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) at java.lang.Thread.run(Thread.java:745) If I add giraph.userPartitionCount=1000,giraph.maxPartitionsInMemory=1. Whole command is : hadoop jar /home/hlan/giraph-1.2.0-hadoop2/giraph-examples/ target/giraph-examples-1.2.0-hadoop2-for-hadoop-2.6.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreGraph=true -Ddigraph.block_factory_configurators=org.apache.giraph.conf.FacebookConfiguration -Dmapreduce.map.memory.mb=14848 org.apache.giraph.examples.myTask -vif org.apache.giraph.examples.LongFloatNullTextInputFormat -vip /user/hlan/cube/tmp/out/ -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/hlan/output -w 199 -ca mapred.job.tracker=localhost: 5431,steps=6,giraph.isStaticGraph=true,giraph.numInputThreads=10,giraph. userPartitionCount=1000,giraph.maxPartitionsInMemory=1 *the job will be pass superstep -1 very quick (around 10 mins). But it will be killed near end of superstep 0.* 2016-10-27 18:53:56,607 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: Vertices - Mean: 9810049, Min: Worker(hostname=trantor11.umiacs.umd.edu hostOrIp= trantor11.umiacs.umd.edu, MRtaskID=10, port=30010) - 9771533, Max: Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=49, port=30049) - 9995724 2016-10-27 18:53:56,608 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: Edges - Mean: 0, Min: Worker(hostname=trantor11.umiacs.umd.edu hostOrIp= trantor11.umiacs.umd.edu, MRtaskID=10, port=30010) - 0, Max: Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=49, port=30049) - 0 2016-10-27 18:53:56,634 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 18:54:26,638 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 18:54:56,640 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 18:55:26,641 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 18:55:56,642 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 18:56:26,643 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 18:56:56,644 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 18:57:26,645 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 18:57:56,646 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 18:58:26,647 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 18:58:56,675 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 18:59:26,676 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 18:59:56,677 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 19:00:26,678 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 19:00:56,679 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 19:01:26,680 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 19:01:29,610 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x158084f5b2100c6, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2016-10-27 19:01:29,612 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /192.168.74.212:53136 which had sessionid 0x158084f5b2100c6 2016-10-27 19:01:31,702 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /192.168.74.212:56696 2016-10-27 19:01:31,711 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew session 0x158084f5b2100c6 at /192.168.74.212:56696 2016-10-27 19:01:31,712 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.ZooKeeperServer: Established session 0x158084f5b2100c6 with negotiated timeout 600000 for client / 192.168.74.212:56696 2016-10-27 19:01:56,681 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 19:02:20,029 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x158084f5b2100c5, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2016-10-27 19:02:20,030 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /192.168.74.212:53134 which had sessionid 0x158084f5b2100c5 2016-10-27 19:02:21,584 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /192.168.74.212:56718 2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew session 0x158084f5b2100c5 at /192.168.74.212:56718 2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.ZooKeeperServer: Established session 0x158084f5b2100c5 with negotiated timeout 600000 for client / 192.168.74.212:56718 2016-10-27 19:02:26,682 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 19:02:56,683 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 19:03:05,743 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x158084f5b2100b9, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2016-10-27 19:03:05,744 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /192.168.74.203:51130 which had sessionid 0x158084f5b2100b9 2016-10-27 19:03:07,452 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /192.168.74.203:54676 2016-10-27 19:03:07,493 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew session 0x158084f5b2100b9 at /192.168.74.203:54676 2016-10-27 19:03:07,494 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.ZooKeeperServer: Established session 0x158084f5b2100b9 with negotiated timeout 600000 for client / 192.168.74.203:54676 2016-10-27 19:03:26,684 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 19:03:53,712 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x158084f5b2100be, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2016-10-27 19:03:53,713 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /192.168.74.203:51146 which had sessionid 0x158084f5b2100be 2016-10-27 19:03:55,436 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /192.168.74.203:54694 2016-10-27 19:03:55,482 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.ZooKeeperServer: Client attempting to renew session 0x158084f5b2100be at /192.168.74.203:54694 2016-10-27 19:03:55,483 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.ZooKeeperServer: Established session 0x158084f5b2100be with negotiated timeout 600000 for client / 192.168.74.203:54694 2016-10-27 19:03:56,719 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 0 on path /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0/_superstepDir/0/_workerFinishedDir 2016-10-27 19:04:00,000 INFO [SessionTracker] org.apache.zookeeper.server.ZooKeeperServer: Expiring session 0x158084f5b2100b8, timeout of 600000ms exceeded 2016-10-27 19:04:00,001 INFO [SessionTracker] org.apache.zookeeper.server.ZooKeeperServer: Expiring session 0x158084f5b2100c2, timeout of 600000ms exceeded 2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x158084f5b2100b8 2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x158084f5b2100c2 2016-10-27 19:04:00,004 INFO [SyncThread:0] org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /192.168.74.203:51116 which had sessionid 0x158084f5b2100b8 2016-10-27 19:04:00,006 INFO [SyncThread:0] org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /192.168.74.212:53128 which had sessionid 0x158084f5b2100c2 2016-10-27 19:04:00,033 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: setJobState: {"_applicationAttemptKey":-1,"_stateKey":"FAILED","_superstepKey":-1} on superstep 0 Any Idea about this? Thanks, Hai On Tue, Nov 8, 2016 at 6:37 AM, Denis Dudinski <denis.dudin...@gmail.com> wrote: > Hi Xenia, > > Thank you! I'll check the thread you mentioned. > > Best Regards, > Denis Dudinski > > 2016-11-08 14:16 GMT+03:00 Xenia Demetriou <xenia...@gmail.com>: > > Hi Denis, > > > > For the "java.lang.OutOfMemoryError: GC overhead limit exceeded" error I > > hope that the conversation in below link can help you. > > www.mail-archive.com/user@giraph.apache.org/msg02938.html > > > > Regards, > > Xenia > > > > 2016-11-08 12:25 GMT+02:00 Denis Dudinski <denis.dudin...@gmail.com>: > >> > >> Hi Hassan, > >> > >> Thank you for really quick response! > >> > >> I changed "giraph.isStaticGraph" to false and the error disappeared. > >> As expected iteration became slow and wrote to disk edges once again > >> in superstep 1. > >> > >> However, the computation failed at superstep 2 with error > >> "java.lang.OutOfMemoryError: GC overhead limit exceeded". It seems to > >> be unrelated to "isStaticGraph" issue, but I think it worth mentioning > >> to see the picture as a whole. > >> > >> Are there any other tests/information I am able to execute/check to > >> help to pinpoint "isStaticGraph" problem? > >> > >> Best Regards, > >> Denis Dudinski > >> > >> > >> 2016-11-07 20:00 GMT+03:00 Hassan Eslami <hsn.esl...@gmail.com>: > >> > Hi Denis, > >> > > >> > Thanks for bringing up the issue. In the previous conversation thread, > >> > the > >> > similar problem is reported even with a simpler example connected > >> > component > >> > calculation. Although, back then, we were developing other > >> > performance-critical components of OOC. > >> > > >> > Let's debug this issue together to make the new OOC more stable. I > >> > suspect > >> > the problem is with "giraph.isStaticGraph=true" (as this is only an > >> > optimization and most of our end-to-end testing was on cases where the > >> > graph > >> > could change). Let's get rid of it for now and see if the problem > still > >> > exists. > >> > > >> > Best, > >> > Hassan > >> > > >> > On Mon, Nov 7, 2016 at 6:24 AM, Denis Dudinski > >> > <denis.dudin...@gmail.com> > >> > wrote: > >> >> > >> >> Hello, > >> >> > >> >> We are trying to calculate PageRank on huge graph, which does not fit > >> >> into memory. For calculation to succeed we tried to turn on OutOfCore > >> >> feature of Giraph, but every launch we tried resulted in > >> >> com.esotericsoftware.kryo.KryoException: Buffer underflow. > >> >> Each time it happens on different servers but exactly right after > >> >> start of superstep 1. > >> >> > >> >> We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod version, can't > >> >> back-step to Giraph's officially supported version and had to patch > >> >> Giraph a little) placed on 11 servers + 3 master servers (namenodes > >> >> etc.) with separate ZooKeeper cluster deployment. > >> >> > >> >> Our launch command: > >> >> > >> >> hadoop jar /opt/giraph-1.2.0/pr-job-jar-with-dependencies.jar > >> >> org.apache.giraph.GiraphRunner com.prototype.di.pr. > PageRankComputation > >> >> \ > >> >> -mc com.prototype.di.pr.PageRankMasterCompute \ > >> >> -yj pr-job-jar-with-dependencies.jar \ > >> >> -vif com.belprime.di.pr.input.HBLongVertexInputFormat \ > >> >> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \ > >> >> -op /user/hadoop/output/pr_test \ > >> >> -w 10 \ > >> >> -c com.prototype.di.pr.PRDoubleCombiner \ > >> >> -wc com.prototype.di.pr.PageRankWorkerContext \ > >> >> -ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \ > >> >> -ca giraph.logLevel=info \ > >> >> -ca hbase.mapreduce.inputtable=di_test \ > >> >> -ca hbase.mapreduce.scan.columns=di:n \ > >> >> -ca hbase.defaults.for.version.skip=true \ > >> >> -ca hbase.table.row.textkey=false \ > >> >> -ca giraph.yarn.task.heap.mb=48000 \ > >> >> -ca giraph.isStaticGraph=true \ > >> >> -ca giraph.SplitMasterWorker=false \ > >> >> -ca giraph.oneToAllMsgSending=true \ > >> >> -ca giraph.metrics.enable=true \ > >> >> -ca giraph.jmap.histo.enable=true \ > >> >> -ca > >> >> giraph.vertexIdClass=com.prototype.di.pr.DomainPartAwareLongWritable > \ > >> >> -ca > >> >> giraph.outgoingMessageValueClass=org.apache.hadoop.io.DoubleWritable > \ > >> >> -ca giraph.inputOutEdgesClass=org.apache.giraph.edge. > LongNullArrayEdges > >> >> \ > >> >> -ca giraph.useOutOfCoreGraph=true \ > >> >> -ca giraph.waitForPerWorkerRequests=true \ > >> >> -ca giraph.maxNumberOfUnsentRequests=1000 \ > >> >> -ca > >> >> > >> >> giraph.vertexInputFilterClass=com.prototype.di.pr.input. > PagesFromSameDomainLimiter > >> >> \ > >> >> -ca giraph.useInputSplitLocality=true \ > >> >> -ca hbase.mapreduce.scan.cachedrows=10000 \ > >> >> -ca giraph.minPartitionsPerComputeThread=60 \ > >> >> -ca > >> >> > >> >> giraph.graphPartitionerFactoryClass=com.prototype.di.pr. > DomainAwareGraphPartitionerFactory > >> >> \ > >> >> -ca giraph.numInputThreads=1 \ > >> >> -ca giraph.inputSplitSamplePercent=20 \ > >> >> -ca giraph.pr.maxNeighborsPerVertex=50 \ > >> >> -ca > >> >> giraph.partitionClass=org.apache.giraph.partition.ByteArrayPartition > \ > >> >> -ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex \ > >> >> -ca > >> >> > >> >> giraph.partitionsDirectory=/disk1/_bsp/_partitions,/disk2/ > _bsp/_partitions > >> >> > >> >> Logs excerpt: > >> >> > >> >> 16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre superstep in > >> >> worker > >> >> context > >> >> 16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60 partitions > >> >> to process with 1 compute thread(s), originally 1 thread(s) on > >> >> superstep 1 > >> >> 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration: with 60 > >> >> partitions in memory and 1 active threads > >> >> 16/11/06 15:47:15 INFO pr.PageRankComputation: Pre superstep1 in PR > >> >> computation > >> >> 16/11/06 15:47:15 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> >> usedMemoryFraction = 0.75 > >> >> 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine: > >> >> updateActiveThreadsFraction: updating the number of active threads to > >> >> 1 > >> >> 16/11/06 15:47:16 INFO policy.ThresholdBasedOracle: > >> >> updateRequestsCredit: updating the credit to 20 > >> >> 16/11/06 15:47:17 INFO graph.GraphTaskManager: installGCMonitoring: > >> >> name = PS Scavenge, action = end of minor GC, cause = Allocation > >> >> Failure, duration = 937ms > >> >> 16/11/06 15:47:17 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> >> usedMemoryFraction = 0.72 > >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> >> usedMemoryFraction = 0.74 > >> >> 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine: > >> >> updateActiveThreadsFraction: updating the number of active threads to > >> >> 1 > >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: > >> >> updateRequestsCredit: updating the credit to 20 > >> >> 16/11/06 15:47:19 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> >> usedMemoryFraction = 0.76 > >> >> 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine: doneProcessingPartition: > >> >> processing partition 234 is done! > >> >> 16/11/06 15:47:20 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> >> usedMemoryFraction = 0.79 > >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine: > >> >> updateActiveThreadsFraction: updating the number of active threads to > >> >> 1 > >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: > >> >> updateRequestsCredit: updating the credit to 18 > >> >> 16/11/06 15:47:21 INFO handler.RequestDecoder: decode: Server window > >> >> metrics MBytes/sec received = 1.0994, MBytesReceived = 33.0459, ave > >> >> received req MBytes = 0.0138, secs waited = 30.058 > >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> >> usedMemoryFraction = 0.82 > >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next > >> >> IO command is: StorePartitionIOCommand: (partitionId = 234) > >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's > >> >> command StorePartitionIOCommand: (partitionId = 234) completed: > bytes= > >> >> 64419740, duration=351, bandwidth=175.03, bandwidth (excluding GC > >> >> time)=175.03 > >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> >> usedMemoryFraction = 0.83 > >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next > >> >> IO command is: StoreIncomingMessageIOCommand: (partitionId = 234) > >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's > >> >> command StoreIncomingMessageIOCommand: (partitionId = 234) completed: > >> >> bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC > time)=NaN > >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> >> usedMemoryFraction = 0.83 > >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring: > >> >> name = PS Scavenge, action = end of minor GC, cause = Allocation > >> >> Failure, duration = 3107ms > >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring: > >> >> name = PS MarkSweep, action = end of major GC, cause = Ergonomics, > >> >> duration = 15064ms > >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine: > >> >> updateActiveThreadsFraction: updating the number of active threads to > >> >> 1 > >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: > >> >> updateRequestsCredit: updating the credit to 20 > >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> >> usedMemoryFraction = 0.71 > >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call: thread 0's next > >> >> IO command is: LoadPartitionIOCommand: (partitionId = 234, superstep > = > >> >> 2) > >> >> JMap histo dump at Sun Nov 06 15:47:41 CET 2016 > >> >> 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine: doneProcessingPartition: > >> >> processing partition 364 is done! > >> >> 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine: > >> >> updateActiveThreadsFraction: updating the number of active threads to > >> >> 1 > >> >> 16/11/06 15:47:48 INFO policy.ThresholdBasedOracle: > >> >> updateRequestsCredit: updating the credit to 20 > >> >> -- > >> >> -- num #instances #bytes class name > >> >> -- ---------------------------------------------- > >> >> -- 1: 224004229 10752202992 > >> >> java.util.concurrent.ConcurrentHashMap$Node > >> >> -- 2: 19751666 6645730528 [B > >> >> -- 3: 222135985 5331263640 > >> >> com.belprime.di.pr.DomainPartAwareLongWritable > >> >> -- 4: 214686483 5152475592 > >> >> org.apache.hadoop.io.DoubleWritable > >> >> -- 5: 353 4357261784 > >> >> [Ljava.util.concurrent.ConcurrentHashMap$Node; > >> >> -- 6: 486266 204484688 [I > >> >> -- 7: 6017652 192564864 > >> >> org.apache.giraph.utils.UnsafeByteArrayOutputStream > >> >> -- 8: 3986203 159448120 > >> >> org.apache.giraph.utils.UnsafeByteArrayInputStream > >> >> -- 9: 2064182 148621104 > >> >> org.apache.giraph.graph.ByteValueVertex > >> >> -- 10: 2064182 82567280 > >> >> org.apache.giraph.edge.ByteArrayEdges > >> >> -- 11: 1886875 45285000 java.lang.Integer > >> >> -- 12: 349409 30747992 > >> >> java.util.concurrent.ConcurrentHashMap$TreeNode > >> >> -- 13: 916970 29343040 java.util.Collections$1 > >> >> -- 14: 916971 22007304 > >> >> java.util.Collections$SingletonSet > >> >> -- 15: 47270 3781600 > >> >> java.util.concurrent.ConcurrentHashMap$TreeBin > >> >> -- 16: 26201 2590912 [C > >> >> -- 17: 34175 1367000 > >> >> org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator > >> >> -- 18: 6143 1067704 java.lang.Class > >> >> -- 19: 25953 830496 java.lang.String > >> >> -- 20: 34175 820200 > >> >> org.apache.giraph.edge.EdgeNoValue > >> >> -- 21: 4488 703400 [Ljava.lang.Object; > >> >> -- 22: 70 395424 [Ljava.nio.channels. > SelectionKey; > >> >> -- 23: 2052 328320 java.lang.reflect.Method > >> >> -- 24: 6600 316800 > >> >> org.apache.giraph.utils.ByteArrayVertexIdMessages > >> >> -- 25: 5781 277488 java.util.HashMap$Node > >> >> -- 26: 5651 271248 java.util.Hashtable$Entry > >> >> -- 27: 6604 211328 > >> >> org.apache.giraph.factories.DefaultMessageValueFactory > >> >> 16/11/06 15:47:49 ERROR utils.LogStacktraceCallable: Execution of > >> >> callable failed > >> >> java.lang.RuntimeException: call: execution of IO command > >> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed! > >> >> at > >> >> > >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:115) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:36) > >> >> at > >> >> > >> >> org.apache.giraph.utils.LogStacktraceCallable.call( > LogStacktraceCallable.java:67) > >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > >> >> at > >> >> > >> >> java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > >> >> at > >> >> > >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > >> >> at java.lang.Thread.run(Thread.java:745) > >> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer > underflow. > >> >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199) > >> >> at > >> >> com.esotericsoftware.kryo.io.UnsafeInput.readLong( > UnsafeInput.java:112) > >> >> at > >> >> > >> >> com.esotericsoftware.kryo.io.KryoDataInput.readLong( > KryoDataInput.java:91) > >> >> at org.apache.hadoop.io.LongWritable.readFields( > LongWritable.java:47) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges( > DiskBackedPartitionStore.java:245) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore. > loadInMemoryPartitionData(DiskBackedPartitionStore.java:278) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.data.DiskBackedDataStore. > loadPartitionDataProxy(DiskBackedDataStore.java:234) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore. > loadPartitionData(DiskBackedPartitionStore.java:311) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute( > LoadPartitionIOCommand.java:66) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:99) > >> >> ... 6 more > >> >> 16/11/06 15:47:49 FATAL graph.GraphTaskManager: uncaughtException: > >> >> OverrideExceptionHandler on thread ooc-io-0, msg = call: execution of > >> >> IO command LoadPartitionIOCommand: (partitionId = 234, superstep = 2) > >> >> failed!, exiting... > >> >> java.lang.RuntimeException: call: execution of IO command > >> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed! > >> >> at > >> >> > >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:115) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:36) > >> >> at > >> >> > >> >> org.apache.giraph.utils.LogStacktraceCallable.call( > LogStacktraceCallable.java:67) > >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > >> >> at > >> >> > >> >> java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > >> >> at > >> >> > >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > >> >> at java.lang.Thread.run(Thread.java:745) > >> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer > underflow. > >> >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199) > >> >> at > >> >> com.esotericsoftware.kryo.io.UnsafeInput.readLong( > UnsafeInput.java:112) > >> >> at > >> >> > >> >> com.esotericsoftware.kryo.io.KryoDataInput.readLong( > KryoDataInput.java:91) > >> >> at org.apache.hadoop.io.LongWritable.readFields( > LongWritable.java:47) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges( > DiskBackedPartitionStore.java:245) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore. > loadInMemoryPartitionData(DiskBackedPartitionStore.java:278) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.data.DiskBackedDataStore. > loadPartitionDataProxy(DiskBackedDataStore.java:234) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore. > loadPartitionData(DiskBackedPartitionStore.java:311) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute( > LoadPartitionIOCommand.java:66) > >> >> at > >> >> > >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:99) > >> >> ... 6 more > >> >> 16/11/06 15:47:49 ERROR worker.BspServiceWorker: unregisterHealth: > Got > >> >> failure, unregistering health on > >> >> > >> >> > >> >> /_hadoopBsp/giraph_yarn_application_1478342673283_ > 0009/_applicationAttemptsDir/0/_superstepDir/1/_ > workerHealthyDir/datanode6.webmeup.com_5 > >> >> on superstep 1 > >> >> > >> >> We looked into one thread > >> >> > >> >> > >> >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/% > 3CCAECWHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT% > 3DoLJAAPQw%40mail.gmail.com%3E > >> >> but it is rather old and at that time the answer was "do not use it > >> >> yet". > >> >> (see reply > >> >> > >> >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/% > 3CCAH1LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw% > 40mail.gmail.com%3E). > >> >> Does it hold today? We would like to use new advanced adaptive OOC > >> >> approach if possible... > >> >> > >> >> Thank you in advance, any help or hint would be really appreciated. > >> >> > >> >> Best Regards, > >> >> Denis Dudinski > >> > > >> > > > > > >