[ https://issues.apache.org/jira/browse/GIRAPH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653913#comment-14653913 ]
Hassan Eslami commented on GIRAPH-1026: --------------------------------------- Thanks Max for bringing this up. I see that you are using "WeightedPageRankComputation" that generates synthetic data on the fly instead of reading it from a file. I guess that the job fails in the input superstep. The reason is the rate at which data is generated is much more than the rate at which OOC mechanism can offload data to disk. So, OOC mechanism cannot keep up with the generated data and sooner or later the job fails. This issue may also happen in compute supersteps due to messages (specially in applications that do not use message combiner). GIRAPH-1025 is addressing these issues by controlling the flow of incoming data (the code can be found at https://reviews.facebook.net/D43395. The code is still under review). In GIRAPH-1025, flow-control for input superstep is turned off by default. You should use "-Dgiraph.giraph.enableFlowControlInput=true" to enable it. Also, I see that you are running the job with 20GB of Xmx. The default limits for OOC mechanism are tuned for much larger Xmx values. I suggest you increase the OOC limit parameters as well (For instance use -Dgiraph.lowFreeMemoryFraction=0.2 -Dgiraph.midFreeMemoryFraction=0.3 -Dgiraph.fairFreeMemoryFraction=0.4). > New Out-of-core mechanism does not work > --------------------------------------- > > Key: GIRAPH-1026 > URL: https://issues.apache.org/jira/browse/GIRAPH-1026 > Project: Giraph > Issue Type: Bug > Affects Versions: 1.2.0-SNAPSHOT > Reporter: Max Garmash > > After releasing new OOC mechanism we tried to test it on our data and it > failed. > Our environment: > 4x (CPU 6 cores / 12 threads, RAM 64GB) > We can successfully process about 75 millions of vertices. > With 100-120M vertices it fails like this: > {noformat} > 2015-08-04 12:35:21,000 INFO [AMRM Callback Handler Thread] > yarn.GiraphApplicationMaster > (GiraphApplicationMaster.java:onContainersCompleted(574)) - Got container > status for containerID=container_1438068521412_0193_01_000005, > state=COMPLETE, exitStatus=-104, diagnostics=Container > [pid=6700,containerID=container_1438068521412_0193_01_000005] is running > beyond physical memory limits. Current usage: 20.3 GB of 20 GB physical > memory used; 22.4 GB of 42 GB virtual memory used. Killing container. > Dump of the process-tree for container_1438068521412_0193_01_000005 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 6704 6700 6700 6700 (java) 78760 20733 24033841152 5317812 java > -Xmx20480M -Xms20480M -cp > .:${CLASSPATH}:./*:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$MR2_CLASSPATH:./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/264-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*::./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/264-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*::./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/264-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*: > org.apache.giraph.yarn.GiraphYarnTask 1438068521412 193 5 1 > |- 6700 6698 6700 6700 (bash) 0 0 14376960 433 /bin/bash -c java > -Xmx20480M -Xms20480M -cp > .:${CLASSPATH}:./*:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$MR2_CLASSPATH:./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/264-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*: > org.apache.giraph.yarn.GiraphYarnTask 1438068521412 193 5 1 > 1>/var/log/hadoop-yarn/container/application_1438068521412_0193/container_1438068521412_0193_01_000005/task-5-stdout.log > > 2>/var/log/hadoop-yarn/container/application_1438068521412_0193/container_1438068521412_0193_01_000005/task-5-stderr.log > > Container killed on request. Exit code is 143 > Container exited with a non-zero exit code 143 > {noformat} > Logs from container > {noformat} > 2015-08-04 12:34:51,258 INFO [netty-server-worker-4] handler.RequestDecoder > (RequestDecoder.java:channelRead(74)) - decode: Server window metrics > MBytes/sec received = 12.5315, MBytesReceived = 380.217, ave received req > MBytes = 0.007, secs waited = 30.34 > 2015-08-04 12:35:16,258 INFO [check-memory] ooc.CheckMemoryCallable > (CheckMemoryCallable.java:call(221)) - call: Memory is very limited now. > Calling GC manually. freeMemory = 924.27MB > {noformat} > We are running our job like this: > {noformat} > hadoop jar > giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.6.0-cdh5.4.4-jar-with-dependencies.jar > \ > org.apache.giraph.GiraphRunner \ > -Dgiraph.yarn.task.heap.mb=20480 \ > -Dgiraph.isStaticGraph=true \ > -Dgiraph.useOutOfCoreGraph=true \ > -Dgiraph.logLevel=info \ > -Dgiraph.weightedPageRank.superstepCount=5 \ > ru.isys.WeightedPageRankComputation \ > -vif ru.isys.CrawlerInputFormat \ > -vip /tmp/bigdata/input \ > -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \ > -op /tmp/giraph \ > -w 6 \ > -yj > giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.6.0-cdh5.4.4-jar-with-dependencies.jar > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)