[ 
https://issues.apache.org/jira/browse/GIRAPH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701081#comment-14701081
 ] 

Vitaly Tsvetkov commented on GIRAPH-1026:
-----------------------------------------

Hi Hassan!
After applying GIRAPH-1025 patch, tuning of cluster and experiments with memory 
fractions we successfully pass input superstep with large graph (v=273,7M, 
e=10166,5M). 
Run it like this:
{noformat}
hadoop fs -rm -R -skipTrash /tmp/giraph
hadoop jar graph-with-dependencies.jar \
 org.apache.giraph.GiraphRunner \
 -Dgiraph.yarn.task.heap.mb=58880 \
 -Dgiraph.useOutOfCoreGraph=true \
 -Dgiraph.enableFlowControlInput=true \
 -Dgiraph.userPartitionCount=36 \
 -Dgiraph.useBigDataIOForMessages=true \
 -Dgiraph.waitForPerWorkerRequests=true \
 -Dgiraph.maxNumberOfUnsentRequests=1000 \
 -Dgiraph.lowFreeMemoryFraction=0.27 -Dgiraph.midFreeMemoryFraction=0.4 
-Dgiraph.fairFreeMemoryFraction=0.41 -Dgiraph.highFreeMemoryFraction=0.42 \
 -Dhash.partitionBalanceAlgorithm=edges \
 -Dgiraph.messageCombinerClass=ru.isys.FloatSumMessageCombiner \
 -Dgiraph.weightedPageRank.superstepCount=5 \
 ru.isys.WeightedPageRankComputation \
 -vif ru.isys.CrawlerInputFormat -vip /tmp/bigdata/vk \
 -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /tmp/giraph \
 -w 3 \
 -yj graph-with-dependencies.jar
{noformat}
but we get memory error like above as soon as we start Superstep 0.
Main logs said
{noformat}
15/08/18 10:12:53 INFO yarn.GiraphApplicationMaster: Got response from RM for 
container ask, completedCnt=1
15/08/18 10:12:53 INFO yarn.GiraphApplicationMaster: Got container status for 
containerID=container_1439560605156_0053_01_000003, state=COMPLETE, 
exitStatus=-104, diagnostics=Container 
[pid=12175,containerID=container_1439560605156_0053_01_000003] is running 
beyond physical memory limits. Current usage: 57.9 GB of 57.5 GB physical 
memory used; 60.5 GB of 120.7 GB virtual memory used. Killing container.
Dump of the process-tree for container_1439560605156_0053_01_000003 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 12179 12175 12175 12175 (java) 3367972 66964 64979361792 15176127 
java -Xmx58880M -Xms58880M -cp 
.:${CLASSPATH}:./*:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$MR2_CLASSPATH:./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/735-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*::./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/735-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*::./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/735-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*:
 org.apache.giraph.yarn.GiraphYarnTask 1439560605156 53 3 1 
        |- 12175 12173 12175 12175 (bash) 0 0 14381056 275 /bin/bash -c java 
-Xmx58880M -Xms58880M -cp 
.:${CLASSPATH}:./*:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$MR2_CLASSPATH:./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/735-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*:
 org.apache.giraph.yarn.GiraphYarnTask 1439560605156 53 3 1 
1>/var/log/hadoop-yarn/container/application_1439560605156_0053/container_1439560605156_0053_01_000003/task-3-stdout.log
 
2>/var/log/hadoop-yarn/container/application_1439560605156_0053/container_1439560605156_0053_01_000003/task-3-stderr.log
  

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

{noformat}
and container_1439560605156_0053_01_000003/task-3-stderr.log  ends there 
without any error
{noformat}
15/08/18 10:12:40 INFO yarn.GiraphYarnTask: [STATUS: task-1] startSuperstep: 
ALL_EXCEPT_ZOOKEEPER - Attempt=0, Superstep=0
15/08/18 10:12:40 INFO netty.NettyClient: connectAllAddresses: Successfully 
added 0 connections, (0 total connected) 0 failed, 0 failures total.
15/08/18 10:12:40 INFO ooc.DiskBackedPartitionStore: getPartition: start 
reading partition 16 from disk
15/08/18 10:12:50 INFO ooc.CheckMemoryCallable: call: Memory is very limited 
now. Calling GC manually. freeMemory = 15121.87MB
{noformat}
Is any idea what we did wrong?

We find out that with 12 partitions per worker it has 2-3 partitions in-memory 
at the end of Superstep -1, so we think we should specify 
*-Dgiraph.maxPartitionsInMemory=2* to avoid OOM error (it is still possible, 
isn't it?). Set this option, we get
{noformat}
java.lang.NullPointerException
        at 
org.apache.giraph.comm.requests.SendWorkerVerticesRequest.doRequest(SendWorkerVerticesRequest.java:111)
        at 
org.apache.giraph.comm.netty.handler.WorkerRequestServerHandler.processRequest(WorkerRequestServerHandler.java:62)
{noformat}
oocEngine is null (because we set giraph.maxPartitionsInMemory) but there is 
invocation oocEngine.isSpilling(). It seems like a bug.

Looking forward for your early reply! 


> New Out-of-core mechanism does not work
> ---------------------------------------
>
>                 Key: GIRAPH-1026
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-1026
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 1.2.0-SNAPSHOT
>            Reporter: Max Garmash
>
> After releasing new OOC mechanism we tried to test it on our data and it 
> failed.
> Our environment:
> 4x (CPU 6 cores / 12 threads, RAM 64GB) 
> We can successfully process about 75 millions of vertices. 
> With 100-120M vertices it fails like this:
> {noformat}
> 2015-08-04 12:35:21,000 INFO  [AMRM Callback Handler Thread] 
> yarn.GiraphApplicationMaster 
> (GiraphApplicationMaster.java:onContainersCompleted(574)) - Got container 
> status for containerID=container_1438068521412_0193_01_000005, 
> state=COMPLETE, exitStatus=-104, diagnostics=Container 
> [pid=6700,containerID=container_1438068521412_0193_01_000005] is running 
> beyond physical memory limits. Current usage: 20.3 GB of 20 GB physical 
> memory used; 22.4 GB of 42 GB virtual memory used. Killing container.
> Dump of the process-tree for container_1438068521412_0193_01_000005 :
>       |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>       |- 6704 6700 6700 6700 (java) 78760 20733 24033841152 5317812 java 
> -Xmx20480M -Xms20480M -cp 
> .:${CLASSPATH}:./*:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$MR2_CLASSPATH:./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/264-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*::./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/264-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*::./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/264-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*:
>  org.apache.giraph.yarn.GiraphYarnTask 1438068521412 193 5 1 
>       |- 6700 6698 6700 6700 (bash) 0 0 14376960 433 /bin/bash -c java 
> -Xmx20480M -Xms20480M -cp 
> .:${CLASSPATH}:./*:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$MR2_CLASSPATH:./*:/etc/hadoop/conf.cloudera.yarn:/run/cloudera-scm-agent/process/264-yarn-NODEMANAGER:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/*:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-mapreduce/lib/*:
>  org.apache.giraph.yarn.GiraphYarnTask 1438068521412 193 5 1 
> 1>/var/log/hadoop-yarn/container/application_1438068521412_0193/container_1438068521412_0193_01_000005/task-5-stdout.log
>  
> 2>/var/log/hadoop-yarn/container/application_1438068521412_0193/container_1438068521412_0193_01_000005/task-5-stderr.log
>   
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143
> {noformat}
> Logs from container
> {noformat}
> 2015-08-04 12:34:51,258 INFO  [netty-server-worker-4] handler.RequestDecoder 
> (RequestDecoder.java:channelRead(74)) - decode: Server window metrics 
> MBytes/sec received = 12.5315, MBytesReceived = 380.217, ave received req 
> MBytes = 0.007, secs waited = 30.34
> 2015-08-04 12:35:16,258 INFO  [check-memory] ooc.CheckMemoryCallable 
> (CheckMemoryCallable.java:call(221)) - call: Memory is very limited now. 
> Calling GC manually. freeMemory = 924.27MB
> {noformat}
> We are running our job like this:
> {noformat}
> hadoop jar 
> giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.6.0-cdh5.4.4-jar-with-dependencies.jar
>  \
>  org.apache.giraph.GiraphRunner \
>  -Dgiraph.yarn.task.heap.mb=20480 \
>  -Dgiraph.isStaticGraph=true \
>  -Dgiraph.useOutOfCoreGraph=true \
>  -Dgiraph.logLevel=info \
>  -Dgiraph.weightedPageRank.superstepCount=5 \
>  ru.isys.WeightedPageRankComputation \
>  -vif ru.isys.CrawlerInputFormat \
>  -vip /tmp/bigdata/input \
>  -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
>  -op /tmp/giraph \
>  -w 6 \
>  -yj 
> giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.6.0-cdh5.4.4-jar-with-dependencies.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to