Re: YARN container killed as running beyond memory limits
Hi, You should disable vmem check. See this: http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/ Thanks. 2015년 6월 17일 수요일, Naganarasimha G R (Naga)님이 작성한 메시지: > Hi, >From the logs its pretty clear its due to > *"Current usage: 576.2 MB of 2 GB physical memory used; 4.2 GB of 4.2 GB > virtual memory used. Killing container."* > Please increase the value yarn.nodemanager.vmem-pmem-ratio from the > default value 2 to something like 4 or 8 based on ur app and system. > > + Naga > -- > *From:* Arbi Akhina [arbi.akh...@gmail.com > ] > *Sent:* Wednesday, June 17, 2015 17:19 > *To:* user@hadoop.apache.org > > *Subject:* YARN container killed as running beyond memory limits > > Hi, I've a YARN application that submits containers. In the > AplicationMaster logs I see that the container is killed. Here is the logs: > > Jun 17, 2015 1:31:27 PM com.heavenize.modules.RMCallbackHandler > onContainersCompleted > INFO: container 'container_1434471275225_0007_01_02' status is > ContainerStatus: [ContainerId: container_1434471275225_0007_01_02, State: > COMPLETE, Diagnostics: Container > [pid=4069,containerID=container_1434471275225_0007_01_02] is running > beyond virtual memory limits. Current usage: 576.2 MB of 2 GB physical memory > used; 4.2 GB of 4.2 GB virtual memory used. Killing container. > Dump of the process-tree for container_1434471275225_0007_01_02 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 4094 4093 4069 4069 (java) 2932 94 2916065280 122804 > /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms512m -Xmx2048m > -XX:MaxPermSize=250m -XX:+UseConcMarkSweepGC > -Dosmoze.path=/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/container_1434471275225_0007_01_02/Osmoze > -Dspring.profiles.active=webServer -jar > /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/container_1434471275225_0007_01_02/heavenize-modules.jar > |- 4093 4073 4069 4069 (sh) 0 0 4550656 164 /bin/sh > /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/container_1434471275225_0007_01_02/startup.sh > |- 4073 4069 4069 4069 (java) 249 34 1577267200 24239 > /usr/lib/jvm/java-7-openjdk-amd64/bin/java > com.heavenize.yarn.task.ModulesManager -containerId > container_1434471275225_0007_01_02 -port 5369 -exe > hdfs://hadoop-server/user/hadoop/heavenize/heavenize-modules.jar -conf > hdfs://hadoop-server/user/hadoop/heavenize/config.zip > |- 4069 1884 4069 4069 (bash) 0 0 12730368 304 /bin/bash -c > /usr/lib/jvm/java-7-openjdk-amd64/bin/java > com.heavenize.yarn.task.ModulesManager -containerId > container_1434471275225_0007_01_02 -port 5369 -exe > hdfs://hadoop-server/user/hadoop/heavenize/heavenize-modules.jar -conf > hdfs://hadoop-server/user/hadoop/heavenize/config.zip 1> > /usr/local/hadoop/logs/userlogs/application_1434471275225_0007/container_1434471275225_0007_01_02/stdout > 2> > /usr/local/hadoop/logs/userlogs/application_1434471275225_0007/container_1434471275225_0007_01_02/stderr > > > I don't see any memory excess, any idea where this error comes from? > There is no errors in the container, it just stop logging as a result of > being killed. > -- Drake 민영근 Ph.D kt NexR
HDFS Short-Circuit Local Reads
Hi, We are using (still, until Monday) HDP 2.1 for quite some time now, and SC local reads were enabled all the time. In beginning, we used Hortonworks recommendations and set SC cache size to 256, with default 5 minutes to invalidate them, and that's where problems started. At some point in time we started using multigets. After very short time they started timing out on our side. We were playing with different timeouts, graphite was showing (metric hbase.regionserver.RegionServer.get_mean) that load on three nodes out of all other increased drastically. Looking into logs, googling, going through documentation over and over again, we found some discussion that SC cache by should be no lower than 4096. After setting it up to 4096, our problem was solved. For some time. At some point our data usage patterns were changed, and as we already had monitoring for this stuff, multigets started timing out again, monitoring showing it's timing out on two nodes where number of open sockets was ~3-4k per node, while on all others was 400-500. Narrowing this down a little bit we found some strangely too big regions, did some splitting, some manual merges, HBase distributed it around, but issue was still there. And then I found next three things (here's questions coming): - With cache size of 4096, and 30ms cache expiry timeout, we saw exactly every ten minutes this error in logs: 2015-06-18 14:26:07,093 WARN org.apache.hadoop.hdfs.BlockReaderLocal: error creating DomainSocket 2015-06-18 14:26:07,093 WARN org.apache.hadoop.hdfs.client.ShortCircuitCache: ShortCircuitCache(0x3d1dc8c9): failed to load 1109699858_BP-1988583858-172.22.5.40-1424448407690 -- 2015-06-18 14:36:07,135 WARN org.apache.hadoop.hdfs.BlockReaderLocal: error creating DomainSocket 2015-06-18 14:36:07,136 WARN org.apache.hadoop.hdfs.client.ShortCircuitCache: ShortCircuitCache(0x3d1dc8c9): failed to load 1109704764_BP-1988583858-172.22.5.40-1424448407690 -- 2015-06-18 14:46:07,137 WARN org.apache.hadoop.hdfs.BlockReaderLocal: error creating DomainSocket 2015-06-18 14:46:07,138 WARN org.apache.hadoop.hdfs.client.ShortCircuitCache: ShortCircuitCache(0x3d1dc8c9): failed to load 1105787899_BP-1988583858-172.22.5.40-1424448407690 - After increasing SC cache to 8192 (as on those couple that were getting up to 5-7k 4096 obviously wasn't enough): - Our multigets are not taking between 20-30 seconds anymore but being again done within 5 seconds, what's our client timeout. - netstat -tanlp | grep -c 50010 shows now ~ 2800 open local SC per every node. Why would those errors be logged exactly every 10 minutes with 4096 cache size and 5 minutes expire timeout? Why would increasing SC cache also 'balance' number of open SC on all nodes? Am I right that hbase.regionserver.RegionServer.get_mean shows mean number of gets in unit on time, not time needed to make a gets? If I'm true, increasing this made, in our case, gets faster. If I'm wrong, it made gets slower, but then it speeded up our multigets, what's twisting my brain after narrowing this down for a week. How should cache and expiry timeout correlate to each other? Thanks a lot!
accessing hadoop job history
Hi, I ma new to Hadoop and when I run hadoop job -history output I get this Ignore unrecognized file: output Exception in thread "main" java.io.IOException: Unable to initialize History Viewer at org.apache.hadoop.mapreduce.jobhistory.HistoryViewer.(HistoryViewer.java:90) at org.apache.hadoop.mapreduce.tools.CLI.viewHistory(CLI.java:487) at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:330) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1237) Caused by: java.io.IOException: Unable to initialize History Viewer at org.apache.hadoop.mapreduce.jobhistory.HistoryViewer.(HistoryViewer.java:84) ... 5 more I checked the logs (history server logs, `mapred-*username*-historyserver-**.local.log` ), and there are empty. How can I solve it ? Best regards, Mehdi