Re: YARN container killed as running beyond memory limits

2015-06-20 Thread Drake민영근
Hi,

You should disable vmem check. See this:
http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/


Thanks.

2015년 6월 17일 수요일, Naganarasimha G R (Naga)님이
작성한 메시지:

>  Hi,
>From the logs its pretty clear its due to
> *"Current usage: 576.2 MB of 2 GB physical memory used; 4.2 GB of 4.2 GB
> virtual memory used. Killing container."*
> Please increase the value yarn.nodemanager.vmem-pmem-ratio from the
> default value 2 to something like 4 or 8 based on ur app and system.
>
>  + Naga
> --
>  *From:* Arbi Akhina [arbi.akh...@gmail.com
> ]
> *Sent:* Wednesday, June 17, 2015 17:19
> *To:* user@hadoop.apache.org
> 
> *Subject:* YARN container killed as running beyond memory limits
>
>   Hi, I've a YARN application that submits containers. In the
> AplicationMaster logs I see that the container is killed. Here is the logs:
>
>  Jun 17, 2015 1:31:27 PM com.heavenize.modules.RMCallbackHandler 
> onContainersCompleted
> INFO: container 'container_1434471275225_0007_01_02' status is 
> ContainerStatus: [ContainerId: container_1434471275225_0007_01_02, State: 
> COMPLETE, Diagnostics: Container 
> [pid=4069,containerID=container_1434471275225_0007_01_02] is running 
> beyond virtual memory limits. Current usage: 576.2 MB of 2 GB physical memory 
> used; 4.2 GB of 4.2 GB virtual memory used. Killing container.
> Dump of the process-tree for container_1434471275225_0007_01_02 :
>   |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>   |- 4094 4093 4069 4069 (java) 2932 94 2916065280 122804 
> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms512m -Xmx2048m 
> -XX:MaxPermSize=250m -XX:+UseConcMarkSweepGC 
> -Dosmoze.path=/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/container_1434471275225_0007_01_02/Osmoze
>  -Dspring.profiles.active=webServer -jar 
> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/container_1434471275225_0007_01_02/heavenize-modules.jar
>   |- 4093 4073 4069 4069 (sh) 0 0 4550656 164 /bin/sh 
> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/container_1434471275225_0007_01_02/startup.sh
>   |- 4073 4069 4069 4069 (java) 249 34 1577267200 24239 
> /usr/lib/jvm/java-7-openjdk-amd64/bin/java 
> com.heavenize.yarn.task.ModulesManager -containerId 
> container_1434471275225_0007_01_02 -port 5369 -exe 
> hdfs://hadoop-server/user/hadoop/heavenize/heavenize-modules.jar -conf 
> hdfs://hadoop-server/user/hadoop/heavenize/config.zip
>   |- 4069 1884 4069 4069 (bash) 0 0 12730368 304 /bin/bash -c 
> /usr/lib/jvm/java-7-openjdk-amd64/bin/java 
> com.heavenize.yarn.task.ModulesManager -containerId 
> container_1434471275225_0007_01_02 -port 5369 -exe 
> hdfs://hadoop-server/user/hadoop/heavenize/heavenize-modules.jar -conf 
> hdfs://hadoop-server/user/hadoop/heavenize/config.zip 1> 
> /usr/local/hadoop/logs/userlogs/application_1434471275225_0007/container_1434471275225_0007_01_02/stdout
>  2> 
> /usr/local/hadoop/logs/userlogs/application_1434471275225_0007/container_1434471275225_0007_01_02/stderr
>
>
>  I don't see any memory excess, any idea where this error comes from?
>  There is no errors in the container, it just stop logging as a result of
> being killed.
>


-- 
Drake 민영근 Ph.D
kt NexR


HDFS Short-Circuit Local Reads

2015-06-20 Thread Dejan Menges
Hi,

We are using (still, until Monday) HDP 2.1 for quite some time now, and SC
local reads were enabled all the time. In beginning, we used Hortonworks
recommendations and set SC cache size to 256, with default 5 minutes to
invalidate them, and that's where problems started.

At some point in time we started using multigets. After very short time
they started timing out on our side. We were playing with different
timeouts, graphite was showing (metric
hbase.regionserver.RegionServer.get_mean) that load on three nodes out of
all other increased drastically. Looking into logs, googling, going through
documentation over and over again, we found some discussion that SC cache
by should be no lower than 4096. After setting it up to 4096, our problem
was solved. For some time.

At some point our data usage patterns were changed, and as we already had
monitoring for this stuff, multigets started timing out again, monitoring
showing it's timing out on two nodes where number of open sockets was ~3-4k
per node, while on all others was 400-500. Narrowing this down a little bit
we found some strangely too big regions, did some splitting, some manual
merges, HBase distributed it around, but issue was still there. And then I
found next three things (here's questions coming):

- With cache size of 4096, and 30ms cache expiry timeout, we saw
exactly every ten minutes this error in logs:

2015-06-18 14:26:07,093 WARN org.apache.hadoop.hdfs.BlockReaderLocal: error
creating DomainSocket
2015-06-18 14:26:07,093 WARN
org.apache.hadoop.hdfs.client.ShortCircuitCache:
ShortCircuitCache(0x3d1dc8c9): failed to load
1109699858_BP-1988583858-172.22.5.40-1424448407690
--
2015-06-18 14:36:07,135 WARN org.apache.hadoop.hdfs.BlockReaderLocal: error
creating DomainSocket
2015-06-18 14:36:07,136 WARN
org.apache.hadoop.hdfs.client.ShortCircuitCache:
ShortCircuitCache(0x3d1dc8c9): failed to load
1109704764_BP-1988583858-172.22.5.40-1424448407690
--
2015-06-18 14:46:07,137 WARN org.apache.hadoop.hdfs.BlockReaderLocal: error
creating DomainSocket
2015-06-18 14:46:07,138 WARN
org.apache.hadoop.hdfs.client.ShortCircuitCache:
ShortCircuitCache(0x3d1dc8c9): failed to load
1105787899_BP-1988583858-172.22.5.40-1424448407690

- After increasing SC cache to 8192 (as on those couple that were getting
up to 5-7k 4096 obviously wasn't enough):
- Our multigets are not taking between 20-30 seconds anymore but being
again done within 5 seconds, what's our client timeout.
- netstat -tanlp | grep -c 50010 shows now ~ 2800 open local SC per
every node.

Why would those errors be logged exactly every 10 minutes with 4096 cache
size and 5 minutes expire timeout?

Why would increasing SC cache also 'balance' number of open SC on all nodes?

Am I right that hbase.regionserver.RegionServer.get_mean shows mean number
of gets in unit on time, not time needed to make a gets? If I'm true,
increasing this made, in our case, gets faster. If I'm wrong, it made gets
slower, but then it speeded up our multigets, what's twisting my brain
after narrowing this down for a week.

How should cache and expiry timeout correlate to each other?

Thanks a lot!


accessing hadoop job history

2015-06-20 Thread mehdi benchoufi
Hi,

I ma new to Hadoop and when I run

hadoop job -history output

I get this

Ignore unrecognized file: output
Exception in thread "main" java.io.IOException: Unable to initialize
History Viewer
at
org.apache.hadoop.mapreduce.jobhistory.HistoryViewer.(HistoryViewer.java:90)
at org.apache.hadoop.mapreduce.tools.CLI.viewHistory(CLI.java:487)
at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:330)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1237)
Caused by: java.io.IOException: Unable to initialize History Viewer
at
org.apache.hadoop.mapreduce.jobhistory.HistoryViewer.(HistoryViewer.java:84)
... 5 more


I checked the logs (history server logs,
`mapred-*username*-historyserver-**.local.log` ), and there are empty. How
can I solve it ?

Best regards,
Mehdi