Hello, I am getting following error when running on 500MB dataset compressed in avro data format.
Container [pid=22961,containerID=container_1409834588043_0080_01_000010] is running beyond virtual memory limits. Current usage: 636.6 MB of 1 GB physical memory used; 2.1 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1409834588043_0080_01_000010 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 22961 16896 22961 22961 (bash) 0 0 9424896 312 /bin/bash -c /usr/java/default/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx768m -Djava.io.tmpdir=/home/hadoop/yarn/local/usercache/jobsubmit/appcache/application_1409834588043_0080/container_1409834588043_0080_01_000010/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_000010 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 153.87.47.116 47184 attempt_1409834588043_0080_r_000000_0 10 1>/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_000010/stdout 2>/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_000010/stderr |- 22970 22961 22961 22961 (java) 24692 1165 2256662528 162659 /usr/java/default/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx768m -Djava.io.tmpdir=/home/hadoop/yarn/local/usercache/jobsubmit/appcache/application_1409834588043_0080/container_1409834588043_0080_01_000010/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_000010 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 153.87.47.116 47184 attempt_1409834588043_0080_r_000000_0 10 Container killed on request. Exit code is 143 I have read a lot about hadoop yarn memory settings but seems that something basic I am missing in my understanding of how yarn and MR2 works. I have pretty small testing cluster of 5 machines, 2nn and 3dn with following parameters set # hadoop - yarn-site.xml yarn.nodemanager.resource.memory-mb : 2048 yarn.scheduler.minimum-allocation-mb : 256 yarn.scheduler.maximum-allocation-mb : 2048 # hadoop - mapred-site.xml mapreduce.map.memory.mb : 768 mapreduce.map.java.opts : -Xmx512m mapreduce.reduce.memory.mb : 1024 mapreduce.reduce.java.opts : -Xmx768m mapreduce.task.io.sort.mb : 100 yarn.app.mapreduce.am.resource.mb : 1024 yarn.app.mapreduce.am.command-opts : -Xmx768m I understand the mathematics here for the parameters but what I do not understand is: Does your containers need to grow with the size of your dataset? e.g. setting of mapreduce.map.memory.mb and mapreduce.map.java.opts on per job basis? My reducer doesn't cache any data, it is simply in -> out just categorize data to multiple outputs as follows using AvroMultipleOutputs() @Override public void reduce(Text key, Iterable<AvroValue<PosData>> values, Context context) throws IOException, InterruptedException { try { log.info("Processing key {}", key.toString()); final StoreIdDob storeIdDob = separateKey(key); log.info("Processing DOB {}, SotoreId {}", storeIdDob.getDob(), storeIdDob.getStoreId()); int size = 0; Output out; String path; if (storeIdDob.getDob() != null && isValidDOB(storeIdDob.getDob()) && storeIdDob.getStoreId() != null && !storeIdDob.getStoreId().isEmpty()) { // reasonable data if (isHistoricalDOB(storeIdDob.getDob())) { out = Output.HISTORY; } else { out = Output.ACTUAL; } path = out.getKey() + "/" + storeIdDob.getDob() + "/" + storeIdDob.getStoreId(); } else { // error data out = Output.ERROR; path = out.getKey() + "/" + "part"; } for (AvroValue<PosData> posData : values) { amos.write(out.getKey(), new AvroKey<PosData> (posData.datum()), null, path); } } catch (Exception e) { log.error("Error on reducer ", e); //TODO audit log :-) } } Do I need to grow the container size with size of the dataset? That seems to me odd and I did expect that is what MR is for. Or am I missing some settings which decides the size of data chunks? Thx Jakub