Job with only maptasks and map output still in local disk?
Hi, I noticed that when we have a mapreduce job with no reduce tasks, YARN saves the map output is the HDFS. I want that the job still save the map output in the local disk. In YARN, is it possible to have a mapreduce job that only executes map tasks (no reduce tasks to execute), and that the map tasks still save the map output in the local disk? Thanks,
Is there a way to submit a job using the YARN REST API?
Hi, Is there a way to submit a job using the YARN REST API? Thanks,
Submit mapreduce job in remote YARN
Hi, I would like to submit a mapreduce job in a remote YARN cluster. Can I do this in java, or using a REST API? Thanks,
Steps for container release
Hi everyone, I was trying to understand the process that makes the resources of a container available again to the ResourceManager. As far as I can guess from the logs, the AM: - sends a stop request to the NodeManager for the specific container - suddenly tells the RM about the release of the resources, which become available (queues are re-sorted). Actually, I was expecting the RM to wait for an acknowledgment from the NM (through NM-RM heartbeat) about the real end of the container, but it looks to me that the resources are made available upon receiving this info from the AM (AM-RM heartbeat). Maybe the container decommission time is so small to be irrelevant? The logs are at INFO level, and I can't change it to DEBUG since I'm not the only one using the cluster, so maybe I am missing something. Thanks Fabio
Re: Is there a way to submit a job using the YARN REST API?
Please take a look at https://issues.apache.org/jira/browse/MAPREDUCE-5874 Cheers On Feb 20, 2015, at 3:11 AM, xeonmailinglist xeonmailingl...@gmail.com wrote: Hi, Is there a way to submit a job using the YARN REST API? Thanks,
BLOCK and Split size question
Hello Every one, I have couple of doubts can any one please point me in right direction. 1What exactly happen when I want to copy 1TB file to Hadoop Cluster using copyfromlocal command 1 what will be the split size? will it be same as the block size? 2 What is a block and split? If we have 100 MB file and a block size of 64 MB, As we know it will be divided into 2 blocks of 64 MB and 36 MB the second block still has 28 MB of space left what will happen to that free space? will the cluster have unequal block size or will it be occupied by other file? 3) let’s say a 64MB block is on node A and replicated among 2 other nodes(B,C), and the input split size for the map-reduce program is 64MB, will this split just have location for node A? Or will it have locations for all the three nodes A,b,C? 4) How is it handled if the Input Split size is greater or lesser than block size? can any one please help? thanks SP
How to Tune Hadoop Cluster from Administrator prospective
Hi, How to Tune Hadoop Cluster from Administrator prospective ? What parameters we should consider etc? What to look for performance tuning ? Thanks Krish
Scheduling in YARN according to available resources
I have 7 nodes in my Hadoop cluster [8GB RAM and 4VCPUs to each nodes], 1 Namenode + 6 datanodes. I followed the link from Hortonwroks [ http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html ] and made calculation according to the hardware configruation on my nodes. Added the update mapred-site and yarn-site.xml files in my question. Still my application is crashing with the same exection My mapreduce application has 34 input splits with a block size of 128MB. **mapred-site.xml** has the following properties: mapreduce.framework.name = yarn mapred.child.java.opts= -Xmx2048m mapreduce.map.memory.mb = 4096 mapreduce.map.java.opts = -Xmx2048m **yarn-site.xml** has the following properties: yarn.resourcemanager.hostname= hadoop-master yarn.nodemanager.aux-services= mapreduce_shuffle yarn.nodemanager.resource.memory-mb = 6144 yarn.scheduler.minimum-allocation-mb = 2048 yarn.scheduler.maximum-allocation-mb = 6144 Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 3876 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8192m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1424264025191_0002/container_1424264025191_0002_01_11/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.12 50842 attempt_1424264025191_0002_m_05_0 11 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stdout 2 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stderr How can avoid this?any help is appreciated It looks to me that YAN is trying to launch all the container simultaneously and anot according to the available resources. Is there an option to restrict number of containers on hadoop ndoes? Regards, Tariq
Get method in Writable
Am I able to get the values from writable of a previous job. ie I have 2 MR jobs *MR 1:* I need to pass 3 element as values from reducer and the key is NullWritable. So I created a custom writable class to achieve this. * public class TreeInfoWritable implements Writable{* * DoubleWritable entropy;* * IntWritable sum;* * IntWritable clsCount;* * ..* *}* *MR 2:* I need to access MR 1 result in MR2 mapper setup function. And I accessed it as distributed cache (small file). Is there a way to get those values using get method. *while ((setupData = bf.readLine()) != null) {* * System.out.println(Setup Line +setupData);* * TreeInfoWritable info = //something i can pass to TreeInfoWritable and get values* * DoubleWritable entropy = info.getEntropy();* * System.out.println(entropy: +entropy);* *}* Tried to convert writable to gson format. *MR 1* *Gson gson = new Gson();* *String emitVal = gson.toJson(valEmit);* *context.write(out, new Text(emitVal));* But parsing canot be done in *MR2*. *TreeInfoWritable info = gson.toJson(setupData, TreeInfoWritable.class);* *Error: Type mismatch: cannot convert from String to TreeInfoWritable* Once it is changed to string we cannot get values. Am I able to get a workaround for the same. or to use just POJO classes instaed of Writable. I'm afraid if that becomes slower as we are depending on Java instaed of hadoop 's serializable classes
secure checksum in HDFS
Hi, Is it possible to use SHA-256, or MD5 as a checksum in a file in HDFS? Thanks,
YARN container lauch failed exception and mapred-site.xml configuration
I have 7 nodes in my Hadoop cluster [8GB RAM and 4VCPUs to each nodes], 1 Namenode + 6 datanodes. **EDIT-1@ARNON:** I followed the link, mad calculation according to the hardware configruation on my nodes and have added the update mapred-site and yarn-site.xml files in my question. Still my application is crashing with the same exection My mapreduce application has 34 input splits with a block size of 128MB. **mapred-site.xml** has the following properties: mapreduce.framework.name = yarn mapred.child.java.opts= -Xmx2048m mapreduce.map.memory.mb = 4096 mapreduce.map.java.opts = -Xmx2048m **yarn-site.xml** has the following properties: yarn.resourcemanager.hostname= hadoop-master yarn.nodemanager.aux-services= mapreduce_shuffle yarn.nodemanager.resource.memory-mb = 6144 yarn.scheduler.minimum-allocation-mb = 2048 yarn.scheduler.maximum-allocation-mb = 6144 Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 3876 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8192m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1424264025191_0002/container_1424264025191_0002_01_11/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.12 50842 attempt_1424264025191_0002_m_05_0 11 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stdout 2 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stderr How can avoid this?any help is appreciated Is there an option to restrict number of containers on hadoop ndoes?
Re: Scheduling in YARN according to available resources
I had an issue very similar, I changed and used Oracle JDK. There is nothing I see wrong with your configuration in my first look, thanks Regards, Nair On Sat, Feb 21, 2015 at 1:42 AM, tesm...@gmail.com tesm...@gmail.com wrote: I have 7 nodes in my Hadoop cluster [8GB RAM and 4VCPUs to each nodes], 1 Namenode + 6 datanodes. I followed the link from Hortonwroks [ http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html ] and made calculation according to the hardware configruation on my nodes. Added the update mapred-site and yarn-site.xml files in my question. Still my application is crashing with the same exection My mapreduce application has 34 input splits with a block size of 128MB. **mapred-site.xml** has the following properties: mapreduce.framework.name = yarn mapred.child.java.opts= -Xmx2048m mapreduce.map.memory.mb = 4096 mapreduce.map.java.opts = -Xmx2048m **yarn-site.xml** has the following properties: yarn.resourcemanager.hostname= hadoop-master yarn.nodemanager.aux-services= mapreduce_shuffle yarn.nodemanager.resource.memory-mb = 6144 yarn.scheduler.minimum-allocation-mb = 2048 yarn.scheduler.maximum-allocation-mb = 6144 Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 3876 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8192m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1424264025191_0002/container_1424264025191_0002_01_11/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.12 50842 attempt_1424264025191_0002_m_05_0 11 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stdout 2 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stderr How can avoid this?any help is appreciated It looks to me that YAN is trying to launch all the container simultaneously and anot according to the available resources. Is there an option to restrict number of containers on hadoop ndoes? Regards, Tariq -- Warmest Regards, Ravi Shankar
Re: Encryption At Rest Question
In case of SSL enabled cluster, the DEK will be encrypted on the wire by the SSL layer. In case of non-SSL enabled cluster, it is not. But the intercepter only gets the DEK and not the encrypted data, so the data is still safe. Only if the intercepter also manages to gain access to the encrypted data block and associate that with the corresponding DEK, then the data is compromised. Given that each HDFS file has a different DEK, the intercepter has to gain quite a bit of access before the data is compromised. On 18 February 2015 at 00:04, Plamen Jeliazkov plamen.jeliaz...@wandisco.com wrote: Hey guys, I had a question about how the new file encryption work done primarily in HDFS-6134. I was just curious, how is the DEK protected on the wire? Particularly after the KMS decrypts the EDEK and returns it to the client. Thanks, -Plamen 5 reasons your Hadoop needs WANdisco http://www.wandisco.com/system/files/documentation/5-Reasons.pdf Listed on the London Stock Exchange: WAND http://www.bloomberg.com/quote/WAND:LN THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE PRIVILEGED. If this message was misdirected, WANdisco, Inc. and its subsidiaries, (WANdisco) does not waive any confidentiality or privilege. If you are not the intended recipient, please notify us immediately and destroy the message without disclosing its contents to anyone. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. The views and opinions expressed in this e-mail message are the author's own and may not reflect the views and opinions of WANdisco, unless the author is authorized by WANdisco to express such views or opinions on its behalf. All email sent to or from this address is subject to electronic storage and review by WANdisco. Although WANdisco operates anti-virus programs, it does not accept responsibility for any damage whatsoever caused by viruses being passed. -- Regards, Ranadip Chatterjee
Re: Scheduling in YARN according to available resources
Thanks for your answer Nair, Is installing Oracle JDK on Ubuntu is that complicated as described in this link http://askubuntu.com/questions/56104/how-can-i-install-sun-oracles-proprietary-java-jdk-6-7-8-or-jre Is there an alternate? Regards On Sat, Feb 21, 2015 at 6:50 AM, R Nair ravishankar.n...@gmail.com wrote: I had an issue very similar, I changed and used Oracle JDK. There is nothing I see wrong with your configuration in my first look, thanks Regards, Nair On Sat, Feb 21, 2015 at 1:42 AM, tesm...@gmail.com tesm...@gmail.com wrote: I have 7 nodes in my Hadoop cluster [8GB RAM and 4VCPUs to each nodes], 1 Namenode + 6 datanodes. I followed the link from Hortonwroks [ http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html ] and made calculation according to the hardware configruation on my nodes. Added the update mapred-site and yarn-site.xml files in my question. Still my application is crashing with the same exection My mapreduce application has 34 input splits with a block size of 128MB. **mapred-site.xml** has the following properties: mapreduce.framework.name = yarn mapred.child.java.opts= -Xmx2048m mapreduce.map.memory.mb = 4096 mapreduce.map.java.opts = -Xmx2048m **yarn-site.xml** has the following properties: yarn.resourcemanager.hostname= hadoop-master yarn.nodemanager.aux-services= mapreduce_shuffle yarn.nodemanager.resource.memory-mb = 6144 yarn.scheduler.minimum-allocation-mb = 2048 yarn.scheduler.maximum-allocation-mb = 6144 Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 3876 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8192m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1424264025191_0002/container_1424264025191_0002_01_11/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.12 50842 attempt_1424264025191_0002_m_05_0 11 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stdout 2 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stderr How can avoid this?any help is appreciated It looks to me that YAN is trying to launch all the container simultaneously and anot according to the available resources. Is there an option to restrict number of containers on hadoop ndoes? Regards, Tariq -- Warmest Regards, Ravi Shankar
Fwd: YARN container lauch failed exception and mapred-site.xml configuration
I have 7 nodes in my Hadoop cluster [8GB RAM and 4VCPUs to each nodes], 1 Namenode + 6 datanodes. I followed the link o horton works [ http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html] and made calculation according to the hardware configruation on my nodes and have added the update mapred-site and yarn-site.xml files in my question. Still my application is crashing with the same exection My mapreduce application has 34 input splits with a block size of 128MB. **mapred-site.xml** has the following properties: mapreduce.framework.name = yarn mapred.child.java.opts= -Xmx2048m mapreduce.map.memory.mb = 4096 mapreduce.map.java.opts = -Xmx2048m **yarn-site.xml** has the following properties: yarn.resourcemanager.hostname= hadoop-master yarn.nodemanager.aux-services= mapreduce_shuffle yarn.nodemanager.resource.memory-mb = 6144 yarn.scheduler.minimum-allocation-mb = 2048 yarn.scheduler.maximum-allocation-mb = 6144 Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 3876 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8192m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1424264025191_0002/container_1424264025191_0002_01_11/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.12 50842 attempt_1424264025191_0002_m_05_0 11 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stdout 2 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stderr How can avoid this?any help is appreciated Is there an option to restrict number of containers on hadoop ndoes?
Re: secure checksum in HDFS
There seem to be some work done on this here: https://issues.apache.org/jira/browse/HADOOP-9209 3rd party tool: https://github.com/rdsr/hdfs-checksum Regards, Shahab On Fri, Feb 20, 2015 at 12:39 PM, xeonmailinglist xeonmailingl...@gmail.com wrote: Hi, Is it possible to use SHA-256, or MD5 as a checksum in a file in HDFS? Thanks,
Re: How to get Hadoop's Generic Options value
Here is an example: https://adhoop.wordpress.com/2012/02/16/generate-a-list-of-anagrams-round-3/ -Rajesh On Thu, Feb 19, 2015 at 9:32 PM, Haoming Zhang haoming.zh...@outlook.com wrote: Thanks guys, I will try your solutions later and update the result! -- From: unmeshab...@gmail.com Date: Fri, 20 Feb 2015 10:04:38 +0530 Subject: Re: How to get Hadoop's Generic Options value To: user@hadoop.apache.org Try implementing your program public class YourDriver extends Configured implements Tool { main() run() } Then supply your file using -D option. Thanks Unmesha Biju
Re: Submit mapreduce job in remote YARN
yes, you can do this in java, if these conditions are satisfied 1. your client is in the same network with the hadoop cluster 2. add the hadoop configuration to your java classpath, then the jvm will load the hadoop configuration but the suggesttiong way is hadoop jar 2015-02-20 19:18 GMT+08:00 xeonmailinglist xeonmailingl...@gmail.com: Hi, I would like to submit a mapreduce job in a remote YARN cluster. Can I do this in java, or using a REST API? Thanks,
Re: suspend and resume a job in execution?
I am not aware of an API that would let you do this. You may be able to move an application to a queue with 0 resources to achieve the desired behavior but I'm not entirely sure. On Wednesday, February 18, 2015 9:24 AM, xeonmailinglist xeonmailingl...@gmail.com wrote: By job, I mean an mapreduce job. I would like to suspend and resume the mapreduce job whilst it is executing. On 18-02-2015 12:10, xeonmailinglist wrote: Hi, I want to suspend a job that it is in execution when all maptasks finishes, and then resume the job later. Can I do this in yarn? Is there an API for that, or I must use the command line? Thanks,