Re: How can I get the memory usage in Namenode and Datanode?
Where I am working we are working on transient cluster (temporary) using Amazon EMR. When I was reading up on how things work they suggested for monitoring to use ganglia to monitor memory usage and network usage etc. That way depending on how things are setup be it using an amazon s3 bucket for example and pulling data directly into the cluster the network link will always be saturated to ensure a constant flow of data. What I am suggesting is potentially looking at ganglia. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-22 07:42, Fang Zhou wrote: Hi Jonathan, Thank you. The number of files impact on the memory usage in Namenode. I just want to get the real memory usage situation in Namenode. The memory used in heap always changes so that I have no idea about which value is the right one. Thanks, Tim On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I am rather new to hadoop, but wouldnt the difference be potentially in how the files are split in terms of size? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-21 21:54, Fang Zhou wrote: Hi All, I want to test the memory usage on Namenode and Datanode. I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website interface to check the memory. The values I get from them are different. I also found that the memory always changes periodically. This is the first thing confused me. I thought the more files stored in Namenode, the more memory usage in Namenode and Datanode. I also thought the memory used in Namenode should be larger than the memory used in each Datanode. However, some results show my ideas are wrong. For example, I test the memory usage of Namenode with 6000 and 1000 files. The 6000 memory is less than 1000 memory from jmap's results. I also found that the memory usage in Datanode is larger than the memory used in Namenode. I really don't know how to get the memory usage in Namenode and Datanode. Can anyone give me some advices? Thanks, Tim
Re: hadoop learning
Hi Rishabh, I didn't know anything about Hadoop a few months ago, and I started from the very beginning. I don't suggest you to start with online documentation, that is always fragmented, incomplete and sometimes not even up to date. Also starting by directly using Hadoop is the fastest way to frustration and will just take you to abandon this technology. I can suggest you two books I used to start with, and they have been quite helpful for someone who didn't even know what mapreduce is. They provide many examples and use cases (especially the first one): - OReilly - Hadoop The Definitive Guide 3rd Edition. This is quite old but, other than the coding part, it could explain quite well what hadoop is, what it does and how it works. It is mainly about old versions of Hadoop, but I believe it's something you should know, even because most of articles online still refer to the pre-YARN terminology. - Addison-Wesley Professional - Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2. This is what you I used to really understand the new hadoop architecture and terminology. Sometimes it gives too many details, but better more than less. It also has a couple of chapters about installing Hadoop. Good luck Fabio On Sat, Feb 21, 2015 at 3:33 PM, Ted Yu yuzhih...@gmail.com wrote: Rishabh: You can start with: http://wiki.apache.org/hadoop/HowToContribute There're several components: common, hdfs, YARN, mapreduce, ... Which ones are you interested in ? Cheers On Sat, Feb 21, 2015 at 12:18 AM, Bhupendra Gupta bhupendra1...@gmail.com wrote: I have been learning and trying to implement a hadoop ecosystem for one of the POC from last 1 month or so and i think that the best way to learn is by doing it.. Hadoop as the concept has lots of implementation and i picked up hortonworks sandbox for learning... This has helped me in guaging some of the concepts and few practical understanding as well. Happy learning Sent from my iPhone Bhupendra Gupta On 21-Feb-2015, at 1:39 pm, Rishabh Agrawal ss.rishab...@gmail.com wrote: Hello, Please tell me where can i learn the concepts of Big Data and Hadoop from the scratch. Please provide some links online. Rishabh Agrawal
Re: Scheduling in YARN according to available resources
Hi Tariq, Glad to see that your issue is resolved, thank you. This re-affirms the compatibility issue with openJDK. Thanks Regards, Ravi On Sat, Feb 21, 2015 at 1:40 PM, tesm...@gmail.com tesm...@gmail.com wrote: Dear Nair, Your tip in your first email saved my day. Tahnks once again. I am happy with Oracle JDK. Regards, Tariq On Sat, Feb 21, 2015 at 4:05 PM, R Nair ravishankar.n...@gmail.com wrote: one of it is in the forum, if you search in google you will get more. I am not saying it may not work, but you will have to select and apply some patches. One of my friends also had the same problem and with too much difficulty, he got this into work. So better avoid :) https://github.com/elasticsearch/elasticsearch-hadoop/issues/197 Thanks and regards, Nair On Sat, Feb 21, 2015 at 8:20 AM, tesm...@gmail.com tesm...@gmail.com wrote: Thanks Nair. Managed installing Oracle JDK and it is working great. Thanks for the tip. Any idea why OpenJDK is crashing and Oracle JDK works? Regards, Tariq On Sat, Feb 21, 2015 at 7:14 AM, tesm...@gmail.com tesm...@gmail.com wrote: Thanks for your answer Nair, Is installing Oracle JDK on Ubuntu is that complicated as described in this link http://askubuntu.com/questions/56104/how-can-i-install-sun-oracles-proprietary-java-jdk-6-7-8-or-jre Is there an alternate? Regards On Sat, Feb 21, 2015 at 6:50 AM, R Nair ravishankar.n...@gmail.com wrote: I had an issue very similar, I changed and used Oracle JDK. There is nothing I see wrong with your configuration in my first look, thanks Regards, Nair On Sat, Feb 21, 2015 at 1:42 AM, tesm...@gmail.com tesm...@gmail.com wrote: I have 7 nodes in my Hadoop cluster [8GB RAM and 4VCPUs to each nodes], 1 Namenode + 6 datanodes. I followed the link from Hortonwroks [ http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html ] and made calculation according to the hardware configruation on my nodes. Added the update mapred-site and yarn-site.xml files in my question. Still my application is crashing with the same exection My mapreduce application has 34 input splits with a block size of 128MB. **mapred-site.xml** has the following properties: mapreduce.framework.name = yarn mapred.child.java.opts= -Xmx2048m mapreduce.map.memory.mb = 4096 mapreduce.map.java.opts = -Xmx2048m **yarn-site.xml** has the following properties: yarn.resourcemanager.hostname= hadoop-master yarn.nodemanager.aux-services= mapreduce_shuffle yarn.nodemanager.resource.memory-mb = 6144 yarn.scheduler.minimum-allocation-mb = 2048 yarn.scheduler.maximum-allocation-mb = 6144 Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 3876 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8192m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1424264025191_0002/container_1424264025191_0002_01_11/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.12 50842 attempt_1424264025191_0002_m_05_0 11 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stdout 2 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stderr How can avoid this?any help is appreciated It looks to me that YAN is trying to launch all the container simultaneously and anot according to the available resources. Is there an option to restrict number of containers on hadoop ndoes? Regards, Tariq -- Warmest Regards, Ravi Shankar -- Warmest Regards, Ravi Shankar -- Warmest Regards, Ravi Shankar
Re: How can I get the memory usage in Namenode and Datanode?
Thank you for your sharing. Appreciate. Tim On Feb 22, 2015, at 1:23 AM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: Hi Tim, Not sure if this might be of any use in terms of improving overall cluster performance for you, but I hope that it might shed some ideas for you and others. https://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-22 07:57, Tim Chou wrote: Hi Jonathan, Very useful information. I will look at the ganglia. However, I do not have the administrative privilege for the cluster. I don't know if I can install Ganglia in the cluster. Thank you for your information. Best, Tim 2015-02-22 0:53 GMT-06:00 Jonathan Aquilina jaquil...@eagleeyet.net mailto:jaquil...@eagleeyet.net: Where I am working we are working on transient cluster (temporary) using Amazon EMR. When I was reading up on how things work they suggested for monitoring to use ganglia to monitor memory usage and network usage etc. That way depending on how things are setup be it using an amazon s3 bucket for example and pulling data directly into the cluster the network link will always be saturated to ensure a constant flow of data. What I am suggesting is potentially looking at ganglia. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-22 07:42, Fang Zhou wrote: Hi Jonathan, Thank you. The number of files impact on the memory usage in Namenode. I just want to get the real memory usage situation in Namenode. The memory used in heap always changes so that I have no idea about which value is the right one. Thanks, Tim On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net mailto:jaquil...@eagleeyet.net wrote: I am rather new to hadoop, but wouldnt the difference be potentially in how the files are split in terms of size? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-21 21:54, Fang Zhou wrote: Hi All, I want to test the memory usage on Namenode and Datanode. I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website interface to check the memory. The values I get from them are different. I also found that the memory always changes periodically. This is the first thing confused me. I thought the more files stored in Namenode, the more memory usage in Namenode and Datanode. I also thought the memory used in Namenode should be larger than the memory used in each Datanode. However, some results show my ideas are wrong. For example, I test the memory usage of Namenode with 6000 and 1000 files. The 6000 memory is less than 1000 memory from jmap's results. I also found that the memory usage in Datanode is larger than the memory used in Namenode. I really don't know how to get the memory usage in Namenode and Datanode. Can anyone give me some advices? Thanks, Tim
Re: How can I get the memory usage in Namenode and Datanode?
Hi Tim, Not sure if this might be of any use in terms of improving overall cluster performance for you, but I hope that it might shed some ideas for you and others. https://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-22 07:57, Tim Chou wrote: Hi Jonathan, Very useful information. I will look at the ganglia. However, I do not have the administrative privilege for the cluster. I don't know if I can install Ganglia in the cluster. Thank you for your information. Best, Tim 2015-02-22 0:53 GMT-06:00 Jonathan Aquilina jaquil...@eagleeyet.net: Where I am working we are working on transient cluster (temporary) using Amazon EMR. When I was reading up on how things work they suggested for monitoring to use ganglia to monitor memory usage and network usage etc. That way depending on how things are setup be it using an amazon s3 bucket for example and pulling data directly into the cluster the network link will always be saturated to ensure a constant flow of data. What I am suggesting is potentially looking at ganglia. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-22 07:42, Fang Zhou wrote: Hi Jonathan, Thank you. The number of files impact on the memory usage in Namenode. I just want to get the real memory usage situation in Namenode. The memory used in heap always changes so that I have no idea about which value is the right one. Thanks, Tim On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I am rather new to hadoop, but wouldnt the difference be potentially in how the files are split in terms of size? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-21 21:54, Fang Zhou wrote: Hi All, I want to test the memory usage on Namenode and Datanode. I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website interface to check the memory. The values I get from them are different. I also found that the memory always changes periodically. This is the first thing confused me. I thought the more files stored in Namenode, the more memory usage in Namenode and Datanode. I also thought the memory used in Namenode should be larger than the memory used in each Datanode. However, some results show my ideas are wrong. For example, I test the memory usage of Namenode with 6000 and 1000 files. The 6000 memory is less than 1000 memory from jmap's results. I also found that the memory usage in Datanode is larger than the memory used in Namenode. I really don't know how to get the memory usage in Namenode and Datanode. Can anyone give me some advices? Thanks, Tim
Re: How can I get the memory usage in Namenode and Datanode?
Can anyone help me? Thanks, Tim On Feb 21, 2015, at 2:54 PM, Fang Zhou timchou@gmail.com wrote: Hi All, I want to test the memory usage on Namenode and Datanode. I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website interface to check the memory. The values I get from them are different. I also found that the memory always changes periodically. This is the first thing confused me. I thought the more files stored in Namenode, the more memory usage in Namenode and Datanode. I also thought the memory used in Namenode should be larger than the memory used in each Datanode. However, some results show my ideas are wrong. For example, I test the memory usage of Namenode with 6000 and 1000 files. The 6000” memory is less than “1000” memory from jmap’s results. I also found that the memory usage in Datanode is larger than the memory used in Namenode. I really don’t know how to get the memory usage in Namenode and Datanode. Can anyone give me some advices? Thanks, Tim
Re: How can I get the memory usage in Namenode and Datanode?
Hi Jonathan, Very useful information. I will look at the ganglia. However, I do not have the administrative privilege for the cluster. I don't know if I can install Ganglia in the cluster. Thank you for your information. Best, Tim 2015-02-22 0:53 GMT-06:00 Jonathan Aquilina jaquil...@eagleeyet.net: Where I am working we are working on transient cluster (temporary) using Amazon EMR. When I was reading up on how things work they suggested for monitoring to use ganglia to monitor memory usage and network usage etc. That way depending on how things are setup be it using an amazon s3 bucket for example and pulling data directly into the cluster the network link will always be saturated to ensure a constant flow of data. What I am suggesting is potentially looking at ganglia. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-22 07:42, Fang Zhou wrote: Hi Jonathan, Thank you. The number of files impact on the memory usage in Namenode. I just want to get the real memory usage situation in Namenode. The memory used in heap always changes so that I have no idea about which value is the right one. Thanks, Tim On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I am rather new to hadoop, but wouldnt the difference be potentially in how the files are split in terms of size? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-21 21:54, Fang Zhou wrote: Hi All, I want to test the memory usage on Namenode and Datanode. I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website interface to check the memory. The values I get from them are different. I also found that the memory always changes periodically. This is the first thing confused me. I thought the more files stored in Namenode, the more memory usage in Namenode and Datanode. I also thought the memory used in Namenode should be larger than the memory used in each Datanode. However, some results show my ideas are wrong. For example, I test the memory usage of Namenode with 6000 and 1000 files. The 6000 memory is less than 1000 memory from jmap's results. I also found that the memory usage in Datanode is larger than the memory used in Namenode. I really don't know how to get the memory usage in Namenode and Datanode. Can anyone give me some advices? Thanks, Tim
Re: Hadoop - HTTPS communication between nodes - How to Confirm ?
Hi Be careful, HTTPS is to secure WebHDFS. If you want to protect all network streams you need more than that : https://s3.amazonaws.com/dev.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.2/bk_reference/content/reference_chap-wire-encryption.html If you're just interested in HTTPS an lsof -p datanode pid | grep TCP will show you that DN listening on 50075 for HTTP, 50475 for HTTPS. For namenode that would be 50070 and 50470 Ulul Le 21/02/2015 19:53, hadoop.supp...@visolve.com a écrit : Hello Everyone, We are trying to measure performance between HTTP and HTTPS version on Hadoop DFS, Mapreduce and other related modules. As of now, we have tested using several metrics on Hadoop HTTP Mode. Similiarly we are trying to test the same metrics on HTTPS Platform. Basically our test suite cluster consists of one Master Node and two Slave Nodes. We have configured HTTPS connection and now we need to verify whether Nodes are communicating directly through HTTPS. Tried checking logs, clusters webhdfs ui, health check information, dfs admin report but of no help. Since there is only limited documentation available in HTTPS, we are unable to verify whether Nodes are communicating through HTTPS. Hence any experts around here can shed some light on how to confirm HTTPS communication status between nodes (might be with mapreduce/DFS). Note: If I have missed any information, feel free to check with me for the same. /Thanks,/ /S.RagavendraGanesh/// ViSolve Hadoop Support Team ViSolve Inc. | San Jose, California Website: www.visolve.com http://www.visolve.com email: servi...@visolve.com mailto:servi...@visolve.com | Phone: 408-850-2243
Re: How can I get the memory usage in Namenode and Datanode?
I am rather new to hadoop, but wouldnt the difference be potentially in how the files are split in terms of size? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-21 21:54, Fang Zhou wrote: Hi All, I want to test the memory usage on Namenode and Datanode. I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website interface to check the memory. The values I get from them are different. I also found that the memory always changes periodically. This is the first thing confused me. I thought the more files stored in Namenode, the more memory usage in Namenode and Datanode. I also thought the memory used in Namenode should be larger than the memory used in each Datanode. However, some results show my ideas are wrong. For example, I test the memory usage of Namenode with 6000 and 1000 files. The 6000 memory is less than 1000 memory from jmap's results. I also found that the memory usage in Datanode is larger than the memory used in Namenode. I really don't know how to get the memory usage in Namenode and Datanode. Can anyone give me some advices? Thanks, Tim
Re: How can I get the memory usage in Namenode and Datanode?
Hi Jonathan, Thank you. The number of files impact on the memory usage in Namenode. I just want to get the real memory usage situation in Namenode. The memory used in heap always changes so that I have no idea about which value is the right one. Thanks, Tim On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I am rather new to hadoop, but wouldnt the difference be potentially in how the files are split in terms of size? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-02-21 21:54, Fang Zhou wrote: Hi All, I want to test the memory usage on Namenode and Datanode. I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website interface to check the memory. The values I get from them are different. I also found that the memory always changes periodically. This is the first thing confused me. I thought the more files stored in Namenode, the more memory usage in Namenode and Datanode. I also thought the memory used in Namenode should be larger than the memory used in each Datanode. However, some results show my ideas are wrong. For example, I test the memory usage of Namenode with 6000 and 1000 files. The 6000 memory is less than 1000 memory from jmap's results. I also found that the memory usage in Datanode is larger than the memory used in Namenode. I really don't know how to get the memory usage in Namenode and Datanode. Can anyone give me some advices? Thanks, Tim
hadoop learning
Hello, Please tell me where can i learn the concepts of Big Data and Hadoop from the scratch. Please provide some links online. Rishabh Agrawal
Re: hadoop learning
I have been learning and trying to implement a hadoop ecosystem for one of the POC from last 1 month or so and i think that the best way to learn is by doing it.. Hadoop as the concept has lots of implementation and i picked up hortonworks sandbox for learning... This has helped me in guaging some of the concepts and few practical understanding as well. Happy learning Sent from my iPhone Bhupendra Gupta On 21-Feb-2015, at 1:39 pm, Rishabh Agrawal ss.rishab...@gmail.com wrote: Hello, Please tell me where can i learn the concepts of Big Data and Hadoop from the scratch. Please provide some links online. Rishabh Agrawal
Re: Time taken by -copyFromLocalHost for transferring data
$ time hadoop fs -put local file hdfs path For small files, I would expect the time to have a significant variance between runs. For larger files, it should be more consistent (since the throughput will be bound by the network bandwidth of the local machine). On 21 Feb 2015 08:43, tesm...@gmail.com tesm...@gmail.com wrote: Hi, How can I measure the time taken by -copyFromLocalHost for transferring my data from local host to HDFS? Regards, Tariq
Running MapReduce jobs in batch mode on different data sets
Hi, Is it possible to run jobs on Hadoop in batch mode? I have 5 different datasets in HDFS and need to run the same MapReduce application on these datasets sets one after the other. Right now I am doing it manually How can I automate this? How can I save the log of each execution in text files for later processing? Regards, Tariq
Re: hadoop learning
Rishabh: You can start with: http://wiki.apache.org/hadoop/HowToContribute There're several components: common, hdfs, YARN, mapreduce, ... Which ones are you interested in ? Cheers On Sat, Feb 21, 2015 at 12:18 AM, Bhupendra Gupta bhupendra1...@gmail.com wrote: I have been learning and trying to implement a hadoop ecosystem for one of the POC from last 1 month or so and i think that the best way to learn is by doing it.. Hadoop as the concept has lots of implementation and i picked up hortonworks sandbox for learning... This has helped me in guaging some of the concepts and few practical understanding as well. Happy learning Sent from my iPhone Bhupendra Gupta On 21-Feb-2015, at 1:39 pm, Rishabh Agrawal ss.rishab...@gmail.com wrote: Hello, Please tell me where can i learn the concepts of Big Data and Hadoop from the scratch. Please provide some links online. Rishabh Agrawal
How can I get the memory usage in Namenode and Datanode?
Hi All, I want to test the memory usage on Namenode and Datanode. I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website interface to check the memory. The values I get from them are different. I also found that the memory always changes periodically. This is the first thing confused me. I thought the more files stored in Namenode, the more memory usage in Namenode and Datanode. I also thought the memory used in Namenode should be larger than the memory used in each Datanode. However, some results show my ideas are wrong. For example, I test the memory usage of Namenode with 6000 and 1000 files. The 6000” memory is less than “1000” memory from jmap’s results. I also found that the memory usage in Datanode is larger than the memory used in Namenode. I really don’t know how to get the memory usage in Namenode and Datanode. Can anyone give me some advices? Thanks, Tim
RE: Yarn AM is abending job more information
Alex, Thanks for looking at the output and your feedback. I want to make sure I understand your input correctly. My cluster is a set of old dual core machines and my client is a virtual box VM with 10 GB mem allocated to it. I did some more testing (and will continue to do so to track down the problem). I found that if I move my jar file to the resource manager server on the Dell cluster and execute it local (rather than remotely) it runs to a successful completion. So there is definitely something not right somewhere and I have to believe it is a setup problem on my part, not a hardware problem. Here is the job output: Thanks - rd From: Alexander Alten-Lorenz [mailto:wget.n...@gmail.com] Sent: Friday, February 20, 2015 2:12 AM To: user@hadoop.apache.org Subject: Re: Yarn AM is abending job when submitting a remote job to cluster 15/02/20 19:38:21 INFO client.RMProxy: Connecting to ResourceManager at hadoop0.rdpratti.com/192.168.2.253:8032 15/02/20 19:38:22 INFO input.FileInputFormat: Total input paths to process : 5 15/02/20 19:38:22 INFO mapreduce.JobSubmitter: number of splits:5 15/02/20 19:38:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1424003606313_0015 15/02/20 19:38:22 INFO impl.YarnClientImpl: Submitted application application_1424003606313_0015 15/02/20 19:38:22 INFO mapreduce.Job: The url to track the job: http://hadoop0.rdpratti.com:8088/proxy/application_1424003606313_0015/ 15/02/20 19:38:22 INFO mapreduce.Job: Running job: job_1424003606313_0015 15/02/20 19:38:36 INFO mapreduce.Job: Job job_1424003606313_0015 running in uber mode : false 15/02/20 19:38:36 INFO mapreduce.Job: map 0% reduce 0% 15/02/20 19:38:45 INFO mapreduce.Job: map 20% reduce 0% 15/02/20 19:38:47 INFO mapreduce.Job: map 40% reduce 0% 15/02/20 19:38:52 INFO mapreduce.Job: map 80% reduce 0% 15/02/20 19:38:59 INFO mapreduce.Job: map 100% reduce 0% 15/02/20 19:39:03 INFO mapreduce.Job: map 100% reduce 25% 15/02/20 19:39:08 INFO mapreduce.Job: map 100% reduce 50% 15/02/20 19:39:09 INFO mapreduce.Job: map 100% reduce 75% 15/02/20 19:39:10 INFO mapreduce.Job: map 100% reduce 100% 15/02/20 19:39:11 INFO mapreduce.Job: Job job_1424003606313_0015 completed successfully 15/02/20 19:39:11 INFO mapreduce.Job: Counters: 50 File System Counters FILE: Number of bytes read=1628864 FILE: Number of bytes written=4240224 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=5343866 HDFS: Number of bytes written=624 HDFS: Number of read operations=27 HDFS: Number of large read operations=0 HDFS: Number of write operations=8 Job Counters Launched map tasks=5 Launched reduce tasks=4 Data-local map tasks=2 Rack-local map tasks=3 Total time spent by all maps in occupied slots (ms)=43715 Total time spent by all reduces in occupied slots (ms)=30261 Total time spent by all map tasks (ms)=43715 Total time spent by all reduce tasks (ms)=30261 Total vcore-seconds taken by all map tasks=43715 Total vcore-seconds taken by all reduce tasks=30261 Total megabyte-seconds taken by all map tasks=44764160 Total megabyte-seconds taken by all reduce tasks=30987264 Map-Reduce Framework Map input records=175558 Map output records=974078 Map output bytes=5844468 Map output materialized bytes=1631237 Input split bytes=659 Combine input records=0 Combine output records=0 Reduce input groups=35 Reduce shuffle bytes=1631237 Reduce input records=974078 Reduce output records=35 Spilled Records=1948156 Shuffled Maps =20 Failed Shuffles=0 Merged Map outputs=20 GC time elapsed (ms)=862 CPU time spent (ms)=30820 Physical memory (bytes) snapshot=2817286144 Virtual memory (bytes) snapshot=13831352320 Total committed heap usage (bytes)=2295857152 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=5343207 File Output Format Counters Bytes Written=624