Re: How can I get the memory usage in Namenode and Datanode?

2015-02-21 Thread Jonathan Aquilina
 

Where I am working we are working on transient cluster (temporary) using
Amazon EMR. When I was reading up on how things work they suggested for
monitoring to use ganglia to monitor memory usage and network usage etc.
That way depending on how things are setup be it using an amazon s3
bucket for example and pulling data directly into the cluster the
network link will always be saturated to ensure a constant flow of data.


What I am suggesting is potentially looking at ganglia. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-22 07:42, Fang Zhou wrote: 

 Hi Jonathan, 
 
 Thank you. 
 
 The number of files impact on the memory usage in Namenode. 
 
 I just want to get the real memory usage situation in Namenode. 
 
 The memory used in heap always changes so that I have no idea about which 
 value is the right one. 
 
 Thanks, 
 Tim 
 
 On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote: 
 
 I am rather new to hadoop, but wouldnt the difference be potentially in how 
 the files are split in terms of size? 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-02-21 21:54, Fang Zhou wrote: 
 
 Hi All,
 
 I want to test the memory usage on Namenode and Datanode.
 
 I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website 
 interface to check the memory.
 The values I get from them are different. I also found that the memory always 
 changes periodically.
 This is the first thing confused me.
 
 I thought the more files stored in Namenode, the more memory usage in 
 Namenode and Datanode.
 I also thought the memory used in Namenode should be larger than the memory 
 used in each Datanode.
 However, some results show my ideas are wrong.
 For example, I test the memory usage of Namenode with 6000 and 1000 files.
 The 6000 memory is less than 1000 memory from jmap's results. 
 I also found that the memory usage in Datanode is larger than the memory used 
 in Namenode.
 
 I really don't know how to get the memory usage in Namenode and Datanode.
 
 Can anyone give me some advices?
 
 Thanks,
 Tim
 

Re: hadoop learning

2015-02-21 Thread Fabio C.
Hi Rishabh,
I didn't know anything about Hadoop a few months ago, and I started from
the very beginning. I don't suggest you to start with online documentation,
that is always fragmented, incomplete and sometimes not even up to date.
Also starting by directly using Hadoop is the fastest way to frustration
and will just take you to abandon this technology.
I can suggest you two books I used to start with, and they have been quite
helpful for someone who didn't even know what mapreduce is. They provide
many examples and use cases (especially the first one):
- OReilly - Hadoop The Definitive Guide 3rd Edition. This is quite old
but, other than the coding part, it could explain quite well what hadoop
is, what it does and how it works. It is mainly about old versions of
Hadoop, but I believe it's something you should know, even because most of
articles online still refer to the pre-YARN terminology.
-  Addison-Wesley Professional - Apache Hadoop YARN: Moving beyond
MapReduce and Batch Processing with Apache Hadoop 2. This is what you I
used to really understand the new hadoop architecture and terminology.
Sometimes it gives too many details, but better more than less. It also has
a couple of chapters about installing Hadoop.

Good luck

Fabio

On Sat, Feb 21, 2015 at 3:33 PM, Ted Yu yuzhih...@gmail.com wrote:

 Rishabh:
 You can start with:
 http://wiki.apache.org/hadoop/HowToContribute

 There're several components: common, hdfs, YARN, mapreduce, ...
 Which ones are you interested in ?

 Cheers

 On Sat, Feb 21, 2015 at 12:18 AM, Bhupendra Gupta bhupendra1...@gmail.com
  wrote:

 I have been learning and trying to implement a hadoop ecosystem for one
 of the POC from last 1 month or so and i think that the best way to learn
 is by doing it..

 Hadoop as the concept has lots of implementation and i picked up
 hortonworks sandbox for learning...
 This has helped me in guaging some of the concepts and few practical
 understanding as well.

 Happy learning

 Sent from my iPhone

 Bhupendra Gupta

  On 21-Feb-2015, at 1:39 pm, Rishabh Agrawal ss.rishab...@gmail.com
 wrote:
 
  Hello,
 
  Please tell me where can i learn the concepts of Big Data and Hadoop
 from the scratch. Please provide some links online.
 
 
 
  Rishabh Agrawal





Re: Scheduling in YARN according to available resources

2015-02-21 Thread R Nair
Hi Tariq,

Glad to see that your issue is resolved, thank you. This re-affirms the
compatibility issue with openJDK. Thanks

Regards,
Ravi

On Sat, Feb 21, 2015 at 1:40 PM, tesm...@gmail.com tesm...@gmail.com
wrote:

 Dear Nair,

 Your tip in your first email saved my day. Tahnks once again. I am happy
 with Oracle JDK.

 Regards,
 Tariq

 On Sat, Feb 21, 2015 at 4:05 PM, R Nair ravishankar.n...@gmail.com
 wrote:

 one of it is in the forum, if you search in google you will get more. I
 am not saying it may not work, but you will have to select and apply some
 patches. One of my friends also had the same problem and with too much
 difficulty, he got this into work. So better avoid :)

 https://github.com/elasticsearch/elasticsearch-hadoop/issues/197

 Thanks and regards,
 Nair

 On Sat, Feb 21, 2015 at 8:20 AM, tesm...@gmail.com tesm...@gmail.com
 wrote:

 Thanks Nair.

 Managed installing Oracle JDK and it is working great. Thanks for the
 tip.

 Any idea why OpenJDK is crashing and Oracle JDK works?

 Regards,
 Tariq




 On Sat, Feb 21, 2015 at 7:14 AM, tesm...@gmail.com tesm...@gmail.com
 wrote:

 Thanks for your answer Nair,
 Is installing Oracle JDK on Ubuntu is that complicated as described in
 this link

 http://askubuntu.com/questions/56104/how-can-i-install-sun-oracles-proprietary-java-jdk-6-7-8-or-jre

 Is there an alternate?

 Regards


 On Sat, Feb 21, 2015 at 6:50 AM, R Nair ravishankar.n...@gmail.com
 wrote:

 I had an issue very similar, I changed and used Oracle JDK. There is
 nothing I see wrong with your configuration in my first look, thanks

 Regards,
 Nair

 On Sat, Feb 21, 2015 at 1:42 AM, tesm...@gmail.com tesm...@gmail.com
 wrote:

 I have 7 nodes in my Hadoop cluster [8GB RAM and 4VCPUs to each
 nodes], 1 Namenode + 6 datanodes.

 I followed the link from Hortonwroks [
 http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html
 ] and made calculation according to the hardware configruation on my
 nodes. Added the update mapred-site and yarn-site.xml files in my 
 question.
 Still my application is crashing with the same exection

 My mapreduce application has 34 input splits with a block size of
 128MB.

 **mapred-site.xml** has the  following properties:

 mapreduce.framework.name  = yarn
 mapred.child.java.opts= -Xmx2048m
 mapreduce.map.memory.mb   = 4096
 mapreduce.map.java.opts   = -Xmx2048m

 **yarn-site.xml** has the  following properties:

 yarn.resourcemanager.hostname= hadoop-master
 yarn.nodemanager.aux-services= mapreduce_shuffle
 yarn.nodemanager.resource.memory-mb  = 6144
 yarn.scheduler.minimum-allocation-mb = 2048
 yarn.scheduler.maximum-allocation-mb = 6144


  Exception from container-launch: ExitCodeException exitCode=134:
 /bin/bash: line 1:  3876 Aborted  (core dumped)
 /usr/lib/jvm/java-7-openjdk-amd64/bin/java
 -Djava.net.preferIPv4Stack=true
 -Dhadoop.metrics.log.level=WARN -Xmx8192m
 -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1424264025191_0002/container_1424264025191_0002_01_11/tmp
 -Dlog4j.configuration=container-log4j.properties
 -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11
 -Dyarn.app.container.log.filesize=0
 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild
 192.168.0.12 50842 attempt_1424264025191_0002_m_05_0 11 

 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stdout
 2

 /home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_11/stderr


 How can avoid this?any help is appreciated

 It looks to me that YAN is trying to launch all the container
 simultaneously and anot according to the available resources. Is
 there an option to restrict number of containers on hadoop ndoes?

 Regards,
 Tariq




 --
 Warmest Regards,

 Ravi Shankar






 --
 Warmest Regards,

 Ravi Shankar





-- 
Warmest Regards,

Ravi Shankar


Re: How can I get the memory usage in Namenode and Datanode?

2015-02-21 Thread Fang Zhou
Thank you for your sharing.

Appreciate.

Tim

 On Feb 22, 2015, at 1:23 AM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote:
 
 Hi Tim,
 
 Not sure if this might be of any use in terms of improving overall cluster 
 performance for you, but I hope that it might shed some ideas for you and 
 others.
 
 https://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf
 
  
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 On 2015-02-22 07:57, Tim Chou wrote:
 
 Hi Jonathan,
  
 Very useful information. I will look at the ganglia.
  
 However, I do not have the administrative privilege for the cluster. I don't 
 know if I can install Ganglia in the cluster.
  
 Thank you for your information.
  
 Best,
 Tim
 
 2015-02-22 0:53 GMT-06:00 Jonathan Aquilina jaquil...@eagleeyet.net 
 mailto:jaquil...@eagleeyet.net:
 Where I am working we are working on transient cluster (temporary) using 
 Amazon EMR. When I was reading up on how things work they suggested for 
 monitoring to use ganglia to monitor memory usage and network usage etc. 
 That way depending on how things are setup be it using an amazon s3 bucket 
 for example and pulling data directly into the cluster the network link will 
 always be saturated to ensure a constant flow of data.
 
 What I am suggesting is potentially looking at ganglia.
 
  
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 On 2015-02-22 07:42, Fang Zhou wrote:
 
 Hi Jonathan,
  
 Thank you.
  
 The number of files impact on the memory usage in Namenode.
  
 I just want to get the real memory usage situation in Namenode.
  
 The memory used in heap always changes so that I have no idea about which 
 value is the right one.
  
 Thanks,
 Tim
 
 On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net 
 mailto:jaquil...@eagleeyet.net wrote:
 
 I am rather new to hadoop, but wouldnt the difference be potentially in how 
 the files are split in terms of size?
 
  
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 On 2015-02-21 21:54, Fang Zhou wrote:
 
 Hi All,
 
 I want to test the memory usage on Namenode and Datanode.
 
 I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website 
 interface to check the memory.
 The values I get from them are different. I also found that the memory 
 always changes periodically.
 This is the first thing confused me.
 
 I thought the more files stored in Namenode, the more memory usage in 
 Namenode and Datanode.
 I also thought the memory used in Namenode should be larger than the memory 
 used in each Datanode.
 However, some results show my ideas are wrong.
 For example, I test the memory usage of Namenode with 6000 and 1000 files.
 The 6000 memory is less than 1000 memory from jmap's results. 
 I also found that the memory usage in Datanode is larger than the memory 
 used in Namenode.
 
 I really don't know how to get the memory usage in Namenode and Datanode.
 
 Can anyone give me some advices?
 
 Thanks,
 Tim



Re: How can I get the memory usage in Namenode and Datanode?

2015-02-21 Thread Jonathan Aquilina
 

Hi Tim, 

Not sure if this might be of any use in terms of improving overall
cluster performance for you, but I hope that it might shed some ideas
for you and others. 

https://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-22 07:57, Tim Chou wrote: 

 Hi Jonathan, 
 
 Very useful information. I will look at the ganglia. 
 
 However, I do not have the administrative privilege for the cluster. I don't 
 know if I can install Ganglia in the cluster. 
 
 Thank you for your information. 
 
 Best, 
 Tim 
 
 2015-02-22 0:53 GMT-06:00 Jonathan Aquilina jaquil...@eagleeyet.net:
 
 Where I am working we are working on transient cluster (temporary) using 
 Amazon EMR. When I was reading up on how things work they suggested for 
 monitoring to use ganglia to monitor memory usage and network usage etc. That 
 way depending on how things are setup be it using an amazon s3 bucket for 
 example and pulling data directly into the cluster the network link will 
 always be saturated to ensure a constant flow of data. 
 
 What I am suggesting is potentially looking at ganglia. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-02-22 07:42, Fang Zhou wrote: Hi Jonathan, 
 
 Thank you. 
 
 The number of files impact on the memory usage in Namenode. 
 
 I just want to get the real memory usage situation in Namenode. 
 
 The memory used in heap always changes so that I have no idea about which 
 value is the right one. 
 
 Thanks, 
 Tim 
 
 On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote: 
 
 I am rather new to hadoop, but wouldnt the difference be potentially in how 
 the files are split in terms of size? 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-02-21 21:54, Fang Zhou wrote: 
 
 Hi All,
 
 I want to test the memory usage on Namenode and Datanode.
 
 I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website 
 interface to check the memory.
 The values I get from them are different. I also found that the memory always 
 changes periodically.
 This is the first thing confused me.
 
 I thought the more files stored in Namenode, the more memory usage in 
 Namenode and Datanode.
 I also thought the memory used in Namenode should be larger than the memory 
 used in each Datanode.
 However, some results show my ideas are wrong.
 For example, I test the memory usage of Namenode with 6000 and 1000 files.
 The 6000 memory is less than 1000 memory from jmap's results. 
 I also found that the memory usage in Datanode is larger than the memory used 
 in Namenode.
 
 I really don't know how to get the memory usage in Namenode and Datanode.
 
 Can anyone give me some advices?
 
 Thanks,
 Tim
 

Re: How can I get the memory usage in Namenode and Datanode?

2015-02-21 Thread Fang Zhou
Can anyone help me?

Thanks,
Tim

 On Feb 21, 2015, at 2:54 PM, Fang Zhou timchou@gmail.com wrote:
 
 Hi All,
 
 I want to test the memory usage on Namenode and Datanode.
 
 I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website 
 interface to check the memory.
 The values I get from them are different. I also found that the memory always 
 changes periodically.
 This is the first thing confused me.
 
 I thought the more files stored in Namenode, the more memory usage in 
 Namenode and Datanode.
 I also thought the memory used in Namenode should be larger than the memory 
 used in each Datanode.
 However, some results show my ideas are wrong.
 For example, I test the memory usage of Namenode with 6000 and 1000 files.
 The 6000” memory is less than “1000” memory from jmap’s results. 
 I also found that the memory usage in Datanode is larger than the memory used 
 in Namenode.
 
 I really don’t know how to get the memory usage in Namenode and Datanode.
 
 Can anyone give me some advices?
 
 Thanks,
 Tim



Re: How can I get the memory usage in Namenode and Datanode?

2015-02-21 Thread Tim Chou
Hi Jonathan,

Very useful information. I will look at the ganglia.

However, I do not have the administrative privilege for the cluster. I
don't know if I can install Ganglia in the cluster.

Thank you for your information.

Best,
Tim

2015-02-22 0:53 GMT-06:00 Jonathan Aquilina jaquil...@eagleeyet.net:

  Where I am working we are working on transient cluster (temporary) using
 Amazon EMR. When I was reading up on how things work they suggested for
 monitoring to use ganglia to monitor memory usage and network usage etc.
 That way depending on how things are setup be it using an amazon s3 bucket
 for example and pulling data directly into the cluster the network link
 will always be saturated to ensure a constant flow of data.

 What I am suggesting is potentially looking at ganglia.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-02-22 07:42, Fang Zhou wrote:

 Hi Jonathan,

 Thank you.

 The number of files impact on the memory usage in Namenode.

 I just want to get the real memory usage situation in Namenode.

 The memory used in heap always changes so that I have no idea about which
 value is the right one.

 Thanks,
 Tim

  On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net
 wrote:

  I am rather new to hadoop, but wouldnt the difference be potentially in
 how the files are split in terms of size?


 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-02-21 21:54, Fang Zhou wrote:

 Hi All,

 I want to test the memory usage on Namenode and Datanode.

 I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website 
 interface to check the memory.
 The values I get from them are different. I also found that the memory always 
 changes periodically.
 This is the first thing confused me.

 I thought the more files stored in Namenode, the more memory usage in 
 Namenode and Datanode.
 I also thought the memory used in Namenode should be larger than the memory 
 used in each Datanode.
 However, some results show my ideas are wrong.
 For example, I test the memory usage of Namenode with 6000 and 1000 files.
 The 6000 memory is less than 1000 memory from jmap's results.
 I also found that the memory usage in Datanode is larger than the memory used 
 in Namenode.

 I really don't know how to get the memory usage in Namenode and Datanode.

 Can anyone give me some advices?

 Thanks,
 Tim




Re: Hadoop - HTTPS communication between nodes - How to Confirm ?

2015-02-21 Thread Ulul

Hi

Be careful, HTTPS is to secure WebHDFS. If you want to protect all 
network streams you need more than that :

https://s3.amazonaws.com/dev.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.2/bk_reference/content/reference_chap-wire-encryption.html

If you're just interested in HTTPS an lsof -p datanode pid | grep TCP 
will show you that DN listening on 50075 for HTTP, 50475 for HTTPS. For 
namenode that would be 50070 and 50470


Ulul

Le 21/02/2015 19:53, hadoop.supp...@visolve.com a écrit :


Hello Everyone,

We are trying to measure performance between HTTP and HTTPS version on 
Hadoop DFS, Mapreduce and other related modules.


As of now, we have tested using several metrics on Hadoop HTTP Mode. 
Similiarly we are trying to test the same metrics on HTTPS Platform. 
Basically our test suite cluster consists of one Master Node and two 
Slave Nodes.


We have configured HTTPS connection and now we need to verify whether 
Nodes are communicating directly through HTTPS. Tried checking logs, 
clusters webhdfs ui, health check information, dfs admin report but of 
no help. Since there is only limited documentation available in HTTPS, 
we are unable to verify whether Nodes are communicating through HTTPS.


Hence any experts around here can shed some light on how to confirm 
HTTPS communication status between nodes (might be with mapreduce/DFS).


Note: If I have missed any information, feel free to check with me for 
the same.


/Thanks,/

/S.RagavendraGanesh///

ViSolve Hadoop Support Team
ViSolve Inc. | San Jose, California
Website: www.visolve.com http://www.visolve.com

email: servi...@visolve.com mailto:servi...@visolve.com | Phone: 
408-850-2243






Re: How can I get the memory usage in Namenode and Datanode?

2015-02-21 Thread Jonathan Aquilina
 

I am rather new to hadoop, but wouldnt the difference be potentially in
how the files are split in terms of size? 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-21 21:54, Fang Zhou wrote: 

 Hi All,
 
 I want to test the memory usage on Namenode and Datanode.
 
 I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website 
 interface to check the memory.
 The values I get from them are different. I also found that the memory always 
 changes periodically.
 This is the first thing confused me.
 
 I thought the more files stored in Namenode, the more memory usage in 
 Namenode and Datanode.
 I also thought the memory used in Namenode should be larger than the memory 
 used in each Datanode.
 However, some results show my ideas are wrong.
 For example, I test the memory usage of Namenode with 6000 and 1000 files.
 The 6000 memory is less than 1000 memory from jmap's results. 
 I also found that the memory usage in Datanode is larger than the memory used 
 in Namenode.
 
 I really don't know how to get the memory usage in Namenode and Datanode.
 
 Can anyone give me some advices?
 
 Thanks,
 Tim
 

Re: How can I get the memory usage in Namenode and Datanode?

2015-02-21 Thread Fang Zhou
Hi Jonathan,

Thank you.

The number of files impact on the memory usage in Namenode.

I just want to get the real memory usage situation in Namenode.

The memory used in heap always changes so that I have no idea about which value 
is the right one.

Thanks,
Tim

 On Feb 22, 2015, at 12:22 AM, Jonathan Aquilina jaquil...@eagleeyet.net 
 wrote:
 
 I am rather new to hadoop, but wouldnt the difference be potentially in how 
 the files are split in terms of size?
 
  
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 On 2015-02-21 21:54, Fang Zhou wrote:
 
 Hi All,
 
 I want to test the memory usage on Namenode and Datanode.
 
 I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website 
 interface to check the memory.
 The values I get from them are different. I also found that the memory 
 always changes periodically.
 This is the first thing confused me.
 
 I thought the more files stored in Namenode, the more memory usage in 
 Namenode and Datanode.
 I also thought the memory used in Namenode should be larger than the memory 
 used in each Datanode.
 However, some results show my ideas are wrong.
 For example, I test the memory usage of Namenode with 6000 and 1000 files.
 The 6000 memory is less than 1000 memory from jmap's results. 
 I also found that the memory usage in Datanode is larger than the memory 
 used in Namenode.
 
 I really don't know how to get the memory usage in Namenode and Datanode.
 
 Can anyone give me some advices?
 
 Thanks,
 Tim



hadoop learning

2015-02-21 Thread Rishabh Agrawal
Hello,

Please tell me where can i learn the concepts of Big Data and Hadoop from
the scratch. Please provide some links online.



Rishabh Agrawal


Re: hadoop learning

2015-02-21 Thread Bhupendra Gupta
I have been learning and trying to implement a hadoop ecosystem for one of the 
POC from last 1 month or so and i think that the best way to learn is by doing 
it..

Hadoop as the concept has lots of implementation and i picked up hortonworks 
sandbox for learning...
This has helped me in guaging some of the concepts and few practical 
understanding as well.

Happy learning 

Sent from my iPhone

Bhupendra Gupta

 On 21-Feb-2015, at 1:39 pm, Rishabh Agrawal ss.rishab...@gmail.com wrote:
 
 Hello,
 
 Please tell me where can i learn the concepts of Big Data and Hadoop from the 
 scratch. Please provide some links online. 
 
 
 
 Rishabh Agrawal


Re: Time taken by -copyFromLocalHost for transferring data

2015-02-21 Thread Ranadip Chatterjee
$ time hadoop fs -put local file hdfs path

For small files, I would expect the time to have a significant variance
between runs. For larger files, it should be more consistent (since the
throughput will be bound by the network bandwidth of the local machine).
On 21 Feb 2015 08:43, tesm...@gmail.com tesm...@gmail.com wrote:

 Hi,

 How can I measure the time taken by -copyFromLocalHost for transferring my
 data from local host to HDFS?

 Regards,
 Tariq



Running MapReduce jobs in batch mode on different data sets

2015-02-21 Thread tesm...@gmail.com
Hi,

Is it possible to run jobs on Hadoop in batch mode?

I have 5 different datasets in HDFS and need to run the same MapReduce
application on these datasets sets one after the other.

Right now I am doing it manually How can I automate this?

How can I save the log of each execution in text files for later processing?

Regards,
Tariq


Re: hadoop learning

2015-02-21 Thread Ted Yu
Rishabh:
You can start with:
http://wiki.apache.org/hadoop/HowToContribute

There're several components: common, hdfs, YARN, mapreduce, ...
Which ones are you interested in ?

Cheers

On Sat, Feb 21, 2015 at 12:18 AM, Bhupendra Gupta bhupendra1...@gmail.com
wrote:

 I have been learning and trying to implement a hadoop ecosystem for one of
 the POC from last 1 month or so and i think that the best way to learn is
 by doing it..

 Hadoop as the concept has lots of implementation and i picked up
 hortonworks sandbox for learning...
 This has helped me in guaging some of the concepts and few practical
 understanding as well.

 Happy learning

 Sent from my iPhone

 Bhupendra Gupta

  On 21-Feb-2015, at 1:39 pm, Rishabh Agrawal ss.rishab...@gmail.com
 wrote:
 
  Hello,
 
  Please tell me where can i learn the concepts of Big Data and Hadoop
 from the scratch. Please provide some links online.
 
 
 
  Rishabh Agrawal



How can I get the memory usage in Namenode and Datanode?

2015-02-21 Thread Fang Zhou
Hi All,

I want to test the memory usage on Namenode and Datanode.

I try to use jmap, jstat, proc/pid/stat, top, ps aux, and Hadoop website 
interface to check the memory.
The values I get from them are different. I also found that the memory always 
changes periodically.
This is the first thing confused me.

I thought the more files stored in Namenode, the more memory usage in Namenode 
and Datanode.
I also thought the memory used in Namenode should be larger than the memory 
used in each Datanode.
However, some results show my ideas are wrong.
For example, I test the memory usage of Namenode with 6000 and 1000 files.
The 6000” memory is less than “1000” memory from jmap’s results. 
I also found that the memory usage in Datanode is larger than the memory used 
in Namenode.

I really don’t know how to get the memory usage in Namenode and Datanode.

Can anyone give me some advices?

Thanks,
Tim

RE: Yarn AM is abending job more information

2015-02-21 Thread Roland DePratti
Alex,

 

Thanks for looking at the output and your feedback.  I want to make sure I
understand your input correctly.

 

My cluster is a set of old dual core machines and my client is a virtual box
VM with 10 GB mem allocated to it.

 

I did some more testing (and will continue to do so to track down the
problem).

 

I found that if I move my jar file to the resource manager server on the
Dell cluster and execute it local (rather than remotely) it runs to a
successful completion. So there is definitely something not right somewhere
and I have to believe it is a setup problem on my part, not a hardware
problem.

 

Here is the job output:

 

 

Thanks

 

-   rd

 

 

From: Alexander Alten-Lorenz [mailto:wget.n...@gmail.com] 
Sent: Friday, February 20, 2015 2:12 AM
To: user@hadoop.apache.org
Subject: Re: Yarn AM is abending job when submitting a remote job to cluster

 

15/02/20 19:38:21 INFO client.RMProxy: Connecting to ResourceManager at 
hadoop0.rdpratti.com/192.168.2.253:8032
15/02/20 19:38:22 INFO input.FileInputFormat: Total input paths to process : 5
15/02/20 19:38:22 INFO mapreduce.JobSubmitter: number of splits:5
15/02/20 19:38:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_1424003606313_0015
15/02/20 19:38:22 INFO impl.YarnClientImpl: Submitted application 
application_1424003606313_0015
15/02/20 19:38:22 INFO mapreduce.Job: The url to track the job: 
http://hadoop0.rdpratti.com:8088/proxy/application_1424003606313_0015/
15/02/20 19:38:22 INFO mapreduce.Job: Running job: job_1424003606313_0015
15/02/20 19:38:36 INFO mapreduce.Job: Job job_1424003606313_0015 running in 
uber mode : false
15/02/20 19:38:36 INFO mapreduce.Job:  map 0% reduce 0%
15/02/20 19:38:45 INFO mapreduce.Job:  map 20% reduce 0%
15/02/20 19:38:47 INFO mapreduce.Job:  map 40% reduce 0%
15/02/20 19:38:52 INFO mapreduce.Job:  map 80% reduce 0%
15/02/20 19:38:59 INFO mapreduce.Job:  map 100% reduce 0%
15/02/20 19:39:03 INFO mapreduce.Job:  map 100% reduce 25%
15/02/20 19:39:08 INFO mapreduce.Job:  map 100% reduce 50%
15/02/20 19:39:09 INFO mapreduce.Job:  map 100% reduce 75%
15/02/20 19:39:10 INFO mapreduce.Job:  map 100% reduce 100%
15/02/20 19:39:11 INFO mapreduce.Job: Job job_1424003606313_0015 completed 
successfully
15/02/20 19:39:11 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=1628864
FILE: Number of bytes written=4240224
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=5343866
HDFS: Number of bytes written=624
HDFS: Number of read operations=27
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Launched map tasks=5
Launched reduce tasks=4
Data-local map tasks=2
Rack-local map tasks=3
Total time spent by all maps in occupied slots (ms)=43715
Total time spent by all reduces in occupied slots (ms)=30261
Total time spent by all map tasks (ms)=43715
Total time spent by all reduce tasks (ms)=30261
Total vcore-seconds taken by all map tasks=43715
Total vcore-seconds taken by all reduce tasks=30261
Total megabyte-seconds taken by all map tasks=44764160
Total megabyte-seconds taken by all reduce tasks=30987264
Map-Reduce Framework
Map input records=175558
Map output records=974078
Map output bytes=5844468
Map output materialized bytes=1631237
Input split bytes=659
Combine input records=0
Combine output records=0
Reduce input groups=35
Reduce shuffle bytes=1631237
Reduce input records=974078
Reduce output records=35
Spilled Records=1948156
Shuffled Maps =20
Failed Shuffles=0
Merged Map outputs=20
GC time elapsed (ms)=862
CPU time spent (ms)=30820
Physical memory (bytes) snapshot=2817286144
Virtual memory (bytes) snapshot=13831352320
Total committed heap usage (bytes)=2295857152
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=5343207
File Output Format Counters
Bytes Written=624