Thank you everyone for your solution ,

I think I got an idea of where I was making a mistake, not only was I setting 
up and destroying the jvm for a single Hadoop jobs
I was also creating numerous Hadoop jobs for processing different files which 
can be handled in one single job.

Will try the solution that I think would help solve the problem.


From: Shekhar Sharma []
Sent: Tuesday, December 17, 2013 9:12 PM
Subject: Re: Estimating the time of my hadoop jobs

Apart from what Devin has suggested there are other factors which could be 
worth while noting when you are running your hadoop cluster on virtual machines.

(1) How many map and reduce slots are there in cluster?

 Since you have not mentioned and you are using 4 node hadoop cluster so total 
of 8map slots and 8 reduce slots are present.
What does it mean?
It means that at a time on your cluster only 8 map tasks and 8 reduce task will 
run parallely and other task have to wait..

(2) Since you have not mentioned anywhere that whether 30GB of data is made up 
of lot of smaller files ( less than block size) or bigger file...let us do a 
simple calculation assuming only one file of 30GB and assuming a block size of 

30GB = 30 * 1024 * 1024* 1024 = 32212254720

64MB = 64 * 1024*1024 =67108864

Total Number of blocks the data will be broken  = (32212254720) / (67108864) = 
480 Blocks

Now this means you will be running 480 Map tasks ( keeping in mind inputsplit 
size = block size)...But since you have only 8 map slots so at a time only 8 
map task will run and others will be pending...

Assuming all the 8map tasks finishes at one time then you will have 480/8 = 60 
map waves

 (3) Now you know that each task runs on a separate JVM, that means to say for 
every task a jvm is created and then after the task is finished the JVM is tear 
down..this is also a bottle neck, creation and destroy of JVM

So try reusing the same JVM. There is option where in you can reuse the JVM

(4) SInce you are working with such  big data, try using combiner?

(5) Also try compressing the data and the intermediate output of the mappers 
and reducer op
   ---First try with sequence file
   ---Then try with snappy compression codec

By the above pointers if you can bring down the timings to atleast 1 hour or 
Then with the same 4 node cluster and Hadoop running on separate physical 
machine you will for sure see the job getting completed in 15-30minutes..[ 
Please refer Devin's comments ]

My suggestion is get the optimal performance on your virtual machine and then 
you go for real hadoop cluster. You will for sure see the performance 

Som Shekhar Sharma

On Tue, Dec 17, 2013 at 6:42 PM, Devin Suiter RDX 
<<>> wrote:

One of the problems you run into with Hadoop in Virtual Machine environments is 
performance issues when they are all running on the same physical host. With a 
VM, even though you are giving them 4 GB of RAM, and a virtual CPU and disk, if 
the virtual machines are sharing physical components like processor and 
physical storage medium, they compete for resources at the physical level. Even 
if you have the VM on a single host, or on a multi-core host with multiple 
disks and they are sharing as few resources as possible, there will still be a 
performance hit when the VM information has to pass through the hypervisor 
layer - co-scheduling resources with the host and things like that.

Does that make sense?

It's generally accepted that due to these issues, Hadoop in virtual 
environments does not offer the same performance benefits as a physical Hadoop 
cluster. It can be used pretty well with even low-quality hardware though, so 
so, maybe you can acquire some used desktops and install your favorite Linux 
flavor on them and make a cluster - some people have even run Hadoop on 
Raspberry Pi clusters.

Devin Suiter
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556<tel:412-256-8556> |<>

On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil 
<<>> wrote:
I know this foolish of me to ask this, because there are a lot of factors that 
affect this,
but why is it taking so much time, can anyone suggest possible reasons for it, 
or if anyone has faced such issue before

Nikhil Kandoi
P.S - I am  Hadoop-1.0.3  for this application, so I wonder if this version has 
got something to do with it.

From: Azuryy Yu [<>]
Sent: Tuesday, December 17, 2013 4:14 PM
Subject: Re: Estimating the time of my hadoop jobs

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your *not 
good* hareware.

On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil 
<<>> wrote:
Hello everyone,

I am new to Hadoop and would like to see if I'm on the right track.
Currently I'm developing an application which would ingest logs of order of 
60-70 GB of data/day and would then do
Some analysis on them
Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual 
Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of 
about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this 
amount of time.

Thank you

Nikhil Kandoi/

Reply via email to