RE: Why Hadoop is slow in Cloud

2011-01-21 Thread Evert Lammerts
 Even with performance hit, there are still benefits running Hadoop this
 way
   -as you only consume/pay for CPU time you use, if you are only
 running
 batch jobs, its lower cost than having a hadoop cluster that is under-
 used.
 
   -if your data is stored in the cloud infrastructure, then you need to
 data mine it in VMs, unless you want to take the time and money hit of
 moving it out, and have somewhere to store it.
 
 -if the infrastructure lets you, you can lock down the cluster so it is
 secure.
 
 Where a physical cluster is good is that it is a very low cost way of
 storing data, provided you can analyse it with Hadoop, and provided you
 can keep that cluster busy most of the time, either with Hadoop work or
 other scheduled work. If your cluster is idle for computation, you are
 still paying the capital and (reduced) electricity costs, so the cost
 of
 storage and what compute you do effectively increases.

Agreed, but this has little to do with Hadoop as a middleware and more to do
with the benefits of virtualized vs physical infrastructure. I agree that it
is convenient to use HDFS as a DFS to keep your data local to your VMs, but
you could choose other DFS's as well.

The major benefit of Hadoop is its data-locality principle, and this is what
you give up when you move to the cloud. Regardless of whether you store your
data within your VM or on a NAS, it *will* have to travel over a line. As
soon as that happens you lose the benefit of data-locality and are left with
MapReduce as a way for parallel computing. And in that case you could use
less restrictive software, like maybe PBS. You could even install HOD on
your virtual cluster, if you'd like the possibility of MapReduce.

Adarsh, there are probably results around of more generic benchmark tools
(Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should
give you a better idea of the penalties of virtualization. (Our experience
with a number of technologies on our OpenNebula cloud is, like Steve points
out, that you mainly pay for disk I/O performance.)

I think a decision to go with either cloud or physical infrastructure should
be based on the frequency, intensity and types of computation you expect on
the short term (that should include operations dealing with data), and the
way you think these parameters will develop on a mid-long term. And then
compare the prices of a physical cluster that meets those demands (make sure
to include power and operations) and the investment you would otherwise need
to make in Cloud.


smime.p7s
Description: S/MIME cryptographic signature


Re: Why Hadoop is slow in Cloud

2011-01-21 Thread Steve Loughran

On 20/01/11 23:24, Marc Farnum Rendino wrote:

On Wed, Jan 19, 2011 at 2:50 PM, Edward Caprioloedlinuxg...@gmail.com  wrote:

As for virtualization,paravirtualization,emulation.(whatever ulization)


Wow; that's a really big category.


There are always a lot of variables, but the net result is always
less. It may be 2% 10% or 15%, but it is always less.


If it's less of something I don't care about, it's not a factor (for me).

On the other hand, if I'm paying less and getting more of what I DO
care about, I'd rather go with that.

It's about the cost/benefit *ratio*.


There's also perf vs storage. On a big cluster, you could add a second 
Nehalem CPU and maybe get 10-15% boost on throughput, or for the same 
capex and opex add 10% new servers, which at scale means many more TB of 
storage and the compute to go with it. The decision rests with the team 
and their problems.


Re: Why Hadoop is slow in Cloud

2011-01-21 Thread Steve Loughran

On 21/01/11 09:20, Evert Lammerts wrote:

Even with performance hit, there are still benefits running Hadoop this
way
   -as you only consume/pay for CPU time you use, if you are only
running
batch jobs, its lower cost than having a hadoop cluster that is under-
used.

   -if your data is stored in the cloud infrastructure, then you need to
data mine it in VMs, unless you want to take the time and money hit of
moving it out, and have somewhere to store it.

-if the infrastructure lets you, you can lock down the cluster so it is
secure.

Where a physical cluster is good is that it is a very low cost way of
storing data, provided you can analyse it with Hadoop, and provided you
can keep that cluster busy most of the time, either with Hadoop work or
other scheduled work. If your cluster is idle for computation, you are
still paying the capital and (reduced) electricity costs, so the cost
of
storage and what compute you do effectively increases.


Agreed, but this has little to do with Hadoop as a middleware and more to do
with the benefits of virtualized vs physical infrastructure. I agree that it
is convenient to use HDFS as a DFS to keep your data local to your VMs, but
you could choose other DFS's as well.


We don't use HDFS, we bring up VMs close to where the data persists.

http://www.slideshare.net/steve_l/high-availability-hadoop



The major benefit of Hadoop is its data-locality principle, and this is what
you give up when you move to the cloud. Regardless of whether you store your
data within your VM or on a NAS, it *will* have to travel over a line. As
soon as that happens you lose the benefit of data-locality and are left with
MapReduce as a way for parallel computing. And in that case you could use
less restrictive software, like maybe PBS. You could even install HOD on
your virtual cluster, if you'd like the possibility of MapReduce.


We don't suffer locality hits so much, but you do pay for the extra 
infrastructure costs of a more agile datacentre, and if you go to 
redundancy in hardware over replication, you have less places to run 
your code.


Even on EC2, which doesn't let you tell it what datasets you want to 
play with for its VM placer to use in its decisions, once data is in the 
datanodes you do get locality




Adarsh, there are probably results around of more generic benchmark tools
(Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should
give you a better idea of the penalties of virtualization. (Our experience
with a number of technologies on our OpenNebula cloud is, like Steve points
out, that you mainly pay for disk I/O performance.)


-would be interesting to see anything you can publish there...



I think a decision to go with either cloud or physical infrastructure should
be based on the frequency, intensity and types of computation you expect on
the short term (that should include operations dealing with data), and the
way you think these parameters will develop on a mid-long term. And then
compare the prices of a physical cluster that meets those demands (make sure
to include power and operations) and the investment you would otherwise need
to make in Cloud.


+1



Re: Why Hadoop is slow in Cloud

2011-01-20 Thread Marc Farnum Rendino
On Wed, Jan 19, 2011 at 2:50 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 As for virtualization,paravirtualization,emulation.(whatever ulization)

Wow; that's a really big category.

 There are always a lot of variables, but the net result is always
 less. It may be 2% 10% or 15%, but it is always less.

If it's less of something I don't care about, it's not a factor (for me).

On the other hand, if I'm paying less and getting more of what I DO
care about, I'd rather go with that.

It's about the cost/benefit *ratio*.


Re: Why Hadoop is slow in Cloud

2011-01-19 Thread Marc Farnum Rendino
On Tue, Jan 18, 2011 at 8:59 AM, Adarsh Sharma adarsh.sha...@orkash.com wrote:
 I want to know *AT WHAT COSTS  *it comes.
 10-15% is tolerable but at this rate, it needs some work.

 As Steve rightly suggest , I am in some CPU bound testing work to  know the
  exact stats.

Yep; you've got to test your own workflow to see how it's affected by
your conditions - lots of variables.

BTW: For AWS (Amazon) there are significant differences in I/O, for
different instance types; if I recall correctly, for best I/O, start
no lower than m1.large. And the three storage types (instance, EBS,
and S3) have different characteristics as well; I'd start with EBS,
though I haven't worked much with S3 yet, and that does offer some
benefits.


Re: Why Hadoop is slow in Cloud

2011-01-19 Thread Edward Capriolo
On Wed, Jan 19, 2011 at 1:32 PM, Marc Farnum Rendino mvg...@gmail.com wrote:
 On Tue, Jan 18, 2011 at 8:59 AM, Adarsh Sharma adarsh.sha...@orkash.com 
 wrote:
 I want to know *AT WHAT COSTS  *it comes.
 10-15% is tolerable but at this rate, it needs some work.

 As Steve rightly suggest , I am in some CPU bound testing work to  know the
  exact stats.

 Yep; you've got to test your own workflow to see how it's affected by
 your conditions - lots of variables.

 BTW: For AWS (Amazon) there are significant differences in I/O, for
 different instance types; if I recall correctly, for best I/O, start
 no lower than m1.large. And the three storage types (instance, EBS,
 and S3) have different characteristics as well; I'd start with EBS,
 though I haven't worked much with S3 yet, and that does offer some
 benefits.

As for virtualization,paravirtualization,emulation.(whatever ulization)
There are always a lot of variables, but the net result is always
less. It may be 2% 10% or 15%, but it is always less. A $50,000 server
and such a solution takes 10% performance right off the top. There
goes $5,000.00 performance right out the window. I never think
throwing away performance was acceptable ( I was born without a silver
SSD in my crib).  Plus some people even pay for virtualization
software (vendors will remain nameless) Truly paying for less.


Re: Why Hadoop is slow in Cloud

2011-01-18 Thread Marc Farnum Rendino
Virtualization != Emulation

Yes, virtualization does have its own costs (as does running directly
on hardware) - depending on the specifics of both the virtualization
*and* the task at hand.

If my task (in the general sense) is CPU bound, it doesn't matter (to
me) if the virtualization has a disk I/O penalty.

If on the other hand, my task is limited by a disk I/O penalty, I'll
weigh that into the *total* cost/benefit, and virtualization may not -
or may still - be an advantageous choice.

Context is king.

On Mon, Jan 17, 2011 at 10:41 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 Everything you emulate you cut X% performance right off the top...


Re: Why Hadoop is slow in Cloud

2011-01-18 Thread Adarsh Sharma

Marc Farnum Rendino wrote:

Virtualization != Emulation

Yes, virtualization does have its own costs (as does running directly
on hardware) - depending on the specifics of both the virtualization
*and* the task at hand.
  

Absolutely right, and for this I perform the initial testing.

I want to know *AT WHAT COSTS  *it comes.
10-15% is tolerable but at this rate, it needs some work.

As Steve rightly suggest , I am in some CPU bound testing work to  know 
the  exact stats.


I let you know after the work.


If my task (in the general sense) is CPU bound, it doesn't matter (to
me) if the virtualization has a disk I/O penalty.
  


But is it possible to perform some tuning in the work-flow of the VM's 
to increase some performance or not.


If on the other hand, my task is limited by a disk I/O penalty, I'll

weigh that into the *total* cost/benefit, and virtualization may not -
or may still - be an advantageous choice.

  

Some reasons of slowness will highly helpful. Any guidance is appreciable.


Context is king.

  

Thanks  best Regards

Adarsh Sharma


On Mon, Jan 17, 2011 at 10:41 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
  

Everything you emulate you cut X% performance right off the top...





Re: Why Hadoop is slow in Cloud

2011-01-17 Thread Steve Loughran

On 17/01/11 04:11, Adarsh Sharma wrote:

Dear all,

Yesterday I performed a kind of testing between *Hadoop in Standalone
Servers*  *Hadoop in Cloud.

*I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in
which one node act as Master ( Namenode , Jobtracker ) and the remaining
nodes act as slaves ( Datanodes, Tasktracker ).
On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made
one Standalone Machine as *Hadoop Master* and the slaves are configured
on the VM's in Cloud.

I am confused about the stats obtained after the testing. What I
concluded that the VM are giving half peformance as compared with
Standalone Servers.


Interesting stats, nothing that massively surprises me, especially as 
your benchmarks are very much streaming through datasets. If you were 
doing something more CPU intensive (graph work, for example), things 
wouldn't look so bad


I've done stuff in this area.
http://www.slideshare.net/steve_l/farming-hadoop-inthecloud





I am expected some slow down but at this level I never expect. Would
this is genuine or there may be some configuration problem.

I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in
Standalone Servers.

Please have a look on the results and if interested comment on it.




The big killer here is File IO, with today's HDD controllers and virtual 
filesystems, disk IO is way underpowered compared to physical disk IO. 
Networking is reduced (but improving), and CPU can be pretty good, but 
disk is bad.



Why?

1.  Every access to a block in the VM is turned into virtual disk 
controller operations which are then interpreted by the VDC and turned 
into reads/writes in the virtual disk drive


2. which is turned into seeks, reads and writes in the physical hardware.

Some workarounds

-allocate physical disks for the HDFS filesystem, for the duration of 
the VMs.


-have the local hosts serve up a bit of their filesystem on a fast 
protocol (like NFS), and have every VM mount the local physical NFS 
filestore as their hadoop data dirs.




Re: Why Hadoop is slow in Cloud

2011-01-17 Thread Edward Capriolo
On Mon, Jan 17, 2011 at 6:08 AM, Steve Loughran ste...@apache.org wrote:
 On 17/01/11 04:11, Adarsh Sharma wrote:

 Dear all,

 Yesterday I performed a kind of testing between *Hadoop in Standalone
 Servers*  *Hadoop in Cloud.

 *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in
 which one node act as Master ( Namenode , Jobtracker ) and the remaining
 nodes act as slaves ( Datanodes, Tasktracker ).
 On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made
 one Standalone Machine as *Hadoop Master* and the slaves are configured
 on the VM's in Cloud.

 I am confused about the stats obtained after the testing. What I
 concluded that the VM are giving half peformance as compared with
 Standalone Servers.

 Interesting stats, nothing that massively surprises me, especially as your
 benchmarks are very much streaming through datasets. If you were doing
 something more CPU intensive (graph work, for example), things wouldn't look
 so bad

 I've done stuff in this area.
 http://www.slideshare.net/steve_l/farming-hadoop-inthecloud




 I am expected some slow down but at this level I never expect. Would
 this is genuine or there may be some configuration problem.

 I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in
 Standalone Servers.

 Please have a look on the results and if interested comment on it.



 The big killer here is File IO, with today's HDD controllers and virtual
 filesystems, disk IO is way underpowered compared to physical disk IO.
 Networking is reduced (but improving), and CPU can be pretty good, but disk
 is bad.


 Why?

 1.  Every access to a block in the VM is turned into virtual disk controller
 operations which are then interpreted by the VDC and turned into
 reads/writes in the virtual disk drive

 2. which is turned into seeks, reads and writes in the physical hardware.

 Some workarounds

 -allocate physical disks for the HDFS filesystem, for the duration of the
 VMs.

 -have the local hosts serve up a bit of their filesystem on a fast protocol
 (like NFS), and have every VM mount the local physical NFS filestore as
 their hadoop data dirs.



Q: Why is my Nintendo emulator slow on a 800 MHZ computer made 10
years after Nintendo?
A: Emulation

Everything you emulate you cut X% performance right off the top.

Emulation is great when you want to run mac on windows or freebsd on
linux or nintendo on linux. However most people would do better with
technologies that use kernel level isolation such as Linux containers,
Solaris Zones, Linux VServer (my favorite) http://linux-vserver.org/,
User Mode Linux or similar technologies that ISOLATE rather then
EMULATE.

Sorry list I feel I rant about this bi-annually. I have just always
been so shocked about how many people get lured into cloud and
virtualized solutions for better management and near native
performance


Why Hadoop is slow in Cloud

2011-01-16 Thread Adarsh Sharma

Dear all,

Yesterday I performed a kind of testing between *Hadoop in Standalone 
Servers*  *Hadoop in Cloud.


*I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in 
which one node act as Master ( Namenode , Jobtracker ) and the remaining 
nodes act as slaves ( Datanodes, Tasktracker ).
On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made 
one Standalone Machine as *Hadoop Master* and the slaves are configured 
on the VM's in Cloud.


I am confused about the stats obtained after the testing. What I 
concluded that the VM are giving half peformance as compared with 
Standalone Servers.


I am expected some slow down but at this level I never expect. Would 
this is genuine or there may be some configuration problem.


I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in 
Standalone Servers.


Please have a look on the results and if interested comment on it.



Thanks  Regards

Adarsh Sharma


hadoop_testing_new.ods
Description: application/vnd.oasis.opendocument.spreadsheet