Re: Any suggestion on performance improvement ?

2008-11-21 Thread Aaron Kimball
It's worth pointing out that Hadoop really isn't designed to run at this low
of a scale. Hadoop's performance doesn't really begin to kick in until
you've got 10's of GB's of data.

The question is sort of like asking how can I make an 18-wheeler run faster
when carrying only a single bag of groceries.

There is a large amount of overhead associated with starting Hadoop; in
particular, starting a bunch of JVMs. The TaskTrackers only poll for new
work every 10 seconds, so every Hadoop job is going to be 10 seconds long
minimum, before the job actually gets slotted into worker nodes. The
remaining 20 seconds of time is most likely eaten up by similar overheads.
These stop being a factor when you actually have a sizeable amount of data
worth reading.

You are correct that adding more nodes won't help. For 60 MB of data, it's
only spawning one task, on one worker node.

You might want to configure Hadoop to run in single-threaded mode on a
single machine, and ditch the cluster entirely. Set 'mapred.job.tracker' to
'local' and 'fs.default.name' to 'file:///some/dir/in/the/local/machine',
and it should run Hadoop entirely within a single JVM.

- Aaron

On Fri, Nov 14, 2008 at 11:12 AM, souravm [EMAIL PROTECTED] wrote:

 Hi Alex,

 I get 30-40 secs of response time for around 60MB of data. The number of
 Map and Reduce task is 1 each. This is because the default HDFS block size
 is 64 MB and Pig assigns 1 Map task for each HDFS block - I believe that is
 optimal.

 Now this being the unit of performance even if I increase the number of
 node I don't think the performance would be better.

 Regards,
 Sourav
 -Original Message-
 From: Alex Loddengaard [mailto:[EMAIL PROTECTED]
 Sent: Friday, November 14, 2008 9:44 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Any suggestion on performance improvement ?

 How big is the data that you're loading and filtering?  Your cluster is
 pretty small, so if you have data on the magnitude of tens or hundreds of
 GBs, then the performance you're describing is probably to be expected.
 How many map and reduce tasks are you running on each node?

 Alex

 On Thu, Nov 13, 2008 at 4:55 PM, souravm [EMAIL PROTECTED] wrote:

  Hi,
 
  I'm testing with a 4 node setup of Hadoop hdfs.
 
  Each node has configuration of 2GB memory and dual core and around 30-60
 GB
  disk space.
 
  I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.
 
  I'm querying those files using PIG. What I'm seeing that even a simple
  select query (LOAD and FILTER) is taking at least 30-40 sec of time. The
 MAP
  process in one node takes at least 25 sec.
 
  I've kept the jvm max heap size to 1024m.
 
  Any suggestion on how to improve the performance with different
  configuration at Hadoop level (by changing hdfs and MapReduce parameters)
 ?
 
  Regards,
  Sourav
 
   CAUTION - Disclaimer *
  This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
  solely
  for the use of the addressee(s). If you are not the intended recipient,
  please
  notify the sender by e-mail and delete the original message. Further, you
  are not
  to copy, disclose, or distribute this e-mail or its contents to any other
  person and
  any such actions are unlawful. This e-mail may contain viruses. Infosys
 has
  taken
  every reasonable precaution to minimize this risk, but is not liable for
  any damage
  you may sustain as a result of any virus in this e-mail. You should carry
  out your
  own virus checks before opening the e-mail or attachment. Infosys
 reserves
  the
  right to monitor and review the content of all messages sent to or from
  this e-mail
  address. Messages sent to or from this e-mail address may be stored on
 the
  Infosys e-mail system.
  ***INFOSYS End of Disclaimer INFOSYS***
 



Re: Any suggestion on performance improvement ?

2008-11-14 Thread Alex Loddengaard
How big is the data that you're loading and filtering?  Your cluster is
pretty small, so if you have data on the magnitude of tens or hundreds of
GBs, then the performance you're describing is probably to be expected.
How many map and reduce tasks are you running on each node?

Alex

On Thu, Nov 13, 2008 at 4:55 PM, souravm [EMAIL PROTECTED] wrote:

 Hi,

 I'm testing with a 4 node setup of Hadoop hdfs.

 Each node has configuration of 2GB memory and dual core and around 30-60 GB
 disk space.

 I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.

 I'm querying those files using PIG. What I'm seeing that even a simple
 select query (LOAD and FILTER) is taking at least 30-40 sec of time. The MAP
 process in one node takes at least 25 sec.

 I've kept the jvm max heap size to 1024m.

 Any suggestion on how to improve the performance with different
 configuration at Hadoop level (by changing hdfs and MapReduce parameters) ?

 Regards,
 Sourav

  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
 solely
 for the use of the addressee(s). If you are not the intended recipient,
 please
 notify the sender by e-mail and delete the original message. Further, you
 are not
 to copy, disclose, or distribute this e-mail or its contents to any other
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys has
 taken
 every reasonable precaution to minimize this risk, but is not liable for
 any damage
 you may sustain as a result of any virus in this e-mail. You should carry
 out your
 own virus checks before opening the e-mail or attachment. Infosys reserves
 the
 right to monitor and review the content of all messages sent to or from
 this e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***



RE: Any suggestion on performance improvement ?

2008-11-14 Thread souravm
Hi Alex,

I get 30-40 secs of response time for around 60MB of data. The number of Map 
and Reduce task is 1 each. This is because the default HDFS block size is 64 MB 
and Pig assigns 1 Map task for each HDFS block - I believe that is optimal.

Now this being the unit of performance even if I increase the number of node I 
don't think the performance would be better.

Regards,
Sourav
-Original Message-
From: Alex Loddengaard [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 14, 2008 9:44 AM
To: core-user@hadoop.apache.org
Subject: Re: Any suggestion on performance improvement ?

How big is the data that you're loading and filtering?  Your cluster is
pretty small, so if you have data on the magnitude of tens or hundreds of
GBs, then the performance you're describing is probably to be expected.
How many map and reduce tasks are you running on each node?

Alex

On Thu, Nov 13, 2008 at 4:55 PM, souravm [EMAIL PROTECTED] wrote:

 Hi,

 I'm testing with a 4 node setup of Hadoop hdfs.

 Each node has configuration of 2GB memory and dual core and around 30-60 GB
 disk space.

 I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.

 I'm querying those files using PIG. What I'm seeing that even a simple
 select query (LOAD and FILTER) is taking at least 30-40 sec of time. The MAP
 process in one node takes at least 25 sec.

 I've kept the jvm max heap size to 1024m.

 Any suggestion on how to improve the performance with different
 configuration at Hadoop level (by changing hdfs and MapReduce parameters) ?

 Regards,
 Sourav

  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
 solely
 for the use of the addressee(s). If you are not the intended recipient,
 please
 notify the sender by e-mail and delete the original message. Further, you
 are not
 to copy, disclose, or distribute this e-mail or its contents to any other
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys has
 taken
 every reasonable precaution to minimize this risk, but is not liable for
 any damage
 you may sustain as a result of any virus in this e-mail. You should carry
 out your
 own virus checks before opening the e-mail or attachment. Infosys reserves
 the
 right to monitor and review the content of all messages sent to or from
 this e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***