Re: Inconsistent times in Hadoop web interface

2010-12-27 Thread yipeng
Ahh... I don't manage my cluster but you were spot on. Now I know who to
follow up with.

Thanks!!

Yipeng

On Mon, Dec 27, 2010 at 3:55 PM, Harsh J  wrote:

> Hey,
>
> On Mon, Dec 27, 2010 at 10:44 AM, yipeng  wrote:
> > Hi guys,
> >
> > I am having some inconsistent timing in the web interface. The job finish
> > time as below is 47 secs but the Map & Reduce took significantly longer.
> I
> > don't think I did anything that could have caused this. Any ideas what
> might
> > have?
> >
>
> Are your cluster nodes' clocks synced right (via ntpd, etc.)?
>
> --
> Harsh J
> www.harshj.com
>


Re: Hadoop/Elastic MR on AWS

2010-12-27 Thread Sudhir Vallamkondu
We recently crossed this bridge and here are some insights. We did an
extensive study comparing costs and benchmarking local vs EMR for our
current needs and future trend.

- Scalability you get with EMR is unmatched although you need to look at
your requirement and decide this is something you need.

- When using EMR its cheaper to use reserved instances vs nodes on the fly.
You can always add more nodes when required. I suggest looking at your
current computing needs and reserve instances for a year or two and use
these to run EMR and add nodes at peak needs. In your cost estimation you
will need to factor in the data transfer time/costs unless you are dealing
with public datasets on S3

- EMR fared similar to local cluster on CPU benchmarks (we used MRBench to
benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO
benchmark). For IO intensive jobs you will need to add more nodes to
compensate this.

- When compared to local cluster, you will need to factor the time it takes
for the EMR cluster to setup when starting a job. This like data transfer
time, cluster replication time etc

- EMR API is very flexible however you will need to build a custom interface
on top of it to suit your job management and monitoring needs

- EMR bootstrap actions can satisfy most of your native lib needs so no
drawbacks there.


-- Sudhir


On 12/26/10 5:26 AM, "common-user-digest-h...@hadoop.apache.org"
 wrote:

> From: Otis Gospodnetic 
> Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST)
> To: 
> Subject: Re: Hadoop/Elastic MR on AWS
> 
> Hello Amandeep,
> 
> 
> 
> - Original Message 
>> From: Amandeep Khurana 
>> To: common-user@hadoop.apache.org
>> Sent: Fri, December 10, 2010 1:14:45 AM
>> Subject: Re: Hadoop/Elastic MR on AWS
>> 
>> Mark,
>> 
>> Using EMR makes it very easy to start a cluster and add/reduce  capacity as
>> and when required. There are certain optimizations that make EMR  an
>> attractive choice as compared to building your own cluster out. Using  EMR
> 
> 
> Could you please point out what optimizations you are referring to?
> 
> Thanks,
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
> Hadoop ecosystem search :: http://search-hadoop.com/
> 
>> also ensures you are using a production quality, stable system backed by  the
>> EMR engineers. You can always use bootstrap actions to put your own  tweaked
>> version of Hadoop in there if you want to do that.
>> 
>> Also, you  don't have to tear down your cluster after every job. You can set
>> the alive  option when you start your cluster and it will stay there even
>> after your  Hadoop job completes.
>> 
>> If you face any issues with EMR, send me a mail  offline and I'll be happy to
>> help.
>> 
>> -Amandeep
>> 
>> 
>> On Thu, Dec 9,  2010 at 9:47 PM, Mark   wrote:
>> 
>>> Does anyone have any thoughts/experiences on running Hadoop  in AWS? What
>>> are some pros/cons?
>>> 
>>> Are there any good  AMI's out there for this?
>>> 
>>> Thanks for any advice.
>>> 
>> 


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information of iCrossing. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.




Re: Hadoop/Elastic MR on AWS

2010-12-27 Thread Dave Viner
Hi Sudhir,

Can you publish your findings around pricing, and how you calculated the
various aspects?

This is great information.

Thanks
Dave Viner


On Mon, Dec 27, 2010 at 10:17 AM, Sudhir Vallamkondu <
sudhir.vallamko...@icrossing.com> wrote:

> We recently crossed this bridge and here are some insights. We did an
> extensive study comparing costs and benchmarking local vs EMR for our
> current needs and future trend.
>
> - Scalability you get with EMR is unmatched although you need to look at
> your requirement and decide this is something you need.
>
> - When using EMR its cheaper to use reserved instances vs nodes on the fly.
> You can always add more nodes when required. I suggest looking at your
> current computing needs and reserve instances for a year or two and use
> these to run EMR and add nodes at peak needs. In your cost estimation you
> will need to factor in the data transfer time/costs unless you are dealing
> with public datasets on S3
>
> - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to
> benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO
> benchmark). For IO intensive jobs you will need to add more nodes to
> compensate this.
>
> - When compared to local cluster, you will need to factor the time it takes
> for the EMR cluster to setup when starting a job. This like data transfer
> time, cluster replication time etc
>
> - EMR API is very flexible however you will need to build a custom
> interface
> on top of it to suit your job management and monitoring needs
>
> - EMR bootstrap actions can satisfy most of your native lib needs so no
> drawbacks there.
>
>
> -- Sudhir
>
>
> On 12/26/10 5:26 AM, "common-user-digest-h...@hadoop.apache.org"
>  wrote:
>
> > From: Otis Gospodnetic 
> > Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST)
> > To: 
> > Subject: Re: Hadoop/Elastic MR on AWS
> >
> > Hello Amandeep,
> >
> >
> >
> > - Original Message 
> >> From: Amandeep Khurana 
> >> To: common-user@hadoop.apache.org
> >> Sent: Fri, December 10, 2010 1:14:45 AM
> >> Subject: Re: Hadoop/Elastic MR on AWS
> >>
> >> Mark,
> >>
> >> Using EMR makes it very easy to start a cluster and add/reduce  capacity
> as
> >> and when required. There are certain optimizations that make EMR  an
> >> attractive choice as compared to building your own cluster out. Using
>  EMR
> >
> >
> > Could you please point out what optimizations you are referring to?
> >
> > Thanks,
> > Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
> HBase
> > Hadoop ecosystem search :: http://search-hadoop.com/
> >
> >> also ensures you are using a production quality, stable system backed by
>  the
> >> EMR engineers. You can always use bootstrap actions to put your own
>  tweaked
> >> version of Hadoop in there if you want to do that.
> >>
> >> Also, you  don't have to tear down your cluster after every job. You can
> set
> >> the alive  option when you start your cluster and it will stay there
> even
> >> after your  Hadoop job completes.
> >>
> >> If you face any issues with EMR, send me a mail  offline and I'll be
> happy to
> >> help.
> >>
> >> -Amandeep
> >>
> >>
> >> On Thu, Dec 9,  2010 at 9:47 PM, Mark 
>  wrote:
> >>
> >>> Does anyone have any thoughts/experiences on running Hadoop  in AWS?
> What
> >>> are some pros/cons?
> >>>
> >>> Are there any good  AMI's out there for this?
> >>>
> >>> Thanks for any advice.
> >>>
> >>
>
>
> iCrossing Privileged and Confidential Information
> This email message is for the sole use of the intended recipient(s) and may
> contain confidential and privileged information of iCrossing. Any
> unauthorized review, use, disclosure or distribution is prohibited. If you
> are not the intended recipient, please contact the sender by reply email and
> destroy all copies of the original message.
>
>
>


Re: Hadoop/Elastic MR on AWS

2010-12-27 Thread James Seigel
Thank you for sharing.

Sent from my mobile. Please excuse the typos.

On 2010-12-27, at 11:18 AM, Sudhir Vallamkondu
 wrote:

> We recently crossed this bridge and here are some insights. We did an
> extensive study comparing costs and benchmarking local vs EMR for our
> current needs and future trend.
>
> - Scalability you get with EMR is unmatched although you need to look at
> your requirement and decide this is something you need.
>
> - When using EMR its cheaper to use reserved instances vs nodes on the fly.
> You can always add more nodes when required. I suggest looking at your
> current computing needs and reserve instances for a year or two and use
> these to run EMR and add nodes at peak needs. In your cost estimation you
> will need to factor in the data transfer time/costs unless you are dealing
> with public datasets on S3
>
> - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to
> benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO
> benchmark). For IO intensive jobs you will need to add more nodes to
> compensate this.
>
> - When compared to local cluster, you will need to factor the time it takes
> for the EMR cluster to setup when starting a job. This like data transfer
> time, cluster replication time etc
>
> - EMR API is very flexible however you will need to build a custom interface
> on top of it to suit your job management and monitoring needs
>
> - EMR bootstrap actions can satisfy most of your native lib needs so no
> drawbacks there.
>
>
> -- Sudhir
>
>
> On 12/26/10 5:26 AM, "common-user-digest-h...@hadoop.apache.org"
>  wrote:
>
>> From: Otis Gospodnetic 
>> Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST)
>> To: 
>> Subject: Re: Hadoop/Elastic MR on AWS
>>
>> Hello Amandeep,
>>
>>
>>
>> - Original Message 
>>> From: Amandeep Khurana 
>>> To: common-user@hadoop.apache.org
>>> Sent: Fri, December 10, 2010 1:14:45 AM
>>> Subject: Re: Hadoop/Elastic MR on AWS
>>>
>>> Mark,
>>>
>>> Using EMR makes it very easy to start a cluster and add/reduce  capacity as
>>> and when required. There are certain optimizations that make EMR  an
>>> attractive choice as compared to building your own cluster out. Using  EMR
>>
>>
>> Could you please point out what optimizations you are referring to?
>>
>> Thanks,
>> Otis
>> 
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
>> Hadoop ecosystem search :: http://search-hadoop.com/
>>
>>> also ensures you are using a production quality, stable system backed by  
>>> the
>>> EMR engineers. You can always use bootstrap actions to put your own  tweaked
>>> version of Hadoop in there if you want to do that.
>>>
>>> Also, you  don't have to tear down your cluster after every job. You can set
>>> the alive  option when you start your cluster and it will stay there even
>>> after your  Hadoop job completes.
>>>
>>> If you face any issues with EMR, send me a mail  offline and I'll be happy 
>>> to
>>> help.
>>>
>>> -Amandeep
>>>
>>>
>>> On Thu, Dec 9,  2010 at 9:47 PM, Mark   wrote:
>>>
 Does anyone have any thoughts/experiences on running Hadoop  in AWS? What
 are some pros/cons?

 Are there any good  AMI's out there for this?

 Thanks for any advice.

>>>
>
>
> iCrossing Privileged and Confidential Information
> This email message is for the sole use of the intended recipient(s) and may 
> contain confidential and privileged information of iCrossing. Any 
> unauthorized review, use, disclosure or distribution is prohibited. If you 
> are not the intended recipient, please contact the sender by reply email and 
> destroy all copies of the original message.
>
>


UI doesn't work

2010-12-27 Thread maha
Hi,

  I get Error 404 when I try to use hadoop UI to monitor my job execution. I'm 
using Hadoop-0.20.2 and the following are parts of my configuration files.

 in Core-site.xml:
fs.default.name
hdfs://speed.cs.ucsb.edu:9000

in mapred-site.xml:
mapred.job.tracker
speed.cs.ucsb.edu:9001


when I try to open:  http://speed.cs.ucsb.edu:50070/   I get the 404 Error.


Any ideas?

  Thank you,
 Maha



Re: UI doesn't work

2010-12-27 Thread James Seigel
Two quick questions first.

Is the job tracker running on that machine?
Is there a firewall in the way?

James

Sent from my mobile. Please excuse the typos.

On 2010-12-27, at 4:46 PM, maha  wrote:

> Hi,
>
>  I get Error 404 when I try to use hadoop UI to monitor my job execution. I'm 
> using Hadoop-0.20.2 and the following are parts of my configuration files.
>
> in Core-site.xml:
>fs.default.name
>hdfs://speed.cs.ucsb.edu:9000
>
> in mapred-site.xml:
>mapred.job.tracker
>speed.cs.ucsb.edu:9001
>
>
> when I try to open:  http://speed.cs.ucsb.edu:50070/   I get the 404 Error.
>
>
> Any ideas?
>
>  Thank you,
> Maha
>


Hadoop RPC call response post processing

2010-12-27 Thread Stefan Groschupf
Hi All, 
I'm browsing the RPC code since quite a while now trying to find any entry 
point / interceptor slot that allows me to handle a RPC call response writable 
after it was send over the wire.
Does anybody has an idea how break into the RPC code from outside. All the 
interesting methods are private. :(

Background:
Heavy use of the RPC allocates hugh amount of Writable objects. We saw in 
multiple systems  that the garbage collect can get so busy that the jvm almost 
freezes for seconds. Things like zookeeper sessions time out in that cases.
My idea is to create an object pool for writables. Borrowing an object from the 
pool is simple since this happen in our custom code, though we do know when the 
writable return was send over the wire and can be returned into the pool.
A dirty hack would be to overwrite the write(out) method in the writable, 
assuming that is the last thing done with the writable, though turns out that 
this method is called in other cases too, e.g. to measure throughput.

Any ideas?

Thanks, 
Stefan

Re: UI doesn't work

2010-12-27 Thread Harsh J
I remember facing such an issue with the JT (50030) once. None of the
jsp pages would load, 'cept the index. It was some odd issue with the
webapps not getting loaded right while startup. Don't quite remember
how it got solved.

Did you do any ant operation on your release copy of Hadoop prior to
starting it, by the way?

On Tue, Dec 28, 2010 at 5:15 AM, maha  wrote:
> Hi,
>
>  I get Error 404 when I try to use hadoop UI to monitor my job execution. I'm 
> using Hadoop-0.20.2 and the following are parts of my configuration files.
>
>  in Core-site.xml:
>    fs.default.name
>    hdfs://speed.cs.ucsb.edu:9000
>
> in mapred-site.xml:
>    mapred.job.tracker
>    speed.cs.ucsb.edu:9001
>
>
> when I try to open:  http://speed.cs.ucsb.edu:50070/   I get the 404 Error.
>
>
> Any ideas?
>
>  Thank you,
>     Maha
>
>



-- 
Harsh J
www.harshj.com


Re: Hadoop RPC call response post processing

2010-12-27 Thread Todd Lipcon
Hi Stefan,

Sounds interesting.

Maybe you're looking for o.a.h.ipc.Server$Responder?

-Todd

On Mon, Dec 27, 2010 at 8:07 PM, Stefan Groschupf  wrote:

> Hi All,
> I'm browsing the RPC code since quite a while now trying to find any entry
> point / interceptor slot that allows me to handle a RPC call response
> writable after it was send over the wire.
> Does anybody has an idea how break into the RPC code from outside. All the
> interesting methods are private. :(
>
> Background:
> Heavy use of the RPC allocates hugh amount of Writable objects. We saw in
> multiple systems  that the garbage collect can get so busy that the jvm
> almost freezes for seconds. Things like zookeeper sessions time out in that
> cases.
> My idea is to create an object pool for writables. Borrowing an object from
> the pool is simple since this happen in our custom code, though we do know
> when the writable return was send over the wire and can be returned into the
> pool.
> A dirty hack would be to overwrite the write(out) method in the writable,
> assuming that is the last thing done with the writable, though turns out
> that this method is called in other cases too, e.g. to measure throughput.
>
> Any ideas?
>
> Thanks,
> Stefan




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: UI doesn't work

2010-12-27 Thread Adarsh Sharma

maha wrote:

Hi,

  I get Error 404 when I try to use hadoop UI to monitor my job execution. I'm 
using Hadoop-0.20.2 and the following are parts of my configuration files.

 in Core-site.xml:
fs.default.name
hdfs://speed.cs.ucsb.edu:9000

in mapred-site.xml:
mapred.job.tracker
speed.cs.ucsb.edu:9001


when I try to open:  http://speed.cs.ucsb.edu:50070/   I get the 404 Error.


Any ideas?

  Thank you,
 Maha


  

Check the logs of namenode and jobtracker and post their listings.

Best Regards

Adarsh


Re: Hadoop RPC call response post processing

2010-12-27 Thread Ted Dunning
I would be very surprised if allocation itself is the problem as opposed to
good old fashioned excess copying.

It is very hard to write an allocator faster than the java generational gc,
especially if you are talking about objects that are ephemeral.

Have you looked at the tenuring distribution?

On Mon, Dec 27, 2010 at 8:07 PM, Stefan Groschupf  wrote:

> Hi All,
> I'm browsing the RPC code since quite a while now trying to find any entry
> point / interceptor slot that allows me to handle a RPC call response
> writable after it was send over the wire.
> Does anybody has an idea how break into the RPC code from outside. All the
> interesting methods are private. :(
>
> Background:
> Heavy use of the RPC allocates hugh amount of Writable objects. We saw in
> multiple systems  that the garbage collect can get so busy that the jvm
> almost freezes for seconds. Things like zookeeper sessions time out in that
> cases.
> My idea is to create an object pool for writables. Borrowing an object from
> the pool is simple since this happen in our custom code, though we do know
> when the writable return was send over the wire and can be returned into the
> pool.
> A dirty hack would be to overwrite the write(out) method in the writable,
> assuming that is the last thing done with the writable, though turns out
> that this method is called in other cases too, e.g. to measure throughput.
>
> Any ideas?
>
> Thanks,
> Stefan