Re: risks of using Hadoop

Shi Yu Wed, 21 Sep 2011 11:44:38 -0700

I saw the title of this discussion started a few days ago but didn't payattention to them. this morning i came across to some of these messageand rofl, too much drama. According to my experience, there are somerisks of using hadoop.

1) not real time and mission critical, you may consider hadoop as agood workhorse for offline processing, a good framework for large scaledata analysis and data processing, however, there are many factors thataffect hadoop jobs. Even the most well-written and robust code couldfail because of some exceptional hardware and network problems.

2) don't put too much hope on efficiency, It can do some job which wasimpossible to achieve, but maybe not as fast as you imagine. There isno magic that hadoop creates everything in a blink. Usually and safely,you may prefer to break down your entire large job into several pieces,save and back the data step by step. In this fashion, hadoop couldreally get some huge job done, but still requires lots of manual effort.

3) no integrative workflow and open soruce multi-user & administrativeplatform. This point is connected to the previous one because once ahuge hadoop job started, especially for statistical analysis and machinelearning task that requires many iterations, manual care isindispensable. As far as I know, there is yet no integrative workflowmanagement system built for hadoop task. Moreover, if you have yourprivate cluster running hadoop jobs and the coordination of multipleusers could be a problem. For small group a board schedule is necessary,as for large group there might be a huge amount of work to configurehardware and virtual machines. In our experience, optimizing the clusterperformance for hadoop is non-trivial and we met quite a lot ofproblems. Amazon EC2 is a good choice, but running long and large taskon that could be quite expensive.

4) thinking of your problem carefully in a key-value fashion, try tominimize the use of reducer. Hadoop is actually shuffle, sort,aggregation of key-value pairs. Many practical problems at hand can beeasily transformed to key-value data structure, however, most of the jobcan be done as mappers only. Don't jump into the reducer task tooearly, just trace all the data with a simple key of several bytes andfinish mapper-only tasks as many as possible. In this way, you couldavoid many unnecessary sort and aggregation tasks.


Shi



On 9/21/2011 1:01 PM, GOEKE, MATTHEW (AG/1000) wrote:

I would completely agree with Mike's comments with one addition: Hadoop centers 
around how to manipulate the flow of data in a way to make the framework work 
for your specific problem. There are recipes for common problems but depending 
on your domain that might solve only 30-40% of your use cases. It should take 
little to no time for a good java dev to understand how to make an MR program. 
It will take significantly more time for that java dev to understand the domain 
and Hadoop well enough to consistently write *good* MR programs. Mike listed 
some great ways to cut down on that curve but you really want someone who has 
not only an affinity for code but can also apply the critical thinking to how 
you should pipeline your data. If you plan on using it purely with Pig/Hive 
abstractions on top then this can be negated significantly.

Some my might disagree but that is my $0.02
Matt

-----Original Message-----
From: Michael Segel [mailto:michael_se...@hotmail.com]
Sent: Wednesday, September 21, 2011 12:48 PM
To: common-user@hadoop.apache.org
Subject: RE: risks of using Hadoop


Kobina

The points 1 and 2 are definitely real risks. SPOF is not.

As I pointed out in my mini-rant to Tom was that your end users / developers 
who use the cluster can do more harm to your cluster than a SPOF machine 
failure.

I don't know what one would consider a 'long learning curve'. With the adoption 
of any new technology, you're talking at least 3-6 months based on the 
individual and the overall complexity of the environment.

Take anyone who is a strong developer, put them through Cloudera's training, 
plus some play time, and you've shortened the learning curve.
The better the java developer, the easier it is for them to pick up Hadoop.

I would also suggest taking the approach of hiring a senior person who can 
cross train and mentor your staff. This too will shorten the runway.

HTH

-Mike

Date: Wed, 21 Sep 2011 17:02:45 +0100
Subject: Re: risks of using Hadoop
From: kobina.kwa...@gmail.com
To: common-user@hadoop.apache.org

Jignesh,

Will your point 2 still be valid if we hire very experienced Java
programmers?

Kobina.

On 20 September 2011 21:07, Jignesh Patel<jign...@websoft.com>  wrote:

@Kobina
1. Lack of skill set
2. Longer learning curve
3. Single point of failure


@Uma
I am curious to know about .20.2 is that stable? Is it same as the one you
mention in your email(Federation changes), If I need scaled nameNode and
append support, which version I should choose.

Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is
updating the Hadoop API. When that will be integrated with Hadoop.

If I need


-Jignesh

On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote:

Hi Kobina,

Some experiences which may helpful for you with respective to DFS.

1. Selecting the correct version.
    I will recommend to use 0.20X version. This is pretty stable version

and all other organizations prefers it. Well tested as well.

Dont go for 21 version.This version is not a stable version.This is risk.

2. You should perform thorough test with your customer operations.
  (of-course you will do this :-))

3. 0.20x version has the problem of SPOF.
   If NameNode goes down you will loose the data.One way of recovering is

by using the secondaryNameNode.You can recover the data till last
checkpoint.But here manual intervention is required.

In latest trunk SPOF will be addressed bu HDFS-1623.

4. 0.20x NameNodes can not scale. Federation changes included in latest

versions. ( i think in 22). this may not be the problem for your cluster.
But please consider this aspect as well.

5. Please select the hadoop version depending on your security

requirements. There are versions available for security as well in 0.20X.

6. If you plan to use Hbase, it requires append support. 20Append has the

support for append. 0.20.205 release also will have append support but not
yet released. Choose your correct version to avoid sudden surprises.



Regards,
Uma
----- Original Message -----
From: Kobina Kwarko<kobina.kwa...@gmail.com>
Date: Saturday, September 17, 2011 3:42 am
Subject: Re: risks of using Hadoop
To: common-user@hadoop.apache.org

We are planning to use Hadoop in my organisation for quality of
servicesanalysis out of CDR records from mobile operators. We are
thinking of having
a small cluster of may be 10 - 15 nodes and I'm preparing the
proposal. my
office requires that i provide some risk analysis in the proposal.

thank you.

On 16 September 2011 20:34, Uma Maheswara Rao G 72686
<mahesw...@huawei.com>wrote:

Hello,

First of all where you are planning to use Hadoop?

Regards,
Uma
----- Original Message -----
From: Kobina Kwarko<kobina.kwa...@gmail.com>
Date: Saturday, September 17, 2011 0:41 am
Subject: risks of using Hadoop
To: common-user<common-user@hadoop.apache.org>

Hello,

Please can someone point some of the risks we may incur if we
decide to
implement Hadoop?

BR,

Isaac.

                                        
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of 
"Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.

Re: risks of using Hadoop

Reply via email to