I saw the title of this discussion started a few days ago but didn't pay attention to them. this morning i came across to some of these message and rofl, too much drama. According to my experience, there are some risks of using hadoop.

1) not real time and mission critical, you may consider hadoop as a good workhorse for offline processing, a good framework for large scale data analysis and data processing, however, there are many factors that affect hadoop jobs. Even the most well-written and robust code could fail because of some exceptional hardware and network problems.

2) don't put too much hope on efficiency, It can do some job which was impossible to achieve, but maybe not as fast as you imagine. There is no magic that hadoop creates everything in a blink. Usually and safely, you may prefer to break down your entire large job into several pieces, save and back the data step by step. In this fashion, hadoop could really get some huge job done, but still requires lots of manual effort.

3) no integrative workflow and open soruce multi-user & administrative platform. This point is connected to the previous one because once a huge hadoop job started, especially for statistical analysis and machine learning task that requires many iterations, manual care is indispensable. As far as I know, there is yet no integrative workflow management system built for hadoop task. Moreover, if you have your private cluster running hadoop jobs and the coordination of multiple users could be a problem. For small group a board schedule is necessary, as for large group there might be a huge amount of work to configure hardware and virtual machines. In our experience, optimizing the cluster performance for hadoop is non-trivial and we met quite a lot of problems. Amazon EC2 is a good choice, but running long and large task on that could be quite expensive.

4) thinking of your problem carefully in a key-value fashion, try to minimize the use of reducer. Hadoop is actually shuffle, sort, aggregation of key-value pairs. Many practical problems at hand can be easily transformed to key-value data structure, however, most of the job can be done as mappers only. Don't jump into the reducer task too early, just trace all the data with a simple key of several bytes and finish mapper-only tasks as many as possible. In this way, you could avoid many unnecessary sort and aggregation tasks.

Shi



On 9/21/2011 1:01 PM, GOEKE, MATTHEW (AG/1000) wrote:
I would completely agree with Mike's comments with one addition: Hadoop centers 
around how to manipulate the flow of data in a way to make the framework work 
for your specific problem. There are recipes for common problems but depending 
on your domain that might solve only 30-40% of your use cases. It should take 
little to no time for a good java dev to understand how to make an MR program. 
It will take significantly more time for that java dev to understand the domain 
and Hadoop well enough to consistently write *good* MR programs. Mike listed 
some great ways to cut down on that curve but you really want someone who has 
not only an affinity for code but can also apply the critical thinking to how 
you should pipeline your data. If you plan on using it purely with Pig/Hive 
abstractions on top then this can be negated significantly.

Some my might disagree but that is my $0.02
Matt

-----Original Message-----
From: Michael Segel [mailto:michael_se...@hotmail.com]
Sent: Wednesday, September 21, 2011 12:48 PM
To: common-user@hadoop.apache.org
Subject: RE: risks of using Hadoop


Kobina

The points 1 and 2 are definitely real risks. SPOF is not.

As I pointed out in my mini-rant to Tom was that your end users / developers 
who use the cluster can do more harm to your cluster than a SPOF machine 
failure.

I don't know what one would consider a 'long learning curve'. With the adoption 
of any new technology, you're talking at least 3-6 months based on the 
individual and the overall complexity of the environment.

Take anyone who is a strong developer, put them through Cloudera's training, 
plus some play time, and you've shortened the learning curve.
The better the java developer, the easier it is for them to pick up Hadoop.

I would also suggest taking the approach of hiring a senior person who can 
cross train and mentor your staff. This too will shorten the runway.

HTH

-Mike


Date: Wed, 21 Sep 2011 17:02:45 +0100
Subject: Re: risks of using Hadoop
From: kobina.kwa...@gmail.com
To: common-user@hadoop.apache.org

Jignesh,

Will your point 2 still be valid if we hire very experienced Java
programmers?

Kobina.

On 20 September 2011 21:07, Jignesh Patel<jign...@websoft.com>  wrote:

@Kobina
1. Lack of skill set
2. Longer learning curve
3. Single point of failure


@Uma
I am curious to know about .20.2 is that stable? Is it same as the one you
mention in your email(Federation changes), If I need scaled nameNode and
append support, which version I should choose.

Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is
updating the Hadoop API. When that will be integrated with Hadoop.

If I need


-Jignesh

On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote:

Hi Kobina,

Some experiences which may helpful for you with respective to DFS.

1. Selecting the correct version.
    I will recommend to use 0.20X version. This is pretty stable version
and all other organizations prefers it. Well tested as well.
Dont go for 21 version.This version is not a stable version.This is risk.

2. You should perform thorough test with your customer operations.
  (of-course you will do this :-))

3. 0.20x version has the problem of SPOF.
   If NameNode goes down you will loose the data.One way of recovering is
by using the secondaryNameNode.You can recover the data till last
checkpoint.But here manual intervention is required.
In latest trunk SPOF will be addressed bu HDFS-1623.

4. 0.20x NameNodes can not scale. Federation changes included in latest
versions. ( i think in 22). this may not be the problem for your cluster.
But please consider this aspect as well.
5. Please select the hadoop version depending on your security
requirements. There are versions available for security as well in 0.20X.
6. If you plan to use Hbase, it requires append support. 20Append has the
support for append. 0.20.205 release also will have append support but not
yet released. Choose your correct version to avoid sudden surprises.


Regards,
Uma
----- Original Message -----
From: Kobina Kwarko<kobina.kwa...@gmail.com>
Date: Saturday, September 17, 2011 3:42 am
Subject: Re: risks of using Hadoop
To: common-user@hadoop.apache.org

We are planning to use Hadoop in my organisation for quality of
servicesanalysis out of CDR records from mobile operators. We are
thinking of having
a small cluster of may be 10 - 15 nodes and I'm preparing the
proposal. my
office requires that i provide some risk analysis in the proposal.

thank you.

On 16 September 2011 20:34, Uma Maheswara Rao G 72686
<mahesw...@huawei.com>wrote:

Hello,

First of all where you are planning to use Hadoop?

Regards,
Uma
----- Original Message -----
From: Kobina Kwarko<kobina.kwa...@gmail.com>
Date: Saturday, September 17, 2011 0:41 am
Subject: risks of using Hadoop
To: common-user<common-user@hadoop.apache.org>

Hello,

Please can someone point some of the risks we may incur if we
decide to
implement Hadoop?

BR,

Isaac.


                                        
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of 
"Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


Reply via email to