Re: risks of using Hadoop

2011-09-21 Thread Steve Loughran

On 20/09/11 22:52, Michael Segel wrote:


PS... There's this junction box in your machine room that has this very large 
on/off switch. If pulled down, it will cut power to your cluster and you will 
lose everything. Now would you consider this a risk? Sure. But is it something 
you should really lose sleep over? Do you understand that there are risks and 
there are improbable risks?


We follow the @devops_borat Ops book and have a post-it-note on the 
switch saying not a light switch


Re: risks of using Hadoop

2011-09-21 Thread Dieter Plaetinck
On Wed, 21 Sep 2011 11:21:01 +0100
Steve Loughran ste...@apache.org wrote:

 On 20/09/11 22:52, Michael Segel wrote:
 
  PS... There's this junction box in your machine room that has this
  very large on/off switch. If pulled down, it will cut power to your
  cluster and you will lose everything. Now would you consider this a
  risk? Sure. But is it something you should really lose sleep over?
  Do you understand that there are risks and there are improbable
  risks?
 
 We follow the @devops_borat Ops book and have a post-it-note on the 
 switch saying not a light switch

:D


Re: risks of using Hadoop

2011-09-21 Thread Steve Loughran

On 21/09/11 11:30, Dieter Plaetinck wrote:

On Wed, 21 Sep 2011 11:21:01 +0100
Steve Loughranste...@apache.org  wrote:


On 20/09/11 22:52, Michael Segel wrote:


PS... There's this junction box in your machine room that has this
very large on/off switch. If pulled down, it will cut power to your
cluster and you will lose everything. Now would you consider this a
risk? Sure. But is it something you should really lose sleep over?
Do you understand that there are risks and there are improbable
risks?


We follow the @devops_borat Ops book and have a post-it-note on the
switch saying not a light switch


:D


Also we have a backup 4-port 1Gbe linksys router for when the main 
switch fails. The biggest issue these days is that since we switched the 
backplane to Ethernet over Powerline a power outage leads to network 
partitioning even when the racks have UPS.



see also http://twitter.com/#!/DEVOPS_BORAT


RE: risks of using Hadoop

2011-09-21 Thread Tom Deutsch
I am truly sorry if at some point in your life someone dropped an IBM logo 
on your head and it left a dent - but you are being a jerk.

Right after you were engaging in your usual condescension a person from 
Xerox posted on the very issue you were blowing off. Things happen. To any 
system.

I'm not knocking Hadoop - and frankly making sure new users have a good 
experience based on the real things that need to be aware of / manage is 
in everyone's interests here to grow the footprint.

Please take note that no where in here have I ever said anything to 
discourage Hadoop deployments/use or anything that is vendor specific.



Tom Deutsch
Program Director
CTO Office: Information Management
Hadoop Product Manager / Customer Exec
IBM
3565 Harbor Blvd
Costa Mesa, CA 92626-1420
tdeut...@us.ibm.com




Michael Segel michael_se...@hotmail.com 
09/20/2011 02:52 PM
Please respond to
common-user@hadoop.apache.org


To
common-user@hadoop.apache.org
cc

Subject
RE: risks of using Hadoop







Tom,

I think it is arrogant to parrot FUD when you've never had your hands 
dirty in any real Hadoop environment. 
So how could your response reflect the operational realities of running a 
Hadoop cluster?

What Brian was saying was that the SPOF is an over played FUD trump card. 
Anyone who's built clusters will have mitigated the risks of losing the 
NN. 
Then there's MapR... where you don't have a SPOF. But again that's a 
derivative of Apache Hadoop.
(Derivative isn't a bad thing...)

You're right that you need to plan accordingly, however from risk 
perspective, this isn't a risk. 
In fact, I believe Tom White's book has a good layout to mitigate this and 
while I have First Ed, I'll have to double check the second ed to see if 
he modified it.

Again, the point Brian was making and one that I agree with is that the NN 
as a SPOF is an overblown 'risk'.

You have a greater chance of data loss than you do of losing your NN. 

Probably the reason why some of us are a bit irritated by the SPOF 
reference to the NN is that its clowns who haven't done any work in this 
space, pick up on the FUD and spread it around. This makes it difficult 
for guys like me from getting anything done because we constantly have to 
go back and reassure stake holders that its a non-issue.

With respect to naming vendors, I did name MapR outside of Apache because 
they do have their own derivative release that improves upon the 
limitations found in Apache's Hadoop.

-Mike
PS... There's this junction box in your machine room that has this very 
large on/off switch. If pulled down, it will cut power to your cluster and 
you will lose everything. Now would you consider this a risk? Sure. But is 
it something you should really lose sleep over? Do you understand that 
there are risks and there are improbable risks? 


 To: common-user@hadoop.apache.org
 Subject: RE: risks of using Hadoop
 From: tdeut...@us.ibm.com
 Date: Tue, 20 Sep 2011 12:48:05 -0700
 
 No worries Michael - it would be stretch to see any arrogance or 
 disrespect in your response.
 
 Kobina has asked a fair question, and deserves a response that reflects 
 the operational realities of where we are. 
 
 If you are looking at doing large scale CDR handling - which I believe 
is 
 the use case here - you need to plan accordingly. Even you use the term 
 mitigate - which is different than prevent.  Kobina needs an 
 understanding of that they are looking at. That isn't a pro/con stance 
on 
 Hadoop, it is just reality and they should plan accordingly. 
 
 (Note - I'm not the one who brought vendors into this - which doesn't 
 strike me as appropriate for this list)
 
 
 Tom Deutsch
 Program Director
 CTO Office: Information Management
 Hadoop Product Manager / Customer Exec
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com
 
 
 
 
 Michael Segel michael_se...@hotmail.com 
 09/17/2011 07:37 PM
 Please respond to
 common-user@hadoop.apache.org
 
 
 To
 common-user@hadoop.apache.org
 cc
 
 Subject
 RE: risks of using Hadoop
 
 
 
 
 
 
 
 Gee Tom,
 No disrespect, but I don't believe you have any personal practical 
 experience in designing and building out clusters or putting them to the 

 test.
 
 Now to the points that Brian raised..
 
 1) SPOF... it sounds great on paper. Some FUD to scare someone away from 

 Hadoop. But in reality... you can mitigate your risks by setting up raid 

 on your NN/HM node. You can also NFS mount a copy to your SN (or 
whatever 
 they're calling it these days...) Or you can go to MapR which has 
 redesigned HDFS which removes this problem. But with your Apache Hadoop 
or 
 Cloudera's release, losing your NN is rare. Yes it can happen, but not 
 your greatest risk. (Not by a long shot)
 
 2) Data Loss.
 You can mitigate this as well. Do I need to go through all of the 
options 
 and DR/BCP planning? Sure there's always a chance that you

Re: risks of using Hadoop

2011-09-21 Thread Kobina Kwarko
Jignesh,

Will your point 2 still be valid if we hire very experienced Java
programmers?

Kobina.

On 20 September 2011 21:07, Jignesh Patel jign...@websoft.com wrote:


 @Kobina
 1. Lack of skill set
 2. Longer learning curve
 3. Single point of failure


 @Uma
 I am curious to know about .20.2 is that stable? Is it same as the one you
 mention in your email(Federation changes), If I need scaled nameNode and
 append support, which version I should choose.

 Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is
 updating the Hadoop API. When that will be integrated with Hadoop.

 If I need


 -Jignesh

 On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote:

  Hi Kobina,
 
  Some experiences which may helpful for you with respective to DFS.
 
  1. Selecting the correct version.
 I will recommend to use 0.20X version. This is pretty stable version
 and all other organizations prefers it. Well tested as well.
  Dont go for 21 version.This version is not a stable version.This is risk.
 
  2. You should perform thorough test with your customer operations.
   (of-course you will do this :-))
 
  3. 0.20x version has the problem of SPOF.
If NameNode goes down you will loose the data.One way of recovering is
 by using the secondaryNameNode.You can recover the data till last
 checkpoint.But here manual intervention is required.
  In latest trunk SPOF will be addressed bu HDFS-1623.
 
  4. 0.20x NameNodes can not scale. Federation changes included in latest
 versions. ( i think in 22). this may not be the problem for your cluster.
 But please consider this aspect as well.
 
  5. Please select the hadoop version depending on your security
 requirements. There are versions available for security as well in 0.20X.
 
  6. If you plan to use Hbase, it requires append support. 20Append has the
 support for append. 0.20.205 release also will have append support but not
 yet released. Choose your correct version to avoid sudden surprises.
 
 
 
  Regards,
  Uma
  - Original Message -
  From: Kobina Kwarko kobina.kwa...@gmail.com
  Date: Saturday, September 17, 2011 3:42 am
  Subject: Re: risks of using Hadoop
  To: common-user@hadoop.apache.org
 
  We are planning to use Hadoop in my organisation for quality of
  servicesanalysis out of CDR records from mobile operators. We are
  thinking of having
  a small cluster of may be 10 - 15 nodes and I'm preparing the
  proposal. my
  office requires that i provide some risk analysis in the proposal.
 
  thank you.
 
  On 16 September 2011 20:34, Uma Maheswara Rao G 72686
  mahesw...@huawei.comwrote:
 
  Hello,
 
  First of all where you are planning to use Hadoop?
 
  Regards,
  Uma
  - Original Message -
  From: Kobina Kwarko kobina.kwa...@gmail.com
  Date: Saturday, September 17, 2011 0:41 am
  Subject: risks of using Hadoop
  To: common-user common-user@hadoop.apache.org
 
  Hello,
 
  Please can someone point some of the risks we may incur if we
  decide to
  implement Hadoop?
 
  BR,
 
  Isaac.
 
 
 




Re: risks of using Hadoop

2011-09-21 Thread Uma Maheswara Rao G 72686
Jignesh,
Please see my comments inline.

- Original Message -
From: Kobina Kwarko kobina.kwa...@gmail.com
Date: Wednesday, September 21, 2011 9:33 pm
Subject: Re: risks of using Hadoop
To: common-user@hadoop.apache.org

 Jignesh,
 
 Will your point 2 still be valid if we hire very experienced Java
 programmers?
 
 Kobina.
 
 On 20 September 2011 21:07, Jignesh Patel jign...@websoft.com wrote:
 
 
  @Kobina
  1. Lack of skill set
  2. Longer learning curve
  3. Single point of failure
 
 
  @Uma
  I am curious to know about .20.2 is that stable? Is it same as 
 the one you
  mention in your email(Federation changes), If I need scaled 
 nameNode and
  append support, which version I should choose.
 
  Regarding Single point of failure, I believe Hortonworks(a.k.a 
 Yahoo) is
  updating the Hadoop API. When that will be integrated with Hadoop.
 
  If I need
 

 Yes, 0.20 versions are stable. Federation changes will not be available in 
0.20 versions. I think Fedaration changes has been merged to 0.23 branch.
So, from 0.23 onwards you can get Fedaration implementaion. But there is no 
release happend for 0.23 branch yet.

Regarding NameNode High Availability, there is one issue HDFS-1623 to 
build.(Inprogress)This may take couple of months to integrate.


 
  -Jignesh
 
  On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote:
 
   Hi Kobina,
  
   Some experiences which may helpful for you with respective to DFS.
  
   1. Selecting the correct version.
  I will recommend to use 0.20X version. This is pretty stable 
 version and all other organizations prefers it. Well tested as well.
   Dont go for 21 version.This version is not a stable 
 version.This is risk.
  
   2. You should perform thorough test with your customer operations.
(of-course you will do this :-))
  
   3. 0.20x version has the problem of SPOF.
 If NameNode goes down you will loose the data.One way of 
 recovering is
  by using the secondaryNameNode.You can recover the data till last
  checkpoint.But here manual intervention is required.
   In latest trunk SPOF will be addressed bu HDFS-1623.
  
   4. 0.20x NameNodes can not scale. Federation changes included 
 in latest
  versions. ( i think in 22). this may not be the problem for your 
 cluster. But please consider this aspect as well.
  
   5. Please select the hadoop version depending on your security
  requirements. There are versions available for security as well 
 in 0.20X.
  
   6. If you plan to use Hbase, it requires append support. 
 20Append has the
  support for append. 0.20.205 release also will have append 
 support but not
  yet released. Choose your correct version to avoid sudden surprises.
  
  
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 3:42 am
   Subject: Re: risks of using Hadoop
   To: common-user@hadoop.apache.org
  
   We are planning to use Hadoop in my organisation for quality of
   servicesanalysis out of CDR records from mobile operators. We are
   thinking of having
   a small cluster of may be 10 - 15 nodes and I'm preparing the
   proposal. my
   office requires that i provide some risk analysis in the 
 proposal. 
   thank you.
  
   On 16 September 2011 20:34, Uma Maheswara Rao G 72686
   mahesw...@huawei.comwrote:
  
   Hello,
  
   First of all where you are planning to use Hadoop?
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 0:41 am
   Subject: risks of using Hadoop
   To: common-user common-user@hadoop.apache.org
  
   Hello,
  
   Please can someone point some of the risks we may incur if we
   decide to
   implement Hadoop?
  
   BR,
  
   Isaac.
  
  
  
 
 
 

Regards,
Uma


Re: risks of using Hadoop

2011-09-21 Thread Ahmed Nagy
Another way to decrease the risks is just to use Amazon Web Services. That
 might be a bit expensive

On Sun, Sep 18, 2011 at 12:11 AM, Brian Bockelman bbock...@cse.unl.edu
wrote:


 On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:

  Hi Kobina,
 
  Some experiences which may helpful for you with respective to DFS.
 
  1. Selecting the correct version.
 I will recommend to use 0.20X version. This is pretty stable version
and all other organizations prefers it. Well tested as well.
  Dont go for 21 version.This version is not a stable version.This is
risk.
 
  2. You should perform thorough test with your customer operations.
   (of-course you will do this :-))
 
  3. 0.20x version has the problem of SPOF.
If NameNode goes down you will loose the data.One way of recovering is
by using the secondaryNameNode.You can recover the data till last
checkpoint.But here manual intervention is required.
  In latest trunk SPOF will be addressed bu HDFS-1623.
 
  4. 0.20x NameNodes can not scale. Federation changes included in latest
versions. ( i think in 22). this may not be the problem for your cluster.
But please consider this aspect as well.
 

 With respect to (3) and (4) - these are often completely overblown for
many Hadoop use cases.  If you use Hadoop as originally designed (large
scale batch data processing), these likely don't matter.

 If you're looking at some of the newer use cases (low latency stuff or
time-critical processing), or if you architect your solution poorly (lots of
small files), these issues become relevant.  Another case where I see folks
get frustrated is using Hadoop as a plain old batch system; for non-data
workflows, it doesn't measure up against specialized systems.

 You really want to make sure that Hadoop is the best tool for your job.

 Brian


RE: risks of using Hadoop

2011-09-21 Thread Michael Segel

Tom,

Normally someone who has a personal beef with someone will take it offline and 
deal with it.
Clearly manners aren't your strong point... unfortunately making me respond to 
you in public.

Since you asked, no, I don't have any beefs with IBM. In fact, I happen to have 
quite a few friends within IBM's IM pillar. (although many seem to taking 
Elvis' advice and left the building...)

What I do have a problem with is you and your response to the posts in this 
thread.

Its bad enough that you really don't know what you're talking about. But this 
is compounded by the fact that your posts end with your job title seems to 
indicate that you are a thought leader from a well known, brand name company.  
So unlike some schmuck off the street, because of your job title, someone may 
actually pay attention to you and take what you say at face value.

The issue at hand is that the OP wanted to know the risks so that he can 
address them to give his pointy haired stake holders a warm fuzzy feeling.
SPOF isn't a risk, but a point of FUD that is constantly being brought out by 
people who have an alternative that they wanted to promote.
Brian pretty much put it in to perspective. You attempted to correct him, and 
while Brian was polite, I'm not.  Why? Because I happen to know of enough 
people who still think that what BS IBM trots out must be true and taken at 
face value. 

I think you're more concerned with making an appearance than you are with 
anyone having a good experience. No offense, but again, you're not someone who 
has actual hands on experience so you're not in position to give advice.  I 
don't know to write what you say out of being arrogant, but I have to wonder if 
you actually paid attention in your SSM class. Raising FUD and non issues as 
risk doesn't help anyone promote Hadoop, regardless of the vendor.  What it 
does is cause the stakeholders reason to pause. Overstating risks can cause 
just as much harm as over promising results.  Again, its Sales 101. Perhaps 
you're still trying to convert these folks off Hadoop on to IBM's DB2? No wait, 
that was someone else... and it wasn't Hadoop, it was Informix. (Sorry to the 
list, that was an inside joke that probably went over Tom's head, but for 
someone's  benefit.) 

To help drill the point of the issue home...
1) Look at MapR, an IBM competitor who's derivative already solves this SPOF 
problem.
2) Look at how to set up a cluster (Apache, HortonWorks, Cloudera) where you 
can mitigate this by your node configuration along with simple sysadmin tricks 
like NFS mounting a drive from a different machine within the cluster 
(Preferably a different rack for a back up.)
3) Think about your backup and recovery of your Name Node's files.

There's more, and I would encourage you to actually talk to a professional 
before giving out advice. ;-)

HTH

-Mike

PS. My last PS talked about the big power switch in a switch box in the machine 
room that cuts the power. (When its a lever, do you really need to tell someone 
that its not a light switch? And I guess you could padlock it too) 
Seriously, there is more risk to data loss and corruption based on luser issues 
than there is of a SPOF (NN failure).




 To: common-user@hadoop.apache.org
 Subject: RE: risks of using Hadoop
 From: tdeut...@us.ibm.com
 Date: Wed, 21 Sep 2011 06:20:53 -0700
 
 I am truly sorry if at some point in your life someone dropped an IBM logo 
 on your head and it left a dent - but you are being a jerk.
 
 Right after you were engaging in your usual condescension a person from 
 Xerox posted on the very issue you were blowing off. Things happen. To any 
 system.
 
 I'm not knocking Hadoop - and frankly making sure new users have a good 
 experience based on the real things that need to be aware of / manage is 
 in everyone's interests here to grow the footprint.
 
 Please take note that no where in here have I ever said anything to 
 discourage Hadoop deployments/use or anything that is vendor specific.
 
 
 
 Tom Deutsch
 Program Director
 CTO Office: Information Management
 Hadoop Product Manager / Customer Exec
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com
 
 
 
 
 Michael Segel michael_se...@hotmail.com 
 09/20/2011 02:52 PM
 Please respond to
 common-user@hadoop.apache.org
 
 
 To
 common-user@hadoop.apache.org
 cc
 
 Subject
 RE: risks of using Hadoop
 
 
 
 
 
 
 
 Tom,
 
 I think it is arrogant to parrot FUD when you've never had your hands 
 dirty in any real Hadoop environment. 
 So how could your response reflect the operational realities of running a 
 Hadoop cluster?
 
 What Brian was saying was that the SPOF is an over played FUD trump card. 
 Anyone who's built clusters will have mitigated the risks of losing the 
 NN. 
 Then there's MapR... where you don't have a SPOF. But again that's a 
 derivative of Apache Hadoop.
 (Derivative isn't a bad thing...)
 
 You're right that you need

RE: risks of using Hadoop

2011-09-21 Thread Michael Segel

Kobina

The points 1 and 2 are definitely real risks. SPOF is not.

As I pointed out in my mini-rant to Tom was that your end users / developers 
who use the cluster can do more harm to your cluster than a SPOF machine 
failure.

I don't know what one would consider a 'long learning curve'. With the adoption 
of any new technology, you're talking at least 3-6 months based on the 
individual and the overall complexity of the environment. 

Take anyone who is a strong developer, put them through Cloudera's training, 
plus some play time, and you've shortened the learning curve.
The better the java developer, the easier it is for them to pick up Hadoop.

I would also suggest taking the approach of hiring a senior person who can 
cross train and mentor your staff. This too will shorten the runway.

HTH

-Mike


 Date: Wed, 21 Sep 2011 17:02:45 +0100
 Subject: Re: risks of using Hadoop
 From: kobina.kwa...@gmail.com
 To: common-user@hadoop.apache.org
 
 Jignesh,
 
 Will your point 2 still be valid if we hire very experienced Java
 programmers?
 
 Kobina.
 
 On 20 September 2011 21:07, Jignesh Patel jign...@websoft.com wrote:
 
 
  @Kobina
  1. Lack of skill set
  2. Longer learning curve
  3. Single point of failure
 
 
  @Uma
  I am curious to know about .20.2 is that stable? Is it same as the one you
  mention in your email(Federation changes), If I need scaled nameNode and
  append support, which version I should choose.
 
  Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is
  updating the Hadoop API. When that will be integrated with Hadoop.
 
  If I need
 
 
  -Jignesh
 
  On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote:
 
   Hi Kobina,
  
   Some experiences which may helpful for you with respective to DFS.
  
   1. Selecting the correct version.
  I will recommend to use 0.20X version. This is pretty stable version
  and all other organizations prefers it. Well tested as well.
   Dont go for 21 version.This version is not a stable version.This is risk.
  
   2. You should perform thorough test with your customer operations.
(of-course you will do this :-))
  
   3. 0.20x version has the problem of SPOF.
 If NameNode goes down you will loose the data.One way of recovering is
  by using the secondaryNameNode.You can recover the data till last
  checkpoint.But here manual intervention is required.
   In latest trunk SPOF will be addressed bu HDFS-1623.
  
   4. 0.20x NameNodes can not scale. Federation changes included in latest
  versions. ( i think in 22). this may not be the problem for your cluster.
  But please consider this aspect as well.
  
   5. Please select the hadoop version depending on your security
  requirements. There are versions available for security as well in 0.20X.
  
   6. If you plan to use Hbase, it requires append support. 20Append has the
  support for append. 0.20.205 release also will have append support but not
  yet released. Choose your correct version to avoid sudden surprises.
  
  
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 3:42 am
   Subject: Re: risks of using Hadoop
   To: common-user@hadoop.apache.org
  
   We are planning to use Hadoop in my organisation for quality of
   servicesanalysis out of CDR records from mobile operators. We are
   thinking of having
   a small cluster of may be 10 - 15 nodes and I'm preparing the
   proposal. my
   office requires that i provide some risk analysis in the proposal.
  
   thank you.
  
   On 16 September 2011 20:34, Uma Maheswara Rao G 72686
   mahesw...@huawei.comwrote:
  
   Hello,
  
   First of all where you are planning to use Hadoop?
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 0:41 am
   Subject: risks of using Hadoop
   To: common-user common-user@hadoop.apache.org
  
   Hello,
  
   Please can someone point some of the risks we may incur if we
   decide to
   implement Hadoop?
  
   BR,
  
   Isaac.
  
  
  
 
 
  

RE: risks of using Hadoop

2011-09-21 Thread GOEKE, MATTHEW (AG/1000)
I would completely agree with Mike's comments with one addition: Hadoop centers 
around how to manipulate the flow of data in a way to make the framework work 
for your specific problem. There are recipes for common problems but depending 
on your domain that might solve only 30-40% of your use cases. It should take 
little to no time for a good java dev to understand how to make an MR program. 
It will take significantly more time for that java dev to understand the domain 
and Hadoop well enough to consistently write *good* MR programs. Mike listed 
some great ways to cut down on that curve but you really want someone who has 
not only an affinity for code but can also apply the critical thinking to how 
you should pipeline your data. If you plan on using it purely with Pig/Hive 
abstractions on top then this can be negated significantly.

Some my might disagree but that is my $0.02
Matt 

-Original Message-
From: Michael Segel [mailto:michael_se...@hotmail.com] 
Sent: Wednesday, September 21, 2011 12:48 PM
To: common-user@hadoop.apache.org
Subject: RE: risks of using Hadoop


Kobina

The points 1 and 2 are definitely real risks. SPOF is not.

As I pointed out in my mini-rant to Tom was that your end users / developers 
who use the cluster can do more harm to your cluster than a SPOF machine 
failure.

I don't know what one would consider a 'long learning curve'. With the adoption 
of any new technology, you're talking at least 3-6 months based on the 
individual and the overall complexity of the environment. 

Take anyone who is a strong developer, put them through Cloudera's training, 
plus some play time, and you've shortened the learning curve.
The better the java developer, the easier it is for them to pick up Hadoop.

I would also suggest taking the approach of hiring a senior person who can 
cross train and mentor your staff. This too will shorten the runway.

HTH

-Mike


 Date: Wed, 21 Sep 2011 17:02:45 +0100
 Subject: Re: risks of using Hadoop
 From: kobina.kwa...@gmail.com
 To: common-user@hadoop.apache.org
 
 Jignesh,
 
 Will your point 2 still be valid if we hire very experienced Java
 programmers?
 
 Kobina.
 
 On 20 September 2011 21:07, Jignesh Patel jign...@websoft.com wrote:
 
 
  @Kobina
  1. Lack of skill set
  2. Longer learning curve
  3. Single point of failure
 
 
  @Uma
  I am curious to know about .20.2 is that stable? Is it same as the one you
  mention in your email(Federation changes), If I need scaled nameNode and
  append support, which version I should choose.
 
  Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is
  updating the Hadoop API. When that will be integrated with Hadoop.
 
  If I need
 
 
  -Jignesh
 
  On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote:
 
   Hi Kobina,
  
   Some experiences which may helpful for you with respective to DFS.
  
   1. Selecting the correct version.
  I will recommend to use 0.20X version. This is pretty stable version
  and all other organizations prefers it. Well tested as well.
   Dont go for 21 version.This version is not a stable version.This is risk.
  
   2. You should perform thorough test with your customer operations.
(of-course you will do this :-))
  
   3. 0.20x version has the problem of SPOF.
 If NameNode goes down you will loose the data.One way of recovering is
  by using the secondaryNameNode.You can recover the data till last
  checkpoint.But here manual intervention is required.
   In latest trunk SPOF will be addressed bu HDFS-1623.
  
   4. 0.20x NameNodes can not scale. Federation changes included in latest
  versions. ( i think in 22). this may not be the problem for your cluster.
  But please consider this aspect as well.
  
   5. Please select the hadoop version depending on your security
  requirements. There are versions available for security as well in 0.20X.
  
   6. If you plan to use Hbase, it requires append support. 20Append has the
  support for append. 0.20.205 release also will have append support but not
  yet released. Choose your correct version to avoid sudden surprises.
  
  
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 3:42 am
   Subject: Re: risks of using Hadoop
   To: common-user@hadoop.apache.org
  
   We are planning to use Hadoop in my organisation for quality of
   servicesanalysis out of CDR records from mobile operators. We are
   thinking of having
   a small cluster of may be 10 - 15 nodes and I'm preparing the
   proposal. my
   office requires that i provide some risk analysis in the proposal.
  
   thank you.
  
   On 16 September 2011 20:34, Uma Maheswara Rao G 72686
   mahesw...@huawei.comwrote:
  
   Hello,
  
   First of all where you are planning to use Hadoop?
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 0:41 am
   Subject: risks

Re: risks of using Hadoop

2011-09-21 Thread Shi Yu
I saw the title of this discussion started a few days ago but didn't pay 
attention to them. this morning i came across to some of these message 
and rofl, too much drama. According to my experience, there are some 
risks of using hadoop.


1) not real time and mission critical,   you may consider hadoop as a 
good workhorse for offline processing, a good framework for large scale 
data analysis and data processing, however, there are many factors that 
affect hadoop jobs. Even the most well-written and robust code could 
fail because of some exceptional hardware and network problems.


2) don't put too much hope on efficiency,  It can do some job which was 
impossible to achieve, but maybe not as fast as you imagine.  There is 
no magic that hadoop creates everything in a blink. Usually and safely, 
you may prefer to break down your entire large job into several pieces, 
save and back the data step by step. In this fashion, hadoop could 
really get some huge job done, but still requires lots of manual effort.


3) no integrative workflow and open soruce multi-user  administrative 
platform.  This point is connected to the previous one because once a 
huge hadoop job started, especially for statistical analysis and machine 
learning task that requires many iterations, manual care is 
indispensable. As far as I know, there is yet no integrative workflow 
management system built for hadoop task.  Moreover, if you have your 
private cluster running hadoop jobs and the coordination of multiple 
users could be a problem. For small group a board schedule is necessary, 
as for large group there might be a huge amount of work to configure 
hardware and virtual machines. In our experience, optimizing the cluster 
performance for hadoop is non-trivial and we met quite a lot of 
problems.  Amazon EC2 is a good choice, but running long and large task 
on that could be quite expensive.


4) thinking of your problem carefully in a key-value fashion, try to 
minimize the use of reducer. Hadoop is actually shuffle, sort, 
aggregation of key-value pairs. Many practical problems at hand can be 
easily transformed to key-value data structure, however, most of the job 
can be done as mappers only.  Don't jump into the reducer task too 
early, just trace all the data with a simple key of several bytes and 
finish mapper-only tasks as many as possible.  In this way, you could 
avoid many unnecessary sort and aggregation tasks.


Shi



On 9/21/2011 1:01 PM, GOEKE, MATTHEW (AG/1000) wrote:

I would completely agree with Mike's comments with one addition: Hadoop centers 
around how to manipulate the flow of data in a way to make the framework work 
for your specific problem. There are recipes for common problems but depending 
on your domain that might solve only 30-40% of your use cases. It should take 
little to no time for a good java dev to understand how to make an MR program. 
It will take significantly more time for that java dev to understand the domain 
and Hadoop well enough to consistently write *good* MR programs. Mike listed 
some great ways to cut down on that curve but you really want someone who has 
not only an affinity for code but can also apply the critical thinking to how 
you should pipeline your data. If you plan on using it purely with Pig/Hive 
abstractions on top then this can be negated significantly.

Some my might disagree but that is my $0.02
Matt

-Original Message-
From: Michael Segel [mailto:michael_se...@hotmail.com]
Sent: Wednesday, September 21, 2011 12:48 PM
To: common-user@hadoop.apache.org
Subject: RE: risks of using Hadoop


Kobina

The points 1 and 2 are definitely real risks. SPOF is not.

As I pointed out in my mini-rant to Tom was that your end users / developers 
who use the cluster can do more harm to your cluster than a SPOF machine 
failure.

I don't know what one would consider a 'long learning curve'. With the adoption 
of any new technology, you're talking at least 3-6 months based on the 
individual and the overall complexity of the environment.

Take anyone who is a strong developer, put them through Cloudera's training, 
plus some play time, and you've shortened the learning curve.
The better the java developer, the easier it is for them to pick up Hadoop.

I would also suggest taking the approach of hiring a senior person who can 
cross train and mentor your staff. This too will shorten the runway.

HTH

-Mike



Date: Wed, 21 Sep 2011 17:02:45 +0100
Subject: Re: risks of using Hadoop
From: kobina.kwa...@gmail.com
To: common-user@hadoop.apache.org

Jignesh,

Will your point 2 still be valid if we hire very experienced Java
programmers?

Kobina.

On 20 September 2011 21:07, Jignesh Pateljign...@websoft.com  wrote:


@Kobina
1. Lack of skill set
2. Longer learning curve
3. Single point of failure


@Uma
I am curious to know about .20.2 is that stable? Is it same as the one you
mention in your email(Federation changes), If I need scaled nameNode and
append support

Re: risks of using Hadoop

2011-09-21 Thread Raj V
I have been following this thread. Over the last two years that I have been 
using hadoop with a fairly large cluster, my biggest problem has been analyzing 
failures. In the beginning it was fairly simple - unformatted name node, task  
trackers not starting , heap allocation mistakes version id mismatch  
configuration mistakes, that were easily fixed using this group's help and or 
analyzing logs. Then the errors got a little more complicated - too many fetch 
failures, task exited with error code 134, error reading task output, etc  
where the logs were less useful this mailing list and the source became more 
useful - and given that I am not a Java expert, I needded to rely on this group 
more and more.  There are wonderful people like Harsha, Steve and Todd who 
sincerely and correctly answer many queries. But this is a complex system  are 
so many knobs and so many variables that knowing all possible failures is a 
probably close to impossible.  This
 is just the framework. If you combine this with all the esoteric industries 
that hadoop is used for the complexity increases because of the domain 
expertise required. 

We won't even touch the voodoo magic that is involved in optimizing hadoop 
runs. 

So to mitigate the risk of running hadoop you need someone with  four heads. - 
the domain head - one who can think and solve domain problems, the hadoop head- 
the person to translate this into M/R. The java head who understands java and 
can take a shot at looking at the source code and finding solutions to problems 
and the system head , the person who keeps the cluster buzzing along smoothly. 
So unless you have these heads or able to get these heads as required - there 
is some definite risk. 

Thanks once again to this wonderful group and many active people like Todd, 
Harsha , Steve and many others who have helped me and others go over that 
stumbling block., 












From: Ahmed Nagy ahmed.n...@gmail.com
To: common-user@hadoop.apache.org
Sent: Wednesday, September 21, 2011 2:02 AM
Subject: Re: risks of using Hadoop

Another way to decrease the risks is just to use Amazon Web Services. That
might be a bit expensive

On Sun, Sep 18, 2011 at 12:11 AM, Brian Bockelman bbock...@cse.unl.edu
wrote:


 On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:

  Hi Kobina,
 
  Some experiences which may helpful for you with respective to DFS.
 
  1. Selecting the correct version.
     I will recommend to use 0.20X version. This is pretty stable version
and all other organizations prefers it. Well tested as well.
  Dont go for 21 version.This version is not a stable version.This is
risk.
 
  2. You should perform thorough test with your customer operations.
   (of-course you will do this :-))
 
  3. 0.20x version has the problem of SPOF.
    If NameNode goes down you will loose the data.One way of recovering is
by using the secondaryNameNode.You can recover the data till last
checkpoint.But here manual intervention is required.
  In latest trunk SPOF will be addressed bu HDFS-1623.
 
  4. 0.20x NameNodes can not scale. Federation changes included in latest
versions. ( i think in 22). this may not be the problem for your cluster.
But please consider this aspect as well.
 

 With respect to (3) and (4) - these are often completely overblown for
many Hadoop use cases.  If you use Hadoop as originally designed (large
scale batch data processing), these likely don't matter.

 If you're looking at some of the newer use cases (low latency stuff or
time-critical processing), or if you architect your solution poorly (lots of
small files), these issues become relevant.  Another case where I see folks
get frustrated is using Hadoop as a plain old batch system; for non-data
workflows, it doesn't measure up against specialized systems.

 You really want to make sure that Hadoop is the best tool for your job.

 Brian




RE: risks of using Hadoop

2011-09-21 Thread Bill Habermaas
Amen to that. I haven't heard a good rant in a long time, I am definitely 
amused end entertained. 

As a veteran of 3 years with Hadoop I will say that the SPOF issue is whatever 
you want to make it. But it has not, nor will it ever defer me from using this 
great system. Every system has its risks and they can be minimized by careful 
architectural crafting and intelligent usage. 

Bill

-Original Message-
From: Michael Segel [mailto:michael_se...@hotmail.com] 
Sent: Wednesday, September 21, 2011 1:48 PM
To: common-user@hadoop.apache.org
Subject: RE: risks of using Hadoop


Kobina

The points 1 and 2 are definitely real risks. SPOF is not.

As I pointed out in my mini-rant to Tom was that your end users / developers 
who use the cluster can do more harm to your cluster than a SPOF machine 
failure.

I don't know what one would consider a 'long learning curve'. With the adoption 
of any new technology, you're talking at least 3-6 months based on the 
individual and the overall complexity of the environment. 

Take anyone who is a strong developer, put them through Cloudera's training, 
plus some play time, and you've shortened the learning curve.
The better the java developer, the easier it is for them to pick up Hadoop.

I would also suggest taking the approach of hiring a senior person who can 
cross train and mentor your staff. This too will shorten the runway.

HTH

-Mike


 Date: Wed, 21 Sep 2011 17:02:45 +0100
 Subject: Re: risks of using Hadoop
 From: kobina.kwa...@gmail.com
 To: common-user@hadoop.apache.org
 
 Jignesh,
 
 Will your point 2 still be valid if we hire very experienced Java
 programmers?
 
 Kobina.
 
 On 20 September 2011 21:07, Jignesh Patel jign...@websoft.com wrote:
 
 
  @Kobina
  1. Lack of skill set
  2. Longer learning curve
  3. Single point of failure
 
 
  @Uma
  I am curious to know about .20.2 is that stable? Is it same as the one you
  mention in your email(Federation changes), If I need scaled nameNode and
  append support, which version I should choose.
 
  Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is
  updating the Hadoop API. When that will be integrated with Hadoop.
 
  If I need
 
 
  -Jignesh
 
  On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote:
 
   Hi Kobina,
  
   Some experiences which may helpful for you with respective to DFS.
  
   1. Selecting the correct version.
  I will recommend to use 0.20X version. This is pretty stable version
  and all other organizations prefers it. Well tested as well.
   Dont go for 21 version.This version is not a stable version.This is risk.
  
   2. You should perform thorough test with your customer operations.
(of-course you will do this :-))
  
   3. 0.20x version has the problem of SPOF.
 If NameNode goes down you will loose the data.One way of recovering is
  by using the secondaryNameNode.You can recover the data till last
  checkpoint.But here manual intervention is required.
   In latest trunk SPOF will be addressed bu HDFS-1623.
  
   4. 0.20x NameNodes can not scale. Federation changes included in latest
  versions. ( i think in 22). this may not be the problem for your cluster.
  But please consider this aspect as well.
  
   5. Please select the hadoop version depending on your security
  requirements. There are versions available for security as well in 0.20X.
  
   6. If you plan to use Hbase, it requires append support. 20Append has the
  support for append. 0.20.205 release also will have append support but not
  yet released. Choose your correct version to avoid sudden surprises.
  
  
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 3:42 am
   Subject: Re: risks of using Hadoop
   To: common-user@hadoop.apache.org
  
   We are planning to use Hadoop in my organisation for quality of
   servicesanalysis out of CDR records from mobile operators. We are
   thinking of having
   a small cluster of may be 10 - 15 nodes and I'm preparing the
   proposal. my
   office requires that i provide some risk analysis in the proposal.
  
   thank you.
  
   On 16 September 2011 20:34, Uma Maheswara Rao G 72686
   mahesw...@huawei.comwrote:
  
   Hello,
  
   First of all where you are planning to use Hadoop?
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 0:41 am
   Subject: risks of using Hadoop
   To: common-user common-user@hadoop.apache.org
  
   Hello,
  
   Please can someone point some of the risks we may incur if we
   decide to
   implement Hadoop?
  
   BR,
  
   Isaac.
  
  
  
 
 



Re: RE: risks of using Hadoop

2011-09-21 Thread Uma Maheswara Rao G 72686
Absolutely agree with you.
Mainly we should consider SPOF and minimize the problem with our carefulness. 
(there are many ways to minimize this issue, we have seen in this thread)

Regards,
Uma
- Original Message -
From: Bill Habermaas bill.haberm...@oracle.com
Date: Thursday, September 22, 2011 10:04 am
Subject: RE: risks of using Hadoop
To: common-user@hadoop.apache.org

 Amen to that. I haven't heard a good rant in a long time, I am 
 definitely amused end entertained. 
 
 As a veteran of 3 years with Hadoop I will say that the SPOF issue 
 is whatever you want to make it. But it has not, nor will it ever 
 defer me from using this great system. Every system has its risks 
 and they can be minimized by careful architectural crafting and 
 intelligent usage. 
 
 Bill
 
 -Original Message-
 From: Michael Segel [mailto:michael_se...@hotmail.com] 
 Sent: Wednesday, September 21, 2011 1:48 PM
 To: common-user@hadoop.apache.org
 Subject: RE: risks of using Hadoop
 
 
 Kobina
 
 The points 1 and 2 are definitely real risks. SPOF is not.
 
 As I pointed out in my mini-rant to Tom was that your end users / 
 developers who use the cluster can do more harm to your cluster 
 than a SPOF machine failure.
 
 I don't know what one would consider a 'long learning curve'. With 
 the adoption of any new technology, you're talking at least 3-6 
 months based on the individual and the overall complexity of the 
 environment. 
 
 Take anyone who is a strong developer, put them through Cloudera's 
 training, plus some play time, and you've shortened the learning 
 curve.The better the java developer, the easier it is for them to 
 pick up Hadoop.
 
 I would also suggest taking the approach of hiring a senior person 
 who can cross train and mentor your staff. This too will shorten 
 the runway.
 
 HTH
 
 -Mike
 
 
  Date: Wed, 21 Sep 2011 17:02:45 +0100
  Subject: Re: risks of using Hadoop
  From: kobina.kwa...@gmail.com
  To: common-user@hadoop.apache.org
  
  Jignesh,
  
  Will your point 2 still be valid if we hire very experienced Java
  programmers?
  
  Kobina.
  
  On 20 September 2011 21:07, Jignesh Patel jign...@websoft.com 
 wrote: 
  
   @Kobina
   1. Lack of skill set
   2. Longer learning curve
   3. Single point of failure
  
  
   @Uma
   I am curious to know about .20.2 is that stable? Is it same as 
 the one you
   mention in your email(Federation changes), If I need scaled 
 nameNode and
   append support, which version I should choose.
  
   Regarding Single point of failure, I believe Hortonworks(a.k.a 
 Yahoo) is
   updating the Hadoop API. When that will be integrated with Hadoop.
  
   If I need
  
  
   -Jignesh
  
   On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote:
  
Hi Kobina,
   
Some experiences which may helpful for you with respective 
 to DFS.
   
1. Selecting the correct version.
   I will recommend to use 0.20X version. This is pretty 
 stable version
   and all other organizations prefers it. Well tested as well.
Dont go for 21 version.This version is not a stable 
 version.This is risk.
   
2. You should perform thorough test with your customer 
 operations.(of-course you will do this :-))
   
3. 0.20x version has the problem of SPOF.
  If NameNode goes down you will loose the data.One way of 
 recovering is
   by using the secondaryNameNode.You can recover the data till last
   checkpoint.But here manual intervention is required.
In latest trunk SPOF will be addressed bu HDFS-1623.
   
4. 0.20x NameNodes can not scale. Federation changes 
 included in latest
   versions. ( i think in 22). this may not be the problem for 
 your cluster.
   But please consider this aspect as well.
   
5. Please select the hadoop version depending on your security
   requirements. There are versions available for security as 
 well in 0.20X.
   
6. If you plan to use Hbase, it requires append support. 
 20Append has the
   support for append. 0.20.205 release also will have append 
 support but not
   yet released. Choose your correct version to avoid sudden 
 surprises.  
   
   
Regards,
Uma
- Original Message -
From: Kobina Kwarko kobina.kwa...@gmail.com
Date: Saturday, September 17, 2011 3:42 am
Subject: Re: risks of using Hadoop
To: common-user@hadoop.apache.org
   
We are planning to use Hadoop in my organisation for 
 quality of
servicesanalysis out of CDR records from mobile operators. 
 We are
thinking of having
a small cluster of may be 10 - 15 nodes and I'm preparing the
proposal. my
office requires that i provide some risk analysis in the 
 proposal.  
thank you.
   
On 16 September 2011 20:34, Uma Maheswara Rao G 72686
mahesw...@huawei.comwrote:
   
Hello,
   
First of all where you are planning to use Hadoop?
   
Regards,
Uma
- Original Message -
From: Kobina Kwarko kobina.kwa...@gmail.com
Date: Saturday

RE: risks of using Hadoop

2011-09-20 Thread Michael Segel

Tom,

I think it is arrogant to parrot FUD when you've never had your hands dirty in 
any real Hadoop environment. 
So how could your response reflect the operational realities of running a 
Hadoop cluster?

What Brian was saying was that the SPOF is an over played FUD trump card. 
Anyone who's built clusters will have mitigated the risks of losing the NN. 
Then there's MapR... where you don't have a SPOF. But again that's a derivative 
of Apache Hadoop.
(Derivative isn't a bad thing...)

You're right that you need to plan accordingly, however from risk perspective, 
this isn't a risk. 
In fact, I believe Tom White's book has a good layout to mitigate this and 
while I have First Ed, I'll have to double check the second ed to see if he 
modified it.

Again, the point Brian was making and one that I agree with is that the NN as a 
SPOF is an overblown 'risk'.

You have a greater chance of data loss than you do of losing your NN. 

Probably the reason why some of us are a bit irritated by the SPOF reference to 
the NN is that its clowns who haven't done any work in this space, pick up on 
the FUD and spread it around. This makes it difficult for guys like me from 
getting anything done because we constantly have to go back and reassure stake 
holders that its a non-issue.

With respect to naming vendors, I did name MapR outside of Apache because they 
do have their own derivative release that improves upon the limitations found 
in Apache's Hadoop.

-Mike
PS... There's this junction box in your machine room that has this very large 
on/off switch. If pulled down, it will cut power to your cluster and you will 
lose everything. Now would you consider this a risk? Sure. But is it something 
you should really lose sleep over? Do you understand that there are risks and 
there are improbable risks? 


 To: common-user@hadoop.apache.org
 Subject: RE: risks of using Hadoop
 From: tdeut...@us.ibm.com
 Date: Tue, 20 Sep 2011 12:48:05 -0700
 
 No worries Michael - it would be stretch to see any arrogance or 
 disrespect in your response.
 
 Kobina has asked a fair question, and deserves a response that reflects 
 the operational realities of where we are. 
 
 If you are looking at doing large scale CDR handling - which I believe is 
 the use case here - you need to plan accordingly. Even you use the term 
 mitigate - which is different than prevent.  Kobina needs an 
 understanding of that they are looking at. That isn't a pro/con stance on 
 Hadoop, it is just reality and they should plan accordingly. 
 
 (Note - I'm not the one who brought vendors into this - which doesn't 
 strike me as appropriate for this list)
 
 
 Tom Deutsch
 Program Director
 CTO Office: Information Management
 Hadoop Product Manager / Customer Exec
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com
 
 
 
 
 Michael Segel michael_se...@hotmail.com 
 09/17/2011 07:37 PM
 Please respond to
 common-user@hadoop.apache.org
 
 
 To
 common-user@hadoop.apache.org
 cc
 
 Subject
 RE: risks of using Hadoop
 
 
 
 
 
 
 
 Gee Tom,
 No disrespect, but I don't believe you have any personal practical 
 experience in designing and building out clusters or putting them to the 
 test.
 
 Now to the points that Brian raised..
 
 1) SPOF... it sounds great on paper. Some FUD to scare someone away from 
 Hadoop. But in reality... you can mitigate your risks by setting up raid 
 on your NN/HM node. You can also NFS mount a copy to your SN (or whatever 
 they're calling it these days...) Or you can go to MapR which has 
 redesigned HDFS which removes this problem. But with your Apache Hadoop or 
 Cloudera's release, losing your NN is rare. Yes it can happen, but not 
 your greatest risk. (Not by a long shot)
 
 2) Data Loss.
 You can mitigate this as well. Do I need to go through all of the options 
 and DR/BCP planning? Sure there's always a chance that you have some Luser 
 who does something brain dead. This is true of all databases and systems. 
 (I know I can probably recount some of IBM's Informix and DB2 having data 
 loss issues. But that's a topic for another time. ;-)
 
 I can't speak for Brian, but I don't think he's trivializing it. In fact I 
 think he's doing a fine job of level setting expectations.
 
 And if you talk to Ted Dunning of MapR, I'm sure he'll point out that 
 their current release does address points 3 and 4 again making their risks 
 moot. (At least if you're using MapR)
 
 -Mike
 
 
  Subject: Re: risks of using Hadoop
  From: tdeut...@us.ibm.com
  Date: Sat, 17 Sep 2011 17:38:27 -0600
  To: common-user@hadoop.apache.org
  
  I disagree Brian - data loss and system down time (both potentially 
 non-trival) should not be taken lightly. Use cases and thus availability 
 requirements do vary, but I would not encourage anyone to shrug them off 
 as overblown, especially as Hadoop become more production oriented in 
 utilization

Re: risks of using Hadoop

2011-09-20 Thread Todd Lipcon
On Wed, Sep 21, 2011 at 6:52 AM, Michael Segel
michael_se...@hotmail.com wrote:
 PS... There's this junction box in your machine room that has this very large 
 on/off switch. If pulled down, it will cut power to your cluster and you will 
 lose everything. Now would you consider this a risk? Sure. But is it 
 something you should really lose sleep over? Do you understand that there are 
 risks and there are improbable risks?

http://www.datacenterknowledge.com/archives/2007/05/07/averting-disaster-with-the-epo-button/

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: risks of using Hadoop

2011-09-19 Thread Steve Loughran

On 18/09/11 02:32, Tom Deutsch wrote:

Not trying to give you a hard time Brian - we just have different 
users/customers/expectations on us.


Tom, I suggest you read Apache goes realtime at facebook and consider 
how you could adopt those features -and how to contribute them back to 
the ASF. Certainly I'd like to see their subcluster placement policy in 
the codebase.



For anyone doing batch work, I'd take the NN outage problem as an 
intermittent event that happens less often than OS upgrades -it's just 
something you should expect and test before your system goes live -make 
sure your secondary NN is working and you know how to handle a restart.


Regarding the original discussion, 10-15 nodes has enough machines that 
the loss of one or two should be sustainable; with smaller clusters you 
get less benefit from replication (as each failing server loses a higher 
percentage of the blocks), but the probability of server failure is much 
less.


You can fit everything into a single rack with a ToR switch running at 1 
gigabit through the rack, 10 Gigabit if the servers have it on the 
mainboards and you can afford the difference, as it may mitigate some of 
the impact of server loss. Do think about expansion here; at least have 
enough ports for the entire rack, and the option of multiple 10 GbE 
interconnects to any other racks you may add later. Single switch 
clusters don't need any rack topology scripts, so you can skip on one 
bit of setup.


As everyone says, you need to worry about namenode failure. You could 
put the secondary namenode on the same machine as the job tracker, and 
have them both write to NFS mounted filesystems. The trick in a small 
cluster is to use some (more than one) of the workers' disk space as 
those NFS mount points.



Risks
 -security; you may want to isolate the cluster from the rest of your 
intranet
 -security: if I could run code on your cluster I could probably get at 
various server ports and read what I wanted. As all MR jobs are running 
code in the cluster, you have to trust people coding at the Java layer. 
If it's pig or hive jobs, life is simpler.
 -data integrity. Get ECC memory, monitor disks aggressively and take 
them offline if you think they are playing up. Run SATA VERIFY commands 
against the machines (in the sg3_utils package).
 -DoS to the rest of the Intranet. Bad code on a medium to large 
cluster can overload the rest of your network simply by making too many 
DNS requests, let alone lots of remote HTTP operations. This should not 
be a risk for smaller clusters.
 -Developers writing code that doesn't scale. You don't have to worry 
about this in a small cluster, but as you scale you will find use of 
choke points (JT counters, shared remote filestores) may cause problems. 
Even excessive logging can be trouble.


-New feature for Ops: more monitoring to learn about. While the NN 
uptime matters, the worker nodes are less important. Don't have the team 
sprint to deal with a lost worker node. That said, for a small cluster 
I'd have a couple of replacement disks around, as the loss of disk would 
have more impact on total capacity.



I've been looking at Hadoop Data Integrity, and now have a todo list 
based on my findings

http://www.slideshare.net/steve_l/did-you-reallywantthatdata

Because your cluster is small, you won't overload your NN even with 
small blocks, or the JT with jobs finishing too fast for it to keep up 
with, so you can use smaller blocks, which should improve data integrity.


Otherwise, the main risk, as people note is unrealistic expectations. 
Hadoop is not a replacement for a database with ACID transaction 
requirements, even reads are slower than indexed tables. What it is good 
for is very-low-cost storage of large amounts of low-value data, and as 
a platform for the layers above.


-Steve


Re: risks of using Hadoop

2011-09-19 Thread Steve Loughran

On 18/09/11 03:37, Michael Segel wrote:





2) Data Loss.
You can mitigate this as well. Do I need to go through all of the options and 
DR/BCP planning? Sure there's always a chance that you have some Luser who does 
something brain dead. This is true of all databases and systems. (I know I can 
probably recount some of IBM's Informix and DB2 having data loss issues. But 
that's a topic for another time. ;-)



That raises one more point. Once your cluster grows it's hard to back it 
up except to other Hadoop clusters. If you want survive loss-of-site 
events (power, communications) then you'll need to exchange copies of 
the high-value data between physically remote clusters. But you may not 
need to replicate at 3x remotely, because it's only backup data.



-steve



Re: risks of using Hadoop

2011-09-18 Thread Kobina Kwarko
Useful contributions. I want to find out one more thing, has Hadoop been
successfully simulated so far? May using Opnet or ns2?

Regards,

kobina.


On 18 September 2011 03:37, Michael Segel michael_se...@hotmail.com wrote:


 Gee Tom,
 No disrespect, but I don't believe you have any personal practical
 experience in designing and building out clusters or putting them to the
 test.

 Now to the points that Brian raised..

 1) SPOF... it sounds great on paper. Some FUD to scare someone away from
 Hadoop. But in reality... you can mitigate your risks by setting up raid on
 your NN/HM node. You can also NFS mount a copy to your SN (or whatever
 they're calling it these days...) Or you can go to MapR which has redesigned
 HDFS which removes this problem. But with your Apache Hadoop or Cloudera's
 release, losing your NN is rare. Yes it can happen, but not your greatest
 risk. (Not by a long shot)

 2) Data Loss.
 You can mitigate this as well. Do I need to go through all of the options
 and DR/BCP planning? Sure there's always a chance that you have some Luser
 who does something brain dead. This is true of all databases and systems. (I
 know I can probably recount some of IBM's Informix and DB2 having data loss
 issues. But that's a topic for another time. ;-)

 I can't speak for Brian, but I don't think he's trivializing it. In fact I
 think he's doing a fine job of level setting expectations.

 And if you talk to Ted Dunning of MapR, I'm sure he'll point out that their
 current release does address points 3 and 4 again making their risks moot.
 (At least if you're using MapR)

 -Mike


  Subject: Re: risks of using Hadoop
  From: tdeut...@us.ibm.com
  Date: Sat, 17 Sep 2011 17:38:27 -0600
  To: common-user@hadoop.apache.org
 
  I disagree Brian - data loss and system down time (both potentially
 non-trival) should not be taken lightly. Use cases and thus availability
 requirements do vary, but I would not encourage anyone to shrug them off as
 overblown, especially as Hadoop become more production oriented in
 utilization.
 
  ---
  Sent from my Blackberry so please excuse typing and spelling errors.
 
 
  - Original Message -
  From: Brian Bockelman [bbock...@cse.unl.edu]
  Sent: 09/17/2011 05:11 PM EST
  To: common-user@hadoop.apache.org
  Subject: Re: risks of using Hadoop
 
 
 
 
  On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:
 
   Hi Kobina,
  
   Some experiences which may helpful for you with respective to DFS.
  
   1. Selecting the correct version.
  I will recommend to use 0.20X version. This is pretty stable version
 and all other organizations prefers it. Well tested as well.
   Dont go for 21 version.This version is not a stable version.This is
 risk.
  
   2. You should perform thorough test with your customer operations.
(of-course you will do this :-))
  
   3. 0.20x version has the problem of SPOF.
 If NameNode goes down you will loose the data.One way of recovering
 is by using the secondaryNameNode.You can recover the data till last
 checkpoint.But here manual intervention is required.
   In latest trunk SPOF will be addressed bu HDFS-1623.
  
   4. 0.20x NameNodes can not scale. Federation changes included in latest
 versions. ( i think in 22). this may not be the problem for your cluster.
 But please consider this aspect as well.
  
 
  With respect to (3) and (4) - these are often completely overblown for
 many Hadoop use cases.  If you use Hadoop as originally designed (large
 scale batch data processing), these likely don't matter.
 
  If you're looking at some of the newer use cases (low latency stuff or
 time-critical processing), or if you architect your solution poorly (lots of
 small files), these issues become relevant.  Another case where I see folks
 get frustrated is using Hadoop as a plain old batch system; for non-data
 workflows, it doesn't measure up against specialized systems.
 
  You really want to make sure that Hadoop is the best tool for your job.
 
  Brian




Re: risks of using Hadoop

2011-09-17 Thread Uma Maheswara Rao G 72686
Hi George,

You can use it noramally as well. Append interfaces will be exposed.
For Hbase, append support is required very much.

Regards,
Uma

- Original Message -
From: George Kousiouris gkous...@mail.ntua.gr
Date: Saturday, September 17, 2011 12:29 pm
Subject: Re: risks of using Hadoop
To: common-user@hadoop.apache.org
Cc: Uma Maheswara Rao G 72686 mahesw...@huawei.com

 
 Hi,
 
 When you say that 0.20.205 will support appends, you mean for 
 general 
 purpose writes on the HDFS? or only Hbase?
 
 Thanks,
 George
 
 On 9/17/2011 7:08 AM, Uma Maheswara Rao G 72686 wrote:
  6. If you plan to use Hbase, it requires append support. 20Append 
 has the support for append. 0.20.205 release also will have append 
 support but not yet released. Choose your correct version to avoid 
 sudden surprises.
 
 
 
  Regards,
  Uma
  - Original Message -
  From: Kobina Kwarkokobina.kwa...@gmail.com
  Date: Saturday, September 17, 2011 3:42 am
  Subject: Re: risks of using Hadoop
  To: common-user@hadoop.apache.org
 
  We are planning to use Hadoop in my organisation for quality of
  servicesanalysis out of CDR records from mobile operators. We are
  thinking of having
  a small cluster of may be 10 - 15 nodes and I'm preparing the
  proposal. my
  office requires that i provide some risk analysis in the proposal.
 
  thank you.
 
  On 16 September 2011 20:34, Uma Maheswara Rao G 72686
  mahesw...@huawei.comwrote:
 
  Hello,
 
  First of all where you are planning to use Hadoop?
 
  Regards,
  Uma
  - Original Message -
  From: Kobina Kwarkokobina.kwa...@gmail.com
  Date: Saturday, September 17, 2011 0:41 am
  Subject: risks of using Hadoop
  To: common-usercommon-user@hadoop.apache.org
 
  Hello,
 
  Please can someone point some of the risks we may incur if we
  decide to
  implement Hadoop?
 
  BR,
 
  Isaac.
 
 
 
 
 -- 
 
 ---
 
 George Kousiouris
 Electrical and Computer Engineer
 Division of Communications,
 Electronics and Information Engineering
 School of Electrical and Computer Engineering
 Tel: +30 210 772 2546
 Mobile: +30 6939354121
 Fax: +30 210 772 2569
 Email: gkous...@mail.ntua.gr
 Site: http://users.ntua.gr/gkousiou/
 
 National Technical University of Athens
 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
 
 


Re: risks of using Hadoop

2011-09-17 Thread Todd Lipcon
To clarify, *append* is not supported and is known to be buggy. *sync*
support is what HBase needs and what 0.20.205 will support. Before 205
is released, you can also find these features in CDH3 or by building
your own release from SVN.

-Todd

On Sat, Sep 17, 2011 at 4:59 AM, Uma Maheswara Rao G 72686
mahesw...@huawei.com wrote:
 Hi George,

 You can use it noramally as well. Append interfaces will be exposed.
 For Hbase, append support is required very much.

 Regards,
 Uma

 - Original Message -
 From: George Kousiouris gkous...@mail.ntua.gr
 Date: Saturday, September 17, 2011 12:29 pm
 Subject: Re: risks of using Hadoop
 To: common-user@hadoop.apache.org
 Cc: Uma Maheswara Rao G 72686 mahesw...@huawei.com


 Hi,

 When you say that 0.20.205 will support appends, you mean for
 general
 purpose writes on the HDFS? or only Hbase?

 Thanks,
 George

 On 9/17/2011 7:08 AM, Uma Maheswara Rao G 72686 wrote:
  6. If you plan to use Hbase, it requires append support. 20Append
 has the support for append. 0.20.205 release also will have append
 support but not yet released. Choose your correct version to avoid
 sudden surprises.
 
 
 
  Regards,
  Uma
  - Original Message -
  From: Kobina Kwarkokobina.kwa...@gmail.com
  Date: Saturday, September 17, 2011 3:42 am
  Subject: Re: risks of using Hadoop
  To: common-user@hadoop.apache.org
 
  We are planning to use Hadoop in my organisation for quality of
  servicesanalysis out of CDR records from mobile operators. We are
  thinking of having
  a small cluster of may be 10 - 15 nodes and I'm preparing the
  proposal. my
  office requires that i provide some risk analysis in the proposal.
 
  thank you.
 
  On 16 September 2011 20:34, Uma Maheswara Rao G 72686
  mahesw...@huawei.comwrote:
 
  Hello,
 
  First of all where you are planning to use Hadoop?
 
  Regards,
  Uma
  - Original Message -
  From: Kobina Kwarkokobina.kwa...@gmail.com
  Date: Saturday, September 17, 2011 0:41 am
  Subject: risks of using Hadoop
  To: common-usercommon-user@hadoop.apache.org
 
  Hello,
 
  Please can someone point some of the risks we may incur if we
  decide to
  implement Hadoop?
 
  BR,
 
  Isaac.
 
 


 --

 ---

 George Kousiouris
 Electrical and Computer Engineer
 Division of Communications,
 Electronics and Information Engineering
 School of Electrical and Computer Engineering
 Tel: +30 210 772 2546
 Mobile: +30 6939354121
 Fax: +30 210 772 2569
 Email: gkous...@mail.ntua.gr
 Site: http://users.ntua.gr/gkousiou/

 National Technical University of Athens
 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece






-- 
Todd Lipcon
Software Engineer, Cloudera


Re: risks of using Hadoop

2011-09-17 Thread Uma Maheswara Rao G 72686
Yes, 
I was mentioning before append beacuse branch name itself is 20Append. Sync is 
the main api name to sync editlogs. 

@George
  mainly need to consider the Hbase usage.
sync is supported.
append api has some open issues. for example 
https://issues.apache.org/jira/browse/HDFS-1228

Apologies for the confusions if any.

Thanks a lot for more clarification!
 
Thanks
Uma
- Original Message -
From: Todd Lipcon t...@cloudera.com
Date: Sunday, September 18, 2011 1:35 am
Subject: Re: risks of using Hadoop
To: common-user@hadoop.apache.org

 To clarify, *append* is not supported and is known to be buggy. *sync*
 support is what HBase needs and what 0.20.205 will support. Before 205
 is released, you can also find these features in CDH3 or by building
 your own release from SVN.
 
 -Todd
 
 On Sat, Sep 17, 2011 at 4:59 AM, Uma Maheswara Rao G 72686
 mahesw...@huawei.com wrote:
  Hi George,
 
  You can use it noramally as well. Append interfaces will be exposed.
  For Hbase, append support is required very much.
 
  Regards,
  Uma
 
  - Original Message -
  From: George Kousiouris gkous...@mail.ntua.gr
  Date: Saturday, September 17, 2011 12:29 pm
  Subject: Re: risks of using Hadoop
  To: common-user@hadoop.apache.org
  Cc: Uma Maheswara Rao G 72686 mahesw...@huawei.com
 
 
  Hi,
 
  When you say that 0.20.205 will support appends, you mean for
  general
  purpose writes on the HDFS? or only Hbase?
 
  Thanks,
  George
 
  On 9/17/2011 7:08 AM, Uma Maheswara Rao G 72686 wrote:
   6. If you plan to use Hbase, it requires append support. 20Append
  has the support for append. 0.20.205 release also will have append
  support but not yet released. Choose your correct version to avoid
  sudden surprises.
  
  
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarkokobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 3:42 am
   Subject: Re: risks of using Hadoop
   To: common-user@hadoop.apache.org
  
   We are planning to use Hadoop in my organisation for quality of
   servicesanalysis out of CDR records from mobile operators. We 
 are  thinking of having
   a small cluster of may be 10 - 15 nodes and I'm preparing the
   proposal. my
   office requires that i provide some risk analysis in the 
 proposal. 
   thank you.
  
   On 16 September 2011 20:34, Uma Maheswara Rao G 72686
   mahesw...@huawei.comwrote:
  
   Hello,
  
   First of all where you are planning to use Hadoop?
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarkokobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 0:41 am
   Subject: risks of using Hadoop
   To: common-usercommon-user@hadoop.apache.org
  
   Hello,
  
   Please can someone point some of the risks we may incur if we
   decide to
   implement Hadoop?
  
   BR,
  
   Isaac.
  
  
 
 
  --
 
  ---
 
  George Kousiouris
  Electrical and Computer Engineer
  Division of Communications,
  Electronics and Information Engineering
  School of Electrical and Computer Engineering
  Tel: +30 210 772 2546
  Mobile: +30 6939354121
  Fax: +30 210 772 2569
  Email: gkous...@mail.ntua.gr
  Site: http://users.ntua.gr/gkousiou/
 
  National Technical University of Athens
  9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
 
 
 
 
 
 
 -- 
 Todd Lipcon
 Software Engineer, Cloudera
 


Re: risks of using Hadoop

2011-09-17 Thread Brian Bockelman

On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:

 Hi Kobina,
 
 Some experiences which may helpful for you with respective to DFS. 
 
 1. Selecting the correct version.
I will recommend to use 0.20X version. This is pretty stable version and 
 all other organizations prefers it. Well tested as well.
 Dont go for 21 version.This version is not a stable version.This is risk.
 
 2. You should perform thorough test with your customer operations. 
  (of-course you will do this :-))
 
 3. 0.20x version has the problem of SPOF.
   If NameNode goes down you will loose the data.One way of recovering is by 
 using the secondaryNameNode.You can recover the data till last checkpoint.But 
 here manual intervention is required.
 In latest trunk SPOF will be addressed bu HDFS-1623.
 
 4. 0.20x NameNodes can not scale. Federation changes included in latest 
 versions. ( i think in 22). this may not be the problem for your cluster. But 
 please consider this aspect as well.
 

With respect to (3) and (4) - these are often completely overblown for many 
Hadoop use cases.  If you use Hadoop as originally designed (large scale batch 
data processing), these likely don't matter.

If you're looking at some of the newer use cases (low latency stuff or 
time-critical processing), or if you architect your solution poorly (lots of 
small files), these issues become relevant.  Another case where I see folks get 
frustrated is using Hadoop as a plain old batch system; for non-data 
workflows, it doesn't measure up against specialized systems.

You really want to make sure that Hadoop is the best tool for your job.

Brian

Re: risks of using Hadoop

2011-09-17 Thread Tom Deutsch
I disagree Brian - data loss and system down time (both potentially non-trival) 
should not be taken lightly. Use cases and thus availability requirements do 
vary, but I would not encourage anyone to shrug them off as overblown, 
especially as Hadoop become more production oriented in utilization.

---
Sent from my Blackberry so please excuse typing and spelling errors.


- Original Message -
From: Brian Bockelman [bbock...@cse.unl.edu]
Sent: 09/17/2011 05:11 PM EST
To: common-user@hadoop.apache.org
Subject: Re: risks of using Hadoop




On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:

 Hi Kobina,

 Some experiences which may helpful for you with respective to DFS.

 1. Selecting the correct version.
I will recommend to use 0.20X version. This is pretty stable version and 
 all other organizations prefers it. Well tested as well.
 Dont go for 21 version.This version is not a stable version.This is risk.

 2. You should perform thorough test with your customer operations.
  (of-course you will do this :-))

 3. 0.20x version has the problem of SPOF.
   If NameNode goes down you will loose the data.One way of recovering is by 
 using the secondaryNameNode.You can recover the data till last checkpoint.But 
 here manual intervention is required.
 In latest trunk SPOF will be addressed bu HDFS-1623.

 4. 0.20x NameNodes can not scale. Federation changes included in latest 
 versions. ( i think in 22). this may not be the problem for your cluster. But 
 please consider this aspect as well.


With respect to (3) and (4) - these are often completely overblown for many 
Hadoop use cases.  If you use Hadoop as originally designed (large scale batch 
data processing), these likely don't matter.

If you're looking at some of the newer use cases (low latency stuff or 
time-critical processing), or if you architect your solution poorly (lots of 
small files), these issues become relevant.  Another case where I see folks get 
frustrated is using Hadoop as a plain old batch system; for non-data 
workflows, it doesn't measure up against specialized systems.

You really want to make sure that Hadoop is the best tool for your job.

Brian


Re: risks of using Hadoop

2011-09-17 Thread Brian Bockelman
Data loss in a batch-oriented environment is different than data loss in an 
online/production environment.  It's a trade-off, and I personally think many 
folks don't weigh the costs well.

As you mention - Hadoop is becoming more production oriented in utilization.  
*In those cases*, you definitely don't want to shrug off data loss / downtime.  
However, there's many people who simply don't need this.

If I'm told that I can buy a 10% larger cluster by accepting up to 15 minutes 
of data loss, I'd do it in a heartbeat where I work.

Brian

On Sep 17, 2011, at 6:38 PM, Tom Deutsch wrote:

 I disagree Brian - data loss and system down time (both potentially 
 non-trival) should not be taken lightly. Use cases and thus availability 
 requirements do vary, but I would not encourage anyone to shrug them off as 
 overblown, especially as Hadoop become more production oriented in 
 utilization.
 
 ---
 Sent from my Blackberry so please excuse typing and spelling errors.
 
 
 - Original Message -
 From: Brian Bockelman [bbock...@cse.unl.edu]
 Sent: 09/17/2011 05:11 PM EST
 To: common-user@hadoop.apache.org
 Subject: Re: risks of using Hadoop
 
 
 
 
 On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:
 
 Hi Kobina,
 
 Some experiences which may helpful for you with respective to DFS.
 
 1. Selecting the correct version.
   I will recommend to use 0.20X version. This is pretty stable version and 
 all other organizations prefers it. Well tested as well.
 Dont go for 21 version.This version is not a stable version.This is risk.
 
 2. You should perform thorough test with your customer operations.
 (of-course you will do this :-))
 
 3. 0.20x version has the problem of SPOF.
  If NameNode goes down you will loose the data.One way of recovering is by 
 using the secondaryNameNode.You can recover the data till last 
 checkpoint.But here manual intervention is required.
 In latest trunk SPOF will be addressed bu HDFS-1623.
 
 4. 0.20x NameNodes can not scale. Federation changes included in latest 
 versions. ( i think in 22). this may not be the problem for your cluster. 
 But please consider this aspect as well.
 
 
 With respect to (3) and (4) - these are often completely overblown for many 
 Hadoop use cases.  If you use Hadoop as originally designed (large scale 
 batch data processing), these likely don't matter.
 
 If you're looking at some of the newer use cases (low latency stuff or 
 time-critical processing), or if you architect your solution poorly (lots of 
 small files), these issues become relevant.  Another case where I see folks 
 get frustrated is using Hadoop as a plain old batch system; for non-data 
 workflows, it doesn't measure up against specialized systems.
 
 You really want to make sure that Hadoop is the best tool for your job.
 
 Brian



Re: risks of using Hadoop

2011-09-17 Thread Tom Deutsch
Not trying to give you a hard time Brian - we just have different 
users/customers/expectations on us.



---
Sent from my Blackberry so please excuse typing and spelling errors.


- Original Message -
From: Brian Bockelman [bbock...@cse.unl.edu]
Sent: 09/17/2011 08:10 PM EST
To: common-user@hadoop.apache.org
Subject: Re: risks of using Hadoop



Data loss in a batch-oriented environment is different than data loss in an 
online/production environment.  It's a trade-off, and I personally think many 
folks don't weigh the costs well.

As you mention - Hadoop is becoming more production oriented in utilization.  
*In those cases*, you definitely don't want to shrug off data loss / downtime.  
However, there's many people who simply don't need this.

If I'm told that I can buy a 10% larger cluster by accepting up to 15 minutes 
of data loss, I'd do it in a heartbeat where I work.

Brian

On Sep 17, 2011, at 6:38 PM, Tom Deutsch wrote:

 I disagree Brian - data loss and system down time (both potentially 
 non-trival) should not be taken lightly. Use cases and thus availability 
 requirements do vary, but I would not encourage anyone to shrug them off as 
 overblown, especially as Hadoop become more production oriented in 
 utilization.

 ---
 Sent from my Blackberry so please excuse typing and spelling errors.


 - Original Message -
 From: Brian Bockelman [bbock...@cse.unl.edu]
 Sent: 09/17/2011 05:11 PM EST
 To: common-user@hadoop.apache.org
 Subject: Re: risks of using Hadoop




 On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:

 Hi Kobina,

 Some experiences which may helpful for you with respective to DFS.

 1. Selecting the correct version.
   I will recommend to use 0.20X version. This is pretty stable version and 
 all other organizations prefers it. Well tested as well.
 Dont go for 21 version.This version is not a stable version.This is risk.

 2. You should perform thorough test with your customer operations.
 (of-course you will do this :-))

 3. 0.20x version has the problem of SPOF.
  If NameNode goes down you will loose the data.One way of recovering is by 
 using the secondaryNameNode.You can recover the data till last 
 checkpoint.But here manual intervention is required.
 In latest trunk SPOF will be addressed bu HDFS-1623.

 4. 0.20x NameNodes can not scale. Federation changes included in latest 
 versions. ( i think in 22). this may not be the problem for your cluster. 
 But please consider this aspect as well.


 With respect to (3) and (4) - these are often completely overblown for many 
 Hadoop use cases.  If you use Hadoop as originally designed (large scale 
 batch data processing), these likely don't matter.

 If you're looking at some of the newer use cases (low latency stuff or 
 time-critical processing), or if you architect your solution poorly (lots of 
 small files), these issues become relevant.  Another case where I see folks 
 get frustrated is using Hadoop as a plain old batch system; for non-data 
 workflows, it doesn't measure up against specialized systems.

 You really want to make sure that Hadoop is the best tool for your job.

 Brian



Re: risks of using Hadoop

2011-09-17 Thread Brian Bockelman
:) I think we can agree to that point.  Hopefully a plethora of viewpoints is 
good for the community!

(And when we run into something that needs higher availability, I'll drop by 
and say hi!)

On Sep 17, 2011, at 8:32 PM, Tom Deutsch wrote:

 Not trying to give you a hard time Brian - we just have different 
 users/customers/expectations on us.
 
 
 
 ---
 Sent from my Blackberry so please excuse typing and spelling errors.
 
 
 - Original Message -
 From: Brian Bockelman [bbock...@cse.unl.edu]
 Sent: 09/17/2011 08:10 PM EST
 To: common-user@hadoop.apache.org
 Subject: Re: risks of using Hadoop
 
 
 
 Data loss in a batch-oriented environment is different than data loss in an 
 online/production environment.  It's a trade-off, and I personally think many 
 folks don't weigh the costs well.
 
 As you mention - Hadoop is becoming more production oriented in utilization.  
 *In those cases*, you definitely don't want to shrug off data loss / 
 downtime.  However, there's many people who simply don't need this.
 
 If I'm told that I can buy a 10% larger cluster by accepting up to 15 minutes 
 of data loss, I'd do it in a heartbeat where I work.
 
 Brian
 
 On Sep 17, 2011, at 6:38 PM, Tom Deutsch wrote:
 
 I disagree Brian - data loss and system down time (both potentially 
 non-trival) should not be taken lightly. Use cases and thus availability 
 requirements do vary, but I would not encourage anyone to shrug them off as 
 overblown, especially as Hadoop become more production oriented in 
 utilization.
 
 ---
 Sent from my Blackberry so please excuse typing and spelling errors.
 
 
 - Original Message -
 From: Brian Bockelman [bbock...@cse.unl.edu]
 Sent: 09/17/2011 05:11 PM EST
 To: common-user@hadoop.apache.org
 Subject: Re: risks of using Hadoop
 
 
 
 
 On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:
 
 Hi Kobina,
 
 Some experiences which may helpful for you with respective to DFS.
 
 1. Selecting the correct version.
  I will recommend to use 0.20X version. This is pretty stable version and 
 all other organizations prefers it. Well tested as well.
 Dont go for 21 version.This version is not a stable version.This is risk.
 
 2. You should perform thorough test with your customer operations.
 (of-course you will do this :-))
 
 3. 0.20x version has the problem of SPOF.
 If NameNode goes down you will loose the data.One way of recovering is by 
 using the secondaryNameNode.You can recover the data till last 
 checkpoint.But here manual intervention is required.
 In latest trunk SPOF will be addressed bu HDFS-1623.
 
 4. 0.20x NameNodes can not scale. Federation changes included in latest 
 versions. ( i think in 22). this may not be the problem for your cluster. 
 But please consider this aspect as well.
 
 
 With respect to (3) and (4) - these are often completely overblown for many 
 Hadoop use cases.  If you use Hadoop as originally designed (large scale 
 batch data processing), these likely don't matter.
 
 If you're looking at some of the newer use cases (low latency stuff or 
 time-critical processing), or if you architect your solution poorly (lots of 
 small files), these issues become relevant.  Another case where I see folks 
 get frustrated is using Hadoop as a plain old batch system; for non-data 
 workflows, it doesn't measure up against specialized systems.
 
 You really want to make sure that Hadoop is the best tool for your job.
 
 Brian



RE: risks of using Hadoop

2011-09-17 Thread Michael Segel

Gee Tom,
No disrespect, but I don't believe you have any personal practical experience 
in designing and building out clusters or putting them to the test.

Now to the points that Brian raised..

1) SPOF... it sounds great on paper. Some FUD to scare someone away from 
Hadoop. But in reality... you can mitigate your risks by setting up raid on 
your NN/HM node. You can also NFS mount a copy to your SN (or whatever they're 
calling it these days...) Or you can go to MapR which has redesigned HDFS which 
removes this problem. But with your Apache Hadoop or Cloudera's release, losing 
your NN is rare. Yes it can happen, but not your greatest risk. (Not by a long 
shot)

2) Data Loss.
You can mitigate this as well. Do I need to go through all of the options and 
DR/BCP planning? Sure there's always a chance that you have some Luser who does 
something brain dead. This is true of all databases and systems. (I know I can 
probably recount some of IBM's Informix and DB2 having data loss issues. But 
that's a topic for another time. ;-)

I can't speak for Brian, but I don't think he's trivializing it. In fact I 
think he's doing a fine job of level setting expectations.

And if you talk to Ted Dunning of MapR, I'm sure he'll point out that their 
current release does address points 3 and 4 again making their risks moot. (At 
least if you're using MapR)

-Mike


 Subject: Re: risks of using Hadoop
 From: tdeut...@us.ibm.com
 Date: Sat, 17 Sep 2011 17:38:27 -0600
 To: common-user@hadoop.apache.org
 
 I disagree Brian - data loss and system down time (both potentially 
 non-trival) should not be taken lightly. Use cases and thus availability 
 requirements do vary, but I would not encourage anyone to shrug them off as 
 overblown, especially as Hadoop become more production oriented in 
 utilization.
 
 ---
 Sent from my Blackberry so please excuse typing and spelling errors.
 
 
 - Original Message -
 From: Brian Bockelman [bbock...@cse.unl.edu]
 Sent: 09/17/2011 05:11 PM EST
 To: common-user@hadoop.apache.org
 Subject: Re: risks of using Hadoop
 
 
 
 
 On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:
 
  Hi Kobina,
  
  Some experiences which may helpful for you with respective to DFS. 
  
  1. Selecting the correct version.
 I will recommend to use 0.20X version. This is pretty stable version and 
  all other organizations prefers it. Well tested as well.
  Dont go for 21 version.This version is not a stable version.This is risk.
  
  2. You should perform thorough test with your customer operations. 
   (of-course you will do this :-))
  
  3. 0.20x version has the problem of SPOF.
If NameNode goes down you will loose the data.One way of recovering is by 
  using the secondaryNameNode.You can recover the data till last 
  checkpoint.But here manual intervention is required.
  In latest trunk SPOF will be addressed bu HDFS-1623.
  
  4. 0.20x NameNodes can not scale. Federation changes included in latest 
  versions. ( i think in 22). this may not be the problem for your cluster. 
  But please consider this aspect as well.
  
 
 With respect to (3) and (4) - these are often completely overblown for many 
 Hadoop use cases.  If you use Hadoop as originally designed (large scale 
 batch data processing), these likely don't matter.
 
 If you're looking at some of the newer use cases (low latency stuff or 
 time-critical processing), or if you architect your solution poorly (lots of 
 small files), these issues become relevant.  Another case where I see folks 
 get frustrated is using Hadoop as a plain old batch system; for non-data 
 workflows, it doesn't measure up against specialized systems.
 
 You really want to make sure that Hadoop is the best tool for your job.
 
 Brian
  

risks of using Hadoop

2011-09-16 Thread Kobina Kwarko
Hello,

Please can someone point some of the risks we may incur if we decide to
implement Hadoop?

BR,

Isaac.


Re: risks of using Hadoop

2011-09-16 Thread Uma Maheswara Rao G 72686
Hello,

First of all where you are planning to use Hadoop?

Regards,
Uma
- Original Message -
From: Kobina Kwarko kobina.kwa...@gmail.com
Date: Saturday, September 17, 2011 0:41 am
Subject: risks of using Hadoop
To: common-user common-user@hadoop.apache.org

 Hello,
 
 Please can someone point some of the risks we may incur if we 
 decide to
 implement Hadoop?
 
 BR,
 
 Isaac.
 


Re: risks of using Hadoop

2011-09-16 Thread Harsh J
Hey Kobina,

You might find some interesting results with your data that may change
the world. Big risk, I'd say :-)

On Sat, Sep 17, 2011 at 12:41 AM, Kobina Kwarko kobina.kwa...@gmail.com wrote:
 Hello,

 Please can someone point some of the risks we may incur if we decide to
 implement Hadoop?

J/K. As Uma says, we need more context.

-- 
Harsh J


Re: risks of using Hadoop

2011-09-16 Thread Tom Deutsch

And that once your business folks see what they have been missing you'll
never able to stop giving them the benefit of that insight.

--Original Message--
From: Harsh J
To: common-user
ReplyTo: common-user
Subject: Re: risks of using Hadoop
Sent: Sep 16, 2011 12:38 PM

Hey Kobina,

You might find some interesting results with your data that may change
the world. Big risk, I'd say :-)

On Sat, Sep 17, 2011 at 12:41 AM, Kobina Kwarko kobina.kwa...@gmail.com
wrote:
 Hello,

 Please can someone point some of the risks we may incur if we decide to
 implement Hadoop?

J/K. As Uma says, we need more context.

--
Harsh J


---
Sent from my Blackberry so please excuse typing and spelling errors.



RE: risks of using Hadoop

2011-09-16 Thread Michael Segel

Risks?

Well if you come to Hadoop World in Nov, we actually have a presentation that 
might help reduce some of your initial risks.

There are always risks when starting a new project. Regardless of the 
underlying technology, you have costs associated with failure and unless you 
can level set expectations you'll increase your odds of failure. 

Best advice... don't listen to sales critters or marketing folks. ;-) [Right 
Tom?]
They have an agenda.
 ;-)

 Date: Fri, 16 Sep 2011 20:11:20 +0100
 Subject: risks of using Hadoop
 From: kobina.kwa...@gmail.com
 To: common-user@hadoop.apache.org
 
 Hello,
 
 Please can someone point some of the risks we may incur if we decide to
 implement Hadoop?
 
 BR,
 
 Isaac.
  

Re: risks of using Hadoop

2011-09-16 Thread Kobina Kwarko
We are planning to use Hadoop in my organisation for quality of services
analysis out of CDR records from mobile operators. We are thinking of having
a small cluster of may be 10 - 15 nodes and I'm preparing the proposal. my
office requires that i provide some risk analysis in the proposal.

thank you.

On 16 September 2011 20:34, Uma Maheswara Rao G 72686
mahesw...@huawei.comwrote:

 Hello,

 First of all where you are planning to use Hadoop?

 Regards,
 Uma
 - Original Message -
 From: Kobina Kwarko kobina.kwa...@gmail.com
 Date: Saturday, September 17, 2011 0:41 am
 Subject: risks of using Hadoop
 To: common-user common-user@hadoop.apache.org

  Hello,
 
  Please can someone point some of the risks we may incur if we
  decide to
  implement Hadoop?
 
  BR,
 
  Isaac.
 



Re: risks of using Hadoop

2011-09-16 Thread Uma Maheswara Rao G 72686
Hi Kobina,
 
 Some experiences which may helpful for you with respective to DFS. 

 1. Selecting the correct version.
I will recommend to use 0.20X version. This is pretty stable version and 
all other organizations prefers it. Well tested as well.
 Dont go for 21 version.This version is not a stable version.This is risk.

2. You should perform thorough test with your customer operations. 
  (of-course you will do this :-))

3. 0.20x version has the problem of SPOF.
   If NameNode goes down you will loose the data.One way of recovering is by 
using the secondaryNameNode.You can recover the data till last checkpoint.But 
here manual intervention is required.
In latest trunk SPOF will be addressed bu HDFS-1623.

4. 0.20x NameNodes can not scale. Federation changes included in latest 
versions. ( i think in 22). this may not be the problem for your cluster. But 
please consider this aspect as well.

5. Please select the hadoop version depending on your security requirements. 
There are versions available for security as well in 0.20X.

6. If you plan to use Hbase, it requires append support. 20Append has the 
support for append. 0.20.205 release also will have append support but not yet 
released. Choose your correct version to avoid sudden surprises.



Regards,
Uma
- Original Message -
From: Kobina Kwarko kobina.kwa...@gmail.com
Date: Saturday, September 17, 2011 3:42 am
Subject: Re: risks of using Hadoop
To: common-user@hadoop.apache.org

 We are planning to use Hadoop in my organisation for quality of 
 servicesanalysis out of CDR records from mobile operators. We are 
 thinking of having
 a small cluster of may be 10 - 15 nodes and I'm preparing the 
 proposal. my
 office requires that i provide some risk analysis in the proposal.
 
 thank you.
 
 On 16 September 2011 20:34, Uma Maheswara Rao G 72686
 mahesw...@huawei.comwrote:
 
  Hello,
 
  First of all where you are planning to use Hadoop?
 
  Regards,
  Uma
  - Original Message -
  From: Kobina Kwarko kobina.kwa...@gmail.com
  Date: Saturday, September 17, 2011 0:41 am
  Subject: risks of using Hadoop
  To: common-user common-user@hadoop.apache.org
 
   Hello,
  
   Please can someone point some of the risks we may incur if we
   decide to
   implement Hadoop?
  
   BR,
  
   Isaac.
  
 
 


Re: risks of using Hadoop

2011-09-16 Thread Kobina Kwarko
Hi Uma,

Response very much appreciated.

Thanks.

Isaac.

On 17 September 2011 05:08, Uma Maheswara Rao G 72686
mahesw...@huawei.comwrote:

 Hi Kobina,

  Some experiences which may helpful for you with respective to DFS.

  1. Selecting the correct version.
I will recommend to use 0.20X version. This is pretty stable version and
 all other organizations prefers it. Well tested as well.
  Dont go for 21 version.This version is not a stable version.This is risk.

 2. You should perform thorough test with your customer operations.
  (of-course you will do this :-))

 3. 0.20x version has the problem of SPOF.
   If NameNode goes down you will loose the data.One way of recovering is by
 using the secondaryNameNode.You can recover the data till last
 checkpoint.But here manual intervention is required.
 In latest trunk SPOF will be addressed bu HDFS-1623.

 4. 0.20x NameNodes can not scale. Federation changes included in latest
 versions. ( i think in 22). this may not be the problem for your cluster.
 But please consider this aspect as well.

 5. Please select the hadoop version depending on your security
 requirements. There are versions available for security as well in 0.20X.

 6. If you plan to use Hbase, it requires append support. 20Append has the
 support for append. 0.20.205 release also will have append support but not
 yet released. Choose your correct version to avoid sudden surprises.



 Regards,
 Uma
 - Original Message -
 From: Kobina Kwarko kobina.kwa...@gmail.com
 Date: Saturday, September 17, 2011 3:42 am
 Subject: Re: risks of using Hadoop
 To: common-user@hadoop.apache.org

  We are planning to use Hadoop in my organisation for quality of
  servicesanalysis out of CDR records from mobile operators. We are
  thinking of having
  a small cluster of may be 10 - 15 nodes and I'm preparing the
  proposal. my
  office requires that i provide some risk analysis in the proposal.
 
  thank you.
 
  On 16 September 2011 20:34, Uma Maheswara Rao G 72686
  mahesw...@huawei.comwrote:
 
   Hello,
  
   First of all where you are planning to use Hadoop?
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 0:41 am
   Subject: risks of using Hadoop
   To: common-user common-user@hadoop.apache.org
  
Hello,
   
Please can someone point some of the risks we may incur if we
decide to
implement Hadoop?
   
BR,
   
Isaac.