Fwd: Job Scheduler, Task Scheduler and Fair Scheduler

2011-09-17 Thread kartheek muthyala
Any updates!!

-- Forwarded message --
From: kartheek muthyala kartheek0...@gmail.com
Date: Fri, Sep 16, 2011 at 8:38 PM
Subject: Job Scheduler, Task Scheduler and Fair Scheduler
To: common-user@hadoop.apache.org


Hi all,
Can any one explain me the responsibilities of each scheduler?. I am
interested in the flow of commands that goes between these scheduler. And if
any one have any info regarding how the job scheduler schedules a job based
on the data locality?. As of I know, there is some heartbeat mechanism that
goes from task scheduler to job scheduler and in response job scheduler does
something here to find out the node where the data is more closely located
and schedules the task in that node. Is there an elaborate way of
explanation around this area?. Any help will be greatly appreciated.
Thanks and Regards,
Kartheek.


Re: Job Scheduler, Task Scheduler and Fair Scheduler

2011-09-17 Thread Arun C Murthy

On Sep 16, 2011, at 11:26 PM, kartheek muthyala wrote:

 Any updates!!

A bit of patience will help. It also helps to do some homework and ask specific 
questions.

I don't know if you have looked at any of the code, but there are 3 schedulers:
JobQueueTaskScheduler (aka default scheduler or fifo scheduler)
Capacity Scheduler (CS)
Fair Scheduler (FS).

TaskScheduler is just an interface for all schedulers (default, CS, FS).

Then there is JobInProgress which handles scheduling for map tasks of an 
individual job based on data locality (JobInProgress.obtainNew*MapTask).

Other than that each of the schedulers (default, CS, FS) use different criteria 
for picking a certain job to offer a 'slot' on a given TT when it's available.

All this has changed radically and completely with MRv2 which is now in 
branch-0.23 and trunk to allow MR and non-MR apps on same Hadoop cluster:
http://wiki.apache.org/hadoop/NextGenMapReduce

Arun

 
 -- Forwarded message --
 From: kartheek muthyala kartheek0...@gmail.com
 Date: Fri, Sep 16, 2011 at 8:38 PM
 Subject: Job Scheduler, Task Scheduler and Fair Scheduler
 To: common-user@hadoop.apache.org
 
 
 Hi all,
 Can any one explain me the responsibilities of each scheduler?. I am
 interested in the flow of commands that goes between these scheduler. And if
 any one have any info regarding how the job scheduler schedules a job based
 on the data locality?. As of I know, there is some heartbeat mechanism that
 goes from task scheduler to job scheduler and in response job scheduler does
 something here to find out the node where the data is more closely located
 and schedules the task in that node. Is there an elaborate way of
 explanation around this area?. Any help will be greatly appreciated.
 Thanks and Regards,
 Kartheek.



Ganglia for hadoop monitoring

2011-09-17 Thread john smith
Hi all,

First of all, ganglia integration with hadoop is an awesome feature. Kudos
to the hadoop devs. Unfortunately its not working out for me. I am unable to
see hadoop specific metrics in my ganglia frontend.  My configurations are
as follows:

gmetad.coinf :

data_source hadoop test host-name of gmetad   ( I also tried data_source
hadoop test list of machine:port separated by spaces. None of them
worked.

gmond conf:

cluster {
  name = hadoop test
  owner = jS
  latlong = 
  url = abc.com
}

udp_send_channel {
  host = host-name of gmetad
  port = 8649
  ttl = 1
}
udp_recv_channel {
  port = 8649
}

I restarted my cluster several times and also gmonds and gmetad. However I
am unable to see hadoop metrics in  my page. Infact nothing with name
hadoop test turns up

Am I missing something? I have a couple of doutbs here.

1) Is it compulsory that hadoop NN should run the gmetad daemon and the
front end ?  Every tutorial on the net assumes it that way. However I am
running gmetad on a separate node and I included NN in the node list
(running a gmond on it).
2) How does hadoop communicate its metrics with the gmond

Any help is highly appreciated.

Thanks ,
JS


Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)

2011-09-17 Thread Robert J Berger
Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping 
me out on the IRC channel. Well beyond the call of duty! Its people like Harsh 
that make the HBase/Hadoop community what it is and one of the joys of working  
with this technology. And then one follow on question on how to recover from 
CORRUPT blocks.

The main thing I learnt other than being careful not to install packages on all 
the regionservers/slaves at one time that may cause Out of Memory Errors and 
crash all your java processes.. is that if:

Your namenode is stuck in safe mode, and even though the namenode log says that 
Safe mode will be turned off automatically.
If there is enough wrong with your HDFS system like too many under-replicated 
blocks.
It seems that it has to be out of safe mode to correct the problem... 

I hallucinated that the datanodes by doing verifications were doing the work to 
get the namenode out of safe mode. And probably would have waited another few 
hours if Harsh hadn't helped me out and told me what probably everyone but me 
knew:

hadoop dfsadmin -safemode leave


CURRENT QUESTION ON CORRUPT BLOCKS:
--

After that the namenode did get all the under-replicated blocks replicated, but 
I ended up with about 200 blocks that fsck considered CORRUPT and/or MISSING. 
It looked like tables were being compacted when the outage occurred. Otherwise 
I don't know why a lot of the bad blocks are in old tables, not data being 
written at the time of the crash. The hdfs filesystem dates also showed them as 
being old.

I am not sure what is the best thing to do now to be able to recover the 
CORRUPT/MISSING blocks and to get fsck to say all is healthy. 

Is the best thing to just do:

hadoop fsck -move

which will move what is left of the corrupt blocks into hdfs /lost+found?

Is there any way to recover those blocks? 

I may be able to get them from the backup/export of all our tables we did 
recently and I believe I can regenerate the rest. But it would be nice to know 
if there is a way to recover them if there was no other way.

Thanks in advance.
Rob
 
On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:

 Just had an HDFS/HBase instance where all the slave/regionservers processes 
 crashed, but the namenode stayed up. I did proper shutdown of the namenode
 
 After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows 
 235 corrupt/missing blocks out of 117280 Blocks. All the slaves are doing 
 DataBlockScanner: Verification succeeded. As far as I can tell there are no 
 errors in the datanodes.
 
 Can I expect it to self-heal? Or do I need to do something to help it along? 
 Anyway to tell how long it will take to recover if I do have to just wait?
 
 Other than the verification messages on the datanodes, the namenode fsck 
 numbers are not changing and the namenode log continues to say:
 
 The ratio of reported blocks 0.9980 has not reached the threshold 0.9990. 
 Safe mode will be turned off automatically.
 
 The ratio has not changed for over an hour now.
 
 If you happen to know the answer, please get back to me right away by email 
 or on #hadoop IRC as I'm trying to figure it out now...
 
 Thanks!
 __
 Robert J Berger - CTO
 Runa Inc.
 +1 408-838-8896
 http://blog.ibd.com
 
 
 

__
Robert J Berger - CTO
Runa Inc.
+1 408-838-8896
http://blog.ibd.com





Re: Job Scheduler, Task Scheduler and Fair Scheduler

2011-09-17 Thread kartheek muthyala
Hey Arun,
Thanks for the information. And sorry for my previous mail regarding
updates!! I just wanted to emphasize the importance of the query. I couldn't
get enough time to go through the code that's why i approached you guys, as
you are expertise in this area.
Thanks  Regards,
Kartheek.

On Sat, Sep 17, 2011 at 12:09 PM, Arun C Murthy a...@hortonworks.com wrote:


 On Sep 16, 2011, at 11:26 PM, kartheek muthyala wrote:

  Any updates!!

 A bit of patience will help. It also helps to do some homework and ask
 specific questions.

 I don't know if you have looked at any of the code, but there are 3
 schedulers:
 JobQueueTaskScheduler (aka default scheduler or fifo scheduler)
 Capacity Scheduler (CS)
 Fair Scheduler (FS).

 TaskScheduler is just an interface for all schedulers (default, CS, FS).

 Then there is JobInProgress which handles scheduling for map tasks of an
 individual job based on data locality (JobInProgress.obtainNew*MapTask).

 Other than that each of the schedulers (default, CS, FS) use different
 criteria for picking a certain job to offer a 'slot' on a given TT when it's
 available.

 All this has changed radically and completely with MRv2 which is now in
 branch-0.23 and trunk to allow MR and non-MR apps on same Hadoop cluster:
 http://wiki.apache.org/hadoop/NextGenMapReduce

 Arun

 
  -- Forwarded message --
  From: kartheek muthyala kartheek0...@gmail.com
  Date: Fri, Sep 16, 2011 at 8:38 PM
  Subject: Job Scheduler, Task Scheduler and Fair Scheduler
  To: common-user@hadoop.apache.org
 
 
  Hi all,
  Can any one explain me the responsibilities of each scheduler?. I am
  interested in the flow of commands that goes between these scheduler. And
 if
  any one have any info regarding how the job scheduler schedules a job
 based
  on the data locality?. As of I know, there is some heartbeat mechanism
 that
  goes from task scheduler to job scheduler and in response job scheduler
 does
  something here to find out the node where the data is more closely
 located
  and schedules the task in that node. Is there an elaborate way of
  explanation around this area?. Any help will be greatly appreciated.
  Thanks and Regards,
  Kartheek.




Re: risks of using Hadoop

2011-09-17 Thread Uma Maheswara Rao G 72686
Hi George,

You can use it noramally as well. Append interfaces will be exposed.
For Hbase, append support is required very much.

Regards,
Uma

- Original Message -
From: George Kousiouris gkous...@mail.ntua.gr
Date: Saturday, September 17, 2011 12:29 pm
Subject: Re: risks of using Hadoop
To: common-user@hadoop.apache.org
Cc: Uma Maheswara Rao G 72686 mahesw...@huawei.com

 
 Hi,
 
 When you say that 0.20.205 will support appends, you mean for 
 general 
 purpose writes on the HDFS? or only Hbase?
 
 Thanks,
 George
 
 On 9/17/2011 7:08 AM, Uma Maheswara Rao G 72686 wrote:
  6. If you plan to use Hbase, it requires append support. 20Append 
 has the support for append. 0.20.205 release also will have append 
 support but not yet released. Choose your correct version to avoid 
 sudden surprises.
 
 
 
  Regards,
  Uma
  - Original Message -
  From: Kobina Kwarkokobina.kwa...@gmail.com
  Date: Saturday, September 17, 2011 3:42 am
  Subject: Re: risks of using Hadoop
  To: common-user@hadoop.apache.org
 
  We are planning to use Hadoop in my organisation for quality of
  servicesanalysis out of CDR records from mobile operators. We are
  thinking of having
  a small cluster of may be 10 - 15 nodes and I'm preparing the
  proposal. my
  office requires that i provide some risk analysis in the proposal.
 
  thank you.
 
  On 16 September 2011 20:34, Uma Maheswara Rao G 72686
  mahesw...@huawei.comwrote:
 
  Hello,
 
  First of all where you are planning to use Hadoop?
 
  Regards,
  Uma
  - Original Message -
  From: Kobina Kwarkokobina.kwa...@gmail.com
  Date: Saturday, September 17, 2011 0:41 am
  Subject: risks of using Hadoop
  To: common-usercommon-user@hadoop.apache.org
 
  Hello,
 
  Please can someone point some of the risks we may incur if we
  decide to
  implement Hadoop?
 
  BR,
 
  Isaac.
 
 
 
 
 -- 
 
 ---
 
 George Kousiouris
 Electrical and Computer Engineer
 Division of Communications,
 Electronics and Information Engineering
 School of Electrical and Computer Engineering
 Tel: +30 210 772 2546
 Mobile: +30 6939354121
 Fax: +30 210 772 2569
 Email: gkous...@mail.ntua.gr
 Site: http://users.ntua.gr/gkousiou/
 
 National Technical University of Athens
 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
 
 


Re: risks of using Hadoop

2011-09-17 Thread Todd Lipcon
To clarify, *append* is not supported and is known to be buggy. *sync*
support is what HBase needs and what 0.20.205 will support. Before 205
is released, you can also find these features in CDH3 or by building
your own release from SVN.

-Todd

On Sat, Sep 17, 2011 at 4:59 AM, Uma Maheswara Rao G 72686
mahesw...@huawei.com wrote:
 Hi George,

 You can use it noramally as well. Append interfaces will be exposed.
 For Hbase, append support is required very much.

 Regards,
 Uma

 - Original Message -
 From: George Kousiouris gkous...@mail.ntua.gr
 Date: Saturday, September 17, 2011 12:29 pm
 Subject: Re: risks of using Hadoop
 To: common-user@hadoop.apache.org
 Cc: Uma Maheswara Rao G 72686 mahesw...@huawei.com


 Hi,

 When you say that 0.20.205 will support appends, you mean for
 general
 purpose writes on the HDFS? or only Hbase?

 Thanks,
 George

 On 9/17/2011 7:08 AM, Uma Maheswara Rao G 72686 wrote:
  6. If you plan to use Hbase, it requires append support. 20Append
 has the support for append. 0.20.205 release also will have append
 support but not yet released. Choose your correct version to avoid
 sudden surprises.
 
 
 
  Regards,
  Uma
  - Original Message -
  From: Kobina Kwarkokobina.kwa...@gmail.com
  Date: Saturday, September 17, 2011 3:42 am
  Subject: Re: risks of using Hadoop
  To: common-user@hadoop.apache.org
 
  We are planning to use Hadoop in my organisation for quality of
  servicesanalysis out of CDR records from mobile operators. We are
  thinking of having
  a small cluster of may be 10 - 15 nodes and I'm preparing the
  proposal. my
  office requires that i provide some risk analysis in the proposal.
 
  thank you.
 
  On 16 September 2011 20:34, Uma Maheswara Rao G 72686
  mahesw...@huawei.comwrote:
 
  Hello,
 
  First of all where you are planning to use Hadoop?
 
  Regards,
  Uma
  - Original Message -
  From: Kobina Kwarkokobina.kwa...@gmail.com
  Date: Saturday, September 17, 2011 0:41 am
  Subject: risks of using Hadoop
  To: common-usercommon-user@hadoop.apache.org
 
  Hello,
 
  Please can someone point some of the risks we may incur if we
  decide to
  implement Hadoop?
 
  BR,
 
  Isaac.
 
 


 --

 ---

 George Kousiouris
 Electrical and Computer Engineer
 Division of Communications,
 Electronics and Information Engineering
 School of Electrical and Computer Engineering
 Tel: +30 210 772 2546
 Mobile: +30 6939354121
 Fax: +30 210 772 2569
 Email: gkous...@mail.ntua.gr
 Site: http://users.ntua.gr/gkousiou/

 National Technical University of Athens
 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece






-- 
Todd Lipcon
Software Engineer, Cloudera


Re: risks of using Hadoop

2011-09-17 Thread Uma Maheswara Rao G 72686
Yes, 
I was mentioning before append beacuse branch name itself is 20Append. Sync is 
the main api name to sync editlogs. 

@George
  mainly need to consider the Hbase usage.
sync is supported.
append api has some open issues. for example 
https://issues.apache.org/jira/browse/HDFS-1228

Apologies for the confusions if any.

Thanks a lot for more clarification!
 
Thanks
Uma
- Original Message -
From: Todd Lipcon t...@cloudera.com
Date: Sunday, September 18, 2011 1:35 am
Subject: Re: risks of using Hadoop
To: common-user@hadoop.apache.org

 To clarify, *append* is not supported and is known to be buggy. *sync*
 support is what HBase needs and what 0.20.205 will support. Before 205
 is released, you can also find these features in CDH3 or by building
 your own release from SVN.
 
 -Todd
 
 On Sat, Sep 17, 2011 at 4:59 AM, Uma Maheswara Rao G 72686
 mahesw...@huawei.com wrote:
  Hi George,
 
  You can use it noramally as well. Append interfaces will be exposed.
  For Hbase, append support is required very much.
 
  Regards,
  Uma
 
  - Original Message -
  From: George Kousiouris gkous...@mail.ntua.gr
  Date: Saturday, September 17, 2011 12:29 pm
  Subject: Re: risks of using Hadoop
  To: common-user@hadoop.apache.org
  Cc: Uma Maheswara Rao G 72686 mahesw...@huawei.com
 
 
  Hi,
 
  When you say that 0.20.205 will support appends, you mean for
  general
  purpose writes on the HDFS? or only Hbase?
 
  Thanks,
  George
 
  On 9/17/2011 7:08 AM, Uma Maheswara Rao G 72686 wrote:
   6. If you plan to use Hbase, it requires append support. 20Append
  has the support for append. 0.20.205 release also will have append
  support but not yet released. Choose your correct version to avoid
  sudden surprises.
  
  
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarkokobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 3:42 am
   Subject: Re: risks of using Hadoop
   To: common-user@hadoop.apache.org
  
   We are planning to use Hadoop in my organisation for quality of
   servicesanalysis out of CDR records from mobile operators. We 
 are  thinking of having
   a small cluster of may be 10 - 15 nodes and I'm preparing the
   proposal. my
   office requires that i provide some risk analysis in the 
 proposal. 
   thank you.
  
   On 16 September 2011 20:34, Uma Maheswara Rao G 72686
   mahesw...@huawei.comwrote:
  
   Hello,
  
   First of all where you are planning to use Hadoop?
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarkokobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 0:41 am
   Subject: risks of using Hadoop
   To: common-usercommon-user@hadoop.apache.org
  
   Hello,
  
   Please can someone point some of the risks we may incur if we
   decide to
   implement Hadoop?
  
   BR,
  
   Isaac.
  
  
 
 
  --
 
  ---
 
  George Kousiouris
  Electrical and Computer Engineer
  Division of Communications,
  Electronics and Information Engineering
  School of Electrical and Computer Engineering
  Tel: +30 210 772 2546
  Mobile: +30 6939354121
  Fax: +30 210 772 2569
  Email: gkous...@mail.ntua.gr
  Site: http://users.ntua.gr/gkousiou/
 
  National Technical University of Athens
  9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
 
 
 
 
 
 
 -- 
 Todd Lipcon
 Software Engineer, Cloudera
 


Re: risks of using Hadoop

2011-09-17 Thread Brian Bockelman

On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:

 Hi Kobina,
 
 Some experiences which may helpful for you with respective to DFS. 
 
 1. Selecting the correct version.
I will recommend to use 0.20X version. This is pretty stable version and 
 all other organizations prefers it. Well tested as well.
 Dont go for 21 version.This version is not a stable version.This is risk.
 
 2. You should perform thorough test with your customer operations. 
  (of-course you will do this :-))
 
 3. 0.20x version has the problem of SPOF.
   If NameNode goes down you will loose the data.One way of recovering is by 
 using the secondaryNameNode.You can recover the data till last checkpoint.But 
 here manual intervention is required.
 In latest trunk SPOF will be addressed bu HDFS-1623.
 
 4. 0.20x NameNodes can not scale. Federation changes included in latest 
 versions. ( i think in 22). this may not be the problem for your cluster. But 
 please consider this aspect as well.
 

With respect to (3) and (4) - these are often completely overblown for many 
Hadoop use cases.  If you use Hadoop as originally designed (large scale batch 
data processing), these likely don't matter.

If you're looking at some of the newer use cases (low latency stuff or 
time-critical processing), or if you architect your solution poorly (lots of 
small files), these issues become relevant.  Another case where I see folks get 
frustrated is using Hadoop as a plain old batch system; for non-data 
workflows, it doesn't measure up against specialized systems.

You really want to make sure that Hadoop is the best tool for your job.

Brian

Re: risks of using Hadoop

2011-09-17 Thread Tom Deutsch
I disagree Brian - data loss and system down time (both potentially non-trival) 
should not be taken lightly. Use cases and thus availability requirements do 
vary, but I would not encourage anyone to shrug them off as overblown, 
especially as Hadoop become more production oriented in utilization.

---
Sent from my Blackberry so please excuse typing and spelling errors.


- Original Message -
From: Brian Bockelman [bbock...@cse.unl.edu]
Sent: 09/17/2011 05:11 PM EST
To: common-user@hadoop.apache.org
Subject: Re: risks of using Hadoop




On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:

 Hi Kobina,

 Some experiences which may helpful for you with respective to DFS.

 1. Selecting the correct version.
I will recommend to use 0.20X version. This is pretty stable version and 
 all other organizations prefers it. Well tested as well.
 Dont go for 21 version.This version is not a stable version.This is risk.

 2. You should perform thorough test with your customer operations.
  (of-course you will do this :-))

 3. 0.20x version has the problem of SPOF.
   If NameNode goes down you will loose the data.One way of recovering is by 
 using the secondaryNameNode.You can recover the data till last checkpoint.But 
 here manual intervention is required.
 In latest trunk SPOF will be addressed bu HDFS-1623.

 4. 0.20x NameNodes can not scale. Federation changes included in latest 
 versions. ( i think in 22). this may not be the problem for your cluster. But 
 please consider this aspect as well.


With respect to (3) and (4) - these are often completely overblown for many 
Hadoop use cases.  If you use Hadoop as originally designed (large scale batch 
data processing), these likely don't matter.

If you're looking at some of the newer use cases (low latency stuff or 
time-critical processing), or if you architect your solution poorly (lots of 
small files), these issues become relevant.  Another case where I see folks get 
frustrated is using Hadoop as a plain old batch system; for non-data 
workflows, it doesn't measure up against specialized systems.

You really want to make sure that Hadoop is the best tool for your job.

Brian


Re: risks of using Hadoop

2011-09-17 Thread Brian Bockelman
Data loss in a batch-oriented environment is different than data loss in an 
online/production environment.  It's a trade-off, and I personally think many 
folks don't weigh the costs well.

As you mention - Hadoop is becoming more production oriented in utilization.  
*In those cases*, you definitely don't want to shrug off data loss / downtime.  
However, there's many people who simply don't need this.

If I'm told that I can buy a 10% larger cluster by accepting up to 15 minutes 
of data loss, I'd do it in a heartbeat where I work.

Brian

On Sep 17, 2011, at 6:38 PM, Tom Deutsch wrote:

 I disagree Brian - data loss and system down time (both potentially 
 non-trival) should not be taken lightly. Use cases and thus availability 
 requirements do vary, but I would not encourage anyone to shrug them off as 
 overblown, especially as Hadoop become more production oriented in 
 utilization.
 
 ---
 Sent from my Blackberry so please excuse typing and spelling errors.
 
 
 - Original Message -
 From: Brian Bockelman [bbock...@cse.unl.edu]
 Sent: 09/17/2011 05:11 PM EST
 To: common-user@hadoop.apache.org
 Subject: Re: risks of using Hadoop
 
 
 
 
 On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:
 
 Hi Kobina,
 
 Some experiences which may helpful for you with respective to DFS.
 
 1. Selecting the correct version.
   I will recommend to use 0.20X version. This is pretty stable version and 
 all other organizations prefers it. Well tested as well.
 Dont go for 21 version.This version is not a stable version.This is risk.
 
 2. You should perform thorough test with your customer operations.
 (of-course you will do this :-))
 
 3. 0.20x version has the problem of SPOF.
  If NameNode goes down you will loose the data.One way of recovering is by 
 using the secondaryNameNode.You can recover the data till last 
 checkpoint.But here manual intervention is required.
 In latest trunk SPOF will be addressed bu HDFS-1623.
 
 4. 0.20x NameNodes can not scale. Federation changes included in latest 
 versions. ( i think in 22). this may not be the problem for your cluster. 
 But please consider this aspect as well.
 
 
 With respect to (3) and (4) - these are often completely overblown for many 
 Hadoop use cases.  If you use Hadoop as originally designed (large scale 
 batch data processing), these likely don't matter.
 
 If you're looking at some of the newer use cases (low latency stuff or 
 time-critical processing), or if you architect your solution poorly (lots of 
 small files), these issues become relevant.  Another case where I see folks 
 get frustrated is using Hadoop as a plain old batch system; for non-data 
 workflows, it doesn't measure up against specialized systems.
 
 You really want to make sure that Hadoop is the best tool for your job.
 
 Brian



Re: risks of using Hadoop

2011-09-17 Thread Tom Deutsch
Not trying to give you a hard time Brian - we just have different 
users/customers/expectations on us.



---
Sent from my Blackberry so please excuse typing and spelling errors.


- Original Message -
From: Brian Bockelman [bbock...@cse.unl.edu]
Sent: 09/17/2011 08:10 PM EST
To: common-user@hadoop.apache.org
Subject: Re: risks of using Hadoop



Data loss in a batch-oriented environment is different than data loss in an 
online/production environment.  It's a trade-off, and I personally think many 
folks don't weigh the costs well.

As you mention - Hadoop is becoming more production oriented in utilization.  
*In those cases*, you definitely don't want to shrug off data loss / downtime.  
However, there's many people who simply don't need this.

If I'm told that I can buy a 10% larger cluster by accepting up to 15 minutes 
of data loss, I'd do it in a heartbeat where I work.

Brian

On Sep 17, 2011, at 6:38 PM, Tom Deutsch wrote:

 I disagree Brian - data loss and system down time (both potentially 
 non-trival) should not be taken lightly. Use cases and thus availability 
 requirements do vary, but I would not encourage anyone to shrug them off as 
 overblown, especially as Hadoop become more production oriented in 
 utilization.

 ---
 Sent from my Blackberry so please excuse typing and spelling errors.


 - Original Message -
 From: Brian Bockelman [bbock...@cse.unl.edu]
 Sent: 09/17/2011 05:11 PM EST
 To: common-user@hadoop.apache.org
 Subject: Re: risks of using Hadoop




 On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:

 Hi Kobina,

 Some experiences which may helpful for you with respective to DFS.

 1. Selecting the correct version.
   I will recommend to use 0.20X version. This is pretty stable version and 
 all other organizations prefers it. Well tested as well.
 Dont go for 21 version.This version is not a stable version.This is risk.

 2. You should perform thorough test with your customer operations.
 (of-course you will do this :-))

 3. 0.20x version has the problem of SPOF.
  If NameNode goes down you will loose the data.One way of recovering is by 
 using the secondaryNameNode.You can recover the data till last 
 checkpoint.But here manual intervention is required.
 In latest trunk SPOF will be addressed bu HDFS-1623.

 4. 0.20x NameNodes can not scale. Federation changes included in latest 
 versions. ( i think in 22). this may not be the problem for your cluster. 
 But please consider this aspect as well.


 With respect to (3) and (4) - these are often completely overblown for many 
 Hadoop use cases.  If you use Hadoop as originally designed (large scale 
 batch data processing), these likely don't matter.

 If you're looking at some of the newer use cases (low latency stuff or 
 time-critical processing), or if you architect your solution poorly (lots of 
 small files), these issues become relevant.  Another case where I see folks 
 get frustrated is using Hadoop as a plain old batch system; for non-data 
 workflows, it doesn't measure up against specialized systems.

 You really want to make sure that Hadoop is the best tool for your job.

 Brian



Re: risks of using Hadoop

2011-09-17 Thread Brian Bockelman
:) I think we can agree to that point.  Hopefully a plethora of viewpoints is 
good for the community!

(And when we run into something that needs higher availability, I'll drop by 
and say hi!)

On Sep 17, 2011, at 8:32 PM, Tom Deutsch wrote:

 Not trying to give you a hard time Brian - we just have different 
 users/customers/expectations on us.
 
 
 
 ---
 Sent from my Blackberry so please excuse typing and spelling errors.
 
 
 - Original Message -
 From: Brian Bockelman [bbock...@cse.unl.edu]
 Sent: 09/17/2011 08:10 PM EST
 To: common-user@hadoop.apache.org
 Subject: Re: risks of using Hadoop
 
 
 
 Data loss in a batch-oriented environment is different than data loss in an 
 online/production environment.  It's a trade-off, and I personally think many 
 folks don't weigh the costs well.
 
 As you mention - Hadoop is becoming more production oriented in utilization.  
 *In those cases*, you definitely don't want to shrug off data loss / 
 downtime.  However, there's many people who simply don't need this.
 
 If I'm told that I can buy a 10% larger cluster by accepting up to 15 minutes 
 of data loss, I'd do it in a heartbeat where I work.
 
 Brian
 
 On Sep 17, 2011, at 6:38 PM, Tom Deutsch wrote:
 
 I disagree Brian - data loss and system down time (both potentially 
 non-trival) should not be taken lightly. Use cases and thus availability 
 requirements do vary, but I would not encourage anyone to shrug them off as 
 overblown, especially as Hadoop become more production oriented in 
 utilization.
 
 ---
 Sent from my Blackberry so please excuse typing and spelling errors.
 
 
 - Original Message -
 From: Brian Bockelman [bbock...@cse.unl.edu]
 Sent: 09/17/2011 05:11 PM EST
 To: common-user@hadoop.apache.org
 Subject: Re: risks of using Hadoop
 
 
 
 
 On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:
 
 Hi Kobina,
 
 Some experiences which may helpful for you with respective to DFS.
 
 1. Selecting the correct version.
  I will recommend to use 0.20X version. This is pretty stable version and 
 all other organizations prefers it. Well tested as well.
 Dont go for 21 version.This version is not a stable version.This is risk.
 
 2. You should perform thorough test with your customer operations.
 (of-course you will do this :-))
 
 3. 0.20x version has the problem of SPOF.
 If NameNode goes down you will loose the data.One way of recovering is by 
 using the secondaryNameNode.You can recover the data till last 
 checkpoint.But here manual intervention is required.
 In latest trunk SPOF will be addressed bu HDFS-1623.
 
 4. 0.20x NameNodes can not scale. Federation changes included in latest 
 versions. ( i think in 22). this may not be the problem for your cluster. 
 But please consider this aspect as well.
 
 
 With respect to (3) and (4) - these are often completely overblown for many 
 Hadoop use cases.  If you use Hadoop as originally designed (large scale 
 batch data processing), these likely don't matter.
 
 If you're looking at some of the newer use cases (low latency stuff or 
 time-critical processing), or if you architect your solution poorly (lots of 
 small files), these issues become relevant.  Another case where I see folks 
 get frustrated is using Hadoop as a plain old batch system; for non-data 
 workflows, it doesn't measure up against specialized systems.
 
 You really want to make sure that Hadoop is the best tool for your job.
 
 Brian



RE: risks of using Hadoop

2011-09-17 Thread Michael Segel

Gee Tom,
No disrespect, but I don't believe you have any personal practical experience 
in designing and building out clusters or putting them to the test.

Now to the points that Brian raised..

1) SPOF... it sounds great on paper. Some FUD to scare someone away from 
Hadoop. But in reality... you can mitigate your risks by setting up raid on 
your NN/HM node. You can also NFS mount a copy to your SN (or whatever they're 
calling it these days...) Or you can go to MapR which has redesigned HDFS which 
removes this problem. But with your Apache Hadoop or Cloudera's release, losing 
your NN is rare. Yes it can happen, but not your greatest risk. (Not by a long 
shot)

2) Data Loss.
You can mitigate this as well. Do I need to go through all of the options and 
DR/BCP planning? Sure there's always a chance that you have some Luser who does 
something brain dead. This is true of all databases and systems. (I know I can 
probably recount some of IBM's Informix and DB2 having data loss issues. But 
that's a topic for another time. ;-)

I can't speak for Brian, but I don't think he's trivializing it. In fact I 
think he's doing a fine job of level setting expectations.

And if you talk to Ted Dunning of MapR, I'm sure he'll point out that their 
current release does address points 3 and 4 again making their risks moot. (At 
least if you're using MapR)

-Mike


 Subject: Re: risks of using Hadoop
 From: tdeut...@us.ibm.com
 Date: Sat, 17 Sep 2011 17:38:27 -0600
 To: common-user@hadoop.apache.org
 
 I disagree Brian - data loss and system down time (both potentially 
 non-trival) should not be taken lightly. Use cases and thus availability 
 requirements do vary, but I would not encourage anyone to shrug them off as 
 overblown, especially as Hadoop become more production oriented in 
 utilization.
 
 ---
 Sent from my Blackberry so please excuse typing and spelling errors.
 
 
 - Original Message -
 From: Brian Bockelman [bbock...@cse.unl.edu]
 Sent: 09/17/2011 05:11 PM EST
 To: common-user@hadoop.apache.org
 Subject: Re: risks of using Hadoop
 
 
 
 
 On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:
 
  Hi Kobina,
  
  Some experiences which may helpful for you with respective to DFS. 
  
  1. Selecting the correct version.
 I will recommend to use 0.20X version. This is pretty stable version and 
  all other organizations prefers it. Well tested as well.
  Dont go for 21 version.This version is not a stable version.This is risk.
  
  2. You should perform thorough test with your customer operations. 
   (of-course you will do this :-))
  
  3. 0.20x version has the problem of SPOF.
If NameNode goes down you will loose the data.One way of recovering is by 
  using the secondaryNameNode.You can recover the data till last 
  checkpoint.But here manual intervention is required.
  In latest trunk SPOF will be addressed bu HDFS-1623.
  
  4. 0.20x NameNodes can not scale. Federation changes included in latest 
  versions. ( i think in 22). this may not be the problem for your cluster. 
  But please consider this aspect as well.
  
 
 With respect to (3) and (4) - these are often completely overblown for many 
 Hadoop use cases.  If you use Hadoop as originally designed (large scale 
 batch data processing), these likely don't matter.
 
 If you're looking at some of the newer use cases (low latency stuff or 
 time-critical processing), or if you architect your solution poorly (lots of 
 small files), these issues become relevant.  Another case where I see folks 
 get frustrated is using Hadoop as a plain old batch system; for non-data 
 workflows, it doesn't measure up against specialized systems.
 
 You really want to make sure that Hadoop is the best tool for your job.
 
 Brian