Re: dfs.name.dir capacity for namenode backup?

2010-05-18 Thread Todd Lipcon
On Mon, May 17, 2010 at 5:10 PM, jiang licht licht_ji...@yahoo.com wrote:

 I am considering to use a machine to save a
 redundant copy of HDFS metadata through setting dfs.name.dir in
 hdfs-site.xml like this (as in YDN):

 property
namedfs.name.dir/name
value/home/hadoop/dfs/name,/mnt/namenode-backup/value
finaltrue/final
 /property

 where the two folders are on different machines so that
 /mnt/namenode-backup keeps a copy of hdfs file system information and its
 machine can be used to replace the first machine that fails as namenode.

 So, my question is how big this hdfs metatdata will consume? I guess it is
 proportional to the hdfs capacity. What ratio is that or what size will be
 for 150TB hdfs?


On the order of a few GB, max (you really need double the size of your
image, so it has tmp space when downloading a checkpoint or performing an
upgrade). But on any disk you can buy these days you'll have plenty of
space.

-Todd


-- 
Todd Lipcon
Software Engineer, Cloudera


Hadoop User Group UK Meetup - June 3rd

2010-05-18 Thread Klaas Bosteels
Hi all,

I've picked up where Johan left off with the HUGUK meetups and the next one is 
planned for June 3rd. The main talks will be:

“Introduction to Sqoop” by Aaron Kimball (Cloudera)
“Hive at Last.fm” by Tim Sell (Last.fm)

More details are available at: http://dumbotics.com/2010/05/18/huguk-4/

-Klaas

Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Pierre ANCELOT
Hi,
I'm porting a legacy application to hadoop and it uses a bunch of small
files.
I'm aware that having such small files ain't a good idea but I'm not doing
the technical decisions and the port has to be done for yesterday...
Of course such small files are a problem, loading 64MB blocks for a few
lines of text is an evident loss.
What will happen if I set a smaller, or even way smaller (32kB) blocks?

Thank you.

Pierre ANCELOT.


Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Brian Bockelman
Hey Pierre,

These are not traditional filesystem blocks - if you save a file smaller than 
64MB, you don't lose 64MB of file space..

Hadoop will use 32KB to store a 32KB file (ok, plus a KB of metadata or so), 
not 64MB.

Brian

On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:

 Hi,
 I'm porting a legacy application to hadoop and it uses a bunch of small
 files.
 I'm aware that having such small files ain't a good idea but I'm not doing
 the technical decisions and the port has to be done for yesterday...
 Of course such small files are a problem, loading 64MB blocks for a few
 lines of text is an evident loss.
 What will happen if I set a smaller, or even way smaller (32kB) blocks?
 
 Thank you.
 
 Pierre ANCELOT.



smime.p7s
Description: S/MIME cryptographic signature


Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Pierre ANCELOT
Hi, thanks for this fast answer :)
If so, what do you mean by blocks? If a file has to be splitted, it will be
splitted when larger than 64MB?




On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman bbock...@cse.unl.eduwrote:

 Hey Pierre,

 These are not traditional filesystem blocks - if you save a file smaller
 than 64MB, you don't lose 64MB of file space..

 Hadoop will use 32KB to store a 32KB file (ok, plus a KB of metadata or
 so), not 64MB.

 Brian

 On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:

  Hi,
  I'm porting a legacy application to hadoop and it uses a bunch of small
  files.
  I'm aware that having such small files ain't a good idea but I'm not
 doing
  the technical decisions and the port has to be done for yesterday...
  Of course such small files are a problem, loading 64MB blocks for a few
  lines of text is an evident loss.
  What will happen if I set a smaller, or even way smaller (32kB) blocks?
 
  Thank you.
 
  Pierre ANCELOT.




-- 
http://www.neko-consulting.com
Ego sum quis ego servo
Je suis ce que je protège
I am what I protect


Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Pierre ANCELOT
... and by slices of 64MB then I mean...
?

On Tue, May 18, 2010 at 2:38 PM, Pierre ANCELOT pierre...@gmail.com wrote:

 Hi, thanks for this fast answer :)
 If so, what do you mean by blocks? If a file has to be splitted, it will be
 splitted when larger than 64MB?





 On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman bbock...@cse.unl.eduwrote:

 Hey Pierre,

 These are not traditional filesystem blocks - if you save a file smaller
 than 64MB, you don't lose 64MB of file space..

 Hadoop will use 32KB to store a 32KB file (ok, plus a KB of metadata or
 so), not 64MB.

 Brian

 On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:

  Hi,
  I'm porting a legacy application to hadoop and it uses a bunch of small
  files.
  I'm aware that having such small files ain't a good idea but I'm not
 doing
  the technical decisions and the port has to be done for yesterday...
  Of course such small files are a problem, loading 64MB blocks for a few
  lines of text is an evident loss.
  What will happen if I set a smaller, or even way smaller (32kB) blocks?
 
  Thank you.
 
  Pierre ANCELOT.




 --
 http://www.neko-consulting.com
 Ego sum quis ego servo
 Je suis ce que je protège
 I am what I protect




-- 
http://www.neko-consulting.com
Ego sum quis ego servo
Je suis ce que je protège
I am what I protect


Re: Data node decommission doesn't seem to be working correctly

2010-05-18 Thread Brian Bockelman
Hey Scott,

Hadoop tends to get confused by nodes with multiple hostnames or multiple IP 
addresses.  Is this your case?

I can't remember precisely what our admin does, but I think he puts in the IP 
address which Hadoop listens on in the exclude-hosts file.

Look in the output of 

hadoop dfsadmin -report

to determine precisely which IP address your datanode is listening on.

Brian

On May 17, 2010, at 11:32 PM, Scott White wrote:

 I followed the steps mentioned here:
 http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
 decommission a data node. What I see from the namenode is the hostname of
 the machine that I decommissioned shows up in both the list of dead nodes
 but also live nodes where its admin status is marked as 'In Service'. It's
 been twelve hours and there is no sign in the namenode logs that the node
 has been decommissioned. Any suggestions of what might be the problem and
 what to try to ensure that this node gets safely taken down?
 
 thanks in advance,
 Scott



smime.p7s
Description: S/MIME cryptographic signature


Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Brian Bockelman

On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:

 Hi, thanks for this fast answer :)
 If so, what do you mean by blocks? If a file has to be splitted, it will be
 splitted when larger than 64MB?
 

For every 64MB of the file, Hadoop will create a separate block.  So, if you 
have a 32KB file, there will be one block of 32KB.  If the file is 65MB, then 
it will have one block of 64MB and another block of 1MB.

Splitting files is very useful for load-balancing and distributing I/O across 
multiple nodes.  At 32KB / file, you don't really need to split the files at 
all.

I recommend reading the HDFS design document for background issues like this:

http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html

Brian

 
 
 
 On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman bbock...@cse.unl.eduwrote:
 
 Hey Pierre,
 
 These are not traditional filesystem blocks - if you save a file smaller
 than 64MB, you don't lose 64MB of file space..
 
 Hadoop will use 32KB to store a 32KB file (ok, plus a KB of metadata or
 so), not 64MB.
 
 Brian
 
 On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
 
 Hi,
 I'm porting a legacy application to hadoop and it uses a bunch of small
 files.
 I'm aware that having such small files ain't a good idea but I'm not
 doing
 the technical decisions and the port has to be done for yesterday...
 Of course such small files are a problem, loading 64MB blocks for a few
 lines of text is an evident loss.
 What will happen if I set a smaller, or even way smaller (32kB) blocks?
 
 Thank you.
 
 Pierre ANCELOT.
 
 
 
 
 -- 
 http://www.neko-consulting.com
 Ego sum quis ego servo
 Je suis ce que je protège
 I am what I protect



smime.p7s
Description: S/MIME cryptographic signature


Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Patrick Angeles
Pierre,

Adding to what Brian has said (some things are not explicitly mentioned in
the HDFS design doc)...

- If you have small files that take up  64MB you do not actually use the
entire 64MB block on disk.
- You *do* use up RAM on the NameNode, as each block represents meta-data
that needs to be maintained in-memory in the NameNode.
- Hadoop won't perform optimally with very small block sizes. Hadoop I/O is
optimized for high sustained throughput per single file/block. There is a
penalty for doing too many seeks to get to the beginning of each block.
Additionally, you will have a MapReduce task per small file. Each MapReduce
task has a non-trivial startup overhead.
- The recommendation is to consolidate your small files into large files.
One way to do this is via SequenceFiles... put the filename in the
SequenceFile key field, and the file's bytes in the SequenceFile value
field.

In addition to the HDFS design docs, I recommend reading this blog post:
http://www.cloudera.com/blog/2009/02/the-small-files-problem/

Happy Hadooping,

- Patrick

On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT pierre...@gmail.com wrote:

 Okay, thank you :)


 On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman bbock...@cse.unl.edu
 wrote:

 
  On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
 
   Hi, thanks for this fast answer :)
   If so, what do you mean by blocks? If a file has to be splitted, it
 will
  be
   splitted when larger than 64MB?
  
 
  For every 64MB of the file, Hadoop will create a separate block.  So, if
  you have a 32KB file, there will be one block of 32KB.  If the file is
 65MB,
  then it will have one block of 64MB and another block of 1MB.
 
  Splitting files is very useful for load-balancing and distributing I/O
  across multiple nodes.  At 32KB / file, you don't really need to split
 the
  files at all.
 
  I recommend reading the HDFS design document for background issues like
  this:
 
  http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
 
  Brian
 
  
  
  
   On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman bbock...@cse.unl.edu
  wrote:
  
   Hey Pierre,
  
   These are not traditional filesystem blocks - if you save a file
 smaller
   than 64MB, you don't lose 64MB of file space..
  
   Hadoop will use 32KB to store a 32KB file (ok, plus a KB of metadata
 or
   so), not 64MB.
  
   Brian
  
   On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
  
   Hi,
   I'm porting a legacy application to hadoop and it uses a bunch of
 small
   files.
   I'm aware that having such small files ain't a good idea but I'm not
   doing
   the technical decisions and the port has to be done for yesterday...
   Of course such small files are a problem, loading 64MB blocks for a
 few
   lines of text is an evident loss.
   What will happen if I set a smaller, or even way smaller (32kB)
 blocks?
  
   Thank you.
  
   Pierre ANCELOT.
  
  
  
  
   --
   http://www.neko-consulting.com
   Ego sum quis ego servo
   Je suis ce que je protège
   I am what I protect
 
 


 --
 http://www.neko-consulting.com
 Ego sum quis ego servo
 Je suis ce que je protège
 I am what I protect



Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Pierre ANCELOT
Thank you,
Any way I can measure the startup overhead in terms of time?


On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles patr...@cloudera.comwrote:

 Pierre,

 Adding to what Brian has said (some things are not explicitly mentioned in
 the HDFS design doc)...

 - If you have small files that take up  64MB you do not actually use the
 entire 64MB block on disk.
 - You *do* use up RAM on the NameNode, as each block represents meta-data
 that needs to be maintained in-memory in the NameNode.
 - Hadoop won't perform optimally with very small block sizes. Hadoop I/O is
 optimized for high sustained throughput per single file/block. There is a
 penalty for doing too many seeks to get to the beginning of each block.
 Additionally, you will have a MapReduce task per small file. Each MapReduce
 task has a non-trivial startup overhead.
 - The recommendation is to consolidate your small files into large files.
 One way to do this is via SequenceFiles... put the filename in the
 SequenceFile key field, and the file's bytes in the SequenceFile value
 field.

 In addition to the HDFS design docs, I recommend reading this blog post:
 http://www.cloudera.com/blog/2009/02/the-small-files-problem/

 Happy Hadooping,

 - Patrick

 On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT pierre...@gmail.com
 wrote:

  Okay, thank you :)
 
 
  On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman bbock...@cse.unl.edu
  wrote:
 
  
   On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
  
Hi, thanks for this fast answer :)
If so, what do you mean by blocks? If a file has to be splitted, it
  will
   be
splitted when larger than 64MB?
   
  
   For every 64MB of the file, Hadoop will create a separate block.  So,
 if
   you have a 32KB file, there will be one block of 32KB.  If the file is
  65MB,
   then it will have one block of 64MB and another block of 1MB.
  
   Splitting files is very useful for load-balancing and distributing I/O
   across multiple nodes.  At 32KB / file, you don't really need to split
  the
   files at all.
  
   I recommend reading the HDFS design document for background issues like
   this:
  
   http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
  
   Brian
  
   
   
   
On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman 
 bbock...@cse.unl.edu
   wrote:
   
Hey Pierre,
   
These are not traditional filesystem blocks - if you save a file
  smaller
than 64MB, you don't lose 64MB of file space..
   
Hadoop will use 32KB to store a 32KB file (ok, plus a KB of metadata
  or
so), not 64MB.
   
Brian
   
On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
   
Hi,
I'm porting a legacy application to hadoop and it uses a bunch of
  small
files.
I'm aware that having such small files ain't a good idea but I'm
 not
doing
the technical decisions and the port has to be done for
 yesterday...
Of course such small files are a problem, loading 64MB blocks for a
  few
lines of text is an evident loss.
What will happen if I set a smaller, or even way smaller (32kB)
  blocks?
   
Thank you.
   
Pierre ANCELOT.
   
   
   
   
--
http://www.neko-consulting.com
Ego sum quis ego servo
Je suis ce que je protège
I am what I protect
  
  
 
 
  --
  http://www.neko-consulting.com
  Ego sum quis ego servo
  Je suis ce que je protège
  I am what I protect
 




-- 
http://www.neko-consulting.com
Ego sum quis ego servo
Je suis ce que je protège
I am what I protect


Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Patrick Angeles
Should be evident in the total job running time... that's the only metric
that really matters :)

On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT pierre...@gmail.comwrote:

 Thank you,
 Any way I can measure the startup overhead in terms of time?


 On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles patr...@cloudera.com
 wrote:

  Pierre,
 
  Adding to what Brian has said (some things are not explicitly mentioned
 in
  the HDFS design doc)...
 
  - If you have small files that take up  64MB you do not actually use the
  entire 64MB block on disk.
  - You *do* use up RAM on the NameNode, as each block represents meta-data
  that needs to be maintained in-memory in the NameNode.
  - Hadoop won't perform optimally with very small block sizes. Hadoop I/O
 is
  optimized for high sustained throughput per single file/block. There is a
  penalty for doing too many seeks to get to the beginning of each block.
  Additionally, you will have a MapReduce task per small file. Each
 MapReduce
  task has a non-trivial startup overhead.
  - The recommendation is to consolidate your small files into large files.
  One way to do this is via SequenceFiles... put the filename in the
  SequenceFile key field, and the file's bytes in the SequenceFile value
  field.
 
  In addition to the HDFS design docs, I recommend reading this blog post:
  http://www.cloudera.com/blog/2009/02/the-small-files-problem/
 
  Happy Hadooping,
 
  - Patrick
 
  On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT pierre...@gmail.com
  wrote:
 
   Okay, thank you :)
  
  
   On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman bbock...@cse.unl.edu
   wrote:
  
   
On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
   
 Hi, thanks for this fast answer :)
 If so, what do you mean by blocks? If a file has to be splitted, it
   will
be
 splitted when larger than 64MB?

   
For every 64MB of the file, Hadoop will create a separate block.  So,
  if
you have a 32KB file, there will be one block of 32KB.  If the file
 is
   65MB,
then it will have one block of 64MB and another block of 1MB.
   
Splitting files is very useful for load-balancing and distributing
 I/O
across multiple nodes.  At 32KB / file, you don't really need to
 split
   the
files at all.
   
I recommend reading the HDFS design document for background issues
 like
this:
   
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
   
Brian
   



 On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman 
  bbock...@cse.unl.edu
wrote:

 Hey Pierre,

 These are not traditional filesystem blocks - if you save a file
   smaller
 than 64MB, you don't lose 64MB of file space..

 Hadoop will use 32KB to store a 32KB file (ok, plus a KB of
 metadata
   or
 so), not 64MB.

 Brian

 On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:

 Hi,
 I'm porting a legacy application to hadoop and it uses a bunch of
   small
 files.
 I'm aware that having such small files ain't a good idea but I'm
  not
 doing
 the technical decisions and the port has to be done for
  yesterday...
 Of course such small files are a problem, loading 64MB blocks for
 a
   few
 lines of text is an evident loss.
 What will happen if I set a smaller, or even way smaller (32kB)
   blocks?

 Thank you.

 Pierre ANCELOT.




 --
 http://www.neko-consulting.com
 Ego sum quis ego servo
 Je suis ce que je protège
 I am what I protect
   
   
  
  
   --
   http://www.neko-consulting.com
   Ego sum quis ego servo
   Je suis ce que je protège
   I am what I protect
  
 



 --
 http://www.neko-consulting.com
 Ego sum quis ego servo
 Je suis ce que je protège
 I am what I protect



Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread He Chen
If you know how to use AspectJ to do aspect oriented programming. You can
write a aspect class. Let it just monitors the whole process of MapReduce

On Tue, May 18, 2010 at 10:00 AM, Patrick Angeles patr...@cloudera.comwrote:

 Should be evident in the total job running time... that's the only metric
 that really matters :)

 On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT pierre...@gmail.com
 wrote:

  Thank you,
  Any way I can measure the startup overhead in terms of time?
 
 
  On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles patr...@cloudera.com
  wrote:
 
   Pierre,
  
   Adding to what Brian has said (some things are not explicitly mentioned
  in
   the HDFS design doc)...
  
   - If you have small files that take up  64MB you do not actually use
 the
   entire 64MB block on disk.
   - You *do* use up RAM on the NameNode, as each block represents
 meta-data
   that needs to be maintained in-memory in the NameNode.
   - Hadoop won't perform optimally with very small block sizes. Hadoop
 I/O
  is
   optimized for high sustained throughput per single file/block. There is
 a
   penalty for doing too many seeks to get to the beginning of each block.
   Additionally, you will have a MapReduce task per small file. Each
  MapReduce
   task has a non-trivial startup overhead.
   - The recommendation is to consolidate your small files into large
 files.
   One way to do this is via SequenceFiles... put the filename in the
   SequenceFile key field, and the file's bytes in the SequenceFile value
   field.
  
   In addition to the HDFS design docs, I recommend reading this blog
 post:
   http://www.cloudera.com/blog/2009/02/the-small-files-problem/
  
   Happy Hadooping,
  
   - Patrick
  
   On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT pierre...@gmail.com
   wrote:
  
Okay, thank you :)
   
   
On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman 
 bbock...@cse.unl.edu
wrote:
   

 On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:

  Hi, thanks for this fast answer :)
  If so, what do you mean by blocks? If a file has to be splitted,
 it
will
 be
  splitted when larger than 64MB?
 

 For every 64MB of the file, Hadoop will create a separate block.
  So,
   if
 you have a 32KB file, there will be one block of 32KB.  If the file
  is
65MB,
 then it will have one block of 64MB and another block of 1MB.

 Splitting files is very useful for load-balancing and distributing
  I/O
 across multiple nodes.  At 32KB / file, you don't really need to
  split
the
 files at all.

 I recommend reading the HDFS design document for background issues
  like
 this:

 http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html

 Brian

 
 
 
  On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman 
   bbock...@cse.unl.edu
 wrote:
 
  Hey Pierre,
 
  These are not traditional filesystem blocks - if you save a file
smaller
  than 64MB, you don't lose 64MB of file space..
 
  Hadoop will use 32KB to store a 32KB file (ok, plus a KB of
  metadata
or
  so), not 64MB.
 
  Brian
 
  On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
 
  Hi,
  I'm porting a legacy application to hadoop and it uses a bunch
 of
small
  files.
  I'm aware that having such small files ain't a good idea but
 I'm
   not
  doing
  the technical decisions and the port has to be done for
   yesterday...
  Of course such small files are a problem, loading 64MB blocks
 for
  a
few
  lines of text is an evident loss.
  What will happen if I set a smaller, or even way smaller (32kB)
blocks?
 
  Thank you.
 
  Pierre ANCELOT.
 
 
 
 
  --
  http://www.neko-consulting.com
  Ego sum quis ego servo
  Je suis ce que je protège
  I am what I protect


   
   
--
http://www.neko-consulting.com
Ego sum quis ego servo
Je suis ce que je protège
I am what I protect
   
  
 
 
 
  --
  http://www.neko-consulting.com
  Ego sum quis ego servo
  Je suis ce que je protège
  I am what I protect
 




-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: what's the mechnism to determine the reducer number and reduce progress

2010-05-18 Thread stan lee
Thanks PanFeng, do you have more detailed explanation on this? Is it
caculated by how many reduce files has completed each phase?

Also, what's the answer for my second question? Thanks!

On Mon, May 17, 2010 at 12:44 PM, 原攀峰 ypf...@163.com wrote:

 For a reduce task, the execution is divided into three phases, each of
 which accounts for 1/3 of the score:
 • The copy phase, when the task fetches map outputs.
 • The sort phase, when map outputs are sorted by key.
 • The reduce phase, when a user-defined function is applied to the list of
 map outputs with each key.
 --

 Yuan Panfeng(原攀峰) | BeiHang University

 TEL: +86-13426166934

 MSN: ypf...@hotmail.com

 EMAIL: ypf...@gmail.com

 QQ: 362889262




 在2010-05-17 09:44:38,stan lee lee.stan...@gmail.com 写道:
  When I run the sort job, I found when there are 70 reduce tasks running
 and
 no one completed, the progress bar shows that it has finished about 80%,
 so
 how the mapreduce mechnism to caculate this?
 
 Also,  when I run a job, as we know, we can determine the number of total
 reduce tasks through setNumReduceTasks() function, but how to determine
 the
 reducer number(I mean the tasktracker number which run the reduce task)
 being used?
 
 Thanks!
 Stan. Lee



Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Brian Bockelman
Hey Hassan,

1) The overhead is pretty small, measured in a small number of milliseconds on 
average
2) HDFS is not designed for online latency.  Even though the average is 
small, if something bad happens, your clients might experience a lot of 
delays while going through the retry stack.  The initial design was for batch 
processing, and latency-sensitive applications came later.

Additionally since the NN is a SPOF, you might want to consider your uptime 
requirements.  Each organization will have to balance these risks with the 
advantages (such as much cheaper hardware).

There's a nice interview with the GFS authors here where they touch upon the 
latency issues:

http://queue.acm.org/detail.cfm?id=1594206

As GFS and HDFS share many design features, the theoretical parts of their 
discussion might be useful for you.

As far as overall throughput of the system goes, it depends heavily upon your 
implementation and hardware.  Our HDFS routinely serves 5-10 Gbps.

Brian

On May 18, 2010, at 10:29 AM, Nyamul Hassan wrote:

 This is a very interesting thread to us, as we are thinking about deploying
 HDFS as a massive online storage for a on online university, and then
 serving the video files to students who want to view them.
 
 We cannot control the size of the videos (and some class work files), as
 they will mostly be uploaded by the teachers providing the classes.
 
 How would the overall through put of HDFS be affected in such a solution?
 Would HDFS be feasible at all for such a setup?
 
 Regards
 HASSAN
 
 
 
 On Tue, May 18, 2010 at 21:11, He Chen airb...@gmail.com wrote:
 
 If you know how to use AspectJ to do aspect oriented programming. You can
 write a aspect class. Let it just monitors the whole process of MapReduce
 
 On Tue, May 18, 2010 at 10:00 AM, Patrick Angeles patr...@cloudera.com
 wrote:
 
 Should be evident in the total job running time... that's the only metric
 that really matters :)
 
 On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT pierre...@gmail.com
 wrote:
 
 Thank you,
 Any way I can measure the startup overhead in terms of time?
 
 
 On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles patr...@cloudera.com
 wrote:
 
 Pierre,
 
 Adding to what Brian has said (some things are not explicitly
 mentioned
 in
 the HDFS design doc)...
 
 - If you have small files that take up  64MB you do not actually use
 the
 entire 64MB block on disk.
 - You *do* use up RAM on the NameNode, as each block represents
 meta-data
 that needs to be maintained in-memory in the NameNode.
 - Hadoop won't perform optimally with very small block sizes. Hadoop
 I/O
 is
 optimized for high sustained throughput per single file/block. There
 is
 a
 penalty for doing too many seeks to get to the beginning of each
 block.
 Additionally, you will have a MapReduce task per small file. Each
 MapReduce
 task has a non-trivial startup overhead.
 - The recommendation is to consolidate your small files into large
 files.
 One way to do this is via SequenceFiles... put the filename in the
 SequenceFile key field, and the file's bytes in the SequenceFile
 value
 field.
 
 In addition to the HDFS design docs, I recommend reading this blog
 post:
 http://www.cloudera.com/blog/2009/02/the-small-files-problem/
 
 Happy Hadooping,
 
 - Patrick
 
 On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT pierre...@gmail.com
 
 wrote:
 
 Okay, thank you :)
 
 
 On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman 
 bbock...@cse.unl.edu
 wrote:
 
 
 On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
 
 Hi, thanks for this fast answer :)
 If so, what do you mean by blocks? If a file has to be
 splitted,
 it
 will
 be
 splitted when larger than 64MB?
 
 
 For every 64MB of the file, Hadoop will create a separate block.
 So,
 if
 you have a 32KB file, there will be one block of 32KB.  If the
 file
 is
 65MB,
 then it will have one block of 64MB and another block of 1MB.
 
 Splitting files is very useful for load-balancing and
 distributing
 I/O
 across multiple nodes.  At 32KB / file, you don't really need to
 split
 the
 files at all.
 
 I recommend reading the HDFS design document for background
 issues
 like
 this:
 
 http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
 
 Brian
 
 
 
 
 On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman 
 bbock...@cse.unl.edu
 wrote:
 
 Hey Pierre,
 
 These are not traditional filesystem blocks - if you save a
 file
 smaller
 than 64MB, you don't lose 64MB of file space..
 
 Hadoop will use 32KB to store a 32KB file (ok, plus a KB of
 metadata
 or
 so), not 64MB.
 
 Brian
 
 On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
 
 Hi,
 I'm porting a legacy application to hadoop and it uses a
 bunch
 of
 small
 files.
 I'm aware that having such small files ain't a good idea but
 I'm
 not
 doing
 the technical decisions and the port has to be done for
 yesterday...
 Of course such small files are a problem, loading 64MB blocks
 for
 a
 few
 lines of text is an evident loss.
 What will happen if I set a 

Do we need to install both 32 and 64 bit lzo2 to enable lzo compression and how can we use gzip compressoin codec in hadoop

2010-05-18 Thread stan lee
Hi Guys,

I am trying to use compression to reduce the IO workload when trying to run
a job but failed. I have several questions which needs your help.

For lzo compression, I found a guide
http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it said Note
that you must have both 32-bit and 64-bit liblzo2 installed ? I am not sure
whether it means that we also need 32bit liblzo2 installed even when we are
on 64bit system. If so, why?

Also if I don't use lzo compression and tried to use gzip to compress the
final reduce output file, I just set below value in mapred-site.xml, but
seems it doesn't work(how can I find the final .gz file compressed? I used
hadoop dfs -l dir and didn't find that.). My question: can we use gzip
to compress the final result when it's not streaming job? How can we ensure
that the compression has been enabled during a job execution?

property
   namemapred.output.compress/name
   valuetrue/value
/property

Thanks!
Stan Lee


Re: Do we need to install both 32 and 64 bit lzo2 to enable lzo compression and how can we use gzip compressoin codec in hadoop

2010-05-18 Thread Ted Yu
32bit liblzo2 isn't needed on 64-bit systems.

On Tue, May 18, 2010 at 8:44 AM, stan lee lee.stan...@gmail.com wrote:

 Hi Guys,

 I am trying to use compression to reduce the IO workload when trying to run
 a job but failed. I have several questions which needs your help.

 For lzo compression, I found a guide
 http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it said
 Note
 that you must have both 32-bit and 64-bit liblzo2 installed ? I am not
 sure
 whether it means that we also need 32bit liblzo2 installed even when we are
 on 64bit system. If so, why?

 Also if I don't use lzo compression and tried to use gzip to compress the
 final reduce output file, I just set below value in mapred-site.xml, but
 seems it doesn't work(how can I find the final .gz file compressed? I used
 hadoop dfs -l dir and didn't find that.). My question: can we use gzip
 to compress the final result when it's not streaming job? How can we ensure
 that the compression has been enabled during a job execution?

 property
   namemapred.output.compress/name
   valuetrue/value
 /property

 Thanks!
 Stan Lee



Re: Data node decommission doesn't seem to be working correctly

2010-05-18 Thread Koji Noguchi
Hi Scott, 

You might be hitting two different issues.

1) Decommission not finishing.
   https://issues.apache.org/jira/browse/HDFS-694  explains decommission
never finishing due to open files in 0.20

2) Nodes showing up both in live and dead nodes.
   I remember Suresh taking a look at this.
   It was something about same node registered with hostname and IP
separately (when datanode is rejumped and started fresh (?)).

Cc-ing Suresh.

Koji

On 5/17/10 9:32 PM, Scott White scottbl...@gmail.com wrote:

 I followed the steps mentioned here:
 http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
 decommission a data node. What I see from the namenode is the hostname of
 the machine that I decommissioned shows up in both the list of dead nodes
 but also live nodes where its admin status is marked as 'In Service'. It's
 been twelve hours and there is no sign in the namenode logs that the node
 has been decommissioned. Any suggestions of what might be the problem and
 what to try to ensure that this node gets safely taken down?
 
 thanks in advance,
 Scott



Re: Data node decommission doesn't seem to be working correctly

2010-05-18 Thread Scott White
Dfsadmin -report reports the hostname for that machine and not the ip. That
machine happens to be the master node which is why I am trying to
decommission the data node there since I only want the data node running on
the slave nodes. Dfs admin -report reports all the ips for the slave nodes.

One question: I believe that the namenode was accidentally restarted during
the 12 hours or so I was waiting for the decommission to complete. Would
this put things into a bad state? I did try running dfsadmin -refreshNodes
after it was restarted.

Scott


On Tue, May 18, 2010 at 5:44 AM, Brian Bockelman bbock...@cse.unl.eduwrote:

 Hey Scott,

 Hadoop tends to get confused by nodes with multiple hostnames or multiple
 IP addresses.  Is this your case?

 I can't remember precisely what our admin does, but I think he puts in the
 IP address which Hadoop listens on in the exclude-hosts file.

 Look in the output of

 hadoop dfsadmin -report

 to determine precisely which IP address your datanode is listening on.

 Brian

 On May 17, 2010, at 11:32 PM, Scott White wrote:

  I followed the steps mentioned here:
  http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
  decommission a data node. What I see from the namenode is the hostname of
  the machine that I decommissioned shows up in both the list of dead nodes
  but also live nodes where its admin status is marked as 'In Service'.
 It's
  been twelve hours and there is no sign in the namenode logs that the node
  has been decommissioned. Any suggestions of what might be the problem and
  what to try to ensure that this node gets safely taken down?
 
  thanks in advance,
  Scott




Re: Do we need to install both 32 and 64 bit lzo2 to enable lzo compression and how can we use gzip compressoin codec in hadoop

2010-05-18 Thread Harsh J
Hi stan,

You can do something of this sort if you use FileOutputFormat, from
within your Job Driver:

FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
// GzipCodec from org.apache.hadoop.io.compress.
// and where 'job' is either JobConf or Job object.

This will write the simple file output in Gzip format. You also have BZip2Codec.

On Tue, May 18, 2010 at 9:14 PM, stan lee lee.stan...@gmail.com wrote:
 Hi Guys,

 I am trying to use compression to reduce the IO workload when trying to run
 a job but failed. I have several questions which needs your help.

 For lzo compression, I found a guide
 http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it said Note
 that you must have both 32-bit and 64-bit liblzo2 installed ? I am not sure
 whether it means that we also need 32bit liblzo2 installed even when we are
 on 64bit system. If so, why?

 Also if I don't use lzo compression and tried to use gzip to compress the
 final reduce output file, I just set below value in mapred-site.xml, but
 seems it doesn't work(how can I find the final .gz file compressed? I used
 hadoop dfs -l dir and didn't find that.). My question: can we use gzip
 to compress the final result when it's not streaming job? How can we ensure
 that the compression has been enabled during a job execution?

 property
       namemapred.output.compress/name
       valuetrue/value
 /property

 Thanks!
 Stan Lee




-- 
Harsh J
www.harshj.com


Re: preserve JobTracker information

2010-05-18 Thread Harsh J
Preserved JobTracker history is already available at /jobhistory.jsp

There is a link at the end of the /jobtracker.jsp page that leads to
this. There's also free analysis to go with that! :)

On Tue, May 18, 2010 at 11:00 PM, Alan Miller someb...@squareplanet.de wrote:
 Hi,

 Is there a way to preserve previous job information (Completed Jobs, Failed
 Jobs)
 when the hadoop cluster is restarted?

 Everytime I start up my cluster (start-dfs.sh,start-mapred.sh) the
 JobTracker interface
 at http://myhost:50020/jobtracker.jsp is always empty.

 Thanks,
 Alan






-- 
Harsh J
www.harshj.com


Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Konstantin Boudnik
I had an experiment with block size of 10 bytes (sic!). This was _very_ slow
on NN side. Like writing 5 Mb was happening for 25 minutes or so :( No fun to
say the least...

On Tue, May 18, 2010 at 10:56AM, Konstantin Shvachko wrote:
 You can also get some performance numbers and answers to the block size 
 dilemma problem here:
 
 http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html
 
 I remember some people were using Hadoop for storing or streaming videos.
 Don't know how well that worked.
 It would be interesting to learn about your experience.
 
 Thanks,
 --Konstantin
 
 
 On 5/18/2010 8:41 AM, Brian Bockelman wrote:
  Hey Hassan,
 
  1) The overhead is pretty small, measured in a small number of milliseconds 
  on average
  2) HDFS is not designed for online latency.  Even though the average is 
  small, if something bad happens, your clients might experience a lot of 
  delays while going through the retry stack.  The initial design was for 
  batch processing, and latency-sensitive applications came later.
 
  Additionally since the NN is a SPOF, you might want to consider your uptime 
  requirements.  Each organization will have to balance these risks with the 
  advantages (such as much cheaper hardware).
 
  There's a nice interview with the GFS authors here where they touch upon 
  the latency issues:
 
  http://queue.acm.org/detail.cfm?id=1594206
 
  As GFS and HDFS share many design features, the theoretical parts of their 
  discussion might be useful for you.
 
  As far as overall throughput of the system goes, it depends heavily upon 
  your implementation and hardware.  Our HDFS routinely serves 5-10 Gbps.
 
  Brian
 
  On May 18, 2010, at 10:29 AM, Nyamul Hassan wrote:
 
  This is a very interesting thread to us, as we are thinking about deploying
  HDFS as a massive online storage for a on online university, and then
  serving the video files to students who want to view them.
 
  We cannot control the size of the videos (and some class work files), as
  they will mostly be uploaded by the teachers providing the classes.
 
  How would the overall through put of HDFS be affected in such a solution?
  Would HDFS be feasible at all for such a setup?
 
  Regards
  HASSAN
 
 
 
  On Tue, May 18, 2010 at 21:11, He Chenairb...@gmail.com  wrote:
 
  If you know how to use AspectJ to do aspect oriented programming. You can
  write a aspect class. Let it just monitors the whole process of MapReduce
 
  On Tue, May 18, 2010 at 10:00 AM, Patrick Angelespatr...@cloudera.com
  wrote:
 
  Should be evident in the total job running time... that's the only metric
  that really matters :)
 
  On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOTpierre...@gmail.com
  wrote:
 
  Thank you,
  Any way I can measure the startup overhead in terms of time?
 
 
  On Tue, May 18, 2010 at 4:27 PM, Patrick Angelespatr...@cloudera.com
  wrote:
 
  Pierre,
 
  Adding to what Brian has said (some things are not explicitly
  mentioned
  in
  the HDFS design doc)...
 
  - If you have small files that take up  64MB you do not actually use
  the
  entire 64MB block on disk.
  - You *do* use up RAM on the NameNode, as each block represents
  meta-data
  that needs to be maintained in-memory in the NameNode.
  - Hadoop won't perform optimally with very small block sizes. Hadoop
  I/O
  is
  optimized for high sustained throughput per single file/block. There
  is
  a
  penalty for doing too many seeks to get to the beginning of each
  block.
  Additionally, you will have a MapReduce task per small file. Each
  MapReduce
  task has a non-trivial startup overhead.
  - The recommendation is to consolidate your small files into large
  files.
  One way to do this is via SequenceFiles... put the filename in the
  SequenceFile key field, and the file's bytes in the SequenceFile
  value
  field.
 
  In addition to the HDFS design docs, I recommend reading this blog
  post:
  http://www.cloudera.com/blog/2009/02/the-small-files-problem/
 
  Happy Hadooping,
 
  - Patrick
 
  On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOTpierre...@gmail.com
 
  wrote:
 
  Okay, thank you :)
 
 
  On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman
  bbock...@cse.unl.edu
  wrote:
 
 
  On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
 
  Hi, thanks for this fast answer :)
  If so, what do you mean by blocks? If a file has to be
  splitted,
  it
  will
  be
  splitted when larger than 64MB?
 
 
  For every 64MB of the file, Hadoop will create a separate block.
  So,
  if
  you have a 32KB file, there will be one block of 32KB.  If the
  file
  is
  65MB,
  then it will have one block of 64MB and another block of 1MB.
 
  Splitting files is very useful for load-balancing and
  distributing
  I/O
  across multiple nodes.  At 32KB / file, you don't really need to
  split
  the
  files at all.
 
  I recommend reading the HDFS design document for background
  issues
  like
  this:
 
  

Re: Data node decommission doesn't seem to be working correctly

2010-05-18 Thread Brian Bockelman
Hey Scott,

If the node shows up in the dead nodes and the live nodes as you say, it's 
definitely not even attempting to be decommissioned.  If HDFS was attempting 
decommissioning and you restart the namenode, then it would only show up in the 
dead nodes list.

Another option is to just turn off HDFS on that node alone, and don't 
physically delete the data from the node until HDFS completely recovers.  This 
is not recommended for production usage, as it creates a period where the 
cluster is in danger of losing files.  However, it can be used as a one-off to 
get over this speed-hump.

Brian

On May 18, 2010, at 12:02 PM, Scott White wrote:

 Dfsadmin -report reports the hostname for that machine and not the ip. That
 machine happens to be the master node which is why I am trying to
 decommission the data node there since I only want the data node running on
 the slave nodes. Dfs admin -report reports all the ips for the slave nodes.
 
 One question: I believe that the namenode was accidentally restarted during
 the 12 hours or so I was waiting for the decommission to complete. Would
 this put things into a bad state? I did try running dfsadmin -refreshNodes
 after it was restarted.
 
 Scott
 
 
 On Tue, May 18, 2010 at 5:44 AM, Brian Bockelman bbock...@cse.unl.eduwrote:
 
 Hey Scott,
 
 Hadoop tends to get confused by nodes with multiple hostnames or multiple
 IP addresses.  Is this your case?
 
 I can't remember precisely what our admin does, but I think he puts in the
 IP address which Hadoop listens on in the exclude-hosts file.
 
 Look in the output of
 
 hadoop dfsadmin -report
 
 to determine precisely which IP address your datanode is listening on.
 
 Brian
 
 On May 17, 2010, at 11:32 PM, Scott White wrote:
 
 I followed the steps mentioned here:
 http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
 decommission a data node. What I see from the namenode is the hostname of
 the machine that I decommissioned shows up in both the list of dead nodes
 but also live nodes where its admin status is marked as 'In Service'.
 It's
 been twelve hours and there is no sign in the namenode logs that the node
 has been decommissioned. Any suggestions of what might be the problem and
 what to try to ensure that this node gets safely taken down?
 
 thanks in advance,
 Scott
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Do we need to install both 32 and 64 bit lzo2 to enable lzo compression and how can we use gzip compressoin codec in hadoop

2010-05-18 Thread Hong Tang

Stan,

See my comments inline.

Thanks, Hong

On May 18, 2010, at 8:44 AM, stan lee wrote:


Hi Guys,

I am trying to use compression to reduce the IO workload when trying  
to run

a job but failed. I have several questions which needs your help.

For lzo compression, I found a guide
http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it  
said Note
that you must have both 32-bit and 64-bit liblzo2 installed ? I am  
not sure
whether it means that we also need 32bit liblzo2 installed even when  
we are

on 64bit system. If so, why?


The answer on the wiki page is to the question of how to set up the  
native libraries so that both 32-bit AND 64-bit java would work. If  
you adhere to an environment with the same flavor of java across the  
whole cluster, then the solution would not apply to you.


Also if I don't use lzo compression and tried to use gzip to  
compress the
final reduce output file, I just set below value in mapred-site.xml,  
but
seems it doesn't work(how can I find the final .gz file compressed?  
I used
hadoop dfs -l dir and didn't find that.). My question: can we  
use gzip
to compress the final result when it's not streaming job? How can we  
ensure

that the compression has been enabled during a job execution?

property
  namemapred.output.compress/name
  valuetrue/value
/property



The truth is, this option is honored by the implementation of  
OutputFormat classes.  If you use TextOutputFormat, then you should  
see files like part-.gz in the output directory. If you write  
your own output format class, then you should follow the  
implementations of TextOutputFormat or SequenceFileOutputFormat to set  
up compression properly.




Re: dfs.name.dir capacity for namenode backup?

2010-05-18 Thread Andrew Nguyen
Sorry to hijack but after following this thread, I had a related question to 
the secondary location of dfs.name.dir.  

Is the approach outlined below the preferred/suggested way to do this?  Is this 
people mean when they say, stick it on NFS ?

Thanks!

On May 17, 2010, at 11:14 PM, Todd Lipcon wrote:

 On Mon, May 17, 2010 at 5:10 PM, jiang licht licht_ji...@yahoo.com wrote:
 
 I am considering to use a machine to save a
 redundant copy of HDFS metadata through setting dfs.name.dir in
 hdfs-site.xml like this (as in YDN):
 
 property
   namedfs.name.dir/name
   value/home/hadoop/dfs/name,/mnt/namenode-backup/value
   finaltrue/final
 /property
 
 where the two folders are on different machines so that
 /mnt/namenode-backup keeps a copy of hdfs file system information and its
 machine can be used to replace the first machine that fails as namenode.
 
 So, my question is how big this hdfs metatdata will consume? I guess it is
 proportional to the hdfs capacity. What ratio is that or what size will be
 for 150TB hdfs?
 
 
 On the order of a few GB, max (you really need double the size of your
 image, so it has tmp space when downloading a checkpoint or performing an
 upgrade). But on any disk you can buy these days you'll have plenty of
 space.
 
 -Todd
 
 
 -- 
 Todd Lipcon
 Software Engineer, Cloudera



Re: dfs.name.dir capacity for namenode backup?

2010-05-18 Thread Todd Lipcon
Yes, we recommend at least one local directory and one NFS directory for
dfs.name.dir in production environments. This allows an up-to-date recovery
of NN metadata if the NN should fail. In future versions the BackupNode
functionality will move us one step closer to not needing NFS for production
deployments.

Note that the NFS directory does not need to be anything fancy - you can
simply use an NFS mount on another normal Linux box.

-Todd

On Tue, May 18, 2010 at 11:19 AM, Andrew Nguyen and...@ucsfcti.org wrote:

 Sorry to hijack but after following this thread, I had a related question
 to the secondary location of dfs.name.dir.

 Is the approach outlined below the preferred/suggested way to do this?  Is
 this people mean when they say, stick it on NFS ?

 Thanks!

 On May 17, 2010, at 11:14 PM, Todd Lipcon wrote:

  On Mon, May 17, 2010 at 5:10 PM, jiang licht licht_ji...@yahoo.com
 wrote:
 
  I am considering to use a machine to save a
  redundant copy of HDFS metadata through setting dfs.name.dir in
  hdfs-site.xml like this (as in YDN):
 
  property
namedfs.name.dir/name
value/home/hadoop/dfs/name,/mnt/namenode-backup/value
finaltrue/final
  /property
 
  where the two folders are on different machines so that
  /mnt/namenode-backup keeps a copy of hdfs file system information and
 its
  machine can be used to replace the first machine that fails as namenode.
 
  So, my question is how big this hdfs metatdata will consume? I guess it
 is
  proportional to the hdfs capacity. What ratio is that or what size will
 be
  for 150TB hdfs?
 
 
  On the order of a few GB, max (you really need double the size of your
  image, so it has tmp space when downloading a checkpoint or performing an
  upgrade). But on any disk you can buy these days you'll have plenty of
  space.
 
  -Todd
 
 
  --
  Todd Lipcon
  Software Engineer, Cloudera




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Pierre ANCELOT
Thanks for the sarcasm but with 3 small files and so, 3 Mapper
instatiations, even though it's not (and never did I say it was) he only
metric that matters, it seem to me lie something very interresting to check
out...
I have hierarchy over me and they will be happy to understand my choices
with real numbers to base their understanding on.
Thanks.


On Tue, May 18, 2010 at 5:00 PM, Patrick Angeles patr...@cloudera.comwrote:

 Should be evident in the total job running time... that's the only metric
 that really matters :)

 On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT pierre...@gmail.com
 wrote:

  Thank you,
  Any way I can measure the startup overhead in terms of time?
 
 
  On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles patr...@cloudera.com
  wrote:
 
   Pierre,
  
   Adding to what Brian has said (some things are not explicitly mentioned
  in
   the HDFS design doc)...
  
   - If you have small files that take up  64MB you do not actually use
 the
   entire 64MB block on disk.
   - You *do* use up RAM on the NameNode, as each block represents
 meta-data
   that needs to be maintained in-memory in the NameNode.
   - Hadoop won't perform optimally with very small block sizes. Hadoop
 I/O
  is
   optimized for high sustained throughput per single file/block. There is
 a
   penalty for doing too many seeks to get to the beginning of each block.
   Additionally, you will have a MapReduce task per small file. Each
  MapReduce
   task has a non-trivial startup overhead.
   - The recommendation is to consolidate your small files into large
 files.
   One way to do this is via SequenceFiles... put the filename in the
   SequenceFile key field, and the file's bytes in the SequenceFile value
   field.
  
   In addition to the HDFS design docs, I recommend reading this blog
 post:
   http://www.cloudera.com/blog/2009/02/the-small-files-problem/
  
   Happy Hadooping,
  
   - Patrick
  
   On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT pierre...@gmail.com
   wrote:
  
Okay, thank you :)
   
   
On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman 
 bbock...@cse.unl.edu
wrote:
   

 On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:

  Hi, thanks for this fast answer :)
  If so, what do you mean by blocks? If a file has to be splitted,
 it
will
 be
  splitted when larger than 64MB?
 

 For every 64MB of the file, Hadoop will create a separate block.
  So,
   if
 you have a 32KB file, there will be one block of 32KB.  If the file
  is
65MB,
 then it will have one block of 64MB and another block of 1MB.

 Splitting files is very useful for load-balancing and distributing
  I/O
 across multiple nodes.  At 32KB / file, you don't really need to
  split
the
 files at all.

 I recommend reading the HDFS design document for background issues
  like
 this:

 http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html

 Brian

 
 
 
  On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman 
   bbock...@cse.unl.edu
 wrote:
 
  Hey Pierre,
 
  These are not traditional filesystem blocks - if you save a file
smaller
  than 64MB, you don't lose 64MB of file space..
 
  Hadoop will use 32KB to store a 32KB file (ok, plus a KB of
  metadata
or
  so), not 64MB.
 
  Brian
 
  On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
 
  Hi,
  I'm porting a legacy application to hadoop and it uses a bunch
 of
small
  files.
  I'm aware that having such small files ain't a good idea but
 I'm
   not
  doing
  the technical decisions and the port has to be done for
   yesterday...
  Of course such small files are a problem, loading 64MB blocks
 for
  a
few
  lines of text is an evident loss.
  What will happen if I set a smaller, or even way smaller (32kB)
blocks?
 
  Thank you.
 
  Pierre ANCELOT.
 
 
 
 
  --
  http://www.neko-consulting.com
  Ego sum quis ego servo
  Je suis ce que je protège
  I am what I protect


   
   
--
http://www.neko-consulting.com
Ego sum quis ego servo
Je suis ce que je protège
I am what I protect
   
  
 
 
 
  --
  http://www.neko-consulting.com
  Ego sum quis ego servo
  Je suis ce que je protège
  I am what I protect
 




-- 
http://www.neko-consulting.com
Ego sum quis ego servo
Je suis ce que je protège
I am what I protect


Re: JAVA_HOME not set

2010-05-18 Thread David Howell
Are you using Cloudera's hadoop 0.20.2?

There's some logic in bin/hadoop-config.sh that seems to be failing if
JAVA_HOME isn't set, and it runs before hadoop-env.sh.

If you think it might be the same problem, please weigh in:

http://getsatisfaction.com/cloudera/topics/java_home_setting_in_hadoop_env_sh_not_respected_in_cdh_3


- David


On Tue, May 18, 2010 at 12:30 PM, Erik Test erik.shi...@gmail.com wrote:
 Hi All,

 I continually get this error when trying to run start-all.sh for hadoop
 0.20.2 on ubuntu. What confuses me is I DO have JAVA_HOME set in
 hadoop-env.sh to /usr/lib/jvm/jdk1.6.0_17. I've double checked to see that
 JAVA_HOME is set to this by echoing the path before running the start script
 but still now luck. I then tried adding bin to the path but then got errors
 saying /usr/lib/jvm/jdk1.6.0_17/bin/bin/java couldn't be found.

 Can someone give me suggestions on how to fix this problem please?

 Erik



Re: JAVA_HOME not set

2010-05-18 Thread Erik Test
Hm. I actually just changed to this version
Erik


On 18 May 2010 15:59, David Howell dehow...@gmail.com wrote:

 Are you using Cloudera's hadoop 0.20.2?

 There's some logic in bin/hadoop-config.sh that seems to be failing if
 JAVA_HOME isn't set, and it runs before hadoop-env.sh.

 If you think it might be the same problem, please weigh in:


 http://getsatisfaction.com/cloudera/topics/java_home_setting_in_hadoop_env_sh_not_respected_in_cdh_3


 - David


 On Tue, May 18, 2010 at 12:30 PM, Erik Test erik.shi...@gmail.com wrote:
  Hi All,
 
  I continually get this error when trying to run start-all.sh for hadoop
  0.20.2 on ubuntu. What confuses me is I DO have JAVA_HOME set in
  hadoop-env.sh to /usr/lib/jvm/jdk1.6.0_17. I've double checked to see
 that
  JAVA_HOME is set to this by echoing the path before running the start
 script
  but still now luck. I then tried adding bin to the path but then got
 errors
  saying /usr/lib/jvm/jdk1.6.0_17/bin/bin/java couldn't be found.
 
  Can someone give me suggestions on how to fix this problem please?
 
  Erik
 



Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Patrick Angeles
That wasn't sarcasm. This is what you do:

- Run your mapreduce job on 30k small files.
- Consolidate your 30k small files into larger files.
- Run mapreduce ok the larger files.
- Compare the running time

The difference in runtime is made up by your task startup and seek overhead.

If you want to get the 'average' overhead per task, divide the total times
for each job by the number of map tasks. This won't be a true average
because with larger chunks of data, you will have longer running map tasks
that will hold up the shuffle phase. But the average doesn't really matter
here because you always have that trade-off going from small to large chunks
of data.


On Tue, May 18, 2010 at 7:31 PM, Pierre ANCELOT pierre...@gmail.com wrote:

 Thanks for the sarcasm but with 3 small files and so, 3 Mapper
 instatiations, even though it's not (and never did I say it was) he only
 metric that matters, it seem to me lie something very interresting to check
 out...
 I have hierarchy over me and they will be happy to understand my choices
 with real numbers to base their understanding on.
 Thanks.


 On Tue, May 18, 2010 at 5:00 PM, Patrick Angeles patr...@cloudera.com
 wrote:

  Should be evident in the total job running time... that's the only metric
  that really matters :)
 
  On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT pierre...@gmail.com
  wrote:
 
   Thank you,
   Any way I can measure the startup overhead in terms of time?
  
  
   On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles patr...@cloudera.com
   wrote:
  
Pierre,
   
Adding to what Brian has said (some things are not explicitly
 mentioned
   in
the HDFS design doc)...
   
- If you have small files that take up  64MB you do not actually use
  the
entire 64MB block on disk.
- You *do* use up RAM on the NameNode, as each block represents
  meta-data
that needs to be maintained in-memory in the NameNode.
- Hadoop won't perform optimally with very small block sizes. Hadoop
  I/O
   is
optimized for high sustained throughput per single file/block. There
 is
  a
penalty for doing too many seeks to get to the beginning of each
 block.
Additionally, you will have a MapReduce task per small file. Each
   MapReduce
task has a non-trivial startup overhead.
- The recommendation is to consolidate your small files into large
  files.
One way to do this is via SequenceFiles... put the filename in the
SequenceFile key field, and the file's bytes in the SequenceFile
 value
field.
   
In addition to the HDFS design docs, I recommend reading this blog
  post:
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
   
Happy Hadooping,
   
- Patrick
   
On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT pierre...@gmail.com
 
wrote:
   
 Okay, thank you :)


 On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman 
  bbock...@cse.unl.edu
 wrote:

 
  On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
 
   Hi, thanks for this fast answer :)
   If so, what do you mean by blocks? If a file has to be
 splitted,
  it
 will
  be
   splitted when larger than 64MB?
  
 
  For every 64MB of the file, Hadoop will create a separate block.
   So,
if
  you have a 32KB file, there will be one block of 32KB.  If the
 file
   is
 65MB,
  then it will have one block of 64MB and another block of 1MB.
 
  Splitting files is very useful for load-balancing and
 distributing
   I/O
  across multiple nodes.  At 32KB / file, you don't really need to
   split
 the
  files at all.
 
  I recommend reading the HDFS design document for background
 issues
   like
  this:
 
  http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
 
  Brian
 
  
  
  
   On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman 
bbock...@cse.unl.edu
  wrote:
  
   Hey Pierre,
  
   These are not traditional filesystem blocks - if you save a
 file
 smaller
   than 64MB, you don't lose 64MB of file space..
  
   Hadoop will use 32KB to store a 32KB file (ok, plus a KB of
   metadata
 or
   so), not 64MB.
  
   Brian
  
   On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
  
   Hi,
   I'm porting a legacy application to hadoop and it uses a
 bunch
  of
 small
   files.
   I'm aware that having such small files ain't a good idea but
  I'm
not
   doing
   the technical decisions and the port has to be done for
yesterday...
   Of course such small files are a problem, loading 64MB blocks
  for
   a
 few
   lines of text is an evident loss.
   What will happen if I set a smaller, or even way smaller
 (32kB)
 blocks?
  
   Thank you.
  
   Pierre ANCELOT.
  
  
  
  
   --
   http://www.neko-consulting.com
   Ego sum quis 

RE: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Jones, Nick
I'm not familiar with how to use/create them, but shouldn't a HAR (Hadoop 
Archive) work well in this situation?  I thought it was designed to collect 
several small files together through another level indirection to avoid the NN 
load and decreasing the HDFS block size.

Nick Jones
-Original Message-
From: patrickange...@gmail.com [mailto:patrickange...@gmail.com] On Behalf Of 
Patrick Angeles
Sent: Tuesday, May 18, 2010 4:36 PM
To: common-user@hadoop.apache.org
Subject: Re: Any possible to set hdfs block size to a value smaller than 64MB?

That wasn't sarcasm. This is what you do:

- Run your mapreduce job on 30k small files.
- Consolidate your 30k small files into larger files.
- Run mapreduce ok the larger files.
- Compare the running time

The difference in runtime is made up by your task startup and seek overhead.

If you want to get the 'average' overhead per task, divide the total times
for each job by the number of map tasks. This won't be a true average
because with larger chunks of data, you will have longer running map tasks
that will hold up the shuffle phase. But the average doesn't really matter
here because you always have that trade-off going from small to large chunks
of data.


On Tue, May 18, 2010 at 7:31 PM, Pierre ANCELOT pierre...@gmail.com wrote:

 Thanks for the sarcasm but with 3 small files and so, 3 Mapper
 instatiations, even though it's not (and never did I say it was) he only
 metric that matters, it seem to me lie something very interresting to check
 out...
 I have hierarchy over me and they will be happy to understand my choices
 with real numbers to base their understanding on.
 Thanks.


 On Tue, May 18, 2010 at 5:00 PM, Patrick Angeles patr...@cloudera.com
 wrote:

  Should be evident in the total job running time... that's the only metric
  that really matters :)
 
  On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT pierre...@gmail.com
  wrote:
 
   Thank you,
   Any way I can measure the startup overhead in terms of time?
  
  
   On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles patr...@cloudera.com
   wrote:
  
Pierre,
   
Adding to what Brian has said (some things are not explicitly
 mentioned
   in
the HDFS design doc)...
   
- If you have small files that take up  64MB you do not actually use
  the
entire 64MB block on disk.
- You *do* use up RAM on the NameNode, as each block represents
  meta-data
that needs to be maintained in-memory in the NameNode.
- Hadoop won't perform optimally with very small block sizes. Hadoop
  I/O
   is
optimized for high sustained throughput per single file/block. There
 is
  a
penalty for doing too many seeks to get to the beginning of each
 block.
Additionally, you will have a MapReduce task per small file. Each
   MapReduce
task has a non-trivial startup overhead.
- The recommendation is to consolidate your small files into large
  files.
One way to do this is via SequenceFiles... put the filename in the
SequenceFile key field, and the file's bytes in the SequenceFile
 value
field.
   
In addition to the HDFS design docs, I recommend reading this blog
  post:
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
   
Happy Hadooping,
   
- Patrick
   
On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT pierre...@gmail.com
 
wrote:
   
 Okay, thank you :)


 On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman 
  bbock...@cse.unl.edu
 wrote:

 
  On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
 
   Hi, thanks for this fast answer :)
   If so, what do you mean by blocks? If a file has to be
 splitted,
  it
 will
  be
   splitted when larger than 64MB?
  
 
  For every 64MB of the file, Hadoop will create a separate block.
   So,
if
  you have a 32KB file, there will be one block of 32KB.  If the
 file
   is
 65MB,
  then it will have one block of 64MB and another block of 1MB.
 
  Splitting files is very useful for load-balancing and
 distributing
   I/O
  across multiple nodes.  At 32KB / file, you don't really need to
   split
 the
  files at all.
 
  I recommend reading the HDFS design document for background
 issues
   like
  this:
 
  http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
 
  Brian
 
  
  
  
   On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman 
bbock...@cse.unl.edu
  wrote:
  
   Hey Pierre,
  
   These are not traditional filesystem blocks - if you save a
 file
 smaller
   than 64MB, you don't lose 64MB of file space..
  
   Hadoop will use 32KB to store a 32KB file (ok, plus a KB of
   metadata
 or
   so), not 64MB.
  
   Brian
  
   On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
  
   Hi,
   I'm porting a legacy application to hadoop and it uses a
 bunch
  of
 

Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Todd Lipcon
On Tue, May 18, 2010 at 2:50 PM, Jones, Nick nick.jo...@amd.com wrote:

 I'm not familiar with how to use/create them, but shouldn't a HAR (Hadoop
 Archive) work well in this situation?  I thought it was designed to collect
 several small files together through another level indirection to avoid the
 NN load and decreasing the HDFS block size.


Yes, or CombineFileInputFormat. JVM reuse also helps somewhat, so long as
you're not talking about hundreds of thousands of files (in which case it
starts to hurt JT load with that many tasks in jobs)

There are a number of ways to combat the issue, but rule of thumb is that
you shouldn't try to use HDFS to store tons of small files :)

-Todd

-Original Message-
 From: patrickange...@gmail.com [mailto:patrickange...@gmail.com] On Behalf
 Of Patrick Angeles
 Sent: Tuesday, May 18, 2010 4:36 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Any possible to set hdfs block size to a value smaller than
 64MB?

 That wasn't sarcasm. This is what you do:

 - Run your mapreduce job on 30k small files.
 - Consolidate your 30k small files into larger files.
 - Run mapreduce ok the larger files.
 - Compare the running time

 The difference in runtime is made up by your task startup and seek
 overhead.

 If you want to get the 'average' overhead per task, divide the total times
 for each job by the number of map tasks. This won't be a true average
 because with larger chunks of data, you will have longer running map tasks
 that will hold up the shuffle phase. But the average doesn't really matter
 here because you always have that trade-off going from small to large
 chunks
 of data.


 On Tue, May 18, 2010 at 7:31 PM, Pierre ANCELOT pierre...@gmail.com
 wrote:

  Thanks for the sarcasm but with 3 small files and so, 3 Mapper
  instatiations, even though it's not (and never did I say it was) he only
  metric that matters, it seem to me lie something very interresting to
 check
  out...
  I have hierarchy over me and they will be happy to understand my choices
  with real numbers to base their understanding on.
  Thanks.
 
 
  On Tue, May 18, 2010 at 5:00 PM, Patrick Angeles patr...@cloudera.com
  wrote:
 
   Should be evident in the total job running time... that's the only
 metric
   that really matters :)
  
   On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT pierre...@gmail.com
   wrote:
  
Thank you,
Any way I can measure the startup overhead in terms of time?
   
   
On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles 
 patr...@cloudera.com
wrote:
   
 Pierre,

 Adding to what Brian has said (some things are not explicitly
  mentioned
in
 the HDFS design doc)...

 - If you have small files that take up  64MB you do not actually
 use
   the
 entire 64MB block on disk.
 - You *do* use up RAM on the NameNode, as each block represents
   meta-data
 that needs to be maintained in-memory in the NameNode.
 - Hadoop won't perform optimally with very small block sizes.
 Hadoop
   I/O
is
 optimized for high sustained throughput per single file/block.
 There
  is
   a
 penalty for doing too many seeks to get to the beginning of each
  block.
 Additionally, you will have a MapReduce task per small file. Each
MapReduce
 task has a non-trivial startup overhead.
 - The recommendation is to consolidate your small files into large
   files.
 One way to do this is via SequenceFiles... put the filename in the
 SequenceFile key field, and the file's bytes in the SequenceFile
  value
 field.

 In addition to the HDFS design docs, I recommend reading this blog
   post:
 http://www.cloudera.com/blog/2009/02/the-small-files-problem/

 Happy Hadooping,

 - Patrick

 On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT 
 pierre...@gmail.com
  
 wrote:

  Okay, thank you :)
 
 
  On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman 
   bbock...@cse.unl.edu
  wrote:
 
  
   On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
  
Hi, thanks for this fast answer :)
If so, what do you mean by blocks? If a file has to be
  splitted,
   it
  will
   be
splitted when larger than 64MB?
   
  
   For every 64MB of the file, Hadoop will create a separate
 block.
So,
 if
   you have a 32KB file, there will be one block of 32KB.  If the
  file
is
  65MB,
   then it will have one block of 64MB and another block of 1MB.
  
   Splitting files is very useful for load-balancing and
  distributing
I/O
   across multiple nodes.  At 32KB / file, you don't really need
 to
split
  the
   files at all.
  
   I recommend reading the HDFS design document for background
  issues
like
   this:
  
   http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
  
   Brian
  
   
   
   
On Tue,