Re: how to improve the Hadoop's capability of dealing with small files

2009-05-12 Thread Rasit OZDAS
I have the similar situation, I have very small files,
I never tried HBase (want to), but you can also group them
and write (let's say) 20-30 into a file as every file becomes a key in that
big file.

There are methods in API which you can write an object as a file into HDFS,
and read again
to get original object. Having list of items in object can solve this
problem..


Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread Piotr Praczyk
If You want to use many small files, they are probably having the same
purpose and struc?
Why not use HBase instead of a raw HDFS ? Many small files would be packed
together and the problem would disappear.

cheers
Piotr

2009/5/7 Jonathan Cao jonath...@rockyou.com

 There are at least two design choices in Hadoop that have implications for
 your scenario.
 1. All the HDFS meta data is stored in name node memory -- the memory size
 is one limitation on how many small files you can have

 2. The efficiency of map/reduce paradigm dictates that each mapper/reducer
 job has enough work to offset the overhead of spawning the job.  It relies
 on each task reading contiguous chuck of data (typically 64MB), your small
 file situation will change those efficient sequential reads to larger
 number
 of inefficient random reads.

 Of course, small is a relative term?

 Jonathan

 2009/5/6 陈桂芬 chenguifen...@163.com

  Hi:
 
  In my application, there are many small files. But the hadoop is designed
  to deal with many large files.
 
  I want to know why hadoop doesn’t support small files very well and where
  is the bottleneck. And what can I do to improve the Hadoop’s capability
 of
  dealing with small files.
 
  Thanks.
 
 



Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread Jeff Hammerbacher
Hey,

You can read more about why small files are difficult for HDFS at
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.

Regards,
Jeff

2009/5/7 Piotr Praczyk piotr.prac...@gmail.com

 If You want to use many small files, they are probably having the same
 purpose and struc?
 Why not use HBase instead of a raw HDFS ? Many small files would be packed
 together and the problem would disappear.

 cheers
 Piotr

 2009/5/7 Jonathan Cao jonath...@rockyou.com

  There are at least two design choices in Hadoop that have implications
 for
  your scenario.
  1. All the HDFS meta data is stored in name node memory -- the memory
 size
  is one limitation on how many small files you can have
 
  2. The efficiency of map/reduce paradigm dictates that each
 mapper/reducer
  job has enough work to offset the overhead of spawning the job.  It
 relies
  on each task reading contiguous chuck of data (typically 64MB), your
 small
  file situation will change those efficient sequential reads to larger
  number
  of inefficient random reads.
 
  Of course, small is a relative term?
 
  Jonathan
 
  2009/5/6 陈桂芬 chenguifen...@163.com
 
   Hi:
  
   In my application, there are many small files. But the hadoop is
 designed
   to deal with many large files.
  
   I want to know why hadoop doesn’t support small files very well and
 where
   is the bottleneck. And what can I do to improve the Hadoop’s capability
  of
   dealing with small files.
  
   Thanks.
  
  
 



Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread Edward Capriolo
2009/5/7 Jeff Hammerbacher ham...@cloudera.com:
 Hey,

 You can read more about why small files are difficult for HDFS at
 http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.

 Regards,
 Jeff

 2009/5/7 Piotr Praczyk piotr.prac...@gmail.com

 If You want to use many small files, they are probably having the same
 purpose and struc?
 Why not use HBase instead of a raw HDFS ? Many small files would be packed
 together and the problem would disappear.

 cheers
 Piotr

 2009/5/7 Jonathan Cao jonath...@rockyou.com

  There are at least two design choices in Hadoop that have implications
 for
  your scenario.
  1. All the HDFS meta data is stored in name node memory -- the memory
 size
  is one limitation on how many small files you can have
 
  2. The efficiency of map/reduce paradigm dictates that each
 mapper/reducer
  job has enough work to offset the overhead of spawning the job.  It
 relies
  on each task reading contiguous chuck of data (typically 64MB), your
 small
  file situation will change those efficient sequential reads to larger
  number
  of inefficient random reads.
 
  Of course, small is a relative term?
 
  Jonathan
 
  2009/5/6 陈桂芬 chenguifen...@163.com
 
   Hi:
  
   In my application, there are many small files. But the hadoop is
 designed
   to deal with many large files.
  
   I want to know why hadoop doesn't support small files very well and
 where
   is the bottleneck. And what can I do to improve the Hadoop's capability
  of
   dealing with small files.
  
   Thanks.
  
  
 


When the small file problem comes up most of the talk centers around
the inode table being in memory. The cloudera blog points out
something:

Furthermore, HDFS is not geared up to efficiently accessing small
files: it is primarily designed for streaming access of large files.
Reading through small files normally causes lots of seeks and lots of
hopping from datanode to datanode to retrieve each small file, all of
which is an inefficient data access pattern.

My application attempted to load 9000 6Kb files using a single
threaded application and the FSOutpustStream objects to write directly
to hadoop files. My plan was to have hadoop merge these files in the
next step. I had to abandon this plan because this process was taking
hours. I knew HDFS had a small file problem but I never realized
that I could not do this problem the 'old fashioned way'. I merged the
files locally and uploading a few small files gave great throughput.
Small files is not just a permanent storage issue it is a serious
optimization.


Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread jason hadoop
The way I typically address that is to write a zip file using the zip
utilities. Commonly for output.
HDFS is not optimized for low latency, but for high through put for bulk
operations.

2009/5/7 Edward Capriolo edlinuxg...@gmail.com

 2009/5/7 Jeff Hammerbacher ham...@cloudera.com:
  Hey,
 
  You can read more about why small files are difficult for HDFS at
  http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.
 
  Regards,
  Jeff
 
  2009/5/7 Piotr Praczyk piotr.prac...@gmail.com
 
  If You want to use many small files, they are probably having the same
  purpose and struc?
  Why not use HBase instead of a raw HDFS ? Many small files would be
 packed
  together and the problem would disappear.
 
  cheers
  Piotr
 
  2009/5/7 Jonathan Cao jonath...@rockyou.com
 
   There are at least two design choices in Hadoop that have implications
  for
   your scenario.
   1. All the HDFS meta data is stored in name node memory -- the memory
  size
   is one limitation on how many small files you can have
  
   2. The efficiency of map/reduce paradigm dictates that each
  mapper/reducer
   job has enough work to offset the overhead of spawning the job.  It
  relies
   on each task reading contiguous chuck of data (typically 64MB), your
  small
   file situation will change those efficient sequential reads to larger
   number
   of inefficient random reads.
  
   Of course, small is a relative term?
  
   Jonathan
  
   2009/5/6 陈桂芬 chenguifen...@163.com
  
Hi:
   
In my application, there are many small files. But the hadoop is
  designed
to deal with many large files.
   
I want to know why hadoop doesn't support small files very well and
  where
is the bottleneck. And what can I do to improve the Hadoop's
 capability
   of
dealing with small files.
   
Thanks.
   
   
  
 
 
 When the small file problem comes up most of the talk centers around
 the inode table being in memory. The cloudera blog points out
 something:

 Furthermore, HDFS is not geared up to efficiently accessing small
 files: it is primarily designed for streaming access of large files.
 Reading through small files normally causes lots of seeks and lots of
 hopping from datanode to datanode to retrieve each small file, all of
 which is an inefficient data access pattern.

 My application attempted to load 9000 6Kb files using a single
 threaded application and the FSOutpustStream objects to write directly
 to hadoop files. My plan was to have hadoop merge these files in the
 next step. I had to abandon this plan because this process was taking
 hours. I knew HDFS had a small file problem but I never realized
 that I could not do this problem the 'old fashioned way'. I merged the
 files locally and uploading a few small files gave great throughput.
 Small files is not just a permanent storage issue it is a serious
 optimization.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: how to improve the Hadoop's capability of dealing with small files

2009-05-06 Thread imcaptor
Please try  -D dfs.block.size=4096000
The specification must be in bytes.


On Tue, May 5, 2009 at 4:47 AM, Christian Ulrik Søttrup soett...@nbi.dk
wrote:
- 隐藏引用文字 -

 Hi all,

 I have a job that creates very big local files so i need to split it to as
 many mappers as possible. Now the DFS block size I'm
 using means that this job is only split to 3 mappers. I don't want to
 change the hdfs wide block size because it works for my other jobs.

 Is there a way to give a specific file a different block size. The
 documentation says it is, but does not explain how.
 I've tried:
 hadoop dfs -D dfs.block.size=4M -put file  /dest/

 But that does not work.

 any help would be apreciated.

 Cheers,
 Chrulle


2009/5/7 陈桂芬 chenguifen...@163.com

 Hi:

 In my application, there are many small files. But the hadoop is designed
 to deal with many large files.

 I want to know why hadoop doesn’t support small files very well and where
 is the bottleneck. And what can I do to improve the Hadoop’s capability of
 dealing with small files.

 Thanks.




Re: how to improve the Hadoop's capability of dealing with small files

2009-05-06 Thread Jonathan Cao
There are at least two design choices in Hadoop that have implications for
your scenario.
1. All the HDFS meta data is stored in name node memory -- the memory size
is one limitation on how many small files you can have

2. The efficiency of map/reduce paradigm dictates that each mapper/reducer
job has enough work to offset the overhead of spawning the job.  It relies
on each task reading contiguous chuck of data (typically 64MB), your small
file situation will change those efficient sequential reads to larger number
of inefficient random reads.

Of course, small is a relative term?

Jonathan

2009/5/6 陈桂芬 chenguifen...@163.com

 Hi:

 In my application, there are many small files. But the hadoop is designed
 to deal with many large files.

 I want to know why hadoop doesn’t support small files very well and where
 is the bottleneck. And what can I do to improve the Hadoop’s capability of
 dealing with small files.

 Thanks.