Re: how to improve the Hadoop's capability of dealing with small files
I have the similar situation, I have very small files, I never tried HBase (want to), but you can also group them and write (let's say) 20-30 into a file as every file becomes a key in that big file. There are methods in API which you can write an object as a file into HDFS, and read again to get original object. Having list of items in object can solve this problem..
Re: how to improve the Hadoop's capability of dealing with small files
The way I typically address that is to write a zip file using the zip utilities. Commonly for output. HDFS is not optimized for low latency, but for high through put for bulk operations. 2009/5/7 Edward Capriolo > 2009/5/7 Jeff Hammerbacher : > > Hey, > > > > You can read more about why small files are difficult for HDFS at > > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. > > > > Regards, > > Jeff > > > > 2009/5/7 Piotr Praczyk > > > >> If You want to use many small files, they are probably having the same > >> purpose and struc? > >> Why not use HBase instead of a raw HDFS ? Many small files would be > packed > >> together and the problem would disappear. > >> > >> cheers > >> Piotr > >> > >> 2009/5/7 Jonathan Cao > >> > >> > There are at least two design choices in Hadoop that have implications > >> for > >> > your scenario. > >> > 1. All the HDFS meta data is stored in name node memory -- the memory > >> size > >> > is one limitation on how many "small" files you can have > >> > > >> > 2. The efficiency of map/reduce paradigm dictates that each > >> mapper/reducer > >> > job has enough work to offset the overhead of spawning the job. It > >> relies > >> > on each task reading contiguous chuck of data (typically 64MB), your > >> small > >> > file situation will change those efficient sequential reads to larger > >> > number > >> > of inefficient random reads. > >> > > >> > Of course, small is a relative term? > >> > > >> > Jonathan > >> > > >> > 2009/5/6 陈桂芬 > >> > > >> > > Hi: > >> > > > >> > > In my application, there are many small files. But the hadoop is > >> designed > >> > > to deal with many large files. > >> > > > >> > > I want to know why hadoop doesn't support small files very well and > >> where > >> > > is the bottleneck. And what can I do to improve the Hadoop's > capability > >> > of > >> > > dealing with small files. > >> > > > >> > > Thanks. > >> > > > >> > > > >> > > >> > > > When the small file problem comes up most of the talk centers around > the inode table being in memory. The cloudera blog points out > something: > > Furthermore, HDFS is not geared up to efficiently accessing small > files: it is primarily designed for streaming access of large files. > Reading through small files normally causes lots of seeks and lots of > hopping from datanode to datanode to retrieve each small file, all of > which is an inefficient data access pattern. > > My application attempted to load 9000 6Kb files using a single > threaded application and the FSOutpustStream objects to write directly > to hadoop files. My plan was to have hadoop merge these files in the > next step. I had to abandon this plan because this process was taking > hours. I knew HDFS had a "small file problem" but I never realized > that I could not do this problem the 'old fashioned way'. I merged the > files locally and uploading a few small files gave great throughput. > Small files is not just a permanent storage issue it is a serious > optimization. > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: how to improve the Hadoop's capability of dealing with small files
2009/5/7 Jeff Hammerbacher : > Hey, > > You can read more about why small files are difficult for HDFS at > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. > > Regards, > Jeff > > 2009/5/7 Piotr Praczyk > >> If You want to use many small files, they are probably having the same >> purpose and struc? >> Why not use HBase instead of a raw HDFS ? Many small files would be packed >> together and the problem would disappear. >> >> cheers >> Piotr >> >> 2009/5/7 Jonathan Cao >> >> > There are at least two design choices in Hadoop that have implications >> for >> > your scenario. >> > 1. All the HDFS meta data is stored in name node memory -- the memory >> size >> > is one limitation on how many "small" files you can have >> > >> > 2. The efficiency of map/reduce paradigm dictates that each >> mapper/reducer >> > job has enough work to offset the overhead of spawning the job. It >> relies >> > on each task reading contiguous chuck of data (typically 64MB), your >> small >> > file situation will change those efficient sequential reads to larger >> > number >> > of inefficient random reads. >> > >> > Of course, small is a relative term? >> > >> > Jonathan >> > >> > 2009/5/6 陈桂芬 >> > >> > > Hi: >> > > >> > > In my application, there are many small files. But the hadoop is >> designed >> > > to deal with many large files. >> > > >> > > I want to know why hadoop doesn't support small files very well and >> where >> > > is the bottleneck. And what can I do to improve the Hadoop's capability >> > of >> > > dealing with small files. >> > > >> > > Thanks. >> > > >> > > >> > >> > When the small file problem comes up most of the talk centers around the inode table being in memory. The cloudera blog points out something: Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern. My application attempted to load 9000 6Kb files using a single threaded application and the FSOutpustStream objects to write directly to hadoop files. My plan was to have hadoop merge these files in the next step. I had to abandon this plan because this process was taking hours. I knew HDFS had a "small file problem" but I never realized that I could not do this problem the 'old fashioned way'. I merged the files locally and uploading a few small files gave great throughput. Small files is not just a permanent storage issue it is a serious optimization.
Re: how to improve the Hadoop's capability of dealing with small files
Hey, You can read more about why small files are difficult for HDFS at http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. Regards, Jeff 2009/5/7 Piotr Praczyk > If You want to use many small files, they are probably having the same > purpose and struc? > Why not use HBase instead of a raw HDFS ? Many small files would be packed > together and the problem would disappear. > > cheers > Piotr > > 2009/5/7 Jonathan Cao > > > There are at least two design choices in Hadoop that have implications > for > > your scenario. > > 1. All the HDFS meta data is stored in name node memory -- the memory > size > > is one limitation on how many "small" files you can have > > > > 2. The efficiency of map/reduce paradigm dictates that each > mapper/reducer > > job has enough work to offset the overhead of spawning the job. It > relies > > on each task reading contiguous chuck of data (typically 64MB), your > small > > file situation will change those efficient sequential reads to larger > > number > > of inefficient random reads. > > > > Of course, small is a relative term? > > > > Jonathan > > > > 2009/5/6 陈桂芬 > > > > > Hi: > > > > > > In my application, there are many small files. But the hadoop is > designed > > > to deal with many large files. > > > > > > I want to know why hadoop doesn’t support small files very well and > where > > > is the bottleneck. And what can I do to improve the Hadoop’s capability > > of > > > dealing with small files. > > > > > > Thanks. > > > > > > > > >
Re: how to improve the Hadoop's capability of dealing with small files
If You want to use many small files, they are probably having the same purpose and struc? Why not use HBase instead of a raw HDFS ? Many small files would be packed together and the problem would disappear. cheers Piotr 2009/5/7 Jonathan Cao > There are at least two design choices in Hadoop that have implications for > your scenario. > 1. All the HDFS meta data is stored in name node memory -- the memory size > is one limitation on how many "small" files you can have > > 2. The efficiency of map/reduce paradigm dictates that each mapper/reducer > job has enough work to offset the overhead of spawning the job. It relies > on each task reading contiguous chuck of data (typically 64MB), your small > file situation will change those efficient sequential reads to larger > number > of inefficient random reads. > > Of course, small is a relative term? > > Jonathan > > 2009/5/6 陈桂芬 > > > Hi: > > > > In my application, there are many small files. But the hadoop is designed > > to deal with many large files. > > > > I want to know why hadoop doesn’t support small files very well and where > > is the bottleneck. And what can I do to improve the Hadoop’s capability > of > > dealing with small files. > > > > Thanks. > > > > >
Re: how to improve the Hadoop's capability of dealing with small files
There are at least two design choices in Hadoop that have implications for your scenario. 1. All the HDFS meta data is stored in name node memory -- the memory size is one limitation on how many "small" files you can have 2. The efficiency of map/reduce paradigm dictates that each mapper/reducer job has enough work to offset the overhead of spawning the job. It relies on each task reading contiguous chuck of data (typically 64MB), your small file situation will change those efficient sequential reads to larger number of inefficient random reads. Of course, small is a relative term? Jonathan 2009/5/6 陈桂芬 > Hi: > > In my application, there are many small files. But the hadoop is designed > to deal with many large files. > > I want to know why hadoop doesn’t support small files very well and where > is the bottleneck. And what can I do to improve the Hadoop’s capability of > dealing with small files. > > Thanks. > >
Re: how to improve the Hadoop's capability of dealing with small files
Please try -D dfs.block.size=4096000 The specification must be in bytes. On Tue, May 5, 2009 at 4:47 AM, Christian Ulrik Søttrup wrote: - 隐藏引用文字 - > Hi all, > > I have a job that creates very big local files so i need to split it to as > many mappers as possible. Now the DFS block size I'm > using means that this job is only split to 3 mappers. I don't want to > change the hdfs wide block size because it works for my other jobs. > > Is there a way to give a specific file a different block size. The > documentation says it is, but does not explain how. > I've tried: > hadoop dfs -D dfs.block.size=4M -put file /dest/ > > But that does not work. > > any help would be apreciated. > > Cheers, > Chrulle > 2009/5/7 陈桂芬 > Hi: > > In my application, there are many small files. But the hadoop is designed > to deal with many large files. > > I want to know why hadoop doesn’t support small files very well and where > is the bottleneck. And what can I do to improve the Hadoop’s capability of > dealing with small files. > > Thanks. > >