hi,

Here is some useful info:

A small file is one which is significantly smaller than the HDFS block size 
(default 64MB). If you’re storing small files, then you probably have lots of 
them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS 
can’t handle lots of files.

Every file, directory and block in HDFS is represented as an object in the 
namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 
million files, each using a block, would use about 3 gigabytes of memory. 
Scaling up much beyond this level is a problem with current hardware. Certainly 
a billion files is not feasible.

Furthermore, HDFS is not geared up to efficiently accessing small files: it is 
primarily designed for streaming access of large files. Reading through small 
files normally causes lots of seeks and lots of hopping from datanode to 
datanode to retrieve each small file, all of which is an inefficient data 
access pattern.
Problems with small files and MapReduce

Map tasks usually process a block of input at a time (using the default 
FileInputFormat). If the file is very small and there are a lot of them, then 
each map task processes very little input, and there are a lot more map tasks, 
each of which imposes extra bookkeeping overhead. Compare a 1GB file broken 
into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map 
each, and the job time can be tens or hundreds of times slower than the 
equivalent one with a single input file.

There are a couple of features to help alleviate the bookkeeping overhead: task 
JVM reuse for running multiple map tasks in one JVM, thereby avoiding some JVM 
startup overhead (see the mapred.job.reuse.jvm.num.tasks property), and 
MultiFileInputSplit which can run more than one split per map.

just copied from cloudera's blog.
http://www.cloudera.com/blog/2009/02/the-small-files-problem/#comments

regards,
Uma

----- Original Message -----
From: lessonz <less...@q.com>
Date: Thursday, September 29, 2011 11:10 pm
Subject: Block Size
To: common-user <common-user@hadoop.apache.org>

> I'm new to Hadoop, and I'm trying to understand the implications of 
> a 64M
> block size in the HDFS. Is there a good reference that enumerates the
> implications of this decision and its effects on files stored in 
> the system
> as well as map-reduce jobs?
> 
> Thanks.
>

Reply via email to