Re: HDFS blocks

Ted Dunning Fri, 27 Jun 2008 08:42:21 -0700

I would strongly recommend leaving the block size large.  Writing the small
files is no big deal since no space is wasted to speak of.

At the data rate that you are talking about, the cost of merging should not
be a big deal.  You should definitely merge often enough to avoid having
very many of these small files.  If you have hundreds of them, you will
definitely notice significant degradation in you ability to process them.

One useful strategy is to merge them repeated.  This costs you a little bit
in repeated merging, but wins big by keeping the number of files much
smaller.

For the future, lohit's comments are exactly correct ... archive files and
append will make your problems much easier.

For coordinating which files are current and which are partially done, you
might consider using zookeeper.  Very nice for fast, reliable updates.

On Fri, Jun 27, 2008 at 1:18 AM, Goel, Ankur <[EMAIL PROTECTED]>
wrote:

> Hi Folks,
>        I have a setup where in I am streaming data into HDFS from a
> remote location and creating a new files every X min. The file generated
> is of a very small size (512 KB - 6 MB) size. Since that is the size
> range the streaming code sets the block size to 6MB whereas default that
> we have set for the cluster is 128 MB. The idea behind such a thing is
> to generate small temporal data chunks from multiple sources and merge
> them periodically into a big chunk with our default (128 MB) block size.
>
> The webUI for DFS reports the block size for these files to be 6 MB. My
> questions are.
> 1. Can we have multiple files in DFS use different block sizes ?
> 2. If we use default block size for these small chunks, is the DFS space
> wasted ?
>   If not then does it mean that a single DFS block can hold data from
> more than one file ?
>
> Thanks
> -Ankur
>

-- 
ted

Re: HDFS blocks

Reply via email to