Hey Niels,

The block size is a per-file property. Would putting/creating these
gzip files on the DFS with a very high block size (such that it
doesn't split across for such files) be a valid solution to your
problem here?

On Wed, Apr 27, 2011 at 1:25 PM, Niels Basjes <ni...@basjes.nl> wrote:
> Hi,
>
> In some scenarios you have gzipped files as input for your map reduce
> job (apache logfiles is a common example).
> Now some of those files are several hundred megabytes and as such will
> be split by HDFS in several blocks.
>
> When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 
> 2)
>
> Total number of blocks: 2
> 25063947863662497:           10.10.138.62:50010         10.10.138.61:50010
> 1014249434553595747:   10.10.138.64:50010               10.10.138.63:50010
>
> As you can see the file has been distributed over all 4 nodes.
>
> When actually reading those files they are unsplittable due to the
> nature of the Gzip codec.
> So a job will (in the above example) ALWAYS need to pull "the other
> half" of the file over the network, if a file is bigger and the
> cluster is bigger then the percentage of the file that goes over the
> network will probably increase.
>
> Now if I can tell HDFS that a ".gz" file should always be "100% local"
> for the node that will be doing the processing this would reduce the
> network IO during the job dramatically.
> Especially if you want to run several jobs against the same input.
>
> So my question is: Is there a way to force/tell HDFS to make sure that
> a datanode that has blocks of this file must always have ALL blocks of
> this file?
>
> --
> Best regards,
>
> Niels Basjes
>



-- 
Harsh J

Reply via email to