Hey Niels, The block size is a per-file property. Would putting/creating these gzip files on the DFS with a very high block size (such that it doesn't split across for such files) be a valid solution to your problem here?
On Wed, Apr 27, 2011 at 1:25 PM, Niels Basjes <ni...@basjes.nl> wrote: > Hi, > > In some scenarios you have gzipped files as input for your map reduce > job (apache logfiles is a common example). > Now some of those files are several hundred megabytes and as such will > be split by HDFS in several blocks. > > When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = > 2) > > Total number of blocks: 2 > 25063947863662497: 10.10.138.62:50010 10.10.138.61:50010 > 1014249434553595747: 10.10.138.64:50010 10.10.138.63:50010 > > As you can see the file has been distributed over all 4 nodes. > > When actually reading those files they are unsplittable due to the > nature of the Gzip codec. > So a job will (in the above example) ALWAYS need to pull "the other > half" of the file over the network, if a file is bigger and the > cluster is bigger then the percentage of the file that goes over the > network will probably increase. > > Now if I can tell HDFS that a ".gz" file should always be "100% local" > for the node that will be doing the processing this would reduce the > network IO during the job dramatically. > Especially if you want to run several jobs against the same input. > > So my question is: Is there a way to force/tell HDFS to make sure that > a datanode that has blocks of this file must always have ALL blocks of > this file? > > -- > Best regards, > > Niels Basjes > -- Harsh J