Hi, I did the following with a 1.6GB file hadoop fs -Ddfs.block.size=2147483648 -put /home/nbasjes/access-2010-11-29.log.gz /user/nbasjes and I got
Total number of blocks: 1 4189183682512190568: 10.10.138.61:50010 10.10.138.62:50010 Yes, that does the trick. Thank you. Niels 2011/4/27 Harsh J <ha...@cloudera.com>: > Hey Niels, > > The block size is a per-file property. Would putting/creating these > gzip files on the DFS with a very high block size (such that it > doesn't split across for such files) be a valid solution to your > problem here? > > On Wed, Apr 27, 2011 at 1:25 PM, Niels Basjes <ni...@basjes.nl> wrote: >> Hi, >> >> In some scenarios you have gzipped files as input for your map reduce >> job (apache logfiles is a common example). >> Now some of those files are several hundred megabytes and as such will >> be split by HDFS in several blocks. >> >> When looking at a real 116MiB file on HDFS I see this (4 nodes, replication >> = 2) >> >> Total number of blocks: 2 >> 25063947863662497: 10.10.138.62:50010 10.10.138.61:50010 >> 1014249434553595747: 10.10.138.64:50010 10.10.138.63:50010 >> >> As you can see the file has been distributed over all 4 nodes. >> >> When actually reading those files they are unsplittable due to the >> nature of the Gzip codec. >> So a job will (in the above example) ALWAYS need to pull "the other >> half" of the file over the network, if a file is bigger and the >> cluster is bigger then the percentage of the file that goes over the >> network will probably increase. >> >> Now if I can tell HDFS that a ".gz" file should always be "100% local" >> for the node that will be doing the processing this would reduce the >> network IO during the job dramatically. >> Especially if you want to run several jobs against the same input. >> >> So my question is: Is there a way to force/tell HDFS to make sure that >> a datanode that has blocks of this file must always have ALL blocks of >> this file? >> >> -- >> Best regards, >> >> Niels Basjes >> > > > > -- > Harsh J > -- Met vriendelijke groeten, Niels Basjes