Bejoy - I tried this technique a number of times, and was not able to get this to work. My files remain as they were on input. Is there a version I need (beyond 0.20.2) to make this work, or another setting that could prevent it from working?
On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <bejoy.had...@gmail.com> wrote: > ** > Hi Anna > > If you want to increase the block size of existing files. You can use a > Identity Mapper with no reducer. Set the min and max split sizes to your > requirement (512Mb). Use SequenceFileInputFormat and > SequenceFileOutputFormat for your job. > Your job should be done. > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ------------------------------ > *From: * Chris Nauroth <cnaur...@hortonworks.com> > *Date: *Mon, 1 Oct 2012 21:12:58 -0700 > *To: *<user@hadoop.apache.org> > *ReplyTo: * user@hadoop.apache.org > *Subject: *Re: File block size use > > Hello Anna, > > If I understand correctly, you have a set of multiple sequence files, each > much smaller than the desired block size, and you want to concatenate them > into a set of fewer files, each one more closely aligned to your desired > block size. Presumably, the goal is to improve throughput of map reduce > jobs using those files as input by running fewer map tasks, reading a > larger number of input records. > > Whenever I've had this kind of requirement, I've run a custom map reduce > job to implement the file consolidation. In my case, I was typically > working with TextInputFormat (not sequence files). I used IdentityMapper > and a custom reducer that passed through all values but with key set to > NullWritable, because the keys (input file offsets in the case of > TextInputFormat) were not valuable data. For my input data, this was > sufficient to achieve fairly even distribution of data across the reducer > tasks, and I could reasonably predict the input data set size, so I could > reasonably set the number of reducers and get decent results. (This may or > may not be true for your data set though.) > > A weakness of this approach is that the keys must pass from the map tasks > to the reduce tasks, only to get discarded before writing the final output. > Also, the distribution of input records to reduce tasks is not truly > random, and therefore the reduce output files may be uneven in size. This > could be solved by writing NullWritable keys out of the map task instead of > the reduce task and writing a custom implementation of Partitioner to > distribute them randomly. > > To expand on this idea, it could be possible to inspect the FileStatus of > each input, sum the values of FileStatus.getLen(), and then use that > information to make a decision about how many reducers to run (and > therefore approximately set a target output file size). I'm not aware of > any built-in or external utilities that do this for you though. > > Hope this helps, > --Chris > > On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <annalah...@gmail.com> wrote: > >> I would like to be able to resize a set of inputs, already in >> SequenceFile format, to be larger. >> >> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not >> get what I expected. The outputs were exactly the same as the inputs. >> >> I also tried running a job with an IdentityMapper and IdentityReducer. >> Although that approaches a better solution, it still requires that I know >> in advance how many reducers I need to get better file sizes. >> >> I was looking at the SequenceFile.Writer constructors and noticed that >> there are block size parameters that can be used. Using a writer >> constructed with a 512MB block size, there is nothing that splits the >> output and I simply get a single file the size of my inputs. >> >> What is the current standard for combining sequence files to create >> larger files for map-reduce jobs? I have seen code that tracks what it >> writes into the file, but that seems like the long version. I am hoping >> there is a shorter path. >> >> Thank you. >> >> Anna >> >> >