Anna I misunderstood your problem. I thought you wanted to change the block size of every file. I didn' t realize that you were aggregating multiple small files into different, albeit smaller, set of larger files of a bigger block size to improve performance.
I think as Chris suggested you need to have a custom M/R job or you could probably get away with some scripting magic :-) Raj >________________________________ > From: Anna Lahoud <annalah...@gmail.com> >To: user@hadoop.apache.org; Raj Vishwanathan <rajv...@yahoo.com> >Sent: Tuesday, October 9, 2012 7:01 AM >Subject: Re: File block size use > > >Raj - I was not able to get this to work either. > > >On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <rajv...@yahoo.com> wrote: > >I haven't tried it but this should also work >> >> >> hadoop fs -Ddfs.block.size=<NEW BLOCK SIZE> -cp src dest >> >> >> >>Raj >> >> >> >>>________________________________ >>> From: Anna Lahoud <annalah...@gmail.com> >>>To: user@hadoop.apache.org; bejoy.had...@gmail.com >>>Sent: Tuesday, October 2, 2012 7:17 AM >>> >>>Subject: Re: File block size use >>> >>> >>> >>>Thank you. I will try today. >>> >>> >>>On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <bejoy.had...@gmail.com> wrote: >>> >>>Hi Anna >>>> >>>>If you want to increase the block size of existing files. You can use a >>>>Identity Mapper with no reducer. Set the min and max split sizes to your >>>>requirement (512Mb). Use SequenceFileInputFormat and >>>>SequenceFileOutputFormat for your job. >>>>Your job should be done. >>>> >>>> >>>>Regards >>>>Bejoy KS >>>> >>>>Sent from handheld, please excuse typos. >>>>________________________________ >>>> >>>>From: Chris Nauroth <cnaur...@hortonworks.com> >>>>Date: Mon, 1 Oct 2012 21:12:58 -0700 >>>>To: <user@hadoop.apache.org> >>>>ReplyTo: user@hadoop.apache.org >>>>Subject: Re: File block size use >>>> >>>>Hello Anna, >>>> >>>> >>>>If I understand correctly, you have a set of multiple sequence files, each >>>>much smaller than the desired block size, and you want to concatenate them >>>>into a set of fewer files, each one more closely aligned to your desired >>>>block size. Presumably, the goal is to improve throughput of map reduce >>>>jobs using those files as input by running fewer map tasks, reading a >>>>larger number of input records. >>>> >>>> >>>>Whenever I've had this kind of requirement, I've run a custom map reduce >>>>job to implement the file consolidation. In my case, I was typically >>>>working with TextInputFormat (not sequence files). I used IdentityMapper >>>>and a custom reducer that passed through all values but with key set to >>>>NullWritable, because the keys (input file offsets in the case of >>>>TextInputFormat) were not valuable data. For my input data, this was >>>>sufficient to achieve fairly even distribution of data across the reducer >>>>tasks, and I could reasonably predict the input data set size, so I could >>>>reasonably set the number of reducers and get decent results. (This may or >>>>may not be true for your data set though.) >>>> >>>> >>>>A weakness of this approach is that the keys must pass from the map tasks >>>>to the reduce tasks, only to get discarded before writing the final output. >>>> Also, the distribution of input records to reduce tasks is not truly >>>>random, and therefore the reduce output files may be uneven in size. This >>>>could be solved by writing NullWritable keys out of the map task instead of >>>>the reduce task and writing a custom implementation of Partitioner to >>>>distribute them randomly. >>>> >>>> >>>>To expand on this idea, it could be possible to inspect the FileStatus of >>>>each input, sum the values of FileStatus.getLen(), and then use that >>>>information to make a decision about how many reducers to run (and >>>>therefore approximately set a target output file size). I'm not aware of >>>>any built-in or external utilities that do this for you though. >>>> >>>> >>>>Hope this helps, >>>>--Chris >>>> >>>> >>>>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <annalah...@gmail.com> wrote: >>>> >>>>I would like to be able to resize a set of inputs, already in SequenceFile >>>>format, to be larger. >>>>> >>>>>I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not >>>>>get what I expected. The outputs were exactly the same as the inputs. >>>>> >>>>>I also tried running a job with an IdentityMapper and IdentityReducer. >>>>>Although that approaches a better solution, it still requires that I know >>>>>in advance how many reducers I need to get better file sizes. >>>>> >>>>>I was looking at the SequenceFile.Writer constructors and noticed that >>>>>there are block size parameters that can be used. Using a writer >>>>>constructed with a 512MB block size, there is nothing that splits the >>>>>output and I simply get a single file the size of my inputs. >>>>> >>>>>What is the current standard for combining sequence files to create larger >>>>>files for map-reduce jobs? I have seen code that tracks what it writes >>>>>into the file, but that seems like the long version. I am hoping there is >>>>>a shorter path. >>>>> >>>>>Thank you. >>>>> >>>>>Anna >>>>> >>>>> >>>> >>> >>> >>> > > >