Anna

I misunderstood your problem. I thought you wanted to change the block size of 
every file. I didn' t realize that you were aggregating multiple small files 
into different, albeit smaller, set of larger files of a bigger block size 
to improve performance. 

I think as Chris suggested you need to have a custom M/R job or you could 
probably get away with some scripting magic :-)

Raj



>________________________________
> From: Anna Lahoud <annalah...@gmail.com>
>To: user@hadoop.apache.org; Raj Vishwanathan <rajv...@yahoo.com> 
>Sent: Tuesday, October 9, 2012 7:01 AM
>Subject: Re: File block size use
> 
>
>Raj - I was not able to get this to work either. 
>
>
>On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <rajv...@yahoo.com> wrote:
>
>I haven't tried it but this should also work
>>
>>
>> hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
>>
>>
>>
>>Raj
>>
>>
>>
>>>________________________________
>>> From: Anna Lahoud <annalah...@gmail.com>
>>>To: user@hadoop.apache.org; bejoy.had...@gmail.com 
>>>Sent: Tuesday, October 2, 2012 7:17 AM
>>>
>>>Subject: Re: File block size use
>>> 
>>>
>>>
>>>Thank you. I will try today.
>>>
>>>
>>>On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <bejoy.had...@gmail.com> wrote:
>>>
>>>Hi Anna
>>>>
>>>>If you want to increase the block size of existing files. You can use a 
>>>>Identity Mapper with no reducer.  Set the min and max split sizes to your 
>>>>requirement (512Mb). Use SequenceFileInputFormat and 
>>>>SequenceFileOutputFormat for your job.
>>>>Your job should be done.
>>>>
>>>>
>>>>Regards
>>>>Bejoy KS
>>>>
>>>>Sent from handheld, please excuse typos.
>>>>________________________________
>>>>
>>>>From:  Chris Nauroth <cnaur...@hortonworks.com> 
>>>>Date: Mon, 1 Oct 2012 21:12:58 -0700
>>>>To: <user@hadoop.apache.org>
>>>>ReplyTo:  user@hadoop.apache.org 
>>>>Subject: Re: File block size use
>>>>
>>>>Hello Anna,
>>>>
>>>>
>>>>If I understand correctly, you have a set of multiple sequence files, each 
>>>>much smaller than the desired block size, and you want to concatenate them 
>>>>into a set of fewer files, each one more closely aligned to your desired 
>>>>block size.  Presumably, the goal is to improve throughput of map reduce 
>>>>jobs using those files as input by running fewer map tasks, reading a 
>>>>larger number of input records.
>>>>
>>>>
>>>>Whenever I've had this kind of requirement, I've run a custom map reduce 
>>>>job to implement the file consolidation.  In my case, I was typically 
>>>>working with TextInputFormat (not sequence files).  I used IdentityMapper 
>>>>and a custom reducer that passed through all values but with key set to 
>>>>NullWritable, because the keys (input file offsets in the case of 
>>>>TextInputFormat) were not valuable data.  For my input data, this was 
>>>>sufficient to achieve fairly even distribution of data across the reducer 
>>>>tasks, and I could reasonably predict the input data set size, so I could 
>>>>reasonably set the number of reducers and get decent results.  (This may or 
>>>>may not be true for your data set though.)
>>>>
>>>>
>>>>A weakness of this approach is that the keys must pass from the map tasks 
>>>>to the reduce tasks, only to get discarded before writing the final output. 
>>>> Also, the distribution of input records to reduce tasks is not truly 
>>>>random, and therefore the reduce output files may be uneven in size.  This 
>>>>could be solved by writing NullWritable keys out of the map task instead of 
>>>>the reduce task and writing a custom implementation of Partitioner to 
>>>>distribute them randomly.
>>>>
>>>>
>>>>To expand on this idea, it could be possible to inspect the FileStatus of 
>>>>each input, sum the values of FileStatus.getLen(), and then use that 
>>>>information to make a decision about how many reducers to run (and 
>>>>therefore approximately set a target output file size).  I'm not aware of 
>>>>any built-in or external utilities that do this for you though.
>>>>
>>>>
>>>>Hope this helps,
>>>>--Chris
>>>>
>>>>
>>>>On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <annalah...@gmail.com> wrote:
>>>>
>>>>I would like to be able to resize a set of inputs, already in SequenceFile 
>>>>format, to be larger. 
>>>>>
>>>>>I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not 
>>>>>get what I expected. The outputs were exactly the same as the inputs. 
>>>>>
>>>>>I also tried running a job with an IdentityMapper and IdentityReducer. 
>>>>>Although that approaches a better solution, it still requires that I know 
>>>>>in advance how many reducers I need to get better file sizes. 
>>>>>
>>>>>I was looking at the SequenceFile.Writer constructors and noticed that 
>>>>>there are block size parameters that can be used. Using a writer 
>>>>>constructed with a 512MB block size, there is nothing that splits the 
>>>>>output and I simply get a single file the size of my inputs. 
>>>>>
>>>>>What is the current standard for combining sequence files to create larger 
>>>>>files for map-reduce jobs? I have seen code that tracks what it writes 
>>>>>into the file, but that seems like the long version. I am hoping there is 
>>>>>a shorter path.
>>>>>
>>>>>Thank you.
>>>>>
>>>>>Anna
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>
>
>

Reply via email to