I would like to be able to resize a set of inputs, already in SequenceFile format, to be larger.
I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not get what I expected. The outputs were exactly the same as the inputs. I also tried running a job with an IdentityMapper and IdentityReducer. Although that approaches a better solution, it still requires that I know in advance how many reducers I need to get better file sizes. I was looking at the SequenceFile.Writer constructors and noticed that there are block size parameters that can be used. Using a writer constructed with a 512MB block size, there is nothing that splits the output and I simply get a single file the size of my inputs. What is the current standard for combining sequence files to create larger files for map-reduce jobs? I have seen code that tracks what it writes into the file, but that seems like the long version. I am hoping there is a shorter path. Thank you. Anna