I would like to be able to resize a set of inputs, already in SequenceFile
format, to be larger.

I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
get what I expected. The outputs were exactly the same as the inputs.

I also tried running a job with an IdentityMapper and IdentityReducer.
Although that approaches a better solution, it still requires that I know
in advance how many reducers I need to get better file sizes.

I was looking at the SequenceFile.Writer constructors and noticed that
there are block size parameters that can be used. Using a writer
constructed with a 512MB block size, there is nothing that splits the
output and I simply get a single file the size of my inputs.

What is the current standard for combining sequence files to create larger
files for map-reduce jobs? I have seen code that tracks what it writes into
the file, but that seems like the long version. I am hoping there is a
shorter path.

Thank you.

Anna

Reply via email to