Hi JJ
If you use the default TextInputFormat, it wont do the job as it
would generate at least one split for each file. So in your case there
would be a min of 78 splits as there are that many input files and 78
mappers and hence same 78 output files. You need to use
CombineFileInputFormat
Hi Bejoy,
This is what I tried initially but in this case, just to run job over 5 GB
input takes more than an hour as RecordReader in LineRecordReader and
offset is around 64 MB. It's is making performance really bad.
Thanks,
Anurag Tangri
On Wed, Dec 21, 2011 at 12:13 PM, Bejoy Ks
Hi Shevek/others,
I tried this.
First job created about 78 files of each 15 MB size.
I tried a second map only job with IdentityMapper with
-Dmapred.min.split.size=1073741824 but it did not cause output files to be
1 Gb each but same output as above i.e. 78 files of 15 MB size.
Is there a way
Hi Shevek,
Thanks for the explanation !
Can you point me to some documentatino for specifying size in output format
?
If i say size as 200 MB, then after 200 mb, it would do this per split or
overall ?
I mena would I end up with 200 mb and a 50 mb from 1st mapper and then, say
200 mb and 10 mb
Hi,
I am trying to create output files of fixed size by using :
-Dmapred.max.split.size=6442450812 (6 Gb)
But the problem is that the input Data size and metadata varies and I have
to adjust above value manually to achieve fixed size.
Is there a way I can programmatically
You can control the input to a computer program, but not (arbitrarily) how
much output it generates. The only way to generate output files of a fixed
size is to write a custom output format which shifts to a new filename every
time that size is exceeded, but you will still get some small bits left