Re: How to create Output files of about fixed size

2011-12-21 Thread Bejoy Ks
Hi JJ If you use the default TextInputFormat, it wont do the job as it would generate at least one split for each file. So in your case there would be a min of 78 splits as there are that many input files and 78 mappers and hence same 78 output files. You need to use CombineFileInputFormat

Re: How to create Output files of about fixed size

2011-12-21 Thread Mapred Learn
Hi Bejoy, This is what I tried initially but in this case, just to run job over 5 GB input takes more than an hour as RecordReader in LineRecordReader and offset is around 64 MB. It's is making performance really bad. Thanks, Anurag Tangri On Wed, Dec 21, 2011 at 12:13 PM, Bejoy Ks

Re: How to create Output files of about fixed size

2011-12-20 Thread Mapred Learn
Hi Shevek/others, I tried this. First job created about 78 files of each 15 MB size. I tried a second map only job with IdentityMapper with -Dmapred.min.split.size=1073741824 but it did not cause output files to be 1 Gb each but same output as above i.e. 78 files of 15 MB size. Is there a way

Re: How to create Output files of about fixed size

2011-10-28 Thread Mapred Learn
Hi Shevek, Thanks for the explanation ! Can you point me to some documentatino for specifying size in output format ? If i say size as 200 MB, then after 200 mb, it would do this per split or overall ? I mena would I end up with 200 mb and a 50 mb from 1st mapper and then, say 200 mb and 10 mb

Fwd: How to create Output files of about fixed size

2011-10-26 Thread Mapred Learn
Hi, I am trying to create output files of fixed size by using : -Dmapred.max.split.size=6442450812 (6 Gb) But the problem is that the input Data size and metadata varies and I have to adjust above value manually to achieve fixed size. Is there a way I can programmatically

Re: How to create Output files of about fixed size

2011-10-26 Thread Shevek
You can control the input to a computer program, but not (arbitrarily) how much output it generates. The only way to generate output files of a fixed size is to write a custom output format which shifts to a new filename every time that size is exceeded, but you will still get some small bits left