I have a question on how input files are split before they are given out to Map functions. Say I have an input directory containing 1000 files whose total size is 100 MB, and I have 10 machines in my cluster and I have configured 10 mapred.map.tasks in hadoop-site.xml. 1. With this configuration, do we have a way to know what size each split will be of? 2. Does split size depend on how many files there are in the input directory? What if I have only 10 files in input directory, but the total size of all these files is still 100 MB? Will it affect split size? Thanks.
--------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost.