I find that in most cases where I solve Hadoop problems the solution consists of several jobs chained together. When problem is solved the solution is almost never wanted in the form of a collection of files named things like part-r-00000. It is usually the case that even the boundaries if the files have little to do with Hadoop. A good solution seems to be to run a last Hadoop job to convert data into a file that others can use.
I am currently working on a problem which can be imagined like this - I have a large number of 'customers' when the job is done the next stage wants a series of files containing the customers living in each county, one file per county in , say a csv format. If we use the county name as a key one reducer will receive all of the customers in that county. The reducer opens a HDFS file named for the county with the task attempt number and .tmp appended, When the key is finished the file is renamed to the county name with .csv appended. 1) the rename is a small concurrency sin since multiple attempts may attempt the same rename at the same time, a) It is unclear whether rename in a HDFS file system succeeds if the destination path exists - does it? b) does failure throw an exception or simply return false 2) When one attempt succeeds the others will be killed. These killed tasks may have open temporary files that should be deleted. Is there code which will be called as a task is killed, say is cleanup called or some killcleanup that can delete temporary files Is there a better way assuming files must be created and reference keys and not tasks?