Thanks to previous kind answers and more reading in the elephant book, I now 
understand that mapper tasks place partitioned results into local files that 
are served up to reducers via HTTP:

"The output file's partitions are made available to the reducers over HTTP. The 
maximum number of worker threads used to serve the file partitions is 
controlled by the tasktracker.http.threads property; this setting is per 
tasktracker, not per map task slot. The default of 40 may need to be increased 
for large clusters running large jobs. In MapReduce 2, this property is not 
applicable because the maximum number of threads used is set automatically 
based on the number of processors on the machine. (MapReduce 2 uses Netty, 
which by default allows up to twice as many threads as there are processors.)"

My question is, for a custom (non-MR) application under YARN, how would I set 
up my application tasks' output data to be served over HTTP?  Is there an API 
to control this, or are there predefined local folders that will be served up?  
Once I am finished with the temporary data, how do I request that the files are 
removed?

Thanks
John

Reply via email to