Currently, we've got defined: <property> <name>hadoop.tmp.dir</name> <value>/hadoop/hadoop-metadata/cache/</value> </property>
In our experiments with SOLR, the intermediate files are so large that they tend to blow out disk space and fail (and annoyingly leave behind their huge failed attempts). We've had issues with it in the past, but we're having real problems with SOLR if we can't comfortably get more space out of hadoop.tmp.dir somehow. 1) It seems we never set *mapred.system.dir* to anything special, so it's defaulting to ${hadoop.tmp.dir}/mapred/system. Is this a problem? The docs seem to recommend against it when hadoop.tmp.dir had ${user.name} in it, which ours doesn't. 1b) The doc says mapred.system.dir is "the in-HDFS path to shared MapReduce system files." To me, that means there's must be 1 single path for mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path. Otherwise, one might imagine that you could specify multiple paths to store hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir? 2) IIRC, there's a -D switch for supplying config name/value pairs into indivdiual jobs. Does such a switch exist? Googling for single letters is fruitless. If we had a path on our workers with more space (in our case, another hard disk), could we simply pass that path in as hadoop.tmp.dir for our SOLR jobs? Without incurring any consistency issues on future jobs that might use the SOLR output on HDFS?