Currently, we've got defined:
  <property>
     <name>hadoop.tmp.dir</name>
     <value>/hadoop/hadoop-metadata/cache/</value>
  </property>

In our experiments with SOLR, the intermediate files are so large that they
tend to blow out disk space and fail (and annoyingly leave behind their huge
failed attempts). We've had issues with it in the past, but we're having
real problems with SOLR if we can't comfortably get more space out of
hadoop.tmp.dir somehow.

1) It seems we never set *mapred.system.dir* to anything special, so it's
defaulting to ${hadoop.tmp.dir}/mapred/system.
Is this a problem? The docs seem to recommend against it when hadoop.tmp.dir
had ${user.name} in it, which ours doesn't.

1b) The doc says mapred.system.dir is "the in-HDFS path to shared MapReduce
system files." To me, that means there's must be 1 single path for
mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
Otherwise, one might imagine that you could specify multiple paths to store
hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir?

2) IIRC, there's a -D switch for supplying config name/value pairs into
indivdiual jobs. Does such a switch exist? Googling for single letters is
fruitless. If we had a path on our workers with more space (in our case,
another hard disk), could we simply pass that path in as hadoop.tmp.dir for
our SOLR jobs? Without incurring any consistency issues on future jobs that
might use the SOLR output on HDFS?

Reply via email to