Bryan, Not sure you should be concerned with whether the output is on local vs. HDFS. I wouldn't think there would be much of a performance difference if you are doing streaming output (append) in both cases. Hadoop already uses local storage where ever possible (including for the task working directories as far as I know). I've never had performance problems with side effect files, as long as the correct setup is used.
Definitely if multiple mounts are available locally where the tasks are running you can add a comma delimited list to mapreduce.cluster.local.dir in mapred-site.xml of those machines: http://hadoop.apache.org/common/docs/current/cluster_setup.html#mapred-site. xml Theoretically you can use the methods I listed below to create unique files/paths under /tmp or any other mount point you wish. However, it is much better to let Hadoop manage where the files are stored (i.e. Use the work directory given to you). If you add multiple paths to mapreduce.cluster.local.dir then Hadoop will spread the I/O from multiple mappers/reducers across these paths. Likewise you can mount RAID 0 (stripe) of multiple drives to get the same effect. You can use a single RAID 0 to keep the mapred-site.xml uniform. RAID 0 is fine since speculative execution takes care of if a disk fails. If would be helpful to know your use case since the primary option is normally to create multiple outputs from a reducer: http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred uce/lib/output/MultipleOutputs.html Most likely you should try that before going into the realm of side effect files (or messing with local temp on the task nodes). Try the multiple outputs if you are dealing with streaming data. If you absolutely cannot get it to work then you may have to cross check the other more complex options. Cheers, Matt On 5/4/11 1:07 PM, "Bryan Keller" <brya...@gmail.com> wrote: > Am I mistaken or are side-effect files on HDFS? I need my temp files to be on > the local filesystem. Also, the java working directory is not the reducer's > local processing directory, thus "./tmp" doesn't get me what I'm after. As it > stands now I'm using java.io.tmpdir which is not a long-term solution for me. > I am looking to use the reducer's task-specific local directory which should > be balanced across my local drives. > > On May 4, 2011, at 12:31 PM, Matt Pouttu-Clarke wrote: > >> Hi Bryan, >> >> These are called side effect files, and I use them extensively: >> >> O'Riley Hadoop 2nd Edition, p. 187 >> Pro Hadoop, p. 279 >> >> You get the path to the save the file(s) using: >> http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi >> leOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapred.JobConf%29 >> >> The output committer moves these files from the work directory to the output >> directory when the task completes. That way you don't have duplicate files >> due to speculative execution. You should also generate a unique name for >> each of your output files by using this function to prevent file name >> collisions: >> http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi >> leOutputFormat.html#getUniqueName%28org.apache.hadoop.mapred.JobConf,%20java >> .lang.String%29 >> >> Hope this helps, >> Matt >> >> >> On 5/4/11 12:18 PM, "Bryan Keller" <brya...@gmail.com> wrote: >> >>> Right. What I am struggling with is how to retrieve the path/drive that the >>> reducer is using, so I can use the same path for local temp files. >>> >>> On May 4, 2011, at 9:03 AM, Robert Evans wrote: >>> >>>> Bryan, >>>> >>>> I believe that map/reduce gives you a single drive to write to so that your >>>> reducer has less of an impact on other reducers/mappers running on the same >>>> box. If you want to write to more drives I thought the idea would then be >>>> to >>>> increase the number of reducers you have and let mapred assign each to a >>>> drive to use, instead of having one reducer eating up I/O bandwidth from >>>> all >>>> of the drives. >>>> >>>> --Bobby Evans >>>> >>>> On 5/4/11 7:11 AM, "Bryan Keller" <brya...@gmail.com> wrote: >>>> >>>> I too am looking for the best place to put local temp files I create during >>>> reduce processing. I am hoping there is a variable or property someplace >>>> that >>>> defines a per-reducer temp directory. The "mapred.child.tmp" property is by >>>> default simply the relative directory "./tmp" so it isn't useful on it's >>>> own. >>>> >>>> I have 5 drives being used in "mapred.local.dir", and I was hoping to use >>>> them all for writing temp files, rather than specifying a single temp >>>> directory that all my reducers use. >>>> >>>> >>>> On Apr 9, 2011, at 2:40 AM, Harsh J wrote: >>>> >>>>> Hello, >>>>> >>>>> On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill <bill...@gmail.com> wrote: >>>>>> If I try: >>>>>> >>>>>> storePath = FileOutputFormat.getPathForWorkFile(context, "my-file", >>>>>> ".seq"); >>>>>> writer = SequenceFile.createWriter(FileSystem.getLocal(configuration), >>>>>> configuration, storePath, IntWritable.class, itemClass); >>>>>> ... >>>>>> reader = new SequenceFile.Reader(FileSystem.getLocal(configuration), >>>>>> storePath, configuration); >>>>>> >>>>>> I get an exception about a mismatch in file systems when trying to read >>>>>> from >>>>>> the file. >>>>>> >>>>>> Alternately if I try: >>>>>> >>>>>> storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context, >>>>>> "my-file", ".seq")); >>>>>> writer = SequenceFile.createWriter(FileSystem.get(configuration), >>>>>> configuration, storePath, IntWritable.class, itemClass); >>>>>> ... >>>>>> reader = new SequenceFile.Reader(FileSystem.getLocal(configuration), >>>>>> storePath, configuration); >>>>> >>>>> FileOutputFormat.getPathForWorkFile will give back HDFS paths. And >>>>> since you are looking to create local temporary files to be used only >>>>> by the task within itself, you shouldn't really worry about unique >>>>> filenames (stuff can go wrong). >>>>> >>>>> You're looking for the tmp/ directory locally created in the FS where >>>>> the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp). >>>>> You can create a regular file there using vanilla Java APIs for files, >>>>> or using RawLocalFS + your own created Path (not derived via >>>>> OutputFormat/etc.). >>>>> >>>>>> storePath = new Path(new >>>>>> Path(context.getConf().get("mapred.child.tmp"), "my-file.seq"); >>>>>> writer = SequenceFile.createWriter(FileSystem.getLocal(configuration), >>>>>> configuration, storePath, IntWritable.class, itemClass); >>>>>> ... >>>>>> reader = new SequenceFile.Reader(FileSystem.getLocal(configuration), >>>>>> storePath, configuration); >>>>> >>>>> The above should work, I think (haven't tried, but the idea is to use >>>>> the mapred.child.tmp). >>>>> >>>>> Also see: >>>>> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Director >>>>> y+ >>>>> Structure >>>>> >>>>> -- >>>>> Harsh J >>>> >>>> >>> >> >> >> iCrossing Privileged and Confidential Information >> This email message is for the sole use of the intended recipient(s) and may >> contain confidential and privileged information of iCrossing. Any >> unauthorized review, use, disclosure or distribution is prohibited. If you >> are not the intended recipient, please contact the sender by reply email and >> destroy all copies of the original message. >> >> >