Re: How do I create per-reducer temporary files?

Matt Pouttu-Clarke Wed, 04 May 2011 22:59:49 -0700

Bryan,

Not sure you should be concerned with whether the output is on local vs.
HDFS.  I wouldn't think there would be much of a performance difference if
you are doing streaming output (append) in both cases.  Hadoop already uses
local storage where ever possible (including for the task working
directories as far as I know).  I've never had performance problems with
side effect files, as long as the correct setup is used.


Definitely if multiple mounts are available locally where the tasks are
running you can add a comma delimited list to mapreduce.cluster.local.dir in
mapred-site.xml of those machines:
http://hadoop.apache.org/common/docs/current/cluster_setup.html#mapred-site.
xml

Theoretically you can use the methods I listed below to create unique
files/paths under /tmp or any other mount point you wish.  However, it is
much better to let Hadoop manage where the files are stored (i.e. Use the
work directory given to you).

If you add multiple paths to mapreduce.cluster.local.dir then Hadoop will
spread the I/O from multiple mappers/reducers across these paths.  Likewise
you can mount RAID 0 (stripe) of multiple drives to get the same effect.
You can use a single RAID 0 to keep the mapred-site.xml uniform.  RAID 0 is
fine since speculative execution takes care of if a disk fails.

If would be helpful to know your use case since the primary option is
normally to create multiple outputs from a reducer:
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred
uce/lib/output/MultipleOutputs.html

Most likely you should try that before going into the realm of side effect
files (or messing with local temp on the task nodes).  Try the multiple
outputs if you are dealing with streaming data.  If you absolutely cannot
get it to work then you may have to cross check the other more complex
options.

Cheers,
Matt



On 5/4/11 1:07 PM, "Bryan Keller" <brya...@gmail.com> wrote:

> Am I mistaken or are side-effect files on HDFS? I need my temp files to be on
> the local filesystem. Also, the java working directory is not the reducer's
> local processing directory, thus "./tmp" doesn't get me what I'm after. As it
> stands now I'm using java.io.tmpdir which is not a long-term solution for me.
> I am looking to use the reducer's task-specific local directory which should
> be balanced across my local drives.
> 
> On May 4, 2011, at 12:31 PM, Matt Pouttu-Clarke wrote:
> 
>> Hi Bryan,
>> 
>> These are called side effect files, and I use them extensively:
>> 
>> O'Riley Hadoop 2nd Edition, p. 187
>> Pro Hadoop, p. 279
>> 
>> You get the path to the save the file(s) using:
>> http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
>> leOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapred.JobConf%29
>> 
>> The output committer moves these files from the work directory to the output
>> directory when the task completes.  That way you don't have duplicate files
>> due to speculative execution.  You should also generate a unique name for
>> each of your output files by using this function to prevent file name
>> collisions:
>> http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi
>> leOutputFormat.html#getUniqueName%28org.apache.hadoop.mapred.JobConf,%20java
>> .lang.String%29
>> 
>> Hope this helps,
>> Matt
>> 
>> 
>> On 5/4/11 12:18 PM, "Bryan Keller" <brya...@gmail.com> wrote:
>> 
>>> Right. What I am struggling with is how to retrieve the path/drive that the
>>> reducer is using, so I can use the same path for local temp files.
>>> 
>>> On May 4, 2011, at 9:03 AM, Robert Evans wrote:
>>> 
>>>> Bryan,
>>>> 
>>>> I believe that map/reduce gives you a single drive to write to so that your
>>>> reducer has less of an impact on other reducers/mappers running on the same
>>>> box.  If you want to write to more drives I thought the idea would then be
>>>> to
>>>> increase the number of reducers you have and let mapred assign each to a
>>>> drive to use, instead of having one reducer eating up I/O bandwidth from
>>>> all
>>>> of the drives.
>>>> 
>>>> --Bobby Evans
>>>> 
>>>> On 5/4/11 7:11 AM, "Bryan Keller" <brya...@gmail.com> wrote:
>>>> 
>>>> I too am looking for the best place to put local temp files I create during
>>>> reduce processing. I am hoping there is a variable or property someplace
>>>> that
>>>> defines a per-reducer temp directory. The "mapred.child.tmp" property is by
>>>> default simply the relative directory "./tmp" so it isn't useful on it's
>>>> own.
>>>> 
>>>> I have 5 drives being used in "mapred.local.dir", and I was hoping to use
>>>> them all for writing temp files, rather than specifying a single temp
>>>> directory that all my reducers use.
>>>> 
>>>> 
>>>> On Apr 9, 2011, at 2:40 AM, Harsh J wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill <bill...@gmail.com> wrote:
>>>>>> If I try:
>>>>>> 
>>>>>>    storePath = FileOutputFormat.getPathForWorkFile(context, "my-file",
>>>>>> ".seq");
>>>>>>    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
>>>>>>          configuration, storePath, IntWritable.class, itemClass);
>>>>>>    ...
>>>>>>    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
>>>>>> storePath, configuration);
>>>>>> 
>>>>>> I get an exception about a mismatch in file systems when trying to read
>>>>>> from
>>>>>> the file.
>>>>>> 
>>>>>> Alternately if I try:
>>>>>> 
>>>>>>    storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context,
>>>>>> "my-file", ".seq"));
>>>>>>    writer = SequenceFile.createWriter(FileSystem.get(configuration),
>>>>>>          configuration, storePath, IntWritable.class, itemClass);
>>>>>>    ...
>>>>>>    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
>>>>>> storePath, configuration);
>>>>> 
>>>>> FileOutputFormat.getPathForWorkFile will give back HDFS paths. And
>>>>> since you are looking to create local temporary files to be used only
>>>>> by the task within itself, you shouldn't really worry about unique
>>>>> filenames (stuff can go wrong).
>>>>> 
>>>>> You're looking for the tmp/ directory locally created in the FS where
>>>>> the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp).
>>>>> You can create a regular file there using vanilla Java APIs for files,
>>>>> or using RawLocalFS + your own created Path (not derived via
>>>>> OutputFormat/etc.).
>>>>> 
>>>>>>    storePath = new Path(new
>>>>>> Path(context.getConf().get("mapred.child.tmp"), "my-file.seq");
>>>>>>    writer = SequenceFile.createWriter(FileSystem.getLocal(configuration),
>>>>>>          configuration, storePath, IntWritable.class, itemClass);
>>>>>>    ...
>>>>>>    reader = new SequenceFile.Reader(FileSystem.getLocal(configuration),
>>>>>> storePath, configuration);
>>>>> 
>>>>> The above should work, I think (haven't tried, but the idea is to use
>>>>> the mapred.child.tmp).
>>>>> 
>>>>> Also see: 
>>>>> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Director
>>>>> y+
>>>>> Structure
>>>>> 
>>>>> --
>>>>> Harsh J
>>>> 
>>>> 
>>> 
>> 
>> 
>> iCrossing Privileged and Confidential Information
>> This email message is for the sole use of the intended recipient(s) and may
>> contain confidential and privileged information of iCrossing. Any
>> unauthorized review, use, disclosure or distribution is prohibited. If you
>> are not the intended recipient, please contact the sender by reply email and
>> destroy all copies of the original message.
>> 
>> 
>

Re: How do I create per-reducer temporary files?

Reply via email to