Thanks again.  This gives me a lot of options; we will see what works.

Do you know if there are any permissions issues if we directly access the 
folders of LOCAL_DIR_ENV?

Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String 
contextCfgItemName) and a note mentioning that an example of this item is 
"mapred.local.dir".  Is that the correct usage, or is there something 
YARN-generic?

Cheers,
john

-----Original Message-----
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Sunday, October 20, 2013 11:58 PM
To: <user@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

Hi,

MR does use multiple disks when spilling. But the work directory is also 
round-robined to spread I/O.

YARN sets an environment property thats a list (comma separated value) of 
directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can 
together use. Perhaps read it in with 
StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
and then round robin internally over those paths (with free space handling)?

Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
class; which is what MR uses. Its not been declared publicly stable though, but 
we can do that over a JIRA.

On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <john.lil...@redpoint.net> wrote:
> Harsh, thanks for the quick response.  These files don't need to be on the 
> DFS (although we use that too).  These are local files used during sorting, 
> joining, transitive closure.
>
> The task-relative folder might be good enough, but our app *can* make use of 
> multiple temp folders if they are available.  Our YARN app can be fairly I/O 
> intensive; is it possible to allocate more than one temp folder on different 
> physical devices?
>
> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on 
> different disks so that they do not compete with each other on I/O?
>
> For that matter, where does MR allocate the temporary files generated by 
> Mapper output?  Presumably MR has the same I/O parallelism requirements that 
> we do.
>
> Thanks
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:ha...@cloudera.com]
> Sent: Sunday, October 20, 2013 10:49 AM
> To: <user@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Every container gets its own local work directory (You can use the relative 
> ./) thats auto-cleaned up at the end of the container's life.
> This is the best place to store the temporary files. This is not something 
> you need custom configuration for.
>
> Do the files need to be on a distributed FS or a local one?
>
> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <john.lil...@redpoint.net> wrote:
>> We have a pure YARN application (no MapReduce) that has need to store 
>> a significant amount of temporary data.  How can we know the best 
>> location for these files?  How can we ensure that our YARN tasks have 
>> write access to these locations?  Is this something that must be configured 
>> outside of YARN?
>> Thanks,
>> John
>
> --
> Harsh J



--
Harsh J

Reply via email to