I believe framework checks timestamps on HDFS for marking an already available 
copy of the file valid or invalid, since the archived files are not cleaned up 
till a certain du limit is reached, and no apis for cleanup available. There 
was a thread on this some time back on the list.

Amogh

-----Original Message-----
From: Allen Wittenauer [mailto:awittena...@linkedin.com] 
Sent: Tuesday, September 29, 2009 10:41 PM
To: common-user@hadoop.apache.org
Subject: Re: Distributed cache - are files unique per job?




On 9/29/09 2:55 AM, "Erik Forsberg" <forsb...@opera.com> wrote:
> If I distribute files using the Distributed Cache (-archives option),
> are they guaranteed to be unique per job, or is there a risk that if I
> distribute a file named A with job 1, job 2 which also distributes a
> file named A will read job 1's file?

>From my understanding, at one point in time there was a 'shortcut' in the
system that did exactly what you fear.  If the same cache file name was
specified by multiple jobs, they'd get the same file as it was assumed they
were the same file.  I *think* this has been fixed though.

[Needless to say, for automated jobs that push security keys through a cache
file, this is bad.]

Reply via email to