I believe framework checks timestamps on HDFS for marking an already available copy of the file valid or invalid, since the archived files are not cleaned up till a certain du limit is reached, and no apis for cleanup available. There was a thread on this some time back on the list.
Amogh -----Original Message----- From: Allen Wittenauer [mailto:awittena...@linkedin.com] Sent: Tuesday, September 29, 2009 10:41 PM To: common-user@hadoop.apache.org Subject: Re: Distributed cache - are files unique per job? On 9/29/09 2:55 AM, "Erik Forsberg" <forsb...@opera.com> wrote: > If I distribute files using the Distributed Cache (-archives option), > are they guaranteed to be unique per job, or is there a risk that if I > distribute a file named A with job 1, job 2 which also distributes a > file named A will read job 1's file? >From my understanding, at one point in time there was a 'shortcut' in the system that did exactly what you fear. If the same cache file name was specified by multiple jobs, they'd get the same file as it was assumed they were the same file. I *think* this has been fixed though. [Needless to say, for automated jobs that push security keys through a cache file, this is bad.]