[ 
https://issues.apache.org/jira/browse/AMBARI-9990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Fernandez updated AMBARI-9990:
----------------------------------------
    Attachment:     (was: AMBARI-9990.patch)

> CopyFromLocal failed to copy Tez tarball to HDFS failed because multiple 
> processes tried to copy to the same destination simultaneously
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMBARI-9990
>                 URL: https://issues.apache.org/jira/browse/AMBARI-9990
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.0.0
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>             Fix For: 2.0.0
>
>         Attachments: AMBARI-9990.patch, 
> hadoop-hdfs-datanode-c6408.ambari.apache.org.log, 
> hadoop-hdfs-datanode-c6410.ambari.apache.org.log, 
> hadoop-hdfs-namenode-c6408.ambari.apache.org.log, hdfs-audit.log
>
>
> Pig Service Check and Hive Server 2 START ran on 2 different machines during 
> the stack installation and failed to copy the tez tarball to HDFS.
> I was able to reproduce this locally by calling CopyFromLocal from two 
> clients simultaneously. See the HDFS audit log, datanode logs on c6408 & 
> c6410, and namenode log on c6410.
> The copyFromLocal command's behavior is:
> * Try to create a temporary file <filename>._COPYING_ and write the real data 
> there
> * If hit any exception, delete the file with the name <filename>._COPYING_
> Thus we have the following race condition in this test:
> Process P1 created file "tez.tar.gz._COPYING_" and wrote data to it
> Process P2 fired the same copyFromLocal command and hit exception because it 
> could not get the lease
> P2 then deleted the file "tez.tar.gz._COPYING_"
> P1 could not close the file "tez.tar.gz._COPYING_" since it had been deleted 
> by P2. The exception would say "could not find lease for file..."
> In general we do not have the correct synchronization guarantee for the 
> "copyFromLocal" command.
> One solution is for the destination file name to be unique. Because the mv 
> command is synchronized by the namenode, at least one of them will succeed in 
> naming the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to