Chunyi Yang created HADOOP-18847:
------------------------------------

             Summary: mapreduce job encounters java.io.IOException when 
dfs.client.rbf.observer.read.enable is true
                 Key: HADOOP-18847
                 URL: https://issues.apache.org/jira/browse/HADOOP-18847
             Project: Hadoop Common
          Issue Type: Bug
          Components: common
            Reporter: Chunyi Yang


When running mapreduce job in a Router-Based Federation+Observer read enabled 
environment, we see approximately a 1% probability of encountering the 
following error. 

{code:java}
"java.io.IOException: Resource 
hdfs://XXXX/user/XXXX/.staging/job_XXXXXX/.tez/application_XXXXXX/tez-conf.pb 
changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
\"2023-07-07T12:41:16.822+0900\", current time: 
\"2023-07-07T12:41:22.386+0900\"",
{code}


This error happens in function verifyAndCopy inside FSDownload.java when 
nodemanager tries to download a file right after the file has been written to 
the HDFS. The write operation runs on active namenode and read operation runs 
on observer namenode as expected.


The edits file and hdfs-audit files show that the expected time seen in error 
message is the OP_CLOSE MTIME of the tez-conf.pb file(which is correct) while 
the actual timestamp it gets from the read operation is OP_ADD MTIME of the 
target tez-conf.pf file (which is wrong). This mismatch shows that the observer 
namenode responses to client before its edits file updates to the lastest 
stateid which causes the issue.







--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to