[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338188#comment-16338188 ] Hudson commented on MAPREDUCE-7015: --- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #13549 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/13549/]) MAPREDUCE-7015. Possible race condition in JHS if the job is not loaded. (jlowe: rev cff9edd4b514bdcfe22cd49964e3707fb78ab876) * (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/HistoryFileManager.java * (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CachedHistoryStorage.java * (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/test/java/org/apache/hadoop/mapreduce/v2/hs/TestJobHistory.java > Possible race condition in JHS if the job is not loaded > --- > > Key: MAPREDUCE-7015 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: MAPREDUCE-7015-001.patch, MAPREDUCE-7015-POC01.patch, > MAPREDUCE-7015-POC02.patch > > > There could be a race condition inside JHS. In our build environment, > {{TestMRJobClient.testJobClient()}} failed with this exception: > {noformat} > ava.io.FileNotFoundException: File does not exist: > hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) > at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167) > {noformat} > Root cause: > 1. MapReduce job completes > 2. CLI calls {{cluster.getJob(jobid)}} > 3. The job is finished and the client side gets redirected to JHS > 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find > the job > 5. First it scans the intermediate directory and finds the job > 6. The call {{moveToDone()}} is scheduled for execution on a separate thread > inside {{moveToDoneExecutor}} and it starts to run immediately > 7. RPC invocation returns with the path pointing to > {{/tmp/hadoop-yarn/staging/history/done_intermediate}} > 8. The call to {{moveToDone()}} completes which moves the contents of > {{done_intermediate}} to {{done}} > 9. Hadoop CLI tries to download the config file from done_intermediate but > it's no longer there > Usually step #6 is slow enough to complete after #7, but sometimes it's > faster, causing this race condition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338185#comment-16338185 ] Jason Lowe commented on MAPREDUCE-7015: --- Apologies for the delay. +1, the latest patch lgtm. Committing this. > Possible race condition in JHS if the job is not loaded > --- > > Key: MAPREDUCE-7015 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: MAPREDUCE-7015-001.patch, MAPREDUCE-7015-POC01.patch, > MAPREDUCE-7015-POC02.patch > > > There could be a race condition inside JHS. In our build environment, > {{TestMRJobClient.testJobClient()}} failed with this exception: > {noformat} > ava.io.FileNotFoundException: File does not exist: > hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) > at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167) > {noformat} > Root cause: > 1. MapReduce job completes > 2. CLI calls {{cluster.getJob(jobid)}} > 3. The job is finished and the client side gets redirected to JHS > 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find > the job > 5. First it scans the intermediate directory and finds the job > 6. The call {{moveToDone()}} is scheduled for execution on a separate thread > inside {{moveToDoneExecutor}} and it starts to run immediately > 7. RPC invocation returns with the path pointing to > {{/tmp/hadoop-yarn/staging/history/done_intermediate}} > 8. The call to {{moveToDone()}} completes which moves the contents of > {{done_intermediate}} to {{done}} > 9. Hadoop CLI tries to download the config file from done_intermediate but > it's no longer there > Usually step #6 is slow enough to complete after #7, but sometimes it's > faster, causing this race condition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336084#comment-16336084 ] Peter Bacsko commented on MAPREDUCE-7015: - ping [~jlowe] - could you also check MAPREDUCE-7020 please? > Possible race condition in JHS if the job is not loaded > --- > > Key: MAPREDUCE-7015 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: MAPREDUCE-7015-001.patch, MAPREDUCE-7015-POC01.patch, > MAPREDUCE-7015-POC02.patch > > > There could be a race condition inside JHS. In our build environment, > {{TestMRJobClient.testJobClient()}} failed with this exception: > {noformat} > ava.io.FileNotFoundException: File does not exist: > hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) > at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167) > {noformat} > Root cause: > 1. MapReduce job completes > 2. CLI calls {{cluster.getJob(jobid)}} > 3. The job is finished and the client side gets redirected to JHS > 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find > the job > 5. First it scans the intermediate directory and finds the job > 6. The call {{moveToDone()}} is scheduled for execution on a separate thread > inside {{moveToDoneExecutor}} and it starts to run immediately > 7. RPC invocation returns with the path pointing to > {{/tmp/hadoop-yarn/staging/history/done_intermediate}} > 8. The call to {{moveToDone()}} completes which moves the contents of > {{done_intermediate}} to {{done}} > 9. Hadoop CLI tries to download the config file from done_intermediate but > it's no longer there > Usually step #6 is slow enough to complete after #7, but sometimes it's > faster, causing this race condition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328676#comment-16328676 ] Hadoop QA commented on MAPREDUCE-7015: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 9s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 8s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 28s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 24s{color} | {color:green} hadoop-mapreduce-client-hs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 44m 19s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 | | JIRA Issue | MAPREDUCE-7015 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12906392/MAPREDUCE-7015-001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 977fe21f2594 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 09efdfe | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_151 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7302/testReport/ | | Max. process+thread count | 440 (vs. ulimit of 5000) | | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs | | Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7302/console | | Powered by | Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Possible race condition in JHS if the job is not
[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328648#comment-16328648 ] Peter Bacsko commented on MAPREDUCE-7015: - [~jlowe] modified the patch as you suggested plus added a unit test. > Possible race condition in JHS if the job is not loaded > --- > > Key: MAPREDUCE-7015 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: MAPREDUCE-7015-001.patch, MAPREDUCE-7015-POC01.patch, > MAPREDUCE-7015-POC02.patch > > > There could be a race condition inside JHS. In our build environment, > {{TestMRJobClient.testJobClient()}} failed with this exception: > {noformat} > ava.io.FileNotFoundException: File does not exist: > hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) > at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167) > {noformat} > Root cause: > 1. MapReduce job completes > 2. CLI calls {{cluster.getJob(jobid)}} > 3. The job is finished and the client side gets redirected to JHS > 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find > the job > 5. First it scans the intermediate directory and finds the job > 6. The call {{moveToDone()}} is scheduled for execution on a separate thread > inside {{moveToDoneExecutor}} and it starts to run immediately > 7. RPC invocation returns with the path pointing to > {{/tmp/hadoop-yarn/staging/history/done_intermediate}} > 8. The call to {{moveToDone()}} completes which moves the contents of > {{done_intermediate}} to {{done}} > 9. Hadoop CLI tries to download the config file from done_intermediate but > it's no longer there > Usually step #6 is slow enough to complete after #7, but sometimes it's > faster, causing this race condition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321081#comment-16321081 ] Jason Lowe commented on MAPREDUCE-7015: --- Thanks for the patch! CountDownLatch seems like a little bit of overkill in this context. I think it would be sufficient to add this method to HistoryFileInfo: {code} public synchronized void waitUntilMoved() { while (isMovePending() && !hfi.didMoveFail()) { wait(); } } {code} and then simply add a notifyAll() when the HistoryFileInfo state is written (i.e.: in deleted and moveToDone). I also think it is unnecessary to add a waitForAllMoveToDone. CachedHistoryStorage#getAllPartialJobs can simply wait on each file info in the loop as it collects them. > Possible race condition in JHS if the job is not loaded > --- > > Key: MAPREDUCE-7015 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: MAPREDUCE-7015-POC01.patch, MAPREDUCE-7015-POC02.patch > > > There could be a race condition inside JHS. In our build environment, > {{TestMRJobClient.testJobClient()}} failed with this exception: > {noformat} > ava.io.FileNotFoundException: File does not exist: > hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) > at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167) > {noformat} > Root cause: > 1. MapReduce job completes > 2. CLI calls {{cluster.getJob(jobid)}} > 3. The job is finished and the client side gets redirected to JHS > 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find > the job > 5. First it scans the intermediate directory and finds the job > 6. The call {{moveToDone()}} is scheduled for execution on a separate thread > inside {{moveToDoneExecutor}} and it starts to run immediately > 7. RPC invocation returns with the path pointing to > {{/tmp/hadoop-yarn/staging/history/done_intermediate}} > 8. The call to {{moveToDone()}} completes which moves the contents of > {{done_intermediate}} to {{done}} > 9. Hadoop CLI tries to download the config file from done_intermediate but > it's no longer there > Usually step #6 is slow enough to complete after #7, but sometimes it's > faster, causing this race condition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320625#comment-16320625 ] Peter Bacsko commented on MAPREDUCE-7015: - ping [~jlowe] > Possible race condition in JHS if the job is not loaded > --- > > Key: MAPREDUCE-7015 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: MAPREDUCE-7015-POC01.patch, MAPREDUCE-7015-POC02.patch > > > There could be a race condition inside JHS. In our build environment, > {{TestMRJobClient.testJobClient()}} failed with this exception: > {noformat} > ava.io.FileNotFoundException: File does not exist: > hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) > at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167) > {noformat} > Root cause: > 1. MapReduce job completes > 2. CLI calls {{cluster.getJob(jobid)}} > 3. The job is finished and the client side gets redirected to JHS > 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find > the job > 5. First it scans the intermediate directory and finds the job > 6. The call {{moveToDone()}} is scheduled for execution on a separate thread > inside {{moveToDoneExecutor}} and it starts to run immediately > 7. RPC invocation returns with the path pointing to > {{/tmp/hadoop-yarn/staging/history/done_intermediate}} > 8. The call to {{moveToDone()}} completes which moves the contents of > {{done_intermediate}} to {{done}} > 9. Hadoop CLI tries to download the config file from done_intermediate but > it's no longer there > Usually step #6 is slow enough to complete after #7, but sometimes it's > faster, causing this race condition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313106#comment-16313106 ] Peter Bacsko commented on MAPREDUCE-7015: - Thanks for the idea [~jlowe]. I agree that waiting for only the job related files is a better approach. I came up with another POC. Still needs some tweak (eg. timeout when calling {{await()}}) and tests. > Possible race condition in JHS if the job is not loaded > --- > > Key: MAPREDUCE-7015 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: MAPREDUCE-7015-POC01.patch, MAPREDUCE-7015-POC02.patch > > > There could be a race condition inside JHS. In our build environment, > {{TestMRJobClient.testJobClient()}} failed with this exception: > {noformat} > ava.io.FileNotFoundException: File does not exist: > hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) > at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167) > {noformat} > Root cause: > 1. MapReduce job completes > 2. CLI calls {{cluster.getJob(jobid)}} > 3. The job is finished and the client side gets redirected to JHS > 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find > the job > 5. First it scans the intermediate directory and finds the job > 6. The call {{moveToDone()}} is scheduled for execution on a separate thread > inside {{moveToDoneExecutor}} and it starts to run immediately > 7. RPC invocation returns with the path pointing to > {{/tmp/hadoop-yarn/staging/history/done_intermediate}} > 8. The call to {{moveToDone()}} completes which moves the contents of > {{done_intermediate}} to {{done}} > 9. Hadoop CLI tries to download the config file from done_intermediate but > it's no longer there > Usually step #6 is slow enough to complete after #7, but sometimes it's > faster, causing this race condition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309767#comment-16309767 ] Jason Lowe commented on MAPREDUCE-7015: --- IIRC one reason the moveToDoneExecutor was added was to avoid spending a long time processing every move necessary during the scan. By adding the sync parameter or removing the executor completely, that makes the scan move everything it finds inline. This can take quite a long time since it not only has to wait for RPC calls but also can spend a long time waiting to acquire the lock because another thread could be calling loadJob() at the time and waiting on a very slow datanode. That makes the scan single-threaded since it can't make progress on other intermediate files until it finishes moving each one in order. So I don't think making this sync is the ideal solution. It may make more sense to have the RPC call wait for out-of-band intermediate scan results it will return to be moved rather than forcing the entire scan to be single-threaded or always waiting for all intermediates to be moved. For example, HistoryFileManager#getFileInfo could explicitly wait on the move to complete for the one job it is interested in if that job was found in the intermediate scan. Then we don't have to wait for _every_ intermediate job to be moved, just the one we care about. getAllFileInfo would need to wait for all of them, but they would be processed in parallel while we're waiting for each in turn. > Possible race condition in JHS if the job is not loaded > --- > > Key: MAPREDUCE-7015 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: MAPREDUCE-7015-POC01.patch > > > There could be a race condition inside JHS. In our build environment, > {{TestMRJobClient.testJobClient()}} failed with this exception: > {noformat} > ava.io.FileNotFoundException: File does not exist: > hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) > at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167) > {noformat} > Root cause: > 1. MapReduce job completes > 2. CLI calls {{cluster.getJob(jobid)}} > 3. The job is finished and the client side gets redirected to JHS > 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find > the job > 5. First it scans the intermediate directory and finds the job > 6. The call {{moveToDone()}} is scheduled for execution on a separate thread > inside {{moveToDoneExecutor}} and it starts to run immediately > 7. RPC invocation returns with the path pointing to > {{/tmp/hadoop-yarn/staging/history/done_intermediate}} > 8. The call to {{moveToDone()}} completes which moves the contents of > {{done_intermediate}} to {{done}} > 9. Hadoop CLI tries to download the config file from done_intermediate but > it's no longer there > Usually step #6 is slow enough to complete after #7, but sometimes it's > faster, causing this race condition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309554#comment-16309554 ] Peter Bacsko commented on MAPREDUCE-7015: - Yes, my bad. I corrected the description. > Possible race condition in JHS if the job is not loaded > --- > > Key: MAPREDUCE-7015 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: MAPREDUCE-7015-POC01.patch > > > There could be a race condition inside JHS. In our build environment, > {{TestMRJobClient.testJobClient()}} failed with this exception: > {noformat} > ava.io.FileNotFoundException: File does not exist: > hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) > at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167) > {noformat} > Root cause: > 1. MapReduce job completes > 2. CLI calls {{cluster.getJob(jobid)}} > 3. The job is finished and the client side gets redirected to JHS > 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find > the job > 5. First it scans the intermediate directory and finds the job > 6. The call {{moveToDone()}} is scheduled for execution on a separate thread > inside {{moveToDoneExecutor}} but does not get the chance to run immediately > 7. RPC invocation returns with the path pointing to > {{/tmp/hadoop-yarn/staging/history/done_intermediate}} > 8. The call to {{moveToDone()}} completes which moves the contents of > {{done_intermediate}} to {{done}} > 9. Hadoop CLI tries to download the config file from done_intermediate but > it's no longer there > Usually step #6 is slow enough to complete before after #7, but sometimes > it's faster, causing this race condition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308511#comment-16308511 ] Haibo Chen commented on MAPREDUCE-7015: --- [~pbacsko] I think you meant to say step #6 is slow enough for the most times, right? Otherwise, the cli client would almost always try to find the config file in the intermediate directory when the file is already quickly moved into the done directory, and fail. > Possible race condition in JHS if the job is not loaded > --- > > Key: MAPREDUCE-7015 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: MAPREDUCE-7015-POC01.patch > > > There could be a race condition inside JHS. In our build environment, > {{TestMRJobClient.testJobClient()}} failed with this exception: > {noformat} > ava.io.FileNotFoundException: File does not exist: > hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) > at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167) > {noformat} > Root cause: > 1. MapReduce job completes > 2. CLI calls {{cluster.getJob(jobid)}} > 3. The job is finished and the client side gets redirected to JHS > 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find > the job > 5. First it scans the intermediate directory and finds the job > 6. The call {{moveToDone()}} is scheduled for execution on a separate thread > inside {{moveToDoneExecutor}} but does not get the chance to run immediately > 7. RPC invocation returns with the path pointing to > {{/tmp/hadoop-yarn/staging/history/done_intermediate}} > 8. The call to {{moveToDone()}} completes which moves the contents of > {{done_intermediate}} to {{done}} > 9. Hadoop CLI tries to download the config file from done_intermediate but > it's no longer there > Usually step #6 is fast enough to complete before step #7, but sometimes it > can get behind, causing this race condition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307925#comment-16307925 ] Peter Bacsko commented on MAPREDUCE-7015: - [~jlowe], [~haibochen] what do you think about the solution? There's an even simpler one: just remove {{moveToDoneExecutor}} completely, is that acceptable? > Possible race condition in JHS if the job is not loaded > --- > > Key: MAPREDUCE-7015 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: MAPREDUCE-7015-POC01.patch > > > There could be a race condition inside JHS. In our build environment, > {{TestMRJobClient.testJobClient()}} failed with this exception: > {noformat} > ava.io.FileNotFoundException: File does not exist: > hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) > at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167) > {noformat} > Root cause: > 1. MapReduce job completes > 2. CLI calls {{cluster.getJob(jobid)}} > 3. The job is finished and the client side gets redirected to JHS > 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find > the job > 5. First it scans the intermediate directory and finds the job > 6. The call {{moveToDone()}} is scheduled for execution on a separate thread > inside {{moveToDoneExecutor}} but does not get the chance to run immediately > 7. RPC invocation returns with the path pointing to > {{/tmp/hadoop-yarn/staging/history/done_intermediate}} > 8. The call to {{moveToDone()}} completes which moves the contents of > {{done_intermediate}} to {{done}} > 9. Hadoop CLI tries to download the config file from done_intermediate but > it's no longer there > Usually step #6 is fast enough to complete before step #7, but sometimes it > can get behind, causing this race condition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded
[ https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274505#comment-16274505 ] Peter Bacsko commented on MAPREDUCE-7015: - Uploaded a POC for this - I think making the call to {{moveToDone()}} solves the problem, although I can't tell whether it can affect the performance negatively. > Possible race condition in JHS if the job is not loaded > --- > > Key: MAPREDUCE-7015 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: MAPREDUCE-7015-POC01.patch > > > There could be a race condition inside JHS. In our build environment, > {{TestMRJobClient.testJobClient()}} failed with this exception: > {noformat} > ava.io.FileNotFoundException: File does not exist: > hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266) > at > org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092) > at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068) > at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551) > at > org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167) > {noformat} > Root cause: > 1. MapReduce job completes > 2. CLI calls {{cluster.getJob(jobid)}} > 3. The job is finished and the client side gets redirected to JHS > 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find > the job > 5. First it scans the intermediate directory and finds the job > 6. The call {{moveToDone()}} is scheduled for execution on a separate thread > inside {{moveToDoneExecutor}} but does not get the chance to run immediately > 7. RPC invocation returns with the path pointing to > {{/tmp/hadoop-yarn/staging/history/done_intermediate}} > 8. The call to {{moveToDone()}} completes which moves the contents of > {{done_intermediate}} to {{done}} > 9. Hadoop CLI tries to download the config file from done_intermediate but > it's no longer there > Usually step #6 is fast enough to complete before step #7, but sometimes it > can get behind, causing this race condition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org