[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2018-01-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338188#comment-16338188
 ] 

Hudson commented on MAPREDUCE-7015:
---

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #13549 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/13549/])
MAPREDUCE-7015. Possible race condition in JHS if the job is not loaded. 
(jlowe: rev cff9edd4b514bdcfe22cd49964e3707fb78ab876)
* (edit) 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/HistoryFileManager.java
* (edit) 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CachedHistoryStorage.java
* (edit) 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/test/java/org/apache/hadoop/mapreduce/v2/hs/TestJobHistory.java


> Possible race condition in JHS if the job is not loaded
> ---
>
> Key: MAPREDUCE-7015
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: MAPREDUCE-7015-001.patch, MAPREDUCE-7015-POC01.patch, 
> MAPREDUCE-7015-POC02.patch
>
>
> There could be a race condition inside JHS. In our build environment, 
> {{TestMRJobClient.testJobClient()}} failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
>   at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find 
> the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread 
> inside {{moveToDoneExecutor}} and it starts to run immediately
> 7. RPC invocation returns with the path pointing to 
> {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of 
> {{done_intermediate}} to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but 
> it's no longer there
> Usually step #6 is slow enough to complete after #7, but sometimes it's 
> faster, causing this race condition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2018-01-24 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338185#comment-16338185
 ] 

Jason Lowe commented on MAPREDUCE-7015:
---

Apologies for the delay.  +1, the latest patch lgtm.

Committing this.

> Possible race condition in JHS if the job is not loaded
> ---
>
> Key: MAPREDUCE-7015
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: MAPREDUCE-7015-001.patch, MAPREDUCE-7015-POC01.patch, 
> MAPREDUCE-7015-POC02.patch
>
>
> There could be a race condition inside JHS. In our build environment, 
> {{TestMRJobClient.testJobClient()}} failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
>   at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find 
> the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread 
> inside {{moveToDoneExecutor}} and it starts to run immediately
> 7. RPC invocation returns with the path pointing to 
> {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of 
> {{done_intermediate}} to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but 
> it's no longer there
> Usually step #6 is slow enough to complete after #7, but sometimes it's 
> faster, causing this race condition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2018-01-23 Thread Peter Bacsko (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336084#comment-16336084
 ] 

Peter Bacsko commented on MAPREDUCE-7015:
-

ping [~jlowe] - could you also check MAPREDUCE-7020 please?

> Possible race condition in JHS if the job is not loaded
> ---
>
> Key: MAPREDUCE-7015
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: MAPREDUCE-7015-001.patch, MAPREDUCE-7015-POC01.patch, 
> MAPREDUCE-7015-POC02.patch
>
>
> There could be a race condition inside JHS. In our build environment, 
> {{TestMRJobClient.testJobClient()}} failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
>   at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find 
> the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread 
> inside {{moveToDoneExecutor}} and it starts to run immediately
> 7. RPC invocation returns with the path pointing to 
> {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of 
> {{done_intermediate}} to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but 
> it's no longer there
> Usually step #6 is slow enough to complete after #7, but sometimes it's 
> faster, causing this race condition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2018-01-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328676#comment-16328676
 ] 

Hadoop QA commented on MAPREDUCE-7015:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
9s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m  8s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
18s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 28s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
17s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
24s{color} | {color:green} hadoop-mapreduce-client-hs in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
20s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 44m 19s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 |
| JIRA Issue | MAPREDUCE-7015 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12906392/MAPREDUCE-7015-001.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 977fe21f2594 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 
13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 09efdfe |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_151 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7302/testReport/ |
| Max. process+thread count | 440 (vs. ulimit of 5000) |
| modules | C: 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs U: 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7302/console |
| Powered by | Apache Yetus 0.7.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Possible race condition in JHS if the job is not 

[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2018-01-17 Thread Peter Bacsko (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328648#comment-16328648
 ] 

Peter Bacsko commented on MAPREDUCE-7015:
-

[~jlowe] modified the patch as you suggested plus added a unit test.

> Possible race condition in JHS if the job is not loaded
> ---
>
> Key: MAPREDUCE-7015
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: MAPREDUCE-7015-001.patch, MAPREDUCE-7015-POC01.patch, 
> MAPREDUCE-7015-POC02.patch
>
>
> There could be a race condition inside JHS. In our build environment, 
> {{TestMRJobClient.testJobClient()}} failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
>   at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find 
> the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread 
> inside {{moveToDoneExecutor}} and it starts to run immediately
> 7. RPC invocation returns with the path pointing to 
> {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of 
> {{done_intermediate}} to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but 
> it's no longer there
> Usually step #6 is slow enough to complete after #7, but sometimes it's 
> faster, causing this race condition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2018-01-10 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321081#comment-16321081
 ] 

Jason Lowe commented on MAPREDUCE-7015:
---

Thanks for the patch!

CountDownLatch seems like a little bit of overkill in this context.  I think it 
would be sufficient to add this method to HistoryFileInfo:
{code}
  public synchronized void waitUntilMoved() {
while (isMovePending() && !hfi.didMoveFail()) {
wait();
}
  }
{code}

and then simply add a notifyAll() when the HistoryFileInfo state is written 
(i.e.: in deleted and moveToDone).

I also think it is unnecessary to add a waitForAllMoveToDone.  
CachedHistoryStorage#getAllPartialJobs can simply wait on each file info in the 
loop as it collects them.


> Possible race condition in JHS if the job is not loaded
> ---
>
> Key: MAPREDUCE-7015
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: MAPREDUCE-7015-POC01.patch, MAPREDUCE-7015-POC02.patch
>
>
> There could be a race condition inside JHS. In our build environment, 
> {{TestMRJobClient.testJobClient()}} failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
>   at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find 
> the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread 
> inside {{moveToDoneExecutor}} and it starts to run immediately
> 7. RPC invocation returns with the path pointing to 
> {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of 
> {{done_intermediate}} to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but 
> it's no longer there
> Usually step #6 is slow enough to complete after #7, but sometimes it's 
> faster, causing this race condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2018-01-10 Thread Peter Bacsko (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320625#comment-16320625
 ] 

Peter Bacsko commented on MAPREDUCE-7015:
-

ping [~jlowe]

> Possible race condition in JHS if the job is not loaded
> ---
>
> Key: MAPREDUCE-7015
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: MAPREDUCE-7015-POC01.patch, MAPREDUCE-7015-POC02.patch
>
>
> There could be a race condition inside JHS. In our build environment, 
> {{TestMRJobClient.testJobClient()}} failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
>   at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find 
> the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread 
> inside {{moveToDoneExecutor}} and it starts to run immediately
> 7. RPC invocation returns with the path pointing to 
> {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of 
> {{done_intermediate}} to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but 
> it's no longer there
> Usually step #6 is slow enough to complete after #7, but sometimes it's 
> faster, causing this race condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2018-01-05 Thread Peter Bacsko (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313106#comment-16313106
 ] 

Peter Bacsko commented on MAPREDUCE-7015:
-

Thanks for the idea [~jlowe]. I agree that waiting for only the job related 
files is a better approach. I came up with another POC. Still needs some tweak 
(eg. timeout when calling {{await()}}) and tests.

> Possible race condition in JHS if the job is not loaded
> ---
>
> Key: MAPREDUCE-7015
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: MAPREDUCE-7015-POC01.patch, MAPREDUCE-7015-POC02.patch
>
>
> There could be a race condition inside JHS. In our build environment, 
> {{TestMRJobClient.testJobClient()}} failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
>   at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find 
> the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread 
> inside {{moveToDoneExecutor}} and it starts to run immediately
> 7. RPC invocation returns with the path pointing to 
> {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of 
> {{done_intermediate}} to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but 
> it's no longer there
> Usually step #6 is slow enough to complete after #7, but sometimes it's 
> faster, causing this race condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2018-01-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309767#comment-16309767
 ] 

Jason Lowe commented on MAPREDUCE-7015:
---

IIRC one reason the moveToDoneExecutor was added was to avoid spending a long 
time processing every move necessary during the scan.  By adding the sync 
parameter or removing the executor completely, that makes the scan move 
everything it finds inline. This can take quite a long time since it not only 
has to wait for RPC calls but also can spend a long time waiting to acquire the 
lock because another thread could be calling loadJob() at the time and waiting 
on a very slow datanode.  That makes the scan single-threaded since it can't 
make progress on other intermediate files until it finishes moving each one in 
order.  So I don't think making this sync is the ideal solution.

It may make more sense to have the RPC call wait for out-of-band intermediate 
scan results it will return to be moved rather than forcing the entire scan to 
be single-threaded or always waiting for all intermediates to be moved.  For 
example, HistoryFileManager#getFileInfo could explicitly wait on the move to 
complete for the one job it is interested in if that job was found in the 
intermediate scan.  Then we don't have to wait for _every_ intermediate job to 
be moved, just the one we care about.  getAllFileInfo would need to wait for 
all of them, but they would be processed in parallel while we're waiting for 
each in turn.

> Possible race condition in JHS if the job is not loaded
> ---
>
> Key: MAPREDUCE-7015
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: MAPREDUCE-7015-POC01.patch
>
>
> There could be a race condition inside JHS. In our build environment, 
> {{TestMRJobClient.testJobClient()}} failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
>   at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find 
> the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread 
> inside {{moveToDoneExecutor}} and it starts to run immediately
> 7. RPC invocation returns with the path pointing to 
> {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of 
> {{done_intermediate}} to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but 
> it's no longer there
> Usually step #6 is slow enough to complete after #7, but sometimes it's 
> faster, causing this race condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2018-01-03 Thread Peter Bacsko (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309554#comment-16309554
 ] 

Peter Bacsko commented on MAPREDUCE-7015:
-

Yes, my bad. I corrected the description.

> Possible race condition in JHS if the job is not loaded
> ---
>
> Key: MAPREDUCE-7015
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: MAPREDUCE-7015-POC01.patch
>
>
> There could be a race condition inside JHS. In our build environment, 
> {{TestMRJobClient.testJobClient()}} failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
>   at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find 
> the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread 
> inside {{moveToDoneExecutor}} but does not get the chance to run immediately
> 7. RPC invocation returns with the path pointing to 
> {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of 
> {{done_intermediate}} to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but 
> it's no longer there
> Usually step #6 is slow enough to complete before after #7, but sometimes 
> it's faster, causing this race condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2018-01-02 Thread Haibo Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308511#comment-16308511
 ] 

Haibo Chen commented on MAPREDUCE-7015:
---

[~pbacsko] I think you meant to say step #6 is slow enough for the most times, 
right? Otherwise, the cli client would almost always try to find the config 
file in the intermediate directory when the file is already quickly moved into 
the done directory, and fail.

> Possible race condition in JHS if the job is not loaded
> ---
>
> Key: MAPREDUCE-7015
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: MAPREDUCE-7015-POC01.patch
>
>
> There could be a race condition inside JHS. In our build environment, 
> {{TestMRJobClient.testJobClient()}} failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
>   at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find 
> the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread 
> inside {{moveToDoneExecutor}} but does not get the chance to run immediately
> 7. RPC invocation returns with the path pointing to 
> {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of 
> {{done_intermediate}} to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but 
> it's no longer there
> Usually step #6 is fast enough to complete before step #7, but sometimes it 
> can get behind, causing this race condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2018-01-02 Thread Peter Bacsko (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307925#comment-16307925
 ] 

Peter Bacsko commented on MAPREDUCE-7015:
-

[~jlowe], [~haibochen] what do you think about the solution? There's an even 
simpler one: just remove {{moveToDoneExecutor}} completely, is that acceptable?

> Possible race condition in JHS if the job is not loaded
> ---
>
> Key: MAPREDUCE-7015
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: MAPREDUCE-7015-POC01.patch
>
>
> There could be a race condition inside JHS. In our build environment, 
> {{TestMRJobClient.testJobClient()}} failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
>   at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find 
> the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread 
> inside {{moveToDoneExecutor}} but does not get the chance to run immediately
> 7. RPC invocation returns with the path pointing to 
> {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of 
> {{done_intermediate}} to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but 
> it's no longer there
> Usually step #6 is fast enough to complete before step #7, but sometimes it 
> can get behind, causing this race condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

2017-12-01 Thread Peter Bacsko (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274505#comment-16274505
 ] 

Peter Bacsko commented on MAPREDUCE-7015:
-

Uploaded a POC for this - I think making the call to {{moveToDone()}} solves 
the problem, although I can't tell whether it can affect the performance 
negatively.

> Possible race condition in JHS if the job is not loaded
> ---
>
> Key: MAPREDUCE-7015
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: MAPREDUCE-7015-POC01.patch
>
>
> There could be a race condition inside JHS. In our build environment, 
> {{TestMRJobClient.testJobClient()}} failed with this exception:
> {noformat}
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
>   at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
>   at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
>   at 
> org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
> {noformat}
> Root cause:
> 1. MapReduce job completes
> 2. CLI calls {{cluster.getJob(jobid)}}
> 3. The job is finished and the client side gets redirected to JHS
> 4. The job data is missing from {{CachedHistoryStorage}} so JHS tries to find 
> the job
> 5. First it scans the intermediate directory and finds the job
> 6. The call {{moveToDone()}} is scheduled for execution on a separate thread 
> inside {{moveToDoneExecutor}} but does not get the chance to run immediately
> 7. RPC invocation returns with the path pointing to 
> {{/tmp/hadoop-yarn/staging/history/done_intermediate}}
> 8. The call to {{moveToDone()}} completes which moves the contents of 
> {{done_intermediate}} to {{done}}
> 9. Hadoop CLI tries to download the config file from done_intermediate but 
> it's no longer there
> Usually step #6 is fast enough to complete before step #7, but sometimes it 
> can get behind, causing this race condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org