[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2015-07-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619618#comment-14619618
 ] 

Hadoop QA commented on HDFS-7314:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  17m 38s | Pre-patch trunk has 1 extant 
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:red}-1{color} | javac |   8m  9s | The applied patch generated  1  
additional warning messages. |
| {color:green}+1{color} | javadoc |  10m  6s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 26s | The applied patch generated  3 
new checkstyle issues (total was 138, now 139). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 27s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 37s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | native |   3m 10s | Pre-build of native portion |
| {color:red}-1{color} | hdfs tests | 159m 27s | Tests failed in hadoop-hdfs. |
| | | 204m 59s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.TestLeaseRecovery2 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12744319/HDFS-7314-8.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 2e3d83f |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html
 |
| javac | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/diffJavacWarnings.txt
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11630/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11630/console |


This message was automatically generated.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
  Labels: BB2015-05-TBR
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314-8.patch, 
 HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2015-01-29 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297133#comment-14297133
 ] 

Ming Ma commented on HDFS-7314:
---

Thanks, [~jira.shegalov]. That is interesting. That might work when 
applications request new FileSystem object. However, there is the scenario 
where applications still hold the reference of aborted FileSystem object and 
want to use that to create files; then applications need to be modified to 
catch the exception and recreate the FileSystem object? At the beginning of the 
jira, one of the 3 solutions proposed is to keep DistributedFileSystem alive 
and recreate DFSClient. Regarding of the approach, it will be good to keep it 
transparent to the applications.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2015-01-29 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297266#comment-14297266
 ] 

Gera Shegalov commented on HDFS-7314:
-

Actually I need to take #1 back, I misspoke DFS#close calls super.close()
{code}
  @Override
  public void close() throws IOException {
try {
  dfs.closeOutputStreams(false);
  super.close();
} finally {
  dfs.close();
}
  }
{code}

So it's only about 2.  

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2015-01-28 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296296#comment-14296296
 ] 

Gera Shegalov commented on HDFS-7314:
-

I think the real problem is that the {{FileSystem}}-level CACHE entry is not 
invalidated/evicted although the DFS Client is closed. 

# DistributedFileSystem#close does not call super.close() that would achieve 
this.
# DFSClient#abort does not close the wrapping DFS object nor DFS tries to 
intercept checkOpen to do this.

Solving these issues would solve the scenario described in the JIRA. What do 
you think, [~mingma]?

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2015-01-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296425#comment-14296425
 ] 

Hadoop QA commented on HDFS-7314:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12680729/HDFS-7314-7.patch
  against trunk revision 5a0051f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9366//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9366//console

This message is automatically generated.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can 

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-20 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219683#comment-14219683
 ] 

Ming Ma commented on HDFS-7314:
---

Thanks, Colin.

1. There is an existing static method called {{LeaseRenewer}}#{{getInstance}}. 
There are synchronizations at methods of {{LeaseRenewer}} and 
{{LeaseRenewer#Factory}}. But that is only synchronized at each class instance 
level. Some of these race conditions come from the lack of synchronization 
between class instances. We can try to fix those scenarios.

2. Alternatively, we can get rid of the {{LeaseRenewer}} thread/object recycle 
logic. For short duration program like MR job submission, it won't kick in 
anyway. For long running services like YARN, it doesn't really matter as they 
create several long running threads it should be ok to keep few 
{{LeaseRenewer}} threads around. In addition; given these services might use 
HDFS regularly, {{LeaseRenewer}} threads will be recreated or just kept around 
anyway.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-19 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219046#comment-14219046
 ] 

Colin Patrick McCabe commented on HDFS-7314:


I need to think about this more.  I think the root of the problem is the 
decision to expose the {{LeaseManager}} thread object instances outside 
LeaseManager.java, rather than simply having static methods (or the equivalent) 
that act on the current lease manager for your UGI, without forcing you to 
know or care what that is.

I am fine with fixing this in another JIRA, but I really feel like we should 
fix it first.  I don't feel good about the current synchronization at all.

Thanks for your patience, [~mingma].

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-13 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211284#comment-14211284
 ] 

Colin Patrick McCabe commented on HDFS-7314:


I feel like this code is still not quite right.  We can get two 
{{LeaseRenewer}} objects now, right?

1. beginFileLease calls into getLeaseRenewer, gets LeaseRenewer #1
2. LeaseRenewer#closeClient (for LeaseRenewer #1) removes itself from 
Factory.INSTANCE.
3. another thread calls beginFileLease.  There is no LeaseRenewer object in 
Factory.INSTANCE any more, so a new one is created (call it #2).
4. first thread calls put, adds the DFSClient to LeaseRenewer #1 and LR1 to 
Factory.INSTANCE
5. second thread calls put, adds the DFSClient to LeaseRenewer #2 and LR2 to 
Factory.INSTANCE.

Won't we end up with two {{LeaseRenewer}} objects after this point?

The problem is basically that if we allow the {{LeaseRenewer}} object to escape 
from LeaseRenewer.java, and we accept that these objects can die, we have to 
accept that people can be using dead LeaseRenewer objects.

I'm not sure what the best way to fix this is... it is kind of a mess.  I guess 
maybe it's a pre-existing problem too?  If I'm understanding the situation 
correctly.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-13 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211584#comment-14211584
 ] 

Ming Ma commented on HDFS-7314:
---

Thanks Colin for the good point. I also noticed that during the analysis; but 
assumed that is part of the original design.

1. The issue you described is in trunk. It can happen when LeaseRenewer goes 
away, due to SocketTimeoutException or RenewerExpired.
2. In your above steps, LeaseRenewer object is added to Factory.INSTANCE in #3, 
not in step #4 and #5. But that doesn't change the issue. What will happen is 
when the first thread calls endFileLease, it will get hold of LR2. So LR1 will 
keep renewing the lease even after all files have been closed.

It appears we have discovered a bunch of race conditions regardless whether the 
original issue is addressed or not. Given that, we can consider fixing the 
original issue and open another jira to address these race conditions.

As you mentioned, the issues come from the fact that LeaseRenewer tries to 
clean up the object and the thread when they are no longer used. IMHO, that is 
not necessary; we can just keep LeaseRenewer objects and their threads around 
once they are created, the idea in the original patch. LeaseRenewer objects are 
keyed by NN address and ugi. In the normal set up,  with HDFS federation you 
can have several NN addresses, but # of ugis should be limited. So it isn't 
expensive to keep these objects and their threads around.

If the long term fix is to keep the LeaseRenewer object and thread around, we 
can start with fix for SocketTimeoutException in this patch and open another 
patch to address the RenewerExpired scenario later.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * 

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-10 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205204#comment-14205204
 ] 

Colin Patrick McCabe commented on HDFS-7314:


{code}
@@ -450,10 +455,11 @@ private void run(final int id) throws 
InterruptedException {
   + (elapsed/1000) +  seconds.  Aborting ..., ie);
   synchronized (this) {
 while (!dfsclients.isEmpty()) {
-  dfsclients.get(0).abort();
+  DFSClient dfsClient = dfsclients.get(0);
+  dfsClient.closeAllFilesBeingWritten(true);
+  closeClient(dfsClient);
 }
   }
-  break;
 } catch (IOException ie) {
   LOG.warn(Failed to renew lease for  + clientsString() +  for 
   + (elapsed/1000) +  seconds.  Will retry shortly ..., ie);
{code}
It seems like getting rid of break here is going to lead to the 
{{LeaseRenewer}} thread for the client continuing to run after the client's 
lease has been aborted.  This doesn't seem like what we want?  After all, we 
are going to create a new {{LeaseRenewer}} if the {{DFSClient}} opens another 
file for write.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-10 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205241#comment-14205241
 ] 

Ming Ma commented on HDFS-7314:
---

Thanks, Colin. The reason to keep the thread running is to handle the following 
race condition.

1. leaseRenewal thread is aborting.
2. The application creates files before leaseRenewal is removed from the 
factory. So DFSClient is added to the leaseRenewal object.
3. leaseRenewal thread exits. So nobody will renew lease for that DFSClient.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-10 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205270#comment-14205270
 ] 

Colin Patrick McCabe commented on HDFS-7314:


Good catch.  This code is certainly somewhat subtle.  I think that the 
{{currentId}} variable was intended to address the problem you're describing.

Keeping the thread running seems strange.  Is it going to abort the clients 
it's tracking more than once?  I would rather stop it if at all possible.

It seems like maybe what we should do here is set {{emptyTime}} to 0 and break 
out of the loop to exit the thread.  This will lead to the current 
{{LeaseRenewer}} thread being considered expired and not used in 
{{LeaseRenewer#put}}.  So there should be no race condition then, because 
{{LeaseRenewer#put}} will create a new thread (and increment {{currentId}}) if 
the current one is expired.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205631#comment-14205631
 ] 

Hadoop QA commented on HDFS-7314:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12680685/HDFS-7314-6.patch
  against trunk revision 68a0508.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestDistributedFileSystem

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8707//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8707//console

This message is automatically generated.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After 

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205997#comment-14205997
 ] 

Hadoop QA commented on HDFS-7314:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12680729/HDFS-7314-7.patch
  against trunk revision 58e9bf4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8711//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8711//console

This message is automatically generated.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to 

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201763#comment-14201763
 ] 

Hadoop QA commented on HDFS-7314:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12680087/HDFS-7314-4.patch
  against trunk revision ba0a42c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.server.namenode.TestFsck
org.apache.hadoop.hdfs.server.namenode.TestDeleteRace

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8686//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8686//console

This message is automatically generated.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places 

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-07 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202971#comment-14202971
 ] 

Colin Patrick McCabe commented on HDFS-7314:


bq. It turns out a new bug not related to this was discovered by this change. 
If DataStreamer thread exit and closes the stream before application closes the 
stream, DFSClient will keep renewing the lease. That is because DataStreamer's 
closeInternal marks the stream closed but didn't call DFSClient's endFileLease. 
Later when application closes the stream, it will skip DFSClient's endFileLease 
given the stream has been closed.

You're right that there is a bug here.  There is a lot of discussion about what 
to do about this issue in HDFS-4504.  It's not as simple as just calling 
{{endFileLease}}... if we missed calling {{completeFile}}, the NN will continue 
to think that we have a lease open on this file.  I think we should avoid 
modifying {{DFSOutputStream#close}} here.  We should try to keep this JIRA 
focused on just the description.  Plus HDFS-4504 is a complex issue, not easy 
to solve.

{{TestDFSClientRetries.java}}: let's get rid of the unnecessary whitespace 
change in the current patch.

I like the idea of getting rid of the {{DFSClient#abort}} function.

The patch looks good once these things are removed, should be ready to go soon!

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203256#comment-14203256
 ] 

Hadoop QA commented on HDFS-7314:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12680355/HDFS-7314-5.patch
  against trunk revision 4a114dd.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8696//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8696//console

This message is automatically generated.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
 HDFS-7314-5.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can 

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-06 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200568#comment-14200568
 ] 

Colin Patrick McCabe commented on HDFS-7314:


bq. 1. abort is only used for this scenario. After we have LeaseRenewer call 
abortOpenFiles, abort won't be called by any functions.

Good point.  Let's get rid of {{DFSClient#abort}} completely then.  We don't 
need this function any more.

bq. 2. In addition to have DFSClient call closeAllFilesBeingWritten, 
LeaseRenewer also needs to remove the DFSClient from its list via 
dfsclients.remove(dfsc); so that DFSClient doesn't renew release when there are 
no files opened. This is achieved via LeaseRenewer's closeClient.

When a lease timeout occurs, {{LeaseRenewer}} can just call 
{{DFSClient#closeAllFilesBeingWritten(abort=true)}}.  Then {{LeaseRenewer}} can 
just call {{LeaseRenewer#closeClient}} on itself.  This avoids the need to 
modify {{LeaseRenewer#closeClient}}.

{code}
@@ -447,16 +453,17 @@ private void run(final int id) throws 
InterruptedException {
   lastRenewed = Time.now();
 } catch (SocketTimeoutException ie) {
   LOG.warn(Failed to renew lease for  + clientsString() +  for 
-  + (elapsed/1000) +  seconds.  Aborting ..., ie);
+  + ((Time.now() - lastRenewed)/1000) +  seconds.  Aborting ...,
+  ie);
   synchronized (this) {
 while (!dfsclients.isEmpty()) {
{code}

I don't think we need this change and the other similar change.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it 

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201218#comment-14201218
 ] 

Hadoop QA commented on HDFS-7314:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679945/HDFS-7314-3.patch
  against trunk revision 1670578.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestDFSClientRetries

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.qjournal.client.TestQuorumJournalManager

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8681//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8681//console

This message is automatically generated.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new 

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-05 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14198765#comment-14198765
 ] 

Colin Patrick McCabe commented on HDFS-7314:


HDFS-7314-2.patch just seems to rename {{abort}} to {{abortOpenFiles}}.  What I 
was suggesting was creating a separate function, different from {{abort}}, 
which the {{LeaseRenewer}} would call.  Actually, looking at it, I wonder if 
the lease renewer can just call {{closeAllFilesBeingWritten}}?  I haven't 
looked at it in detail so maybe there's something else the lease renewer needs 
to do, but this at least looks like a good start.

We don't need all this {{boolean removeFromFactory}} stuff.  {{getInstance}} 
will re-add the {{DFSClient}} to the map later if needed.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-05 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14198998#comment-14198998
 ] 

Ming Ma commented on HDFS-7314:
---

Thanks, Colin. Here are more explanations for the changes. Please let me know 
your thoughts. Appreciate your input.

1. {{abort}} is only used for this scenario. After we have {{LeaseRenewer}} 
call {{abortOpenFiles}}, {{abort}} won't be called by any functions.
2. In addition to have {{DFSClient}} call {{closeAllFilesBeingWritten}}, 
{{LeaseRenewer}} also needs to remove the {{DFSClient}} from its list via 
{{dfsclients.remove(dfsc);}} so that {{DFSClient}} doesn't renew release when 
there are no files opened. This is achieved via {{LeaseRenewer}}'s 
{{closeClient}}.
3. Whether {{LeaseRenewer}} should be removed from the factory when it gets 
SocketTimeoutException. Given {{LeaseRenewer}} thread won't exit when it gets 
SocketTimeoutException as part of the fix, if {{LeaseRenewer}} object is 
removed from the factory, then it could leak the {{LeaseRenewer}} thread even 
though the old {{LeaseRenewer}} object isn't used by other objects. In reality, 
{{LeaseRenewer}} won't be removed from the factory inside {{closeClient}} given 
given {{isRenewerExpired()}} will return false. So {{removeFromFactory}} is 
there mostly for the semantics, not necessary.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195968#comment-14195968
 ] 

Hadoop QA commented on HDFS-7314:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679168/HDFS-7314.patch
  against trunk revision 2bb327e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8638//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8638//console

This message is automatically generated.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-04 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196603#comment-14196603
 ] 

Colin Patrick McCabe commented on HDFS-7314:


Thanks, [~mingma].  It's interesting that all the unit tests pass with the 
changed behavior of {{DFSClient#abort}}.

I would prefer not to add this new configuration key, because I really can't 
think of any cases where I'd like to set it to {{true}}.

I think it would be better just to have the lease timeout logic call a function 
other than {{DFSClient#abort}}.  Basically create something like 
{{DFSClient#abortOpenFiles}} and have the lease timeout code call this instead 
of abort.  That way we don't get confused about what abort means, but we also 
have the nice behavior that our client continues to be useful after a lease 
timeout.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14197061#comment-14197061
 ] 

Hadoop QA commented on HDFS-7314:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679296/HDFS-7314-2.patch
  against trunk revision 1eed102.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8642//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8642//console

This message is automatically generated.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7314-2.patch, HDFS-7314.patch


 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to 

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-03 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194790#comment-14194790
 ] 

Colin Patrick McCabe commented on HDFS-7314:


bq. I think DFSClient#abort() can be changed, so that only the existing output 
streams are aborted. The underlying IPC client can try to reopen connection 
later.

Good idea.  The only thing to watch out for is that some unit tests might be 
using {{abort}} and expecting the current semantics.  Perhaps we can create a 
new function, {{abortOpenStreams}}?

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma

 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-10-31 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191986#comment-14191986
 ] 

Kihwal Lee commented on HDFS-7314:
--

I think DFSClient#abort() can be changed, so that only the existing output 
streams are aborted. The underlying IPC client can try to reopen connection 
later.

 Aborted DFSClient's impact on long running service like YARN
 

 Key: HDFS-7314
 URL: https://issues.apache.org/jira/browse/HDFS-7314
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma

 It happened in YARN nodemanger scenario. But it could happen to any long 
 running service that use cached instance of DistrbutedFileSystem.
 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
 any DFSClient request will get ConnectTimeoutException.
 2. YARN nodemanager use DFSClient for certain write operation such as log 
 aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
 renewLease RPC got ConnectTimeoutException.
 {noformat}
 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
 renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
 Aborting ...
 {noformat}
 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
 instance of DistributedFileSystem.
 {noformat}
 2014-10-29 20:26:23,991 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc...
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
 Given the callstack is YARN - DistributedFileSystem - DFSClient, this can 
 be addressed at different layers.
 * YARN closes the DistributedFileSystem object when it receives some well 
 defined exception. Then the next HDFS call will create a new instance of 
 DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
 applications need to address this as well.
 * DistributedFileSystem detects Aborted DFSClient and create a new instance 
 of DFSClient. We will need to fix all the places DistributedFileSystem calls 
 DFSClient.
 * After DFSClient gets into Aborted state, it doesn't have to reject all 
 requests , instead it can retry. If NN is available again it can transition 
 to healthy state.
 Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)