[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2014-03-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918845#comment-13918845
 ] 

Tsuyoshi OZAWA commented on YARN-1778:
--

A log of the test failure is available here: 
https://builds.apache.org/job/PreCommit-YARN-Build/3234//testReport/

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2014-03-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918899#comment-13918899
 ] 

Tsuyoshi OZAWA commented on YARN-1778:
--

The error message reported on HDFS-6048 is exactly same.

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2014-04-17 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973413#comment-13973413
 ] 

Wangda Tan commented on YARN-1778:
--

I just tried, but cannot reproduce it. Is there anyone else can try to 
reproduce it too?

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2014-04-18 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973879#comment-13973879
 ] 

Tsuyoshi OZAWA commented on YARN-1778:
--

I tried but currently I cannot reproduce this problem. [~xgong], should we 
close this issue once and reopen it when we find to reproduce on trunk?

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-01-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297629#comment-14297629
 ] 

Jason Lowe commented on YARN-1778:
--

This test is still failing:

{noformat}
2015-01-28 22:54:42,813 INFO  [main] zookeeper.ZKTestCase 
(ZKTestCase.java:failed(65)) - FAILED testFSRMStateStoreClientRetry
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertFalse(Assert.java:64)
at org.junit.Assert.assertFalse(Assert.java:74)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore.testFSRMStateStoreClientRetry(TestFSRMStateStore.java:289)
{noformat}

https://builds.apache.org/job/PreCommit-YARN-Build/6442//testReport/ is one 
example with more details, and I've seen this fail recently on other precommit 
builds.

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-03 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303027#comment-14303027
 ] 

zhihai xu commented on YARN-1778:
-

This test failed randomly. When it failed, the exception is 
"java.io.IOException: NameNode still not started"
 I find out the reason for this random error is
The NameNode constructor [set started 
flag|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L780]
 at the end.
{code}
this.started.set(true);
{code}
And it starts 
[NameNodeRpcServer|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L765]
 by calling function initialize before started flag is set.
initialize=>startCommonServices=>[rpcServer.start|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L639]
{code}
rpcServer.start();
{code}
If the client (which try to call mkdirs) connects to NameNode server before 
started flag is set,
the java.io.IOException: "[NameNode still not 
started|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java#L1875]";
 will happen, then the test will fail.
If the client connects to NameNode server after started flag is set, the test 
will succeed.

Although we can fix this error by reordering the code in NameNode constructor: 
move rpcServer.start to the end before started flag is set,
It looks like it is the normal behavior for the name node. It will give 
exception when the clients connect before started flag is set. since the gap is 
very small, it will have little side effect and clients can also retry if 
exception happens.
I submitted a patch which add delay before call storeApplicationStateInternal 
and also not treat this exception(NameNode still not started) as an error.

The proof is given in the following logs:
{code}
2015-02-03 00:09:18,991 INFO  [Thread-2] namenode.NameNode 
(NameNode.java:(754)) - create NameNode
2015-02-03 00:09:18,991 INFO  [Thread-2] namenode.NameNode 
(NameNode.java:setClientNamenodeAddress(352)) - fs.defaultFS is 
hdfs://localhost:57792
2015-02-03 00:09:18,991 INFO  [Thread-2] namenode.NameNode 
(NameNode.java:setClientNamenodeAddress(372)) - Clients are to use 
localhost:57792 to access this namenode/service.
2015-02-03 00:09:18,992 INFO  [Thread-2] namenode.NameNode 
(NameNode.java:(766)) - create NameNode initialize
2015-02-03 00:09:18,996 INFO  [Thread-2] hdfs.DFSUtil 
(DFSUtil.java:httpServerTemplateForNNAndJN(1760)) - Starting Web-server for 
hdfs at: http://localhost:57791
2015-02-03 00:09:18,997 INFO  [Thread-2] http.HttpRequestLog 
(HttpRequestLog.java:getRequestLog(80)) - Http request log for 
http.requests.namenode is not defined
2015-02-03 00:09:18,997 INFO  [Thread-2] http.HttpServer2 
(HttpServer2.java:addGlobalFilter(621)) - Added global filter 'safety' 
(class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2015-02-03 00:09:18,998 INFO  [Thread-2] http.HttpServer2 
(HttpServer2.java:addFilter(599)) - Added filter static_user_filter 
(class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
context hdfs
2015-02-03 00:09:18,998 INFO  [Thread-2] http.HttpServer2 
(HttpServer2.java:addFilter(606)) - Added filter static_user_filter 
(class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to 
context static
2015-02-03 00:09:18,999 INFO  [Thread-2] http.HttpServer2 
(NameNodeHttpServer.java:initWebHdfs(86)) - Added filter 
'org.apache.hadoop.hdfs.web.AuthFilter' 
(class=org.apache.hadoop.hdfs.web.AuthFilter)
2015-02-03 00:09:18,999 INFO  [Thread-2] http.HttpServer2 
(HttpServer2.java:addJerseyResourcePackage(525)) - addJerseyResourcePackage: 
packageName=org.apache.hadoop.hdfs.server.namenode.web.resources;org.apache.hadoop.hdfs.web.resources,
 pathSpec=/webhdfs/v1/*
2015-02-03 00:09:19,000 INFO  [Thread-2] http.HttpServer2 
(HttpServer2.java:openListeners(808)) - Jetty bound to port 57791
2015-02-03 00:09:19,000 INFO  [Thread-2] mortbay.log (Slf4jLog.java:info(67)) - 
jetty-6.1.26
2015-02-03 00:09:19,019 INFO  [Thread-2] mortbay.log (Slf4jLog.java:info(67)) - 
Started SelectChannelConnector@localhost:57791
2015-02-03 00:09:19,021 INFO  [Thread-2] namenode.FSNamesystem 
(FSNamesystem.java:(721)) - No KeyProvider found.
2015-02-03 00:09:19,021 INFO  [Thread-2] namenode.FSNamesystem 
(FSNamesystem.java:(731)) - fsLock is fair:true
2015-02-03 00:09:19,022 INFO  [Thread-2] blockmanagement.DatanodeManager 
(DatanodeManager.java:(239)) - dfs.block.invalidate.limit=1000
2015-02-03 00:09:19,022 INFO  [Thread-2] blockmanagement.DatanodeManager 
(DatanodeManager.java:(245)) - 
dfs.namenode.datanode.registration.ip-hostname-check=t

[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303095#comment-14303095
 ] 

Hadoop QA commented on YARN-1778:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12696129/YARN-1778.000.patch
  against trunk revision 8cb4731.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6488//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6488//console

This message is automatically generated.

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303471#comment-14303471
 ] 

Jason Lowe commented on YARN-1778:
--

Thanks for the analysis and patch, [~zxu]!  I'm wondering if the test is trying 
to tell us there really is a problem with FSRMStateStore retries, and therefore 
fixing the test is actually masking a real problem that needs to be fixed in 
the main code.  If I understand the intent of the test correctly, it's trying 
to verify that FSRMStateStore will not throw an exception while namenodes are 
down or coming back up.  However if we make the test wait until the namenodes 
are back up before trying to connect then that defeats most of the point of the 
test.

I think the critical question is: should the "Namenode still not started" 
exception be retried by either the DFSClient layer or by FSRMStateStore?  I 
think it should, otherwise a client of FSRMStateStore is going to see this 
exception in a similar, real-world scenario where the Namenode was restarted 
and wonder why the framework didn't auto-retry.

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303547#comment-14303547
 ] 

Tsuyoshi OZAWA commented on YARN-1778:
--

I'd like to +1 for retrying at FSRMStateStore layer. We also have an option to 
add a configuration for setting max number of retry.

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-03 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303632#comment-14303632
 ] 

zhihai xu commented on YARN-1778:
-

Hi [~jlowe], that is a very good suggestion. I saw some HDFS error from 
FSRMStateStore at YARN-2820 which cause RM restart, 
I already think about doing some retry operation at YARN-2820 last night to 
make FSRMStateStore better to recover from HDFS error.
I will do these change together at YARN-2820. I will make the patch for 
YARN-2820 in next one or two days. I will resolve this issue in YARN-2820.
thanks [~ozawa] for review.

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303776#comment-14303776
 ] 

Jason Lowe commented on YARN-1778:
--

We may also want to check if this should be handled in DFSClient rather than 
FSRMStateStore.  In other words, is there a reason to have FSRMStateStore make 
a special case for this, or should all HDFS clients get the benefit of the 
retry logic?

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-03 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303866#comment-14303866
 ] 

zhihai xu commented on YARN-1778:
-

Hi [~jlowe], thanks for your information. I think HDFS clients retry will have 
a lot of corner case to cover, it may not be easy to cover all these cases . 
For example In YARN-2820, we hit the issue:HDFS IOException after HDFS client 
retry at dfsClient.namenode.complete which is the sub-function(low level) retry 
in FileSystemRMStateStore#updateFile in the following log.
{code}
2014-10-29 23:49:12,202 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Updating info for attempt: appattempt_1409135750325_109118_01 at: 
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01

2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...

2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...

2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete
/tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/
appattempt_1409135750325_109118_01.new.tmp retrying...

2014-10-29 23:49:46,283 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
Error updating info for attempt: appattempt_1409135750325_109118_01
java.io.IOException: Unable to close file because the last block does not have 
enough number of replicas.
2014-10-29 23:49:46,284 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Error storing/updating appAttempt: appattempt_1409135750325_109118_01
2014-10-29 23:49:46,916 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED.
{code}
HDFS client is low level retry. It doesn't know how the upper layer use it. 
IMO, It make senses to do the retry in the upper layer for the whole 
functionality retry, which is similar as doing the retry at different network 
layers: retry at physical layer, link layer and TCP/IP layer.



> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-05 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306840#comment-14306840
 ] 

zhihai xu commented on YARN-1778:
-

I submitted a patch YARN-2820.000.patch at YARN-2820 for review. I verified 
YARN-2820.000.patch can recover from "java.io.IOException: NameNode still not 
started" with retry, So it will solve this issue. 

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-05 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306866#comment-14306866
 ] 

Tsuyoshi OZAWA commented on YARN-1778:
--

[~zxu] cc: [~jlowe] Thank you for the investigation. 
DFSOutputStream#completeFile includes the logic to retry. It's hard-coded for 
now:

{code}
  if (retries == 0) {
throw new IOException("Unable to close file because the last block"
+ " does not have enough number of replicas.");
  }
  retries--;
  Thread.sleep(localTimeout);
  localTimeout *= 2;
  if (Time.now() - localstart > 5000) {
DFSClient.LOG.info("Could not complete " + src + " retrying...");
  }
{code}

How about making these timeouts and number of retries configurable and setting 
via fs.state-store.num-retries and fs.state-store.retry-interval-ms? It's 
simpler way to deal with this problem.

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-05 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308663#comment-14308663
 ] 

zhihai xu commented on YARN-1778:
-

[~ozawa], Not sure what do you mean. The retries is not hard-coded based on the 
following code at 
[DFSOutputStream#completeFile|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L1540]
{code}
int retries = dfsClient.getConf().nBlockWriteLocateFollowingRetry;
{code}
nBlockWriteLocateFollowingRetry is decided by configuration 
"dfs.client.block.write.locateFollowingBlock.retries".
The problem for me is the retry in DFSOutputStream#completeFile doesn't work. 
Based on the log,
It retry 5 times in more than 30 seconds and it still doesn't work, then the 
exception "Unable to close file because the last block does not have enough 
number of replicas" generated from 
[FileSystemRMStateStore#writeFile|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java#L583]
 caused RM restart(). 
My patch will work better with retry at both high layer(new code) and low 
layer(old code) because it retry in FileSystemRMStateStore#writeFile, if any 
exception happen, it will [overwrite the 
file|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java#L581]
 and redo everything.

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-13 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14319725#comment-14319725
 ] 

Tsuyoshi OZAWA commented on YARN-1778:
--

[~zxu], sorry for confusing you. I meant that we should make the period of 
retry configurable - it's hard-coded as 5000msec for now.

{code}
The problem for me is the retry in DFSOutputStream#completeFile doesn't work. 
{code}

How about making the count of retry bigger in FileSystemRMStateStore in 
startInternal? I think it works well.

{code}
My patch will work better with retry at both high layer(new code) and low 
layer(old code) because it retry in FileSystemRMStateStore#writeFile, if any 
exception happen, it will overwrite the file and redo everything.
{code}

What kind of failure are you thinking about? I think retrying completeFile here 
is more straightforward and simple solution.

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-16 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323050#comment-14323050
 ] 

zhihai xu commented on YARN-1778:
-

[~ozawa],
That is a good idea. Although we can increase 
"dfs.client.block.write.locateFollowingBlock.retries" in configuration file and 
the FileSystemRMStateStore will take the change in startInternal from the 
configuration file in the following code, it will affect all the other modules 
to take this change. That may not be feasible.
{code}
Configuration conf = new Configuration(getConfig());
fs = fsWorkingPath.getFileSystem(conf);
{code}
To increase the flexibility, we can create a new configuration to customize 
"dfs.client.block.write.locateFollowingBlock.retries" for 
FileSystemRMStateStore, which is similar as FS_RM_STATE_STORE_RETRY_POLICY_SPEC 
to customize "dfs.client.retry.policy.spec" for
FileSystemRMStateStore at the following code from startInternal:
{code}
String retryPolicy =
conf.get(YarnConfiguration.FS_RM_STATE_STORE_RETRY_POLICY_SPEC,
  YarnConfiguration.DEFAULT_FS_RM_STATE_STORE_RETRY_POLICY_SPEC);
conf.set("dfs.client.retry.policy.spec", retryPolicy);
{code}
I will implement a new patch based on this.
thanks for the suggestion.
zhihai

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-16 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323719#comment-14323719
 ] 

zhihai xu commented on YARN-1778:
-

Hi [~ozawa], I uploaded a new patch at YARN-2820. Could you review it? thanks

> TestFSRMStateStore fails on trunk
> -
>
> Key: YARN-1778
> URL: https://issues.apache.org/jira/browse/YARN-1778
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Assignee: zhihai xu
> Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)