[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2016-07-04 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15360989#comment-15360989
 ] 

He Xiaoqiao commented on HDFS-7858:
---

hi [~asuresh] when i patch this to 2.7.1, it throws some exception when submit 
job as following:
{quote}
2016-07-01 17:45:37,497 WARN [pool-9-thread-2] org.apache.hadoop.ipc.Client: 
Exception encountered while connecting to the server : 
java.nio.channels.ClosedByInterruptException
2016-07-01 17:45:37,542 WARN [pool-10-thread-2] org.apache.hadoop.ipc.Client: 
Exception encountered while connecting to the server : 
java.nio.channels.ClosedByInterruptException
2016-07-01 17:45:37,571 INFO [main] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Emitting job 
history data to the timeline server is not enabled
2016-07-01 17:45:37,572 WARN [pool-11-thread-2] org.apache.hadoop.ipc.Client: 
Exception encountered while connecting to the server : 
java.nio.channels.ClosedByInterruptException
2016-07-01 17:45:37,573 INFO [main] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Recovery is enabled. Will try 
to recover from previous life on best effort basis.
2016-07-01 17:45:37,633 INFO [main] 
org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system 
[viewfs://ha/]
2016-07-01 17:45:37,698 INFO [main] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at 
viewfs://ha/hadoop-yarn/staging/yarn/.staging/job_1467365572539_3212/job_1467365572539_3212_1.jhist
2016-07-01 17:45:37,713 WARN [main] 
org.apache.hadoop.hdfs.server.namenode.ha.RequestHedgingProxyProvider: 
Invocation returned exception on [nn1host/ip:port]
2016-07-01 17:45:37,716 WARN [pool-12-thread-2] org.apache.hadoop.ipc.Client: 
Exception encountered while connecting to the server : 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category READ is not supported in state standby
2016-07-01 17:45:37,717 WARN [main] 
org.apache.hadoop.hdfs.server.namenode.ha.RequestHedgingProxyProvider: 
Invocation returned exception on [nn2host/ip:port]
2016-07-01 17:45:37,725 WARN [main] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Unable to parse prior job 
history, aborting recovery
MultiException[{java.util.concurrent.ExecutionException: 
java.lang.reflect.InvocationTargetException, 
java.util.concurrent.ExecutionException: 
java.lang.reflect.InvocationTargetException, }]
at 
org.apache.hadoop.hdfs.server.namenode.ha.RequestHedgingProxyProvider$RequestHedgingInvocationHandler.invoke(RequestHedgingProxyProvider.java:133)
at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201)
at 
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
at 
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:303)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:299)
at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
at 
org.apache.hadoop.fs.viewfs.ChRootedFileSystem.open(ChRootedFileSystem.java:257)
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.open(ViewFileSystem.java:423)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:788)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.getPreviousJobHistoryStream(MRAppMaster.java:1199)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.parsePreviousJobHistory(MRAppMaster.java:1203)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.processRecovery(MRAppMaster.java:1175)
at 
org.apache.hadoop.mapreduce.v2.app.MRApp

[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-09-24 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906384#comment-14906384
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7858:
---

Never mind.  Thanks for the response.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.13.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Use hedged RPCs to simultaneously call multiple configured NNs to decide 
> which is the active Namenode.
> 2) Subsequent calls, will invoke the previously successful NN.
> 3) On failover of the currently active NN, the remaining NNs will be invoked 
> to decide which is the new active 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-09-23 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904584#comment-14904584
 ] 

Arun Suresh commented on HDFS-7858:
---

[~szetszwo], Unfortunately, I do not remember the specifics, but think it went 
in to minutes..

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.13.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Use hedged RPCs to simultaneously call multiple configured NNs to decide 
> which is the active Namenode.
> 2) Subsequent calls, will invoke the previously successful NN.
> 3) On failover of the currently active NN, the remaining NNs will be invoked 
> to decide which is the new active 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-09-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901642#comment-14901642
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7858:
---

> ... then those clients might not get a response soon enough to try the other 
> NN.

[~asuresh], do you recall how long have you seen for the client waiting?  I 
might hit a similar problem recently.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.13.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Use hedged RPCs to simultaneously call multiple configured NNs to decide 
> which is the active Namenode.
> 2) Subsequent calls, will invoke the previously successful NN.
> 3) On failover of the currently active NN, the remaining NNs will be invoked 
> to decide which is the new active 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644543#comment-14644543
 ] 

Hudson commented on HDFS-7858:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2216 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2216/])
HDFS-7858. Improve HA Namenode Failover detection on the client. (asuresh) 
(Arun Suresh: rev 030fcfa99c345ad57625486eeabedebf2fd4411f)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/ConfiguredFailoverProxyProvider.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestRequestHedgingProxyProvider.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/RequestHedgingProxyProvider.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithNFS.md
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/MultiException.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithQJM.md
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.13.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644532#comment-14644532
 ] 

Hudson commented on HDFS-7858:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #267 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/267/])
HDFS-7858. Improve HA Namenode Failover detection on the client. (asuresh) 
(Arun Suresh: rev 030fcfa99c345ad57625486eeabedebf2fd4411f)
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/MultiException.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestRequestHedgingProxyProvider.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithQJM.md
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/ConfiguredFailoverProxyProvider.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithNFS.md
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/RequestHedgingProxyProvider.java


> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.13.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644432#comment-14644432
 ] 

Hudson commented on HDFS-7858:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #259 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/259/])
HDFS-7858. Improve HA Namenode Failover detection on the client. (asuresh) 
(Arun Suresh: rev 030fcfa99c345ad57625486eeabedebf2fd4411f)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithQJM.md
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestRequestHedgingProxyProvider.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/MultiException.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/RequestHedgingProxyProvider.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithNFS.md
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/ConfiguredFailoverProxyProvider.java


> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.13.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644318#comment-14644318
 ] 

Hudson commented on HDFS-7858:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2197 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2197/])
HDFS-7858. Improve HA Namenode Failover detection on the client. (asuresh) 
(Arun Suresh: rev 030fcfa99c345ad57625486eeabedebf2fd4411f)
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/MultiException.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/RequestHedgingProxyProvider.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithQJM.md
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithNFS.md
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/ConfiguredFailoverProxyProvider.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestRequestHedgingProxyProvider.java


> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.13.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644217#comment-14644217
 ] 

Hudson commented on HDFS-7858:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1000 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1000/])
HDFS-7858. Improve HA Namenode Failover detection on the client. (asuresh) 
(Arun Suresh: rev 030fcfa99c345ad57625486eeabedebf2fd4411f)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestRequestHedgingProxyProvider.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithQJM.md
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithNFS.md
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/MultiException.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/ConfiguredFailoverProxyProvider.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/RequestHedgingProxyProvider.java


> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.13.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644206#comment-14644206
 ] 

Hudson commented on HDFS-7858:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #270 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/270/])
HDFS-7858. Improve HA Namenode Failover detection on the client. (asuresh) 
(Arun Suresh: rev 030fcfa99c345ad57625486eeabedebf2fd4411f)
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/MultiException.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/RequestHedgingProxyProvider.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestRequestHedgingProxyProvider.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithQJM.md
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/ConfiguredFailoverProxyProvider.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithNFS.md
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java


> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.13.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643911#comment-14643911
 ] 

Hudson commented on HDFS-7858:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8231 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8231/])
HDFS-7858. Improve HA Namenode Failover detection on the client. (asuresh) 
(Arun Suresh: rev 030fcfa99c345ad57625486eeabedebf2fd4411f)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestRequestHedgingProxyProvider.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/RequestHedgingProxyProvider.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithNFS.md
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSHighAvailabilityWithQJM.md
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/ConfiguredFailoverProxyProvider.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/MultiException.java


> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.13.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-27 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643902#comment-14643902
 ] 

Arun Suresh commented on HDFS-7858:
---

Test Case error was due to a timing issue. Modified test case to ensure that it 
doesn't happen

Committed to trunk and branch-2 

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.13.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643869#comment-14643869
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  22m 33s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:red}-1{color} | javac |   7m 38s | The applied patch generated  2  
additional warning messages. |
| {color:green}+1{color} | javadoc |   9m 51s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 57s | Site still builds. |
| {color:green}+1{color} | checkstyle |   2m  2s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 27s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 24s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  22m 20s | Tests passed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 158m 53s | Tests failed in hadoop-hdfs. |
| | | 233m  4s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.hdfs.server.namenode.ha.TestRequestHedgingProxyProvider |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12747474/HDFS-7858.13.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle site |
| git revision | trunk / 3572ebd |
| javac | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11852/artifact/patchprocess/diffJavacWarnings.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11852/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11852/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11852/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11852/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.13.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643603#comment-14643603
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  21m 42s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:red}-1{color} | javac |   7m 31s | The applied patch generated  1  
additional warning messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 59s | Site still builds. |
| {color:green}+1{color} | checkstyle |   2m  1s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 27s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 19s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | common tests |  22m 20s | Tests failed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 162m 20s | Tests failed in hadoop-hdfs. |
| | | 235m 17s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.ipc.TestRPC |
|   | hadoop.fs.TestLocalFsFCStatistics |
|   | hadoop.hdfs.server.namenode.ha.TestStandbyIsHot |
|   | hadoop.hdfs.server.namenode.ha.TestRequestHedgingProxyProvider |
|   | hadoop.hdfs.server.namenode.TestFsck |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12747399/HDFS-7858.12.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle site |
| git revision | trunk / f36835f |
| javac | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11848/artifact/patchprocess/diffJavacWarnings.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11848/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11848/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11848/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11848/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, HDFS-7858.4.patch, 
> HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, HDFS-7858.8.patch, 
> HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-27 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643339#comment-14643339
 ] 

Jing Zhao commented on HDFS-7858:
-

Thanks for working on this, [~asuresh]! The latest patch looks good to me. +1. 
Also agree with [~arpitagarwal] that we can keep testing and improving this 
since this is currently not default.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.12.patch, 
> HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, HDFS-7858.4.patch, 
> HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, HDFS-7858.8.patch, 
> HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-27 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643102#comment-14643102
 ] 

Arpit Agarwal commented on HDFS-7858:
-

I am +1 on the v11 patch. {{RequestHedgingProxyProvider}} is disabled by 
default so remaining issues can be addressed separately to avoid spinning on 
this forever. :-)

# One optimization with your new approach - In the common HA case with two 
NameNodes, after {{performFailover}} is called, {{toIgnore}} will be non-null. 
We don't need to create a thread pool/completion service, we can simply send 
the request to the single proxy in the callers thread.
# The TODO is not technically a TODO. We can just document this property in the 
class Javadoc that it can block indefinitely and depends on the caller 
implementing a timeout.
# Couple of documentation nitpicks:
## _The two implementations which currently ships_ -> _The two implementations 
which currently ship_
## _so use these_ --> _so use one of these unless you are using a custom proxy 
provider_

Will hold off committing in case [~jingzhao] has any further comments. Thanks 
for working on this [~asuresh].

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, 
> HDFS-7858.7.patch, HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641780#comment-14641780
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  19m 52s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 32s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 59s | Site still builds. |
| {color:green}+1{color} | checkstyle |   1m 39s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 28s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 21s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  22m 21s | Tests passed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 160m 51s | Tests failed in hadoop-hdfs. |
| | | 231m 36s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.TestAppendSnapshotTruncate |
|   | hadoop.hdfs.TestDistributedFileSystem |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12747199/HDFS-7858.11.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle site |
| git revision | trunk / 156f24e |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11840/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11840/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11840/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11840/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.11.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, 
> HDFS-7858.7.patch, HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641570#comment-14641570
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  21m 50s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 39s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 59s | Site still builds. |
| {color:green}+1{color} | checkstyle |   1m 59s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 29s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   4m 24s | The patch appears to introduce 2 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  22m 27s | Tests passed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 163m 44s | Tests failed in hadoop-hdfs. |
| | | 237m  6s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs |
| Failed unit tests | hadoop.hdfs.TestDistributedFileSystem |
|   | hadoop.hdfs.server.namenode.ha.TestStandbyIsHot |
| Timed out tests | org.apache.hadoop.hdfs.server.mover.TestStorageMover |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12747173/HDFS-7858.10.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle site |
| git revision | trunk / adcf5dd |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11838/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11838/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11838/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11838/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11838/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.10.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, 
> HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, 
> HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641238#comment-14641238
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  21m 46s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 35s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 43s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   3m  0s | Site still builds. |
| {color:green}+1{color} | checkstyle |   2m  2s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 29s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 31s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   4m 26s | The patch appears to introduce 1 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  22m 16s | Tests passed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 158m 42s | Tests failed in hadoop-hdfs. |
| | | 231m 57s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs |
| Failed unit tests | hadoop.hdfs.TestDistributedFileSystem |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12747095/HDFS-7858.10.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle site |
| git revision | trunk / d19d187 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11833/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11833/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11833/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11833/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11833/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, HDFS-7858.4.patch, 
> HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, HDFS-7858.8.patch, 
> HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-24 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641067#comment-14641067
 ] 

Jing Zhao commented on HDFS-7858:
-

bq. I need this to get name of the proxy which was successful (so i can key 
into the targetProxies map). CallResult catches the exception and sets it as 
the result.

Yeah, I did notice the exception has been captured by CallResult. But maybe we 
can use a future-->proxy map here? In this way we do not need to have a wrapper 
class like CallResult so maybe the code can be further simplified.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch, 
> HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, HDFS-7858.4.patch, 
> HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, HDFS-7858.8.patch, 
> HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-24 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640857#comment-14640857
 ] 

Jing Zhao commented on HDFS-7858:
-

Thanks again for updating the patch, [~asuresh]! Some minor comments on the 
latest patch:
# Do we need the latch in {{RequestHedgingInvocationHandler#invoke}}?
# I'm not sure if we need requestTimeout. Client/DN now already sets their 
specific socket timeout for their connection to NameNode thus it seems 
redundant to have an extra 2 min timeout when polling the CompletionService.
# We can use the ExecutionException thrown by {{callResultFuture.get()}} to get 
the exception thrown by the invocation.
# Maybe we should use debug/trace here?
{code}
+LOG.info("Invocation successful on ["
++ callResultFuture.get().name + "]");
{code}


> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, 
> HDFS-7858.7.patch, HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640395#comment-14640395
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  26m 54s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  1s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   9m 44s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  11m 53s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 27s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   3m 37s | Site still builds. |
| {color:green}+1{color} | checkstyle |   2m 39s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 45s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 43s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   5m 33s | The patch appears to introduce 1 
new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | common tests |  25m 29s | Tests failed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 161m 41s | Tests failed in hadoop-hdfs. |
| | | 250m 31s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs |
| Failed unit tests | hadoop.ha.TestZKFailoverController |
|   | hadoop.hdfs.TestAppendSnapshotTruncate |
|   | hadoop.hdfs.server.namenode.ha.TestRequestHedgingProxyProvider |
|   | hadoop.hdfs.TestDistributedFileSystem |
| Timed out tests | 
org.apache.hadoop.hdfs.server.namenode.ha.TestDelegationTokensWithHA |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12746976/HDFS-7858.9.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle site |
| git revision | trunk / e202efa |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11826/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11826/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11826/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11826/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11826/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, 
> HDFS-7858.7.patch, HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-24 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640152#comment-14640152
 ] 

Arun Suresh commented on HDFS-7858:
---

bq. in case of failover of HA, only one request will be invoked (SNN) in hedged 
invocations. Am I right?
yup.. although in the case of more than 2 NNs, the subsequent request will be 
hedged to ALL remaining NNs except the current failed-over NN. 

bq. This way I feel both ConfiguredFailoverProxyProvider and 
RequestHedgingProxyProvider work same way, except at the very first time. ..
Yup.. as well as the above mentioned condition. 

bq.  ..if no. of proxies to try to are more than 2 then 
RequestHedgingProxyProvider will be best.
yup.. now that HDFS-6440 is resolved, I am hoping ReqHedging would be default. 
It is also useful in cases where there are large number of adhoc clients (MR 
jobs) where many of the calls will be one time calls. 
RequestHedgingProxyProvider will ensure that these tasks don't have to wait for 
a timed-out request / Exception from a Failed NN to failover to failover to the 
SNN.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, 
> HDFS-7858.7.patch, HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-24 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640133#comment-14640133
 ] 

Vinayakumar B commented on HDFS-7858:
-

I have a small question here.
I believe all client operations will successfully talk to only Active NameNode.
In current {{ConfiguredFailoverProxyProvider}}, only at the beginning, when the 
client initializes, there will be a need of trying to both Nodes, if standby 
comes first.
During failover, if ANN goes down and SNN is still not failedover, then client 
has to try again to previous ANN and come back to current SNN to check for the 
failover one more time. Once the successful proxy found, all subsequent 
requests will go there.

In case of proposed {{RequestHedgingProxyProvider}}, Only at the beginning, 
there will not be any failed proxy, at that time hedged requests will goto both 
NNs.
During failover, current failed proxy (prev ANN) will be ignored for hedged 
requests, i.e. in case of failover of HA, only one request will be invoked 
(SNN) in hedged invocations. Am I right?

This way I feel both {{ConfiguredFailoverProxyProvider}} and 
{{RequestHedgingProxyProvider}} work same way, except at the very first time. 
And yes, if no. of proxies to try to are more than 2 then 
{{RequestHedgingProxyProvider}}  will be best.

Am I missing something here?

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, 
> HDFS-7858.7.patch, HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640114#comment-14640114
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  29m 21s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   9m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  12m 13s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 30s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   3m 33s | Site still builds. |
| {color:green}+1{color} | checkstyle |   2m 36s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 2  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 59s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 40s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   5m 36s | The patch appears to introduce 1 
new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | common tests |  24m 55s | Tests failed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests |   0m 25s | Tests failed in hadoop-hdfs. |
| | |  91m 30s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs |
| Failed unit tests | hadoop.io.retry.TestFailoverProxy |
| Failed build | hadoop-hdfs |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12746923/HDFS-7858.8.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle site |
| git revision | trunk / 02c0181 |
| whitespace | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11821/artifact/patchprocess/whitespace.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11821/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11821/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11821/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11821/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11821/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, 
> HDFS-7858.7.patch, HDFS-7858.8.patch, HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640012#comment-14640012
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  21m 48s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 32s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 58s | Site still builds. |
| {color:green}+1{color} | checkstyle |   2m  2s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 2  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 29s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   4m 25s | The patch appears to introduce 1 
new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | common tests |  21m 31s | Tests failed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 171m 16s | Tests failed in hadoop-hdfs. |
| | | 243m 36s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs |
| Failed unit tests | hadoop.io.retry.TestFailoverProxy |
|   | hadoop.hdfs.TestDFSInotifyEventInputStream |
|   | hadoop.hdfs.tools.TestDFSZKFailoverController |
|   | hadoop.hdfs.TestDFSClientFailover |
|   | hadoop.hdfs.server.namenode.ha.TestDelegationTokensWithHA |
|   | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
|   | hadoop.hdfs.server.namenode.ha.TestFailoverWithBlockTokensEnabled |
|   | hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication |
|   | hadoop.hdfs.server.namenode.ha.TestDNFencing |
|   | hadoop.hdfs.server.namenode.ha.TestHAMetrics |
|   | hadoop.hdfs.server.namenode.ha.TestXAttrsWithHA |
|   | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
|   | hadoop.hdfs.TestDistributedFileSystem |
|   | hadoop.hdfs.server.namenode.ha.TestHAFsck |
|   | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes |
|   | hadoop.hdfs.server.namenode.ha.TestHAAppend |
|   | hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA |
|   | hadoop.hdfs.server.namenode.ha.TestHASafeMode |
|   | hadoop.hdfs.server.namenode.ha.TestStandbyIsHot |
|   | hadoop.hdfs.server.namenode.ha.TestHAStateTransitions |
|   | hadoop.hdfs.TestEncryptionZonesWithHA |
| Timed out tests | 
org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12746923/HDFS-7858.8.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle site |
| git revision | trunk / ab3197c |
| whitespace | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11817/artifact/patchprocess/whitespace.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11817/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11817/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11817/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11817/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11817/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, 
> HDFS-7858.7.patch, HDFS-7858.8.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a st

[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639890#comment-14639890
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  22m  3s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 39s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 38s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 59s | Site still builds. |
| {color:green}+1{color} | checkstyle |   2m  4s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 2  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 30s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 31s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 21s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  22m 25s | Tests passed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 161m  5s | Tests failed in hadoop-hdfs. |
| | | 234m 42s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.TestDistributedFileSystem |
|   | hadoop.hdfs.server.namenode.ha.TestStandbyIsHot |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12746903/HDFS-7858.7.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle site |
| git revision | trunk / 1d3026e |
| whitespace | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11816/artifact/patchprocess/whitespace.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11816/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11816/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11816/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf900.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11816/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, 
> HDFS-7858.7.patch, HDFS-7858.8.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-23 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639676#comment-14639676
 ] 

Jing Zhao commented on HDFS-7858:
-

bq. So, in the case of the ReqHedgingProxy, performFailover will be called only 
if ALL the proxies have failed (with retry/failover_and_retry.. )

But I guess if "the original successfulProxy is not null", this try only uses 
this single proxy thus the failover_and_retry only talks about it? 

bq. Oh.. I was thinking we should keep trunk Java 7 compilable ?

Java 7 does not require explicit type argument. Java 6 does.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch, 
> HDFS-7858.7.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-22 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637663#comment-14637663
 ] 

Arpit Agarwal commented on HDFS-7858:
-

Thanks for updating the patch Arun. The MultiException approach looks like a 
good alternative to refactoring RetryPolicy.

A few comments:
# I didn't understand the call to {{super.performFailover}} in 
{{RequestHedgingProxyProvider#getProxy}}.
# The documentation in HDFSHighAvailabilityWithQJM.md and 
HDFSHighAvailabilityWithNFS.md should be updated as it states _The only 
implementation which currently ships with Hadoop is the 
ConfiguredFailoverProxyProvider_. Okay to do this in a separate Jira.
# Agree with Jing's suggestion to use a {{CompletionService}}.

Also we should file a task to make {{RequestHedgingProxyProvider}} the default 
eventually.

Nitpicks:
# {{getDelayMillis}} javadoc is wrong.
# {{successfullproxy}} should be {{successfulproxy}}.
# _new LinkedList_ - explicit type argument redundant.
# _static interface ProxyFactory_ - static is redundant.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-22 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637494#comment-14637494
 ] 

Jing Zhao commented on HDFS-7858:
-

Thanks for updating the patch, [~asuresh]! The current approach looks good to 
me. Some quick comments about the patch:
# In {{RequestHedgingInvocationHandler#invoke}}, instead of polling the tasks 
every 10ms, can we use {{CompletionService}} here?
# For {{RequestHedgingProxyProvider#performFailover}}, if the original 
successfulProxy is not null, we can exclude it for the next time retry.



> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632744#comment-14632744
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 44s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 37s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m 27s | The applied patch generated  
11 new checkstyle issues (total was 426, now 436). |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 20s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 23s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  22m 26s | Tests passed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 160m 22s | Tests failed in hadoop-hdfs. |
| | | 227m 53s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.TestAppendSnapshotTruncate |
|   | hadoop.hdfs.TestDistributedFileSystem |
|   | hadoop.hdfs.server.namenode.ha.TestStandbyIsHot |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745991/HDFS-7858.6.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 176131f |
| checkstyle |  
https://builds.apache.org/job/PreCommit-HDFS-Build/11748/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11748/artifact/patchprocess/whitespace.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11748/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11748/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11748/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11748/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch, HDFS-7858.6.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632690#comment-14632690
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  17m  6s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:red}-1{color} | javac |   7m 34s | The applied patch generated  1  
additional warning messages. |
| {color:red}-1{color} | javadoc |   9m 35s | The applied patch generated  1  
additional warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 24s | The applied patch generated  6 
new checkstyle issues (total was 7, now 12). |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 30s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 35s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 23s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  22m 31s | Tests passed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 160m 43s | Tests failed in hadoop-hdfs. |
| | | 226m  4s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.TestAppendSnapshotTruncate |
|   | hadoop.hdfs.TestDistributedFileSystem |
|   | hadoop.hdfs.server.namenode.ha.TestStandbyIsHot |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745987/HDFS-7858.5.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 176131f |
| javac | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11747/artifact/patchprocess/diffJavacWarnings.txt
 |
| javadoc | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11747/artifact/patchprocess/diffJavadocWarnings.txt
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-HDFS-Build/11747/artifact/patchprocess/diffcheckstylehadoop-common.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11747/artifact/patchprocess/whitespace.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11747/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11747/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11747/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11747/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch, HDFS-7858.5.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-17 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631806#comment-14631806
 ] 

Arpit Agarwal commented on HDFS-7858:
-

Thanks [~asuresh], I will review your patch by next week.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch, HDFS-7858.4.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630673#comment-14630673
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 51s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:red}-1{color} | javac |   7m 39s | The applied patch generated  1  
additional warning messages. |
| {color:red}-1{color} | javadoc |   9m 45s | The applied patch generated  1  
additional warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 50s | The applied patch generated  6 
new checkstyle issues (total was 7, now 12). |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 32s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   4m 26s | The patch appears to introduce 4 
new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | common tests |  21m 29s | Tests failed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 177m  1s | Tests failed in hadoop-hdfs. |
| | | 243m 49s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-common |
| Failed unit tests | hadoop.io.retry.TestFailoverProxy |
|   | hadoop.ipc.TestIPC |
|   | hadoop.io.retry.TestRetryProxy |
|   | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
|   | hadoop.hdfs.server.namenode.ha.TestHASafeMode |
|   | hadoop.hdfs.TestEncryptionZonesWithHA |
|   | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes |
|   | hadoop.hdfs.server.namenode.ha.TestHAAppend |
|   | hadoop.hdfs.tools.TestDFSZKFailoverController |
|   | hadoop.hdfs.server.namenode.ha.TestHAStateTransitions |
|   | hadoop.hdfs.TestDistributedFileSystem |
|   | hadoop.hdfs.server.namenode.ha.TestXAttrsWithHA |
|   | hadoop.hdfs.server.namenode.ha.TestStandbyIsHot |
|   | hadoop.hdfs.server.namenode.TestCheckpoint |
|   | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
|   | hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication |
|   | hadoop.hdfs.server.namenode.TestNameNodeRetryCacheMetrics |
|   | hadoop.hdfs.TestGetBlocks |
|   | hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA |
|   | hadoop.hdfs.server.namenode.ha.TestFailoverWithBlockTokensEnabled |
|   | hadoop.hdfs.server.namenode.TestBlockUnderConstruction |
|   | hadoop.hdfs.TestDFSClientRetries |
|   | hadoop.hdfs.TestDFSInotifyEventInputStream |
|   | hadoop.hdfs.TestDFSClientFailover |
|   | hadoop.hdfs.server.namenode.ha.TestHAFsck |
|   | hadoop.hdfs.server.namenode.ha.TestDNFencing |
| Timed out tests | 
org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745705/HDFS-7858.4.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 0bda84f |
| javac | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11733/artifact/patchprocess/diffJavacWarnings.txt
 |
| javadoc | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11733/artifact/patchprocess/diffJavadocWarnings.txt
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-HDFS-Build/11733/artifact/patchprocess/diffcheckstylehadoop-common.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11733/artifact/patchprocess/whitespace.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11733/artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11733/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11733/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11733/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf900.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11733/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858

[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-14 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626969#comment-14626969
 ] 

Arun Suresh commented on HDFS-7858:
---

[~arpitagarwal], apologize for sitting on this...

I was trying to refactor this as per [~jingzhao]'s suggestion (replacing 
RetryInvocationHandler with RequestHedgingInvocationHandler). Unfortunately, it 
was turning out to be a more far reaching impact (technically request hedging 
is different from retry.. so the whole policy framework etc. would need to be 
refactored)

If everyone is ok with the current approach, we can punt the larger refactoring 
to another JIRA and I can incorporate [~arpitagarwal]'s suggestion (skip 
standby for subsequent requests) and provide a quick patch.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626963#comment-14626963
 ] 

Hadoop QA commented on HDFS-7858:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12702886/HDFS-7858.3.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 979c9ca |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11704/console |


This message was automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-07-14 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626950#comment-14626950
 ] 

Arpit Agarwal commented on HDFS-7858:
-

Hi [~asuresh], were you thinking of posting an updated patch. The overall 
approach looks good.

One comment from a quick look - RequestHedgingProxyProvider sends all requests 
to both NNs. Should it skip the standby for subsequent requests?

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-04-24 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511949#comment-14511949
 ] 

Chris Nauroth commented on HDFS-7858:
-

This is very interesting.  Thanks for working on it, Arun!

bq. Yup, QJM ensures only 1 namenode can write, but fencing is still 
recommended since there is still a possibility of the stale reads from the old 
Active NN before going down (I am hoping this will not be too much of an issue)

I don't think the patch introduces any new problems here.  If two NameNodes 
think they are active, there is already a risk of reads being served by the 
wrong node.


> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-03-31 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388750#comment-14388750
 ] 

Arun Suresh commented on HDFS-7858:
---

Thanks for the review [~jingzhao].

bq. I'm thinking that how about providing RequestHedgingInvocationHandler as a 
replacement of RetryInvocationHandler? We need to add the retry logic into 
RequestHedgingInvocationHandler but the whole layer may look more clean.

Yup.. that makes sense.. let me give a shot at refactoring.. and will post an 
updated patch shortly

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-03-30 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387816#comment-14387816
 ] 

Jing Zhao commented on HDFS-7858:
-

Thanks for working on this, [~asuresh].

One concern is where we should put the new logic. Looks like the current patch 
wraps things in the following way: 

{{RequestHedgingInvocationHandler}} --> proxy returned by 
{{RequestHedgingProxyProvider#getProxy}} --> {{RetryInvocationHandler}}

I'm not sure if this is the best way to go. RetryInvocationHandler has its own 
logic for retry and failover, which is usually based on the type of the 
exception thrown by the invocation. With the new design, the exception caught 
by {{RetryInvocationHandler}} is identified based on the exceptions thrown by 
all the targets inside of {{RequestHedgingInvocationHandler}}. Since different 
targets may return different exceptions, looks like we cannot guarantee 
{{RetryInvocationHandler}} finally gets the exception from the correct target.

I'm thinking that how about providing {{RequestHedgingInvocationHandler}} as a 
replacement of {{RetryInvocationHandler}}? We need to add the retry logic into 
{{RequestHedgingInvocationHandler}} but the whole layer may look more clean.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-03-30 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387190#comment-14387190
 ] 

Arun Suresh commented on HDFS-7858:
---

ping [~atm], [~jingzhao], [~bikassaha],
Was wondering if I might get a review for the current patch..

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-03-14 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362020#comment-14362020
 ] 

Arun Suresh commented on HDFS-7858:
---

[~kasha],

bq. one is not required to configure a fencing mechanism when using QJM ?
Yup, QJM ensures only 1 namenode can write, but fencing is still recommended 
since there is still a possibility of the stale reads from the old Active NN 
before going down (I am hoping this will not be too much of an issue)

bq.  it would be nice to make the solution here accessible to YARN as well. 
The current patch extends the {{ConfigredFailoverProxyProvider}} in the hdfs 
code base. The {{ConfiguredRMFailoverProxyProvider}} looks like it belongs to 
the same class hierarchy.. so it shouldnt be too hard. But like you mentioned, 
if YARN is not deployed with {{ZKRMStateStore}}, there is a possibility of 
split-brain.. which leads mean to think.. wouldnt it be nice to incorporate QJM 
and JNs into YARN deployment ? thoughts ?

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-03-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361536#comment-14361536
 ] 

Karthik Kambatla commented on HDFS-7858:


If possible, it would be nice to make the solution here accessible to YARN as 
well. 

Simultaneously connecting to all the masters (NNs in HDFS and RMs in YARN) 
might work most of the time. How do we plan to handle a split-brain? In YARN, 
we don't use an explicit fencing mechanism. IIRR, one is not required to 
configure a fencing mechanism when using QJM? 


> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-03-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349609#comment-14349609
 ] 

Hadoop QA commented on HDFS-7858:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12702886/HDFS-7858.3.patch
  against trunk revision 952640f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestDecommission
  org.apache.hadoop.hdfs.server.balancer.TestBalancer

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9752//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9752//console

This message is automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-03-04 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347533#comment-14347533
 ] 

Arun Suresh commented on HDFS-7858:
---

This testcase failure seems unrelated..

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-03-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346880#comment-14346880
 ] 

Hadoop QA commented on HDFS-7858:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12702433/HDFS-7858.2.patch
  against trunk revision 3560180.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.TestFileTruncate

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9730//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9730//console

This message is automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-03-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346668#comment-14346668
 ] 

Hadoop QA commented on HDFS-7858:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12702429/HDFS-7858.2.patch
  against trunk revision 3560180.

{color:red}-1 patch{color}.  Trunk compilation may be broken.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9727//console

This message is automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-03-02 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14344503#comment-14344503
 ] 

Jing Zhao commented on HDFS-7858:
-

bq. Optimistically try to connect to both configured NNs simultaneously

I like this better than letting clients connect JNs. JNs are in a critical code 
path for writing editlog and failing to write a quorum of JNs can cause NN to 
kill itself, thus maybe we need to be more careful when letting all clients 
directly connect to them.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-03-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14344311#comment-14344311
 ] 

Hadoop QA commented on HDFS-7858:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12701992/HDFS-7858.1.patch
  against trunk revision ca1c00b.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 4 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common:

  
org.apache.hadoop.hdfs.server.namenode.TestNameNodeRetryCacheMetrics

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9699//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9699//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9699//console

This message is automatically generated.

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client

2015-03-02 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343889#comment-14343889
 ] 

Aaron T. Myers commented on HDFS-7858:
--

Hey folks, sorry to come into this discussion so late.

Given that some folks choose to use HDFS HA without auto failover at all, and 
thus without ZKFCs or ZK in sight, I think we should target any solution to 
this problem to work without ZK. I'm also a little leery of using a cache file, 
as I'm afraid of thundering herd effects (if the file is in HDFS or in a home 
dir which is network mounted), and also don't like the fact that in a large 
cluster all users on all machines might need to populate this cache file.

As such, I'd propose that we pursue either of the following two options:

# Optimistically try to connect to both configured NNs simultaneously, thus 
allowing that one (the standby) may take a while to respond, but also expecting 
that the active will always respond rather promptly. This is similar to 
Kihwal's suggestion.
# Have the client connect to the JNs to determine which NN is the likely the 
active. In my experience, even those who don't use automatic failover basically 
always use the QJM. I think those that continue to use NFS-based HA are very 
few and far between.

Thoughts?

> Improve HA Namenode Failover detection on the client
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7858.1.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client using Zookeeper

2015-02-28 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341849#comment-14341849
 ] 

Bikas Saha commented on HDFS-7858:
--

What are long lived client examples? How many such clients would be there in a 
large busy cluster? Will they be setting watches on ZK?

bq. Adding a cached entry to user's home dir to pick last active NN.  If entry 
is not present, the client picks the Standby from the configuration. 
This seems like a reasonable improvement to the current scheme which will allow 
a client to connect to the current active directly (even though it may be 
listed later in the NN names list).

Please do keep in mind that ZK is just a notifier in the leader election 
scheme. The real control lies in the FailoverController which is pluggable. A 
different FailoverController may not use ZK. The status of the master flag may 
not be valid/be-empty while the FailoverController is fencing the old master 
and bringing up the new master.

Getting configuration from ZK is related but probably orthogonal. The entire 
config for HDFS could be downloaded from ZK based on a well known HDFS service 
name.


> Improve HA Namenode Failover detection on the client using Zookeeper
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client using Zookeeper

2015-02-28 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341411#comment-14341411
 ] 

Arun Suresh commented on HDFS-7858:
---

[~bikassaha], you make a very valid point.

I guess the situation you mentioned can be alleviated as follows : 
Considering the fact a client knows apriori, both the Active and Standby, what 
if we do the following : if locally cached active namenode entry has become 
unavailable, yes there will be an initial surge of requests to the failed NN, 
but the client can directly retry to the Standby without consulting ZK. ZK 
connections will happen only in the following cases :
# If no cached entry is present in the user home directory.
# Long living clients 

Also I was thinking, maybe we break this into 2 separate JIRAs :
# Adding a cached entry to user's home dir to pick last active NN. If entry is 
not present, the client picks the Standby from the configuration. No ZK 
involvement for this, it only brings some determinism in which namenode is 
picked first. 
# Have another JIRA to add ZK client optimization. This would in addition to 
the ZK watch feature for long lived clients can bring in probably additional 
benefits such as having only the logical nameservice name in the Configuration. 
Namenodes when it starts up will register under a ZNode and clients find out 
the actual URI of the Active and Standby directly from ZK (like HBase clients). 
Short lived clients would then first query ZK, finding the active and standby 
NN URIs and cache them (rather than reading from the Configuration), so 
subsequent Client invocation do not hit ZK. 

> Improve HA Namenode Failover detection on the client using Zookeeper
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client using Zookeeper

2015-02-27 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341399#comment-14341399
 ] 

Bikas Saha commented on HDFS-7858:
--

bq. The client will proceed to connect to that NN first (thereby removing 
non-determinism from the current scheme).. and will most probably succeed. It 
will contact ZK only if the connection was unsuccessful..
Yes. It will most probably succeed. But when will it not succeed? When that NN 
has failed over or has crashed, right? Which means that every time a known 
primary NN becomes unavailable there will be surge of failed connections to it 
(from cached entries that point to it) and then these connections will be 
redirected to ZK. For a proxy of the number of connections consider MR jobs, 
where every Map task running on every machine has a DFS client to read  from 
HDFS and every Reduce task on every machine has a DFS client to write to HDFS. 
MR tasks are typically short lived clients.

> Improve HA Namenode Failover detection on the client using Zookeeper
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client using Zookeeper

2015-02-27 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341375#comment-14341375
 ] 

Arun Suresh commented on HDFS-7858:
---

[~kihwal], [~bikassaha], I understand your concerns over the use of ZK. But 
consider the following:

# Most DFSClients are cached (since the Filesystem objects are generally 
cached). Thus long lived clients will probably not have more than 1-2 
persistent connections to ZK.
# Short lived clients will first check if there is cached entry (possibly in 
the home directory, something like ~/.lastNN) that contains the last accessed 
active NN. The client will proceed to connect to that NN first (thereby 
removing non-determinism from the current scheme).. and will most probably 
succeed. It will contact ZK only if the connection was unsuccessful.. and we 
can limit this to just a ping (not a watch registration) so the connection is 
not persistent.
# A Client that has connected to a NN without the need for a ZK connection can 
continue to NOT talk to ZK till
## the client dies and no ZK connection is ever made
## after waiting for a configurable time (maybe an hour).. after which it is 
established that it is a ling lived client,  

Do you think this might reduce the total number of connection to ZK at any 
point of time ? 


> Improve HA Namenode Failover detection on the client using Zookeeper
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client using Zookeeper

2015-02-27 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341336#comment-14341336
 ] 

Bikas Saha commented on HDFS-7858:
--

We need to be careful about how many clients can be supported by ZK (either 
pinging for info or watchers). ZK is typically a shared service with YARN/HBase 
etc.

> Improve HA Namenode Failover detection on the client using Zookeeper
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7858) Improve HA Namenode Failover detection on the client using Zookeeper

2015-02-27 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340846#comment-14340846
 ] 

Kihwal Lee commented on HDFS-7858:
--

ZK may not scale to support thousands of clients. I think it will be better to 
use more aggressive timeout and proper retry policy to get around such problems.

> Improve HA Namenode Failover detection on the client using Zookeeper
> 
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)