[ 
https://issues.apache.org/jira/browse/HADOOP-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169559#comment-16169559
 ] 

Xiao Chen commented on HADOOP-14521:
------------------------------------

Sorry for my delayed response here, was on leave last week.

Perhaps [my previous 
comment|https://issues.apache.org/jira/browse/HADOOP-14521?focusedCommentId=16159027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16159027]
 was lost among the revert messages - to restate the TL;DR:
This change will make the clients _not_ retry on certain IOEs. That means if a 
client upgrades, what's working' for them before will no longer 'work'. 
In other words, clients who did not have to be aware of 
HADOOP-14445/HADOOP-14841/any undiscovered bugs in that pattern, will see them 
and fail.

Any server bugs aside, I don't see a reason why we cannot keep the existing 
behavior, and only add more retries for the {{failoverOnNetworkException}} 
types of exceptions, which seems to be the main reason to have this jira. In 
other words, the 'change' I'm proposing to add on top of the reverted patch is 
to, s/TryOnceThenFail/Try(providers.length)ThenFail/g

[~shahrs87] are we on the same page now?

> KMS client needs retry logic
> ----------------------------
>
>                 Key: HADOOP-14521
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14521
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 2.6.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>         Attachments: HADOOP-14521.09.patch, 
> HADOOP-14521-branch-2.8.002.patch, HADOOP-14521-branch-2.8.2.patch, 
> HADOOP-14521-trunk-10.patch, HDFS-11804-branch-2.8.patch, 
> HDFS-11804-trunk-1.patch, HDFS-11804-trunk-2.patch, HDFS-11804-trunk-3.patch, 
> HDFS-11804-trunk-4.patch, HDFS-11804-trunk-5.patch, HDFS-11804-trunk-6.patch, 
> HDFS-11804-trunk-7.patch, HDFS-11804-trunk-8.patch, HDFS-11804-trunk.patch
>
>
> The kms client appears to have no retry logic – at all.  It's completely 
> decoupled from the ipc retry logic.  This has major impacts if the KMS is 
> unreachable for any reason, including but not limited to network connection 
> issues, timeouts, the +restart during an upgrade+.
> This has some major ramifications:
> # Jobs may fail to submit, although oozie resubmit logic should mask it
> # Non-oozie launchers may experience higher rates if they do not already have 
> retry logic.
> # Tasks reading EZ files will fail, probably be masked by framework reattempts
> # EZ file creation fails after creating a 0-length file – client receives 
> EDEK in the create response, then fails when decrypting the EDEK
> # Bulk hadoop fs copies, and maybe distcp, will prematurely fail



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to