[ 
https://issues.apache.org/jira/browse/HADOOP-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159027#comment-16159027
 ] 

Xiao Chen commented on HADOOP-14521:
------------------------------------

Thanks for the prompt response [~shahrs87].
bq. The previous behavior was just masking the bugs on the server side. 
True, and I agree the the server-side bugs should be fixed.

However, as noted in the last comment, in practice the existing behavior allows 
a client request to succeed after retry. With this patch, clients will 
straightly fail on the first failure. This incompatible behavior is painful for 
the client, and is the main reason I raise the above.

HADOOP-14445 and HADOOP-14841 are just examples for this kind of failures. 
Although they are nasty bugs, my biggest concern now is not specific to any of 
them. Rather, it's the behavior change that made a previously working client, 
doesn't work anymore. This sounds pretty pressing to me.

I can think of a few ways to keep existing behavior, but with 3.0beta and 2.8.2 
coming, I'm inclined to revert this for now, and re-commit after improvement. 
[~andrew.wang] / [~djp] FYI.

> KMS client needs retry logic
> ----------------------------
>
>                 Key: HADOOP-14521
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14521
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 2.6.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>             Fix For: 2.9.0, 3.0.0-beta1, 2.8.2
>
>         Attachments: HADOOP-14521.09.patch, 
> HADOOP-14521-branch-2.8.002.patch, HADOOP-14521-branch-2.8.2.patch, 
> HADOOP-14521-trunk-10.patch, HDFS-11804-branch-2.8.patch, 
> HDFS-11804-trunk-1.patch, HDFS-11804-trunk-2.patch, HDFS-11804-trunk-3.patch, 
> HDFS-11804-trunk-4.patch, HDFS-11804-trunk-5.patch, HDFS-11804-trunk-6.patch, 
> HDFS-11804-trunk-7.patch, HDFS-11804-trunk-8.patch, HDFS-11804-trunk.patch
>
>
> The kms client appears to have no retry logic – at all.  It's completely 
> decoupled from the ipc retry logic.  This has major impacts if the KMS is 
> unreachable for any reason, including but not limited to network connection 
> issues, timeouts, the +restart during an upgrade+.
> This has some major ramifications:
> # Jobs may fail to submit, although oozie resubmit logic should mask it
> # Non-oozie launchers may experience higher rates if they do not already have 
> retry logic.
> # Tasks reading EZ files will fail, probably be masked by framework reattempts
> # EZ file creation fails after creating a 0-length file – client receives 
> EDEK in the create response, then fails when decrypting the EDEK
> # Bulk hadoop fs copies, and maybe distcp, will prematurely fail



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to