[jira] [Commented] (HADOOP-16284) KMS Cache Miss Storm

Daryn Sharp (JIRA) Fri, 10 May 2019 06:14:10 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-16284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837257#comment-16837257
 ]


Daryn Sharp commented on HADOOP-16284:
--------------------------------------

Wow, that's really bad.  IIRC, no content to map is symptomatic of java's 
HTTPUrlConnection being completely broken.  It will internal retry failed POSTs 
or PUTs (forget which) and the retry doesn't resend the original payload.  Of 
course it can't buffer the payload w/o a high risk of OOM so not sure why sun 
thought the retry was a good idea.

The bug is guaranteed to happen under load if jetty's low resource monitor is 
enabled.  I'd recommend completely disabling it and increasing the client side 
timeout to mitigate the customer impact.

> KMS Cache Miss Storm
> --------------------
>
>                 Key: HADOOP-16284
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16284
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: kms
>    Affects Versions: 2.6.0
>         Environment: CDH 5.13.1, Kerberized, Cloudera Keytrustee Server
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>         Attachments: 4 kms, no KTS patch.png
>
>
> We recently stumble upon a performance issue with KMS, where occasionally it 
> exhibited "No content to map" error (this cluster ran an old version that 
> doesn't have HADOOP-14841) and jobs crashed. *We bumped the number of KMSes 
> from 2 to 4, and situation went even worse.*
> Later, we realized this cluster had a few hundred encryption zones and a few 
> hundred encryption keys. This is pretty unusual because most of the 
> deployments known to us has at most a dozen keys. So in terms of number of 
> keys, this cluster is 1-2 order of magnitude higher than any one else.
> The high number of encryption keys in creases the likelihood of key cache 
> miss in KMS. In Cloudera's setup, each cache miss forces KMS to sync with its 
> backend, the Cloudera Keytrustee Server. Plus the high number of KMSes 
> amplifies the latency, effectively causing a [cache miss 
> storm|https://en.wikipedia.org/wiki/Cache_stampede].
> We were able to reproduce this issue with KMS-o-meter (HDFS-14312) - I will 
> come up with a better name later surely - and discovered a scalability bug in 
> CKTS. The fix was verified again with the tool.
> Filing this bug so the community is aware of this issue. I don't have a 
> solution for now in KMS. But we want to address this scalability problem in 
> the near future because we are seeing use cases that requires thousands of 
> encryption keys.
> ----
> On a side note, 4 KMS doesn't work well without HADOOP-14445 (and subsequent 
> fixes). A MapReduce job acquires at most 3 KMS delegation tokens, and so for 
> cases, such as distcp, it wouldn fail to reach the 4th KMS on the remote 
> cluster. I imagine similar issues exist for other execution engines, but I 
> didn't test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-16284) KMS Cache Miss Storm

Reply via email to