[ 
https://issues.apache.org/jira/browse/KAFKA-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961345#comment-14961345
 ] 

Gwen Shapira commented on KAFKA-2664:
-------------------------------------

I'm not certain this is just quotas. We added the use of 
o.a.k.common.network.Selector into SocketServer, which adds a bunch of 
per-connection metrics. We tried to make it efficient, but this may have added 
significant overhead too.

I'm wondering:
1. Can you specify which git-hash you reverted to?
2. Did you profile the connection? Or is this an educated guess of where time 
went?

50-100ms to create a connection is pretty bad, so I think its a great idea to 
improve our efficiency there.

> Adding a new metric with several pre-existing metrics is very expensive
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2664
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2664
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Joel Koshy
>             Fix For: 0.9.0.1
>
>
> I know the summary sounds expected, but we recently ran into a socket server 
> request queue backup that I suspect was caused by a combination of improperly 
> implemented applications that reconnect with a different (random) client-id 
> each time; and the fact that for quotas we now register a new quota 
> metric-set for each client-id.
> So here is what happened: a broker went down and a handful of other brokers 
> starting seeing queue times go up significantly. This caused the request 
> queue to backup, which caused socket timeouts and a further deluge of 
> reconnects. The only way we could get out of this was to fire-wall the broker 
> and downgrade to a version without quotas (or I think it would have worked to 
> just restart the broker).
> My guess is that there were a ton of pre-existing client-id metrics. I don’t 
> know for sure but I’m basing that on the fact that there were several new 
> unique client-ids showing up in the public access logs and request local 
> times for fetches started going up inexplicably. (It would have been useful 
> to have a metric for the number of metrics.) So it turns out that in the 
> above scenario (with say 50k pre-existing client-ids), the avg local time for 
> fetch can go up to the order of 50-100ms (at least with tests on a linux box) 
> largely due to the time taken to create new metrics; and that’s because we 
> use a copy-on-write map underneath. If you have enough (say, hundreds) of 
> clients re-connecting at the same time with new client-id's, that can cause 
> the request queues to start backing up and the overall queuing system to 
> become unstable; and the line starts to spill out of the building.
> I think this is a fairly new scenario with quotas - i.e., I don’t think the 
> past per-X metrics (per-topic for e.g.,) creation rate would ever come this 
> close.
> To be clear, the clients are clearly doing the wrong thing but I think the 
> broker can and should protect itself adequately against such rogue scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to