[ 
https://issues.apache.org/jira/browse/KAFKA-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961595#comment-14961595
 ] 

Joel Koshy commented on KAFKA-2664:
-----------------------------------

[~gwenshap] in general yes it could, but  if we did register per-connection 
metrics it is unlikely to cause as much of an issue as per-client-id metrics if 
you have clients that improperly generate a new client-id for every reconnect. 
This is because you would (typically) have of the order of a few hundred or low 
thousands of connection-id's; and once those have become registered you 
wouldn't need to add anymore even if you have many of those clients 
reconnecting frequently. That said, it is currently disabled (i.e., we don't 
register per-connection metrics) in the server-side selector.

bq.1. Can you specify which git-hash you reverted to?

The version we rolled back to does include KAFKA-1928 if that's what you are 
asking and multi-port support as well; but since those per-connection metrics 
are disabled it is probably irrelevant here.

bq. 2. Did you profile the connection? Or is this an educated guess of where 
time went?

I forgot to mention this above, but after the above episode and a mild 
suspicion fell on quota metrics I had a separate stress test for quota metrics 
- the easiest way to observe this is to synthetically call 
{{QuotaManager.recordAndMaybeThrottle}} in a loop and profile it. Most of the 
time is spent in the copy on write map and map resizes. So yes it was an 
educated guess until today... because today we deliberately reproduced this 
today in production and attached a profiler to the broker to verify that the 
higher local times were due to the creation of per-client id quota metrics.

> Adding a new metric with several pre-existing metrics is very expensive
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2664
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2664
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Joel Koshy
>             Fix For: 0.9.0.1
>
>
> I know the summary sounds expected, but we recently ran into a socket server 
> request queue backup that I suspect was caused by a combination of improperly 
> implemented applications that reconnect with a different (random) client-id 
> each time; and the fact that for quotas we now register a new quota 
> metric-set for each client-id.
> So here is what happened: a broker went down and a handful of other brokers 
> starting seeing queue times go up significantly. This caused the request 
> queue to backup, which caused socket timeouts and a further deluge of 
> reconnects. The only way we could get out of this was to fire-wall the broker 
> and downgrade to a version without quotas (or I think it would have worked to 
> just restart the broker).
> My guess is that there were a ton of pre-existing client-id metrics. I don’t 
> know for sure but I’m basing that on the fact that there were several new 
> unique client-ids showing up in the public access logs and request local 
> times for fetches started going up inexplicably. (It would have been useful 
> to have a metric for the number of metrics.) So it turns out that in the 
> above scenario (with say 50k pre-existing client-ids), the avg local time for 
> fetch can go up to the order of 50-100ms (at least with tests on a linux box) 
> largely due to the time taken to create new metrics; and that’s because we 
> use a copy-on-write map underneath. If you have enough (say, hundreds) of 
> clients re-connecting at the same time with new client-id's, that can cause 
> the request queues to start backing up and the overall queuing system to 
> become unstable; and the line starts to spill out of the building.
> I think this is a fairly new scenario with quotas - i.e., I don’t think the 
> past per-X metrics (per-topic for e.g.,) creation rate would ever come this 
> close.
> To be clear, the clients are clearly doing the wrong thing but I think the 
> broker can and should protect itself adequately against such rogue scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to