We ran into an incident a while back where one of our broker machines abruptly went down (AWS is fun). While the leadership transitions and so forth seemed to work correctly with the remaining brokers, our producers hung shortly thereafter. I should point out that we are using the old Scala producer in async mode. What happened was that the producer's queue filled up and the SyncProducer on the other end was blocked in a write() call, waiting for ACKs that will never come. My understanding of blocking IO on the JVM is that this call will block until such time as the OS gives up on the TCP connection, which could take as long as 30 minutes.
As a remedy, we're first going to set queue.enqueue.timeout.ms to some positive value, as we're willing to lose some of these particular messages to avoid blocking user requests. But this won't actually make the producer recover more quickly. Is lowering the OS level TCP keepalive time the right thing here? Also, can someone comment on whether this behavior would also happen with the new producer? We want to get there, but it hasn't been a priority. -- Tommy Becker Senior Software Engineer O +1 919.460.4747 tivo.com ________________________________ This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.