Hello good people of Kafka, We have an old ticket open (https://issues.apache.org/jira/browse/KAFKA-15247) with no replies.
The fix has been working well for us for over a year now, here?s the main text of the ticket: When the Kafka server is installed in an open shift environment we are seeing cases where the clients receive OutOfMemory errors due to single large (1.2Gb) byte buffers being allocated by the client. >From research this appears to be a known issue when a plaintext client is >configured to attempt connection to a TLS secured endpoint however in this >instance we see successful communication via TLS and then when the Kafka >server is restarted (or connectivity is broken) both the consumers and >producers can throw OutOfMemoryError's with the following stacks: [ ? removed stacks, see jira ticket for details ? ] We believe that what is happening is that when the Kafka server goes down, in the RHOS environment the route is still available for some small period of time and the SASLClientAuthenticator is able to receive rogue packets which it interprets as a length to read off stream. For the consumer code since there is application code on the stack we were able to implement a workaround by catching the OOM but on the producer side the entire stack is Kafka client code. I looked at the SaslClientAuthenticator code and I can see that it's use of the network buffer is unbounded so I applied 2 patches to this code. The first limits the buffer size for authentication to 10Mb, the 2nd catches the OOM and instead fails auth. Using the patched client the customer has gone from being able to recreate this on at least 1 appserver for every Kafka server restart to not being able to reproduce the issue at all. I am happy to submit a PR but I wanted to get feedback before I did so. For instance is 10Mb a suitable maximum buffer size for auth, should the maximum perhaps be configurable instead and if so what is best practice for providing this configuration> Secondly catching the OOM doesn't feel like best practice to me however without doing this the entire application fails due to aggressive allocation of byte buffers in the SaslClientAuthenticator is there any alternative I should be considering. Is there anything we can do to get some attention on the ticket so we don?t have to patch every level of Kafka? Thanks! Andreas -- Andreas Martens [signature_558150371] Senior Engineer App Connect Enterprise IBM Unless otherwise stated above: IBM United Kingdom Limited Registered in England and Wales with number 741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU