Hello good people of Kafka,

We have an old ticket open (https://issues.apache.org/jira/browse/KAFKA-15247) 
with no replies.


The fix has been working well for us for over a year now, here?s the main text 
of the ticket:
When the Kafka server is installed in an open shift environment we are seeing 
cases where the clients receive OutOfMemory errors due to single large (1.2Gb) 
byte buffers being allocated by the client.



>From research this appears to be a known issue when a plaintext client is 
>configured to attempt connection to a TLS secured endpoint however in this 
>instance we see successful communication  via TLS and then when the Kafka 
>server is restarted (or connectivity is broken) both the consumers and 
>producers can throw OutOfMemoryError's with the following stacks:
[ ? removed stacks, see jira ticket for details ? ]

We believe that what is happening is that when the Kafka server goes down, in 
the RHOS environment the route is still available for some small period of time 
and the SASLClientAuthenticator is able to receive rogue packets which it 
interprets as a length to read off stream.

For the consumer code since there is application code on the stack we were able 
to implement a workaround by catching the OOM but on the producer side the 
entire stack is Kafka client code.

I looked at the SaslClientAuthenticator code and I can see that it's use of the 
network buffer is unbounded so I applied 2 patches to this code. The first 
limits the buffer size for authentication to 10Mb, the 2nd catches the OOM and 
instead fails auth.

Using the patched client the customer has gone from being able to recreate this 
on at least 1 appserver for every Kafka server restart to not being able to 
reproduce the issue at all.

I am happy to submit a PR but I wanted to get feedback before I did so. For 
instance is 10Mb a suitable maximum buffer size for auth, should the maximum 
perhaps be configurable instead and if so what is best practice for providing 
this configuration>

Secondly catching the OOM doesn't feel like best practice to me however without 
doing this the entire application fails due to aggressive allocation of byte 
buffers in the SaslClientAuthenticator is there any alternative I should be 
considering.

Is there anything we can do to get some attention on the ticket so we don?t 
have to patch every level of Kafka?

Thanks!
Andreas

--
Andreas Martens
[signature_558150371]
Senior Engineer
App Connect Enterprise
IBM

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU

Reply via email to