[ 
https://issues.apache.org/jira/browse/NIFI-12194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781104#comment-17781104
 ] 

Paul Grey commented on NIFI-12194:
----------------------------------

Following up on this. I've checked nifi/main, and unfortunately the problem 
exists there as well.

It appears to be a combination of a problem in the Kafka client library, acting 
together with the NiFi handling of connection initialization for the processor. 
In this misconfiguration, the library attempts a large allocation of a direct 
byte buffer, which fails in the NiFi configurations with a small memory 
footprint. When the failure occurs, NiFi currently immediately retries the 
connection, which fails in the same way. The large memory allocation combined 
with the immediate retry starves the NiFi process of CPU cycles, causing 
instability.

It is not straightforward to detect the Kafka misconfiguration in NiFi. A more 
reasonable solution seems to be to improve NiFi behavior in general on a 
connection initialization failure.

There is an inbound fix that improves this behavior. On connection init 
failure, a NiFi processor API is invoked that yields CPU resources for a 
configurable amount of time (by default, 1 second). This does not prevent the 
problem, but hopefully preserves sufficient CPU to provide for stable UI 
interactivity (bulletin indicates error, processor can be stopped and 
configuration adjusted).

The fix is directed at the main line. Once merged, it can be backported to the 
1.x line, and would be included in an upcoming NiFi 1.x release.

Thanks very much for raising a red flag here!  Definitely an unusual problem; 
hope the fix helps you, and those who might encounter the problem in the future.

> Nifi fails when ConsumeKafka_2_6 processor is started with PLAINTEXT 
> securityProtocol
> -------------------------------------------------------------------------------------
>
>                 Key: NIFI-12194
>                 URL: https://issues.apache.org/jira/browse/NIFI-12194
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.21.0, 1.23.0
>            Reporter: Peter Schmitzer
>            Assignee: Paul Grey
>            Priority: Major
>         Attachments: image-2023-09-27-15-56-02-438.png
>
>
> When starting ConsumeKafka_2_6 processor with sasl mechanism GSSAPI and the 
> securityProtocol PLAINTEXT (although SSL would be correct) the UI crashed and 
> nifi was no longer accessible. Not only the frontend was not accessible 
> anymore, also the other processors in our flow stopped performing well 
> according to our dashboards.
> We were able to reproduce this by using the config as described above.
> Our nifi in preprod (where this was detected) runs in a kubernetes cluster.
>  * version 1.21.0
>  * 3 nodes
>  * jvmMemory: 1536m
>  * 3G memory (limit)
>  * 400m cpu (request)
>  * zookeeper
> The logs do not offer any unusual entries when the issue is triggered. 
> Inspecting the pod metrics we found a spike in memory.
> The issue is a bit scary for us because a rather innocent config parameter in 
> one single processor is able to let our whole cluster break down.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to