Hi all, I'm investigating a strange occurrence on one of the NiFi clusters I manage. I'm mailing the dev list rather than users because it seems likely to be a bug or quirk in SSL certificate handling by NiFi or the libraries it depends on.
One of our production clusters has been suffering from high CPU load (typically at or above the number of vCPUs) on all 3 nodes equally. I wasn't able to trace it back to specific processors. Recently I had a breakthrough, completely by accident. After replacing the SSL certificates in the main keystores and restarting NiFi, CPU load dropped drastically and stayed low, with only the primary seeing some spikes. You can see this in the graphs (left is % used as reported by the Linux stats, right is NiFi's own CPU load a few minutes after putting back one of the old certificates. It also only rises on the node with the old certificate, so its partner nodes are not affected by the certificate presented, only some internal mechanism for loading/verifying the certificate seems to cause the trouble. I'm guessing for RPG traffic, as the cluster still has many RPGs looping back to its own cluster. The cluster is running NiFi 1.18.0 on OpenJDK 1.8.0_352. Its Acceptance twin cluster never had this problem. It is running the same config but at lower load and had its certificates changed a week earlier with no noticeable change (CPU usage rarely stays above 25% for long). Something must have been wrong with the old certificate. It has the same key length and key algorithm as the new one, though a different intermediate and root CA. Storing it in JKS or PKCS12 doesn't make a difference. The CRL URI for both is reachable but gives a 470 error (internal enterprise CA and CRL). I can't find any difference that would explain the vastly different CPU usage. Does this ring a bell with anyone? And if not, what kind of logging/diagnostics settings would help me track this down? Thanks, Isha [Afbeelding met tekst, schermopname, Perceel, lijn Automatisch gegenereerde beschrijving]