[
https://issues.apache.org/jira/browse/NIFI-16011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088622#comment-18088622
]
ASF subversion and git services commented on NIFI-16011:
--------------------------------------------------------
Commit d1c7bd2ac94c3fa3e76f36f3a9823dbb5a08d15a in nifi's branch
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=d1c7bd2ac94 ]
NIFI-16011: Fix system-tests (#11325)
Make cluster node HTTP protocol version configurable and reduce LoadBalanceIT
request volume
Introduce a new nifi.properties setting,
nifi.cluster.node.protocol.http.version, that configures the HTTP
version that the cluster node web client prefers when replicating
requests to other nodes. Accepts HTTP_2 (the default) or HTTP_1_1;
invalid values log a warning and fall back to HTTP_2.
The setting is wired into FlowControllerConfiguration.webClientService()
so the cluster replication HttpClient honors the configured version.
The shipped conf/nifi.properties template defaults to HTTP_2 so
production traffic is unchanged. The system test resource templates
under nifi-system-test-suite/src/test/resources/conf/* set the value to
HTTP_1_1 to work around intermittent "RST_STREAM received Stream
cancelled" failures the JDK java.net.http.HttpClient produces when it
talks to Jetty 12.1.10 (jetty PR #15087 / issue #15009) under the heavy
disconnect / offload / restart patterns the system tests exercise. The
replicator cannot retry replicated POSTs, so when Jetty sends RST_STREAM
mid-stream the in-flight request is lost.
Also reduces the number of FlowFiles used in LoadBalanceIT from 100 to
20 so each test iterates over the queue with far fewer API calls,
reducing replication pressure and shortening the test.
Signed-off-by: Kevin Doran <[email protected]>
> Repeated system test failures caused by LoadBalanceIT
> -----------------------------------------------------
>
> Key: NIFI-16011
> URL: https://issues.apache.org/jira/browse/NIFI-16011
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> We are consistently seeing system test failures. Looking at the logs from
> Github Actions, it appears that LoadBalanceIT is always the first one to
> fail, with the issue then cascading. It seems that the end of the
> LoadBalanceIT.testPartitionByAttribute test is performing a queue listing for
> each of the 100 expected FlowFiles, and this then gets replicated across the
> cluster.
> This, in turn, causes connection pool exhaustion, resulting in
> {code:java}
> IOException: RST_STREAM received {code}
> Which comes back as an HTTP 500 error.
> That test can be tightened up by producing 20 FlowFiles instead of 100. This
> will reduce the number of requests by 5x, giving us much more breathing room.
>
> After digging in, the reduction from 100 FlowFiles to 20 did not provide the
> resilience I was looking for. The issue appears to stem from changes made in
> the latest version of Jetty. It appears that they explicitly and
> intentionally changed how RST_STREAM resets are handled. Reverting the recent
> Jetty version change did confirm that system tests pass. Restoring to the
> latest confirmed failures again. It is important to keep current with Jetty,
> however, and these issues do not appear to affect production instances. They
> affect system tests because system tests constantly restart containers while
> also firing off huge numbers of HTTP requests in very short succession.
> To this end, the approach that I will take is to expose configuring the HTTP
> version to use for intra-cluster communications. We will default to HTTP_2,
> remaining backward compatible. But system tests can make use of HTTP 1.1 in
> order to avoid these failures. This will not be a permanent solution to run
> all system tests using HTTP 1.1, but it is more desirable than the constant
> system failures that we see currently.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)