[
https://issues.apache.org/jira/browse/NIFI-16011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Payne updated NIFI-16011:
------------------------------
Description:
We are consistently seeing system test failures. Looking at the logs from
Github Actions, it appears that LoadBalanceIT is always the first one to fail,
with the issue then cascading. It seems that the end of the
LoadBalanceIT.testPartitionByAttribute test is performing a queue listing for
each of the 100 expected FlowFiles, and this then gets replicated across the
cluster.
This, in turn, causes connection pool exhaustion, resulting in
{code:java}
IOException: RST_STREAM received {code}
Which comes back as an HTTP 500 error.
That test can be tightened up by producing 20 FlowFiles instead of 100. This
will reduce the number of requests by 5x, giving us much more breathing room.
After digging in, the reduction from 100 FlowFiles to 20 did not provide the
resilience I was looking for. The issue appears to stem from changes made in
the latest version of Jetty. It appears that they explicitly and intentionally
changed how RST_STREAM resets are handled. Reverting the recent Jetty version
change did confirm that system tests pass. Restoring to the latest confirmed
failures again. It is important to keep current with Jetty, however, and these
issues do not appear to affect production instances. They affect system tests
because system tests constantly restart containers while also firing off huge
numbers of HTTP requests in very short succession.
To this end, the approach that I will take is to expose configuring the HTTP
version to use for intra-cluster communications. We will default to HTTP_2,
remaining backward compatible. But system tests can make use of HTTP 1.1 in
order to avoid these failures. This will not be a permanent solution to run all
system tests using HTTP 1.1, but it is more desirable than the constant system
failures than we see currently.
was:
We are consistently seeing system test failures. Looking at the logs from
Github Actions, it appears that LoadBalanceIT is always the first one to fail,
with the issue then cascading. It seems that the end of the
LoadBalanceIT.testPartitionByAttribute test is performing a queue listing for
each of the 100 expected FlowFiles, and this then gets replicated across the
cluster.
This, in turn, causes connection pool exhaustion, resulting in
{code:java}
IOException: RST_STREAM received {code}
Which comes back as an HTTP 500 error.
That test can be tightened up by producing 20 FlowFiles instead of 100. This
will reduce the number of requests by 5x, giving us much more breathing room.
After digging in, the reduction from 100 FlowFiles to 20 did not provide the
resilience I was looking for. The issue appears to stem from changes made in
the latest version of Jetty. It appears that they explicitly
> Repeated system test failures caused by LoadBalanceIT
> -----------------------------------------------------
>
> Key: NIFI-16011
> URL: https://issues.apache.org/jira/browse/NIFI-16011
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Time Spent: 1h
> Remaining Estimate: 0h
>
> We are consistently seeing system test failures. Looking at the logs from
> Github Actions, it appears that LoadBalanceIT is always the first one to
> fail, with the issue then cascading. It seems that the end of the
> LoadBalanceIT.testPartitionByAttribute test is performing a queue listing for
> each of the 100 expected FlowFiles, and this then gets replicated across the
> cluster.
> This, in turn, causes connection pool exhaustion, resulting in
> {code:java}
> IOException: RST_STREAM received {code}
> Which comes back as an HTTP 500 error.
> That test can be tightened up by producing 20 FlowFiles instead of 100. This
> will reduce the number of requests by 5x, giving us much more breathing room.
>
> After digging in, the reduction from 100 FlowFiles to 20 did not provide the
> resilience I was looking for. The issue appears to stem from changes made in
> the latest version of Jetty. It appears that they explicitly and
> intentionally changed how RST_STREAM resets are handled. Reverting the recent
> Jetty version change did confirm that system tests pass. Restoring to the
> latest confirmed failures again. It is important to keep current with Jetty,
> however, and these issues do not appear to affect production instances. They
> affect system tests because system tests constantly restart containers while
> also firing off huge numbers of HTTP requests in very short succession.
> To this end, the approach that I will take is to expose configuring the HTTP
> version to use for intra-cluster communications. We will default to HTTP_2,
> remaining backward compatible. But system tests can make use of HTTP 1.1 in
> order to avoid these failures. This will not be a permanent solution to run
> all system tests using HTTP 1.1, but it is more desirable than the constant
> system failures than we see currently.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)