[ 
https://issues.apache.org/jira/browse/NIFI-16011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Villard resolved NIFI-16011.
-----------------------------------
    Fix Version/s: 2.10.0
       Resolution: Fixed

> Repeated system test failures caused by LoadBalanceIT
> -----------------------------------------------------
>
>                 Key: NIFI-16011
>                 URL: https://issues.apache.org/jira/browse/NIFI-16011
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 2.10.0
>
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We are consistently seeing system test failures. Looking at the logs from 
> Github Actions, it appears that LoadBalanceIT is always the first one to 
> fail, with the issue then cascading. It seems that the end of the 
> LoadBalanceIT.testPartitionByAttribute test is performing a queue listing for 
> each of the 100 expected FlowFiles, and this then gets replicated across the 
> cluster.
> This, in turn, causes connection pool exhaustion, resulting in
> {code:java}
> IOException: RST_STREAM received {code}
> Which comes back as an HTTP 500 error.
> That test can be tightened up by producing 20 FlowFiles instead of 100. This 
> will reduce the number of requests by 5x, giving us much more breathing room.
>  
> After digging in, the reduction from 100 FlowFiles to 20 did not provide the 
> resilience I was looking for. The issue appears to stem from changes made in 
> the latest version of Jetty. It appears that they explicitly and 
> intentionally changed how RST_STREAM resets are handled. Reverting the recent 
> Jetty version change did confirm that system tests pass. Restoring to the 
> latest confirmed failures again. It is important to keep current with Jetty, 
> however, and these issues do not appear to affect production instances. They 
> affect system tests because system tests constantly restart containers while 
> also firing off huge numbers of HTTP requests in very short succession.
> To this end, the approach that I will take is to expose configuring the HTTP 
> version to use for intra-cluster communications. We will default to HTTP_2, 
> remaining backward compatible. But system tests can make use of HTTP 1.1 in 
> order to avoid these failures. This will not be a permanent solution to run 
> all system tests using HTTP 1.1, but it is more desirable than the constant 
> system failures that we see currently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to