[jira] [Comment Edited] (NIFI-15535) AWS Controller Service Fault

Lou Vasquez (Jira) Mon, 16 Mar 2026 17:10:07 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-15535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18066177#comment-18066177
 ]


Lou Vasquez edited comment on NIFI-15535 at 3/17/26 12:09 AM:
--------------------------------------------------------------

I had what I believe is this exact issue on EKS and NiFi 2.7.2 and created a 
patch that resolved it for my test case. I was able to reproduce it 
consistently with specific actions using a flow I will attach. I also 
reproduced it in NiFi 2.8.0

The flow is:
 - 2 PutSQS processors (fed by generate flowfile processors)
 - both use the same AWSCredentialsProviderControllerService
 - service has default settings (default box AWS) and a web token configured in 
env vars via IRSA.

To reproduce:
 - clean working start of service and processors
 - both processors work fine
 - stop just 1 PutSQS processor
 - wait for expiration of auth (typically an hour, but see debug)
 - try to send a message using the other (still running) PutSQS processor

Analysis (some assumptions that bear out in fixed code): 
AWS appears to be holding the STS HTTP client connection pool 
(PoolingHttpClientConnectionManager) with the SQS clients and sharing it 
between them. When either PutSQS is stopped ( SqsClient.close() ), it seems AWS 
SDK closes that shared pool. When the token expires, the still running 
processor does not seem to be aware that its STS pool is closed and AWS SDK 
tries to use it to request credential refresh. The still running PutSQS 
continues to work until expiration because it has valid client that has not 
expired, it simply cannot refresh due to STS being gone.

Evidence of analysis:
While the details are hidden without digging/debugging AWS SDK code, several 
inelegant (and one not terrible) solutions worked that support it.
 - separating underlying service (AWSCredentialsProviderControllerService) 
prevents this issue
 - modifying (implicit and 
explicit)DefaultCredentialsStrategy.getAwsCredentialsProvider to always return 
a new DefaultCredentialsProvider (builder().build()) resolved issue. It is 
extremely abusive of STS/AWS, but narrows issue source.

Final working patch:
 - every call to AwsCredentialsProvider getAwsCredentialsProvider returns a new 
provider
 - (thus) every processor gets its own AWS SDK provider
 - (presumably) prevents AWS SDK from sharing STS across clients
 - prevents NiFi workflow dev from needing a service for every AWS processor 
(per original fix & backup plan)

I have my latest patch(attached), the flow I used to recreate (attached), many 
debug logs (attached), a list of helpful logback debug lines, and can spin up 
an EKS reasonably easily if you want me to try patches for you. 

NOTE: issue did not occur on local box (osx podman w/ web token) with unpatched 
2.7.2 (may be EKS only)
 * see attached files
debug logs of steps to recreate: lv_connection_pool.log
patch to aws nar that resolves:   lv_fix-aws-sts-connection-pool.patch 
nifi flow that recreates:                lv_simple_connect_pool_bug_flow.json


was (Author: JIRAUSER312786):
I had what I believe is this exact issue on EKS and NiFi 2.7.2 and created a 
patch that resolved it for my test case. I was able to reproduce it 
consistently with specific actions using a flow I will attach. I also 
reproduced it in NiFi 2.8.0

The flow is:
 - 2 PutSQS processors (fed by generate flowfile processors)
 - both use the same AWSCredentialsProviderControllerService
 - service has default settings (default box AWS) and a web token configured in 
env vars via IRSA.

To reproduce:
 - clean working start of service and processors
 - both processors work fine
 - stop just 1 PutSQS processor
 - wait for expiration of auth (typically an hour, but see debug)
 - try to send a message using the other (still running) PutSQS processor

Analysis (some assumptions that bear out in fixed code): 
AWS appears to be holding the STS HTTP client connection pool 
(PoolingHttpClientConnectionManager) with the SQS clients and sharing it 
between them. When either PutSQS is stopped ( SqsClient.close() ), it seems AWS 
SDK closes that shared pool. When the token expires, the still running 
processor does not seem to be aware that its STS pool is closed and AWS SDK 
tries to use it to request credential refresh.

Evidence of analysis:
While the details are hidden without digging/debugging AWS SDK code, several 
inelegant (and one not terrible) solutions worked that support it.
 - separating underlying service (AWSCredentialsProviderControllerService) 
prevents this issue
 - modifying (implicit and 
explicit)DefaultCredentialsStrategy.getAwsCredentialsProvider to always return 
a new DefaultCredentialsProvider (builder().build()) resolved issue. It is 
extremely abusive of STS/AWS, but narrows issue source.

Final working patch:
 - every call to AwsCredentialsProvider getAwsCredentialsProvider returns a new 
provider
 - (thus) every processor gets its own AWS SDK provider
 - (presumably) prevents AWS SDK from sharing STS across clients
 - prevents NiFi workflow dev from needing a service for every AWS processor 
(per original fix & backup plan)

I have my latest patch(attached), the flow I used to recreate (attached), many 
debug logs (attached), a list of helpful logback debug lines, and can spin up 
an EKS reasonably easily if you want me to try patches for you. 

NOTE: issue did not occur on local box (osx podman w/ web token) with unpatched 
2.7.2 (may be EKS only)
 * see attached files
debug logs of steps to recreate: lv_connection_pool.log
patch to aws nar that resolves:   lv_fix-aws-sts-connection-pool.patch 
nifi flow that recreates:                lv_simple_connect_pool_bug_flow.json

> AWS Controller Service Fault
> ----------------------------
>
>                 Key: NIFI-15535
>                 URL: https://issues.apache.org/jira/browse/NIFI-15535
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 2.7.2
>         Environment: Test and Prod
>            Reporter: Christopher Gambino
>            Priority: Major
>         Attachments: image-2026-02-02-09-54-53-956.png, 
> jirasamplenifibug.txt, lv_connection_pool.log, 
> lv_fix-aws-sts-connection-pool.patch, lv_simple_connect_pool_bug_flow.json
>
>
> NiFi's AWS connection pool service is failing after the upgrade to 2.7.2  
> This has been observed across multiple environments that were being upgraded 
> from 2.6.x.  These environments have previously not had any issues with 
> NiFi's connection to AWS.  These happen during times of both high and low 
> load so we do not believe it is load or memory related.  The bug is observed 
> primarily on SQS and s3 processors as that what our flow has, we do not have 
> other processor types to validate against
>  
> The pods are running on EKS with pod level identity set
>  
> For security reasons I cant post the whole log but the main errors are shown 
> in the attached screenshots and text files
>  
> !image-2026-02-02-09-54-53-956.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (NIFI-15535) AWS Controller Service Fault

Reply via email to