[ 
https://issues.apache.org/jira/browse/HADOOP-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18073562#comment-18073562
 ] 

ASF GitHub Bot commented on HADOOP-19862:
-----------------------------------------

konstantinb opened a new pull request, #8426:
URL: https://github.com/apache/hadoop/pull/8426

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   HADOOP-19862: S3A: introduce configurable shared thread pool for AWS SDK 
clients
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   ### AI Tooling
   
   If an AI tool was used:
   
   - [ ] The PR includes the phrase "Contains content generated by <tool>"
         where <tool> is the name of the AI tool used.
   - [ ] My use of AI contributions follows the ASF legal policy
         https://www.apache.org/legal/generative-tooling.html




> S3A: Thread leak from AWS SDK v2 ScheduledExecutorService
> ---------------------------------------------------------
>
>                 Key: HADOOP-19862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19862
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Konstantin Bereznyakov
>            Priority: Major
>
> AWS SDK v2 S3 clients create internal ScheduledExecutorService instances that 
> accumulate over time, causing unbounded thread growth. Thread dumps show 
> thousands of sdk-ScheduledExecutor-* threads in processes that create 
> multiple S3AFileSystem instances.
>   Environment
>   - Hadoop 3.4.x with AWS SDK v2
>   - Not observed in Hadoop 3.3.x (AWS SDK v1)
>   Observed Behavior
>   Thread dump comparison:
>   Hadoop 3.3.x (AWS SDK v1):  Normal thread count
>   Hadoop 3.4.x (AWS SDK v2):  1600+ "sdk-ScheduledExecutor-*" threads
>   Thread pattern:
>   "sdk-ScheduledExecutor-0-0" daemon prio=5 waiting
>   "sdk-ScheduledExecutor-0-1" daemon prio=5 waiting
>   ...
>   "sdk-ScheduledExecutor-0-4" daemon prio=5 waiting
>   "sdk-ScheduledExecutor-1-0" daemon prio=5 waiting
>   ...
>   Root Cause
>   AWS SDK v2's SdkDefaultClientBuilder creates a ScheduledThreadPoolExecutor 
> with 5 threads per client when no executor is explicitly provided 
> (https://github.com/aws/aws-sdk-java-v2/issues/1690):
>   Executors.newScheduledThreadPool(5,
>       new 
> ThreadFactoryBuilder().threadNamePrefix("sdk-ScheduledExecutor").build())
>   These threads are used for retry scheduling, timeout handling, and 
> credential refresh.
>   Contributing Factors
>   1. AbstractFileSystem has no caching
>   Unlike FileSystem.get() which uses CACHE.get(uri, conf), 
> AbstractFileSystem.get() always creates new instances:
>   // AbstractFileSystem.java:263-266
>   public static AbstractFileSystem get(final URI uri, final Configuration 
> conf) {
>       return createFileSystem(uri, conf);  // NO CACHING
>   }
>   Each FileContext.getFileContext() call with an S3 URI creates:
>   - New AbstractFileSystem (S3A)
>   - New S3AFileSystem
>   - New S3Client
>   - 5 new sdk-ScheduledExecutor threads
>   2. S3Client threads not released on close
>   As documented in https://github.com/aws/aws-sdk-java-v2/issues/1690:
>   "When using cached, ephemeral clients, I can see that the scheduled thread 
> pool will at times be leaked when the aws client is evicted"
>   3. Multiple client types affected
>   S3A creates multiple AWS SDK clients:
>   - S3Client (sync)
>   - S3AsyncClient
>   - STS client (for delegation tokens)
>   - KMS client (for encryption)
>   Each client instance creates its own 5-thread pool.
>   Impact
>   - Unbounded thread growth in any process using S3A
>   - Resource exhaustion leading to OOM or system instability
>   - Particularly affects YARN NodeManager, Spark drivers/executors, and other 
> services that create many filesystem instances
>   Related
>   - https://github.com/aws/aws-sdk-java-v2/issues/1690 - SDK issue 
> documenting the problem
>   - https://github.com/aws/aws-sdk-java-v2/pull/4002 - SDK fix allowing 
> shared executor configuration
>   - HADOOP-19624 - Similar thread leak in ABFS (AbfsClientThrottlingAnalyzer)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to