[
https://issues.apache.org/jira/browse/HADOOP-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18074450#comment-18074450
]
ASF GitHub Bot commented on HADOOP-19862:
-----------------------------------------
ajfabbri commented on PR #8426:
URL: https://github.com/apache/hadoop/pull/8426#issuecomment-4271719370
👋 Thanks for the PR! I've been meaning to help test and review this but
haven't had time to get to it. Will try to take a look next week.
> S3A: Thread leak from AWS SDK v2 ScheduledExecutorService
> ---------------------------------------------------------
>
> Key: HADOOP-19862
> URL: https://issues.apache.org/jira/browse/HADOOP-19862
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Konstantin Bereznyakov
> Priority: Major
> Labels: pull-request-available
>
> AWS SDK v2 S3 clients create internal ScheduledExecutorService instances that
> accumulate over time, causing unbounded thread growth. Thread dumps show
> thousands of sdk-ScheduledExecutor-* threads in processes that create
> multiple S3AFileSystem instances.
> Â Environment
> Â - Hadoop 3.4.x with AWS SDK v2
> Â - Not observed in Hadoop 3.3.x (AWS SDK v1)
> Â Observed Behavior
> Â Thread dump comparison:
> Â Hadoop 3.3.x (AWS SDK v1): Â Normal thread count
> Â Hadoop 3.4.x (AWS SDK v2): Â 1600+ "sdk-ScheduledExecutor-*" threads
> Â Thread pattern:
> Â "sdk-ScheduledExecutor-0-0" daemon prio=5 waiting
> Â "sdk-ScheduledExecutor-0-1" daemon prio=5 waiting
> Â ...
> Â "sdk-ScheduledExecutor-0-4" daemon prio=5 waiting
> Â "sdk-ScheduledExecutor-1-0" daemon prio=5 waiting
> Â ...
> Â Root Cause
> Â AWS SDK v2's SdkDefaultClientBuilder creates a ScheduledThreadPoolExecutor
> with 5 threads per client when no executor is explicitly provided
> (https://github.com/aws/aws-sdk-java-v2/issues/1690):
> Â Executors.newScheduledThreadPool(5,
> Â Â Â new
> ThreadFactoryBuilder().threadNamePrefix("sdk-ScheduledExecutor").build())
> Â These threads are used for retry scheduling, timeout handling, and
> credential refresh.
> Â Contributing Factors
> Â 1. AbstractFileSystem has no caching
> Â Unlike FileSystem.get() which uses CACHE.get(uri, conf),
> AbstractFileSystem.get() always creates new instances:
> Â // AbstractFileSystem.java:263-266
> Â public static AbstractFileSystem get(final URI uri, final Configuration
> conf) {
> Â Â Â return createFileSystem(uri, conf); Â // NO CACHING
> Â }
> Â Each FileContext.getFileContext() call with an S3 URI creates:
> Â - New AbstractFileSystem (S3A)
> Â - New S3AFileSystem
> Â - New S3Client
> Â - 5 new sdk-ScheduledExecutor threads
> Â 2. S3Client threads not released on close
> Â As documented in https://github.com/aws/aws-sdk-java-v2/issues/1690:
> Â "When using cached, ephemeral clients, I can see that the scheduled thread
> pool will at times be leaked when the aws client is evicted"
> Â 3. Multiple client types affected
> Â S3A creates multiple AWS SDK clients:
> Â - S3Client (sync)
> Â - S3AsyncClient
> Â - STS client (for delegation tokens)
> Â - KMS client (for encryption)
> Â Each client instance creates its own 5-thread pool.
> Â Impact
> Â - Unbounded thread growth in any process using S3A
> Â - Resource exhaustion leading to OOM or system instability
> Â - Particularly affects YARN NodeManager, Spark drivers/executors, and other
> services that create many filesystem instances
> Â Related
> Â - https://github.com/aws/aws-sdk-java-v2/issues/1690 - SDK issue
> documenting the problem
> Â - https://github.com/aws/aws-sdk-java-v2/pull/4002 - SDK fix allowing
> shared executor configuration
> Â - HADOOP-19624 - Similar thread leak in ABFS (AbfsClientThrottlingAnalyzer)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]