Hi team,

PR #12868 <https://github.com/apache/iceberg/pull/12868> addresses a
critical issue regarding FileIO resource management in Spark that requires
broader community discussion and review.

Issue Summary:

When Spark cleans up broadcast variables, calling FileIO.close() can
unintentionally shut down shared resources, such as HTTP connection pools.
This is particularly problematic when using S3FileIO with the default
ApacheHttpClient, and it can cause Spark read/write queries to fail.

Technical Details:

   - Spark will broadcast FileIO instances to executors.
   - During driver broadcast variable cleanup, calling close() can
   terminate shared resources (e.g., HTTP client connection pool) on the
   executor. This breaks core Iceberg functionality, see #12858
   <https://github.com/apache/iceberg/issues/12858>, #12046
   <https://github.com/apache/iceberg/issues/12046>.
   - This issue is particularly acute with S3FileIO using ApacheHttpClient,
   as all instances share the same connection pool instance (see referenced
   code
   
<https://github.com/apache/httpcomponents-client/blob/master/httpclient5/src/main/java/org/apache/hc/client5/http/impl/io/PoolingHttpClientConnectionManager.java#L263>).
   However, the core problem lies in the broader approach to managing shared
   resources within FileIO.

Request for Community Input:

   - Review the proposed solution in PR #12868
   <https://github.com/apache/iceberg/pull/12868>
   - Discuss whether this is the correct way to fix the issue
   - We might also consider whether a more explicit resource ownership and
   lifecycle management model is needed.

This issue impacts many users running Iceberg on Spark in production, so
timely review and feedback would be appreciated.

Best,

Xiaoxuan Li

Reply via email to