boaz-gold opened a new issue, #15898:
URL: https://github.com/apache/iceberg/issues/15898

   ### Apache Iceberg version
   
   1.10.0
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Apache Iceberg version: 1.10.0                                               
                                                                                
                                          
                                                                                
                                                                                
                                          
     Component: org.apache.iceberg.CachingCatalog                               
                                                                                
                                            
                                                                                
                                                                                
                                          
     Description
   
     CachingCatalog uses a Caffeine cache to hold Table objects. When an entry 
is evicted (by TTL via cache.expiration-interval-ms or by size via 
cache.max-total-bytes), the RemovalListener               
     (MetadataTableInvalidatingRemovalListener) only invalidates related 
metadata table entries. It does not call table.io().close().
                                                                                
                                                                                
                                            
     This means any resources held by the FileIO implementation are never 
released on eviction.                                                           
                                                  
      
     Impact                                                                     
                                                                                
                                            
                                                                                
                                                                                
                                          
     With io-impl = org.apache.iceberg.aws.s3.S3FileIO:                         
                                                                                
                                            
     - Each evicted Table leaves behind a live AWS SDK v2 S3Client
     - Each S3Client owns a ScheduledExecutorService (sdk-ScheduledExecutor-N) 
with background threads for credential refresh (IMDSv2)                         
                                             
     - These threads are GC roots — they can never be collected                 
                                                      
     - In a long-running process (e.g. Spark Thrift Server), threads accumulate 
without bound until the JVM crashes with os::commit_memory failed; error='Not 
enough space' (errno=12)                      
                                                                                
                                                                                
                      
     Observed in production (Spark Thrift Server, ~24h uptime):                 
                                                                                
                                            
     Total JVM threads:          27,877                                         
                                                                                
                                            
     sdk-ScheduledExecutor:      27,657                                         
                                                                                
                                            
     Distinct pool instances:    8,075+                                         
                                                                                
                                            
                                       
     Proof from bytecode                                                        
                                                                                
                                            
                                                                                
                                                                                
                                            
     CachingCatalog$MetadataTableInvalidatingRemovalListener.onRemoval() 
decompiled from iceberg-spark-runtime-3.5_2.12-1.10.0:                          
                                                   
                                                                                
                                                                                
                                            
     // logs debug                                                              
                                                                                
                                            
     // if EXPIRED and not a metadata table: 
cache.invalidateAll(metadataTableIdentifiers)
     // return   ← no close() call                                              
                                                                                
                                            
                     
     There is no table.io().close() call anywhere in the eviction path.         
                                                                                
                                            
                     
     Proposed fix                                                               
                                                                                
                                            
                     
     In CachingCatalog.java, 
MetadataTableInvalidatingRemovalListener.onRemoval():                           
                                                                                
               
      
     if (value != null && value.io() instanceof Closeable) {                    
                                                                                
                                            
         try {       
             ((Closeable) value.io()).close();
         } catch (IOException e) {
             LOG.warn("Failed to close FileIO for evicted table {}", key, e);   
                                                                                
                                            
         }
     }                                                                          
                                                                                
                                            
                     
     Note: S3FileIO implements Closeable and its close() method calls 
S3Client.close(), which shuts down the ScheduledExecutorService and releases 
all threads. This fix is sufficient to resolve the leak. 
      
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to