fengguangyuan opened a new issue #3340:
URL: https://github.com/apache/iceberg/issues/3340


   # What' the problem?
   In our practices, it's sensible for us to have multi threads to access 
`TableMetadata` through the hive operation, but the current logic of 
`current()` and `refresh()` is designed for single thread, hence if in parallel 
scenarios, it's easily to lead to GC issues in the driver/main thread.
   
   # Why the problem ?
   Considering the following cases, each `AppendFiles` instance may hold a 
stale table metadata instance (referenced by `base` defined as a member 
variable  in `SnapshotProducer`), because of some new snapshots committed by 
other threads or tasks:
   ```java
   AppendFiles af1 = table.newAppend().addFile(thread-1.file);
   AppendFiles af2 = table.newAppend().addFile(thread-2.file);
   AppendFiles af3 = table.newAppend().addFile(thread-3.file);
   ...
   ```
   With so many `AppendFiles` existed, the referenced staled `TableMetadata` 
instances also won't be reclaimed by GC in time, 
    and as we know that the size of TableMetadata instance is increased along 
with the number of snapshots, in consequence, the GC issues come, commonly 
seeing `GC overhead limited exceed` error.
   
   # Resolution
   To make `current()` and `refresh()` synchronized, please let me known if 
it's reasonable and sensible. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to