fengguangyuan opened a new issue #3340:
URL: https://github.com/apache/iceberg/issues/3340
# What' the problem?
In our practices, it's sensible for us to have multi threads to access
`TableMetadata` through the hive operation, but the current logic of
`current()` and `refresh()` is designed for single thread, hence if in parallel
scenarios, it's easily to lead to GC issues in the driver/main thread.
# Why the problem ?
Considering the following cases, each `AppendFiles` instance may hold a
stale table metadata instance (referenced by `base` defined as a member
variable in `SnapshotProducer`), because of some new snapshots committed by
other threads or tasks:
```java
AppendFiles af1 = table.newAppend().addFile(thread-1.file);
AppendFiles af2 = table.newAppend().addFile(thread-2.file);
AppendFiles af3 = table.newAppend().addFile(thread-3.file);
...
```
With so many `AppendFiles` existed, the referenced staled `TableMetadata`
instances also won't be reclaimed by GC in time,
and as we know that the size of TableMetadata instance is increased along
with the number of snapshots, in consequence, the GC issues come, commonly
seeing `GC overhead limited exceed` error.
# Resolution
To make `current()` and `refresh()` synchronized, please let me known if
it's reasonable and sensible. :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]