ZihanLi58 commented on code in PR #3569:
URL: https://github.com/apache/gobblin/pull/3569#discussion_r980310564
##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java:
##########
@@ -46,36 +51,89 @@
public class IcebergTable {
private final TableOperations tableOps;
+ /** @return metadata info limited to the most recent (current) snapshot */
public IcebergSnapshotInfo getCurrentSnapshotInfo() throws IOException {
TableMetadata current = tableOps.current();
- Snapshot snapshot = current.currentSnapshot();
+ return createSnapshotInfo(current.currentSnapshot(),
Optional.of(current.metadataFileLocation()));
+ }
+
+ /** @return metadata info for all known snapshots, ordered historically,
with *most recent last* */
+ public Iterator<IcebergSnapshotInfo> getAllSnapshotInfosIterator() {
+ TableMetadata current = tableOps.current();
+ long currentSnapshotId = current.currentSnapshot().snapshotId();
+ List<Snapshot> snapshots = current.snapshots();
+ return Iterators.transform(snapshots.iterator(), snapshot -> {
+ try {
+ return IcebergTable.this.createSnapshotInfo(
+ snapshot,
+ currentSnapshotId == snapshot.snapshotId() ?
Optional.of(current.metadataFileLocation()) : Optional.empty()
+ );
+ } catch (IOException e) {
+ throw new RuntimeException(e);
+ }
+ });
+ }
+
+ /**
+ * @return metadata info for all known snapshots, but incrementally, so
content overlap between snapshots appears
+ * only within the first as they're ordered historically, with *most recent
last*
+ */
+ public Iterator<IcebergSnapshotInfo> getIncrementalSnapshotInfosIterator() {
Review Comment:
It makes sense to me if we are doing the MVP at this point. But if we just
want to copy the whole table, why not using a set to store and eliminate
duplication directly instead of this complicated logic? As for remaining the
order, I believe most of the overhead I'm concerning is the data file, that
even just copy current snapshot, you will check all the old data files,
comparing to that, metadata files does not seem to be a big deal here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]