[GitHub] [gobblin] ZihanLi58 commented on a diff in pull request #3569: [GOBBLIN-1707] Update `IcebergDataset` to incorporate all snapshots, not only the current one

GitBox Fri, 23 Sep 2022 16:03:24 -0700


ZihanLi58 commented on code in PR #3569:
URL: https://github.com/apache/gobblin/pull/3569#discussion_r979103212



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergCatalogFactory.java:
##########
@@ -18,14 +18,19 @@
 package org.apache.gobblin.data.management.copy.iceberg;
 
 import org.apache.hadoop.conf.Configuration;
-import org.apache.iceberg.hive.HiveCatalogs;
+import org.apache.iceberg.hive.HiveCatalog;
+
+import com.google.common.collect.Maps;
 
 
 /**
  * Provides an {@link IcebergCatalog}.
  */
 public class IcebergCatalogFactory {
   public static IcebergCatalog create(Configuration configuration) {
-    return new IcebergHiveCatalog(HiveCatalogs.loadCatalog(configuration));
+    HiveCatalog hcat = new HiveCatalog();

Review Comment:
   What's the reason for this change?



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java:
##########
@@ -46,36 +51,89 @@
 public class IcebergTable {
   private final TableOperations tableOps;
 
+  /** @return metadata info limited to the most recent (current) snapshot */
   public IcebergSnapshotInfo getCurrentSnapshotInfo() throws IOException {
     TableMetadata current = tableOps.current();
-    Snapshot snapshot = current.currentSnapshot();
+    return createSnapshotInfo(current.currentSnapshot(), 
Optional.of(current.metadataFileLocation()));
+  }
+
+  /** @return metadata info for all known snapshots, ordered historically, 
with *most recent last* */
+  public Iterator<IcebergSnapshotInfo> getAllSnapshotInfosIterator() {
+    TableMetadata current = tableOps.current();
+    long currentSnapshotId = current.currentSnapshot().snapshotId();
+    List<Snapshot> snapshots = current.snapshots();
+    return Iterators.transform(snapshots.iterator(), snapshot -> {
+      try {
+        return IcebergTable.this.createSnapshotInfo(
+            snapshot,
+            currentSnapshotId == snapshot.snapshotId() ? 
Optional.of(current.metadataFileLocation()) : Optional.empty()
+        );
+      } catch (IOException e) {
+        throw new RuntimeException(e);
+      }
+    });
+  }
+
+  /**
+   * @return metadata info for all known snapshots, but incrementally, so 
content overlap between snapshots appears
+   * only within the first as they're ordered historically, with *most recent 
last*
+   */
+  public Iterator<IcebergSnapshotInfo> getIncrementalSnapshotInfosIterator() {

Review Comment:
   Seems all this purpose is just to remove the overlap, then why not just 
simply use set as file path is just string there, and we can use set to 
eliminate all the duplicates? 
   When we talk about incremental, I think it's more about what's the last 
snapshot we have copied to the target and we only want to copy the new 
snapshots and new added data file since then? But this method still seems we 
need to look at all the available paths in one table which is not efficiency? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [gobblin] ZihanLi58 commented on a diff in pull request #3569: [GOBBLIN-1707] Update `IcebergDataset` to incorporate all snapshots, not only the current one

Reply via email to