[ 
https://issues.apache.org/jira/browse/GOBBLIN-1707?focusedWorklogId=811758&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-811758
 ]

ASF GitHub Bot logged work on GOBBLIN-1707:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 23/Sep/22 23:03
            Start Date: 23/Sep/22 23:03
    Worklog Time Spent: 10m 
      Work Description: ZihanLi58 commented on code in PR #3569:
URL: https://github.com/apache/gobblin/pull/3569#discussion_r979103212


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergCatalogFactory.java:
##########
@@ -18,14 +18,19 @@
 package org.apache.gobblin.data.management.copy.iceberg;
 
 import org.apache.hadoop.conf.Configuration;
-import org.apache.iceberg.hive.HiveCatalogs;
+import org.apache.iceberg.hive.HiveCatalog;
+
+import com.google.common.collect.Maps;
 
 
 /**
  * Provides an {@link IcebergCatalog}.
  */
 public class IcebergCatalogFactory {
   public static IcebergCatalog create(Configuration configuration) {
-    return new IcebergHiveCatalog(HiveCatalogs.loadCatalog(configuration));
+    HiveCatalog hcat = new HiveCatalog();

Review Comment:
   What's the reason for this change?



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java:
##########
@@ -46,36 +51,89 @@
 public class IcebergTable {
   private final TableOperations tableOps;
 
+  /** @return metadata info limited to the most recent (current) snapshot */
   public IcebergSnapshotInfo getCurrentSnapshotInfo() throws IOException {
     TableMetadata current = tableOps.current();
-    Snapshot snapshot = current.currentSnapshot();
+    return createSnapshotInfo(current.currentSnapshot(), 
Optional.of(current.metadataFileLocation()));
+  }
+
+  /** @return metadata info for all known snapshots, ordered historically, 
with *most recent last* */
+  public Iterator<IcebergSnapshotInfo> getAllSnapshotInfosIterator() {
+    TableMetadata current = tableOps.current();
+    long currentSnapshotId = current.currentSnapshot().snapshotId();
+    List<Snapshot> snapshots = current.snapshots();
+    return Iterators.transform(snapshots.iterator(), snapshot -> {
+      try {
+        return IcebergTable.this.createSnapshotInfo(
+            snapshot,
+            currentSnapshotId == snapshot.snapshotId() ? 
Optional.of(current.metadataFileLocation()) : Optional.empty()
+        );
+      } catch (IOException e) {
+        throw new RuntimeException(e);
+      }
+    });
+  }
+
+  /**
+   * @return metadata info for all known snapshots, but incrementally, so 
content overlap between snapshots appears
+   * only within the first as they're ordered historically, with *most recent 
last*
+   */
+  public Iterator<IcebergSnapshotInfo> getIncrementalSnapshotInfosIterator() {

Review Comment:
   Seems all this purpose is just to remove the overlap, then why not just 
simply use set as file path is just string there, and we can use set to 
eliminate all the duplicates? 
   When we talk about incremental, I think it's more about what's the last 
snapshot we have copied to the target and we only want to copy the new 
snapshots and new added data file since then? But this method still seems we 
need to look at all the available paths in one table which is not efficiency? 





Issue Time Tracking
-------------------

    Worklog Id:     (was: 811758)
    Time Spent: 40m  (was: 0.5h)

> Add Iceberg support to DistCp
> -----------------------------
>
>                 Key: GOBBLIN-1707
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1707
>             Project: Apache Gobblin
>          Issue Type: Task
>          Components: gobblin-core
>            Reporter: Kip Kohn
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Add capability for iceberg copy/replication to distcp.  Support incremental 
> copy (only of delta changes since last time) in addition to full copy on 
> first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to