Re: [PR] HIVE-28258: Use Iceberg semantics for Merge task [hive]

via GitHub Thu, 06 Jun 2024 06:17:36 -0700


SourabhBadhya commented on code in PR #5251:
URL: https://github.com/apache/hive/pull/5251#discussion_r1629519259



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergMergeTaskProperties.java:
##########
@@ -42,14 +46,23 @@ public Path getTmpLocation() {
     return new Path(location + "/data/");
   }
 
-  public StorageFormatDescriptor getStorageFormatDescriptor() throws 
IOException {
-    FileFormat fileFormat = 
FileFormat.fromString(properties.getProperty(TableProperties.DEFAULT_FILE_FORMAT,
-            TableProperties.DEFAULT_FILE_FORMAT_DEFAULT));
-    StorageFormatDescriptor descriptor = 
storageFormatFactory.get(fileFormat.name());
-    if (descriptor == null) {
-      throw new IOException("Unsupported storage format descriptor");
+  @Override
+  public MergeSplitProperties getSplitProperties() throws IOException {
+    String tableName = properties.getProperty(Catalogs.NAME);
+    String snapshotRef = properties.getProperty(Catalogs.SNAPSHOT_REF);
+    Configuration configuration = SessionState.getSessionConf();
+    List<JobContext> originalContextList = 
IcebergTableUtil.generateJobContext(configuration, tableName, snapshotRef);
+    List<JobContext> jobContextList = originalContextList.stream()
+            .map(TezUtil::enrichContextWithVertexId)
+            .collect(Collectors.toList());
+    if (jobContextList.isEmpty()) {
+      return null;
     }
-    return descriptor;
+    List<ContentFile> contentFiles = new 
HiveIcebergOutputCommitter().getOutputContentFiles(jobContextList);
+    Map<Path, Object> pathToContentFile = Maps.newConcurrentMap();
+    contentFiles.forEach(contentFile -> {

Review Comment:
   I dont think it works that easily. We have to construct the path object.



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java:
##########
@@ -328,4 +334,40 @@ public static PartitionData toPartitionData(StructLike 
key, Types.StructType key
     }
     return data;
   }
+
+  /**
+   * Generates {@link JobContext}s for the OutputCommitter for the specific 
table.
+   * @param configuration The configuration used for as a base of the JobConf
+   * @param tableName The name of the table we are planning to commit
+   * @param branchName the name of the branch
+   * @return The generated Optional JobContext list or empty if not presents.
+   */
+  static List<JobContext> generateJobContext(Configuration configuration, 
String tableName,

Review Comment:
   Done.



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapred/AbstractMapredIcebergRecordReader.java:
##########
@@ -23,16 +23,16 @@
 import org.apache.hadoop.mapred.JobConf;
 import org.apache.hadoop.mapred.RecordReader;
 import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapreduce.InputSplit;
 import org.apache.hadoop.mapreduce.TaskAttemptContext;
-import org.apache.iceberg.mr.mapreduce.IcebergSplit;
 
 @SuppressWarnings("checkstyle:VisibilityModifier")
 public abstract class AbstractMapredIcebergRecordReader<T> implements 
RecordReader<Void, T> {
 
   protected final org.apache.hadoop.mapreduce.RecordReader<Void, ?> 
innerReader;
 
   public 
AbstractMapredIcebergRecordReader(org.apache.iceberg.mr.mapreduce.IcebergInputFormat<?>
 mapreduceInputFormat,
-      IcebergSplit split, JobConf job, Reporter reporter) throws IOException {
+      InputSplit split, JobConf job, Reporter reporter) throws IOException {

Review Comment:
   This is required since I have created a new split by the name of 
IcebergMergeSplit which extends FileSplit. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Re: [PR] HIVE-28258: Use Iceberg semantics for Merge task [hive]

Reply via email to