bhasudha commented on a change in pull request #689: [HUDI-25] Optimize 
HoodieInputFormat.listStatus for faster Hive Incremental queries
URL: https://github.com/apache/incubator-hudi/pull/689#discussion_r291493931
 
 

 ##########
 File path: 
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/InputPathHandler.java
 ##########
 @@ -0,0 +1,138 @@
+/*
+ * Copyright (c) 2016 Uber Technologies, Inc. ([email protected])
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *           http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.uber.hoodie.hadoop;
+
+import static com.uber.hoodie.hadoop.HoodieInputFormat.getTableMetaClient;
+
+import com.uber.hoodie.common.table.HoodieTableMetaClient;
+import com.uber.hoodie.exception.DatasetNotFoundException;
+import com.uber.hoodie.exception.InvalidDatasetException;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+/**
+ * InputPathHandler takes in a set of input paths and incremental tables list. 
Then, classifies the
+ * input paths to incremental, non-incremental paths and non-hoodie paths. 
This is then accessed
+ * later to mutate the JobConf before processing incremental mode queries and 
snapshot queries.
+ */
+public class InputPathHandler {
+
+  public static final Log LOG = LogFactory.getLog(InputPathHandler.class);
+
+  private final Configuration conf;
+  // tablename to metadata mapping for all Hoodie tables(both incremental & 
non-incremental)
+  private final Map<String, HoodieTableMetaClient> tableMetaClientMap;
+  private final Map<HoodieTableMetaClient, List<Path>> groupedIncrementalPaths;
+  private final List<Path> nonIncrementalPaths;
+  private final List<Path> nonHoodieInputPaths;
+
+  InputPathHandler(Configuration conf, Path[] inputPaths, List<String> 
incrementalTables) throws IOException {
+    this.conf = conf;
+    tableMetaClientMap = new HashMap<>();
+    nonIncrementalPaths = new ArrayList<>();
 
 Review comment:
   They cannot be compared to existing code because existing code doesn't look 
into InputPaths inside HoodieInputFormat. InputPaths are handled only inside 
FileInputFormat.
   
   **Implication of new data structures -** 
    The InputPathHandler is created once per listStatus() call. Within the 
InputPathHandler object the three dataStructures (nonIncrementalPaths, 
incrementalPaths and groupedIncrementalPaths) split the total number of 
InputPaths among them. At max we can expect totally one entry per InputPath in 
just one of these structures. The mem constraint will be order of total # 
InputPaths.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to