[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #4397: Add support for listing partition recursively during the table migration

GitBox Mon, 28 Mar 2022 09:04:40 -0700


RussellSpitzer commented on a change in pull request #4397:
URL: https://github.com/apache/iceberg/pull/4397#discussion_r836598309




##########
File path: data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java
##########
@@ -45,11 +47,14 @@
 import org.apache.iceberg.mapping.NameMapping;
 import org.apache.iceberg.orc.OrcMetrics;
 import org.apache.iceberg.parquet.ParquetUtil;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import 
org.apache.iceberg.relocated.com.google.common.util.concurrent.MoreExecutors;
 import 
org.apache.iceberg.relocated.com.google.common.util.concurrent.ThreadFactoryBuilder;
 import org.apache.iceberg.util.Tasks;
 
 public class TableMigrationUtil {
+  public static final String TABLE_MIGRATION_FILE_LISTING_RECURSIVE = 
"tableMigrationFileListingRecursive";

Review comment:
       @kbendick , @puchengy Starting a new discussion on how to pass 
parameters and name here.
   
   I feel conflicted about naming here. This should only be used in conjunction 
with hive/spark tables and my gut feeling was to just respect their controlling 
HadoopConf params but this was mainly because of laziness. 
   
   I was thinking basically that anyone calling this code and expecting that 
behavior would already have set
   
   ```
   mapred.input.dir.recursive=TRUE
   hive.mapred.supports.subdirectories=TRUE
   ```
   
   So it would be rather seemless since Spark would support those as well. But 
bringing up the Spark option makes me consider that as well, but if we accepted 
it I think we would have to accept it as a Spark specific option and rely on it 
being passed through the hadoop conf. This in turn would mean it needs to get 
passed through `listPartition` which has too many parameters at this moment 
anyway. I think we may need to refactor that soon to that a context parameter 
instead of the current 8.
   
   So currently I kind of think there are two ways forward but I'm willing to 
hear other options
   
   1) Re-use the hadoop/hive config parameters. 
   Pros: 
   Should already be configured by any users using recursive dirs
   Cons: 
   We (Iceberg) usually don't configure things via Hadoop Conf,
   We miss any users who were using the DataSource option and they will have to 
read docs/ struggle to figure out how to enable it.
    
    2) Add an official parameter to the `listPartition`
    Pros: We could then allow any combination of Hadoop parameters or force our 
own specific Iceberg parameter.
    Cons: 
    Requires adding another parameter to an already heavily parameterized 
function
    Will require users to learn a new parameter
    Requires greater refactoring for what I believe is more of an edge case
    
    Any thoughts?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #4397: Add support for listing partition recursively during the table migration

Reply via email to