RussellSpitzer commented on a change in pull request #4397:
URL: https://github.com/apache/iceberg/pull/4397#discussion_r836598309
##########
File path: data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java
##########
@@ -45,11 +47,14 @@
import org.apache.iceberg.mapping.NameMapping;
import org.apache.iceberg.orc.OrcMetrics;
import org.apache.iceberg.parquet.ParquetUtil;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
import
org.apache.iceberg.relocated.com.google.common.util.concurrent.MoreExecutors;
import
org.apache.iceberg.relocated.com.google.common.util.concurrent.ThreadFactoryBuilder;
import org.apache.iceberg.util.Tasks;
public class TableMigrationUtil {
+ public static final String TABLE_MIGRATION_FILE_LISTING_RECURSIVE =
"tableMigrationFileListingRecursive";
Review comment:
@kbendick , @puchengy Starting a new discussion on how to pass
parameters and name here.
I feel conflicted about naming here. This should only be used in conjunction
with hive/spark tables and my gut feeling was to just respect their controlling
HadoopConf params but this was mainly because of laziness.
I was thinking basically that anyone calling this code and expecting that
behavior would already have set
```
mapred.input.dir.recursive=TRUE
hive.mapred.supports.subdirectories=TRUE
```
So it would be rather seemless since Spark would support those as well. But
bringing up the Spark option makes me consider that as well, but if we accepted
it I think we would have to accept it as a Spark specific option and rely on it
being passed through the hadoop conf. This in turn would mean it needs to get
passed through `listPartition` which has too many parameters at this moment
anyway. I think we may need to refactor that soon to that a context parameter
instead of the current 8.
So currently I kind of think there are two ways forward but I'm willing to
hear other options
1) Re-use the hadoop/hive config parameters.
Pros:
Should already be configured by any users using recursive dirs
Cons:
We (Iceberg) usually don't configure things via Hadoop Conf,
We miss any users who were using the DataSource option and they will have to
read docs/ struggle to figure out how to enable it.
2) Add an official parameter to the `listPartition`
Pros: We could then allow any combination of Hadoop parameters or force our
own specific Iceberg parameter.
Cons:
Requires adding another parameter to an already heavily parameterized
function
Will require users to learn a new parameter
Requires greater refactoring for what I believe is more of an edge case
Any thoughts?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]