Re: [PR] feat: Use PartitionValueExtractor interface in Spark reader path [hudi]

via GitHub Wed, 25 Feb 2026 12:04:12 -0800


suryaprasanna commented on code in PR #17850:
URL: https://github.com/apache/hudi/pull/17850#discussion_r2854714278



##########
hudi-aws/src/test/java/org/apache/hudi/aws/sync/TestGluePartitionPushdown.java:
##########
@@ -91,6 +95,8 @@ public void testPushDownFilters() {
   @Test
   public void testPushDownFilterIfExceedLimit() {
     Properties props = new Properties();
+    props.put(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key(),
+        MultiPartKeysValueExtractor.class.getName());

Review Comment:
   If we default NonPartitionedExtractor it is needed. So, removing this code 
change. So, it wont be required anymore.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -291,6 +291,8 @@ class HoodieSparkSqlWriterInternal {
           if 
(StringUtils.nonEmpty(hoodieConfig.getString(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME)))
             
hoodieConfig.getString(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME)
           else KeyGeneratorType.getKeyGeneratorClassName(hoodieConfig)
+        val partitionValueExtractorClassName = hoodieConfig
+          
.getStringOrDefault(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key(), 
null)

Review Comment:
   True, let me just use DataSourceWriteOptions configs alone.



##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##########
@@ -452,7 +453,8 @@ HoodieTableMetaClient 
initializeEmptyTable(HoodieTableMetaClient.TableBuilder ta
       payloadClass = overridingMergeConfigs.get().getMiddle();
       mergeStrategyId = overridingMergeConfigs.get().getRight();
     }
-
+    String partitionValueExtractorClassName = 
props.getString(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key(),
+        HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.defaultValue());

Review Comment:
   DataSourceWriteOptions.PARTITION_EXTRACTOR_CLASS is actually same as 
HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS. Let me use 
DataSourceWriteOptions.PARTITION_EXTRACTOR_CLASS here.



##########
hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/util/TestPartitionFilterGenerator.java:
##########
@@ -36,6 +38,7 @@ public class TestPartitionFilterGenerator {
   @Test
   public void testPushDownFilters() {
     Properties props = new Properties();
+    props.put(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key(), 
MultiPartKeysValueExtractor.class.getName());

Review Comment:
   If we default NonPartitionedExtractor it is needed. So, removing this code 
change. So, it wont be required anymore.



##########
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java:
##########
@@ -119,40 +118,10 @@ public class HoodieSyncConfig extends HoodieConfig {
   public static final ConfigProperty<String> 
META_SYNC_PARTITION_EXTRACTOR_CLASS = ConfigProperty
       .key("hoodie.datasource.hive_sync.partition_extractor_class")
       .defaultValue("org.apache.hudi.hive.MultiPartKeysValueExtractor")
-      .withInferFunction(cfg -> {
-        Option<String> partitionFieldsOpt;
-        if (StringUtils.nonEmpty(cfg.getString(META_SYNC_PARTITION_FIELDS))) {
-          partitionFieldsOpt = 
Option.ofNullable(cfg.getString(META_SYNC_PARTITION_FIELDS));
-        } else {
-          partitionFieldsOpt = HoodieTableConfig.getPartitionFieldProp(cfg)
-              .or(() -> 
Option.ofNullable(cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME)));
-        }
-        if (!partitionFieldsOpt.isPresent()) {
-          return Option.empty();
-        }
-        String partitionFields = partitionFieldsOpt.get();
-        if (StringUtils.nonEmpty(partitionFields)) {
-          int numOfPartFields = partitionFields.split(",").length;
-          if (numOfPartFields == 1) {
-            if (cfg.contains(HIVE_STYLE_PARTITIONING_ENABLE)
-                && 
cfg.getString(HIVE_STYLE_PARTITIONING_ENABLE).equals("true")) {
-              return 
Option.of("org.apache.hudi.hive.HiveStylePartitionValueExtractor");
-            } else if (cfg.contains(SLASH_SEPARATED_DATE_PARTITIONING)
-                && 
cfg.getString(SLASH_SEPARATED_DATE_PARTITIONING).equals("true")) {
-              return 
Option.of("org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor");
-            } else {
-              return 
Option.of("org.apache.hudi.hive.SinglePartPartitionValueExtractor");
-            }
-          } else {
-            return 
Option.of("org.apache.hudi.hive.MultiPartKeysValueExtractor");
-          }
-        } else {
-          return Option.of("org.apache.hudi.hive.NonPartitionedExtractor");
-        }
-      })
+      
.withInferFunction(HoodieTableConfigUtils::inferPartitionValueExtractorClass)

Review Comment:
   1. Lets check if META_SYNC_PARTITION_FIELDS is explicitly configured. if 
yes, lets take that as first choice.
   > This cannot be done as the meta sync configs are not part of hudi-common 
module they are of hudi-sync module.
   2. Missing SLASH_SEPARATED_DATE_PARTITIONING Check
   > My bad, fixed it now.
   3. Different Config Key for Hive-Style Partitioning
   > public static final ConfigProperty<String> HIVE_STYLE_PARTITIONING_ENABLE 
= KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_ENABLE;
   > So, either of them can be used. Let me use KeyGeneratorOptions.
   4. Changed Logic for Empty Partition Fields
   > Reverting this change. We can make this change as part of a separate PR.



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala:
##########
@@ -224,11 +226,13 @@ object HoodieSparkUtils extends SparkAdapterSupport with 
SparkVersionsSupport wi
                                  partitionPath: String,
                                  tableBasePath: StoragePath,
                                  tableSchema: StructType,
-                                 tableConfig: java.util.Map[String, String],
+                                 tableConfig: HoodieTableConfig,
                                  timeZoneId: String,
-                                 shouldValidatePartitionColumns: Boolean): 
Array[Object] = {
+                                 shouldValidatePartitionColumns: Boolean,
+                                 usePartitionValueExtractorOnRead: Boolean): 
Array[Object] = {
     val keyGeneratorClass = 
KeyGeneratorType.getKeyGeneratorClassName(tableConfig)
-    val timestampKeyGeneratorType = 
tableConfig.get(TimestampKeyGeneratorConfig.TIMESTAMP_TYPE_FIELD.key())
+    val timestampKeyGeneratorType = 
tableConfig.propsMap().get(TimestampKeyGeneratorConfig.TIMESTAMP_TYPE_FIELD.key())
+    val partitionValueExtractorClass = 
tableConfig.getPartitionValueExtractorClass

Review Comment:
   Yes, then the user need to set usePartitionValueExtractorOnRead to be false. 
So, we dont rely on that field. That is why by default 
usePartitionValueExtractorOnRead is set to false.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -1208,6 +1217,10 @@ public String getKeyGeneratorClassName() {
     return KeyGeneratorType.getKeyGeneratorClassName(this);
   }
 
+  public String getPartitionValueExtractorClass() {

Review Comment:
   Good idea, yes doing that.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -749,7 +752,8 @@ class HoodieSparkSqlWriterInternal {
           
String.valueOf(HoodieTableConfig.PARTITION_METAFILE_USE_BASE_FORMAT.defaultValue())
         ))
         val tableFormat = 
hoodieConfig.getStringOrDefault(HoodieTableConfig.TABLE_FORMAT)
-
+        val partitionValueExtractorClassName = hoodieConfig
+          
.getStringOrDefault(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key(), 
null)

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Use PartitionValueExtractor interface in Spark reader path [hudi]

Reply via email to