nsivabalan commented on code in PR #17850:
URL: https://github.com/apache/hudi/pull/17850#discussion_r2849090232
##########
hudi-aws/src/test/java/org/apache/hudi/aws/sync/TestGluePartitionPushdown.java:
##########
@@ -91,6 +95,8 @@ public void testPushDownFilters() {
@Test
public void testPushDownFilterIfExceedLimit() {
Properties props = new Properties();
+ props.put(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key(),
+ MultiPartKeysValueExtractor.class.getName());
Review Comment:
oh, why are these required now?
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala:
##########
@@ -224,11 +226,13 @@ object HoodieSparkUtils extends SparkAdapterSupport with
SparkVersionsSupport wi
partitionPath: String,
tableBasePath: StoragePath,
tableSchema: StructType,
- tableConfig: java.util.Map[String, String],
+ tableConfig: HoodieTableConfig,
timeZoneId: String,
- shouldValidatePartitionColumns: Boolean):
Array[Object] = {
+ shouldValidatePartitionColumns: Boolean,
+ usePartitionValueExtractorOnRead: Boolean):
Array[Object] = {
val keyGeneratorClass =
KeyGeneratorType.getKeyGeneratorClassName(tableConfig)
- val timestampKeyGeneratorType =
tableConfig.get(TimestampKeyGeneratorConfig.TIMESTAMP_TYPE_FIELD.key())
+ val timestampKeyGeneratorType =
tableConfig.propsMap().get(TimestampKeyGeneratorConfig.TIMESTAMP_TYPE_FIELD.key())
+ val partitionValueExtractorClass =
tableConfig.getPartitionValueExtractorClass
Review Comment:
something to keep in mind. I could not locate any gaps. but a reminder.
we should also account for scenario where someone uses 1.x reader to read a
0.x table. So, the table config may not have the partition value extractor set.
##########
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java:
##########
@@ -119,40 +118,10 @@ public class HoodieSyncConfig extends HoodieConfig {
public static final ConfigProperty<String>
META_SYNC_PARTITION_EXTRACTOR_CLASS = ConfigProperty
.key("hoodie.datasource.hive_sync.partition_extractor_class")
.defaultValue("org.apache.hudi.hive.MultiPartKeysValueExtractor")
- .withInferFunction(cfg -> {
- Option<String> partitionFieldsOpt;
- if (StringUtils.nonEmpty(cfg.getString(META_SYNC_PARTITION_FIELDS))) {
- partitionFieldsOpt =
Option.ofNullable(cfg.getString(META_SYNC_PARTITION_FIELDS));
- } else {
- partitionFieldsOpt = HoodieTableConfig.getPartitionFieldProp(cfg)
- .or(() ->
Option.ofNullable(cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME)));
- }
- if (!partitionFieldsOpt.isPresent()) {
- return Option.empty();
- }
- String partitionFields = partitionFieldsOpt.get();
- if (StringUtils.nonEmpty(partitionFields)) {
- int numOfPartFields = partitionFields.split(",").length;
- if (numOfPartFields == 1) {
- if (cfg.contains(HIVE_STYLE_PARTITIONING_ENABLE)
- &&
cfg.getString(HIVE_STYLE_PARTITIONING_ENABLE).equals("true")) {
- return
Option.of("org.apache.hudi.hive.HiveStylePartitionValueExtractor");
- } else if (cfg.contains(SLASH_SEPARATED_DATE_PARTITIONING)
- &&
cfg.getString(SLASH_SEPARATED_DATE_PARTITIONING).equals("true")) {
- return
Option.of("org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor");
- } else {
- return
Option.of("org.apache.hudi.hive.SinglePartPartitionValueExtractor");
- }
- } else {
- return
Option.of("org.apache.hudi.hive.MultiPartKeysValueExtractor");
- }
- } else {
- return Option.of("org.apache.hudi.hive.NonPartitionedExtractor");
- }
- })
+
.withInferFunction(HoodieTableConfigUtils::inferPartitionValueExtractorClass)
Review Comment:
lets be very thorough in this change.
I see some minor misses here.
1. Cur code: check for META_SYNC_PARTITION_FIELDS. If not, uses
HoodieTableConfig.getPartitionFieldProp(cfg). if not,
cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME)
Current code (lines 124-128):
if (StringUtils.nonEmpty(cfg.getString(META_SYNC_PARTITION_FIELDS))) {
partitionFieldsOpt =
Option.ofNullable(cfg.getString(META_SYNC_PARTITION_FIELDS));
} else {
partitionFieldsOpt = HoodieTableConfig.getPartitionFieldProp(cfg)
.or(() ->
Option.ofNullable(cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME)));
}
Patch version:
Option<String> partitionFieldsOpt =
HoodieTableConfig.getPartitionFieldProp(cfg)
.or(() ->
Option.ofNullable(cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME)));
Lets check if META_SYNC_PARTITION_FIELDS is explicitly configured. if yes,
lets take that as first choice.
2. Missing SLASH_SEPARATED_DATE_PARTITIONING Check ❌
Current code (lines 140-142):
} else if (cfg.contains(SLASH_SEPARATED_DATE_PARTITIONING)
&& cfg.getString(SLASH_SEPARATED_DATE_PARTITIONING).equals("true")) {
return
Option.of("org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor");
Patch version:
// Missing entirely
The patch removes support for slash-separated date partitioning, which
could break existing tables using this feature.
3. Different Config Key for Hive-Style Partitioning ⚠️
Current code:
cfg.contains(HIVE_STYLE_PARTITIONING_ENABLE)
cfg.getString(HIVE_STYLE_PARTITIONING_ENABLE)
Patch version:
cfg.contains(KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_ENABLE.key())
cfg.getString(KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_ENABLE.key())
The patch uses KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_ENABLE.key()
instead of the table config constant. This might be referencing a different
config key.
4. Changed Logic for Empty Partition Fields 🔄
Current code (lines 130-132):
if (!partitionFieldsOpt.isPresent()) {
return Option.empty(); // Returns empty Option
}
Patch version:
if (!partitionFieldsOpt.isPresent()) {
return Option.of("org.apache.hudi.hive.NonPartitionedExtractor"); //
Returns NonPartitionedExtractor
}
The patch returns a value instead of an empty Option when no partition
fields are found initially.
##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -1208,6 +1217,10 @@ public String getKeyGeneratorClassName() {
return KeyGeneratorType.getKeyGeneratorClassName(this);
}
+ public String getPartitionValueExtractorClass() {
Review Comment:
can we return Option<String>
##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##########
@@ -452,7 +453,8 @@ HoodieTableMetaClient
initializeEmptyTable(HoodieTableMetaClient.TableBuilder ta
payloadClass = overridingMergeConfigs.get().getMiddle();
mergeStrategyId = overridingMergeConfigs.get().getRight();
}
-
+ String partitionValueExtractorClassName =
props.getString(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key(),
+ HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.defaultValue());
Review Comment:
shouldn't this refer to
DataSourceWriteOptions.PARTITION_EXTRACTOR_CLASS.key()
and then fall back to
HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key()
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -291,6 +291,8 @@ class HoodieSparkSqlWriterInternal {
if
(StringUtils.nonEmpty(hoodieConfig.getString(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME)))
hoodieConfig.getString(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME)
else KeyGeneratorType.getKeyGeneratorClassName(hoodieConfig)
+ val partitionValueExtractorClassName = hoodieConfig
+
.getStringOrDefault(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key(),
null)
Review Comment:
shouldn't this refer to
DataSourceWriteOptions.PARTITION_EXTRACTOR_CLASS.key()
and then fall back to
HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key()
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -749,7 +752,8 @@ class HoodieSparkSqlWriterInternal {
String.valueOf(HoodieTableConfig.PARTITION_METAFILE_USE_BASE_FORMAT.defaultValue())
))
val tableFormat =
hoodieConfig.getStringOrDefault(HoodieTableConfig.TABLE_FORMAT)
-
+ val partitionValueExtractorClassName = hoodieConfig
+
.getStringOrDefault(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key(),
null)
Review Comment:
same here
##########
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieTableConfigUtils.java:
##########
@@ -84,4 +85,37 @@ public static HoodieTableVersion
getTableVersion(HoodieConfig config) {
?
HoodieTableVersion.fromVersionCode(config.getInt(HoodieTableConfig.VERSION))
: HoodieTableConfig.VERSION.defaultValue();
}
+
+ /**
+ * Infers the appropriate PartitionValueExtractor class based on table
configuration.
+ * This function determines the correct extractor based on the number of
partition fields
+ * and partitioning style (Hive-style vs non-Hive-style).
+ *
+ * @param cfg HoodieConfig containing table configuration
+ * @return Option containing the inferred PartitionValueExtractor class name
+ */
+ public static Option<String> inferPartitionValueExtractorClass(HoodieConfig
cfg) {
+ Option<String> partitionFieldsOpt =
HoodieTableConfig.getPartitionFieldProp(cfg)
+ .or(() ->
Option.ofNullable(cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME)));
+
+ if (!partitionFieldsOpt.isPresent()) {
+ return Option.of("org.apache.hudi.hive.NonPartitionedExtractor");
Review Comment:
didn't we discuss to keep this in a separate PR. or did we agree to keep it
in this patch only?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]