yihua commented on code in PR #18378:
URL: https://github.com/apache/hudi/pull/18378#discussion_r3035868510


##########
hudi-common/src/main/java/org/apache/hudi/common/util/PartitionPathEncodeUtils.java:
##########
@@ -56,6 +56,7 @@ public class PartitionPathEncodeUtils {
 
     charToEscapeFilename.set('_');
     charToEscapeFilename.set('-');
+    charToEscapeFilename.set('.');

Review Comment:
   🤖 Adding `.` to `charToEscapeFilename` is a global change — it affects all 
callers, not just the new partitioned RLI path. For any existing Hudi table 
whose partition paths happen to contain dots (e.g. a table already partitioned 
by `fare.currency=USD` without this fix), those paths were stored in the 
metadata table with unescaped dots. After upgrading, any lookup that re-encodes 
the partition path with the new escaping will produce a different key than 
what's stored, silently breaking RLI lookups on existing tables. Could you 
verify whether this encoding is applied symmetrically at lookup time (so both 
stored and queried keys match), or whether there needs to be a 
migration/upgrade path for pre-existing metadata?



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestRecordLevelIndex.scala:
##########
@@ -676,6 +676,75 @@ class TestRecordLevelIndex extends 
RecordLevelIndexTestBase with SparkDatasetMix
       s"Failed to create empty replacement file $candidateBaseFile")
     candidateBaseFile.getName
   }
+
+  @Test
+  def 
testPartitionedRecordLevelIndexWithHiveStylePartitioningAndDotInPartitionField():
 Unit = {
+    initMetaClient(HoodieTableType.COPY_ON_WRITE)
+    val dataGen = new HoodieTestDataGenerator()
+    val inserts = dataGen.generateInserts("001", 10)
+    val insertDf = toDataset(spark, inserts)
+
+    // Use fare.currency as partition field to test dots in partition field 
names with Hive-style partitioning
+    val options = Map(HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+      DataSourceWriteOptions.TABLE_TYPE.key -> 
HoodieTableType.COPY_ON_WRITE.name(),
+      RECORDKEY_FIELD.key -> "_row_key",
+      PARTITIONPATH_FIELD.key -> "fare.currency",
+      HoodieTableConfig.ORDERING_FIELDS.key -> "timestamp",
+      HoodieMetadataConfig.GLOBAL_RECORD_LEVEL_INDEX_ENABLE_PROP.key() -> 
"false",
+      HoodieMetadataConfig.RECORD_LEVEL_INDEX_ENABLE_PROP.key() -> "true",
+      HoodieMetadataConfig.STREAMING_WRITE_ENABLED.key() -> "false",
+      HoodieCompactionConfig.INLINE_COMPACT.key() -> "false",
+      HoodieIndexConfig.INDEX_TYPE.key() -> RECORD_LEVEL_INDEX.name(),
+      DataSourceWriteOptions.HIVE_STYLE_PARTITIONING.key() -> "true")
+
+    insertDf.write.format("hudi")
+      .options(options)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+
+    assertEquals(10, spark.read.format("hudi").load(basePath).count())
+
+    val props = 
TypedProperties.fromMap(JavaConverters.mapAsJavaMapConverter(options).asJava)
+    val writeConfig = HoodieWriteConfig.newBuilder()
+      .withProps(props)
+      .withPath(basePath)
+      .build()
+
+    val metadata = metadataWriter(writeConfig).getTableMetadata
+    val recordKeys = inserts.asScala.map(i => 
i.getRecordKey).asJava.stream().collect(Collectors.toList())
+
+    // Verify record index entries for both partitions
+    // When using fare.currency field, the actual partition paths will be like 
"fare.currency=USD"
+    val usdPartitionLocations = readRecordIndex(metadata, recordKeys, 
HOption.of("fare.currency=USD"))
+
+    // All records should be found
+    assertEquals(10, usdPartitionLocations.size)
+
+    val df = spark.read.format("hudi").load(basePath).collect()

Review Comment:
   🤖 nit: the `if (usdPartitionLocations.nonEmpty)` guard is unreachable dead 
code — the `assertEquals(10, ...)` three lines above already guarantees the 
collection is non-empty if execution reaches here.



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestRecordLevelIndex.scala:
##########
@@ -676,6 +676,75 @@ class TestRecordLevelIndex extends 
RecordLevelIndexTestBase with SparkDatasetMix
       s"Failed to create empty replacement file $candidateBaseFile")
     candidateBaseFile.getName
   }
+
+  @Test
+  def 
testPartitionedRecordLevelIndexWithHiveStylePartitioningAndDotInPartitionField():
 Unit = {
+    initMetaClient(HoodieTableType.COPY_ON_WRITE)
+    val dataGen = new HoodieTestDataGenerator()
+    val inserts = dataGen.generateInserts("001", 10)
+    val insertDf = toDataset(spark, inserts)
+
+    // Use fare.currency as partition field to test dots in partition field 
names with Hive-style partitioning
+    val options = Map(HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+      DataSourceWriteOptions.TABLE_TYPE.key -> 
HoodieTableType.COPY_ON_WRITE.name(),
+      RECORDKEY_FIELD.key -> "_row_key",
+      PARTITIONPATH_FIELD.key -> "fare.currency",
+      HoodieTableConfig.ORDERING_FIELDS.key -> "timestamp",
+      HoodieMetadataConfig.GLOBAL_RECORD_LEVEL_INDEX_ENABLE_PROP.key() -> 
"false",
+      HoodieMetadataConfig.RECORD_LEVEL_INDEX_ENABLE_PROP.key() -> "true",
+      HoodieMetadataConfig.STREAMING_WRITE_ENABLED.key() -> "false",
+      HoodieCompactionConfig.INLINE_COMPACT.key() -> "false",
+      HoodieIndexConfig.INDEX_TYPE.key() -> RECORD_LEVEL_INDEX.name(),
+      DataSourceWriteOptions.HIVE_STYLE_PARTITIONING.key() -> "true")
+
+    insertDf.write.format("hudi")
+      .options(options)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+
+    assertEquals(10, spark.read.format("hudi").load(basePath).count())
+
+    val props = 
TypedProperties.fromMap(JavaConverters.mapAsJavaMapConverter(options).asJava)
+    val writeConfig = HoodieWriteConfig.newBuilder()
+      .withProps(props)
+      .withPath(basePath)
+      .build()
+
+    val metadata = metadataWriter(writeConfig).getTableMetadata
+    val recordKeys = inserts.asScala.map(i => 
i.getRecordKey).asJava.stream().collect(Collectors.toList())
+
+    // Verify record index entries for both partitions
+    // When using fare.currency field, the actual partition paths will be like 
"fare.currency=USD"
+    val usdPartitionLocations = readRecordIndex(metadata, recordKeys, 
HOption.of("fare.currency=USD"))
+
+    // All records should be found

Review Comment:
   🤖 This assertion assumes all 10 generated records have `fare.currency=USD`, 
but `HoodieTestDataGenerator.generateInserts` may produce records with varied 
`fare.currency` values (e.g. USD, EUR, SGD). If even one record lands in a 
different partition, this assertion will fail intermittently. Could you either 
explicitly verify what values the generator produces for this field, or 
generate your own test records with a fixed `fare.currency` value to make the 
test deterministic?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to