hudi-bot opened a new issue, #15035:
URL: https://github.com/apache/hudi/issues/15035

   When there is unicode in the partition path, the upsert fails.
   h3. To reproduce
    # Create this dataframe in spark-shell (note the dotted I)
   {code:none}
   scala> res0.show(truncate=false)
   +---+---+
   |_c0|_c1|
   +---+---+
   |1  |İ  |
   +---+---+
   {code}
    # Write it to hudi (this write will create the hudi table and succeed)
   {code:none}
    res0.write.format("hudi").option("hoodie.table.name", 
"unicode_test").option("hoodie.datasource.write.precombine.field", 
"_c0").option("hoodie.datasource.write.recordkey.field", 
"_c0").option("hoodie.datasource.write.partitionpath.field", 
"_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
   {code}
    # Try to write {{res0}} again (this upsert will fail at index lookup stage)
   
   Environment
    * Hudi version: 0.10.1
    * Spark version: 3.1.2
   
   h3. Stacktrace
   {code:none}
   22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
(http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0&basepath=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test&fileid=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0&lastinstantts=20220225182311228&timelinehash=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
   22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 
403)
   org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
parquet 
file:/Users/ji.qi/Desktop/unicode_test/İ/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
        at 
org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
        at 
org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
        at 
org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
        at 
org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
        at 
org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
        at 
org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
        at 
org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
        at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
        at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
        at scala.collection.AbstractIterator.to(Iterator.scala:1429)
        at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
        at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
        at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
        at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
        at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
        at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.FileNotFoundException: File 
file:/Users/ji.qi/Desktop/unicode_test/İ/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
 does not exist
        at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
        at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
        at 
org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:433)
        at 
org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:183)
        ... 33 more
   {code}
   It seems like the file name is being resolved wrongly and the dotted I 
somehow turns into A+umlaut+degrees, this is what the filesystem actually looks 
like:
   {code:none}
   .
   ├── .hoodie
   │   ├── .20220225181656520.commit.crc
   │   ├── .20220225181656520.commit.requested.crc
   │   ├── .20220225181656520.inflight.crc
   │   ├── .20220225182310482.commit.requested.crc
   │   ├── .20220225182311228.rollback.crc
   │   ├── .20220225182311228.rollback.inflight.crc
   │   ├── .20220225182311228.rollback.requested.crc
   │   ├── .aux
   │   │   └── .bootstrap
   │   │       ├── .fileids
   │   │       └── .partitions
   │   ├── .hoodie.properties.crc
   │   ├── .temp
   │   ├── 20220225181656520.commit
   │   ├── 20220225181656520.commit.requested
   │   ├── 20220225181656520.inflight
   │   ├── 20220225182310482.commit.requested
   │   ├── 20220225182311228.rollback
   │   ├── 20220225182311228.rollback.inflight
   │   ├── 20220225182311228.rollback.requested
   │   ├── archived
   │   └── hoodie.properties
   └── İ
       ├── ..hoodie_partition_metadata.crc
       ├── 
.31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet.crc
       ├── .hoodie_partition_metadata
       └── 
31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
   {code}
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-3517
   - Type: Bug
   - Epic: https://issues.apache.org/jira/browse/HUDI-5425
   - Affects version(s):
     - 0.10.1
   - Fix version(s):
     - 1.1.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to