hudi-bot opened a new issue, #15035:
URL: https://github.com/apache/hudi/issues/15035
When there is unicode in the partition path, the upsert fails.
h3. To reproduce
# Create this dataframe in spark-shell (note the dotted I)
{code:none}
scala> res0.show(truncate=false)
+---+---+
|_c0|_c1|
+---+---+
|1 |İ |
+---+---+
{code}
# Write it to hudi (this write will create the hudi table and succeed)
{code:none}
res0.write.format("hudi").option("hoodie.table.name",
"unicode_test").option("hoodie.datasource.write.precombine.field",
"_c0").option("hoodie.datasource.write.recordkey.field",
"_c0").option("hoodie.datasource.write.partitionpath.field",
"_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
{code}
# Try to write {{res0}} again (this upsert will fail at index lookup stage)
Environment
* Hudi version: 0.10.1
* Spark version: 3.1.2
h3. Stacktrace
{code:none}
22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request :
(http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0&basepath=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test&fileid=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0&lastinstantts=20220225182311228&timelinehash=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID
403)
org.apache.hudi.exception.HoodieIOException: Failed to read footer for
parquet
file:/Users/ji.qi/Desktop/unicode_test/İ/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
at
org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
at
org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
at
org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
at
org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
at
org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
at
org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
at
org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File
file:/Users/ji.qi/Desktop/unicode_test/İ/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
at
org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:433)
at
org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:183)
... 33 more
{code}
It seems like the file name is being resolved wrongly and the dotted I
somehow turns into A+umlaut+degrees, this is what the filesystem actually looks
like:
{code:none}
.
├── .hoodie
│ ├── .20220225181656520.commit.crc
│ ├── .20220225181656520.commit.requested.crc
│ ├── .20220225181656520.inflight.crc
│ ├── .20220225182310482.commit.requested.crc
│ ├── .20220225182311228.rollback.crc
│ ├── .20220225182311228.rollback.inflight.crc
│ ├── .20220225182311228.rollback.requested.crc
│ ├── .aux
│ │ └── .bootstrap
│ │ ├── .fileids
│ │ └── .partitions
│ ├── .hoodie.properties.crc
│ ├── .temp
│ ├── 20220225181656520.commit
│ ├── 20220225181656520.commit.requested
│ ├── 20220225181656520.inflight
│ ├── 20220225182310482.commit.requested
│ ├── 20220225182311228.rollback
│ ├── 20220225182311228.rollback.inflight
│ ├── 20220225182311228.rollback.requested
│ ├── archived
│ └── hoodie.properties
└── İ
├── ..hoodie_partition_metadata.crc
├──
.31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet.crc
├── .hoodie_partition_metadata
└──
31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
{code}
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-3517
- Type: Bug
- Epic: https://issues.apache.org/jira/browse/HUDI-5425
- Affects version(s):
- 0.10.1
- Fix version(s):
- 1.1.0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]