[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-3517:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.14.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3517:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.14.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2023-04-23 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3517:
--
Fix Version/s: (was: 0.12.3)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.13.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Issue Type: Bug  (was: Improvement)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.13.1, 0.12.3
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2023-03-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Fix Version/s: 0.12.3

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.13.1, 0.12.3
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2023-01-17 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3517:
--
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.13.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2023-01-17 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2, 0.13.0 Final Sprint 3  
(was: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.13.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2023-01-09 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Sprint: 0.13.0 Final Sprint, 0.13.0 Final Sprint 2  (was: 0.13.0 Final 
Sprint)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.13.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-12-20 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Epic Link: HUDI-5425

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.13.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-12-20 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Priority: Blocker  (was: Critical)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.13.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-12-19 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3517:
--
Sprint: 2022/12/26

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.13.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-10-01 Thread Zhaojing Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-3517:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.13.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-09-26 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3517:

Sprint:   (was: 2022/09/19)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.12.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-09-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3517:
--
Sprint: 2022/09/19  (was: 2022/09/05)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.12.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-08-20 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Sprint: 2022/09/05  (was: 2022/08/22)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.12.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-08-19 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Priority: Critical  (was: Blocker)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.12.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-08-19 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Story Points: 3

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.12.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-08-19 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Priority: Blocker  (was: Critical)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: hudi-on-call
> Fix For: 0.12.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3517:
--
Priority: Critical  (was: Major)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.12.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-08-16 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3517:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: hudi-on-call
> Fix For: 0.12.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-03-28 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Component/s: spark-sql

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql, writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: hudi-on-call
> Fix For: 0.12.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-03-28 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Issue Type: Improvement  (was: Bug)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: hudi-on-call
> Fix For: 0.12.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-03-28 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: hudi-on-call
> Fix For: 0.12.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-03-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3517:
--
Remaining Estimate: 2h
 Original Estimate: 2h

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: hudi-on-call
> Fix For: 0.11.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-03-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3517:
--
Sprint: Cont' improve - 2022/03/7

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: hudi-on-call
> Fix For: 0.11.0
>
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-02-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Fix Version/s: 0.11.0

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Priority: Major
>  Labels: hudi-on-call
> Fix For: 0.11.0
>
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-02-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Labels: hudi-on-call  (was: )

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Priority: Major
>  Labels: hudi-on-call
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-02-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Component/s: writer-core

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Priority: Major
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-02-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3517:
-
Priority: Major  (was: Minor)

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Priority: Major
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-02-25 Thread Ji Qi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Qi updated HUDI-3517:

Description: 
When there is unicode in the partition path, the upsert fails.
h3. To reproduce
 # Create this dataframe in spark-shell (note the dotted I)
{code:none}
scala> res0.show(truncate=false)
+---+---+
|_c0|_c1|
+---+---+
|1  |İ  |
+---+---+
{code}
 # Write it to hudi (this write will create the hudi table and succeed)
{code:none}
 res0.write.format("hudi").option("hoodie.table.name", 
"unicode_test").option("hoodie.datasource.write.precombine.field", 
"_c0").option("hoodie.datasource.write.recordkey.field", 
"_c0").option("hoodie.datasource.write.partitionpath.field", 
"_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
{code}
 # Try to write {{res0}} again (this upsert will fail at index lookup stage)

Environment
 * Hudi version: 0.10.1
 * Spark version: 3.1.2

h3. Stacktrace
{code:none}
22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
(http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
org.apache.hudi.exception.HoodieIOException: Failed to read footer for parquet 
file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
at 
org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
at 
org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
at 
org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
at 
org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
at 
org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
at 
org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
at 
org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File 
file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
 does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
at 

[jira] [Updated] (HUDI-3517) Unicode in partition path causes it to be resolved wrongly

2022-02-25 Thread Ji Qi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Qi updated HUDI-3517:

Affects Version/s: 0.10.1

> Unicode in partition path causes it to be resolved wrongly
> --
>
> Key: HUDI-3517
> URL: https://issues.apache.org/jira/browse/HUDI-3517
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Ji Qi
>Priority: Minor
>
> When there is unicode in the partition path, the upsert fails.
> h3. To reproduce
>  # Create this dataframe in spark-shell (note the dotted I)
> {code:none}
> scala> res0.show(truncate=false)
> +---+---+
> |_c0|_c1|
> +---+---+
> |1  |İ  |
> +---+---+
> {code}
>  # Write it to hudi (this write will create the hudi table and succeed)
> {code:none}
>  res0.write.format("hudi").option("hoodie.table.name", 
> "unicode_test").option("hoodie.datasource.write.precombine.field", 
> "_c0").option("hoodie.datasource.write.recordkey.field", 
> "_c0").option("hoodie.datasource.write.partitionpath.field", 
> "_c1").mode("append").save("file:///Users/ji.qi/Desktop/unicode_test")
> {code}
>  # Try to write {{res0}} again (this upsert will fail at index lookup stage)
> Environment
>  * Hudi version: 0.10.1
>  * Spark version: 3.1.2
> h3. Stacktrace
> {code:none}
> 22/02/25 18:23:14 INFO RemoteHoodieTableFileSystemView: Sending request : 
> (http://192.168.1.148:54043/v1/hoodie/view/datafile/latest/partition?partition=%C4%B0=file%3A%2FUsers%2Fji.qi%2FDesktop%2Funicode_test=31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0=20220225182311228=65c5a6a5c6836dc4f7805550e81ca034b30ad85c38794f9f8ce68a9e914aab83)
> 22/02/25 18:23:14 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 403)
> org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
> parquet 
> file:/Users/ji.qi/Desktop/unicode_test/Ä°/31517a5e-af56-4fbc-9aa6-1ef1729bb89d-0_0-30-2006_20220225181656520.parquet
>   at 
> org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
>   at 
> org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
>   at 
> org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
>   at 
> org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
>   at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
>