[jira] [Comment Edited] (SPARK-42198) spark.read fails to read filenames with accented characters

Tarique Anwer (Jira) Mon, 30 Jan 2023 04:16:06 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-42198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17682041#comment-17682041
 ]


Tarique Anwer edited comment on SPARK-42198 at 1/30/23 12:15 PM:
-----------------------------------------------------------------

I have updated the original comment to remove the specific file name. I'm 
trying to read all the XML files in the folder together. While it works just 
fine for files without accented characters in their filename, I start getting 
error as soon as one is mixed in the lot.

Even if I try to read a single file with the accented character, as in the 
comment above, I get an error.
{code:java}
spark.conf.set("spark.sql.caseSensitive", "true")
df = (
  spark.read.format('xml')
   .option("rowTag", "ClinicalDocument")
  
.load('/dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/José_Emilio366_Macías944_1e740307-8780-4542-abeb-7037a2557a0e.xml')
){code}
 
Error:

 
{code:java}
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: 
dbfs:/dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/José_Emilio366_Macías944_1e740307-8780-4542-abeb-7037a2557a0e.xml{code}
 

 


was (Author: JIRAUSER296223):
I have updated the original comment to remove the specific file name. I'm 
trying to read all the XML files in the folder together. While it works just 
fine for files without accented characters in their filename, I start getting 
error as soon as one is mixed in the lot.

Even if I try to read a single file with the accented character, as in the 
comment above, I get an error.
spark.conf.set("spark.sql.caseSensitive", "true")
df = (
  spark.read.format('xml')
   .option("rowTag", "ClinicalDocument")
  
.load('/dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/José_Emilio366_Macías944_1e740307-8780-4542-abeb-7037a2557a0e.xml')
) 
Error:

 
{code:java}
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: 
dbfs:/dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/José_Emilio366_Macías944_1e740307-8780-4542-abeb-7037a2557a0e.xml{code}
 

 

> spark.read fails to read filenames with accented characters
> -----------------------------------------------------------
>
>                 Key: SPARK-42198
>                 URL: https://issues.apache.org/jira/browse/SPARK-42198
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Tarique Anwer
>            Priority: Major
>
> Unable to read filenames with accented characters in the filename.
> *Sample error:*
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 43 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 43.3 in stage 1.0 
> (TID 105) (10.139.64.5 executor 0): java.io.FileNotFoundException: 
> /4842022074360943/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/Amalia471_Magaña874_3912696a-0aef-492e-83ef-468262b82966.xml{code}
>  
> *{{Steps to reproduce error:}}*
> {code:java}
> %sh
> mkdir -p /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass
> wget  
> https://synthetichealth.github.io/synthea-sample-data/downloads/synthea_sample_data_ccda_sep2019.zip
>  -O ./synthea_sample_data_ccda_sep2019.zip 
> unzip ./synthea_sample_data_ccda_sep2019.zip -d 
> /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/
> {code}
>  
> {code:java}
> spark.conf.set("spark.sql.caseSensitive", "true")
> df = (
>   spark.read.format('xml')
>    .option("rowTag", "ClinicalDocument")
>   .load('/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/')
> ){code}
> Is there a way to deal with this situation where I don't have control over 
> the file names for some reason?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-42198) spark.read fails to read filenames with accented characters

Reply via email to