[jira] [Updated] (SPARK-42198) spark.read fails to read filenames with accented characters

Tarique Anwer (Jira) Fri, 25 Apr 2025 03:39:57 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-42198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tarique Anwer updated SPARK-42198:
----------------------------------
    Priority: Minor  (was: Major)

> spark.read fails to read filenames with accented characters
> -----------------------------------------------------------
>
>                 Key: SPARK-42198
>                 URL: https://issues.apache.org/jira/browse/SPARK-42198
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Tarique Anwer
>            Priority: Minor
>
> Unable to read filenames with accented characters in the filename.
> *Sample error:*
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 43 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 43.3 in stage 1.0 
> (TID 105) (10.139.64.5 executor 0): java.io.FileNotFoundException: 
> /4842022074360943/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/Amalia471_Magaña874_3912696a-0aef-492e-83ef-468262b82966.xml{code}
>  
> *{{Steps to reproduce error:}}*
> {code:java}
> %sh
> mkdir -p /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass
> wget  
> https://synthetichealth.github.io/synthea-sample-data/downloads/synthea_sample_data_ccda_sep2019.zip
>  -O ./synthea_sample_data_ccda_sep2019.zip 
> unzip ./synthea_sample_data_ccda_sep2019.zip -d 
> /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/
> {code}
>  
> {code:java}
> spark.conf.set("spark.sql.caseSensitive", "true")
> df = (
>   spark.read.format('xml')
>    .option("rowTag", "ClinicalDocument")
>   .load('/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/')
> ){code}
> Is there a way to deal with this situation where I don't have control over 
> the file names for some reason?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42198) spark.read fails to read filenames with accented characters

Reply via email to