[ https://issues.apache.org/jira/browse/SPARK-42198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tarique Anwer updated SPARK-42198: ---------------------------------- Priority: Minor (was: Major) > spark.read fails to read filenames with accented characters > ----------------------------------------------------------- > > Key: SPARK-42198 > URL: https://issues.apache.org/jira/browse/SPARK-42198 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.2.1 > Reporter: Tarique Anwer > Priority: Minor > > Unable to read filenames with accented characters in the filename. > *Sample error:* > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 43 in > stage 1.0 failed 4 times, most recent failure: Lost task 43.3 in stage 1.0 > (TID 105) (10.139.64.5 executor 0): java.io.FileNotFoundException: > /4842022074360943/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/Amalia471_Magaña874_3912696a-0aef-492e-83ef-468262b82966.xml{code} > > *{{Steps to reproduce error:}}* > {code:java} > %sh > mkdir -p /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass > wget > https://synthetichealth.github.io/synthea-sample-data/downloads/synthea_sample_data_ccda_sep2019.zip > -O ./synthea_sample_data_ccda_sep2019.zip > unzip ./synthea_sample_data_ccda_sep2019.zip -d > /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ > {code} > > {code:java} > spark.conf.set("spark.sql.caseSensitive", "true") > df = ( > spark.read.format('xml') > .option("rowTag", "ClinicalDocument") > .load('/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/') > ){code} > Is there a way to deal with this situation where I don't have control over > the file names for some reason? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org