Hamish Whittal Sun, 01 Mar 2020 13:57:20 -0800

Hi there,

I have an hdfs directory with thousands of files. It seems that some of
them - and I don't know which ones - have a problem with their schema and
it's causing my Spark application to fail with this error:


Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet
column cannot be converted in file hdfs://
ip-172-24-89-229.blaah.com:8020/user/hadoop/origdata/part-00000-8b83989a-e387-4f64-8ac5-22b16770095e-c000.snappy.parquet.
Column: [price], Expected: double, Found: FIXED_LEN_BYTE_ARRAY

The problem is not only that it's causing the application to fail, but
every time if does fail, I have to copy that file out of the directory and
start the app again.

I thought of trying to use try-except, but I can't seem to get that to work.

Is there any advice anyone can give me because I really can't see myself
going through thousands of files trying to figure out which ones are broken.

Thanks in advance,

hamish

Reply via email to