Hi there, I have an hdfs directory with thousands of files. It seems that some of them - and I don't know which ones - have a problem with their schema and it's causing my Spark application to fail with this error:
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs:// ip-172-24-89-229.blaah.com:8020/user/hadoop/origdata/part-00000-8b83989a-e387-4f64-8ac5-22b16770095e-c000.snappy.parquet. Column: [price], Expected: double, Found: FIXED_LEN_BYTE_ARRAY The problem is not only that it's causing the application to fail, but every time if does fail, I have to copy that file out of the directory and start the app again. I thought of trying to use try-except, but I can't seem to get that to work. Is there any advice anyone can give me because I really can't see myself going through thousands of files trying to figure out which ones are broken. Thanks in advance, hamish