My company runs java code that uses Spark to read from, and write to, Azure
Blob storage. This code runs more or less 24x7.
Recently we've noticed a few failures that leave stack traces in our logs; what
they have in common are exceptions that look variously like
Caused by: java.io.IOException: can not read class
org.apache.parquet.format.PageHeader: Unrecognized type 0
Caused by: java.io.IOException: can not read class
org.apache.parquet.format.PageHeader : don't know what type: 14
Caused by: java.io.IOException: can not read class
org.apache.parquet.format.PageHeader Required field 'num_values' was not found
in serialized data!
Caused by: java.io.IOException: can not read class
org.apache.parquet.format.PageHeader Required field 'uncompressed_page_size'
was not found in serialized data!
I searched
https://stackoverflow.com/search?q=%5Bapache-spark%5D+java.io.IOException+can+not+read+class+org.apache.parquet.format.PageHeader
and found exactly one marginally-relevant hit --
https://stackoverflow.com/questions/47211392/required-field-uncompressed-page-size-was-not-found-in-serialized-data-parque
It contains a suggested workaround which I haven't yet tried, but intend to
soon.
I searched the ASF archive for
[email protected]<mailto:[email protected]>; the only hit is
https://lists.apache.org/[email protected]:2022-9:can%20not%20read%20class%20org.apache.parquet.format.PageHeader
which is relevant but unhelpful.
It cites https://issues.apache.org/jira/browse/SPARK-11844 which is quite
relevant, but again unhelpful.
Unfortunately, we cannot provide the relevant parquet file to the mailing list,
since it of course contains proprietary data.
I've posted the stack trace at
https://gist.github.com/erich-truveta/f30d77441186a8c30c5f22f9c44bf59f
Here are various maven dependencies that might be relevant (gotten from the
output of `mvn dependency:tree`):
org.apache.hadoop.thirdparty:hadoop-shaded-guava :jar:1.1.1
org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7 :jar:1.1.1
org.apache.hadoop:hadoop-annotations :jar:3.3.4
org.apache.hadoop:hadoop-auth :jar:3.3.4
org.apache.hadoop:hadoop-azure :jar:3.3.4
org.apache.hadoop:hadoop-client-api :jar:3.3.4
org.apache.hadoop:hadoop-client-runtime :jar:3.3.4
org.apache.hadoop:hadoop-client :jar:3.3.4
org.apache.hadoop:hadoop-common :jar:3.3.4
org.apache.hadoop:hadoop-hdfs-client :jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-common :jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-core :jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-jobclient :jar:3.3.4
org.apache.hadoop:hadoop-yarn-api :jar:3.3.4
org.apache.hadoop:hadoop-yarn-client :jar:3.3.4
org.apache.hadoop:hadoop-yarn-common :jar:3.3.4
org.apache.hive:hive-storage-api :jar:2.7.2
org.apache.parquet:parquet-column :jar:1.12.2
org.apache.parquet:parquet-common :jar:1.12.2
org.apache.parquet:parquet-encoding :jar:1.12.2
org.apache.parquet:parquet-format-structures :jar:1.12.2
org.apache.parquet:parquet-hadoop :jar:1.12.2
org.apache.parquet:parquet-jackson :jar:1.12.2
org.apache.spark:spark-catalyst_2.12 :jar:3.3.1
org.apache.spark:spark-core_2.12 :jar:3.3.1
org.apache.spark:spark-kvstore_2.12 :jar:3.3.1
org.apache.spark:spark-launcher_2.12 :jar:3.3.1
org.apache.spark:spark-network-common_2.12 :jar:3.3.1
org.apache.spark:spark-network-shuffle_2.12 :jar:3.3.1
org.apache.spark:spark-sketch_2.12 :jar:3.3.1
org.apache.spark:spark-sql_2.12 :jar:3.3.1
org.apache.spark:spark-tags_2.12 :jar:3.3.1
org.apache.spark:spark-unsafe_2.12 :jar:3.3.1
Thank you for any help you can provide!