Upgrading to Spark 2.0.1 broke array in parquet DataFrame

Sam Goodwin Fri, 04 Nov 2016 17:12:13 -0700

I have a table with a few columns, some of which are arrays. Since
upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always null
when reading in a DataFrame.


When writing the Parquet files, the schema of the column is specified as

StructField("packageIds",ArrayType(StringType))

The schema of the column in the Hive Metastore is

packageIds array<string>

The schema used in the writer exactly matches the schema in the Metastore
in all ways (order, casing, types etc)

The query is a simple "select *"

spark.sql("select * from tablename limit 1").collect() // null columns in Row

How can I begin debugging this issue? Notable things I've already
investigated:

   - Files were written using Spark 1.6
   - DataFrame works in spark 1.5 and 1.6
   - I've inspected the parquet files using parquet-tools and can see the
   data.
   - I also have another table written in exactly the same way and it
   doesn't have the issue.

Upgrading to Spark 2.0.1 broke array in parquet DataFrame

Reply via email to