Met a problem when using spark to load parquet files with different version schemas

Wei Yan Mon, 11 May 2015 09:59:59 -0700

Hi, devs,

I met a problem when using spark to read to parquet files with two
different versions of schemas. For example, the first file has one field
with "int" type, while the same field in the second file is a "long". I
thought spark would automatically generate a merged schema "long", and use
that schema to process both files. However, the following code cannot work:


DataFrame df = sqlContext.parquetFile(inputPath);
df.registerTempTable("data");
sqlContext.sql("select count(msg.actual_eta) from data").collect();


Exception:
parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in
file f1.parquet
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)


BTW, I use spark 1.3.1, and already set
"spark.sql.parquet.useDataSourceApi" to false.

Any help would be appreciated.

-Wei

Met a problem when using spark to load parquet files with different version schemas

Reply via email to