Hi all

My team has the same issue. It looks like Spark 1.3's sparkSQL cannot read
parquet file generated by Spark 1.1. It will cost a lot of migration work
when we wanna to upgrade Spark 1.3.

Is there  anyone can help me?


Thanks

Wisely Chen


On Tue, Mar 10, 2015 at 5:06 PM, Pei-Lun Lee <pl...@appier.com> wrote:

> Hi,
>
> I found that if I try to read parquet file generated by spark 1.1.1 using
> 1.3.0-rc3 by default settings, I got this error:
>
> com.fasterxml.jackson.core.JsonParseException: Unrecognized token
> 'StructType': was expecting ('true', 'false' or 'null')
>  at [Source: StructType(List(StructField(a,IntegerType,false))); line: 1,
> column: 11]
>         at
> com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1419)
>         at
>
> com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:508)
>         at
>
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2300)
>         at
>
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1459)
>         at
>
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:683)
>         at
>
> com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3105)
>         at
>
> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3051)
>         at
>
> com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161)
>         at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19)
>         at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44)
>         at
> org.apache.spark.sql.types.DataType$.fromJson(dataTypes.scala:41)
>         at
>
> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$1$$anonfun$25.apply(newParquet.scala:675)
>         at
>
> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$1$$anonfun$25.apply(newParquet.scala:675)
>
>
>
> this is how I save parquet file with 1.1.1:
>
> sql("select 1 as a").saveAsParquetFile("/tmp/foo")
>
>
>
> and this is the meta data of the 1.1.1 parquet file:
>
> creator:     parquet-mr version 1.4.3
> extra:       org.apache.spark.sql.parquet.row.metadata =
> StructType(List(StructField(a,IntegerType,false)))
>
>
>
> by comparison, this is 1.3.0 meta:
>
> creator:     parquet-mr version 1.6.0rc3
> extra:       org.apache.spark.sql.parquet.row.metadata =
> {"type":"struct","fields":[{"name":"a","type":"integer","nullable":t
> [more]...
>
>
>
> It looks like now ParquetRelation2 is used to load parquet file by default
> and it only recognizes JSON format schema but 1.1.1 schema was case class
> string format.
>
> Setting spark.sql.parquet.useDataSourceApi to false will fix it, but I
> don't know the differences.
> Is this considered a bug? We have a lot of parquet files from 1.1.1, should
> we disable data source api in order to read them if we want to upgrade to
> 1.3?
>
> Thanks,
> --
> Pei-Lun
>

Reply via email to