Re: SparkSQL saveAsParquetFile does not preserve AVRO schema
Note: In the code (org.apache.spark.sql.parquet.DefaultSource) I've found this: val relation = if (doInsertion) { // This is a hack. We always set nullable/containsNull/valueContainsNull to true // for the schema of a parquet data. val df = sqlContext.createDataFrame( data.queryExecution.toRdd, data.schema.asNullable) val createdRelation = createRelation(sqlContext, parameters, df.schema).asInstanceOf[ParquetRelation2] createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite) createdRelation } The culprit is "data.schema.asNullable". What's the real reason for this? Why not simply use the existing schema nullable flags? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-saveAsParquetFile-does-not-preserve-AVRO-schema-tp2p24454.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL saveAsParquetFile does not preserve AVRO schema
Hi, I have serious problems with saving DataFrame as parquet file. I read the data from the parquet file like this: val df = sparkSqlCtx.parquetFile(inputFile.toString) and print the schema (you can see both fields are required) root |-- time: long (nullable = false) |-- time_ymdhms: long (nullable = false) ...omitted... Now I try to save DataFrame as parquet file like this: df.saveAsParquetFile(outputFile.toString) The code runs normally, but loading the file, which I have saved in the previous step (outputFile) together with the same inputFile fails with this error: Caused by: parquet.schema.IncompatibleSchemaModificationException: repetition constraint is more restrictive: can not merge type required int64 time into optional int64 time The problem is that saveAsParquetFile does not preserve nullable flags! So once I try to load outputFile parquet file and print the schema I get this: root |-- time: long (nullable = true) |-- time_ymdhms: long (nullable = true) ...omitted... I use Spark 1.3.0 with Parquet 1.6.0 Is it somehow possible to keep also these flags? Or is it a bug? Any help will be appreciated. Thanks in advance! Petr -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-saveAsParquetFile-does-not-preserve-AVRO-schema-tp2.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org