[ https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905856#comment-14905856 ]
Cheng Lian edited comment on SPARK-10659 at 9/24/15 5:51 AM: ------------------------------------------------------------- This behavior had once been a hacky way to workaround interoperability with Hive (fields in Hive schemata are always nullable). I think we can remove this one now. One potential design space problem need to be fixed is that, when persisting a DataFrame as a table in Parquet format into Hive metastore, what should we do if the schema has non-nullable fields. Basically two choices: # Persist the table in Spark SQL data source specific format, which is Hive incompatible, this preserves Parquet schema # Turn the schema into nullable form and save it in Hive compatible format I'd go for 1. was (Author: lian cheng): This behavior had once been a hacky way to workaround interoperability with Hive (fields in Hive schemata are always nullable). I think we can remove this one now. One potential design space problem need to be fixed is that, when persisting a DataFrame as a table in Parquet format into Hive metastore, what should we do if the schema has non-nullable fields. Basically two choices: # Persist the table in Spark SQL data source specific format, which is Hive incompatible, this preserves Parquet schema # Turn the schema into nullable form and save it in Hive compatible format I'd go for 1. > DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not > nullable) flag in schema > -------------------------------------------------------------------------------------------------- > > Key: SPARK-10659 > URL: https://issues.apache.org/jira/browse/SPARK-10659 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0 > Reporter: Vladimir Picka > > DataFrames currently automatically promotes all Parquet schema fields to > optional when they are written to an empty directory. The problem remains in > v1.5.0. > The culprit is this code: > {code} > val relation = if (doInsertion) { > // This is a hack. We always set > nullable/containsNull/valueContainsNull to true > // for the schema of a parquet data. > val df = > sqlContext.createDataFrame( > data.queryExecution.toRdd, > data.schema.asNullable) > val createdRelation = > createRelation(sqlContext, parameters, > df.schema).asInstanceOf[ParquetRelation2] > createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite) > createdRelation > } > {code} > which was implemented as part of this PR: > https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b > This very unexpected behaviour for some use cases when files are read from > one place and written to another like small file packing - it ends up with > incompatible files because required can't be promoted to optional normally. > It is essence of a schema that it enforces "required" invariant on data. It > should be supposed that it is intended. > I believe that a better approach is to have default behaviour to keep schema > as is and provide f.e. a builder method or option to allow forcing to > optional. > Right now we have to overwrite private API so that our files are rewritten as > is with all its perils. > Vladimir -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org