[ https://issues.apache.org/jira/browse/SPARK-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-6538: ----------------------------------- Assignee: Apache Spark > Add missing nullable Metastore fields when merging a Parquet schema > ------------------------------------------------------------------- > > Key: SPARK-6538 > URL: https://issues.apache.org/jira/browse/SPARK-6538 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0 > Reporter: Adam Budde > Assignee: Apache Spark > Fix For: 1.3.1 > > > When Spark SQL infers a schema for a DataFrame, it will take the union of all > field types present in the structured source data (e.g. an RDD of JSON data). > When the source data for a row doesn't define a particular field on the > DataFrame's schema, a null value will simply be assumed for this field. This > workflow makes it very easy to construct tables and query over a set of > structured data with a nonuniform schema. However, this behavior is not > consistent in some cases when dealing with Parquet files and an external > table managed by an external Hive metastore. > In our particular usecase, we use Spark Streaming to parse and transform our > input data and then apply a window function to save an arbitrary-sized batch > of data as a Parquet file, which itself will be added as a partition to an > external Hive table via an "ALTER TABLE... ADD PARTITION..." statement. Since > our input data is nonuniform, it is expected that not every partition batch > will contain every field present in the table's schema obtained from the Hive > metastore. As such, we expect that the schema of some of our Parquet files > may not contain the same set fields present in the full metastore schema. > In such cases, it seems natural that Spark SQL would simply assume null > values for any missing fields in the partition's Parquet file, assuming these > fields are specified as nullable by the metastore schema. This is not the > case in the current implementation of ParquetRelation2. The > mergeMetastoreParquetSchema() method used to reconcile differences between a > Parquet file's schema and a schema retrieved from the Hive metastore will > raise an exception if the Parquet file doesn't match the same set of fields > specified by the metastore. > I propose altering this implementation in order to allow for any missing > metastore fields marked as nullable to be merged in to the Parquet file's > schema before continuing with the checks present in > mergeMetastoreParquetSchema(). > Classifying this as a bug as it exposes inconsistent behavior, IMHO. If you > feel this should be an improvement or new feature instead, please feel free > to reclassify this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org