[ 
https://issues.apache.org/jira/browse/SPARK-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6538:
-----------------------------------

    Assignee: Apache Spark

> Add missing nullable Metastore fields when merging a Parquet schema
> -------------------------------------------------------------------
>
>                 Key: SPARK-6538
>                 URL: https://issues.apache.org/jira/browse/SPARK-6538
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Adam Budde
>            Assignee: Apache Spark
>             Fix For: 1.3.1
>
>
> When Spark SQL infers a schema for a DataFrame, it will take the union of all 
> field types present in the structured source data (e.g. an RDD of JSON data). 
> When the source data for a row doesn't define a particular field on the 
> DataFrame's schema, a null value will simply be assumed for this field. This 
> workflow makes it very easy to construct tables and query over a set of 
> structured data with a nonuniform schema. However, this behavior is not 
> consistent in some cases when dealing with Parquet files and an external 
> table managed by an external Hive metastore.
> In our particular usecase, we use Spark Streaming to parse and transform our 
> input data and then apply a window function to save an arbitrary-sized batch 
> of data as a Parquet file, which itself will be added as a partition to an 
> external Hive table via an "ALTER TABLE... ADD PARTITION..." statement. Since 
> our input data is nonuniform, it is expected that not every partition batch 
> will contain every field present in the table's schema obtained from the Hive 
> metastore. As such, we expect that the schema of some of our Parquet files 
> may not contain the same set fields present in the full metastore schema.
> In such cases, it seems natural that Spark SQL would simply assume null 
> values for any missing fields in the partition's Parquet file, assuming these 
> fields are specified as nullable by the metastore schema. This is not the 
> case in the current implementation of ParquetRelation2. The 
> mergeMetastoreParquetSchema() method used to reconcile differences between a 
> Parquet file's schema and a schema retrieved from the Hive metastore will 
> raise an exception if the Parquet file doesn't match the same set of fields 
> specified by the metastore.
> I propose altering this implementation in order to allow for any missing 
> metastore fields marked as nullable to be merged in to the Parquet file's 
> schema before continuing with the checks present in 
> mergeMetastoreParquetSchema().
> Classifying this as a bug as it exposes inconsistent behavior, IMHO. If you 
> feel this should be an improvement or new feature instead, please feel free 
> to reclassify this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to