GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/4826
[SPARK-5950][SQL]Insert array into a metastore table saved as parquet should work when using datasource api This PR contains the following changes: 1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is compatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values). 2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types. 3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings. 4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust. 5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables. cc @marmbrus @liancheng You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark insertNullabilityCheck Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4826.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4826 ---- commit 4ec17fd28a45e42db92144af9cb8a8e7e796eb40 Author: Yin Huai <yh...@databricks.com> Date: 2015-02-27T21:20:00Z Failed test. commit 8f19fe520080f064b50dc5885f221889c2612eea Author: Yin Huai <yh...@databricks.com> Date: 2015-02-27T21:20:57Z equalsIgnoreCompatibleNullability commit 9a266114fca979c69468709ed82fbb99fe2595e6 Author: Yin Huai <yh...@databricks.com> Date: 2015-02-27T21:26:33Z Make InsertIntoTable happy. commit 0a703e751cf0ebcd481f2f7dd66cc7bdea529f04 Author: Yin Huai <yh...@databricks.com> Date: 2015-02-27T21:38:07Z Test failed again since we cannot read correct content. commit bf50d7383e499cbf1e3964a9391d4e9b56607f32 Author: Yin Huai <yh...@databricks.com> Date: 2015-02-28T05:33:43Z When appending data, we use the schema of the existing table instead of the schema of the new data. commit 8bd008b403140b430344d669727410de7b4bc235 Author: Yin Huai <yh...@databricks.com> Date: 2015-02-28T05:34:54Z nullable, containsNull, and valueContainsNull will be always true for parquet data. commit b2c06f8c4e67450650b2a58c5168eb31cd490641 Author: Yin Huai <yh...@databricks.com> Date: 2015-02-28T05:35:30Z Ignore nullability in JSON relation's equality check. commit e4f397cea7ec0dc21a714b75a7254bb275319fc2 Author: Yin Huai <yh...@databricks.com> Date: 2015-02-28T05:35:54Z Unit tests. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org