[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

yhuai Fri, 27 Feb 2015 21:53:01 -0800

GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/4826


    [SPARK-5950][SQL]Insert array into a metastore table saved as parquet 
should work when using datasource api

    This PR contains the following changes:
    1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is 
the middle ground between DataType's equality check and 
`DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does 
`equalsIgnoreNullability` as well as if the nullability of `from` is compatible 
with that of `to`. For example, the nullability of `ArrayType(IntegerType, 
containsNull = false)` is compatible with that of `ArrayType(IntegerType, 
containsNull = true)` (for an array without null values, we can always say it 
may contain null values). However,  the nullability of `ArrayType(IntegerType, 
containsNull = true)` is compatible with that of `ArrayType(IntegerType, 
containsNull = false)` (for an array that may have null values, we cannot say 
it does not have null values).
    2. For the `resolved` field of `InsertIntoTable`, use 
`equalsIgnoreCompatibleNullability` to replace the equality check of the data 
types.
    3. For our data source write path, when appending data, we always use the 
schema of existing table to write the data. This is important for parquet, 
since nullability direct impacts the way to encode/decode values. If we do not 
do this, we may see corrupted values when reading values from a set of parquet 
files generated with different nullability settings.
    4. When generating a new parquet table, we always set 
nullable/containsNull/valueContainsNull to true. So, we will not face 
situations that we cannot append data because containsNull/valueContainsNull in 
an Array/Map column of the existing table has already been set to `false`. This 
change makes the whole data pipeline more robust.
    5. Update the equality check of JSON relation. Since JSON does not really 
cares nullability,  `equalsIgnoreNullability` seems a better choice to compare 
schemata from to JSON tables.
    
    cc @marmbrus @liancheng

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark insertNullabilityCheck

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4826.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4826
    
----
commit 4ec17fd28a45e42db92144af9cb8a8e7e796eb40
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-27T21:20:00Z

    Failed test.

commit 8f19fe520080f064b50dc5885f221889c2612eea
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-27T21:20:57Z

    equalsIgnoreCompatibleNullability

commit 9a266114fca979c69468709ed82fbb99fe2595e6
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-27T21:26:33Z

    Make InsertIntoTable happy.

commit 0a703e751cf0ebcd481f2f7dd66cc7bdea529f04
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-27T21:38:07Z

    Test failed again since we cannot read correct content.

commit bf50d7383e499cbf1e3964a9391d4e9b56607f32
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-28T05:33:43Z

    When appending data, we use the schema of the existing table instead of the 
schema of the new data.

commit 8bd008b403140b430344d669727410de7b4bc235
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-28T05:34:54Z

    nullable, containsNull, and valueContainsNull will be always true for 
parquet data.

commit b2c06f8c4e67450650b2a58c5168eb31cd490641
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-28T05:35:30Z

    Ignore nullability in JSON relation's equality check.

commit e4f397cea7ec0dc21a714b75a7254bb275319fc2
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-28T05:35:54Z

    Unit tests.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5950][SQL]Insert array into a metastore...

Reply via email to