GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/23124

    [SPARK-25829][SQL] remove duplicated map keys with last wins policy

    ## What changes were proposed in this pull request?
    
    Currently duplicated map keys are not handled consistently. For example, 
map look up respects the duplicated key appears first, `Dataset.collect` only 
keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc.
    
    This PR proposes to remove duplicated map keys with last wins policy, to 
follow Java/Scala and Presto. It only applies to built-in functions, as users 
can create map with duplicated map keys via private APIs anyway.
    
    For other places:
    1. data source v1 doesn't have this problem, as users need to provide a 
java/scala map, which can't have duplicated keys.
    2. data source v2 may have this problem. I've added a note to 
`ArrayBasedMapData` to ask the caller to take care of duplicated keys. In the 
future we should enforce it in the stable data APIs for data source v2.
    3. UDF doesn't have this problem, as users need to provide a java/scala 
map. Same as data source v1.
    4. file format. I checked all of them and only parquet does not enforce it. 
For backward compatibility reasons I change nothing but leave a note saying 
that the behavior will be undefined if users write map with duplicated keys to 
parquet files. Maybe we can add a config and fail by default if parquet files 
have map with duplicated keys. This can be done in followup.
    
    ## How was this patch tested?
    
    updated tests and new tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark map

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23124.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23124
    
----
commit cbcd5d7a937f8120ef8527f1f26150ed93f1de0a
Author: Wenchen Fan <wenchen@...>
Date:   2018-11-15T02:49:22Z

    remove duplicated map keys with last wins policy

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to