GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/23124
[SPARK-25829][SQL] remove duplicated map keys with last wins policy ## What changes were proposed in this pull request? Currently duplicated map keys are not handled consistently. For example, map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. This PR proposes to remove duplicated map keys with last wins policy, to follow Java/Scala and Presto. It only applies to built-in functions, as users can create map with duplicated map keys via private APIs anyway. For other places: 1. data source v1 doesn't have this problem, as users need to provide a java/scala map, which can't have duplicated keys. 2. data source v2 may have this problem. I've added a note to `ArrayBasedMapData` to ask the caller to take care of duplicated keys. In the future we should enforce it in the stable data APIs for data source v2. 3. UDF doesn't have this problem, as users need to provide a java/scala map. Same as data source v1. 4. file format. I checked all of them and only parquet does not enforce it. For backward compatibility reasons I change nothing but leave a note saying that the behavior will be undefined if users write map with duplicated keys to parquet files. Maybe we can add a config and fail by default if parquet files have map with duplicated keys. This can be done in followup. ## How was this patch tested? updated tests and new tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark map Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23124.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23124 ---- commit cbcd5d7a937f8120ef8527f1f26150ed93f1de0a Author: Wenchen Fan <wenchen@...> Date: 2018-11-15T02:49:22Z remove duplicated map keys with last wins policy ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org