Thank you for the decision, All. As of now, to unblock this, it seems that we are trying to remove them from the function registry.
https://github.com/apache/spark/pull/22821 One problem here is that users can recover those functions like this simply. scala> spark.sessionState.functionRegistry.createOrReplaceTempFunction("map_filter", x => org.apache.spark.sql.catalyst.expressions.MapFilter(x(0),x(1))) Technically, the PR looks like a compromised way to unblock the release and to allow some users that feature completely. At first glance, I thought this is a workaround to ignore the discussion context. But, that sounds like one of the practical ways for Apache Spark. (We had Spark 2.0 Tech. Preview before.) I want to finalize the decision on `map_filter` (and related three functions) issue. Are we good to go with https://github.com/apache/spark/pull/22821? Bests, Dongjoon. PS. Also, there is a PR to completely remove them, too. https://github.com/cloud-fan/spark/pull/11 On Wed, Oct 24, 2018 at 10:14 PM Xiao Li <lix...@databricks.com> wrote: > @Dongjoon Hyun <dongjoon.h...@gmail.com> Thanks! This is a blocking > ticket. It returns a wrong result due to our undefined behavior. I agree we > should revert the newly added map-oriented functions. In 3.0 release, we > need to define the behavior of duplicate keys in the data type MAP and fix > all the related issues that are confusing to our end users. > > Thanks, > > Xiao > > On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Ah now I see the problem. `map_filter` has a very weird semantic that is >> neither "earlier entry wins" or "latter entry wins". >> >> I've opened https://github.com/apache/spark/pull/22821 , to remove these >> newly added map-related functions from FunctionRegistry(for 2.4.0), so that >> they are invisible to end-users, and the weird behavior of Spark map type >> with duplicated keys are not escalated. We should fix it ASAP in the master >> branch. >> >> If others are OK with it, I'll start a new RC after that PR is merged. >> >> Thanks, >> Wenchen >> >> On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> For the first question, it's `bin/spark-sql` result. I didn't check STS, >>> but it will return the same with `bin/spark-sql`. >>> >>> > I think map_filter is implemented correctly. map(1,2,1,3) is actually >>> map(1,2) according to the "earlier entry wins" semantic. I don't think >>> this will change in 2.4.1. >>> >>> For the second one, `map_filter` issue is not about `earlier entry wins` >>> stuff. Please see the following example. >>> >>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT >>> map_concat(map(1,2), map(1,3)) m); >>> {1:3} {1:2} >>> >>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT >>> map_concat(map(1,2), map(1,3)) m); >>> {1:3} {1:3} >>> >>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT >>> map_concat(map(1,2), map(1,3)) m); >>> {1:3} {} >>> >>> In other words, `map_filter` works like `push-downed filter` to the map >>> in terms of the output result >>> while users assumed that `map_filter` works on top of the result of `m`. >>> >>> This is a function semantic issue. >>> >>> >>> On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4 >>>> > {1:3} >>>> >>>> Are you running in the thrift-server? Then maybe this is caused by the >>>> bug in `Dateset.collect` as I mentioned above. >>>> >>>> I think map_filter is implemented correctly. map(1,2,1,3) is actually >>>> map(1,2) according to the "earlier entry wins" semantic. I don't think >>>> this will change in 2.4.1. >>>> >>>> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>> wrote: >>>> >>>>> Thank you for the follow-ups. >>>>> >>>>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings >>>>> (including Spark/Scala) in the end? >>>>> >>>>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in >>>>> many ways. >>>>> >>>>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2 >>>>> +---------------+ >>>>> |map(1, 2, 1, 3)| >>>>> +---------------+ >>>>> | Map(1 -> 3)| >>>>> +---------------+ >>>>> >>>>> >>>>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4 >>>>> {1:3} >>>>> >>>>> >>>>> hive> select map(1,2,1,3); // Hive 1.2.2 >>>>> OK >>>>> {1:3} >>>>> >>>>> >>>>> presto> SELECT map_concat(map(array[1],array[2]), >>>>> map(array[1],array[3])); // Presto 0.212 >>>>> _col0 >>>>> ------- >>>>> {1=3} >>>>> >>>>> >>>>> Bests, >>>>> Dongjoon. >>>>> >>>>> >>>>> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <cloud0...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Dongjoon, >>>>>> >>>>>> Thanks for reporting it! This is indeed a bug that needs to be fixed. >>>>>> >>>>>> The problem is not about the function `map_filter`, but about how the >>>>>> map type values are created in Spark, when there are duplicated keys. >>>>>> >>>>>> In programming languages like Java/Scala, when creating map, the >>>>>> later entry wins. e.g. in scala >>>>>> scala> Map(1 -> 2, 1 -> 3) >>>>>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3) >>>>>> >>>>>> scala> Map(1 -> 2, 1 -> 3).get(1) >>>>>> res1: Option[Int] = Some(3) >>>>>> >>>>>> However, in Spark, the earlier entry wins >>>>>> scala> sql("SELECT map(1,2,1,3)[1]").show >>>>>> +------------------+ >>>>>> |map(1, 2, 1, 3)[1]| >>>>>> +------------------+ >>>>>> | 2| >>>>>> +------------------+ >>>>>> >>>>>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> >>>>>> 2). >>>>>> >>>>>> But there are several bugs in Spark >>>>>> >>>>>> scala> sql("SELECT map(1,2,1,3)").show >>>>>> +----------------+ >>>>>> | map(1, 2, 1, 3)| >>>>>> +----------------+ >>>>>> |[1 -> 2, 1 -> 3]| >>>>>> +----------------+ >>>>>> The displayed string of map values has a bug and we should >>>>>> deduplicate the entries, This is tracked by SPARK-25824. >>>>>> >>>>>> >>>>>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map") >>>>>> res11: org.apache.spark.sql.DataFrame = [] >>>>>> >>>>>> scala> sql("select * from t").show >>>>>> +--------+ >>>>>> | map| >>>>>> +--------+ >>>>>> |[1 -> 3]| >>>>>> +--------+ >>>>>> The Hive map value convert has a bug, we should respect the "earlier >>>>>> entry wins" semantic. No ticket yet. >>>>>> >>>>>> >>>>>> scala> sql("select map(1,2,1,3)").collect >>>>>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) >>>>>> Same bug happens at `collect`. No ticket yet. >>>>>> >>>>>> I'll create tickets and list all of them as known issues in 2.4.0. >>>>>> >>>>>> It's arguable if the "earlier entry wins" semantic is reasonable. >>>>>> Fixing it is a behavior change and we can only apply it to master branch. >>>>>> >>>>>> Going back to https://issues.apache.org/jira/browse/SPARK-25823, >>>>>> it's just a symptom of the hive map value converter bug. I think it's a >>>>>> non-blocker. >>>>>> >>>>>> Thanks, >>>>>> Wenchen >>>>>> >>>>>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun < >>>>>> dongjoon.h...@gmail.com> wrote: >>>>>> >>>>>>> Hi, All. >>>>>>> >>>>>>> -0 due to the following issue. From Spark 2.4.0, users may get an >>>>>>> incorrect result when they use new `map_fitler` with `map_concat` >>>>>>> functions. >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-25823 >>>>>>> >>>>>>> SPARK-25823 is only aiming to fix the data correctness issue from >>>>>>> `map_filter`. >>>>>>> >>>>>>> PMC members are able to lower the priority. Always, I respect PMC's >>>>>>> decision. >>>>>>> >>>>>>> I'm sending this email to draw more attention to this bug and to >>>>>>> give some warning on the new feature's limitation to the community. >>>>>>> >>>>>>> Bests, >>>>>>> Dongjoon. >>>>>>> >>>>>>> >>>>>>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <cloud0...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Please vote on releasing the following candidate as Apache Spark >>>>>>>> version 2.4.0. >>>>>>>> >>>>>>>> The vote is open until October 26 PST and passes if a majority +1 >>>>>>>> PMC votes are cast, with >>>>>>>> a minimum of 3 +1 votes. >>>>>>>> >>>>>>>> [ ] +1 Release this package as Apache Spark 2.4.0 >>>>>>>> [ ] -1 Do not release this package because ... >>>>>>>> >>>>>>>> To learn more about Apache Spark, please see >>>>>>>> http://spark.apache.org/ >>>>>>>> >>>>>>>> The tag to be voted on is v2.4.0-rc4 (commit >>>>>>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe): >>>>>>>> https://github.com/apache/spark/tree/v2.4.0-rc4 >>>>>>>> >>>>>>>> The release files, including signatures, digests, etc. can be found >>>>>>>> at: >>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/ >>>>>>>> >>>>>>>> Signatures used for Spark RCs can be found in this file: >>>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS >>>>>>>> >>>>>>>> The staging repository for this release can be found at: >>>>>>>> >>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1290 >>>>>>>> >>>>>>>> The documentation corresponding to this release can be found at: >>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/ >>>>>>>> >>>>>>>> The list of bug fixes going into 2.4.0 can be found at the >>>>>>>> following URL: >>>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385 >>>>>>>> >>>>>>>> FAQ >>>>>>>> >>>>>>>> ========================= >>>>>>>> How can I help test this release? >>>>>>>> ========================= >>>>>>>> >>>>>>>> If you are a Spark user, you can help us test this release by taking >>>>>>>> an existing Spark workload and running on this release candidate, >>>>>>>> then >>>>>>>> reporting any regressions. >>>>>>>> >>>>>>>> If you're working in PySpark you can set up a virtual env and >>>>>>>> install >>>>>>>> the current RC and see if anything important breaks, in the >>>>>>>> Java/Scala >>>>>>>> you can add the staging repository to your projects resolvers and >>>>>>>> test >>>>>>>> with the RC (make sure to clean up the artifact cache before/after >>>>>>>> so >>>>>>>> you don't end up building with a out of date RC going forward). >>>>>>>> >>>>>>>> =========================================== >>>>>>>> What should happen to JIRA tickets still targeting 2.4.0? >>>>>>>> =========================================== >>>>>>>> >>>>>>>> The current list of open tickets targeted at 2.4.0 can be found at: >>>>>>>> https://issues.apache.org/jira/projects/SPARK and search for >>>>>>>> "Target Version/s" = 2.4.0 >>>>>>>> >>>>>>>> Committers should look at those and triage. Extremely important bug >>>>>>>> fixes, documentation, and API tweaks that impact compatibility >>>>>>>> should >>>>>>>> be worked on immediately. Everything else please retarget to an >>>>>>>> appropriate release. >>>>>>>> >>>>>>>> ================== >>>>>>>> But my bug isn't fixed? >>>>>>>> ================== >>>>>>>> >>>>>>>> In order to make timely releases, we will typically not hold the >>>>>>>> release unless the bug in question is a regression from the previous >>>>>>>> release. That being said, if there is something which is a >>>>>>>> regression >>>>>>>> that has not been correctly targeted please ping me or a committer >>>>>>>> to >>>>>>>> help target the issue. >>>>>>>> >>>>>>> > > -- > [image: Spark+AI Summit North America 2019] > <http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america&si=undefined&pi=406b8c9a-b648-4923-9ed1-9a51ffe213fa> >