Re: [VOTE] SPARK 2.4.0 (RC4)

Dongjoon Hyun Thu, 25 Oct 2018 08:01:16 -0700

Thank you for the decision, All.

As of now, to unblock this, it seems that we are trying to remove them from
the function registry.


https://github.com/apache/spark/pull/22821

One problem here is that users can recover those functions like this simply.

scala> 
spark.sessionState.functionRegistry.createOrReplaceTempFunction("map_filter",
x => org.apache.spark.sql.catalyst.expressions.MapFilter(x(0),x(1)))


Technically, the PR looks like a compromised way to unblock the release and
to allow some users that feature completely.

At first glance, I thought this is a workaround to ignore the discussion
context. But, that sounds like one of the practical ways for Apache Spark.
(We had Spark 2.0 Tech. Preview before.)

I want to finalize the decision on `map_filter` (and related three
functions) issue. Are we good to go with
https://github.com/apache/spark/pull/22821?

Bests,
Dongjoon.

PS. Also, there is a PR to completely remove them, too.
       https://github.com/cloud-fan/spark/pull/11


On Wed, Oct 24, 2018 at 10:14 PM Xiao Li <[email protected]> wrote:

> @Dongjoon Hyun <[email protected]>  Thanks! This is a blocking
> ticket. It returns a wrong result due to our undefined behavior. I agree we
> should revert the newly added map-oriented functions. In 3.0 release, we
> need to define the behavior of duplicate keys in the data type MAP and fix
> all the related issues that are confusing to our end users.
>
> Thanks,
>
> Xiao
>
> On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan <[email protected]> wrote:
>
>> Ah now I see the problem. `map_filter` has a very weird semantic that is
>> neither "earlier entry wins" or "latter entry wins".
>>
>> I've opened https://github.com/apache/spark/pull/22821 , to remove these
>> newly added map-related functions from FunctionRegistry(for 2.4.0), so that
>> they are invisible to end-users, and the weird behavior of Spark map type
>> with duplicated keys are not escalated. We should fix it ASAP in the master
>> branch.
>>
>> If others are OK with it, I'll start a new RC after that PR is merged.
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun <[email protected]>
>> wrote:
>>
>>> For the first question, it's `bin/spark-sql` result. I didn't check STS,
>>> but it will return the same with `bin/spark-sql`.
>>>
>>> > I think map_filter is implemented correctly. map(1,2,1,3) is actually
>>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>>> this will change in 2.4.1.
>>>
>>> For the second one, `map_filter` issue is not about `earlier entry wins`
>>> stuff. Please see the following example.
>>>
>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
>>> map_concat(map(1,2), map(1,3)) m);
>>> {1:3} {1:2}
>>>
>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
>>> map_concat(map(1,2), map(1,3)) m);
>>> {1:3} {1:3}
>>>
>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
>>> map_concat(map(1,2), map(1,3)) m);
>>> {1:3} {}
>>>
>>> In other words, `map_filter` works like `push-downed filter` to the map
>>> in terms of the output result
>>> while users assumed that `map_filter` works on top of the result of `m`.
>>>
>>> This is a function semantic issue.
>>>
>>>
>>> On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan <[email protected]> wrote:
>>>
>>>> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>>>> > {1:3}
>>>>
>>>> Are you running in the thrift-server? Then maybe this is caused by the
>>>> bug in `Dateset.collect` as I mentioned above.
>>>>
>>>> I think map_filter is implemented correctly. map(1,2,1,3) is actually
>>>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>>>> this will change in 2.4.1.
>>>>
>>>> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun <[email protected]>
>>>> wrote:
>>>>
>>>>> Thank you for the follow-ups.
>>>>>
>>>>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
>>>>> (including Spark/Scala) in the end?
>>>>>
>>>>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in
>>>>> many ways.
>>>>>
>>>>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
>>>>> +---------------+
>>>>> |map(1, 2, 1, 3)|
>>>>> +---------------+
>>>>> |    Map(1 -> 3)|
>>>>> +---------------+
>>>>>
>>>>>
>>>>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>>>>> {1:3}
>>>>>
>>>>>
>>>>> hive> select map(1,2,1,3);  // Hive 1.2.2
>>>>> OK
>>>>> {1:3}
>>>>>
>>>>>
>>>>> presto> SELECT map_concat(map(array[1],array[2]),
>>>>> map(array[1],array[3])); // Presto 0.212
>>>>>  _col0
>>>>> -------
>>>>>  {1=3}
>>>>>
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Dongjoon,
>>>>>>
>>>>>> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>>>>>>
>>>>>> The problem is not about the function `map_filter`, but about how the
>>>>>> map type values are created in Spark, when there are duplicated keys.
>>>>>>
>>>>>> In programming languages like Java/Scala, when creating map, the
>>>>>> later entry wins. e.g. in scala
>>>>>> scala> Map(1 -> 2, 1 -> 3)
>>>>>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>>>>>>
>>>>>> scala> Map(1 -> 2, 1 -> 3).get(1)
>>>>>> res1: Option[Int] = Some(3)
>>>>>>
>>>>>> However, in Spark, the earlier entry wins
>>>>>> scala> sql("SELECT map(1,2,1,3)[1]").show
>>>>>> +------------------+
>>>>>> |map(1, 2, 1, 3)[1]|
>>>>>> +------------------+
>>>>>> |                 2|
>>>>>> +------------------+
>>>>>>
>>>>>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 ->
>>>>>> 2).
>>>>>>
>>>>>> But there are several bugs in Spark
>>>>>>
>>>>>> scala> sql("SELECT map(1,2,1,3)").show
>>>>>> +----------------+
>>>>>> | map(1, 2, 1, 3)|
>>>>>> +----------------+
>>>>>> |[1 -> 2, 1 -> 3]|
>>>>>> +----------------+
>>>>>> The displayed string of map values has a bug and we should
>>>>>> deduplicate the entries, This is tracked by SPARK-25824.
>>>>>>
>>>>>>
>>>>>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
>>>>>> res11: org.apache.spark.sql.DataFrame = []
>>>>>>
>>>>>> scala> sql("select * from t").show
>>>>>> +--------+
>>>>>> |     map|
>>>>>> +--------+
>>>>>> |[1 -> 3]|
>>>>>> +--------+
>>>>>> The Hive map value convert has a bug, we should respect the "earlier
>>>>>> entry wins" semantic. No ticket yet.
>>>>>>
>>>>>>
>>>>>> scala> sql("select map(1,2,1,3)").collect
>>>>>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
>>>>>> Same bug happens at `collect`. No ticket yet.
>>>>>>
>>>>>> I'll create tickets and list all of them as known issues in 2.4.0.
>>>>>>
>>>>>> It's arguable if the "earlier entry wins" semantic is reasonable.
>>>>>> Fixing it is a behavior change and we can only apply it to master branch.
>>>>>>
>>>>>> Going back to https://issues.apache.org/jira/browse/SPARK-25823,
>>>>>> it's just a symptom of the hive map value converter bug. I think it's a
>>>>>> non-blocker.
>>>>>>
>>>>>> Thanks,
>>>>>> Wenchen
>>>>>>
>>>>>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> -0 due to the following issue. From Spark 2.4.0, users may get an
>>>>>>> incorrect result when they use new `map_fitler` with `map_concat` 
>>>>>>> functions.
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-25823
>>>>>>>
>>>>>>> SPARK-25823 is only aiming to fix the data correctness issue from
>>>>>>> `map_filter`.
>>>>>>>
>>>>>>> PMC members are able to lower the priority. Always, I respect PMC's
>>>>>>> decision.
>>>>>>>
>>>>>>> I'm sending this email to draw more attention to this bug and to
>>>>>>> give some warning on the new feature's limitation to the community.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>> version 2.4.0.
>>>>>>>>
>>>>>>>> The vote is open until October 26 PST and passes if a majority +1
>>>>>>>> PMC votes are cast, with
>>>>>>>> a minimum of 3 +1 votes.
>>>>>>>>
>>>>>>>> [ ] +1 Release this package as Apache Spark 2.4.0
>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>
>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>> http://spark.apache.org/
>>>>>>>>
>>>>>>>> The tag to be voted on is v2.4.0-rc4 (commit
>>>>>>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
>>>>>>>> https://github.com/apache/spark/tree/v2.4.0-rc4
>>>>>>>>
>>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>>> at:
>>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>>>>>>>>
>>>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>>>
>>>>>>>> The staging repository for this release can be found at:
>>>>>>>>
>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1290
>>>>>>>>
>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>>>>>>>>
>>>>>>>> The list of bug fixes going into 2.4.0 can be found at the
>>>>>>>> following URL:
>>>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>>>>>>>
>>>>>>>> FAQ
>>>>>>>>
>>>>>>>> =========================
>>>>>>>> How can I help test this release?
>>>>>>>> =========================
>>>>>>>>
>>>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>>>> an existing Spark workload and running on this release candidate,
>>>>>>>> then
>>>>>>>> reporting any regressions.
>>>>>>>>
>>>>>>>> If you're working in PySpark you can set up a virtual env and
>>>>>>>> install
>>>>>>>> the current RC and see if anything important breaks, in the
>>>>>>>> Java/Scala
>>>>>>>> you can add the staging repository to your projects resolvers and
>>>>>>>> test
>>>>>>>> with the RC (make sure to clean up the artifact cache before/after
>>>>>>>> so
>>>>>>>> you don't end up building with a out of date RC going forward).
>>>>>>>>
>>>>>>>> ===========================================
>>>>>>>> What should happen to JIRA tickets still targeting 2.4.0?
>>>>>>>> ===========================================
>>>>>>>>
>>>>>>>> The current list of open tickets targeted at 2.4.0 can be found at:
>>>>>>>> https://issues.apache.org/jira/projects/SPARK and search for
>>>>>>>> "Target Version/s" = 2.4.0
>>>>>>>>
>>>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>>>> fixes, documentation, and API tweaks that impact compatibility
>>>>>>>> should
>>>>>>>> be worked on immediately. Everything else please retarget to an
>>>>>>>> appropriate release.
>>>>>>>>
>>>>>>>> ==================
>>>>>>>> But my bug isn't fixed?
>>>>>>>> ==================
>>>>>>>>
>>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>>> release unless the bug in question is a regression from the previous
>>>>>>>> release. That being said, if there is something which is a
>>>>>>>> regression
>>>>>>>> that has not been correctly targeted please ping me or a committer
>>>>>>>> to
>>>>>>>> help target the issue.
>>>>>>>>
>>>>>>>
>
> --
> [image: Spark+AI Summit North America 2019]
> <http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america&si=undefined&pi=406b8c9a-b648-4923-9ed1-9a51ffe213fa>
>

Re: [VOTE] SPARK 2.4.0 (RC4)

Reply via email to