Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Xiao Li
@Dongjoon Hyun Thanks! This is a blocking ticket. It returns a wrong result due to our undefined behavior. I agree we should revert the newly added map-oriented functions. In 3.0 release, we need to define the behavior of duplicate keys in the data type MAP and fix all the related issues that

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
Ah now I see the problem. `map_filter` has a very weird semantic that is neither "earlier entry wins" or "latter entry wins". I've opened https://github.com/apache/spark/pull/22821 , to remove these newly added map-related functions from FunctionRegistry(for 2.4.0), so that they are invisible to

Re: What's a blocker?

2018-10-24 Thread Mark Hamstra
Yeah, I can pretty much agree with that. Before we get into release candidates, it's not as big a deal if something gets labeled as a blocker. Once we are into an RC, I'd like to see any discussions as to whether something is or isn't a blocker at least cross-referenced in the RC VOTE thread so

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Dongjoon Hyun
For the first question, it's `bin/spark-sql` result. I didn't check STS, but it will return the same with `bin/spark-sql`. > I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins" semantic. I don't think this will change in 2.4.1. For

Re: What's a blocker?

2018-10-24 Thread Saisai Shao
Just my two cents of the past experience. As a release manager of Spark 2.3.2, I felt significantly delay during the release by block issues. Vote was failed several times by one or two "block issue". I think during the RC time, each "block issue" should be carefully evaluated by the related PMCs

Re: What's a blocker?

2018-10-24 Thread Hyukjin Kwon
> Let's understand statements like "X is not a blocker" to mean "I don't think that X is a blocker". Interpretations not proclamations, backed up by reasons, not all of which are appeals to policy and precedent. Might not be a big deal and out of the topic but I rather hope people explicitly avoid

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4 > {1:3} Are you running in the thrift-server? Then maybe this is caused by the bug in `Dateset.collect` as I mentioned above. I think map_filter is implemented correctly. map(1,2,1,3) is actually map(1,2) according to the "earlier entry wins"

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Dongjoon Hyun
Thank you for the follow-ups. Then, Spark 2.4.1 will return `{1:2}` differently from the followings (including Spark/Scala) in the end? I hoped to fix the `map_filter`, but now Spark looks inconsistent in many ways. scala> sql("select map(1,2,1,3)").show // Spark 2.2.2 +---+ |map(1,

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
Hi Dongjoon, Thanks for reporting it! This is indeed a bug that needs to be fixed. The problem is not about the function `map_filter`, but about how the map type values are created in Spark, when there are duplicated keys. In programming languages like Java/Scala, when creating map, the later

Re: queryable state & streaming

2018-10-24 Thread Arun Mahadevan
I don't think separate API or RPCs etc might be necessary for queryable state if the state can be exposed as just another datasource. Then the sql queries can be issued against it just like executing sql queries against any other data source. For now I think the "memory" sink could be used as a

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Dongjoon Hyun
Hi, All. -0 due to the following issue. From Spark 2.4.0, users may get an incorrect result when they use new `map_fitler` with `map_concat` functions. https://issues.apache.org/jira/browse/SPARK-25823 SPARK-25823 is only aiming to fix the data correctness issue from `map_filter`. PMC members

Re: KryoSerializer Implementation - Not using KryoPool

2018-10-24 Thread Sean Owen
I don't know; possibly just because it wasn't available whenever Kryo was first used in the project. Skimming the code, the KryoSerializerInstance looks like a wrapper that provides a Kryo object to do work. It already maintains a 'pool' of just 1 instance. Is the point that KryoSerializer can

KryoSerializer Implementation - Not using KryoPool

2018-10-24 Thread Patrick Brown
Hi, I am wondering about the implementation of KryoSerializer, specifically the lack of use of KryoPool, which is recommended by Kryo themselves. Looking at the code, it seems that frequently KryoSerializer.newInstance is called, followed by a serialize and then this instance goes out of scope,

What's a blocker?

2018-10-24 Thread Sean Owen
Shifting this to dev@. See the PR https://github.com/apache/spark/pull/22144 for more context. There will be no objective, complete definition of blocker, or even regression or correctness issue. Many cases are clear, some are not. We can draw up more guidelines, and feel free to open PRs against

Re: Hadoop-Token-Across-Kerberized -Cluster

2018-10-24 Thread Davinder Kumar
Any update on this, or anybody facing/faced similar issue. Any suggestion will be appreciated. Thanks -Davinder From: Davinder Kumar Sent: Wednesday, October 17, 2018 11:01 AM To: dev Subject: Hadoop-Token-Across-Kerberized -Cluster Hello All, Need one

CVE-2018-11804: Apache Spark build/mvn runs zinc, and can expose information from build machines

2018-10-24 Thread Sean Owen
Severity: Low Vendor: The Apache Software Foundation Versions Affected: 1.3.x release branch and later, including master Description: Spark's Apache Maven-based build includes a convenience script, 'build/mvn', that downloads and runs a zinc server to speed up compilation. This server will