Re: [SQL] hash: 64-bits and seeding

2019-03-06 Thread Reynold Xin
Rather than calling it hash64, it'd be better to just call it xxhash64. The reason being ten years from now, we probably would look back and laugh at a specific hash implementation. It'd be better to just name the expression what it is. On Wed, Mar 06, 2019 at 7:59 PM, <

[SQL] hash: 64-bits and seeding

2019-03-06 Thread Huon.Wilson
Hi, I’m working on something that requires deterministic randomness, i.e. a row gets the same “random” value no matter the order of the DataFrame. A seeded hash seems to be the perfect way to do this, but the existing hashes have various limitations: - hash: 32-bit output (only 4 billion

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Saisai Shao
Do we have other block/critical issues for Spark 2.4.1 or waiting something to be fixed? I roughly searched the JIRA, seems there's no block/critical issues marked for 2.4.1. Thanks Saisai shane knapp 于2019年3月7日周四 上午4:57写道: > i'll be popping in to the sig-big-data meeting on the 20th to talk

Re: Hive Hash in Spark

2019-03-06 Thread Ryan Blue
I think this was needed to add support for bucketed Hive tables. Like Tyson noted, if the other side of a join can be bucketed the same way, then Spark can use a bucketed join. I have long-term plans to support this in the DataSourceV2 API, but I don't think we are very close to implementing it

Re: Hive Hash in Spark

2019-03-06 Thread Reynold Xin
I think they might be used in bucketing? Not 100% sure. On Wed, Mar 06, 2019 at 1:40 PM, < tcon...@gmail.com > wrote: > > > > Hi, > > > >   > > > > I noticed the existence of a Hive Hash partitioning implementation in > Spark, but also noticed that it’s not being used, and that the

Hive Hash in Spark

2019-03-06 Thread tcondie
Hi, I noticed the existence of a Hive Hash partitioning implementation in Spark, but also noticed that it's not being used, and that the Spark hash partitioning function is presently hardcoded to Murmur3. My question is whether Hive Hash is dead code or are their future plans to support

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread shane knapp
i'll be popping in to the sig-big-data meeting on the 20th to talk about stuff like this. On Wed, Mar 6, 2019 at 12:40 PM Stavros Kontopoulos < stavros.kontopou...@lightbend.com> wrote: > Yes its a touch decision and as we discussed today ( >

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Stavros Kontopoulos
Yes its a touch decision and as we discussed today ( https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA ) "Kubernetes support window is 9 months, Spark is two years". So we may end up with old client versions on branches still supported like 2.4.x in the future. That

[build system] meet your build engineer @ the sparkAI summit!

2019-03-06 Thread shane knapp
i'll be there (again) working the riselab booth april 23-25 in SF... come by and say hi! we'll also have demos and information about some of our ongoing research projects... once we get the details hammered out i'll post more information here. looking forward to seeing everyone again. :)

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

2019-03-06 Thread Reynold Xin
I think the general philosophy here should be Python should be the most liberal and support a column object, or a literal value. It's also super useful to support column name, but we need to decide what happens for a string column. Is a string passed in a literal string value, or a column name?

Re: Two spark applications listen on same port on same machine

2019-03-06 Thread Sean Owen
Two drivers can't be listening on port 4040 at the same time -- on the same machine. The OS wouldn't allow it. Are they actually on different machines or somehow different interfaces? or are you saying the reported port is wrong? On Wed, Mar 6, 2019 at 12:23 PM Moein Hosseini wrote: > I've

Two spark applications listen on same port on same machine

2019-03-06 Thread Moein Hosseini
I've submitted two spark applications in cluster of 3 standalone nodes in near the same time (I have bash script to submit them one after one without delay). But something goes wrong. In the master UI, Running applications section show both of my job with true configuration (cores, memory and

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Sean Owen
If the old client is basically unusable with the versions of K8S people mostly use now, and the new client still works with older versions, I could see including this in 2.4.1. Looking at https://github.com/fabric8io/kubernetes-client#compatibility-matrix it seems like the 4.1.1 client is needed

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Stavros Kontopoulos
Yes Shane Knapp has done the work for that already, and also tests pass, I am working on a PR now, I could submit it for the 2.4 branch . I understand that this is a major dependency update, but the problem I see is that the client version is so old that I dont think it makes much sense for

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread shane knapp
On Wed, Mar 6, 2019 at 7:17 AM Sean Owen wrote: > The problem is that that's a major dependency upgrade in a maintenance > release. It didn't seem to work when we applied it to master. I don't > think it would block a release. > > i tested the k8s client 4.1.2 against master a couple of weeks

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Sean Owen
The problem is that that's a major dependency upgrade in a maintenance release. It didn't seem to work when we applied it to master. I don't think it would block a release. On Wed, Mar 6, 2019 at 6:32 AM Stavros Kontopoulos wrote: > > We need to resolve this

4 Apache Events in 2019: DC Roadshow soon; next up Chicago, Las Vegas, and Berlin!

2019-03-06 Thread Rich Bowen
Dear Apache Enthusiast, (You’re receiving this because you are subscribed to one or more user mailing lists for an Apache Software Foundation project.) TL;DR: * Apache Roadshow DC is in 3 weeks. Register now at https://apachecon.com/usroadshowdc19/ * Registration for Apache Roadshow Chicago is

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Stavros Kontopoulos
We need to resolve this https://issues.apache.org/jira/browse/SPARK-26742 as well for 2.4.1, to make k8s support meaningful as many people are now on 1.11+ Stavros On Tue, Mar 5, 2019 at 3:12 PM Saisai Shao wrote: > Hi DB, > > I saw that we already have 6 RCs, but the vote I can search by now

[no subject]

2019-03-06 Thread Dongxu Wang