date:20190306

Re: [SQL] hash: 64-bits and seeding

2019-03-06 Thread Reynold Xin

Rather than calling it hash64, it'd be better to just call it xxhash64. The 
reason being ten years from now, we probably would look back and laugh at a 
specific hash implementation. It'd be better to just name the expression what 
it is.

On Wed, Mar 06, 2019 at 7:59 PM, < huon.wil...@data61.csiro.au > wrote:

> 
> 
> 
> Hi,
> 
> 
> 
> I’m working on something that requires deterministic randomness, i.e. a
> row gets the same “random” value no matter the order of the DataFrame. A
> seeded hash seems to be the perfect way to do this, but the existing
> hashes have various limitations:
> 
> 
> 
> - hash: 32-bit output (only 4 billion possibilities will result in a lot
> of collisions for many tables: the birthday paradox implies >50% chance of
> at least one for tables larger than 77000 rows, and likely ~1.6 billion
> collisions in a table of size 4 billion)
> - sha1/sha2/md5: single binary column input, string output
> 
> 
> 
> It seems there’s already support for a 64-bit hash function that can work
> with an arbitrary number of arbitrary-typed columns (XxHash64), and
> exposing this for DataFrames seems like it’s essentially one line in
> sql/functions.scala to match `hash` (plus docs, tests, function registry
> etc.):
> 
> 
> 
> def hash64(cols: Column*): Column = withExpr { new
> XxHash64(cols.map(_.expr)) }
> 
> 
> 
> For my use case, this can then be used to get a 64-bit “random” column
> like
> 
> 
> 
> val seed = rng.nextLong()
> hash64(lit(seed), col1, col2)
> 
> 
> 
> I’ve created a (hopefully) complete patch by mimicking ‘hash’ at https:/ /
> github. com/ apache/ spark/ compare/ master... huonw:hash64 (
> https://github.com/apache/spark/compare/master...huonw:hash64 ) ; should I
> open a JIRA and submit it as a pull request?
> 
> 
> 
> Additionally, both hash and the new hash64 already have support for being
> seeded, but this isn’t exposed directly and instead requires something
> like the `lit` above. Would it make sense to add overloads like the
> following?
> 
> 
> 
> def hash(seed: Int, cols: Columns*) = …
> def hash64(seed: Long, cols: Columns*) = …
> 
> 
> 
> Though, it does seem a bit unfortunate to be forced to pass the seed
> first.
> 
> 
> 
> (I sent this email to user@ spark. apache. org ( u...@spark.apache.org ) a
> few days ago, but didn't get any discussion about the Spark aspects of
> this, so I'm resending it here; I apologise in advance if I'm breaking
> protocol!)
> 
> 
> 
> - Huon Wilson
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

[SQL] hash: 64-bits and seeding

2019-03-06 Thread Huon.Wilson

Hi,

I’m working on something that requires deterministic randomness, i.e. a row 
gets the same “random” value no matter the order of the DataFrame. A seeded 
hash seems to be the perfect way to do this, but the existing hashes have 
various limitations:

- hash: 32-bit output (only 4 billion possibilities will result in a lot of 
collisions for many tables: the birthday paradox implies  >50% chance of at 
least one for tables larger than 77000 rows, and likely ~1.6 billion collisions 
in a table of size 4 billion)
- sha1/sha2/md5: single binary column input, string output

It seems there’s already support for a 64-bit hash function that can work with 
an arbitrary number of arbitrary-typed columns (XxHash64), and exposing this 
for DataFrames seems like it’s essentially one line in sql/functions.scala to 
match `hash` (plus docs, tests, function registry etc.):

def hash64(cols: Column*): Column = withExpr { new 
XxHash64(cols.map(_.expr)) }

For my use case, this can then be used to get a 64-bit “random” column like

val seed = rng.nextLong()
hash64(lit(seed), col1, col2)

I’ve created a (hopefully) complete patch by mimicking ‘hash’ at 
https://github.com/apache/spark/compare/master...huonw:hash64; should I open a 
JIRA and submit it as a pull request?

Additionally, both hash and the new hash64 already have support for being 
seeded, but this isn’t exposed directly and instead requires something like the 
`lit` above. Would it make sense to add overloads like the following?

def hash(seed: Int, cols: Columns*) = …
def hash64(seed: Long, cols: Columns*) = …

Though, it does seem a bit unfortunate to be forced to pass the seed first.

(I sent this email to u...@spark.apache.org a few days ago, but didn't get any 
discussion about the Spark aspects of this, so I'm resending it here; I 
apologise in advance if I'm breaking protocol!)

- Huon Wilson


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Saisai Shao

Do we have other block/critical issues for Spark 2.4.1 or waiting something
to be fixed? I roughly searched the JIRA, seems there's no block/critical
issues marked for 2.4.1.

Thanks
Saisai

shane knapp  于2019年3月7日周四 上午4:57写道：

> i'll be popping in to the sig-big-data meeting on the 20th to talk about
> stuff like this.
>
> On Wed, Mar 6, 2019 at 12:40 PM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
>> Yes its a touch decision and as we discussed today (
>> https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA
>> )
>> "Kubernetes support window is 9 months, Spark is two years". So we may
>> end up with old client versions on branches still supported like 2.4.x in
>> the future.
>> That gives us no choice but to upgrade, if we want to be on the safe
>> side. We have tested 3.0.0 with 1.11 internally and it works but I dont
>> know what it means to run with old
>> clients.
>>
>>
>> On Wed, Mar 6, 2019 at 7:54 PM Sean Owen  wrote:
>>
>>> If the old client is basically unusable with the versions of K8S
>>> people mostly use now, and the new client still works with older
>>> versions, I could see including this in 2.4.1.
>>>
>>> Looking at
>>> https://github.com/fabric8io/kubernetes-client#compatibility-matrix
>>> it seems like the 4.1.1 client is needed for 1.10 and above. However
>>> it no longer supports 1.7 and below.
>>> We have 3.0.x, and versions through 4.0.x of the client support the
>>> same K8S versions, so no real middle ground here.
>>>
>>> 1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
>>> branches are maintained for 9 months per
>>> https://kubernetes.io/docs/setup/version-skew-policy/
>>>
>>> Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
>>> used the newer client from the start as at that point (?) 1.7 and
>>> earlier were already at least 7 months past EOL.
>>> If we update the client in 2.4.1, versions of K8S as recently
>>> 'supported' as a year ago won't work anymore. I'm guessing there are
>>> still 1.7 users out there? That wasn't that long ago but if the
>>> project and users generally move fast, maybe not.
>>>
>>> Normally I'd say, that's what the next minor release of Spark is for;
>>> update if you want later infra. But there is no Spark 2.5.
>>> I presume downstream distros could modify the dependency easily (?) if
>>> needed and maybe already do. It wouldn't necessarily help end users.
>>>
>>> Does the 3.0.x client not work at all with 1.10+ or just unsupported.
>>> If it 'basically works but no guarantees' I'd favor not updating. If
>>> it doesn't work at all, hm. That's tough. I think I'd favor updating
>>> the client but think it's a tough call both ways.
>>>
>>>
>>>
>>> On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
>>>  wrote:
>>> >
>>> > Yes Shane Knapp has done the work for that already,  and also tests
>>> pass, I am working on a PR now, I could submit it for the 2.4 branch .
>>> > I understand that this is a major dependency update, but the problem I
>>> see is that the client version is so old that I dont think it makes
>>> > much sense for current users who are on k8s 1.10, 1.11 etc(
>>> https://github.com/fabric8io/kubernetes-client#compatibility-matrix,
>>> 3.0.0 does not even exist in there).
>>> > I dont know what it means to use that old version with current k8s
>>> clusters in terms of bugs etc.
>>>
>>
>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: Hive Hash in Spark

2019-03-06 Thread Ryan Blue

I think this was needed to add support for bucketed Hive tables. Like Tyson
noted, if the other side of a join can be bucketed the same way, then Spark
can use a bucketed join. I have long-term plans to support this in the
DataSourceV2 API, but I don't think we are very close to implementing it
yet.

rb

On Wed, Mar 6, 2019 at 1:57 PM Reynold Xin  wrote:

> I think they might be used in bucketing? Not 100% sure.
>
>
> On Wed, Mar 06, 2019 at 1:40 PM,  wrote:
>
>> Hi,
>>
>>
>>
>> I noticed the existence of a Hive Hash partitioning implementation in
>> Spark, but also noticed that it’s not being used, and that the Spark hash
>> partitioning function is presently hardcoded to Murmur3. My question is
>> whether Hive Hash is dead code or are their future plans to support reading
>> and understanding data the has been partitioned using Hive Hash? By
>> understanding, I mean that I’m able to avoid a full shuffle join on Table A
>> (partitioned by Hive Hash) when joining with a Table B that I can shuffle
>> via Hive Hash to Table A.
>>
>>
>>
>> Thank you,
>>
>> Tyson
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Hive Hash in Spark

2019-03-06 Thread Reynold Xin

I think they might be used in bucketing? Not 100% sure.

On Wed, Mar 06, 2019 at 1:40 PM, < tcon...@gmail.com > wrote:

> 
> 
> 
> Hi,
> 
> 
> 
>  
> 
> 
> 
> I noticed the existence of a Hive Hash partitioning implementation in
> Spark, but also noticed that it’s not being used, and that the Spark hash
> partitioning function is presently hardcoded to Murmur3. My question is
> whether Hive Hash is dead code or are their future plans to support
> reading and understanding data the has been partitioned using Hive Hash?
> By understanding, I mean that I’m able to avoid a full shuffle join on
> Table A (partitioned by Hive Hash) when joining with a Table B that I can
> shuffle via Hive Hash to Table A.
> 
> 
> 
>  
> 
> 
> 
> Thank you,
> 
> 
> 
> Tyson
> 
> 
>

Hive Hash in Spark

2019-03-06 Thread tcondie

Hi,

 

I noticed the existence of a Hive Hash partitioning implementation in Spark,
but also noticed that it's not being used, and that the Spark hash
partitioning function is presently hardcoded to Murmur3. My question is
whether Hive Hash is dead code or are their future plans to support reading
and understanding data the has been partitioned using Hive Hash? By
understanding, I mean that I'm able to avoid a full shuffle join on Table A
(partitioned by Hive Hash) when joining with a Table B that I can shuffle
via Hive Hash to Table A. 

 

Thank you,

Tyson

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread shane knapp

i'll be popping in to the sig-big-data meeting on the 20th to talk about
stuff like this.

On Wed, Mar 6, 2019 at 12:40 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Yes its a touch decision and as we discussed today (
> https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA
> )
> "Kubernetes support window is 9 months, Spark is two years". So we may
> end up with old client versions on branches still supported like 2.4.x in
> the future.
> That gives us no choice but to upgrade, if we want to be on the safe side.
> We have tested 3.0.0 with 1.11 internally and it works but I dont know what
> it means to run with old
> clients.
>
>
> On Wed, Mar 6, 2019 at 7:54 PM Sean Owen  wrote:
>
>> If the old client is basically unusable with the versions of K8S
>> people mostly use now, and the new client still works with older
>> versions, I could see including this in 2.4.1.
>>
>> Looking at
>> https://github.com/fabric8io/kubernetes-client#compatibility-matrix
>> it seems like the 4.1.1 client is needed for 1.10 and above. However
>> it no longer supports 1.7 and below.
>> We have 3.0.x, and versions through 4.0.x of the client support the
>> same K8S versions, so no real middle ground here.
>>
>> 1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
>> branches are maintained for 9 months per
>> https://kubernetes.io/docs/setup/version-skew-policy/
>>
>> Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
>> used the newer client from the start as at that point (?) 1.7 and
>> earlier were already at least 7 months past EOL.
>> If we update the client in 2.4.1, versions of K8S as recently
>> 'supported' as a year ago won't work anymore. I'm guessing there are
>> still 1.7 users out there? That wasn't that long ago but if the
>> project and users generally move fast, maybe not.
>>
>> Normally I'd say, that's what the next minor release of Spark is for;
>> update if you want later infra. But there is no Spark 2.5.
>> I presume downstream distros could modify the dependency easily (?) if
>> needed and maybe already do. It wouldn't necessarily help end users.
>>
>> Does the 3.0.x client not work at all with 1.10+ or just unsupported.
>> If it 'basically works but no guarantees' I'd favor not updating. If
>> it doesn't work at all, hm. That's tough. I think I'd favor updating
>> the client but think it's a tough call both ways.
>>
>>
>>
>> On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
>>  wrote:
>> >
>> > Yes Shane Knapp has done the work for that already,  and also tests
>> pass, I am working on a PR now, I could submit it for the 2.4 branch .
>> > I understand that this is a major dependency update, but the problem I
>> see is that the client version is so old that I dont think it makes
>> > much sense for current users who are on k8s 1.10, 1.11 etc(
>> https://github.com/fabric8io/kubernetes-client#compatibility-matrix,
>> 3.0.0 does not even exist in there).
>> > I dont know what it means to use that old version with current k8s
>> clusters in terms of bugs etc.
>>
>
>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Stavros Kontopoulos

Yes its a touch decision and as we discussed today (
https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA
)
"Kubernetes support window is 9 months, Spark is two years". So we may end
up with old client versions on branches still supported like 2.4.x in the
future.
That gives us no choice but to upgrade, if we want to be on the safe side.
We have tested 3.0.0 with 1.11 internally and it works but I dont know what
it means to run with old
clients.


On Wed, Mar 6, 2019 at 7:54 PM Sean Owen  wrote:

> If the old client is basically unusable with the versions of K8S
> people mostly use now, and the new client still works with older
> versions, I could see including this in 2.4.1.
>
> Looking at
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix
> it seems like the 4.1.1 client is needed for 1.10 and above. However
> it no longer supports 1.7 and below.
> We have 3.0.x, and versions through 4.0.x of the client support the
> same K8S versions, so no real middle ground here.
>
> 1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
> branches are maintained for 9 months per
> https://kubernetes.io/docs/setup/version-skew-policy/
>
> Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
> used the newer client from the start as at that point (?) 1.7 and
> earlier were already at least 7 months past EOL.
> If we update the client in 2.4.1, versions of K8S as recently
> 'supported' as a year ago won't work anymore. I'm guessing there are
> still 1.7 users out there? That wasn't that long ago but if the
> project and users generally move fast, maybe not.
>
> Normally I'd say, that's what the next minor release of Spark is for;
> update if you want later infra. But there is no Spark 2.5.
> I presume downstream distros could modify the dependency easily (?) if
> needed and maybe already do. It wouldn't necessarily help end users.
>
> Does the 3.0.x client not work at all with 1.10+ or just unsupported.
> If it 'basically works but no guarantees' I'd favor not updating. If
> it doesn't work at all, hm. That's tough. I think I'd favor updating
> the client but think it's a tough call both ways.
>
>
>
> On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
>  wrote:
> >
> > Yes Shane Knapp has done the work for that already,  and also tests
> pass, I am working on a PR now, I could submit it for the 2.4 branch .
> > I understand that this is a major dependency update, but the problem I
> see is that the client version is so old that I dont think it makes
> > much sense for current users who are on k8s 1.10, 1.11 etc(
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix,
> 3.0.0 does not even exist in there).
> > I dont know what it means to use that old version with current k8s
> clusters in terms of bugs etc.
>

[build system] meet your build engineer @ the sparkAI summit!

2019-03-06 Thread shane knapp

i'll be there (again) working the riselab booth april 23-25 in SF...  come
by and say hi!

we'll also have demos and information about some of our ongoing research
projects...  once we get the details hammered out i'll post more
information here.

looking forward to seeing everyone again.  :)

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

2019-03-06 Thread Reynold Xin

I think the general philosophy here should be Python should be the most liberal 
and support a column object, or a literal value. It's also super useful to 
support column name, but we need to decide what happens for a string column. Is 
a string passed in a literal string value, or a column name?

On Mon, Mar 04, 2019 at 6:00 AM, André Mello < ame...@palantir.com > wrote:

> 
> 
> 
> Hey everyone,
> 
> 
> 
>  
> 
> 
> 
> Progress has been made with PR #23882 (
> https://github.com/apache/spark/pull/23882 ) , and it is now in a state
> where it could be merged with master.
> 
> 
> 
>  
> 
> 
> 
> This is what we’re doing for now:
> 
> * PySpark *will* support strings consistently throughout its API.
> * Arguably string support makes syntax closer to SQL and Scala, where you
> can use similar shorthands to specify columns, and the general direction
> of the PySpark API has been to be consistent with those other two;
> * This is a small, additive change that will not break anything;
> * The reason support was not there in the first place was because the code
> that generated functions was originally designed for aggregators, which
> all support column names, but it was being used for other functions (e.g.
> lower, abs) that did not, so it seems like it was not intentional.
> 
> 
> 
> 
> 
> 
> We are NOT going to:
> 
> * Make any code changes in Scala;
> * This requires first deciding if string support is desirable or not;
> * Decide whether or not strings should be supported in the Scala API;
> * This requires a larger discussion and the above changes are independent
> of this;
> * Make PySpark support Column objects where it currently only supports
> strings (e.g. multi-argument version of drop());
> * Converting from Column to column name is not something the API does
> right now, so this is a stronger change;
> * This can be considered separately.
> * Do anything with R for now.
> * Anyone is free to take on this, but I have no experience with R.
> 
> 
>  
> 
> 
> 
> If you folks agree with this, let us know, so we can move forward with the
> merge.
> 
> 
> 
>  
> 
> 
> 
> Best.
> 
> 
> 
>  
> 
> 
> 
> -- André.
> 
> 
> 
>  
> 
> 
> 
> *From:* Reynold Xin < r...@databricks.com >
> *Date:* Monday, 25 February 2019 at 00:49
> *To:* Felix Cheung < felixcheun...@hotmail.com >
> *Cc:* dev < dev@spark.apache.org >, Sean Owen < sro...@gmail.com >, André
> Mello < asmello...@gmail.com >
> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL
> functions
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> The challenge with the Scala/Java API in the past is that when there are
> multipe parameters, it'd lead to an explosion of function overloads. 
> 
> 
> 
> 
>  
> 
> 
> 
> 
>  
> 
> 
> 
> On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung < felixcheun...@hotmail.com >
> wrote:
> 
> 
>> 
>> 
>> I hear three topics in this thread
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 1. I don’t think we should remove string. Column and string can both be
>> “type safe”. And I would agree we don’t *need* to break API compatibility
>> here.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 2. Gaps in python API. Extending on #1, definitely we should be consistent
>> and add string as param where it is missed.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 3. Scala API for string - hard to say but make sense if nothing but for
>> consistency. Though I can also see the argument of Column only in Scala.
>> String might be more natural in python and much less significant in Scala
>> because of $”foo” notation.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> (My 2 c)
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 
>> *From:* Sean Owen < sro...@gmail.com >
>> *Sent:* Sunday, February 24, 2019 6:59 AM
>> *To:* André Mello
>> *Cc:* dev
>> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL
>> functions
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> I just commented on the PR -- I personally don't think it's worth
>> removing support for, say, max("foo") over max(col("foo")) or
>> max($"foo") in Scala. We can make breaking changes in Spark 3 but this
>> seems like it would unnecessarily break a lot of code. The string arg
>> is more concise in Python and I can't think of cases where it's
>> particularly ambiguous or confusing; on the contrary it's more natural
>> coming from SQL.
>> 
>> What we do have are inconsistencies and errors in support of string vs
>> Column as fixed in the PR. I was surprised to see that
>> df.select [df.select] (
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__df.select_=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=Q8z7q05IotYos8ff_hQuw0iwPH3qaLxV1IKxvqk-djY=0WT5JP6QGu6HeMsItzix27CBSwNfQiH9NHnmqVr0WFc=dEvDkbQgr79-nt7QL6ZvI5bLBtkHixh2Pw8NzIH7mfw=
>> ) (abs("col")) throws an error while df.select [df.select] (
>>

Re: Two spark applications listen on same port on same machine

2019-03-06 Thread Sean Owen

Two drivers can't be listening on port 4040 at the same time -- on the same
machine. The OS wouldn't allow it. Are they actually on different machines
or somehow different interfaces? or are you saying the reported port is
wrong?

On Wed, Mar 6, 2019 at 12:23 PM Moein Hosseini  wrote:

> I've submitted two spark applications in cluster of 3 standalone nodes in
> near the same time (I have bash script to submit them one after one without
> delay). But something goes wrong. In the master UI, Running applications
> section show both of my job with true configuration (cores, memory and
> different application-id) but both of redirect to port number 4040 which is
> listen by second submitted job.
> I think it could be race condition in UI but found nothing in logs. Could
> you help me to investigate where should I look for reason?
>
> Best Regards
> Moein
>
> --
>
> Moein Hosseini
> Data Engineer
> mobile: +98 912 468 1859 <+98+912+468+1859>
> site: www.moein.xyz
> email: moein...@gmail.com
> [image: linkedin] 
> [image: twitter] 
>
>

Two spark applications listen on same port on same machine

2019-03-06 Thread Moein Hosseini

I've submitted two spark applications in cluster of 3 standalone nodes in
near the same time (I have bash script to submit them one after one without
delay). But something goes wrong. In the master UI, Running applications
section show both of my job with true configuration (cores, memory and
different application-id) but both of redirect to port number 4040 which is
listen by second submitted job.
I think it could be race condition in UI but found nothing in logs. Could
you help me to investigate where should I look for reason?

Best Regards
Moein

-- 

Moein Hosseini
Data Engineer
mobile: +98 912 468 1859 <+98+912+468+1859>
site: www.moein.xyz
email: moein...@gmail.com
[image: linkedin] 
[image: twitter]

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Sean Owen

If the old client is basically unusable with the versions of K8S
people mostly use now, and the new client still works with older
versions, I could see including this in 2.4.1.

Looking at https://github.com/fabric8io/kubernetes-client#compatibility-matrix
it seems like the 4.1.1 client is needed for 1.10 and above. However
it no longer supports 1.7 and below.
We have 3.0.x, and versions through 4.0.x of the client support the
same K8S versions, so no real middle ground here.

1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
branches are maintained for 9 months per
https://kubernetes.io/docs/setup/version-skew-policy/

Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
used the newer client from the start as at that point (?) 1.7 and
earlier were already at least 7 months past EOL.
If we update the client in 2.4.1, versions of K8S as recently
'supported' as a year ago won't work anymore. I'm guessing there are
still 1.7 users out there? That wasn't that long ago but if the
project and users generally move fast, maybe not.

Normally I'd say, that's what the next minor release of Spark is for;
update if you want later infra. But there is no Spark 2.5.
I presume downstream distros could modify the dependency easily (?) if
needed and maybe already do. It wouldn't necessarily help end users.

Does the 3.0.x client not work at all with 1.10+ or just unsupported.
If it 'basically works but no guarantees' I'd favor not updating. If
it doesn't work at all, hm. That's tough. I think I'd favor updating
the client but think it's a tough call both ways.

On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
 wrote:
>
> Yes Shane Knapp has done the work for that already,  and also tests pass, I 
> am working on a PR now, I could submit it for the 2.4 branch .
> I understand that this is a major dependency update, but the problem I see is 
> that the client version is so old that I dont think it makes
> much sense for current users who are on k8s 1.10, 1.11 
> etc(https://github.com/fabric8io/kubernetes-client#compatibility-matrix, 
> 3.0.0 does not even exist in there).
> I dont know what it means to use that old version with current k8s clusters 
> in terms of bugs etc.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Stavros Kontopoulos

Yes Shane Knapp has done the work for that already,  and also tests pass, I
am working on a PR now, I could submit it for the 2.4 branch .
I understand that this is a major dependency update, but the problem I see
is that the client version is so old that I dont think it makes
much sense for current users who are on k8s 1.10, 1.11 etc(
https://github.com/fabric8io/kubernetes-client#compatibility-matrix, 3.0.0
does not even exist in there).
I dont know what it means to use that old version with current k8s clusters
in terms of bugs etc.

On Wed, Mar 6, 2019 at 6:32 PM shane knapp  wrote:

> On Wed, Mar 6, 2019 at 7:17 AM Sean Owen  wrote:
>
>> The problem is that that's a major dependency upgrade in a maintenance
>> release. It didn't seem to work when we applied it to master. I don't
>> think it would block a release.
>>
>> i tested the k8s client 4.1.2 against master a couple of weeks back and
> it worked fine.  i will doubly confirm when i get in to the office today.
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread shane knapp

On Wed, Mar 6, 2019 at 7:17 AM Sean Owen  wrote:

> The problem is that that's a major dependency upgrade in a maintenance
> release. It didn't seem to work when we applied it to master. I don't
> think it would block a release.
>
> i tested the k8s client 4.1.2 against master a couple of weeks back and it
worked fine.  i will doubly confirm when i get in to the office today.

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Sean Owen

The problem is that that's a major dependency upgrade in a maintenance
release. It didn't seem to work when we applied it to master. I don't
think it would block a release.

On Wed, Mar 6, 2019 at 6:32 AM Stavros Kontopoulos
 wrote:
>
> We need to resolve this https://issues.apache.org/jira/browse/SPARK-26742 as 
> well for 2.4.1, to make k8s support meaningful as many people are now on 1.11+
>
> Stavros
>
> On Tue, Mar 5, 2019 at 3:12 PM Saisai Shao  wrote:
>>
>> Hi DB,
>>
>> I saw that we already have 6 RCs, but the vote I can search by now was RC2, 
>> were they all canceled?
>>
>> Thanks
>> Saisai

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

4 Apache Events in 2019: DC Roadshow soon; next up Chicago, Las Vegas, and Berlin!

2019-03-06 Thread Rich Bowen

Dear Apache Enthusiast,

(You’re receiving this because you are subscribed to one or more user
mailing lists for an Apache Software Foundation project.)

TL;DR:
 * Apache Roadshow DC is in 3 weeks. Register now at
https://apachecon.com/usroadshowdc19/
 * Registration for Apache Roadshow Chicago is open.
http://apachecon.com/chiroadshow19
 * The CFP for ApacheCon North America is now open.
https://apachecon.com/acna19
 * Save the date: ApacheCon Europe will be held in Berlin, October 22nd
through 24th.  https://apachecon.com/aceu19


Registration is open for two Apache Roadshows; these are smaller events
with a more focused program and regional community engagement:

Our Roadshow event in Washington DC takes place in under three weeks, on
March 25th. We’ll be hosting a day-long event at the Fairfax campus of
George Mason University. The roadshow is a full day of technical talks
(two tracks) and an open source job fair featuring AWS, Bloomberg, dito,
GridGain, Linode, and Security University. More details about the
program, the job fair, and to register, visit
https://apachecon.com/usroadshowdc19/

Apache Roadshow Chicago will be held May 13-14th at a number of venues
in Chicago’s Logan Square neighborhood. This event will feature sessions
in AdTech, FinTech and Insurance, startups, “Made in Chicago”, Project
Shark Tank (innovations from the Apache Incubator), community diversity,
and more. It’s a great way to learn about various Apache projects “at
work” while playing at a brewery, a beercade, and a neighborhood bar.
Sign up today at https://www.apachecon.com/chiroadshow19/

We’re delighted to announce that the Call for Presentations (CFP) is now
open for ApacheCon North America in Las Vegas, September 9-13th! As the
official conference series of the ASF, ApacheCon North America will
feature over a dozen Apache project summits, including Cassandra,
Cloudstack, Tomcat, Traffic Control, and more. We’re looking for talks
in a wide variety of categories -- anything related to ASF projects and
the Apache development process. The CFP closes at midnight on May 26th.
In addition, the ASF will be celebrating its 20th Anniversary during the
event. For more details and to submit a proposal for the CFP, visit
https://apachecon.com/acna19/ . Registration will be opening soon.

Be sure to mark your calendars for ApacheCon Europe, which will be held
in Berlin, October 22-24th at the KulturBrauerei, a landmark of Berlin's
industrial history. In addition to innovative content from our projects,
we are collaborating with the Open Source Design community
(https://opensourcedesign.net/) to offer a track on design this year.
The CFP and registration will open soon at https://apachecon.com/aceu19/ .

Sponsorship opportunities are available for all events, with details
listed on each event’s site at http://apachecon.com/.

We look forward to seeing you!

Rich, for the ApacheCon Planners
@apachecon


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Stavros Kontopoulos

We need to resolve this https://issues.apache.org/jira/browse/SPARK-26742
as well for 2.4.1, to make k8s support meaningful as many people are now on
1.11+

Stavros

On Tue, Mar 5, 2019 at 3:12 PM Saisai Shao  wrote:

> Hi DB,
>
> I saw that we already have 6 RCs, but the vote I can search by now was
> RC2, were they all canceled?
>
> Thanks
> Saisai
>
> DB Tsai  于2019年2月22日周五 上午4:51写道：
>
>> I am cutting a new rc4 with fix from Felix. Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0359BC9965359766
>>
>> On Thu, Feb 21, 2019 at 8:57 AM Felix Cheung 
>> wrote:
>> >
>> > I merged the fix to 2.4.
>> >
>> >
>> > 
>> > From: Felix Cheung 
>> > Sent: Wednesday, February 20, 2019 9:34 PM
>> > To: DB Tsai; Spark dev list
>> > Cc: Cesar Delgado
>> > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
>> >
>> > Could you hold for a bit - I have one more fix to get in
>> >
>> >
>> > 
>> > From: d_t...@apple.com on behalf of DB Tsai 
>> > Sent: Wednesday, February 20, 2019 12:25 PM
>> > To: Spark dev list
>> > Cc: Cesar Delgado
>> > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
>> >
>> > Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.
>> >
>> > DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple,
>> Inc
>> >
>> > > On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin
>>  wrote:
>> > >
>> > > Just wanted to point out that
>> > > https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
>> > > and is marked as a correctness bug. (The fix is in the 2.4 branch,
>> > > just not in rc2.)
>> > >
>> > > On Wed, Feb 20, 2019 at 12:07 PM DB Tsai 
>> wrote:
>> > >>
>> > >> Please vote on releasing the following candidate as Apache Spark
>> version 2.4.1.
>> > >>
>> > >> The vote is open until Feb 24 PST and passes if a majority +1 PMC
>> votes are cast, with
>> > >> a minimum of 3 +1 votes.
>> > >>
>> > >> [ ] +1 Release this package as Apache Spark 2.4.1
>> > >> [ ] -1 Do not release this package because ...
>> > >>
>> > >> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> > >>
>> > >> The tag to be voted on is v2.4.1-rc2 (commit
>> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
>> > >> https://github.com/apache/spark/tree/v2.4.1-rc2
>> > >>
>> > >> The release files, including signatures, digests, etc. can be found
>> at:
>> > >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
>> > >>
>> > >> Signatures used for Spark RCs can be found in this file:
>> > >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> > >>
>> > >> The staging repository for this release can be found at:
>> > >>
>> https://repository.apache.org/content/repositories/orgapachespark-1299/
>> > >>
>> > >> The documentation corresponding to this release can be found at:
>> > >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
>> > >>
>> > >> The list of bug fixes going into 2.4.1 can be found at the following
>> URL:
>> > >> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>> > >>
>> > >> FAQ
>> > >>
>> > >> =
>> > >> How can I help test this release?
>> > >> =
>> > >>
>> > >> If you are a Spark user, you can help us test this release by taking
>> > >> an existing Spark workload and running on this release candidate,
>> then
>> > >> reporting any regressions.
>> > >>
>> > >> If you're working in PySpark you can set up a virtual env and install
>> > >> the current RC and see if anything important breaks, in the
>> Java/Scala
>> > >> you can add the staging repository to your projects resolvers and
>> test
>> > >> with the RC (make sure to clean up the artifact cache before/after so
>> > >> you don't end up building with a out of date RC going forward).
>> > >>
>> > >> ===
>> > >> What should happen to JIRA tickets still targeting 2.4.1?
>> > >> ===
>> > >>
>> > >> The current list of open tickets targeted at 2.4.1 can be found at:
>> > >> https://issues.apache.org/jira/projects/SPARK and search for
>> "Target Version/s" = 2.4.1
>> > >>
>> > >> Committers should look at those and triage. Extremely important bug
>> > >> fixes, documentation, and API tweaks that impact compatibility should
>> > >> be worked on immediately. Everything else please retarget to an
>> > >> appropriate release.
>> > >>
>> > >> ==
>> > >> But my bug isn't fixed?
>> > >> ==
>> > >>
>> > >> In order to make timely releases, we will typically not hold the
>> > >> release unless the bug in question is a regression from the previous
>> > >> release. That being said, if there is something which is a regression
>> > >> that has not been correctly targeted please ping me or a committer to
>> > >> help target the issue.
>> > >>
>> > >>
>> > >> DB Tsai | Siri Open Source

[no subject]

2019-03-06 Thread Dongxu Wang

Re: [SQL] hash: 64-bits and seeding

[SQL] hash: 64-bits and seeding

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Re: Hive Hash in Spark

Re: Hive Hash in Spark

Hive Hash in Spark

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

[build system] meet your build engineer @ the sparkAI summit!

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Re: Two spark applications listen on same port on same machine

Two spark applications listen on same port on same machine

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

4 Apache Events in 2019: DC Roadshow soon; next up Chicago, Las Vegas, and Berlin!

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

[no subject]

19 matches

Site Navigation

Mail list logo

Footer information