Re: About introduce function sum0 to Spark

2018-10-23 Thread Wenchen Fan
This is logically `sum( if(isnull(col), 0, col) )` right?

On Tue, Oct 23, 2018 at 2:58 PM 陶 加涛  wrote:

> The name is from Apache Calcite, And it doesn’t matter, we can introduce
> our own.
>
>
>
>
>
> ---
>
> Regards!
>
> Aron Tao
>
>
>
> *发件人**: *Mark Hamstra 
> *日期**: *2018年10月23日 星期二 12:28
> *收件人**: *"taojia...@gmail.com" 
> *抄送**: *dev 
> *主题**: *Re: About introduce function sum0 to Spark
>
>
>
> That's a horrible name. This is just a fold.
>
>
>
> On Mon, Oct 22, 2018 at 7:39 PM 陶 加涛  wrote:
>
> Hi, in calcite, has the concept of sum0, here I quote the definition of
> sum0:
>
>
>
> Sum0 is an aggregator which returns the sum of the values which
>
> go into it like Sum. It differs in that when no non null values
>
> are applied zero is returned instead of null..
>
>
>
> One scenario is that we can use sum0 to implement pre-calculation
> count(pre-calculation system like Apache Kylin).
>
>
>
> It is very easy in Spark to implement sum0, if community consider this is
> necessary, I would like to open a JIRA and implement this.
>
>
>
> ---
>
> Regards!
>
> Aron Tao
>
>
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Hyukjin Kwon
I am sorry for raising this late. Out of curiosity, does anyone know why we
don't treat SPARK-24935 (https://github.com/apache/spark/pull/22144) as a
blocker?

It looks it broke a API compatibility, and an actual usecase of an external
library (https://github.com/DataSketches/sketches-hive)
Also, looks sufficient discussion was made for its diagnosis (
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ
).


2018년 10월 23일 (화) 오후 12:03, Darcy Shen 님이 작성:

>
>
> +1
>
>
>  On Tue, 23 Oct 2018 01:42:06 +0800 Wenchen Fan
> wrote 
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until October 26 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc4 (commit
> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
> https://github.com/apache/spark/tree/v2.4.0-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1290
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
No, because the docs are built into the release too and released to
the site too from the released artifact.
As a practical matter, I think these docs are not critical for
release, and can follow in a maintenance release. I'd retarget to
2.4.1 or untarget.
I do know at times a release's docs have been edited after the fact,
but that's bad form. We'd not go change a class in the release after
it was released and call it the same release.

I'd still like some confirmation that someone can build and pass tests
with -Pkubernetes, maybe? It actually all passed with the 2.11 build.
I don't think it's a 2.12 incompatibility, but rather than the K8S
tests maybe don't quite work with the 2.12 build artifact naming. Or
else something to do with my env.

On Mon, Oct 22, 2018 at 9:08 PM Wenchen Fan  wrote:
>
> Regarding the doc tickets, I vaguely remember that we can merge doc PRs after 
> release and publish doc to spark website later. Can anyone confirm?
>
> On Tue, Oct 23, 2018 at 8:30 AM Sean Owen  wrote:
>>
>> This is what I got from a straightforward build of the source distro
>> here ... really, ideally, it builds as-is from source. You're saying
>> someone would have to first build a k8s distro from source too?
>> It's not a 'must' that this be automatic but nothing else fails out of the 
>> box.
>> I feel like I might be misunderstanding the setup here.
>> On Mon, Oct 22, 2018 at 7:25 PM Stavros Kontopoulos
>>  wrote:

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Hadoop 3 support

2018-10-23 Thread Steve Loughran



> On 16 Oct 2018, at 22:06, t4  wrote:
> 
> has anyone got spark jars working with hadoop3.1 that they can share? i am
> looking to be able to use the latest  hadoop-aws fixes from v3.1

we do, but we do with

*  a patched hive JAR
* bulding spark with -Phive,yarn,hadoop-3.1,hadoop-cloud,kinesis  profiles to 
pull in the object store stuff *while leaving out the things which cause 
conflict*
* some extra stuff to wire up the 0-rename-committer

w.r.t hadoop aws, the hadoop-2.9 artifacts have the shaded aws JAR; 50 MB of 
.class to avoid jackson dependency pain, and an early version of S3Guard. For 
the new commit stuff you will need to go to hadoop 3.1

-steve



> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Stavros Kontopoulos
Sean,

I will try it against 2.12 shortly.

You're saying someone would have to first build a k8s distro from source
> too?


Ok I missed the error one line above, before the distro error there is
another one:

fatal: not a git repository (or any of the parent directories): .git


So that seems to come from here
.
It seems that the test root is not set up correctly. It should be the top
git dir from which you built Spark.

Now regarding the distro thing. dev-run-integration-tests.sh should run
from within the cloned project after the distro is built. The distro is
required

, it should fail otherwise.

Integration tests run the setup-integration-test-env.sh script.
dev-run-integration-tests.sh
calls mvn

which
in turn executes that setup script

.

How do you run the tests?

Stavros

On Tue, Oct 23, 2018 at 3:01 PM, Sean Owen  wrote:

> No, because the docs are built into the release too and released to
> the site too from the released artifact.
> As a practical matter, I think these docs are not critical for
> release, and can follow in a maintenance release. I'd retarget to
> 2.4.1 or untarget.
> I do know at times a release's docs have been edited after the fact,
> but that's bad form. We'd not go change a class in the release after
> it was released and call it the same release.
>
> I'd still like some confirmation that someone can build and pass tests
> with -Pkubernetes, maybe? It actually all passed with the 2.11 build.
> I don't think it's a 2.12 incompatibility, but rather than the K8S
> tests maybe don't quite work with the 2.12 build artifact naming. Or
> else something to do with my env.
>
> On Mon, Oct 22, 2018 at 9:08 PM Wenchen Fan  wrote:
> >
> > Regarding the doc tickets, I vaguely remember that we can merge doc PRs
> after release and publish doc to spark website later. Can anyone confirm?
> >
> > On Tue, Oct 23, 2018 at 8:30 AM Sean Owen  wrote:
> >>
> >> This is what I got from a straightforward build of the source distro
> >> here ... really, ideally, it builds as-is from source. You're saying
> >> someone would have to first build a k8s distro from source too?
> >> It's not a 'must' that this be automatic but nothing else fails out of
> the box.
> >> I feel like I might be misunderstanding the setup here.
> >> On Mon, Oct 22, 2018 at 7:25 PM Stavros Kontopoulos
> >>  wrote:
>



-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Hyukjin Kwon
I am searching and checking some PRs or JIRAs that state regression. Let me
leave a link - it might be good to double check
https://github.com/apache/spark/pull/22514 as well.

2018년 10월 23일 (화) 오후 11:58, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com>님이 작성:

> Sean,
>
> I will try it against 2.12 shortly.
>
> You're saying someone would have to first build a k8s distro from source
>> too?
>
>
> Ok I missed the error one line above, before the distro error there is
> another one:
>
> fatal: not a git repository (or any of the parent directories): .git
>
>
> So that seems to come from here
> .
> It seems that the test root is not set up correctly. It should be the top
> git dir from which you built Spark.
>
> Now regarding the distro thing. dev-run-integration-tests.sh should run
> from within the cloned project after the distro is built. The distro is
> required
> 
> , it should fail otherwise.
>
> Integration tests run the setup-integration-test-env.sh script. 
> dev-run-integration-tests.sh
> calls mvn
> 
>  which
> in turn executes that setup script
> 
> .
>
> How do you run the tests?
>
> Stavros
>
> On Tue, Oct 23, 2018 at 3:01 PM, Sean Owen  wrote:
>
>> No, because the docs are built into the release too and released to
>> the site too from the released artifact.
>> As a practical matter, I think these docs are not critical for
>> release, and can follow in a maintenance release. I'd retarget to
>> 2.4.1 or untarget.
>> I do know at times a release's docs have been edited after the fact,
>> but that's bad form. We'd not go change a class in the release after
>> it was released and call it the same release.
>>
>> I'd still like some confirmation that someone can build and pass tests
>> with -Pkubernetes, maybe? It actually all passed with the 2.11 build.
>> I don't think it's a 2.12 incompatibility, but rather than the K8S
>> tests maybe don't quite work with the 2.12 build artifact naming. Or
>> else something to do with my env.
>>
>> On Mon, Oct 22, 2018 at 9:08 PM Wenchen Fan  wrote:
>> >
>> > Regarding the doc tickets, I vaguely remember that we can merge doc PRs
>> after release and publish doc to spark website later. Can anyone confirm?
>> >
>> > On Tue, Oct 23, 2018 at 8:30 AM Sean Owen  wrote:
>> >>
>> >> This is what I got from a straightforward build of the source distro
>> >> here ... really, ideally, it builds as-is from source. You're saying
>> >> someone would have to first build a k8s distro from source too?
>> >> It's not a 'must' that this be automatic but nothing else fails out of
>> the box.
>> >> I feel like I might be misunderstanding the setup here.
>> >> On Mon, Oct 22, 2018 at 7:25 PM Stavros Kontopoulos
>> >>  wrote:
>>
>
>
>
> --
> Stavros Kontopoulos
>
> *Senior Software Engineer*
> *Lightbend, Inc.*
>
> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
> *e: stavros.kontopou...@lightbend.com* 
>
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
Yeah, that's maybe the issue here. This is a source release, not a git
checkout, and it still needs to work in this context.

I just added -Pkubernetes to my build and didn't do anything else. I think
the ideal is that a "mvn -P... -P... install" to work from a source
release; that's a good expectation and consistent with docs.

Maybe these tests simply don't need to run with the normal suite of tests,
and can be considered tests run manually by developers running these
scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?

I don't think this has to block the release even if so, just trying to get
to the bottom of it.


On Tue, Oct 23, 2018 at 10:58 AM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Ok I missed the error one line above, before the distro error there is
> another one:
>
> fatal: not a git repository (or any of the parent directories): .git
>
>
> So that seems to come from here
> .
> It seems that the test root is not set up correctly. It should be the top
> git dir from which you built Spark.
>
> Now regarding the distro thing. dev-run-integration-tests.sh should run
> from within the cloned project after the distro is built. The distro is
> required
> 
> , it should fail otherwise.
>
> Integration tests run the setup-integration-test-env.sh script. 
> dev-run-integration-tests.sh
> calls mvn
> 
>  which
> in turn executes that setup script
> 
> .
>
> How do you run the tests?
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Wenchen Fan
I read through the contributing guide
, it only mentions that data
correctness and data loss issues should be marked as blockers. AFAIK we
also mark regressions of current release as blockers, but not regressions
of the previous releases.

SPARK-24935 is indeed a bug, and is a regression from Spark 2.2.0. We
should definitely fix it, but doesn't seem like a blocker. BTW the root
cause of SPARK-24935 is unknown(at least I can't tell from the PR), so
fixing it might take a while.

On Tue, Oct 23, 2018 at 11:58 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Sean,
>
> I will try it against 2.12 shortly.
>
> You're saying someone would have to first build a k8s distro from source
>> too?
>
>
> Ok I missed the error one line above, before the distro error there is
> another one:
>
> fatal: not a git repository (or any of the parent directories): .git
>
>
> So that seems to come from here
> .
> It seems that the test root is not set up correctly. It should be the top
> git dir from which you built Spark.
>
> Now regarding the distro thing. dev-run-integration-tests.sh should run
> from within the cloned project after the distro is built. The distro is
> required
> 
> , it should fail otherwise.
>
> Integration tests run the setup-integration-test-env.sh script. 
> dev-run-integration-tests.sh
> calls mvn
> 
>  which
> in turn executes that setup script
> 
> .
>
> How do you run the tests?
>
> Stavros
>
> On Tue, Oct 23, 2018 at 3:01 PM, Sean Owen  wrote:
>
>> No, because the docs are built into the release too and released to
>> the site too from the released artifact.
>> As a practical matter, I think these docs are not critical for
>> release, and can follow in a maintenance release. I'd retarget to
>> 2.4.1 or untarget.
>> I do know at times a release's docs have been edited after the fact,
>> but that's bad form. We'd not go change a class in the release after
>> it was released and call it the same release.
>>
>> I'd still like some confirmation that someone can build and pass tests
>> with -Pkubernetes, maybe? It actually all passed with the 2.11 build.
>> I don't think it's a 2.12 incompatibility, but rather than the K8S
>> tests maybe don't quite work with the 2.12 build artifact naming. Or
>> else something to do with my env.
>>
>> On Mon, Oct 22, 2018 at 9:08 PM Wenchen Fan  wrote:
>> >
>> > Regarding the doc tickets, I vaguely remember that we can merge doc PRs
>> after release and publish doc to spark website later. Can anyone confirm?
>> >
>> > On Tue, Oct 23, 2018 at 8:30 AM Sean Owen  wrote:
>> >>
>> >> This is what I got from a straightforward build of the source distro
>> >> here ... really, ideally, it builds as-is from source. You're saying
>> >> someone would have to first build a k8s distro from source too?
>> >> It's not a 'must' that this be automatic but nothing else fails out of
>> the box.
>> >> I feel like I might be misunderstanding the setup here.
>> >> On Mon, Oct 22, 2018 at 7:25 PM Stavros Kontopoulos
>> >>  wrote:
>>
>
>
>
> --
> Stavros Kontopoulos
>
> *Senior Software Engineer*
> *Lightbend, Inc.*
>
> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
> *e: stavros.kontopou...@lightbend.com* 
>
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
(I should add, I only observed this with the Scala 2.12 build. It all
seemed to work with 2.11. Therefore I'm not too worried about it. I
don't think it's a Scala version issue, but perhaps something looking
for a spark 2.11 tarball and not finding it. See
https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
a change that might address this kind of thing.)

On Tue, Oct 23, 2018 at 11:05 AM Sean Owen  wrote:
>
> Yeah, that's maybe the issue here. This is a source release, not a git 
> checkout, and it still needs to work in this context.
>
> I just added -Pkubernetes to my build and didn't do anything else. I think 
> the ideal is that a "mvn -P... -P... install" to work from a source release; 
> that's a good expectation and consistent with docs.
>
> Maybe these tests simply don't need to run with the normal suite of tests, 
> and can be considered tests run manually by developers running these scripts? 
> Basically, KubernetesSuite shouldn't run in a normal mvn install?
>
> I don't think this has to block the release even if so, just trying to get to 
> the bottom of it.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Xiao Li
Thanks for reporting this. https://github.com/apache/spark/pull/22514 is
not a blocker. We can fix it in the next minor release, if we are unable to
make it in this release.

Thanks,

Xiao

Sean Owen  于2018年10月23日周二 上午9:14写道:

> (I should add, I only observed this with the Scala 2.12 build. It all
> seemed to work with 2.11. Therefore I'm not too worried about it. I
> don't think it's a Scala version issue, but perhaps something looking
> for a spark 2.11 tarball and not finding it. See
> https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
> a change that might address this kind of thing.)
>
> On Tue, Oct 23, 2018 at 11:05 AM Sean Owen  wrote:
> >
> > Yeah, that's maybe the issue here. This is a source release, not a git
> checkout, and it still needs to work in this context.
> >
> > I just added -Pkubernetes to my build and didn't do anything else. I
> think the ideal is that a "mvn -P... -P... install" to work from a source
> release; that's a good expectation and consistent with docs.
> >
> > Maybe these tests simply don't need to run with the normal suite of
> tests, and can be considered tests run manually by developers running these
> scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?
> >
> > I don't think this has to block the release even if so, just trying to
> get to the bottom of it.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Xiao Li
https://github.com/apache/spark/pull/22144 is also not a blocker of Spark
2.4 release, as discussed in the PR.

Thanks,

Xiao

Xiao Li  于2018年10月23日周二 上午9:20写道:

> Thanks for reporting this. https://github.com/apache/spark/pull/22514 is
> not a blocker. We can fix it in the next minor release, if we are unable to
> make it in this release.
>
> Thanks,
>
> Xiao
>
> Sean Owen  于2018年10月23日周二 上午9:14写道:
>
>> (I should add, I only observed this with the Scala 2.12 build. It all
>> seemed to work with 2.11. Therefore I'm not too worried about it. I
>> don't think it's a Scala version issue, but perhaps something looking
>> for a spark 2.11 tarball and not finding it. See
>> https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
>> a change that might address this kind of thing.)
>>
>> On Tue, Oct 23, 2018 at 11:05 AM Sean Owen  wrote:
>> >
>> > Yeah, that's maybe the issue here. This is a source release, not a git
>> checkout, and it still needs to work in this context.
>> >
>> > I just added -Pkubernetes to my build and didn't do anything else. I
>> think the ideal is that a "mvn -P... -P... install" to work from a source
>> release; that's a good expectation and consistent with docs.
>> >
>> > Maybe these tests simply don't need to run with the normal suite of
>> tests, and can be considered tests run manually by developers running these
>> scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?
>> >
>> > I don't think this has to block the release even if so, just trying to
>> get to the bottom of it.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Hyukjin Kwon
https://github.com/apache/spark/pull/22514 sounds like a regression that
affects Hive CTAS in write path (by not replacing them into Spark internal
datasources; therefore performance regression).
but yea I suspect if we should block the release by this.

https://github.com/apache/spark/pull/22144 is just being discussed if I am
not mistaken.

Thanks.

2018년 10월 24일 (수) 오전 12:27, Xiao Li 님이 작성:

> https://github.com/apache/spark/pull/22144 is also not a blocker of Spark
> 2.4 release, as discussed in the PR.
>
> Thanks,
>
> Xiao
>
> Xiao Li  于2018年10月23日周二 上午9:20写道:
>
>> Thanks for reporting this. https://github.com/apache/spark/pull/22514 is
>> not a blocker. We can fix it in the next minor release, if we are unable to
>> make it in this release.
>>
>> Thanks,
>>
>> Xiao
>>
>> Sean Owen  于2018年10月23日周二 上午9:14写道:
>>
>>> (I should add, I only observed this with the Scala 2.12 build. It all
>>> seemed to work with 2.11. Therefore I'm not too worried about it. I
>>> don't think it's a Scala version issue, but perhaps something looking
>>> for a spark 2.11 tarball and not finding it. See
>>> https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
>>> a change that might address this kind of thing.)
>>>
>>> On Tue, Oct 23, 2018 at 11:05 AM Sean Owen  wrote:
>>> >
>>> > Yeah, that's maybe the issue here. This is a source release, not a git
>>> checkout, and it still needs to work in this context.
>>> >
>>> > I just added -Pkubernetes to my build and didn't do anything else. I
>>> think the ideal is that a "mvn -P... -P... install" to work from a source
>>> release; that's a good expectation and consistent with docs.
>>> >
>>> > Maybe these tests simply don't need to run with the normal suite of
>>> tests, and can be considered tests run manually by developers running these
>>> scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?
>>> >
>>> > I don't think this has to block the release even if so, just trying to
>>> get to the bottom of it.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
I can’t seem to find any documentation of the &, |, and ~ operators for
PySpark DataFrame columns. I assume that should be in our docs somewhere.

Was it always missing? Am I just missing something obvious?

Nick


Re: Documentation of boolean column operators missing?

2018-10-23 Thread Xiao Li
They are documented at the link below

https://spark.apache.org/docs/2.3.0/api/sql/index.html



On Tue, Oct 23, 2018 at 10:27 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I can’t seem to find any documentation of the &, |, and ~ operators for
> PySpark DataFrame columns. I assume that should be in our docs somewhere.
>
> Was it always missing? Am I just missing something obvious?
>
> Nick
>


-- 
[image: Spark+AI Summit North America 2019]



Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
Nope, that’s different. I’m talking about the operators on DataFrame
columns in PySpark, not SQL functions.

For example:

(df
.where(~col('is_exiled') & (col('age') > 60))
.show()
)


On Tue, Oct 23, 2018 at 1:48 PM Xiao Li  wrote:

> They are documented at the link below
>
> https://spark.apache.org/docs/2.3.0/api/sql/index.html
>
>
>
> On Tue, Oct 23, 2018 at 10:27 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I can’t seem to find any documentation of the &, |, and ~ operators for
>> PySpark DataFrame columns. I assume that should be in our docs somewhere.
>>
>> Was it always missing? Am I just missing something obvious?
>>
>> Nick
>>
>
>
> --
> [image: Spark+AI Summit North America 2019]
> 
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Stavros Kontopoulos
Sean,

Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile
using the related tag v2.4.0-rc4:

./dev/change-scala-version.sh 2.12
./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr
-Phadoop-2.7 -Pkubernetes -Phive
Pushed images to dockerhub (previous email) since I didnt use the minikube
daemon (default behavior).

Then run tests successfully against minikube:

TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
cd resource-managers/kubernetes/integration-tests

./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH --service-account
default --namespace default --image-tag k8s-scala-12 --image-repo skonto


[INFO]
[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 229 milliseconds.
Run starting. Expected test count is: 14
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
Run completed in 5 minutes, 24 seconds.
Total number of tests run: 14
Suites: completed 2, aborted 0
Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM 2.4.0 . SUCCESS [
4.491 s]
[INFO] Spark Project Tags . SUCCESS [
3.833 s]
[INFO] Spark Project Local DB . SUCCESS [
2.680 s]
[INFO] Spark Project Networking ... SUCCESS [
4.817 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
2.541 s]
[INFO] Spark Project Unsafe ... SUCCESS [
2.795 s]
[INFO] Spark Project Launcher . SUCCESS [
5.593 s]
[INFO] Spark Project Core . SUCCESS [
25.160 s]
[INFO] Spark Project Kubernetes Integration Tests 2.4.0 ... SUCCESS [05:30
min]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 06:23 min
[INFO] Finished at: 2018-10-23T18:39:11Z
[INFO]



but had to modify this line

and
added -Pscala-2.12 , otherwise it fails (these tests inherit from the
parent pom but the profile is not propagated to the mvn command that
launches the tests, I can create a PR to fix that).


On Tue, Oct 23, 2018 at 7:44 PM, Hyukjin Kwon  wrote:

> https://github.com/apache/spark/pull/22514 sounds like a regression that
> affects Hive CTAS in write path (by not replacing them into Spark internal
> datasources; therefore performance regression).
> but yea I suspect if we should block the release by this.
>
> https://github.com/apache/spark/pull/22144 is just being discussed if I
> am not mistaken.
>
> Thanks.
>
> 2018년 10월 24일 (수) 오전 12:27, Xiao Li 님이 작성:
>
>> https://github.com/apache/spark/pull/22144 is also not a blocker of
>> Spark 2.4 release, as discussed in the PR.
>>
>> Thanks,
>>
>> Xiao
>>
>> Xiao Li  于2018年10月23日周二 上午9:20写道:
>>
>>> Thanks for reporting this. https://github.com/apache/spark/pull/22514
>>> is not a blocker. We can fix it in the next minor release, if we are unable
>>> to make it in this release.
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>> Sean Owen  于2018年10月23日周二 上午9:14写道:
>>>
 (I should add, I only observed this with the Scala 2.12 build. It all
 seemed to work with 2.11. Therefore I'm not too worried about it. I
 don't think it's a Scala version issue, but perhaps something looking
 for a spark 2.11 tarball and not finding it. See
 https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
 a change that might address this kind of thing.)

 On Tue, Oct 23, 2018 at 11:05 AM Sean Owen  wrote:
 >
 > Yeah, that's maybe the issue here. This is a source release, not a
 git checkout, and it still needs to work in this context.
 >
 > I just added -Pkubernetes to my build and didn't do anything else. I
 think the ideal is that a "mvn -P... -P... install" to work from a source
 release; that's a good expectation and consistent with docs.
 >
 > Maybe these tests simply don't need to run w

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Ilan Filonenko
+1 (non-binding) in reference to all k8s tests for 2.11 (including SparkR
Tests with R version being 3.4.1)

*[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
spark-kubernetes-integration-tests_2.11 ---*
*Discovery starting.*
*Discovery completed in 202 milliseconds.*
*Run starting. Expected test count is: 15*
*KubernetesSuite:*
*- Run SparkPi with no resources*
*- Run SparkPi with a very long application name.*
*- Use SparkLauncher.NO_RESOURCE*
*- Run SparkPi with a master URL without a scheme.*
*- Run SparkPi with an argument.*
*- Run SparkPi with custom labels, annotations, and environment variables.*
*- Run extraJVMOptions check on driver*
*- Run SparkRemoteFileTest using a remote data file*
*- Run SparkPi with env and mount secrets.*
*- Run PySpark on simple pi.py example*
*- Run PySpark with Python2 to test a pyfiles example*
*- Run PySpark with Python3 to test a pyfiles example*
*- Run PySpark with memory customization*
*- Run SparkR on simple dataframe.R example*
*- Run in client mode.*
*Run completed in 6 minutes, 47 seconds.*
*Total number of tests run: 15*
*Suites: completed 2, aborted 0*
*Tests: succeeded 15, failed 0, canceled 0, ignored 0, pending 0*
*All tests passed.*

Sean, in reference to your issues, the comment you linked is correct in
that you would need to build a Kubernetes distribution:
i.e.
*dev/make-distribution.sh --pip --r --tgz -Psparkr -Phadoop-2.7
-Pkubernetes*setup minikube
i.e. *minikube start --insecure-registry=localhost:5000 --cpus 6 --memory
6000*
and then run appropriate tests:
i.e. *dev/dev-run-integration-tests.sh --spark-tgz
.../spark-2.4.0-bin-2.7.3.tgz*

The newest PR that you linked allows us to point to the local Kubernetes
cluster deployed via docker-for-mac as opposed to minikube which gives us
another way to test, but does not change the workflow of testing AFAICT.

On Tue, Oct 23, 2018 at 9:14 AM Sean Owen  wrote:

> (I should add, I only observed this with the Scala 2.12 build. It all
> seemed to work with 2.11. Therefore I'm not too worried about it. I
> don't think it's a Scala version issue, but perhaps something looking
> for a spark 2.11 tarball and not finding it. See
> https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
> a change that might address this kind of thing.)
>
> On Tue, Oct 23, 2018 at 11:05 AM Sean Owen  wrote:
> >
> > Yeah, that's maybe the issue here. This is a source release, not a git
> checkout, and it still needs to work in this context.
> >
> > I just added -Pkubernetes to my build and didn't do anything else. I
> think the ideal is that a "mvn -P... -P... install" to work from a source
> release; that's a good expectation and consistent with docs.
> >
> > Maybe these tests simply don't need to run with the normal suite of
> tests, and can be considered tests run manually by developers running these
> scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?
> >
> > I don't think this has to block the release even if so, just trying to
> get to the bottom of it.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: About introduce function sum0 to Spark

2018-10-23 Thread Mark Hamstra
Yes, as long as you are only talking about summing numeric values. Part of
my point, though, is that this is just a special case of folding or
aggregating with an initial or 'zero' value. It doesn't need to be limited
to just numeric sums with zero = 0.

On Tue, Oct 23, 2018 at 12:23 AM Wenchen Fan  wrote:

> This is logically `sum( if(isnull(col), 0, col) )` right?
>
> On Tue, Oct 23, 2018 at 2:58 PM 陶 加涛  wrote:
>
>> The name is from Apache Calcite, And it doesn’t matter, we can introduce
>> our own.
>>
>>
>>
>>
>>
>> ---
>>
>> Regards!
>>
>> Aron Tao
>>
>>
>>
>> *发件人**: *Mark Hamstra 
>> *日期**: *2018年10月23日 星期二 12:28
>> *收件人**: *"taojia...@gmail.com" 
>> *抄送**: *dev 
>> *主题**: *Re: About introduce function sum0 to Spark
>>
>>
>>
>> That's a horrible name. This is just a fold.
>>
>>
>>
>> On Mon, Oct 22, 2018 at 7:39 PM 陶 加涛  wrote:
>>
>> Hi, in calcite, has the concept of sum0, here I quote the definition of
>> sum0:
>>
>>
>>
>> Sum0 is an aggregator which returns the sum of the values which
>>
>> go into it like Sum. It differs in that when no non null values
>>
>> are applied zero is returned instead of null..
>>
>>
>>
>> One scenario is that we can use sum0 to implement pre-calculation
>> count(pre-calculation system like Apache Kylin).
>>
>>
>>
>> It is very easy in Spark to implement sum0, if community consider this is
>> necessary, I would like to open a JIRA and implement this.
>>
>>
>>
>> ---
>>
>> Regards!
>>
>> Aron Tao
>>
>>
>>
>>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Dongjoon Hyun
BTW, for that integration suite, I saw the related artifacts in the RC4
staging directory.

Does Spark 2.4.0 need to start to release these
`spark-kubernetes-integration-tests`
artifacts?

   -
   
https://repository.apache.org/content/repositories/orgapachespark-1290/org/apache/spark/spark-kubernetes-integration-tests_2.11/
   -
   
https://repository.apache.org/content/repositories/orgapachespark-1290/org/apache/spark/spark-kubernetes-integration-tests_2.12/

Historically, Spark released `spark-docker-integration-tests` at Spark
1.6.x era and stopped since Spark 2.0.0.

   -
   
http://central.maven.org/maven2/org/apache/spark/spark-docker-integration-tests_2.10/
   -
   
http://central.maven.org/maven2/org/apache/spark/spark-docker-integration-tests_2.11/


Bests,
Dongjoon.

On Tue, Oct 23, 2018 at 11:43 AM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Sean,
>
> Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile
> using the related tag v2.4.0-rc4:
>
> ./dev/change-scala-version.sh 2.12
> ./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr
> -Phadoop-2.7 -Pkubernetes -Phive
> Pushed images to dockerhub (previous email) since I didnt use the minikube
> daemon (default behavior).
>
> Then run tests successfully against minikube:
>
> TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
> cd resource-managers/kubernetes/integration-tests
>
> ./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH --service-account
> default --namespace default --image-tag k8s-scala-12 --image-repo skonto
>
>
> [INFO]
> [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
> spark-kubernetes-integration-tests_2.12 ---
> Discovery starting.
> Discovery completed in 229 milliseconds.
> Run starting. Expected test count is: 14
> KubernetesSuite:
> - Run SparkPi with no resources
> - Run SparkPi with a very long application name.
> - Use SparkLauncher.NO_RESOURCE
> - Run SparkPi with a master URL without a scheme.
> - Run SparkPi with an argument.
> - Run SparkPi with custom labels, annotations, and environment variables.
> - Run extraJVMOptions check on driver
> - Run SparkRemoteFileTest using a remote data file
> - Run SparkPi with env and mount secrets.
> - Run PySpark on simple pi.py example
> - Run PySpark with Python2 to test a pyfiles example
> - Run PySpark with Python3 to test a pyfiles example
> - Run PySpark with memory customization
> - Run in client mode.
> Run completed in 5 minutes, 24 seconds.
> Total number of tests run: 14
> Suites: completed 2, aborted 0
> Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0
> All tests passed.
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM 2.4.0 . SUCCESS [
> 4.491 s]
> [INFO] Spark Project Tags . SUCCESS [
> 3.833 s]
> [INFO] Spark Project Local DB . SUCCESS [
> 2.680 s]
> [INFO] Spark Project Networking ... SUCCESS [
> 4.817 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 2.541 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 2.795 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 5.593 s]
> [INFO] Spark Project Core . SUCCESS [
> 25.160 s]
> [INFO] Spark Project Kubernetes Integration Tests 2.4.0 ... SUCCESS [05:30
> min]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 06:23 min
> [INFO] Finished at: 2018-10-23T18:39:11Z
> [INFO]
> 
>
>
> but had to modify this line
> 
>  and
> added -Pscala-2.12 , otherwise it fails (these tests inherit from the
> parent pom but the profile is not propagated to the mvn command that
> launches the tests, I can create a PR to fix that).
>
>
> On Tue, Oct 23, 2018 at 7:44 PM, Hyukjin Kwon  wrote:
>
>> https://github.com/apache/spark/pull/22514 sounds like a regression that
>> affects Hive CTAS in write path (by not replacing them into Spark internal
>> datasources; therefore performance regression).
>> but yea I suspect if we should block the release by this.
>>
>> https://github.com/apache/spark/pull/22144 is just being discussed if I
>> am not mistaken.
>>
>> Thanks.
>>
>> 2018년 10월 24일 (수) 오전 12:27, Xiao Li 님이 작성:
>>
>>> https://github.com/apache/spark/pull/22144 is also not a blocker of
>>> Spark 2.4 release, as discussed in the PR.
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>> Xiao Li  于2018年10月23日周二 上午9:20写道:
>>>
 Thanks for reporting this. https://github.com/apache/spark/pull/22514
 is no

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Sean Owen
Those should all be Column functions, really, and I see them at
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column

On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas 
wrote:

> I can’t seem to find any documentation of the &, |, and ~ operators for
> PySpark DataFrame columns. I assume that should be in our docs somewhere.
>
> Was it always missing? Am I just missing something obvious?
>
> Nick
>


Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
So it appears then that the equivalent operators for PySpark are completely
missing from the docs, right? That’s surprising. And if there are column
function equivalents for |, &, and ~, then I can’t find those either for
PySpark. Indeed, I don’t think such a thing is possible in PySpark.
(e.g. (col('age')
> 0).and(...))

I can file a ticket about this, but I’m just making sure I’m not missing
something obvious.

On Tue, Oct 23, 2018 at 2:50 PM Sean Owen  wrote:

> Those should all be Column functions, really, and I see them at
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
>
> On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I can’t seem to find any documentation of the &, |, and ~ operators for
>> PySpark DataFrame columns. I assume that should be in our docs somewhere.
>>
>> Was it always missing? Am I just missing something obvious?
>>
>> Nick
>>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Stavros Kontopoulos
+1 (non-binding). Run k8s tests with Scala 2.12. Also included the
RTestsSuite (mentioned by Ilan) although not part of the 2.4 rc tag:

[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 239 milliseconds.
Run starting. Expected test count is: 15
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Run SparkR on simple dataframe.R example
Run completed in 6 minutes, 32 seconds.
Total number of tests run: 15
Suites: completed 2, aborted 0
Tests: succeeded 15, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM 2.4.0 . SUCCESS [
4.480 s]
[INFO] Spark Project Tags . SUCCESS [
3.898 s]
[INFO] Spark Project Local DB . SUCCESS [
2.773 s]
[INFO] Spark Project Networking ... SUCCESS [
5.063 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
2.651 s]
[INFO] Spark Project Unsafe ... SUCCESS [
2.662 s]
[INFO] Spark Project Launcher . SUCCESS [
5.103 s]
[INFO] Spark Project Core . SUCCESS [
25.703 s]
[INFO] Spark Project Kubernetes Integration Tests 2.4.0 ... SUCCESS [06:51
min]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 07:44 min
[INFO] Finished at: 2018-10-23T19:09:41Z
[INFO]


Stavros

On Tue, Oct 23, 2018 at 9:46 PM, Dongjoon Hyun 
wrote:

> BTW, for that integration suite, I saw the related artifacts in the RC4
> staging directory.
>
> Does Spark 2.4.0 need to start to release these `spark-kubernetes
> -integration-tests` artifacts?
>
>- https://repository.apache.org/content/repositories/
>orgapachespark-1290/org/apache/spark/spark-kubernetes-
>integration-tests_2.11/
>
> 
>- https://repository.apache.org/content/repositories/
>orgapachespark-1290/org/apache/spark/spark-kubernetes-
>integration-tests_2.12/
>
> 
>
> Historically, Spark released `spark-docker-integration-tests` at Spark
> 1.6.x era and stopped since Spark 2.0.0.
>
>- http://central.maven.org/maven2/org/apache/spark/spark-
>docker-integration-tests_2.10/
>- http://central.maven.org/maven2/org/apache/spark/spark-
>docker-integration-tests_2.11/
>
>
> Bests,
> Dongjoon.
>
> On Tue, Oct 23, 2018 at 11:43 AM Stavros Kontopoulos  lightbend.com> wrote:
>
>> Sean,
>>
>> Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile
>> using the related tag v2.4.0-rc4:
>>
>> ./dev/change-scala-version.sh 2.12
>> ./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr
>> -Phadoop-2.7 -Pkubernetes -Phive
>> Pushed images to dockerhub (previous email) since I didnt use the
>> minikube daemon (default behavior).
>>
>> Then run tests successfully against minikube:
>>
>> TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
>> cd resource-managers/kubernetes/integration-tests
>>
>> ./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH
>> --service-account default --namespace default --image-tag k8s-scala-12 
>> --image-repo
>> skonto
>>
>>
>> [INFO]
>> [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
>> spark-kubernetes-integration-tests_2.12 ---
>> Discovery starting.
>> Discovery completed in 229 milliseconds.
>> Run starting. Expected test count is: 14
>> KubernetesSuite:
>> - Run SparkPi with no resources
>> - Run SparkPi with a very long application name.
>> - Use SparkLauncher.NO_RESOURCE
>> - Run SparkPi with a master URL without a scheme.
>> - Run SparkPi with an argument.
>> - Run SparkPi with custom labels, annotations, and environment variables.
>> - Run extraJVMOptions check on driver
>> - Run SparkRemoteFileTest using a remote data file
>> - Run SparkPi with env and mount secre

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
Also, to clarify something for folks who don't work with PySpark: The
boolean column operators in PySpark are completely different from those in
Scala, and non-obvious to boot (since they overload Python's _bitwise_
operators). So their apparent absence from the docs is surprising.

On Tue, Oct 23, 2018 at 3:02 PM Nicholas Chammas 
wrote:

> So it appears then that the equivalent operators for PySpark are
> completely missing from the docs, right? That’s surprising. And if there
> are column function equivalents for |, &, and ~, then I can’t find those
> either for PySpark. Indeed, I don’t think such a thing is possible in
> PySpark. (e.g. (col('age') > 0).and(...))
>
> I can file a ticket about this, but I’m just making sure I’m not missing
> something obvious.
>
> On Tue, Oct 23, 2018 at 2:50 PM Sean Owen  wrote:
>
>> Those should all be Column functions, really, and I see them at
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
>>
>> On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I can’t seem to find any documentation of the &, |, and ~ operators for
>>> PySpark DataFrame columns. I assume that should be in our docs somewhere.
>>>
>>> Was it always missing? Am I just missing something obvious?
>>>
>>> Nick
>>>
>>


Re: Documentation of boolean column operators missing?

2018-10-23 Thread Sean Owen
(& and | are both logical and bitwise operators in Java and Scala, FWIW)

I don't see them in the python docs; they are defined in column.py but
they don't turn up in the docs. Then again, they're not documented:

...
__and__ = _bin_op('and')
__or__ = _bin_op('or')
__invert__ = _func_op('not')
__rand__ = _bin_op("and")
__ror__ = _bin_op("or")
...

I don't know if there's a good reason for it, but go ahead and doc
them if they can be.
While I suspect their meaning is obvious once it's clear they aren't
the bitwise operators, that part isn't obvious/ While it matches
Java/Scala/Scala-Spark syntax, and that's probably most important, it
isn't typical for python.

The comments say that it is not possible to overload 'and' and 'or',
which would have been more natural.

On Tue, Oct 23, 2018 at 2:20 PM Nicholas Chammas
 wrote:
>
> Also, to clarify something for folks who don't work with PySpark: The boolean 
> column operators in PySpark are completely different from those in Scala, and 
> non-obvious to boot (since they overload Python's _bitwise_ operators). So 
> their apparent absence from the docs is surprising.
>
> On Tue, Oct 23, 2018 at 3:02 PM Nicholas Chammas  
> wrote:
>>
>> So it appears then that the equivalent operators for PySpark are completely 
>> missing from the docs, right? That’s surprising. And if there are column 
>> function equivalents for |, &, and ~, then I can’t find those either for 
>> PySpark. Indeed, I don’t think such a thing is possible in PySpark. (e.g. 
>> (col('age') > 0).and(...))
>>
>> I can file a ticket about this, but I’m just making sure I’m not missing 
>> something obvious.
>>
>>
>> On Tue, Oct 23, 2018 at 2:50 PM Sean Owen  wrote:
>>>
>>> Those should all be Column functions, really, and I see them at 
>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
>>>
>>> On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas 
>>>  wrote:

 I can’t seem to find any documentation of the &, |, and ~ operators for 
 PySpark DataFrame columns. I assume that should be in our docs somewhere.

 Was it always missing? Am I just missing something obvious?

 Nick

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Documentation of boolean column operators missing?

2018-10-23 Thread Maciej Szymkiewicz
Even if these were documented Sphinx doesn't include dunder methods by
default (with exception to __init__). There is :special-members: option
which could be passed to, for example, autoclass.

On Tue, 23 Oct 2018 at 21:32, Sean Owen  wrote:

> (& and | are both logical and bitwise operators in Java and Scala, FWIW)
>
> I don't see them in the python docs; they are defined in column.py but
> they don't turn up in the docs. Then again, they're not documented:
>
> ...
> __and__ = _bin_op('and')
> __or__ = _bin_op('or')
> __invert__ = _func_op('not')
> __rand__ = _bin_op("and")
> __ror__ = _bin_op("or")
> ...
>
> I don't know if there's a good reason for it, but go ahead and doc
> them if they can be.
> While I suspect their meaning is obvious once it's clear they aren't
> the bitwise operators, that part isn't obvious/ While it matches
> Java/Scala/Scala-Spark syntax, and that's probably most important, it
> isn't typical for python.
>
> The comments say that it is not possible to overload 'and' and 'or',
> which would have been more natural.
>
> On Tue, Oct 23, 2018 at 2:20 PM Nicholas Chammas
>  wrote:
> >
> > Also, to clarify something for folks who don't work with PySpark: The
> boolean column operators in PySpark are completely different from those in
> Scala, and non-obvious to boot (since they overload Python's _bitwise_
> operators). So their apparent absence from the docs is surprising.
> >
> > On Tue, Oct 23, 2018 at 3:02 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> >>
> >> So it appears then that the equivalent operators for PySpark are
> completely missing from the docs, right? That’s surprising. And if there
> are column function equivalents for |, &, and ~, then I can’t find those
> either for PySpark. Indeed, I don’t think such a thing is possible in
> PySpark. (e.g. (col('age') > 0).and(...))
> >>
> >> I can file a ticket about this, but I’m just making sure I’m not
> missing something obvious.
> >>
> >>
> >> On Tue, Oct 23, 2018 at 2:50 PM Sean Owen  wrote:
> >>>
> >>> Those should all be Column functions, really, and I see them at
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
> >>>
> >>> On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> 
>  I can’t seem to find any documentation of the &, |, and ~ operators
> for PySpark DataFrame columns. I assume that should be in our docs
> somewhere.
> 
>  Was it always missing? Am I just missing something obvious?
> 
>  Nick
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
To be clear I'm currently +1 on this release, with much commentary.

OK, the explanation for kubernetes tests makes sense. Yes I think we need
to propagate the scala-2.12 build profile to make it work. Go for it, if
you have a lead on what the change is.
This doesn't block the release as it's an issue for tests, and only affects
2.12. However if we had a clean fix for this and there were another RC, I'd
include it.

Dongjoon has a good point about the spark-kubernetes-integration-tests
artifact. That doesn't sound like it should be published in this way,
though, of course, we publish the test artifacts from every module already.
This is only a bit odd in being a non-test artifact meant for testing. But
it's special testing! So I also don't think that needs to block a release.

This happens because the integration tests module is enabled with the
'kubernetes' profile too, and also this output is copied into the release
tarball at kubernetes/integration-tests/tests. Do we need that in a binary
release?

If these integration tests are meant to be run ad hoc, manually, not part
of a normal test cycle, then I think we can just not enable it with
-Pkubernetes. If it is meant to run every time, then it sounds like we need
a little extra work shown in recent PRs to make that easier, but then, this
test code should just be the 'test' artifact parts of the kubernetes
module, no?


On Tue, Oct 23, 2018 at 1:46 PM Dongjoon Hyun 
wrote:

> BTW, for that integration suite, I saw the related artifacts in the RC4
> staging directory.
>
> Does Spark 2.4.0 need to start to release these 
> `spark-kubernetes-integration-tests`
> artifacts?
>
>-
>
> https://repository.apache.org/content/repositories/orgapachespark-1290/org/apache/spark/spark-kubernetes-integration-tests_2.11/
>-
>
> https://repository.apache.org/content/repositories/orgapachespark-1290/org/apache/spark/spark-kubernetes-integration-tests_2.12/
>
> Historically, Spark released `spark-docker-integration-tests` at Spark
> 1.6.x era and stopped since Spark 2.0.0.
>
>-
>
> http://central.maven.org/maven2/org/apache/spark/spark-docker-integration-tests_2.10/
>-
>
> http://central.maven.org/maven2/org/apache/spark/spark-docker-integration-tests_2.11/
>
>
> Bests,
> Dongjoon.
>
> On Tue, Oct 23, 2018 at 11:43 AM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
>> Sean,
>>
>> Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile
>> using the related tag v2.4.0-rc4:
>>
>> ./dev/change-scala-version.sh 2.12
>> ./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr
>> -Phadoop-2.7 -Pkubernetes -Phive
>> Pushed images to dockerhub (previous email) since I didnt use the
>> minikube daemon (default behavior).
>>
>> Then run tests successfully against minikube:
>>
>> TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
>> cd resource-managers/kubernetes/integration-tests
>>
>> ./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH
>> --service-account default --namespace default
>> --image-tag k8s-scala-12 --image-repo skonto
>>
>


Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
On Tue, 23 Oct 2018 at 21:32, Sean Owen  wrote:
>
>> The comments say that it is not possible to overload 'and' and 'or',
>> which would have been more natural.
>>
> Yes, unfortunately, Python does not allow you to override and, or, or not.
They are not implemented as “dunder” method (e.g. __add__()) and they
implement special short-circuiting logic that’s not possible to reproduce
with a function call. I think we made the most practical choice in
overriding the bitwise operators.

In any case, I’ll file a JIRA ticket about this, and maybe also submit a PR
to close it, adding documentation about PySpark column boolean operators to
the programming guide.

Nick


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Dongjoon Hyun
Ur, Wenchen.

Source distribution seems to fail by default.

https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz

$ dev/make-distribution.sh -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
-Phive-thriftserver
...
+ cp /spark-2.4.0/LICENSE-binary /spark-2.4.0/dist/LICENSE
cp: /spark-2.4.0/LICENSE-binary: No such file or directory


The root cause seems to be the following fix.

https://github.com/apache/spark/pull/22436/files#diff-01ca42240614718522afde4d4885b40dR175

Although Apache Spark provides the binary distributions, it would be great
if this succeeds out of the box.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until October 26 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc4 (commit
> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
> https://github.com/apache/spark/tree/v2.4.0-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1290
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
Hm, so you're trying to build a source release from a binary release?
I don't think that needs to work nor do I expect it to for reasons
like this. They just have fairly different things.

On Tue, Oct 23, 2018 at 7:04 PM Dongjoon Hyun  wrote:
>
> Ur, Wenchen.
>
> Source distribution seems to fail by default.
>
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
>
> $ dev/make-distribution.sh -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive 
> -Phive-thriftserver
> ...
> + cp /spark-2.4.0/LICENSE-binary /spark-2.4.0/dist/LICENSE
> cp: /spark-2.4.0/LICENSE-binary: No such file or directory
>
>
> The root cause seems to be the following fix.
>
> https://github.com/apache/spark/pull/22436/files#diff-01ca42240614718522afde4d4885b40dR175
>
> Although Apache Spark provides the binary distributions, it would be great if 
> this succeeds out of the box.
>
> Bests,
> Dongjoon.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Ryan Blue
+1 (non-binding)

The Iceberg implementation of DataSourceV2 is passing all tests after
updating to the 2.4 API, although I've had to disable ORC support because
BufferHolder is no longer public.

One oddity is that the DSv2 API for batch sources now includes an epoch ID,
which I think will be removed in the refactor before 2.5 or 3.0 and wasn't
part of the 2.3 release. That's strange, but it's minor.

rb

On Tue, Oct 23, 2018 at 5:10 PM Sean Owen  wrote:

> Hm, so you're trying to build a source release from a binary release?
> I don't think that needs to work nor do I expect it to for reasons
> like this. They just have fairly different things.
>
> On Tue, Oct 23, 2018 at 7:04 PM Dongjoon Hyun 
> wrote:
> >
> > Ur, Wenchen.
> >
> > Source distribution seems to fail by default.
> >
> >
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
> >
> > $ dev/make-distribution.sh -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
> -Phive-thriftserver
> > ...
> > + cp /spark-2.4.0/LICENSE-binary /spark-2.4.0/dist/LICENSE
> > cp: /spark-2.4.0/LICENSE-binary: No such file or directory
> >
> >
> > The root cause seems to be the following fix.
> >
> >
> https://github.com/apache/spark/pull/22436/files#diff-01ca42240614718522afde4d4885b40dR175
> >
> > Although Apache Spark provides the binary distributions, it would be
> great if this succeeds out of the box.
> >
> > Bests,
> > Dongjoon.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix