Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread Xiangrui Meng
I posted my comment in the JIRA
.
Main concerns here:

1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0
release someday.
2. ML/DL systems that can benefits from columnar format are mostly in
Python.
3. Simple operations, though benefits vectorization, might not be worth the
data exchange overhead.

So would an improved Pandas UDF API would be good enough? For example,
SPARK-26412  (UDF that
takes an iterator of of Arrow batches).

Sorry that I should join the discussion earlier! Hope it is not too late:)

On Fri, Apr 19, 2019 at 1:20 PM  wrote:

> +1 (non-binding) for better columnar data processing support.
>
>
>
> *From:* Jules Damji 
> *Sent:* Friday, April 19, 2019 12:21 PM
> *To:* Bryan Cutler 
> *Cc:* Dev 
> *Subject:* Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
>
>
>
> + (non-binding)
>
> Sent from my iPhone
>
> Pardon the dumb thumb typos :)
>
>
> On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
>
> +1 (non-binding)
>
>
>
> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
>
> +1 (non-binding).  Looking forward to seeing better support for processing
> columnar data.
>
>
>
> Jason
>
>
>
> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
> wrote:
>
> Hi everyone,
>
>
>
> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
>
>
>
> You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
>
>
>
> Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
>
>
>
> [ ] +1: Accept the proposal as an official SPIP
>
> [ ] +0
>
> [ ] -1: I don't think this is a good idea because ...
>
>
>
>
>
> Thanks!
>
> Tom Graves
>
>


RE: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread tcondie
+1 (non-binding) for better columnar data processing support.

 

From: Jules Damji  
Sent: Friday, April 19, 2019 12:21 PM
To: Bryan Cutler 
Cc: Dev 
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
Processing Support

 

+ (non-binding)

Sent from my iPhone

Pardon the dumb thumb typos :)


On Apr 19, 2019, at 10:30 AM, Bryan Cutler mailto:cutl...@gmail.com> > wrote:

+1 (non-binding)

 

On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe mailto:jl...@apache.org> > wrote:

+1 (non-binding).  Looking forward to seeing better support for processing 
columnar data.

 

Jason

 

On Tue, Apr 16, 2019 at 10:38 AM Tom Graves mailto:tgraves...@yahoo.com.invalid> > wrote:

Hi everyone,

 

I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
Columnar Processing Support.  The proposal is to extend the support to allow 
for more columnar processing.

 

You can find the full proposal in the jira at: 
https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
thread in the dev mailing list.

 

Please vote as early as you can, I will leave the vote open until next Monday 
(the 22nd), 2pm CST to give people plenty of time.

 

[ ] +1: Accept the proposal as an official SPIP

[ ] +0

[ ] -1: I don't think this is a good idea because ...

 

 

Thanks!

Tom Graves



Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread Jules Damji
+ (non-binding)

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> 
> +1 (non-binding)
> 
>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
>> +1 (non-binding).  Looking forward to seeing better support for processing 
>> columnar data.
>> 
>> Jason
>> 
>>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves  
>>> wrote:
>>> Hi everyone,
>>> 
>>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
>>> Columnar Processing Support.  The proposal is to extend the support to 
>>> allow for more columnar processing.
>>> 
>>> You can find the full proposal in the jira at: 
>>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
>>> thread in the dev mailing list.
>>> 
>>> Please vote as early as you can, I will leave the vote open until next 
>>> Monday (the 22nd), 2pm CST to give people plenty of time.
>>> 
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don't think this is a good idea because ...
>>> 
>>> 
>>> Thanks!
>>> Tom Graves


Re: [VOTE] Release Apache Spark 2.4.2

2019-04-19 Thread shane knapp
-1, as i'd like to be sure that the python test infra change for jenkins is
included (https://github.com/apache/spark/pull/24379)

On Fri, Apr 19, 2019 at 12:01 PM Michael Armbrust 
wrote:

> +1 (binding), we've test this and it LGTM.
>
> On Thu, Apr 18, 2019 at 7:51 PM Wenchen Fan  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.4.2.
>>
>> The vote is open until April 23 PST and passes if a majority +1 PMC votes
>> are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.4.2-rc1 (commit
>> a44880ba74caab7a987128cb09c4bee41617770a):
>> https://github.com/apache/spark/tree/v2.4.2-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1322/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-docs/
>>
>> The list of bug fixes going into 2.4.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12344996
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.2?
>> ===
>>
>> The current list of open tickets targeted at 2.4.2 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.2
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [VOTE] Release Apache Spark 2.4.2

2019-04-19 Thread Michael Armbrust
+1 (binding), we've test this and it LGTM.

On Thu, Apr 18, 2019 at 7:51 PM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.2.
>
> The vote is open until April 23 PST and passes if a majority +1 PMC votes
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.2-rc1 (commit
> a44880ba74caab7a987128cb09c4bee41617770a):
> https://github.com/apache/spark/tree/v2.4.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1322/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-docs/
>
> The list of bug fixes going into 2.4.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12344996
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.2?
> ===
>
> The current list of open tickets targeted at 2.4.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: pyspark.sql.functions ide friendly

2019-04-19 Thread educhana
It's not oly the linter, but also autocompletion and help. 

Aside, in the module some functions are declared statically and the difference 
is not clear. 

On 2019/04/17 11:35:53, Sean Owen  wrote: 
> I use IntelliJ and have never seen an issue parsing the pyspark
> functions... you're just saying the linter has an optional inspection
> to flag it? just disable that?
> I don't think we want to complicate the Spark code just for this. They
> are declared at runtime for a reason.
> 
> On Wed, Apr 17, 2019 at 6:27 AM educh...@gmail.com  wrote:
> >
> > Hi,
> >
> > I'm aware of various workarounds to make this work smoothly in various 
> > IDEs, but wouldn't better to solve the root cause?
> >
> > I've seen the code and don't see anything that requires such level of 
> > dynamic code, the translation is 99% trivial.
> >
> > On 2019/04/16 12:16:41, 880f0464 <880f0...@protonmail.com.INVALID> wrote:
> > > Hi.
> > >
> > > That's a problem with Spark as such and in general can be addressed on 
> > > IDE to IDE basis - see for example https://stackoverflow.com/q/40163106 
> > > for some hints.
> > >
> > >
> > > Sent with ProtonMail Secure Email.
> > >
> > > ‐‐‐ Original Message ‐‐‐
> > > On Tuesday, April 16, 2019 2:10 PM, educhana  wrote:
> > >
> > > > Hi,
> > > >
> > > > Currently using pyspark.sql.functions from an IDE like PyCharm is 
> > > > causing
> > > > the linters complain due to the functions being declared at runtime.
> > > >
> > > > Would a PR fixing this be welcomed? Is there any problems/difficulties 
> > > > I'm
> > > > unaware?
> > > >
> > > >
> > > > --
> > > >
> > > > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> > > >
> > > > --
> > > >
> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.4.2

2019-04-19 Thread Sean Owen
While we're on this subject, there are two more dependency updates
that we could consider including in 2.4.2 on the same grounds, as
they're dependencies with CVEs. However, it's not clear whether the
CVEs actually affect Spark. These are already in master.

https://issues.apache.org/jira/browse/SPARK-27469
https://issues.apache.org/jira/browse/SPARK-27470


On Fri, Apr 19, 2019 at 11:13 AM Sean Owen  wrote:
>
> All: here is the backport of changes to update to 2.9.8 from master back to 
> 2.4.
> https://github.com/apache/spark/pull/24418
>
> master has been on 2.9.8 for a while, so the concern isnt' Spark so
> much. It's that user apps would face the same change of behavior if
> they used Jackson in a similar way. I'm moderately in favor of
> updating just as it's come up several times, but it's debatable.
>
> On Fri, Apr 19, 2019 at 11:01 AM Alessandro Bellina
>  wrote:
> >
> > We ran unit tests locally with the patch in 
> > https://github.com/apache/spark/pull/21596 on top of our 2.4 branch, so I 
> > think that branch-2.4 should pass that part.
> >
> > As far dependencies, things get murkier. I went through the dependencies 
> > where we have exclude clauses for databind. Here is the summary of those 
> > versions that were different than 2.9.6 (i.e. kafka-0.10 is using 2.9.6, so 
> > not listed below):
> >
> > hadoop is using jackson version 2.2.3
> > org.json4s moved to jackson.version 2.9.8, perhaps we should move to 2.9.8 
> > instead? (though it should be compatible with 2.9.6)
> > calcite 1.2.0-incubating uses jackson version 2.1.1, but it has moved to 
> > 2.9.6 in more recent versions.
> > arrow 0.10.0 uses jackson version 2.7.9, but moved to 2.9.8 in version 
> > apache-arrow-0.13.0.
> > io.fabric8 3.0+ uses jackson version 2.7.4.
> >
> > The risk to staying with the older version is in user code (or other 
> > libraries) not doing proper input data validation. The risk of moving up is 
> > we can introduce instability as there are a lot of moving parts. If we can 
> > think of a set of tests that can be done here, either manually or 
> > automated, I think that would be worthwhile to contribute.
> >
> > On Thu, Apr 18, 2019 at 9:53 PM Wenchen Fan  wrote:
> >>
> >> I've cut RC1. If people think we must upgrade Jackson in 2.4, I can cut 
> >> RC2 shortly.
> >>
> >> Thanks,
> >> Wenchen
> >>
> >> On Fri, Apr 19, 2019 at 3:32 AM Felix Cheung  
> >> wrote:
> >>>
> >>> Re shading - same argument I’ve made earlier today in a PR...
> >>>
> >>> (Context- in many cases Spark has light or indirect dependencies but 
> >>> bringing them into the process breaks users code easily)
> >>>
> >>>
> >>> 
> >>> From: Michael Heuer 
> >>> Sent: Thursday, April 18, 2019 6:41 AM
> >>> To: Reynold Xin
> >>> Cc: Sean Owen; Michael Armbrust; Ryan Blue; Spark Dev List; Wenchen Fan; 
> >>> Xiao Li
> >>> Subject: Re: Spark 2.4.2
> >>>
> >>> +100
> >>>
> >>>
> >>> On Apr 18, 2019, at 1:48 AM, Reynold Xin  wrote:
> >>>
> >>> We should have shaded all Spark’s dependencies :(
> >>>
> >>> On Wed, Apr 17, 2019 at 11:47 PM Sean Owen  wrote:
> 
>  For users that would inherit Jackson and use it directly, or whose
>  dependencies do. Spark itself (with modifications) should be OK with
>  the change.
>  It's risky and normally wouldn't backport, except that I've heard a
>  few times about concerns about CVEs affecting Databind, so wondering
>  who else out there might have an opinion. I'm not pushing for it
>  necessarily.
> 
>  On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin  wrote:
>  >
>  > For Jackson - are you worrying about JSON parsing for users or 
>  > internal Spark functionality breaking?
>  >
>  > On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  wrote:
>  >>
>  >> There's only one other item on my radar, which is considering updating
>  >> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
>  >> a few times now that there are a number of CVEs open for 2.6.7. Cons:
>  >> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
>  >> behavior non-trivially. That said back-porting the update PR to 2.4
>  >> worked out OK locally. Any strong opinions on this one?
>  >>
>  >> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan  
>  >> wrote:
>  >> >
>  >> > I volunteer to be the release manager for 2.4.2, as I was also 
>  >> > going to propose 2.4.2 because of the reverting of SPARK-25250. Is 
>  >> > there any other ongoing bug fixes we want to include in 2.4.2? If 
>  >> > no I'd like to start the release process today (CST).
>  >> >
>  >> > Thanks,
>  >> > Wenchen
>  >> >
>  >> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen  wrote:
>  >> >>
>  >> >> I think the 'only backport bug fixes to branches' principle 
>  >> >> remains sound. But what's a bug fix? Something that changes 
>  >> >> behavior to match what is explicitly 

Re: Spark 2.4.2

2019-04-19 Thread Sean Owen
All: here is the backport of changes to update to 2.9.8 from master back to 2.4.
https://github.com/apache/spark/pull/24418

master has been on 2.9.8 for a while, so the concern isnt' Spark so
much. It's that user apps would face the same change of behavior if
they used Jackson in a similar way. I'm moderately in favor of
updating just as it's come up several times, but it's debatable.

On Fri, Apr 19, 2019 at 11:01 AM Alessandro Bellina
 wrote:
>
> We ran unit tests locally with the patch in 
> https://github.com/apache/spark/pull/21596 on top of our 2.4 branch, so I 
> think that branch-2.4 should pass that part.
>
> As far dependencies, things get murkier. I went through the dependencies 
> where we have exclude clauses for databind. Here is the summary of those 
> versions that were different than 2.9.6 (i.e. kafka-0.10 is using 2.9.6, so 
> not listed below):
>
> hadoop is using jackson version 2.2.3
> org.json4s moved to jackson.version 2.9.8, perhaps we should move to 2.9.8 
> instead? (though it should be compatible with 2.9.6)
> calcite 1.2.0-incubating uses jackson version 2.1.1, but it has moved to 
> 2.9.6 in more recent versions.
> arrow 0.10.0 uses jackson version 2.7.9, but moved to 2.9.8 in version 
> apache-arrow-0.13.0.
> io.fabric8 3.0+ uses jackson version 2.7.4.
>
> The risk to staying with the older version is in user code (or other 
> libraries) not doing proper input data validation. The risk of moving up is 
> we can introduce instability as there are a lot of moving parts. If we can 
> think of a set of tests that can be done here, either manually or automated, 
> I think that would be worthwhile to contribute.
>
> On Thu, Apr 18, 2019 at 9:53 PM Wenchen Fan  wrote:
>>
>> I've cut RC1. If people think we must upgrade Jackson in 2.4, I can cut RC2 
>> shortly.
>>
>> Thanks,
>> Wenchen
>>
>> On Fri, Apr 19, 2019 at 3:32 AM Felix Cheung  
>> wrote:
>>>
>>> Re shading - same argument I’ve made earlier today in a PR...
>>>
>>> (Context- in many cases Spark has light or indirect dependencies but 
>>> bringing them into the process breaks users code easily)
>>>
>>>
>>> 
>>> From: Michael Heuer 
>>> Sent: Thursday, April 18, 2019 6:41 AM
>>> To: Reynold Xin
>>> Cc: Sean Owen; Michael Armbrust; Ryan Blue; Spark Dev List; Wenchen Fan; 
>>> Xiao Li
>>> Subject: Re: Spark 2.4.2
>>>
>>> +100
>>>
>>>
>>> On Apr 18, 2019, at 1:48 AM, Reynold Xin  wrote:
>>>
>>> We should have shaded all Spark’s dependencies :(
>>>
>>> On Wed, Apr 17, 2019 at 11:47 PM Sean Owen  wrote:

 For users that would inherit Jackson and use it directly, or whose
 dependencies do. Spark itself (with modifications) should be OK with
 the change.
 It's risky and normally wouldn't backport, except that I've heard a
 few times about concerns about CVEs affecting Databind, so wondering
 who else out there might have an opinion. I'm not pushing for it
 necessarily.

 On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin  wrote:
 >
 > For Jackson - are you worrying about JSON parsing for users or internal 
 > Spark functionality breaking?
 >
 > On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  wrote:
 >>
 >> There's only one other item on my radar, which is considering updating
 >> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
 >> a few times now that there are a number of CVEs open for 2.6.7. Cons:
 >> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
 >> behavior non-trivially. That said back-porting the update PR to 2.4
 >> worked out OK locally. Any strong opinions on this one?
 >>
 >> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan  wrote:
 >> >
 >> > I volunteer to be the release manager for 2.4.2, as I was also going 
 >> > to propose 2.4.2 because of the reverting of SPARK-25250. Is there 
 >> > any other ongoing bug fixes we want to include in 2.4.2? If no I'd 
 >> > like to start the release process today (CST).
 >> >
 >> > Thanks,
 >> > Wenchen
 >> >
 >> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen  wrote:
 >> >>
 >> >> I think the 'only backport bug fixes to branches' principle remains 
 >> >> sound. But what's a bug fix? Something that changes behavior to 
 >> >> match what is explicitly supposed to happen, or implicitly supposed 
 >> >> to happen -- implied by what other similar things do, by reasonable 
 >> >> user expectations, or simply how it worked previously.
 >> >>
 >> >> Is this a bug fix? I guess the criteria that matches is that 
 >> >> behavior doesn't match reasonable user expectations? I don't know 
 >> >> enough to have a strong opinion. I also don't think there is 
 >> >> currently an objection to backporting it, whatever it's called.
 >> >>
 >> >>
 >> >> Is the question whether this needs a new release? There's no harm in 
 >> >> another point release, 

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread Bryan Cutler
+1 (non-binding)

On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:

> +1 (non-binding).  Looking forward to seeing better support for processing
> columnar data.
>
> Jason
>
> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
> wrote:
>
>> Hi everyone,
>>
>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>> extended Columnar Processing Support.  The proposal is to extend the
>> support to allow for more columnar processing.
>>
>> You can find the full proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>>
>> Please vote as early as you can, I will leave the vote open until next
>> Monday (the 22nd), 2pm CST to give people plenty of time.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>>
>> Thanks!
>> Tom Graves
>>
>


Re: Spark 2.4.2

2019-04-19 Thread Driesprong, Fokko
For me a +1 on upgrading Jackson as well. This has been long overdue. There
are some behavioural changes regarding handling null/None. This is also
described in the PR:
https://github.com/apache/spark/pull/21596

Also it has a positive impact on the performance.

Cheers, Fokko

Op vr 19 apr. 2019 om 19:16 schreef Arun Mahadevan 

> +1 to upgrade Jackson. It has come up multiple times due to CVEs and the
> back port has worked out but it may be good to include if its not going to
> delay the release.
>
> On Thu, 18 Apr 2019 at 19:53, Wenchen Fan  wrote:
>
>> I've cut RC1. If people think we must upgrade Jackson in 2.4, I can cut
>> RC2 shortly.
>>
>> Thanks,
>> Wenchen
>>
>> On Fri, Apr 19, 2019 at 3:32 AM Felix Cheung 
>> wrote:
>>
>>> Re shading - same argument I’ve made earlier today in a PR...
>>>
>>> (Context- in many cases Spark has light or indirect dependencies but
>>> bringing them into the process breaks users code easily)
>>>
>>>
>>> --
>>> *From:* Michael Heuer 
>>> *Sent:* Thursday, April 18, 2019 6:41 AM
>>> *To:* Reynold Xin
>>> *Cc:* Sean Owen; Michael Armbrust; Ryan Blue; Spark Dev List; Wenchen
>>> Fan; Xiao Li
>>> *Subject:* Re: Spark 2.4.2
>>>
>>> +100
>>>
>>>
>>> On Apr 18, 2019, at 1:48 AM, Reynold Xin  wrote:
>>>
>>> We should have shaded all Spark’s dependencies :(
>>>
>>> On Wed, Apr 17, 2019 at 11:47 PM Sean Owen  wrote:
>>>
 For users that would inherit Jackson and use it directly, or whose
 dependencies do. Spark itself (with modifications) should be OK with
 the change.
 It's risky and normally wouldn't backport, except that I've heard a
 few times about concerns about CVEs affecting Databind, so wondering
 who else out there might have an opinion. I'm not pushing for it
 necessarily.

 On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin 
 wrote:
 >
 > For Jackson - are you worrying about JSON parsing for users or
 internal Spark functionality breaking?
 >
 > On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  wrote:
 >>
 >> There's only one other item on my radar, which is considering
 updating
 >> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come
 up
 >> a few times now that there are a number of CVEs open for 2.6.7. Cons:
 >> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
 >> behavior non-trivially. That said back-porting the update PR to 2.4
 >> worked out OK locally. Any strong opinions on this one?
 >>
 >> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan 
 wrote:
 >> >
 >> > I volunteer to be the release manager for 2.4.2, as I was also
 going to propose 2.4.2 because of the reverting of SPARK-25250. Is there
 any other ongoing bug fixes we want to include in 2.4.2? If no I'd like to
 start the release process today (CST).
 >> >
 >> > Thanks,
 >> > Wenchen
 >> >
 >> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen 
 wrote:
 >> >>
 >> >> I think the 'only backport bug fixes to branches' principle
 remains sound. But what's a bug fix? Something that changes behavior to
 match what is explicitly supposed to happen, or implicitly supposed to
 happen -- implied by what other similar things do, by reasonable user
 expectations, or simply how it worked previously.
 >> >>
 >> >> Is this a bug fix? I guess the criteria that matches is that
 behavior doesn't match reasonable user expectations? I don't know enough to
 have a strong opinion. I also don't think there is currently an objection
 to backporting it, whatever it's called.
 >> >>
 >> >>
 >> >> Is the question whether this needs a new release? There's no harm
 in another point release, other than needing a volunteer release manager.
 One could say, wait a bit longer to see what more info comes in about
 2.4.1. But given that 2.4.1 took like 2 months, it's reasonable to move
 towards a release cycle again. I don't see objection to that either (?)
 >> >>
 >> >>
 >> >> The meta question remains: is a 'bug fix' definition even agreed,
 and being consistently applied? There aren't correct answers, only best
 guesses from each person's own experience, judgment and priorities. These
 can differ even when applied in good faith.
 >> >>
 >> >> Sometimes the variance of opinion comes because people have
 different info that needs to be surfaced. Here, maybe it's best to share
 what about that offline conversation was convincing, for example.
 >> >>
 >> >> I'd say it's also important to separate what one would prefer
 from what one can't live with(out). Assuming one trusts the intent and
 experience of the handful of others with an opinion, I'd defer to someone
 who wants X and will own it, even if I'm moderately against it. Otherwise
 we'd get little done.
 >> >>
 >> >> In that light, it seems like both 

Re: Spark 2.4.2

2019-04-19 Thread Arun Mahadevan
+1 to upgrade Jackson. It has come up multiple times due to CVEs and the
back port has worked out but it may be good to include if its not going to
delay the release.

On Thu, 18 Apr 2019 at 19:53, Wenchen Fan  wrote:

> I've cut RC1. If people think we must upgrade Jackson in 2.4, I can cut
> RC2 shortly.
>
> Thanks,
> Wenchen
>
> On Fri, Apr 19, 2019 at 3:32 AM Felix Cheung 
> wrote:
>
>> Re shading - same argument I’ve made earlier today in a PR...
>>
>> (Context- in many cases Spark has light or indirect dependencies but
>> bringing them into the process breaks users code easily)
>>
>>
>> --
>> *From:* Michael Heuer 
>> *Sent:* Thursday, April 18, 2019 6:41 AM
>> *To:* Reynold Xin
>> *Cc:* Sean Owen; Michael Armbrust; Ryan Blue; Spark Dev List; Wenchen
>> Fan; Xiao Li
>> *Subject:* Re: Spark 2.4.2
>>
>> +100
>>
>>
>> On Apr 18, 2019, at 1:48 AM, Reynold Xin  wrote:
>>
>> We should have shaded all Spark’s dependencies :(
>>
>> On Wed, Apr 17, 2019 at 11:47 PM Sean Owen  wrote:
>>
>>> For users that would inherit Jackson and use it directly, or whose
>>> dependencies do. Spark itself (with modifications) should be OK with
>>> the change.
>>> It's risky and normally wouldn't backport, except that I've heard a
>>> few times about concerns about CVEs affecting Databind, so wondering
>>> who else out there might have an opinion. I'm not pushing for it
>>> necessarily.
>>>
>>> On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin  wrote:
>>> >
>>> > For Jackson - are you worrying about JSON parsing for users or
>>> internal Spark functionality breaking?
>>> >
>>> > On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  wrote:
>>> >>
>>> >> There's only one other item on my radar, which is considering updating
>>> >> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
>>> >> a few times now that there are a number of CVEs open for 2.6.7. Cons:
>>> >> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
>>> >> behavior non-trivially. That said back-porting the update PR to 2.4
>>> >> worked out OK locally. Any strong opinions on this one?
>>> >>
>>> >> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan 
>>> wrote:
>>> >> >
>>> >> > I volunteer to be the release manager for 2.4.2, as I was also
>>> going to propose 2.4.2 because of the reverting of SPARK-25250. Is there
>>> any other ongoing bug fixes we want to include in 2.4.2? If no I'd like to
>>> start the release process today (CST).
>>> >> >
>>> >> > Thanks,
>>> >> > Wenchen
>>> >> >
>>> >> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen  wrote:
>>> >> >>
>>> >> >> I think the 'only backport bug fixes to branches' principle
>>> remains sound. But what's a bug fix? Something that changes behavior to
>>> match what is explicitly supposed to happen, or implicitly supposed to
>>> happen -- implied by what other similar things do, by reasonable user
>>> expectations, or simply how it worked previously.
>>> >> >>
>>> >> >> Is this a bug fix? I guess the criteria that matches is that
>>> behavior doesn't match reasonable user expectations? I don't know enough to
>>> have a strong opinion. I also don't think there is currently an objection
>>> to backporting it, whatever it's called.
>>> >> >>
>>> >> >>
>>> >> >> Is the question whether this needs a new release? There's no harm
>>> in another point release, other than needing a volunteer release manager.
>>> One could say, wait a bit longer to see what more info comes in about
>>> 2.4.1. But given that 2.4.1 took like 2 months, it's reasonable to move
>>> towards a release cycle again. I don't see objection to that either (?)
>>> >> >>
>>> >> >>
>>> >> >> The meta question remains: is a 'bug fix' definition even agreed,
>>> and being consistently applied? There aren't correct answers, only best
>>> guesses from each person's own experience, judgment and priorities. These
>>> can differ even when applied in good faith.
>>> >> >>
>>> >> >> Sometimes the variance of opinion comes because people have
>>> different info that needs to be surfaced. Here, maybe it's best to share
>>> what about that offline conversation was convincing, for example.
>>> >> >>
>>> >> >> I'd say it's also important to separate what one would prefer from
>>> what one can't live with(out). Assuming one trusts the intent and
>>> experience of the handful of others with an opinion, I'd defer to someone
>>> who wants X and will own it, even if I'm moderately against it. Otherwise
>>> we'd get little done.
>>> >> >>
>>> >> >> In that light, it seems like both of the PRs at issue here are not
>>> _wrong_ to backport. This is a good pair that highlights why, when there
>>> isn't a clear reason to do / not do something (e.g. obvious errors,
>>> breaking public APIs) we give benefit-of-the-doubt in order to get it later.
>>> >> >>
>>> >> >>
>>> >> >> On Wed, Apr 17, 2019 at 12:09 PM Ryan Blue <
>>> rb...@netflix.com.invalid> wrote:
>>> >> >>>
>>> >> >>> Sorry, I should be more clear about what I'm trying to say here.

DataSourceV2 sync, 17 April 2019

2019-04-19 Thread Ryan Blue
Here are my notes from the last DSv2 sync. As always:

   - If you’d like to attend the sync, send me an email and I’ll add you to
   the invite. Everyone is welcome.
   - These notes are what I wrote down and remember. If you have
   corrections or comments, please reply.

*Topics*:

   - TableCatalog PR #24246: https://github.com/apache/spark/pull/24246
   - Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233
   - Streaming capabilities PR #24129:
   https://github.com/apache/spark/pull/24129

*Attendees*:

Ryan Blue
John Zhuge
Matt Cheah
Yifei Huang
Bruce Robbins
Jamison Bennett
Russell Spitzer
Wenchen Fan
Yuanjian Li

(and others who arrived after the start)

*Discussion*:

   - TableCatalog PR: https://github.com/apache/spark/pull/24246
  - Wenchen and Matt had just reviewed the PR. Mostly what was in the
  SPIP so not much discussion of content.
  - Wenchen: Easier to review if the changes to move Table and
  TableCapability were in a separate PR (mostly import changes)
  - Ryan will open a separate PR for the move [Ed: #24410]
  - Russell: How should caching work? Has hit lots of problems with
  Spark caching data and getting out of date
  - Ryan: Spark should always call into the catalog and not cache to
  avoid those problems. However, Spark should ensure that it uses the same
  instance of a Table for all scans in the same query, for consistent
  self-joins.
  - Some discussion of self joins. Conclusion was that we don’t need to
  worry about this yet because it is unlikely.
  - Wenchen: should this include the namespace methods?
  - Ryan: No, those are a separate concern and can be added in a
  parallel PR.
   - Remove SaveMode PR: https://github.com/apache/spark/pull/24233
  - Wenchen: PR is on hold waiting for streaming capabilities, #24129,
  because the Noop sink doesn’t validate schema
  - Wenchen will open a PR to add a capability to opt out of schema
  validation, then come back to this PR.
   - Streaming capabilities PR: https://github.com/apache/spark/pull/24129
  - Ryan: This PR needs validation in the analyzer. The analyzer is
  where validations should exist, or else validations must be copied into
  every code path that produces a streaming plan.
  - Wenchen: the write check can’t be written because the write node is
  never passed to the analyzer. Fixing that is a larger problem.
  - Ryan: Agree that refactoring to pass the write node to the analyzer
  should be separate.
  - Wenchen: a check to ensure that either microbatch or continuous can
  be used is hard because some sources may fall back
  - Ryan: By the time this check runs, fallback has happened. Do v1
  sources support continuous mode?
  - Wenchen: No, v1 doesn’t support continuous
  - Ryan: Then this can be written to assume that v1 sources only
  support microbatch mode.
  - Wenchen will add this check
  - Wenchen: the check that tables in a v2 streaming relation support
  either microbatch or continuous won’t catch anything and are unnecessary
  - Ryan: These checks still need to be in the analyzer so future uses
  do not break. We had the same problem moving to v2: because schema checks
  were specific to DataSource code paths, they were overlooked when adding
  v2. Running validations in the analyzer avoids problems like this.
  - Wenchen will add the validation.
   - Matt: Will v2 be ready in time for the 3.0 release?
  - Ryan: Once #24246 is in, we can work on PRs in parallel, but it is
  not looking good.

-- 
Ryan Blue
Software Engineer
Netflix


Re: [SPARK-25079][build system] the future of python3.6 is upon us!

2019-04-19 Thread shane knapp
and this is done.

welcome to the brave new world of python3.6!

On Fri, Apr 19, 2019 at 9:34 AM shane knapp  wrote:

> i will actually be doing this now!
>
>
>
> On Thu, Apr 18, 2019 at 2:57 PM shane knapp  wrote:
>
>> well, upon us on monday.  :)
>>
>> firstly, an important note:  if you have an open PR, please check to see
>> if you need to rebase my changes on it before testing.
>>
>> monday @ 11am PST, i will begin.  in order:
>>
>> 0) jenkins enters quiet mode, running PRB builds cancelled
>>
>> 1)  existing p3k env on all workers will be updated to python3.6  [1]
>> 1a)  spot-check for the random 'us/pacific-new' bug
>>
>> 3)  remove the TODOs from the three PRs and merge
>>
>> 4)  jenkins exits quiet mode, builds launch
>>
>> 5)  ~5 hours later i'll check back in and make sure we're good.  :)
>>
>> steps 1-4 shouldn't take more than an hour and i really expect things to
>> be back up and running pretty quickly.  i will send updates as needed.
>>
>> shane
>>
>> 1--   this will be for 2.3/2.4 only, and tests against pandas 0.19.2 and
>> pyarrow 0.8.0.  master tests against pandas 0.23.2 and pyarrow 0.12.1
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [SPARK-25079][build system] the future of python3.6 is upon us!

2019-04-19 Thread shane knapp
i will actually be doing this now!



On Thu, Apr 18, 2019 at 2:57 PM shane knapp  wrote:

> well, upon us on monday.  :)
>
> firstly, an important note:  if you have an open PR, please check to see
> if you need to rebase my changes on it before testing.
>
> monday @ 11am PST, i will begin.  in order:
>
> 0) jenkins enters quiet mode, running PRB builds cancelled
>
> 1)  existing p3k env on all workers will be updated to python3.6  [1]
> 1a)  spot-check for the random 'us/pacific-new' bug
>
> 3)  remove the TODOs from the three PRs and merge
>
> 4)  jenkins exits quiet mode, builds launch
>
> 5)  ~5 hours later i'll check back in and make sure we're good.  :)
>
> steps 1-4 shouldn't take more than an hour and i really expect things to
> be back up and running pretty quickly.  i will send updates as needed.
>
> shane
>
> 1--   this will be for 2.3/2.4 only, and tests against pandas 0.19.2 and
> pyarrow 0.8.0.  master tests against pandas 0.23.2 and pyarrow 0.12.1
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: pyspark.sql.functions ide friendly

2019-04-19 Thread Hyukjin Kwon
+1 I'm good with changing too.

On Thu, 18 Apr 2019, 01:18 Reynold Xin,  wrote:

> Are you talking about the ones that are defined in a dictionary? If yes,
> that was actually not that great in hindsight (makes it harder to read &
> change), so I'm OK changing it.
>
> E.g.
>
> _functions = {
> 'lit': _lit_doc,
> 'col': 'Returns a :class:`Column` based on the given column name.',
> 'column': 'Returns a :class:`Column` based on the given column name.',
> 'asc': 'Returns a sort expression based on the ascending order of the
> given column name.',
> 'desc': 'Returns a sort expression based on the descending order of
> the given column name.',
> }
>
>
> On Wed, Apr 17, 2019 at 4:35 AM, Sean Owen  wrote:
>
>> I use IntelliJ and have never seen an issue parsing the pyspark
>> functions... you're just saying the linter has an optional inspection to
>> flag it? just disable that?
>> I don't think we want to complicate the Spark code just for this. They
>> are declared at runtime for a reason.
>>
>> On Wed, Apr 17, 2019 at 6:27 AM educh...@gmail.com 
>> wrote:
>>
>> Hi,
>>
>> I'm aware of various workarounds to make this work smoothly in various
>> IDEs, but wouldn't better to solve the root cause?
>>
>> I've seen the code and don't see anything that requires such level of
>> dynamic code, the translation is 99% trivial.
>>
>> On 2019/04/16 12:16:41, 880f0464 <880f0...@protonmail.com.INVALID>
>> wrote:
>>
>> Hi.
>>
>> That's a problem with Spark as such and in general can be addressed on
>> IDE to IDE basis - see for example https://stackoverflow.com/q/40163106
>> for some hints.
>>
>> Sent with ProtonMail Secure Email.
>>
>> ‐‐‐ Original Message ‐‐‐
>> On Tuesday, April 16, 2019 2:10 PM, educhana  wrote:
>>
>> Hi,
>>
>> Currently using pyspark.sql.functions from an IDE like PyCharm is causing
>> the linters complain due to the functions being declared at runtime.
>>
>> Would a PR fixing this be welcomed? Is there any problems/difficulties
>> I'm unaware?
>>
>> --
>>
>>
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> --
>>
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>


In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2019-04-19 Thread Hyukjin Kwon
Hi all,

Looks 'spark/dev/github_jira_sync.py' is not running correctly somewhere.
Usually the JIRA's status should be updated to "IN PROGRESS" when
somebody opens a PR against a JIRA.
Looks now it only leaves a link and does not change JIRA's status.

Can someone else who knows where it's running can check this?

FWIW, I check every PR and JIRA almost every day but ever since this
happened, this makes (at least to me) duplicately check the JIRAs.
Previously, if I check all the PRs and JIRAs, they were not duplicated
because JIRAs having PRs have different status, "IN PROGRESS" but now all
JIRAs have "OPEN" status.

Thanks.