DataSourceV2 sync notes - 12 June 2019

2019-06-14 Thread Ryan Blue
Here are the latest DSv2 sync notes. Please reply with updates or
corrections.

*Attendees*:

Ryan Blue
Michael Armbrust
Gengliang Wang
Matt Cheah
John Zhuge

*Topics*:

Wenchen’s reorganization proposal
Problems with TableProvider - property map isn’t sufficient

New PRs:

   - ReplaceTable: https://github.com/apache/spark/pull/24798
   - V2 Table Resolution: https://github.com/apache/spark/pull/24741
   - V2 Session Catalog: https://github.com/apache/spark/pull/24768

*Discussion*:

   - Wenchen’s organization proposal
  - Ryan: Wenchen proposed using
  `org.apache.spark.sql.connector.{catalog, expressions, read, write,
  extensions}
  - Ryan: I’m not sure we need extensions, but otherwise it looks good
  to me
  - Matt: This is in the catalyst module, right?
  - Ryan: Right. The API is in catalyst. The extensions package would
  be used for any parts that need to be in SQL, but hopefully there aren’t
  any.
  - Consensus was to go with the proposed organization
   - Problems with TableProvider:
  - Gengliang: CREATE TABLE with an ORC v2 table can’t report its
  schema because there are no files
  - Ryan: We hit this when trying to use ORC in SQL unit tests for v2.
  The problem is that the source can’t be passed the schema and other
  information
  - Gengliang: Schema could be passed using the userSpecifiedSchema arg
  - Ryan: The user schema is for cases where the data lacks specific
  types and a user supplies them, like CSV. I don’t think it makes sense to
  reuse that to pass the schema from the catalog
  - Ryan: Other table metadata should be passed as well, like
  partitioning, so sources don’t infer it. I think this requires some
  thought. Anyone want to volunteer?
  - No one volunteered to fix the problem
   - ReplaceTable PR
  - Matt: Needs another update after comments, but about ready
  - Ryan: I agree it is almost ready to commit. I should point out that
  this includes a default implementation that will leave a table deleted if
  the write fails. I think this is expected behavior because REPLACE is a
  DROP combined with CTAS
  - Michael: Sources should be able to opt out of that behavior
  - Ryan: We want to ensure consistent behavior across sources
  - Resolution: sources can implement the staging and throw an
  exception if they choose to opt out
   - V2 table resolution:
  - John: this should be ready to go, only minor comments from Dongjoon
  left
  - This was merged the next day
   - V2 session catalog
  - Ryan: When testing, we realized that if a default catalog is used
  for v2 sources (like ORC v2) then you can run CREATE TABLE, which goes to
  some v2 catalog, but then you can’t load the same table using
the same name
  because the session catalog doesn’t have it.
  - Ryan: To fix this, we need a v2 catalog that delegates to the
  session catalog. This should be used for all v2 operations when
the session
  catalog can’t be used.
  - Ryan: Then the v2 default catalog should be used instead of the
  session catalog when it is set. This provides a smooth
transition from the
  session catalog to v2 catalogs.
   - Gengliang: another topic: decimals
  - Gengliang: v2 doesn’t insert unsafe casts, so literals in SQL
  cannot be inserted to double/float columns
  - Michael: Shouldn’t queries use decimal literals so that floating
  point literals can be floats? What do other databases do?
  - Matt: is this a v2 problem?
  - Ryan: this is not specific to v2 and was discovered when converting
  v1 to use the v2 output rules
  - Ryan: we could add a new decimal type that doesn’t lose data but is
  allowed to be cast because it can only be used for literals where the
  intended type is unknown. There is precedent for this in the parser with
  Hive char and varchar types.
  - Conclusion: This isn’t really a v2 problem
   - Michael: Any work so far on MERGE INTO?
  - Ryan: Not yet, but feel free to make a proposal and start working
  - Ryan: Do you also need to pass extra metadata with each row?
  - Michael: No, this should be delegated to the source
  - Matt: That would be operator push-down
  - Ryan: I agree, that’s operator push-down. It would be great to hear
  how that would work, but I think MERGE INTO should have a default
  implementation. It should be supported across sources instead of in just
  one so we have a reference implementation.
  - Michael: Having only a reference implementation was the problem
  with v1. The behavior should be written down in a spec. Hive has a
  reasonable implementation to follow.
  - Ryan: Yes, but it is still valuable to have a reference
  implementation. And of course a spec is needed.
   - Matt: what does the roadmap look like for finishing in time for Spark
   3.0?
  - Ry

Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Bryan Cutler
Yeah, PyArrow is the only other PySpark dependency we check for a minimum
version. We updated that not too long ago to be 0.12.1, which I think we
are still good on for now.

On Fri, Jun 14, 2019 at 11:36 AM Felix Cheung 
wrote:

> How about pyArrow?
>
> --
> *From:* Holden Karau 
> *Sent:* Friday, June 14, 2019 11:06:15 AM
> *To:* Felix Cheung
> *Cc:* Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>
> Are there other Python dependencies we should consider upgrading at the
> same time?
>
> On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung 
> wrote:
>
>> So to be clear, min version check is 0.23
>> Jenkins test is 0.24
>>
>> I’m ok with this. I hope someone will test 0.23 on releases though before
>> we sign off?
>>
> We should maybe add this to the release instruction notes?
>
>>
>> --
>> *From:* shane knapp 
>> *Sent:* Friday, June 14, 2019 10:23:56 AM
>> *To:* Bryan Cutler
>> *Cc:* Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
>> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>>
>> excellent.  i shall not touch anything.  :)
>>
>> On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler  wrote:
>>
>>> Shane, I think 0.24.2 is probably more common right now, so if we were
>>> to pick one to test against, I still think it should be that one. Our
>>> Pandas usage in PySpark is pretty conservative, so it's pretty unlikely
>>> that we will add something that would break 0.23.X.
>>>
>>> On Fri, Jun 14, 2019 at 10:10 AM shane knapp 
>>> wrote:
>>>
 ah, ok...  should we downgrade the testing env on jenkins then?  any
 specific version?

 shane, who is loathe (and i mean LOATHE) to touch python envs ;)

 On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler 
 wrote:

> I should have stated this earlier, but when the user does something
> that requires Pandas, the minimum version is checked against what was
> imported and will raise an exception if it is a lower version. So I'm
> concerned that using 0.24.2 might be a little too new for users running
> older clusters. To give some release dates, 0.23.2 was released about a
> year ago, 0.24.0 in January and 0.24.2 in March.
>
 I think given that we’re switching to requiring Python 3 and also a bit
> of a way from cutting a release 0.24 could be Ok as a min version
> requirement
>
>>
>
> On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
> wrote:
>
>> just to everyone knows, our python 3.6 testing infra is currently on
>> 0.24.2...
>>
>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>> +1
>>>
>>> Thank you for this effort, Bryan!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>>> wrote:
>>>
 I’m +1 for upgrading, although since this is probably the last easy
 chance we’ll have to bump version numbers easily I’d suggest 0.24.2


 On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
 wrote:

> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow
> and pandas combinations. Spark 3 should be good time to increase.
>
> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>
>> Hi All,
>>
>> We would like to discuss increasing the minimum supported version
>> of Pandas in Spark, which is currently 0.19.2.
>>
>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>> workarounds in PySpark that could be removed if such an old version 
>> is not
>> required. This will help to keep code clean and reduce maintenance 
>> effort.
>>
>> The change is targeted for Spark 3.0.0 release, see
>> https://issues.apache.org/jira/browse/SPARK-28041. The current
>> thought is to bump the version to 0.23.2, but we would like to 
>> discuss
>> before making a change. Does anyone else have thoughts on this?
>>
>> Regards,
>> Bryan
>>
> --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

 --
 Shane Knapp
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High P

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-14 Thread Imran Rashid
 +1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in
the community.  There is already a lot of interest in alternative shuffle
storage, from dynamic allocation in kubernetes, to even just improving
stability in standard on-premise use of Spark.  However, they're often
stuck doing this in forks of Spark, and in ways that are not maintainable
(because they copy-paste many spark internals) or are incorrect (for not
correctly handling speculative execution & stage retries).

Second, I think the specific proposal is good for finding the right balance
between flexibility and too much complexity, to allow incremental
improvements.  A lot of work has been put into this already to try to
figure out which pieces are essential to make alternative shuffle storage
implementations feasible.

Of course, that means it doesn't include everything imaginable; some things
still aren't supported, and some will still choose to use the older
ShuffleManager api to give total control over all of shuffle.  But we know
there are a reasonable set of things which can be implemented behind the
api as the first step, and it can continue to evolve.

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko  wrote:

> +1 (non-binding). This API is versatile and flexible enough to handle
> Bloomberg's internal use-cases. The ability for us to vary implementation
> strategies is quite appealing. It is also worth to note the minimal changes
> to Spark core in order to make it work. This is a very much needed addition
> within the Spark shuffle story.
>
> On Fri, Jun 14, 2019 at 9:59 AM bo yang  wrote:
>
>> +1 This is great work, allowing plugin of different sort shuffle
>> write/read implementation! Also great to see it retain the current Spark
>> configuration
>> (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).
>>
>>
>> On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah  wrote:
>>
>>> Hi everyone,
>>>
>>>
>>>
>>> I would like to call a vote for the SPIP for SPARK-25299
>>> , which proposes to
>>> introduce a pluggable storage API for temporary shuffle data.
>>>
>>>
>>>
>>> You may find the SPIP document here
>>> 
>>> .
>>>
>>>
>>>
>>> The discussion thread for the SPIP was conducted here
>>> 
>>> .
>>>
>>>
>>>
>>> Please vote on whether or not this proposal is agreeable to you.
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>> -Matt Cheah
>>>
>>


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Felix Cheung
How about pyArrow?


From: Holden Karau 
Sent: Friday, June 14, 2019 11:06:15 AM
To: Felix Cheung
Cc: Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas

Are there other Python dependencies we should consider upgrading at the same 
time?

On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
So to be clear, min version check is 0.23
Jenkins test is 0.24

I’m ok with this. I hope someone will test 0.23 on releases though before we 
sign off?
We should maybe add this to the release instruction notes?


From: shane knapp mailto:skn...@berkeley.edu>>
Sent: Friday, June 14, 2019 10:23:56 AM
To: Bryan Cutler
Cc: Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas

excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick 
one to test against, I still think it should be that one. Our Pandas usage in 
PySpark is pretty conservative, so it's pretty unlikely that we will add 
something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific 
version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
I should have stated this earlier, but when the user does something that 
requires Pandas, the minimum version is checked against what was imported and 
will raise an exception if it is a lower version. So I'm concerned that using 
0.24.2 might be a little too new for users running older clusters. To give some 
release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 
0.24.2 in March.
I think given that we’re switching to requiring Python 3 and also a bit of a 
way from cutting a release 0.24 could be Ok as a min version requirement


On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance 
we’ll have to bump version numbers easily I’d suggest 0.24.2


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas 
combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 
mailto:cutl...@gmail.com>>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in 
Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in 
PySpark that could be removed if such an old version is not required. This will 
help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see 
https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to 
bump the version to 0.23.2, but we would like to discuss before making a 
change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Holden Karau
Are there other Python dependencies we should consider upgrading at the
same time?

On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung 
wrote:

> So to be clear, min version check is 0.23
> Jenkins test is 0.24
>
> I’m ok with this. I hope someone will test 0.23 on releases though before
> we sign off?
>
We should maybe add this to the release instruction notes?

>
> --
> *From:* shane knapp 
> *Sent:* Friday, June 14, 2019 10:23:56 AM
> *To:* Bryan Cutler
> *Cc:* Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>
> excellent.  i shall not touch anything.  :)
>
> On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler  wrote:
>
>> Shane, I think 0.24.2 is probably more common right now, so if we were to
>> pick one to test against, I still think it should be that one. Our Pandas
>> usage in PySpark is pretty conservative, so it's pretty unlikely that we
>> will add something that would break 0.23.X.
>>
>> On Fri, Jun 14, 2019 at 10:10 AM shane knapp  wrote:
>>
>>> ah, ok...  should we downgrade the testing env on jenkins then?  any
>>> specific version?
>>>
>>> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>>>
>>> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler  wrote:
>>>
 I should have stated this earlier, but when the user does something
 that requires Pandas, the minimum version is checked against what was
 imported and will raise an exception if it is a lower version. So I'm
 concerned that using 0.24.2 might be a little too new for users running
 older clusters. To give some release dates, 0.23.2 was released about a
 year ago, 0.24.0 in January and 0.24.2 in March.

>>> I think given that we’re switching to requiring Python 3 and also a bit
of a way from cutting a release 0.24 could be Ok as a min version
requirement

>

 On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
 wrote:

> just to everyone knows, our python 3.6 testing infra is currently on
> 0.24.2...
>
> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> Thank you for this effort, Bryan!
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>> wrote:
>>
>>> I’m +1 for upgrading, although since this is probably the last easy
>>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>>
>>>
>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>>> wrote:
>>>
 I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow
 and pandas combinations. Spark 3 should be good time to increase.

 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:

> Hi All,
>
> We would like to discuss increasing the minimum supported version
> of Pandas in Spark, which is currently 0.19.2.
>
> Pandas 0.19.2 was released nearly 3 years ago and there are some
> workarounds in PySpark that could be removed if such an old version 
> is not
> required. This will help to keep code clean and reduce maintenance 
> effort.
>
> The change is targeted for Spark 3.0.0 release, see
> https://issues.apache.org/jira/browse/SPARK-28041. The current
> thought is to bump the version to 0.23.2, but we would like to discuss
> before making a change. Does anyone else have thoughts on this?
>
> Regards,
> Bryan
>
 --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Felix Cheung
So to be clear, min version check is 0.23
Jenkins test is 0.24

I’m ok with this. I hope someone will test 0.23 on releases though before we 
sign off?


From: shane knapp 
Sent: Friday, June 14, 2019 10:23:56 AM
To: Bryan Cutler
Cc: Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas

excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick 
one to test against, I still think it should be that one. Our Pandas usage in 
PySpark is pretty conservative, so it's pretty unlikely that we will add 
something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific 
version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
I should have stated this earlier, but when the user does something that 
requires Pandas, the minimum version is checked against what was imported and 
will raise an exception if it is a lower version. So I'm concerned that using 
0.24.2 might be a little too new for users running older clusters. To give some 
release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 
0.24.2 in March.

On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance 
we’ll have to bump version numbers easily I’d suggest 0.24.2


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas 
combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 
mailto:cutl...@gmail.com>>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in 
Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in 
PySpark that could be removed if such an old version is not required. This will 
help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see 
https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to 
bump the version to 0.23.2, but we would like to discuss before making a 
change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Exposing JIRA issue types at GitHub PRs

2019-06-14 Thread Dongjoon Hyun
Now, you can see the exposed component labels (ordered by the number of
PRs) here and click the component to search.

https://github.com/apache/spark/labels?sort=count-desc

Dongjoon.


On Fri, Jun 14, 2019 at 1:15 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> JIRA and PR is ready for reviews.
>
> https://issues.apache.org/jira/browse/SPARK-28051 (Exposing JIRA issue
> component types at GitHub PRs)
> https://github.com/apache/spark/pull/24871
>
> Bests,
> Dongjoon.
>
>
> On Thu, Jun 13, 2019 at 10:48 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.
>>
>> Sure, we can do whatever we want.
>>
>> I'll wait for more feedbacks and proceed to the next steps.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido 
>> wrote:
>>
>>> Hi Dongjoon,
>>> Thanks for the proposal! I like the idea. Maybe we can extend it to
>>> component too and to some jira labels such as correctness which may be
>>> worth to highlight in PRs too. My only concern is that in many cases JIRAs
>>> are created not very carefully so they may be incorrect at the moment of
>>> the pr creation and it may be updated later: so keeping them in sync may be
>>> an extra effort..
>>>
>>> On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:
>>>
 Seems like a good idea. Can we test this with a component first?

 On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> Since we use both Apache JIRA and GitHub actively for Apache Spark
> contributions, we have lots of JIRAs and PRs consequently. One specific
> thing I've been longing to see is `Jira Issue Type` in GitHub.
>
> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
> There are two main benefits:
> 1. It helps the communication between the contributors and reviewers
> with more information.
> (In some cases, some people only visit GitHub to see the PR and
> commits)
> 2. `Labels` is searchable. We don't need to visit Apache Jira to
> search PRs to see a specific type.
> (For example, the reviewers can see and review 'BUG' PRs first by
> using `is:open is:pr label:BUG`.)
>
> Of course, this can be done automatically without human intervention.
> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
> can add the labels from the beginning. If needed, I can volunteer to 
> update
> the script.
>
> To show the demo, I labeled several PRs manually. You can see the
> result right now in Apache Spark PR page.
>
>   - https://github.com/apache/spark/pulls
>
> If you're surprised due to those manual activities, I want to
> apologize for that. I hope we can take advantage of the existing GitHub
> features to serve Apache Spark community in a way better than yesterday.
>
> How do you think about this specific suggestion?
>
> Bests,
> Dongjoon
>
> PS. I saw that `Request Review` and `Assign` features are already used
> for some purposes, but these feature are out of the scope in this email.
>



jQuery 3.4.1 update

2019-06-14 Thread Sean Owen
Just surfacing this change as it's probably pretty good to go, but, a)
I'm not a jQuery / JS expert and b) we don't have comprehensive UI
tests.

https://github.com/apache/spark/pull/24843

I'd like to get us up to a modern jQuery for 3.0, to keep up with
security fixes (which was the minor motivation here).

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread shane knapp
excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler  wrote:

> Shane, I think 0.24.2 is probably more common right now, so if we were to
> pick one to test against, I still think it should be that one. Our Pandas
> usage in PySpark is pretty conservative, so it's pretty unlikely that we
> will add something that would break 0.23.X.
>
> On Fri, Jun 14, 2019 at 10:10 AM shane knapp  wrote:
>
>> ah, ok...  should we downgrade the testing env on jenkins then?  any
>> specific version?
>>
>> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>>
>> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler  wrote:
>>
>>> I should have stated this earlier, but when the user does something that
>>> requires Pandas, the minimum version is checked against what was imported
>>> and will raise an exception if it is a lower version. So I'm concerned that
>>> using 0.24.2 might be a little too new for users running older clusters. To
>>> give some release dates, 0.23.2 was released about a year ago, 0.24.0 in
>>> January and 0.24.2 in March.
>>>
>>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp  wrote:
>>>
 just to everyone knows, our python 3.6 testing infra is currently on
 0.24.2...

 On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
 wrote:

> +1
>
> Thank you for this effort, Bryan!
>
> Bests,
> Dongjoon.
>
> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
> wrote:
>
>> I’m +1 for upgrading, although since this is probably the last easy
>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>
>>
>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>> wrote:
>>
>>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow
>>> and pandas combinations. Spark 3 should be good time to increase.
>>>
>>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>>
 Hi All,

 We would like to discuss increasing the minimum supported version
 of Pandas in Spark, which is currently 0.19.2.

 Pandas 0.19.2 was released nearly 3 years ago and there are some
 workarounds in PySpark that could be removed if such an old version is 
 not
 required. This will help to keep code clean and reduce maintenance 
 effort.

 The change is targeted for Spark 3.0.0 release, see
 https://issues.apache.org/jira/browse/SPARK-28041. The current
 thought is to bump the version to 0.23.2, but we would like to discuss
 before making a change. Does anyone else have thoughts on this?

 Regards,
 Bryan

>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

 --
 Shane Knapp
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Bryan Cutler
Shane, I think 0.24.2 is probably more common right now, so if we were to
pick one to test against, I still think it should be that one. Our Pandas
usage in PySpark is pretty conservative, so it's pretty unlikely that we
will add something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp  wrote:

> ah, ok...  should we downgrade the testing env on jenkins then?  any
> specific version?
>
> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>
> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler  wrote:
>
>> I should have stated this earlier, but when the user does something that
>> requires Pandas, the minimum version is checked against what was imported
>> and will raise an exception if it is a lower version. So I'm concerned that
>> using 0.24.2 might be a little too new for users running older clusters. To
>> give some release dates, 0.23.2 was released about a year ago, 0.24.0 in
>> January and 0.24.2 in March.
>>
>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp  wrote:
>>
>>> just to everyone knows, our python 3.6 testing infra is currently on
>>> 0.24.2...
>>>
>>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
>>> wrote:
>>>
 +1

 Thank you for this effort, Bryan!

 Bests,
 Dongjoon.

 On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
 wrote:

> I’m +1 for upgrading, although since this is probably the last easy
> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>
>
> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
> wrote:
>
>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow
>> and pandas combinations. Spark 3 should be good time to increase.
>>
>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>
>>> Hi All,
>>>
>>> We would like to discuss increasing the minimum supported version of
>>> Pandas in Spark, which is currently 0.19.2.
>>>
>>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>>> workarounds in PySpark that could be removed if such an old version is 
>>> not
>>> required. This will help to keep code clean and reduce maintenance 
>>> effort.
>>>
>>> The change is targeted for Spark 3.0.0 release, see
>>> https://issues.apache.org/jira/browse/SPARK-28041. The current
>>> thought is to bump the version to 0.23.2, but we would like to discuss
>>> before making a change. Does anyone else have thoughts on this?
>>>
>>> Regards,
>>> Bryan
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-14 Thread Ilan Filonenko
+1 (non-binding). This API is versatile and flexible enough to handle
Bloomberg's internal use-cases. The ability for us to vary implementation
strategies is quite appealing. It is also worth to note the minimal changes
to Spark core in order to make it work. This is a very much needed addition
within the Spark shuffle story.

On Fri, Jun 14, 2019 at 9:59 AM bo yang  wrote:

> +1 This is great work, allowing plugin of different sort shuffle
> write/read implementation! Also great to see it retain the current Spark
> configuration
> (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).
>
>
> On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah  wrote:
>
>> Hi everyone,
>>
>>
>>
>> I would like to call a vote for the SPIP for SPARK-25299
>> , which proposes to
>> introduce a pluggable storage API for temporary shuffle data.
>>
>>
>>
>> You may find the SPIP document here
>> 
>> .
>>
>>
>>
>> The discussion thread for the SPIP was conducted here
>> 
>> .
>>
>>
>>
>> Please vote on whether or not this proposal is agreeable to you.
>>
>>
>>
>> Thanks!
>>
>>
>>
>> -Matt Cheah
>>
>


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread shane knapp
ah, ok...  should we downgrade the testing env on jenkins then?  any
specific version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler  wrote:

> I should have stated this earlier, but when the user does something that
> requires Pandas, the minimum version is checked against what was imported
> and will raise an exception if it is a lower version. So I'm concerned that
> using 0.24.2 might be a little too new for users running older clusters. To
> give some release dates, 0.23.2 was released about a year ago, 0.24.0 in
> January and 0.24.2 in March.
>
> On Fri, Jun 14, 2019 at 9:27 AM shane knapp  wrote:
>
>> just to everyone knows, our python 3.6 testing infra is currently on
>> 0.24.2...
>>
>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> Thank you for this effort, Bryan!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>>> wrote:
>>>
 I’m +1 for upgrading, although since this is probably the last easy
 chance we’ll have to bump version numbers easily I’d suggest 0.24.2


 On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
 wrote:

> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
> pandas combinations. Spark 3 should be good time to increase.
>
> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>
>> Hi All,
>>
>> We would like to discuss increasing the minimum supported version of
>> Pandas in Spark, which is currently 0.19.2.
>>
>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>> workarounds in PySpark that could be removed if such an old version is 
>> not
>> required. This will help to keep code clean and reduce maintenance 
>> effort.
>>
>> The change is targeted for Spark 3.0.0 release, see
>> https://issues.apache.org/jira/browse/SPARK-28041. The current
>> thought is to bump the version to 0.23.2, but we would like to discuss
>> before making a change. Does anyone else have thoughts on this?
>>
>> Regards,
>> Bryan
>>
> --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Bryan Cutler
I should have stated this earlier, but when the user does something that
requires Pandas, the minimum version is checked against what was imported
and will raise an exception if it is a lower version. So I'm concerned that
using 0.24.2 might be a little too new for users running older clusters. To
give some release dates, 0.23.2 was released about a year ago, 0.24.0 in
January and 0.24.2 in March.

On Fri, Jun 14, 2019 at 9:27 AM shane knapp  wrote:

> just to everyone knows, our python 3.6 testing infra is currently on
> 0.24.2...
>
> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> Thank you for this effort, Bryan!
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>> wrote:
>>
>>> I’m +1 for upgrading, although since this is probably the last easy
>>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>>
>>>
>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>>> wrote:
>>>
 I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
 pandas combinations. Spark 3 should be good time to increase.

 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:

> Hi All,
>
> We would like to discuss increasing the minimum supported version of
> Pandas in Spark, which is currently 0.19.2.
>
> Pandas 0.19.2 was released nearly 3 years ago and there are some
> workarounds in PySpark that could be removed if such an old version is not
> required. This will help to keep code clean and reduce maintenance effort.
>
> The change is targeted for Spark 3.0.0 release, see
> https://issues.apache.org/jira/browse/SPARK-28041. The current
> thought is to bump the version to 0.23.2, but we would like to discuss
> before making a change. Does anyone else have thoughts on this?
>
> Regards,
> Bryan
>
 --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-14 Thread bo yang
+1 This is great work, allowing plugin of different sort shuffle write/read
implementation! Also great to see it retain the current Spark configuration
(spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).


On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah  wrote:

> Hi everyone,
>
>
>
> I would like to call a vote for the SPIP for SPARK-25299
> , which proposes to
> introduce a pluggable storage API for temporary shuffle data.
>
>
>
> You may find the SPIP document here
> 
> .
>
>
>
> The discussion thread for the SPIP was conducted here
> 
> .
>
>
>
> Please vote on whether or not this proposal is agreeable to you.
>
>
>
> Thanks!
>
>
>
> -Matt Cheah
>


Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

2019-06-14 Thread Matt Cheah
We opened a thread for voting yesterday, so please participate!

 

-Matt Cheah

 

From: Yue Li 
Date: Thursday, June 13, 2019 at 7:22 PM
To: Saisai Shao , Imran Rashid 
Cc: Matt Cheah , "Yifei Huang (PD)" , 
Mridul Muralidharan , Bo Yang , Ilan Filonenko 
, Imran Rashid , Justin Uang 
, Liang Tang , Marcelo Vanzin 
, Matei Zaharia , Min Shen 
, Reynold Xin , Ryan Blue 
, Vinoo Ganesh , Will Manning 
, "b...@fb.com" , "dev@spark.apache.org" 
, "fel...@uber.com" , 
"f...@linkedin.com" , "tgraves...@gmail.com" 
, "yez...@linkedin.com" , Cedric 
Zhuang 
Subject: Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

 

+ Cedric, who is our lead developer of Splash shuffle manager at MemVerge. 

 

Fully agreed with Saisai. Thanks!

 

Best, 

 

Yue

 

From: Saisai Shao 
Date: Thursday, June 13, 2019 at 2:52 PM
To: Imran Rashid 
Cc: Matt Cheah , "Yifei Huang (PD)" , 
Mridul Muralidharan , Bo Yang , Ilan Filonenko 
, Imran Rashid , Justin Uang 
, Liang Tang , Marcelo Vanzin 
, Matei Zaharia , Min Shen 
, Reynold Xin , Ryan Blue 
, Vinoo Ganesh , Will Manning 
, "b...@fb.com" , "dev@spark.apache.org" 
, "fel...@uber.com" , 
"f...@linkedin.com" , "tgraves...@gmail.com" 
, "yez...@linkedin.com" , Yue Li 

Subject: Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

 

I think maybe we could start a vote on this SPIP.  

 

This has been discussed for a while, and the current doc is pretty complete as 
for now. Also we saw lots of demands in the community about building their own 
shuffle storage.

 

Thanks

Saisai 

 

Imran Rashid  于2019年6月11日周二 上午3:27写道:

I would be happy to shepherd this.

 

On Wed, Jun 5, 2019 at 7:33 PM Matt Cheah  wrote:

Hi everyone,

 

I wanted to pick this back up again. The discussion has quieted down both on 
this thread and on the document.

 

We made a few revisions to the document to hopefully make it easier to read and 
to clarify our criteria for success in the project. Some of the APIs have also 
been adjusted based on further discussion and things we’ve learned.

 

I was hoping to discuss what our next steps could be here. Specifically,
Would any PMC be willing to become the shepherd for this SPIP?
Is there any more feedback regarding this proposal?
What would we need to do to take this to a voting phase and to begin proposing 
our work against upstream Spark?
 

Thanks,

 

-Matt Cheah

 

From: "Yifei Huang (PD)" 
Date: Monday, May 13, 2019 at 1:04 PM
To: Mridul Muralidharan 
Cc: Bo Yang , Ilan Filonenko , Imran Rashid 
, Justin Uang , Liang Tang 
, Marcelo Vanzin , Matei Zaharia 
, Matt Cheah , Min Shen 
, Reynold Xin , Ryan Blue 
, Vinoo Ganesh , Will Manning 
, "b...@fb.com" , "dev@spark.apache.org" 
, "fel...@uber.com" , 
"f...@linkedin.com" , "tgraves...@gmail.com" 
, "yez...@linkedin.com" , 
"yue...@memverge.com" 
Subject: Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

 

Hi Mridul - thanks for taking the time to give us feedback! Thoughts on the 
points that you mentioned:

 

The API is meant to work with the existing SortShuffleManager algorithm. There 
aren't strict requirements on how other ShuffleManager implementations must 
behave, so it seems impractical to design an API that could also satisfy those 
unknown requirements. However, we do believe that the API is rather generic, 
using OutputStreams for writes and InputStreams for reads, and indexing the 
data by a shuffleId-mapId-reduceId combo, so if other shuffle algorithms treat 
the data in the same chunks and want an interface for storage, then they can 
also use this API from within their implementation.

 

About speculative execution, we originally made the assumption that each 
shuffle task is deterministic, which meant that even if a later mapper overrode 
a previous committed mapper's value, it's still the same contents. Having 
searched some tickets and reading 
https://github.com/apache/spark/pull/22112/files [github.com] more carefully, I 
think there are problems with our original thought if the writer writes all 
attempts of a task to the same location. One example is if the writer 
implementation writes each partition to the remote host in a sequence of 
chunks. In such a situation, a reducer might read data half written by the 
original task and half written by the running speculative task, which will not 
be the correct contents if the mapper output is unordered. Therefore, writes by 
a single mapper might have to be transactioned, which is not clear from the 
API, and seems rather complex to reason about, so we shouldn't expect this from 
the implementer.

 

However, this doesn't affect the fundamentals of the API since we only need to 
add an additional attemptId to the storage data index (which can be stored 
within the MapStatus) to solve the problem of concurrent writes. This would 
also make it more clear that the writer should use attempt ID as an index to 
ensure that writes from speculative tasks don't interfere with one another (we 
can add that to the API docs as well

Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread shane knapp
just to everyone knows, our python 3.6 testing infra is currently on
0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
wrote:

> +1
>
> Thank you for this effort, Bryan!
>
> Bests,
> Dongjoon.
>
> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau  wrote:
>
>> I’m +1 for upgrading, although since this is probably the last easy
>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>
>>
>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon  wrote:
>>
>>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
>>> pandas combinations. Spark 3 should be good time to increase.
>>>
>>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>>
 Hi All,

 We would like to discuss increasing the minimum supported version of
 Pandas in Spark, which is currently 0.19.2.

 Pandas 0.19.2 was released nearly 3 years ago and there are some
 workarounds in PySpark that could be removed if such an old version is not
 required. This will help to keep code clean and reduce maintenance effort.

 The change is targeted for Spark 3.0.0 release, see
 https://issues.apache.org/jira/browse/SPARK-28041. The current thought
 is to bump the version to 0.23.2, but we would like to discuss before
 making a change. Does anyone else have thoughts on this?

 Regards,
 Bryan

>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] upcoming jenkins downtime: august 3rd 2019

2019-06-14 Thread Dongjoon Hyun
Thank you for the early notice, Shane! :)

Dongjoon

On Fri, Jun 14, 2019 at 9:13 AM shane knapp  wrote:

> the campus colo will be performing some electrical maintenance, which
> means that they'll be powering off the entire building.
>
> since the jenkins cluster is located in that colo, we are most definitely
> affected.  :)
>
> i'll be out of town that weekend, but will have one of my sysadmins bring
> everything back up on sunday, august 4th.  if they run in to issues, i will
> jump in first thing monday, august 5th.
>
> as the time approaches, i will send reminders and updates.
>
> thanks,
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Dongjoon Hyun
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau  wrote:

> I’m +1 for upgrading, although since this is probably the last easy chance
> we’ll have to bump version numbers easily I’d suggest 0.24.2
>
>
> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon  wrote:
>
>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
>> pandas combinations. Spark 3 should be good time to increase.
>>
>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>
>>> Hi All,
>>>
>>> We would like to discuss increasing the minimum supported version of
>>> Pandas in Spark, which is currently 0.19.2.
>>>
>>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>>> workarounds in PySpark that could be removed if such an old version is not
>>> required. This will help to keep code clean and reduce maintenance effort.
>>>
>>> The change is targeted for Spark 3.0.0 release, see
>>> https://issues.apache.org/jira/browse/SPARK-28041. The current thought
>>> is to bump the version to 0.23.2, but we would like to discuss before
>>> making a change. Does anyone else have thoughts on this?
>>>
>>> Regards,
>>> Bryan
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


[build system] upcoming jenkins downtime: august 3rd 2019

2019-06-14 Thread shane knapp
the campus colo will be performing some electrical maintenance, which means
that they'll be powering off the entire building.

since the jenkins cluster is located in that colo, we are most definitely
affected.  :)

i'll be out of town that weekend, but will have one of my sysadmins bring
everything back up on sunday, august 4th.  if they run in to issues, i will
jump in first thing monday, august 5th.

as the time approaches, i will send reminders and updates.

thanks,

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Holden Karau
I’m +1 for upgrading, although since this is probably the last easy chance
we’ll have to bump version numbers easily I’d suggest 0.24.2


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon  wrote:

> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
> pandas combinations. Spark 3 should be good time to increase.
>
> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>
>> Hi All,
>>
>> We would like to discuss increasing the minimum supported version of
>> Pandas in Spark, which is currently 0.19.2.
>>
>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>> workarounds in PySpark that could be removed if such an old version is not
>> required. This will help to keep code clean and reduce maintenance effort.
>>
>> The change is targeted for Spark 3.0.0 release, see
>> https://issues.apache.org/jira/browse/SPARK-28041. The current thought
>> is to bump the version to 0.23.2, but we would like to discuss before
>> making a change. Does anyone else have thoughts on this?
>>
>> Regards,
>> Bryan
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Exposing JIRA issue types at GitHub PRs

2019-06-14 Thread Dongjoon Hyun
Hi, All.

JIRA and PR is ready for reviews.

https://issues.apache.org/jira/browse/SPARK-28051 (Exposing JIRA issue
component types at GitHub PRs)
https://github.com/apache/spark/pull/24871

Bests,
Dongjoon.


On Thu, Jun 13, 2019 at 10:48 AM Dongjoon Hyun 
wrote:

> Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.
>
> Sure, we can do whatever we want.
>
> I'll wait for more feedbacks and proceed to the next steps.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido 
> wrote:
>
>> Hi Dongjoon,
>> Thanks for the proposal! I like the idea. Maybe we can extend it to
>> component too and to some jira labels such as correctness which may be
>> worth to highlight in PRs too. My only concern is that in many cases JIRAs
>> are created not very carefully so they may be incorrect at the moment of
>> the pr creation and it may be updated later: so keeping them in sync may be
>> an extra effort..
>>
>> On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:
>>
>>> Seems like a good idea. Can we test this with a component first?
>>>
>>> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 Since we use both Apache JIRA and GitHub actively for Apache Spark
 contributions, we have lots of JIRAs and PRs consequently. One specific
 thing I've been longing to see is `Jira Issue Type` in GitHub.

 How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
 There are two main benefits:
 1. It helps the communication between the contributors and reviewers
 with more information.
 (In some cases, some people only visit GitHub to see the PR and
 commits)
 2. `Labels` is searchable. We don't need to visit Apache Jira to search
 PRs to see a specific type.
 (For example, the reviewers can see and review 'BUG' PRs first by
 using `is:open is:pr label:BUG`.)

 Of course, this can be done automatically without human intervention.
 Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
 can add the labels from the beginning. If needed, I can volunteer to update
 the script.

 To show the demo, I labeled several PRs manually. You can see the
 result right now in Apache Spark PR page.

   - https://github.com/apache/spark/pulls

 If you're surprised due to those manual activities, I want to apologize
 for that. I hope we can take advantage of the existing GitHub features to
 serve Apache Spark community in a way better than yesterday.

 How do you think about this specific suggestion?

 Bests,
 Dongjoon

 PS. I saw that `Request Review` and `Assign` features are already used
 for some purposes, but these feature are out of the scope in this email.

>>>