Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-17 Thread shane knapp
pyarrow is currently testing against 0.12.1.

On Fri, Jun 14, 2019 at 2:56 PM Bryan Cutler  wrote:

> Yeah, PyArrow is the only other PySpark dependency we check for a minimum
> version. We updated that not too long ago to be 0.12.1, which I think we
> are still good on for now.
>
> On Fri, Jun 14, 2019 at 11:36 AM Felix Cheung 
> wrote:
>
>> How about pyArrow?
>>
>> --
>> *From:* Holden Karau 
>> *Sent:* Friday, June 14, 2019 11:06:15 AM
>> *To:* Felix Cheung
>> *Cc:* Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
>> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>>
>> Are there other Python dependencies we should consider upgrading at the
>> same time?
>>
>> On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung 
>> wrote:
>>
>>> So to be clear, min version check is 0.23
>>> Jenkins test is 0.24
>>>
>>> I’m ok with this. I hope someone will test 0.23 on releases though
>>> before we sign off?
>>>
>> We should maybe add this to the release instruction notes?
>>
>>>
>>> ------------------
>>> *From:* shane knapp 
>>> *Sent:* Friday, June 14, 2019 10:23:56 AM
>>> *To:* Bryan Cutler
>>> *Cc:* Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
>>> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>>>
>>> excellent.  i shall not touch anything.  :)
>>>
>>> On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler  wrote:
>>>
>>>> Shane, I think 0.24.2 is probably more common right now, so if we were
>>>> to pick one to test against, I still think it should be that one. Our
>>>> Pandas usage in PySpark is pretty conservative, so it's pretty unlikely
>>>> that we will add something that would break 0.23.X.
>>>>
>>>> On Fri, Jun 14, 2019 at 10:10 AM shane knapp 
>>>> wrote:
>>>>
>>>>> ah, ok...  should we downgrade the testing env on jenkins then?  any
>>>>> specific version?
>>>>>
>>>>> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>>>>>
>>>>> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler 
>>>>> wrote:
>>>>>
>>>>>> I should have stated this earlier, but when the user does something
>>>>>> that requires Pandas, the minimum version is checked against what was
>>>>>> imported and will raise an exception if it is a lower version. So I'm
>>>>>> concerned that using 0.24.2 might be a little too new for users running
>>>>>> older clusters. To give some release dates, 0.23.2 was released about a
>>>>>> year ago, 0.24.0 in January and 0.24.2 in March.
>>>>>>
>>>>> I think given that we’re switching to requiring Python 3 and also a
>> bit of a way from cutting a release 0.24 could be Ok as a min version
>> requirement
>>
>>>
>>>>>>
>>>>>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
>>>>>> wrote:
>>>>>>
>>>>>>> just to everyone knows, our python 3.6 testing infra is currently on
>>>>>>> 0.24.2...
>>>>>>>
>>>>>>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Thank you for this effort, Bryan!
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I’m +1 for upgrading, although since this is probably the last
>>>>>>>>> easy chance we’ll have to bump version numbers easily I’d suggest 
>>>>>>>>> 0.24.2
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I am +1 to go for 0.23.2 - it brings some overhead to test
>>>>>>>>>> PyArrow and pandas combinations. Spark 3 should be good time to 
>>>>>>>>>> increase.
>&g

Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-15 Thread Hyukjin Kwon
Oh btw, why is it 0.23.2, not 0.23.0 or 0.23.4?

On Sat, 15 Jun 2019, 06:56 Bryan Cutler,  wrote:

> Yeah, PyArrow is the only other PySpark dependency we check for a minimum
> version. We updated that not too long ago to be 0.12.1, which I think we
> are still good on for now.
>
> On Fri, Jun 14, 2019 at 11:36 AM Felix Cheung 
> wrote:
>
>> How about pyArrow?
>>
>> --
>> *From:* Holden Karau 
>> *Sent:* Friday, June 14, 2019 11:06:15 AM
>> *To:* Felix Cheung
>> *Cc:* Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
>> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>>
>> Are there other Python dependencies we should consider upgrading at the
>> same time?
>>
>> On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung 
>> wrote:
>>
>>> So to be clear, min version check is 0.23
>>> Jenkins test is 0.24
>>>
>>> I’m ok with this. I hope someone will test 0.23 on releases though
>>> before we sign off?
>>>
>> We should maybe add this to the release instruction notes?
>>
>>>
>>> ------------------
>>> *From:* shane knapp 
>>> *Sent:* Friday, June 14, 2019 10:23:56 AM
>>> *To:* Bryan Cutler
>>> *Cc:* Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
>>> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>>>
>>> excellent.  i shall not touch anything.  :)
>>>
>>> On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler  wrote:
>>>
>>>> Shane, I think 0.24.2 is probably more common right now, so if we were
>>>> to pick one to test against, I still think it should be that one. Our
>>>> Pandas usage in PySpark is pretty conservative, so it's pretty unlikely
>>>> that we will add something that would break 0.23.X.
>>>>
>>>> On Fri, Jun 14, 2019 at 10:10 AM shane knapp 
>>>> wrote:
>>>>
>>>>> ah, ok...  should we downgrade the testing env on jenkins then?  any
>>>>> specific version?
>>>>>
>>>>> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>>>>>
>>>>> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler 
>>>>> wrote:
>>>>>
>>>>>> I should have stated this earlier, but when the user does something
>>>>>> that requires Pandas, the minimum version is checked against what was
>>>>>> imported and will raise an exception if it is a lower version. So I'm
>>>>>> concerned that using 0.24.2 might be a little too new for users running
>>>>>> older clusters. To give some release dates, 0.23.2 was released about a
>>>>>> year ago, 0.24.0 in January and 0.24.2 in March.
>>>>>>
>>>>> I think given that we’re switching to requiring Python 3 and also a
>> bit of a way from cutting a release 0.24 could be Ok as a min version
>> requirement
>>
>>>
>>>>>>
>>>>>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
>>>>>> wrote:
>>>>>>
>>>>>>> just to everyone knows, our python 3.6 testing infra is currently on
>>>>>>> 0.24.2...
>>>>>>>
>>>>>>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Thank you for this effort, Bryan!
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I’m +1 for upgrading, although since this is probably the last
>>>>>>>>> easy chance we’ll have to bump version numbers easily I’d suggest 
>>>>>>>>> 0.24.2
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I am +1 to go for 0.23.2 - it brings some overhead to test
>>>>>>>>>> PyArrow and pandas combinations. Spark 3 should be good time to 
>>>>>>>>>> increase.
>&g

Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Bryan Cutler
Yeah, PyArrow is the only other PySpark dependency we check for a minimum
version. We updated that not too long ago to be 0.12.1, which I think we
are still good on for now.

On Fri, Jun 14, 2019 at 11:36 AM Felix Cheung 
wrote:

> How about pyArrow?
>
> --
> *From:* Holden Karau 
> *Sent:* Friday, June 14, 2019 11:06:15 AM
> *To:* Felix Cheung
> *Cc:* Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>
> Are there other Python dependencies we should consider upgrading at the
> same time?
>
> On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung 
> wrote:
>
>> So to be clear, min version check is 0.23
>> Jenkins test is 0.24
>>
>> I’m ok with this. I hope someone will test 0.23 on releases though before
>> we sign off?
>>
> We should maybe add this to the release instruction notes?
>
>>
>> --
>> *From:* shane knapp 
>> *Sent:* Friday, June 14, 2019 10:23:56 AM
>> *To:* Bryan Cutler
>> *Cc:* Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
>> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>>
>> excellent.  i shall not touch anything.  :)
>>
>> On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler  wrote:
>>
>>> Shane, I think 0.24.2 is probably more common right now, so if we were
>>> to pick one to test against, I still think it should be that one. Our
>>> Pandas usage in PySpark is pretty conservative, so it's pretty unlikely
>>> that we will add something that would break 0.23.X.
>>>
>>> On Fri, Jun 14, 2019 at 10:10 AM shane knapp 
>>> wrote:
>>>
>>>> ah, ok...  should we downgrade the testing env on jenkins then?  any
>>>> specific version?
>>>>
>>>> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>>>>
>>>> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler 
>>>> wrote:
>>>>
>>>>> I should have stated this earlier, but when the user does something
>>>>> that requires Pandas, the minimum version is checked against what was
>>>>> imported and will raise an exception if it is a lower version. So I'm
>>>>> concerned that using 0.24.2 might be a little too new for users running
>>>>> older clusters. To give some release dates, 0.23.2 was released about a
>>>>> year ago, 0.24.0 in January and 0.24.2 in March.
>>>>>
>>>> I think given that we’re switching to requiring Python 3 and also a bit
> of a way from cutting a release 0.24 could be Ok as a min version
> requirement
>
>>
>>>>>
>>>>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
>>>>> wrote:
>>>>>
>>>>>> just to everyone knows, our python 3.6 testing infra is currently on
>>>>>> 0.24.2...
>>>>>>
>>>>>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <
>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> Thank you for this effort, Bryan!
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I’m +1 for upgrading, although since this is probably the last easy
>>>>>>>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow
>>>>>>>>> and pandas combinations. Spark 3 should be good time to increase.
>>>>>>>>>
>>>>>>>>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> We would like to discuss increasing the minimum supported version
>>>>>>>>>> of Pandas in Spark, which is currently 0.19.2.
>>>>>>>>>>
>>>>>>>>>> Pandas 0.19.2 was released nearly 3 years ago and there are some
&

Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Felix Cheung
How about pyArrow?


From: Holden Karau 
Sent: Friday, June 14, 2019 11:06:15 AM
To: Felix Cheung
Cc: Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas

Are there other Python dependencies we should consider upgrading at the same 
time?

On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
So to be clear, min version check is 0.23
Jenkins test is 0.24

I’m ok with this. I hope someone will test 0.23 on releases though before we 
sign off?
We should maybe add this to the release instruction notes?


From: shane knapp mailto:skn...@berkeley.edu>>
Sent: Friday, June 14, 2019 10:23:56 AM
To: Bryan Cutler
Cc: Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas

excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick 
one to test against, I still think it should be that one. Our Pandas usage in 
PySpark is pretty conservative, so it's pretty unlikely that we will add 
something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific 
version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
I should have stated this earlier, but when the user does something that 
requires Pandas, the minimum version is checked against what was imported and 
will raise an exception if it is a lower version. So I'm concerned that using 
0.24.2 might be a little too new for users running older clusters. To give some 
release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 
0.24.2 in March.
I think given that we’re switching to requiring Python 3 and also a bit of a 
way from cutting a release 0.24 could be Ok as a min version requirement


On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance 
we’ll have to bump version numbers easily I’d suggest 0.24.2


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas 
combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 
mailto:cutl...@gmail.com>>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in 
Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in 
PySpark that could be removed if such an old version is not required. This will 
help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see 
https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to 
bump the version to 0.23.2, but we would like to discuss before making a 
change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Holden Karau
Are there other Python dependencies we should consider upgrading at the
same time?

On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung 
wrote:

> So to be clear, min version check is 0.23
> Jenkins test is 0.24
>
> I’m ok with this. I hope someone will test 0.23 on releases though before
> we sign off?
>
We should maybe add this to the release instruction notes?

>
> --
> *From:* shane knapp 
> *Sent:* Friday, June 14, 2019 10:23:56 AM
> *To:* Bryan Cutler
> *Cc:* Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>
> excellent.  i shall not touch anything.  :)
>
> On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler  wrote:
>
>> Shane, I think 0.24.2 is probably more common right now, so if we were to
>> pick one to test against, I still think it should be that one. Our Pandas
>> usage in PySpark is pretty conservative, so it's pretty unlikely that we
>> will add something that would break 0.23.X.
>>
>> On Fri, Jun 14, 2019 at 10:10 AM shane knapp  wrote:
>>
>>> ah, ok...  should we downgrade the testing env on jenkins then?  any
>>> specific version?
>>>
>>> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>>>
>>> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler  wrote:
>>>
>>>> I should have stated this earlier, but when the user does something
>>>> that requires Pandas, the minimum version is checked against what was
>>>> imported and will raise an exception if it is a lower version. So I'm
>>>> concerned that using 0.24.2 might be a little too new for users running
>>>> older clusters. To give some release dates, 0.23.2 was released about a
>>>> year ago, 0.24.0 in January and 0.24.2 in March.
>>>>
>>> I think given that we’re switching to requiring Python 3 and also a bit
of a way from cutting a release 0.24 could be Ok as a min version
requirement

>
>>>>
>>>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
>>>> wrote:
>>>>
>>>>> just to everyone knows, our python 3.6 testing infra is currently on
>>>>> 0.24.2...
>>>>>
>>>>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Thank you for this effort, Bryan!
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>>>>>> wrote:
>>>>>>
>>>>>>> I’m +1 for upgrading, although since this is probably the last easy
>>>>>>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow
>>>>>>>> and pandas combinations. Spark 3 should be good time to increase.
>>>>>>>>
>>>>>>>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> We would like to discuss increasing the minimum supported version
>>>>>>>>> of Pandas in Spark, which is currently 0.19.2.
>>>>>>>>>
>>>>>>>>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>>>>>>>>> workarounds in PySpark that could be removed if such an old version 
>>>>>>>>> is not
>>>>>>>>> required. This will help to keep code clean and reduce maintenance 
>>>>>>>>> effort.
>>>>>>>>>
>>>>>>>>> The change is targeted for Spark 3.0.0 release, see
>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-28041. The current
>>>>>>>>> thought is to bump the version to 0.23.2, but we would like to discuss
>>>>>>>>> before making a change. Does anyone else have thoughts on this?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Bryan
>>>>>>>>>
>>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Shane Knapp
>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> https://rise.cs.berkeley.edu
>>>>>
>>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Felix Cheung
So to be clear, min version check is 0.23
Jenkins test is 0.24

I’m ok with this. I hope someone will test 0.23 on releases though before we 
sign off?


From: shane knapp 
Sent: Friday, June 14, 2019 10:23:56 AM
To: Bryan Cutler
Cc: Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas

excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick 
one to test against, I still think it should be that one. Our Pandas usage in 
PySpark is pretty conservative, so it's pretty unlikely that we will add 
something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific 
version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
I should have stated this earlier, but when the user does something that 
requires Pandas, the minimum version is checked against what was imported and 
will raise an exception if it is a lower version. So I'm concerned that using 
0.24.2 might be a little too new for users running older clusters. To give some 
release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 
0.24.2 in March.

On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance 
we’ll have to bump version numbers easily I’d suggest 0.24.2


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas 
combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 
mailto:cutl...@gmail.com>>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in 
Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in 
PySpark that could be removed if such an old version is not required. This will 
help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see 
https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to 
bump the version to 0.23.2, but we would like to discuss before making a 
change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread shane knapp
excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler  wrote:

> Shane, I think 0.24.2 is probably more common right now, so if we were to
> pick one to test against, I still think it should be that one. Our Pandas
> usage in PySpark is pretty conservative, so it's pretty unlikely that we
> will add something that would break 0.23.X.
>
> On Fri, Jun 14, 2019 at 10:10 AM shane knapp  wrote:
>
>> ah, ok...  should we downgrade the testing env on jenkins then?  any
>> specific version?
>>
>> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>>
>> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler  wrote:
>>
>>> I should have stated this earlier, but when the user does something that
>>> requires Pandas, the minimum version is checked against what was imported
>>> and will raise an exception if it is a lower version. So I'm concerned that
>>> using 0.24.2 might be a little too new for users running older clusters. To
>>> give some release dates, 0.23.2 was released about a year ago, 0.24.0 in
>>> January and 0.24.2 in March.
>>>
>>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp  wrote:
>>>
 just to everyone knows, our python 3.6 testing infra is currently on
 0.24.2...

 On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
 wrote:

> +1
>
> Thank you for this effort, Bryan!
>
> Bests,
> Dongjoon.
>
> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
> wrote:
>
>> I’m +1 for upgrading, although since this is probably the last easy
>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>
>>
>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>> wrote:
>>
>>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow
>>> and pandas combinations. Spark 3 should be good time to increase.
>>>
>>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>>
 Hi All,

 We would like to discuss increasing the minimum supported version
 of Pandas in Spark, which is currently 0.19.2.

 Pandas 0.19.2 was released nearly 3 years ago and there are some
 workarounds in PySpark that could be removed if such an old version is 
 not
 required. This will help to keep code clean and reduce maintenance 
 effort.

 The change is targeted for Spark 3.0.0 release, see
 https://issues.apache.org/jira/browse/SPARK-28041. The current
 thought is to bump the version to 0.23.2, but we would like to discuss
 before making a change. Does anyone else have thoughts on this?

 Regards,
 Bryan

>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

 --
 Shane Knapp
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Bryan Cutler
Shane, I think 0.24.2 is probably more common right now, so if we were to
pick one to test against, I still think it should be that one. Our Pandas
usage in PySpark is pretty conservative, so it's pretty unlikely that we
will add something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp  wrote:

> ah, ok...  should we downgrade the testing env on jenkins then?  any
> specific version?
>
> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>
> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler  wrote:
>
>> I should have stated this earlier, but when the user does something that
>> requires Pandas, the minimum version is checked against what was imported
>> and will raise an exception if it is a lower version. So I'm concerned that
>> using 0.24.2 might be a little too new for users running older clusters. To
>> give some release dates, 0.23.2 was released about a year ago, 0.24.0 in
>> January and 0.24.2 in March.
>>
>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp  wrote:
>>
>>> just to everyone knows, our python 3.6 testing infra is currently on
>>> 0.24.2...
>>>
>>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
>>> wrote:
>>>
 +1

 Thank you for this effort, Bryan!

 Bests,
 Dongjoon.

 On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
 wrote:

> I’m +1 for upgrading, although since this is probably the last easy
> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>
>
> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
> wrote:
>
>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow
>> and pandas combinations. Spark 3 should be good time to increase.
>>
>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>
>>> Hi All,
>>>
>>> We would like to discuss increasing the minimum supported version of
>>> Pandas in Spark, which is currently 0.19.2.
>>>
>>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>>> workarounds in PySpark that could be removed if such an old version is 
>>> not
>>> required. This will help to keep code clean and reduce maintenance 
>>> effort.
>>>
>>> The change is targeted for Spark 3.0.0 release, see
>>> https://issues.apache.org/jira/browse/SPARK-28041. The current
>>> thought is to bump the version to 0.23.2, but we would like to discuss
>>> before making a change. Does anyone else have thoughts on this?
>>>
>>> Regards,
>>> Bryan
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread shane knapp
ah, ok...  should we downgrade the testing env on jenkins then?  any
specific version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler  wrote:

> I should have stated this earlier, but when the user does something that
> requires Pandas, the minimum version is checked against what was imported
> and will raise an exception if it is a lower version. So I'm concerned that
> using 0.24.2 might be a little too new for users running older clusters. To
> give some release dates, 0.23.2 was released about a year ago, 0.24.0 in
> January and 0.24.2 in March.
>
> On Fri, Jun 14, 2019 at 9:27 AM shane knapp  wrote:
>
>> just to everyone knows, our python 3.6 testing infra is currently on
>> 0.24.2...
>>
>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> Thank you for this effort, Bryan!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>>> wrote:
>>>
 I’m +1 for upgrading, although since this is probably the last easy
 chance we’ll have to bump version numbers easily I’d suggest 0.24.2


 On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
 wrote:

> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
> pandas combinations. Spark 3 should be good time to increase.
>
> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>
>> Hi All,
>>
>> We would like to discuss increasing the minimum supported version of
>> Pandas in Spark, which is currently 0.19.2.
>>
>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>> workarounds in PySpark that could be removed if such an old version is 
>> not
>> required. This will help to keep code clean and reduce maintenance 
>> effort.
>>
>> The change is targeted for Spark 3.0.0 release, see
>> https://issues.apache.org/jira/browse/SPARK-28041. The current
>> thought is to bump the version to 0.23.2, but we would like to discuss
>> before making a change. Does anyone else have thoughts on this?
>>
>> Regards,
>> Bryan
>>
> --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Bryan Cutler
I should have stated this earlier, but when the user does something that
requires Pandas, the minimum version is checked against what was imported
and will raise an exception if it is a lower version. So I'm concerned that
using 0.24.2 might be a little too new for users running older clusters. To
give some release dates, 0.23.2 was released about a year ago, 0.24.0 in
January and 0.24.2 in March.

On Fri, Jun 14, 2019 at 9:27 AM shane knapp  wrote:

> just to everyone knows, our python 3.6 testing infra is currently on
> 0.24.2...
>
> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> Thank you for this effort, Bryan!
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>> wrote:
>>
>>> I’m +1 for upgrading, although since this is probably the last easy
>>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>>
>>>
>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>>> wrote:
>>>
 I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
 pandas combinations. Spark 3 should be good time to increase.

 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:

> Hi All,
>
> We would like to discuss increasing the minimum supported version of
> Pandas in Spark, which is currently 0.19.2.
>
> Pandas 0.19.2 was released nearly 3 years ago and there are some
> workarounds in PySpark that could be removed if such an old version is not
> required. This will help to keep code clean and reduce maintenance effort.
>
> The change is targeted for Spark 3.0.0 release, see
> https://issues.apache.org/jira/browse/SPARK-28041. The current
> thought is to bump the version to 0.23.2, but we would like to discuss
> before making a change. Does anyone else have thoughts on this?
>
> Regards,
> Bryan
>
 --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread shane knapp
just to everyone knows, our python 3.6 testing infra is currently on
0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
wrote:

> +1
>
> Thank you for this effort, Bryan!
>
> Bests,
> Dongjoon.
>
> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau  wrote:
>
>> I’m +1 for upgrading, although since this is probably the last easy
>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>
>>
>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon  wrote:
>>
>>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
>>> pandas combinations. Spark 3 should be good time to increase.
>>>
>>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>>
 Hi All,

 We would like to discuss increasing the minimum supported version of
 Pandas in Spark, which is currently 0.19.2.

 Pandas 0.19.2 was released nearly 3 years ago and there are some
 workarounds in PySpark that could be removed if such an old version is not
 required. This will help to keep code clean and reduce maintenance effort.

 The change is targeted for Spark 3.0.0 release, see
 https://issues.apache.org/jira/browse/SPARK-28041. The current thought
 is to bump the version to 0.23.2, but we would like to discuss before
 making a change. Does anyone else have thoughts on this?

 Regards,
 Bryan

>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Dongjoon Hyun
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau  wrote:

> I’m +1 for upgrading, although since this is probably the last easy chance
> we’ll have to bump version numbers easily I’d suggest 0.24.2
>
>
> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon  wrote:
>
>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
>> pandas combinations. Spark 3 should be good time to increase.
>>
>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>
>>> Hi All,
>>>
>>> We would like to discuss increasing the minimum supported version of
>>> Pandas in Spark, which is currently 0.19.2.
>>>
>>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>>> workarounds in PySpark that could be removed if such an old version is not
>>> required. This will help to keep code clean and reduce maintenance effort.
>>>
>>> The change is targeted for Spark 3.0.0 release, see
>>> https://issues.apache.org/jira/browse/SPARK-28041. The current thought
>>> is to bump the version to 0.23.2, but we would like to discuss before
>>> making a change. Does anyone else have thoughts on this?
>>>
>>> Regards,
>>> Bryan
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Holden Karau
I’m +1 for upgrading, although since this is probably the last easy chance
we’ll have to bump version numbers easily I’d suggest 0.24.2


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon  wrote:

> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
> pandas combinations. Spark 3 should be good time to increase.
>
> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>
>> Hi All,
>>
>> We would like to discuss increasing the minimum supported version of
>> Pandas in Spark, which is currently 0.19.2.
>>
>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>> workarounds in PySpark that could be removed if such an old version is not
>> required. This will help to keep code clean and reduce maintenance effort.
>>
>> The change is targeted for Spark 3.0.0 release, see
>> https://issues.apache.org/jira/browse/SPARK-28041. The current thought
>> is to bump the version to 0.23.2, but we would like to discuss before
>> making a change. Does anyone else have thoughts on this?
>>
>> Regards,
>> Bryan
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-13 Thread Hyukjin Kwon
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:

> Hi All,
>
> We would like to discuss increasing the minimum supported version of
> Pandas in Spark, which is currently 0.19.2.
>
> Pandas 0.19.2 was released nearly 3 years ago and there are some
> workarounds in PySpark that could be removed if such an old version is not
> required. This will help to keep code clean and reduce maintenance effort.
>
> The change is targeted for Spark 3.0.0 release, see
> https://issues.apache.org/jira/browse/SPARK-28041. The current thought is
> to bump the version to 0.23.2, but we would like to discuss before making a
> change. Does anyone else have thoughts on this?
>
> Regards,
> Bryan
>