Re: Thoughts on Cloudpickle Update

2018-01-19 Thread Hyukjin Kwon
> So given that it fixes some real world bugs, any particular reason why?
Would you be comfortable with doing it in 2.3.1?

Ah, I don't feel strongly about this but RC2 will be running on and
cloudpickle's quite core fix to PySpark. Just thought we might want to have
enough time with it.

One worry is, upgrading it includes a fix about namedtuple too where
PySpark has a custom fix.
I would like to check few things about this.

So, yea, it's vague. I wouldn't stay against if you'd prefer.




2018-01-19 16:42 GMT+09:00 Holden Karau :

>
>
> On Jan 19, 2018 7:28 PM, "Hyukjin Kwon"  wrote:
>
> > Is it an option to match the latest version of cloudpickle and still
> set protocol level 2?
>
> IMHO, I think this can be an option but I am not fully sure yet if we
> should/could go ahead for it within Spark 2.X. I need some
> investigations including things about Pyrolite.
>
> Let's go ahead with matching it to 0.4.2 first. I am quite clear on
> matching it to 0.4.2 at least.
>
> So given that there is a follow up on which fixes a regression if we're
> not comfortable doing the latest version let's double-check that the
> version we do upgrade to doesn't have that regression.
>
>
>
> > I agree that upgrading to try and match version 0.4.2 would be a good
> starting point. Unless no one objects, I will open up a JIRA and try to do
> this.
>
> Yup but I think we shouldn't make this into Spark 2.3.0 to be clear.
>
> So given that it fixes some real world bugs, any particular reason why?
> Would you be comfortable with doing it in 2.3.1?
>
>
>
> > Also lets try to keep track in our commit messages which version of
> cloudpickle we end up upgrading to.
>
> +1: PR description, commit message or any unit to identify each will be 
> useful.
> It should be easier once we have a matched version.
>
>
>
> 2018-01-19 12:55 GMT+09:00 Holden Karau :
>
>> So if there are different version of Python on the cluster machines I
>> think that's already unsupported so I'm not worried about that.
>>
>> I'd suggest going to the highest released version since there appear to
>> be some useful fixes between 0.4.2 & 0.5.2
>>
>> Also lets try to keep track in our commit messages which version of
>> cloudpickle we end up upgrading to.
>>
>> On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler  wrote:
>>
>>> Thanks for all the details and background Hyukjin! Regarding the pickle
>>> protocol change, if I understand correctly, it is currently at level 2 in
>>> Spark which is good for backwards compatibility for all of Python 2.
>>> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
>>> above, will pick a level determined by your Python version. So is the
>>> concern here for Spark if someone has different versions of Python in their
>>> cluster, like 3.5 and 3.3, then different protocols will be used and
>>> deserialization might fail?  Is it an option to match the latest version of
>>> cloudpickle and still set protocol level 2?
>>>
>>> I agree that upgrading to try and match version 0.4.2 would be a good
>>> starting point. Unless no one objects, I will open up a JIRA and try to do
>>> this.
>>>
>>> Thanks,
>>> Bryan
>>>
>>> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon 
>>> wrote:
>>>
 Hi Bryan,

 Yup, I support to match the version. I pushed it forward before to
 match it with https://github.com/cloudpipe/cloudpickle
 before few times in Spark's copy and also cloudpickle itself with few
 fixes. I believe our copy is closest to 0.4.1.

 I have been trying to follow up the changes in cloudpipe/cloudpickle
 for which version we should match, I think we should match
 it with 0.4.2 first (I need to double check) because IMHO they have
 been adding rather radical changes from 0.5.0, including
 pickle protocol change (by default).

 Personally, I would like to match it with the latest because there have
 been some important changes. For
 example, see this too - https://github.com/cloudpipe
 /cloudpickle/pull/138 (it's pending for reviewing yet) eventually but
 0.4.2 should be
 a good start point.

 For the strategy, I think we can match it and follow 0.4.x within Spark
 for the conservative and safe choice + minimal cost.


 I tried to leave few explicit answers to the questions from you, Bryan:

 > Spark is currently using a forked version and it seems like updates
 are made every now and then when
 > needed, but it's not really clear where the current state is and how
 much it has diverged.

 I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.


 > Are there any known issues with recent changes from those that follow
 cloudpickle dev?

 I am technically involved in cloudpickle dev although less active.
 They changed default pickle protocol (https://github.com/cloudpipe/
 cloudpickle/pull/127). So, if we target 0.5.x+, we should double check
 the potential com

Re: Thoughts on Cloudpickle Update

2018-01-19 Thread Holden Karau
So it is pretty core, but its one of the better indirectly tested
components. I think probably the most reasonable path is to see what the
diff ends up looking like and make a call at that point for if we want it
to go to master or master & branch-2.3?

On Fri, Jan 19, 2018 at 12:30 AM, Hyukjin Kwon  wrote:

> > So given that it fixes some real world bugs, any particular reason why?
> Would you be comfortable with doing it in 2.3.1?
>
> Ah, I don't feel strongly about this but RC2 will be running on and
> cloudpickle's quite core fix to PySpark. Just thought we might want to have
> enough time with it.
>
> One worry is, upgrading it includes a fix about namedtuple too where
> PySpark has a custom fix.
> I would like to check few things about this.
>
> So, yea, it's vague. I wouldn't stay against if you'd prefer.
>
>
>
>
> 2018-01-19 16:42 GMT+09:00 Holden Karau :
>
>>
>>
>> On Jan 19, 2018 7:28 PM, "Hyukjin Kwon"  wrote:
>>
>> > Is it an option to match the latest version of cloudpickle and still
>> set protocol level 2?
>>
>> IMHO, I think this can be an option but I am not fully sure yet if we
>> should/could go ahead for it within Spark 2.X. I need some
>> investigations including things about Pyrolite.
>>
>> Let's go ahead with matching it to 0.4.2 first. I am quite clear on
>> matching it to 0.4.2 at least.
>>
>> So given that there is a follow up on which fixes a regression if we're
>> not comfortable doing the latest version let's double-check that the
>> version we do upgrade to doesn't have that regression.
>>
>>
>>
>> > I agree that upgrading to try and match version 0.4.2 would be a good
>> starting point. Unless no one objects, I will open up a JIRA and try to do
>> this.
>>
>> Yup but I think we shouldn't make this into Spark 2.3.0 to be clear.
>>
>> So given that it fixes some real world bugs, any particular reason why?
>> Would you be comfortable with doing it in 2.3.1?
>>
>>
>>
>> > Also lets try to keep track in our commit messages which version of
>> cloudpickle we end up upgrading to.
>>
>> +1: PR description, commit message or any unit to identify each will be 
>> useful.
>> It should be easier once we have a matched version.
>>
>>
>>
>> 2018-01-19 12:55 GMT+09:00 Holden Karau :
>>
>>> So if there are different version of Python on the cluster machines I
>>> think that's already unsupported so I'm not worried about that.
>>>
>>> I'd suggest going to the highest released version since there appear to
>>> be some useful fixes between 0.4.2 & 0.5.2
>>>
>>> Also lets try to keep track in our commit messages which version of
>>> cloudpickle we end up upgrading to.
>>>
>>> On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler  wrote:
>>>
 Thanks for all the details and background Hyukjin! Regarding the pickle
 protocol change, if I understand correctly, it is currently at level 2 in
 Spark which is good for backwards compatibility for all of Python 2.
 Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
 above, will pick a level determined by your Python version. So is the
 concern here for Spark if someone has different versions of Python in their
 cluster, like 3.5 and 3.3, then different protocols will be used and
 deserialization might fail?  Is it an option to match the latest version of
 cloudpickle and still set protocol level 2?

 I agree that upgrading to try and match version 0.4.2 would be a good
 starting point. Unless no one objects, I will open up a JIRA and try to do
 this.

 Thanks,
 Bryan

 On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon 
 wrote:

> Hi Bryan,
>
> Yup, I support to match the version. I pushed it forward before to
> match it with https://github.com/cloudpipe/cloudpickle
> before few times in Spark's copy and also cloudpickle itself with few
> fixes. I believe our copy is closest to 0.4.1.
>
> I have been trying to follow up the changes in cloudpipe/cloudpickle
> for which version we should match, I think we should match
> it with 0.4.2 first (I need to double check) because IMHO they have
> been adding rather radical changes from 0.5.0, including
> pickle protocol change (by default).
>
> Personally, I would like to match it with the latest because there
> have been some important changes. For
> example, see this too - https://github.com/cloudpipe
> /cloudpickle/pull/138 (it's pending for reviewing yet) eventually but
> 0.4.2 should be
> a good start point.
>
> For the strategy, I think we can match it and follow 0.4.x within
> Spark for the conservative and safe choice + minimal cost.
>
>
> I tried to leave few explicit answers to the questions from you, Bryan:
>
> > Spark is currently using a forked version and it seems like updates
> are made every now and then when
> > needed, but it's not really clear where the current state is and how
> much it ha

Re: Thoughts on Cloudpickle Update

2018-01-19 Thread Hyukjin Kwon
Yea, that sounds good to me.

2018-01-19 18:29 GMT+09:00 Holden Karau :

> So it is pretty core, but its one of the better indirectly tested
> components. I think probably the most reasonable path is to see what the
> diff ends up looking like and make a call at that point for if we want it
> to go to master or master & branch-2.3?
>
> On Fri, Jan 19, 2018 at 12:30 AM, Hyukjin Kwon 
> wrote:
>
>> > So given that it fixes some real world bugs, any particular reason
>> why? Would you be comfortable with doing it in 2.3.1?
>>
>> Ah, I don't feel strongly about this but RC2 will be running on and
>> cloudpickle's quite core fix to PySpark. Just thought we might want to have
>> enough time with it.
>>
>> One worry is, upgrading it includes a fix about namedtuple too where
>> PySpark has a custom fix.
>> I would like to check few things about this.
>>
>> So, yea, it's vague. I wouldn't stay against if you'd prefer.
>>
>>
>>
>>
>> 2018-01-19 16:42 GMT+09:00 Holden Karau :
>>
>>>
>>>
>>> On Jan 19, 2018 7:28 PM, "Hyukjin Kwon"  wrote:
>>>
>>> > Is it an option to match the latest version of cloudpickle and still
>>> set protocol level 2?
>>>
>>> IMHO, I think this can be an option but I am not fully sure yet if we
>>> should/could go ahead for it within Spark 2.X. I need some
>>> investigations including things about Pyrolite.
>>>
>>> Let's go ahead with matching it to 0.4.2 first. I am quite clear on
>>> matching it to 0.4.2 at least.
>>>
>>> So given that there is a follow up on which fixes a regression if we're
>>> not comfortable doing the latest version let's double-check that the
>>> version we do upgrade to doesn't have that regression.
>>>
>>>
>>>
>>> > I agree that upgrading to try and match version 0.4.2 would be a good
>>> starting point. Unless no one objects, I will open up a JIRA and try to do
>>> this.
>>>
>>> Yup but I think we shouldn't make this into Spark 2.3.0 to be clear.
>>>
>>> So given that it fixes some real world bugs, any particular reason why?
>>> Would you be comfortable with doing it in 2.3.1?
>>>
>>>
>>>
>>> > Also lets try to keep track in our commit messages which version of
>>> cloudpickle we end up upgrading to.
>>>
>>> +1: PR description, commit message or any unit to identify each will be 
>>> useful.
>>> It should be easier once we have a matched version.
>>>
>>>
>>>
>>> 2018-01-19 12:55 GMT+09:00 Holden Karau :
>>>
 So if there are different version of Python on the cluster machines I
 think that's already unsupported so I'm not worried about that.

 I'd suggest going to the highest released version since there appear to
 be some useful fixes between 0.4.2 & 0.5.2

 Also lets try to keep track in our commit messages which version of
 cloudpickle we end up upgrading to.

 On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler 
 wrote:

> Thanks for all the details and background Hyukjin! Regarding the
> pickle protocol change, if I understand correctly, it is currently at 
> level
> 2 in Spark which is good for backwards compatibility for all of Python 2.
> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
> above, will pick a level determined by your Python version. So is the
> concern here for Spark if someone has different versions of Python in 
> their
> cluster, like 3.5 and 3.3, then different protocols will be used and
> deserialization might fail?  Is it an option to match the latest version 
> of
> cloudpickle and still set protocol level 2?
>
> I agree that upgrading to try and match version 0.4.2 would be a good
> starting point. Unless no one objects, I will open up a JIRA and try to do
> this.
>
> Thanks,
> Bryan
>
> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon 
> wrote:
>
>> Hi Bryan,
>>
>> Yup, I support to match the version. I pushed it forward before to
>> match it with https://github.com/cloudpipe/cloudpickle
>> before few times in Spark's copy and also cloudpickle itself with few
>> fixes. I believe our copy is closest to 0.4.1.
>>
>> I have been trying to follow up the changes in cloudpipe/cloudpickle
>> for which version we should match, I think we should match
>> it with 0.4.2 first (I need to double check) because IMHO they have
>> been adding rather radical changes from 0.5.0, including
>> pickle protocol change (by default).
>>
>> Personally, I would like to match it with the latest because there
>> have been some important changes. For
>> example, see this too - https://github.com/cloudpipe
>> /cloudpickle/pull/138 (it's pending for reviewing yet) eventually
>> but 0.4.2 should be
>> a good start point.
>>
>> For the strategy, I think we can match it and follow 0.4.x within
>> Spark for the conservative and safe choice + minimal cost.
>>
>>
>> I tried to leave few explicit answers to the questions fr

Kryo 4 serialized form changes -- a problem?

2018-01-19 Thread Sean Owen
See:

https://issues.apache.org/jira/browse/SPARK-23131
https://github.com/apache/spark/pull/20301#issuecomment-358473199

I expected a major Kryo upgrade to be problematic, but it worked fine. It
picks up a number of fixes:
https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0

It might be good for Spark 2.4.

Its serialized format isn't entirely compatible though. I'm trying to
recall whether this is a problem in practice. We don't guarantee wire
compatibility across mismatched Spark versions, right?

But does the Kryo serialized form show up in any persistent stored form? I
don't believe any normal output, even that of saveAsObjectFile, uses it.

I'm wondering if I am not recalling why this would be a problem to update?

Sean


Re: Kryo 4 serialized form changes -- a problem?

2018-01-19 Thread Reynold Xin
I don’t think Spark relies on Kryo or Java for persistence. User programs
might though so it would be great if we can shade it.

On Fri, Jan 19, 2018 at 5:55 AM Sean Owen  wrote:

> See:
>
> https://issues.apache.org/jira/browse/SPARK-23131
> https://github.com/apache/spark/pull/20301#issuecomment-358473199
>
> I expected a major Kryo upgrade to be problematic, but it worked fine. It
> picks up a number of fixes:
> https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
>
> It might be good for Spark 2.4.
>
> Its serialized format isn't entirely compatible though. I'm trying to
> recall whether this is a problem in practice. We don't guarantee wire
> compatibility across mismatched Spark versions, right?
>
> But does the Kryo serialized form show up in any persistent stored form? I
> don't believe any normal output, even that of saveAsObjectFile, uses it.
>
> I'm wondering if I am not recalling why this would be a problem to update?
>
> Sean
>


Re: Kryo 4 serialized form changes -- a problem?

2018-01-19 Thread Sean Owen
Good point. The good news is that it has always been shaded for us, because
it comes in via Chill, which shades it.

On Fri, Jan 19, 2018 at 10:28 AM Reynold Xin  wrote:

> I don’t think Spark relies on Kryo or Java for persistence. User programs
> might though so it would be great if we can shade it.
>
> On Fri, Jan 19, 2018 at 5:55 AM Sean Owen  wrote:
>
>> See:
>>
>> https://issues.apache.org/jira/browse/SPARK-23131
>> https://github.com/apache/spark/pull/20301#issuecomment-358473199
>>
>> I expected a major Kryo upgrade to be problematic, but it worked fine. It
>> picks up a number of fixes:
>> https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
>>
>> It might be good for Spark 2.4.
>>
>> Its serialized format isn't entirely compatible though. I'm trying to
>> recall whether this is a problem in practice. We don't guarantee wire
>> compatibility across mismatched Spark versions, right?
>>
>> But does the Kryo serialized form show up in any persistent stored form?
>> I don't believe any normal output, even that of saveAsObjectFile, uses it.
>>
>> I'm wondering if I am not recalling why this would be a problem to update?
>>
>> Sean
>>
>


Re: Kryo 4 serialized form changes -- a problem?

2018-01-19 Thread Koert Kuipers
it is mainly a problem because for reasons of sanity one wants to keep
single kryo/chill version, and kryo/chill could be used in other places for
somewhat persistent serialization by the user.

i know, this is not spark's problem... it is the users problem. but i would
find it odd to change kryo in a minor upgrade in general. not that it
cannot be done.



On Fri, Jan 19, 2018 at 8:55 AM, Sean Owen  wrote:

> See:
>
> https://issues.apache.org/jira/browse/SPARK-23131
> https://github.com/apache/spark/pull/20301#issuecomment-358473199
>
> I expected a major Kryo upgrade to be problematic, but it worked fine. It
> picks up a number of fixes:
> https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
>
> It might be good for Spark 2.4.
>
> Its serialized format isn't entirely compatible though. I'm trying to
> recall whether this is a problem in practice. We don't guarantee wire
> compatibility across mismatched Spark versions, right?
>
> But does the Kryo serialized form show up in any persistent stored form? I
> don't believe any normal output, even that of saveAsObjectFile, uses it.
>
> I'm wondering if I am not recalling why this would be a problem to update?
>
> Sean
>


Re: Kryo 4 serialized form changes -- a problem?

2018-01-19 Thread Sean Owen
Yeah, if users are using Kryo directly, they should be insulated from a
Spark-side change because of shading.
However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x. I
am not sure if that causes problems for apps.

Normally I'd avoid any major-version change in a minor release. This one
looked potentially entirely internal.
I think if there are any doubts, we can leave it for Spark 3. There was a
bug report that needed a fix from Kryo 4, but it might be minor after all.


On Fri, Jan 19, 2018 at 11:05 AM Koert Kuipers  wrote:

> it is mainly a problem because for reasons of sanity one wants to keep
> single kryo/chill version, and kryo/chill could be used in other places for
> somewhat persistent serialization by the user.
>
> i know, this is not spark's problem... it is the users problem. but i
> would find it odd to change kryo in a minor upgrade in general. not that it
> cannot be done.
>
>
>
> On Fri, Jan 19, 2018 at 8:55 AM, Sean Owen  wrote:
>
>> See:
>>
>> https://issues.apache.org/jira/browse/SPARK-23131
>> https://github.com/apache/spark/pull/20301#issuecomment-358473199
>>
>> I expected a major Kryo upgrade to be problematic, but it worked fine. It
>> picks up a number of fixes:
>> https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
>>
>> It might be good for Spark 2.4.
>>
>> Its serialized format isn't entirely compatible though. I'm trying to
>> recall whether this is a problem in practice. We don't guarantee wire
>> compatibility across mismatched Spark versions, right?
>>
>> But does the Kryo serialized form show up in any persistent stored form?
>> I don't believe any normal output, even that of saveAsObjectFile, uses it.
>>
>> I'm wondering if I am not recalling why this would be a problem to update?
>>
>> Sean
>>
>
>


Re: Kryo 4 serialized form changes -- a problem?

2018-01-19 Thread Koert Kuipers
i think its probably fine, but i remember updating kryo and chill to be a
major issue with scalding historically exactly because kryo was also used
for serialized data on disk by some major users.

On Fri, Jan 19, 2018 at 12:13 PM, Sean Owen  wrote:

> Yeah, if users are using Kryo directly, they should be insulated from a
> Spark-side change because of shading.
> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x. I
> am not sure if that causes problems for apps.
>
> Normally I'd avoid any major-version change in a minor release. This one
> looked potentially entirely internal.
> I think if there are any doubts, we can leave it for Spark 3. There was a
> bug report that needed a fix from Kryo 4, but it might be minor after all.
>
>
> On Fri, Jan 19, 2018 at 11:05 AM Koert Kuipers  wrote:
>
>> it is mainly a problem because for reasons of sanity one wants to keep
>> single kryo/chill version, and kryo/chill could be used in other places for
>> somewhat persistent serialization by the user.
>>
>> i know, this is not spark's problem... it is the users problem. but i
>> would find it odd to change kryo in a minor upgrade in general. not that it
>> cannot be done.
>>
>>
>>
>> On Fri, Jan 19, 2018 at 8:55 AM, Sean Owen  wrote:
>>
>>> See:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-23131
>>> https://github.com/apache/spark/pull/20301#issuecomment-358473199
>>>
>>> I expected a major Kryo upgrade to be problematic, but it worked fine.
>>> It picks up a number of fixes:
>>> https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
>>>
>>> It might be good for Spark 2.4.
>>>
>>> Its serialized format isn't entirely compatible though. I'm trying to
>>> recall whether this is a problem in practice. We don't guarantee wire
>>> compatibility across mismatched Spark versions, right?
>>>
>>> But does the Kryo serialized form show up in any persistent stored form?
>>> I don't believe any normal output, even that of saveAsObjectFile, uses it.
>>>
>>> I'm wondering if I am not recalling why this would be a problem to
>>> update?
>>>
>>> Sean
>>>
>>
>>


Spark 3

2018-01-19 Thread Sean Owen
Forking this thread to muse about Spark 3. Like Spark 2, I assume it would
be more about making all those accumulated breaking changes and updating
lots of dependencies. Hadoop 3 looms large in that list as well as Scala
2.12.

Spark 1 was release in May 2014, and Spark 2 in July 2016. If Spark 2.3 is
out in Feb 2018 and it takes the now-usual 6 months until a next release,
Spark 3 could reasonably be next.

However the release cycles are naturally slowing down, and it could also be
said that 2019 would be more on schedule for Spark 3.

Nothing particularly urgent about deciding, but I'm curious if anyone had
an opinion on whether to move on to Spark 3 next or just continue with 2.4
later this year.

On Fri, Jan 19, 2018 at 11:13 AM Sean Owen  wrote:

> Yeah, if users are using Kryo directly, they should be insulated from a
> Spark-side change because of shading.
> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x. I
> am not sure if that causes problems for apps.
>
> Normally I'd avoid any major-version change in a minor release. This one
> looked potentially entirely internal.
> I think if there are any doubts, we can leave it for Spark 3. There was a
> bug report that needed a fix from Kryo 4, but it might be minor after all.
>
>>
>>


Re: Spark 3

2018-01-19 Thread Holden Karau
I think an interesting exercise would be to consider what changes we are
putting off for a major version and if they make enough of a change to
warrent the work involved or keep pushing it off.

Personally the first thing that comes to mind is I'd like to revisit the
accumulator APIs again and see if we can do something with then. What's top
of everyone else's mind?

On Jan 20, 2018 6:32 AM, "Sean Owen"  wrote:

> Forking this thread to muse about Spark 3. Like Spark 2, I assume it would
> be more about making all those accumulated breaking changes and updating
> lots of dependencies. Hadoop 3 looms large in that list as well as Scala
> 2.12.
>
> Spark 1 was release in May 2014, and Spark 2 in July 2016. If Spark 2.3 is
> out in Feb 2018 and it takes the now-usual 6 months until a next release,
> Spark 3 could reasonably be next.
>
> However the release cycles are naturally slowing down, and it could also
> be said that 2019 would be more on schedule for Spark 3.
>
> Nothing particularly urgent about deciding, but I'm curious if anyone had
> an opinion on whether to move on to Spark 3 next or just continue with 2.4
> later this year.
>
> On Fri, Jan 19, 2018 at 11:13 AM Sean Owen  wrote:
>
>> Yeah, if users are using Kryo directly, they should be insulated from a
>> Spark-side change because of shading.
>> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x.
>> I am not sure if that causes problems for apps.
>>
>> Normally I'd avoid any major-version change in a minor release. This one
>> looked potentially entirely internal.
>> I think if there are any doubts, we can leave it for Spark 3. There was a
>> bug report that needed a fix from Kryo 4, but it might be minor after all.
>>
>>>
>>>


Re: Spark 3

2018-01-19 Thread Sean Owen
(Here are a few that have already been flagged for 3.0:
https://issues.apache.org/jira/browse/SPARK-22236?jql=project%20%3D%20SPARK%20AND%20%22Target%20Version%2Fs%22%20%20%3D%203.0.0
 )

On Fri, Jan 19, 2018 at 11:43 AM Holden Karau 
wrote:

> I think an interesting exercise would be to consider what changes we are
> putting off for a major version and if they make enough of a change to
> warrent the work involved or keep pushing it off.
>
> Personally the first thing that comes to mind is I'd like to revisit the
> accumulator APIs again and see if we can do something with then. What's top
> of everyone else's mind?
>
> On Jan 20, 2018 6:32 AM, "Sean Owen"  wrote:
>
>> Forking this thread to muse about Spark 3. Like Spark 2, I assume it
>> would be more about making all those accumulated breaking changes and
>> updating lots of dependencies. Hadoop 3 looms large in that list as well as
>> Scala 2.12.
>>
>> Spark 1 was release in May 2014, and Spark 2 in July 2016. If Spark 2.3
>> is out in Feb 2018 and it takes the now-usual 6 months until a next
>> release, Spark 3 could reasonably be next.
>>
>> However the release cycles are naturally slowing down, and it could also
>> be said that 2019 would be more on schedule for Spark 3.
>>
>> Nothing particularly urgent about deciding, but I'm curious if anyone had
>> an opinion on whether to move on to Spark 3 next or just continue with 2.4
>> later this year.
>>
>> On Fri, Jan 19, 2018 at 11:13 AM Sean Owen  wrote:
>>
>>> Yeah, if users are using Kryo directly, they should be insulated from a
>>> Spark-side change because of shading.
>>> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x.
>>> I am not sure if that causes problems for apps.
>>>
>>> Normally I'd avoid any major-version change in a minor release. This one
>>> looked potentially entirely internal.
>>> I think if there are any doubts, we can leave it for Spark 3. There was
>>> a bug report that needed a fix from Kryo 4, but it might be minor after all.
>>>




Re: Spark 3

2018-01-19 Thread Ryan Blue
I think that the DataFrame write API needs improvement and I'm looking
forward to getting DataSourceV2 to be complete and reliable, but neither of
those require breaking changes in the short term.

For both, I think we should develop in parallel until they are mature
enough to replace what we use currently, and only then make a breaking
change to remove the old code. For me, that means aiming for a breaking
release some time in 2019.

rb

On Fri, Jan 19, 2018 at 9:43 AM, Holden Karau 
wrote:

> I think an interesting exercise would be to consider what changes we are
> putting off for a major version and if they make enough of a change to
> warrent the work involved or keep pushing it off.
>
> Personally the first thing that comes to mind is I'd like to revisit the
> accumulator APIs again and see if we can do something with then. What's top
> of everyone else's mind?
>
> On Jan 20, 2018 6:32 AM, "Sean Owen"  wrote:
>
>> Forking this thread to muse about Spark 3. Like Spark 2, I assume it
>> would be more about making all those accumulated breaking changes and
>> updating lots of dependencies. Hadoop 3 looms large in that list as well as
>> Scala 2.12.
>>
>> Spark 1 was release in May 2014, and Spark 2 in July 2016. If Spark 2.3
>> is out in Feb 2018 and it takes the now-usual 6 months until a next
>> release, Spark 3 could reasonably be next.
>>
>> However the release cycles are naturally slowing down, and it could also
>> be said that 2019 would be more on schedule for Spark 3.
>>
>> Nothing particularly urgent about deciding, but I'm curious if anyone had
>> an opinion on whether to move on to Spark 3 next or just continue with 2.4
>> later this year.
>>
>> On Fri, Jan 19, 2018 at 11:13 AM Sean Owen  wrote:
>>
>>> Yeah, if users are using Kryo directly, they should be insulated from a
>>> Spark-side change because of shading.
>>> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x.
>>> I am not sure if that causes problems for apps.
>>>
>>> Normally I'd avoid any major-version change in a minor release. This one
>>> looked potentially entirely internal.
>>> I think if there are any doubts, we can leave it for Spark 3. There was
>>> a bug report that needed a fix from Kryo 4, but it might be minor after all.
>>>




-- 
Ryan Blue
Software Engineer
Netflix


Re: Spark 3

2018-01-19 Thread Koert Kuipers
i was expecting to be able to move to scala 2.12 sometime this year

if this cannot be done in spark 2.x then that could be a compelling reason
to move spark 3 up to 2018 i think

hadoop 3 sounds great but personally i have no use case for it yet

On Fri, Jan 19, 2018 at 12:31 PM, Sean Owen  wrote:

> Forking this thread to muse about Spark 3. Like Spark 2, I assume it would
> be more about making all those accumulated breaking changes and updating
> lots of dependencies. Hadoop 3 looms large in that list as well as Scala
> 2.12.
>
> Spark 1 was release in May 2014, and Spark 2 in July 2016. If Spark 2.3 is
> out in Feb 2018 and it takes the now-usual 6 months until a next release,
> Spark 3 could reasonably be next.
>
> However the release cycles are naturally slowing down, and it could also
> be said that 2019 would be more on schedule for Spark 3.
>
> Nothing particularly urgent about deciding, but I'm curious if anyone had
> an opinion on whether to move on to Spark 3 next or just continue with 2.4
> later this year.
>
> On Fri, Jan 19, 2018 at 11:13 AM Sean Owen  wrote:
>
>> Yeah, if users are using Kryo directly, they should be insulated from a
>> Spark-side change because of shading.
>> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x.
>> I am not sure if that causes problems for apps.
>>
>> Normally I'd avoid any major-version change in a minor release. This one
>> looked potentially entirely internal.
>> I think if there are any doubts, we can leave it for Spark 3. There was a
>> bug report that needed a fix from Kryo 4, but it might be minor after all.
>>
>>>
>>>


Re: Spark 3

2018-01-19 Thread Justin Miller
Would that mean supporting both 2.12 and 2.11? Could be a while before some of 
our libraries are off of 2.11.

Thanks,
Justin

> On Jan 19, 2018, at 10:53 AM, Koert Kuipers  wrote:
> 
> i was expecting to be able to move to scala 2.12 sometime this year
> 
> if this cannot be done in spark 2.x then that could be a compelling reason to 
> move spark 3 up to 2018 i think
> 
> hadoop 3 sounds great but personally i have no use case for it yet
> 
> On Fri, Jan 19, 2018 at 12:31 PM, Sean Owen  > wrote:
> Forking this thread to muse about Spark 3. Like Spark 2, I assume it would be 
> more about making all those accumulated breaking changes and updating lots of 
> dependencies. Hadoop 3 looms large in that list as well as Scala 2.12.
> 
> Spark 1 was release in May 2014, and Spark 2 in July 2016. If Spark 2.3 is 
> out in Feb 2018 and it takes the now-usual 6 months until a next release, 
> Spark 3 could reasonably be next.
> 
> However the release cycles are naturally slowing down, and it could also be 
> said that 2019 would be more on schedule for Spark 3.
> 
> Nothing particularly urgent about deciding, but I'm curious if anyone had an 
> opinion on whether to move on to Spark 3 next or just continue with 2.4 later 
> this year.
> 
> On Fri, Jan 19, 2018 at 11:13 AM Sean Owen  > wrote:
> Yeah, if users are using Kryo directly, they should be insulated from a 
> Spark-side change because of shading.
> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x. I am 
> not sure if that causes problems for apps.
> 
> Normally I'd avoid any major-version change in a minor release. This one 
> looked potentially entirely internal.
> I think if there are any doubts, we can leave it for Spark 3. There was a bug 
> report that needed a fix from Kryo 4, but it might be minor after all.
> 
> 



Re: Spark 3

2018-01-19 Thread Reynold Xin
We can certainly provide a build for Scala 2.12, even in 2.x.


On Fri, Jan 19, 2018 at 10:17 AM, Justin Miller <
justin.mil...@protectwise.com> wrote:

> Would that mean supporting both 2.12 and 2.11? Could be a while before
> some of our libraries are off of 2.11.
>
> Thanks,
> Justin
>
>
> On Jan 19, 2018, at 10:53 AM, Koert Kuipers  wrote:
>
> i was expecting to be able to move to scala 2.12 sometime this year
>
> if this cannot be done in spark 2.x then that could be a compelling reason
> to move spark 3 up to 2018 i think
>
> hadoop 3 sounds great but personally i have no use case for it yet
>
> On Fri, Jan 19, 2018 at 12:31 PM, Sean Owen  wrote:
>
>> Forking this thread to muse about Spark 3. Like Spark 2, I assume it
>> would be more about making all those accumulated breaking changes and
>> updating lots of dependencies. Hadoop 3 looms large in that list as well as
>> Scala 2.12.
>>
>> Spark 1 was release in May 2014, and Spark 2 in July 2016. If Spark 2.3
>> is out in Feb 2018 and it takes the now-usual 6 months until a next
>> release, Spark 3 could reasonably be next.
>>
>> However the release cycles are naturally slowing down, and it could also
>> be said that 2019 would be more on schedule for Spark 3.
>>
>> Nothing particularly urgent about deciding, but I'm curious if anyone had
>> an opinion on whether to move on to Spark 3 next or just continue with 2.4
>> later this year.
>>
>> On Fri, Jan 19, 2018 at 11:13 AM Sean Owen  wrote:
>>
>>> Yeah, if users are using Kryo directly, they should be insulated from a
>>> Spark-side change because of shading.
>>> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x.
>>> I am not sure if that causes problems for apps.
>>>
>>> Normally I'd avoid any major-version change in a minor release. This one
>>> looked potentially entirely internal.
>>> I think if there are any doubts, we can leave it for Spark 3. There was
>>> a bug report that needed a fix from Kryo 4, but it might be minor after all.
>>>


>
>


Re: Thoughts on Cloudpickle Update

2018-01-19 Thread Bryan Cutler
Thanks Holden and Hyukjin.  I agree, let's start doing the work first and
see if it the changes are low risk enough, then we can evaluate how best to
proceed.  I made https://issues.apache.org/jira/browse/SPARK-23159 and will
get started on the update and we can continue to discuss in the PR.

On Fri, Jan 19, 2018 at 1:32 AM, Hyukjin Kwon  wrote:

> Yea, that sounds good to me.
>
> 2018-01-19 18:29 GMT+09:00 Holden Karau :
>
>> So it is pretty core, but its one of the better indirectly tested
>> components. I think probably the most reasonable path is to see what the
>> diff ends up looking like and make a call at that point for if we want it
>> to go to master or master & branch-2.3?
>>
>> On Fri, Jan 19, 2018 at 12:30 AM, Hyukjin Kwon 
>> wrote:
>>
>>> > So given that it fixes some real world bugs, any particular reason
>>> why? Would you be comfortable with doing it in 2.3.1?
>>>
>>> Ah, I don't feel strongly about this but RC2 will be running on and
>>> cloudpickle's quite core fix to PySpark. Just thought we might want to have
>>> enough time with it.
>>>
>>> One worry is, upgrading it includes a fix about namedtuple too where
>>> PySpark has a custom fix.
>>> I would like to check few things about this.
>>>
>>> So, yea, it's vague. I wouldn't stay against if you'd prefer.
>>>
>>>
>>>
>>>
>>> 2018-01-19 16:42 GMT+09:00 Holden Karau :
>>>


 On Jan 19, 2018 7:28 PM, "Hyukjin Kwon"  wrote:

 > Is it an option to match the latest version of cloudpickle and still
 set protocol level 2?

 IMHO, I think this can be an option but I am not fully sure yet if we
 should/could go ahead for it within Spark 2.X. I need some
 investigations including things about Pyrolite.

 Let's go ahead with matching it to 0.4.2 first. I am quite clear on
 matching it to 0.4.2 at least.

 So given that there is a follow up on which fixes a regression if we're
 not comfortable doing the latest version let's double-check that the
 version we do upgrade to doesn't have that regression.



 > I agree that upgrading to try and match version 0.4.2 would be a
 good starting point. Unless no one objects, I will open up a JIRA and try
 to do this.

 Yup but I think we shouldn't make this into Spark 2.3.0 to be clear.

 So given that it fixes some real world bugs, any particular reason why?
 Would you be comfortable with doing it in 2.3.1?



 > Also lets try to keep track in our commit messages which version of
 cloudpickle we end up upgrading to.

 +1: PR description, commit message or any unit to identify each will be 
 useful.
 It should be easier once we have a matched version.



 2018-01-19 12:55 GMT+09:00 Holden Karau :

> So if there are different version of Python on the cluster machines I
> think that's already unsupported so I'm not worried about that.
>
> I'd suggest going to the highest released version since there appear
> to be some useful fixes between 0.4.2 & 0.5.2
>
> Also lets try to keep track in our commit messages which version of
> cloudpickle we end up upgrading to.
>
> On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler 
> wrote:
>
>> Thanks for all the details and background Hyukjin! Regarding the
>> pickle protocol change, if I understand correctly, it is currently at 
>> level
>> 2 in Spark which is good for backwards compatibility for all of Python 2.
>> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
>> above, will pick a level determined by your Python version. So is the
>> concern here for Spark if someone has different versions of Python in 
>> their
>> cluster, like 3.5 and 3.3, then different protocols will be used and
>> deserialization might fail?  Is it an option to match the latest version 
>> of
>> cloudpickle and still set protocol level 2?
>>
>> I agree that upgrading to try and match version 0.4.2 would be a good
>> starting point. Unless no one objects, I will open up a JIRA and try to 
>> do
>> this.
>>
>> Thanks,
>> Bryan
>>
>> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon 
>> wrote:
>>
>>> Hi Bryan,
>>>
>>> Yup, I support to match the version. I pushed it forward before to
>>> match it with https://github.com/cloudpipe/cloudpickle
>>> before few times in Spark's copy and also cloudpickle itself with
>>> few fixes. I believe our copy is closest to 0.4.1.
>>>
>>> I have been trying to follow up the changes in cloudpipe/cloudpickle
>>> for which version we should match, I think we should match
>>> it with 0.4.2 first (I need to double check) because IMHO they have
>>> been adding rather radical changes from 0.5.0, including
>>> pickle protocol change (by default).
>>>
>>> Personally, I would like to match it with