Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Reynold Xin
I understand the argument to add JDK 11 support just to extend the EOL, but
the other things seem kind of arbitrary and are not supported by your
arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api
stable yet and will continue to evolve in the 3.x line.

Spark is designed in a way that’s decoupled from storage, and as a result
one can run multiple versions of Spark in parallel during migration.

On Fri, Jun 12, 2020 at 9:40 PM DB Tsai  wrote:

> +1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support
>
> We had an internal preview version of Spark 3.0 for our customers to try
> out for a while, and then we realized that it's very challenging for
> enterprise applications in production to move to Spark 3.0. For example,
> many of our customers' Spark applications depend on some internal projects
> that may not be owned by ETL teams; it requires much coordination with
> other teams to cross-build the dependencies that Spark applications depend
> on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support
> of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate
> from 2.x version to 3.0 based on my observation working with our customers.
>
> Also, JDK8 is already EOL, in some companies, using JDK8 is not supported
> by the infra team, and requires an exception to use unsupported JDK. Of
> course, for those companies, they can use vendor's Spark distribution such
> as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark
> release which is possible but not very trivial.
>
> As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support
> can definitely lower the gap, and users can still move forward using new
> features. Afterall, the reason why we are working on OSS is we like people
> to use our code, isn't it?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
>
> On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim 
> wrote:
>
>> I guess we already went through the same discussion, right? If anyone is
>> missed, please go through the discussion thread. [1] The consensus looks to
>> be not positive to migrate the new DSv2 into Spark 2.x version line,
>> because the change is pretty much huge, and also backward incompatible.
>>
>> What I can think of benefits of having Spark 2.5 is to avoid force
>> upgrade to the major release to have fixes for critical bugs. Not all
>> critical fixes were landed to 2.x as well, because some fixes bring
>> backward incompatibility. We don't land these fixes to the 2.x version line
>> because we didn't consider having Spark 2.5 before - we don't want to let
>> end users tolerate the inconvenience during upgrading bugfix version. End
>> users may be OK to tolerate during upgrading minor version, since they can
>> still live with 2.4.x to deny these fixes.
>>
>> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we
>> might want to consider porting some of features which don't bring backward
>> incompatibility. Well, new major features of Spark 3.0 would be probably
>> better to be introduced in Spark 3.0, but some features could be,
>> especially if the feature resolves the long-standing issue or the feature
>> has been provided for a long time in competitive products.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1.
>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979
>>
>> On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue 
>> wrote:
>>
>>> +1 for a 2.x release with a DSv2 API that matches 3.0.
>>>
>>> There are a lot of big differences between the API in 2.4 and 3.0, and I
>>> think a release to help migrate would be beneficial to organizations like
>>> ours that will be supporting 2.x and 3.0 in parallel for quite a while.
>>> Migration to Spark 3 is going to take time as people build confidence in
>>> it. I don't think that can be avoided by leaving a larger feature gap
>>> between 2.x and 3.0.
>>>
>>> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li  wrote:
>>>
 Based on my understanding, DSV2 is not stable yet. It still
 misses various features. Even our built-in file sources are still unable to
 fully migrate to DSV2. We plan to enhance it in the next few releases to
 close the gap.

 Also, the changes on DSV2 in Spark 3.0 did not break any existing
 application. We should encourage more users to try Spark 3 and increase the
 adoption of Spark 3.x.

 Xiao

 On Fri, Jun 12, 2020 at 5:36 PM Holden Karau 
 wrote:

> So I one of the things which we’re planning on backporting internally
> is DSv2, which I think being available in a community release in a 2 
> branch
> would be more broadly useful. Anything else on top of that would be on a
> case by case basis for if they make an easier upgrade path to 3.
>
> If we’re worried about people using 2.5 as 

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread DB Tsai
+1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support

We had an internal preview version of Spark 3.0 for our customers to try
out for a while, and then we realized that it's very challenging for
enterprise applications in production to move to Spark 3.0. For example,
many of our customers' Spark applications depend on some internal projects
that may not be owned by ETL teams; it requires much coordination with
other teams to cross-build the dependencies that Spark applications depend
on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support
of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate
from 2.x version to 3.0 based on my observation working with our customers.

Also, JDK8 is already EOL, in some companies, using JDK8 is not supported
by the infra team, and requires an exception to use unsupported JDK. Of
course, for those companies, they can use vendor's Spark distribution such
as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark
release which is possible but not very trivial.

As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support
can definitely lower the gap, and users can still move forward using new
features. Afterall, the reason why we are working on OSS is we like people
to use our code, isn't it?

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim 
wrote:

> I guess we already went through the same discussion, right? If anyone is
> missed, please go through the discussion thread. [1] The consensus looks to
> be not positive to migrate the new DSv2 into Spark 2.x version line,
> because the change is pretty much huge, and also backward incompatible.
>
> What I can think of benefits of having Spark 2.5 is to avoid force upgrade
> to the major release to have fixes for critical bugs. Not all critical
> fixes were landed to 2.x as well, because some fixes bring backward
> incompatibility. We don't land these fixes to the 2.x version line because
> we didn't consider having Spark 2.5 before - we don't want to let end users
> tolerate the inconvenience during upgrading bugfix version. End users may
> be OK to tolerate during upgrading minor version, since they can still live
> with 2.4.x to deny these fixes.
>
> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we
> might want to consider porting some of features which don't bring backward
> incompatibility. Well, new major features of Spark 3.0 would be probably
> better to be introduced in Spark 3.0, but some features could be,
> especially if the feature resolves the long-standing issue or the feature
> has been provided for a long time in competitive products.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1.
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979
>
> On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue 
> wrote:
>
>> +1 for a 2.x release with a DSv2 API that matches 3.0.
>>
>> There are a lot of big differences between the API in 2.4 and 3.0, and I
>> think a release to help migrate would be beneficial to organizations like
>> ours that will be supporting 2.x and 3.0 in parallel for quite a while.
>> Migration to Spark 3 is going to take time as people build confidence in
>> it. I don't think that can be avoided by leaving a larger feature gap
>> between 2.x and 3.0.
>>
>> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li  wrote:
>>
>>> Based on my understanding, DSV2 is not stable yet. It still
>>> misses various features. Even our built-in file sources are still unable to
>>> fully migrate to DSV2. We plan to enhance it in the next few releases to
>>> close the gap.
>>>
>>> Also, the changes on DSV2 in Spark 3.0 did not break any existing
>>> application. We should encourage more users to try Spark 3 and increase the
>>> adoption of Spark 3.x.
>>>
>>> Xiao
>>>
>>> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau 
>>> wrote:
>>>
 So I one of the things which we’re planning on backporting internally
 is DSv2, which I think being available in a community release in a 2 branch
 would be more broadly useful. Anything else on top of that would be on a
 case by case basis for if they make an easier upgrade path to 3.

 If we’re worried about people using 2.5 as a long term home we could
 always mark it with “-transitional” or something similar?

 On Fri, Jun 12, 2020 at 4:33 PM Sean Owen  wrote:

> What is the functionality that would go into a 2.5.0 release, that
> can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the
> 2.x maintenance branch, and I personally could imagine being open to more
> freely backporting a few new features for 2.x users, whereas usually it's
> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
> branch but there's something too big for a 

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Jungtaek Lim
I guess we already went through the same discussion, right? If anyone is
missed, please go through the discussion thread. [1] The consensus looks to
be not positive to migrate the new DSv2 into Spark 2.x version line,
because the change is pretty much huge, and also backward incompatible.

What I can think of benefits of having Spark 2.5 is to avoid force upgrade
to the major release to have fixes for critical bugs. Not all critical
fixes were landed to 2.x as well, because some fixes bring backward
incompatibility. We don't land these fixes to the 2.x version line because
we didn't consider having Spark 2.5 before - we don't want to let end users
tolerate the inconvenience during upgrading bugfix version. End users may
be OK to tolerate during upgrading minor version, since they can still live
with 2.4.x to deny these fixes.

In addition, given there's a huge time gap between Spark 2.4 and 3.0, we
might want to consider porting some of features which don't bring backward
incompatibility. Well, new major features of Spark 3.0 would be probably
better to be introduced in Spark 3.0, but some features could be,
especially if the feature resolves the long-standing issue or the feature
has been provided for a long time in competitive products.

Thanks,
Jungtaek Lim (HeartSaVioR)

1.
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979

On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue 
wrote:

> +1 for a 2.x release with a DSv2 API that matches 3.0.
>
> There are a lot of big differences between the API in 2.4 and 3.0, and I
> think a release to help migrate would be beneficial to organizations like
> ours that will be supporting 2.x and 3.0 in parallel for quite a while.
> Migration to Spark 3 is going to take time as people build confidence in
> it. I don't think that can be avoided by leaving a larger feature gap
> between 2.x and 3.0.
>
> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li  wrote:
>
>> Based on my understanding, DSV2 is not stable yet. It still
>> misses various features. Even our built-in file sources are still unable to
>> fully migrate to DSV2. We plan to enhance it in the next few releases to
>> close the gap.
>>
>> Also, the changes on DSV2 in Spark 3.0 did not break any existing
>> application. We should encourage more users to try Spark 3 and increase the
>> adoption of Spark 3.x.
>>
>> Xiao
>>
>> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau 
>> wrote:
>>
>>> So I one of the things which we’re planning on backporting internally is
>>> DSv2, which I think being available in a community release in a 2 branch
>>> would be more broadly useful. Anything else on top of that would be on a
>>> case by case basis for if they make an easier upgrade path to 3.
>>>
>>> If we’re worried about people using 2.5 as a long term home we could
>>> always mark it with “-transitional” or something similar?
>>>
>>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen  wrote:
>>>
 What is the functionality that would go into a 2.5.0 release, that
 can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the
 2.x maintenance branch, and I personally could imagine being open to more
 freely backporting a few new features for 2.x users, whereas usually it's
 only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
 branch but there's something too big for a 'normal' maintenance release,
 and I think the whole question turns on what that is.

 If it's things like JDK 11 support, I think that is unfortunately
 fairly 'breaking' because of dependency updates. But maybe that's not it.


 On Fri, Jun 12, 2020 at 4:38 PM Holden Karau 
 wrote:

> Hi Folks,
>
> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
> release. Spark 3 brings a number of important changes, and by its nature 
> is
> not backward compatible. I think we'd all like to have as smooth an 
> upgrade
> experience to Spark 3 as possible, and I believe that having a Spark 2
> release some of the new functionality while continuing to support the 
> older
> APIs and current Scala version would make the upgrade path smoother.
>
> This pattern is not uncommon in other Hadoop ecosystem projects, like
> Hadoop itself and HBase.
>
> I know that Ryan Blue has indicated he is already going to be
> maintaining something like that internally at Netflix, and we'll be doing
> the same thing at Apple. It seems like having a transitional release could
> benefit the community with easy migrations and help avoid duplicated work.
>
> I want to be clear I'm volunteering to do the work of managing a 2.5
> release, so hopefully, this wouldn't create any substantial burdens on the
> community.
>
> Cheers,
>
> Holden
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> 

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Ryan Blue
+1 for a 2.x release with a DSv2 API that matches 3.0.

There are a lot of big differences between the API in 2.4 and 3.0, and I
think a release to help migrate would be beneficial to organizations like
ours that will be supporting 2.x and 3.0 in parallel for quite a while.
Migration to Spark 3 is going to take time as people build confidence in
it. I don't think that can be avoided by leaving a larger feature gap
between 2.x and 3.0.

On Fri, Jun 12, 2020 at 5:53 PM Xiao Li  wrote:

> Based on my understanding, DSV2 is not stable yet. It still misses various
> features. Even our built-in file sources are still unable to fully migrate
> to DSV2. We plan to enhance it in the next few releases to close the gap.
>
> Also, the changes on DSV2 in Spark 3.0 did not break any existing
> application. We should encourage more users to try Spark 3 and increase the
> adoption of Spark 3.x.
>
> Xiao
>
> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau  wrote:
>
>> So I one of the things which we’re planning on backporting internally is
>> DSv2, which I think being available in a community release in a 2 branch
>> would be more broadly useful. Anything else on top of that would be on a
>> case by case basis for if they make an easier upgrade path to 3.
>>
>> If we’re worried about people using 2.5 as a long term home we could
>> always mark it with “-transitional” or something similar?
>>
>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen  wrote:
>>
>>> What is the functionality that would go into a 2.5.0 release, that can't
>>> be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x
>>> maintenance branch, and I personally could imagine being open to more
>>> freely backporting a few new features for 2.x users, whereas usually it's
>>> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
>>> branch but there's something too big for a 'normal' maintenance release,
>>> and I think the whole question turns on what that is.
>>>
>>> If it's things like JDK 11 support, I think that is unfortunately fairly
>>> 'breaking' because of dependency updates. But maybe that's not it.
>>>
>>>
>>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau 
>>> wrote:
>>>
 Hi Folks,

 As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
 release. Spark 3 brings a number of important changes, and by its nature is
 not backward compatible. I think we'd all like to have as smooth an upgrade
 experience to Spark 3 as possible, and I believe that having a Spark 2
 release some of the new functionality while continuing to support the older
 APIs and current Scala version would make the upgrade path smoother.

 This pattern is not uncommon in other Hadoop ecosystem projects, like
 Hadoop itself and HBase.

 I know that Ryan Blue has indicated he is already going to be
 maintaining something like that internally at Netflix, and we'll be doing
 the same thing at Apple. It seems like having a transitional release could
 benefit the community with easy migrations and help avoid duplicated work.

 I want to be clear I'm volunteering to do the work of managing a 2.5
 release, so hopefully, this wouldn't create any substantial burdens on the
 community.

 Cheers,

 Holden
 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> 
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Xiao Li
Based on my understanding, DSV2 is not stable yet. It still misses various
features. Even our built-in file sources are still unable to fully migrate
to DSV2. We plan to enhance it in the next few releases to close the gap.

Also, the changes on DSV2 in Spark 3.0 did not break any existing
application. We should encourage more users to try Spark 3 and increase the
adoption of Spark 3.x.

Xiao

On Fri, Jun 12, 2020 at 5:36 PM Holden Karau  wrote:

> So I one of the things which we’re planning on backporting internally is
> DSv2, which I think being available in a community release in a 2 branch
> would be more broadly useful. Anything else on top of that would be on a
> case by case basis for if they make an easier upgrade path to 3.
>
> If we’re worried about people using 2.5 as a long term home we could
> always mark it with “-transitional” or something similar?
>
> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen  wrote:
>
>> What is the functionality that would go into a 2.5.0 release, that can't
>> be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x
>> maintenance branch, and I personally could imagine being open to more
>> freely backporting a few new features for 2.x users, whereas usually it's
>> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
>> branch but there's something too big for a 'normal' maintenance release,
>> and I think the whole question turns on what that is.
>>
>> If it's things like JDK 11 support, I think that is unfortunately fairly
>> 'breaking' because of dependency updates. But maybe that's not it.
>>
>>
>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau 
>> wrote:
>>
>>> Hi Folks,
>>>
>>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>>> release. Spark 3 brings a number of important changes, and by its nature is
>>> not backward compatible. I think we'd all like to have as smooth an upgrade
>>> experience to Spark 3 as possible, and I believe that having a Spark 2
>>> release some of the new functionality while continuing to support the older
>>> APIs and current Scala version would make the upgrade path smoother.
>>>
>>> This pattern is not uncommon in other Hadoop ecosystem projects, like
>>> Hadoop itself and HBase.
>>>
>>> I know that Ryan Blue has indicated he is already going to be
>>> maintaining something like that internally at Netflix, and we'll be doing
>>> the same thing at Apple. It seems like having a transitional release could
>>> benefit the community with easy migrations and help avoid duplicated work.
>>>
>>> I want to be clear I'm volunteering to do the work of managing a 2.5
>>> release, so hopefully, this wouldn't create any substantial burdens on the
>>> community.
>>>
>>> Cheers,
>>>
>>> Holden
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 



Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Holden Karau
So I one of the things which we’re planning on backporting internally is
DSv2, which I think being available in a community release in a 2 branch
would be more broadly useful. Anything else on top of that would be on a
case by case basis for if they make an easier upgrade path to 3.

If we’re worried about people using 2.5 as a long term home we could always
mark it with “-transitional” or something similar?

On Fri, Jun 12, 2020 at 4:33 PM Sean Owen  wrote:

> What is the functionality that would go into a 2.5.0 release, that can't
> be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x
> maintenance branch, and I personally could imagine being open to more
> freely backporting a few new features for 2.x users, whereas usually it's
> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
> branch but there's something too big for a 'normal' maintenance release,
> and I think the whole question turns on what that is.
>
> If it's things like JDK 11 support, I think that is unfortunately fairly
> 'breaking' because of dependency updates. But maybe that's not it.
>
>
> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau  wrote:
>
>> Hi Folks,
>>
>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>> release. Spark 3 brings a number of important changes, and by its nature is
>> not backward compatible. I think we'd all like to have as smooth an upgrade
>> experience to Spark 3 as possible, and I believe that having a Spark 2
>> release some of the new functionality while continuing to support the older
>> APIs and current Scala version would make the upgrade path smoother.
>>
>> This pattern is not uncommon in other Hadoop ecosystem projects, like
>> Hadoop itself and HBase.
>>
>> I know that Ryan Blue has indicated he is already going to be maintaining
>> something like that internally at Netflix, and we'll be doing the same
>> thing at Apple. It seems like having a transitional release could benefit
>> the community with easy migrations and help avoid duplicated work.
>>
>> I want to be clear I'm volunteering to do the work of managing a 2.5
>> release, so hopefully, this wouldn't create any substantial burdens on the
>> community.
>>
>> Cheers,
>>
>> Holden
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Sean Owen
What is the functionality that would go into a 2.5.0 release, that can't be
in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x
maintenance branch, and I personally could imagine being open to more
freely backporting a few new features for 2.x users, whereas usually it's
only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
branch but there's something too big for a 'normal' maintenance release,
and I think the whole question turns on what that is.

If it's things like JDK 11 support, I think that is unfortunately fairly
'breaking' because of dependency updates. But maybe that's not it.


On Fri, Jun 12, 2020 at 4:38 PM Holden Karau  wrote:

> Hi Folks,
>
> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
> release. Spark 3 brings a number of important changes, and by its nature is
> not backward compatible. I think we'd all like to have as smooth an upgrade
> experience to Spark 3 as possible, and I believe that having a Spark 2
> release some of the new functionality while continuing to support the older
> APIs and current Scala version would make the upgrade path smoother.
>
> This pattern is not uncommon in other Hadoop ecosystem projects, like
> Hadoop itself and HBase.
>
> I know that Ryan Blue has indicated he is already going to be maintaining
> something like that internally at Netflix, and we'll be doing the same
> thing at Apple. It seems like having a transitional release could benefit
> the community with easy migrations and help avoid duplicated work.
>
> I want to be clear I'm volunteering to do the work of managing a 2.5
> release, so hopefully, this wouldn't create any substantial burdens on the
> community.
>
> Cheers,
>
> Holden
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Xiao Li
Which new functionalities are you referring to? In Spark SQL, most of the
major features in Spark 3.0 are difficult/time-consuming to backport. For
example, adaptive query execution. Releasing a new version is not hard, but
backporting/reviewing/maintaining these features are very time-consuming.

Which old APIs are broken? If the impact is big, we should add them back
based on our former discussion
http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html

Thanks,

Xiao


On Fri, Jun 12, 2020 at 2:38 PM Holden Karau  wrote:

> Hi Folks,
>
> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
> release. Spark 3 brings a number of important changes, and by its nature is
> not backward compatible. I think we'd all like to have as smooth an upgrade
> experience to Spark 3 as possible, and I believe that having a Spark 2
> release some of the new functionality while continuing to support the older
> APIs and current Scala version would make the upgrade path smoother.
>
> This pattern is not uncommon in other Hadoop ecosystem projects, like
> Hadoop itself and HBase.
>
> I know that Ryan Blue has indicated he is already going to be maintaining
> something like that internally at Netflix, and we'll be doing the same
> thing at Apple. It seems like having a transitional release could benefit
> the community with easy migrations and help avoid duplicated work.
>
> I want to be clear I'm volunteering to do the work of managing a 2.5
> release, so hopefully, this wouldn't create any substantial burdens on the
> community.
>
> Cheers,
>
> Holden
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 



Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Holden Karau
Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release.
Spark 3 brings a number of important changes, and by its nature is not
backward compatible. I think we'd all like to have as smooth an upgrade
experience to Spark 3 as possible, and I believe that having a Spark 2
release some of the new functionality while continuing to support the older
APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like
Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining
something like that internally at Netflix, and we'll be doing the same
thing at Apple. It seems like having a transitional release could benefit
the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5
release, so hopefully, this wouldn't create any substantial burdens on the
community.

Cheers,

Holden
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [EXTERNAL] Re: ColumnnarBatch to InternalRow Cast exception with codegen enabled.

2020-06-12 Thread Kris Mo
Hi Nasrulla,

Without details of your code / configuration, it's a bit hard to tell what
exactly went wrong, since there can be a lot of places that could go
wrong...

But one thing for sure is that, the interpreted code path (non-WSCG) and
the WSCG path are two separate things and it wouldn't surprise me that one
works and the other doesn't, because you can have different features/bugs
in them.
Depending on which version/branch of Spark you're working with, you might
need to implement columnar support slightly differently. c.f.
https://github.com/apache/spark/commit/c341de8b3e1f1d3327bd4ae3b0d2ec048f64d306


Best regards,
Kris
--

Kris Mok

Software Engineer Databricks Inc.

kris@databricks.com

databricks.com





On Fri, Jun 12, 2020 at 11:09 AM Nasrulla Khan Haris <
nasrulla.k...@microsoft.com> wrote:

>
>
> Thanks Kris for your inputs. Yes I have a new data source which wraps
> around built-in parquet data source. What I do not understand is with WSCG
> disabled, output is not columnar batch, if my changes do not handle
> columnar support, shouldn’t the behavior remain same with or without WSCG.
>
>
>
>
>
>
>
> *From:* Kris Mo 
> *Sent:* Friday, June 12, 2020 2:20 AM
> *To:* Nasrulla Khan Haris 
> *Cc:* dev@spark.apache.org
> *Subject:* [EXTERNAL] Re: ColumnnarBatch to InternalRow Cast exception
> with codegen enabled.
>
>
>
> Hi Nasrulla,
>
>
>
> Not sure what your new code is doing, but the symptom looks like you're
> creating a new data source that wraps around the builtin Parquet data
> source?
>
>
>
> The problem here is, whole-stage codegen generated code for row-based
> input, but the actual input is columnar.
>
> In other words, in your setup, the vectorized Parquet reader is enabled
> (which produces columnar output), and you probably wrote a new operator
> that didn't properly interact with the columnar support, so that WSCG
> thought it should generate row-based code instead of columnar code.
>
>
>
> Hope it helps,
>
> Kris
>
> --
>
>
>
> Kris Mok
>
> Software Engineer Databricks Inc.
>
> kris@databricks.com
>
> databricks.com
> 
>
>  [image: Image removed by sender.]
>
> 
>
>
>
>
>
> On Thu, Jun 11, 2020 at 5:41 PM Nasrulla Khan Haris
>  wrote:
>
> HI Spark developer,
>
>
>
> I have a new baseRelation which Initializes ParquetFileFormat object and
> when reading the data I am encountering Cast Exception below, however when
> I disable codegen support with config “spark.sql.codegen.wholeStage"=
> false, I do not encounter this exception.
>
>
>
>
>
> 20/06/11 17:35:39 INFO FileScanRDD: Reading File path: file:///D:/
> jvm/src/test/scala/resources/pems_sorted/station=402260/part-r-00245-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet,
> range: 0-50936, partition values: [402260]
>
> 20/06/11 17:35:39 INFO CodecPool: Got brand-new decompressor [.snappy]
>
> 20/06/11 17:35:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
> 0)
>
> java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
> Source)
>
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
>
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>
> at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
>
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>
> at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>
> at
> 

RE: [EXTERNAL] Re: ColumnnarBatch to InternalRow Cast exception with codegen enabled.

2020-06-12 Thread Nasrulla Khan Haris

Thanks Kris for your inputs. Yes I have a new data source which wraps around 
built-in parquet data source. What I do not understand is with WSCG disabled, 
output is not columnar batch, if my changes do not handle columnar support, 
shouldn’t the behavior remain same with or without WSCG.



From: Kris Mo 
Sent: Friday, June 12, 2020 2:20 AM
To: Nasrulla Khan Haris 
Cc: dev@spark.apache.org
Subject: [EXTERNAL] Re: ColumnnarBatch to InternalRow Cast exception with 
codegen enabled.

Hi Nasrulla,

Not sure what your new code is doing, but the symptom looks like you're 
creating a new data source that wraps around the builtin Parquet data source?

The problem here is, whole-stage codegen generated code for row-based input, 
but the actual input is columnar.
In other words, in your setup, the vectorized Parquet reader is enabled (which 
produces columnar output), and you probably wrote a new operator that didn't 
properly interact with the columnar support, so that WSCG thought it should 
generate row-based code instead of columnar code.

Hope it helps,
Kris
--

Kris Mok

Software Engineer Databricks Inc.

kris@databricks.com

databricks.com

 [Image removed by sender.]



On Thu, Jun 11, 2020 at 5:41 PM Nasrulla Khan Haris 
 wrote:
HI Spark developer,

I have a new baseRelation which Initializes ParquetFileFormat object and when 
reading the data I am encountering Cast Exception below, however when I disable 
codegen support with config “spark.sql.codegen.wholeStage"= false, I do not 
encounter this exception.


20/06/11 17:35:39 INFO FileScanRDD: Reading File path: file:///D:/ 
jvm/src/test/scala/resources/pems_sorted/station=402260/part-r-00245-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet,
 range: 0-50936, partition values: [402260]
20/06/11 17:35:39 INFO CodecPool: Got brand-new decompressor [.snappy]
20/06/11 17:35:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch 
cannot be cast to org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


Appreciate your inputs.

Thanks,
NKH


RE: [EXTERNAL] Re: ColumnnarBatch to InternalRow Cast exception with codegen enabled.

2020-06-12 Thread Nasrulla Khan Haris
Thanks Kris for your inputs. Yes I have a new data source which wraps around 
builtin parquet data source. What I do not understand is with WSCG disabled, 
Output is not columnar batch.


From: Kris Mo 
Sent: Friday, June 12, 2020 2:20 AM
To: Nasrulla Khan Haris 
Cc: dev@spark.apache.org
Subject: [EXTERNAL] Re: ColumnnarBatch to InternalRow Cast exception with 
codegen enabled.

Hi Nasrulla,

Not sure what your new code is doing, but the symptom looks like you're 
creating a new data source that wraps around the builtin Parquet data source?

The problem here is, whole-stage codegen generated code for row-based input, 
but the actual input is columnar.
In other words, in your setup, the vectorized Parquet reader is enabled (which 
produces columnar output), and you probably wrote a new operator that didn't 
properly interact with the columnar support, so that WSCG thought it should 
generate row-based code instead of columnar code.

Hope it helps,
Kris
--

Kris Mok

Software Engineer Databricks Inc.

kris@databricks.com

databricks.com

 [Image removed by sender.]



On Thu, Jun 11, 2020 at 5:41 PM Nasrulla Khan Haris 
 wrote:
HI Spark developer,

I have a new baseRelation which Initializes ParquetFileFormat object and when 
reading the data I am encountering Cast Exception below, however when I disable 
codegen support with config “spark.sql.codegen.wholeStage"= false, I do not 
encounter this exception.


20/06/11 17:35:39 INFO FileScanRDD: Reading File path: file:///D:/ 
jvm/src/test/scala/resources/pems_sorted/station=402260/part-r-00245-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet,
 range: 0-50936, partition values: [402260]
20/06/11 17:35:39 INFO CodecPool: Got brand-new decompressor [.snappy]
20/06/11 17:35:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch 
cannot be cast to org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


Appreciate your inputs.

Thanks,
NKH


Why time difference while registering a new BlockManager (using BlockManagerMasterEndpoint)?

2020-06-12 Thread Jacek Laskowski
Hi,

Just noticed an inconsistency between times when a BlockManager is about to
be registered [1][2] and the time listeners are going to be informed [3],
and got curious whether it's intentional or not.

Why is the `time` value not used for SparkListenerBlockManagerAdded message?

[1]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L453
[2]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L478
[3]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L481

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




Re: ColumnnarBatch to InternalRow Cast exception with codegen enabled.

2020-06-12 Thread Kris Mo
Hi Nasrulla,

Not sure what your new code is doing, but the symptom looks like you're
creating a new data source that wraps around the builtin Parquet data
source?

The problem here is, whole-stage codegen generated code for row-based
input, but the actual input is columnar.
In other words, in your setup, the vectorized Parquet reader is enabled
(which produces columnar output), and you probably wrote a new operator
that didn't properly interact with the columnar support, so that WSCG
thought it should generate row-based code instead of columnar code.

Hope it helps,
Kris
--

Kris Mok

Software Engineer Databricks Inc.

kris@databricks.com

databricks.com





On Thu, Jun 11, 2020 at 5:41 PM Nasrulla Khan Haris
 wrote:

> HI Spark developer,
>
>
>
> I have a new baseRelation which Initializes ParquetFileFormat object and
> when reading the data I am encountering Cast Exception below, however when
> I disable codegen support with config “spark.sql.codegen.wholeStage"=
> false, I do not encounter this exception.
>
>
>
>
>
> 20/06/11 17:35:39 INFO FileScanRDD: Reading File path: file:///D:/
> jvm/src/test/scala/resources/pems_sorted/station=402260/part-r-00245-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet,
> range: 0-50936, partition values: [402260]
>
> 20/06/11 17:35:39 INFO CodecPool: Got brand-new decompressor [.snappy]
>
> 20/06/11 17:35:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
> 0)
>
> java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
> Source)
>
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
>
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>
> at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
>
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>
> at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
> at java.lang.Thread.run(Thread.java:748)
>
>
>
>
>
> Appreciate your inputs.
>
>
>
> Thanks,
>
> NKH
>