Re: What else could be removed in Spark 4?

2023-08-24 Thread Steve Loughran
I would recommend cutting them.

+ historically they've fixed the version of aws-sdk jar used in spark
releases, meaning s3a connector through spark rarely used the same sdk
release as that qualified through the hadoop sdk update process, so if
there were incompatibilities, it'd be up to the spark dev team to deal with
them (SPARK-39969)
+ hadoop trunk will move to the aws v2 SDK within a few days; as soon as we
can get the merge pr in: https://issues.apache.org/jira/browse/HADOOP-18073

+ I think we will do a (final?) hadoop release with the v1 sdk in
september, bug fixes and update the non-destructive dependencies for the
latest batch of cves (guava, etc -but not jackson)

The v1 and v2 SDKs can coexist: different modules, different libraries,
packages, every classname different, sometimes in case only. -but there is
no compatibility. everything needs rewritten to a new SDK where now every
class is autogenerated from some IDL specification, has builders, method
names are generally different etc. Moving even a single class to a new
release is hard work.

docs on changes aws_sdk_upgrade.md


Because of the coexistence, spark can pull  a v2 hadoop-aws library, with
s3a using it to talk to s3, while kinesis builds, runs, tests etc on the
existing v1 library. But to move to the new SDK is pretty traumatic and
one-way. Unless it is a critical feature for some people, deletion is the
easy option.

Happy to supply a patch for this, though because of the co-existence I'm
not worrying about it on my current cross-project testing.

steve


On Thu, 17 Aug 2023 at 06:44, Yang Jie  wrote:

> I would like to know how we should handle the two Kinesis-related modules
> in Spark 4.0. They have a very low frequency of code updates, and because
> the corresponding tests are not continuously executed in any GitHub Actions
> pipeline, so I think they significantly lack quality assurance. On top of
> that, I am not certain if the test cases, which require AWS credentials in
> these modules, get verified during each Spark version release.
>
> Thanks,
> Jiie Yang
>
> On 2023/08/08 08:28:37 Cheng Pan wrote:
> > What do you think about removing HiveContext and even SQLContext?
> >
> > And as an extension of this question, should we re-implement the Hive
> using DSv2 API in Spark 4?
> >
> > For developers who want to implement a custom DataSource plugin, he/she
> may want to learn something from the Spark built-in one[1], and Hive is a
> good candidate. A kind of legacy implementation may confuse the developers.
> >
> > It was discussed/requested in [2][3][4][5]
> >
> > There were some requests for multiple Hive metastores support[6], and I
> have experienced that users choose Presto/Trino instead of Spark because
> the former supports multi HMS.
> >
> > BTW, there are known third-party Hive DSv2 implementations[7][8].
> >
> > [1] https://www.mail-archive.com/dev@spark.apache.org/msg30353.html
> > [2] https://www.mail-archive.com/dev@spark.apache.org/msg25715.html
> > [3] https://issues.apache.org/jira/browse/SPARK-31241
> > [4] https://issues.apache.org/jira/browse/SPARK-39797
> > [5] https://issues.apache.org/jira/browse/SPARK-44518
> > [6] https://www.mail-archive.com/dev@spark.apache.org/msg30228.html
> > [7] https://github.com/permanentstar/spark-sql-dsv2-extension
> > [8]
> https://github.com/apache/kyuubi/tree/master/extensions/spark/kyuubi-spark-connector-hive
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > > On Aug 8, 2023, at 10:09, Wenchen Fan  wrote:
> > >
> > > I think the principle is we should remove things that block us from
> supporting new things like Java 21, or come with a significant maintenance
> cost. If there is no benefit to removing deprecated APIs (just to keep the
> codebase clean?), I'd prefer to leave them there and not bother.
> > >
> > > On Tue, Aug 8, 2023 at 9:00 AM Jia Fan 
> wrote:
> > > Thanks Sean  for open this discussion.
> > >
> > > 1. I think drop Scala 2.12 is a good option.
> > >
> > > 2. Personally, I think we should remove most methods that are
> deprecated since 2.x/1.x unless it can't find a good replacement. There is
> already a 3.x version as a buffer and I don't think it is good practice to
> use the deprecated method of 2.x on 4.x.
> > >
> > > 3. For Mesos, I think we should remove it from doc first.
> > > 
> > >
> > > Jia Fan
> > >
> > >
> > >
> > >> 2023年8月8日 05:47,Sean Owen  写道:
> > >>
> > >> While we're noodling on the topic, what else might be worth removing
> in Spark 4?
> > >>
> > >> For example, looks like we're finally hitting problems supporting
> Java 8 through 21 all at once, related to Scala 2.13.x updates. It would be
> reasonable to require Java 11, or even 17, as a baseline for the multi-year
> lifecycle of Spark 4.
> > >>
> > >> Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might 

Re: What else could be removed in Spark 4?

2023-08-16 Thread Yang Jie
I would like to know how we should handle the two Kinesis-related modules in 
Spark 4.0. They have a very low frequency of code updates, and because the 
corresponding tests are not continuously executed in any GitHub Actions 
pipeline, so I think they significantly lack quality assurance. On top of that, 
I am not certain if the test cases, which require AWS credentials in these 
modules, get verified during each Spark version release.

Thanks,
Jiie Yang

On 2023/08/08 08:28:37 Cheng Pan wrote:
> What do you think about removing HiveContext and even SQLContext?
> 
> And as an extension of this question, should we re-implement the Hive using 
> DSv2 API in Spark 4?
> 
> For developers who want to implement a custom DataSource plugin, he/she may 
> want to learn something from the Spark built-in one[1], and Hive is a good 
> candidate. A kind of legacy implementation may confuse the developers.
> 
> It was discussed/requested in [2][3][4][5]
> 
> There were some requests for multiple Hive metastores support[6], and I have 
> experienced that users choose Presto/Trino instead of Spark because the 
> former supports multi HMS.
> 
> BTW, there are known third-party Hive DSv2 implementations[7][8].
> 
> [1] https://www.mail-archive.com/dev@spark.apache.org/msg30353.html
> [2] https://www.mail-archive.com/dev@spark.apache.org/msg25715.html
> [3] https://issues.apache.org/jira/browse/SPARK-31241
> [4] https://issues.apache.org/jira/browse/SPARK-39797
> [5] https://issues.apache.org/jira/browse/SPARK-44518
> [6] https://www.mail-archive.com/dev@spark.apache.org/msg30228.html
> [7] https://github.com/permanentstar/spark-sql-dsv2-extension
> [8] 
> https://github.com/apache/kyuubi/tree/master/extensions/spark/kyuubi-spark-connector-hive
> 
> Thanks,
> Cheng Pan
> 
> 
> > On Aug 8, 2023, at 10:09, Wenchen Fan  wrote:
> > 
> > I think the principle is we should remove things that block us from 
> > supporting new things like Java 21, or come with a significant maintenance 
> > cost. If there is no benefit to removing deprecated APIs (just to keep the 
> > codebase clean?), I'd prefer to leave them there and not bother.
> > 
> > On Tue, Aug 8, 2023 at 9:00 AM Jia Fan  wrote:
> > Thanks Sean  for open this discussion.
> > 
> > 1. I think drop Scala 2.12 is a good option.
> > 
> > 2. Personally, I think we should remove most methods that are deprecated 
> > since 2.x/1.x unless it can't find a good replacement. There is already a 
> > 3.x version as a buffer and I don't think it is good practice to use the 
> > deprecated method of 2.x on 4.x.
> > 
> > 3. For Mesos, I think we should remove it from doc first.
> > 
> > 
> > Jia Fan
> > 
> > 
> > 
> >> 2023年8月8日 05:47,Sean Owen  写道:
> >> 
> >> While we're noodling on the topic, what else might be worth removing in 
> >> Spark 4?
> >> 
> >> For example, looks like we're finally hitting problems supporting Java 8 
> >> through 21 all at once, related to Scala 2.13.x updates. It would be 
> >> reasonable to require Java 11, or even 17, as a baseline for the 
> >> multi-year lifecycle of Spark 4.
> >> 
> >> Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get hard 
> >> otherwise.
> >> 
> >> There was a good discussion about whether old deprecated methods should be 
> >> removed. They can't be removed at other times, but, doesn't mean they all 
> >> should be. createExternalTable was brought up as a first example. What 
> >> deprecated methods are worth removing?
> >> 
> >> There's Mesos support, long since deprecated, which seems like something 
> >> to prune.
> >> 
> >> Are there old Hive/Hadoop version combos we should just stop supporting?
> > 
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What else could be removed in Spark 4?

2023-08-08 Thread Cheng Pan
What do you think about removing HiveContext and even SQLContext?

And as an extension of this question, should we re-implement the Hive using 
DSv2 API in Spark 4?

For developers who want to implement a custom DataSource plugin, he/she may 
want to learn something from the Spark built-in one[1], and Hive is a good 
candidate. A kind of legacy implementation may confuse the developers.

It was discussed/requested in [2][3][4][5]

There were some requests for multiple Hive metastores support[6], and I have 
experienced that users choose Presto/Trino instead of Spark because the former 
supports multi HMS.

BTW, there are known third-party Hive DSv2 implementations[7][8].

[1] https://www.mail-archive.com/dev@spark.apache.org/msg30353.html
[2] https://www.mail-archive.com/dev@spark.apache.org/msg25715.html
[3] https://issues.apache.org/jira/browse/SPARK-31241
[4] https://issues.apache.org/jira/browse/SPARK-39797
[5] https://issues.apache.org/jira/browse/SPARK-44518
[6] https://www.mail-archive.com/dev@spark.apache.org/msg30228.html
[7] https://github.com/permanentstar/spark-sql-dsv2-extension
[8] 
https://github.com/apache/kyuubi/tree/master/extensions/spark/kyuubi-spark-connector-hive

Thanks,
Cheng Pan


> On Aug 8, 2023, at 10:09, Wenchen Fan  wrote:
> 
> I think the principle is we should remove things that block us from 
> supporting new things like Java 21, or come with a significant maintenance 
> cost. If there is no benefit to removing deprecated APIs (just to keep the 
> codebase clean?), I'd prefer to leave them there and not bother.
> 
> On Tue, Aug 8, 2023 at 9:00 AM Jia Fan  wrote:
> Thanks Sean  for open this discussion.
> 
> 1. I think drop Scala 2.12 is a good option.
> 
> 2. Personally, I think we should remove most methods that are deprecated 
> since 2.x/1.x unless it can't find a good replacement. There is already a 3.x 
> version as a buffer and I don't think it is good practice to use the 
> deprecated method of 2.x on 4.x.
> 
> 3. For Mesos, I think we should remove it from doc first.
> 
> 
> Jia Fan
> 
> 
> 
>> 2023年8月8日 05:47,Sean Owen  写道:
>> 
>> While we're noodling on the topic, what else might be worth removing in 
>> Spark 4?
>> 
>> For example, looks like we're finally hitting problems supporting Java 8 
>> through 21 all at once, related to Scala 2.13.x updates. It would be 
>> reasonable to require Java 11, or even 17, as a baseline for the multi-year 
>> lifecycle of Spark 4.
>> 
>> Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get hard 
>> otherwise.
>> 
>> There was a good discussion about whether old deprecated methods should be 
>> removed. They can't be removed at other times, but, doesn't mean they all 
>> should be. createExternalTable was brought up as a first example. What 
>> deprecated methods are worth removing?
>> 
>> There's Mesos support, long since deprecated, which seems like something to 
>> prune.
>> 
>> Are there old Hive/Hadoop version combos we should just stop supporting?
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What else could be removed in Spark 4?

2023-08-08 Thread Cheng Pan
> Are there old Hive/Hadoop version combos we should just stop supporting?

Dropping support for Java 8 means dropping support for Hive lower than 
2.0(exclusive)[1].

IsolatedClientLoader is aimed to allow using different Hive jars to communicate 
with different versions of HMS. AFAIK, the current built-in Hive 2.3.9 client 
works well on communicating with the Hive Metastore server through 2.1 to 3.1 
(maybe 2.0 too, not sure). This brings a new question, does the 
IsolatedClientLoader required then?

I think we should drop IsolatedClientLoader because

1. As explained above, we can use the built-in Hive 2.3.9 client to communicate 
with HMS 2.1+
2. Since SPARK-42539[2], the default Hive 2.3.9 client does not use 
IsolatedClientLoader, and as explained in SPARK-42539, IsolatedClientLoader 
causes some inconsistent behaviors.
3. It blocks Guava upgrading. HIVE-27560[3] aim to make Hive 2.3.10(unreleased) 
compatible with all Guava 14+ versions, but unfortunately, Guava is marked as 
`isSharedClass`[3] in IsolatedClientLoader, so technically, if we want to 
upgrade Guava we need to make all supported Hive versions(through 2.1.x to 
3.1.x) to support high version of Guava, I think it's impossible.

[1] 
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientVersions.scala
[2] https://issues.apache.org/jira/browse/SPARK-42539
[3] https://issues.apache.org/jira/browse/HIVE-27560
[4] https://github.com/apache/spark/pull/33989#issuecomment-926277286

Thanks,
Cheng Pan


> On Aug 8, 2023, at 05:47, Sean Owen  wrote:
> 
> Are there old Hive/Hadoop version combos we should just stop supporting?



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What else could be removed in Spark 4?

2023-08-07 Thread Wenchen Fan
I think the principle is we should remove things that block us from
supporting new things like Java 21, or come with a significant
maintenance cost. If there is no benefit to removing deprecated APIs (just
to keep the codebase clean?), I'd prefer to leave them there and not bother.

On Tue, Aug 8, 2023 at 9:00 AM Jia Fan  wrote:

> Thanks Sean  for open this discussion.
>
> 1. I think drop Scala 2.12 is a good option.
>
> 2. Personally, I think we should remove most methods that are deprecated
> since 2.x/1.x unless it can't find a good replacement. There is already a
> 3.x version as a buffer and I don't think it is good practice to use the
> deprecated method of 2.x on 4.x.
>
> 3. For Mesos, I think we should remove it from doc first.
> 
>
> Jia Fan
>
>
>
> 2023年8月8日 05:47,Sean Owen  写道:
>
> While we're noodling on the topic, what else might be worth removing in
> Spark 4?
>
> For example, looks like we're finally hitting problems supporting Java 8
> through 21 all at once, related to Scala 2.13.x updates. It would be
> reasonable to require Java 11, or even 17, as a baseline for the multi-year
> lifecycle of Spark 4.
>
> Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get hard
> otherwise.
>
> There was a good discussion about whether old deprecated methods should be
> removed. They can't be removed at other times, but, doesn't mean they all
> *should* be. createExternalTable was brought up as a first example. What
> deprecated methods are worth removing?
>
> There's Mesos support, long since deprecated, which seems like something
> to prune.
>
> Are there old Hive/Hadoop version combos we should just stop supporting?
>
>
>


Re: What else could be removed in Spark 4?

2023-08-07 Thread Jia Fan
Thanks Sean  for open this discussion.

1. I think drop Scala 2.12 is a good option.

2. Personally, I think we should remove most methods that are deprecated since 
2.x/1.x unless it can't find a good replacement. There is already a 3.x version 
as a buffer and I don't think it is good practice to use the deprecated method 
of 2.x on 4.x.

3. For Mesos, I think we should remove it from doc first.


Jia Fan



> 2023年8月8日 05:47,Sean Owen  写道:
> 
> While we're noodling on the topic, what else might be worth removing in Spark 
> 4?
> 
> For example, looks like we're finally hitting problems supporting Java 8 
> through 21 all at once, related to Scala 2.13.x updates. It would be 
> reasonable to require Java 11, or even 17, as a baseline for the multi-year 
> lifecycle of Spark 4.
> 
> Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get hard 
> otherwise.
> 
> There was a good discussion about whether old deprecated methods should be 
> removed. They can't be removed at other times, but, doesn't mean they all 
> should be. createExternalTable was brought up as a first example. What 
> deprecated methods are worth removing?
> 
> There's Mesos support, long since deprecated, which seems like something to 
> prune.
> 
> Are there old Hive/Hadoop version combos we should just stop supporting?



What else could be removed in Spark 4?

2023-08-07 Thread Sean Owen
While we're noodling on the topic, what else might be worth removing in
Spark 4?

For example, looks like we're finally hitting problems supporting Java 8
through 21 all at once, related to Scala 2.13.x updates. It would be
reasonable to require Java 11, or even 17, as a baseline for the multi-year
lifecycle of Spark 4.

Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get hard
otherwise.

There was a good discussion about whether old deprecated methods should be
removed. They can't be removed at other times, but, doesn't mean they all
*should* be. createExternalTable was brought up as a first example. What
deprecated methods are worth removing?

There's Mesos support, long since deprecated, which seems like something to
prune.

Are there old Hive/Hadoop version combos we should just stop supporting?