Re: What else could be removed in Spark 4?

Steve Loughran Thu, 24 Aug 2023 10:39:57 -0700

I would recommend cutting them.

+ historically they've fixed the version of aws-sdk jar used in spark
releases, meaning s3a connector through spark rarely used the same sdk
release as that qualified through the hadoop sdk update process, so if
there were incompatibilities, it'd be up to the spark dev team to deal with
them (SPARK-39969)
+ hadoop trunk will move to the aws v2 SDK within a few days; as soon as we
can get the merge pr in: https://issues.apache.org/jira/browse/HADOOP-18073


+ I think we will do a (final?) hadoop release with the v1 sdk in
september, bug fixes and update the non-destructive dependencies for the
latest batch of cves (guava, etc -but not jackson)

The v1 and v2 SDKs can coexist: different modules, different libraries,
packages, every classname different, sometimes in case only. -but there is
no compatibility. everything needs rewritten to a new SDK where now every
class is autogenerated from some IDL specification, has builders, method
names are generally different etc. Moving even a single class to a new
release is hard work.

docs on changes aws_sdk_upgrade.md
<https://github.com/apache/hadoop/blob/feature-HADOOP-18073-s3a-sdk-upgrade/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/aws_sdk_upgrade.md>

Because of the coexistence, spark can pull  a v2 hadoop-aws library, with
s3a using it to talk to s3, while kinesis builds, runs, tests etc on the
existing v1 library. But to move to the new SDK is pretty traumatic and
one-way. Unless it is a critical feature for some people, deletion is the
easy option.

Happy to supply a patch for this, though because of the co-existence I'm
not worrying about it on my current cross-project testing.

steve


On Thu, 17 Aug 2023 at 06:44, Yang Jie <yangji...@apache.org> wrote:

> I would like to know how we should handle the two Kinesis-related modules
> in Spark 4.0. They have a very low frequency of code updates, and because
> the corresponding tests are not continuously executed in any GitHub Actions
> pipeline, so I think they significantly lack quality assurance. On top of
> that, I am not certain if the test cases, which require AWS credentials in
> these modules, get verified during each Spark version release.
>
> Thanks,
> Jiie Yang
>
> On 2023/08/08 08:28:37 Cheng Pan wrote:
> > What do you think about removing HiveContext and even SQLContext?
> >
> > And as an extension of this question, should we re-implement the Hive
> using DSv2 API in Spark 4?
> >
> > For developers who want to implement a custom DataSource plugin, he/she
> may want to learn something from the Spark built-in one[1], and Hive is a
> good candidate. A kind of legacy implementation may confuse the developers.
> >
> > It was discussed/requested in [2][3][4][5]
> >
> > There were some requests for multiple Hive metastores support[6], and I
> have experienced that users choose Presto/Trino instead of Spark because
> the former supports multi HMS.
> >
> > BTW, there are known third-party Hive DSv2 implementations[7][8].
> >
> > [1] https://www.mail-archive.com/dev@spark.apache.org/msg30353.html
> > [2] https://www.mail-archive.com/dev@spark.apache.org/msg25715.html
> > [3] https://issues.apache.org/jira/browse/SPARK-31241
> > [4] https://issues.apache.org/jira/browse/SPARK-39797
> > [5] https://issues.apache.org/jira/browse/SPARK-44518
> > [6] https://www.mail-archive.com/dev@spark.apache.org/msg30228.html
> > [7] https://github.com/permanentstar/spark-sql-dsv2-extension
> > [8]
> https://github.com/apache/kyuubi/tree/master/extensions/spark/kyuubi-spark-connector-hive
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > > On Aug 8, 2023, at 10:09, Wenchen Fan <cloud0...@gmail.com> wrote:
> > >
> > > I think the principle is we should remove things that block us from
> supporting new things like Java 21, or come with a significant maintenance
> cost. If there is no benefit to removing deprecated APIs (just to keep the
> codebase clean?), I'd prefer to leave them there and not bother.
> > >
> > > On Tue, Aug 8, 2023 at 9:00 AM Jia Fan <fanjiaemi...@qq.com.invalid>
> wrote:
> > > Thanks Sean  for open this discussion.
> > >
> > > 1. I think drop Scala 2.12 is a good option.
> > >
> > > 2. Personally, I think we should remove most methods that are
> deprecated since 2.x/1.x unless it can't find a good replacement. There is
> already a 3.x version as a buffer and I don't think it is good practice to
> use the deprecated method of 2.x on 4.x.
> > >
> > > 3. For Mesos, I think we should remove it from doc first.
> > > ________________________
> > >
> > > Jia Fan
> > >
> > >
> > >
> > >> 2023年8月8日 05:47，Sean Owen <sro...@gmail.com> 写道：
> > >>
> > >> While we're noodling on the topic, what else might be worth removing
> in Spark 4?
> > >>
> > >> For example, looks like we're finally hitting problems supporting
> Java 8 through 21 all at once, related to Scala 2.13.x updates. It would be
> reasonable to require Java 11, or even 17, as a baseline for the multi-year
> lifecycle of Spark 4.
> > >>
> > >> Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get
> hard otherwise.
> > >>
> > >> There was a good discussion about whether old deprecated methods
> should be removed. They can't be removed at other times, but, doesn't mean
> they all should be. createExternalTable was brought up as a first example.
> What deprecated methods are worth removing?
> > >>
> > >> There's Mesos support, long since deprecated, which seems like
> something to prune.
> > >>
> > >> Are there old Hive/Hadoop version combos we should just stop
> supporting?
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: What else could be removed in Spark 4?

Reply via email to