Re: [VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-11 Thread Cheng Pan
-1 (non-binding)

A small question, the tag is orphan but I suppose it should belong to the 
master branch.

Seems YARN integration is broken due to javax =>  jakarta namespace migration, 
I filled SPARK-48238, and left some comments on 
https://github.com/apache/spark/pull/45154

Caused by: java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
jakarta.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) 
~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
 ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$2(ServletHandler.java:724)
 ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
 ~[?:?]
at 
java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:734)
 ~[?:?]
at 
java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
 ~[?:?]
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:749)
 ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
... 38 more

Thanks,
Cheng Pan


> On May 11, 2024, at 13:55, Wenchen Fan  wrote:
> 
> Please vote on releasing the following candidate as Apache Spark version 
> 4.0.0-preview1.
> 
> The vote is open until May 16 PST and passes if a majority +1 PMC votes are 
> cast, with
> a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see http://spark.apache.org/
> 
> The tag to be voted on is v4.0.0-preview1-rc1 (commit 
> 7dcf77c739c3854260464d732dbfb9a0f54706e7):
> https://github.com/apache/spark/tree/v4.0.0-preview1-rc1
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1454/
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-docs/
> 
> The list of bug fixes going into 4.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353359
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Cheng Pan
+1 (non-binding)

Thanks,
Cheng Pan

On Sat, Apr 27, 2024 at 9:29 AM Holden Karau  wrote:
>
> +1
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh  wrote:
>>
>> +1
>>
>> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun  wrote:
>> >
>> > I'll start with my +1.
>> >
>> > Dongjoon.
>> >
>> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
>> > > Please vote on SPARK-46122 to set 
>> > > spark.sql.legacy.createHiveTableByDefault
>> > > to `false` by default. The technical scope is defined in the following 
>> > > PR.
>> > >
>> > > - DISCUSSION:
>> > > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
>> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
>> > > - PR: https://github.com/apache/spark/pull/46207
>> > >
>> > > The vote is open until April 30th 1AM (PST) and passes
>> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> > >
>> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by default
>> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because 
>> > > ...
>> > >
>> > > Thank you in advance.
>> > >
>> > > Dongjoon
>> > >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Spark 4.0.0 release

2024-04-16 Thread Cheng Pan
will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?

Thanks,
Cheng Pan


> On Apr 15, 2024, at 09:58, Jungtaek Lim  wrote:
> 
> W.r.t. state data source - reader (SPARK-45511), there are several follow-up 
> tickets, but we don't plan to address them soon. The current implementation 
> is the final shape for Spark 4.0.0, unless there are demands on the follow-up 
> tickets.
> 
> We may want to check the plan for transformWithState - my understanding is 
> that we want to release the feature to 4.0.0, but there are several remaining 
> works to be done. While the tentative timeline for releasing is June 2024, 
> what would be the tentative timeline for the RC cut?
> (cc. Anish to add more context on the plan for transformWithState)
> 
> On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan  wrote:
> Hi all,
> 
> It's close to the previously proposed 4.0.0 release date (June 2024), and I 
> think it's time to prepare for it and discuss the ongoing projects:
> • 
> ANSI by default
> • Spark Connect GA
> • Structured Logging
> • Streaming state store data source
> • new data type VARIANT
> • STRING collation support
> • Spark k8s operator versioning
> Please help to add more items to this list that are missed here. I would like 
> to volunteer as the release manager for Apache Spark 4.0.0 if there is no 
> objection. Thank you all for the great work that fills Spark 4.0!
> 
> Wenchen Fan


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Cheng Pan
+1, non-binding

Thanks,
Cheng Pan


> On Apr 15, 2024, at 14:14, John Zhuge  wrote:
> 
> +1 (non-binding)
> 
> On Sun, Apr 14, 2024 at 7:18 PM Jungtaek Lim  
> wrote:
> +1 (non-binding), thanks Dongjoon.
> 
> On Sun, Apr 14, 2024 at 7:22 AM Dongjoon Hyun  wrote:
> Please vote on SPARK-4 to use ANSI SQL mode by default.
> The technical scope is defined in the following PR which is
> one line of code change and one line of migration guide.
> 
> - DISCUSSION: https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
> - JIRA: https://issues.apache.org/jira/browse/SPARK-4
> - PR: https://github.com/apache/spark/pull/46013
> 
> The vote is open until April 17th 1AM (PST) and passes
> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> 
> [ ] +1 Use ANSI SQL mode by default
> [ ] -1 Do not use ANSI SQL mode by default because ...
> 
> Thank you in advance.
> 
> Dongjoon
> 
> 
> -- 
> John Zhuge


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: External Spark shuffle service for k8s

2024-04-07 Thread Cheng Pan
Instead of External Shuffle Shufle, Apache Celeborn might be a good option as a 
Remote Shuffle Service for Spark on K8s.

There are some useful resources you might be interested in.

[1] https://celeborn.apache.org/
[2] https://www.youtube.com/watch?v=s5xOtG6Venw
[3] https://github.com/aws-samples/emr-remote-shuffle-service
[4] https://github.com/apache/celeborn/issues/2140

Thanks,
Cheng Pan


> On Apr 6, 2024, at 21:41, Mich Talebzadeh  wrote:
> 
> I have seen some older references for shuffle service for k8s,
> although it is not clear they are talking about a generic shuffle
> service for k8s.
> 
> Anyhow with the advent of genai and the need to allow for a larger
> volume of data, I was wondering if there has been any more work on
> this matter. Specifically larger and scalable file systems like HDFS,
> GCS , S3 etc, offer significantly larger storage capacity than local
> disks on individual worker nodes in a k8s cluster, thus allowing
> handling much larger datasets more efficiently. Also the degree of
> parallelism and fault tolerance  with these files systems come into
> it. I will be interested in hearing more about any progress on this.
> 
> Thanks
> .
> 
> Mich Talebzadeh,
> 
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> 
> London
> United Kingdom
> 
> 
>   view my Linkedin profile
> 
> 
> https://en.everybodywiki.com/Mich_Talebzadeh
> 
> 
> 
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] MySQL version support policy

2024-03-25 Thread Cheng Pan
Thanks Dongjoon’s reply and questions,

> A. Adding a new Apache Spark community policy (contract) to guarantee MySQL
> LTS Versions Support.

Yes, at least the latest MySQL LTS version. To reduce the maintenance efforts 
on the Spark side, I think we can only run CI with the latest LTS version but 
accept reasonable patches for compatibility with older LTS versions. For 
example, Spark on K8s is only verified with the latest minikube in CI, and also 
accepts reasonable patches for older K8s.

> B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1)

Those versions likely work well too. For example, Spark currently officially 
supports JDK 17 and 21, it likely works on JDK 20 too, but has not been 
verified by the community.

> 1. For (A), do you mean MySQL LTS versions are not supported by Apache Spark 
> releases properly due to the improper test suite?

Not yet. MySQL retains good backward compatibilities so far, I see a lot of 
users use MySQL 8.0 drivers to access both MySQL 5.7/8.0 servers through Spark 
JDBC datasource, everything goes well so far.

> 2. For (B), why does Apache Spark need to drop non-LTS MySQL support?

I think we can accept reasonable patches with careful review, but neither 
official support declaration nor CI verification is required, just like we do 
for JDK version support.

> 3. What about MariaDB? Do we need to stick to some versions?

I’m not familiar with MariaDB, but I would treat it as a MySQL-compatible 
product, in the same position as Amazon RDS for MySQL, neither official support 
declaration nor CI verification is required, but considering the adoption rate 
of those products, reasonable patches should be considered too.

Thanks,
Cheng Pan

On 2024/03/25 06:47:10 Dongjoon Hyun wrote:
> Hi, Cheng.
> 
> Thank you for the suggestion. Your suggestion seems to have at least two
> themes.
> 
> A. Adding a new Apache Spark community policy (contract) to guarantee MySQL
> LTS Versions Support.
> B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1)
> 
> And, it brings me three questions.
> 
> 1. For (A), do you mean MySQL LTS versions are not supported by Apache
> Spark releases properly due to the improper test suite?
> 2. For (B), why does Apache Spark need to drop non-LTS MySQL support?
> 3. What about MariaDB? Do we need to stick to some versions?
> 
> To be clear, if needed, we can have daily GitHub Action CIs easily like
> Python CI (Python 3.8/3.10/3.11/3.12).
> 
> -
> https://github.com/apache/spark/blob/master/.github/workflows/build_python.yml
> 
> Thanks,
> Dongjoon.
> 
> 
> On Sun, Mar 24, 2024 at 10:29 PM Cheng Pan  wrote:
> 
> > Hi, Spark community,
> >
> > I noticed that the Spark JDBC connector MySQL dialect is testing against
> > the 8.3.0[1] now, a non-LTS version.
> >
> > MySQL changed the version policy recently[2], which is now very similar to
> > the Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version,
> > 8.1, 8.2, 8.3 is non-LTS, and the next LTS version is 8.4.
> >
> > I would say that MySQL is one of the most important infrastructures today,
> > I checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version
> > support policy, and both only support 5.7 and 8.0.
> >
> > Also, Spark officially only supports LTS Java versions, like JDK 17 and
> > 21, but not 22. I would recommend using MySQL 8.0 for testing until the
> > next MySQL LTS version (8.4) is available.
> >
> > Additional discussion can be found at [3]
> >
> > [1] https://issues.apache.org/jira/browse/SPARK-47453
> > [2]
> > https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/
> > [3] https://github.com/apache/spark/pull/45581
> > [4] https://aws.amazon.com/rds/mysql/
> > [5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy
> >
> > Thanks,
> > Cheng Pan
> >
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>  



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[DISCUSS] MySQL version support policy

2024-03-24 Thread Cheng Pan
Hi, Spark community,

I noticed that the Spark JDBC connector MySQL dialect is testing against the 
8.3.0[1] now, a non-LTS version.

MySQL changed the version policy recently[2], which is now very similar to the 
Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version, 8.1, 8.2, 
8.3 is non-LTS, and the next LTS version is 8.4. 

I would say that MySQL is one of the most important infrastructures today, I 
checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version support 
policy, and both only support 5.7 and 8.0.

Also, Spark officially only supports LTS Java versions, like JDK 17 and 21, but 
not 22. I would recommend using MySQL 8.0 for testing until the next MySQL LTS 
version (8.4) is available.

Additional discussion can be found at [3]

[1] https://issues.apache.org/jira/browse/SPARK-47453
[2] 
https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/
[3] https://github.com/apache/spark/pull/45581
[4] https://aws.amazon.com/rds/mysql/
[5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy

Thanks,
Cheng Pan



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-20 Thread Cheng Pan
+1 (non-binding)

- Build successfully from source code.
- Pass integration tests with Spark ClickHouse Connector[1]

[1] https://github.com/housepower/spark-clickhouse-connector/pull/299

Thanks,
Cheng Pan


> On Feb 20, 2024, at 10:56, Jungtaek Lim  wrote:
> 
> Thanks Sean, let's continue the process for this RC.
> 
> +1 (non-binding)
> 
> - downloaded all files from URL
> - checked signature
> - extracted all archives
> - ran all tests from source files in source archive file, via running "sbt 
> clean test package" - Ubuntu 20.04.4 LTS, OpenJDK 17.0.9.
> 
> Also bump to dev@ to encourage participation - looks like the timing is not 
> good for US folks but let's see more days.
> 
> 
> On Sat, Feb 17, 2024 at 1:49 AM Sean Owen  wrote:
> Yeah let's get that fix in, but it seems to be a minor test only issue so 
> should not block release.
> 
> On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:
> Very sorry. When I was fixing `SPARK-45242 
> (https://github.com/apache/spark/pull/43594)`, I noticed that its `Affects 
> Version` and `Fix Version` of SPARK-45242 were both 4.0, and I didn't realize 
> that it had also been merged into branch-3.5, so I didn't advocate for 
> SPARK-45357 to be backported to branch-3.5.
>  As far as I know, the condition to trigger this test failure is: when using 
> Maven to test the `connect` module, if  `sparkTestRelation` in 
> `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized, then 
> the `id` of `sparkTestRelation` will no longer be 0. So, I think this is 
> indeed related to the order in which Maven executes the test cases in the 
> `connect` module.
>  I have submitted a backport PR to branch-3.5, and if necessary, we can merge 
> it to fix this test issue.
>  Jie Yang
>   发件人: Jungtaek Lim 
> 日期: 2024年2月16日 星期五 22:15
> 收件人: Sean Owen , Rui Wang 
> 抄送: dev 
> 主题: Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
>   I traced back relevant changes and got a sense of what happened.
>   Yangjie figured out the issue via link. It's a tricky issue according to 
> the comments from Yangjie - the test is dependent on ordering of execution 
> for test suites. He said it does not fail in sbt, hence CI build couldn't 
> catch it.
> He fixed it via link, but we missed that the offending commit was also ported 
> back to 3.5 as well, hence the fix wasn't ported back to 3.5.
>   Surprisingly, I can't reproduce locally even with maven. In my attempt to 
> reproduce, SparkConnectProtoSuite was executed at third, 
> SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite, and then 
> SparkConnectProtoSuite. Maybe very specific to the environment, not just 
> maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I used 
> build/mvn (Maven 3.8.8).
>   I'm not 100% sure this is something we should fail the release as it's a 
> test only and sounds very environment dependent, but I'll respect your call 
> on vote.
>   Btw, looks like Rui also made a relevant fix via link (not to fix the 
> failing test but to fix other issues), but this also wasn't ported back to 
> 3.5. @Rui Wang Do you think this is a regression issue and warrants a new RC?
> On Fri, Feb 16, 2024 at 11:38 AM Sean Owen  wrote:
> Is anyone seeing this Spark Connect test failure? then again, I have some 
> weird issue with this env that always fails 1 or 2 tests that nobody else can 
> replicate. 
>   - Test observe *** FAILED ***
>   == FAIL: Plans do not match ===
>   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, 
> sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric, [min(id#0) AS 
> min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 44
>+- LocalRelation , [id#0, name#0]   
>   +- LocalRelation , [id#0, name#0] 
> (PlanTest.scala:179)
>   On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim  
> wrote:
> DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured out 
> doc generation issue after tagging RC1.
>   Please vote on releasing the following candidate as Apache Spark version 
> 3.5.1.
> 
> The vote is open until February 18th 9AM (PST) and passes if a majority +1 
> PMC votes are cast, with
> a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 3.5.1
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see https://spark.apache.org/
> 
> The tag to be voted on is v3.5.1-rc2 (commit 
> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
> https://github.com/apache/spark/tree/v3.5.1-rc2
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bi

Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-14 Thread Cheng Pan
+1 (non-binding)

Thanks,
Cheng Pan


> On Nov 15, 2023, at 01:41, L. C. Hsieh  wrote:
> 
> Hi all,
> 
> I’d like to start a vote for SPIP: An Official Kubernetes Operator for
> Apache Spark.
> 
> The proposal is to develop an official Java-based Kubernetes operator
> for Apache Spark to automate the deployment and simplify the lifecycle
> management and orchestration of Spark applications and Spark clusters
> on k8s at prod scale.
> 
> This aims to reduce the learning curve and operation overhead for
> Spark users so they can concentrate on core Spark logic.
> 
> Please also refer to:
> 
>   - Discussion thread:
> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
>   - JIRA ticket: https://issues.apache.org/jira/browse/SPARK-45923
>   - SPIP doc: 
> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
> 
> 
> Please vote on the SPIP for the next 72 hours:
> 
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
> 
> 
> Thank you!
> 
> Liang-Chi Hsieh
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-10 Thread Cheng Pan
> Not really - this is not designed to be a replacement for the current 
> approach.

That's what I assumed too. But my question is, as a user, how to write a 
spark-submit command to submit a Spark app to leverage this operator?

Thanks,
Cheng Pan


> On Nov 11, 2023, at 03:21, Zhou Jiang  wrote:
> 
> Not really - this is not designed to be a replacement for the current 
> approach. Kubernetes operator fits in the scenario for automation and 
> application lifecycle management at scale. Users can choose between 
> spark-submit and operator approach based on their specific needs and 
> requirements.
> 
> On Thu, Nov 9, 2023 at 9:16 PM Cheng Pan  wrote:
> Thanks for this impressive proposal, I have a basic question, how does 
> spark-submit work with this operator? Or it enforces that we must use 
> `kubectl apply -f spark-job.yaml`(or K8s client in programming way) to submit 
> Spark app?
> 
> Thanks,
> Cheng Pan
> 
> 
> > On Nov 10, 2023, at 04:05, Zhou Jiang  wrote:
> > 
> > Hi Spark community,
> > I'm reaching out to initiate a conversation about the possibility of 
> > developing a Java-based Kubernetes operator for Apache Spark. Following the 
> > operator pattern 
> > (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark 
> > users may manage applications and related components seamlessly using 
> > native tools like kubectl. The primary goal is to simplify the Spark user 
> > experience on Kubernetes, minimizing the learning curve and operational 
> > complexities and therefore enable users to focus on the Spark application 
> > development.
> > Although there are several open-source Spark on Kubernetes operators 
> > available, none of them are officially integrated into the Apache Spark 
> > project. As a result, these operators may lack active support and 
> > development for new features. Within this proposal, our aim is to introduce 
> > a Java-based Spark operator as an integral component of the Apache Spark 
> > project. This solution has been employed internally at Apple for multiple 
> > years, operating millions of executors in real production environments. The 
> > use of Java in this solution is intended to accommodate a wider user and 
> > contributor audience, especially those who are familiar with Scala.
> > Ideally, this operator should have its dedicated repository, similar to 
> > Spark Connect Golang or Spark Docker, allowing it to maintain a loose 
> > connection with the Spark release cycle. This model is also followed by the 
> > Apache Flink Kubernetes operator.
> > We believe that this project holds the potential to evolve into a thriving 
> > community project over the long run. A comparison can be drawn with the 
> > Flink Kubernetes Operator: Apple has open-sourced internal Flink Kubernetes 
> > operator, making it a part of the Apache Flink project 
> > (https://github.com/apache/flink-kubernetes-operator). This move has gained 
> > wide industry adoption and contributions from the community. In a mere 
> > year, the Flink operator has garnered more than 600 stars and has attracted 
> > contributions from over 80 contributors. This showcases the level of 
> > community interest and collaborative momentum that can be achieved in 
> > similar scenarios.
> > More details can be found at SPIP doc : Spark Kubernetes Operator 
> > https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
> > Thanks,-- 
> > Zhou JIANG
> > 
> 
> 
> 
> -- 
> Zhou JIANG
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-09 Thread Cheng Pan
Thanks for this impressive proposal, I have a basic question, how does 
spark-submit work with this operator? Or it enforces that we must use `kubectl 
apply -f spark-job.yaml`(or K8s client in programming way) to submit Spark app?

Thanks,
Cheng Pan


> On Nov 10, 2023, at 04:05, Zhou Jiang  wrote:
> 
> Hi Spark community,
> I'm reaching out to initiate a conversation about the possibility of 
> developing a Java-based Kubernetes operator for Apache Spark. Following the 
> operator pattern 
> (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark 
> users may manage applications and related components seamlessly using native 
> tools like kubectl. The primary goal is to simplify the Spark user experience 
> on Kubernetes, minimizing the learning curve and operational complexities and 
> therefore enable users to focus on the Spark application development.
> Although there are several open-source Spark on Kubernetes operators 
> available, none of them are officially integrated into the Apache Spark 
> project. As a result, these operators may lack active support and development 
> for new features. Within this proposal, our aim is to introduce a Java-based 
> Spark operator as an integral component of the Apache Spark project. This 
> solution has been employed internally at Apple for multiple years, operating 
> millions of executors in real production environments. The use of Java in 
> this solution is intended to accommodate a wider user and contributor 
> audience, especially those who are familiar with Scala.
> Ideally, this operator should have its dedicated repository, similar to Spark 
> Connect Golang or Spark Docker, allowing it to maintain a loose connection 
> with the Spark release cycle. This model is also followed by the Apache Flink 
> Kubernetes operator.
> We believe that this project holds the potential to evolve into a thriving 
> community project over the long run. A comparison can be drawn with the Flink 
> Kubernetes Operator: Apple has open-sourced internal Flink Kubernetes 
> operator, making it a part of the Apache Flink project 
> (https://github.com/apache/flink-kubernetes-operator). This move has gained 
> wide industry adoption and contributions from the community. In a mere year, 
> the Flink operator has garnered more than 600 stars and has attracted 
> contributions from over 80 contributors. This showcases the level of 
> community interest and collaborative momentum that can be achieved in similar 
> scenarios.
> More details can be found at SPIP doc : Spark Kubernetes Operator 
> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
> Thanks,-- 
> Zhou JIANG
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 3.3.3 (RC1)

2023-08-11 Thread Cheng Pan
+1 (non-binding)

Passed integration test with Apache Kyuubi.

Thanks for driving this release.

Thanks,
Cheng Pan


> On Aug 11, 2023, at 06:36, L. C. Hsieh  wrote:
> 
> +1
> 
> Thanks Yuming.
> 
> On Thu, Aug 10, 2023 at 3:24 PM Dongjoon Hyun  wrote:
>> 
>> +1
>> 
>> Dongjoon
>> 
>> On 2023/08/10 07:14:07 yangjie01 wrote:
>>> +1
>>> Thanks, Jie Yang
>>> 
>>> 
>>> 发件人: Yuming Wang 
>>> 日期: 2023年8月10日 星期四 13:33
>>> 收件人: Dongjoon Hyun 
>>> 抄送: dev 
>>> 主题: Re: [VOTE] Release Apache Spark 3.3.3 (RC1)
>>> 
>>> +1 myself.
>>> 
>>> On Tue, Aug 8, 2023 at 12:41 AM Dongjoon Hyun 
>>> mailto:dongjoon.h...@gmail.com>> wrote:
>>> Thank you, Yuming.
>>> 
>>> Dongjoon.
>>> 
>>> On Mon, Aug 7, 2023 at 9:30 AM yangjie01 
>>> mailto:yangji...@baidu.com>> wrote:
>>> HI,Dongjoon and Yuming
>>> 
>>> I submitted a PR a few days ago to try to fix this issue: 
>>> https://github.com/apache/spark/pull/42167<https://mailshield.baidu.com/check?q=zJC5kBC6NRCGy3lXApap3GX6%2bKB9Gi%2b%2fTr0LBfwtxiuVHIiRznzQ7iofG2KJFsJB>.
>>>  The reason for the failure is that the branch daily test and the master 
>>> use the same yml file.
>>> 
>>> Jie Yang
>>> 
>>> 发件人: Dongjoon Hyun mailto:dongjoon.h...@gmail.com>>
>>> 日期: 2023年8月8日 星期二 00:18
>>> 收件人: Yuming Wang mailto:yumw...@apache.org>>
>>> 抄送: dev mailto:dev@spark.apache.org>>
>>> 主题: Re: [VOTE] Release Apache Spark 3.3.3 (RC1)
>>> 
>>> Hi, Yuming.
>>> 
>>> One of the community GitHub Action test pipelines is unhealthy consistently 
>>> due to Python mypy linter.
>>> 
>>> https://github.com/apache/spark/actions/workflows/build_branch33.yml<https://mailshield.baidu.com/check?q=zL6yo8WBsL15wzkqifGHCZlkv7KqucJxpuNp8neenIT6Re6167OIO8%2fCYlTH0k%2b29wZ%2fDuFIdfwQCHRIDBzTS292DGk6EvIh>
>>> 
>>> It seems due to the pipeline difference between the same Python mypy linter 
>>> already pass in commit build,
>>> 
>>> Dongjoon.
>>> 
>>> 
>>> On Fri, Aug 4, 2023 at 8:09 PM Yuming Wang 
>>> mailto:yumw...@apache.org>> wrote:
>>> Please vote on releasing the following candidate as Apache Spark version 
>>> 3.3.3.
>>> 
>>> The vote is open until 11:59pm Pacific time August 10th and passes if a 
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> 
>>> [ ] +1 Release this package as Apache Spark 3.3.3
>>> [ ] -1 Do not release this package because ...
>>> 
>>> To learn more about Apache Spark, please see 
>>> https://spark.apache.org<https://mailshield.baidu.com/check?q=cUpKoLnajWahunV4UDIAXHiHyx%2f5wSVGtwwdag%3d%3d>
>>> 
>>> The tag to be voted on is v3.3.3-rc1 (commit 
>>> 8c2b3319c6734250ff9d72f3d7e5cab56b142195):
>>> https://github.com/apache/spark/tree/v3.3.3-rc1<https://mailshield.baidu.com/check?q=8FCIKpLCdZkaDTtrM2i6z6MozYaNPIUxXbtoz6UY4Dd9HDZ%2fGD1yoiMERdI6DE0Tv%2bgl0w%3d%3d>
>>> 
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.3-rc1-bin<https://mailshield.baidu.com/check?q=E6K9wCUIl7R2GWg35cz6FTdyOlAIldH1DzrC5lMm5vEz7tsnGbtOoOh3Xhjgt%2bKmRTfJyMzbsWs8FQuvjrnyEw%3d%3d>
>>> 
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS<https://mailshield.baidu.com/check?q=E6fHbSXEWw02TTJBpc3bfA9mi7ea0YiWcNHkm%2fDJxwlaWinGnMdaoO1PahHhgj00vKwcbElpuHA%3d>
>>> 
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1445<https://mailshield.baidu.com/check?q=qwIV%2bgL7su%2fhDHaSq3L7D4SvWg6hop35lQ6SmnXKIqkCT%2b5Z2apQOzuDyyPx6aoUTTbwled13%2b5ajYiObU6S6Fie%2bMXccPyMOLOrKg%3d%3d>
>>> 
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.3-rc1-docs<https://mailshield.baidu.com/check?q=8J9mpKGDzLZWyCARq00pdYmMTZ7Xg2gOIhMdnfDmdhOphsDhxGAe3BboUHQltnOgRUrIx2ycA8%2b%2fDX2SG1gd6g%3d%3d>
>>> 
>>> The list of bug fixes going into 3.3.3 can be found at the following URL:
>>> https://s.apache.org/rjci4<https://mailshield.baidu.com/check?q=CDSiusCyO4bcrg80RMEGb9gnL5P2xcxAWMuq6OOUhbc%3d>
>>> 
>>> This release is using the release script of the tag v3.3.3-rc1.
>

Re: What else could be removed in Spark 4?

2023-08-08 Thread Cheng Pan
What do you think about removing HiveContext and even SQLContext?

And as an extension of this question, should we re-implement the Hive using 
DSv2 API in Spark 4?

For developers who want to implement a custom DataSource plugin, he/she may 
want to learn something from the Spark built-in one[1], and Hive is a good 
candidate. A kind of legacy implementation may confuse the developers.

It was discussed/requested in [2][3][4][5]

There were some requests for multiple Hive metastores support[6], and I have 
experienced that users choose Presto/Trino instead of Spark because the former 
supports multi HMS.

BTW, there are known third-party Hive DSv2 implementations[7][8].

[1] https://www.mail-archive.com/dev@spark.apache.org/msg30353.html
[2] https://www.mail-archive.com/dev@spark.apache.org/msg25715.html
[3] https://issues.apache.org/jira/browse/SPARK-31241
[4] https://issues.apache.org/jira/browse/SPARK-39797
[5] https://issues.apache.org/jira/browse/SPARK-44518
[6] https://www.mail-archive.com/dev@spark.apache.org/msg30228.html
[7] https://github.com/permanentstar/spark-sql-dsv2-extension
[8] 
https://github.com/apache/kyuubi/tree/master/extensions/spark/kyuubi-spark-connector-hive

Thanks,
Cheng Pan


> On Aug 8, 2023, at 10:09, Wenchen Fan  wrote:
> 
> I think the principle is we should remove things that block us from 
> supporting new things like Java 21, or come with a significant maintenance 
> cost. If there is no benefit to removing deprecated APIs (just to keep the 
> codebase clean?), I'd prefer to leave them there and not bother.
> 
> On Tue, Aug 8, 2023 at 9:00 AM Jia Fan  wrote:
> Thanks Sean  for open this discussion.
> 
> 1. I think drop Scala 2.12 is a good option.
> 
> 2. Personally, I think we should remove most methods that are deprecated 
> since 2.x/1.x unless it can't find a good replacement. There is already a 3.x 
> version as a buffer and I don't think it is good practice to use the 
> deprecated method of 2.x on 4.x.
> 
> 3. For Mesos, I think we should remove it from doc first.
> 
> 
> Jia Fan
> 
> 
> 
>> 2023年8月8日 05:47,Sean Owen  写道:
>> 
>> While we're noodling on the topic, what else might be worth removing in 
>> Spark 4?
>> 
>> For example, looks like we're finally hitting problems supporting Java 8 
>> through 21 all at once, related to Scala 2.13.x updates. It would be 
>> reasonable to require Java 11, or even 17, as a baseline for the multi-year 
>> lifecycle of Spark 4.
>> 
>> Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get hard 
>> otherwise.
>> 
>> There was a good discussion about whether old deprecated methods should be 
>> removed. They can't be removed at other times, but, doesn't mean they all 
>> should be. createExternalTable was brought up as a first example. What 
>> deprecated methods are worth removing?
>> 
>> There's Mesos support, long since deprecated, which seems like something to 
>> prune.
>> 
>> Are there old Hive/Hadoop version combos we should just stop supporting?
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What else could be removed in Spark 4?

2023-08-08 Thread Cheng Pan
> Are there old Hive/Hadoop version combos we should just stop supporting?

Dropping support for Java 8 means dropping support for Hive lower than 
2.0(exclusive)[1].

IsolatedClientLoader is aimed to allow using different Hive jars to communicate 
with different versions of HMS. AFAIK, the current built-in Hive 2.3.9 client 
works well on communicating with the Hive Metastore server through 2.1 to 3.1 
(maybe 2.0 too, not sure). This brings a new question, does the 
IsolatedClientLoader required then?

I think we should drop IsolatedClientLoader because

1. As explained above, we can use the built-in Hive 2.3.9 client to communicate 
with HMS 2.1+
2. Since SPARK-42539[2], the default Hive 2.3.9 client does not use 
IsolatedClientLoader, and as explained in SPARK-42539, IsolatedClientLoader 
causes some inconsistent behaviors.
3. It blocks Guava upgrading. HIVE-27560[3] aim to make Hive 2.3.10(unreleased) 
compatible with all Guava 14+ versions, but unfortunately, Guava is marked as 
`isSharedClass`[3] in IsolatedClientLoader, so technically, if we want to 
upgrade Guava we need to make all supported Hive versions(through 2.1.x to 
3.1.x) to support high version of Guava, I think it's impossible.

[1] 
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientVersions.scala
[2] https://issues.apache.org/jira/browse/SPARK-42539
[3] https://issues.apache.org/jira/browse/HIVE-27560
[4] https://github.com/apache/spark/pull/33989#issuecomment-926277286

Thanks,
Cheng Pan


> On Aug 8, 2023, at 05:47, Sean Owen  wrote:
> 
> Are there old Hive/Hadoop version combos we should just stop supporting?



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcome two new Apache Spark committers

2023-08-06 Thread Cheng Pan
Congratulations! Peter and Xiduo! 

Thanks,
Cheng Pan


> On Aug 7, 2023, at 10:58, Gengliang Wang  wrote:
> 
> Congratulations! Peter and Xiduo! 



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Cheng Pan
This API looks starting from scratch and has no relationship with the existing 
Java/Scala DataSourceV2 API. Particularly, how can they support SQL?

We have been back and forth on the DataSource V2 design since 2.3, I believe 
there are some things to learn when introducing the Python DataSource API.

Thanks,
Cheng Pan




> On Jun 16, 2023, at 12:14, Allison Wang  
> wrote:
> 
> Hi everyone,
> 
> I would like to start a discussion on “Python Data Source API”.
> 
> This proposal aims to introduce a simple API in Python for Data Sources. The 
> idea is to enable Python developers to create data sources without having to 
> learn Scala or deal with the complexities of the current data source APIs. 
> The goal is to make a Python-based API that is simple and easy to use, thus 
> making Spark more accessible to the wider Python developer community. This 
> proposed approach is based on the recently introduced Python user-defined 
> table functions with extensions to support data sources.
> 
> SPIP Doc:  
> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
> 
> SPIP JIRA: https://issues.apache.org/jira/browse/SPARK-44076
> 
> Looking forward to your feedback.
> 
> Thanks,
> Allison


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 3.5.0 Expectations (?)

2023-05-30 Thread Cheng Pan
@Bjørn Jørgensen

I did some investigation on upgrading Guava after Spark drop Hadoop2 support, 
but unfortunately, the Hive still depends on it, the worse thing is, that 
Guava’s classes are marked as shared in IsolatedClientLoader[1], which means 
Spark can not upgrade Guava even after upgrading the built-in Hive from current 
2.3.9 to a new version which does not stick on an old Guava, to avoid breaking 
the old version of Hive Metastore client.

I don't find clues why Guava classes need to be marked as shared, can anyone 
bring some background?

[1] 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L215

Thanks,
Cheng Pan


> On May 31, 2023, at 03:49, Bjørn Jørgensen  wrote:
> 
> @Dongjoon Hyun Thank you.
> 
> I have two points to discuss. 
> First, we are currently conducting tests with Python versions 3.8 and 3.9. 
> Should we consider replacing 3.9 with 3.11?
> 
> Secondly, I'd like to know the status of Google Guava. 
> With Hadoop version 2 no longer being utilized, is there any other factor 
> that is posing a blockage for this?
> 
> tir. 30. mai 2023 kl. 10:39 skrev Mich Talebzadeh :
> I don't know whether it is related but Scala 2.12.17 is fine for the Spark 3 
> family (compile and run) . I spent a day compiling  Spark 3.4.0 code against 
> Scala 2.13.8 with maven and was getting all sorts of weird and wonderful 
> errors at runtime.
> 
> HTH
> 
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
> 
>view my Linkedin profile
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
>  Disclaimer: Use it at your own risk. Any and all responsibility for any 
> loss, damage or destruction of data or any other property which may arise 
> from relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction. 
>   
> 
> On Tue, 30 May 2023 at 01:59, Jungtaek Lim  
> wrote:
> Shall we initiate a new discussion thread for Scala 2.13 by default? While 
> I'm not an expert on this area, it sounds like the change is major and 
> (probably) breaking. It seems to be worth having a separate discussion thread 
> rather than just treat it like one of 25 items.
> 
> On Tue, May 30, 2023 at 9:54 AM Sean Owen  wrote:
> It does seem risky; there are still likely libs out there that don't cross 
> compile for 2.13. I would make it the default at 4.0, myself.
> 
> On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon  wrote:
> While I support going forward with a higher version, actually using Scala 
> 2.13 by default is a big deal especially in a way that:
> • Users would likely download the built-in version assuming that it’s 
> backward binary compatible.
> • PyPI doesn't allow specifying the Scala version, meaning that users 
> wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.
> I wonder if it’s safer to do it in Spark 4 (which I believe will be discussed 
> soon).
> 
> 
> On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:
> Thanks Dongjoon!
> There are some ticket I want to share.
> SPARK-39420 Support ANALYZE TABLE on v2 tables
> SPARK-42750 Support INSERT INTO by name
> SPARK-43521 Support CREATE TABLE LIKE FILE
> 
> Dongjoon Hyun  于2023年5月29日周一 08:42写道:
> Hi, All.
> 
> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and 
> currently a few notable things are under discussions in the mailing list.
> 
> I believe it's a good time to share a short summary list (containing both 
> completed and in-progress items) to give a highlight in advance and to 
> collect your targets too.
> 
> Please share your expectations or working items if you want to prioritize 
> them more in the community in Apache Spark 3.5.0 timeframe.
> 
> (Sorted by ID)
> SPARK-40497 Upgrade Scala 2.13.11
> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 -> 
> 1.12.316)
> SPARK-43024 Upgrade Pandas to 2.0.0
> SPARK-43200 Remove Hadoop 2 reference in docs
> SPARK-43347 Remove Python 3.7 Support
> SPARK-43348 Support Python 3.8 in PyPy3
> SPARK-43351 Add Spark Connect Go prototype code and example
> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
> SPARK-43394 Upgrade to Maven 3.8.8
> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
> SPARK-43446 Upgrade to Apache Arrow 12.0.0
> SPARK-43447 Support R 4.3.0
> SPARK-43489 Remove protobuf 2.5.0
> SPARK-43519 Bump Parquet to 1.13.1
> SPARK-43581 Upgrade kubernetes-client to 6.6.2
> SPARK-43588 Upgrade to ASM 9.5
> SPARK-43600 Update K8

Re: Remove protobuf 2.5.0 from Spark dependencies

2023-05-23 Thread Cheng Pan
+CC dev@hbase

Thanks,
Cheng Pan

On Fri, May 19, 2023 at 4:08 AM Steve Loughran
 wrote:
>
>
>
> On Thu, 18 May 2023 at 03:45, Cheng Pan  wrote:
>>
>> Steve, thanks for the information, I think HADOOP-17046 should be fine for 
>> the Spark case.
>>
>> Hadoop put the protobuf 3 into the pre-shaded hadoop-thirdparty, and the 
>> hadoop-client-runtime shades protobuf 2 during the package, which results in 
>> protobuf 2 and 3 co-exist in hadoop-client-runtime in different packages:
>>
>> - protobuf 2: org.apache.hadoop.shaded.com.google.protobuf
>> - protobuf 3: org.apache.hadoop.thirdparty.protobuf
>
> j
> oh, so in fact that "put it back in unshaded" change doesn't do anything 
> useful through the hadoop-client lib. so it is very much useless.
>>
>>
>> As HADOOP-18487 plans to mark the protobuf 2 optional, will this make 
>> hadoop-client-runtime does not ship protobuf 2? If yes, things become worse 
>> for downstream projects who consumes hadoop shaded client, like Spark, 
>> because it requires the user to add vanilla protobuf 2 jar into the 
>> classpath if they want to access those API.
>
>
> Well, what applications are using  
> org.apache.hadoop.shaded.com.google.protobuf ? hadoop itself doesn't; it's 
> only referenced in unshaded form because hbase wanted the IPC library to 
> still work with the unshaded version they were still using. But if the 
> parquet2 lib is now only available shaded, their protobuf compiled .class 
> files aren't going to link to it, are they?
>
> does anyone know how spark + hbase + hadoop-client-runtime work so that spark 
> can talk to an hbase server? especially: what is needed on the classpath, and 
> what gets loaded for a call
>>
>>
>> In summary, I think the current state is fine. But for security purposes, 
>> the Hadoop community may want to remove the EOL protobuf 2 classes from 
>> hadoop-client-runtime.
>
>
>  +1. the shaded one which is in use also needs upgrading.
>
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> On May 17, 2023 at 04:10:43, Dongjoon Hyun  wrote:
>>>
>>> Thank you for sharing, Steve.
>>>
>>> Dongjoon
>>>
>>> On Tue, May 16, 2023 at 11:44 AM Steve Loughran 
>>>  wrote:
>>>>
>>>> I have some bad news here which is even though hadoop cut protobuf 2.5 
>>>> support, hbase team put it back in (HADOOP-17046). I don't know if the 
>>>> shaded hadoop client has removed that dependency on protobuf 2.5.
>>>>
>>>> In HADOOP-18487 i want to allow hadoop to cut that dependency, with hbase 
>>>> having to add it to the classpath if they still want it:
>>>> https://github.com/apache/hadoop/pull/4996
>>>>
>>>> It's been neglected -if you can help with review/test etc that'd be great. 
>>>> I'd love to get this into the 3.3.6 release.
>>>>
>>>> On Sat, 13 May 2023 at 08:36, Cheng Pan  wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> In SPARK-42452 (apache/spark#41153 [1]), I’m trying to remove protobuf 
>>>>> 2.5.0 from the Spark dependencies.
>>>>>
>>>>> Spark does not use protobuf 2.5.0 directly, instead, it comes from other 
>>>>> dependencies, with the following changes, now, Spark does not require 
>>>>> protobuf 2.5.0.
>>>>>
>>>>> - SPARK-40323 upgraded ORC 1.8.0, which moved from protobuf 2.5.0 to a 
>>>>> shaded protobuf 3
>>>>>
>>>>> - SPARK-33212 switched from Hadoop vanilla client to Hadoop shaded 
>>>>> client, also removed the protobuf 2 dependency. SPARK-42452 removed the 
>>>>> support for Hadoop 2.
>>>>>
>>>>> - SPARK-14421 shaded and relocated protobuf 2.6.1, which is required by 
>>>>> the kinesis client, into the kinesis assembly jar
>>>>>
>>>>> - Spark itself's core/connect/protobuf modules use protobuf 3, also 
>>>>> shaded and relocated all protobuf 3 deps.
>>>>>
>>>>> Feel free to comment if you still have any concerns.
>>>>>
>>>>> [1] https://github.com/apache/spark/pull/41153
>>>>>
>>>>> Thanks,
>>>>> Cheng Pan

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Remove protobuf 2.5.0 from Spark dependencies

2023-05-23 Thread Cheng Pan
+CC dev@hbase From: Steve Loughran Date: Friday, May 19, 2023 at 04:08To: Cc: dev Subject: Re: Remove protobuf 2.5.0 from Spark dependencies  On Thu, 18 May 2023 at 03:45, Cheng Pan <cheng...@apache.org> wrote:Steve, thanks for the information, I think HADOOP-17046 should be fine for the Spark case. Hadoop put the protobuf 3 into the pre-shaded hadoop-thirdparty, and the hadoop-client-runtime shades protobuf 2 during the package, which results in protobuf 2 and 3 co-exist in hadoop-client-runtime in different packages: - protobuf 2: org.apache.hadoop.shaded.com.google.protobuf- protobuf 3: org.apache.hadoop.thirdparty.protobufjoh, so in fact that "put it back in unshaded" change doesn't do anything useful through the hadoop-client lib. so it is very much useless.   As HADOOP-18487 plans to mark the protobuf 2 optional, will this make hadoop-client-runtime does not ship protobuf 2? If yes, things become worse for downstream projects who consumes hadoop shaded client, like Spark, because it requires the user to add vanilla protobuf 2 jar into the classpath if they want to access those API. Well, what applications are using  org.apache.hadoop.shaded.com.google.protobuf ? hadoop itself doesn't; it's only referenced in unshaded form because hbase wanted the IPC library to still work with the unshaded version they were still using. But if the parquet2 lib is now only available shaded, their protobuf compiled .class files aren't going to link to it, are they? does anyone know how spark + hbase + hadoop-client-runtime work so that spark can talk to an hbase server? especially: what is needed on the classpath, and what gets loaded for a call In summary, I think the current state is fine. But for security purposes, the Hadoop community may want to remove the EOL protobuf 2 classes from hadoop-client-runtime.  +1. the shaded one which is in use also needs upgrading.  Thanks,Cheng Pan  On May 17, 2023 at 04:10:43, Dongjoon Hyun <dongj...@apache.org> wrote:Thank you for sharing, Steve. Dongjoon On Tue, May 16, 2023 at 11:44 AM Steve Loughran  wrote:I have some bad news here which is even though hadoop cut protobuf 2.5 support, hbase team put it back in (HADOOP-17046). I don't know if the shaded hadoop client has removed that dependency on protobuf 2.5. In HADOOP-18487 i want to allow hadoop to cut that dependency, with hbase having to add it to the classpath if they still want it: https://github.com/apache/hadoop/pull/4996It's been neglected -if you can help with review/test etc that'd be great. I'd love to get this into the 3.3.6 release. On Sat, 13 May 2023 at 08:36, Cheng Pan <cheng...@apache.org> wrote:Hi all, In SPARK-42452 (apache/spark#41153 [1]), I’m trying to remove protobuf 2.5.0 from the Spark dependencies. Spark does not use protobuf 2.5.0 directly, instead, it comes from other dependencies, with the following changes, now, Spark does not require protobuf 2.5.0.- SPARK-40323 upgraded ORC 1.8.0, which moved from protobuf 2.5.0 to a shaded protobuf 3 - SPARK-33212 switched from Hadoop vanilla client to Hadoop shaded client, also removed the protobuf 2 dependency. SPARK-42452 removed the support for Hadoop 2. - SPARK-14421 shaded and relocated protobuf 2.6.1, which is required by the kinesis client, into the kinesis assembly jar - Spark itself's core/connect/protobuf modules use protobuf 3, also shaded and relocated all protobuf 3 deps. Feel free to comment if you still have any concerns. [1] https://github.com/apache/spark/pull/41153 Thanks,Cheng Pan

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Remove protobuf 2.5.0 from Spark dependencies

2023-05-17 Thread Cheng Pan
Steve, thanks for the information, I think HADOOP-17046 should be fine for
the Spark case.

Hadoop put the protobuf 3 into the pre-shaded hadoop-thirdparty, and the
hadoop-client-runtime shades protobuf 2 during the package, which results
in protobuf 2 and 3 co-exist in hadoop-client-runtime in different packages:

- protobuf 2: org.apache.hadoop.shaded.com.google.protobuf
- protobuf 3: org.apache.hadoop.thirdparty.protobuf

As HADOOP-18487 plans to mark the protobuf 2 optional, will this make
hadoop-client-runtime
does not ship protobuf 2? If yes, things become worse for downstream
projects who consumes hadoop shaded client, like Spark, because it requires
the user to add vanilla protobuf 2 jar into the classpath if they want to
access those API.

In summary, I think the current state is fine. But for security purposes,
the Hadoop community may want to remove the EOL protobuf 2 classes from
hadoop-client-runtime.

Thanks,
Cheng Pan


On May 17, 2023 at 04:10:43, Dongjoon Hyun  wrote:

> Thank you for sharing, Steve.
>
> Dongjoon
>
> On Tue, May 16, 2023 at 11:44 AM Steve Loughran
>  wrote:
>
>> I have some bad news here which is even though hadoop cut protobuf 2.5
>> support, hbase team put it back in (HADOOP-17046). I don't know if the
>> shaded hadoop client has removed that dependency on protobuf 2.5.
>>
>> In HADOOP-18487 i want to allow hadoop to cut that dependency, with hbase
>> having to add it to the classpath if they still want it:
>> https://github.com/apache/hadoop/pull/4996
>>
>> It's been neglected -if you can help with review/test etc that'd be
>> great. I'd love to get this into the 3.3.6 release.
>>
>> On Sat, 13 May 2023 at 08:36, Cheng Pan  wrote:
>>
>>> Hi all,
>>>
>>> In SPARK-42452 (apache/spark#41153 [1]), I’m trying to remove protobuf
>>> 2.5.0 from the Spark dependencies.
>>>
>>> Spark does not use protobuf 2.5.0 directly, instead, it comes from other
>>> dependencies, with the following changes, now, Spark does not require
>>> protobuf 2.5.0.
>>>
>>> - SPARK-40323 upgraded ORC 1.8.0, which moved from protobuf 2.5.0 to a
>>> shaded protobuf 3
>>>
>>> - SPARK-33212 switched from Hadoop vanilla client to Hadoop shaded
>>> client, also removed the protobuf 2 dependency. SPARK-42452 removed the
>>> support for Hadoop 2.
>>>
>>> - SPARK-14421 shaded and relocated protobuf 2.6.1, which is required by
>>> the kinesis client, into the kinesis assembly jar
>>>
>>> - Spark itself's core/connect/protobuf modules use protobuf 3, also
>>> shaded and relocated all protobuf 3 deps.
>>>
>>> Feel free to comment if you still have any concerns.
>>>
>>> [1] https://github.com/apache/spark/pull/41153
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>


Remove protobuf 2.5.0 from Spark dependencies

2023-05-13 Thread Cheng Pan
Hi all,

In SPARK-42452 (apache/spark#41153 [1]), I’m trying to remove protobuf
2.5.0 from the Spark dependencies.

Spark does not use protobuf 2.5.0 directly, instead, it comes from other
dependencies, with the following changes, now, Spark does not require
protobuf 2.5.0.

- SPARK-40323 upgraded ORC 1.8.0, which moved from protobuf 2.5.0 to a
shaded protobuf 3

- SPARK-33212 switched from Hadoop vanilla client to Hadoop shaded client,
also removed the protobuf 2 dependency. SPARK-42452 removed the support for
Hadoop 2.

- SPARK-14421 shaded and relocated protobuf 2.6.1, which is required by the
kinesis client, into the kinesis assembly jar

- Spark itself's core/connect/protobuf modules use protobuf 3, also shaded
and relocated all protobuf 3 deps.

Feel free to comment if you still have any concerns.

[1] https://github.com/apache/spark/pull/41153

Thanks,
Cheng Pan


Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Cheng Pan
There is a DSv2-based Hive connector in Apache Kyuubi[1] that supports
connecting multiple HMS in a single Spark application.

Some limitations

- currently only supports Spark 3.3
- has a known issue when using w/ `spark-sql`, but OK w/ spark-shell and
normal jar-based Spark application.

[1]
https://github.com/apache/kyuubi/tree/master/extensions/spark/kyuubi-spark-connector-hive

Thanks,
Cheng Pan


On Apr 18, 2023 at 00:38:23, Elliot West  wrote:

> Hi Ankit,
>
> While not a part of Spark, there is a project called 'WaggleDance' that
> can federate multiple Hive metastores so that they are accessible via a
> single URI: https://github.com/ExpediaGroup/waggle-dance
>
> This may be useful or perhaps serve as inspiration.
>
> Thanks,
>
> Elliot.
>
> On Mon, 17 Apr 2023 at 16:38, Ankit Gupta  wrote:
>
>> ++
>> User Mailing List
>>
>> Just a reminder, anyone who can help on this.
>>
>> Thanks a lot !
>>
>> Ankit Prakash Gupta
>>
>> On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta 
>> wrote:
>>
>>> Hi All
>>>
>>> The question is regarding the support of multiple Remote Hive Metastore
>>> catalogs with Spark. Starting Spark 3, multiple catalog support is added in
>>> spark, but have we implemented any CatalogPlugin that can help us configure
>>> multiple Remote Hive Metastore Catalogs ? If yes, can anyone help me with
>>> the Fully Qualified Class Name that I can try using for configuring a Hive
>>> Metastore Catalog. If not, I would like to work on the implementation of
>>> the CatalogPlugin that we can use to configure multiple Hive Metastore
>>> Servers' .
>>>
>>> Thanks and Regards.
>>>
>>> Ankit Prakash Gupta
>>> +91 8750101321
>>> info.ank...@gmail.com
>>>
>>>


Re: [Question] LimitedInputStream license issue in Spark source.

2023-02-28 Thread Cheng Pan
 +CC @Justin Mclean, and @Willem Jiang since you left the concerns

It looks like Spark may have incorrectly added that header. You could ask
> them why it was added perhaps or just leave it as is.
>
Kind Regards,
>
Justin
>

I have the same question as Justin asked, do we need to add the ASF License
> header to this file[1]?

[1]
> ./common/src/main/java/org/apache/celeborn/common/network/util/LimitedInputStream.java

Willem Jiang


Thanks,
Cheng Pan


On Mar 1, 2023 at 15:04:52, Dongjoon Hyun  wrote:

> Since both license headers are Apache License 2.0, we don't see any issue
> there. They are compatible.
>
> The first line of the second license header means the file was copied from
> Google Guava project originally.
>
> Apache Spark community keeps the original header because it has
> `Authorship` part, `Copyright (C) 2007 The Guava Authors`.
>
> /*
>  * Based on LimitedInputStream.java from Google Guava
>  *
>  * Copyright (C) 2007 The Guava Authors
>  *
>  *Licensed under the Apache License, Version 2.0 (the "License");
>  *you may not use this file except in compliance with the License.
>  *You may obtain a copy of the License at
>  *
>  *http://www.apache.org/licenses/LICENSE-2.0
>  *
>  *Unless required by applicable law or agreed to in writing, software
>  *distributed under the License is distributed on an "AS IS" BASIS,
>  *WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied.
>  *See the License for the specific language governing permissions and
>  *limitations under the License.
>  */
>
> Dongjoon.
>
>
>
> On Tue, Feb 28, 2023 at 10:27 PM Ethan Feng  wrote:
>
>> Hi developers,
>> I have been running a release in project celeborn(incubating)
>> recently and the incubator PPMCs have arguments[2] about the headers
>> in source file[1] that we copied from the Spark source.
>> We don't understand why the source file[1] has two headers with
>> different copyright disclaimers. Could you explain why the source
>> file[1] needs two headers?
>>
>> Regards,
>> Ethan Feng
>>
>> -
>> 1.  org/apache/spark/network/util/LimitedInputStream.java
>> 2.  https://lists.apache.org/thread/q8hchs4n2twhjl0qf3rn0ln0hyrvrnnp
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Allow Spark on K8s to integrate w/ External Log Service

2022-11-02 Thread Cheng Pan
The key point here is, how do you jump to the log service from Spark UI to
explore or download logs of each Pod like Spark on Yarn?

Thanks,
Cheng Pan


On Nov 2, 2022 at 18:32:26, Martin Andersson 
wrote:

> Hello Cheng.
>
> I don't quite understand, why can't you configure Log4j bundled with spark
> to write logs in whatever format you need, then use something like promtail
> to export the logs to whatever log service you want to use?
>
> BR,
> Martin
> ----------
> *From:* Cheng Pan 
> *Sent:* Tuesday, November 1, 2022 17:03
> *To:* spark-dev 
> *Subject:* Allow Spark on K8s to integrate w/ External Log Service
>
>
> EXTERNAL SENDER. Do not click links or open attachments unless you
> recognize the sender and know the content is safe. DO NOT provide your
> username or password.
>
> Hi Spark community,
>
> I’m recently working on migrating Spark workloads from Yarn to K8s, and
> one of the issue I found is that there is no out-of-box log solution for
> Spark applications on K8s as Yarn has, and Spark even does not provide a
> way to integrate w/ external log service.
>
> Based on the above situation, I propose a way[1][2] to allow Spark to
> integrate w/ external log service when running on K8s, and PR[3] is
> available for reviewing and testing.
>
> I also found there was a PR[4] wants to the similar thing.
>
> Please let me know if you have any feedback on this idea, and other log
> solutions/ideas/suggestions for Spark on K8s are welcome as well.
>
> [1]
> https://docs.google.com/document/d/1MfB39LD4B4Rp7MDRxZbMKMbdNSe6V6mBmMQ-gkCnM-0/edit?usp=sharing
> <https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1MfB39LD4B4Rp7MDRxZbMKMbdNSe6V6mBmMQ-gkCnM-0%2Fedit%3Fusp%3Dsharing=05%7C01%7Cmartin.andersson%40kambi.com%7C3d098567685c47a3c1de08dabc22bdc0%7Ce3ec1ec4b9944e9e82e080234621871f%7C0%7C0%7C638029154642900019%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=G7qXFK2msMyB5OI3xUwQ4LqkNnlMnc5SpzmMHApLE1o%3D=0>
> [2] https://issues.apache.org/jira/browse/SPARK-40887
> <https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-40887=05%7C01%7Cmartin.andersson%40kambi.com%7C3d098567685c47a3c1de08dabc22bdc0%7Ce3ec1ec4b9944e9e82e080234621871f%7C0%7C0%7C638029154642900019%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=7ssnO23zcDErOcZsi9I09aMNIxq5xrV0AIVQA8386VA%3D=0>
> [3] https://github.com/apache/spark/pull/38357
> <https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F38357=05%7C01%7Cmartin.andersson%40kambi.com%7C3d098567685c47a3c1de08dabc22bdc0%7Ce3ec1ec4b9944e9e82e080234621871f%7C0%7C0%7C638029154643056253%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=UcOJUhZSKlL30QiTUWgGBIMiHuLrPSrBHa5sBf4vpVw%3D=0>
> [4] https://github.com/apache/spark/pull/32456
> <https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F32456=05%7C01%7Cmartin.andersson%40kambi.com%7C3d098567685c47a3c1de08dabc22bdc0%7Ce3ec1ec4b9944e9e82e080234621871f%7C0%7C0%7C638029154643056253%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=OHNZqnEjwAH%2F2aEQTCxbqHJEwgt6hWc8cVSDahoIBkQ%3D=0>
>
> Thanks,
> Cheng Pan
>


Allow Spark on K8s to integrate w/ External Log Service

2022-11-01 Thread Cheng Pan
Hi Spark community,

I’m recently working on migrating Spark workloads from Yarn to K8s, and one
of the issue I found is that there is no out-of-box log solution for Spark
applications on K8s as Yarn has, and Spark even does not provide a way to
integrate w/ external log service.

Based on the above situation, I propose a way[1][2] to allow Spark to
integrate w/ external log service when running on K8s, and PR[3] is
available for reviewing and testing.

I also found there was a PR[4] wants to the similar thing.

Please let me know if you have any feedback on this idea, and other log
solutions/ideas/suggestions for Spark on K8s are welcome as well.

[1]
https://docs.google.com/document/d/1MfB39LD4B4Rp7MDRxZbMKMbdNSe6V6mBmMQ-gkCnM-0/edit?usp=sharing
[2] https://issues.apache.org/jira/browse/SPARK-40887
[3] https://github.com/apache/spark/pull/38357
[4] https://github.com/apache/spark/pull/32456

Thanks,
Cheng Pan


Re: [VOTE] Release Spark 3.3.1 (RC4)

2022-10-18 Thread Cheng Pan
+1 (non-binding)

- Passed Apache Kyuubi (Incubating) integration tests[1]
- Run some jobs on our internal K8s cluster

[1] https://github.com/apache/incubator-kyuubi/pull/3507

Thanks,
Cheng Pan

On Wed, Oct 19, 2022 at 9:13 AM Yikun Jiang  wrote:
>
> +1, also test passed with spark-docker workflow (downloading rc4 tgz, 
> extract, build image, run K8s IT)
>
> [1] https://github.com/Yikun/spark-docker/pull/9
>
> Regards,
> Yikun
>
> On Wed, Oct 19, 2022 at 8:59 AM Wenchen Fan  wrote:
>>
>> +1
>>
>> On Wed, Oct 19, 2022 at 4:59 AM Chao Sun  wrote:
>>>
>>> +1. Thanks Yuming!
>>>
>>> Chao
>>>
>>> On Tue, Oct 18, 2022 at 1:18 PM Thomas graves  wrote:
>>> >
>>> > +1. Ran internal test suite.
>>> >
>>> > Tom
>>> >
>>> > On Sun, Oct 16, 2022 at 9:14 PM Yuming Wang  wrote:
>>> > >
>>> > > Please vote on releasing the following candidate as Apache Spark 
>>> > > version 3.3.1.
>>> > >
>>> > > The vote is open until 11:59pm Pacific time October 21th and passes if 
>>> > > a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> > >
>>> > > [ ] +1 Release this package as Apache Spark 3.3.1
>>> > > [ ] -1 Do not release this package because ...
>>> > >
>>> > > To learn more about Apache Spark, please see https://spark.apache.org
>>> > >
>>> > > The tag to be voted on is v3.3.1-rc4 (commit 
>>> > > fbbcf9434ac070dd4ced4fb9efe32899c6db12a9):
>>> > > https://github.com/apache/spark/tree/v3.3.1-rc4
>>> > >
>>> > > The release files, including signatures, digests, etc. can be found at:
>>> > > https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-bin
>>> > >
>>> > > Signatures used for Spark RCs can be found in this file:
>>> > > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> > >
>>> > > The staging repository for this release can be found at:
>>> > > https://repository.apache.org/content/repositories/orgapachespark-1430
>>> > >
>>> > > The documentation corresponding to this release can be found at:
>>> > > https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-docs
>>> > >
>>> > > The list of bug fixes going into 3.3.1 can be found at the following 
>>> > > URL:
>>> > > https://s.apache.org/ttgz6
>>> > >
>>> > > This release is using the release script of the tag v3.3.1-rc4.
>>> > >
>>> > >
>>> > > FAQ
>>> > >
>>> > > ==
>>> > > What happened to v3.3.1-rc3?
>>> > > ==
>>> > > A performance regression(SPARK-40703) was found after tagging 
>>> > > v3.3.1-rc3, which the Iceberg community hopes Spark 3.3.1 could fix.
>>> > > So we skipped the vote on v3.3.1-rc3.
>>> > >
>>> > > =
>>> > > How can I help test this release?
>>> > > =
>>> > > If you are a Spark user, you can help us test this release by taking
>>> > > an existing Spark workload and running on this release candidate, then
>>> > > reporting any regressions.
>>> > >
>>> > > If you're working in PySpark you can set up a virtual env and install
>>> > > the current RC and see if anything important breaks, in the Java/Scala
>>> > > you can add the staging repository to your projects resolvers and test
>>> > > with the RC (make sure to clean up the artifact cache before/after so
>>> > > you don't end up building with a out of date RC going forward).
>>> > >
>>> > > ===
>>> > > What should happen to JIRA tickets still targeting 3.3.1?
>>> > > ===
>>> > > The current list of open tickets targeted at 3.3.1 can be found at:
>>> > > https://issues.apache.org/jira/projects/SPARK and search for "Target 
>>> > > Version/s" = 3.3.1
>>> > >
>>> > > Committers should look at those and triage. Extremely important bug
>>> > > fixes, documentation, and API tweaks that impact compatibility should
>>> > > be worked on immediately. Everything else please retarget to an
>>> > > appropriate release.
>>> > >
>>> > > ==
>>> > > But my bug isn't fixed?
>>> > > ==
>>> > > In order to make timely releases, we will typically not hold the
>>> > > release unless the bug in question is a regression from the previous
>>> > > release. That being said, if there is something which is a regression
>>> > > that has not been correctly targeted please ping me or a committer to
>>> > > help target the issue.
>>> > >
>>> > >
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-07 Thread Cheng Pan
+1 (non-binding)

* Verified SPARK-39313 has been address[1]
* Passed integration test w/ Apache Kyuubi (Incubating)[2]

[1] https://github.com/housepower/spark-clickhouse-connector/pull/123
[2] https://github.com/apache/incubator-kyuubi/pull/2817

Thanks,
Cheng Pan

On Wed, Jun 8, 2022 at 7:04 AM Chris Nauroth  wrote:
>
> +1 (non-binding)
>
> * Verified all checksums.
> * Verified all signatures.
> * Built from source, with multiple profiles, to full success, for Java 11 and 
> Scala 2.13:
> * build/mvn -Phadoop-3 -Phadoop-cloud -Phive-thriftserver -Pkubernetes 
> -Pscala-2.13 -Psparkr -Pyarn -DskipTests clean package
> * Tests passed.
> * Ran several examples successfully:
> * bin/spark-submit --class org.apache.spark.examples.SparkPi 
> examples/jars/spark-examples_2.12-3.3.0.jar
> * bin/spark-submit --class 
> org.apache.spark.examples.sql.hive.SparkHiveExample 
> examples/jars/spark-examples_2.12-3.3.0.jar
> * bin/spark-submit 
> examples/src/main/python/streaming/network_wordcount.py localhost 
> * Tested some of the issues that blocked prior release candidates:
> * bin/spark-sql -e 'SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT true) 
> t(x) UNION SELECT 1 AS a;'
> * bin/spark-sql -e "select date '2018-11-17' > 1"
> * SPARK-39293 ArrayAggregate fix
>
> Chris Nauroth
>
>
> On Tue, Jun 7, 2022 at 1:30 PM Cheng Su  wrote:
>>
>> +1 (non-binding). Built and ran some internal test for Spark SQL.
>>
>>
>>
>> Thanks,
>>
>> Cheng Su
>>
>>
>>
>> From: L. C. Hsieh 
>> Date: Tuesday, June 7, 2022 at 1:23 PM
>> To: dev 
>> Subject: Re: [VOTE] Release Spark 3.3.0 (RC5)
>>
>> +1
>>
>> Liang-Chi
>>
>> On Tue, Jun 7, 2022 at 1:03 PM Gengliang Wang  wrote:
>> >
>> > +1 (non-binding)
>> >
>> > Gengliang
>> >
>> > On Tue, Jun 7, 2022 at 12:24 PM Thomas Graves  wrote:
>> >>
>> >> +1
>> >>
>> >> Tom Graves
>> >>
>> >> On Sat, Jun 4, 2022 at 9:50 AM Maxim Gekk
>> >>  wrote:
>> >> >
>> >> > Please vote on releasing the following candidate as Apache Spark 
>> >> > version 3.3.0.
>> >> >
>> >> > The vote is open until 11:59pm Pacific time June 8th and passes if a 
>> >> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Release this package as Apache Spark 3.3.0
>> >> > [ ] -1 Do not release this package because ...
>> >> >
>> >> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >> >
>> >> > The tag to be voted on is v3.3.0-rc5 (commit 
>> >> > 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
>> >> > https://github.com/apache/spark/tree/v3.3.0-rc5
>> >> >
>> >> > The release files, including signatures, digests, etc. can be found at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
>> >> >
>> >> > Signatures used for Spark RCs can be found in this file:
>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >> >
>> >> > The staging repository for this release can be found at:
>> >> > https://repository.apache.org/content/repositories/orgapachespark-1406
>> >> >
>> >> > The documentation corresponding to this release can be found at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
>> >> >
>> >> > The list of bug fixes going into 3.3.0 can be found at the following 
>> >> > URL:
>> >> > https://issues.apache.org/jira/projects/SPARK/versions/12350369
>> >> >
>> >> > This release is using the release script of the tag v3.3.0-rc5.
>> >> >
>> >> >
>> >> > FAQ
>> >> >
>> >> > =
>> >> > How can I help test this release?
>> >> > =
>> >> > If you are a Spark user, you can help us test this release by taking
>> >> > an existing Spark workload and running on this release candidate, then
>> >> > reporting any regressions.
>> >> >
>> >> > If you're working in PySpark you can set up a virtual env and install
>> >> > the current RC and see if anything important breaks, in the Java/Scala
>> >> > you can 

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-09 Thread Cheng Pan
+1 (non-binding)

Integration test passed[1] with my project[2].

[1] https://github.com/housepower/spark-clickhouse-connector/runs/3834335017
[2] https://github.com/housepower/spark-clickhouse-connector

Thanks,
Cheng Pan


On Sat, Oct 9, 2021 at 2:01 PM Ye Zhou  wrote:

> +1 (non-binding).
>
> Run Maven build, tested within our YARN cluster, in client or cluster
> mode, with push based shuffle enabled/disalbled, and shuffling a large
> amount of data. Applications ran successfully with expected shuffle
> behavior.
>
> On Fri, Oct 8, 2021 at 10:06 PM sarutak  wrote:
>
>> +1
>>
>> I think no critical issue left.
>> Thank you Gengliang.
>>
>> Kousuke
>>
>> > +1
>> >
>> > Looks good.
>> >
>> > Liang-Chi
>> >
>> > On 2021/10/08 16:16:12, Kent Yao  wrote:
>> >> +1 (non-binding) BR
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> font{
>> >> line-height: 1.6;
>> >> }
>> >> 
>> >> 
>> >> 
>> >> font{
>> >> line-height: 1.6;
>> >> }
>> >> 
>> >> 
>> >> 
>> >> font{
>> >> line-height: 1.6;
>> >> }
>> >> 
>> >> 
>> >> 
>> >> font{
>> >> line-height: 1.6;
>> >> }
>> >> 
>> >>
>> >> 
>> >> font{
>> >> line-height: 1.6;
>> >> }
>> >> 
>> >> 
>> >> 
>> >> font{
>> >> line-height: 1.6;
>> >> }
>> >> 
>> >>
>> >> 
>> >> font{
>> >> line-height: 1.6;
>> >> }
>> >> 
>> >>
>> >> 
>> >> font{
>> >> line-height: 1.6;
>> >> }
>> >> 
>> >>
>> >> 
>> >> font{
>> >> line-height: 1.6;
>> >> }
>> >> 
>> >> 
>> >> 
>> >> font{
>> >> line-height: 1.6;
>> >> }
>> >> 
>> >> > >> size="3">Kent Yao > >> style="color: rgb(82, 82, 82); font-family: 宋体-简; font-size:
>> >> x-small;">@ Data Science Center, Hangzhou Research Institute, NetEase
>> >> Corp.> >> style="font-size: 13px;">a s> >> color="#525252" style="font-size: 13px;">park enthusiast> >> style="color: rgb(0, 0, 0); font-family: Helvetica; font-size:
>> >> 13px;">> > class="mr-2 flex-self-stretch" style="box-sizing: border-box;
>> > align-self: stretch !important; margin-right: 8px !important;">> > face="宋体-简" color="#525252" class=" classDarkfont" style="box-sizing:
>> > border-box; font-size: 13px;">> > class="" href="https://github.com/yaooqinn/kyuubi; style="box-sizing:
>> > border-box;">kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and
>> > analytics, built on top of > > href="http://spark.apache.org/; rel="nofollow" style="font-we
>> >  ight: normal; color: rgb(49, 53, 59); font-family: 宋体-简; font-size:
>> > 13px; box-sizing: border-box; font-variant-ligatures: normal;">Apache
>> > Spark.> > class=" d-flex flex-wrap flex-items-center break-word f3 text-normal
>> > classDarkfont" style="box-sizing: border-box; margin: 0px;
>> > font-variant-ligatures: normal; orphans: 2; widows: 2;
>> > text-decoration-thickness: initial; flex-wrap: wrap !important;
>> > align-items: center !important; word-break: break-word !important;
>> > overflow-wrap: break-word !important; display: flex !important;">> > style="color: rgb(49, 53, 59); font-family: Helvetica; font-size:
>> > 14px;">> > style="b
>> >  ox-sizing: border-box; align-self: stretch !important; margin-right:
>> > 8px !important;">> > href="https://github.com/yaooqinn/spark-authorizer; style="box-sizing:
>> > bord