Spark on Kubernetes scheduler variety

2021-06-17 Thread Holden Karau
Hi Folks,

I'm continuing my adventures to make Spark on containers party and I
was wondering if folks have experience with the different batch
scheduler options that they prefer? I was thinking so that we can
better support dynamic allocation it might make sense for us to
support using different schedulers and I wanted to see if there are
any that the community is more interested in?

I know that one of the Spark on Kube operators supports
volcano/kube-batch so I was thinking that might be a place I start
exploring but also want to be open to other schedulers that folks
might be interested in.

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.0.3 (RC1)

2021-06-17 Thread Sean Owen
+1 same result as ever. Signatures are OK, tags look good, tests pass.

On Thu, Jun 17, 2021 at 5:11 AM Yi Wu  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.0.3.
>
> The vote is open until Jun 21th 3AM (PST) and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.0.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.0.3-rc1 (commit
> 65ac1e75dc468f53fc778cd2ce1ba3f21067aab8):
> https://github.com/apache/spark/tree/v3.0.3-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1386/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-docs/
>
> The list of bug fixes going into 3.0.3 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349723
>
> This release is using the release script of the tag v3.0.3-rc1.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.0.3?
> ===
>
> The current list of open tickets targeted at 3.0.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: CRAN package SparkR

2021-06-17 Thread Felix Cheung
Any suggestion or comment on this? They are going to remove the package by
6-28

Seems to me if we have a switch to opt in to install (and not by default
on), or prompt the user in interactive session, should be good as user
confirmation.



On Sun, Jun 13, 2021 at 11:25 PM Felix Cheung 
wrote:

> It looks like they would not allow caching the Spark
> Distribution.
>
> I’m not sure what can be done about this.
>
> If I recall, the package should remove this during test. Or maybe
> spark.install() ie optional (hence getting user confirmation?)
>
>
> -- Forwarded message -
> Date: Sun, Jun 13, 2021 at 10:19 PM
> Subject: CRAN package SparkR
> To: Felix Cheung 
> CC: 
>
>
> Dear maintainer,
>
> Checking this apparently creates the default directory as per
>
> #' @param localDir a local directory where Spark is installed. The
> directory con
> tains
> #' version-specific folders of Spark packages. Default is
> path t
> o
> #' the cache directory:
> #' \itemize{
> #'   \item Mac OS X: \file{~/Library/Caches/spark}
> #'   \item Unix: \env{$XDG_CACHE_HOME} if defined,
> otherwise \file{~/.cache/spark}
> #'   \item Windows:
> \file{\%LOCALAPPDATA\%\\Apache\\Spark\\Cache}.
> #' }
>
> However, the CRAN Policy says
>
>   - Packages should not write in the user’s home filespace (including
> clipboards), nor anywhere else on the file system apart from the R
> session’s temporary directory (or during installation in the
> location pointed to by TMPDIR: and such usage should be cleaned
> up). Installing into the system’s R installation (e.g., scripts to
> its bin directory) is not allowed.
>
> Limited exceptions may be allowed in interactive sessions if the
> package obtains confirmation from the user.
>
> For R version 4.0 or later (hence a version dependency is required
> or only conditional use is possible), packages may store
> user-specific data, configuration and cache files in their
> respective user directories obtained from tools::R_user_dir(),
> provided that by default sizes are kept as small as possible and the
> contents are actively managed (including removing outdated
> material).
>
> Can you pls fix as necessary?
>
> Please fix before 2021-06-28 to safely retain your package on CRAN.
>
> Best
> -k
>


Re: UPDATE: Apache Spark 3.2 Release

2021-06-17 Thread Dongjoon Hyun
Thank you for the correction, Yikun.
Yes, it's 3.3.1. :)

On 2021/06/17 09:03:55, Yikun Jiang  wrote: 
> - Apache Hadoop 3.3.2 becomes the default Hadoop profile for Apache Spark
> 3.2 via SPARK-29250 today. We are observing big improvements in S3 use
> cases. Please try it and share your experience.
> 
> It should be  Apache Hadoop 3.3.1 [1]. : )
> 
> Note that Apache hadoop 3.3.0 is the first Hadoop release including x86 and
> aarch64, and 3.3.1 also. Very happy to see 3.3.1 can be the default
> dependency of Spark 3.2.0.
> 
> [1] https://hadoop.apache.org/release/3.3.1.html
> 
> Regards,
> Yikun
> 
> 
> Dongjoon Hyun  于2021年6月17日周四 上午5:58写道:
> 
> > This is a continuation of the previous thread, `Apache Spark 3.2
> > Expectation`, in order to give you updates.
> >
> > -
> > https://lists.apache.org/thread.html/r61897da071729913bf586ddd769311ce8b5b068e7156c352b51f7a33%40%3Cdev.spark.apache.org%3E
> >
> > First of all, the AS-IS schedule is here
> >
> > - https://spark.apache.org/versioning-policy.html
> >
> >   July 1st Code freeze. Release branch cut.
> >   Mid July QA period. Focus on bug fixes, tests, stability and docs.
> > Generally, no new features merged.
> >   August   Release candidates (RC), voting, etc. until final release passes
> >
> > Second, Gengliang Wang volunteered as a release manager and started to
> > work as a release manager. Thank you! He shared the on-going issues and I
> > want to piggy-back the followings to his list.
> >
> >
> > # Languages
> >
> > - Scala 2.13 Support: Although SPARK-25075 is almost done and we have
> > Scala 2.13 Jenkins job on master branch, we do not support Scala 2.13.6. We
> > should document it if Scala 2.13.7 is not arrived on time.
> >   Please see https://github.com/scala/scala/pull/9641 (Milestone Scala
> > 2.13.7).
> >
> > - SparkR CRAN publishing: Apache SparkR 3.1.2 is in CRAN as of today, but
> > we get policy violation warnings for cache directory. The fix deadline is
> > 2021-06-28. If that's going to be removed again, we need to retry via
> > Apache Spark 3.2.0 after making some fix.
> >   https://cran.r-project.org/web/packages/SparkR/index.html
> >
> >
> > # Dependencies
> >
> > - Apache Hadoop 3.3.2 becomes the default Hadoop profile for Apache Spark
> > 3.2 via SPARK-29250 today. We are observing big improvements in S3 use
> > cases. Please try it and share your experience.
> >
> > - Apache Hive 2.3.9 becomes the built-in Hive library with more HMS
> > compatibility fixes recently. We need re-evaluate the previous HMS
> > incompatibility reports.
> >
> > - K8s 1.21 is released May 12th. K8s Client 5.4.1 supports it in Apache
> > Spark 3.2. In addition, public cloud vendors start to support K8s 1.20.
> > Please note that this is a breaking K8s API change from K8s Client 4.x to
> > 5.x.
> >
> > - SPARK-33913 upgraded Apache Kafka Client dependency to 2.8.0 and Kafka
> > community is considering the deprecation of Scala 2.12 support at Apache
> > Kafka 3.0.
> >
> > - SPARK-34542 upgraded Apache Parquet dependency to 1.12.0. However, we
> > need SPARK-34859 to fix column index issue before release. In addition,
> > Apache Parquet encryption is added as a developer API. Custom KMS client
> > should be implemented.
> >
> > - SPARK-35489 upgraded Apache ORC dependency to 1.6.8. We still need
> > ORC-804 for better masking feature additionally.
> >
> > - SPARK-34651 improved ZStandard support with ZStandard 1.4.9 and we are
> > currently evaluating newly arrived ZStandard 1.5.0 additionally. Currently,
> > JDK11 performance is under investigation. In addition, SPARK-35181 (Use
> > zstd for spark.io.compression.codec by default) is still on the way
> > seperately.
> >
> >
> > # Newly arrived items
> >
> > - SPARK-35779 Dynamic filtering for Data Source V2
> >
> > - SPARK-35781 Support Spark on Apple Silicon on macOS natively
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Release Spark 3.0.3 (RC1)

2021-06-17 Thread Yi Wu
Please vote on releasing the following candidate as Apache Spark version
3.0.3.

The vote is open until Jun 21th 3AM (PST) and passes if a majority +1 PMC
votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v3.0.3-rc1 (commit
65ac1e75dc468f53fc778cd2ce1ba3f21067aab8):
https://github.com/apache/spark/tree/v3.0.3-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1386/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-docs/

The list of bug fixes going into 3.0.3 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12349723

This release is using the release script of the tag v3.0.3-rc1.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.0.3?
===

The current list of open tickets targeted at 3.0.3 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.0.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: UPDATE: Apache Spark 3.2 Release

2021-06-17 Thread Yikun Jiang
- Apache Hadoop 3.3.2 becomes the default Hadoop profile for Apache Spark
3.2 via SPARK-29250 today. We are observing big improvements in S3 use
cases. Please try it and share your experience.

It should be  Apache Hadoop 3.3.1 [1]. : )

Note that Apache hadoop 3.3.0 is the first Hadoop release including x86 and
aarch64, and 3.3.1 also. Very happy to see 3.3.1 can be the default
dependency of Spark 3.2.0.

[1] https://hadoop.apache.org/release/3.3.1.html

Regards,
Yikun


Dongjoon Hyun  于2021年6月17日周四 上午5:58写道:

> This is a continuation of the previous thread, `Apache Spark 3.2
> Expectation`, in order to give you updates.
>
> -
> https://lists.apache.org/thread.html/r61897da071729913bf586ddd769311ce8b5b068e7156c352b51f7a33%40%3Cdev.spark.apache.org%3E
>
> First of all, the AS-IS schedule is here
>
> - https://spark.apache.org/versioning-policy.html
>
>   July 1st Code freeze. Release branch cut.
>   Mid July QA period. Focus on bug fixes, tests, stability and docs.
> Generally, no new features merged.
>   August   Release candidates (RC), voting, etc. until final release passes
>
> Second, Gengliang Wang volunteered as a release manager and started to
> work as a release manager. Thank you! He shared the on-going issues and I
> want to piggy-back the followings to his list.
>
>
> # Languages
>
> - Scala 2.13 Support: Although SPARK-25075 is almost done and we have
> Scala 2.13 Jenkins job on master branch, we do not support Scala 2.13.6. We
> should document it if Scala 2.13.7 is not arrived on time.
>   Please see https://github.com/scala/scala/pull/9641 (Milestone Scala
> 2.13.7).
>
> - SparkR CRAN publishing: Apache SparkR 3.1.2 is in CRAN as of today, but
> we get policy violation warnings for cache directory. The fix deadline is
> 2021-06-28. If that's going to be removed again, we need to retry via
> Apache Spark 3.2.0 after making some fix.
>   https://cran.r-project.org/web/packages/SparkR/index.html
>
>
> # Dependencies
>
> - Apache Hadoop 3.3.2 becomes the default Hadoop profile for Apache Spark
> 3.2 via SPARK-29250 today. We are observing big improvements in S3 use
> cases. Please try it and share your experience.
>
> - Apache Hive 2.3.9 becomes the built-in Hive library with more HMS
> compatibility fixes recently. We need re-evaluate the previous HMS
> incompatibility reports.
>
> - K8s 1.21 is released May 12th. K8s Client 5.4.1 supports it in Apache
> Spark 3.2. In addition, public cloud vendors start to support K8s 1.20.
> Please note that this is a breaking K8s API change from K8s Client 4.x to
> 5.x.
>
> - SPARK-33913 upgraded Apache Kafka Client dependency to 2.8.0 and Kafka
> community is considering the deprecation of Scala 2.12 support at Apache
> Kafka 3.0.
>
> - SPARK-34542 upgraded Apache Parquet dependency to 1.12.0. However, we
> need SPARK-34859 to fix column index issue before release. In addition,
> Apache Parquet encryption is added as a developer API. Custom KMS client
> should be implemented.
>
> - SPARK-35489 upgraded Apache ORC dependency to 1.6.8. We still need
> ORC-804 for better masking feature additionally.
>
> - SPARK-34651 improved ZStandard support with ZStandard 1.4.9 and we are
> currently evaluating newly arrived ZStandard 1.5.0 additionally. Currently,
> JDK11 performance is under investigation. In addition, SPARK-35181 (Use
> zstd for spark.io.compression.codec by default) is still on the way
> seperately.
>
>
> # Newly arrived items
>
> - SPARK-35779 Dynamic filtering for Data Source V2
>
> - SPARK-35781 Support Spark on Apple Silicon on macOS natively
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Migrating from hive to spark

2021-06-17 Thread Mich Talebzadeh
Ok the first link throws some clues

.*... Hive excels in batch disc processing with a map reduce execution
engine. Actually, Hive can also use Spark as its execution engine which
also has a Hive context allowing us to query Hive tables. Despite all the
great things Hive can solve, this post is to talk about why we move our
ETL’s to the ‘not so new’ player for batch processing, ...*

Great, you want to use Spark for ETL as opposed to Hive for cleaning up
your data once your upstream CDC files are landed on HDF? correct





   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 17 Jun 2021 at 08:17, Battula, Brahma Reddy 
wrote:

> Hi Talebzadeh,
>
>
>
> Looks I confused, Sorry.. Now I changed to subject to make it clear.
>
> Facebook has tried migration from hive to spark. Check the following links
> for same.
>
>
>
> *https://www.dcsl.com/migrating-from-hive-to-spark/
> *
>
>
> https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql
>
> https://www.cloudwalker.io/2019/02/19/spark-ad-hoc-querying/
>
>
>
>
>
> would like to know, like this anybody else migrated..? and any challenges
> or pre-requisite to migrate(Like hardware)..? any tools to evaluate before
> we migrate?
>
>
>
>
>
>
>
>
>
> *From: *Mich Talebzadeh 
> *Date: *Tuesday, 15 June 2021 at 10:36 PM
> *To: *Battula, Brahma Reddy 
> *Cc: *Battula, Brahma Reddy , ayan guha <
> guha.a...@gmail.com>, dev@spark.apache.org ,
> u...@spark.apache.org 
> *Subject: *Re: Spark-sql can replace Hive ?
>
> OK you mean use spark.sql as opposed to HiveContext.sql?
>
>
>
> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
> HiveContext.sql("")
>
>
>
> replace with
>
>
>
> spark.sql("")
>
> ?
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Tue, 15 Jun 2021 at 18:00, Battula, Brahma Reddy 
> wrote:
>
> Currently I am using hive sql engine for adhoc queries. As spark-sql also
> supports this, I want migrate from hive.
>
>
>
>
>
>
>
>
>
> *From: *Mich Talebzadeh 
> *Date: *Thursday, 10 June 2021 at 8:12 PM
> *To: *Battula, Brahma Reddy 
> *Cc: *ayan guha , dev@spark.apache.org <
> dev@spark.apache.org>, u...@spark.apache.org 
> *Subject: *Re: Spark-sql can replace Hive ?
>
> These are different things. Spark provides a computational layer and a
> dialogue of SQL based on Hive.
>
>
>
> Hive is a DW on top of HDFS. What are you trying to replace?
>
>
>
> HTH
>
>
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 10 Jun 2021 at 12:09, Battula, Brahma Reddy
>  wrote:
>
> Thanks for prompt reply.
>
>
>
> I want to replace hive with spark.
>
>
>
>
>
>
>
>
>
> *From: *ayan guha 
> *Date: *Thursday, 10 June 2021 at 4:35 PM
> *To: *Battula, Brahma Reddy 
> *Cc: *dev@spark.apache.org , u...@spark.apache.org <
> u...@spark.apache.org>
> *Subject: *Re: Spark-sql can replace Hive ?
>
> Would you mind expanding the ask? Spark Sql can use hive by itaelf
>
>
>
> On Thu, 10 Jun 2021 at 8:58 pm, Battula, Brahma Reddy
>  wrote:
>
> Hi
>
>
>
> Would like know any refences/docs to replace hive with spark-sql
> completely like how 

Migrating from hive to spark

2021-06-17 Thread Battula, Brahma Reddy
Hi Talebzadeh,

Looks I confused, Sorry.. Now I changed to subject to make it clear.
Facebook has tried migration from hive to spark. Check the following links for 
same.

https://www.dcsl.com/migrating-from-hive-to-spark/
https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql
https://www.cloudwalker.io/2019/02/19/spark-ad-hoc-querying/


would like to know, like this anybody else migrated..? and any challenges or 
pre-requisite to migrate(Like hardware)..? any tools to evaluate before we 
migrate?




From: Mich Talebzadeh 
Date: Tuesday, 15 June 2021 at 10:36 PM
To: Battula, Brahma Reddy 
Cc: Battula, Brahma Reddy , ayan guha 
, dev@spark.apache.org , 
u...@spark.apache.org 
Subject: Re: Spark-sql can replace Hive ?
OK you mean use spark.sql as opposed to HiveContext.sql?

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
HiveContext.sql("")

replace with

spark.sql("")
?



 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Tue, 15 Jun 2021 at 18:00, Battula, Brahma Reddy 
mailto:bbatt...@visa.com>> wrote:
Currently I am using hive sql engine for adhoc queries. As spark-sql also 
supports this, I want migrate from hive.




From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Date: Thursday, 10 June 2021 at 8:12 PM
To: Battula, Brahma Reddy 
Cc: ayan guha mailto:guha.a...@gmail.com>>, 
dev@spark.apache.org 
mailto:dev@spark.apache.org>>, 
u...@spark.apache.org 
mailto:u...@spark.apache.org>>
Subject: Re: Spark-sql can replace Hive ?
These are different things. Spark provides a computational layer and a dialogue 
of SQL based on Hive.

Hive is a DW on top of HDFS. What are you trying to replace?

HTH





 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 10 Jun 2021 at 12:09, Battula, Brahma Reddy  
wrote:
Thanks for prompt reply.

I want to replace hive with spark.




From: ayan guha mailto:guha.a...@gmail.com>>
Date: Thursday, 10 June 2021 at 4:35 PM
To: Battula, Brahma Reddy 
Cc: dev@spark.apache.org 
mailto:dev@spark.apache.org>>, 
u...@spark.apache.org 
mailto:u...@spark.apache.org>>
Subject: Re: Spark-sql can replace Hive ?
Would you mind expanding the ask? Spark Sql can use hive by itaelf

On Thu, 10 Jun 2021 at 8:58 pm, Battula, Brahma Reddy 
 wrote:
Hi

Would like know any refences/docs to replace hive with spark-sql completely 
like how migrate the existing data in hive.?

thanks


--
Best Regards,
Ayan Guha


Re: Apache Spark 3.2 Expectation

2021-06-17 Thread Hyukjin Kwon
*GA -> QA

On Thu, 17 Jun 2021, 15:16 Hyukjin Kwon,  wrote:

> I think we would make sure treating these items in the list as exceptions
> from the code freeze, and discourage to push new APIs and features though.
>
> GA period ideally we should focus on bug fixes and polishing.
>
> It would be great if we can speed up on these items in the list too.
>
>
> On Thu, 17 Jun 2021, 15:08 Gengliang Wang,  wrote:
>
>> Thanks for the suggestions from Dongjoon, Liangchi, Min, and Xiao!
>> Now we make it clear that it's a soft cut and we can still merge
>> important code changes to branch-3.2 before RC. Let's keep the branch cut
>> date as July 1st.
>>
>> On Thu, Jun 17, 2021 at 1:41 PM Dongjoon Hyun 
>> wrote:
>>
>>> > First, I think you are saying "branch-3.2";
>>>
>>> To Xiao. Yes, it's was a typo of "branch-3.2".
>>>
>>> > We do strongly prefer to cut the release for Spark 3.2.0 including
>>> all the patches under SPARK-30602.
>>> > This way, we can backport the other performance/operability
>>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>>> future Spark 3.2.x patch releases.
>>>
>>> To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+ as
>>> Xiao wrote.
>>>
>>>
>>>
>>> On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:
>>>
 To Liang-Chi, I'm -1 for postponing the branch cut because this is a
> soft cut and the committers still are able to commit to `branch-3.3`
> according to their decisions.


 First, I think you are saying "branch-3.2";

 Second, the "so cut" means no "code freeze", although we cut the
 branch. To avoid releasing half-baked and unready features, the release
 manager needs to be very careful when cutting the RC. Based on what is
 proposed here, the RC date is the actual code freeze date.

 This way, we can backport the other performance/operability
> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
> future Spark 3.2.x patch releases.


 This is not allowed based on the policy. Only bug fixes can be merged
 to the patch releases. Thus, if we know it will introduce major performance
 regression, we have to turn the feature off by default.

 Xiao



 Min Shen  于2021年6月16日周三 下午3:22写道:

> Hi Gengliang,
>
> Thanks for volunteering as the release manager for Spark 3.2.0.
> Regarding the ongoing work of push-based shuffle in SPARK-30602, we
> are close to having all the patches merged to master to enable push-based
> shuffle.
> Currently, there are 2 PRs under SPARK-30602 that are under active
> review (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
> We should be able to post the PRs for the other 2 remaining tickets
> (SPARK-32923 and SPARK-35546) early next week.
>
> The tickets under SPARK-30602 are the minimum set of patches to enable
> push-based shuffle.
> We do have other performance/operability enhancements tickets under
> SPARK-33235 that are needed to fully contribute what we have internally 
> for
> push-based shuffle.
> However, these are optional for enabling push-based shuffle.
> We do strongly prefer to cut the release for Spark 3.2.0 including all
> the patches under SPARK-30602.
> This way, we can backport the other performance/operability
> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
> future Spark 3.2.x patch releases.
> I understand the preference of not postponing the branch cut date.
> We will check with Dongjoon regarding the soft cut date and the
> flexibility for including the remaining tickets under SPARK-30602 into
> branch-3.2.
>
> Best,
> Min
>
> On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh 
> wrote:
>
>>
>> Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
>> As it is soft cut date, there is no reason to postpone it.
>>
>> It sounds good then to keep original branch cut date.
>>
>> Thank you.
>>
>>
>>
>> Dongjoon Hyun-2 wrote
>> > Thank you for volunteering, Gengliang.
>> >
>> > Apache Spark 3.2.0 is the first version enabling AQE by default.
>> I'm also
>> > watching some on-going improvements on that.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-33828 (SQL
>> Adaptive Query
>> > Execution QA)
>> >
>> > To Liang-Chi, I'm -1 for postponing the branch cut because this is
>> a soft
>> > cut and the committers still are able to commit to `branch-3.3`
>> according
>> > to their decisions.
>> >
>> > Given that Apache Spark had 115 commits in a week in various areas
>> > concurrently, we should start QA for Apache Spark 3.2 by creating
>> > branch-3.3 and allowing only limited backporting.
>> >
>> > https://github.com/apache/spark/graphs/commit-activity
>> 

Re: Apache Spark 3.2 Expectation

2021-06-17 Thread Hyukjin Kwon
I think we would make sure treating these items in the list as exceptions
from the code freeze, and discourage to push new APIs and features though.

GA period ideally we should focus on bug fixes and polishing.

It would be great if we can speed up on these items in the list too.


On Thu, 17 Jun 2021, 15:08 Gengliang Wang,  wrote:

> Thanks for the suggestions from Dongjoon, Liangchi, Min, and Xiao!
> Now we make it clear that it's a soft cut and we can still merge important
> code changes to branch-3.2 before RC. Let's keep the branch cut date as
> July 1st.
>
> On Thu, Jun 17, 2021 at 1:41 PM Dongjoon Hyun 
> wrote:
>
>> > First, I think you are saying "branch-3.2";
>>
>> To Xiao. Yes, it's was a typo of "branch-3.2".
>>
>> > We do strongly prefer to cut the release for Spark 3.2.0 including all
>> the patches under SPARK-30602.
>> > This way, we can backport the other performance/operability
>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>> future Spark 3.2.x patch releases.
>>
>> To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+ as
>> Xiao wrote.
>>
>>
>>
>> On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:
>>
>>> To Liang-Chi, I'm -1 for postponing the branch cut because this is a
 soft cut and the committers still are able to commit to `branch-3.3`
 according to their decisions.
>>>
>>>
>>> First, I think you are saying "branch-3.2";
>>>
>>> Second, the "so cut" means no "code freeze", although we cut the branch.
>>> To avoid releasing half-baked and unready features, the release
>>> manager needs to be very careful when cutting the RC. Based on what is
>>> proposed here, the RC date is the actual code freeze date.
>>>
>>> This way, we can backport the other performance/operability enhancements
 tickets under SPARK-33235 into branch-3.2 to be released in future Spark
 3.2.x patch releases.
>>>
>>>
>>> This is not allowed based on the policy. Only bug fixes can be merged to
>>> the patch releases. Thus, if we know it will introduce major performance
>>> regression, we have to turn the feature off by default.
>>>
>>> Xiao
>>>
>>>
>>>
>>> Min Shen  于2021年6月16日周三 下午3:22写道:
>>>
 Hi Gengliang,

 Thanks for volunteering as the release manager for Spark 3.2.0.
 Regarding the ongoing work of push-based shuffle in SPARK-30602, we are
 close to having all the patches merged to master to enable push-based
 shuffle.
 Currently, there are 2 PRs under SPARK-30602 that are under active
 review (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
 We should be able to post the PRs for the other 2 remaining tickets
 (SPARK-32923 and SPARK-35546) early next week.

 The tickets under SPARK-30602 are the minimum set of patches to enable
 push-based shuffle.
 We do have other performance/operability enhancements tickets under
 SPARK-33235 that are needed to fully contribute what we have internally for
 push-based shuffle.
 However, these are optional for enabling push-based shuffle.
 We do strongly prefer to cut the release for Spark 3.2.0 including all
 the patches under SPARK-30602.
 This way, we can backport the other performance/operability
 enhancements tickets under SPARK-33235 into branch-3.2 to be released in
 future Spark 3.2.x patch releases.
 I understand the preference of not postponing the branch cut date.
 We will check with Dongjoon regarding the soft cut date and the
 flexibility for including the remaining tickets under SPARK-30602 into
 branch-3.2.

 Best,
 Min

 On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh 
 wrote:

>
> Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
> As it is soft cut date, there is no reason to postpone it.
>
> It sounds good then to keep original branch cut date.
>
> Thank you.
>
>
>
> Dongjoon Hyun-2 wrote
> > Thank you for volunteering, Gengliang.
> >
> > Apache Spark 3.2.0 is the first version enabling AQE by default. I'm
> also
> > watching some on-going improvements on that.
> >
> > https://issues.apache.org/jira/browse/SPARK-33828 (SQL Adaptive
> Query
> > Execution QA)
> >
> > To Liang-Chi, I'm -1 for postponing the branch cut because this is a
> soft
> > cut and the committers still are able to commit to `branch-3.3`
> according
> > to their decisions.
> >
> > Given that Apache Spark had 115 commits in a week in various areas
> > concurrently, we should start QA for Apache Spark 3.2 by creating
> > branch-3.3 and allowing only limited backporting.
> >
> > https://github.com/apache/spark/graphs/commit-activity
> >
> > Bests,
> > Dongjoon.
> >
> >
> > On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh 
>
> > viirya@
>
> >  wrote:
> >
> >> First, thanks for being 

Re: Apache Spark 3.2 Expectation

2021-06-17 Thread Gengliang Wang
Thanks for the suggestions from Dongjoon, Liangchi, Min, and Xiao!
Now we make it clear that it's a soft cut and we can still merge important
code changes to branch-3.2 before RC. Let's keep the branch cut date as
July 1st.

On Thu, Jun 17, 2021 at 1:41 PM Dongjoon Hyun 
wrote:

> > First, I think you are saying "branch-3.2";
>
> To Xiao. Yes, it's was a typo of "branch-3.2".
>
> > We do strongly prefer to cut the release for Spark 3.2.0 including all
> the patches under SPARK-30602.
> > This way, we can backport the other performance/operability
> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
> future Spark 3.2.x patch releases.
>
> To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+ as
> Xiao wrote.
>
>
>
> On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:
>
>> To Liang-Chi, I'm -1 for postponing the branch cut because this is a soft
>>> cut and the committers still are able to commit to `branch-3.3` according
>>> to their decisions.
>>
>>
>> First, I think you are saying "branch-3.2";
>>
>> Second, the "so cut" means no "code freeze", although we cut the branch.
>> To avoid releasing half-baked and unready features, the release
>> manager needs to be very careful when cutting the RC. Based on what is
>> proposed here, the RC date is the actual code freeze date.
>>
>> This way, we can backport the other performance/operability enhancements
>>> tickets under SPARK-33235 into branch-3.2 to be released in future Spark
>>> 3.2.x patch releases.
>>
>>
>> This is not allowed based on the policy. Only bug fixes can be merged to
>> the patch releases. Thus, if we know it will introduce major performance
>> regression, we have to turn the feature off by default.
>>
>> Xiao
>>
>>
>>
>> Min Shen  于2021年6月16日周三 下午3:22写道:
>>
>>> Hi Gengliang,
>>>
>>> Thanks for volunteering as the release manager for Spark 3.2.0.
>>> Regarding the ongoing work of push-based shuffle in SPARK-30602, we are
>>> close to having all the patches merged to master to enable push-based
>>> shuffle.
>>> Currently, there are 2 PRs under SPARK-30602 that are under active
>>> review (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
>>> We should be able to post the PRs for the other 2 remaining tickets
>>> (SPARK-32923 and SPARK-35546) early next week.
>>>
>>> The tickets under SPARK-30602 are the minimum set of patches to enable
>>> push-based shuffle.
>>> We do have other performance/operability enhancements tickets under
>>> SPARK-33235 that are needed to fully contribute what we have internally for
>>> push-based shuffle.
>>> However, these are optional for enabling push-based shuffle.
>>> We do strongly prefer to cut the release for Spark 3.2.0 including all
>>> the patches under SPARK-30602.
>>> This way, we can backport the other performance/operability enhancements
>>> tickets under SPARK-33235 into branch-3.2 to be released in future Spark
>>> 3.2.x patch releases.
>>> I understand the preference of not postponing the branch cut date.
>>> We will check with Dongjoon regarding the soft cut date and the
>>> flexibility for including the remaining tickets under SPARK-30602 into
>>> branch-3.2.
>>>
>>> Best,
>>> Min
>>>
>>> On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh 
>>> wrote:
>>>

 Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
 As it is soft cut date, there is no reason to postpone it.

 It sounds good then to keep original branch cut date.

 Thank you.



 Dongjoon Hyun-2 wrote
 > Thank you for volunteering, Gengliang.
 >
 > Apache Spark 3.2.0 is the first version enabling AQE by default. I'm
 also
 > watching some on-going improvements on that.
 >
 > https://issues.apache.org/jira/browse/SPARK-33828 (SQL Adaptive
 Query
 > Execution QA)
 >
 > To Liang-Chi, I'm -1 for postponing the branch cut because this is a
 soft
 > cut and the committers still are able to commit to `branch-3.3`
 according
 > to their decisions.
 >
 > Given that Apache Spark had 115 commits in a week in various areas
 > concurrently, we should start QA for Apache Spark 3.2 by creating
 > branch-3.3 and allowing only limited backporting.
 >
 > https://github.com/apache/spark/graphs/commit-activity
 >
 > Bests,
 > Dongjoon.
 >
 >
 > On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh 

 > viirya@

 >  wrote:
 >
 >> First, thanks for being volunteer as the release manager of Spark
 3.2.0,
 >> Gengliang!
 >>
 >> And yes, for the two important Structured Streaming features, RocksDB
 >> StateStore and session window, we're working on them and expect to
 have
 >> them
 >> in the new release.
 >>
 >> So I propose to postpone the branch cut date.
 >>
 >> Thank you!
 >>
 >> Liang-Chi
 >>
 >>
 >> Gengliang Wang-2 wrote
 >> > Thanks, Hyukjin.