Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread huaxin gao
Thanks Anton for the updated proposal -- it looks great! I appreciate the hard work put into refining it. I am looking forward to the upcoming vote and moving forward with this initiative. Thanks, Huaxin On Thu, May 9, 2024 at 7:30 PM L. C. Hsieh wrote: > Thanks Anton. Thank you, Wenchen,

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Wenchen Fan
Thanks for leading this project! Let's move forward. On Fri, May 10, 2024 at 10:31 AM L. C. Hsieh wrote: > Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and > others if I miss those who are participating in the discussion. > > I suppose we have reached a consensus or close to

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread L. C. Hsieh
Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and others if I miss those who are participating in the discussion. I suppose we have reached a consensus or close to being in the design. If you have some more comments, please let us know. If not, I will go to start a vote soon

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Anton Okolnychyi
Thanks to everyone who commented on the design doc. I updated the proposal and it is ready for another look. I hope we can converge and move forward with this effort! - Anton пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi пише: > Hi folks, > > I'd like to start a discussion on SPARK-44167 that

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan
UPDATE: I've successfully uploaded the release packages: https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/ (I skipped SparkR as I was not able to fix the errors, I'll get back to it later) However, there is a new issue with doc building:

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun
Please re-try to upload, Wenchen. ASF Infra team bumped up our upload limit based on our request. > Your upload limit has been increased to 650MB Dongjoon. On Thu, May 9, 2024 at 8:12 AM Wenchen Fan wrote: > I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776 > > On

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan
I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776 On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun wrote: > In addition, FYI, I was the latest release manager with Apache Spark 3.4.3 > (2024-04-15 Vote) > > According to my work log, I uploaded the following binaries to SVN

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun
In addition, FYI, I was the latest release manager with Apache Spark 3.4.3 (2024-04-15 Vote) According to my work log, I uploaded the following binaries to SVN from EC2 (us-west-2) without any issues. -rw-r--r--. 1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz -rw-r--r--. 1 centos

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun
Could you file an INFRA JIRA issue with the error message and context first, Wenchen? As you know, if we see something, we had better file a JIRA issue because it could be not only an Apache Spark project issue but also all ASF project issues. Dongjoon. On Thu, May 9, 2024 at 12:28 AM Wenchen

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Wenchen Fan
Thanks for starting the discussion! To add a bit more color, we should at least add a test job to make sure the release script can produce the packages correctly. Today it's kind of being manually tested by the release manager each time, which slows down the release process. It's better if we can

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Hussein Awala
Hello, I can answer some of your common questions with other Apache projects. > Who currently has permissions for Github actions? Is there a specific owner for that today or a different volunteer each time? The Apache organization owns Github Actions, and committers (contributors with write

[DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Nimrod Ofek
Following the conversation started with Spark 4.0.0 release, this is a thread to discuss improvements to our release processes. I'll Start by raising some questions that probably should have answers to start the discussion: 1. What is currently running in GitHub Actions? 2. Who currently

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan
UPDATE: After resolving a few issues in the release scripts, I can finally build the release packages. However, I can't upload them to the staging SVN repo due to a transmitting error, and it seems like a limitation from the server side. I tried it on both my local laptop and remote AWS instance,

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo
Very helpful! On Wed, May 8, 2024 at 9:07 AM Mich Talebzadeh wrote: > *Potential reasons* > > >- Data Serialization: Spark needs to serialize the DataFrame into an >in-memory format suitable for storage. This process can be time-consuming, >especially for large datasets like 3.2 GB

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Holden Karau
That looks cool, maybe let’s split off a thread on how to improve our release processes? Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Erik Krogen
On that note, GitHub recently released (public preview) a new feature called Artifact Attestions which may be relevant/useful here: Introducing Artifact Attestations–now in public beta - The GitHub Blog On Wed,

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Mich Talebzadeh
*Potential reasons* - Data Serialization: Spark needs to serialize the DataFrame into an in-memory format suitable for storage. This process can be time-consuming, especially for large datasets like 3.2 GB with complex schemas. - Shuffle Operations: If your transformations involve

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo
Could any one help me here ? Sent from my iPhone > On May 7, 2024, at 4:30 PM, Prem Sahoo wrote: > >  > Hello Folks, > in Spark I have read a file and done some transformation and finally writing > to hdfs. > > Now I am interested in writing the same dataframe to MapRFS but for this > Spark

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Nimrod Ofek
I have no permissions so I can't do it but I'm happy to help (although I am more familiar with Gitlab CICD than Github Actions). Is there some point of contact that can provide me needed context and permissions? I'd also love to see why the costs are high and see how we can reduce them... Thanks,

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau
I think signing the artifacts produced from a secure CI sounds like a good idea. I know we’ve been asked to reduce our GitHub action usage but perhaps someone interested could volunteer to set that up. Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.):

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek
Hi, Thanks for the reply. >From my experience, a build on a build server would be much more predictable and less error prone than building on some laptop- and of course much faster to have builds, snapshots, release candidates, early previews releases, release candidates or final releases. It

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau
Indeed. We could conceivably build the release in CI/CD but the final verification / signing should be done locally to keep the keys safe (there was some concern from earlier release processes). Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.):

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Dongjoon Hyun
Thank you so much for the update, Wenchen! Dongjoon. On Tue, May 7, 2024 at 10:49 AM Wenchen Fan wrote: > UPDATE: > > Unfortunately, it took me quite some time to set up my laptop and get it > ready for the release process (docker desktop doesn't work anymore, my pgp > key is lost, etc.). I'll

caching a dataframe in Spark takes lot of time

2024-05-07 Thread Prem Sahoo
Hello Folks, in Spark I have read a file and done some transformation and finally writing to hdfs. Now I am interested in writing the same dataframe to MapRFS but for this Spark will execute the full DAG again (recompute all the previous steps)(all the read + transformations ). I don't want

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek
Hi, Sorry for the novice question, Wenchen - the release is done manually from a laptop? Not using a CI CD process on a build server? Thanks, Nimrod On Tue, May 7, 2024 at 8:50 PM Wenchen Fan wrote: > UPDATE: > > Unfortunately, it took me quite some time to set up my laptop and get it > ready

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Wenchen Fan
UPDATE: Unfortunately, it took me quite some time to set up my laptop and get it ready for the release process (docker desktop doesn't work anymore, my pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for your patience! Wenchen On Fri, May 3, 2024 at 7:47 AM yangjie01

Spark not creating staging dir for insertInto partitioned table

2024-05-07 Thread Sanskar Modi
Hi Folks, I wanted to check why spark doesn't create staging dir while doing an insertInto on partitioned tables. I'm running below example code – ``` spark.sql("set hive.exec.dynamic.partition.mode=nonstrict") val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3))) val df =

Re: ASF board report draft for May

2024-05-06 Thread Matei Zaharia
I’ll mention that we’re working toward a preview release, even if the details are not finalized by the time we sent the report. > On May 6, 2024, at 10:52 AM, Holden Karau wrote: > > I trust Wenchen to manage the preview release effectively but if there are > concerns around how to manage a

Re: ASF board report draft for May

2024-05-06 Thread Holden Karau
I trust Wenchen to manage the preview release effectively but if there are concerns around how to manage a developer preview release lets split that off from the board report discussion. On Mon, May 6, 2024 at 10:44 AM Mich Talebzadeh wrote: > I did some historical digging on this. > > Whilst

Re: Why spark-submit works with package not with jar

2024-05-06 Thread Mich Talebzadeh
Thanks David. I wanted to explain the difference between Package and Jar with comments from the community on previous discussions back a few years ago. cheers Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin

Re: ASF board report draft for May

2024-05-06 Thread Mich Talebzadeh
I did some historical digging on this. Whilst both preview release and RCs are pre-release versions, the main difference lies in their maturity and readiness for production use. Preview releases are early versions aimed at gathering feedback, while release candidates (RCs) are nearly finished

Re: Why spark-submit works with package not with jar

2024-05-06 Thread David Rabinowitz
Hi, It seems this library is several years old. Have you considered using the Google provided connector? You can find it in https://github.com/GoogleCloudDataproc/spark-bigquery-connector Regards, David Rabinowitz On Sun, May 5, 2024 at 6:07 PM Jeff Zhang wrote: > Are you sure

Re: ASF board report draft for May

2024-05-06 Thread Mich Talebzadeh
@Wenchen Fan Thanks for the update! To clarify, is the vote for approving a specific preview build, or is it for moving towards an RC stage? I gather there is a distinction between these two? Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United

Re: ASF board report draft for May

2024-05-06 Thread Holden Karau
If folks are against the term soon we could say “in-progress” Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau On Mon, May 6, 2024 at

Re: ASF board report draft for May

2024-05-06 Thread Mich Talebzadeh
Hi, We should reconsider using the term "soon" for ASF board as it is subjective with no date (assuming this is an official communication on Wednesday). We ought to say "Spark 4, the next major release after Spark 3.x, is currently under development. We plan to make a preview version available

Re: ASF board report draft for May

2024-05-06 Thread Wenchen Fan
The preview release also needs a vote. I'll try my best to cut the RC on Monday, but the actual release may take some time. Hopefully, we can get it out this week but if the vote fails, it will take longer as we need more RCs. On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun wrote: > +1 for

Re: Why spark-submit works with package not with jar

2024-05-05 Thread Jeff Zhang
Are you sure com.google.api.client.http.HttpRequestInitialize is in the spark-bigquery-latest.jar or it may be in the transitive dependency of spark-bigquery_2.11? On Sat, May 4, 2024 at 7:43 PM Mich Talebzadeh wrote: > > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative

Re: ASF board report draft for May

2024-05-05 Thread Dongjoon Hyun
+1 for Holden's comment. Yes, it would be great to mention `it` as "soon". (If Wenchen release it on Monday, we can simply mention the release) In addition, Apache Spark PMC received an official notice from ASF Infra team. https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg >

Re: ASF board report draft for May

2024-05-05 Thread Holden Karau
Do we want to include that we’re planning on having a preview release of Spark 4 so folks can see the APIs “soon”? Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams:

ASF board report draft for May

2024-05-05 Thread Matei Zaharia
It’s time for our quarterly ASF board report on Apache Spark this Wednesday. Here’s a draft, feel free to suggest changes. Description: Apache Spark is a fast and general purpose engine for large-scale data processing. It offers high-level APIs in Java, Scala, Python, R

Re: [SparkListener] Accessing classes loaded via the '--packages' option

2024-05-04 Thread Mich Talebzadeh
In answer to this part of your question "..*Understanding the Issue:* Are there known reasons within Spark that could explain this difference in behavior when loading dependencies via `--packages` versus placing JARs directly? *2. "* --jar Adds only that jar --package adds the Jar and a its

Fwd: Why spark-submit works with package not with jar

2024-05-04 Thread Mich Talebzadeh
Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct

Fwd: [SparkListener] Accessing classes loaded via the '--packages' option

2024-05-04 Thread Damien Hawes
Hi folks, I'm contributing to the OpenLineage project, specifically the Apache Spark integration. My current focus is on extending the project to support data lineage extraction for Spark Streaming, beginning with Apache Kafka sources and sinks. I've encountered an obstacle when attempting to

Re: Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Jungtaek Lim
(remove user@ as the topic is not aiming to user group) I would like to make a clarification of SPIP as there have been multiple times of improper proposals and the ticket also mentions SPIP without fulfilling effective requirements. SPIP is only effective when there is a dedicated individual or

Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Mich Talebzadeh
Hi, I have raised a ticket SPARK-48117 for enhancing Spark capabilities with Materialised Views (MV). Currently both Hive and Databricks support this. I have added these potential benefits to the ticket -* Improved Query Performance

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Thanks for the comments I received. So in summary, Apache Spark itself doesn't directly manage materialized views,(MV) but it can work with them through integration with the underlying data storage systems like Hive or through iceberg. I believe databricks through unity catalog support MVs as

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess you must have created the view from Hive and are trying to drop it from Spark and that is why you are running to the issue with DROP first. There is

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
An issue I encountered while working with Materialized Views in Spark SQL. It appears that there is an inconsistency between the behavior of Materialized Views in Spark SQL and Hive. When attempting to execute a statement like DROP MATERIALIZED VIEW IF EXISTS test.mv in Spark SQL, I encountered a

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread yangjie01
+1 发件人: Jungtaek Lim 日期: 2024年5月2日 星期四 10:21 收件人: Holden Karau 抄送: Chao Sun , Xiao Li , Tathagata Das , Wenchen Fan , Cheng Pan , Nicholas Chammas , Dongjoon Hyun , Cheng Pan , Spark dev list , Anish Shrigondekar 主题: Re: [DISCUSS] Spark 4.0.0 release +1 love to see it! On Thu, May 2,

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread Mich Talebzadeh
- Integration with additional external data sources or systems, say Hive - Enhancements to the Spark UI for improved monitoring and debugging - Enhancements to machine learning (MLlib) algorithms and capabilities, like TensorFlow or PyTorch,( if any in the pipeline) HTH Mich

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread Steve Loughran
There's a new parquet RC up this week which would be good to pull in. On Thu, 2 May 2024 at 03:20, Jungtaek Lim wrote: > +1 love to see it! > > On Thu, May 2, 2024 at 10:08 AM Holden Karau > wrote: > >> +1 :) yay previews >> >> On Wed, May 1, 2024 at 5:36 PM Chao Sun wrote: >> >>> +1 >>> >>>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Will Raschkowski
To add some user perspective, I wanted to share our experience from automatically upgrading tens of thousands of jobs from Spark 2 to 3 at Palantir: We didn't mind "loud" changes that threw exceptions. We have some infra to try run jobs with Spark 3 and fallback to Spark 2 if there's an

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-02 Thread Nimrod Ofek
Hi Erik and Wenchen, I think that usually a good practice with public api and with internal api that has big impact and a lot of usage is to ease in changes by providing defaults to new parameters that will keep former behaviour in a method with the previous signature with deprecation notice, and

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Jungtaek Lim
+1 love to see it! On Thu, May 2, 2024 at 10:08 AM Holden Karau wrote: > +1 :) yay previews > > On Wed, May 1, 2024 at 5:36 PM Chao Sun wrote: > >> +1 >> >> On Wed, May 1, 2024 at 5:23 PM Xiao Li wrote: >> >>> +1 for next Monday. >>> >>> We can do more previews when the other features are

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Holden Karau
+1 :) yay previews On Wed, May 1, 2024 at 5:36 PM Chao Sun wrote: > +1 > > On Wed, May 1, 2024 at 5:23 PM Xiao Li wrote: > >> +1 for next Monday. >> >> We can do more previews when the other features are ready for preview. >> >> Tathagata Das 于2024年5月1日周三 08:46写道: >> >>> Next week sounds

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan
Hi Erik, Thanks for sharing your thoughts! Note: developer APIs are also public APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking changes should be avoided as much as we can and new APIs should be mentioned in the release notes. Breaking binary compatibility is also a

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Chao Sun
+1 On Wed, May 1, 2024 at 5:23 PM Xiao Li wrote: > +1 for next Monday. > > We can do more previews when the other features are ready for preview. > > Tathagata Das 于2024年5月1日周三 08:46写道: > >> Next week sounds great! Thank you Wenchen! >> >> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: >>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Hyukjin Kwon
SGTM On Thu, 2 May 2024 at 02:06, Dongjoon Hyun wrote: > +1 for next Monday. > > Dongjoon. > > On Wed, May 1, 2024 at 8:46 AM Tathagata Das > wrote: > >> Next week sounds great! Thank you Wenchen! >> >> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: >> >>> Yea I think a preview release

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Xiao Li
+1 for next Monday. We can do more previews when the other features are ready for preview. Tathagata Das 于2024年5月1日周三 08:46写道: > Next week sounds great! Thank you Wenchen! > > On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: > >> Yea I think a preview release won't hurt (without a branch

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Dongjoon Hyun
+1 for next Monday. Dongjoon. On Wed, May 1, 2024 at 8:46 AM Tathagata Das wrote: > Next week sounds great! Thank you Wenchen! > > On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: > >> Yea I think a preview release won't hurt (without a branch cut). We don't >> need to wait for all the

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Erik Krogen
Thanks for raising this important discussion Wenchen! Two points I would like to raise, though I'm fully supportive of any improvements in this regard, my points below notwithstanding -- I am not intending to let perfect be the enemy of good here. On a similar note as Santosh's comment, we should

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Wenchen Fan
Good point, Santosh! I was originally targeting end users who write queries with Spark, as this is probably the largest user base. But we should definitely consider other users who deploy and manage Spark clusters. Those users are usually more tolerant of behavior changes and I think it should be

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Tathagata Das
Next week sounds great! Thank you Wenchen! On Wed, May 1, 2024 at 11:16 AM Wenchen Fan wrote: > Yea I think a preview release won't hurt (without a branch cut). We don't > need to wait for all the ongoing projects to be ready. How about we do a > 4.0 preview release based on the current master

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-01 Thread Santosh Pingale
Thanks Wenchen for starting this! How do we define "the user" for spark? 1. End users: There are some users that use spark as a service from a provider 2. Providers/Operators: There are some users that provide spark as a service for their internal(on-prem setup with yarn/k8s)/external(Something

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Wenchen Fan
Yea I think a preview release won't hurt (without a branch cut). We don't need to wait for all the ongoing projects to be ready. How about we do a 4.0 preview release based on the current master branch next Monday? On Wed, May 1, 2024 at 11:06 PM Tathagata Das wrote: > Hey all, > > Reviving

Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Tathagata Das
Hey all, Reviving this thread, but Spark master has already accumulated a huge amount of changes. As a downstream project maintainer, I want to really start testing the new features and other breaking changes, and it's hard to do that without a Preview release. So the sooner we make a Preview

Re: Potential Impact of Hive Upgrades on Spark Tables

2024-05-01 Thread Mich Talebzadeh
It is important to consider potential impacts on Spark tables stored in the Hive metastore during an "upgrade". Depending on the upgrade path, the Hive metastore schema or SerDes behavior might change, requiring adjustments in the Sparkark code or configurations. I mentioned the need to test the

[DISCUSS] clarify the definition of behavior changes

2024-04-30 Thread Wenchen Fan
Hi all, It's exciting to see innovations keep happening in the Spark community and Spark keeps evolving itself. To make these innovations available to more users, it's important to help users upgrade to newer Spark versions easily. We've done a good job on it: the PR template requires the author

Re: Potential Impact of Hive Upgrades on Spark Tables

2024-04-30 Thread Wenchen Fan
Yes, Spark has a shim layer to support all Hive versions. It shouldn't be an issue as many users create native Spark data source tables already today, by explicitly putting the `USING` clause in the CREATE TABLE statement. On Wed, May 1, 2024 at 12:56 AM Mich Talebzadeh wrote: > @Wenchen Fan

Potential Impact of Hive Upgrades on Spark Tables

2024-04-30 Thread Mich Talebzadeh
@Wenchen Fan Got your explanation, thanks! My understanding is that even if we create Spark tables using Spark's native data sources, by default, the metadata about these tables will be stored in the Hive metastore. As a consequence, a Hive upgrade can potentially affect Spark tables. For

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Kent Yao
+1 Kent Yao On 2024/04/30 09:07:21 Yuming Wang wrote: > +1 > > On Tue, Apr 30, 2024 at 3:31 PM Ye Xianjin wrote: > > > +1 > > Sent from my iPhone > > > > On Apr 30, 2024, at 3:23 PM, DB Tsai wrote: > > > >  > > +1 > > > > On Apr 29, 2024, at 8:01 PM, Wenchen Fan wrote: > > > >  > > To add

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Yuming Wang
+1 On Tue, Apr 30, 2024 at 3:31 PM Ye Xianjin wrote: > +1 > Sent from my iPhone > > On Apr 30, 2024, at 3:23 PM, DB Tsai wrote: > >  > +1 > > On Apr 29, 2024, at 8:01 PM, Wenchen Fan wrote: > >  > To add more color: > > Spark data source table and Hive Serde table are both stored in the

[VOTE][RESULT] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Dongjoon Hyun
The vote passes with 11 +1s (6 binding +1s) and one -1. Thanks to all who helped with the vote! (* = binding) +1: - Dongjoon Hyun * - Gengliang Wang * - Liang-Chi Hsieh * - Holden Karau * - Zhou Jiang - Cheng Pan - Hyukjin Kwon * - DB Tsai * - Ye Xianjin - XiDuo You - Nimrod Ofek +0: None -1: -

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Nimrod Ofek
+1 (non-binding) p.s How do I become binding? Thanks, Nimrod On Tue, Apr 30, 2024 at 10:53 AM Ye Xianjin wrote: > +1 > Sent from my iPhone > > On Apr 30, 2024, at 3:23 PM, DB Tsai wrote: > >  > +1 > > On Apr 29, 2024, at 8:01 PM, Wenchen Fan wrote: > >  > To add more color: > > Spark data

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread XiDuo You
+1 Dongjoon Hyun 于2024年4月27日周六 03:50写道: > > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault > to `false` by default. The technical scope is defined in the following PR. > > - DISCUSSION: https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd > - JIRA:

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Ye Xianjin
+1 Sent from my iPhoneOn Apr 30, 2024, at 3:23 PM, DB Tsai wrote:+1 On Apr 29, 2024, at 8:01 PM, Wenchen Fan wrote:To add more color:Spark data source table and Hive Serde table are both stored in the Hive metastore and keep the data files in the table directory. The only difference is they

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread DB Tsai
+1 On Apr 29, 2024, at 8:01 PM, Wenchen Fan wrote:To add more color:Spark data source table and Hive Serde table are both stored in the Hive metastore and keep the data files in the table directory. The only difference is they have different "table provider", which means Spark will use different

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Wenchen Fan
To add more color: Spark data source table and Hive Serde table are both stored in the Hive metastore and keep the data files in the table directory. The only difference is they have different "table provider", which means Spark will use different reader/writer. Ideally the Spark native data

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Wenchen Fan
@Mich Talebzadeh there seems to be a misunderstanding here. The Spark native data source table is still stored in the Hive metastore, it's just that Spark will use a different (and faster) reader/writer for it. `hive-site.xml` should work as it is today. On Tue, Apr 30, 2024 at 5:23 AM Hyukjin

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Hyukjin Kwon
Mich, It is a legacy config we should get rid of in the end, and it has been tested in production for very long time. Spark should create a Spark table by default. On Tue, Apr 30, 2024 at 5:38 AM Mich Talebzadeh wrote: > Your point > > ".. t's a surprise to me to see that someone has different

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Hyukjin Kwon
+1 It's a legacy conf that we should eventually remove it away. Spark should create Spark table by default, not Hive table. Mich, for your workload, you can simply switch that conf off if it concerns you. We also enabled ANSI as well (that you agreed on). It's a bit akwakrd to stop in the middle

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Dongjoon Hyun
? I'm not sure why you think in that direction. What I wrote was the following. - You voted +1 for SPARK-4 on April 14th (https://lists.apache.org/thread/tp92yzf8y4yjfk6r3dkqjtlb060g82sy) - You voted -1 for SPARK-46122 on April 26th.

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Mich Talebzadeh
Your point ".. t's a surprise to me to see that someone has different positions in a very short period of time in the community" Well, I have been with Spark since 2015 and this is the article in the medium dated February 7, 2016 with regard to both Hive and Spark and also presented in

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-29 Thread Dongjoon Hyun
It's a surprise to me to see that someone has different positions in a very short period of time in the community. Mitch casted +1 for SPARK-4 and -1 for SPARK-46122. - https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc -

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-28 Thread Mich Talebzadeh
Hi @Wenchen Fan Thanks for your response. I believe we have not had enough time to "DISCUSS" this matter. Currently in order to make Spark take advantage of Hive, I create a soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is 3.1.1 /opt/spark/conf/hive-site.xml ->

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-28 Thread Wenchen Fan
@Mich Talebzadeh thanks for sharing your concern! Note: creating Spark native data source tables is usually Hive compatible as well, unless we use features that Hive does not support (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to create Spark native table in this case,

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-27 Thread Hussein Awala
+1 On Saturday, April 27, 2024, John Zhuge wrote: > +1 > > On Fri, Apr 26, 2024 at 8:41 AM Kent Yao wrote: > >> +1 >> >> yangjie01 于2024年4月26日周五 17:16写道: >> > >> > +1 >> > >> > >> > >> > 发件人: Ruifeng Zheng >> > 日期: 2024年4月26日 星期五 15:05 >> > 收件人: Xinrong Meng >> > 抄送: Dongjoon Hyun ,

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-26 Thread John Zhuge
+1 On Fri, Apr 26, 2024 at 8:41 AM Kent Yao wrote: > +1 > > yangjie01 于2024年4月26日周五 17:16写道: > > > > +1 > > > > > > > > 发件人: Ruifeng Zheng > > 日期: 2024年4月26日 星期五 15:05 > > 收件人: Xinrong Meng > > 抄送: Dongjoon Hyun , "dev@spark.apache.org" < > dev@spark.apache.org> > > 主题: Re: [FYI]

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Cheng Pan
+1 (non-binding) Thanks, Cheng Pan On Sat, Apr 27, 2024 at 9:29 AM Holden Karau wrote: > > +1 > > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > On

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Zhou Jiang
+1 (non-binding) On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun wrote: > I'll start with my +1. > > Dongjoon. > > On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > > Please vote on SPARK-46122 to set > spark.sql.legacy.createHiveTableByDefault > > to `false` by default. The technical scope is

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Mich Talebzadeh
-1 for me Do not change spark.sql.legacy.createHiveTableByDefault because: 1. We have not had enough time to "DISCUSS" this matter. The discussion thread was opened almost 24 hours ago. 2. Compatibility: Changing the default behavior could potentially break existing workflows or

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Holden Karau
+1 Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh wrote: > +1 > > On Fri, Apr 26, 2024

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread L. C. Hsieh
+1 On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun wrote: > > I'll start with my +1. > > Dongjoon. > > On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault > > to `false` by default. The technical scope is defined in the

Re: Which version of spark version supports parquet version 2 ?

2024-04-26 Thread Prem Sahoo
Confirmed, closing this . Thanks everyone for valuable information. Sent from my iPhone > On Apr 25, 2024, at 9:55 AM, Prem Sahoo wrote: > >  > Hello Spark , > After discussing with the Parquet and Pyarrow community . We can use the > below config so that Spark can write Parquet V2 files. >

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Gengliang Wang
+1 On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun wrote: > I'll start with my +1. > > Dongjoon. > > On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > > Please vote on SPARK-46122 to set > spark.sql.legacy.createHiveTableByDefault > > to `false` by default. The technical scope is defined in the

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
I'll start with my +1. Dongjoon. On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault > to `false` by default. The technical scope is defined in the following PR. > > - DISCUSSION: >

[VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault to `false` by default. The technical scope is defined in the following PR. - DISCUSSION: https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd - JIRA: https://issues.apache.org/jira/browse/SPARK-46122 - PR:

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-26 Thread Dongjoon Hyun
Thank you, Kent, Wenchen, Mich, Nimrod, Yuming, LiangChi. I'll start a vote. To Mich, for your question, Apache Spark has a long history of converting Hive-provider tables into Spark's datasource tables to handle better in a Spark way. > Can you please elaborate on the above specifically with

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-26 Thread Kent Yao
+1 yangjie01 于2024年4月26日周五 17:16写道: > > +1 > > > > 发件人: Ruifeng Zheng > 日期: 2024年4月26日 星期五 15:05 > 收件人: Xinrong Meng > 抄送: Dongjoon Hyun , "dev@spark.apache.org" > > 主题: Re: [FYI] SPARK-47993: Drop Python 3.8 > > > > +1 > > > > On Fri, Apr 26, 2024 at 10:26 AM Xinrong Meng wrote: > > +1 > >

Survey: To Understand the requirements regarding TRAINING & TRAINING CONTENT in your ASF project

2024-04-26 Thread Mirko Kämpf
Hello ASF people, As a member of ASF Training (Incubating) project and in preparation for our presentation at the CoC conference in June in Bratislava we do conduct a survey.The purpose is this: *We want to understand the requirements regarding training materials and procedures in various ASF

  1   2   3   4   5   6   7   8   9   10   >