ASF board report draft for February

2024-02-17 Thread Matei Zaharia
Hi all, I missed some reminder emails about our board report this month, but here is my draft. I’ll submit it tomorrow if that’s ok. == Issues for the board: - None Project status: - We made two patch releases: Spark 3.3.4 (EOL release) on December 16, 2023, and Spark 3.4.2 on

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread Sean Owen
Yeah let's get that fix in, but it seems to be a minor test only issue so should not block release. On Fri, Feb 16, 2024, 9:30 AM yangjie01 wrote: > Very sorry. When I was fixing `SPARK-45242 ( > https://github.com/apache/spark/pull/43594)` > , I

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread yangjie01
Very sorry. When I was fixing `SPARK-45242 (https://github.com/apache/spark/pull/43594)`, I noticed that its `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I didn't realize that it had also been merged into branch-3.5, so I didn't advocate for SPARK-45357 to be

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread Jungtaek Lim
I traced back relevant changes and got a sense of what happened. Yangjie figured out the issue via link . It's a tricky issue according to the comments from Yangjie - the test is dependent on ordering of execution for test suites.

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-16 Thread Mich Talebzadeh
Hi Chao, As a cool feature - Compared to standard Spark, what kind of performance gains can be expected with Comet? - Can one use Comet on k8s in conjunction with something like a Volcano addon? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-15 Thread Sean Owen
Is anyone seeing this Spark Connect test failure? then again, I have some weird issue with this env that always fails 1 or 2 tests that nobody else can replicate. - Test observe *** FAILED *** == FAIL: Plans do not match === !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-15 Thread Jungtaek Lim
UPDATE: The vote thread is up now. https://lists.apache.org/thread/f28h0brncmkoyv5mtsqtxx38hx309c2j On Tue, Feb 6, 2024 at 11:30 AM Jungtaek Lim wrote: > Thanks all for the positive feedback! Will figure out time to go through > the RC process. Stay tuned! > > On Mon, Feb 5, 2024 at 7:46 AM

Re: Heads-up: Update on Spark 3.5.1 RC

2024-02-15 Thread Jungtaek Lim
UPDATE: Now the vote thread is up for RC2. https://lists.apache.org/thread/f28h0brncmkoyv5mtsqtxx38hx309c2j On Wed, Feb 14, 2024 at 2:59 AM Dongjoon Hyun wrote: > Thank you for the update, Jungtaek. > > Dongjoon. > > On Tue, Feb 13, 2024 at 7:29 AM Jungtaek Lim > wrote: > >> Hi, >> >> Just a

[VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-15 Thread Jungtaek Lim
DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured out doc generation issue after tagging RC1. Please vote on releasing the following candidate as Apache Spark version 3.5.1. The vote is open until February 18th 9AM (PST) and passes if a majority +1 PMC votes are cast,

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-15 Thread Mich Talebzadeh
Hi,I gather from the replies that the plugin is not currently available in the form expected although I am aware of the shell script. Also have you got some benchmark results from your tests that you can possibly share? Thanks, Mich Talebzadeh, Dad | Technologist | Solutions Architect |

Generating config docs automatically

2024-02-14 Thread Nicholas Chammas
I’m interested in automating our config documentation and need input from a committer who is interested in shepherding this work. We have around 60 tables of configs across our documentation. Here’s a typical example.

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
Hi Praveen, We will add a "Getting Started" section in the README soon, but basically comet-spark-shell in the repo should provide a basic tool to build Comet and launch a Spark shell with it. Note that we haven't

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Liu(Laswift) Cao
This is very cool! Congrats on the amazing work Chao and the team! It's exciting to see this native engine trend within the community. Other than gluten, I ran into https://github.com/blaze-init/blaze as well (but haven't evaluated it in detail) On Wed, Feb 14, 2024 at 09:20 Chao Sun wrote: > >

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
> Out of interest what are the differences in the approach between this and > Glutten? Overall they are similar, although Gluten supports multiple backends including Velox and Clickhouse. One major difference is (obviously) Comet is based on DataFusion and Arrow, and written in Rust, while

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread John Zhuge
Congratulations! Excellent work! On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu wrote: > Absolutely thrilled to see the project going open-source! Huge congrats to > Chao and the entire team on this milestone! > > Yufei > > > On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote: > >> Hi all, >> >> We are

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Yufei Gu
Absolutely thrilled to see the project going open-source! Huge congrats to Chao and the entire team on this milestone! Yufei On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via

Re: How do you debug a code-generated aggregate?

2024-02-13 Thread Mich Talebzadeh
Sure thanks for clarification. I gather what you are alluding to is -- in a distributed environment, when one does operations that involve shuffling or repartitioning of data, the order in which this data is processed across partitions is not guaranteed. So when repartitioning a dataframe, the

Re: How do you debug a code-generated aggregate?

2024-02-13 Thread Jack Goodson
Apologies if it wasn't clear, I was meaning the difficulty of debugging, not floating point precision :) On Wed, Feb 14, 2024 at 2:03 AM Mich Talebzadeh wrote: > Hi Jack, > > " most SQL engines suffer from the same issue... "" > > Sure. This behavior is not a bug, but rather a consequence

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Holden Karau
This looks really cool :) Out of interest what are the differences in the approach between this and Glutten? On Tue, Feb 13, 2024 at 12:42 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via leveraging DataFusion

Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Chao Sun
Hi all, We are very happy to announce that Project Comet, a plugin to accelerate Spark query execution via leveraging DataFusion and Arrow, has now been open sourced under the Apache Arrow umbrella. Please check the project repo https://github.com/apache/arrow-datafusion-comet for more details if

Re: Heads-up: Update on Spark 3.5.1 RC

2024-02-13 Thread Dongjoon Hyun
Thank you for the update, Jungtaek. Dongjoon. On Tue, Feb 13, 2024 at 7:29 AM Jungtaek Lim wrote: > Hi, > > Just a head-up since I didn't give an update for a week after the last > update from the discussion thread. > > I've been following the automated release process and encountered several

Re: Extracting Input and Output Partitions in Spark

2024-02-13 Thread Daniel Saha
This would be helpful for a few use cases. For context my team works in security space, and customers access data through a wrapper around spark sql connected to hive metastore. 1. When snapshot (non-partitioned) tables are queried, it’s not clear when the underlying snapshot was last updated.

Heads-up: Update on Spark 3.5.1 RC

2024-02-13 Thread Jungtaek Lim
Hi, Just a head-up since I didn't give an update for a week after the last update from the discussion thread. I've been following the automated release process and encountered several issues. Maybe I will file JIRA tickets and follow PRs. Issues I figured out so far are 1) python library

Re: How do you debug a code-generated aggregate?

2024-02-13 Thread Mich Talebzadeh
Hi Jack, " most SQL engines suffer from the same issue... "" Sure. This behavior is not a bug, but rather a consequence of the limitations of floating-point precision. The numbers involved in the example (see SPIP [SPARK-47024] Sum of floats/doubles may be incorrect depending on

Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Jack Goodson
I may be ignorant of other debugging methods in Spark but the best success I've had is using smaller datasets (if runs take a long time) and adding intermediate output steps. This is quite different from application development in non-distributed systems where a debugger is trivial to attach but I

Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Nicholas Chammas
OK, I figured it out. The details are in SPARK-47024 for anyone who’s interested. It turned out to be a floating point arithmetic “bug”. The main reason I was able to figure it out was because I’ve been investigating another, unrelated bug (a

Re: Extracting Input and Output Partitions in Spark

2024-02-12 Thread Aditya Sohoni
Sharing an example since a few people asked me off-list: We have stored the partition details in the read/write nodes of the physical plan. So this can be accessed via the plan like plan.getInputPartitions or plan.getOutputPartitions, which internally loops through the nodes in the plan and

Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Herman van Hovell
There is no really easy way of getting the state of the aggregation buffer, unless you are willing to modify the code generation and sprinkle in some logging. What I would start with is dumping the generated code by calling explain('codegen') on the DataFrame. That helped me to find similar

How do you debug a code-generated aggregate?

2024-02-11 Thread Nicholas Chammas
Consider this example: >>> from pyspark.sql.functions import sum >>> spark.range(4).repartition(2).select(sum("id")).show() +---+ |sum(id)| +---+ | 6| +---+ I’m trying to understand how this works because I’m investigating a bug in this kind of aggregate. I see that

Re: Pyspark Write Batch Streaming Data to Snowflake Fails with more columns

2024-02-10 Thread Varun Shah
Hi Mich, Thanks for the suggestions. I checked the documentation regarding the issue in data types and found that the different timezone settings being used in spark & snowflake was the issue. Specifying the timezone in spark options while writing the data to snowflake worked  Documentation

Re: Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh
The full code is available from the link below https://github.com/michTalebzadeh/Event_Driven_Real_Time_data_processor_with_SSS_and_API_integration Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Enhanced Console Sink for Structured Streaming

2024-02-09 Thread Neil Ramaswamy
Thanks for the comments, Anish and Jerry. To summarize so far, we are in agreement that: 1. Enhanced console sink is a good tool for new users to understand Structured Streaming semantics 2. It should be opt-in via an option (unlike my original proposal) 3. Out of the 2 modes of verbosity I

Re: Pyspark Write Batch Streaming Data to Snowflake Fails with more columns

2024-02-09 Thread Mich Talebzadeh
Hi Varun, I am no expert on Snowflake, however, the issue you are facing, particularly if it involves data trimming in a COPY statement and potential data mismatch, is likely related to how Snowflake handles data ingestion rather than being directly tied to PySpark. The COPY command in Snowflake

Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh
Appreciate your thoughts on this, Personally I think Spark Structured Streaming can be used effectively in an Event Driven Architecture as well as continuous streaming) >From the link here

Pyspark Write Batch Streaming Data to Snowflake Fails with more columns

2024-02-09 Thread Varun Shah
Hi Team, We currently have implemented pyspark spark-streaming application on databricks, where we read data from s3 and write to the snowflake table using snowflake connector jars (net.snowflake:snowflake-jdbc v3.14.5 and net.snowflake:spark-snowflake v2.12:2.14.0-spark_3.3) . Currently facing

Re: Enhanced Console Sink for Structured Streaming

2024-02-08 Thread Jerry Peng
I am generally a +1 on this as we can use this information in our docs to demonstrate certains concepts to potential users. I am in agreement with other reviewers that we should keep the existing default behavior of the console sink. This new style of output should be enabled behind a flag. As

Re: Enhanced Console Sink for Structured Streaming

2024-02-08 Thread Anish Shrigondekar
Hi Neil, Thanks for putting this together. +1 to the proposal of enhancing the console sink further. I think it will help new users understand some of the streaming/micro-batch semantics a bit better in Spark. Agree with not having verbose mode enabled by default. I think step 1 described above

Re: Shuffle write and read phase optimizations for parquet+zstd write

2024-02-08 Thread Mich Talebzadeh
Hi, ... Most of our jobs end up with a shuffle stage based on a partition column value before writing into a parquet, and most of the time we have data skewness in partitions Have you considered the causes of these recurring issues and some potential alternative strategies? 1. -

Shuffle write and read phase optimizations for parquet+zstd write

2024-02-07 Thread satyajit vegesna
Hi Community, Can someone please help validate the idea below and suggest pros/cons. Most of our jobs end up with a shuffle stage based on a partition column value before writing into parquet, and most of the time we have data skew ness in partitions. Currently most of the problems happen at

Re: Enhanced Console Sink for Structured Streaming

2024-02-06 Thread Neil Ramaswamy
Jungtaek and Raghu, thanks for the input. I'm happy with the verbose mode being off by default. I think it's reasonable to have 1 or 2 levels of verbosity: 1. The first verbose mode could target new users, and take a highly opinionated view on what's important to understand streaming

Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Mich Talebzadeh
I don't think adding this to the streaming flow (at micro level) will be that useful However, this can be added to Spark UI as an enhancement to the Streaming Query Statistics page. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my

Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Raghu Angadi
Agree, the default behavior does not need to change. Neil, how about separating it into two sections: - Actual rows in the sink (same as current output) - Followed by metadata data

Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Jungtaek Lim
Maybe we could keep the default as it is, and explicitly turn on verboseMode to enable auxiliary information. I'm not a believer that anyone will parse the output of console sink (which means this could be a breaking change), but changing the default behavior should be taken conservatively. We can

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-05 Thread Jungtaek Lim
Thanks all for the positive feedback! Will figure out time to go through the RC process. Stay tuned! On Mon, Feb 5, 2024 at 7:46 AM Gengliang Wang wrote: > +1 > > On Sun, Feb 4, 2024 at 1:57 PM Hussein Awala wrote: > >> +1 >> >> On Sun, Feb 4, 2024 at 10:13 PM John Zhuge wrote: >> >>> +1 >>>

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-02-05 Thread kalyan
Hey, Disk space not enough is also a reliability concern, but might need a diff strategy to handle it. As suggested by Mridul, I am working on making things more configurable in another(new) module… with that, we can plug in new rules for each type of error. Regards Kalyan. On Mon, 5 Feb 2024 at

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-02-04 Thread Jay Han
Hi, what about supporting for solving the disk space problem of "device space isn't enough"? I think it's same as OOM exception. kalyan 于2024年1月27日周六 13:00写道: > Hi all, > > Sorry for the delay in getting the first draft of (my first) SPIP out. > >

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Gengliang Wang
+1 On Sun, Feb 4, 2024 at 1:57 PM Hussein Awala wrote: > +1 > > On Sun, Feb 4, 2024 at 10:13 PM John Zhuge wrote: > >> +1 >> >> John Zhuge >> >> >> On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale >> wrote: >> >>> +1 >>> >>> On Sun, Feb 4, 2024, 8:18 PM Xiao Li >>> wrote: >>> +1

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Hussein Awala
+1 On Sun, Feb 4, 2024 at 10:13 PM John Zhuge wrote: > +1 > > John Zhuge > > > On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale > wrote: > >> +1 >> >> On Sun, Feb 4, 2024, 8:18 PM Xiao Li >> wrote: >> >>> +1 >>> >>> On Sun, Feb 4, 2024 at 6:07 AM beliefer wrote: >>> +1

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread John Zhuge
+1 John Zhuge On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale wrote: > +1 > > On Sun, Feb 4, 2024, 8:18 PM Xiao Li > wrote: > >> +1 >> >> On Sun, Feb 4, 2024 at 6:07 AM beliefer wrote: >> >>> +1 >>> >>> >>> >>> 在 2024-02-04 15:26:13,"Dongjoon Hyun" 写道: >>> >>> +1 >>> >>> On Sat, Feb 3,

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Santosh Pingale
+1 On Sun, Feb 4, 2024, 8:18 PM Xiao Li wrote: > +1 > > On Sun, Feb 4, 2024 at 6:07 AM beliefer wrote: > >> +1 >> >> >> >> 在 2024-02-04 15:26:13,"Dongjoon Hyun" 写道: >> >> +1 >> >> On Sat, Feb 3, 2024 at 9:18 PM yangjie01 >> wrote: >> >>> +1 >>> >>> 在 2024/2/4 13:13,“Kent

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread Xiao Li
+1 On Sun, Feb 4, 2024 at 6:07 AM beliefer wrote: > +1 > > > > 在 2024-02-04 15:26:13,"Dongjoon Hyun" 写道: > > +1 > > On Sat, Feb 3, 2024 at 9:18 PM yangjie01 > wrote: > >> +1 >> >> 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入: >> >> >> +1 >> >> >> Jungtaek Lim >

Re:Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread beliefer
+1 在 2024-02-04 15:26:13,"Dongjoon Hyun" 写道: +1 On Sat, Feb 3, 2024 at 9:18 PM yangjie01 wrote: +1 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入: +1 Jungtaek Lim mailto:kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道: > > Hi dev, > > looks like there are a huge

Re: [DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread Dongjoon Hyun
+1 On Sat, Feb 3, 2024 at 9:18 PM yangjie01 wrote: > +1 > > 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入: > > > +1 > > > Jungtaek Lim kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道: > > > > Hi dev, > > > > looks like there are a huge number of commits being pushed to branch-3.5

Re: [DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread yangjie01
+1 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>> 写入: +1 Jungtaek Lim mailto:kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道: > > Hi dev, > > looks like there are a huge number of commits being pushed to branch-3.5 > after 3.5.0 was released, 200+ commits. > > $ git log --oneline

Re: [DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread Kent Yao
+1 Jungtaek Lim 于2024年2月3日周六 21:14写道: > > Hi dev, > > looks like there are a huge number of commits being pushed to branch-3.5 > after 3.5.0 was released, 200+ commits. > > $ git log --oneline v3.5.0..HEAD | wc -l > 202 > > Also, there are 180 JIRA tickets containing 3.5.1 as fixed version, and

Re: Enhanced Console Sink for Structured Streaming

2024-02-03 Thread Neil Ramaswamy
Re: verbosity: yes, it will be more verbose. A config I was planning to implement was a default-on console sink option, verboseMode, that you can set to be off if you just want sink data. I don't think that introduces additional complexity, as the last point suggests. (And also, nobody should be

[DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread Jungtaek Lim
Hi dev, looks like there are a huge number of commits being pushed to branch-3.5 after 3.5.0 was released, 200+ commits. $ git log --oneline v3.5.0..HEAD | wc -l 202 Also, there are 180 JIRA tickets containing 3.5.1 as fixed version, and 10 resolved issues are either marked as blocker (even

Re: Enhanced Console Sink for Structured Streaming

2024-02-03 Thread Mich Talebzadeh
Hi, As I understood, the proposal you mentioned suggests adding event-time and state store metadata to the console sink to better highlight the semantics of the Structured Streaming engine. While I agree this enhancement can provide valuable insights into the engine's behavior especially for

Community over Code EU 2024 Travel Assistance Applications now open!

2024-02-03 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code EU 2024 are now open! We will be supporting Community over Code EU, Bratislava, Slovakia, June 3th - 5th, 2024. TAC exists

[no subject]

2024-02-03 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code EU 2024 are now open! We will be supporting Community over Code EU, Bratislava, Slovakia, June 3th - 5th, 2024. TAC exists

Enhanced Console Sink for Structured Streaming

2024-02-02 Thread Neil Ramaswamy
Hi all, I'd like to propose the idea of enhancing Structured Streaming's console sink to print event-time metrics and state store data, in addition to the sink's rows. I've noticed beginners often struggle to understand how watermarks, operator state, and output rows are all intertwined. By

Re: Spark 3.5.1

2024-01-31 Thread Jungtaek Lim
Hi, I agreed it's time to release 3.5.1. 10 resolved issues are either marked as blocker (even correctness issues) or critical, which justifies the release. I had been trying to find the time to take a step, but had no luck with it. I'll give it another try this week (it needs some time as I'm

Extracting Input and Output Partitions in Spark

2024-01-30 Thread Aditya Sohoni
Hello Spark Devs! We are from Uber's Spark team. Our ETL jobs use Spark to read and write from Hive datasets stored in HDFS. The freshness of the partition written to depends on the freshness of the data in the input partition(s). We monitor this freshness score, so that partitions in our

Spark 3.5.1

2024-01-30 Thread Santosh Pingale
Hey there Spark 3.5 branch has accumulated 199 commits with quite a few bug fixes related to correctness. Are there any plans for releasing 3.5.1? Kind regards Santosh

Re: [QUESTION] Legal dependency with Oracle JDBC driver

2024-01-30 Thread Mich Talebzadeh
Hi Alex, Well, that is just Justin's opinion vis-à-vis his matter. It is different from mine. Bottom line, you can always refer to Oracle or a copyright expert on this matter and see what they suggest. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom

unsubscribe

2024-01-29 Thread Gang Feng

Re: [QUESTION] Legal dependency with Oracle JDBC driver

2024-01-29 Thread Alex Porcelli
Hi Mich, Thank you for the prompt response. Looks like Justin Mclean has a slightly different perspective on the Oracle's license as you can see in [3]. On Mon, Jan 29, 2024 at 4:17 PM Mich Talebzadeh wrote: > Hi, > > This is not an official response and should not be taken as an > official

Re: [QUESTION] Legal dependency with Oracle JDBC driver

2024-01-29 Thread Mich Talebzadeh
Hi, This is not an official response and should not be taken as an official view. It is my own opinion. Looking at the reference [1], I can see a host of inclusion to other JDBC vendor' drivers such as IBM DB2 and MSSQL With regard to link [2], it is already closed (3+ years) and it is assumed

[QUESTION] Legal dependency with Oracle JDBC driver

2024-01-29 Thread Alex Porcelli
Hi Spark Devs, I'm reaching out to understand how you managed to include the Oracle JDBC as one of your dependencies [1]. According to legal tickers [2][3], this is considered a Category X dependency and is not allowed. (I'm part of the Apache KIE podling, and we are struggling with such a

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-26 Thread kalyan
Hi all, Sorry for the delay in getting the first draft of (my first) SPIP out. https://docs.google.com/document/d/1hxEPUirf3eYwNfMOmUHpuI5dIt_HJErCdo7_yr9htQc/edit?pli=1 Let me know what you think. Regards kalyan. On Sat, Jan 20, 2024 at 8:19 AM Ashish Singh wrote: > Hey all, > > Thanks for

Re: [EXTERNAL] Re: Spark Kafka Rack Aware Consumer

2024-01-26 Thread Raghu Angadi
Overall the proposal to make this an option for Kafka source SGTM. You can address the doc review and can send PR (in parallel or after the review). Note that currently executors cache client connection to Kafka and reuse the connection and buffered records for next micro-batch. Your proposal

Re: [EXTERNAL] Re: Spark Kafka Rack Aware Consumer

2024-01-26 Thread Schwager, Randall
Granted. Thanks for bearing with me. I’ve also opened up permissions to allow anyone with the link to edit the document. Thank you! From: Mich Talebzadeh Date: Friday, January 26, 2024 at 09:19 To: "Schwager, Randall" Cc: "dev@spark.apache.org" Subject: Re: [EXTERNAL] Re: Spark Kafka Rack

Re: [EXTERNAL] Re: Spark Kafka Rack Aware Consumer

2024-01-26 Thread Mich Talebzadeh
Ok I made a request to access this document Thanks Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile Ent https://en.everybodywiki.com/Mich_Talebzadeh

Re: [EXTERNAL] Re: Spark Kafka Rack Aware Consumer

2024-01-26 Thread Schwager, Randall
Hi Mich, Thanks for responding. In the JIRA issue, the design doc you’re referring to describes the prior work. This is the design doc for the proposed change: https://docs.google.com/document/d/1RoEk_mt8AUh9sTQZ1NfzIuuYKf1zx6BP1K3IlJ2b8iM/edit#heading=h.pbt6pdb2jt5c I’ll re-word the

Re: Spark Kafka Rack Aware Consumer

2024-01-26 Thread Mich Talebzadeh
Your design doc Structured Streaming Kafka Source - Design Doc - Google Docs seems to be around since 2016. Reading the comments it was decided not to progress with it. What has changed

Re: Spark Kafka Rack Aware Consumer

2024-01-25 Thread Schwager, Randall
Bump. Am I asking these questions in the wrong place? Or should I forego design input and just write the PR? From: "Schwager, Randall" Date: Monday, January 22, 2024 at 17:02 To: "dev@spark.apache.org" Subject: Re: Spark Kafka Rack Aware Consumer Hello Spark Devs! After doing some detective

Re: Spark Kafka Rack Aware Consumer

2024-01-22 Thread Schwager, Randall
Hello Spark Devs! After doing some detective work, I’d like to revisit this idea in earnest. My understanding now is that setting `client.rack` dynamically on the executor will do nothing. This is because the driver assigns Kafka partitions to executors. I’ve summarized a design to enable rack

Re: Removing Kinesis in Spark 4

2024-01-20 Thread Nicholas Chammas
Oh, that’s a very interesting dashboard. I was familiar with the Matomo snippet but never looked up where exactly those metrics were going. I see that the Kinesis docs do indeed have around 650 views in the past month, but for Kafka I see 11K and 1.3K views for the Structured Streaming and

Re: Removing Kinesis in Spark 4

2024-01-20 Thread Sean Owen
I'm not aware of much usage. but that doesn't mean a lot. FWIW, in the past month or so, the Kinesis docs page got about 700 views, compared to about 1400 for Kafka

Removing Kinesis in Spark 4

2024-01-20 Thread Nicholas Chammas
From the dev thread: What else could be removed in Spark 4? > On Aug 17, 2023, at 1:44 AM, Yang Jie wrote: > > I would like to know how we should handle the two Kinesis-related modules in > Spark 4.0. They have a very low

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-20 Thread Pavan Kotikalapudi
Here is the link to the voting thread https://lists.apache.org/thread/rlwqrw6ddxdkbvkp78kpd0zgvglgbbp8. Thank you, Pavan On Wed, Jan 17, 2024 at 7:15 PM Pavan Kotikalapudi wrote: > Thanks for the +1, I will propose voting in a new thread now. > > - Pavan > > On Wed, Jan 17, 2024 at 5:28 PM

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-19 Thread Ashish Singh
Hey all, Thanks for this discussion, the timing of this couldn't be better! At Pinterest, we recently started to look into reducing OOM failures while also reducing memory consumption of spark applications. We considered the following options. 1. Changing core count on executor to change memory

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-19 Thread Mich Talebzadeh
Everyone's vote matters whether they are PMC or not. There is no monopoly here HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-19 Thread Pavan Kotikalapudi
+1 If my vote counts. Does only spark PMC votes count? Thanks, Pavan On Thu, Jan 18, 2024 at 3:19 AM Adam Hobbs wrote: > +1 > -- > *From:* Pavan Kotikalapudi > *Sent:* Thursday, January 18, 2024 4:19:32 AM > *To:* Spark dev list > *Subject:* Re: Vote on Dynamic

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-17 Thread Adam Hobbs
+1 From: Pavan Kotikalapudi Sent: Thursday, January 18, 2024 4:19:32 AM To: Spark dev list Subject: Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815] CAUTION: This email originated from outside of the organisation. Do not click

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-17 Thread Pavan Kotikalapudi
Thanks for proposing and voting for the feature Mich. adding some references to the thread. - Jira ticket - SPARK-24815 - Design Doc

Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-17 Thread Mich Talebzadeh
+1 for me (non binding) *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-17 Thread Mridul Muralidharan
Hi, We are internally exploring adding support for dynamically changing the resource profile of a stage based on runtime characteristics. This includes failures due to OOM and the like, slowness due to excessive GC, resource wastage due to excessive overprovisioning, etc. Essentially handles

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-17 Thread Tom Graves
It is interesting. I think there are definitely some discussion points around this.  reliability vs performance is always a trade off and its great it doesn't fail but if it doesn't meet someone's SLA now that could be as bad if its hard to figure out why.   I think if something like this

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-17 Thread Pavan Kotikalapudi
Thanks for the +1, I will propose voting in a new thread now. - Pavan On Wed, Jan 17, 2024 at 5:28 PM Mich Talebzadeh wrote: > I think we have discussed this enough and I consider it as a useful > feature.. I propose a vote on it. > > + 1 for me > > Mich Talebzadeh, > Dad | Technologist |

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-17 Thread Mich Talebzadeh
I think we have discussed this enough and I consider it as a useful feature.. I propose a vote on it. + 1 for me Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-16 Thread Holden Karau
Oh interesting solution, a co-worker was suggesting something similar using resource profiles to increase memory -- but your approach avoids a lot of complexity I like it (and we could extend it out to support resource profile growth too). I think an SPIP sounds like a great next step. On Tue,

[Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-16 Thread kalyan
Hello All, At Uber, we had recently, done some work on improving the reliability of spark applications in scenarios of fatter executors going out of memory and leading to application failure. Fatter executors are those that have more than 1 task running on it at a given time concurrently. This

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-16 Thread Adam Hobbs
Hi, This is my first time using the dev mailing list so I hope this is the correct way to do it. I would like to lend my support to this proposal and offer my experiences as a consumer of spark, and specifically Spark Structured Streaming (SSS). I am more of an cloud infrastructure devops

[VOTE][RESULT] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-11 Thread Jungtaek Lim
The vote passes with 12 +1s (3 binding +1s). Thanks to all who reviews the SPIP doc and votes! (* = binding) +1: - Jungtaek Lim - Anish Shrigondekar - Mich Talebzadeh - Raghu Angadi - 刘唯 - Shixiong Zhu (*) - Bartosz Konieczny - Praveen Gattu - Burak Yavuz - Bhuwan Sahni - L. C. Hsieh (*) -

Re: [VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-11 Thread Jungtaek Lim
Thanks all for participating! The vote passed. I'll send out the result to a separate thread. On Thu, Jan 11, 2024 at 10:37 PM Wenchen Fan wrote: > +1 > > On Thu, Jan 11, 2024 at 9:32 AM L. C. Hsieh wrote: > >> +1 >> >> On Wed, Jan 10, 2024 at 9:06 AM Bhuwan Sahni >> wrote: >> >>> +1. This is

Re: [VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-10 Thread Wenchen Fan
+1 On Thu, Jan 11, 2024 at 9:32 AM L. C. Hsieh wrote: > +1 > > On Wed, Jan 10, 2024 at 9:06 AM Bhuwan Sahni > wrote: > >> +1. This is a good addition. >> >> >> *Bhuwan Sahni* >> Staff Software Engineer >> >> bhuwan.sa...@databricks.com >> 500 108th Ave. NE >>

Spark Kafka Rack Aware Consumer

2024-01-10 Thread Schwager, Randall
Hello Spark Devs! Has there been discussion around adding the ability to dynamically set the ‘client.rack’ Kafka parameter at the executor? The Kafka SQL connector code on master doesn’t seem to support this feature. One can easily set the ‘client.rack’ parameter at the driver, but that just

Install Ruby 3 to build the docs

2024-01-10 Thread Nicholas Chammas
Just a quick heads up that, while Ruby 2.7 will continue to work, you should plan to install Ruby 3 in the near future in order to build the docs. (I recommend using rbenv to manage multiple Ruby versions.) Ruby 2 reached EOL in March 2023

Re: [VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-10 Thread L. C. Hsieh
+1 On Wed, Jan 10, 2024 at 9:06 AM Bhuwan Sahni wrote: > +1. This is a good addition. > > > *Bhuwan Sahni* > Staff Software Engineer > > bhuwan.sa...@databricks.com > 500 108th Ave. NE > Bellevue, WA 98004 > USA > > > On Wed, Jan 10, 2024 at 9:00 AM Burak Yavuz

<    1   2   3   4   5   6   7   8   9   10   >