Spark 3.5.x on Java 21?

2024-05-08 Thread Stephen Coy
Hi everyone, We’re about to upgrade our Spark clusters from Java 11 and Spark 3.2.1 to Spark 3.5.1. I know that 3.5.1 is supposed to be fine on Java 17, but will it run OK on Java 21? Thanks, Steve C This email contains confidential information of and is the copyright of Infomedia. It

Re: [Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-08 Thread Mich Talebzadeh
you may consider - Increase Watermark Retention: Consider increasing the watermark retention duration. This allows keeping records for a longer period before dropping them. However, this might increase processing latency and violate at-least-once semantics if the watermark lags behind real-time.

Spark not creating staging dir for insertInto partitioned table

2024-05-07 Thread Sanskar Modi
Hi Folks, I wanted to check why spark doesn't create staging dir while doing an insertInto on partitioned tables. I'm running below example code – ``` spark.sql("set hive.exec.dynamic.partition.mode=nonstrict") val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3))) val df =

[Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-07 Thread Nandha Kumar
Hi Team, We are trying to use *spark structured streaming *for our use case. We will be joining 2 streaming sources(from kafka topic) with watermarks. As time progresses, the records that are prior to the watermark timestamp are removed from the state. For our use case, we want to *store

unsubscribe

2024-05-07 Thread Wojciech Bombik
unsubscribe

unsubscribe

2024-05-06 Thread Moise
unsubscribe

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Mich Talebzadeh
Hi Kartrick, Unfortunately Materialised views are not available in Spark as yet. I raised Jira [SPARK-48117] Spark Materialized Views: Improve Query Performance and Data Management - ASF JIRA (apache.org) as a feature request. Let me think

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Karthick Nk
Thanks Mich, can you please confirm me is my understanding correct? First, we have to create the materialized view based on the mapping details we have by using multiple tables as source(since we have multiple join condition from different tables). From the materialised view we can stream the

unsubscribe

2024-05-04 Thread chen...@birdiexx.com
unsubscribe

unsubscribe

2024-05-03 Thread Bing
Replied Message | From | Wood Super | | Date | 05/01/2024 07:49 | | To | user | | Subject | unsubscribe | unsubscribe

Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Mich Talebzadeh
Hi, I have raised a ticket SPARK-48117 for enhancing Spark capabilities with Materialised Views (MV). Currently both Hive and Databricks support this. I have added these potential benefits to the ticket -* Improved Query Performance

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Sadly Apache Spark sounds like it has nothing to do within materialised views. I was hoping it could read it! >>> *spark.sql("SELECT * FROM test.mv ").show()* Traceback (most recent call last): File "", line 1, in File "/opt/spark/python/pyspark/sql/session.py", line 1440, in

Help needed optimize spark history server performance

2024-05-03 Thread Vikas Tharyani
Dear Spark Community, I'm writing to seek your expertise in optimizing the performance of our Spark History Server (SHS) deployed on Amazon EKS. We're encountering timeouts (HTTP 504) when loading large event logs exceeding 5 GB. *Our Setup:* - Deployment: SHS on EKS with Nginx ingress (idle

Re: ********Spark streaming issue to Elastic data**********

2024-05-03 Thread Mich Talebzadeh
My recommendation! is using materialized views (MVs) created in Hive with Spark Structured Streaming and Change Data Capture (CDC) is a good combination for efficiently streaming view data updates in your scenario. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI |

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Thanks for the comments I received. So in summary, Apache Spark itself doesn't directly manage materialized views,(MV) but it can work with them through integration with the underlying data storage systems like Hive or through iceberg. I believe databricks through unity catalog support MVs as

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim
(removing dev@ as I don't think this is dev@ related thread but more about "question") My understanding is that Apache Spark does not support Materialized View. That's all. IMHO it's not a proper expectation that all operations in Apache Hive will be supported in Apache Spark. They are different

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess you must have created the view from Hive and are trying to drop it from Spark and that is why you are running to the issue with DROP first. There is

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
An issue I encountered while working with Materialized Views in Spark SQL. It appears that there is an inconsistency between the behavior of Materialized Views in Spark SQL and Hive. When attempting to execute a statement like DROP MATERIALIZED VIEW IF EXISTS test.mv in Spark SQL, I encountered a

********Spark streaming issue to Elastic data**********

2024-05-02 Thread Karthick Nk
Hi All, Requirements: I am working on the data flow, which will use the view definition(view definition already defined in schema), there are multiple tables used in the view definition. Here we want to stream the view data into elastic index based on if any of the table(used in the view

unsubscribe

2024-05-01 Thread Nebi Aydin
unsubscribe

unsubscribe

2024-05-01 Thread Atakala Selam
unsubscribe

unsubscribe

2024-05-01 Thread Yoel Benharrous
unsubscribe

Traceback is missing content in pyspark when invoked with UDF

2024-05-01 Thread Indivar Mishra
Hi *Tl;Dr:* I have a scenario where I generate code string on fly and execute that code, now for me if an error occurs I need the traceback but for executable code I just get partial traceback i.e. the line which caused the error is missing. Consider below MRC: def fun(): from pyspark.sql

Re: [spark-graphframes]: Generating incorrect edges

2024-05-01 Thread Mich Talebzadeh
Hi Steve, Thanks for your statement. I tend to use uuid myself to avoid collisions. This built-in function generates random IDs that are highly likely to be unique across systems. My concerns are on edge so to speak. If the Spark application runs for a very long time or encounters restarts, the

Re: [spark-graphframes]: Generating incorrect edges

2024-04-30 Thread Stephen Coy
Hi Mich, I was just reading random questions on the user list when I noticed that you said: On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh wrote: 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can

unsubscribe

2024-04-30 Thread Wood Super
unsubscribe

unsubscribe

2024-04-30 Thread junhua . xie
unsubscribe

unsubscribe

2024-04-30 Thread Yoel Benharrous

Re: spark.sql.shuffle.partitions=auto

2024-04-30 Thread Mich Talebzadeh
spark.sql.shuffle.partitions=auto Because Apache Spark does not build clusters. This configuration option is specific to Databricks, with their managed Spark offering. It allows Databricks to automatically determine an optimal number of shuffle partitions for your workload. HTH Mich Talebzadeh,

Re: Spark on Kubernetes

2024-04-30 Thread Mich Talebzadeh
Hi, In k8s the driver is responsible for executor creation. The likelihood of your problem is that Insufficient memory allocated for executors in the K8s cluster. Even with dynamic allocation, k8s won't schedule executor pods if there is not enough free memory to fulfill their resource requests.

spark.sql.shuffle.partitions=auto

2024-04-30 Thread second_co...@yahoo.com.INVALID
May i know is spark.sql.shuffle.partitions=auto only available on Databricks? what about on vanilla Spark ? When i set this, it gives error need to put int.  Any open source library that auto find the best partition , block size for dataframe?

Spark on Kubernetes

2024-04-29 Thread Tarun raghav
Respected Sir/Madam, I am Tarunraghav. I have a query regarding spark on kubernetes. We have an eks cluster, within which we have spark installed in the pods. We set the executor memory as 1GB and set the executor instances as 2, I have also set dynamic allocation as true. So when I try to read a

Re: Python for the kids and now PySpark

2024-04-28 Thread Meena Rajani
Mitch, you are right these days the attention span is getting shorter. Christian could work on a completely new thing for 3 hours and is proud to explain. It is amazing. Thanks for sharing. On Sat, Apr 27, 2024 at 9:40 PM Farshid Ashouri wrote: > Mich, this is absolutely amazing. > > Thanks

Re: Python for the kids and now PySpark

2024-04-27 Thread Farshid Ashouri
Mich, this is absolutely amazing. Thanks for sharing. On Sat, 27 Apr 2024, 22:26 Mich Talebzadeh, wrote: > Python for the kids. Slightly off-topic but worthwhile sharing. > > One of the things that may benefit kids is starting to learn something > new. Basically anything that can focus their

[Release Question]: Estimate on 3.5.2 release?

2024-04-26 Thread Paul Gerver
Hello, I'm curious if there is an estimate when 3.5.2 for Spark Core will be released. There are several bug and security vulnerability fixes in the dependencies we are excited to receive! If anyone has any insights, that would be greatly appreciated. Thanks! - ​Paul

[SparkListener] Accessing classes loaded via the '--packages' option

2024-04-26 Thread Damien Hawes
Hi folks, I'm contributing to the OpenLineage project, specifically the Apache Spark integration. My current focus is on extending the project to support data lineage extraction for Spark Streaming, beginning with Apache Kafka sources and sinks. I've encountered an obstacle when attempting to

unsubscribe

2024-04-25 Thread Code Tutelage

Re:RE: How to add MaxDOP option in spark mssql JDBC

2024-04-25 Thread Elite
Thank you. My main purpose is pass "MaxDop 1" to MSSQL to control the CPU usage. From the offical doc, I guess the problem of my codes is spark wrap the query to select * from (SELECT TOP 10 * FROM dbo.Demo with (nolock) WHERE Id = 1 option (maxdop 1)) spark_gen_alias Apparently, this

Re: [spark-graphframes]: Generating incorrect edges

2024-04-25 Thread Nijland, J.G.W. (Jelle, Student M-CS)
Hi Mich, Thanks for your suggestions. 1) It currently runs on one server with plenty of resources assigned. But I will keep it in mind to replace monotonically_increasing_id() with uuid() once we scale up. 2) I have replaced the null values in origin with a string

DataFrameReader: timestampFormat default value

2024-04-24 Thread keen
Is anyone familiar with [Datetime patterns]( https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) and `TimestampType` parsing in PySpark? When reading CSV or JSON files, timestamp columns need to be parsed. via datasource property `timestampFormat`. [According to documentation](

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
OK let us have a look at these 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can generate the same ID. I suggest switching to UUIDs (e.g., uuid.uuid4()) for guaranteed uniqueness. 2) Missing values in

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
Hi Mich, Thanks for your reply, 1) ID generation is done using monotonically_increasing_id() this is then prefixed with "p_", "m_", "o_" or "org_" depending on the

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
OK few observations 1) ID Generation Method: How are you generating unique IDs (UUIDs, sequential numbers, etc.)? 2) Data Inconsistencies: Have you checked for missing values impacting ID generation? 3) Join Verification: If relevant, can you share the code for joining data points during ID

RE: How to add MaxDOP option in spark mssql JDBC

2024-04-24 Thread Appel, Kevin
You might be able to leverage the prepareQuery option, that is at https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option ... this was introduced in Spark 3.4.0 to handle temp table query and CTE query against MSSQL server since what you send in is not actually what

[spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
tags: pyspark,spark-graphframes Hello, I am running pyspark in a podman container and I have issues with incorrect edges when I build my graph. I start with loading a source dataframe from a parquet directory on my server. The source dataframe has the following columns:

How to add MaxDOP option in spark mssql JDBC

2024-04-23 Thread Elite
[QUESTION] How to pass MAXDOP option · Issue #2395 · microsoft/mssql-jdbc (github.com) Hi team, I am suggested to require help form spark community. We suspect spark rewerite the query before pass to ms sql, and it lead to syntax error. Is there any work around to let make my codes work?

How to use Structured Streaming in Spark SQL

2024-04-22 Thread ????
In Flink, you can create flow calculation tables using Flink SQL, and directly connect with SQL through CDC and Kafka. How to use SQL for flow calculation in Spark 308027...@qq.com

How to access the internal hidden columns of table by spark jdbc

2024-04-20 Thread casel.chen
I want to use spark jdbc to access alibaba cloud hologres (https://www.alibabacloud.com/product/hologres) internal hidden column `hg_binlog_timestamp_us ` but met the following error: Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'hg_binlog_timestamp_us'

Accounting the impact of failures in spark jobs

2024-04-19 Thread Faiz Halde
Hello, In my organization, we have an accounting system for spark jobs that uses the task execution time to determine how much time a spark job uses the executors for and we use it as a way to segregate cost. We sum all the task times per job and apply proportions. Our clusters follow a 1 task

StreamingQueryListener integration with Spark native metric sink (JmxSink)

2024-04-18 Thread Mason Chen
Hi all, Is it possible to integrate StreamingQueryListener with Spark metrics so that metrics can be reported through Spark's internal metric system? Ideally, I would like to report some custom metrics through StreamingQueryListener and export them to Spark's JmxSink. Best, Mason

[ANNOUNCE] Apache Spark 3.4.3 released

2024-04-18 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.3! Spark 3.4.3 is a maintenance release containing many fixes including security and correctness domains. This release is based on the branch-3.4 maintenance branch of Spark. We strongly recommend all 3.4 users to upgrade to this

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello, I'm very new to the Spark ecosystem, apologies if this question is a bit simple. I want to modify a custom fork of Spark to remove function support. For example, I want to remove the query runners ability to call reflect and java_method. I saw that there exists a data structure in

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello, I'm very new to the Spark ecosystem, apologies if this question is a bit simple. I want to modify a custom fork of Spark to remove function support. For example, I want to remove the query runners ability to call reflect and java_method. I saw that there exists a data structure in

should OutputCommitCoordinator fail stages for authorized committer failures when using s3a optimized committers?

2024-04-17 Thread Dylan McClelland
In https://issues.apache.org/jira/browse/SPARK-39195, OutputCommitCoordinator was modified to fail a stage if an authorized committer task fails. We run our spark jobs on a k8s cluster managed by karpenter and mostly built from spot instances. As a result, our executors are frequently killed.

[Spark SQL] xxhash64 default seed of 42 confusion

2024-04-16 Thread Igor Calabria
Hi all, I've noticed that spark's xxhas64 output doesn't match other tool's due to using seed=42 as a default. I've looked at a few libraries and they use 0 as a default seed: - python https://github.com/ifduyue/python-xxhash - java https://github.com/OpenHFT/Zero-Allocation-Hashing/ - java

auto create event log directory if not exist

2024-04-15 Thread second_co...@yahoo.com.INVALID
Spark history server is set to use s3a, like below spark.eventLog.enabled true spark.eventLog.dir s3a://bucket-test/test-directory-log any configuration option i can set on the Spark config such that if the directory 'test-directory-log' does not exist auto create it before start Spark history

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Thanks, Mich for your reply. I agree, it is not so scalable and efficient. But it works correctly for kafka transaction, and there is no problem with committing offset to kafka async for now. I try to tell you some more details about my streaming job. CustomReceiver does not receive anything

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Mich Talebzadeh
Interesting My concern is infinite Loop in* foreachRDD*: The *while(true)* loop within foreachRDD creates an infinite loop within each Spark executor. This might not be the most efficient approach, especially since offsets are committed asynchronously.? HTH Mich Talebzadeh, Technologist |

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Because spark streaming for kafk transaction does not work correctly to suit my need, I moved to another approach using raw kafka consumer which handles read_committed messages from kafka correctly. My codes look like the following. JavaDStream stream = ssc.receiverStream(new CustomReceiver());

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Kidong Lee
Thank you Mich for your reply. Actually, I tried to do most of your advice. When spark.streaming.kafka.allowNonConsecutiveOffsets=false, I got the following error. Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 3)

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Mich Talebzadeh
Hi Kidong, There may be few potential reasons why the message counts from your Kafka producer and Spark Streaming consumer might not match, especially with transactional messages and read_committed isolation level. 1) Just ensure that both your Spark Streaming job and the Kafka consumer written

Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-12 Thread Kidong Lee
Hi, I have a kafka producer which sends messages transactionally to kafka and spark streaming job which should consume read_committed messages from kafka. But there is a problem for spark streaming to consume read_committed messages. The count of messages sent by kafka producer transactionally is

Spark column headings, camelCase or snake case?

2024-04-11 Thread Mich Talebzadeh
I know this is a bit of a silly question. But what is the norm for Sparkcolumn headings? Is it camelCase or snakec_ase. For example here " someone suggested and I quote SumTotalInMillionGBP" accurately conveys the meaning but is a bit long and uses camelCase, which is not the standard convention

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
I think this answers your question about what to do if you need more space on nodes. https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage Local Storage Spark supports using volumes to spill

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-11 Thread Ashley McManamon
Hi Mich, Thanks for the reply. I did come across that file but it didn't align with the appearance of `PartitionedFile`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala In fact, the code snippet you shared also

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
" In the end for my usecase I started using pvcs and pvc aware scheduling along with decommissioning. So far performance is good with this choice." How did you do this? tor. 11. apr. 2024 kl. 04:13 skrev Arun Ravi : > Hi Everyone, > > I had to explored IBM's and AWS's S3 shuffle plugins (some

Re: External Spark shuffle service for k8s

2024-04-10 Thread Arun Ravi
Hi Everyone, I had to explored IBM's and AWS's S3 shuffle plugins (some time back), I had also explored AWS FSX lustre in few of my production jobs which has ~20TB of shuffle operations with 200-300 executors. What I have observed is S3 and fax behaviour was fine during the write phase, however I

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh
interesting. So below should be the corrected code with the suggestion in the [SPARK-47718] .sql() does not recognize watermark defined upstream - ASF JIRA (apache.org) # Define schema for parsing Kafka messages schema = StructType([

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread 刘唯
Sorry this is not a bug but essentially a user error. Spark throws a really confusing error and I'm also confused. Please see the reply in the ticket for how to make things correct. https://issues.apache.org/jira/browse/SPARK-47718 刘唯 于2024年4月6日周六 11:41写道: > This indeed looks like a bug. I will

Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Femi Anthony
If you're using just Spark you could try turning on the history server and try to glean statistics from there. But there is no one location or log file which stores them all. Databricks, which is a managed Spark solution, provides such

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
Hi, I believe this is the package https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala And the code case class FilePartition(index: Int, files: Array[PartitionedFile]) extends Partition with

Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Mich Talebzadeh
Well you can do a fair bit with the available tools The Spark UI, particularly the Staging and Executors tabs, do provide some valuable insights related to database health metrics for applications using a JDBC source. Stage Overview: This section provides a summary of all the stages executed

[Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Ashley McManamon
Hi All, I've been diving into the source code to get a better understanding of how file splitting works from a user perspective. I've hit a deadend at `PartitionedFile`, for which I cannot seem to find a definition? It appears though it should be found at

Re: External Spark shuffle service for k8s

2024-04-08 Thread Mich Talebzadeh
Hi, First thanks everyone for their contributions I was going to reply to @Enrico Minack but noticed additional info. As I understand for example, Apache Uniffle is an incubating project aimed at providing a pluggable shuffle service for Spark. So basically, all these "external shuffle

Re: External Spark shuffle service for k8s

2024-04-08 Thread Vakaris Baškirov
I see that both Uniffle and Celebron support S3/HDFS backends which is great. In the case someone is using S3/HDFS, I wonder what would be the advantages of using Celebron or Uniffle vs IBM shuffle service plugin or Cloud Shuffle Storage Plugin from AWS

How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread casel.chen
Hello, I have a spark application with jdbc source and do some calculation. To monitor application healthy, I need db related metrics per database like number of connections, sql execution time and sql fired time distribution etc. Does anybody know how to get them? Thanks!

Re: External Spark shuffle service for k8s

2024-04-08 Thread roryqi
Apache Uniffle (incubating) may be another solution. You can see https://github.com/apache/incubator-uniffle https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era Mich Talebzadeh 于2024年4月8日周一 07:15写道: > Splendid > > The

Re: External Spark shuffle service for k8s

2024-04-07 Thread Enrico Minack
There is Apache incubator project Uniffle: https://github.com/apache/incubator-uniffle It stores shuffle data on remote servers in memory, on local disk and HDFS. Cheers, Enrico Am 06.04.24 um 15:41 schrieb Mich Talebzadeh: I have seen some older references for shuffle service for k8s,

Spark UDAF in examples fail with not serializable error

2024-04-07 Thread Owen Bell
The type-safe example given at https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html fails with a not serializable exception Is this a known issue?

Re: Idiomatic way to rate-limit streaming sources to avoid OutOfMemoryError?

2024-04-07 Thread Mich Talebzadeh
OK, This is a common issue in Spark Structured Streaming (SSS), where the source generates data faster than Spark can process it. SSS doesn't have a built-in mechanism for directly rate-limiting the incoming data stream itself. However, consider the following: - Limit the rate at which data

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Thanks Cheng for the heads up. I will have a look. Cheers Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile

Re: External Spark shuffle service for k8s

2024-04-07 Thread Vakaris Baškirov
There is an IBM shuffle service plugin that supports S3 https://github.com/IBM/spark-s3-shuffle Though I would think a feature like this could be a part of the main Spark repo. Trino already has out-of-box support for s3 exchange (shuffle) and it's very useful. Vakaris On Sun, Apr 7, 2024 at

Idiomatic way to rate-limit streaming sources to avoid OutOfMemoryError?

2024-04-07 Thread Baran, Mert
Hi Spark community, I have a Spark Structured Streaming application that reads data from a socket source (implemented very similarly to the TextSocketMicroBatchStream). The issue is that the source can generate data faster than Spark can process it, eventually leading to an OutOfMemoryError

Re: External Spark shuffle service for k8s

2024-04-07 Thread Cheng Pan
Instead of External Shuffle Shufle, Apache Celeborn might be a good option as a Remote Shuffle Service for Spark on K8s. There are some useful resources you might be interested in. [1] https://celeborn.apache.org/ [2] https://www.youtube.com/watch?v=s5xOtG6Venw [3]

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Splendid The configurations below can be used with k8s deployments of Spark. Spark applications running on k8s can utilize these configurations to seamlessly access data stored in Google Cloud Storage (GCS) and Amazon S3. For Google GCS we may have spark_config_gcs = {

Example UDAF fails with "not serializable" exception

2024-04-06 Thread Owen Bell
https://spark.apache.org/docs/3.3.2/sql-ref-functions-udf-aggregate.html I'm trying to run this example on Databricks, and it fails with the stacktrace below. It's literally a copy-paste from the example, what am I missing? Job aborted due to stage failure: Task not serializable:

Re: External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
Thanks for your suggestion that I take it as a workaround. Whilst this workaround can potentially address storage allocation issues, I was more interested in exploring solutions that offer a more seamless integration with large distributed file systems like HDFS, GCS, or S3. This would ensure

Re: External Spark shuffle service for k8s

2024-04-06 Thread Bjørn Jørgensen
You can make a PVC on K8S call it 300GB make a folder in yours dockerfile WORKDIR /opt/spark/work-dir RUN chmod g+w /opt/spark/work-dir start spark with adding this .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName", "300gb") \

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-06 Thread 刘唯
This indeed looks like a bug. I will take some time to look into it. Mich Talebzadeh 于2024年4月3日周三 01:55写道: > > hm. you are getting below > > AnalysisException: Append output mode not supported when there are > streaming aggregations on streaming DataFrames/DataSets without watermark; > > The

Re: [External] Re: Issue of spark with antlr version

2024-04-06 Thread Bjørn Jørgensen
[[VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)]( https://lists.apache.org/thread/r0zn6rd8y25yn2dg59ktw3ttrwxzqrfb) Apache Spark 4.0.0 Release Plan === 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master branch. 2. Creating `branch-4.0` on April

Unsubscribe

2024-04-06 Thread rau-jannik
Unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
I have seen some older references for shuffle service for k8s, although it is not clear they are talking about a generic shuffle service for k8s. Anyhow with the advent of genai and the need to allow for a larger volume of data, I was wondering if there has been any more work on this matter.

Clarification on what "[id=#]" refers to in Physical Plan Exchange hashpartitioning

2024-04-04 Thread Tahj Anderson
Hello, While looking through spark physical plans generated by the spark history server log to find any bottle necks in my code, I stumbled across an ID that shows up in a partitioning stage. My goal is to use the history server log to provide meaningful analysis on my spark system

Clarification on what "[id=#]" refers to in Physical Plan Exchange hashpartitioning

2024-04-04 Thread Tahj Anderson
Hello, While looking through spark physical plans generated by the spark history server log to find any bottle necks in my code, I stumbled across an ID that shows up in a partitioning stage. My goal is to use the history server log to provide meaningful analysis on my spark system

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
I don't really understand how Iceberg and the hadoop libraries can coexist in a deployment. The latest spark (3.5.1) base image contains the hadoop-client*-3.3.4.jar. The AWS v2 SDK is only supported in hadoop*-3.4.0.jar and onward. Iceberg AWS integration states AWS v2 SDK is

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
Swapping out the iceberg-aws-bundle for the very latest aws provided sdk ('software.amazon.awssdk:bundle:2.25.23') produces an incompatibility from a slightly different code path: java.lang.NoSuchMethodError: 'void

Participate in the ASF 25th Anniversary Campaign

2024-04-03 Thread Brian Proffitt
Hi everyone, As part of The ASF’s 25th anniversary campaign[1], we will be celebrating projects and communities in multiple ways. We invite all projects and contributors to participate in the following ways: * Individuals - submit your first contribution:

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
[sorry; replying all this time] With hadoop-*-3.3.6 in place of the 3.4.0 below I get java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException I think that the below iceberg-aws-bundle version supplies the v2 sdk. Dan From: Aaron Grubb Sent: 03

Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Aaron Grubb
Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should probably be considered as breaking for tools that build on < 3.4.0 while using AWS. From: Oxlade, Dan Sent: Wednesday, April 3, 2024 2:41:11 PM To: user@spark.apache.org Subject:

[Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
Hi all, I've struggled with this for quite some time. My requirement is to read a parquet file from s3 to a Dataframe then append to an existing iceberg table. In order to read the parquet I need the hadoop-aws dependency for s3a:// . In order to write to iceberg I need the iceberg dependency.

  1   2   3   4   5   6   7   8   9   10   >