Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-08 Thread Someshwar Kale
Hi Chhavi, Currently there is no way to handle backtick(`) spark StructType. Hence the field name a.b and `a.b` are completely different within StructType. To handle that, I have added a custom implementation fixing StringIndexer# validateAndTransformSchema. You can refer to the code on my

Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-08 Thread Chhavi Bansal
Hi Someshwar, Thanks for the response, I have added my comments to the ticket <https://issues.apache.org/jira/browse/SPARK-48463>. Thanks, Chhavi Bansal On Thu, 6 Jun 2024 at 17:28, Someshwar Kale wrote: > As a fix, you may consider adding a transformer to rename columns (perhaps

Re: OOM issue in Spark Driver

2024-06-08 Thread Andrzej Zera
Hey, do you perform stateful operations? Maybe your state is growing indefinitely - a screenshot with state metrics would help (you can find it in Spark UI -> Structured Streaming -> your query). Do you have a driver-only cluster or do you have workers too? What's the memory usage p

OOM issue in Spark Driver

2024-06-08 Thread Karthick Nk
Hi All, I am using the pyspark structure streaming with Azure Databricks for data load process. In the Pipeline I am using a Job cluster and I am running only one pipeline, I am getting the OUT OF MEMORY issue while running for a long time. When I inspect the metrics of the cluster I found that,

Re: 7368396 - Apache Spark 3.5.1 (Support)

2024-06-07 Thread Sadha Chilukoori
Hi Alex, Spark is an open source software available under Apache License 2.0 ( https://www.apache.org/licenses/), further details can be found here in the FAQ page (https://spark.apache.org/faq.html). Hope this helps. Thanks, Sadha On Thu, Jun 6, 2024, 1:32 PM SANTOS SOUZA, ALEX wrote

7368396 - Apache Spark 3.5.1 (Support)

2024-06-06 Thread SANTOS SOUZA, ALEX
Hey guys! I am part of the team responsible for software approval at EMBRAER S.A. We are currently in the process of approving the Apache Spark 3.5.1 software and are verifying the licensing of the application. Therefore, I would like to kindly request you to answer the questions below. -What

Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-06 Thread Someshwar Kale
") val si = new StringIndexer().setInputCol("location_longitude").setOutputCol("longitutdee") val pipeline = new Pipeline().setStages(Array(renameColumn, si)) pipeline.fit(flattenedDf).transform(flattenedDf).show() refer my comment <https://issues.apache.org/jira/b

[SPARK-48423] Unable to save ML Pipeline to azure blob storage

2024-06-05 Thread Chhavi Bansal
a ticket on spark https://issues.apache.org/jira/browse/SPARK-48423, where I describe the entire scenario. I tried debugging the code and found that this key is being explicitly asked for in the code. The only solution was to again set it part of spark.conf which could result to a race condition since

[SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-05 Thread Chhavi Bansal
Hello team I was exploring feature transformation exposed via Mllib on nested dataset, and encountered an error while applying any transformer to a column with dot notation naming. I thought of raising a ticket on spark https://issues.apache.org/jira/browse/SPARK-48463, where I have mentioned

Inquiry Regarding Security Compliance of Apache Spark Docker Image

2024-06-05 Thread Tonmoy Sagar
aimed at handling substantial volumes of data. As part of our deployment strategy, we are endeavouring to implement a Spark-based application on our Azure Kubernetes service. Regrettably, we have encountered challenges from a security perspective with the latest Apache Spark Docker image

[ANNOUNCE] Announcing Apache Spark 4.0.0-preview1

2024-06-03 Thread Wenchen Fan
Hi all, To enable wide-scale community testing of the upcoming Spark 4.0 release, the Apache Spark community has posted a preview release of Spark 4.0. This preview is not a stable release in terms of either API or functionality, but it is meant to give the community early access to try the code

[apache-spark][spark-dataframe] DataFrameWriter.partitionBy does not guarantee previous sort result

2024-05-31 Thread leeyc0
tionKey") .text("output"); However, as mentioned in SPARK-44512 ( https://issues.apache.org/jira/browse/SPARK-44512), this does not guarantee the output is globally sorted. (note: I found that even setting spark.sql.optimizer.plannedWrite.enabled=false still does not

Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Amin Mosayyebzadeh
I am reading from a single file: df = spark.read.text("s3a://test-bucket/testfile.csv") On Fri, May 31, 2024 at 5:26 AM Mich Talebzadeh wrote: > Tell Spark to read from a single file > > data = spark.read.text("s3a://test-bucket/testfile.csv") > > This clar

Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Mich Talebzadeh
Tell Spark to read from a single file data = spark.read.text("s3a://test-bucket/testfile.csv") This clarifies to Spark that you are dealing with a single file and avoids any bucket-like interpretation. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Amin Mosayyebzadeh
I will work on the first two possible causes. For the third one, which I guess is the real problem, Spark treats the testfile.csv object with the url s3a://test-bucket/testfile.csv as a bucket to access _spark_metadata with url s3a://test-bucket/testfile.csv/_spark_metadata testfile.csv

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
ok some observations - Spark job successfully lists the S3 bucket containing testfile.csv. - Spark job can retrieve the file size (33 Bytes) for testfile.csv. - Spark job fails to read the actual data from testfile.csv. - The printed content from testfile.csv is an empty list

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Amin Mosayyebzadeh
The code should read testfile.csv file from s3. and print the content. It only prints a empty list although the file has content. I have also checked our custom s3 storage (Ceph based) logs and I see only LIST operations coming from Spark, there is no GET object operation for testfile.csv

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
Hello, Overall, the exit code of 0 suggests a successful run of your Spark job. Analyze the intended purpose of your code and verify the output or Spark UI for further confirmation. 24/05/30 01:23:43 INFO SparkContext: SparkContext is stopping with exitCode 0. what to check 1. Verify

Re: [s3a] Spark is not reading s3 object content

2024-05-29 Thread Amin Mosayyebzadeh
Hi Mich, Thank you for the help and sorry about the late reply. I ran your provided but I got "exitCode 0". Here is the complete output: === 24/05/30 01:23:38 INFO SparkContext: Running Spark version 3.5.0 24/05/30 01:23:38 INFO SparkContext: OS info Li

[Spark on k8s] A issue of k8s resource creation order

2024-05-29 Thread Tao Yang
Hi, team! I have a spark on k8s issue which posts in https://stackoverflow.com/questions/78537132/spark-on-k8s-resource-creation-order Need help

Re: Spark Protobuf Deserialization

2024-05-27 Thread Sandish Kumar HN
Did you try using to_protobuf and from_protobuf ? https://spark.apache.org/docs/latest/sql-data-sources-protobuf.html On Mon, May 27, 2024 at 15:45 Satyam Raj wrote: > Hello guys, > We're using Spark 3.5.0 for processing Kafka source that contains protobuf > serialized data. T

Re: [Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Mich Talebzadeh
When you use applyInPandasWithState, Spark processes each input row as it arrives, regardless of whether certain columns, such as the timestamp column, contain NULL values. This behavior is useful where you want to handle incomplete or missing data gracefully within your stateful processing logic

Spark Protobuf Deserialization

2024-05-27 Thread Satyam Raj
Hello guys, We're using Spark 3.5.0 for processing Kafka source that contains protobuf serialized data. The format is as follows: message Request { long sent_ts = 1; Event[] event = 2; } message Event { string event_name = 1; bytes event_bytes = 2; } The event_bytes contains the data

[Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Juan Casse
: Is this a supported feature of Spark? Can I rely on this behavior for my production code? Thanks, Juan

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Shay Elbaz
drawbacks, and is not recommended in general, but at least we had something working. The real, long-term solution was to replace the many ( > 200) withColumn() calls to only a few select() calls. You can easily find sources on the internet for why this is better. (it was on Spark 2.3, but I th

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Mich Talebzadeh
involves deep nesting, consider using techniques like explode or flattening to transform it into a less nested structure. This can reduce memory usage during operations like withColumn. 3. Lazy Evaluation: Use select before withColumn: this ensures lazy evaluation, meaning Spark only

Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Gaurav Madan
Dear Community, I'm reaching out to seek your assistance with a memory issue we've been facing while processing certain large and nested DataFrames using Apache Spark. We have encountered a scenario where the driver runs out of memory when applying the `withColumn` method on specific DataFrames

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
sorry i thought i gave an explanation The issue you are encountering with incorrect record numbers in the "ShuffleWrite Size/Records" column in the Spark DAG UI when data is read from cache/persist is a known limitation. This discrepancy arises due to the way Spark handles and repor

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
Just to further clarify that the Shuffle Write Size/Records column in the Spark UI can be misleading when working with cached/persisted data because it reflects the shuffled data size and record count, not the entire cached/persisted data., So it is fair to say that this is a limitation

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show incorrect record counts *when data is retrieved from cache or persisted data*. This happens because the record count reflects the number of records written to disk for shuffling, and not the actual number of re

Re: BUG :: UI Spark

2024-05-26 Thread Sathi Chowdhury
please assist me ? On Fri, May 24, 2024 at 12:29 AM Prem Sahoo wrote: Does anyone have a clue ? On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: Hello Team,in spark DAG UI , we have Stages tab. Once you click on each stage you can view the tasks. In each task we have a column "Shuffle

Re: BUG :: UI Spark

2024-05-26 Thread Prem Sahoo
Can anyone please assist me ? On Fri, May 24, 2024 at 12:29 AM Prem Sahoo wrote: > Does anyone have a clue ? > > On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > >> Hello Team, >> in spark DAG UI , we have Stages tab. Once you click on each stage you >> can

Re: Can Spark Catalog Perform Multimodal Database Query Analysis

2024-05-24 Thread Mich Talebzadeh
Something like this in Python from pyspark.sql import SparkSession # Configure Spark Session with JDBC URLs spark_conf = SparkConf() \ .setAppName("SparkCatalogMultipleSources") \ .set("hive.metastore.uris", "thrift://hive1-metastore:9080,thrift://hive2-me

Can Spark Catalog Perform Multimodal Database Query Analysis

2024-05-24 Thread ????
I have two clusters hive1 and hive2, as well as a MySQL database. Can I use Spark Catalog for registration, but can I only use one catalog at a time? Can multiple catalogs be joined across databases. select * from hive1.table1 join hive2.table2 join mysql.table1 where 308027

Re: BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Does anyone have a clue ? On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > Hello Team, > in spark DAG UI , we have Stages tab. Once you click on each stage you can > view the tasks. > > In each task we have a column "ShuffleWrite Size/Records " that column > pr

Re: [s3a] Spark is not reading s3 object content

2024-05-23 Thread Mich Talebzadeh
Could be a number of reasons First test reading the file with a cli aws s3 cp s3a://input/testfile.csv . cat testfile.csv Try this code with debug option to diagnose the problem from pyspark.sql import SparkSession from pyspark.sql.utils import AnalysisException try: # Initialize Spark

BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Hello Team, in spark DAG UI , we have Stages tab. Once you click on each stage you can view the tasks. In each task we have a column "ShuffleWrite Size/Records " that column prints wrong data when it gets the data from cache/persist . it typically will show the wrong record number thoug

[s3a] Spark is not reading s3 object content

2024-05-23 Thread Amin Mosayyebzadeh
I am trying to read an s3 object from a local S3 storage (Ceph based) using Spark 3.5.1. I see it can access the bucket and list the files (I have verified it on Ceph side by checking its logs), even returning the correct size of the object. But the content is not read. The object url is: s3a

Remote File change detection in S3 when spark queries are running and parquet files in S3 changes

2024-05-22 Thread Raghvendra Yadav
Hello, We are hoping someone can help us understand the spark behavior for scenarios listed below. Q. *Will spark running queries fail when S3 parquet object changes underneath with S3A remote file change detection enabled? Is it 100%? * Our understanding is that S3A has

Re: A handy tool called spark-column-analyser

2024-05-21 Thread ashok34...@yahoo.com.INVALID
cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner  Von Braun)". On Tue, 21 May 2024 at 16:21, Mich Talebzadeh wrote: I just wanted to share a tool I built called spark-column-analyzer. It's a Py

Re: A handy tool called spark-column-analyser

2024-05-21 Thread Mich Talebzadeh
as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Tue, 21 May 2024 at 16:21, Mich Talebzadeh wrote: > I just wanted to s

A handy tool called spark-column-analyser

2024-05-21 Thread Mich Talebzadeh
I just wanted to share a tool I built called *spark-column-analyzer*. It's a Python package that helps you dig into your Spark DataFrames with ease. Ever spend ages figuring out what's going on in your columns? Like, how many null values are there, or how many unique entries? Built with data

Request for Assistance: Adding User Authentication to Apache Spark Application

2024-05-16 Thread NIKHIL RAJ SHRIVASTAVA
Dear Team, I hope this email finds you well. My name is Nikhil Raj, and I am currently working with Apache Spark for one of my projects , where through the help of a parquet file we are creating an external table in Spark. I am reaching out to seek assistance regarding user authentication

Query Regarding UDF Support in Spark Connect with Kubernetes as Cluster Manager

2024-05-15 Thread Nagatomi Yasukazu
Hi Spark Community, I have a question regarding the support for User-Defined Functions (UDFs) in Spark Connect, specifically when using Kubernetes as the Cluster Manager. According to the Spark documentation, UDFs are supported by default for the shell and in standalone applications

Re: [spark-graphframes]: Generating incorrect edges

2024-05-11 Thread Nijland, J.G.W. (Jelle, Student M-CS)
nt M-CS) ; user@spark.apache.org Subject: Re: [spark-graphframes]: Generating incorrect edges Hi Steve, Thanks for your statement. I tend to use uuid myself to avoid collisions. This built-in function generates random IDs that are highly likely to be unique across systems. My concerns are

Spark 3.5.x on Java 21?

2024-05-08 Thread Stephen Coy
Hi everyone, We’re about to upgrade our Spark clusters from Java 11 and Spark 3.2.1 to Spark 3.5.1. I know that 3.5.1 is supposed to be fine on Java 17, but will it run OK on Java 21? Thanks, Steve C This email contains confidential information of and is the copyright of Infomedia

Re: [Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-08 Thread Mich Talebzadeh
ided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.or

Spark not creating staging dir for insertInto partitioned table

2024-05-07 Thread Sanskar Modi
Hi Folks, I wanted to check why spark doesn't create staging dir while doing an insertInto on partitioned tables. I'm running below example code – ``` spark.sql("set hive.exec.dynamic.partition.mode=nonstrict") val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3

[Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-07 Thread Nandha Kumar
Hi Team, We are trying to use *spark structured streaming *for our use case. We will be joining 2 streaming sources(from kafka topic) with watermarks. As time progresses, the records that are prior to the watermark timestamp are removed from the state. For our use case, we want to *store

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Mich Talebzadeh
Hi Kartrick, Unfortunately Materialised views are not available in Spark as yet. I raised Jira [SPARK-48117] Spark Materialized Views: Improve Query Performance and Data Management - ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-48117> as a feature request. Let me

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Karthick Nk
the view data into elastic index by using cdc? Thanks in advance. On Fri, May 3, 2024 at 3:39 PM Mich Talebzadeh wrote: > My recommendation! is using materialized views (MVs) created in Hive with > Spark Structured Streaming and Change Data Capture (CDC) is a good > combination for ef

Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Mich Talebzadeh
Hi, I have raised a ticket SPARK-48117 <https://issues.apache.org/jira/browse/SPARK-48117> for enhancing Spark capabilities with Materialised Views (MV). Currently both Hive and Databricks support this. I have added these potential benefits to the ticket -* Improved Query Perfo

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Sadly Apache Spark sounds like it has nothing to do within materialised views. I was hoping it could read it! >>> *spark.sql("SELECT * FROM test.mv <http://test.mv>").show()* Traceback (most recent call last): File "", line 1, in File "/opt/spark/p

Help needed optimize spark history server performance

2024-05-03 Thread Vikas Tharyani
Dear Spark Community, I'm writing to seek your expertise in optimizing the performance of our Spark History Server (SHS) deployed on Amazon EKS. We're encountering timeouts (HTTP 504) when loading large event logs exceeding 5 GB. *Our Setup:* - Deployment: SHS on EKS with Nginx ingress (idle

Re: ********Spark streaming issue to Elastic data**********

2024-05-03 Thread Mich Talebzadeh
My recommendation! is using materialized views (MVs) created in Hive with Spark Structured Streaming and Change Data Capture (CDC) is a good combination for efficiently streaming view data updates in your scenario. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Thanks for the comments I received. So in summary, Apache Spark itself doesn't directly manage materialized views,(MV) but it can work with them through integration with the underlying data storage systems like Hive or through iceberg. I believe databricks through unity catalog support MVs

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim
(removing dev@ as I don't think this is dev@ related thread but more about "question") My understanding is that Apache Spark does not support Materialized View. That's all. IMHO it's not a proper expectation that all operations in Apache Hive will be supported in Apache Spark. They are

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess you must have created the view from Hive and are trying to drop it from Spark and that is why you are running to the issue with DROP first

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
An issue I encountered while working with Materialized Views in Spark SQL. It appears that there is an inconsistency between the behavior of Materialized Views in Spark SQL and Hive. When attempting to execute a statement like DROP MATERIALIZED VIEW IF EXISTS test.mv in Spark SQL, I encountered

********Spark streaming issue to Elastic data**********

2024-05-02 Thread Karthick Nk
from view definition) by using spark structured streaming. Issue: 1. Here we are facing issue - For each incomming id here we running view definition(so it will read all the data from all the data) and check if any of the incomming id is present in the collective id's of view result, Due to which

Re: [spark-graphframes]: Generating incorrect edges

2024-05-01 Thread Mich Talebzadeh
Hi Steve, Thanks for your statement. I tend to use uuid myself to avoid collisions. This built-in function generates random IDs that are highly likely to be unique across systems. My concerns are on edge so to speak. If the Spark application runs for a very long time or encounters restarts

Re: [spark-graphframes]: Generating incorrect edges

2024-04-30 Thread Stephen Coy
Hi Mich, I was just reading random questions on the user list when I noticed that you said: On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh wrote: 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can

Re: Spark on Kubernetes

2024-04-30 Thread Mich Talebzadeh
. My suggestions - Increase Executor Memory: Allocate more memory per executor (e.g., 2GB or 3GB) to allow for multiple executors within available cluster memory. - Adjust Driver Pod Resources: Ensure the driver pod has enough memory to run Spark and manage executors. - Optimize

Spark on Kubernetes

2024-04-29 Thread Tarun raghav
Respected Sir/Madam, I am Tarunraghav. I have a query regarding spark on kubernetes. We have an eks cluster, within which we have spark installed in the pods. We set the executor memory as 1GB and set the executor instances as 2, I have also set dynamic allocation as true. So when I try to read

Re:RE: How to add MaxDOP option in spark mssql JDBC

2024-04-25 Thread Elite
Thank you. My main purpose is pass "MaxDop 1" to MSSQL to control the CPU usage. From the offical doc, I guess the problem of my codes is spark wrap the query to select * from (SELECT TOP 10 * FROM dbo.Demo with (nolock) WHERE Id = 1 option (maxdop 1)) spark_gen_alias

Re: [spark-graphframes]: Generating incorrect edges

2024-04-25 Thread Nijland, J.G.W. (Jelle, Student M-CS)
"128G" ).set("spark.executor.memoryOverhead", "32G" ).set("spark.driver.cores", "16" ).set("spark.driver.memory", "64G" ) I dont think b) applies as its a single machine. Kind regards, Jelle Fr

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
OK let us have a look at these 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can generate the same ID. I suggest switching to UUIDs (e.g., uuid.uuid4()) for guaranteed uniqueness. 2) Missing values

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
___ From: Mich Talebzadeh Sent: Wednesday, April 24, 2024 4:40 PM To: Nijland, J.G.W. (Jelle, Student M-CS) Cc: user@spark.apache.org Subject: Re: [spark-graphframes]: Generating incorrect edges OK few observations 1) ID Generation Method: How are you generating unique IDs (UUIDs, seque

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
jl...@student.utwente.nl> wrote: > tags: pyspark,spark-graphframes > > Hello, > > I am running pyspark in a podman container and I have issues with > incorrect edges when I build my graph. > I start with loading a source dataframe from a parquet directory on my &

RE: How to add MaxDOP option in spark mssql JDBC

2024-04-24 Thread Appel, Kevin
You might be able to leverage the prepareQuery option, that is at https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option ... this was introduced in Spark 3.4.0 to handle temp table query and CTE query against MSSQL server since what you send in is not actually what

[spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
tags: pyspark,spark-graphframes Hello, I am running pyspark in a podman container and I have issues with incorrect edges when I build my graph. I start with loading a source dataframe from a parquet directory on my server. The source dataframe has the following columns

How to add MaxDOP option in spark mssql JDBC

2024-04-23 Thread Elite
[QUESTION] How to pass MAXDOP option · Issue #2395 · microsoft/mssql-jdbc (github.com) Hi team, I am suggested to require help form spark community. We suspect spark rewerite the query before pass to ms sql, and it lead to syntax error. Is there any work around to let make my codes work

How to use Structured Streaming in Spark SQL

2024-04-22 Thread ????
In Flink, you can create flow calculation tables using Flink SQL, and directly connect with SQL through CDC and Kafka. How to use SQL for flow calculation in Spark 308027...@qq.com

How to access the internal hidden columns of table by spark jdbc

2024-04-20 Thread casel.chen
I want to use spark jdbc to access alibaba cloud hologres (https://www.alibabacloud.com/product/hologres) internal hidden column `hg_binlog_timestamp_us ` but met the following error: Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'hg_binlog_ti

Accounting the impact of failures in spark jobs

2024-04-19 Thread Faiz Halde
Hello, In my organization, we have an accounting system for spark jobs that uses the task execution time to determine how much time a spark job uses the executors for and we use it as a way to segregate cost. We sum all the task times per job and apply proportions. Our clusters follow a 1 task

StreamingQueryListener integration with Spark native metric sink (JmxSink)

2024-04-18 Thread Mason Chen
Hi all, Is it possible to integrate StreamingQueryListener with Spark metrics so that metrics can be reported through Spark's internal metric system? Ideally, I would like to report some custom metrics through StreamingQueryListener and export them to Spark's JmxSink. Best, Mason

[ANNOUNCE] Apache Spark 3.4.3 released

2024-04-18 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.3! Spark 3.4.3 is a maintenance release containing many fixes including security and correctness domains. This release is based on the branch-3.4 maintenance branch of Spark. We strongly recommend all 3.4 users to upgrade

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello, I'm very new to the Spark ecosystem, apologies if this question is a bit simple. I want to modify a custom fork of Spark to remove function support. For example, I want to remove the query runners ability to call reflect and java_method. I saw that there exists a data structure in spark

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello, I'm very new to the Spark ecosystem, apologies if this question is a bit simple. I want to modify a custom fork of Spark to remove function support. For example, I want to remove the query runners ability to call reflect and java_method. I saw that there exists a data structure in spark

[Spark SQL] xxhash64 default seed of 42 confusion

2024-04-16 Thread Igor Calabria
(slice library, used by trino) https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/XxHash64.java Was there a special motivation behind this? or is 42 just used for the sake of the hitchhiker's guide reference? It's very common for spark to interact with other tools (either via

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
h consumes committed messages from kafka directly(, which is not so scalable, I think.). But the main point of this approach which I need is that spark session needs to be used to save rdd(parallelized consumed messages) to iceberg table. Consumed messages will be converted to spark rdd which wil

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Mich Talebzadeh
Interesting My concern is infinite Loop in* foreachRDD*: The *while(true)* loop within foreachRDD creates an infinite loop within each Spark executor. This might not be the most efficient approach, especially since offsets are committed asynchronously.? HTH Mich Talebzadeh, Technologist

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Because spark streaming for kafk transaction does not work correctly to suit my need, I moved to another approach using raw kafka consumer which handles read_committed messages from kafka correctly. My codes look like the following. JavaDStream stream = ssc.receiverStream(new CustomReceiver

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Kidong Lee
) (chango-private-1.chango.private executor driver): java.lang.IllegalArgumentException: requirement failed: Got wrong record for spark-executor-school-student-group school-student-7 even after seeking to offset 11206961 got offset 11206962 instead. If this is a compacted topic, consider enabling

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Mich Talebzadeh
Hi Kidong, There may be few potential reasons why the message counts from your Kafka producer and Spark Streaming consumer might not match, especially with transactional messages and read_committed isolation level. 1) Just ensure that both your Spark Streaming job and the Kafka consumer written

Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-12 Thread Kidong Lee
Hi, I have a kafka producer which sends messages transactionally to kafka and spark streaming job which should consume read_committed messages from kafka. But there is a problem for spark streaming to consume read_committed messages. The count of messages sent by kafka producer transactionally

Spark column headings, camelCase or snake case?

2024-04-11 Thread Mich Talebzadeh
convention for Spark DataFrames (usually snake_case). Use snake_case for better readability like: "total_price_in_millions_gbp" So this is the gist +--+-+---+ |district |NumberOfOffshoreOwned|total_p

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
I think this answers your question about what to do if you need more space on nodes. https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage Local Storage <https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage> Spark supports using volumes to

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-11 Thread Ashley McManamon
Hi Mich, Thanks for the reply. I did come across that file but it didn't align with the appearance of `PartitionedFile`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala In fact, the code snippet you shared also

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
thanks everyone for their contributions >> >> I was going to reply to @Enrico Minack but >> noticed additional info. As I understand for example, Apache Uniffle is an >> incubating project aimed at providing a pluggable shuffle service for >> Spark. So basically, all these &quo

Re: External Spark shuffle service for k8s

2024-04-10 Thread Arun Ravi
ons > > I was going to reply to @Enrico Minack but > noticed additional info. As I understand for example, Apache Uniffle is an > incubating project aimed at providing a pluggable shuffle service for > Spark. So basically, all these "external shuffle services" have in c

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh
interesting. So below should be the corrected code with the suggestion in the [SPARK-47718] .sql() does not recognize watermark defined upstream - ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-47718> # Define schema for parsing Kafka messages schema = Stru

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread 刘唯
Sorry this is not a bug but essentially a user error. Spark throws a really confusing error and I'm also confused. Please see the reply in the ticket for how to make things correct. https://issues.apache.org/jira/browse/SPARK-47718 刘唯 于2024年4月6日周六 11:41写道: > This indeed looks like a bug

Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Femi Anthony
If you're using just Spark you could try turning on the history server <https://spark.apache.org/docs/latest/monitoring.html> and try to glean statistics from there. But there is no one location or log file which stores them all. Databricks, which is a managed Spark solution, pr

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
Hi, I believe this is the package https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala And the code case class FilePartition(index: Int, files: Array[PartitionedFile]) extends Partition

Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Mich Talebzadeh
Well you can do a fair bit with the available tools The Spark UI, particularly the Staging and Executors tabs, do provide some valuable insights related to database health metrics for applications using a JDBC source. Stage Overview: This section provides a summary of all the stages executed

[Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Ashley McManamon
Hi All, I've been diving into the source code to get a better understanding of how file splitting works from a user perspective. I've hit a deadend at `PartitionedFile`, for which I cannot seem to find a definition? It appears though it should be found at

Re: External Spark shuffle service for k8s

2024-04-08 Thread Mich Talebzadeh
Hi, First thanks everyone for their contributions I was going to reply to @Enrico Minack but noticed additional info. As I understand for example, Apache Uniffle is an incubating project aimed at providing a pluggable shuffle service for Spark. So basically, all these "external sh

Re: External Spark shuffle service for k8s

2024-04-08 Thread Vakaris Baškirov
I see that both Uniffle and Celebron support S3/HDFS backends which is great. In the case someone is using S3/HDFS, I wonder what would be the advantages of using Celebron or Uniffle vs IBM shuffle service plugin <https://github.com/IBM/spark-s3-shuffle> or Cloud Shuffle Storage Plugin fr

How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread casel.chen
Hello, I have a spark application with jdbc source and do some calculation. To monitor application healthy, I need db related metrics per database like number of connections, sql execution time and sql fired time distribution etc. Does anybody know how to get them? Thanks!

  1   2   3   4   5   6   7   8   9   10   >