Re: [ANNOUNCE] Apache Spark 4.0.1 released

2025-09-07 Thread Hyukjin Kwon
Yay! On Sun, 7 Sept 2025 at 13:54, Dongjoon Hyun wrote: > We are happy to announce the availability of Apache Spark 4.0.1! > > Spark 4.0.1 is the first maintenance release based on the branch-4.0 > maintenance branch of Spark. It contains many fixes including security and > corr

[ANNOUNCE] Apache Spark 4.0.1 released

2025-09-06 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 4.0.1! Spark 4.0.1 is the first maintenance release based on the branch-4.0 maintenance branch of Spark. It contains many fixes including security and correctness domains. We strongly recommend all 4.0 users to upgrade to this stable

[Spark SQL] [How-to] Can columns be excluded from a scan performed as part of an update?

2025-08-28 Thread William Muesing
Hello! This is my first time using a mailing list like this, apologies if I’ve missed something to conform to standards for it. I’m using the Java API to interact with a data source that’s column-based, and expensive to request entire rows from. However, using the interface that my Table needs to

Spark K8s auto scaling using Keda or similar tools

2025-08-06 Thread Nimrod Ofek
Hi everyone, I hope this message finds you well. We have several use cases involving Spark Structured Streaming that would benefit from auto-scaling. We understand that Dynamic Resource Allocation does not work optimally with Spark Structured Streaming, so we are exploring alternative solutions

RE: [PySpark] [Beginner] [Debug] Does Spark ReadStream support reading from a MinIO bucket?

2025-08-05 Thread Bhatt, Kashyap
>> option("path", "s3://bucketname") Shouldn’t the schema prefix be s3a instead of s3? Information Classification: General From: 刘唯 Sent: Tuesday, August 5, 2025 5:34 PM To: Kleckner, Jade Cc: user@spark.apache.org Subject: Re: [PySpark] [Beginner] [Debug] Doe

Re: [PySpark] [Beginner] [Debug] Does Spark ReadStream support reading from a MinIO bucket?

2025-08-05 Thread 刘唯
This is not necessarily about the readStream / read API. As long as you correctly imported the needed dependencies and set up spark config, you should be able to readStream from s3 path. See https://stackoverflow.com/questions/46740670/no-filesystem-for-scheme-s3-with-pyspark Kleckner, Jade 于

[PySpark] [Beginner] [Debug] Does Spark ReadStream support reading from a MinIO bucket?

2025-08-05 Thread Kleckner, Jade
Hello all, I'm developing a pipeline to possibly read a stream from a MinIO bucket. I have no issues setting Hadoop s3a variables and reading files but when I try to create a bucket for Spark to use as a readStream location it produces the following errors: Example code: i

[SPARK-CORE] SerializationDebugger fails on Java 21

2025-08-04 Thread Clemens Ballarin
I am trying to convert an application to Spark and need to find out all the serialization issues. Unfortunately, the SerializationDebugger appears to no longer work with Java 21 and presumably also Java 17. The problem is reflective access to sun.security.action.GetBooleanAction, which is

[SPARK-CONNECT] [SPARK-4.0] Encountered end-of-stream mid-frame

2025-08-01 Thread Manas Bhardwaj
Hi Folks, Manas here from Data Platform Team of CRED <https://cred.club/> We have been running spark connect 4.0.0 in production and facing the following issue! A gRPC connection failure occurs when executing a PySpark DataFrame action that involves a complex, dynamically generated

Compatibility Issue: DescribeTopicsResult.all() missing in Kafka 4.0.0 used with Spark 4.0.0

2025-07-24 Thread Sandeep Ballu
Hi Spark team, We encountered a `NoSuchMethodError` when running a PySpark application with Spark 4.0.0 and kafka-clients-4.0.0: java.lang.NoSuchMethodError: org.apache.kafka.clients.admin.DescribeTopicsResult.all() This appears to be due to Spark’s Kafka integration module still calling

[Spark SQL]: Python Data Source API and spark.sql.execution.pyspark.python

2025-07-24 Thread Ilya
Dear Spark Community, Why Python Data Source API (pyspark.sql.datasource.Datasource) is not using "spark.sql.execution.pyspark.python" config, but UDF do? Datasource 1) executor always looks for "python3" ignoring "spark.sql.execution.pyspark.python" config 2) so pr

[Spark SQL]: Spark 4 logs warning and stack trace when loading dataframe from path containing wildcard

2025-07-22 Thread Glenn J
Hello. In Spark 4, loading a dataframe from a path that contains a wildcard produces a warning and a stack trace that doesn't happen in Spark 3. >>> spark.read.load('s3a://ullswater-dev/uw01/temp/test_parquet/*.parquet'

spark.api.mode property is not available in spark 4.0.0

2025-07-21 Thread Sangram Mohanty
Hi Team, I am trying to get the property "spark.api.mode" in pyspark console. But it is not working. I have installed pyspark, pyspark-connect and other dependencies, set up spark 4.0.0, started pyspark session in command line. But it is not working and stating this property is not

Re: Spark Job Stuck in Active State (v2.4.3, Cluster Mode)

2025-07-17 Thread Ángel Álvarez Pascua
Sounds super interesting ... El jue, 17 jul 2025, 14:17, Hitesh Vaghela escribió: > Hi Spark community! I’ve posted a detailed question on Stack Overflow > regarding a persistent issue where my Spark job remains in an “Active” > state even after successful dataset processing. No error

Spark Job Stuck in Active State (v2.4.3, Cluster Mode)

2025-07-17 Thread Hitesh Vaghela
Hi Spark community! I’ve posted a detailed question on Stack Overflow regarding a persistent issue where my Spark job remains in an “Active” state even after successful dataset processing. No errors in logs, and attempts to kill the job fail. I’d love your insights on root causes and how to

[Spark SQL]: Spark can't read views created via Trino using enableHiveSupport.

2025-07-15 Thread Tal Haimov
Dear Spark Community, I’m currently managing a data platform that uses Trino with Hive Metastore integration. Our Hive Metastore contains a mix of legacy Hive tables and views, alongside newer views created via Trino. As expected, Trino stores views in the metastore with viewOriginalText

RE: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-15 Thread Wolfgang Buchner
Hi Nimrod, i am also interested in your first point, what exactly doesn "false alarm" mean. Today had following scenario, which in my opinion is a false alarm. Following example: - Topic contains 'N' Messages - Spark Streaming application consumed all 'N' messages

RE: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-15 Thread Wolfgang Buchner
Hi Nimrod, i am also interested in your first point, what exactly doesn "false alarm" mean. Today had following scenario, which in my opinion is a false alarm. Following example: - Topic contains 'N' Messages - Spark Streaming application consumed all 'N' messages

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-14 Thread Khalid Mammadov
red-streaming-kafka-integration.html >> ): >> >> "latest" for streaming, "earliest" for batch >> >> >> On Thu, 10 Jul 2025, 11:04 Nimrod Ofek, wrote: >> >>> Hi everyone, >>> >>> I'm currently working wit

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-13 Thread Nimrod Ofek
atest/streaming/structured-streaming-kafka-integration.html > ): > > "latest" for streaming, "earliest" for batch > > > On Thu, 10 Jul 2025, 11:04 Nimrod Ofek, wrote: > >> Hi everyone, >> >> I'm currently working with Spark Structured Streaming

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-10 Thread Khalid Mammadov
://spark.apache.org/docs/latest/streaming/structured-streaming-kafka-integration.html ): "latest" for streaming, "earliest" for batch On Thu, 10 Jul 2025, 11:04 Nimrod Ofek, wrote: > Hi everyone, > > I'm currently working with Spark Structured Streaming integrated w

Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-10 Thread Nimrod Ofek
Hi everyone, I'm currently working with Spark Structured Streaming integrated with Kafka and had some questions regarding the failOnDataLoss option. The current documentation states: *"Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or

Question about Spark Tag in TreeNode

2025-07-07 Thread Yifan Li
Hi, Spark friends This is Yifan. I am a software developer from Workday. I am not very familiar with Spark and I have a question about the Tag in TreeNode. We have a use case where we will add some information to Tag and we hope the tag will be persisted in Spark. But I noticed that the tag is

Spark checkpointing in batch mode fault tolerance problem

2025-07-05 Thread Martin Aras
Hi I am new in Apache Spark and I created a spark job that reads the data from a Mysql database and does some processing on it and then commits it to another table. The odd thing I faced was that Spark reads all the data from the table when I use `sparkSession.read.jdbc` and `sparkDf.rdd.map

[ANNOUNCE] Apache Spark Kubernetes Operator 0.4.0 released

2025-07-03 Thread Dongjoon Hyun
Hi All. We are happy to announce the availability of Apache Spark Kubernetes Operator 0.4.0! - Website * https://s.apache.org/spark-kubernetes-operator/ - Artifact Hub * https://artifacthub.io/packages/helm/spark-kubernetes-operator/spark-kubernetes-operator/ - Release Note * https

Re: Performance evaluation of Trino 468, Spark 4.0.0-RC2, and Hive 4 on Tez/MR3

2025-07-02 Thread Sungwoo Park
Hello, We have published a follow-up blog that compares the latest versions: 1) Trino 476, 2) Spark 4.0.0, 3) Hive 4 on MR3 2.1. At the end, we discuss MPP and MapReduce. https://mr3docs.datamonad.com/blog/2025-07-02-performance-evaluation-2.1 --- Sungwoo On Tue, Apr 22, 2025 at 7:08 PM

Re: What is the current canonical way to join more than 2 watermarked streams (Spark 3.5.6)?

2025-06-26 Thread Jungtaek Lim
Hi, Starting from Spark 4.0.0, we support multiple stateful operators in append mode. You can perform the chain of stream-stream joins. One thing you need to care about is, the output of stream-stream join will have two different event time columns, which is ambiguous w.r.t. which column has to

Inquiry About User Impersonation Support in Spark Thrift Server (Spark 1.x to 4.x)

2025-06-26 Thread Allen Chu
Dear [Team / Support / Apache Spark Community], I hope this message finds you well. I'm reaching out to inquire about the support for *user impersonation* in the *Spark Thrift Server* across different versions of Apache Spark, specifically from *Spark 1.x through Spark 4.x*. We are curr

What is the current canonical way to join more than 2 watermarked streams (Spark 3.5.6)?

2025-06-25 Thread cheapsolutionarchit...@gmail.com
Hi, Given two Spark-Structured streams and using them as https://spark.apache.org/docs/3.5.6/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking, just works. Now if I want to join three streams using the same technique, Spark complains about multiple possible

Spark on kubernete, configmap add log4j2.properties data

2025-06-23 Thread melin li
hello The spark on streaming task is running for a long time and has the need to dynamically adjust the log level (monitorInterval=30). It is more convenient to modify the log4j configuration through configmap.

[SQL]: Registering spark extensions which utilise DataSourceV2Strategy in Spark 4

2025-06-16 Thread Jack Buggins
Dear community, I am working on a popular open-source connector that provides a custom Data Source V2 Strategy which is providing a useful planning extension to Spark, yet I can't seem to reconcile the API updates in spark 4 in relation to adding extensions. We add a custom planner str

[ANNOUNCE] Apache Spark Kubernetes Operator 0.3.0 released

2025-06-04 Thread Dongjoon Hyun
Hi All. We are happy to announce the availability of Apache Spark Kubernetes Operator 0.3.0! - Notable Changes * Built and tested with Apache Spark 4.0 and Spark Connect Swift Client * Running on Java 24 * Promoting CRDs to v1beta1 from v1alpha1 - Website * https://s.apache.org/spark

[ANNOUNCE] Apache Spark Connect Swift Client 0.3.0 released

2025-06-04 Thread Dongjoon Hyun
Hi All. We are happy to announce the availability of Apache Spark Connect Swift Client 0.3.0! This is the first release tested with the official Apache Spark 4.0.0. Website - https://apache.github.io/spark-connect-swift/ Release Note - https://github.com/apache/spark-connect-swift/releases

Re: Inquiry: Extending Spark ML Support via Spark Connect to Scala/Java APIs (SPARK-50812 Analogue)

2025-06-04 Thread Daniel Filev
Dear Apache Spark Community/Development Team, I was wondering whether you had a chance to take a look at my previous email. I would appreciate any and all information which you could provide on the aforementioned points. I hope all is well on your end and do thank you for your time and

Inquiry: Extending Spark ML Support via Spark Connect to Scala/Java APIs (SPARK-50812 Analogue)

2025-05-30 Thread Daniel Filev
Dear Apache Spark Community/Development Team, I hope this message finds you well. I am writing to inquire about the roadmap and future plans for extending Spark ML support through Spark Connect to the Scala API in a manner analogous to SPARK-50812. Specifically, my team is very interested in

[ANNOUNCE] Apache Spark 4.0.0 released

2025-05-28 Thread Wenchen Fan
Hi All, We are happy to announce the availability of *Apache Spark 4.0.0*! Apache Spark 4.0.0 is the first release of the 4.x line. This release resolves more than 5100 tickets with contributions from more than 390 individuals. To download Spark 4.0.0, head over to the download page: https

Re: Reg: spark delta table read failing

2025-05-21 Thread Bjørn Jørgensen
tting > generated successfully but the debug log is showing the unexpected > response. I tried from managed identity using python to access the storage > account. It is able to access the storage account without any issue but > from spark i am getting the following error. > > full log gist:

Reg: spark delta table read failing

2025-05-21 Thread Akram Shaik
account. It is able to access the storage account without any issue but from spark i am getting the following error. full log gist: Full Log <https://gist.github.com/akramshaik541/e231d578403f795adff5e6ecd493d445> Spark version using 3.5.5 Hadoop-azure 3.4.1 Hadoop-common 3.4.1 25/05/21 18

[ANNOUNCE] Announcing Apache Spark Kubernetes Operator 0.2.0

2025-05-20 Thread Dongjoon Hyun
Hi All. We are happy to announce the availability of Apache Spark Kubernetes Operator 0.2.0! - Website * https://s.apache.org/spark-kubernetes-operator/ - Artifact Hub * https://artifacthub.io/packages/helm/spark-kubernetes-operator/spark-kubernetes-operator/ - Release Note * https

[ANNOUNCE] Announcing Apache Spark Connect Swift Client 0.2.0

2025-05-20 Thread Dongjoon Hyun
Hi All. We are happy to announce the availability of Apache Spark Connect Swift Client 0.2.0! Website - https://apache.github.io/spark-connect-swift/ Release Note - https://github.com/apache/spark-connect-swift/releases/tag/0.2.0 - https://s.apache.org/spark-connect-swift-0.2.0 Swift

Re: Performance evaluation of Trino 468, Spark 4.0.0-RC2, and Hive 4 on Tez/MR3

2025-05-09 Thread Sungwoo Park
To answer the question on the configuration of Spark 4.0.0-RC2, this is spark-defaults.conf used in the benchmark. Any suggestion on adding or changing configuration values will be appreciated. spark.driver.cores=36 spark.driver.maxResultSize=0 spark.driver.memory=196g

Help requested: Spark security triage and followup

2025-05-09 Thread Apache Security Team
Dear Spark users and developers, As you know, the Apache Software Foundation takes our users' security seriously, and defines sensible release and security processes to make sure potential security issues are dealt with responsibly. These indirectly also protect our committers, shie

Re: [ANNOUNCE] Announcing Apache Spark Kubernetes Operator 0.1.0

2025-05-08 Thread Mridul Muralidharan
I had not checked the release. The release notes mention that Apache Spark 4.0 is supported - which has not yet been released. While I don’t expect drastic changes - and most likely the support which will continue to work, the messaging is not accurate - Mridul On Wed, May 7, 2025 at 8:54 PM

[ANNOUNCE] Announcing Apache Spark Connect Swift Client 0.1.0

2025-05-07 Thread Dongjoon Hyun
Hi All. We are happy to announce the availability of Apache Spark Connect Swift Client 0.1.0! Release Note - https://github.com/apache/spark-connect-swift/releases/tag/v0.1.0 - https://s.apache.org/spark-connect-swift-0.1.0 Swift Package Index - https://swiftpackageindex.com/apache/spark

[ANNOUNCE] Announcing Apache Spark Kubernetes Operator 0.1.0

2025-05-07 Thread Dongjoon Hyun
Hi All. We are happy to announce the availability of Apache Spark Kubernetes Operator 0.1.0! - Release Note: * https://github.com/apache/spark-kubernetes-operator/releases/tag/v0.1.0 * https://s.apache.org/spark-kubernetes-operator-0.1.0 - Published Docker Image: * apache/spark-kubernetes

Re: [Spark SQL] spark.sql insert overwrite on existing partition not updating hive metastore partition transient_lastddltime and column_stats

2025-05-02 Thread Sathi Chowdhury
|UNKNOWN | |Created By |Spark 3.5.3 | |Type|EXTERNAL | |Provider

[Spark SQL] spark.sql insert overwrite on existing partition not updating hive metastore partition transient_lastddltime and column_stats

2025-05-01 Thread Pradeep
|db1 | |Table |table1 | |Owner |root | |Created Time|Tue Apr 15 15:30:00 UTC 2025 | |Last Access |UNKNOWN | |Created By

Issue with Spark Operator

2025-04-29 Thread nilanjan sarkar
Hello, I am trying to deploy a Spark streaming application using the Spark Kubernetes Operator, but the application crashes after a while. After describing CRD using *kubectl -n my-namespace describe sparkapplication my-app,* I see the following - Qos Class: Guaranteed

Performance evaluation of Trino 468, Spark 4.0.0-RC2, and Hive 4 on Tez/MR3

2025-04-22 Thread Sungwoo Park
Hello, We published a blog that reports the performance evaluation of Trino 468, Spark 4.0.0-RC2, and Hive 4 on Tez/MR3 2.0 using the TPC-DS Benchmark, 10TB scale factor. Hope you find it useful. https://mr3docs.datamonad.com/blog/2025-04-18-performance-evaluation-2.0 --- Sungwoo

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread Abhishek Singla
@Ángel Álvarez Pascua Thanks, however I am thinking of some other solution which does not involve saving the dataframe result. Will update this thread with details soon. @daniel williams Thanks, I will surely check spark-testing-base out. Regards, Abhishek Singla On Thu, Apr 17, 2025 at 11

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread daniel williams
I have not. Most of my work and development on Spark has been on the scala side of the house and I've built a suite of tools for Kafka integration with Spark for stream analytics along with spark-testing-base <https://github.com/holdenk/spark-testing-base> On Thu, Apr 17, 2025 at 12:

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread Ángel Álvarez Pascua
Have you used the new equality functions introduced in Spark 3.5? https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.testing.assertDataFrameEqual.html El jue, 17 abr 2025, 13:18, daniel williams escribió: > Good call out. Yeah, once you take your work out of Spark it’s

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread daniel williams
Good call out. Yeah, once you take your work out of Spark it’s all on you. Any level partitions operations (e.g. map, flat map, foreach) ends up as a lambda in catalyst. I’ve found, however, not using explode and doing things procedurally at this point with a sufficient amount of unit testing

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread Ángel Álvarez Pascua
Just a quick note on working at the RDD level in Spark — once you go down to that level, it’s entirely up to you to handle everything. You gain more control and flexibility, but Spark steps back and hands you the steering wheel. If tasks fail, it's usually because you're allowing them t

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread Abhishek Singla
failures. I wanted to know if there is an existing way in spark batch to checkpoint already processed rows of a partition if using foreachPartition or mapParitions, so that they are not processed again on rescheduling of task due to failure or retriggering of job due to failures. Regards, Abhish

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread daniel williams
o with task/job failures. I wanted to know if there is an > existing way in spark batch to checkpoint already processed rows of a > partition if using foreachPartition or mapParitions, so that they are not > processed again on rescheduling of task due to failure or retriggering of > job due to f

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread Ángel Álvarez Pascua
cribió: > Hi Team, > > We are using foreachPartition to send dataset row data to third system via > HTTP client. The operation is not idempotent. I wanna ensure that in case > of failures the previously processed dataset should not get processed > again. > > Is there a way to

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread daniel williams
client. The operation is not idempotent. I wanna ensure that in case > of failures the previously processed dataset should not get processed > again. > > Is there a way to checkpoint in Spark batch > 1. checkpoint processed partitions so that if there are 1000 partitions > and 100 were p

Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread Abhishek Singla
Hi Team, We are using foreachPartition to send dataset row data to third system via HTTP client. The operation is not idempotent. I wanna ensure that in case of failures the previously processed dataset should not get processed again. Is there a way to checkpoint in Spark batch 1. checkpoint

Re: Java coding with spark API

2025-04-10 Thread Jules Damji
Tim, Yes, you can use Java for your Spark workloads just fine. Cheers Jules Excuse the thumb typos On Fri, 04 Apr 2025 at 12:53 AM, tim wade wrote: > Hello > > I am just newbie to spark. I am programming with Java mainly, knowing > scala very bit. > > Can I just write code

Spark Streaming Dataset with Multiple S3 Sources is too Slow

2025-04-07 Thread Jevon Cowell
I have a spark streaming dataset that is a union of 12 datasets (for 12 different s3 buckets). On start up , it takes nearly 18/20 mins for the Spark Streaming Job to show up on the Spark Streaming UI and an additional 18-20 mins for the job to even start. When looking at the logs I see

Re: Java coding with spark API

2025-04-07 Thread Stephen Coy
Hi Tim, We have a large ETL project comprising about forty individual Apache Spark applications, all built exclusively in Java. They are executed on three different Spark clusters built on AWS EC2 instances. The applications are built in Java 17 for Spark 3.5.x. Cheers, Steve C > On 4

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-04-06 Thread karan alang
One issue I've seen is that after about 24 hours, the sparkapplication job pods seem to be getting evicted .. i've installed spark history server, and am verifying the case. It could be due to resource constraints, checking this. Pls note : kubeflow spark operator is installed in

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-04-06 Thread karan alang
Thanks, Megh ! I did some research and realized the same - PVC is not a good option for spark shuffle, primarily for latency issues. The same is the case with S3 or MinIO. I've implemented option 2, and am testing this out currently: Storing data in host path is possible regds, Karan

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-04-06 Thread megh vidani
t > me know. > > thanks! > > > On Mon, Mar 31, 2025 at 1:58 PM karan alang wrote: > >> hello all - checking to see if anyone has any input on this >> >> thanks! >> >> >> On Tue, Mar 25, 2025 at 12:22 PM karan alang >> wrote: >

kubernetes spark connect iceberg SparkWrite$WriterFactory not found

2025-04-06 Thread Razvan Mihai
Hello, I'm trying to run a simple Python client against a spark connect server running in Kubernetes as a proof-of-concept. The client writes a couple of records to a local Iceberg table. The Iceberg runtime is provisioned using "--packages" argument to the "start-connect-

Re: Java coding with spark API

2025-04-05 Thread Sonal Goyal
Java is very much supported in Spark. In our open source project, we haven’t done spark connect yet but we do a lot of transformations, ML and graph stuff using Java with Spark. Never faced the language barrier. Cheers, Sonal https://github.com/zinggAI/zingg On Sat, 5 Apr 2025 at 4:42 PM

Re: Java coding with spark API

2025-04-05 Thread Ángel Álvarez Pascua
I think you have more limitations using Spark Connect than Spark from Java. I used RDD, registered UDFs, ... from Java without any problems. El sáb, 5 abr 2025, 9:50, tim wade escribió: > Hello > > I only know Java programming. If I use Java to communicate with the > Spark API and

Re: Java coding with spark API

2025-04-05 Thread tim wade
Hello I only know Java programming. If I use Java to communicate with the Spark API and submit tasks to Spark API from Java, I'm not sure what disadvantages this might have. I see other people writing tasks in Scala, then compiling them and submitting to Spark using spark-submit. T

Re: Java coding with spark API

2025-04-05 Thread Ángel Álvarez Pascua
I think I did that some years ago in Spark 2.4 on a Hortonworks cluster with SSL and Kerberos enabled. It worked, but never went into production. El vie, 4 abr 2025, 9:54, tim wade escribió: > Hello > > I am just newbie to spark. I am programming with Java mainly, knowing > sc

Re: Java coding with spark API

2025-04-04 Thread Jevon Cowell
Hey Tim! What are you aiming to achieve exactly? Regards, Jevon C > On Apr 4, 2025, at 3:54 AM, tim wade wrote: > > Hello > > I am just newbie to spark. I am programming with Java mainly, knowing scala > very bit. > > Can I just write code with java to talk

Kubeflow Spark-Operator

2025-04-04 Thread Hamish Whittal
Hello folks, My colleague has posted this issue on Github: https://github.com/kubeflow/spark-operator/issues/2491 I'm wondering whether anyone here is using the kubeflow, Spark-Operator and could provide any insight into what's happening here. I know he's been stumped for a

Correctness Issue: UNIX_SECONDS() mismatch with TO_UTC_TIMESTAMP() result in Spark 3.5.1

2025-04-04 Thread Miguel Leite
Hi Spark Dev Team, I believe I've encountered a potential bug in spark 3.5.1 concerning the UNIX_SECONDS function when used with TO_UTC_TIMESTAMP. When converting a timestamp from a specific timezone (e.g., 'Europe/Amsterdam') to UTC and then getting its Unix seconds, the result

Java coding with spark API

2025-04-04 Thread tim wade
Hello I am just newbie to spark. I am programming with Java mainly, knowing scala very bit. Can I just write code with java to talk to Spark's java API for submitting jobs? (the main job is a stru-streaming job). T

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-03-31 Thread Mich Talebzadeh
yes apache celeborn may be useful. You need to do some research though. https://celeborn.apache.org/ Have a look at this link as well Spark Executor Shuffle Storage Options <https://iomete.com/resources/k8s/spark-executor-shuffle-storage-options>HTHDr Mich Talebzadeh, Architect | Data S

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-03-31 Thread karan alang
wrote: > >> hello All, >> >> I have kubeflow Spark Operator installed on k8s and from what i >> understand - Spark Shuffle is not officially supported on kubernetes. >> >> Looking for feedback from the community on what approach is being taken >> t

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-03-31 Thread karan alang
hello all - checking to see if anyone has any input on this thanks! On Tue, Mar 25, 2025 at 12:22 PM karan alang wrote: > hello All, > > I have kubeflow Spark Operator installed on k8s and from what i understand > - Spark Shuffle is not officially supported on kubernetes. >

Spark 3.3 job jar assembly with JDK 17 and JRE 11 runtime (java target/source = 8)

2025-03-28 Thread Kristopher Kane
Howdy All, The Spark 3.3 documentation states that it is Java 8/11/17 compatible, but I'm having a hard time finding an existing code base that is using JDK 17 for the userland compilation. Even the Spark 3.3 branch doesn't appear to compile/test with JDK 17 in the github actions for

Request for Support and Resources for Apache Spark User Groups in Bogotá and Mexico

2025-03-27 Thread Juan Diaz
Dear Apache Foundation Team, I hope this email finds you well. My name is Juan, and I am a co-organizer of two Apache Spark user groups: Apache Spark Bogotá<https://www.meetup.com/es/Apache-Spark-Bogota> and Apache Spark Mexico<https://www.meetup.com/es/apache-spark-mexicocity

performance issue Spark 3.5.2 on kubernetes

2025-03-26 Thread Prem Sahoo
Hello Team, I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO object storage . It was slower when compared to writing to MapR FS with the above tech stack. Then moved on to a later upgraded version of Spark 3.5.2 and Hadoop 4.3.1 which started writing to MinIO with V2

Spark Shuffle - in kubeflow spark operator installation on k8s

2025-03-25 Thread karan alang
hello All, I have kubeflow Spark Operator installed on k8s and from what i understand - Spark Shuffle is not officially supported on kubernetes. Looking for feedback from the community on what approach is being taken to handle this issue - especially since dynamicAllocation cannot be enabled

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-25 Thread Prem Sahoo
Just one more variable is Spark 3.5.2 runs on kubernetes and Spark 3.2.0 runs on YARN . It seems kubernetes can be a cause of slowness too .Sent from my iPhoneOn Mar 24, 2025, at 7:10 PM, Prem Gmail wrote:Hello Spark Dev/users,Any one has any clue why and how a better version have performance

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-24 Thread Prem Gmail
Hello Spark Dev/users,Any one has any clue why and how a better version have performance issue .I will be happy to raise JIRA .Sent from my iPhoneOn Mar 24, 2025, at 4:20 PM, Prem Sahoo wrote:The problem is on the writer's side. It takes longer to write to Minio with Spark 3.5.2 and Hadoop

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-24 Thread Prem Sahoo
The problem is on the writer's side. It takes longer to write to Minio with Spark 3.5.2 and Hadoop 3.4.1 . so it seems there are some tech changes between hadoop 2.7.6 to 3.4.1 which made the write process faster. On Sun, Mar 23, 2025 at 12:09 AM Ángel Álvarez Pascua < angel.alv

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Ángel Álvarez Pascua
@Prem Sahoo , could you test both versions of Spark+Hadoop by replacing your "write to MinIO" statement with write.format("noop")? This would help us determine whether the issue lies on the reader side or the writer side. El dom, 23 mar 2025 a las 4:53, Prem Gmail () escri

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Prem Gmail
V2 writer in 3.5.2 and Hadoop 3.4.1 should be much faster than Spark 3.2.0 and Hadoop 2.7.6 but that’s not the case , tried magic committer option which is agin more slow . So internally something changed which made this slow . May I know ?Sent from my iPhoneOn Mar 22, 2025, at 11:05 PM

Re: High/Critical CVEs in jackson-mapper-asl (spark 3.5.5)

2025-03-18 Thread Ángel Álvarez Pascua
Seems like the Jackson version hasn't changed since Spark 1.4 (pom.xml <https://github.com/apache/spark/blob/branch-1.4/pom.xml>). Even Spark 4 is still using this super old (2013) version. Maybe it's time ... El mar, 18 mar 2025 a las 16:05, Mohammad, Ejas Ali () escribió: >

High/Critical CVEs in jackson-mapper-asl (spark 3.5.5)

2025-03-18 Thread Mohammad, Ejas Ali
Hi Spark Community, I hope you are doing well. We have identified high and critical CVEs related to the jackson-mapper-asl package used in Apache Spark 3.5.5. We would like to understand if there are any official fixes or recommended mitigation steps available for these vulnerabilities. | CVE

Spark Kubernetes Operator | Release Date

2025-03-17 Thread Dheeraj Panangat
Hi Team, Can you please help with a date when the community plans to release a stable PROD ready version for spark-kubernetes-operator <https://github.com/apache/spark-kubernetes-operator> ? Does Spark recommend using the kubeflow/spark-operator <https://github.com/kubeflow/spark

Re: Multiple CVE issues in apache/spark-py:3.4.0 + Pyspark 3.4.0

2025-03-15 Thread Soumasish
Two things come to mind, low hanging fruits - update to Spark 3.5 that should reduce the CVEs. Alternatively consider using Spark connect - where you can address the client side vulnerabilities yourself. Best Regards Soumasish Goswami in: www.linkedin.com/in/soumasish # (415) 530-0405 - On

Multiple CVE issues in apache/spark-py:3.4.0 + Pyspark 3.4.0

2025-03-15 Thread Mohammad, Ejas Ali
Hi Spark Community, I am using the official Docker image `apache/spark-py:v3.4.0` and installing `pyspark==3.4.0` on top of it. However, I have encountered multiple security vulnerabilities related to outdated dependencies in the base image. Issues: 1. Security Concerns: - Prisma scan

[ANNOUNCE] Version 2.0.0-beta1 of hnswlib spark released

2025-03-12 Thread jelmer
Hi spark users, A few years back I created a java implementation of the hnsw algorithm in my spare time. Hnsw is an algorithm to do k-nearest neighbour search. Or as as people tend to refer to it now: vector search It can can be used to implement things like recommendation systems, image search

[CONNECT] Question on Spark Connect in Cluster Deply Mode

2025-03-10 Thread Yasukazu Nagatomi
Hello everyone, I noticed that a recent PR appears to disable the start of Spark Connect when the deployment mode is set to "cluster". PR: [SPARK-42371][CONNECT] Add scripts to start and stop Spark Connect server by HyukjinKwon · Pull Request #39928 · apache/spark · GitHub https://

[ANNOUNCE] Apache Spark 3.5.5 released

2025-02-27 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.5.5! Spark 3.5.5 is the fifth maintenance release based on the branch-3.5 maintenance branch of Spark. It contains many fixes including security and correctness domains. We strongly recommend all 3.5 users to upgrade to this stable

Re: Spark connect: Table caching for global use?

2025-02-17 Thread Ángel
> Thanks Mich >>> >>> > created on driver memory >>> >>> That I hadn't anticipated. Are you sure? >>> I understood that caching a table pegged the RDD partitions into the >>> memory of the executors holding the partition. >>> >&

Re: Spark connect: Table caching for global use?

2025-02-17 Thread Subhasis Mukherjee
emory >> >> That I hadn't anticipated. Are you sure? >> I understood that caching a table pegged the RDD partitions into the >> memory of the executors holding the partition. >> >> >> >> >> On Sun, Feb 16, 2025 at 11:17 AM Mich Talebzadeh <

Re: Spark connect: Table caching for global use?

2025-02-16 Thread Mich Talebzadeh
tition. > > > > > On Sun, Feb 16, 2025 at 11:17 AM Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> yep. created on driver memory. watch for OOM if the size becomes too large >> >> spark-submit --driver-memory 8G ... >> >> HTH >

Re: Spark connect: Table caching for global use?

2025-02-16 Thread Tim Robertson
mory. watch for OOM if the size becomes too large > > spark-submit --driver-memory 8G ... > > HTH > > Dr Mich Talebzadeh, > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR > >view my Linkedin profile > <https://www.linkedin.com/in/mich-tal

Re: Spark connect: Table caching for global use?

2025-02-16 Thread Mich Talebzadeh
yep. created on driver memory. watch for OOM if the size becomes too large spark-submit --driver-memory 8G ... HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-520

Re: Spark connect: Table caching for global use?

2025-02-16 Thread Tim Robertson
occurrence_svampe"); On Sun, Feb 16, 2025 at 10:05 AM Tim Robertson wrote: > Hi folks > > Is it possible to cache a table for shared use across sessions with spark > connect? > I'd like to load a read only table once that many sessions will then > query to improve per

  1   2   3   4   5   6   7   8   9   10   >