[Streaming (DStream) ] : Does Spark Streaming supports pause/resume consumption of message from Kafka?

2023-12-01 Thread Saurabh Agrawal (180813)
Hi Spark Team, I am using Spark 3.4.0 version in my application which is use to consume messages from Kafka topics. I have below queries: 1. Does DStream support pause/resume streaming message consumption at runtime on particular condition? If yes, please provide details. 2. I tried to revoke

Re:[ANNOUNCE] Apache Spark 3.4.2 released

2023-11-30 Thread beliefer
Congratulations! At 2023-12-01 01:23:55, "Dongjoon Hyun" wrote: We are happy to announce the availability of Apache Spark 3.4.2! Spark 3.4.2 is a maintenance release containing many fixes including security and correctness domains. This release is based on the branch-3.4 m

[ANNOUNCE] Apache Spark 3.4.2 released

2023-11-30 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.2! Spark 3.4.2 is a maintenance release containing many fixes including security and correctness domains. This release is based on the branch-3.4 maintenance branch of Spark. We strongly recommend all 3.4 users to upgrade

[sql] how to connect query stage to Spark job/stages?

2023-11-29 Thread Chenghao Lyu
Hi, I am seeking advice on measuring the performance of each QueryStage (QS) when AQE is enabled in Spark SQL. Specifically, I need help to automatically map a QS to its corresponding jobs (or stages) to get the QS runtime metrics. I recorded the QS structure via a customized injected Query

RE: Re: Spark Compatibility with Spring Boot 3.x

2023-11-29 Thread Guru Panda
Team, Do we have any updates when spark 4.x version will release in order to address below issues related to > java.lang.NoClassDefFoundError: javax/servlet/Servlet Thanks and Regards, Guru On 2023/10/05 17:19:51 Angshuman Bhattacharya wrote: > Thanks Ahmed. I am trying to bring t

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Pasha Finkelshtein
Finkelshteyn Developer Advocate for Data Engineering JetBrains asm0...@jetbrains.com https://linktr.ee/asm0dey Find out more <https://jetbrains.com> On Tue, 28 Nov 2023 at 17:04, Faiz Halde wrote: > Hey Pasha, > > Is your suggestion towards the spark team? I can make use of

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Faiz Halde
Hey Pasha, Is your suggestion towards the spark team? I can make use of the plugin system on the driver side of spark but considering spark is distributed, the executor side of spark needs to adapt to the pf4j framework I believe too Thanks Faiz On Tue, Nov 28, 2023, 16:57 Pasha Finkelshtein

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Pasha Finkelshtein
there, so even if it does work it’s something you’d need to pay > attention to on upgrades. Class path isolation is tricky to get right. > > On Mon, Nov 27, 2023 at 2:58 PM Faiz Halde wrote: > >> Hello, >> >> We are using spark 3.5.0 and were wondering if the following is &

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Faiz Halde
Thanks Holden, So you're saying even Spark connect is not going to provide that guarantee? The code referred to above is taken up from Spark connect implementation Could you explain which parts are tricky to get right? Just to be well prepared of the consequences On Tue, Nov 28, 2023, 01:30

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Holden Karau
So I don’t think we make any particular guarantees around class path isolation there, so even if it does work it’s something you’d need to pay attention to on upgrades. Class path isolation is tricky to get right. On Mon, Nov 27, 2023 at 2:58 PM Faiz Halde wrote: > Hello, > > We are us

Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Faiz Halde
Hello, We are using spark 3.5.0 and were wondering if the following is achievable using spark-core Our use case involves spinning up a spark cluster where the driver application loads user jars containing spark transformations at runtime. A single spark application can load multiple user jars

Re: Spark structured streaming tab is missing from spark web UI

2023-11-24 Thread Jungtaek Lim
The feature was added in Spark 3.0. Btw, you may want to check out the EOL date for Apache Spark releases - https://endoflife.date/apache-spark 2.x is already EOLed. On Fri, Nov 24, 2023 at 11:13 PM mallesh j wrote: > Hi Team, > > I am trying to test the performance of a spark

[Spark-sql 3.2.4] Wrong Statistic INFO From 'ANALYZE TABLE' Command

2023-11-24 Thread Nick Luo
Hi, all The ANALYZE TABLE command run from Spark on a Hive table. Question: Before I run ANALYZE TABLE' Command on Spark-sql client, I ran 'ANALYZE TABLE' Command on Hive client, the wrong Statistic Info show up. For example 1. run the analyze table command o hive client - create table

Re: Spark-submit without access to HDFS

2023-11-17 Thread Mich Talebzadeh
Hi, How are you submitting your spark job from your client? Your files can either be on HDFS or HCFS such as gs, s3 etc. With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I assume you want your spark-submit --verbose \ --deploy-mode cluster

RE: The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-16 Thread Stevens, Clay
Perhaps you also need to upgrade Scala? Clay Stevens From: Hanyu Huang Sent: Wednesday, 15 November, 2023 1:15 AM To: user@spark.apache.org Subject: The job failed when we upgraded from spark 3.3.1 to spark3.4.1 Caution, this email may be from a sender outside Wolters Kluwer. Verify

Re: Spark-submit without access to HDFS

2023-11-16 Thread Jörn Franke
I am not 100% sure but I do not think this works - the driver would need access to HDFS.What you could try (have not tested it though in your scenario):- use SparkConnect: https://spark.apache.org/docs/latest/spark-connect-overview.html- host the zip file on a https server and use that url (I

Re: Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
Hi Eugene, As the logs indicate, when executing spark-submit, Spark will package and upload spark/conf to HDFS, along with uploading spark/jars. These files are uploaded to HDFS unless you specify uploading them to another OSS. To do so, you'll need to modify the configuration in hdfs

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread Eugene Miretsky
ioning properly. > It seems that the issue might be due to insufficient disk space. > > -- > eabour > > > *From:* Eugene Miretsky > *Date:* 2023-11-16 05:31 > *To:* user > *Subject:* Spark-submit without access to HDFS > Hey All, >

Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
to insufficient disk space. eabour From: Eugene Miretsky Date: 2023-11-16 05:31 To: user Subject: Spark-submit without access to HDFS Hey All, We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS

Spark-submit without access to HDFS

2023-11-15 Thread Eugene Miretsky
Hey All, We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS Datanodes. How can we submit the jobs? The idea would be to preload all the dependencies (job code, libraries, etc) to HDFS, and just submit

[Spark Structured Streaming] Two sink from Single stream

2023-11-15 Thread Subash Prabanantham
Hi Team, I am working on a basic streaming aggregation where I have one file stream source and two write sinks (Hudi table). The only difference is the aggregation performed is different, hence I am using the same spark session to perform both operations. (File Source) --> Agg1 -&g

The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-15 Thread Hanyu Huang
The version our job originally ran was spark 3.3.1 and Apache Iceberg to 1.2.0, But since we upgraded to spark3.4.1 and Apache Iceberg to 1.3.1, jobs started to fail frequently, We tried to upgrade only iceberg without upgrading spark, and the job did not report an error. Detailed description

The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-15 Thread Hanyu Huang
The version our job originally ran was spark 3.3.1 and Apache Iceberg to 1.2.0, But since we upgraded to spark3.4.1 and Apache Iceberg to 1.3.1, jobs started to fail frequently, We tried to upgrade only iceberg without upgrading spark, and the job did not report an error. Detailed description

The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-14 Thread Hanyu Huang
The version our job originally ran was spark 3.3.1 and Apache Iceberg to 1.2.0, But since we upgraded to spark3.4.1 and Apache Iceberg to 1.3.1, jobs started to fail frequently, We tried to upgrade only iceberg without upgrading spark, and the job did not report an error. Detailed description

Re: Okio Vulnerability in Spark 3.4.1

2023-11-14 Thread Bjørn Jørgensen
may be more straightforward to upgrade the > library that brings it in, assuming a later version brings in a later okio. > You can also manage up the version directly with a new entry in > > > However, does this affect Spark? all else equal it doesn't hurt to > upgrade, but wonde

help needed with SPARK-45598 and SPARK-45769

2023-11-09 Thread Maksym M
Greetings, tl;dr there must have been a regression in spark *connect*'s ability to retrieve data, more details in linked issues https://issues.apache.org/jira/browse/SPARK-45598 https://issues.apache.org/jira/browse/SPARK-45769 we have projects that depend on spark connect 3.5 and we'd

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-11-07 Thread Suyash Ajmera
> > On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, > wrote: > >> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am >> querying to Mysql Database and applying >> >> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working >>

Re: Spark master shuts down when one of zookeeper dies

2023-11-07 Thread Mich Talebzadeh
Hi, Spark standalone mode does not use or rely on ZooKeeper by default. The Spark master and workers communicate directly with each other without using ZooKeeper. However, it appears that in your case you are relying on ZooKeeper to provide high availability for your standalone cluster

Spark master shuts down when one of zookeeper dies

2023-11-06 Thread Kaustubh Ghode
I am using spark-3.4.1 I have a setup with three ZooKeeper servers, Spark master shuts down when a Zookeeper instance is down a new master is elected as leader and the cluster is up. But the original master that was down never comes up. can you please help me with this issue? Stackoverflow link

How to configure authentication from a pySpark client to a Spark Connect server ?

2023-11-05 Thread Xiaolong Wang
Hi, Our company is currently introducing the Spark Connect server to production. Most of the issues have been solved yet I don't know how to configure authentication from a pySpark client to the Spark Connect server. I noticed that there is some interceptor configs at the Scala client side

[Spark SQL] [Bug] Adding `checkpoint()` causes "column [...] cannot be resolved" error

2023-11-05 Thread Robin Zimmerman
Hi all, Wondering if anyone has run into this as I can't find any similar issues in JIRA, mailing list archives, Stack Overflow, etc. I had a query that was running successfully, but the query planning time was extremely long (4+ hours). To fix this I added `checkpoint()` calls earlier in the

Re: Spark / Scala conflict

2023-11-02 Thread Harry Jamison
Thanks Alonso, I think this gives me some ideas. My code is written in Python, and I use spark-submit to submit it. I am not sure what code is written in scala.  Maybe the Phoenix driver based on the stack trace? How do I tell which version of scala that was compiled against? Is there a jar

Re: Spark / Scala conflict

2023-11-02 Thread Aironman DirtDiver
The error message Caused by: java.lang.ClassNotFoundException: scala.Product$class indicates that the Spark job is trying to load a class that is not available in the classpath. This can happen if the Spark job is compiled with a different version of Scala than the version of Scala that is used

Spark / Scala conflict

2023-11-01 Thread Harry Jamison
I am getting the error below when I try to run a spark job connecting to phoneix.  It seems like I have the incorrect scala version that some part of the code is expecting. I am using spark 3.5.0, and I have copied these phoenix jars into the spark lib phoenix-server-hbase-2.5-5.1.3.jar

Elasticity and scalability for Spark in Kubernetes

2023-10-30 Thread Mich Talebzadeh
I was thinking in line of elasticity and autoscaling for Spark in the context of Kubernetes. My experience with Kubernetes and Spark on the so called autopilot has not been that great.This is mainly from my experience that in autopilot you let the choice of nodes be decided by the vendor's

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-29 Thread Nagatomi Yasukazu
Hi, eabour Thank you for the insights. Based on the information you provided, along with the PR [SPARK-42371][CONNECT] that add "./sbin/start-connect-server.sh" script, I'll experiment with launching the Spark Connect Server in Cluster Mode on Kubernetes. [SPARK-42371][CONNECT] A

Re: Spark join produce duplicate rows in resultset

2023-10-27 Thread Meena Rajani
id >> and rev. scode = I.scode; >> >> Thanks, >> Sadha >> >> On Sat, Oct 21, 2023 at 3:21 PM Meena Rajani >> wrote: >> >>> Hello all: >>> >>> I am using spark sql to join two tables. To my surprise I am >>> getting re

[Resolved] Re: spark.stop() cannot stop spark connect session

2023-10-25 Thread eab...@163.com
Hi all. I read source code at spark/python/pyspark/sql/connect/session.py at master · apache/spark (github.com) and the comment for the "stop" method is described as follows: def stop(self) -> None: # Stopping the session will only close the connection to the cur

spark schema conflict behavior records being silently dropped

2023-10-24 Thread Carlos Aguni
hi all, i noticed a weird behavior to when spark parses nested json with schema conflict. i also just noticed that spark "fixed" this in the most recent release 3.5.0 but since i'm working with AWS services being: * EMR 6: spark 3.3.* spark3.4.* * Glue 3: spark3.1.1 * Glue 4: spark 3

submitting tasks failed in Spark standalone mode due to missing failureaccess jar file

2023-10-24 Thread eab...@163.com
Hi Team. I use spark 3.5.0 to start Spark cluster with start-master.sh and start-worker.sh, when I use ./bin/spark-shell --master spark://LAPTOP-TC4A0SCV.:7077 and get error logs: ``` 23/10/24 12:00:46 ERROR TaskSchedulerImpl: Lost an executor 1 (already removed): Command exited with code

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Bjørn Jørgensen
t; Thanks, > Sadha > > On Sat, Oct 21, 2023 at 3:21 PM Meena Rajani > wrote: > >> Hello all: >> >> I am using spark sql to join two tables. To my surprise I am >> getting redundant rows. What could be the cause. >> >> >> select rev.* from rev >&

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Sadha Chilukoori
code left join item I on rev.sys = I.sys and rev.custumer_id = I.custumer_id and rev. scode = I.scode; Thanks, Sadha On Sat, Oct 21, 2023 at 3:21 PM Meena Rajani wrote: > Hello all: > > I am using spark sql to join two tables. To my surprise I am > getting redundant rows. What co

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Patrick Tucci
Hi Meena, It's not impossible, but it's unlikely that there's a bug in Spark SQL randomly duplicating rows. The most likely explanation is there are more records in the item table that match your sys/custumer_id/scode criteria than you expect. In your original query, try changing select rev

Spark join produce duplicate rows in resultset

2023-10-21 Thread Meena Rajani
Hello all: I am using spark sql to join two tables. To my surprise I am getting redundant rows. What could be the cause. select rev.* from rev inner join customer c on rev.custumer_id =c.id inner join product p rev.sys = p.sys rev.prin = p.prin rev.scode= p.bcode left join item I on rev.sys

spark.stop() cannot stop spark connect session

2023-10-20 Thread eab...@163.com
Hi, my code: from pyspark.sql import SparkSession spark = SparkSession.builder.remote("sc://172.29.190.147").getOrCreate() import pandas as pd # 创建pandas dataframe pdf = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "ag

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-19 Thread eab...@163.com
SparkConnectService. org.apache.spark.sql.connect.SparkConnectPlugin : To enable Spark Connect, simply make sure that the appropriate JAR is available in the CLASSPATH and the driver plugin is configured to load this class. org.apache.spark.sql.connect.SimpleSparkConnectService : A simple main class method

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-19 Thread eab...@163.com
Hi all, Has the spark connect server running on k8s functionality been implemented? From: Nagatomi Yasukazu Date: 2023-09-05 17:51 To: user Subject: Re: Running Spark Connect Server in Cluster Mode on Kubernetes Dear Spark Community, I've been exploring the capabilities of the Spark

Re: hive: spark as execution engine. class not found problem

2023-10-17 Thread Vijay Shankar
UNSUBSCRIBE On Tue, Oct 17, 2023 at 5:09 PM Amirhossein Kabiri < amirhosseikab...@gmail.com> wrote: > I used Ambari to config and install Hive and Spark. I want to insert into > a hive table using Spark execution Engine but I face to this weird error. > The error is:

hive: spark as execution engine. class not found problem

2023-10-17 Thread Amirhossein Kabiri
I used Ambari to config and install Hive and Spark. I want to insert into a hive table using Spark execution Engine but I face to this weird error. The error is: Job failed with java.lang.ClassNotFoundException: ive_20231017100559_301568f9-bdfa-4f7c-89a6-f69a65b30aaf:1 2023-10-17 10:07:42,972

Re: Spark stand-alone mode

2023-10-17 Thread Ilango
compare to HPC local mode. They tested with some complex data science scripts using spark and other data science projects. The cluster is really stable and very performant. I enabled dynamic allocation and cap the memory and cpu accordingly at spark-defaults. Conf and at our spark framework code. So its

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-13 Thread Suyash Ajmera
This issue is related to CharVarcharCodegenUtils readSidePadding method . Appending white spaces while reading ENUM data from mysql Causing issue in querying , writing the same data to Cassandra. On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, wrote: > I have upgraded my spark job from sp

[ SPARK SQL ]: PPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-12 Thread Suyash Ajmera
I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying to Mysql Database and applying `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as expected in spark 3.3.1 , but not working with 3.5.0. Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st

Re: Autoscaling in Spark

2023-10-10 Thread Mich Talebzadeh
This has been brought up a few times. I will focus on Spark Structured Streaming Autoscaling does not support Spark Structured Streaming (SSS). Why because streaming jobs are typically long-running jobs that need to maintain state across micro-batches. Autoscaling is designed to scale up and down

Autoscaling in Spark

2023-10-10 Thread Kiran Biswal
Hello Experts Is there any true auto scaling option for spark? The dynamic auto scaling works only for batch. Any guidelines on spark streaming autoscaling and how that will be tied to any cluster level autoscaling solutions? Thanks

Re: Log file location in Spark on K8s

2023-10-09 Thread Prashant Sharma
pm Agrawal, Sanket, wrote: > Hi All, > > > > We are trying to send the spark logs using fluent-bit. We validated that > fluent-bit is able to move logs of all other pods except the > driver/executor pods. > > > > It would be great if someone can guide us whe

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Danilo Sousa
Unsubscribe > Em 9 de out. de 2023, à(s) 07:03, Mich Talebzadeh > escreveu: > > Hi, > > Please see my responses below: > > 1) In Spark Structured Streaming does commit mean streaming data has been > delivered to the sink like Snowflake? > > No. a co

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Mich Talebzadeh
Your mileage varies. Often there is a flavour of Cloud Data warehouse already there. CDWs like BigQuery, Redshift, Snowflake and so forth. They can all do a good job for various degrees - Use efficient data types. Choose data types that are efficient for Spark to process. For example, use

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread ashok34...@yahoo.com.INVALID
Thank you for your feedback Mich. In general how can one optimise the cloud data warehouses (the sink part), to handle streaming Spark data efficiently, avoiding bottlenecks that discussed. AKOn Monday, 9 October 2023 at 11:04:41 BST, Mich Talebzadeh wrote: Hi, Please see my

Log file location in Spark on K8s

2023-10-09 Thread Agrawal, Sanket
Hi All, We are trying to send the spark logs using fluent-bit. We validated that fluent-bit is able to move logs of all other pods except the driver/executor pods. It would be great if someone can guide us where should I look for spark logs in Spark on Kubernetes with client/cluster mode

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Mich Talebzadeh
Hi, Please see my responses below: 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? No. a commit does not refer to data being delivered to a sink like Snowflake or bigQuery. The term commit refers to Spark Structured Streaming (SS

Clarification with Spark Structured Streaming

2023-10-08 Thread ashok34...@yahoo.com.INVALID
Hello team 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? 2) if sinks like Snowflake  cannot absorb or digest streaming data in a timely manner, will there be an impact on spark streaming itself? Thanks AK

Re: Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Igor Calabria
You might be affected by this issue: https://github.com/apache/iceberg/issues/8601 It was already patched but it isn't released yet. On Thu, Oct 5, 2023 at 7:47 PM Prashant Sharma wrote: > Hi Sanket, more details might help here. > > How does your spark configuration look like?

Re: Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Angshuman Bhattacharya
Thanks Ahmed. I am trying to bring this up with Spark DE community On Thu, Oct 5, 2023 at 12:32 PM Ahmed Albalawi < ahmed.albal...@capitalone.com> wrote: > Hello team, > > We are in the process of upgrading one of our apps to Spring Boot 3.x > while using Spark, and we have en

Re: Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Sean Owen
I think we already updated this in Spark 4. However for now you would have to also include a JAR with the jakarta.* classes instead. You are welcome to try Spark 4 now by building from master, but it's far from release. On Thu, Oct 5, 2023 at 11:53 AM Ahmed Albalawi wrote: > Hello team, >

Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Ahmed Albalawi
Hello team, We are in the process of upgrading one of our apps to Spring Boot 3.x while using Spark, and we have encountered an issue with Spark compatibility, specifically with Jakarta Servlet. Spring Boot 3.x uses Jakarta Servlet while Spark uses Javax Servlet. Can we get some guidance on how

Re: Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Prashant Sharma
Hi Sanket, more details might help here. How does your spark configuration look like? What exactly was done when this happened? On Thu, 5 Oct, 2023, 2:29 pm Agrawal, Sanket, wrote: > Hello Everyone, > > > > We are trying to stream the changes in our Iceberg tables stored

Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Agrawal, Sanket
Hello Everyone, We are trying to stream the changes in our Iceberg tables stored in AWS S3. We are achieving this through Spark-Iceberg Connector and using JAR files for Spark-AWS. Suddenly we have started receiving error "Connection pool shut down". Spark Version: 3.4.1 Iceberg:

[Spark Core]: Recomputation cost of a job due to executor failures

2023-10-04 Thread Faiz Halde
Hello, Due to the way Spark implements shuffle, a loss of an executor sometimes results in the recomputation of partitions that were lost The definition of a *partition* is the tuple ( RDD-ids, partition id ) RDD-ids is a sequence of RDD ids In our system, we define the unit of work performed

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jon Rodríguez Aranguren
Dear Jörn Franke, Jayabindu Singh and Spark Community members, Thank you profoundly for your initial insights. I feel it's necessary to provide more precision on our setup to facilitate a deeper understanding. We're interfacing with S3 Compatible storages, but our operational context is somewhat

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
Identity federation may ease this compared to a secret store.Am 01.10.2023 um 08:27 schrieb Jon Rodríguez Aranguren :Dear Jörn Franke, Jayabindu Singh and Spark Community members,Thank you profoundly for your initial insights. I feel it's necessary to provide more precision on our setup to facilitate

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
s arising from such loss, damage or destruction.   On Sun, 1 Oct 2023 at 06:36, Jayabindu Singh <jayabi...@gmail.com> wrote:Hi Jon,Using IAM as suggested by Jorn is the best approach.We recently moved our spark workload from HDP to Spark on K8 and utilizing IAM.It will save you from secret 

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Mich Talebzadeh
from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Sun, 1 Oct 2023 at 06:36, Jayabindu Singh wrote: > Hi Jon, > > Using IAM as suggested by Jorn is t

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jayabindu Singh
Hi Jon, Using IAM as suggested by Jorn is the best approach. We recently moved our spark workload from HDP to Spark on K8 and utilizing IAM. It will save you from secret management headaches and also allows a lot more flexibility on access control and option to allow access to multiple S3 buckets

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jörn Franke
uez Aranguren > : > >  > Dear Spark Community Members, > > I trust this message finds you all in good health and spirits. > > I'm reaching out to the collective expertise of this esteemed community with > a query regarding Spark on Kubernetes. As a newcomer, I ha

Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-29 Thread Jon Rodríguez Aranguren
Dear Spark Community Members, I trust this message finds you all in good health and spirits. I'm reaching out to the collective expertise of this esteemed community with a query regarding Spark on Kubernetes. As a newcomer, I have always admired the depth and breadth of knowledge shared within

Reading Glue Catalog Views through Spark.

2023-09-25 Thread Agrawal, Sanket
Hello Everyone, We have setup spark and setup Iceberg-Glue connectors as mentioned at https://iceberg.apache.org/docs/latest/aws/ to integrate Spark, Iceberg, and AWS Glue Catalog. We are able to read tables through this but we are unable to read data through views. PFB, the error

[PySpark][Spark logs] Is it possible to dynamically customize Spark logs?

2023-09-25 Thread Ayman Rekik
Hello, What would be the right way, if any, to inject a runtime variable into Spark logs. So that, for example, if Spark (driver/worker) logs some info/warning/error message, the variable will be output there (in order to help filtering logs for the sake of monitoring and troubleshooting

Spark Connect Multi-tenant Support

2023-09-22 Thread Kezhi Xiong
Hi, >From Spark Connect's official site's image, it mentions the "Multi-tenant Application Gateway" on driver. Are there any more documents about it? Can I know how users can utilize such a feature? Thanks, Kezhi

[Spark 3.5.0] Is the protobuf-java JAR no longer shipped with Spark?

2023-09-20 Thread Gijs Hendriksen
Hi all, This week, I tried upgrading to Spark 3.5.0, as it contained some fixes for spark-protobuf that I need for my project. However, my code is no longer running under Spark 3.5.0. My build.sbt file is configured as follows: val sparkV  = "3.5.0" val hadoopV   

Spark streaming sourceArchiveDir does not move file to archive directory

2023-09-19 Thread Yunus Emre G?rses
Hello everyone, I'm using scala and spark with the version 3.4.1 in Windows 10. While streaming using Spark, I give the `cleanSource` option as "archive" and the `sourceArchiveDir` option as "archived" as in the code below. ``` spark.readStream .option("cleanSour

Re: Spark stand-alone mode

2023-09-19 Thread Patrick Tucci
Multiple applications can run at once, but you need to either configure Spark or your applications to allow that. In stand-alone mode, each application attempts to take all resources available by default. This section of the documentation has more details: https://spark.apache.org/docs/latest

Re: Spark stand-alone mode

2023-09-18 Thread Ilango
already. 3. We will stick with NFS for now and stand alone then may be will explore HDFS and YARN. Can you please confirm whether multiple users can run spark jobs at the same time? If so I will start working on it and let you know how it goes Mich, the link to Hadoop is not working. Can you

[Spark Core]: How does rpc threads influence shuffle?

2023-09-15 Thread Nebi Aydin
Hello all, I know that these parameters exist for shuffle tuning: *spark.shuffle.io.serverThreadsspark.shuffle.io.clientThreadsspark.shuffle.io.threads* But we also have *spark.rpc.io.serverThreadsspark.rpc.io.clientThreadsspark.rpc.io.threads* So specifically talking about *Shuffling,

Re: Spark stand-alone mode

2023-09-15 Thread Bjørn Jørgensen
use Hive Metastore called Derby :( ) is something respetable like > Postgres DB that can handle multiple concurrent spark jobs > > HTH > > > Mich Talebzadeh, > Distinguished Technologist, Solutions Architect & Engineer > London > United Kingdom > >

Re: Spark stand-alone mode

2023-09-15 Thread Mich Talebzadeh
multiple concurrent spark jobs HTH Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use

Re: Spark stand-alone mode

2023-09-15 Thread Sean Owen
Yes, should work fine, just set up according to the docs. There needs to be network connectivity between whatever the driver node is and these 4 nodes. On Thu, Sep 14, 2023 at 11:57 PM Ilango wrote: > > Hi all, > > We have 4 HPC nodes and installed spark individually in all node

Re: Spark stand-alone mode

2023-09-15 Thread Patrick Tucci
I use Spark in standalone mode. It works well, and the instructions on the site are accurate for the most part. The only thing that didn't work for me was the start_all.sh script. Instead, I use a simple script that starts the master node, then uses SSH to connect to the worker machines and start

Spark stand-alone mode

2023-09-14 Thread Ilango
Hi all, We have 4 HPC nodes and installed spark individually in all nodes. Spark is used as local mode(each driver/executor will have 8 cores and 65 GB) in Sparklyr/pyspark using Rstudio/Posit workbench. Slurm is used as scheduler. As this is local mode, we are facing performance issue(as only

Re: Write Spark Connection client application in Go

2023-09-14 Thread bo yang
at’s so cool! Great work y’all :) >> >> On Tue, Sep 12, 2023 at 8:14 PM bo yang wrote: >> >>> Hi Spark Friends, >>> >>> Anyone interested in using Golang to write Spark application? We created >>> a Spark Connect Go Client library >>>

Re: Write Spark Connection client application in Go

2023-09-13 Thread Martin Grund
This is absolutely awesome! Thank you so much for dedicating your time to this project! On Wed, Sep 13, 2023 at 6:04 AM Holden Karau wrote: > That’s so cool! Great work y’all :) > > On Tue, Sep 12, 2023 at 8:14 PM bo yang wrote: > >> Hi Spark Friends, >> >> Any

Re: Write Spark Connection client application in Go

2023-09-12 Thread Holden Karau
That’s so cool! Great work y’all :) On Tue, Sep 12, 2023 at 8:14 PM bo yang wrote: > Hi Spark Friends, > > Anyone interested in using Golang to write Spark application? We created a > Spark > Connect Go Client library <https://github.com/apache/spark-connect-go>. > Wou

APACHE Spark adoption/growth chart

2023-09-12 Thread Andrew Petersen
Hello Spark community Can anyone direct me to a simple graph/chart that shows APACHE Spark adoption, preferably one that includes recent years? Of less importance, a similar Databricks plot? An internet search gave me plots only up to 2015. I also searched spark.apache.org and databricks.com

Write Spark Connection client application in Go

2023-09-12 Thread bo yang
Hi Spark Friends, Anyone interested in using Golang to write Spark application? We created a Spark Connect Go Client library <https://github.com/apache/spark-connect-go>. Would love to hear feedback/thoughts from the community. Please see the quick start guide <https://github.com/apa

RE: Spark 3.4.1 and Hive 3.1.3

2023-09-08 Thread Agrawal, Sanket
Hi Yasukazu, I tried by replacing the jar though the spark code didn’t work but the vulnerability was removed. But I agree that even 3.1.3 has other vulnerabilities listed on maven page but these are medium level vulnerabilities. We are currently targeting Critical and High vulnerabilities

Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
@Alfie Davidson : Awesome, it worked with "“org.elasticsearch.spark.sql”" But as soon as I switched to *elasticsearch-spark-20_2.12, *"es" also worked. On Fri, Sep 8, 2023 at 12:45 PM Dipayan Dev wrote: > > Let me try that and get back. Just wondering, if there a ch

Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
Let me try that and get back. Just wondering, if there a change in the way we pass the format in connector from Spark 2 to 3? On Fri, 8 Sep 2023 at 12:35 PM, Alfie Davidson wrote: > I am pretty certain you need to change the write.format from “es” to > “org.elasticsearch.spark.sql” &g

RE: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Agrawal, Sanket
Hi, I tried replacing just this JAR but getting errors. From: Nagatomi Yasukazu Sent: Friday, September 8, 2023 9:35 AM To: Agrawal, Sanket Cc: Chao Sun ; Yeachan Park ; user@spark.apache.org Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3 Hi Sanket, While migrating to Hive 3.1.3 may resolve

Re: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Nagatomi Yasukazu
hursday, September 7, 2023 10:23 PM > *To:* Agrawal, Sanket > *Cc:* Yeachan Park ; user@spark.apache.org > *Subject:* [EXT] Re: Spark 3.4.1 and Hive 3.1.3 > > > > Hi Sanket, > > > > Spark 3.4.1 currently only works with Hive 2.3.9, and it would require a > lot of

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen
ame issue. > > > org.elasticsearch > elasticsearch-spark-30_${scala.compat.version} > 7.12.1 > > > > On Fri, Sep 8, 2023 at 4:41 AM Sean Owen wrote: > >> By marking it provided, you are not including this dependency with your >> app. If it is also

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
Hi Sean, Removed the provided thing, but still the same issue. org.elasticsearch elasticsearch-spark-30_${scala.compat.version} 7.12.1 On Fri, Sep 8, 2023 at 4:41 AM Sean Owen wrote: > By marking it provided, you are not including this dependency with your > app. If it i

<    1   2   3   4   5   6   7   8   9   10   >