Re: Support Required: Issue with PySpark Code Execution Order

2025-08-15 Thread Mich Talebzadeh
miss a field). Also, target_col_list comes from the target schema, but type2_changes/source_new_non_migration_data_df come from the source schema. Those aren’t guaranteed to be identical. - You never re-project source_new_non_migration_data_df to target_col_list before the union.

Re: Support Required: Issue with PySpark Code Execution Order

2025-08-13 Thread Karthick N
olumn(surrogate_key, row_number().over(Window.orderBy("PatientId")) + starting_key - 1) type1_updates_columns = list(set(type1_changes.columns) - set (type2_columns)) type1_updates = type1_changes.select(*type1_updates_columns) expired_records.createOrReplaceTempView(f&q

Re: Support Required: Issue with PySpark Code Execution Order

2025-08-11 Thread Mich Talebzadeh
Hi Karthick, The problem seems to be that you were performing transformation/recipe on three data frames without materialisation, then writing back to that target table. Each MERGE re-evaluated its “recipe” at a different time, so they saw different snapshots → flaky/empty results. Fix (short

Re: Support Required: Issue with PySpark Code Execution Order

2025-08-11 Thread Karthick N
Hi *Ángel*, Thank you for checking on this. I’ll review the points you mentioned and get back to you with an update. Hi *Mich*, Looping you in here — could you please assist in reviewing this issue and share your inputs or suggestions? Your expertise would be really helpful in resolving it. Than

Re: Support Required: Issue with PySpark Code Execution Order

2025-08-10 Thread Ángel Álvarez Pascua
Have you tried disabling AQE? El dom, 10 ago 2025, 20:48, Karthick N escribió: > Hi Team, > > I’m facing an issue with the execution order in the PySpark code snippet > below. I’m not certain whether it’s caused by lazy evaluation, Spark plan > optimization, or something else. > > *Issue:* > Fo

Re: Support Required: Issue with PySpark Code Execution Order

2025-08-10 Thread Bjørn Jørgensen
``` When you run actions on the three final DataFrames (by creating temp views and running `MERGE`), Spark looks at the DAG. It sees that all three depend on `joined_df`. If `joined_df` hasn't been materialized and cached yet, the Spark optimizer might re-calculate it for each branch,

RE: [PySpark] [Beginner] [Debug] Does Spark ReadStream support reading from a MinIO bucket?

2025-08-05 Thread Bhatt, Kashyap
>> option("path", "s3://bucketname") Shouldn’t the schema prefix be s3a instead of s3? Information Classification: General From: 刘唯 Sent: Tuesday, August 5, 2025 5:34 PM To: Kleckner, Jade Cc: user@spark.apache.org Subject: Re: [PySpark] [Beginner] [Debug] Doe

Re: [PySpark] [Beginner] [Debug] Does Spark ReadStream support reading from a MinIO bucket?

2025-08-05 Thread 刘唯
This is not necessarily about the readStream / read API. As long as you correctly imported the needed dependencies and set up spark config, you should be able to readStream from s3 path. See https://stackoverflow.com/questions/46740670/no-filesystem-for-scheme-s3-with-pyspark Kleckner, Jade 于202

Re: Spark Job Stuck in Active State (v2.4.3, Cluster Mode)

2025-07-17 Thread Ángel Álvarez Pascua
Sounds super interesting ... El jue, 17 jul 2025, 14:17, Hitesh Vaghela escribió: > Hi Spark community! I’ve posted a detailed question on Stack Overflow > regarding a persistent issue where my Spark job remains in an “Active” > state even after successful dataset processing. No errors in logs,

RE: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-15 Thread Wolfgang Buchner
Hi Nimrod, i am also interested in your first point, what exactly doesn "false alarm" mean. Today had following scenario, which in my opinion is a false alarm. Following example: - Topic contains 'N' Messages - Spark Streaming application consumed all 'N' messages successfully - Checkpoints of s

RE: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-15 Thread Wolfgang Buchner
Hi Nimrod, i am also interested in your first point, what exactly doesn "false alarm" mean. Today had following scenario, which in my opinion is a false alarm. Following example: - Topic contains 'N' Messages - Spark Streaming application consumed all 'N' messages successfully - Checkpoints of s

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-14 Thread Khalid Mammadov
1. I think false alarm in this context means you are ok to loose data like in Dev and Test envs. 2. Not sure 3. Sorry not sure again but guess would be during your failover checkpoint got out of sync Sorry, that is all I used this feature for. If you think you can smoothly fail over to other clust

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-13 Thread Nimrod Ofek
Thanks Khalid, Some follow ups: 1. I'm still unsure what will be "false alarms" 2. When there is data loss on some partitions - will that lead to all partitions to get reset? 3. I had an occurrence - that I set failOnDataloss to false, I set policy to earliest (which was about 24 h

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-10 Thread Khalid Mammadov
I use this option in development environments where jobs are not actively running and Kafka topic has retention policy on. Meaning when a streaming job runs it may find that the last offset it read is not there anymore and in this case it falls back to starting position (i.e. earliest or latest) sp

Re: Performance evaluation of Trino 468, Spark 4.0.0-RC2, and Hive 4 on Tez/MR3

2025-07-02 Thread Sungwoo Park
Hello, We have published a follow-up blog that compares the latest versions: 1) Trino 476, 2) Spark 4.0.0, 3) Hive 4 on MR3 2.1. At the end, we discuss MPP and MapReduce. https://mr3docs.datamonad.com/blog/2025-07-02-performance-evaluation-2.1 --- Sungwoo On Tue, Apr 22, 2025 at 7:08 PM Sungwoo

RE: pyspark4.0.0 still includes "jackson-mapper-asl.jar" that was supposed to be removed according to release note

2025-06-26 Thread Haibo.Wang
+ d...@spark.apache.org HI All Could some help to look into this item? And appreciate if you can forward this thread to the correct team if this is not finding the correct contact. Thanks. Regards Harper From: Wang, Harper (FRPPE) Sent: Wednesday, June 25, 2025 10:

Re: What is the current canonical way to join more than 2 watermarked streams (Spark 3.5.6)?

2025-06-26 Thread Jungtaek Lim
Hi, Starting from Spark 4.0.0, we support multiple stateful operators in append mode. You can perform the chain of stream-stream joins. One thing you need to care about is, the output of stream-stream join will have two different event time columns, which is ambiguous w.r.t. which column has to b

Re: Using storage decommissioning on K8S cluster

2025-06-17 Thread Enrico Minack
Hi Spark users, those interested in the storage decommissioning on Kubernetes might be interested in some known issues and how to fix them: https://issues.apache.org/jira/browse/SPARK-52504 Cheers, Enrico Am 19.02.25 um 12:23 schrieb Enrico Minack: Hi Spark users, I am trying to use the ex

Re: Inquiry: Extending Spark ML Support via Spark Connect to Scala/Java APIs (SPARK-50812 Analogue)

2025-06-04 Thread Daniel Filev
Dear Apache Spark Community/Development Team, I was wondering whether you had a chance to take a look at my previous email. I would appreciate any and all information which you could provide on the aforementioned points. I hope all is well on your end and do thank you for your time and considerat

Re: Structured streaming consumer group offset management in case of consumption of topic with same name from different Kafka clusters

2025-05-26 Thread megh vidani
Hello community, Any help here please? Thanks, Megh On Mon, May 19, 2025, 18:48 megh vidani wrote: > I'm aware that Spark does not rely on the kafka committed offsets. It is > purely for monitoring purposes. > > Thanks, > Megh > > On Mon, May 19, 2025, 18:46 megh vidani wrote: > >> Hi Prashan

Re: [PYSPARK] df.collect throws exception for MapType with ArrayType as key

2025-05-23 Thread Soumasish
This looks like a bug to me. https://github.com/apache/spark/blob/master/python/pyspark/serializers.py When cloudpickle.loads() tries to deserialize {["A", "B"]: "foo"} -> List as a dict type will break. Tuple("A", "B") : Python Input -> ArrayData -> works fine ArrayData ->List["A", "B"] -> Breaks

Re: Reg: spark delta table read failing

2025-05-21 Thread Bjørn Jørgensen
I don't think pyspark 3.5.X and python 3.12 work that well together. ons. 21. mai 2025 kl. 21:15 skrev Akram Shaik : > Hello, > > I am trying to access the azure storage account using the delta table from > pyspark shell using azure AAD identity. I see the token is getting > generated successfull

Re: Structured streaming consumer group offset management in case of consumption of topic with same name from different Kafka clusters

2025-05-19 Thread megh vidani
Hello Spark Dev Community, Reaching out for the below problem statement. Thanks, Megh On Mon, May 19, 2025, 13:16 megh vidani wrote: > Hello Spark Community, I have a structured streaming job in which I'm > consuming a topic with the same name in two different kafka clusters and > then creatin

Re: Structured streaming consumer group offset management in case of consumption of topic with same name from different Kafka clusters

2025-05-19 Thread megh vidani
I'm aware that Spark does not rely on the kafka committed offsets. It is purely for monitoring purposes. Thanks, Megh On Mon, May 19, 2025, 18:46 megh vidani wrote: > Hi Prashant, > > I would like to do it so that I can monitor the consumer group along with > my other consumer groups. > > Thank

Re: Structured streaming consumer group offset management in case of consumption of topic with same name from different Kafka clusters

2025-05-19 Thread megh vidani
Hi Prashant, I would like to do it so that I can monitor the consumer group along with my other consumer groups. Thanks, Megh On Mon, May 19, 2025, 18:21 Prashant Sharma wrote: > Spark does not rely on Kafka's commit, in fact it tracks the stream > progress itself and reads via offsets (e.g. f

Re: Structured streaming consumer group offset management in case of consumption of topic with same name from different Kafka clusters

2025-05-19 Thread Prashant Sharma
Spark does not rely on Kafka's commit, in fact it tracks the stream progress itself and reads via offsets (e.g. from last read points). Why do you want to commit? On Mon, May 19, 2025 at 5:58 PM megh vidani wrote: > Hello Spark Dev Community, > > Reaching out for the below problem statement. > >

Re: Structured Streaming Initial Listing Issue

2025-05-12 Thread Anastasiia Sokhova
Subject: Re: Structured Streaming Initial Listing Issue You don't often get email from andrewlopuk...@gmail.com. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Hello. AFAIK the problem is not scoped to streaming and can't be mitigated with only maxRed

Re: Structured Streaming Initial Listing Issue

2025-05-12 Thread Andrei L
Hello. AFAIK the problem is not scoped to streaming and can't be mitigated with only maxRedultSize for such input. Spark has to bring all file paths into driver memory even in case of streaming ( https://github.com/apache/spark/blob/37028fafc4f9fc873195a88f0840ab69edcf9d2b/sql/core/src/main/scala

Re: Structured Streaming Initial Listing Issue

2025-05-12 Thread 刘唯
That 1073.3 MiB isn't too much bigger than spark.driver.maxResultSize, can't you just increase that config with a larger number? / Wei Anastasiia Sokhova 于2025年4月16日周三 03:37写道: > Dear Spark Community, > > > > I run a Structured Streaming Query to read json files from S3 into an Ic > eberg table

Re: Performance evaluation of Trino 468, Spark 4.0.0-RC2, and Hive 4 on Tez/MR3

2025-05-09 Thread Sungwoo Park
To answer the question on the configuration of Spark 4.0.0-RC2, this is spark-defaults.conf used in the benchmark. Any suggestion on adding or changing configuration values will be appreciated. spark.driver.cores=36 spark.driver.maxResultSize=0 spark.driver.memory=196g spark.dynamicAllocation.enab

Re: [ANNOUNCE] Announcing Apache Spark Kubernetes Operator 0.1.0

2025-05-08 Thread Mridul Muralidharan
I had not checked the release. The release notes mention that Apache Spark 4.0 is supported - which has not yet been released. While I don’t expect drastic changes - and most likely the support which will continue to work, the messaging is not accurate - Mridul On Wed, May 7, 2025 at 8:54 PM Do

Re: Unsubscribe

2025-05-06 Thread ypeng
please send an empty message to the following addresss for unsubscribing. user-unsubscr...@spark.apache.org shuang wu: Unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: High count of Active Jobs

2025-05-04 Thread Ángel Álvarez Pascua
ping El mar, 22 abr 2025, 6:05, Ángel Álvarez Pascua < angel.alvarez.pas...@gmail.com> escribió: > Any news? > > El mié, 16 abr 2025, 16:16, nayan sharma > escribió: > >> HI Ángel, >> >> I haven't tried disabling speculation but I will try running in DEBUG >> mode. >> >> Thanks & Regards, >

Re: [Spark SQL] spark.sql insert overwrite on existing partition not updating hive metastore partition transient_lastddltime and column_stats

2025-05-02 Thread Sathi Chowdhury
I think it is not happening because it is a ddl time and upsert operation does not recreate the partition. It is just a dml statement.  Sent from Yahoo Mail for iPhone On Friday, May 2, 2025, 7:53 AM, Pradeep wrote: I have a partitioned hive external table as belowscala> spark.sql("describe

Re: High count of Active Jobs

2025-04-21 Thread Ángel Álvarez Pascua
Any news? El mié, 16 abr 2025, 16:16, nayan sharma escribió: > HI Ángel, > > I haven't tried disabling speculation but I will try running in DEBUG mode. > > Thanks & Regards, > Nayan Sharma > *+91-8095382952* > > >

Re: Kafka Connector: producer throttling

2025-04-17 Thread daniel williams
@Abhishek Singla kafka.* options are for streaming applications, not batch. This is clearly documented on the structured streaming pages with the kafka integration. If you are transmitting data in a batch application it's best to do so via a foreachPartition operation as discussed leveraging broad

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread Abhishek Singla
@Ángel Álvarez Pascua Thanks, however I am thinking of some other solution which does not involve saving the dataframe result. Will update this thread with details soon. @daniel williams Thanks, I will surely check spark-testing-base out. Regards, Abhishek Singla On Thu, Apr 17, 2025 at 11:59 

Re: Kafka Connector: producer throttling

2025-04-17 Thread Rommel Yuan
I noticed this problem like a year ago, I just didn't pursue further due to other issues. The solution is to use broadcast Kafka config and make producer in each partition. But if naturally the config is honored, that would be great. On Thu, Apr 17, 2025, 15:27 Abhishek Singla wrote: > @daniel

Re: Kafka Connector: producer throttling

2025-04-17 Thread Abhishek Singla
@daniel williams It's batch and not streaming. I still don't understand why "kafka.linger.ms" and "kafka.batch.size" are not being honored by the kafka connector. @rommelhol...@gmail.com How did you fix it? @Jungtaek Lim Could you help out here in case I am missing something? Regards, Abhishe

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread daniel williams
I have not. Most of my work and development on Spark has been on the scala side of the house and I've built a suite of tools for Kafka integration with Spark for stream analytics along with spark-testing-base On Thu, Apr 17, 2025 at 12:13 PM Ángel Ál

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread Ángel Álvarez Pascua
Have you used the new equality functions introduced in Spark 3.5? https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.testing.assertDataFrameEqual.html El jue, 17 abr 2025, 13:18, daniel williams escribió: > Good call out. Yeah, once you take your work out of Spark it’s all on

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread daniel williams
Good call out. Yeah, once you take your work out of Spark it’s all on you. Any level partitions operations (e.g. map, flat map, foreach) ends up as a lambda in catalyst. I’ve found, however, not using explode and doing things procedurally at this point with a sufficient amount of unit testing helps

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread Ángel Álvarez Pascua
Just a quick note on working at the RDD level in Spark — once you go down to that level, it’s entirely up to you to handle everything. You gain more control and flexibility, but Spark steps back and hands you the steering wheel. If tasks fail, it's usually because you're allowing them to — by not p

Re: Kafka Connector: producer throttling

2025-04-16 Thread daniel williams
The contract between the two is a dataset, yes; but did you construct the former via headstream? If not, it’s still batch. -dan On Wed, Apr 16, 2025 at 4:54 PM Abhishek Singla wrote: > > Implementation via Kafka Connector > > > org.apache.spark > spark-sql-kafka-0-10_${scala.version}

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread Abhishek Singla
@daniel williams > Operate your producer in transactional mode. Could you elaborate more on this. How to achieve this with foreachPartition and HTTP Client. > Checkpointing is an abstract concept only applicable to streaming. That means checkpointing is not supported in batch mode, right? @ange

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread daniel williams
Apologies. I’m imagining a pure Kafka application and imagining a transactional consumer on processing. Angel’s suggestion is the correct one, mapPartition to maintain state across your broadcast Kafka producers to retry or introduce back pressure given your producer and retry in its threadpool giv

Re: Kafka Connector: producer throttling

2025-04-16 Thread Abhishek Singla
Implementation via Kafka Connector org.apache.spark spark-sql-kafka-0-10_${scala.version} 3.1.2 private static void publishViaKafkaConnector(Dataset dataset, Map conf) { dataset.write().format("kafka").options(conf).save(); } conf = { "kafka.bootstrap.servers": "localh

Re: High count of Active Jobs

2025-04-16 Thread nayan sharma
HI Ángel, I haven't tried disabling speculation but I will try running in DEBUG mode. Thanks & Regards, Nayan Sharma *+91-8095382952* On Wed, Apr 16, 2025 at 12:32 PM Ángel Álvarez Pas

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread Ángel Álvarez Pascua
What about returning the HTTP codes as a dataframe result in the foreachPartition, saving it to files/table/whatever and then performing a join to discard already ok processed rows when you it try again? El mié, 16 abr 2025 a las 15:01, Abhishek Singla (< abhisheksingla...@gmail.com>) escribió: >

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread daniel williams
Yes. Operate your producer in transactional mode. Checkpointing is an abstract concept only applicable to streaming. -dan On Wed, Apr 16, 2025 at 7:02 AM Abhishek Singla wrote: > Hi Team, > > We are using foreachPartition to send dataset row data to third system via > HTTP client. The operatio

Re: Kafka Connector: producer throttling

2025-04-16 Thread daniel williams
If you are building a broadcast to construct a producer with a set of options then the producer is merely going to operate how it’s going to be configured - it has nothing to do with spark save that the foreachPartition is constructing it via the broadcast. A strategy I’ve used in the past is to *

Re: Kafka Connector: producer throttling

2025-04-16 Thread Abhishek Singla
Yes, producing via kafka-clients using foreachPartition works as expected. Kafka Producer is initialised within call(Iterator t) method. The issue is with using kafka connector with Spark batch. The configs are not being honored even when they are being set in ProducerConfig. This means kafka reco

Re: Kafka Connector: producer throttling

2025-04-16 Thread daniel williams
If you’re using in batch mode with foreachPartition you’ll want to pass in your options as a function of the configuration of your app, not spark (i.e. use scopt to pass in a map which then gets passed to your builder/broadxast). If you are using spark structured streaming you can use Kafka.* prope

Re: Kafka Connector: producer throttling

2025-04-16 Thread Abhishek Singla
Sharing Producer Config on Spark Startup 25/04/16 15:11:59 INFO ProducerConfig: ProducerConfig values: acks = -1 batch.size = 1000 bootstrap.servers = [localhost:9092] buffer.memory = 33554432 client.dns.lookup = use_all_dns_ips client.id = producer-1 compression.type = none connections.max.id

Re: Kafka Connector: producer throttling

2025-04-16 Thread Abhishek Singla
Hi Daniel and Jungtaek, I am using Spark in batch. Tried with kafka., now I can see they are being set in Producer Config on Spark Startup but still they are not being honored. I have set "linger.ms": "1000" and "batch.size": "10". I am publishing 10 records and they are flushed to kafka serve

Re: High count of Active Jobs

2025-04-16 Thread Ángel Álvarez Pascua
Hi Nayan, Yesterday, I tried to reproduce the issue with speculation enabled, but no luck. Even when using threads and performing multiple actions simultaneously, the stages got queued—but they all eventually completed. I'm curious... have you tried disabling speculation? If so, please let me kno

Re: High count of Active Jobs

2025-04-14 Thread Ángel Álvarez Pascua
Hi Nayan, No worries about the logs. If speculation is enabled, that could explain everything. I understand it's your production environment, but would it be possible to disable speculation temporarily and see if the issue persists? Thanks, Ángel PS: I've started writing an article about this i

Re: High count of Active Jobs

2025-04-13 Thread nayan sharma
Hi Ángel, Yes, speculation is enabled. I will lower log4j logging and share. It will take atleast 24hr before we can capture anything. Thanks, Nayan Thanks & Regards, Nayan Sharma *+91-8095382952*

Re: High count of Active Jobs

2025-04-12 Thread nayan sharma
I found something https://issues.apache.org/jira/browse/SPARK-45101 I tried in a lower environment but found no issues. Although there is a version mismatch running spark 3.2.3 in lower env and *3.2.0 in production.* Thanks & Regards, Nayan Sharma *+91-8095382952*

Re: Unsubscribe

2025-04-10 Thread ypeng
On 2025-04-11 01:17, Shiva Punna wrote: Please send a message to: user-unsubscr...@spark.apache.org to unsubscribe. Thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Java coding with spark API

2025-04-10 Thread Jules Damji
Tim, Yes, you can use Java for your Spark workloads just fine. Cheers Jules Excuse the thumb typos On Fri, 04 Apr 2025 at 12:53 AM, tim wade wrote: > Hello > > I am just newbie to spark. I am programming with Java mainly, knowing > scala very bit. > > Can I just write code with java to talk t

Re: Java coding with spark API

2025-04-07 Thread Stephen Coy
Hi Tim, We have a large ETL project comprising about forty individual Apache Spark applications, all built exclusively in Java. They are executed on three different Spark clusters built on AWS EC2 instances. The applications are built in Java 17 for Spark 3.5.x. Cheers, Steve C > On 4 Apr 20

Re: Inquiry in regards to a New onQuery Method for StreamingQueryListener

2025-04-06 Thread Jevon Cowell
I’ve been thinking about this quite a bit today and what an implementation on the spark side would look like. After some deliberation I concluded: We should instead have an `onQueryTriggerStart` method that is published every time a MicroBatch is triggered This should of course be disabled by d

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-04-06 Thread karan alang
One issue I've seen is that after about 24 hours, the sparkapplication job pods seem to be getting evicted .. i've installed spark history server, and am verifying the case. It could be due to resource constraints, checking this. Pls note : kubeflow spark operator is installed in namespace - so35

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-04-06 Thread karan alang
Thanks, Megh ! I did some research and realized the same - PVC is not a good option for spark shuffle, primarily for latency issues. The same is the case with S3 or MinIO. I've implemented option 2, and am testing this out currently: Storing data in host path is possible regds, Karan Alang O

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-04-06 Thread megh vidani
Hello Karan, Apart from Celeborn, there is Apache Uniffle (Incubating) as well. We also have similar setup as yours and we're trying out a PoC with Uniffle right now. What I've gathered so far is, with Uniffle: 1. Storing data in PVCs is not well supported 2. Storing data in host path is possible

Re: Java coding with spark API

2025-04-05 Thread Sonal Goyal
Java is very much supported in Spark. In our open source project, we haven’t done spark connect yet but we do a lot of transformations, ML and graph stuff using Java with Spark. Never faced the language barrier. Cheers, Sonal https://github.com/zinggAI/zingg On Sat, 5 Apr 2025 at 4:42 PM, Ángel

Re: Java coding with spark API

2025-04-05 Thread Ángel Álvarez Pascua
I think you have more limitations using Spark Connect than Spark from Java. I used RDD, registered UDFs, ... from Java without any problems. El sáb, 5 abr 2025, 9:50, tim wade escribió: > Hello > > I only know Java programming. If I use Java to communicate with the > Spark API and submit tasks t

Re: Java coding with spark API

2025-04-05 Thread tim wade
Hello I only know Java programming. If I use Java to communicate with the Spark API and submit tasks to Spark API from Java, I'm not sure what disadvantages this might have. I see other people writing tasks in Scala, then compiling them and submitting to Spark using spark-submit. Thanks. O

Re: Java coding with spark API

2025-04-05 Thread Ángel Álvarez Pascua
I think I did that some years ago in Spark 2.4 on a Hortonworks cluster with SSL and Kerberos enabled. It worked, but never went into production. El vie, 4 abr 2025, 9:54, tim wade escribió: > Hello > > I am just newbie to spark. I am programming with Java mainly, knowing > scala very bit. > > C

Re: Inquiry in regards to a New onQuery Method for StreamingQueryListener

2025-04-04 Thread Jevon Cowell
Hey Jungtaek!Wanted to update the mailing list on my current approach in case others wanted something similar.I created an asynchronous poller iterates though all active queries and checks of the isTriggering boolean value is true.Here’s an example code snippet: ```javapublic static void checkAndUp

Re: Java coding with spark API

2025-04-04 Thread Jevon Cowell
Hey Tim! What are you aiming to achieve exactly? Regards, Jevon C > On Apr 4, 2025, at 3:54 AM, tim wade wrote: > > Hello > > I am just newbie to spark. I am programming with Java mainly, knowing scala > very bit. > > Can I just write code with java to talk to Spark's java API for submit

Re: Executors not getting released dynamically once task is over

2025-04-04 Thread Soumasish
Hi Shivang, This sounds like classic Spark-on-Kubernetes behavior: - Executors do not necessarily shut down immediately after finishing their tasks, unless: - Dynamic resource allocation is enabled. - And Spark knows it can safely scale down - The Driver pod manages the whole lifecycle. If you

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-03-31 Thread Mich Talebzadeh
yes apache celeborn may be useful. You need to do some research though. https://celeborn.apache.org/ Have a look at this link as well Spark Executor Shuffle Storage Options HTHDr Mich Talebzadeh, Architect | Data Science |

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-03-31 Thread karan alang
seems apache-celeborn is also an option, if anyone has used this pls let me know. thanks! On Mon, Mar 31, 2025 at 1:58 PM karan alang wrote: > hello all - checking to see if anyone has any input on this > > thanks! > > > On Tue, Mar 25, 2025 at 12:22 PM karan alang > wrote: > >> hello All, >>

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-03-31 Thread karan alang
hello all - checking to see if anyone has any input on this thanks! On Tue, Mar 25, 2025 at 12:22 PM karan alang wrote: > hello All, > > I have kubeflow Spark Operator installed on k8s and from what i understand > - Spark Shuffle is not officially supported on kubernetes. > > Looking for feedb

Re: Inquiry in regards to a New onQuery Method for StreamingQueryListener

2025-03-27 Thread Jungtaek Lim
Hi Jevon, > From testing, I see that `onQueryIdle` does not trigger when a query is waiting for the next trigger interval. Yeah it's based on trigger - if no trigger has been triggered, the event cannot be sent. > I wanted to get thoughts on whether it’s worth implementing a new QueryListener me

Re: Kafka Connector: producer throttling

2025-03-26 Thread daniel williams
If you're using structured streaming you can pass in options as kafka. into options as documented. If you're using Spark in batch form you'll want to do a foreach on a KafkaProducer via a Broadcast. All KafkaProducer specific options

Re: Kafka Connector: producer throttling

2025-03-26 Thread Jungtaek Lim
Sorry I missed this. Did you make sure that you add "kafka." as prefix on kafka side config when specifying Kafka source/sink option? On Mon, Feb 24, 2025 at 10:31 PM Abhishek Singla < abhisheksingla...@gmail.com> wrote: > Hi Team, > > I am using spark to read from S3 and write to Kafka. > > Spar

Re: Kafka Connector: producer throttling

2025-03-26 Thread Rommel Yuan
It's been a month. I guess the answer is no. I have been running into the same issue. I guess building a Kafka client is the only option. Rommel On Mon, Feb 24, 2025, 09:20 Abhishek Singla wrote: > Isn't there a way to do it with kafka connector instead of kafka client? > Isn't there any way to

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-25 Thread Prem Sahoo
Just one more variable is Spark 3.5.2 runs on kubernetes and Spark 3.2.0 runs on YARN . It seems kubernetes can be a cause of slowness too .Sent from my iPhoneOn Mar 24, 2025, at 7:10 PM, Prem Gmail wrote:Hello Spark Dev/users,Any one has any clue why and how a better version have performance iss

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-24 Thread Prem Gmail
Hello Spark Dev/users,Any one has any clue why and how a better version have performance issue .I will be happy to raise JIRA .Sent from my iPhoneOn Mar 24, 2025, at 4:20 PM, Prem Sahoo wrote:The problem is on the writer's side. It takes longer to write to Minio with Spark 3.5.2 and Hadoop 3.4.1

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-24 Thread Prem Sahoo
The problem is on the writer's side. It takes longer to write to Minio with Spark 3.5.2 and Hadoop 3.4.1 . so it seems there are some tech changes between hadoop 2.7.6 to 3.4.1 which made the write process faster. On Sun, Mar 23, 2025 at 12:09 AM Ángel Álvarez Pascua < angel.alvarez.pas...@gmail.c

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Ángel Álvarez Pascua
@Prem Sahoo , could you test both versions of Spark+Hadoop by replacing your "write to MinIO" statement with write.format("noop")? This would help us determine whether the issue lies on the reader side or the writer side. El dom, 23 mar 2025 a las 4:53, Prem Gmail () escribió: > V2 writer in 3.

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

2025-03-22 Thread Prem Gmail
V2 writer in 3.5.2 and Hadoop 3.4.1 should be much faster than Spark 3.2.0 and Hadoop 2.7.6 but that’s not the case , tried magic committer option which is agin more slow . So internally something changed which made this slow . May I know ?Sent from my iPhoneOn Mar 22, 2025, at 11:05 PM, Kristopher

Re: High/Critical CVEs in jackson-mapper-asl (spark 3.5.5)

2025-03-18 Thread Ángel Álvarez Pascua
Seems like the Jackson version hasn't changed since Spark 1.4 (pom.xml ). Even Spark 4 is still using this super old (2013) version. Maybe it's time ... El mar, 18 mar 2025 a las 16:05, Mohammad, Ejas Ali () escribió: > Hi Spark Community,

Re: Multiple CVE issues in apache/spark-py:3.4.0 + Pyspark 3.4.0

2025-03-15 Thread Soumasish
Two things come to mind, low hanging fruits - update to Spark 3.5 that should reduce the CVEs. Alternatively consider using Spark connect - where you can address the client side vulnerabilities yourself. Best Regards Soumasish Goswami in: www.linkedin.com/in/soumasish # (415) 530-0405 - On

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Bjørn Jørgensen
Something like this use listcomprihension doc_types = ["AB", "AA", "AC"] result = df.groupBy("code").agg( *[F.sum(F.when(F.col("doc_type") == dt, F.col("amount"))).alias(f"{dt}_amnt") for dt in doc_types], F.first("load_date").alias("load_date") ) and it dont use pivot for it. søn

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Mich Talebzadeh
Well I tried using windowing functions with pivot() and it did not work. >From your reply, you are looking for a function that would ideally combine the conciseness of pivot() with the flexibility of explicit aggregations. While Spark provides powerful tools, there is not a single built-in function

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Dhruv Singla
Yes, this is it. I want to form this using a simple short command. The way I mentioned is a lengthy one. On Sun, Mar 9, 2025 at 10:16 PM Mich Talebzadeh wrote: > Is this what you are expecting? > > root > |-- code: integer (nullable = true) > |-- AB_amnt: long (nullable = true) > |-- AA_amnt:

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Dhruv Singla
Hey, I already know this and have written the same in my question. I know formatting can make the code a lot simpler and easier to understand, but I'm looking if there is already a function or a spark built-in for this. Thanks for the help though. On Sun, Mar 9, 2025 at 11:42 PM Mich Talebzadeh w

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Mich Talebzadeh
import pyspark from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession from pyspark.sql import SQLContext from pyspark.sql.functions import struct from pyspark.sql import functions as F from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Mich Talebzadeh
Is this what you are expecting? root |-- code: integer (nullable = true) |-- AB_amnt: long (nullable = true) |-- AA_amnt: long (nullable = true) |-- AC_amnt: long (nullable = true) |-- load_date: date (nullable = true) ++---+---+---+--+ |code|AB_amnt|AA_amnt|AC_amnt|l

Re: Apache - GSOC'25 projects / Contributions

2025-02-24 Thread Mich Talebzadeh
Hi, To get started, you might want to go through the official Spark documentation and contributor guide: - Apache Spark Documentation - Apache Spark Contributor Guide Regarding GSOC 2025

Re: Kafka Connector: producer throttling

2025-02-24 Thread Abhishek Singla
Isn't there a way to do it with kafka connector instead of kafka client? Isn't there any way to throttle kafka connector? Seems like a common problem. Regards, Abhishek Singla On Mon, Feb 24, 2025 at 7:24 PM daniel williams wrote: > I think you should be using a foreachPartition and a broadcast

Re: Kafka Connector: producer throttling

2025-02-24 Thread daniel williams
I think you should be using a foreachPartition and a broadcast to build your producer. From there you will have full control of all options and serialization needed via direct access to the KafkaProducer, as well as all options therein associated (e.g. callbacks, interceptors, etc). -dan On Mon,

Re: Spark connect: Table caching for global use?

2025-02-17 Thread Ángel
Well, that's not strictly correct. I had two different memory leaks on the driver side because of caching. Both of them in stream processes; one of them in Scala (I forgot to unpersist the cached dataframe) and the other one in PySpark (unpersisting cached dataframes wasn't enough because of Python

Re: Spark connect: Table caching for global use?

2025-02-17 Thread Subhasis Mukherjee
> I understood that caching a table pegged the RDD partitions into the memory of the executors holding the partition. Your understanding is correct. Nothing to worry on the driver side while creating the temp view. On Sun, Feb 16, 2025, 10:47 PM Mich Talebzadeh wrote: > Ok let us look at this >

Re: Spark connect: Table caching for global use?

2025-02-16 Thread Mich Talebzadeh
Ok let us look at this - Temporary Views, Metadata is stored on the driver; *data remains distributed across executors.* - Caching/Persisting, *Data is stored in the executors' memory or disk. * - The statement *"created on driver memory"* refers to the metadata of temp

Re: Spark connect: Table caching for global use?

2025-02-16 Thread Tim Robertson
Thanks Mich > created on driver memory That I hadn't anticipated. Are you sure? I understood that caching a table pegged the RDD partitions into the memory of the executors holding the partition. On Sun, Feb 16, 2025 at 11:17 AM Mich Talebzadeh wrote: > yep. created on driver memory. watch

  1   2   3   4   5   6   7   8   9   10   >