Re: spark TLS connection to hive metastore

2024-10-04 Thread Ángel
Enable SSL debug and analyze the log (not an easy task, but better than getting stuck) https://docs.oracle.com/javase/8/docs/technotes/guides/security/jsse/ReadDebug.html Java version used? 8? 11? Could you provide all the stacktrace? El vie, 4 oct 2024, 17:34, Stefano Bovina escribió: > Hi, >

Re: kubeflow spark-operator - error in querying strimzi kafka using structured streaming

2024-10-03 Thread Nimrod Ofek
Where is the checkpoint location? Not in GCS? Probably the location of the checkpoint is there- and you don't have permissions for that... בתאריך יום ה׳, 3 באוק׳ 2024, 02:43, מאת karan alang ‏: > This seems to be the cause of this -> > github.com/kubeflow/spark-operator/issues/1619 .. the secret

Re: kubeflow spark-operator - error in querying strimzi kafka using structured streaming

2024-10-02 Thread karan alang
This seems to be the cause of this -> github.com/kubeflow/spark-operator/issues/1619 .. the secret is not getting mounted die to this error -> MountVolume.SetUp failed for volume “spark-conf-volume-driver I'm getting same error in event logs, and the secret mounted is not getting read If anyone i

Re: How to run spark connect in kubernetes?

2024-10-02 Thread kant kodali
please ignore this. it was a dns issue On Wed, Oct 2, 2024 at 11:16 AM kant kodali wrote: > Here > > are more details about my question that I posted in SO > > On Tue, Oc

Re: How to run spark connect in kubernetes?

2024-10-02 Thread kant kodali
Here are more details about my question that I posted in SO On Tue, Oct 1, 2024 at 11:32 PM kant kodali wrote: > Hi All, > > Is it possible to run a Spark Connect server

Re: Bugs with joins and SQL in Structured Streaming

2024-10-01 Thread Andrzej Zera
Hello! Thank you for looking into these issues! I'm happy that you identified the root cause for OuterJoinTest and SqlSyntaxTest and working on the fix. Regarding IntervalJoinTest, I think I understand your point. Thank you for explaining that. However, this can be confusing to a user. Let's mayb

Re: Bugs with joins and SQL in Structured Streaming

2024-09-30 Thread Jungtaek Lim
I figured out the issue which breaks the second test in SqlSyntaxTest. This is also a correctness issue, unfortunately. Issue and the fix for OuterJoinTest: https://issues.apache.org/jira/browse/SPARK-49829 Issue and the fix for SqlSyntaxTest: https://issues.apache.org/jira/browse/SPARK-49836 Tha

Re: Rust Spark Connect

2024-09-30 Thread ed elliott
Hi, There isn’t an official client (the only official ones are python/java/scala) but this uses the spark connect gRPC API and is well supported: https://github.com/sjrusso8/spark-connect-rs Ed Elliott From: Tarkhan Shakhbazyan Sent: Sunday, September 29, 2024

Re: Bugs with joins and SQL in Structured Streaming

2024-09-29 Thread Jungtaek Lim
I just quickly looked into SqlSyntaxTest - the first broken test looks to be fixed via SPARK-46062 which was released in Spark 3.5.1. The second broken test is a valid issue and I'm yet to know why this is happening. I'll file a JIRA ticket and le

Re: Bugs with joins and SQL in Structured Streaming

2024-09-28 Thread Jungtaek Lim
Sorry I totally missed this email. This is forgotten for 6 months but I'm happy that we have smart users reporting such complex edge-case issues! I haven't had time to validate all of them but OuterJoinTest is a valid correctness issue indeed. Thanks for reporting to us! I figured out the root cau

Re: Help - Learning/Understanding spark web UI

2024-09-26 Thread Daniel Aronovic
Hey Karthick, The best way to deepen your understanding is by using the Spark Web UI as much as possible while learning the fundamentals of Spark. To help ease the learning curve, I recommend trying an open-source project called *Dataflint*. It adds an extra tab to the Spark Web UI and presents m

Re: Help - Learning/Understanding spark web UI

2024-09-26 Thread Ilango
Hi Karthick, I found one of the spark summit talk few years back on spark UI was quite useful. Just search in youtube. let me Check it out and will share it with you if i found it again Thanks, Elango On Thu, 26 Sep 2024 at 4:04 PM, Karthick Nk wrote: > Hi All, > I am looking to deepen my und

Re: Structured Streaming and Spark Connect

2024-09-23 Thread 刘唯
Hi Anastasiia, Thanks for the email. I think you can tweak this spark config *spark.connect.session.manager.defaultSessionTimeout, *this is defined here*: * https://github.com/apache/spark/blob/343471dac4b96b43a09763d759b6c30760fb626e/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/

Re: Structured Streaming and Spark Connect

2024-09-23 Thread Mich Talebzadeh
Hi Anastasia, My take is that in its current form, Spark Connect is not suitable for running long-lived Structured Streaming queries in Standalone mode, especially with long trigger intervals. The lack of support for detached streaming queries makes it problematic for this particular use case. To

Re: is it possible to run spark2 on EMR 7.2.0

2024-09-20 Thread Prem Sahoo
It is not possible as emr 7.2.0 comes with Hadoop 3.x and Spark3.x by default . If you are looking for migration Spark 2 to 3 then use emr 6.x probably 6.2 . Sent from my iPhone > On Sep 20, 2024, at 9:18 AM, joachim rodrigues > wrote: > >  > I'd like to start a migration from spark2 to spark

Re: ERROR: GROUP BY position 0 is not in select list , when using catalyst parser

2024-09-17 Thread joshita mishra
Unsubscribe On Wed, Sep 18, 2024, 09:39 Sudhanshu wrote: > unsubscribe > > On Wed, Sep 11, 2024 at 1:51 PM Rommel Holmes > wrote: > >> i am using spark 3.3.1 >> here is the sql_string to query a ds partitioned table >> >> ``` >> SELECT >> '2024-09-09' AS ds, >> AVG(v1) AS avg_v1, >> AVG(v2) A

Re: ERROR: GROUP BY position 0 is not in select list , when using catalyst parser

2024-09-17 Thread Sudhanshu
unsubscribe On Wed, Sep 11, 2024 at 1:51 PM Rommel Holmes wrote: > i am using spark 3.3.1 > here is the sql_string to query a ds partitioned table > > ``` > SELECT > '2024-09-09' AS ds, > AVG(v1) AS avg_v1, > AVG(v2) AS avg_v2, > AVG(v3) AS avg_v3 > FROM schema.t1 > WHERE ds = '2024-09-09' >

Re: [CONNECT] Why Can't We Specify Cluster Deploy Mode for Spark Connect?

2024-09-09 Thread Nagatomi Yasukazu
I apologize if my previous explanation was unclear, and I realize I didn’t provide enough context for my question. The reason I want to submit a Spark application to a Kubernetes cluster using the Spark Operator is that I want to use Kubernetes as the Cluster Manager, rather than Standalone mode.

Re: [CONNECT] Why Can't We Specify Cluster Deploy Mode for Spark Connect?

2024-09-09 Thread Prabodh Agarwal
Oh. This issue is pretty straightforward to solve actually. Particularly, in spark-3.5.2. Just download the `spark-connect` maven jar and place it in `$SPARK_HOME/jars`. Then rebuild the docker image. I saw that I had posted a comment on this Jira as well. I could fix this up for standalone cluste

Re: [CONNECT] Why Can't We Specify Cluster Deploy Mode for Spark Connect?

2024-09-09 Thread Nagatomi Yasukazu
Hi Prabodh, Thank you for your response. As you can see from the following JIRA issue, it is possible to run the Spark Connect Driver on Kubernetes: https://issues.apache.org/jira/browse/SPARK-45769 However, this issue describes a problem that occurs when the Driver and Executors are running on

Re: Spark Thrift Server - Not Scaling Down Executors 3.4.2+

2024-09-05 Thread Cheng Pan
The default value of spark.dynamicAllocation.shuffleTracking.enabled was changed from false to true in Spark 3.4.0, disabling it might help. [1] https://spark.apache.org/docs/latest/core-migration-guide.html#upgrading-from-core-33-to-34 Thanks, Cheng Pan > On Sep 6, 2024, at 00:36, Jayabindu

Re: Question about Releases and EOL

2024-08-29 Thread Mich Talebzadeh
CCed to spark dev as well Ok this is my take The EOL for Spark releases is determined from my experience by a combination of factors, including: 1) Community Support, the level of community activity and contributions to a particular release branch plays a significant role. If a branch continues

Re: unable to deploy Pyspark application on GKE, Spark installed using bitnami helm chart

2024-08-27 Thread Mat Schaffer
I use https://github.com/kubeflow/spark-operator rather than bitnami chart, but https://medium.com/@kayvan.sol2/spark-on-kubernetes-d566158186c6 shows running spark submit from a master pod exec. Might be something to try. On Mon, Aug 26, 2024 at 12:22 PM karan alang wrote: > We are currently us

Re: Spark Reads from MapR and Write to MinIO fails for few batches

2024-08-24 Thread Prem Sahoo
Issue resolved , thanks for your time folks.Sent from my iPhoneOn Aug 21, 2024, at 5:38 PM, Prem Sahoo wrote:Hello Team,Could you please check on this request ?On Mon, Aug 19, 2024 at 7:00 PM Prem Sahoo wrote:Hello Spark and User,could you please shed some light ?On Thu, Au

Re: Batch to Kafka

2024-08-23 Thread Rommel Holmes
Hi, community Right now i am using the batch to kafka in pyspark to send dataframe into kafka https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-the-output-of-batch-queries-to-kafka The solution works with not so big dataframe. While the dataframe is big, si

Re: Spark Reads from MapR and Write to MinIO fails for few batches

2024-08-21 Thread Prem Sahoo
Hello Team, Could you please check on this request ? On Mon, Aug 19, 2024 at 7:00 PM Prem Sahoo wrote: > Hello Spark and User, > could you please shed some light ? > > On Thu, Aug 15, 2024 at 7:15 PM Prem Sahoo wrote: > >> Hello Spark and User, >> we have a Spark project which is a long runnin

Re: Hitting SPARK-45858 on Kubernetes - Unavoidable bug or misconfiguration?

2024-08-20 Thread Cheng Pan
I searched [1] using the keywords “reliable” and got nothing, so I cannot draw the same conclusion as you. If an implementation claims to support reliable storage, it should inherit interface ShuffleDriverComponents and override method supportsReliableStorage [2] to return true, for example, Ap

Re: Hitting SPARK-45858 on Kubernetes - Unavoidable bug or misconfiguration?

2024-08-20 Thread Cheng Pan
org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO does NOT support reliable storage, so the condition 4) is false even with this configuration. I’m not sure why you think it does. Thanks, Cheng Pan > On Aug 20, 2024, at 18:27, Aaron Grubb wrote: > > Adding spark.shuffle.useOldFetchProt

Re: Hitting SPARK-45858 on Kubernetes - Unavoidable bug or misconfiguration?

2024-08-20 Thread Aaron Grubb
Adding spark.shuffle.useOldFetchProtocol=true changed the outcome of the job however it still was not stable in the face of spot instances going away. Adding spark.decommission.enabled=true, spark.storage.decommission.enabled=true and spark.executor.decommission.killInterval=110 appears to have

Re: Handling load distribution and addressing data skew.

2024-08-19 Thread Raghavendra Ganesh
Hi, Have you tried https://spark.apache.org/docs/latest/sql-performance-tuning.html#spliting-skewed-shuffle-partitions ? Another way of handling the skew is to split the task into multiple(2 or more) stages involving a random salt as key in the intermediate stages. In the above case, val maxSa

Re: Spark Reads from MapR and Write to MinIO fails for few batches

2024-08-19 Thread Prem Sahoo
Hello Spark and User, could you please shed some light ? On Thu, Aug 15, 2024 at 7:15 PM Prem Sahoo wrote: > Hello Spark and User, > we have a Spark project which is a long running Spark session where it > does below > 1. We are reading from Mapr FS and writing to MapR FS. > 2. Another parall

Re: [External] Re: Redundant(?) shuffle after join

2024-08-19 Thread Ofir Manor
my two cents, Ofir From: Mich Talebzadeh Sent: Friday, August 16, 2024 7:54 PM To: Shay Elbaz Cc: Shay Elbaz ; user@spark.apache.org Subject: [External] Re: Redundant(?) shuffle after join Hi Shay, Let me address the points you raised using the STAR

Re: Redundant(?) shuffle after join

2024-08-16 Thread Mich Talebzadeh
ame way as before > the shuffle - the DF was already partitioned (and locally sorted) by the > same key. > > Thanks again, > > Shay > > > > -- > *From:* Mich Talebzadeh > *Sent:* Thursday, August 15, 2024 17:21 > *To:* Shay Elbaz &g

Re: Redundant(?) shuffle after join

2024-08-16 Thread Shay Elbaz
From: Mich Talebzadeh Sent: Thursday, August 15, 2024 17:21 To: Shay Elbaz Cc: user@spark.apache.org Subject: Re: Redundant(?) shuffle after join This message contains hyperlinks, take precaution before opening these links. The actual code is not given, so I am going with the plan output and yo

Re: Redundant(?) shuffle after join

2024-08-15 Thread Mich Talebzadeh
The actual code is not given, so I am going with the plan output and your explanation - You're joining a large, bucketed table with a smaller DataFrame on a common column (key_col). - The subsequent window function also uses key_col - However, a shuffle occurs for the window function

Re: Need help understanding tuning docs

2024-08-14 Thread Subhasis Mukherjee
You are mixing up storage and execution memory. Following is the sequence of storage retention/eviction. - Execution and storage share a unified region (M). - When no spark execution is underway, storage activity can take up the whole of M. This is vice versa for execution activity. - When both

Re: [ANNOUNCE] Apache Spark 3.5.2 released

2024-08-12 Thread Xiao Li
Thank you, Kent! Kent Yao 于2024年8月12日周一 08:03写道: > We are happy to announce the availability of Apache Spark 3.5.2! > > Spark 3.5.2 is the second maintenance release containing security > and correctness fixes. This release is based on the branch-3.5 > maintenance branch of Spark. We strongly re

Re: dynamically infer json data not working as expected

2024-08-08 Thread Perez
Also, I checked your code but it will again give the same result even if I do sampling because the schema of the "data" attribute is not fixed. Any suggestions? On Thu, Aug 8, 2024 at 12:34 PM Perez wrote: > Hi Mich, > > Thanks a lot for your answer but there is one more scenario to it. > > Th

Re: dynamically infer json data not working as expected

2024-08-08 Thread Perez
Hi Mich, Thanks a lot for your answer but there is one more scenario to it. The schema of the data attribute inside the steps column is not fixed. For some records, I see it as a struct and for others, I see it as an Array of objects. So at last it treats it as string only since it gets confused

Re: [spark connect] unable to utilize stand alone cluster

2024-08-06 Thread Prabodh Agarwal
Glad to help! On Tue, 6 Aug, 2024, 17:37 Ilango, wrote: > > Thanks Praboth. Passing —master attr in spark connect command worked like > charm. I am able to submit spark connect to my existing stand-alone cluster > > Thanks for saving my day once again :) > > Thanks, > Elango > > > On Tue, 6 Aug

Re: [spark connect] unable to utilize stand alone cluster

2024-08-06 Thread Ilango
Thanks Praboth. Passing —master attr in spark connect command worked like charm. I am able to submit spark connect to my existing stand-alone cluster Thanks for saving my day once again :) Thanks, Elango On Tue, 6 Aug 2024 at 6:08 PM, Prabodh Agarwal wrote: > Do you get some error on passing

Re: Spark 3.5.0 bug - Writing a small paraquet dataframe to storage using spark 3.5.0 taking too long

2024-08-06 Thread Bijoy Deb
Hi Spark community, Any resolution would be highly appreciated. Few additional analysis from my side: The lag in writing parquet exists in spark 3.5.0, but no lag in spark 3.1.2 or 2.4.5. Also, I found out that the task WholeStageCodeGen(1) --> ColumnarToRow is the one which is taking the most

Re: [spark connect] unable to utilize stand alone cluster

2024-08-06 Thread Prabodh Agarwal
Do you get some error on passing the master option to your spark connect command? On Tue, 6 Aug, 2024, 15:36 Ilango, wrote: > > > > Thanks Prabodh. I'm having an issue with the Spark Connect connection as > the `spark.master` value is set to `local[*]` in Spark Connect UI, whereas > the actual m

Re: [spark connect] unable to utilize stand alone cluster

2024-08-06 Thread Ilango
Thanks Prabodh. I'm having an issue with the Spark Connect connection as the `spark.master` value is set to `local[*]` in Spark Connect UI, whereas the actual master node for our Spark standalone cluster is different. I am passing that master node ip in the Spark Connect Connection. But still it is

Re: [spark connect] unable to utilize stand alone cluster

2024-08-06 Thread Prabodh Agarwal
There is an executors tab on spark connect. It's contents are generally similar to the workers section of the spark master ui. You might need to specify --master option in your spark connect command if you haven't done so yet. On Tue, 6 Aug, 2024, 14:19 Ilango, wrote: > > Hi all, > > I am evalu

Re: dynamically infer json data not working as expected

2024-08-05 Thread Mich Talebzadeh
I gave an answer in SO HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD Imperial College London London, United Kingdom view my Linkedin profile

Re: dynamically infer json data not working as expected

2024-08-05 Thread Perez
https://stackoverflow.com/questions/78835509/dynamically-infer-schema-of-json-data-using-pyspark Any help would be appreciated. Thanks, On Mon, Aug 5, 2024 at 10:35 PM Perez wrote: > Hello everyone, > > I have described my problem on the SO blog : > >

Re: [Issue] Spark SQL - broadcast failure

2024-08-01 Thread Sudharshan V
Hi all, Do we have any idea on this. Thanks On Tue, 23 Jul, 2024, 12:54 pm Sudharshan V, wrote: > We removed the explicit broadcast for that particular table and it took > longer time since the join type changed from BHJ to SMJ. > > I wanted to understand how I can find what went wrong with the

Re: [Spark Connect] connection issue

2024-07-29 Thread Prabodh Agarwal
Glad it worked! On Tue, 30 Jul, 2024, 11:12 Ilango, wrote: > > Thanks Prabodh. I copied the spark connect jar to $SPARK_HOME/jars > folder. And passed the location as —jars attr. Its working now. I could > submit spark jobs via spark connect. > > Really appreciate the help. > > > > Thanks, > E

Re: [Spark Connect] connection issue

2024-07-29 Thread Ilango
Thanks Prabodh. I copied the spark connect jar to $SPARK_HOME/jars folder. And passed the location as —jars attr. Its working now. I could submit spark jobs via spark connect. Really appreciate the help. Thanks, Elango On Tue, 30 Jul 2024 at 11:05 AM, Prabodh Agarwal wrote: > Yeah. I unde

Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Meena Rajani
You probably have to increase jvm/jdk memory size https://stackoverflow.com/questions/1565388/increase-heap-size-in-java On Mon, Jul 29, 2024 at 9:36 PM mike Jadoo wrote: > Thanks. I just downloaded the corretto but I got this error message, > which was the same as before. [It was shared wi

Re: [Spark Connect] connection issue

2024-07-29 Thread Prabodh Agarwal
Yeah. I understand the problem. One of the ways is to actually place the spark connect jar in the $SPARK_HOME/jars folder. That is how we run spark connect. Using the `--packages` or the `--jars` option is flaky in case of spark connect. You can instead manually place the relevant spark connect ja

Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Sadha Chilukoori
Hi Mike, This appears to be an access issue on Windows + Python. Can you try setting up the PYTHON_PATH environment variable as described in this stackoverflow post https://stackoverflow.com/questions/60414394/createprocess-error-5-access-is-denied-pyspark - Sadha On Mon, Jul 29, 2024 at 3:39 PM

Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread mike Jadoo
Thanks. I just downloaded the corretto but I got this error message, which was the same as before. [It was shared with me that this saying that I have limited resources, i think] ---Py4JJavaError

Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Sadha Chilukoori
Hi Mike, I'm not sure about the minimum requirements of a machine for running Spark. But to run some Pyspark scripts (and Jupiter notbebooks) on a local machine, I found the following steps are the easiest. I installed Amazon corretto and updated the java_home variable as instructed here https:/

Re: [Spark Connect] connection issue

2024-07-29 Thread Ilango
Thanks Prabodh, Yes I can see the spark connect logs in $SPARK_HOME/logs path. It seems like the spark connect dependency issue. My spark node is air gapped node so no internet is allowed. Can I download the spark connect jar and pom files locally and share the local paths? How can I share the loca

Re: [Spark Connect] connection issue

2024-07-29 Thread Prabodh Agarwal
The spark connect startup prints the log location. Is that not feasible for you? For me log comes to $SPARK_HOME/logs On Mon, 29 Jul, 2024, 15:30 Ilango, wrote: > > Hi all, > > > I am facing issues with a Spark Connect application running on a Spark > standalone cluster (without YARN and HDFS).

Re: Issue with comparing structs (possible bug)

2024-07-26 Thread Dhruv Singla
The spark version 3.5.1 On Fri, Jul 26, 2024 at 6:54 PM Dhruv Singla wrote: > Hey Everyone > > Hope you are doing well > > I am trying to compare structs with structs using the IN clause. Here is > what I found. > The following query comparing structs gives error > > SELECT struct(1, 2) IN ( > S

Re: [Issue] Spark SQL - broadcast failure

2024-07-23 Thread Sudharshan V
We removed the explicit broadcast for that particular table and it took longer time since the join type changed from BHJ to SMJ. I wanted to understand how I can find what went wrong with the broadcast now. How do I know the size of the table inside of spark memory. I have tried to cache the tabl

Re: [Issue] Spark SQL - broadcast failure

2024-07-23 Thread Sudharshan V
Hi all, apologies for the delayed response. We are using spark version 3.4.1 in jar and EMR 6.11 runtime. We have disabled the auto broadcast always and would broadcast the smaller tables using explicit broadcast. It was working fine historically and only now it is failing. The data sizes I men

Re: issue forwarding SPARK_CONF_DIR to start workers

2024-07-20 Thread Holden Karau
This might a good discussion for the dev@ list, I don’t know much about SLURM deployments personally. Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user

Re: issue forwarding SPARK_CONF_DIR to start workers

2024-07-20 Thread Patrice Duroux
Hi, Here is a small patch that solves this issue. Considering all the scripts, I'm not sure if sbin/stop-workers.sh and sbin/stop-worker.sh need a similar change. Do they really care about SPARK_CONF_DIR to do the job? Note that I have also removed the following part in the script: cd "${SPARK_HO

Re: [Issue] Spark SQL - broadcast failure

2024-07-16 Thread Meena Rajani
Can you try disabling broadcast join and see what happens? On Mon, Jul 8, 2024 at 12:03 PM Sudharshan V wrote: > Hi all, > > Been facing a weird issue lately. > In our production code base , we have an explicit broadcast for a small > table. > It is just a look up table that is around 1gb in siz

Re: [Issue] Spark SQL - broadcast failure

2024-07-16 Thread Mich Talebzadeh
It will help if you mention the Spark version and the piece of problematic code HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD Imperial College London

Re: Help wanted on securing spark with Apache Knox / JWT

2024-07-12 Thread Adam Binford
You need to use the spark.ui.filters setting on the history server https://spark.apache.org/docs/latest/configuration.html#spark-ui: spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.pa

Re: AttributeError: 'MulticlassMetrics' object has no attribute '_sc'

2024-06-23 Thread Saurabh Kumar
Please Unsubscribe me On Mon, 24 Jun 2024 at 07:02, Azhuvath, RajeevX wrote: > Getting the error “AttributeError: 'MulticlassMetrics' object has no > attribute '_sc'” while executing the standalone attached code in a bare > metal system. > > > > Thanks and Regards, > > Rajeev > > ---

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-06-20 Thread Anil Dasari
Hello @Tathagata Das Could you share your thoughts on https://issues.apache.org/jira/browse/SPARK-48418 ? Let me know if you have any questions. thanks. Regards, Anil On Fri, May 24, 2024 at 12:13 AM Anil Dasari wrote: > It appears that structured streaming and Dstream have entirely different

Re: Spark Decommission

2024-06-20 Thread Rajesh Mahindra
Thank Ahmed, thats useful information On Wed, Jun 19, 2024 at 1:36 AM Khaldi, Ahmed wrote: > Hey Rajesh, > > > > Fromm y experience, it’s a stable feature, however you must keep in mind > that it will not guarantee that you will not lose the data that is on the > pods of the nodes getting a spot

Re: Help in understanding Exchange in Spark UI

2024-06-20 Thread Mich Talebzadeh
OK, I gave an answer in StackOverflow. Happy reading Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD Imperial College London London, United Kingdom

Re: [SPARK-48423] Unable to save ML Pipeline to azure blob storage

2024-06-19 Thread Chhavi Bansal
Hello Team, I am pinging back on this thread to get a pair of eyes on this issue. Ticket: https://issues.apache.org/jira/browse/SPARK-48423 On Thu, 6 Jun 2024 at 00:19, Chhavi Bansal wrote: > Hello team, > I was exploring on how to save ML pipeline to azure blob storage, but was > setback by an

Re: Spark Decommission

2024-06-19 Thread Khaldi, Ahmed
Hey Rajesh, Fromm y experience, it’s a stable feature, however you must keep in mind that it will not guarantee that you will not lose the data that is on the pods of the nodes getting a spot kill. Once you have a spot a kill, you have 120s to give the node back to the cloud provider. This is w

Re: Update mode in spark structured streaming

2024-06-15 Thread Mich Talebzadeh
Best to qualify your thoughts with an example By using the foreachBatch function combined with the update output mode in Spark Structured Streaming, you can effectively handle and integrate late-arriving data into your aggregations. This approach will allow you to continuously update your aggregat

Re: Re: OOM issue in Spark Driver

2024-06-11 Thread Mich Talebzadeh
t; may cause OOM. > Checking logs will always be a good start. And it would be better if some > colleague of you is familiar with JVM and OOM related issues. > > BS > Lingzhe Sun > > > *From:* Karthick Nk > *Date:* 2024-06-11 13:28 > *To:* Lingzhe Sun > *CC:* Andr

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-10 Thread Gourav Sengupta
Hi, another thing we can consider while parallelising connection with the upstream sources is that it means you are querying the system simultaneously and that causes usage spikes, and in case the source system is facing a lot of requests during production workloads the best time to parallelise w

Re: Unable to load MongoDB atlas data via PySpark because of BsonString error

2024-06-09 Thread Perez
Hi Team, Any help in this matter would be greatly appreciated. TIA On Sun, Jun 9, 2024 at 11:26 AM Perez wrote: > Hi Team, > > this is the problem > https://stackoverflow.com/questions/78593858/unable-to-load-mongodb-atlas-data-via-pyspark-jdbc-in-glue > > I can't go ahead with *StructType* ap

Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-08 Thread Someshwar Kale
Hi Chhavi, Currently there is no way to handle backtick(`) spark StructType. Hence the field name a.b and `a.b` are completely different within StructType. To handle that, I have added a custom implementation fixing StringIndexer# validateAndTransformSchema. You can refer to the code on my github

Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-08 Thread Chhavi Bansal
Hi Someshwar, Thanks for the response, I have added my comments to the ticket . Thanks, Chhavi Bansal On Thu, 6 Jun 2024 at 17:28, Someshwar Kale wrote: > As a fix, you may consider adding a transformer to rename columns (perhaps > replace all

Re: OOM issue in Spark Driver

2024-06-08 Thread Andrzej Zera
Hey, do you perform stateful operations? Maybe your state is growing indefinitely - a screenshot with state metrics would help (you can find it in Spark UI -> Structured Streaming -> your query). Do you have a driver-only cluster or do you have workers too? What's the memory usage profile at worker

Re: 7368396 - Apache Spark 3.5.1 (Support)

2024-06-07 Thread Sadha Chilukoori
Hi Alex, Spark is an open source software available under Apache License 2.0 ( https://www.apache.org/licenses/), further details can be found here in the FAQ page (https://spark.apache.org/faq.html). Hope this helps. Thanks, Sadha On Thu, Jun 6, 2024, 1:32 PM SANTOS SOUZA, ALEX wrote: > H

Re: Kubernetes cluster: change log4j configuration using uploaded `--files`

2024-06-06 Thread Mich Talebzadeh
The issue you are encountering is due to the order of operations when Spark initializes the JVM for driver and executor pods. The JVM options (-Dlog4j2.configurationFile) are evaluated when the JVM starts, but the --files option copies the files after the JVM has already started. Hence, the log4j c

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Also can I take my lower bound starting from 1 or is it index? On Thu, Jun 6, 2024 at 8:42 PM Perez wrote: > Thanks again Mich. It gives the clear picture but I have again couple of > doubts: > > 1) I know that there will be multiple threads that will be executed with > 10 segment sizes each unt

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Thanks again Mich. It gives the clear picture but I have again couple of doubts: 1) I know that there will be multiple threads that will be executed with 10 segment sizes each until the upper bound is reached but I didn't get this part of the code exactly segments = [(i, min(i + segment_size, uppe

Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-06 Thread Someshwar Kale
As a fix, you may consider adding a transformer to rename columns (perhaps replace all columns with dot to underscore) and use the renamed columns in your pipeline as below- val renameColumn = new RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude") val si = new Str

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Mich Talebzadeh
well you can dynamically determine the upper bound by first querying the database to find the maximum value of the partition column and use it as the upper bound for your partitioning logic. def get_max_value(spark, mongo_config, column_name): max_value_df = spark.read.format("com.mongodb.spar

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Thanks, Mich for your response. However, I have multiple doubts as below: 1) I am trying to load the data for the incremental batch so I am not sure what would be my upper bound. So what can we do? 2) So as each thread loads the desired segment size's data into a dataframe if I want to aggregate a

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Mich Talebzadeh
Yes, partitioning and parallel loading can significantly improve the performance of data extraction from JDBC sources or databases like MongoDB. This approach can leverage Spark's distributed computing capabilities, allowing you to load data in parallel, thus speeding up the overall data loading pr

Re: Terabytes data processing via Glue

2024-06-05 Thread Perez
Thanks Nitin and Russel for your responses. Much appreciated. On Mon, Jun 3, 2024 at 9:47 PM Russell Jurney wrote: > You could use either Glue or Spark for your job. Use what you’re more > comfortable with. > > Thanks, > Russell Jurney @rjurney > russell.jur...@gmail

Re: Classification request

2024-06-04 Thread Dirk-Willem van Gulik
Actually - that answer may oversimplify things / be rather incorrect depending on the exact question of the entity that asks and the exact situation (who ships what code from where). For this reason it is properly best to refer this original poster to: https://www.apache.org/licenses/ex

Re: Classification request

2024-06-04 Thread Artemis User
Sara, Apache Spark is open source under Apache License 2.0 (https://github.com/apache/spark/blob/master/LICENSE).  It is not under export control of any country!  Please feel free to use, reproduce and distribute, as long as your practice is compliant with the license. Having said that, some c

Re: Terabytes data processing via Glue

2024-06-03 Thread Russell Jurney
You could use either Glue or Spark for your job. Use what you’re more comfortable with. Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB datasyndrome.com On Sun, Jun 2, 2024 at 9:59 PM

Re: Terabytes data processing via Glue

2024-06-02 Thread Perez
Hello, Can I get some suggestions? On Sat, Jun 1, 2024 at 1:18 PM Perez wrote: > Hi Team, > > I am planning to load and process around 2 TB historical data. For that > purpose I was planning to go ahead with Glue. > > So is it ok if I use glue if I calculate my DPUs needed correctly? or > shoul

Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Amin Mosayyebzadeh
I am reading from a single file: df = spark.read.text("s3a://test-bucket/testfile.csv") On Fri, May 31, 2024 at 5:26 AM Mich Talebzadeh wrote: > Tell Spark to read from a single file > > data = spark.read.text("s3a://test-bucket/testfile.csv") > > This clarifies to Spark that you are dealing w

Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Mich Talebzadeh
Tell Spark to read from a single file data = spark.read.text("s3a://test-bucket/testfile.csv") This clarifies to Spark that you are dealing with a single file and avoids any bucket-like interpretation. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-30 Thread Subhasis Mukherjee
laid=6087618 From: Gera Shegalov Sent: Wednesday, May 29, 2024 7:57:56 am To: Prem Sahoo Cc: eab...@163.com ; Vibhor Gupta ; user @spark Subject: Re: Re: EXT: Dual Write to HDFS and MinIO in faster way I agree with the previous answers that (if requirements allow it

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Amin Mosayyebzadeh
I will work on the first two possible causes. For the third one, which I guess is the real problem, Spark treats the testfile.csv object with the url s3a://test-bucket/testfile.csv as a bucket to access _spark_metadata with url s3a://test-bucket/testfile.csv/_spark_metadata testfile.csv is an objec

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
ok some observations - Spark job successfully lists the S3 bucket containing testfile.csv. - Spark job can retrieve the file size (33 Bytes) for testfile.csv. - Spark job fails to read the actual data from testfile.csv. - The printed content from testfile.csv is an empty list. - S

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Amin Mosayyebzadeh
The code should read testfile.csv file from s3. and print the content. It only prints a empty list although the file has content. I have also checked our custom s3 storage (Ceph based) logs and I see only LIST operations coming from Spark, there is no GET object operation for testfile.csv The only

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
Hello, Overall, the exit code of 0 suggests a successful run of your Spark job. Analyze the intended purpose of your code and verify the output or Spark UI for further confirmation. 24/05/30 01:23:43 INFO SparkContext: SparkContext is stopping with exitCode 0. what to check 1. Verify Output

Re: [s3a] Spark is not reading s3 object content

2024-05-29 Thread Amin Mosayyebzadeh
Hi Mich, Thank you for the help and sorry about the late reply. I ran your provided but I got "exitCode 0". Here is the complete output: === 24/05/30 01:23:38 INFO SparkContext: Running Spark version 3.5.0 24/05/30 01:23:38 INFO SparkContext: OS info Linux, 5.4.0-182

Re: OOM concern

2024-05-28 Thread Perez
Thanks Mich for the detailed explanation. On Tue, May 28, 2024 at 9:53 PM Mich Talebzadeh wrote: > Russell mentioned some of these issues before. So in short your mileage > varies. For a 100 GB data transfer, the speed difference between Glue and > EMR might not be significant, especially consid

  1   2   3   4   5   6   7   8   9   10   >