pyspark.ml.recommendation is using the wrong python version

2023-09-04 Thread Harry Jamison
I am using python3.7 and spark 2.4.7 I am trying to figure out why my job is using the wrong python version This is how it is starting up the logs confirm that I am using python 3.7But I later see the error message showing it is trying to us 3.8, and I am not sure where it is picking that up.

Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-04 Thread Nagatomi Yasukazu
Hello Mich, Thank you for your questions. Here are my responses: > 1. What investigation have you done to show that it is running in local mode? I have verified through the History Server's Environment tab that: - "spark.master" is set to local[*] - "spark.app.id" begins with local-xxx -

Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-04 Thread Mich Talebzadeh
personally I have not used this feature myself. However, some points 1. What investigation have you done to show that it is running in local mode? 2. who has configured this kubernetes cluster? Is it supplied by a cloud vendor? 3. Confirm that you have configured Spark Connect

Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-03 Thread Nagatomi Yasukazu
Hi Cley, Thank you for taking the time to respond to my query. Your insights on Spark cluster deployment are much appreciated. However, I'd like to clarify that my specific challenge is related to running the Spark Connect Server on Kubernetes in Cluster Mode. While I understand the general

Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-03 Thread Cleyson Barros
Hi Nagatomi, Use Apache imagers, then run your master node, then start your many slavers. You can add a command line in the docker files to call for the master using the docker container names in your service composition if you wish to run 2 masters active and standby follow the instructions in

Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-02 Thread Nagatomi Yasukazu
Hello Apache Spark community, I'm currently trying to run Spark Connect Server on Kubernetes in Cluster Mode and facing some challenges. Any guidance or hints would be greatly appreciated. ## Environment: Apache Spark version: 3.4.1 Kubernetes version: 1.23 Command executed:

[Spark Connect]Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-02 Thread Nagatomi Yasukazu
Hello Apache Spark community, I'm currently trying to run Spark Connect Server on Kubernetes in Cluster Mode and facing some challenges. Any guidance or hints would be greatly appreciated. ## Environment: Apache Spark version: 3.4.1 Kubernetes version: 1.23 Command executed:

Re: Spark Connect, Master, and Workers

2023-09-01 Thread James Yu
Can I simply understand Spark Connect this way: The client process is now the Spark driver? From: Brian Huynh Sent: Thursday, August 10, 2023 10:15 PM To: Kezhi Xiong Cc: user@spark.apache.org Subject: Re: Spark Connect, Master, and Workers Hi Kezhi, Yes,

Re: Elasticsearch support for Spark 3.x

2023-09-01 Thread Koert Kuipers
could the provided scope be the issue? On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev wrote: > Using the following dependency for Spark 3 in POM file (My Scala version > is 2.12.14) > > > > > > > *org.elasticsearch > elasticsearch-spark-30_2.12 > 7.12.0provided* > > > The code throws error

Reg read json inference schema

2023-08-31 Thread Manoj Babu
Hi Team, I am getting the below error when reading a column with a value with JSON string. json_schema_ctx_rdd = record_df.rdd.map(lambda row: row.contexts_parsed) spark.read.option("mode", "PERMISSIVE").option("inferSchema", "true").option("inferTimestamp", "false").json(json_schema_ctx_rdd)

Re:

2023-08-31 Thread leibnitz
me too ayan guha 于2023年8月24日周四 09:02写道: > Unsubscribe-- > Best Regards, > Ayan Guha >

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Bjørn Jørgensen
Have tried to upgrade it. It is from kubernetes-client [SPARK-43990][BUILD] Upgrade kubernetes-client to 6.7.2 tor. 31. aug. 2023 kl. 14:47 skrev Agrawal, Sanket : > I don’t see an entry in pom.xml while building spark. I think

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Sean Owen
It's a dependency of some other HTTP library. Use mvn dependency:tree to see where it comes from. It may be more straightforward to upgrade the library that brings it in, assuming a later version brings in a later okio. You can also manage up the version directly with a new entry in However,

RE: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Agrawal, Sanket
I don’t see an entry in pom.xml while building spark. I think it is being downloaded as part of some other dependency. From: Sean Owen Sent: Thursday, August 31, 2023 5:10 PM To: Agrawal, Sanket Cc: user@spark.apache.org Subject: [EXT] Re: Okio Vulnerability in Spark 3.4.1 Does the

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Sean Owen
Does the vulnerability affect Spark? In any event, have you tried updating Okio in the Spark build? I don't believe you could just replace the JAR, as other libraries probably rely on it and compiled against the current version. On Thu, Aug 31, 2023 at 6:02 AM Agrawal, Sanket wrote: > Hi All, >

Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Agrawal, Sanket
Hi All, Amazon inspector has detected a vulnerability in okio-1.15.0.jar JAR in Spark 3.4.1. It suggests to upgrade the jar version to 3.4.0. But when we try this version of jar then the spark application is failing with below error: py4j.protocol.Py4JJavaError: An error occurred while calling

CommunityOverCode(CoC) 2023

2023-08-28 Thread Uma Maheswara Rao Gangumalla
Hi All, The CommmunityOverCode (CoC) 2023 Conference is approaching real quick. This year's conference is happening at Halifax, Nova Scotia, Canada (Oct 07 - Oct 10 2023). We have an exciting set of talks lined up from compute and storage experts. Please take a moment to check the compute track

Registration open for Community Over Code North America

2023-08-28 Thread Rich Bowen
Hello! Registration is still open for the upcoming Community Over Code NA event in Halifax, NS! We invite you to register for the event https://communityovercode.org/registration/ Apache Committers, note that you have a special discounted rate for the conference at US$250. To take advantage of

Re: Elasticsearch support for Spark 3.x

2023-08-27 Thread Dipayan Dev
Using the following dependency for Spark 3 in POM file (My Scala version is 2.12.14) *org.elasticsearch elasticsearch-spark-30_2.12 7.12.0provided* The code throws error at this line : df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name") The same code

Re: Elasticsearch support for Spark 3.x

2023-08-27 Thread Holden Karau
What’s the version of the ES connector you are using? On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev wrote: > Hi All, > > We're using Spark 2.4.x to write dataframe into the Elasticsearch index. > As we're upgrading to Spark 3.3.0, it throwing out error > Caused by:

Two new tickets for Spark on K8s

2023-08-26 Thread Mich Talebzadeh
Hi, @holden Karau recently created two Jiras that deal with two items of interest namely: 1. Improve Spark Driver Launch Time SPARK-44950 2. Improve Spark Dynamic Allocation SPARK-44951

Re: Spark 2.4.7

2023-08-26 Thread Mich Talebzadeh
Sorry for forgetting. Add this line to the top of the code import sys Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile

Re: Spark 2.4.7

2023-08-26 Thread Mich Talebzadeh
Hi guys, You can try the code below in PySpark relying on* urllib *library to download the contents of the URL and then create a new column in the DataFrame to store the downloaded contents. Spark 4.3.0 The limit explained by Varun from pyspark.sql import SparkSession from

Re: Spark 2.4.7

2023-08-26 Thread Harry Jamison
Thank you Varun, this makes sense. I understand a separate process for content ingestion. I was thinking it would be a separate spark job, but it sounds like you are suggesting that ideally I should do it outside of Hadoop entirely? Thanks Harry On Saturday, August 26, 2023 at 09:19:33

Re: Spark 2.4.7

2023-08-26 Thread Varun Shah
Hi Harry, Ideally, you should not be fetching a url in your transformation job but do the API calls separately (outside the cluster if possible). Ingesting data should be treated separately from transformation / cleaning / join operations. You can create another dataframe of urls, dedup if

Unsubscribe

2023-08-26 Thread Ozair Khan
Unsubscribe Regards, Ozair Khan

Elasticsearch support for Spark 3.x

2023-08-26 Thread Dipayan Dev
Hi All, We're using Spark 2.4.x to write dataframe into the Elasticsearch index. As we're upgrading to Spark 3.3.0, it throwing out error Caused by: java.lang.ClassNotFoundException: es.DefaultSource at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476) at

Spark 2.4.7

2023-08-25 Thread Harry Jamison
I am using python 3.7 and Spark 2.4.7 I am not sure what the best way to do this is. I have a dataframe with a url in one of the columns, and I want to download the contents of that url and put it in a new column. Can someone point me in the right direction on how to do this?I looked at the UDFs

Re: mysterious spark.sql.utils.AnalysisException Union in spark 3.3.2, but not seen in 3.4.0+

2023-08-25 Thread Mich Talebzadeh
Hi Srivastan, Ground investigation 1. Does this union explicitly exist in your code? If not, where are the 7 and 6 column counting coming from? 2. On 3.3.1 have you looked at spark UI and the relevant dag diagram 3. Check query execution plan using explain() functionality 4. Can

mysterious spark.sql.utils.AnalysisException Union in spark 3.3.2, but not seen in 3.4.0+

2023-08-25 Thread Srivatsan vn
Hello Users, I have been seeing some weird issues when I upgraded my EMR setup to 6.11 (which uses spark 3.3.2) , the call stack seems to point to a code location where there is no explicit union, also I have unionByName everywhere in the codebase with allowMissingColumns set

Unsubscribe

2023-08-25 Thread Dipayan Dev

Spark Connect: API mismatch in SparkSesession#execute

2023-08-25 Thread Stefan Hagedorn
Hi everyone, I’m trying to use the “extension” feature of the Spark Connect CommandPlugin (Spark 3.4.1). I created a simple protobuf message `MyMessage` that I want to send from the connect client-side to the connect server (where I registered my plugin). The SparkSession class in

Unsubscribe

2023-08-24 Thread Hemanth Dendukuri
Unsubscribe

Fwd:  Wednesday: Join 6 Members at "Ofir Press | Complementing Scale: Novel Guidance Methods for Improving LMs"

2023-08-24 Thread Mich Talebzadeh
They recently combined Apache Spark and AI meeting in London. An online session worth attending for some? HTH Mich view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk.

Unsubscribe

2023-08-24 Thread Никита Романов
Unsubscribe

Unsubscribe

2023-08-23 Thread Nizam Shaik
Unsubscribe

Unsubscribe

2023-08-23 Thread Aayush Ostwal
Unsubscribe

Unsubscribe

2023-08-23 Thread Dipayan Dev
Unsubscribe

[no subject]

2023-08-23 Thread ayan guha
Unsubscribe-- Best Regards, Ayan Guha

Re: $SPARK_HOME/sbin/start-worker.sh spark://{main_host}:{cluster_port} failing

2023-08-23 Thread Mich Talebzadeh
Hi Jeremy, This error concerns me "23/08/23 20:01:03 ERROR LevelDBProvider: error opening leveldb file file:/mnt/data_ebs/infrastructure/spark/tmp/registeredExecutors.ldb. Creating new file, will not be able to recover state for existing applications

$SPARK_HOME/sbin/start-worker.sh spark://{main_host}:{cluster_port} failing

2023-08-23 Thread Jeremy Brent
Hi Spark Community, We have a cluster running with Spark 3.3.1. All nodes are AWS EC2’s with an Ubuntu OS version 22.04. One of the workers disconnected from the main node. When we run $SPARK_HOME/sbin/start-worker.sh spark://{main_host}:{cluster_port} it appears to run successfully; there is no

unsubscribe

2023-08-22 Thread heri wijayanto
unsubscribe

Re: error trying to save to database (Phoenix)

2023-08-22 Thread Gera Shegalov
If you look at the dependencies of the 5.0.0-HBase-2.0 artifact https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark/5.0.0-HBase-2.0 it was built against Spark 2.3.0, Scala 2.11.8 You may need to check with the Phoenix community if your setup with Spark 3.4.1 etc is supported by

Fwd: Recap on current status of "SPIP: Support Customized Kubernetes Schedulers"

2023-08-22 Thread Mich Talebzadeh
I found some of the notes on Volcano and my tests back in Feb 2022. I did my volcano tests on Spark 3.1.1. The results were not very great then. Hence I asked in thread from @santosh, if any updated comparisons are available. I will try the test with Spark 3.4.1 at some point. Maybe some users

[ANNOUNCE] Apache Spark 3.3.3 released

2023-08-22 Thread Yuming Wang
We are happy to announce the availability of Apache Spark 3.3.3! Spark 3.3.3 is a maintenance release containing stability fixes. This release is based on the branch-3.3 maintenance branch of Spark. We strongly recommend all 3.3 users to upgrade to this stable release. To download Spark 3.3.3,

Unsubscribe

2023-08-21 Thread Dipayan Dev
-- With Best Regards, Dipayan Dev Author of *Deep Learning with Hadoop * M.Tech (AI), IISc, Bangalore

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Kal Stevens
Sorry for being so Dense and thank you for your help. I was using this version phoenix-spark-5.0.0-HBase-2.0.jar Because it was the latest in this repo https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark On Mon, Aug 21, 2023 at 5:07 PM Sean Owen wrote: > It is. But you have a

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen
It is. But you have a third party library in here which seems to require a different version. On Mon, Aug 21, 2023, 7:04 PM Kal Stevens wrote: > OK, it was my impression that scala was packaged with Spark to avoid a > mismatch > https://spark.apache.org/downloads.html > > It looks like spark

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Kal Stevens
OK, it was my impression that scala was packaged with Spark to avoid a mismatch https://spark.apache.org/downloads.html It looks like spark 3.4.1 (my version) uses scala Scala 2.12 How do I specify the scala version? On Mon, Aug 21, 2023 at 4:47 PM Sean Owen wrote: > That's a mismatch in the

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen
That's a mismatch in the version of scala that your library uses vs spark uses. On Mon, Aug 21, 2023, 6:46 PM Kal Stevens wrote: > I am having a hard time figuring out what I am doing wrong here. > I am not sure if I have an incompatible version of something installed or > something else. > I

error trying to save to database (Phoenix)

2023-08-21 Thread Kal Stevens
I am having a hard time figuring out what I am doing wrong here. I am not sure if I have an incompatible version of something installed or something else. I can not find anything relevant in google to figure out what I am doing wrong I am using *spark 3.4.1*, and *python3.10* This is my code to

DataFrame cache keeps growing

2023-08-21 Thread Varun .N
Hi Team, While trying to understand/looking out for a problem of "where size of dataframe keeps growing" , I realized that a similar question was asked a couple of years ago. Need your help in resolving this.

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Bjørn Jørgensen
In yours file /home/spark/real-estate/pullhttp/pull_apartments.py replace import org.apache.spark.SparkContext with from pyspark import SparkContext man. 21. aug. 2023 kl. 15:13 skrev Kal Stevens : > I am getting a class not found error > import org.apache.spark.SparkContext > > It sounds

Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Kal Stevens
I am getting a class not found error import org.apache.spark.SparkContext It sounds like this is because pyspark is not installed, but as far as I can tell it is. Pyspark is installed in the correct python verison root@namenode:/home/spark/# pip3.10 install pyspark Requirement already

Spark doesn’t create SUCCESS file when external path is passed

2023-08-21 Thread Dipayan Dev
Hi Team, I need some help and if someone can replicate the issue at their end, or let me know if I am doing anything wrong. https://issues.apache.org/jira/browse/SPARK-44884 We have recently upgraded to Spark 3.3.0 in our Production Dataproc. We have a lot of downstream application that relies

Unsubscribe

2023-08-21 Thread Umesh Bansal

Re: k8s+ YARN Spark

2023-08-21 Thread Mich Talebzadeh
Interesting. Spark supports the following cluster managers - Standalone: A cluster-manager, limited in features, shipped with Spark. - Apache Hadoop YARN is the most widely used resource manager not just for Spark but for other artefacts as well. On-premise YARN is used extensively.

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Mich Talebzadeh
This should work check your path. It should pyspark from which pyspark /opt/spark/bin/pyspark And your installation should contain cd $SPARK_HOME /opt/spark> ls LICENSE NOTICE R README.md RELEASE bin conf data examples jars kubernetes licenses logs python sbin yarn You should

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Kal Stevens
Nevermind I was doing something dumb On Sun, Aug 20, 2023 at 9:53 PM Kal Stevens wrote: > Are there installation instructions for Spark 3.4.1? > > I defined SPARK_HOME as it describes here > > https://spark.apache.org/docs/latest/api/python/getting_started/install.html > > ls

k8s+ YARN Spark

2023-08-21 Thread Крюков Виталий Семенович
Good afternoon. Perhaps you will be discouraged by what I will write below, but nevertheless, I ask for help in solving my problem. Perhaps the architecture of our solution will not seem correct to you. There are backend services that communicate with a service that implements spark-driver.

Problem with spark 3.4.1 not finding spark java classes

2023-08-20 Thread Kal Stevens
Are there installation instructions for Spark 3.4.1? I defined SPARK_HOME as it describes here https://spark.apache.org/docs/latest/api/python/getting_started/install.html ls $SPARK_HOME/python/lib py4j-0.10.9.7-src.zip PY4J_LICENSE.txt pyspark.zip I am getting a class not found error

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-20 Thread Dipayan Dev
Hi Mich, It's not specific to ORC, and looks like a bug from Hadoop Common project. I have raised a bug and am happy to contribute to Hadoop 3.3.0 version. Do you know if anyone could help me to set the Assignee? https://issues.apache.org/jira/browse/HADOOP-18856 With Best Regards, Dipayan Dev

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Mich Talebzadeh
Under gs directory "gs://test_dd1/abc/" What do you see? gsutil ls gs://test_dd1/abc and the same gs://test_dd1/ gsutil ls gs://test_dd1 I suspect you need a folder for multiple ORC slices! Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin

Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Dipayan Dev
Hi Everyone, I'm stuck with one problem, where I need to provide a custom GCS location for the Hive table from Spark. The code fails while doing an *'insert into'* whenever my Hive table has a flag GS location like gs://, but works for nested locations like gs://bucket_name/blob_name. Is anyone

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-19 Thread Nebi Aydin
Here's the executor logs ``` java.io.IOException: Connection from ip-172-31-16-143.ec2.internal/172.31.16.143:7337 closed at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146) at

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-19 Thread Mich Talebzadeh
That error message *FetchFailedException: Failed to connect to on port 7337 *happens when a task running on one executor node tries to fetch data from another executor node but fails to establish a connection to the specified port (7337 in this case). In a nutshell it is performing network IO

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
Hi, sorry for duplicates. First time user :) I keep getting fetchfailedexception 7337 port closed. Which is external shuffle service port. I was trying to tune these parameters. I have around 1000 executors and 5000 cores. I tried to set spark.shuffle.io.serverThreads to 2k. Should I also set

Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Mich Talebzadeh
Hi, These two threads that you sent seem to be duplicates of each other? Anyhow I trust that you are familiar with the concept of shuffle in Spark. Spark Shuffle is an expensive operation since it involves the following - Disk I/O - Involves data serialization and deserialization

[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations. spark.shuffle.io.serverThreads spark.shuffle.io.clientThreads spark.shuffle.io.threads spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads Thanks.

[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations. spark.shuffle.io.serverThreads spark.shuffle.io.clientThreads spark.shuffle.io.threads spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads Thanks.

[no subject]

2023-08-18 Thread Dipayan Dev
Unsubscribe -- With Best Regards, Dipayan Dev Author of *Deep Learning with Hadoop * M.Tech (AI), IISc, Bangalore

Re: read dataset from only one node in YARN cluster

2023-08-18 Thread Mich Talebzadeh
Hi, Where do you see this? In spark UI. So data is skewed most probably as one node gets all the data and others nothing as I understand? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

read dataset from only one node in YARN cluster

2023-08-18 Thread marc nicole
Hi, Spark 3.2, Hadoop 3.2, using YARN cluster mode, if one wants to read a dataset that is found in one node of the cluster and not in the others, how to tell Spark that? I expect through DataframeReader and using path like *IP:port/pathOnLocalNode* PS: loading the dataset in HDFS is not an

RE: Re: Spark Vulnerabilities

2023-08-18 Thread Sankavi Nagalingam
Hi @Bjørn Jørgensen, Thank you for your quick response. Based on the PR shared , we are doing analysis from our side. For few jars you have requested for the CVE id, I have updated it in the attached document. Kindly verify it from your side and revert us back.

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-18 Thread Mich Talebzadeh
Yes, it sounds like it. So the broadcast DF size seems to be between 1 and 4GB. So I suggest that you leave it as it is. I have not used the standalone mode since spark-2.4.3 so I may be missing a fair bit of context here. I am sure there are others like you that are still using it! HTH Mich

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
No, the driver memory was not set explicitly. So it was likely the default value, which appears to be 1GB. On Thu, Aug 17, 2023, 16:49 Mich Talebzadeh wrote: > One question, what was the driver memory before setting it to 4G? Did you > have it set at all before? > > HTH > > Mich Talebzadeh, >

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
One question, what was the driver memory before setting it to 4G? Did you have it set at all before? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Mich, Here are my config values from spark-defaults.conf: spark.eventLog.enabled true spark.eventLog.dir hdfs://10.0.50.1:8020/spark-logs spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.fs.logDirectory hdfs://10.0.50.1:8020/spark-logs

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
Hello Paatrick, As a matter of interest what parameters and their respective values do you use in spark-submit. I assume it is running in YARN mode. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Mich, Yes, that's the sequence of events. I think the big breakthrough is that (for now at least) Spark is throwing errors instead of the queries hanging. Which is a big step forward. I can at least troubleshoot issues if I know what they are. When I reflect on the issues I faced and the

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
Hi Patrik, glad that you have managed to sort this problem out. Hopefully it will go away for good. Still we are in the dark about how this problem is going away and coming back :( As I recall the chronology of events were as follows: 1. The Issue with hanging Spark job reported 2.

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Everyone, I just wanted to follow up on this issue. This issue has continued since our last correspondence. Today I had a query hang and couldn't resolve the issue. I decided to upgrade my Spark install from 3.4.0 to 3.4.1. After doing so, instead of the query hanging, I got an error message

Managing python modules in docker for PySpark?

2023-08-16 Thread Mich Talebzadeh
Hi, This is a bit of an old hat but worth getting opinions on it. Current options that I believe apply are: 1. Installing them individually via pip in the docker build process 2. Installing them together via pip in the build process via requirments.txt 3. Installing them to a

Re: why advisoryPartitionSize <= maxShuffledHashJoinLocalMapThreshold

2023-08-15 Thread XiDuo You
CoalesceShufflePartitions will merge small partitions into bigger ones. Say, if you set maxShuffledHashJoinLocalMapThreshold to 32MB, but the advisoryPartitionSize is 64MB, then the final each reducer partition size will be close to 64MB. It breaks the maxShuffledHashJoinLocalMapThreshold. So we

why advisoryPartitionSize <= maxShuffledHashJoinLocalMapThreshold

2023-08-15 Thread ??????
dear community, I want to setmaxShuffledHashJoinLocalMapThreshold to enable convert smj to hash join,but I foundmaxShuffledHashJoinLocalMapThreshold must be large thanadvisoryPartitionSize. I want to know what happend ifmaxShuffledHashJoinLocalMapThreshold

Re: Spark Vulnerabilities

2023-08-14 Thread Cheng Pan
For the Guava case, you may be interested in https://github.com/apache/spark/pull/42493 Thanks, Cheng Pan > On Aug 14, 2023, at 16:50, Sankavi Nagalingam > wrote: > > Hi Team, > We could see there are many dependent vulnerabilities present in the latest > spark-core:3.4.1.jar. PFA > Could

Re: Spark Vulnerabilities

2023-08-14 Thread Sean Owen
Yeah, we generally don't respond to "look at the output of my static analyzer". Some of these are already addressed in a later version. Some don't affect Spark. Some are possibly an issue but hard to change without breaking lots of things - they are really issues with upstream dependencies. But

Re: Spark Vulnerabilities

2023-08-14 Thread Bjørn Jørgensen
I have added links to the github PR. Or comment for those that I have not seen before. Apache Spark has very many dependencies, some can easily be upgraded while others are very hard to fix. Please feel free to open a PR if you wanna help. man. 14. aug. 2023 kl. 14:06 skrev Sankavi Nagalingam :

Spark Vulnerabilities

2023-08-14 Thread Sankavi Nagalingam
Hi Team, We could see there are many dependent vulnerabilities present in the latest spark-core:3.4.1.jar. PFA Could you please let us know when will be the fix version available for the users. Thanks, Sankavi The information in this e-mail and any attachments is confidential and may be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Mich Talebzadeh
OK I use Hive 3.1.1 My suggestion is to put your hive issues to u...@hive.apache.org and for JAVA version compatibility They will give you better info. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
I attempted to install Hive yesterday. The experience was similar to other attempts at installing Hive: it took a few hours and at the end of the process, I didn't have a working setup. The latest stable release would not run. I never discovered the cause, but similar StackOverflow questions

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
OK you would not have known unless you went through the process so to speak. Let us do something revolutionary here  Install hive and its metastore. You already have hadoop anyway https://cwiki.apache.org/confluence/display/hive/adminmanual+installation hive metastore

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Yes, on premise. Unfortunately after installing Delta Lake and re-writing all tables as Delta tables, the issue persists. On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh wrote: > ok sure. > > Is this Delta Lake going to be on-premise? > > Mich Talebzadeh, > Solutions Architect/Engineering

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
ok sure. Is this Delta Lake going to be on-premise? Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Hi Mich, Thanks for the feedback. My original intention after reading your response was to stick to Hive for managing tables. Unfortunately, I'm running into another case of SQL scripts hanging. Since all tables are already Parquet, I'm out of troubleshooting options. I'm going to migrate to

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Hi Patrick, There is not anything wrong with Hive On-premise it is the best data warehouse there is Hive handles both ORC and Parquet formal well. They are both columnar implementations of relational model. What you are seeing is the Spark API to Hive which prefers Parquet. I found out a few

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
Thanks for the reply Stephen and Mich. Stephen, you're right, it feels like Spark is waiting for something, but I'm not sure what. I'm the only user on the cluster and there are plenty of resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark and the host servers to make sure

Re: unsubscribe

2023-08-11 Thread Mich Talebzadeh
To unsubscribe e-mail: user-unsubscr...@spark.apache.org Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at

unsubscribe

2023-08-11 Thread Yifan LI
unsubscribe

Re: Extracting Logical Plan

2023-08-11 Thread Vibhatha Abeykoon
Hello Winston, I looked into the suggested code snippet. But I am getting the following error ``` value listenerManager is not a member of org.apache.spark.sql.SparkSession ``` Although I can see it is available in the API.

<    4   5   6   7   8   9   10   11   12   13   >