Two new tickets for Spark on K8s

2023-08-26 Thread Mich Talebzadeh
Hi, @holden Karau recently created two Jiras that deal with two items of interest namely: 1. Improve Spark Driver Launch Time SPARK-44950 2. Improve Spark Dynamic Allocation SPARK-44951

Re: Spark 2.4.7

2023-08-26 Thread Mich Talebzadeh
Sorry for forgetting. Add this line to the top of the code import sys Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile

Re: Spark 2.4.7

2023-08-26 Thread Mich Talebzadeh
Hi guys, You can try the code below in PySpark relying on* urllib *library to download the contents of the URL and then create a new column in the DataFrame to store the downloaded contents. Spark 4.3.0 The limit explained by Varun from pyspark.sql import SparkSession from

Re: Spark 2.4.7

2023-08-26 Thread Harry Jamison
Thank you Varun, this makes sense. I understand a separate process for content ingestion. I was thinking it would be a separate spark job, but it sounds like you are suggesting that ideally I should do it outside of Hadoop entirely? Thanks Harry On Saturday, August 26, 2023 at 09:19:33

Re: Spark 2.4.7

2023-08-26 Thread Varun Shah
Hi Harry, Ideally, you should not be fetching a url in your transformation job but do the API calls separately (outside the cluster if possible). Ingesting data should be treated separately from transformation / cleaning / join operations. You can create another dataframe of urls, dedup if

Unsubscribe

2023-08-26 Thread Ozair Khan
Unsubscribe Regards, Ozair Khan

Elasticsearch support for Spark 3.x

2023-08-26 Thread Dipayan Dev
Hi All, We're using Spark 2.4.x to write dataframe into the Elasticsearch index. As we're upgrading to Spark 3.3.0, it throwing out error Caused by: java.lang.ClassNotFoundException: es.DefaultSource at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476) at

Spark 2.4.7

2023-08-25 Thread Harry Jamison
I am using python 3.7 and Spark 2.4.7 I am not sure what the best way to do this is. I have a dataframe with a url in one of the columns, and I want to download the contents of that url and put it in a new column. Can someone point me in the right direction on how to do this?I looked at the UDFs

Re: mysterious spark.sql.utils.AnalysisException Union in spark 3.3.2, but not seen in 3.4.0+

2023-08-25 Thread Mich Talebzadeh
Hi Srivastan, Ground investigation 1. Does this union explicitly exist in your code? If not, where are the 7 and 6 column counting coming from? 2. On 3.3.1 have you looked at spark UI and the relevant dag diagram 3. Check query execution plan using explain() functionality 4. Can

mysterious spark.sql.utils.AnalysisException Union in spark 3.3.2, but not seen in 3.4.0+

2023-08-25 Thread Srivatsan vn
Hello Users, I have been seeing some weird issues when I upgraded my EMR setup to 6.11 (which uses spark 3.3.2) , the call stack seems to point to a code location where there is no explicit union, also I have unionByName everywhere in the codebase with allowMissingColumns set

Unsubscribe

2023-08-25 Thread Dipayan Dev

Spark Connect: API mismatch in SparkSesession#execute

2023-08-25 Thread Stefan Hagedorn
Hi everyone, I’m trying to use the “extension” feature of the Spark Connect CommandPlugin (Spark 3.4.1). I created a simple protobuf message `MyMessage` that I want to send from the connect client-side to the connect server (where I registered my plugin). The SparkSession class in

Unsubscribe

2023-08-24 Thread Hemanth Dendukuri
Unsubscribe

Fwd:  Wednesday: Join 6 Members at "Ofir Press | Complementing Scale: Novel Guidance Methods for Improving LMs"

2023-08-24 Thread Mich Talebzadeh
They recently combined Apache Spark and AI meeting in London. An online session worth attending for some? HTH Mich view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk.

Unsubscribe

2023-08-24 Thread Никита Романов
Unsubscribe

Unsubscribe

2023-08-23 Thread Nizam Shaik
Unsubscribe

Unsubscribe

2023-08-23 Thread Aayush Ostwal
Unsubscribe

Unsubscribe

2023-08-23 Thread Dipayan Dev
Unsubscribe

[no subject]

2023-08-23 Thread ayan guha
Unsubscribe-- Best Regards, Ayan Guha

Re: $SPARK_HOME/sbin/start-worker.sh spark://{main_host}:{cluster_port} failing

2023-08-23 Thread Mich Talebzadeh
Hi Jeremy, This error concerns me "23/08/23 20:01:03 ERROR LevelDBProvider: error opening leveldb file file:/mnt/data_ebs/infrastructure/spark/tmp/registeredExecutors.ldb. Creating new file, will not be able to recover state for existing applications

$SPARK_HOME/sbin/start-worker.sh spark://{main_host}:{cluster_port} failing

2023-08-23 Thread Jeremy Brent
Hi Spark Community, We have a cluster running with Spark 3.3.1. All nodes are AWS EC2’s with an Ubuntu OS version 22.04. One of the workers disconnected from the main node. When we run $SPARK_HOME/sbin/start-worker.sh spark://{main_host}:{cluster_port} it appears to run successfully; there is no

unsubscribe

2023-08-22 Thread heri wijayanto
unsubscribe

Re: error trying to save to database (Phoenix)

2023-08-22 Thread Gera Shegalov
If you look at the dependencies of the 5.0.0-HBase-2.0 artifact https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark/5.0.0-HBase-2.0 it was built against Spark 2.3.0, Scala 2.11.8 You may need to check with the Phoenix community if your setup with Spark 3.4.1 etc is supported by

Fwd: Recap on current status of "SPIP: Support Customized Kubernetes Schedulers"

2023-08-22 Thread Mich Talebzadeh
I found some of the notes on Volcano and my tests back in Feb 2022. I did my volcano tests on Spark 3.1.1. The results were not very great then. Hence I asked in thread from @santosh, if any updated comparisons are available. I will try the test with Spark 3.4.1 at some point. Maybe some users

[ANNOUNCE] Apache Spark 3.3.3 released

2023-08-22 Thread Yuming Wang
We are happy to announce the availability of Apache Spark 3.3.3! Spark 3.3.3 is a maintenance release containing stability fixes. This release is based on the branch-3.3 maintenance branch of Spark. We strongly recommend all 3.3 users to upgrade to this stable release. To download Spark 3.3.3,

Unsubscribe

2023-08-21 Thread Dipayan Dev
-- With Best Regards, Dipayan Dev Author of *Deep Learning with Hadoop * M.Tech (AI), IISc, Bangalore

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Kal Stevens
Sorry for being so Dense and thank you for your help. I was using this version phoenix-spark-5.0.0-HBase-2.0.jar Because it was the latest in this repo https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark On Mon, Aug 21, 2023 at 5:07 PM Sean Owen wrote: > It is. But you have a

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen
It is. But you have a third party library in here which seems to require a different version. On Mon, Aug 21, 2023, 7:04 PM Kal Stevens wrote: > OK, it was my impression that scala was packaged with Spark to avoid a > mismatch > https://spark.apache.org/downloads.html > > It looks like spark

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Kal Stevens
OK, it was my impression that scala was packaged with Spark to avoid a mismatch https://spark.apache.org/downloads.html It looks like spark 3.4.1 (my version) uses scala Scala 2.12 How do I specify the scala version? On Mon, Aug 21, 2023 at 4:47 PM Sean Owen wrote: > That's a mismatch in the

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen
That's a mismatch in the version of scala that your library uses vs spark uses. On Mon, Aug 21, 2023, 6:46 PM Kal Stevens wrote: > I am having a hard time figuring out what I am doing wrong here. > I am not sure if I have an incompatible version of something installed or > something else. > I

error trying to save to database (Phoenix)

2023-08-21 Thread Kal Stevens
I am having a hard time figuring out what I am doing wrong here. I am not sure if I have an incompatible version of something installed or something else. I can not find anything relevant in google to figure out what I am doing wrong I am using *spark 3.4.1*, and *python3.10* This is my code to

DataFrame cache keeps growing

2023-08-21 Thread Varun .N
Hi Team, While trying to understand/looking out for a problem of "where size of dataframe keeps growing" , I realized that a similar question was asked a couple of years ago. Need your help in resolving this.

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Bjørn Jørgensen
In yours file /home/spark/real-estate/pullhttp/pull_apartments.py replace import org.apache.spark.SparkContext with from pyspark import SparkContext man. 21. aug. 2023 kl. 15:13 skrev Kal Stevens : > I am getting a class not found error > import org.apache.spark.SparkContext > > It sounds

Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Kal Stevens
I am getting a class not found error import org.apache.spark.SparkContext It sounds like this is because pyspark is not installed, but as far as I can tell it is. Pyspark is installed in the correct python verison root@namenode:/home/spark/# pip3.10 install pyspark Requirement already

Spark doesn’t create SUCCESS file when external path is passed

2023-08-21 Thread Dipayan Dev
Hi Team, I need some help and if someone can replicate the issue at their end, or let me know if I am doing anything wrong. https://issues.apache.org/jira/browse/SPARK-44884 We have recently upgraded to Spark 3.3.0 in our Production Dataproc. We have a lot of downstream application that relies

Unsubscribe

2023-08-21 Thread Umesh Bansal

Re: k8s+ YARN Spark

2023-08-21 Thread Mich Talebzadeh
Interesting. Spark supports the following cluster managers - Standalone: A cluster-manager, limited in features, shipped with Spark. - Apache Hadoop YARN is the most widely used resource manager not just for Spark but for other artefacts as well. On-premise YARN is used extensively.

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Mich Talebzadeh
This should work check your path. It should pyspark from which pyspark /opt/spark/bin/pyspark And your installation should contain cd $SPARK_HOME /opt/spark> ls LICENSE NOTICE R README.md RELEASE bin conf data examples jars kubernetes licenses logs python sbin yarn You should

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Kal Stevens
Nevermind I was doing something dumb On Sun, Aug 20, 2023 at 9:53 PM Kal Stevens wrote: > Are there installation instructions for Spark 3.4.1? > > I defined SPARK_HOME as it describes here > > https://spark.apache.org/docs/latest/api/python/getting_started/install.html > > ls

k8s+ YARN Spark

2023-08-21 Thread Крюков Виталий Семенович
Good afternoon. Perhaps you will be discouraged by what I will write below, but nevertheless, I ask for help in solving my problem. Perhaps the architecture of our solution will not seem correct to you. There are backend services that communicate with a service that implements spark-driver.

Problem with spark 3.4.1 not finding spark java classes

2023-08-20 Thread Kal Stevens
Are there installation instructions for Spark 3.4.1? I defined SPARK_HOME as it describes here https://spark.apache.org/docs/latest/api/python/getting_started/install.html ls $SPARK_HOME/python/lib py4j-0.10.9.7-src.zip PY4J_LICENSE.txt pyspark.zip I am getting a class not found error

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-20 Thread Dipayan Dev
Hi Mich, It's not specific to ORC, and looks like a bug from Hadoop Common project. I have raised a bug and am happy to contribute to Hadoop 3.3.0 version. Do you know if anyone could help me to set the Assignee? https://issues.apache.org/jira/browse/HADOOP-18856 With Best Regards, Dipayan Dev

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Mich Talebzadeh
Under gs directory "gs://test_dd1/abc/" What do you see? gsutil ls gs://test_dd1/abc and the same gs://test_dd1/ gsutil ls gs://test_dd1 I suspect you need a folder for multiple ORC slices! Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin

Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Dipayan Dev
Hi Everyone, I'm stuck with one problem, where I need to provide a custom GCS location for the Hive table from Spark. The code fails while doing an *'insert into'* whenever my Hive table has a flag GS location like gs://, but works for nested locations like gs://bucket_name/blob_name. Is anyone

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-19 Thread Nebi Aydin
Here's the executor logs ``` java.io.IOException: Connection from ip-172-31-16-143.ec2.internal/172.31.16.143:7337 closed at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146) at

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-19 Thread Mich Talebzadeh
That error message *FetchFailedException: Failed to connect to on port 7337 *happens when a task running on one executor node tries to fetch data from another executor node but fails to establish a connection to the specified port (7337 in this case). In a nutshell it is performing network IO

Re: [External Email] Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
Hi, sorry for duplicates. First time user :) I keep getting fetchfailedexception 7337 port closed. Which is external shuffle service port. I was trying to tune these parameters. I have around 1000 executors and 5000 cores. I tried to set spark.shuffle.io.serverThreads to 2k. Should I also set

Re: [Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Mich Talebzadeh
Hi, These two threads that you sent seem to be duplicates of each other? Anyhow I trust that you are familiar with the concept of shuffle in Spark. Spark Shuffle is an expensive operation since it involves the following - Disk I/O - Involves data serialization and deserialization

[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations. spark.shuffle.io.serverThreads spark.shuffle.io.clientThreads spark.shuffle.io.threads spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads Thanks.

[Spark Core]: What's difference among spark.shuffle.io.threads

2023-08-18 Thread Nebi Aydin
I want to learn differences among below thread configurations. spark.shuffle.io.serverThreads spark.shuffle.io.clientThreads spark.shuffle.io.threads spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads Thanks.

[no subject]

2023-08-18 Thread Dipayan Dev
Unsubscribe -- With Best Regards, Dipayan Dev Author of *Deep Learning with Hadoop * M.Tech (AI), IISc, Bangalore

Re: read dataset from only one node in YARN cluster

2023-08-18 Thread Mich Talebzadeh
Hi, Where do you see this? In spark UI. So data is skewed most probably as one node gets all the data and others nothing as I understand? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

read dataset from only one node in YARN cluster

2023-08-18 Thread marc nicole
Hi, Spark 3.2, Hadoop 3.2, using YARN cluster mode, if one wants to read a dataset that is found in one node of the cluster and not in the others, how to tell Spark that? I expect through DataframeReader and using path like *IP:port/pathOnLocalNode* PS: loading the dataset in HDFS is not an

RE: Re: Spark Vulnerabilities

2023-08-18 Thread Sankavi Nagalingam
Hi @Bjørn Jørgensen, Thank you for your quick response. Based on the PR shared , we are doing analysis from our side. For few jars you have requested for the CVE id, I have updated it in the attached document. Kindly verify it from your side and revert us back.

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-18 Thread Mich Talebzadeh
Yes, it sounds like it. So the broadcast DF size seems to be between 1 and 4GB. So I suggest that you leave it as it is. I have not used the standalone mode since spark-2.4.3 so I may be missing a fair bit of context here. I am sure there are others like you that are still using it! HTH Mich

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
No, the driver memory was not set explicitly. So it was likely the default value, which appears to be 1GB. On Thu, Aug 17, 2023, 16:49 Mich Talebzadeh wrote: > One question, what was the driver memory before setting it to 4G? Did you > have it set at all before? > > HTH > > Mich Talebzadeh, >

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
One question, what was the driver memory before setting it to 4G? Did you have it set at all before? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Mich, Here are my config values from spark-defaults.conf: spark.eventLog.enabled true spark.eventLog.dir hdfs://10.0.50.1:8020/spark-logs spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.fs.logDirectory hdfs://10.0.50.1:8020/spark-logs

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
Hello Paatrick, As a matter of interest what parameters and their respective values do you use in spark-submit. I assume it is running in YARN mode. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Mich, Yes, that's the sequence of events. I think the big breakthrough is that (for now at least) Spark is throwing errors instead of the queries hanging. Which is a big step forward. I can at least troubleshoot issues if I know what they are. When I reflect on the issues I faced and the

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
Hi Patrik, glad that you have managed to sort this problem out. Hopefully it will go away for good. Still we are in the dark about how this problem is going away and coming back :( As I recall the chronology of events were as follows: 1. The Issue with hanging Spark job reported 2.

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
Hi Everyone, I just wanted to follow up on this issue. This issue has continued since our last correspondence. Today I had a query hang and couldn't resolve the issue. I decided to upgrade my Spark install from 3.4.0 to 3.4.1. After doing so, instead of the query hanging, I got an error message

Managing python modules in docker for PySpark?

2023-08-16 Thread Mich Talebzadeh
Hi, This is a bit of an old hat but worth getting opinions on it. Current options that I believe apply are: 1. Installing them individually via pip in the docker build process 2. Installing them together via pip in the build process via requirments.txt 3. Installing them to a

Re: why advisoryPartitionSize <= maxShuffledHashJoinLocalMapThreshold

2023-08-15 Thread XiDuo You
CoalesceShufflePartitions will merge small partitions into bigger ones. Say, if you set maxShuffledHashJoinLocalMapThreshold to 32MB, but the advisoryPartitionSize is 64MB, then the final each reducer partition size will be close to 64MB. It breaks the maxShuffledHashJoinLocalMapThreshold. So we

why advisoryPartitionSize <= maxShuffledHashJoinLocalMapThreshold

2023-08-15 Thread ??????
dear community, I want to setmaxShuffledHashJoinLocalMapThreshold to enable convert smj to hash join,but I foundmaxShuffledHashJoinLocalMapThreshold must be large thanadvisoryPartitionSize. I want to know what happend ifmaxShuffledHashJoinLocalMapThreshold

Re: Spark Vulnerabilities

2023-08-14 Thread Cheng Pan
For the Guava case, you may be interested in https://github.com/apache/spark/pull/42493 Thanks, Cheng Pan > On Aug 14, 2023, at 16:50, Sankavi Nagalingam > wrote: > > Hi Team, > We could see there are many dependent vulnerabilities present in the latest > spark-core:3.4.1.jar. PFA > Could

Re: Spark Vulnerabilities

2023-08-14 Thread Sean Owen
Yeah, we generally don't respond to "look at the output of my static analyzer". Some of these are already addressed in a later version. Some don't affect Spark. Some are possibly an issue but hard to change without breaking lots of things - they are really issues with upstream dependencies. But

Re: Spark Vulnerabilities

2023-08-14 Thread Bjørn Jørgensen
I have added links to the github PR. Or comment for those that I have not seen before. Apache Spark has very many dependencies, some can easily be upgraded while others are very hard to fix. Please feel free to open a PR if you wanna help. man. 14. aug. 2023 kl. 14:06 skrev Sankavi Nagalingam :

Spark Vulnerabilities

2023-08-14 Thread Sankavi Nagalingam
Hi Team, We could see there are many dependent vulnerabilities present in the latest spark-core:3.4.1.jar. PFA Could you please let us know when will be the fix version available for the users. Thanks, Sankavi The information in this e-mail and any attachments is confidential and may be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Mich Talebzadeh
OK I use Hive 3.1.1 My suggestion is to put your hive issues to u...@hive.apache.org and for JAVA version compatibility They will give you better info. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
I attempted to install Hive yesterday. The experience was similar to other attempts at installing Hive: it took a few hours and at the end of the process, I didn't have a working setup. The latest stable release would not run. I never discovered the cause, but similar StackOverflow questions

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
OK you would not have known unless you went through the process so to speak. Let us do something revolutionary here  Install hive and its metastore. You already have hadoop anyway https://cwiki.apache.org/confluence/display/hive/adminmanual+installation hive metastore

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Yes, on premise. Unfortunately after installing Delta Lake and re-writing all tables as Delta tables, the issue persists. On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh wrote: > ok sure. > > Is this Delta Lake going to be on-premise? > > Mich Talebzadeh, > Solutions Architect/Engineering

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
ok sure. Is this Delta Lake going to be on-premise? Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Hi Mich, Thanks for the feedback. My original intention after reading your response was to stick to Hive for managing tables. Unfortunately, I'm running into another case of SQL scripts hanging. Since all tables are already Parquet, I'm out of troubleshooting options. I'm going to migrate to

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Hi Patrick, There is not anything wrong with Hive On-premise it is the best data warehouse there is Hive handles both ORC and Parquet formal well. They are both columnar implementations of relational model. What you are seeing is the Spark API to Hive which prefers Parquet. I found out a few

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
Thanks for the reply Stephen and Mich. Stephen, you're right, it feels like Spark is waiting for something, but I'm not sure what. I'm the only user on the cluster and there are plenty of resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark and the host servers to make sure

Re: unsubscribe

2023-08-11 Thread Mich Talebzadeh
To unsubscribe e-mail: user-unsubscr...@spark.apache.org Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at

unsubscribe

2023-08-11 Thread Yifan LI
unsubscribe

Re: Extracting Logical Plan

2023-08-11 Thread Vibhatha Abeykoon
Hello Winston, I looked into the suggested code snippet. But I am getting the following error ``` value listenerManager is not a member of org.apache.spark.sql.SparkSession ``` Although I can see it is available in the API.

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Steve may have a valid point. You raised an issue with concurrent writes before, if I recall correctly. Since this limitation may be due to Hive metastore. By default Spark uses Apache Derby for its database persistence. *However it is limited to only one Spark session at any time for the purposes

Re: Spark Connect, Master, and Workers

2023-08-10 Thread Brian Huynh
Hi Kezhi, Yes, you no longer need to start a master to make the client work. Please see the quickstart. https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html You can think of Spark Connect as an API on top of Master so workers can be added to the cluster same

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick, When this has happened to me in the past (admittedly via spark-submit) it has been because another job was still running and had already claimed some of the resources (cores and memory). I think this can also happen if your configuration tries to claim resources that will never be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich, I don't believe Hive is installed. I set up this cluster from scratch. I installed Hadoop and Spark by downloading them from their project websites. If Hive isn't bundled with Hadoop or Spark, I don't believe I have it. I'm running the Thrift server distributed with Spark, like so:

Re: dockerhub does not contain apache/spark-py 3.4.1

2023-08-10 Thread Mich Talebzadeh
Hi Mark, I created a spark3.4.1 docker file. Details from spark-py-3.4.1-scala_2.12-11-jre-slim-buster Pull instructions are given docker pull

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
sorry host is 10.0.50.1 Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Hi Patrick That beeline on port 1 is a hive thrift server running on your hive on host 10.0.50.1:1. if you can access that host, you should be able to log into hive by typing hive. The os user is hadoop in your case and sounds like there is no password! Once inside that host, hive logs

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich, Thanks for the reply. Unfortunately I don't have Hive set up on my cluster. I can explore this if there are no other ways to troubleshoot. I'm using beeline to run commands against the Thrift server. Here's the command I use: ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Can you run this sql query through hive itself? Are you using this command or similar for your thrift server? beeline -u jdbc:hive2:///1/default org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my

Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hello, I'm attempting to run a query on Spark 3.4.0 through the Spark ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in standalone mode using HDFS for storage. The query is as follows: SELECT ME.*, MB.BenefitID FROM MemberEnrollment ME JOIN MemberBenefits MB ON ME.ID =

Re: [PySpark][UDF][PickleException]

2023-08-10 Thread Bjørn Jørgensen
I pasted your text to chatgtp and this is what I got back Your problem arises due to how Apache Spark serializes Python objects to be used in Spark tasks. When a User-Defined Function (UDF) is defined, Spark uses Python's `pickle` library to serialize the Python function and any required objects

[PySpark][UDF][PickleException]

2023-08-10 Thread Sanket Sharma
Hi, I've been trying to debug a Spark UDF for a couple of days now but I can't seem to figure out what is going on. The UDF essentially pads a 2D array to a certain fixed length. When the code uses NumPy, it fails with a PickleException. When I re write using plain python, it works like charm.:

Re: [PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread lnxpgn
Yes, ls -l /tmp/app-submodules.zip, hdfs dfs -ls /tmp/app-submodules.zip can show the file. 在 2023/8/9 22:48, Mich Talebzadeh 写道: If you are running in the cluster mode, that zip file should exist in all the nodes! Is that the case? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead

Spark Connect, Master, and Workers

2023-08-09 Thread Kezhi Xiong
Hi, I'm recently learning Spark Connect but have some questions regarding the connect server's relation with master or workers: so when I'm using the connect server, I don't have to start a master alone side to make clients work. Is the connect server simply using "local[*]" as master? Then, if I

unsubscribe

2023-08-09 Thread heri wijayanto
unsubscribe

Re: dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mich Talebzadeh
Hi Mark, you can build it yourself, no big deal :) REPOSITORY TAG IMAGE ID CREATED SIZE sparkpy/spark-py 3.4.1-scala_2.12-11-jre-slim-buster-Dockerfile a876102b2206 1 second ago

dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mark Elliot
Hello, I noticed that the apache/spark-py image for Spark's 3.4.1 release is not available (apache/spark@3.4.1 is available). Would it be possible to get the 3.4.1 release build for the apache/spark-py image published? Thanks, Mark -- This communication, together with any

Re: [PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread Mich Talebzadeh
If you are running in the cluster mode, that zip file should exist in all the nodes! Is that the case? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

[PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread lnxpgn
Hi, I am using Spark 3.4.1, running on YARN. Hadoop runs on a single-node in a pseudo-distributed mode. spark-submit --master yarn --deploy-mode cluster --py-files /tmp/app-submodules.zip app.py The YARN application ran successfully, but have a warning log message:

unsubscribe

2023-08-08 Thread Daniel Tavares de Santana
unsubscribe

<    3   4   5   6   7   8   9   10   11   12   >