Two new tickets for Spark on K8s

2023-08-26 Thread Mich Talebzadeh
Hi, @holden Karau recently created two Jiras that deal with two items of interest namely: 1. Improve Spark Driver Launch Time SPARK-44950 2. Improve Spark Dynamic Allocation SPARK-44951

Re: Spark 2.4.7

2023-08-26 Thread Mich Talebzadeh
Sorry for forgetting. Add this line to the top of the code import sys Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile

Re: Spark 2.4.7

2023-08-26 Thread Mich Talebzadeh
Hi guys, You can try the code below in PySpark relying on* urllib *library to download the contents of the URL and then create a new column in the DataFrame to store the downloaded contents. Spark 4.3.0 The limit explained by Varun from pyspark.sql import SparkSession from

Re: Spark 2.4.7

2023-08-26 Thread Harry Jamison
Thank you Varun, this makes sense. I understand a separate process for content ingestion. I was thinking it would be a separate spark job, but it sounds like you are suggesting that ideally I should do it outside of Hadoop entirely? Thanks Harry On Saturday, August 26, 2023 at 09:19:33

Re: Spark 2.4.7

2023-08-26 Thread Varun Shah
Hi Harry, Ideally, you should not be fetching a url in your transformation job but do the API calls separately (outside the cluster if possible). Ingesting data should be treated separately from transformation / cleaning / join operations. You can create another dataframe of urls, dedup if

Unsubscribe

2023-08-26 Thread Ozair Khan
Unsubscribe Regards, Ozair Khan

Elasticsearch support for Spark 3.x

2023-08-26 Thread Dipayan Dev
Hi All, We're using Spark 2.4.x to write dataframe into the Elasticsearch index. As we're upgrading to Spark 3.3.0, it throwing out error Caused by: java.lang.ClassNotFoundException: es.DefaultSource at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476) at