Re: Running Spark on Kubernetes (GKE) - failing on spark-submit
Hi Ye, This is the error i get when i don't set the spark.kubernetes.file.upload.path Any ideas on how to fix this ? ``` Exception in thread "main" org.apache.spark.SparkException: Please specify spark.kubernetes.file.upload.path property. at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:299) at org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:248) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:247) at org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:173) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:164) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:89) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179) at org.apache.spark.deploy.SparkSubmit.org $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` On Tue, Feb 14, 2023 at 1:33 AM Ye Xianjin wrote: > The configuration of ‘…file.upload.path’ is wrong. it means a distributed > fs path to store your archives/resource/jars temporarily, then distributed > by spark to drivers/executors. > For your cases, you don’t need to set this configuration. > Sent from my iPhone > > On Feb 14, 2023, at 5:43 AM, karan alang wrote: > > > Hello All, > > I'm trying to run a simple application on GKE (Kubernetes), and it is > failing: > Note : I have spark(bitnami spark chart) installed on GKE using helm > install > > Here is what is done : > 1. created a docker image using Dockerfile > > Dockerfile : > ``` > > FROM python:3.7-slim > > RUN apt-get update && \ > apt-get install -y default-jre && \ > apt-get install -y openjdk-11-jre-headless && \ > apt-get clean > > ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64 > > RUN pip install pyspark > RUN mkdir -p /myexample && chmod 755 /myexample > WORKDIR /myexample > > COPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.py > > CMD ["pyspark"] > > ``` > Simple pyspark application : > ``` > > from pyspark.sql import SparkSession > spark = > SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate() > > data = [('k1', 123000), ('k2', 234000), ('k3', 456000)] > df = spark.createDataFrame(data, ('id', 'salary')) > > df.show(5, False) > > ``` > > Spark-submit command : > ``` > > spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode > cluster --name pyspark-example --conf > spark.kubernetes.container.image=pyspark-example:0.1 --conf > spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py > ``` > > Error i get : > ``` > > 23/02/13 13:18:27 INFO KubernetesUtils: Uploading file: > /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py > to dest: > /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py... > > Exception in t
ADLS Gen2 adfs sample yaml configuration
Hello, I need help/sample adfs(Active Directory Federation Services) (ADLS GEN2) on how to configure ADLS GEN2(adfs) configurations in yaml file with spark history server ?? I would like to see running jobs from JupiterLab notebook with SparkOnK8sV3.0.2 Kernel shell. Any help is much appreciated .. Thanks, Kondal The information transmitted, including any attachments, is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited, and all liability arising therefrom is disclaimed. If you received this in error, please contact the sender and delete the material from any computer. In the event the content of this email includes Tax advice, the content of this email is limited to the matters specifically addressed herein and is not intended to address other potential tax consequences or the potential application of tax penalties to this or any other matter. PricewaterhouseCoopers LLP is a Delaware limited liability partnership. This communication may come from PricewaterhouseCoopers LLP or one of its subsidiaries.
How to explode array columns of a dataframe having the same length
Hello guys, I have the following dataframe: *col1* *col2* *col3* ["A","B","null"] ["C","D","null"] ["E","null","null"] I want to explode it to the following dataframe: *col1* *col2* *col3* "A" "C" "E" "B" "D" "null" "null" "null" "null" How to do that (preferably in Java) using the explode() method ? knowing that something like the following won't yield correct output: for (String colName: dataset.columns()) dataset=dataset.withColumn(colName,explode(dataset.col(colName)));
Re: Running Spark on Kubernetes (GKE) - failing on spark-submit
The configuration of ‘…file.upload.path’ is wrong. it means a distributed fs path to store your archives/resource/jars temporarily, then distributed by spark to drivers/executors. For your cases, you don’t need to set this configuration.Sent from my iPhoneOn Feb 14, 2023, at 5:43 AM, karan alang wrote:Hello All,I'm trying to run a simple application on GKE (Kubernetes), and it is failing:Note : I have spark(bitnami spark chart) installed on GKE using helm install Here is what is done :1. created a docker image using DockerfileDockerfile :```FROM python:3.7-slimRUN apt-get update && \apt-get install -y default-jre && \apt-get install -y openjdk-11-jre-headless && \apt-get cleanENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64RUN pip install pysparkRUN mkdir -p /myexample && chmod 755 /myexampleWORKDIR /myexampleCOPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.pyCMD ["pyspark"]```Simple pyspark application :```from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()data = "" style="color:rgb(0,128,0);font-weight:bold">'k1', 123000), ('k2', 234000), ('k3', 456000)]df = spark.createDataFrame(data, ('id', 'salary'))df.show(5, False)```Spark-submit command :``` spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode cluster --name pyspark-example --conf spark.kubernetes.container.image=pyspark-example:0.1 --conf spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py```Error i get :``` 23/02/13 13:18:27 INFO KubernetesUtils: Uploading file: /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py to dest: /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py... Exception in thread "main" org.apache.spark.SparkException: Uploading file /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py failed... at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:296) at org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:270) at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configureForPython(DriverCommandFeatureStep.scala:109) at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configurePod(DriverCommandFeatureStep.scala:44) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:59) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:89) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.SparkException: Error uploading file StructuredStream-on-gke.py at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:319) at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:292) ... 21 more Caused by: java.io.IOException: Mkdirs failed to create /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:317) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2369) at org.apache.hadoop.fs.FilterFileSystem.copyFromLocalFile(FilterFileSystem.java:368) at org.
Re: Running Spark on Kubernetes (GKE) - failing on spark-submit
I am not k8s expert but I think you got permission issue. Try 777 as an example to see if it works. On Mon, 13 Feb 2023, 21:42 karan alang, wrote: > Hello All, > > I'm trying to run a simple application on GKE (Kubernetes), and it is > failing: > Note : I have spark(bitnami spark chart) installed on GKE using helm > install > > Here is what is done : > 1. created a docker image using Dockerfile > > Dockerfile : > ``` > > FROM python:3.7-slim > > RUN apt-get update && \ > apt-get install -y default-jre && \ > apt-get install -y openjdk-11-jre-headless && \ > apt-get clean > > ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64 > > RUN pip install pyspark > RUN mkdir -p /myexample && chmod 755 /myexample > WORKDIR /myexample > > COPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.py > > CMD ["pyspark"] > > ``` > Simple pyspark application : > ``` > > from pyspark.sql import SparkSession > spark = > SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate() > > data = [('k1', 123000), ('k2', 234000), ('k3', 456000)] > df = spark.createDataFrame(data, ('id', 'salary')) > > df.show(5, False) > > ``` > > Spark-submit command : > ``` > > spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode > cluster --name pyspark-example --conf > spark.kubernetes.container.image=pyspark-example:0.1 --conf > spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py > ``` > > Error i get : > ``` > > 23/02/13 13:18:27 INFO KubernetesUtils: Uploading file: > /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py > to dest: > /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py... > > Exception in thread "main" org.apache.spark.SparkException: Uploading file > /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py > failed... > > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:296) > > at > org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:270) > > at > org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configureForPython(DriverCommandFeatureStep.scala:109) > > at > org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configurePod(DriverCommandFeatureStep.scala:44) > > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:59) > > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > > at scala.collection.immutable.List.foldLeft(List.scala:89) > > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) > > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106) > > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213) > > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207) > > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622) > > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207) > > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179) > > at org.apache.spark.deploy.SparkSubmit.org > $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951) > > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039) > > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048) > > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > Caused by: org.apache.spark.SparkException: Error uploading file > StructuredStream-on-gke.py > > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:319) > > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:292) > > ... 21 more > > Caused by: java.io.IOException: Mkdirs failed to create > /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a > > at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:317) > > at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305) > > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098) > > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987) > > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414) > > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387) > > at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2369