trouble using spark in kubernetes

2022-05-03 Thread Andreas Klos

Hello together,

I am trying to run a minimal example in my k8s cluster.

First, I cloned the petastorm github repo: https://github.com/uber/petastorm

Second, I created a Dockerimage as follows:

FROMubuntu:20.04
RUN apt-get update -qq
RUN apt-get install -qq -y software-properties-common
RUN add-apt-repository -yppa:deadsnakes/ppa
RUN apt-get update -qq

RUN apt-get -qq install -y \
  build-essential \
  cmake \
  openjdk-8-jre-headless \
  git \
  python \
  python3-pip \
  python3.9 \
  python3.9-dev \
  python3.9-venv \
  virtualenv \
  wget \
  && rm -rf /var/lib/apt/lists/*
RUN 
wgethttps://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.bz2
  -P /data/mnist/
RUN mkdir /petastorm
ADD setup.py /petastorm/
ADD README.rst /petastorm/
ADD petastorm /petastorm/petastorm
RUN python3.9 -m pip install pip --upgrade
RUN python3.9 -m pip install wheel
RUN python3.9 -m venv /petastorm_venv3.9
RUN /petastorm_venv3.9/bin/pip3.9 install --no-cache scikit-build
RUN /petastorm_venv3.9/bin/pip3.9 install --no-cache -e 
/petastorm/[test,tf,torch,docs,opencv] --only-binary pyarrow --only-binary 
opencv-python
RUN /petastorm_venv3.9/bin/pip3.9 install -U pyarrow==3.0.0 numpy==1.19.3 
tensorflow==2.5.0 pyspark==3.0.0
RUN /petastorm_venv3.9/bin/pip3.9 install opencv-python-headless
RUN rm -r /petastorm
ADD docker/run_in_venv.sh /

Afterwards, I create a namespace called spark in my k8s cluster, as 
Serviceaccount (spark-driver) and a rolebinding for the service account 
as follows:


kubectl create ns spark
kubectl create serviceaccount spark-driver
kubectl create rolebinding spark-driver-rb --clusterrole=cluster-admin 
--serviceaccount=spark:spark-driver


Finally I create a pod in the spark namespace as follows:

apiVersion: v1
kind: Pod
metadata:
  name: "petastorm-ds-creator"
  namespace: spark
  labels:
    app: "petastorm-ds-creator"
spec:
  serviceAccount: spark-driver
  containers:
  - name: petastorm-ds-creator
    image: "imagename"
    command:
  - "/bin/bash"
  - "-c"
  - "--"
    args:
  - "while true; do sleep 30; done;"
    resources:
  limits:
    cpu: 2000m
    memory: 5000Mi
  requests:
    cpu: 2000m
    memory: 5000Mi
    ports:
    - containerPort:  80
  name:  http
    - containerPort:  443
  name:  https
    - containerPort:  20022
  name:  exposed
    volumeMounts:
    - name: data
  mountPath: /data
  volumes:
    - name: data
  persistentVolumeClaim:
    claimName: spark-geodata-nfs-pvc-20220503
  restartPolicy: Always

I expose port 20022 of the pod with a headless service

kubectl expose pod petastorm-ds-creator --port=20022 --type=ClusterIP 
--cluster-ip=None -n spark


finally I run the following code in the created container/pod:

from pyspark import SparkConf
from pyspark.sql import SparkSession

spark_conf = SparkConf()
spark_conf.setMaster("k8s://https://kubernetes.default:443;)
spark_conf.setAppName("PetastormDsCreator")
spark_conf.set(
    "spark.kubernetes.namespace",
    "spark"
)
spark_conf.set(
    "spark.kubernetes.authenticate.driver.serviceAccountName",
    "spark-driver"
)
spark_conf.set(
    "spark.kubernetes.authenticate.caCertFile",
    "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
)
spark_conf.set(
    "spark.kubernetes.authenticate.oauthTokenFile",
    "/var/run/secrets/kubernetes.io/serviceaccount/token"
)
spark_conf.set(
    "spark.executor.instances",
    "2"
)
spark_conf.set(
    "spark.driver.host",
    "petastorm-ds-creator"
)
spark_conf.set(
    "spark.driver.port",
    "20022"
)
spark_conf.set(
    "spark.kubernetes.container.image",
    "imagename"
)
spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
sc = spark.sparkContext
t = sc.parallelize(range(10))
r = t.sumApprox(3)
print('Approximate sum: %s' % r)

Unfortunately, It does not work...

with kubectl describe po podname-exec-1 I get the following error message:

Error: failed to start container "spark-kubernetes-executor": Error 
response from daemon: OCI runtime create failed: container_linux.go:349: 
starting container process caused "exec: \"executor\": executable file 
not found in $PATH": unknown


Could somebody give me a hint, what am I doing wrong? Is my SparkSession 
configuration not correct?


Best regards

Andreas


Re: Spark error with jupyter

2022-05-03 Thread Bjørn Jørgensen
I use jupyterlab and spark and I have not seen this before.

Jupyter has a docker stack with pyspark

you
can try it.

tor. 21. apr. 2022 kl. 11:07 skrev Wassim Yaich :

> Hi Folks,
> I am working on spark in jupyter but I have a small error for each running
> .
> anyone have the same error or have a solution , please tell me .
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


REMINDER - Travel Assistance available for ApacheCon NA New Orleans 2022

2022-05-03 Thread Gavin McDonald
Hi All Contributors and Committers,

This is a first reminder email that travel
assistance applications for ApacheCon NA 2022 are now open!

We will be supporting ApacheCon North America in New Orleans, Louisiana,
on October 3rd through 6th, 2022.

TAC exists to help those that would like to attend ApacheCon events, but
are unable to do so for financial reasons. This year, We are supporting
both committers and non-committers involved with projects at the
Apache Software Foundation, or open source projects in general.

For more info on this year's applications and qualifying criteria, please
visit the TAC website at http://www.apache.org/travel/
Applications are open and will close on the 1st of July 2022.

Important: Applicants have until the closing date above to submit their
applications (which should contain as much supporting material as required
to efficiently and accurately process their request), this will enable TAC
to announce successful awards shortly afterwards.

As usual, TAC expects to deal with a range of applications from a diverse
range of backgrounds. We therefore encourage (as always) anyone thinking
about sending in an application to do so ASAP.

Why should you attend as a TAC recipient? We encourage you to read stories
from
past recipients at https://apache.org/travel/stories/ . Also note that
previous TAC recipients have gone on to become Committers, PMC Members, ASF
Members, Directors of the ASF Board and Infrastructure Staff members.
Others have gone from Committer to full time Open Source Developers!

How far can you go! - Let TAC help get you there.


Re: [EXTERNAL] Parse Execution Plan from PySpark

2022-05-03 Thread Walaa Eldin Moustafa
Hi Pablo,

Do you mean an in-memory plan? You can access one by implementing a Spark
Listener. Here is an example from the Datahub project [1].

If you end up parsing the SQL plan string, you may consider using/extending
Coral [2, 3]. There is already a POC for that. See some test cases [4].

Thanks,
Walaa.

[1]
https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/spark-lineage/src/main/java/datahub/spark/DatahubSparkListener.java
[2] https://engineering.linkedin.com/blog/2020/coral
[3] https://github.com/linkedin/coral
[4]
https://github.com/linkedin/coral/blob/master/coral-spark-plan/src/test/java/com/linkedin/coral/sparkplan/SparkPlanToIRRelConverterTest.java


On Tue, May 3, 2022 at 1:18 AM Shay Elbaz  wrote:

> Hi Pablo,
>
>
>
> As you probably know, Spark SQL generates custom Java code for the SQL
> functions. You can use geometry.debugCodegen() to print out the generated
> code.
>
>
>
> Shay
>
>
>
> *From:* Pablo Alcain 
> *Sent:* Tuesday, May 3, 2022 6:07 AM
> *To:* user@spark.apache.org
> *Subject:* [EXTERNAL] Parse Execution Plan from PySpark
>
>
>
> *ATTENTION:* This email originated from outside of GM.
>
>
>
>
> Hello all! I'm working with PySpark trying to reproduce some of the
> results we see on batch through streaming processes, just as a PoC for now.
> For this, I'm thinking of trying to interpret the execution plan and
> eventually write it back to Python (I'm doing something similar with pandas
> as well, and I'd like both approaches to be as similar as possible).
>
>
>
> Let me clarify with an example: suppose that starting with a
> `geometry.csv` file with `width` and `height` I want to calculate the
> `area` doing this:
>
>
>
> >>> geometry = spark.read.csv('geometry.csv', header=True)
>
> >>> geometry = geometry.withColumn('area', F.col('width') *
> F.col('height'))
>
>
>
> I would like to extract from the execution plan the fact that area is
> calculated as the product of width * height. One possibility would be to
> parse the execution plan:
>
>
>
> >>> geometry.explain(True)
>
>
>
> ...
>
> == Optimized Logical Plan ==
>
> Project [width#45, height#46, (cast(width#45 as double) * cast(height#46
> as double)) AS area#64]
> +- Relation [width#45,height#46] csv
>
> ...
>
>
>
> From the first line of the Logical Plan we can parse the formula "area =
> height * width" and then write the function back in any language.
>
>
>
> However, even though I'm getting the logical plan as a string, there has
> to be some internal representation that I could leverage and avoid
> the string parsing. Do you know if/how I can access that internal
> representation from Python? I've been trying to navigate the scala source
> code to find it, but this is definitely beyond my area of expertise, so any
> pointers would be more than welcome.
>
>
>
> Thanks in advance,
>
> Pablo
>


RE: [EXTERNAL] Parse Execution Plan from PySpark

2022-05-03 Thread Shay Elbaz
Hi Pablo,

As you probably know, Spark SQL generates custom Java code for the SQL 
functions. You can use geometry.debugCodegen() to print out the generated code.

Shay

From: Pablo Alcain 
Sent: Tuesday, May 3, 2022 6:07 AM
To: user@spark.apache.org
Subject: [EXTERNAL] Parse Execution Plan from PySpark

ATTENTION: This email originated from outside of GM.



Hello all! I'm working with PySpark trying to reproduce some of the results we 
see on batch through streaming processes, just as a PoC for now. For this, I'm 
thinking of trying to interpret the execution plan and eventually write it back 
to Python (I'm doing something similar with pandas as well, and I'd like both 
approaches to be as similar as possible).

Let me clarify with an example: suppose that starting with a `geometry.csv` 
file with `width` and `height` I want to calculate the `area` doing this:

>>> geometry = spark.read.csv('geometry.csv', header=True)
>>> geometry = geometry.withColumn('area', F.col('width') * F.col('height'))

I would like to extract from the execution plan the fact that area is 
calculated as the product of width * height. One possibility would be to parse 
the execution plan:

>>> geometry.explain(True)

...
== Optimized Logical Plan ==
Project [width#45, height#46, (cast(width#45 as double) * cast(height#46 as 
double)) AS area#64]
+- Relation [width#45,height#46] csv
...

From the first line of the Logical Plan we can parse the formula "area = height 
* width" and then write the function back in any language.

However, even though I'm getting the logical plan as a string, there has to be 
some internal representation that I could leverage and avoid the string 
parsing. Do you know if/how I can access that internal representation from 
Python? I've been trying to navigate the scala source code to find it, but this 
is definitely beyond my area of expertise, so any pointers would be more than 
welcome.

Thanks in advance,
Pablo