Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Xiaolong Wang
Hi, Spark community,

I encountered a weird bug when using Spark Connect server to integrate with
Iceberg. I added the iceberg-spark-runtime dependency with `--packages`
param, the driver/connect-server pod did get the correct dependencies. But
when looking at the executor's library, the jar was not properly passed.

To work around this, I need to package the required dependencies into my
image which is something not flexible and elegant.

I'm wondering if anyone has seen this kind of error before.

FYI, my Spark Connect server deployment looks something like the following:

apiVersion: apps/v1
> kind: Deployment
> metadata:
> labels:
> app: spark-connect-ads
> component: spark-connect
> name: spark-connect-ads
> namespace: realtime-streaming
> spec:
> selector:
> matchLabels:
> app: spark-connect-ads
> component: spark-connect
> template:
> metadata:
> labels:
> app: spark-connect-ads
> component: spark-connect
> name: spark-connect-ads
> namespace: realtime-streaming
> spec:
> containers:
> - command:
> - sh
> - -c
> - /opt/spark/sbin/start-connect-server.sh --master k8s://https://xxx.com
> --packages
> org.apache.spark:spark-connect_2.12:3.5.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2
> --conf spark.sql.catalogImplementation=hive
> --conf spark.kubernetes.container.image=xxx/spark-py:3.5-prd
> --conf spark.kubernetes.executor.podNamePrefix=spark-connect-ads
> --conf spark.kubernetes.driver.pod.name=$(hostname)
> --conf spark.driver.host=spark-connect-ads
> --conf spark.kubernetes.namespace=realtime-streaming
> --conf
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
> --conf spark.sql.catalog.spark_catalog.type=hive
> --conf
> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
> --conf spark.sql.iceberg.handle-timestamp-without-timezone=true
> --conf spark.kubernetes.container.image.pullPolicy=Always
> && tail -100f
> /opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-$(hostname).out
> image:
> 165463520094.dkr.ecr.ap-northeast-1.amazonaws.com/realtime-streaming/spark-py:3.5-prd
> imagePullPolicy: IfNotPresent
> name: spark-connect
>


How to configure authentication from a pySpark client to a Spark Connect server ?

2023-11-05 Thread Xiaolong Wang
Hi,

Our company is currently introducing the Spark Connect server to
production.

Most of the issues have been solved yet I don't know how to configure
authentication from a pySpark client to the Spark Connect server.

I noticed that there is some interceptor configs at the Scala client side,
where users can call the following codes:

val spark = SparkSession.builder().remote(host).interceptor(...)
>
> to configure a client side interceptor and at the server side there is a
config called
spark.connect.grpc.interceptor.classes
I'm wondering if there is any way to pass some authentication information
at the pySpark side. If not, is there any on-road plans to support this ?