Hi, Spark community,
I encountered a weird bug when using Spark Connect server to integrate with
Iceberg. I added the iceberg-spark-runtime dependency with `--packages`
param, the driver/connect-server pod did get the correct dependencies. But
when looking at the executor's library, the jar was not properly passed.
To work around this, I need to package the required dependencies into my
image which is something not flexible and elegant.
I'm wondering if anyone has seen this kind of error before.
FYI, my Spark Connect server deployment looks something like the following:
apiVersion: apps/v1
> kind: Deployment
> metadata:
> labels:
> app: spark-connect-ads
> component: spark-connect
> name: spark-connect-ads
> namespace: realtime-streaming
> spec:
> selector:
> matchLabels:
> app: spark-connect-ads
> component: spark-connect
> template:
> metadata:
> labels:
> app: spark-connect-ads
> component: spark-connect
> name: spark-connect-ads
> namespace: realtime-streaming
> spec:
> containers:
> - command:
> - sh
> - -c
> - /opt/spark/sbin/start-connect-server.sh --master k8s://https://xxx.com
> --packages
> org.apache.spark:spark-connect_2.12:3.5.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2
> --conf spark.sql.catalogImplementation=hive
> --conf spark.kubernetes.container.image=xxx/spark-py:3.5-prd
> --conf spark.kubernetes.executor.podNamePrefix=spark-connect-ads
> --conf spark.kubernetes.driver.pod.name=$(hostname)
> --conf spark.driver.host=spark-connect-ads
> --conf spark.kubernetes.namespace=realtime-streaming
> --conf
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
> --conf spark.sql.catalog.spark_catalog.type=hive
> --conf
> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
> --conf spark.sql.iceberg.handle-timestamp-without-timezone=true
> --conf spark.kubernetes.container.image.pullPolicy=Always
> && tail -100f
> /opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-$(hostname).out
> image:
> 165463520094.dkr.ecr.ap-northeast-1.amazonaws.com/realtime-streaming/spark-py:3.5-prd
> imagePullPolicy: IfNotPresent
> name: spark-connect
>