The issue you're encountering with the iceberg-spark-runtime dependency not
being properly passed to the executors in your Spark Connect server
deployment could be due to a couple of factors:

   1.

   *Spark Submit Packaging:* When you use the --packages parameter in
   spark-submit, it only adds the JARs to the driver classpath. The
   executors still need to download and load the JARs separately. This can
   lead to issues if the JARs are not accessible from the executors, such
   as when running in a distributed environment like Kubernetes.
   2.

   *Kubernetes Container Image:* The Spark Connect server container image
   (xxx/spark-py:3.5-prd) might not have the iceberg-spark-runtime dependency
   pre-installed. This means that even if the JARs are available on the
   driver,the executors won't have access to them.

To address this issue, consider the following solutions:

   1.

   *Package Dependencies into Image:* As you mentioned, packaging the
   required dependencies into your container image is a viable option. This
   ensures that the executors have direct access to the JARs, eliminating
   the need for downloading or copying during job execution.
   2.

   *Use Spark Submit with --jars Option:* Instead of relying on --packages, you
   can explicitly specify the JARs using the --jars option in
spark-submit. This
   will package the JARs into the Spark application's submission
directory, ensuring
   that they are available to both the driver and executors.
   3.

   *Mount JARs as Shared Volume:* If the iceberg-spark-runtime dependency
   is already installed on the cluster nodes,you can mount the JARs as a
   shared volume accessible to both the driver and executors. This avoids
   the need to package or download the JARs.
   Mounting JARs as a shared volume in your Spark Connect server deployment
   involves creating a shared volume that stores the JARs and then mounting
   that volume to both the driver and executor containers. Here's a
   step-by-step guide:

   Create a Shared Volume: Create a shared volume using a persistent
   storage solution like NFS, GlusterFS, or AWS EFS. Ensure that all cluster
   nodes have access to the shared volume.

   Copy JARs to Shared Volume: Copy the required JARs, including
   iceberg-spark-runtime, to the shared volume. This will make them accessible
   to both the driver and executor containers.

   Mount Shared Volume to Driver Container: In your Spark Connect server
   deployment configuration, specify the shared volume as a mount point for
   the driver container. This will make the JARs available to the driver.

   Mount Shared Volume to Executor Containers: In the Spark Connect server
   deployment configuration, specify the shared volume as a mount point for
   the executor containers. This will make the JARs available to the executors.

   Update Spark Connect Server Configuration: In your Spark Connect server
   configuration, ensure that the spark.sql.catalogImplementation property is
   set to iceberg. This will instruct Spark to use the Iceberg catalog
   implementation.

   By following these steps, you can successfully mount JARs as a shared
   volume in your Spark Connect server deployment, eliminating the need to
   package or download the JARs.
   4.

   *Use Spark Connect Server with Remote Resources:* Spark Connect Server
   supports accessing remote resources,such as JARs stored in a distributed
   file system or a cloud storage service. By configuring Spark Connect
   Server to use remote resources, you can avoid packaging the dependencies
   into the container image.

By implementing one of these solutions, you should be able to resolve the
issue of the iceberg-spark-runtime dependency not being properly passed to
the executors in your Spark Connect server deployment.

Let me know if any of the proposal works for you.

Alonso

El lun, 4 dic 2023 a las 11:44, Xiaolong Wang
(<xiaolong.w...@smartnews.com.invalid>) escribió:

> Hi, Spark community,
>
> I encountered a weird bug when using Spark Connect server to integrate
> with Iceberg. I added the iceberg-spark-runtime dependency with
> `--packages` param, the driver/connect-server pod did get the correct
> dependencies. But when looking at the executor's library, the jar was not
> properly passed.
>
> To work around this, I need to package the required dependencies into my
> image which is something not flexible and elegant.
>
> I'm wondering if anyone has seen this kind of error before.
>
> FYI, my Spark Connect server deployment looks something like the following:
>
> apiVersion: apps/v1
>> kind: Deployment
>> metadata:
>> labels:
>> app: spark-connect-ads
>> component: spark-connect
>> name: spark-connect-ads
>> namespace: realtime-streaming
>> spec:
>> selector:
>> matchLabels:
>> app: spark-connect-ads
>> component: spark-connect
>> template:
>> metadata:
>> labels:
>> app: spark-connect-ads
>> component: spark-connect
>> name: spark-connect-ads
>> namespace: realtime-streaming
>> spec:
>> containers:
>> - command:
>> - sh
>> - -c
>> - /opt/spark/sbin/start-connect-server.sh --master k8s://https://xxx.com
>> --packages
>> org.apache.spark:spark-connect_2.12:3.5.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2
>> --conf spark.sql.catalogImplementation=hive
>> --conf spark.kubernetes.container.image=xxx/spark-py:3.5-prd
>> --conf spark.kubernetes.executor.podNamePrefix=spark-connect-ads
>> --conf spark.kubernetes.driver.pod.name=$(hostname)
>> --conf spark.driver.host=spark-connect-ads
>> --conf spark.kubernetes.namespace=realtime-streaming
>> --conf
>> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
>> --conf spark.sql.catalog.spark_catalog.type=hive
>> --conf
>> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
>> --conf spark.sql.iceberg.handle-timestamp-without-timezone=true
>> --conf spark.kubernetes.container.image.pullPolicy=Always
>> && tail -100f
>> /opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-$(hostname).out
>> image:
>> 165463520094.dkr.ecr.ap-northeast-1.amazonaws.com/realtime-streaming/spark-py:3.5-prd
>> imagePullPolicy: IfNotPresent
>> name: spark-connect
>>
>

-- 
Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

Reply via email to