So I think this sounds like a bug to me, in the help options for both
regular spark-submit and ./sbin/start-connect-server.sh we say:
"  --packages                  Comma-separated list of maven coordinates of
jars to include
                              on the driver and executor classpaths. Will
search the local
                              maven repo, then maven central and any
additional remote
                              repositories given by --repositories. The
format for the
                              coordinates should be
groupId:artifactId:version."

If the behaviour is intentional for spark-connect it would be good to
understand why (and then also update the docs).

On Mon, Dec 4, 2023 at 3:33 PM Aironman DirtDiver <alons...@gmail.com>
wrote:

> The issue you're encountering with the iceberg-spark-runtime dependency
> not being properly passed to the executors in your Spark Connect server
> deployment could be due to a couple of factors:
>
>    1.
>
>    *Spark Submit Packaging:* When you use the --packages parameter in
>    spark-submit, it only adds the JARs to the driver classpath. The
>    executors still need to download and load the JARs separately. This
>    can lead to issues if the JARs are not accessible from the executors, such
>    as when running in a distributed environment like Kubernetes.
>    2.
>
>    *Kubernetes Container Image:* The Spark Connect server container image
>    (xxx/spark-py:3.5-prd) might not have the iceberg-spark-runtime dependency
>    pre-installed. This means that even if the JARs are available on the
>    driver,the executors won't have access to them.
>
> To address this issue, consider the following solutions:
>
>    1.
>
>    *Package Dependencies into Image:* As you mentioned, packaging the
>    required dependencies into your container image is a viable option. This
>    ensures that the executors have direct access to the JARs, eliminating
>    the need for downloading or copying during job execution.
>    2.
>
>    *Use Spark Submit with --jars Option:* Instead of relying on --packages
>    , you can explicitly specify the JARs using the --jars option in
>    spark-submit. This will package the JARs into the Spark application's
>    submission directory, ensuring that they are available to both the
>    driver and executors.
>    3.
>
>    *Mount JARs as Shared Volume:* If the iceberg-spark-runtime dependency
>    is already installed on the cluster nodes,you can mount the JARs as a
>    shared volume accessible to both the driver and executors. This avoids
>    the need to package or download the JARs.
>    Mounting JARs as a shared volume in your Spark Connect server
>    deployment involves creating a shared volume that stores the JARs and then
>    mounting that volume to both the driver and executor containers. Here's a
>    step-by-step guide:
>
>    Create a Shared Volume: Create a shared volume using a persistent
>    storage solution like NFS, GlusterFS, or AWS EFS. Ensure that all cluster
>    nodes have access to the shared volume.
>
>    Copy JARs to Shared Volume: Copy the required JARs, including
>    iceberg-spark-runtime, to the shared volume. This will make them accessible
>    to both the driver and executor containers.
>
>    Mount Shared Volume to Driver Container: In your Spark Connect server
>    deployment configuration, specify the shared volume as a mount point for
>    the driver container. This will make the JARs available to the driver.
>
>    Mount Shared Volume to Executor Containers: In the Spark Connect
>    server deployment configuration, specify the shared volume as a mount point
>    for the executor containers. This will make the JARs available to the
>    executors.
>
>    Update Spark Connect Server Configuration: In your Spark Connect
>    server configuration, ensure that the spark.sql.catalogImplementation
>    property is set to iceberg. This will instruct Spark to use the Iceberg
>    catalog implementation.
>
>    By following these steps, you can successfully mount JARs as a shared
>    volume in your Spark Connect server deployment, eliminating the need to
>    package or download the JARs.
>    4.
>
>    *Use Spark Connect Server with Remote Resources:* Spark Connect Server
>    supports accessing remote resources,such as JARs stored in a distributed
>    file system or a cloud storage service. By configuring Spark Connect
>    Server to use remote resources, you can avoid packaging the
>    dependencies into the container image.
>
> By implementing one of these solutions, you should be able to resolve the
> issue of the iceberg-spark-runtime dependency not being properly passed to
> the executors in your Spark Connect server deployment.
>
> Let me know if any of the proposal works for you.
>
> Alonso
>
> El lun, 4 dic 2023 a las 11:44, Xiaolong Wang
> (<xiaolong.w...@smartnews.com.invalid>) escribió:
>
>> Hi, Spark community,
>>
>> I encountered a weird bug when using Spark Connect server to integrate
>> with Iceberg. I added the iceberg-spark-runtime dependency with
>> `--packages` param, the driver/connect-server pod did get the correct
>> dependencies. But when looking at the executor's library, the jar was not
>> properly passed.
>>
>> To work around this, I need to package the required dependencies into my
>> image which is something not flexible and elegant.
>>
>> I'm wondering if anyone has seen this kind of error before.
>>
>> FYI, my Spark Connect server deployment looks something like the
>> following:
>>
>> apiVersion: apps/v1
>>> kind: Deployment
>>> metadata:
>>> labels:
>>> app: spark-connect-ads
>>> component: spark-connect
>>> name: spark-connect-ads
>>> namespace: realtime-streaming
>>> spec:
>>> selector:
>>> matchLabels:
>>> app: spark-connect-ads
>>> component: spark-connect
>>> template:
>>> metadata:
>>> labels:
>>> app: spark-connect-ads
>>> component: spark-connect
>>> name: spark-connect-ads
>>> namespace: realtime-streaming
>>> spec:
>>> containers:
>>> - command:
>>> - sh
>>> - -c
>>> - /opt/spark/sbin/start-connect-server.sh --master k8s://https://xxx.com
>>> --packages
>>> org.apache.spark:spark-connect_2.12:3.5.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2
>>> --conf spark.sql.catalogImplementation=hive
>>> --conf spark.kubernetes.container.image=xxx/spark-py:3.5-prd
>>> --conf spark.kubernetes.executor.podNamePrefix=spark-connect-ads
>>> --conf spark.kubernetes.driver.pod.name=$(hostname)
>>> --conf spark.driver.host=spark-connect-ads
>>> --conf spark.kubernetes.namespace=realtime-streaming
>>> --conf
>>> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
>>> --conf spark.sql.catalog.spark_catalog.type=hive
>>> --conf
>>> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
>>> --conf spark.sql.iceberg.handle-timestamp-without-timezone=true
>>> --conf spark.kubernetes.container.image.pullPolicy=Always
>>> && tail -100f
>>> /opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-$(hostname).out
>>> image:
>>> 165463520094.dkr.ecr.ap-northeast-1.amazonaws.com/realtime-streaming/spark-py:3.5-prd
>>> imagePullPolicy: IfNotPresent
>>> name: spark-connect
>>>
>>
>
> --
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>


-- 
Cell : 425-233-8271

Reply via email to