Re: Spark-Connect: Param `--packages` does not take effect for executors.
So I think this sounds like a bug to me, in the help options for both regular spark-submit and ./sbin/start-connect-server.sh we say: " --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version." If the behaviour is intentional for spark-connect it would be good to understand why (and then also update the docs). On Mon, Dec 4, 2023 at 3:33 PM Aironman DirtDiver wrote: > The issue you're encountering with the iceberg-spark-runtime dependency > not being properly passed to the executors in your Spark Connect server > deployment could be due to a couple of factors: > >1. > >*Spark Submit Packaging:* When you use the --packages parameter in >spark-submit, it only adds the JARs to the driver classpath. The >executors still need to download and load the JARs separately. This >can lead to issues if the JARs are not accessible from the executors, such >as when running in a distributed environment like Kubernetes. >2. > >*Kubernetes Container Image:* The Spark Connect server container image >(xxx/spark-py:3.5-prd) might not have the iceberg-spark-runtime dependency >pre-installed. This means that even if the JARs are available on the >driver,the executors won't have access to them. > > To address this issue, consider the following solutions: > >1. > >*Package Dependencies into Image:* As you mentioned, packaging the >required dependencies into your container image is a viable option. This >ensures that the executors have direct access to the JARs, eliminating >the need for downloading or copying during job execution. >2. > >*Use Spark Submit with --jars Option:* Instead of relying on --packages >, you can explicitly specify the JARs using the --jars option in >spark-submit. This will package the JARs into the Spark application's >submission directory, ensuring that they are available to both the >driver and executors. >3. > >*Mount JARs as Shared Volume:* If the iceberg-spark-runtime dependency >is already installed on the cluster nodes,you can mount the JARs as a >shared volume accessible to both the driver and executors. This avoids >the need to package or download the JARs. >Mounting JARs as a shared volume in your Spark Connect server >deployment involves creating a shared volume that stores the JARs and then >mounting that volume to both the driver and executor containers. Here's a >step-by-step guide: > >Create a Shared Volume: Create a shared volume using a persistent >storage solution like NFS, GlusterFS, or AWS EFS. Ensure that all cluster >nodes have access to the shared volume. > >Copy JARs to Shared Volume: Copy the required JARs, including >iceberg-spark-runtime, to the shared volume. This will make them accessible >to both the driver and executor containers. > >Mount Shared Volume to Driver Container: In your Spark Connect server >deployment configuration, specify the shared volume as a mount point for >the driver container. This will make the JARs available to the driver. > >Mount Shared Volume to Executor Containers: In the Spark Connect >server deployment configuration, specify the shared volume as a mount point >for the executor containers. This will make the JARs available to the >executors. > >Update Spark Connect Server Configuration: In your Spark Connect >server configuration, ensure that the spark.sql.catalogImplementation >property is set to iceberg. This will instruct Spark to use the Iceberg >catalog implementation. > >By following these steps, you can successfully mount JARs as a shared >volume in your Spark Connect server deployment, eliminating the need to >package or download the JARs. >4. > >*Use Spark Connect Server with Remote Resources:* Spark Connect Server >supports accessing remote resources,such as JARs stored in a distributed >file system or a cloud storage service. By configuring Spark Connect >Server to use remote resources, you can avoid packaging the >dependencies into the container image. > > By implementing one of these solutions, you should be able to resolve the > issue of the iceberg-spark-runtime dependency not being properly passed to > the executors in your Spark Connect server deployment. > > Let me know if any of the proposal works for you. > > Alonso > > El lun, 4 dic 2023 a las 11:44, Xiaolong Wang > () escribió: > >> Hi, Spark community, >> >> I encountered a weird bug when using Spark Connect server to integrate >> with Iceberg. I added the iceberg-spark-runtime
Re: Spark-Connect: Param `--packages` does not take effect for executors.
The issue you're encountering with the iceberg-spark-runtime dependency not being properly passed to the executors in your Spark Connect server deployment could be due to a couple of factors: 1. *Spark Submit Packaging:* When you use the --packages parameter in spark-submit, it only adds the JARs to the driver classpath. The executors still need to download and load the JARs separately. This can lead to issues if the JARs are not accessible from the executors, such as when running in a distributed environment like Kubernetes. 2. *Kubernetes Container Image:* The Spark Connect server container image (xxx/spark-py:3.5-prd) might not have the iceberg-spark-runtime dependency pre-installed. This means that even if the JARs are available on the driver,the executors won't have access to them. To address this issue, consider the following solutions: 1. *Package Dependencies into Image:* As you mentioned, packaging the required dependencies into your container image is a viable option. This ensures that the executors have direct access to the JARs, eliminating the need for downloading or copying during job execution. 2. *Use Spark Submit with --jars Option:* Instead of relying on --packages, you can explicitly specify the JARs using the --jars option in spark-submit. This will package the JARs into the Spark application's submission directory, ensuring that they are available to both the driver and executors. 3. *Mount JARs as Shared Volume:* If the iceberg-spark-runtime dependency is already installed on the cluster nodes,you can mount the JARs as a shared volume accessible to both the driver and executors. This avoids the need to package or download the JARs. Mounting JARs as a shared volume in your Spark Connect server deployment involves creating a shared volume that stores the JARs and then mounting that volume to both the driver and executor containers. Here's a step-by-step guide: Create a Shared Volume: Create a shared volume using a persistent storage solution like NFS, GlusterFS, or AWS EFS. Ensure that all cluster nodes have access to the shared volume. Copy JARs to Shared Volume: Copy the required JARs, including iceberg-spark-runtime, to the shared volume. This will make them accessible to both the driver and executor containers. Mount Shared Volume to Driver Container: In your Spark Connect server deployment configuration, specify the shared volume as a mount point for the driver container. This will make the JARs available to the driver. Mount Shared Volume to Executor Containers: In the Spark Connect server deployment configuration, specify the shared volume as a mount point for the executor containers. This will make the JARs available to the executors. Update Spark Connect Server Configuration: In your Spark Connect server configuration, ensure that the spark.sql.catalogImplementation property is set to iceberg. This will instruct Spark to use the Iceberg catalog implementation. By following these steps, you can successfully mount JARs as a shared volume in your Spark Connect server deployment, eliminating the need to package or download the JARs. 4. *Use Spark Connect Server with Remote Resources:* Spark Connect Server supports accessing remote resources,such as JARs stored in a distributed file system or a cloud storage service. By configuring Spark Connect Server to use remote resources, you can avoid packaging the dependencies into the container image. By implementing one of these solutions, you should be able to resolve the issue of the iceberg-spark-runtime dependency not being properly passed to the executors in your Spark Connect server deployment. Let me know if any of the proposal works for you. Alonso El lun, 4 dic 2023 a las 11:44, Xiaolong Wang () escribió: > Hi, Spark community, > > I encountered a weird bug when using Spark Connect server to integrate > with Iceberg. I added the iceberg-spark-runtime dependency with > `--packages` param, the driver/connect-server pod did get the correct > dependencies. But when looking at the executor's library, the jar was not > properly passed. > > To work around this, I need to package the required dependencies into my > image which is something not flexible and elegant. > > I'm wondering if anyone has seen this kind of error before. > > FYI, my Spark Connect server deployment looks something like the following: > > apiVersion: apps/v1 >> kind: Deployment >> metadata: >> labels: >> app: spark-connect-ads >> component: spark-connect >> name: spark-connect-ads >> namespace: realtime-streaming >> spec: >> selector: >> matchLabels: >> app: spark-connect-ads >> component: spark-connect >> template: >> metadata: >> labels: >> app: spark-connect-ads >> component: spark-connect >> name: spark-connect-ads >> namespace: realtime-streaming >> spec: >> containers: >> - comman
Spark-Connect: Param `--packages` does not take effect for executors.
Hi, Spark community, I encountered a weird bug when using Spark Connect server to integrate with Iceberg. I added the iceberg-spark-runtime dependency with `--packages` param, the driver/connect-server pod did get the correct dependencies. But when looking at the executor's library, the jar was not properly passed. To work around this, I need to package the required dependencies into my image which is something not flexible and elegant. I'm wondering if anyone has seen this kind of error before. FYI, my Spark Connect server deployment looks something like the following: apiVersion: apps/v1 > kind: Deployment > metadata: > labels: > app: spark-connect-ads > component: spark-connect > name: spark-connect-ads > namespace: realtime-streaming > spec: > selector: > matchLabels: > app: spark-connect-ads > component: spark-connect > template: > metadata: > labels: > app: spark-connect-ads > component: spark-connect > name: spark-connect-ads > namespace: realtime-streaming > spec: > containers: > - command: > - sh > - -c > - /opt/spark/sbin/start-connect-server.sh --master k8s://https://xxx.com > --packages > org.apache.spark:spark-connect_2.12:3.5.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2 > --conf spark.sql.catalogImplementation=hive > --conf spark.kubernetes.container.image=xxx/spark-py:3.5-prd > --conf spark.kubernetes.executor.podNamePrefix=spark-connect-ads > --conf spark.kubernetes.driver.pod.name=$(hostname) > --conf spark.driver.host=spark-connect-ads > --conf spark.kubernetes.namespace=realtime-streaming > --conf > spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog > --conf spark.sql.catalog.spark_catalog.type=hive > --conf > spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions > --conf spark.sql.iceberg.handle-timestamp-without-timezone=true > --conf spark.kubernetes.container.image.pullPolicy=Always > && tail -100f > /opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-$(hostname).out > image: > 165463520094.dkr.ecr.ap-northeast-1.amazonaws.com/realtime-streaming/spark-py:3.5-prd > imagePullPolicy: IfNotPresent > name: spark-connect >