This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new dd3f81c3d610 [SPARK-47152][SQL][BUILD] Provide `CodeHaus Jackson` 
dependencies via a new optional directory
dd3f81c3d610 is described below

commit dd3f81c3d6102fe1427702e97f7f42aa64b0bf5e
Author: Dongjoon Hyun <dh...@apple.com>
AuthorDate: Sat Feb 24 11:05:41 2024 -0800

    [SPARK-47152][SQL][BUILD] Provide `CodeHaus Jackson` dependencies via a new 
optional directory
    
    ### What changes were proposed in this pull request?
    
    This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via 
a new optional directory, `hive-jackson`, instead of the standard `jars` 
directory of Apache Spark binary distribution. Additionally, two internal 
configurations are added whose default values are `hive-jackson/*`.
    
      - `spark.driver.defaultExtraClassPath`
      - `spark.executor.defaultExtraClassPath`
    
    For example, Apache Spark distributions have been providing 
`spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`.
    
    **YARN SHUFFLE EXAMPLE**
    ```
    $ ls -al yarn/*jar
    -rw-r--r--  1 dongjoon  staff  77352048 Sep  8 19:08 
yarn/spark-3.5.0-yarn-shuffle.jar
    ```
    
    This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a 
similar way.
    
    **BEFORE**
    ```
    $ ls -al jars/*asl*
    -rw-r--r--  1 dongjoon  staff  232248 Sep  8 19:08 
jars/jackson-core-asl-1.9.13.jar
    -rw-r--r--  1 dongjoon  staff  780664 Sep  8 19:08 
jars/jackson-mapper-asl-1.9.13.jar
    ```
    
    **AFTER**
    ```
    $ ls -al jars/*asl*
    zsh: no matches found: jars/*asl*
    
    $ ls -al hive-jackson
    total 1984
    drwxr-xr-x   4 dongjoon  staff     128 Feb 23 15:37 .
    drwxr-xr-x  16 dongjoon  staff     512 Feb 23 16:34 ..
    -rw-r--r--   1 dongjoon  staff  232248 Feb 23 15:37 
jackson-core-asl-1.9.13.jar
    -rw-r--r--   1 dongjoon  staff  780664 Feb 23 15:37 
jackson-mapper-asl-1.9.13.jar
    ```
    
    ### Why are the changes needed?
    
    Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson 
dependencies.
    
    Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due 
to Hive UDF support.
    
      - https://github.com/apache/spark/pull/40893
      - https://github.com/apache/spark/pull/42446
    
    SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the 
distribution building stage for Apache Spark 4.0.0.
    
      - #45201
    
    This PR provides a way to exclude Apache Hive Jackson dependencies at 
runtime for Apache Spark 4.0.0.
    
    - Spark Shell without Apache Hive Jackson dependencies.
    ```
    $ bin/spark-shell --driver-default-class-path ""
    ```
    
    - Spark SQL Shell without Apache Hive Jackson dependencies.
    ```
    $ bin/spark-sql --driver-default-class-path ""
    ```
    
    - Spark Thrift Server without Apache Hive Jackson dependencies.
    ```
    $ sbin/start-thriftserver.sh --driver-default-class-path ""
    ```
    
    In addition, last but not least, this PR eliminates `CodeHaus Jackson` 
dependencies from the following Apache Spark deamons (using `spark-daemon.sh 
start`) because they don't require Hive `CodeHaus Jackson` dependencies
    
    - Spark Master
    - Spark Worker
    - Spark History Server
    
    ```
    $ grep 'spark-daemon.sh start' *
    start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start 
$CLASS 1 "$"
    start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
    start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 
$WORKER_NUM \
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. There is no user-facing change by default.
    
    - For the distributions with `hive-jackson-provided` profile, the `scope` 
of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory 
is not created at all.
    - For the distributions with default setting, the `scope` of Apache Hive 
Jackson dependencies is still `compile`. In addition, they are in the Apache 
Spark's built-in class path like the following.
    
    ![Screenshot 2024-02-23 at 16 48 
08](https://github.com/apache/spark/assets/9700541/99ed0f02-2792-4666-ae19-ce4f4b7b8ff9)
    
    - The following Spark Deamon don't use `CodeHaus Jackson` dependencies.
      - Spark Master
      - Spark Worker
      - Spark History Server
    
    ### How was this patch tested?
    
    Pass the CIs and manually build a distribution and check the class paths in 
the `Environment` Tab.
    
    ```
    $ dev/make-distribution.sh -Phive,hive-thriftserver
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #45237 from dongjoon-hyun/SPARK-47152.
    
    Authored-by: Dongjoon Hyun <dh...@apple.com>
    Signed-off-by: Dongjoon Hyun <dh...@apple.com>
---
 .../org/apache/spark/internal/config/package.scala      | 17 +++++++++++++++++
 dev/make-distribution.sh                                |  6 ++++++
 .../apache/spark/launcher/AbstractCommandBuilder.java   |  2 ++
 .../java/org/apache/spark/launcher/SparkLauncher.java   |  8 ++++++++
 .../spark/launcher/SparkSubmitCommandBuilder.java       |  8 ++++++++
 .../apache/spark/launcher/SparkSubmitOptionParser.java  |  2 ++
 6 files changed, 43 insertions(+)

diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala 
b/core/src/main/scala/org/apache/spark/internal/config/package.scala
index 0b026a888e88..7caac5884c74 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/package.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -17,6 +17,7 @@
 
 package org.apache.spark.internal
 
+import java.io.File
 import java.util.Locale
 import java.util.concurrent.TimeUnit
 
@@ -64,8 +65,16 @@ package object config {
       .stringConf
       .createOptional
 
+  private[spark] val DRIVER_DEFAULT_EXTRA_CLASS_PATH =
+    ConfigBuilder(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH)
+      .internal()
+      .version("4.0.0")
+      .stringConf
+      .createWithDefault(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE)
+
   private[spark] val DRIVER_CLASS_PATH =
     ConfigBuilder(SparkLauncher.DRIVER_EXTRA_CLASSPATH)
+      .withPrepended(DRIVER_DEFAULT_EXTRA_CLASS_PATH.key, File.pathSeparator)
       .version("1.0.0")
       .stringConf
       .createOptional
@@ -254,8 +263,16 @@ package object config {
   private[spark] val EXECUTOR_ID =
     
ConfigBuilder("spark.executor.id").version("1.2.0").stringConf.createOptional
 
+  private[spark] val EXECUTOR_DEFAULT_EXTRA_CLASS_PATH =
+    ConfigBuilder(SparkLauncher.EXECUTOR_DEFAULT_EXTRA_CLASS_PATH)
+      .internal()
+      .version("4.0.0")
+      .stringConf
+      .createWithDefault(SparkLauncher.EXECUTOR_DEFAULT_EXTRA_CLASS_PATH_VALUE)
+
   private[spark] val EXECUTOR_CLASS_PATH =
     ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_CLASSPATH)
+      .withPrepended(EXECUTOR_DEFAULT_EXTRA_CLASS_PATH.key, File.pathSeparator)
       .version("1.0.0")
       .stringConf
       .createOptional
diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh
index ce5c94197d4a..5c4c36df37a6 100755
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -189,6 +189,12 @@ echo "Build flags: $@" >> "$DISTDIR/RELEASE"
 # Copy jars
 cp "$SPARK_HOME"/assembly/target/scala*/jars/* "$DISTDIR/jars/"
 
+# Only create the hive-jackson directory if they exist.
+for f in "$DISTDIR"/jars/jackson-*-asl-*.jar; do
+  mkdir -p "$DISTDIR"/hive-jackson
+  mv $f "$DISTDIR"/hive-jackson/
+done
+
 # Only create the yarn directory if the yarn artifacts were built.
 if [ -f 
"$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar ]; then
   mkdir "$DISTDIR/yarn"
diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java 
b/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
index 21861bdcb55e..914f4e4d4570 100644
--- 
a/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
+++ 
b/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
@@ -271,6 +271,8 @@ abstract class AbstractCommandBuilder {
       Properties p = loadPropertiesFile();
       p.stringPropertyNames().forEach(key ->
         effectiveConfig.computeIfAbsent(key, p::getProperty));
+      
effectiveConfig.putIfAbsent(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH,
+        SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE);
     }
     return effectiveConfig;
   }
diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java 
b/launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java
index 5d36ef56d2cf..f41474e12df9 100644
--- a/launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java
+++ b/launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java
@@ -54,6 +54,10 @@ public class SparkLauncher extends 
AbstractLauncher<SparkLauncher> {
 
   /** Configuration key for the driver memory. */
   public static final String DRIVER_MEMORY = "spark.driver.memory";
+  /** Configuration key for the driver default extra class path. */
+  public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH =
+    "spark.driver.defaultExtraClassPath";
+  public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE = 
"hive-jackson/*";
   /** Configuration key for the driver class path. */
   public static final String DRIVER_EXTRA_CLASSPATH = 
"spark.driver.extraClassPath";
   /** Configuration key for the default driver VM options. */
@@ -65,6 +69,10 @@ public class SparkLauncher extends 
AbstractLauncher<SparkLauncher> {
 
   /** Configuration key for the executor memory. */
   public static final String EXECUTOR_MEMORY = "spark.executor.memory";
+  /** Configuration key for the executor default extra class path. */
+  public static final String EXECUTOR_DEFAULT_EXTRA_CLASS_PATH =
+    "spark.executor.defaultExtraClassPath";
+  public static final String EXECUTOR_DEFAULT_EXTRA_CLASS_PATH_VALUE = 
"hive-jackson/*";
   /** Configuration key for the executor class path. */
   public static final String EXECUTOR_EXTRA_CLASSPATH = 
"spark.executor.extraClassPath";
   /** Configuration key for the default executor VM options. */
diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java
 
b/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java
index 5469b36cf961..d884f7e474c0 100644
--- 
a/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java
+++ 
b/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java
@@ -267,6 +267,12 @@ class SparkSubmitCommandBuilder extends 
AbstractCommandBuilder {
     Map<String, String> config = getEffectiveConfig();
     boolean isClientMode = isClientMode(config);
     String extraClassPath = isClientMode ? 
config.get(SparkLauncher.DRIVER_EXTRA_CLASSPATH) : null;
+    String defaultExtraClassPath = 
config.get(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH);
+    if (extraClassPath == null || extraClassPath.trim().isEmpty()) {
+      extraClassPath = defaultExtraClassPath;
+    } else {
+      extraClassPath += File.pathSeparator + defaultExtraClassPath;
+    }
 
     List<String> cmd = buildJavaCommand(extraClassPath);
     // Take Thrift/Connect Server as daemon
@@ -498,6 +504,8 @@ class SparkSubmitCommandBuilder extends 
AbstractCommandBuilder {
         case DRIVER_MEMORY -> conf.put(SparkLauncher.DRIVER_MEMORY, value);
         case DRIVER_JAVA_OPTIONS -> 
conf.put(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS, value);
         case DRIVER_LIBRARY_PATH -> 
conf.put(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH, value);
+        case DRIVER_DEFAULT_CLASS_PATH ->
+          conf.put(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH, value);
         case DRIVER_CLASS_PATH -> 
conf.put(SparkLauncher.DRIVER_EXTRA_CLASSPATH, value);
         case CONF -> {
           checkArgument(value != null, "Missing argument to %s", CONF);
diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitOptionParser.java 
b/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitOptionParser.java
index ea54986daab7..df4fccd0f01e 100644
--- 
a/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitOptionParser.java
+++ 
b/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitOptionParser.java
@@ -40,6 +40,7 @@ class SparkSubmitOptionParser {
   protected final String CONF = "--conf";
   protected final String DEPLOY_MODE = "--deploy-mode";
   protected final String DRIVER_CLASS_PATH = "--driver-class-path";
+  protected final String DRIVER_DEFAULT_CLASS_PATH = 
"--driver-default-class-path";
   protected final String DRIVER_CORES = "--driver-cores";
   protected final String DRIVER_JAVA_OPTIONS =  "--driver-java-options";
   protected final String DRIVER_LIBRARY_PATH = "--driver-library-path";
@@ -94,6 +95,7 @@ class SparkSubmitOptionParser {
     { DEPLOY_MODE },
     { DRIVER_CLASS_PATH },
     { DRIVER_CORES },
+    { DRIVER_DEFAULT_CLASS_PATH },
     { DRIVER_JAVA_OPTIONS },
     { DRIVER_LIBRARY_PATH },
     { DRIVER_MEMORY },


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to