[spark] branch master updated: [SPARK-45371][CONNECT] Fix shading issues in the Spark Connect Scala Client

hvanhovell Mon, 02 Oct 2023 10:03:23 -0700

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new e53abbbceaa [SPARK-45371][CONNECT] Fix shading issues in the Spark 
Connect Scala Client
e53abbbceaa is described below

commit e53abbbceaa2c41babaa23fe4c2f282f559b4c03
Author: Herman van Hovell <her...@databricks.com>
AuthorDate: Mon Oct 2 13:03:06 2023 -0400

    [SPARK-45371][CONNECT] Fix shading issues in the Spark Connect Scala Client
    
    ### What changes were proposed in this pull request?
    This PR fixes shading for the Spark Connect Scala Client maven build. The 
following things are addressed:
    - Guava and protobuf are included in the shaded jars. These were missing, 
and were causing users to see `ClassNotFoundException`s.
    - Fixed duplicate shading of guava. We use the parent pom's location now.
    - Fixed duplicate Netty dependency (shaded and transitive). One was used 
for GRPC and the other was needed by Arrow. This was fixed by pulling arrow 
into the shaded jar.
    - Use the same package as the shading defined in the parent package.
    
    ### Why are the changes needed?
    The maven artifacts for the Spark Connect Scala Client are currently broken.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Manual tests.
    #### Step 1:  Build new shaded library and install it in local maven 
repository
    `build/mvn clean install -pl connector/connect/client/jvm -am -DskipTests`
    #### Step 2: Start Connect Server
    `connector/connect/bin/spark-connect`
    #### Step 3: Launch REPL using the newly created library
    This step requires [coursier](https://get-coursier.io/) to be installed.
    `cs launch --jvm zulu:17.0.8 --scala 2.13.9 -r m2Local 
com.lihaoyi:::ammonite:2.5.11 
org.apache.spark::spark-connect-client-jvm:4.0.0-SNAPSHOT --java-opt 
--add-opens=java.base/java.nio=ALL-UNNAMED -M 
org.apache.spark.sql.application.ConnectRepl`
    #### Step 4: Run a bunch of commands:
    ```scala
    // Check version
    spark.version
    
    // Run a simple query
    {
      spark.range(1, 10000, 1)
        .select($"id", $"id" % 5 as "group", rand(1).as("v1"), rand(2).as("v2"))
        .groupBy($"group")
        .agg(
          avg($"v1").as("v1_avg"),
          avg($"v2").as("v2_avg"))
        .show()
    }
    
    // Run a streaming query
    {
      import org.apache.spark.sql.execution.streaming.ProcessingTimeTrigger
      val query_name = "simple_streaming"
      val stream = spark.readStream
        .format("rate")
        .option("numPartitions", "1")
        .option("rowsPerSecond", "10")
        .load()
        .withWatermark("timestamp", "10 milliseconds")
        .groupBy(window(col("timestamp"), "10 milliseconds"))
        .count()
        .selectExpr("window.start as timestamp", "count as num_events")
        .writeStream
        .format("memory")
        .queryName(query_name)
        .trigger(ProcessingTimeTrigger.create("10 milliseconds"))
      // run for 20 seconds
      val query = stream.start()
      val start = System.currentTimeMillis()
      val end = System.currentTimeMillis() + 20 * 1000
      while (System.currentTimeMillis() < end) {
        println(s"time: ${System.currentTimeMillis() - start} ms")
        println(query.status)
        spark.sql(s"select * from ${query_name}").show()
        Thread.sleep(500)
      }
      query.stop()
    }
    ```
    
    Closes #43195 from hvanhovell/SPARK-45371.
    
    Authored-by: Herman van Hovell <her...@databricks.com>
    Signed-off-by: Herman van Hovell <her...@databricks.com>
---
 connector/connect/client/jvm/pom.xml | 39 +++++++++++++++++++++++++++---------
 1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/connector/connect/client/jvm/pom.xml 
b/connector/connect/client/jvm/pom.xml
index 9ca66b5c29c..a9040107f38 100644
--- a/connector/connect/client/jvm/pom.xml
+++ b/connector/connect/client/jvm/pom.xml
@@ -50,10 +50,20 @@
       <artifactId>spark-sketch_${scala.binary.version}</artifactId>
       <version>${project.version}</version>
     </dependency>
+    <!--
+      We need to define guava and protobuf here because we need to change the 
scope of both from
+      provided to compile. If we don't do this we can't shade these libraries.
+    -->
     <dependency>
       <groupId>com.google.guava</groupId>
       <artifactId>guava</artifactId>
       <version>${connect.guava.version}</version>
+      <scope>compile</scope>
+    </dependency>
+    <dependency>
+      <groupId>com.google.protobuf</groupId>
+      <artifactId>protobuf-java</artifactId>
+      <scope>compile</scope>
     </dependency>
     <dependency>
       <groupId>com.lihaoyi</groupId>
@@ -85,6 +95,7 @@
         <artifactId>maven-shade-plugin</artifactId>
         <configuration>
           <shadedArtifactAttached>false</shadedArtifactAttached>
+          <promoteTransitiveDependencies>true</promoteTransitiveDependencies>
           <artifactSet>
             <includes>
               <include>com.google.android:*</include>
@@ -92,52 +103,62 @@
               <include>com.google.code.findbugs:*</include>
               <include>com.google.code.gson:*</include>
               <include>com.google.errorprone:*</include>
-              <include>com.google.guava:*</include>
               <include>com.google.j2objc:*</include>
               <include>com.google.protobuf:*</include>
+              <include>com.google.flatbuffers:*</include>
               <include>io.grpc:*</include>
               <include>io.netty:*</include>
               <include>io.perfmark:*</include>
+              <include>org.apache.arrow:*</include>
               <include>org.codehaus.mojo:*</include>
               <include>org.checkerframework:*</include>
               
<include>org.apache.spark:spark-connect-common_${scala.binary.version}</include>
+              
<include>org.apache.spark:spark-sql-api_${scala.binary.version}</include>
             </includes>
           </artifactSet>
           <relocations>
             <relocation>
               <pattern>io.grpc</pattern>
-              
<shadedPattern>${spark.shade.packageName}.connect.client.io.grpc</shadedPattern>
+              <shadedPattern>${spark.shade.packageName}.io.grpc</shadedPattern>
               <includes>
                 <include>io.grpc.**</include>
               </includes>
             </relocation>
             <relocation>
               <pattern>com.google</pattern>
-              
<shadedPattern>${spark.shade.packageName}.connect.client.com.google</shadedPattern>
+              
<shadedPattern>${spark.shade.packageName}.com.google</shadedPattern>
+              <excludes>
+                <!-- Guava is relocated to ${spark.shade.packageName}.guava 
(see the parent pom.xml) -->
+                <exclude>com.google.common.**</exclude>
+              </excludes>
             </relocation>
             <relocation>
               <pattern>io.netty</pattern>
-              
<shadedPattern>${spark.shade.packageName}.connect.client.io.netty</shadedPattern>
+              
<shadedPattern>${spark.shade.packageName}.io.netty</shadedPattern>
             </relocation>
             <relocation>
               <pattern>org.checkerframework</pattern>
-              
<shadedPattern>${spark.shade.packageName}.connect.client.org.checkerframework</shadedPattern>
+              
<shadedPattern>${spark.shade.packageName}.org.checkerframework</shadedPattern>
             </relocation>
             <relocation>
               <pattern>javax.annotation</pattern>
-              
<shadedPattern>${spark.shade.packageName}.connect.client.javax.annotation</shadedPattern>
+              
<shadedPattern>${spark.shade.packageName}.javax.annotation</shadedPattern>
             </relocation>
             <relocation>
               <pattern>io.perfmark</pattern>
-              
<shadedPattern>${spark.shade.packageName}.connect.client.io.perfmark</shadedPattern>
+              
<shadedPattern>${spark.shade.packageName}.io.perfmark</shadedPattern>
             </relocation>
             <relocation>
               <pattern>org.codehaus</pattern>
-              
<shadedPattern>${spark.shade.packageName}.connect.client.org.codehaus</shadedPattern>
+              
<shadedPattern>${spark.shade.packageName}.org.codehaus</shadedPattern>
+            </relocation>
+            <relocation>
+              <pattern>org.apache.arrow</pattern>
+              
<shadedPattern>${spark.shade.packageName}.org.apache.arrow</shadedPattern>
             </relocation>
             <relocation>
               <pattern>android.annotation</pattern>
-              
<shadedPattern>${spark.shade.packageName}.connect.client.android.annotation</shadedPattern>
+              
<shadedPattern>${spark.shade.packageName}.android.annotation</shadedPattern>
             </relocation>
           </relocations>
           <!--SPARK-42228: Add `ServicesResourceTransformer` to relocation 
class names in META-INF/services for grpc-->


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-45371][CONNECT] Fix shading issues in the Spark Connect Scala Client

Reply via email to