rahil-c commented on code in PR #1862:
URL: https://github.com/apache/polaris/pull/1862#discussion_r2180478143


##########
plugins/spark/v3.5/integration/build.gradle.kts:
##########
@@ -60,12 +60,51 @@ dependencies {
     exclude("org.apache.logging.log4j", "log4j-core")
     exclude("org.slf4j", "jul-to-slf4j")
   }
+
+  // Add spark-hive for Hudi integration - provides HiveExternalCatalog that 
Hudi needs
+  
testImplementation("org.apache.spark:spark-hive_${scalaVersion}:${spark35Version}")
 {
+    // exclude log4j dependencies to match spark-sql exclusions
+    exclude("org.apache.logging.log4j", "log4j-slf4j2-impl")
+    exclude("org.apache.logging.log4j", "log4j-1.2-api")
+    exclude("org.apache.logging.log4j", "log4j-core")
+    exclude("org.slf4j", "jul-to-slf4j")
+    // exclude old slf4j 1.x to log4j 2.x bridge that conflicts with slf4j 2.x 
bridge
+    exclude("org.apache.logging.log4j", "log4j-slf4j-impl")
+  }
   // enforce the usage of log4j 2.24.3. This is for the log4j-api compatibility
   // of spark-sql dependency
   testRuntimeOnly("org.apache.logging.log4j:log4j-core:2.24.3")
   testRuntimeOnly("org.apache.logging.log4j:log4j-slf4j2-impl:2.24.3")
 
   testImplementation("io.delta:delta-spark_${scalaVersion}:3.3.1")
+  
testImplementation("org.apache.hudi:hudi-spark3.5-bundle_${scalaVersion}:0.15.0")
 {
+    // exclude log4j dependencies to match spark-sql exclusions and prevent 
version conflicts

Review Comment:
   
   Similar as comment above: 
https://github.com/apache/polaris/pull/1862#discussion_r2180424419
   
   When running the following command
   ` jar tf hudi-spark3.5-bundle_2.12-0.15.0.jar | grep -i 
"org/apache/spark/sql/hive"`
   
   I do not see the `HiveExternalCatalog` provided by the `hudi-spark-bundle`.
   ```
   org/apache/spark/sql/hive/
   org/apache/spark/sql/hive/HiveClientUtils.class
   org/apache/spark/sql/hive/HiveClientUtils$.class
   ```
   
   Based on my understanding of the `hudi-spark-bundle` it aims to provide the 
core hudi dependencies needed for spark-hudi integration to work but expects 
certain spark dependencies to be on the engine class path.
   For example when checking the oss spark engine `jars` folder by running 
these commands, spark will provide these:
   
   ```
   cd ~
   wget 
https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz
   mkdir spark-3.5
   tar xzvf spark-3.5.5-bin-hadoop3.tgz  -C spark-3.5 --strip-components=1
   cd spark-3.5
   ```
   
   ```
    rahil@mac  ~/spark-3.5/jars  ls -l | grep hive
   -rw-r--r--@ 1 rahil  staff    183633 Feb 23 12:45 hive-beeline-2.3.9.jar
   -rw-r--r--@ 1 rahil  staff     44704 Feb 23 12:45 hive-cli-2.3.9.jar
   -rw-r--r--@ 1 rahil  staff    436169 Feb 23 12:45 hive-common-2.3.9.jar
   -rw-r--r--@ 1 rahil  staff  10840949 Feb 23 12:45 hive-exec-2.3.9-core.jar
   -rw-r--r--@ 1 rahil  staff    116364 Feb 23 12:45 hive-jdbc-2.3.9.jar
   -rw-r--r--@ 1 rahil  staff    326585 Feb 23 12:45 hive-llap-common-2.3.9.jar
   -rw-r--r--@ 1 rahil  staff   8195966 Feb 23 12:45 hive-metastore-2.3.9.jar
   -rw-r--r--@ 1 rahil  staff    916630 Feb 23 12:45 hive-serde-2.3.9.jar
   -rw-r--r--@ 1 rahil  staff   1679366 Feb 23 12:45 hive-service-rpc-3.1.3.jar
   -rw-r--r--@ 1 rahil  staff     53902 Feb 23 12:45 hive-shims-0.23-2.3.9.jar
   -rw-r--r--@ 1 rahil  staff      8786 Feb 23 12:45 hive-shims-2.3.9.jar
   -rw-r--r--@ 1 rahil  staff    120293 Feb 23 12:45 hive-shims-common-2.3.9.jar
   -rw-r--r--@ 1 rahil  staff     12923 Feb 23 12:45 
hive-shims-scheduler-2.3.9.jar
   -rw-r--r--@ 1 rahil  staff    258346 Feb 23 12:45 hive-storage-api-2.8.1.jar
   -rw-r--r--@ 1 rahil  staff    572320 Feb 23 12:45 
spark-hive-thriftserver_2.12-3.5.5.jar
   -rw-r--r--@ 1 rahil  staff    725252 Feb 23 12:45 spark-hive_2.12-3.5.5.jar
    rahil@mac  ~/spark-3.5/jars 
   ```
   
   Which allows the hudi to not hit the class not found exception(atleast when 
testing via my local spark). Therefore I believe we will need to explicitly 
provide this in test env in order for spark hudi integration test to work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to