rahil-c commented on code in PR #1862:
URL: https://github.com/apache/polaris/pull/1862#discussion_r2180478143
##########
plugins/spark/v3.5/integration/build.gradle.kts:
##########
@@ -60,12 +60,51 @@ dependencies {
exclude("org.apache.logging.log4j", "log4j-core")
exclude("org.slf4j", "jul-to-slf4j")
}
+
+ // Add spark-hive for Hudi integration - provides HiveExternalCatalog that
Hudi needs
+
testImplementation("org.apache.spark:spark-hive_${scalaVersion}:${spark35Version}")
{
+ // exclude log4j dependencies to match spark-sql exclusions
+ exclude("org.apache.logging.log4j", "log4j-slf4j2-impl")
+ exclude("org.apache.logging.log4j", "log4j-1.2-api")
+ exclude("org.apache.logging.log4j", "log4j-core")
+ exclude("org.slf4j", "jul-to-slf4j")
+ // exclude old slf4j 1.x to log4j 2.x bridge that conflicts with slf4j 2.x
bridge
+ exclude("org.apache.logging.log4j", "log4j-slf4j-impl")
+ }
// enforce the usage of log4j 2.24.3. This is for the log4j-api compatibility
// of spark-sql dependency
testRuntimeOnly("org.apache.logging.log4j:log4j-core:2.24.3")
testRuntimeOnly("org.apache.logging.log4j:log4j-slf4j2-impl:2.24.3")
testImplementation("io.delta:delta-spark_${scalaVersion}:3.3.1")
+
testImplementation("org.apache.hudi:hudi-spark3.5-bundle_${scalaVersion}:0.15.0")
{
+ // exclude log4j dependencies to match spark-sql exclusions and prevent
version conflicts
Review Comment:
Similar as comment above:
https://github.com/apache/polaris/pull/1862#discussion_r2180424419
When running the following command
` jar tf hudi-spark3.5-bundle_2.12-0.15.0.jar | grep -i
"org/apache/spark/sql/hive"`
I do not see the `HiveExternalCatalog` provided by the `hudi-spark-bundle`.
```
org/apache/spark/sql/hive/
org/apache/spark/sql/hive/HiveClientUtils.class
org/apache/spark/sql/hive/HiveClientUtils$.class
```
Based on my understanding of the `hudi-spark-bundle` it aims to provide the
core hudi dependencies needed for spark-hudi integration to work but expects
certain spark dependencies to be on the engine class path.
For example when checking the oss spark engine `jars` folder by running
these commands, spark will provide these:
```
cd ~
wget
https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz
mkdir spark-3.5
tar xzvf spark-3.5.5-bin-hadoop3.tgz -C spark-3.5 --strip-components=1
cd spark-3.5
```
```
rahil@mac ~/spark-3.5/jars ls -l | grep hive
-rw-r--r--@ 1 rahil staff 183633 Feb 23 12:45 hive-beeline-2.3.9.jar
-rw-r--r--@ 1 rahil staff 44704 Feb 23 12:45 hive-cli-2.3.9.jar
-rw-r--r--@ 1 rahil staff 436169 Feb 23 12:45 hive-common-2.3.9.jar
-rw-r--r--@ 1 rahil staff 10840949 Feb 23 12:45 hive-exec-2.3.9-core.jar
-rw-r--r--@ 1 rahil staff 116364 Feb 23 12:45 hive-jdbc-2.3.9.jar
-rw-r--r--@ 1 rahil staff 326585 Feb 23 12:45 hive-llap-common-2.3.9.jar
-rw-r--r--@ 1 rahil staff 8195966 Feb 23 12:45 hive-metastore-2.3.9.jar
-rw-r--r--@ 1 rahil staff 916630 Feb 23 12:45 hive-serde-2.3.9.jar
-rw-r--r--@ 1 rahil staff 1679366 Feb 23 12:45 hive-service-rpc-3.1.3.jar
-rw-r--r--@ 1 rahil staff 53902 Feb 23 12:45 hive-shims-0.23-2.3.9.jar
-rw-r--r--@ 1 rahil staff 8786 Feb 23 12:45 hive-shims-2.3.9.jar
-rw-r--r--@ 1 rahil staff 120293 Feb 23 12:45 hive-shims-common-2.3.9.jar
-rw-r--r--@ 1 rahil staff 12923 Feb 23 12:45
hive-shims-scheduler-2.3.9.jar
-rw-r--r--@ 1 rahil staff 258346 Feb 23 12:45 hive-storage-api-2.8.1.jar
-rw-r--r--@ 1 rahil staff 572320 Feb 23 12:45
spark-hive-thriftserver_2.12-3.5.5.jar
-rw-r--r--@ 1 rahil staff 725252 Feb 23 12:45 spark-hive_2.12-3.5.5.jar
rahil@mac ~/spark-3.5/jars
```
Which allows the hudi to not hit the class not found exception(atleast when
testing via my local spark). Therefore I believe we will need to explicitly
provide this in test env in order for spark hudi integration test to work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]