sassai opened a new issue #1646:
URL: https://github.com/apache/incubator-hudi/issues/1646


   **Describe the problem you faced**
   
   I am facing a problem when querying Hudi Tables (MOR) through Spark SQL. 
Spark SQL queries to Hive are resulting in java.lang.ClassNotFoundException: 
org.apache.hudi.hadoop.HoodieParquetInputFormat. Any help is appreciated
   
   I'm running Hudi on the Cloudera Data Platform. I have configured Hive to 
use the additional hudi-hadoop-mr jar as described in 
https://hudi.apache.org/docs/0.5.2-querying_data.html#hive. I am able to query 
the data through Apache Hue and Hive CLI.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Start a spark shell
   
   spark-shell \
   --jars 
"/shared/jars/hive/hudi-hadoop-mr-bundle-0.5.2-incubating.jar,/shared/scripts/hudi-test/lib/hudi-spark-bundle_2.11-0.5.2-incubating.jar"
 \
   --conf spark.sql.hive.convertMetastoreParquet="false" \
   --conf spark.hadoop.fs.azure.ext.cab.required.group="data-eng" \
   --num-executors 1 \
   --driver-memory 2g \
   --executor-memory 2g  \
   --master yarn
   
   2. List tables
   
   scala> spark.sql("show tables").show()
   +--------+--------------------+-----------+
   |database|           tableName|isTemporary|
   +--------+--------------------+-----------+
   | default|      departments_ro|      false|
   | default|      departments_rt|      false|
   +--------+--------------------+-----------+
   
   3. Query MOR Table
   
   spark.sql("select * from departments_ro").show()
   
   4. Results in 
   
   Hive Session ID = 87983a80-2a32-4cc4-a0d0-45cdce434c05
   java.lang.ClassNotFoundException: 
org.apache.hudi.hadoop.HoodieParquetInputFormat
     at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
     at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
     at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
     at java.lang.Class.forName0(Native Method)
     at java.lang.Class.forName(Class.java:348)
     at org.apache.spark.util.Utils$.classForName(Utils.scala:193)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$toInputFormat(HiveClientImpl.scala:974)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$toHiveTable$4.apply(HiveClientImpl.scala:1010)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$toHiveTable$4.apply(HiveClientImpl.scala:1010)
     at scala.Option.map(Option.scala:146)
   
   
   **Expected behavior**
   
   Hive Table containing Hudi Parquet Files can be queried through Spark SQL.
   
   **Environment Description**
   
   * Hudi version :  0.5.2
   
   * Spark version : 2.4.0.7.1.0.0-714
   
   * Hive version : Hive 3.1.3000.7.1.0.0-714
   
   * Hadoop version : Hadoop 3.1.1.7.1.0.0-714
   
   * Storage (HDFS/S3/GCS..) : ADLS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Querying the data trough Spark Datasource API works fine:
   
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   java.lang.ClassNotFoundException: 
org.apache.hudi.hadoop.HoodieParquetInputFormat
     at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
     at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
     at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
     at java.lang.Class.forName0(Native Method)
     at java.lang.Class.forName(Class.java:348)
     at org.apache.spark.util.Utils$.classForName(Utils.scala:193)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$toInputFormat(HiveClientImpl.scala:974)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$toHiveTable$4.apply(HiveClientImpl.scala:1010)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$toHiveTable$4.apply(HiveClientImpl.scala:1010)
     at scala.Option.map(Option.scala:146)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1010)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitions$1.apply(HiveClientImpl.scala:727)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitions$1.apply(HiveClientImpl.scala:726)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:313)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:247)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:246)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:296)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitions(HiveClientImpl.scala:726)
     at 
org.apache.spark.sql.hive.client.HiveClient$class.getPartitions(HiveClient.scala:210)
     at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitions(HiveClientImpl.scala:86)
     at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitions$1.apply(HiveExternalCatalog.scala:1197)
     at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitions$1.apply(HiveExternalCatalog.scala:1195)
     at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
     at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1195)
     at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:246)
     at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:948)
     at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:178)
     at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:166)
     at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192)
     at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192)
     at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2428)
     at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:191)
     at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
     at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
     at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
     at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
     at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:392)
     at 
org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
     at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:636)
     at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
     at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
     at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
     at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
     at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
     at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
     at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
     at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
     at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
     at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
     at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
     at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
     at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
     at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
     at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
     at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
     at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
     at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
     at org.apache.spark.sql.Dataset.show(Dataset.scala:751)
     at org.apache.spark.sql.Dataset.show(Dataset.scala:710)
     at org.apache.spark.sql.Dataset.show(Dataset.scala:719)
     ... 49 elided
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to