sassai opened a new issue #1646: URL: https://github.com/apache/incubator-hudi/issues/1646
**Describe the problem you faced** I am facing a problem when querying Hudi Tables (MOR) through Spark SQL. Spark SQL queries to Hive are resulting in java.lang.ClassNotFoundException: org.apache.hudi.hadoop.HoodieParquetInputFormat. Any help is appreciated I'm running Hudi on the Cloudera Data Platform. I have configured Hive to use the additional hudi-hadoop-mr jar as described in https://hudi.apache.org/docs/0.5.2-querying_data.html#hive. I am able to query the data through Apache Hue and Hive CLI. **To Reproduce** Steps to reproduce the behavior: 1. Start a spark shell spark-shell \ --jars "/shared/jars/hive/hudi-hadoop-mr-bundle-0.5.2-incubating.jar,/shared/scripts/hudi-test/lib/hudi-spark-bundle_2.11-0.5.2-incubating.jar" \ --conf spark.sql.hive.convertMetastoreParquet="false" \ --conf spark.hadoop.fs.azure.ext.cab.required.group="data-eng" \ --num-executors 1 \ --driver-memory 2g \ --executor-memory 2g \ --master yarn 2. List tables scala> spark.sql("show tables").show() +--------+--------------------+-----------+ |database| tableName|isTemporary| +--------+--------------------+-----------+ | default| departments_ro| false| | default| departments_rt| false| +--------+--------------------+-----------+ 3. Query MOR Table spark.sql("select * from departments_ro").show() 4. Results in Hive Session ID = 87983a80-2a32-4cc4-a0d0-45cdce434c05 java.lang.ClassNotFoundException: org.apache.hudi.hadoop.HoodieParquetInputFormat at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:193) at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$toInputFormat(HiveClientImpl.scala:974) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$toHiveTable$4.apply(HiveClientImpl.scala:1010) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$toHiveTable$4.apply(HiveClientImpl.scala:1010) at scala.Option.map(Option.scala:146) **Expected behavior** Hive Table containing Hudi Parquet Files can be queried through Spark SQL. **Environment Description** * Hudi version : 0.5.2 * Spark version : 2.4.0.7.1.0.0-714 * Hive version : Hive 3.1.3000.7.1.0.0-714 * Hadoop version : Hadoop 3.1.1.7.1.0.0-714 * Storage (HDFS/S3/GCS..) : ADLS * Running on Docker? (yes/no) : no **Additional context** Querying the data trough Spark Datasource API works fine: **Stacktrace** ```Add the stacktrace of the error.``` java.lang.ClassNotFoundException: org.apache.hudi.hadoop.HoodieParquetInputFormat at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:193) at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$toInputFormat(HiveClientImpl.scala:974) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$toHiveTable$4.apply(HiveClientImpl.scala:1010) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$toHiveTable$4.apply(HiveClientImpl.scala:1010) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1010) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitions$1.apply(HiveClientImpl.scala:727) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitions$1.apply(HiveClientImpl.scala:726) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:313) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:247) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:246) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:296) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitions(HiveClientImpl.scala:726) at org.apache.spark.sql.hive.client.HiveClient$class.getPartitions(HiveClient.scala:210) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitions(HiveClientImpl.scala:86) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitions$1.apply(HiveExternalCatalog.scala:1197) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitions$1.apply(HiveExternalCatalog.scala:1195) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1195) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:246) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:948) at org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:178) at org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:166) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192) at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2428) at org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:191) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:392) at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) at org.apache.spark.sql.Dataset.head(Dataset.scala:2550) at org.apache.spark.sql.Dataset.take(Dataset.scala:2764) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254) at org.apache.spark.sql.Dataset.showString(Dataset.scala:291) at org.apache.spark.sql.Dataset.show(Dataset.scala:751) at org.apache.spark.sql.Dataset.show(Dataset.scala:710) at org.apache.spark.sql.Dataset.show(Dataset.scala:719) ... 49 elided ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org