[GitHub] [hudi] yihua commented on a diff in pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

via GitHub Fri, 09 Jun 2023 14:01:08 -0700


yihua commented on code in PR #8885:
URL: https://github.com/apache/hudi/pull/8885#discussion_r1224771994



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormat.scala:
##########
@@ -34,6 +34,15 @@ class HoodieParquetFileFormat extends ParquetFileFormat with 
SparkAdapterSupport
 
   override def toString: String = "Hoodie-Parquet"
 
+  override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
+    if (HoodieSparkUtils.gteqSpark3_4) {

Review Comment:
   The tests fail for other spark versions if I don't add this check.
   ```
   Merge Hudi to Hudi *** FAILED ***
   2023-06-06T23:38:24.7660935Z   org.apache.spark.SparkException: Job aborted 
due to stage failure: Task 0 in stage 3194.0 failed 1 times, most recent 
failure: Lost task 0.0 in stage 3194.0 (TID 3768) (fv-az1128-658 executor 
driver): java.lang.ClassCastException: 
org.apache.spark.sql.vectorized.ColumnarBatchRow cannot be cast to 
org.apache.spark.sql.vectorized.ColumnarBatch
   2023-06-06T23:38:24.7662056Z         at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:560)
   2023-06-06T23:38:24.7662628Z         at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:549)
   2023-06-06T23:38:24.7663391Z         at 
scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
   ```



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##########
@@ -66,17 +66,21 @@ case class BaseFileOnlyRelation(override val sqlContext: 
SQLContext,
   // NOTE: This override has to mirror semantic of whenever this Relation is 
converted into [[HadoopFsRelation]],
   //       which is currently done for all cases, except when Schema Evolution 
is enabled
   override protected val shouldExtractPartitionValuesFromPartitionPath: 
Boolean =
-    internalSchemaOpt.isEmpty
+  internalSchemaOpt.isEmpty
 
   override lazy val mandatoryFields: Seq[String] = Seq.empty
 
+  // Before Spark 3.4.0: PartitioningAwareFileIndex.BASE_PATH_PARAM
+  // Since Spark 3.4.0: FileIndexOptions.BASE_PATH_PARAM
+  val BASE_PATH_PARAM = "basePath"
+
   override def updatePrunedDataSchema(prunedSchema: StructType): Relation =
     this.copy(prunedDataSchema = Some(prunedSchema))
 
   override def imbueConfigs(sqlContext: SQLContext): Unit = {
     super.imbueConfigs(sqlContext)
     // TODO Issue with setting this to true in spark 332
-    if (!HoodieSparkUtils.gteqSpark3_3_2) {
+    if (HoodieSparkUtils.gteqSpark3_4 || !HoodieSparkUtils.gteqSpark3_3_2) {

Review Comment:
   The tests fail for other spark versions if I don't add this check.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##########
@@ -66,17 +66,21 @@ case class BaseFileOnlyRelation(override val sqlContext: 
SQLContext,
   // NOTE: This override has to mirror semantic of whenever this Relation is 
converted into [[HadoopFsRelation]],
   //       which is currently done for all cases, except when Schema Evolution 
is enabled
   override protected val shouldExtractPartitionValuesFromPartitionPath: 
Boolean =
-    internalSchemaOpt.isEmpty
+  internalSchemaOpt.isEmpty
 
   override lazy val mandatoryFields: Seq[String] = Seq.empty
 
+  // Before Spark 3.4.0: PartitioningAwareFileIndex.BASE_PATH_PARAM
+  // Since Spark 3.4.0: FileIndexOptions.BASE_PATH_PARAM
+  val BASE_PATH_PARAM = "basePath"
+
   override def updatePrunedDataSchema(prunedSchema: StructType): Relation =
     this.copy(prunedDataSchema = Some(prunedSchema))
 
   override def imbueConfigs(sqlContext: SQLContext): Unit = {
     super.imbueConfigs(sqlContext)
     // TODO Issue with setting this to true in spark 332
-    if (!HoodieSparkUtils.gteqSpark3_3_2) {
+    if (HoodieSparkUtils.gteqSpark3_4 || !HoodieSparkUtils.gteqSpark3_3_2) {

Review Comment:
   ```
   Merge Hudi to Hudi *** FAILED ***
   2023-06-06T23:38:24.7660935Z   org.apache.spark.SparkException: Job aborted 
due to stage failure: Task 0 in stage 3194.0 failed 1 times, most recent 
failure: Lost task 0.0 in stage 3194.0 (TID 3768) (fv-az1128-658 executor 
driver): java.lang.ClassCastException: 
org.apache.spark.sql.vectorized.ColumnarBatchRow cannot be cast to 
org.apache.spark.sql.vectorized.ColumnarBatch
   2023-06-06T23:38:24.7662056Z         at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:560)
   2023-06-06T23:38:24.7662628Z         at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:549)
   2023-06-06T23:38:24.7663391Z         at 
scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #8885: [HUDI-6198] Support Hudi on Spark 3.4.0

Reply via email to