[GitHub] [hudi] rahil-c commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
rahil-c commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1540548174 > There was one failure in the CI: TestAvroSchemaResolutionSupport.testDataTypePromotions Had offline conversation, will disable this test for spark 332 for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rahil-c commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
rahil-c commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1537768291 @xiarixiaoyao Thanks for your analysis. Ive tried adding that code block you linked in this pr. The one thing I am seeing from the tests is a new failure since this "returning_batch" config does not seem to be getting set internally by spark. ``` java.lang.IllegalArgumentException: OPTION_RETURNING_BATCH should always be set for ParquetFileFormat. To workaround this issue, set spark.sql.parquet.enableVectorizedReader=false. ``` Do you have any idea why applying this fix from spark is causing issues? From my understanding The property should be set within spark https://github.com/apache/hudi/pull/8082/files ``` lazy val inputRDD: RDD[InternalRow] = { val options = relation.options + (FileFormat.OPTION_RETURNING_BATCH -> supportsColumnar.toString) val readFile: (PartitionedFile) => Iterator[InternalRow] = relation.fileFormat.buildReaderWithPartitionValues( sparkSession = relation.sparkSession, dataSchema = relation.dataSchema, partitionSchema = relation.partitionSchema, requiredSchema = requiredSchema, filters = pushedDownFilters, options = options, hadoopConf = relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options)) ``` should be set inside `DataSourceScanExec.scala`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rahil-c commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
rahil-c commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1536481643 @xiarixiaoyao Atleast from the java ci perspective when disabling this vectorized reader config for spark 3.3.2, all the tests are passing. cc @danny0405 @yihua -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rahil-c commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
rahil-c commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1535683746 > If we'd like to have this fix in 0.13.1 release without introducing performance problems for existing Spark versions, could we consider the following to triage the scope of impact? > > (1) Could we disable the optimization rule of nested schema pruning for Spark 3.3.2 only, and see if the tests can pass (without config change of vectorized reader)? This is done by not adding `org.apache.spark.sql.execution.datasources.Spark33NestedSchemaPruning` for Spark 3.3.2 in [HoodieAnalysis](https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala#L132). (2) If the above does not work, could we disable the vectorized reader for Spark 3.3.2 only? And still use Spark 3.3.1 as the compile dependency in this case? (3) Could we also list all the failed tests and see what are in common for further investigation? @yihua 1). When trying to disable the optimization rule and run the test it seems the issues is present when trying on several failed tests. 2). I think this path can work, i added a check to only have this config set to true only when it encounters spark version other spark 3.3.2. Im not sure what you mean on using spark 3.3.1 compile dependency though. 3). Have test failure list above cc @danny0405 @xiarixiaoyao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rahil-c commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
rahil-c commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1535003728 > spark.sql.parquet.enableVectorizedReader", "true") @xiarixiaoyao @danny0405 Thank you both for taking a look at this. Just to confirm are we saying we cant disable this `spark.sql.parquet.enableVectorizedReader` since this would cause performance regression. Is there any other workaround you guys are aware of? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rahil-c commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
rahil-c commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1503840585 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rahil-c commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
rahil-c commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1503656500 > Any info on when will this be merged? Will try to get this in ideally sometime this week. @yihua -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rahil-c commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
rahil-c commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1460701128 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rahil-c commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
rahil-c commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1452764196 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org