Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
yihua merged PR #10304: URL: https://github.com/apache/hudi/pull/10304 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
yihua commented on code in PR #10304: URL: https://github.com/apache/hudi/pull/10304#discussion_r1425904302 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -259,65 +256,43 @@ object DefaultSource { CDCRelation.getCDCRelation(sqlContext, metaClient, parameters) } } else { - lazy val fileFormatUtils = if ((isMultipleBaseFileFormatsEnabled && !isBootstrappedTable) -|| (useNewParquetFileFormat)) { -val formatUtils = new HoodieSparkFileFormatUtils(sqlContext, metaClient, parameters, userSchema) -if (formatUtils.hasSchemaOnRead) Option.empty else Some(formatUtils) - } else { -Option.empty - } - - if (isMultipleBaseFileFormatsEnabled) { -if (isBootstrappedTable) { - throw new HoodieException(s"Multiple base file formats are not supported for bootstrapped table") -} -resolveMultiFileFormatRelation(tableType, queryType, fileFormatUtils.get) Review Comment: See the changes in other classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
yihua commented on code in PR #10304: URL: https://github.com/apache/hudi/pull/10304#discussion_r1425903940 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -235,15 +236,11 @@ object DefaultSource { Option(schema) } -val useNewParquetFileFormat = - parameters.getOrElse( -USE_NEW_HUDI_PARQUET_FILE_FORMAT.key, -USE_NEW_HUDI_PARQUET_FILE_FORMAT.defaultValue).toBoolean && - !metaClient.isMetadataTable && - (globPaths == null || globPaths.isEmpty) && - parameters.getOrElse(REALTIME_MERGE.key(), REALTIME_MERGE.defaultValue()) -.equalsIgnoreCase(REALTIME_PAYLOAD_COMBINE_OPT_VAL) - +val useNewParquetFileFormat = parameters.getOrDefault(HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key(), Review Comment: Discussed offline. We should keep one config and use `hoodie.file.group.reader.enabled` to control whether new parquet file format is used, along with new file group reader, in Spark. `hoodie.file.group.reader.enabled` should be used for other engines too, to avoid introducing config per engine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
yihua commented on code in PR #10304: URL: https://github.com/apache/hudi/pull/10304#discussion_r1425900890 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala: ## @@ -86,15 +86,6 @@ object DataSourceReadOptions { s"payload implementation to merge (${REALTIME_PAYLOAD_COMBINE_OPT_VAL}) or skip merging altogether" + s"${REALTIME_SKIP_MERGE_OPT_VAL}") - val USE_NEW_HUDI_PARQUET_FILE_FORMAT: ConfigProperty[String] = ConfigProperty Review Comment: OK, `FILE_GROUP_READER_ENABLED` controls it. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala: ## @@ -86,15 +86,6 @@ object DataSourceReadOptions { s"payload implementation to merge (${REALTIME_PAYLOAD_COMBINE_OPT_VAL}) or skip merging altogether" + s"${REALTIME_SKIP_MERGE_OPT_VAL}") - val USE_NEW_HUDI_PARQUET_FILE_FORMAT: ConfigProperty[String] = ConfigProperty Review Comment: OK, `FILE_GROUP_READER_ENABLED` controls the fallback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1853217050 ## CI report: * 79da2586916d604900592995b283ce281b0ef2ae Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21475) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
yihua commented on code in PR #10304: URL: https://github.com/apache/hudi/pull/10304#discussion_r1424808466 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala: ## @@ -234,19 +232,15 @@ class HoodieMergeOnReadSnapshotHadoopFsRelationFactory(override val sqlContext: override def buildFileIndex(): FileIndex = fileIndex override def buildFileFormat(): FileFormat = { -if (fileGroupReaderEnabled) { - new HoodieFileGroupReaderBasedParquetFileFormat( -tableState, HoodieTableSchema(tableStructSchema, tableAvroSchema.toString, internalSchemaOpt), -metaClient.getTableConfig.getTableName, mergeType, mandatoryFields, -true, isBootstrap, false, shouldUseRecordPosition, Seq.empty) -} else if (metaClient.getTableConfig.isMultipleBaseFileFormatsEnabled && !isBootstrap) { +if (metaClient.getTableConfig.isMultipleBaseFileFormatsEnabled && !isBootstrap) { new HoodieMultipleBaseFileFormat(sparkSession.sparkContext.broadcast(tableState), sparkSession.sparkContext.broadcast(HoodieTableSchema(tableStructSchema, tableAvroSchema.toString, internalSchemaOpt)), metaClient.getTableConfig.getTableName, mergeType, mandatoryFields, true, false, Seq.empty) } else { - new NewHoodieParquetFileFormat(sparkSession.sparkContext.broadcast(tableState), - sparkSession.sparkContext.broadcast(HoodieTableSchema(tableStructSchema, tableAvroSchema.toString, internalSchemaOpt)), -metaClient.getTableConfig.getTableName, mergeType, mandatoryFields, true, isBootstrap, false, Seq.empty) + new HoodieFileGroupReaderBasedParquetFileFormat( Review Comment: nit: Can we create a builder to build the file format instance? It's hard to map which values are for what, especially for the booleans. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -259,65 +256,43 @@ object DefaultSource { CDCRelation.getCDCRelation(sqlContext, metaClient, parameters) } } else { - lazy val fileFormatUtils = if ((isMultipleBaseFileFormatsEnabled && !isBootstrappedTable) -|| (useNewParquetFileFormat)) { -val formatUtils = new HoodieSparkFileFormatUtils(sqlContext, metaClient, parameters, userSchema) -if (formatUtils.hasSchemaOnRead) Option.empty else Some(formatUtils) - } else { -Option.empty - } - - if (isMultipleBaseFileFormatsEnabled) { -if (isBootstrappedTable) { - throw new HoodieException(s"Multiple base file formats are not supported for bootstrapped table") -} -resolveMultiFileFormatRelation(tableType, queryType, fileFormatUtils.get) Review Comment: How do we support multiple base file formats now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1853212464 ## CI report: * 79da2586916d604900592995b283ce281b0ef2ae UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
yihua commented on code in PR #10304: URL: https://github.com/apache/hudi/pull/10304#discussion_r1424802397 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -235,15 +236,11 @@ object DefaultSource { Option(schema) } -val useNewParquetFileFormat = - parameters.getOrElse( -USE_NEW_HUDI_PARQUET_FILE_FORMAT.key, -USE_NEW_HUDI_PARQUET_FILE_FORMAT.defaultValue).toBoolean && - !metaClient.isMetadataTable && - (globPaths == null || globPaths.isEmpty) && - parameters.getOrElse(REALTIME_MERGE.key(), REALTIME_MERGE.defaultValue()) -.equalsIgnoreCase(REALTIME_PAYLOAD_COMBINE_OPT_VAL) - +val useNewParquetFileFormat = parameters.getOrDefault(HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key(), Review Comment: Looks like `FILE_GROUP_READER_ENABLED` controls whether the `HadoopFsRelation` with new file format is used, which is kind of in a weird state. Should we keep `hoodie.data source.read.use.new.parquet.file.format` and when it's turned on, the new file group reader is always used for Spark? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala: ## @@ -86,15 +86,6 @@ object DataSourceReadOptions { s"payload implementation to merge (${REALTIME_PAYLOAD_COMBINE_OPT_VAL}) or skip merging altogether" + s"${REALTIME_SKIP_MERGE_OPT_VAL}") - val USE_NEW_HUDI_PARQUET_FILE_FORMAT: ConfigProperty[String] = ConfigProperty Review Comment: Do we still have the config to fall back to existing relations for reading Hudi tables in Spark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
jonvex commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1853105133 @yihua all tests passing, including azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1852765884 ## CI report: * 79c9af943ac8928216e8245752517f20893b0b42 UNKNOWN * 79da2586916d604900592995b283ce281b0ef2ae Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21475) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1852594104 ## CI report: * 79c9af943ac8928216e8245752517f20893b0b42 UNKNOWN * 7ce0e45df128a45407b2747a6d1004036e0d3ee8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21473) * 79da2586916d604900592995b283ce281b0ef2ae Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21475) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1852532591 ## CI report: * 79c9af943ac8928216e8245752517f20893b0b42 UNKNOWN * 7ce0e45df128a45407b2747a6d1004036e0d3ee8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21473) * 79da2586916d604900592995b283ce281b0ef2ae UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1852496889 ## CI report: * 79c9af943ac8928216e8245752517f20893b0b42 UNKNOWN * 7ce0e45df128a45407b2747a6d1004036e0d3ee8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21473) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1852401372 ## CI report: * d858eaac14b3de45d4066165622738d91ff603fe Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21456) * 79c9af943ac8928216e8245752517f20893b0b42 UNKNOWN * 7ce0e45df128a45407b2747a6d1004036e0d3ee8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21473) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1852320625 ## CI report: * d858eaac14b3de45d4066165622738d91ff603fe Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21456) * 79c9af943ac8928216e8245752517f20893b0b42 UNKNOWN * 7ce0e45df128a45407b2747a6d1004036e0d3ee8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1852305331 ## CI report: * d858eaac14b3de45d4066165622738d91ff603fe Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21456) * 79c9af943ac8928216e8245752517f20893b0b42 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1851243823 ## CI report: * d858eaac14b3de45d4066165622738d91ff603fe Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21456) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1851011432 ## CI report: * d858eaac14b3de45d4066165622738d91ff603fe Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21456) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1851002365 ## CI report: * d858eaac14b3de45d4066165622738d91ff603fe UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
jonvex opened a new pull request, #10304: URL: https://github.com/apache/hudi/pull/10304 ### Change Logs Delete NewHoodieParquetFileFormat and all references to it ### Impact all attention will be on filegroup reader ### Risk level (write none, low medium or high below) low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org