Re: [PR] [HUDI-7224] HoodieSparkSqlWriter metasync success or not show details messages log [hudi]
xuzifu666 commented on code in PR #10314: URL: https://github.com/apache/hudi/pull/10314#discussion_r1423577968 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -975,6 +975,13 @@ class HoodieSparkSqlWriterInternal { val metaSyncSuccess = metaSync(spark, HoodieWriterUtils.convertMapToHoodieConfig(parameters), tableInstantInfo.basePath, schema) + if (metaSyncSuccess) { +log.info("metaSyncSuccess " + tableInstantInfo.instantTime + " successful!") + } Review Comment: problem is job would stop with out any error or details to dig cause,these would take more work without the log @danny0405 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -975,6 +975,13 @@ class HoodieSparkSqlWriterInternal { val metaSyncSuccess = metaSync(spark, HoodieWriterUtils.convertMapToHoodieConfig(parameters), tableInstantInfo.basePath, schema) + if (metaSyncSuccess) { +log.info("metaSyncSuccess " + tableInstantInfo.instantTime + " successful!") + } Review Comment: problem is job would stop with out any error or details to dig cause,these would take more work without the log @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7224] HoodieSparkSqlWriter metasync success or not show details messages log [hudi]
xuzifu666 closed pull request #10314: [HUDI-7224] HoodieSparkSqlWriter metasync success or not show details messages log URL: https://github.com/apache/hudi/pull/10314 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] how to config hudi table TTL in S3? The table_meta can be separated into a directory? [hudi]
zyclove opened a new issue, #10316: URL: https://github.com/apache/hudi/issues/10316 How to configure TTL policy in hudi data table? Can the metadata (.hoodie )be separated into a directory? Only configure the appropriate TTL for the data directory, so that data cleaning can also use hierarchical storage and different life cycles, and the data can also be automatically cleaned by relying on the object storage service, and there is no cost. EG: s3://big-data-eu/hudi/data/bi_ods/table_name/dt=20231211/data < 30 days STANDARD S3 > 30 days delete by TTL with no cost. = < 15 days STANDARD S3 > 15 days GLACIER_IR > 105 days delete by TTL with no cost. s3://big-data-eu/hudi/table_meta/bi_ods/table_name/.hoodie As mentioned above, if there are many data tables under the data storage and the storage periods are the same, I can just configure the storage period for the directory and rely on the object storage to automatically clean up the historical data at no cost. EG: s3://big-data-eu/hudi/data/30days/bi_ods/table_name/dt=20231211/data < 30 days STANDARD S3 > 30 days delete by TTL with no cost. s3://big-data-eu/hudi/data/90days/bi_ods/table_name/dt=20231211/data < 30 days STANDARD S3 30days < 90days GLACIER_IR > 90 days delete by TTL with no cost. = < 15 days STANDARD S3 > 15 days GLACIER_IR > 105 days delete by TTL with no cost. s3://big-data-eu/hudi/table_meta/bi_ods/table_name/.hoodie -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]
hudi-bot commented on PR #10313: URL: https://github.com/apache/hudi/pull/10313#issuecomment-1851464138 ## CI report: * b9ebe136bdcafc4d5bbd407691f2420ccab45adc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21466) * 5273d8cc9ed428d2ac6896f52664618ed02c98a1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21468) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] How to skip some partitions in a table when readStreaming in Spark at the init stage [hudi]
lei-su-awx commented on issue #10315: URL: https://github.com/apache/hudi/issues/10315#issuecomment-1851457169 Hi @danny0405 , do you mean like this: https://github.com/apache/hudi/assets/19327659/946c95b5-77f5-4d34-a315-a284c6b95b37";> I tried this, but also will read other partitions' data file to resolve schema. And I think kind of filter takes effect after source load all partitions files, but I want a config that can tell source that only reads the partition that in my configs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7224] HoodieSparkSqlWriter metasync success or not show details messages log [hudi]
danny0405 commented on code in PR #10314: URL: https://github.com/apache/hudi/pull/10314#discussion_r1423561174 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -975,6 +975,13 @@ class HoodieSparkSqlWriterInternal { val metaSyncSuccess = metaSync(spark, HoodieWriterUtils.convertMapToHoodieConfig(parameters), tableInstantInfo.basePath, schema) + if (metaSyncSuccess) { +log.info("metaSyncSuccess " + tableInstantInfo.instantTime + " successful!") + } Review Comment: Kind of remember if the meta sync failes, the whole job would just fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7208] Do writing stage should shutdown with error when insert failed to reduce user execute time and show error details [hudi]
xuzifu666 commented on code in PR #10297: URL: https://github.com/apache/hudi/pull/10297#discussion_r1423553849 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -508,10 +509,7 @@ protected void doWrite(HoodieRecord record, Schema schema, TypedProperties props flushToDiskIfRequired(record, false); writeToBuffer(record); } catch (Throwable t) { - // Not throwing exception from here, since we don't want to fail the entire job - // for a single record - writeStatus.markFailure(record, t, recordMetadata); - LOG.error("Error writing record " + record, t); + throw new HoodieInsertException("Error writing record " + record, t); Review Comment: spark seem not have the config @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-7222) Fix the loose Scala style check
[ https://issues.apache.org/jira/browse/HUDI-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795632#comment-17795632 ] Lin Liu commented on HUDI-7222: --- We found that the order of the imports are not enforced, the unused imports are not removed. Tried to enforce these rules, which causes hundreds of error messages. Need to understand the rationale that we did not enforce these rules. > Fix the loose Scala style check > --- > > Key: HUDI-7222 > URL: https://issues.apache.org/jira/browse/HUDI-7222 > Project: Apache Hudi > Issue Type: Task >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > > We have seen the Scala code style is loose for many places, like malformed > imports. We have to fix this kind of problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] Flink cannot write data to HUDI when set metadata.enabled to true [hudi]
danny0405 commented on issue #10306: URL: https://github.com/apache/hudi/issues/10306#issuecomment-1851449422 Do we have other exceptions? It looks like the exception is not a root cause, other exception relay the exception msg to interrupt these tasks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6979][RFC-76] support event time based compaction strategy [hudi]
danny0405 commented on code in PR #10266: URL: https://github.com/apache/hudi/pull/10266#discussion_r1423554270 ## rfc/rfc-76/rfc-76.md: ## @@ -0,0 +1,238 @@ + +# RFC-[74]: [support EventTimeBasedCompactionStrategy] + +## Proposers + +- @waitingF + +## Approvers + - @ + - @ + +## Status + +JIRA: [HUDI-6979](https://issues.apache.org/jira/browse/HUDI-6979) + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +Currently, to gain low ingestion latency, we can adopt the MergeOnRead table, which support appending log files and +compact log files into base file later. When querying the snapshot table (RT table) generated by MOR, +query side have to perform a compaction so that they can get all data, which is expected time-consuming causing query latency. +At the time, hudi provide read-optimized table (RO table) for low query latency just like COW. + +But currently, there is no compaction strategy based on event time, so there is no data freshness guarantee for RO table. +For cases, user want all data before a specified time, user have to query the RT table to get all data with expected high query latency. Review Comment: Can you modify the doc based on our new file slicing algorithm? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7208] Do writing stage should shutdown with error when insert failed to reduce user execute time and show error details [hudi]
xuzifu666 commented on code in PR #10297: URL: https://github.com/apache/hudi/pull/10297#discussion_r1423553849 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -508,10 +509,7 @@ protected void doWrite(HoodieRecord record, Schema schema, TypedProperties props flushToDiskIfRequired(record, false); writeToBuffer(record); } catch (Throwable t) { - // Not throwing exception from here, since we don't want to fail the entire job - // for a single record - writeStatus.markFailure(record, t, recordMetadata); - LOG.error("Error writing record " + record, t); + throw new HoodieInsertException("Error writing record " + record, t); Review Comment: spark seem not the config -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] How to skip some partitions in a table when readStreaming in Spark at the init stage [hudi]
danny0405 commented on issue #10315: URL: https://github.com/apache/hudi/issues/10315#issuecomment-1851443273 Did you try to add filter condition with the partition fields? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7208] Do writing stage should shutdown with error when insert failed to reduce user execute time and show error details [hudi]
danny0405 commented on code in PR #10297: URL: https://github.com/apache/hudi/pull/10297#discussion_r1423550203 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -508,10 +509,7 @@ protected void doWrite(HoodieRecord record, Schema schema, TypedProperties props flushToDiskIfRequired(record, false); writeToBuffer(record); } catch (Throwable t) { - // Not throwing exception from here, since we don't want to fail the entire job - // for a single record - writeStatus.markFailure(record, t, recordMetadata); - LOG.error("Error writing record " + record, t); + throw new HoodieInsertException("Error writing record " + record, t); Review Comment: Does Spark have similiar option? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]
danny0405 commented on code in PR #10313: URL: https://github.com/apache/hudi/pull/10313#discussion_r1423548002 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java: ## @@ -86,7 +86,7 @@ private static Configuration addProjectionField(Configuration conf, String field public static void addProjectionField(Configuration conf, String[] fieldName) { if (fieldName.length > 0) { - List columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList()); + List columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS, "").split(",")).collect(Collectors.toList()); Arrays.stream(fieldName).forEach(field -> { Review Comment: When the `LIST_COLUMNS` and when not? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]
hudi-bot commented on PR #10313: URL: https://github.com/apache/hudi/pull/10313#issuecomment-1851403241 ## CI report: * b9ebe136bdcafc4d5bbd407691f2420ccab45adc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21466) * 5273d8cc9ed428d2ac6896f52664618ed02c98a1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7224] HoodieSparkSqlWriter metasync success or not show details messages log [hudi]
hudi-bot commented on PR #10314: URL: https://github.com/apache/hudi/pull/10314#issuecomment-1851396191 ## CI report: * 88b9f8d9518f5afd376479ba9c87a8dd30170ffc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21467) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]
hudi-bot commented on PR #10313: URL: https://github.com/apache/hudi/pull/10313#issuecomment-1851396149 ## CI report: * b9ebe136bdcafc4d5bbd407691f2420ccab45adc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21466) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7132] Data may be lost for flink task failure [hudi]
hudi-bot commented on PR #10312: URL: https://github.com/apache/hudi/pull/10312#issuecomment-1851396107 ## CI report: * 5c971e1a0cafb635ad9cfed0f452751314bdb21c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21465) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7224] HoodieSparkSqlWriter metasync success or not show details messages log [hudi]
hudi-bot commented on PR #10314: URL: https://github.com/apache/hudi/pull/10314#issuecomment-1851388190 ## CI report: * 88b9f8d9518f5afd376479ba9c87a8dd30170ffc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]
hudi-bot commented on PR #10313: URL: https://github.com/apache/hudi/pull/10313#issuecomment-1851388151 ## CI report: * b9ebe136bdcafc4d5bbd407691f2420ccab45adc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7132] Data may be lost for flink task failure [hudi]
hudi-bot commented on PR #10312: URL: https://github.com/apache/hudi/pull/10312#issuecomment-1851388109 ## CI report: * 5c971e1a0cafb635ad9cfed0f452751314bdb21c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] How to skip some partitions in a table when readStreaming in Spark at the init stage [hudi]
lei-su-awx opened a new issue, #10315: URL: https://github.com/apache/hudi/issues/10315 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** I have a table partition by operation type ingestion date(like insert/2023-12-11/, update/2023-12-11/, delete/2023-12-11/), when I read(use spark readStream) this table, I just want to read data under `update` partition. And I found a config 'hoodie.datasource.read.incr.path.glob', then I use this config and value is `update/202*`. But I found spark job init very slow, and found job was stuck https://github.com/apache/hudi/assets/19327659/febba726-b441-4b01-abb0-2e12f8bc62d7";>. But this parquet file is not under `update` partition, it is under `insert` partition, which is very confused. So I want ask is there a config that can only read the target partition and skip others and also does not read other partitions' data files to get schema. **Expected behavior** I want to know is there a config that can only read the target partition and skip others and also does not read other partitions' data files to get schema. **Environment Description** * Hudi version : 0.14.0 * Spark version : 3.4.1 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : GCS * Running on Docker? (yes/no) : no **Additional context** I use the below configurations to write to table: hudi_write_options = { 'hoodie.table.name': hudi_table_name, 'hoodie.datasource.write.partitionpath.field': 'operation_type, ingestion_dt', 'hoodie.datasource.write.operation': 'insert', 'hoodie.datasource.write.table.type': 'MERGE_ON_READ', 'hoodie.parquet.compression.codec': 'zstd', "hoodie.datasource.write.payload.class": "org.apache.hudi.common.model.DefaultHoodieRecordPayload", "hoodie.datasource.write.reconcile.schema": True, "hoodie.metadata.enable": True } And I use the configurations to read from table: read_streaming_hudi_options = { 'maxFilesPerTrigger': 5, 'hoodie.datasource.read.incr.path.glob': 'update/202*', 'hoodie.read.timeline.holes.resolution.policy': 'BLOCK', 'hoodie.datasource.read.file.index.listing.partition-path-prefix.analysis.enabled': False, 'hoodie.file.listing.parallelism': 1000, 'hoodie.metadata.enable': True, 'hoodie.datasource.read.schema.use.end.instanttime': True, 'hoodie.datasource.streaming.startOffset': '202312110' } **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7208] Do writing stage should shutdown with error when insert failed to reduce user execute time and show error details [hudi]
hudi-bot commented on PR #10297: URL: https://github.com/apache/hudi/pull/10297#issuecomment-1851381361 ## CI report: * 68f3125e06ebb01154d659c2b4452c7ac4c5aa25 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21461) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7211) Relax need of ordering/precombine field for tables with autogenerated record keys for DeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Goenka updated HUDI-7211: Description: [https://github.com/apache/hudi/issues/10233] ``` NOW=$(date '+%Y%m%dt%H%M%S') ${SPARK_HOME}/bin/spark-submit \ --jars ${path_prefix}/jars/${SPARK_V}/hudi-spark${SPARK_VERSION}-bundle_2.12-${HUDI_VERSION}.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ ${path_prefix}/jars/${SPARK_V}/hudi-utilities-slim-bundle_2.12-${HUDI_VERSION}.jar \ --target-base-path ${path_prefix}/testcases/stocks/data/target/${NOW} \ --target-table stocks${NOW} \ --table-type COPY_ON_WRITE \ --base-file-format PARQUET \ --props ${path_prefix}/testcases/stocks/configs/hoodie.properties \ --source-class org.apache.hudi.utilities.sources.JsonDFSSource \ --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ --hoodie-conf hoodie.deltastreamer.schemaprovider.source.schema.file=${path_prefix}/testcases/stocks/data/schema_without_ts.avsc \ --hoodie-conf hoodie.deltastreamer.schemaprovider.target.schema.file=${path_prefix}/testcases/stocks/data/schema_without_ts.avsc \ --op UPSERT \ --spark-master yarn \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=${path_prefix}/testcases/stocks/data/source_without_ts \ --hoodie-conf hoodie.datasource.write.partitionpath.field=date \ --hoodie-conf hoodie.datasource.write.keygenerator.type=SIMPLE \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=false \ --hoodie-conf hoodie.metadata.enable=true ``` was: [https://github.com/apache/hudi/issues/10233] Reproducible code - https://github.com/apache/hudi/issues/10233#issuecomment-1849561433 > Relax need of ordering/precombine field for tables with autogenerated record > keys for DeltaStreamer > --- > > Key: HUDI-7211 > URL: https://issues.apache.org/jira/browse/HUDI-7211 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Priority: Critical > Fix For: 0.14.1 > > > [https://github.com/apache/hudi/issues/10233] > > ``` > NOW=$(date '+%Y%m%dt%H%M%S') > ${SPARK_HOME}/bin/spark-submit \ > --jars > ${path_prefix}/jars/${SPARK_V}/hudi-spark${SPARK_VERSION}-bundle_2.12-${HUDI_VERSION}.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > ${path_prefix}/jars/${SPARK_V}/hudi-utilities-slim-bundle_2.12-${HUDI_VERSION}.jar > \ > --target-base-path ${path_prefix}/testcases/stocks/data/target/${NOW} \ > --target-table stocks${NOW} \ > --table-type COPY_ON_WRITE \ > --base-file-format PARQUET \ > --props ${path_prefix}/testcases/stocks/configs/hoodie.properties \ > --source-class org.apache.hudi.utilities.sources.JsonDFSSource \ > --schemaprovider-class > org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ > --hoodie-conf > hoodie.deltastreamer.schemaprovider.source.schema.file=${path_prefix}/testcases/stocks/data/schema_without_ts.avsc > \ > --hoodie-conf > hoodie.deltastreamer.schemaprovider.target.schema.file=${path_prefix}/testcases/stocks/data/schema_without_ts.avsc > \ > --op UPSERT \ > --spark-master yarn \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=${path_prefix}/testcases/stocks/data/source_without_ts > \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date \ > --hoodie-conf hoodie.datasource.write.keygenerator.type=SIMPLE \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=false \ > --hoodie-conf hoodie.metadata.enable=true > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7211) Relax need of ordering/precombine field for tables with autogenerated record keys for DeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Goenka updated HUDI-7211: Summary: Relax need of ordering/precombine field for tables with autogenerated record keys for DeltaStreamer (was: Relax need of precombine key for tables with autogenerated record keys) > Relax need of ordering/precombine field for tables with autogenerated record > keys for DeltaStreamer > --- > > Key: HUDI-7211 > URL: https://issues.apache.org/jira/browse/HUDI-7211 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Priority: Critical > Fix For: 0.14.1 > > > [https://github.com/apache/hudi/issues/10233] > > Reproducible code - > https://github.com/apache/hudi/issues/10233#issuecomment-1849561433 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] restart flink job got InvalidAvroMagicException: Not an Avro data file [hudi]
zlinsc commented on issue #10285: URL: https://github.com/apache/hudi/issues/10285#issuecomment-1851376859 > see the fix: https://github.com/apache/hudi/pull/10298/files thx @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] The Schema Evolution Not working For Hudi 0.12.3 [hudi]
ad1happy2go commented on issue #10309: URL: https://github.com/apache/hudi/issues/10309#issuecomment-1851374103 @Amar1404 Can you give more details like table/writer configurations you are using? I tried with simple scenario and schema evolution from long to double works fine. ``` schema1 = StructType( [ StructField("id", IntegerType(), True), StructField("value", LongType(), True) ] ) schema2 = StructType( [ StructField("id", IntegerType(), True), StructField("value", DoubleType(), True) ] ) data1 = [ Row(1, 100), Row(2, 100), Row(3,100), ] data2 = [ Row(1, 100.1), Row(2, 200.2), Row(3,100.0), ] hudi_configs = { "hoodie.table.name": TABLE_NAME, "hoodie.datasource.write.precombine.field":"value", "hoodie.datasource.write.recordkey.field":"id" } df = spark.createDataFrame(spark.sparkContext.parallelize(data1), schema1) df.write.format("org.apache.hudi").options(**hudi_configs).mode("append").save(PATH) spark.read.format("org.apache.hudi").load(PATH).printSchema() spark.read.format("org.apache.hudi").load(PATH).show() df = spark.createDataFrame(spark.sparkContext.parallelize(data2), schema2) df.write.format("org.apache.hudi").options(**hudi_configs).mode("append").save(PATH) spark.read.format("org.apache.hudi").load(PATH).printSchema() spark.read.format("org.apache.hudi").load(PATH).show() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7224) HoodieSparkSqlWriter metasync success or not show details messages log
[ https://issues.apache.org/jira/browse/HUDI-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy updated HUDI-7224: - Description: HoodieSparkSqlWriter metasync success or not show details messages log,currently only commitSuccess show logs > HoodieSparkSqlWriter metasync success or not show details messages log > -- > > Key: HUDI-7224 > URL: https://issues.apache.org/jira/browse/HUDI-7224 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: xy >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > HoodieSparkSqlWriter metasync success or not show details messages > log,currently only commitSuccess show logs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7224) HoodieSparkSqlWriter metasync success or not show details messages log
[ https://issues.apache.org/jira/browse/HUDI-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7224: - Labels: pull-request-available (was: ) > HoodieSparkSqlWriter metasync success or not show details messages log > -- > > Key: HUDI-7224 > URL: https://issues.apache.org/jira/browse/HUDI-7224 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: xy >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7224] HoodieSparkSqlWriter metasync success or not show details messages log [hudi]
xuzifu666 opened a new pull request, #10314: URL: https://github.com/apache/hudi/pull/10314 ### Change Logs HoodieSparkSqlWriter metasync success or not show details messages log to help user to confirm error in sparksql writing process ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7224) HoodieSparkSqlWriter metasync success or not show details messages log
[ https://issues.apache.org/jira/browse/HUDI-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy updated HUDI-7224: - Fix Version/s: 0.14.1 > HoodieSparkSqlWriter metasync success or not show details messages log > -- > > Key: HUDI-7224 > URL: https://issues.apache.org/jira/browse/HUDI-7224 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: xy >Priority: Major > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7224) HoodieSparkSqlWriter metasync success or not show details messages log
xy created HUDI-7224: Summary: HoodieSparkSqlWriter metasync success or not show details messages log Key: HUDI-7224 URL: https://issues.apache.org/jira/browse/HUDI-7224 Project: Apache Hudi Issue Type: Improvement Reporter: xy -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7224) HoodieSparkSqlWriter metasync success or not show details messages log
[ https://issues.apache.org/jira/browse/HUDI-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy updated HUDI-7224: - Component/s: spark-sql > HoodieSparkSqlWriter metasync success or not show details messages log > -- > > Key: HUDI-7224 > URL: https://issues.apache.org/jira/browse/HUDI-7224 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: xy >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] NPE fix while adding projection field & added its test cases [hudi]
prathit06 opened a new pull request, #10313: URL: https://github.com/apache/hudi/pull/10313 ### Change Logs Describe context and summary for this change : While using `HoodieParquetInputFormat` to read a partitioned hudi table from Flink we encountered below exception ```java.lang.NullPointerException at org.apache.hudi.hadoop.utils.HoodieRealtimeInputFormatUtils.addProjectionField(HoodieRealtimeInputFormatUtils.java:88) at org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:90) at org.apache.flink.api.java.hadoop.mapred.HadoopInputFormatBase.open(HadoopInputFormatBase.java:176) at org.apache.flink.api.java.hadoop.mapred.HadoopInputFormatBase.open(HadoopInputFormatBase.java:58) at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67) at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)``` Table type : cow & insert_overwrite Hudi version : 0.13.1 Flink version : 1.16.0 This PR fixes the NPE & Also adds a simple test case for any further testing in `HoodieRealtimeInputFormatUtils` ### Impact Describe any public API or user-facing feature change or any performance impact._ : NA ### Risk level (write none, low medium or high below) If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7132) Data may be lost in Flink checkpoint
[ https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7132: - Labels: pull-request-available (was: ) > Data may be lost in Flink checkpoint > > > Key: HUDI-7132 > URL: https://issues.apache.org/jira/browse/HUDI-7132 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.13.1, 0.14.0 >Reporter: Bo Cui >Priority: Major > Labels: pull-request-available > > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35 > before the line code, eventBuffer may be updated by `subtaskFailed`, and some > elements of eventBuffer is null > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7132] Data may be lost in Flink checkpoint [hudi]
danny0405 opened a new pull request, #10312: URL: https://github.com/apache/hudi/pull/10312 ### Change Logs See the context in https://issues.apache.org/jira/browse/HUDI-7132, there is no need to clean the event by failure task signal, the events would be overridden when new events got received. ### Impact fix the potential data loss. ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]
hudi-bot commented on PR #10308: URL: https://github.com/apache/hudi/pull/10308#issuecomment-1851342245 ## CI report: * 737e09fc37912e88f640393b11357cb8b27a29c5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21464) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]
hudi-bot commented on PR #10308: URL: https://github.com/apache/hudi/pull/10308#issuecomment-1851336257 ## CI report: * 737e09fc37912e88f640393b11357cb8b27a29c5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7223] Cleaner KEEP_LATEST_BY_HOURS should retain latest commit before earliest commit to retain [hudi]
hudi-bot commented on PR #10307: URL: https://github.com/apache/hudi/pull/10307#issuecomment-1851330034 ## CI report: * 8b67cc8faf3a4e76866bed27c67ab8687eff5c40 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21462) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Incoming batch schema is not compatible with the table's one [hudi]
njalan commented on issue #9980: URL: https://github.com/apache/hudi/issues/9980#issuecomment-1851329200 @ad1happy2go I just raised one PR https://github.com/apache/hudi/pull/10308. Can you please kindly review it and it is my first pr for hudi. I am not sure it can be merged or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] The Schema Evolution Not working For Hudi 0.12.3 [hudi]
Amar1404 closed issue #10310: The Schema Evolution Not working For Hudi 0.12.3 URL: https://github.com/apache/hudi/issues/10310 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] [hudi]
Amar1404 closed issue #10311: [SUPPORT] URL: https://github.com/apache/hudi/issues/10311 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] [hudi]
Amar1404 opened a new issue, #10311: URL: https://github.com/apache/hudi/issues/10311 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** A clear and concise description of the problem. **To Reproduce** Steps to reproduce the behavior: 1. 2. 3. 4. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : * Spark version : * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : * Running on Docker? (yes/no) : **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] The Schema Evolution Not working For Hudi 0.12.3 [hudi]
Amar1404 opened a new issue, #10310: URL: https://github.com/apache/hudi/issues/10310 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** I have a column that is earlier a LONG and that is changed to DOUBLE. The schema evolution is not working. The data is writen but while reading the table the old data file parquet is not able to read throwing **ERROR the parquet file not able to read since the column expect double but got INT64** **To Reproduce** Steps to reproduce the behavior: 1. I have a column 2. 3. 4. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.12.3 * Spark version : 3.3 * Hive version :3 * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : EMR **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] The Schema Evolution Not working For Hudi 0.12.3 [hudi]
Amar1404 opened a new issue, #10309: URL: https://github.com/apache/hudi/issues/10309 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** I have a column that is earlier a LONG and that is changed to DOUBLE. The schema evolution is not working. The data is writen but while reading the table the old data file parquet is not able to read throwing **ERROR the parquet file not able to read since the column expect double but got INT64** **To Reproduce** Steps to reproduce the behavior: 1. I have a column 2. 3. 4. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.12.3 * Spark version : 3.3 * Hive version :3 * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : EMR **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]
njalan opened a new pull request, #10308: URL: https://github.com/apache/hudi/pull/10308 ### Change Logs Use meta sync database to fill hoodie.table.name if it not sets ### Impact Fix issue that if there are same table name under hive 'default' schema than will case Incoming batch schema is not compatible with the table's one. ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7132) Data may be lost in Flink checkpoint
[ https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7132: - Summary: Data may be lost in Flink checkpoint (was: Data may be lost in flink#chk) > Data may be lost in Flink checkpoint > > > Key: HUDI-7132 > URL: https://issues.apache.org/jira/browse/HUDI-7132 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.13.1, 0.14.0 >Reporter: Bo Cui >Priority: Major > > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35 > before the line code, eventBuffer may be updated by `subtaskFailed`, and some > elements of eventBuffer is null > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6658) Implement MOR Incremental for new file format
[ https://issues.apache.org/jira/browse/HUDI-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-6658. - Resolution: Done > Implement MOR Incremental for new file format > - > > Key: HUDI-6658 > URL: https://issues.apache.org/jira/browse/HUDI-6658 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Implement MOR Incremental for new file format -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-6658] Inject filters for incremental query (#10225)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new cacbb82254c [HUDI-6658] Inject filters for incremental query (#10225) cacbb82254c is described below commit cacbb82254c840b96f879c8a577a6e91aff3f57e Author: Jon Vexler AuthorDate: Tue Dec 12 00:08:23 2023 -0500 [HUDI-6658] Inject filters for incremental query (#10225) Add incremental filters to the query plan Also fix some tests that use partition path as the precombine field Incremental queries will now work the new filegroup reader --- .../spark/sql/HoodieCatalystPlansUtils.scala | 11 +- .../main/scala/org/apache/hudi/DefaultSource.scala | 14 +- .../scala/org/apache/hudi/HoodieCDCFileIndex.scala | 5 + .../hudi/HoodieHadoopFsRelationFactory.scala | 6 +- .../apache/hudi/HoodieIncrementalFileIndex.scala | 2 +- .../sql/FileFormatUtilsForFileGroupReader.scala| 128 +++ ...odieFileGroupReaderBasedParquetFileFormat.scala | 36 ++- .../parquet/NewHoodieParquetFileFormat.scala | 2 + .../spark/sql/hudi/analysis/HoodieAnalysis.scala | 2 +- .../TestGlobalIndexEnableUpdatePartitions.java | 6 + .../TestIncrementalReadWithFullTableScan.scala | 27 +-- .../hudi/functional/TestSparkDataSource.scala | 4 + .../hudi/procedure/TestClusteringProcedure.scala | 255 +++-- .../procedure/TestHoodieLogFileProcedure.scala | 11 +- .../spark/sql/HoodieSpark2CatalystPlanUtils.scala | 27 ++- .../spark/sql/HoodieSpark3CatalystPlanUtils.scala | 25 +- .../spark/sql/HoodieSpark30CatalystPlanUtils.scala | 4 +- .../spark/sql/HoodieSpark31CatalystPlanUtils.scala | 4 +- .../spark/sql/HoodieSpark32CatalystPlanUtils.scala | 4 +- .../spark/sql/HoodieSpark33CatalystPlanUtils.scala | 4 +- .../spark/sql/HoodieSpark34CatalystPlanUtils.scala | 4 +- .../spark/sql/HoodieSpark35CatalystPlanUtils.scala | 4 +- .../deltastreamer/TestHoodieDeltaStreamer.java | 4 +- .../utilities/sources/TestHoodieIncrSource.java| 6 + 24 files changed, 372 insertions(+), 223 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala index 64ee645ba0f..b9110f1ed93 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala @@ -102,9 +102,11 @@ trait HoodieCatalystPlansUtils { * Spark requires file formats to append the partition path fields to the end of the schema. * For tables where the partition path fields are not at the end of the schema, we don't want * to return the schema in the wrong order when they do a query like "select *". To fix this - * behavior, we apply a projection onto FileScan when the file format is NewHudiParquetFileFormat + * behavior, we apply a projection onto FileScan when the file format has HoodieFormatTrait + * + * Additionally, incremental queries require filters to be added to the plan */ - def applyNewHoodieParquetFileFormatProjection(plan: LogicalPlan): LogicalPlan + def maybeApplyForNewFileFormat(plan: LogicalPlan): LogicalPlan /** * Decomposes [[InsertIntoStatement]] into its arguments allowing to accommodate for API @@ -140,4 +142,9 @@ trait HoodieCatalystPlansUtils { def failAnalysisForMIT(a: Attribute, cols: String): Unit = {} def createMITJoin(left: LogicalPlan, right: LogicalPlan, joinType: JoinType, condition: Option[Expression], hint: String): LogicalPlan + + /** + * true if both plans produce the same attributes in the the same order + */ + def produceSameOutput(a: LogicalPlan, b: LogicalPlan): Boolean } diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala index 168502b3f08..ac8286b1bde 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala @@ -285,7 +285,12 @@ object DefaultSource { resolveBaseFileOnlyRelation(sqlContext, globPaths, userSchema, metaClient, parameters) } case (COPY_ON_WRITE, QUERY_TYPE_INCREMENTAL_OPT_VAL, _) => - new IncrementalRelation(sqlContext, parameters, userSchema, metaClient) + if (fileFormatUtils.isDefined) { +new HoodieCopyOnWriteIncrementalHadoopFsRelationFactory( + sqlContext, metaClient, parameters, userSchema, isBootstrappedTable).build() + }
Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]
codope merged PR #10225: URL: https://github.com/apache/hudi/pull/10225 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6658) Implement MOR Incremental for new file format
[ https://issues.apache.org/jira/browse/HUDI-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6658: -- Fix Version/s: 1.0.0 > Implement MOR Incremental for new file format > - > > Key: HUDI-6658 > URL: https://issues.apache.org/jira/browse/HUDI-6658 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Implement MOR Incremental for new file format -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7132) Data may be lost in flink#chk
[ https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7132: - Affects Version/s: (was: 1.1.0) > Data may be lost in flink#chk > - > > Key: HUDI-7132 > URL: https://issues.apache.org/jira/browse/HUDI-7132 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.13.1, 0.14.0 >Reporter: Bo Cui >Priority: Major > > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35 > before the line code, eventBuffer may be updated by `subtaskFailed`, and some > elements of eventBuffer is null > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7132) Data may be lost in flink#chk
[ https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795589#comment-17795589 ] Danny Chen commented on HUDI-7132: -- Did you try this fix: https://github.com/apache/hudi/pull/9867/files > Data may be lost in flink#chk > - > > Key: HUDI-7132 > URL: https://issues.apache.org/jira/browse/HUDI-7132 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.13.1, 0.14.0 >Reporter: Bo Cui >Priority: Major > > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35 > before the line code, eventBuffer may be updated by `subtaskFailed`, and some > elements of eventBuffer is null > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7132) Data may be lost in flink#chk
[ https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7132: - Affects Version/s: 0.14.0 0.13.1 > Data may be lost in flink#chk > - > > Key: HUDI-7132 > URL: https://issues.apache.org/jira/browse/HUDI-7132 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.13.1, 0.14.0, 1.1.0 >Reporter: Bo Cui >Priority: Major > > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35 > before the line code, eventBuffer may be updated by `subtaskFailed`, and some > elements of eventBuffer is null > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] CoW: Hudi Upsert not working when there is a timestamp field in the composite key [hudi]
ad1happy2go commented on issue #10303: URL: https://github.com/apache/hudi/issues/10303#issuecomment-1851290166 @srinikandi I see a fix(https://github.com/apache/hudi/pull/4201) was tried but then it was reverted due to another issue,. Will look into it. Thanks for raising this again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]
hudi-bot commented on PR #10225: URL: https://github.com/apache/hudi/pull/10225#issuecomment-1851286731 ## CI report: * 8cc63ddef9ccc12489d0461f1c928e5c579b9277 UNKNOWN * bf5722620a6d6c4ab30944c2ebec3d17cd65f625 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21458) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7176] Add file group reader test framework [hudi]
yihua commented on code in PR #10263: URL: https://github.com/apache/hudi/pull/10263#discussion_r1423390827 ## hudi-common/src/test/java/org/apache/hudi/common/testutils/reader/DataGenerationPlan.java: ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.testutils.reader; + +import java.util.List; + +/** + * The blueprint of records that will be generated + * by the data generator. + * + * Current limitations: + * 1. One plan generates one file, either a base file, or a log file. + * 2. One file contains one operation, e.g., insert, delete, or update. + */ +public class DataGenerationPlan { Review Comment: Is this going to be integrated with `HoodieTestTable` or similar to generate test tables without using transactional writes? ## hudi-common/src/test/java/org/apache/hudi/common/testutils/reader/HoodieTestReaderContext.java: ## @@ -0,0 +1,163 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.testutils.reader; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.engine.HoodieReaderContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieAvroIndexedRecord; +import org.apache.hudi.common.model.HoodieAvroRecordMerger; +import org.apache.hudi.common.model.HoodieEmptyRecord; +import org.apache.hudi.common.model.HoodieKey; +import org.apache.hudi.common.model.HoodieOperation; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordMerger; +import org.apache.hudi.common.util.ConfigUtils; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.io.storage.HoodieAvroParquetReader; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericRecordBuilder; +import org.apache.avro.generic.IndexedRecord; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +import java.io.IOException; +import java.util.Map; +import java.util.function.UnaryOperator; + +import static org.apache.hudi.common.model.HoodieRecordMerger.DEFAULT_MERGER_STRATEGY_UUID; +import static org.apache.hudi.common.testutils.reader.HoodieFileSliceTestUtils.ROW_KEY; + +public class HoodieTestReaderContext extends HoodieReaderContext { + @Override + public FileSystem getFs(String path, Configuration conf) { +return FSUtils.getFs(path, conf); + } + + @Override + public ClosableIterator getFileRecordIterator( + Path filePath, + long start, + long length, + Schema dataSchema, + Schema requiredSchema, + Configuration conf + ) throws IOException { +HoodieAvroParquetReader reader = new HoodieAvroParquetReader(conf, filePath); +return reader.getIndexedRecordIterator(dataSchema, requiredSchema); + } + + @Override + public IndexedRecord convertAvroRecord(IndexedRecord record) { +return record; + } + + @Override + public HoodieRecordMerger getRecordMerger(String mergerStrategy) { +switch (mergerStrategy) { + case DEFAULT_MERGER_STRATEGY_UUID: +return new HoodieAvroRecordMerger(); + default: +throw new HoodieException( +"The merger strategy UUID is not supported: " + mergerStrategy); +
Re: [PR] [HUDI-7223] Cleaner KEEP_LATEST_BY_HOURS should retain latest commit before earliest commit to retain [hudi]
hudi-bot commented on PR #10307: URL: https://github.com/apache/hudi/pull/10307#issuecomment-1851281091 ## CI report: * 8b67cc8faf3a4e76866bed27c67ab8687eff5c40 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21462) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7223] Cleaner KEEP_LATEST_BY_HOURS should retain latest commit before earliest commit to retain [hudi]
hudi-bot commented on PR #10307: URL: https://github.com/apache/hudi/pull/10307#issuecomment-1851252749 ## CI report: * 8b67cc8faf3a4e76866bed27c67ab8687eff5c40 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7223] Cleaner KEEP_LATEST_BY_HOURS should retain latest commit before earliest commit to retain [hudi]
the-other-tim-brown commented on code in PR #10307: URL: https://github.com/apache/hudi/pull/10307#discussion_r1423363621 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java: ## @@ -351,23 +351,15 @@ private Pair> getFilesToCleanKeepingLatestCommits(S continue; } - if (policy == HoodieCleaningPolicy.KEEP_LATEST_COMMITS) { + if (policy == HoodieCleaningPolicy.KEEP_LATEST_COMMITS || policy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) { Review Comment: If we agree this is the correct behavior, I can update or add unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7223] Cleaner KEEP_LATEST_BY_HOURS should retain latest commit before earliest commit to retain [hudi]
the-other-tim-brown opened a new pull request, #10307: URL: https://github.com/apache/hudi/pull/10307 ### Change Logs The `KEEP_LATEST_BY_HOURS` cleaner policy will find the earliest time to retain and then find the earliest commit that is greater than or equal to that time: ``` String earliestTimeToRetain = HoodieActiveTimeline.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant())); earliestCommitToRetain = Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i -> HoodieTimeline.compareTimestamps(i.getTimestamp(), HoodieTimeline.GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst()); ``` In the CleanPlanner, any file slice before this `earliestCommitToRetain` will be removed. This can result in deleting files that were still relevant N hours ago and users will not be able to query the table as of that time. For example, imagine you have configured keep latest 24 hours. You have a new base file written for a file group at time `NOW-48hr`. An update comes in at time `NOW-8hr` and does not touch this file group. Another commit writes a new base file at time `NOW-1hr` and the cleaner kicks in. It will see `NOW-8hr` as the latest commit that is not more than 24 hours old causing the file written in `NOW-48hr` to be removed even though it would be required if someone wants to do a time travel query for a point in time 8+ hours in the past. ### Impact Fixes expectations for time based cleaning and retention. ### Risk level (write none, low medium or high below) Low, will keep files longer for those using this policy ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7223) Hudi Cleaner removing files still required for view N hours old
[ https://issues.apache.org/jira/browse/HUDI-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7223: - Labels: pull-request-available (was: ) > Hudi Cleaner removing files still required for view N hours old > --- > > Key: HUDI-7223 > URL: https://issues.apache.org/jira/browse/HUDI-7223 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Priority: Major > Labels: pull-request-available > > If a user is using time based cleaner policy, they will expect that they can > query the table state as of N hours ago. This means that they do not want to > clean up files older than N hours but files that are no longer relevant to > the table N hours ago. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1851243823 ## CI report: * d858eaac14b3de45d4066165622738d91ff603fe Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21456) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink cannot write data to HUDI when set metadata.enabled to true [hudi]
zdl1 commented on issue #10306: URL: https://github.com/apache/hudi/issues/10306#issuecomment-1851240670 When try to restart the job, there are other exceptions: ``` 2023-12-12 11:23:25 org.apache.flink.util.FlinkException: Global failure triggered by OperatorCoordinator for 'stream_write: test1' (operator 8b0eff726c52aac1276bd5cfcb9bf178). at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder$LazyInitializedCoordinatorContext.failJob(OperatorCoordinatorHolder.java:545) at org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$start$0(StreamWriteOperatorCoordinator.java:191) at org.apache.hudi.sink.utils.NonThrownExecutor.handleException(NonThrownExecutor.java:142) at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.hudi.exception.HoodieException: Executor executes action [commits the instant 20231212110524061] error ... 6 more Caused by: org.apache.hudi.exception.HoodieException: Heartbeat for instant 20231212110524061 has expired, last heartbeat 0 at org.apache.hudi.client.heartbeat.HeartbeatUtils.abortIfHeartbeatExpired(HeartbeatUtils.java:95) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:225) at org.apache.hudi.client.HoodieFlinkWriteClient.commit(HoodieFlinkWriteClient.java:111) at org.apache.hudi.client.HoodieFlinkWriteClient.commit(HoodieFlinkWriteClient.java:74) at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:199) at org.apache.hudi.sink.StreamWriteOperatorCoordinator.doCommit(StreamWriteOperatorCoordinator.java:540) at org.apache.hudi.sink.StreamWriteOperatorCoordinator.commitInstant(StreamWriteOperatorCoordinator.java:516) at org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$notifyCheckpointComplete$2(StreamWriteOperatorCoordinator.java:246) at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130) ... 3 more ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Flink cannot write data to HUDI when set metadata.enabled to true [hudi]
zdl1 opened a new issue, #10306: URL: https://github.com/apache/hudi/issues/10306 **Describe the problem you faced** When I set metadata.enabled to true by Flink, HUDI cannot delta_commit successfully and always restarts the job **To Reproduce** Steps to reproduce the behavior: 1. start flink sql client ``` create table test1 ( c1 int primary key, c2 int, c3 int ) with ( 'connector' = 'hudi', 'path' = 'hdfs:/flink/test1', 'table.type' = 'MERGE_ON_READ', 'metadata.enabled' = 'true' ); ``` 2. ``` create table datagen1 ( c1 int, c2 int, c3 int ) with ( 'connector' = 'datagen', 'number-of-rows' = '300', 'rows-per-second' = '10' ); ``` 3. ``` SET execution.checkpointing.interval=1000; insert into test1 select * from datagen1; ``` **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.13.1 * Flink version : 0.14 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no **Stacktrace** ``` 2023-12-12 11:07:55,633 INFO org.apache.flink.streaming.api.functions.source.datagen.DataGeneratorSource [] - generated 10 rows 2023-12-12 11:07:55,633 WARN org.apache.hadoop.hdfs.client.impl.BlockReaderFactory[] - I/O error constructing remote block reader. java.nio.channels.ClosedByInterruptException: null at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) ~[?:1.8.0_382] at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:658) ~[?:1.8.0_382] at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192) ~[flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:2.8.3-10.0] at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) ~[flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:2.8.3-10.0] at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2946) ~[flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:2.8.3-10.0] at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815) ~[flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:2.8.3-10.0] at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740) ~[flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:2.8.3-10.0] at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385) ~[flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:2.8.3-10.0] at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:696) ~[flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:2.8.3-10.0] at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:655) ~[flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:2.8.3-10.0] at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:926) ~[flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:2.8.3-10.0] at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:982) ~[flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:2.8.3-10.0] at java.io.DataInputStream.read(DataInputStream.java:149) ~[?:1.8.0_382] at java.io.DataInputStream.read(DataInputStream.java:100) ~[?:1.8.0_382] at java.util.Properties$LineReader.readLine(Properties.java:435) ~[?:1.8.0_382] at java.util.Properties.load0(Properties.java:353) ~[?:1.8.0_382] at java.util.Properties.load(Properties.java:341) ~[?:1.8.0_382] at org.apache.hudi.common.table.HoodieTableConfig.fetchConfigs(HoodieTableConfig.java:351) ~[hudi-flink1.14-bundle-0.13.1.jar:0.13.1] at org.apache.hudi.common.table.HoodieTableConfig.(HoodieTableConfig.java:284) ~[hudi-flink1.14-bundle-0.13.1.jar:0.13.1] at org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:138) ~[hudi-flink1.14-bundle-0.13.1.jar:0.13.1] at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:689) ~[hudi-flink1.14-bundle-0.13.1.jar:0.13.1] at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:81) ~[hudi-flink1.14-bundle-0.13.1.jar:0.13.1] at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:770) ~[hudi-flink1.14-bundle-0.13.1.jar:0.13.1] at org.apache.hudi.table.HoodieFlinkTable.create(HoodieFlinkTable.java:62) ~[hudi-flink1.14-bundle-0.13.1.jar:0.13.1] at org.apache.hudi.sink.partitioner.profile.WriteProfile.getTable(WriteProfile.java:138) ~[hudi-flink1.14-bundle-0.13.1.jar:0.13.1] at org.apache.hudi.sink.partitioner.profile.DeltaWriteProfile.getFileSystemView(DeltaWriteProfile.java:93) ~[hudi-flink1.14-bu
[jira] [Created] (HUDI-7223) Hudi Cleaner removing files still required for view N hours old
Timothy Brown created HUDI-7223: --- Summary: Hudi Cleaner removing files still required for view N hours old Key: HUDI-7223 URL: https://issues.apache.org/jira/browse/HUDI-7223 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown If a user is using time based cleaner policy, they will expect that they can query the table state as of N hours ago. This means that they do not want to clean up files older than N hours but files that are no longer relevant to the table N hours ago. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7222) Fix the loose Scala style check
Lin Liu created HUDI-7222: - Summary: Fix the loose Scala style check Key: HUDI-7222 URL: https://issues.apache.org/jira/browse/HUDI-7222 Project: Apache Hudi Issue Type: Task Reporter: Lin Liu Assignee: Lin Liu We have seen the Scala code style is loose for many places, like malformed imports. We have to fix this kind of problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6979][RFC-76] support event time based compaction strategy [hudi]
waitingF commented on code in PR #10266: URL: https://github.com/apache/hudi/pull/10266#discussion_r1423345721 ## rfc/rfc-76/rfc-76.md: ## @@ -0,0 +1,238 @@ + +# RFC-[74]: [support EventTimeBasedCompactionStrategy] + +## Proposers + +- @waitingF + +## Approvers + - @ + - @ + +## Status + +JIRA: [HUDI-6979](https://issues.apache.org/jira/browse/HUDI-6979) + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +Currently, to gain low ingestion latency, we can adopt the MergeOnRead table, which support appending log files and +compact log files into base file later. When querying the snapshot table (RT table) generated by MOR, +query side have to perform a compaction so that they can get all data, which is expected time-consuming causing query latency. +At the time, hudi provide read-optimized table (RO table) for low query latency just like COW. + +But currently, there is no compaction strategy based on event time, so there is no data freshness guarantee for RO table. +For cases, user want all data before a specified time, user have to query the RT table to get all data with expected high query latency. Review Comment: > With our new file slicing under unbounded io compaction strategy, a compaction plan at t is designated as including all the log files complete before t, does that make sense to your use case? You can then query the ro table after the compaction completes, the ro table data freshness is at least up to t. I dont think so, as there will be file groups in pending compaction which will be skipped in scheduling compaction plan, in this case, it will break the rule that "the ro table data freshness is at least up to `t`", there may be history data in those file groups. We should ensure all log files before `t` being compacted, that means we should generate new plan if no file group in pending compaction/clustering, that is no pending compaction or clustering left. For this, we can introduce a new trigger. > The question is how to tell the reader the freshness of the ro table? Yeah, this is part of the rfc. We can extract the freshness from log file during compacting ## rfc/rfc-76/rfc-76.md: ## @@ -0,0 +1,238 @@ + +# RFC-[74]: [support EventTimeBasedCompactionStrategy] + +## Proposers + +- @waitingF + +## Approvers + - @ + - @ + +## Status + +JIRA: [HUDI-6979](https://issues.apache.org/jira/browse/HUDI-6979) + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +Currently, to gain low ingestion latency, we can adopt the MergeOnRead table, which support appending log files and Review Comment: yeah, looks like so -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7208] Do writing stage should shutdown with error when insert failed to reduce user execute time and show error details [hudi]
hudi-bot commented on PR #10297: URL: https://github.com/apache/hudi/pull/10297#issuecomment-1851216322 ## CI report: * fe0a9fb6f96859b5d2bc2254899f2f5c8624d841 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21438) * 68f3125e06ebb01154d659c2b4452c7ac4c5aa25 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21461) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7221) Move Hudi Option class from hudi-common to hudi-io module
Ethan Guo created HUDI-7221: --- Summary: Move Hudi Option class from hudi-common to hudi-io module Key: HUDI-7221 URL: https://issues.apache.org/jira/browse/HUDI-7221 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7221) Move Hudi Option class from hudi-common to hudi-io module
[ https://issues.apache.org/jira/browse/HUDI-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7221: --- Assignee: Ethan Guo > Move Hudi Option class from hudi-common to hudi-io module > - > > Key: HUDI-7221 > URL: https://issues.apache.org/jira/browse/HUDI-7221 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7221) Move Hudi Option class from hudi-common to hudi-io module
[ https://issues.apache.org/jira/browse/HUDI-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7221: Fix Version/s: 1.0.0 > Move Hudi Option class from hudi-common to hudi-io module > - > > Key: HUDI-7221 > URL: https://issues.apache.org/jira/browse/HUDI-7221 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7208] Do writing stage should shutdown with error when insert failed to reduce user execute time and show error details [hudi]
hudi-bot commented on PR #10297: URL: https://github.com/apache/hudi/pull/10297#issuecomment-1851211389 ## CI report: * fe0a9fb6f96859b5d2bc2254899f2f5c8624d841 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21438) * 68f3125e06ebb01154d659c2b4452c7ac4c5aa25 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7176] Add file group reader test framework [hudi]
hudi-bot commented on PR #10263: URL: https://github.com/apache/hudi/pull/10263#issuecomment-1851206311 ## CI report: * 9450ab6f8a73add7f5b25b92c73429ce5f3f25ec UNKNOWN * 15f28aa59354f27033840dc140f1e71483257fc4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21455) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7209] Add configuration to skip not exists file in streaming read [hudi]
danny0405 commented on code in PR #10295: URL: https://github.com/apache/hudi/pull/10295#discussion_r142755 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java: ## @@ -210,9 +210,9 @@ public Result inputSplits( LOG.warn("No partitions found for reading in user provided path."); return Result.EMPTY; } - FileStatus[] files = WriteProfiles.getFilesFromMetadata(path, metaClient.getHadoopConf(), metadataList, metaClient.getTableType(), false); - if (files == null) { -LOG.warn("Found deleted files in metadata, fall back to full table scan."); + Boolean skipNotExistsFile = conf.getBoolean(FlinkOptions.READ_STREAMING_SKIP_NOT_EXIST_FILE); Review Comment: Did you know that this code only works for batch queries? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7170][WIP] Implement HFile reader independent of HBase [hudi]
yihua commented on code in PR #10241: URL: https://github.com/apache/hudi/pull/10241#discussion_r142777 ## hudi-io/src/main/java/org/apache/hudi/io/hfile/HFileDataBlock.java: ## @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.hfile; + +import org.apache.hudi.io.util.IOUtils; + +import java.util.Optional; + +import static org.apache.hudi.io.hfile.KeyValue.KEY_OFFSET; + +/** + * Represents a {@link HFileBlockType#DATA} block in the "Scanned block" section. + */ +public class HFileDataBlock extends HFileBlock { + protected HFileDataBlock(HFileContext context, + byte[] byteBuff, + int startOffsetInBuff) { +super(context, HFileBlockType.DATA, byteBuff, startOffsetInBuff); + } + + /** + * Seeks to the key to look up. + * + * @param key Key to look up. + * @return The {@link KeyValue} instance in the block that contains the exact same key as the + * lookup key; or empty {@link Optional} if the lookup key does not exist. + */ + public Optional seekTo(Key key) { +int offset = startOffsetInBuff + HFILEBLOCK_HEADER_SIZE; +int endOffset = offset + onDiskSizeWithoutHeader; +// TODO: check last 4 bytes in the data block +while (offset + HFILEBLOCK_HEADER_SIZE < endOffset) { + // Full length is not known yet until parsing + KeyValue kv = new KeyValue(byteBuff, offset, -1); Review Comment: Removed as the length is not used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7208] Do writing stage should shutdown with error when insert failed to reduce user execute time and show error details [hudi]
xuzifu666 commented on code in PR #10297: URL: https://github.com/apache/hudi/pull/10297#discussion_r1423332239 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -508,10 +509,7 @@ protected void doWrite(HoodieRecord record, Schema schema, TypedProperties props flushToDiskIfRequired(record, false); writeToBuffer(record); } catch (Throwable t) { - // Not throwing exception from here, since we don't want to fail the entire job - // for a single record - writeStatus.markFailure(record, t, recordMetadata); - LOG.error("Error writing record " + record, t); + throw new HoodieInsertException("Error writing record " + record, t); Review Comment: revert changes in HoodieAppendHandle @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7170][WIP] Implement HFile reader independent of HBase [hudi]
yihua commented on code in PR #10241: URL: https://github.com/apache/hudi/pull/10241#discussion_r1423331913 ## hudi-io/src/main/java/org/apache/hudi/io/hfile/HFileDataBlock.java: ## @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.hfile; + +import org.apache.hudi.io.util.IOUtils; + +import java.util.Optional; + +import static org.apache.hudi.io.hfile.KeyValue.KEY_OFFSET; + +/** + * Represents a {@link HFileBlockType#DATA} block in the "Scanned block" section. + */ +public class HFileDataBlock extends HFileBlock { + protected HFileDataBlock(HFileContext context, + byte[] byteBuff, + int startOffsetInBuff) { +super(context, HFileBlockType.DATA, byteBuff, startOffsetInBuff); + } + + /** + * Seeks to the key to look up. + * + * @param key Key to look up. + * @return The {@link KeyValue} instance in the block that contains the exact same key as the + * lookup key; or empty {@link Optional} if the lookup key does not exist. + */ + public Optional seekTo(Key key) { +int offset = startOffsetInBuff + HFILEBLOCK_HEADER_SIZE; +int endOffset = offset + onDiskSizeWithoutHeader; +// TODO: check last 4 bytes in the data block +while (offset + HFILEBLOCK_HEADER_SIZE < endOffset) { + // Full length is not known yet until parsing + KeyValue kv = new KeyValue(byteBuff, offset, -1); + // TODO: Reading long instead of two integers per HBase Review Comment: Filed HUDI-7220 for benchmarking. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7208] Do writing stage should shutdown with error when insert failed to reduce user execute time and show error details [hudi]
xuzifu666 commented on code in PR #10297: URL: https://github.com/apache/hudi/pull/10297#discussion_r1423332239 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -508,10 +509,7 @@ protected void doWrite(HoodieRecord record, Schema schema, TypedProperties props flushToDiskIfRequired(record, false); writeToBuffer(record); } catch (Throwable t) { - // Not throwing exception from here, since we don't want to fail the entire job - // for a single record - writeStatus.markFailure(record, t, recordMetadata); - LOG.error("Error writing record " + record, t); + throw new HoodieInsertException("Error writing record " + record, t); Review Comment: revert changed in HoodieAppendHandle @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7220) Benchmark new HFile reader
[ https://issues.apache.org/jira/browse/HUDI-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7220: Description: Do micro-benchmarks on the HFile reader on different operations, including reading the whole file, do point and prefix lookups (seekTo), etc. > Benchmark new HFile reader > -- > > Key: HUDI-7220 > URL: https://issues.apache.org/jira/browse/HUDI-7220 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Priority: Major > > Do micro-benchmarks on the HFile reader on different operations, including > reading the whole file, do point and prefix lookups (seekTo), etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7220) Benchmark new HFile reader
[ https://issues.apache.org/jira/browse/HUDI-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7220: Fix Version/s: 1.0.0 > Benchmark new HFile reader > -- > > Key: HUDI-7220 > URL: https://issues.apache.org/jira/browse/HUDI-7220 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > > Do micro-benchmarks on the HFile reader on different operations, including > reading the whole file, do point and prefix lookups (seekTo), etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7220) Benchmark new HFile reader
[ https://issues.apache.org/jira/browse/HUDI-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7220: Priority: Blocker (was: Major) > Benchmark new HFile reader > -- > > Key: HUDI-7220 > URL: https://issues.apache.org/jira/browse/HUDI-7220 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > > Do micro-benchmarks on the HFile reader on different operations, including > reading the whole file, do point and prefix lookups (seekTo), etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7220) Benchmark new HFile reader
[ https://issues.apache.org/jira/browse/HUDI-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7220: --- Assignee: Ethan Guo > Benchmark new HFile reader > -- > > Key: HUDI-7220 > URL: https://issues.apache.org/jira/browse/HUDI-7220 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > > Do micro-benchmarks on the HFile reader on different operations, including > reading the whole file, do point and prefix lookups (seekTo), etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7220) Benchmark new HFile reader
Ethan Guo created HUDI-7220: --- Summary: Benchmark new HFile reader Key: HUDI-7220 URL: https://issues.apache.org/jira/browse/HUDI-7220 Project: Apache Hudi Issue Type: New Feature Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7170][WIP] Implement HFile reader independent of HBase [hudi]
yihua commented on code in PR #10241: URL: https://github.com/apache/hudi/pull/10241#discussion_r1423330790 ## hudi-io/src/test/java/org/apache/hudi/io/hfile/TestHFileReader.java: ## @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.hfile; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.ValueSource; + +import java.io.IOException; +import java.util.Optional; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; + +public class TestHFileReader { Review Comment: I'll add more tests and clean up existing test logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7170][WIP] Implement HFile reader independent of HBase [hudi]
yihua commented on code in PR #10241: URL: https://github.com/apache/hudi/pull/10241#discussion_r1423330150 ## hudi-io/src/main/java/org/apache/hudi/io/hfile/HFileReader.java: ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.hfile; + +import org.apache.hadoop.fs.FSDataInputStream; + +import java.io.ByteArrayInputStream; +import java.io.DataInputStream; +import java.io.IOException; +import java.util.Collections; +import java.util.List; +import java.util.Optional; + +/** + * A reader reading a HFile. + */ +public class HFileReader { + private final FSDataInputStream stream; + private final long fileSize; + private boolean isMetadataInitialized = false; + private HFileContext context; + private List blockIndexEntryList; + private HFileBlock metaIndexBlock; + private HFileBlock fileInfoBlock; + + public HFileReader(FSDataInputStream stream, long fileSize) { +this.stream = stream; +this.fileSize = fileSize; + } + + /** + * Initializes the metadata by reading the "Load-on-open" section. + * + * @throws IOException upon error. + */ + public void initializeMetadata() throws IOException { +assert !this.isMetadataInitialized; + +// Read Trailer (serialized in Proto) +HFileTrailer trailer = readTrailer(stream, fileSize); +this.context = HFileContext.builder() +.compressAlgo(trailer.getCompressionCodec()) +.build(); +HFileBlockReader blockReader = new HFileBlockReader( +context, stream, trailer.getLoadOnOpenDataOffset(), fileSize - HFileTrailer.getTrailerSize()); +HFileRootIndexBlock dataIndexBlock = +(HFileRootIndexBlock) blockReader.nextBlock(HFileBlockType.ROOT_INDEX); +this.blockIndexEntryList = dataIndexBlock.readDataIndex(trailer.getDataIndexCount()); +this.metaIndexBlock = blockReader.nextBlock(HFileBlockType.ROOT_INDEX); +this.fileInfoBlock = blockReader.nextBlock(HFileBlockType.FILE_INFO); + +this.isMetadataInitialized = true; + } + + /** + * Seeks to the key to look up. + * + * @param key Key to look up. + * @return The {@link KeyValue} instance in the block that contains the exact same key as the + * lookup key; or empty {@link Optional} if the lookup key does not exist. + * @throws IOException upon error. + */ + public Optional seekTo(Key key) throws IOException { +BlockIndexEntry lookUpKey = new BlockIndexEntry(key, -1, -1); +int rootLevelBlockIndex = searchBlockByKey(lookUpKey); +if (rootLevelBlockIndex < 0) { + // Key smaller than the start key of the first block + return Optional.empty(); +} +BlockIndexEntry blockToRead = blockIndexEntryList.get(rootLevelBlockIndex); +HFileBlockReader blockReader = new HFileBlockReader( +context, stream, blockToRead.getOffset(), blockToRead.getOffset() + (long) blockToRead.getSize()); +HFileDataBlock dataBlock = (HFileDataBlock) blockReader.nextBlock(HFileBlockType.DATA); +return seekToKeyInBlock(dataBlock, key); + } + + /** + * Reads the HFile major version from the input. + * + * @param bytes Input data. + * @param offset Offset to start reading. + * @return Major version of the file. + */ + public static int readMajorVersion(byte[] bytes, int offset) { Review Comment: This only reads three bytes which is uncommon so I kept it here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7170][WIP] Implement HFile reader independent of HBase [hudi]
yihua commented on code in PR #10241: URL: https://github.com/apache/hudi/pull/10241#discussion_r1423329790 ## hudi-io/src/main/java/org/apache/hudi/io/hfile/HFileReader.java: ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.hfile; + +import org.apache.hadoop.fs.FSDataInputStream; + +import java.io.ByteArrayInputStream; +import java.io.DataInputStream; +import java.io.IOException; +import java.util.Collections; +import java.util.List; +import java.util.Optional; + +/** + * A reader reading a HFile. + */ +public class HFileReader { + private final FSDataInputStream stream; + private final long fileSize; + private boolean isMetadataInitialized = false; + private HFileContext context; + private List blockIndexEntryList; + private HFileBlock metaIndexBlock; + private HFileBlock fileInfoBlock; + + public HFileReader(FSDataInputStream stream, long fileSize) { +this.stream = stream; +this.fileSize = fileSize; + } + + /** + * Initializes the metadata by reading the "Load-on-open" section. + * + * @throws IOException upon error. + */ + public void initializeMetadata() throws IOException { +assert !this.isMetadataInitialized; + +// Read Trailer (serialized in Proto) +HFileTrailer trailer = readTrailer(stream, fileSize); +this.context = HFileContext.builder() +.compressAlgo(trailer.getCompressionCodec()) +.build(); +HFileBlockReader blockReader = new HFileBlockReader( +context, stream, trailer.getLoadOnOpenDataOffset(), fileSize - HFileTrailer.getTrailerSize()); +HFileRootIndexBlock dataIndexBlock = +(HFileRootIndexBlock) blockReader.nextBlock(HFileBlockType.ROOT_INDEX); +this.blockIndexEntryList = dataIndexBlock.readDataIndex(trailer.getDataIndexCount()); +this.metaIndexBlock = blockReader.nextBlock(HFileBlockType.ROOT_INDEX); +this.fileInfoBlock = blockReader.nextBlock(HFileBlockType.FILE_INFO); + +this.isMetadataInitialized = true; + } + + /** + * Seeks to the key to look up. + * + * @param key Key to look up. + * @return The {@link KeyValue} instance in the block that contains the exact same key as the + * lookup key; or empty {@link Optional} if the lookup key does not exist. + * @throws IOException upon error. + */ + public Optional seekTo(Key key) throws IOException { +BlockIndexEntry lookUpKey = new BlockIndexEntry(key, -1, -1); +int rootLevelBlockIndex = searchBlockByKey(lookUpKey); +if (rootLevelBlockIndex < 0) { + // Key smaller than the start key of the first block + return Optional.empty(); +} +BlockIndexEntry blockToRead = blockIndexEntryList.get(rootLevelBlockIndex); +HFileBlockReader blockReader = new HFileBlockReader( +context, stream, blockToRead.getOffset(), blockToRead.getOffset() + (long) blockToRead.getSize()); +HFileDataBlock dataBlock = (HFileDataBlock) blockReader.nextBlock(HFileBlockType.DATA); Review Comment: More APIs need to be added to the HFile reader based on the current prefix search implemented in `HoodieAvroHFileReader` (https://github.com/apache/hudi/blob/release-0.14.0/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java#L298). I plan to cover this in a separate PR: HUDI-7217 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7213] When using wrong tabe.type value in hudi catalog happends npe [hudi]
danny0405 commented on code in PR #10300: URL: https://github.com/apache/hudi/pull/10300#discussion_r1423326760 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/TableOptionProperties.java: ## @@ -189,7 +190,16 @@ public static Map translateFlinkTableProperties2Spark( return properties.entrySet().stream() .filter(e -> KEY_MAPPING.containsKey(e.getKey()) && !catalogTable.getOptions().containsKey(KEY_MAPPING.get(e.getKey( .collect(Collectors.toMap(e -> KEY_MAPPING.get(e.getKey()), -e -> e.getKey().equalsIgnoreCase(FlinkOptions.TABLE_TYPE.key()) ? VALUE_MAPPING.get(e.getValue()) : e.getValue())); +e -> { + if (e.getKey().equalsIgnoreCase(FlinkOptions.TABLE_TYPE.key())) { + String sparkTableType = VALUE_MAPPING.get(e.getValue()); + if (sparkTableType == null) { +throw new HoodieValidationException(String.format("%s's value is invalid", e.getKey())); + } Review Comment: Nice catch, can you write a UT in `TestHoodieCatalog` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] add a util to print hudi options [hudi]
danny0405 commented on PR #10301: URL: https://github.com/apache/hudi/pull/10301#issuecomment-1851190487 Can you give us an example why the print of the options is necessary? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Bootstrap with METADATA_ONLY with Hive Sync Fails on EMR Serverlkess 6.10 [hudi]
soumilshah1995 closed issue #8565: [SUPPORT] Hudi Bootstrap with METADATA_ONLY with Hive Sync Fails on EMR Serverlkess 6.10 URL: https://github.com/apache/hudi/issues/8565 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Spark job stuck after completion, due to some non daemon threads still running [hudi]
zyclove commented on issue #9826: URL: https://github.com/apache/hudi/issues/9826#issuecomment-1851189708 This problem was recently verified based on version 0.14 with the #10224 . It has been running for a week and the problem has not been reproduced for the time being. The release-0.14.1 need to be merged. @nsivabalan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Offline Compaction in EMR Serverless 6.10 for YouTube Video [hudi]
soumilshah1995 closed issue #8400: [SUPPORT] Hudi Offline Compaction in EMR Serverless 6.10 for YouTube Video URL: https://github.com/apache/hudi/issues/8400 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]
soumilshah1995 closed issue #10231: [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore URL: https://github.com/apache/hudi/issues/10231 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7170][WIP] Implement HFile reader independent of HBase [hudi]
yihua commented on code in PR #10241: URL: https://github.com/apache/hudi/pull/10241#discussion_r1423320093 ## hudi-io/src/main/java/org/apache/hudi/io/hfile/HFileReader.java: ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.hfile; + +import org.apache.hadoop.fs.FSDataInputStream; + +import java.io.ByteArrayInputStream; +import java.io.DataInputStream; +import java.io.IOException; +import java.util.Collections; +import java.util.List; +import java.util.Optional; + +/** + * A reader reading a HFile. + */ +public class HFileReader { + private final FSDataInputStream stream; + private final long fileSize; + private boolean isMetadataInitialized = false; + private HFileContext context; + private List blockIndexEntryList; + private HFileBlock metaIndexBlock; + private HFileBlock fileInfoBlock; + + public HFileReader(FSDataInputStream stream, long fileSize) { +this.stream = stream; +this.fileSize = fileSize; + } + + /** + * Initializes the metadata by reading the "Load-on-open" section. + * + * @throws IOException upon error. + */ + public void initializeMetadata() throws IOException { +assert !this.isMetadataInitialized; + +// Read Trailer (serialized in Proto) +HFileTrailer trailer = readTrailer(stream, fileSize); +this.context = HFileContext.builder() +.compressAlgo(trailer.getCompressionCodec()) +.build(); +HFileBlockReader blockReader = new HFileBlockReader( +context, stream, trailer.getLoadOnOpenDataOffset(), fileSize - HFileTrailer.getTrailerSize()); +HFileRootIndexBlock dataIndexBlock = +(HFileRootIndexBlock) blockReader.nextBlock(HFileBlockType.ROOT_INDEX); +this.blockIndexEntryList = dataIndexBlock.readDataIndex(trailer.getDataIndexCount()); +this.metaIndexBlock = blockReader.nextBlock(HFileBlockType.ROOT_INDEX); +this.fileInfoBlock = blockReader.nextBlock(HFileBlockType.FILE_INFO); + +this.isMetadataInitialized = true; + } + + /** + * Seeks to the key to look up. + * + * @param key Key to look up. + * @return The {@link KeyValue} instance in the block that contains the exact same key as the + * lookup key; or empty {@link Optional} if the lookup key does not exist. + * @throws IOException upon error. + */ + public Optional seekTo(Key key) throws IOException { +BlockIndexEntry lookUpKey = new BlockIndexEntry(key, -1, -1); +int rootLevelBlockIndex = searchBlockByKey(lookUpKey); +if (rootLevelBlockIndex < 0) { + // Key smaller than the start key of the first block + return Optional.empty(); +} +BlockIndexEntry blockToRead = blockIndexEntryList.get(rootLevelBlockIndex); +HFileBlockReader blockReader = new HFileBlockReader( +context, stream, blockToRead.getOffset(), blockToRead.getOffset() + (long) blockToRead.getSize()); +HFileDataBlock dataBlock = (HFileDataBlock) blockReader.nextBlock(HFileBlockType.DATA); +return seekToKeyInBlock(dataBlock, key); + } + + /** + * Reads the HFile major version from the input. + * + * @param bytes Input data. + * @param offset Offset to start reading. + * @return Major version of the file. + */ + public static int readMajorVersion(byte[] bytes, int offset) { +int ch1 = bytes[offset] & 0xFF; +int ch2 = bytes[offset + 1] & 0xFF; +int ch3 = bytes[offset + 2] & 0xFF; +return ((ch1 << 16) + (ch2 << 8) + ch3); + } + + /** + * Reads and parses the HFile trailer. + * + * @param stream HFile input. + * @param fileSize HFile size. + * @return {@link HFileTrailer} instance. + * @throws IOException upon error. + */ + private static HFileTrailer readTrailer(FSDataInputStream stream, + long fileSize) throws IOException { +int bufferSize = HFileTrailer.getTrailerSize(); +long seekPos = fileSize - bufferSize; +if (seekPos < 0) { + // It is hard to imagine such a small HFile. + seekPos = 0; + bufferSize = (int) fileSize; +} +stream.seek(seekPos); + +byte[] byteBuff = new byte[bufferSize]; +stream.readFully(byteBuff); + +int majorVersion = readMajorVersi
Re: [I] [SUPPORT] restart flink job got InvalidAvroMagicException: Not an Avro data file [hudi]
danny0405 closed issue #10285: [SUPPORT] restart flink job got InvalidAvroMagicException: Not an Avro data file URL: https://github.com/apache/hudi/issues/10285 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7208] Do writing stage should shutdown with error when insert failed to reduce user execute time and show error details [hudi]
danny0405 commented on code in PR #10297: URL: https://github.com/apache/hudi/pull/10297#discussion_r1423319781 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -508,10 +509,7 @@ protected void doWrite(HoodieRecord record, Schema schema, TypedProperties props flushToDiskIfRequired(record, false); writeToBuffer(record); } catch (Throwable t) { - // Not throwing exception from here, since we don't want to fail the entire job - // for a single record - writeStatus.markFailure(record, t, recordMetadata); - LOG.error("Error writing record " + record, t); + throw new HoodieInsertException("Error writing record " + record, t); Review Comment: In Flink SQL, option `write.ignore.failed` controls the throwing behavior of errors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7170][WIP] Implement HFile reader independent of HBase [hudi]
yihua commented on code in PR #10241: URL: https://github.com/apache/hudi/pull/10241#discussion_r1423319899 ## hudi-io/src/main/java/org/apache/hudi/io/hfile/HFileReader.java: ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.hfile; + +import org.apache.hadoop.fs.FSDataInputStream; + +import java.io.ByteArrayInputStream; +import java.io.DataInputStream; +import java.io.IOException; +import java.util.Collections; +import java.util.List; +import java.util.Optional; + +/** + * A reader reading a HFile. + */ +public class HFileReader { + private final FSDataInputStream stream; + private final long fileSize; + private boolean isMetadataInitialized = false; + private HFileContext context; + private List blockIndexEntryList; + private HFileBlock metaIndexBlock; + private HFileBlock fileInfoBlock; + + public HFileReader(FSDataInputStream stream, long fileSize) { +this.stream = stream; +this.fileSize = fileSize; + } + + /** + * Initializes the metadata by reading the "Load-on-open" section. + * + * @throws IOException upon error. + */ + public void initializeMetadata() throws IOException { +assert !this.isMetadataInitialized; + +// Read Trailer (serialized in Proto) +HFileTrailer trailer = readTrailer(stream, fileSize); +this.context = HFileContext.builder() +.compressAlgo(trailer.getCompressionCodec()) +.build(); +HFileBlockReader blockReader = new HFileBlockReader( +context, stream, trailer.getLoadOnOpenDataOffset(), fileSize - HFileTrailer.getTrailerSize()); +HFileRootIndexBlock dataIndexBlock = +(HFileRootIndexBlock) blockReader.nextBlock(HFileBlockType.ROOT_INDEX); +this.blockIndexEntryList = dataIndexBlock.readDataIndex(trailer.getDataIndexCount()); +this.metaIndexBlock = blockReader.nextBlock(HFileBlockType.ROOT_INDEX); +this.fileInfoBlock = blockReader.nextBlock(HFileBlockType.FILE_INFO); + +this.isMetadataInitialized = true; + } + + /** + * Seeks to the key to look up. + * + * @param key Key to look up. + * @return The {@link KeyValue} instance in the block that contains the exact same key as the + * lookup key; or empty {@link Optional} if the lookup key does not exist. + * @throws IOException upon error. + */ + public Optional seekTo(Key key) throws IOException { +BlockIndexEntry lookUpKey = new BlockIndexEntry(key, -1, -1); +int rootLevelBlockIndex = searchBlockByKey(lookUpKey); +if (rootLevelBlockIndex < 0) { + // Key smaller than the start key of the first block + return Optional.empty(); +} +BlockIndexEntry blockToRead = blockIndexEntryList.get(rootLevelBlockIndex); +HFileBlockReader blockReader = new HFileBlockReader( +context, stream, blockToRead.getOffset(), blockToRead.getOffset() + (long) blockToRead.getSize()); +HFileDataBlock dataBlock = (HFileDataBlock) blockReader.nextBlock(HFileBlockType.DATA); +return seekToKeyInBlock(dataBlock, key); + } + + /** + * Reads the HFile major version from the input. + * + * @param bytes Input data. + * @param offset Offset to start reading. + * @return Major version of the file. + */ + public static int readMajorVersion(byte[] bytes, int offset) { +int ch1 = bytes[offset] & 0xFF; +int ch2 = bytes[offset + 1] & 0xFF; +int ch3 = bytes[offset + 2] & 0xFF; +return ((ch1 << 16) + (ch2 << 8) + ch3); + } + + /** + * Reads and parses the HFile trailer. + * + * @param stream HFile input. + * @param fileSize HFile size. + * @return {@link HFileTrailer} instance. + * @throws IOException upon error. + */ + private static HFileTrailer readTrailer(FSDataInputStream stream, + long fileSize) throws IOException { +int bufferSize = HFileTrailer.getTrailerSize(); +long seekPos = fileSize - bufferSize; +if (seekPos < 0) { + // It is hard to imagine such a small HFile. + seekPos = 0; + bufferSize = (int) fileSize; +} +stream.seek(seekPos); + +byte[] byteBuff = new byte[bufferSize]; +stream.readFully(byteBuff); + +int majorVersion = readMajorVersi
[jira] [Closed] (HUDI-7210) In CleanFunction#open, triggers the cleaning under option 'clean.async.enabled'
[ https://issues.apache.org/jira/browse/HUDI-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7210. Resolution: Fixed Fixed via master branch: 1cdae9c83ecff683e014c74621c78281677c3fac > In CleanFunction#open, triggers the cleaning under option > 'clean.async.enabled' > --- > > Key: HUDI-7210 > URL: https://issues.apache.org/jira/browse/HUDI-7210 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated (5f3bc96dc9e -> 1cdae9c83ec)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 5f3bc96dc9e [HUDI-7023] Support querying without syncing partition metadata to catalog (#10153) add 1cdae9c83ec [HUDI-7210] In CleanFunction#open, triggers the cleaning under option 'clean.async.enabled' (#10298) No new revisions were added by this update. Summary of changes: .../main/java/org/apache/hudi/sink/CleanFunction.java | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-)
Re: [PR] [HUDI-7210] In CleanFunction#open, triggers the cleaning under option 'clean.async.enabled' [hudi]
danny0405 merged PR #10298: URL: https://github.com/apache/hudi/pull/10298 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org