[GitHub] [hudi] hudi-bot commented on pull request #6719: [HUDI-4718] Add Kerberos kinit command support for cli.
hudi-bot commented on PR #6719: URL: https://github.com/apache/hudi/pull/6719#issuecomment-1251923320 ## CI report: * 4b88f9a2c614e5002bdb029a267a6c351386357b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11514) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline
[ https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606933#comment-17606933 ] Danny Chen edited comment on HUDI-4880 at 9/20/22 6:41 AM: --- Yeah, seems reasonable, can you fire a PR to rollback it then ? I somehow remember why we did this, the following writing task would generate the same name marker file and it conflicts and reports file already exists exception. In {{CompactionCommitSink}}, we try some rollback for the failed compaction then, for here, we only schedule the instant that in REQUESTED state, it should be a fresh new compaction instant, so the marker dir expects to be non-existing. was (Author: danny0405): Yeah, seems reasonable, can you fire a PR to rollback it then ? > Corrupted parquet file found in Hudi Flink MOR pipeline > --- > > Key: HUDI-4880 > URL: https://issues.apache.org/jira/browse/HUDI-4880 > Project: Apache Hudi > Issue Type: Bug > Components: compaction, flink >Reporter: Teng Huo >Assignee: Teng Huo >Priority: Major > > h2. Env > Hudi version : 0.11.1 (but I believe this issue still exist in the current > version) > Flink version : 1.13 > Pipeline type: MOR, online compaction > h2. TLDR > Marker mechanism for cleaning corrupted parquet files is not effective now in > Flink MOR online compaction due to this PR: > [https://github.com/apache/hudi/pull/5611] > h2. Issue description > Recently, we suffered an issue which said there were corrupted parquet files > in Hudi table, so this Hudi table is not readable, or compaction task will > constantly fail. > e.g. Spark application complained this parquet file is too small. > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in > stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 > (TID 156) (executor 6): java.lang.RuntimeException: > hdfs://.../0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet > is not a Parquet file (too small length: 0) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514) > {code} > h2. Root cause > After trouble shooting, I believe we might find the root cause of this issue. > At the beginning, this Flink MOR pipeline failed due to some reason, which > left a bunch of unfinished parquet files in this Hudi table. It is acceptable > for Hudi because we can clean them later with "Marker" in the method > "finalizeWrite". It will call a method named "reconcileAgainstMarkers". It > will find out these files which are in the marker folder, but not in the > commit metadata, mark them as corrupted files, then delete them. > However, I found this part of code didn't work properly as expect, this > corrupted parquet file > "0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was > not deleted in "20220919020324533.commit". > Then, we found there is [a piece of > code|https://github.com/apache/hudi/blob/13eb892081fc4ddd5e1592ef8698831972012666/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L134] > deleting the marker folder at the beginning of every batch of compaction. > This causes the mechanism of deleting corrupt files to be a failure, since > all marker files created before the current batch were deleted. > And we found HDFS audit logs showing this marker folder > "hdfs://.../.hoodie/.temp/20220919020324533" was deleted multiple times in a > single Flink application, which proved the current behavior of > "CompactionPlanOperator", it deletes marker folder every time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline
[ https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606933#comment-17606933 ] Danny Chen commented on HUDI-4880: -- Yeah, seems reasonable, can you fire a PR to rollback it then ? > Corrupted parquet file found in Hudi Flink MOR pipeline > --- > > Key: HUDI-4880 > URL: https://issues.apache.org/jira/browse/HUDI-4880 > Project: Apache Hudi > Issue Type: Bug > Components: compaction, flink >Reporter: Teng Huo >Assignee: Teng Huo >Priority: Major > > h2. Env > Hudi version : 0.11.1 (but I believe this issue still exist in the current > version) > Flink version : 1.13 > Pipeline type: MOR, online compaction > h2. TLDR > Marker mechanism for cleaning corrupted parquet files is not effective now in > Flink MOR online compaction due to this PR: > [https://github.com/apache/hudi/pull/5611] > h2. Issue description > Recently, we suffered an issue which said there were corrupted parquet files > in Hudi table, so this Hudi table is not readable, or compaction task will > constantly fail. > e.g. Spark application complained this parquet file is too small. > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in > stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 > (TID 156) (executor 6): java.lang.RuntimeException: > hdfs://.../0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet > is not a Parquet file (too small length: 0) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514) > {code} > h2. Root cause > After trouble shooting, I believe we might find the root cause of this issue. > At the beginning, this Flink MOR pipeline failed due to some reason, which > left a bunch of unfinished parquet files in this Hudi table. It is acceptable > for Hudi because we can clean them later with "Marker" in the method > "finalizeWrite". It will call a method named "reconcileAgainstMarkers". It > will find out these files which are in the marker folder, but not in the > commit metadata, mark them as corrupted files, then delete them. > However, I found this part of code didn't work properly as expect, this > corrupted parquet file > "0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was > not deleted in "20220919020324533.commit". > Then, we found there is [a piece of > code|https://github.com/apache/hudi/blob/13eb892081fc4ddd5e1592ef8698831972012666/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L134] > deleting the marker folder at the beginning of every batch of compaction. > This causes the mechanism of deleting corrupt files to be a failure, since > all marker files created before the current batch were deleted. > And we found HDFS audit logs showing this marker folder > "hdfs://.../.hoodie/.temp/20220919020324533" was deleted multiple times in a > single Flink application, which proved the current behavior of > "CompactionPlanOperator", it deletes marker folder every time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] ksrihari93 commented on issue #5875: [SUPPORT] Hoodie Delta streamer Job with Kafka Source fetching the same offset again and again Commiting the same offset again and again
ksrihari93 commented on issue #5875: URL: https://github.com/apache/hudi/issues/5875#issuecomment-1251882183 what happened was that within one partition offset has expired. When tried with offset based on timestamp (this doesn't support as suggested above) this also did not work out. So recovery of the job has been done by writing a few records on a temporary path and copying the offset to the original directory and recovered the job. For now we can close this issue . we are upgrading the Hudi version, post-testing will get back to you if needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #6722: [HUDI-4326] Fix hive sync serde properties
xushiyan commented on code in PR #6722: URL: https://github.com/apache/hudi/pull/6722#discussion_r974919332 ## hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java: ## @@ -303,6 +301,7 @@ public void testSyncCOWTableWithProperties(boolean useSchemaFromCommitMetadata, hiveDriver.run("SHOW CREATE TABLE " + dbTableName); hiveDriver.getResults(results); String ddl = String.join("\n", results); +assertTrue(ddl.contains(String.format("ROW FORMAT SERDE \n '%s'", ParquetHiveSerDe.class.getName(; Review Comment: at least the hive version used for the test is fixed now, so it'll be always formatted like that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6722: [HUDI-4326] Fix hive sync serde properties
hudi-bot commented on PR #6722: URL: https://github.com/apache/hudi/pull/6722#issuecomment-1251874266 ## CI report: * e70039066ec51b7328e4f69cb2f266bdd1cc065e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11521) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6665: [HUDI-4850] Incremental Ingestion from GCS
hudi-bot commented on PR #6665: URL: https://github.com/apache/hudi/pull/6665#issuecomment-1251874119 ## CI report: * a068aefd47b77ceb65c0f7ca3857e438af2d2d2b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11509) * 2c9da2578674f51f1a68db91b3a1defe29d5cfcc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11520) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #6722: [HUDI-4326] Fix hive sync serde properties
codope commented on code in PR #6722: URL: https://github.com/apache/hudi/pull/6722#discussion_r974911483 ## hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java: ## @@ -303,6 +301,7 @@ public void testSyncCOWTableWithProperties(boolean useSchemaFromCommitMetadata, hiveDriver.run("SHOW CREATE TABLE " + dbTableName); hiveDriver.getResults(results); String ddl = String.join("\n", results); +assertTrue(ddl.contains(String.format("ROW FORMAT SERDE \n '%s'", ParquetHiveSerDe.class.getName(; Review Comment: Is `ROW FORMAT SERDE \n '%s'` a fixed format? `getTable` API in hive client was introduced previously just for this testing purpose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6722: [HUDI-4326] Fix hive sync serde properties
hudi-bot commented on PR #6722: URL: https://github.com/apache/hudi/pull/6722#issuecomment-1251871135 ## CI report: * e70039066ec51b7328e4f69cb2f266bdd1cc065e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6665: [HUDI-4850] Incremental Ingestion from GCS
hudi-bot commented on PR #6665: URL: https://github.com/apache/hudi/pull/6665#issuecomment-1251871016 ## CI report: * a068aefd47b77ceb65c0f7ca3857e438af2d2d2b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11509) * 2c9da2578674f51f1a68db91b3a1defe29d5cfcc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
hudi-bot commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251870554 ## CI report: * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512) * e75f6d0031490025107040c1b0093c3c5720a67d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6580: [HUDI-4792] Batch clean files to delete
hudi-bot commented on PR #6580: URL: https://github.com/apache/hudi/pull/6580#issuecomment-1251867741 ## CI report: * ff98ae0dda69ee611e4814fbae9c8ddc0a93a4f1 UNKNOWN * 99451dc89547f803eb6823b2baa620096e76459e UNKNOWN * 675221955f01a2a4fdc138af346fc78a2d11a41b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11513) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
hudi-bot commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251867212 ## CI report: * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512) * e75f6d0031490025107040c1b0093c3c5720a67d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan opened a new pull request, #6722: [HUDI-4326] Fix hive sync serde properties
xushiyan opened a new pull request, #6722: URL: https://github.com/apache/hudi/pull/6722 ### Change Logs Improve API and refactor code about metasync for serde properties. ### Impact **Risk level: low** ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #5920: [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool
xushiyan commented on code in PR #5920: URL: https://github.com/apache/hudi/pull/5920#discussion_r974895469 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java: ## @@ -290,6 +290,9 @@ private boolean syncSchema(String tableName, boolean tableExists, boolean useRea // Sync the table properties if the schema has changed if (config.getString(HIVE_TABLE_PROPERTIES) != null || config.getBoolean(HIVE_SYNC_AS_DATA_SOURCE_TABLE)) { syncClient.updateTableProperties(tableName, tableProperties); + HoodieFileFormat baseFileFormat = HoodieFileFormat.valueOf(config.getStringOrDefault(META_SYNC_BASE_FILE_FORMAT).toUpperCase()); + String serDeFormatClassName = HoodieInputFormatUtils.getSerDeClassName(baseFileFormat); + syncClient.updateTableSerDeInfo(tableName, serDeFormatClassName, serdeProperties); Review Comment: we don't need to pass serde class from the API. it's controlled by the base file format, which is taken from the sync config. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] fix indent to make build pass (#6721)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 6b7157536d [MINOR] fix indent to make build pass (#6721) 6b7157536d is described below commit 6b7157536d52ee1853a0ff362d750f2d473566c2 Author: Yann Byron AuthorDate: Tue Sep 20 13:08:42 2022 +0800 [MINOR] fix indent to make build pass (#6721) --- .../src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java b/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java index 7b58f5c3f3..ba9d33a662 100644 --- a/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java +++ b/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java @@ -160,8 +160,8 @@ public class TestHiveSyncTool { assertTrue(hiveClient.tableExists(HiveTestUtil.TABLE_NAME), "Table " + HiveTestUtil.TABLE_NAME + " should exist after sync completes"); assertEquals("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe", - hiveClient.getTable(HiveTestUtil.TABLE_NAME).getSd().getSerdeInfo().getSerializationLib(), - "SerDe info not updated or does not match"); + hiveClient.getTable(HiveTestUtil.TABLE_NAME).getSd().getSerdeInfo().getSerializationLib(), +"SerDe info not updated or does not match"); assertEquals(hiveClient.getMetastoreSchema(HiveTestUtil.TABLE_NAME).size(), hiveClient.getStorageSchema().getColumns().size() + 1, "Hive Schema should match the table schema + partition field");
[GitHub] [hudi] xushiyan merged pull request #6721: [MINOR] fix indent to make build pass
xushiyan merged PR #6721: URL: https://github.com/apache/hudi/pull/6721 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6721: [MINOR] fix indent to make build pass
hudi-bot commented on PR #6721: URL: https://github.com/apache/hudi/pull/6721#issuecomment-1251834580 ## CI report: * d1f2d32e451793819be47b80d18ac96252165daa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11517) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6697: [HUDI-3478] Spark CDC Write
hudi-bot commented on PR #6697: URL: https://github.com/apache/hudi/pull/6697#issuecomment-1251834522 ## CI report: * da97ef7f2f1d034d15851c3676b27f06294dc557 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11507) * 596d2554f2fff757e40c9f1fad4a02034123fa12 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11515) * 539da709d04c12834b7105daa0fc3aa80f398ef2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11516) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
hudi-bot commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251834066 ## CI report: * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN * 20f64af242ac3e6df5d1555edf0766e7dcdd698a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470) * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512) * e75f6d0031490025107040c1b0093c3c5720a67d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline
[ https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606881#comment-17606881 ] Teng Huo commented on HUDI-4880: Thanks Danny for replying. Yeah, the new marker files are generated properly when doing writing. But the problem here is the old markers are deleted if the task of the same compaction request instant failed previously, then, commit action in the next task doesn't know the files left in the previous failed task, because all marker files are generated in the second compaction task. As result, reconciling code can't work properly. Here, I assume every marker file is generated when a new parquet generated. I haven't read the code about how these marker files created. Please correct me if I'm wrong. > Corrupted parquet file found in Hudi Flink MOR pipeline > --- > > Key: HUDI-4880 > URL: https://issues.apache.org/jira/browse/HUDI-4880 > Project: Apache Hudi > Issue Type: Bug > Components: compaction, flink >Reporter: Teng Huo >Assignee: Teng Huo >Priority: Major > > h2. Env > Hudi version : 0.11.1 (but I believe this issue still exist in the current > version) > Flink version : 1.13 > Pipeline type: MOR, online compaction > h2. TLDR > Marker mechanism for cleaning corrupted parquet files is not effective now in > Flink MOR online compaction due to this PR: > [https://github.com/apache/hudi/pull/5611] > h2. Issue description > Recently, we suffered an issue which said there were corrupted parquet files > in Hudi table, so this Hudi table is not readable, or compaction task will > constantly fail. > e.g. Spark application complained this parquet file is too small. > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in > stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 > (TID 156) (executor 6): java.lang.RuntimeException: > hdfs://.../0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet > is not a Parquet file (too small length: 0) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514) > {code} > h2. Root cause > After trouble shooting, I believe we might find the root cause of this issue. > At the beginning, this Flink MOR pipeline failed due to some reason, which > left a bunch of unfinished parquet files in this Hudi table. It is acceptable > for Hudi because we can clean them later with "Marker" in the method > "finalizeWrite". It will call a method named "reconcileAgainstMarkers". It > will find out these files which are in the marker folder, but not in the > commit metadata, mark them as corrupted files, then delete them. > However, I found this part of code didn't work properly as expect, this > corrupted parquet file > "0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was > not deleted in "20220919020324533.commit". > Then, we found there is [a piece of > code|https://github.com/apache/hudi/blob/13eb892081fc4ddd5e1592ef8698831972012666/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L134] > deleting the marker folder at the beginning of every batch of compaction. > This causes the mechanism of deleting corrupt files to be a failure, since > all marker files created before the current batch were deleted. > And we found HDFS audit logs showing this marker folder > "hdfs://.../.hoodie/.temp/20220919020324533" was deleted multiple times in a > single Flink application, which proved the current behavior of > "CompactionPlanOperator", it deletes marker folder every time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6721: [MINOR] fix indent to make build pass
hudi-bot commented on PR #6721: URL: https://github.com/apache/hudi/pull/6721#issuecomment-1251828724 ## CI report: * d1f2d32e451793819be47b80d18ac96252165daa UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6697: [HUDI-3478] Spark CDC Write
hudi-bot commented on PR #6697: URL: https://github.com/apache/hudi/pull/6697#issuecomment-1251828487 ## CI report: * da97ef7f2f1d034d15851c3676b27f06294dc557 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11507) * 596d2554f2fff757e40c9f1fad4a02034123fa12 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11515) * 539da709d04c12834b7105daa0fc3aa80f398ef2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on pull request #6721: [MINOR] fix indent to make build pass
YannByron commented on PR #6721: URL: https://github.com/apache/hudi/pull/6721#issuecomment-1251822538 @nsivabalan @xushiyan https://github.com/apache/hudi/pull/5920 brings an indent problem that will cause the failure of build/ci. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline
[ https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606876#comment-17606876 ] Danny Chen commented on HUDI-4880: -- Thanks for the nice analyze, the purpose for this PR is to clean the marker files on each compaction task start up, but then the compaction task would re-generate these markers when writing, so when committing compaction, the marker dir/files exists right ? > Corrupted parquet file found in Hudi Flink MOR pipeline > --- > > Key: HUDI-4880 > URL: https://issues.apache.org/jira/browse/HUDI-4880 > Project: Apache Hudi > Issue Type: Bug > Components: compaction, flink >Reporter: Teng Huo >Assignee: Teng Huo >Priority: Major > > h2. Env > Hudi version : 0.11.1 (but I believe this issue still exist in the current > version) > Flink version : 1.13 > Pipeline type: MOR, online compaction > h2. TLDR > Marker mechanism for cleaning corrupted parquet files is not effective now in > Flink MOR online compaction due to this PR: > [https://github.com/apache/hudi/pull/5611] > h2. Issue description > Recently, we suffered an issue which said there were corrupted parquet files > in Hudi table, so this Hudi table is not readable, or compaction task will > constantly fail. > e.g. Spark application complained this parquet file is too small. > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in > stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 > (TID 156) (executor 6): java.lang.RuntimeException: > hdfs://.../0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet > is not a Parquet file (too small length: 0) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514) > {code} > h2. Root cause > After trouble shooting, I believe we might find the root cause of this issue. > At the beginning, this Flink MOR pipeline failed due to some reason, which > left a bunch of unfinished parquet files in this Hudi table. It is acceptable > for Hudi because we can clean them later with "Marker" in the method > "finalizeWrite". It will call a method named "reconcileAgainstMarkers". It > will find out these files which are in the marker folder, but not in the > commit metadata, mark them as corrupted files, then delete them. > However, I found this part of code didn't work properly as expect, this > corrupted parquet file > "0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was > not deleted in "20220919020324533.commit". > Then, we found there is [a piece of > code|https://github.com/apache/hudi/blob/13eb892081fc4ddd5e1592ef8698831972012666/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L134] > deleting the marker folder at the beginning of every batch of compaction. > This causes the mechanism of deleting corrupt files to be a failure, since > all marker files created before the current batch were deleted. > And we found HDFS audit logs showing this marker folder > "hdfs://.../.hoodie/.temp/20220919020324533" was deleted multiple times in a > single Flink application, which proved the current behavior of > "CompactionPlanOperator", it deletes marker folder every time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] YannByron opened a new pull request, #6721: [MINOR] fix indent to make build pass
YannByron opened a new pull request, #6721: URL: https://github.com/apache/hudi/pull/6721 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wzx140 commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
wzx140 commented on code in PR #5629: URL: https://github.com/apache/hudi/pull/5629#discussion_r974865936 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -505,7 +514,9 @@ public class HoodieWriteConfig extends HoodieConfig { private HoodieMetadataConfig metadataConfig; private HoodieMetastoreConfig metastoreConfig; private HoodieCommonConfig commonConfig; + private HoodieStorageConfig storageConfig; private EngineType engineType; + private HoodieRecordMerger recordMerger; Review Comment: getRecordMerger will be called more than once for getting recordType(SPARK, AVRO). Holding recordMerger will be better? Or we can make it lazy loading. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TengHuo commented on pull request #5611: [HUDI-4108] Clean the marker files before starting new flink compaction
TengHuo commented on PR #5611: URL: https://github.com/apache/hudi/pull/5611#issuecomment-1251800694 We found an issue which might be related with PR, detail in https://issues.apache.org/jira/browse/HUDI-4880 May I ask if there is anyone can double check if it is the root cause? Really appreciate -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline
[ https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606864#comment-17606864 ] Teng Huo commented on HUDI-4880: Related with this issue: https://issues.apache.org/jira/browse/HUDI-4108 > Corrupted parquet file found in Hudi Flink MOR pipeline > --- > > Key: HUDI-4880 > URL: https://issues.apache.org/jira/browse/HUDI-4880 > Project: Apache Hudi > Issue Type: Bug > Components: compaction, flink >Reporter: Teng Huo >Assignee: Teng Huo >Priority: Major > > h2. Env > Hudi version : 0.11.1 (but I believe this issue still exist in the current > version) > Flink version : 1.13 > Pipeline type: MOR, online compaction > h2. TLDR > Marker mechanism for cleaning corrupted parquet files is not effective now in > Flink MOR online compaction due to this PR: > [https://github.com/apache/hudi/pull/5611] > h2. Issue description > Recently, we suffered an issue which said there were corrupted parquet files > in Hudi table, so this Hudi table is not readable, or compaction task will > constantly fail. > e.g. Spark application complained this parquet file is too small. > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in > stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 > (TID 156) (executor 6): java.lang.RuntimeException: > hdfs://.../0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet > is not a Parquet file (too small length: 0) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514) > {code} > h2. Root cause > After trouble shooting, I believe we might find the root cause of this issue. > At the beginning, this Flink MOR pipeline failed due to some reason, which > left a bunch of unfinished parquet files in this Hudi table. It is acceptable > for Hudi because we can clean them later with "Marker" in the method > "finalizeWrite". It will call a method named "reconcileAgainstMarkers". It > will find out these files which are in the marker folder, but not in the > commit metadata, mark them as corrupted files, then delete them. > However, I found this part of code didn't work properly as expect, this > corrupted parquet file > "0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was > not deleted in "20220919020324533.commit". > Then, we found there is [a piece of > code|https://github.com/apache/hudi/blob/13eb892081fc4ddd5e1592ef8698831972012666/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L134] > deleting the marker folder at the beginning of every batch of compaction. > This causes the mechanism of deleting corrupt files to be a failure, since > all marker files created before the current batch were deleted. > And we found HDFS audit logs showing this marker folder > "hdfs://.../.hoodie/.temp/20220919020324533" was deleted multiple times in a > single Flink application, which proved the current behavior of > "CompactionPlanOperator", it deletes marker folder every time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] IsisPolei opened a new issue, #6720: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieRemoteException: Connect to 192.168.64.107:34446 [/192.168.64.107] failed: Connection refused (
IsisPolei opened a new issue, #6720: URL: https://github.com/apache/hudi/issues/6720 hudi:0.10.1 spark:3.1.3_scala2.12 background story: I use SparkRDDWriteClient to process hudi , both app and spark standalone cluster are running in docker. When the app and spark cluster container running in the same local machine, my app work well. But when i deploy the spark cluster in different machine i got a series of connection problems. machineA(192.168.64.107): spark driver(SparkRDDWriteClient app) machineB(192.168.64.121):spark standalone cluster(master and worker running in two containers) Due to the spark network connection mechanism, i have set the connect parms below: spark.master.url: spark://192.168.64.121:7077 spark.driver.bindAddress: 0.0.0.0 spark.driver.host: 192.168.64.107 spark.driver.port: 1 The HoodieSparkContext init correctly and i can see the spark job running in the spark web UI. But when the code reach to sparkRDDWriteClient.upsert(), this exception occur: Caused by: org.apache.hudi.exception.HoodieRemoteException: Connect to 192.168.64.107:34446 [/192.168.64.107] failed: Connection refused (Connection refused) at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.refresh(RemoteHoodieTableFileSystemView.java:420) at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.sync(RemoteHoodieTableFileSystemView.java:484) at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.sync(PriorityBasedFileSystemView.java:257) at org.apache.hudi.client.SparkRDDWriteClient.getTableAndInitCtx(SparkRDDWriteClient.java:493) at org.apache.hudi.client.SparkRDDWriteClient.getTableAndInitCtx(SparkRDDWriteClient.java:448) at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:157) Caused by: org.apache.http.conn.HttpHostConnectException: Connect to 192.168.64.107:34446 [/192.168.64.107] failed: Connection refused (Connection refused) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at org.apache.http.client.fluent.Request.internalExecute(Request.java:173) at org.apache.http.client.fluent.Request.execute(Request.java:177) at org.apache.hudi.common.table.view It seems like these two container can't connect to each other through the hoodie.filesystem.view.remote.port. So i expose this port of my app container but it doesn't work. Please tell me what i did wrong. These are my docker-compose.yml: app: app: image: xxx container_name: app ports: - "5008:5005" - "1:1" - "10001:10001" spark: version: '3' services: master: image: bitnami/spark:3.1 container_name: master hostname: master environment: MASTER: spark://master:7077 restart: always ports: - "7077:7077" - "9080:8080" worker: image: bitnami/spark:3.1 container_name: worker restart: always environment: SPARK_WORKER_CORES: 5 SPARK_WORKER_MEMORY: 2g SPARK_WORKER_PORT: 8881 depends_on: - master links: - master ports: - "8081:8081" expose: - "8881" I hope i describe the situation clearly, please help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wzx140 commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
wzx140 commented on code in PR #5629: URL: https://github.com/apache/hudi/pull/5629#discussion_r974857657 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecord.java: ## @@ -291,59 +284,51 @@ public void checkState() { } } - // - - // - // NOTE: This method duplicates those ones of the HoodieRecordPayload and are placed here - // for the duration of RFC-46 implementation, until migration off `HoodieRecordPayload` - // is complete - // - public abstract HoodieRecord mergeWith(HoodieRecord other, Schema readerSchema, Schema writerSchema) throws IOException; + /** + * Get column in record to support RDDCustomColumnsSortPartitioner + */ + public abstract Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled); - public abstract HoodieRecord rewriteRecord(Schema recordSchema, Schema targetSchema, TypedProperties props) throws IOException; + /** + * Support bootstrap. + */ + public abstract HoodieRecord mergeWith(HoodieRecord other, Schema targetSchema) throws IOException; Review Comment: What this method does is to stitch two record. The logic of bootstrap merge is fixed. We should not let users customize the implementation of this method, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline
[ https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Huo updated HUDI-4880: --- Description: h2. Env Hudi version : 0.11.1 (but I believe this issue still exist in the current version) Flink version : 1.13 Pipeline type: MOR, online compaction h2. TLDR Marker mechanism for cleaning corrupted parquet files is not effective now in Flink MOR online compaction due to this PR: [https://github.com/apache/hudi/pull/5611] h2. Issue description Recently, we suffered an issue which said there were corrupted parquet files in Hudi table, so this Hudi table is not readable, or compaction task will constantly fail. e.g. Spark application complained this parquet file is too small. {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 (TID 156) (executor 6): java.lang.RuntimeException: hdfs://.../0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet is not a Parquet file (too small length: 0) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514) {code} h2. Root cause After trouble shooting, I believe we might find the root cause of this issue. At the beginning, this Flink MOR pipeline failed due to some reason, which left a bunch of unfinished parquet files in this Hudi table. It is acceptable for Hudi because we can clean them later with "Marker" in the method "finalizeWrite". It will call a method named "reconcileAgainstMarkers". It will find out these files which are in the marker folder, but not in the commit metadata, mark them as corrupted files, then delete them. However, I found this part of code didn't work properly as expect, this corrupted parquet file "0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was not deleted in "20220919020324533.commit". Then, we found there is [a piece of code|https://github.com/apache/hudi/blob/13eb892081fc4ddd5e1592ef8698831972012666/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L134] deleting the marker folder at the beginning of every batch of compaction. This causes the mechanism of deleting corrupt files to be a failure, since all marker files created before the current batch were deleted. And we found HDFS audit logs showing this marker folder "hdfs://.../.hoodie/.temp/20220919020324533" was deleted multiple times in a single Flink application, which proved the current behavior of "CompactionPlanOperator", it deletes marker folder every time. was:Adding content > Corrupted parquet file found in Hudi Flink MOR pipeline > --- > > Key: HUDI-4880 > URL: https://issues.apache.org/jira/browse/HUDI-4880 > Project: Apache Hudi > Issue Type: Bug > Components: compaction, flink >Reporter: Teng Huo >Assignee: Teng Huo >Priority: Major > > h2. Env > Hudi version : 0.11.1 (but I believe this issue still exist in the current > version) > Flink version : 1.13 > Pipeline type: MOR, online compaction > h2. TLDR > Marker mechanism for cleaning corrupted parquet files is not effective now in > Flink MOR online compaction due to this PR: > [https://github.com/apache/hudi/pull/5611] > h2. Issue description > Recently, we suffered an issue which said there were corrupted parquet files > in Hudi table, so this Hudi table is not readable, or compaction task will > constantly fail. > e.g. Spark application complained this parquet file is too small. > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in > stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 > (TID 156) (executor 6): java.lang.RuntimeException: > hdfs://.../0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet > is not a Parquet file (too small length: 0) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514) > {code} > h2. Root cause > After trouble shooting, I believe we might find the root cause of this issue. > At the beginning, this Flink MOR pipeline failed due to some reason, which > left a bunch of unfinished parquet files in this Hudi table. It is acceptable > for Hudi because we can clean them later with "Marker" in the method > "finalizeWrite". It will call a method named "reconcileAgainstMarkers". It > will find out these files which are in the marker folder, but not in the > commit metadata, mark them as corrupted files, then delete them. > However, I found this part of code didn't work properly as expect, this > corrupted parquet file > "0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was > not deleted in "20220919020324533.
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
alexeykudinkin commented on code in PR #5629: URL: https://github.com/apache/hudi/pull/5629#discussion_r974403023 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordCompatibilityInterface.java: ## @@ -18,21 +18,28 @@ package org.apache.hudi.common.model; +import java.io.IOException; +import java.util.Properties; import org.apache.avro.Schema; import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.keygen.BaseKeyGenerator; -import java.io.IOException; -import java.io.Serializable; -import java.util.Properties; +public interface HoodieRecordCompatibilityInterface { -/** - * HoodieMerge defines how to merge two records. It is a stateless component. - * It can implement the merging logic of HoodieRecord of different engines - * and avoid the performance consumption caused by the serialization/deserialization of Avro payload. - */ -public interface HoodieMerge extends Serializable { - - HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer); + /** + * This method used to extract HoodieKey not through keyGenerator. + */ + HoodieRecord wrapIntoHoodieRecordPayloadWithParams( Review Comment: 👍 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java: ## @@ -147,19 +141,19 @@ public static HoodieMergedLogRecordScanner.Builder newBuilder() { } @Override - protected void processNextRecord(HoodieRecord hoodieRecord) throws IOException { + protected void processNextRecord(HoodieRecord hoodieRecord) throws IOException { String key = hoodieRecord.getRecordKey(); if (records.containsKey(key)) { // Merge and store the merged record. The HoodieRecordPayload implementation is free to decide what should be // done when a DELETE (empty payload) is encountered before or after an insert/update. - HoodieRecord oldRecord = records.get(key); - HoodieRecordPayload oldValue = oldRecord.getData(); - HoodieRecordPayload combinedValue = (HoodieRecordPayload) merge.preCombine(oldRecord, hoodieRecord).getData(); + HoodieRecord oldRecord = records.get(key); + T oldValue = oldRecord.getData(); + T combinedValue = ((HoodieRecord) recordMerger.merge(oldRecord, hoodieRecord, readerSchema, this.hoodieTableMetaClient.getTableConfig().getProps()).get()).getData(); // If combinedValue is oldValue, no need rePut oldRecord if (combinedValue != oldValue) { -HoodieOperation operation = hoodieRecord.getOperation(); -records.put(key, new HoodieAvroRecord<>(new HoodieKey(key, hoodieRecord.getPartitionPath()), combinedValue, operation)); +hoodieRecord.setData(combinedValue); Review Comment: Why are we resetting the data instead of using new `HoodieRecord` returned by the Merger? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -126,11 +129,17 @@ public class HoodieWriteConfig extends HoodieConfig { .withDocumentation("Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. " + "This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective"); - public static final ConfigProperty MERGE_CLASS_NAME = ConfigProperty - .key("hoodie.datasource.write.merge.class") - .defaultValue(HoodieAvroRecordMerge.class.getName()) - .withDocumentation("Merge class provide stateless component interface for merging records, and support various HoodieRecord " - + "types, such as Spark records or Flink records."); + public static final ConfigProperty MERGER_IMPLS = ConfigProperty + .key("hoodie.datasource.write.merger.impls") + .defaultValue(HoodieAvroRecordMerger.class.getName()) + .withDocumentation("List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. " + + "These merger impls will filter by hoodie.datasource.write.merger.strategy " + + "Hudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc)"); + + public static final ConfigProperty MERGER_STRATEGY = ConfigProperty + .key("hoodie.datasource.write.merger.strategy") + .defaultValue(StringUtils.DEFAULT_MERGER_STRATEGY_UUID) Review Comment: Let's move this to HoodieMerger, rather than `StringUtils` (we can do it in a follow-up) ## hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java: ## @@ -156,11 +155,10 @@ public class HoodieTableConfig extends HoodieConfig { .withDocumentation("Payload class to use for performing compactions, i.e merge delta logs with current base file and then " + " produce a new base file."); - public static final ConfigProperty MERGE_CLASS_NAME = ConfigProperty - .key("hoodie.compact
[GitHub] [hudi] hudi-bot commented on pull request #6580: [HUDI-4792] Batch clean files to delete
hudi-bot commented on PR #6580: URL: https://github.com/apache/hudi/pull/6580#issuecomment-1251791338 ## CI report: * ff98ae0dda69ee611e4814fbae9c8ddc0a93a4f1 UNKNOWN * 99451dc89547f803eb6823b2baa620096e76459e UNKNOWN * 11ba7cd991ca83773aae03b1fd7271364079be21 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11435) * 675221955f01a2a4fdc138af346fc78a2d11a41b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11513) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6697: [HUDI-3478] Spark CDC Write
hudi-bot commented on PR #6697: URL: https://github.com/apache/hudi/pull/6697#issuecomment-1251791430 ## CI report: * da97ef7f2f1d034d15851c3676b27f06294dc557 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11507) * 596d2554f2fff757e40c9f1fad4a02034123fa12 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11515) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
hudi-bot commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251790938 ## CI report: * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN * 20f64af242ac3e6df5d1555edf0766e7dcdd698a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470) * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline
[ https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Huo updated HUDI-4880: --- Description: Adding content > Corrupted parquet file found in Hudi Flink MOR pipeline > --- > > Key: HUDI-4880 > URL: https://issues.apache.org/jira/browse/HUDI-4880 > Project: Apache Hudi > Issue Type: Bug > Components: compaction, flink >Reporter: Teng Huo >Assignee: Teng Huo >Priority: Major > > Adding content -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6719: [HUDI-4718] Add Kerberos kinit command support for cli.
hudi-bot commented on PR #6719: URL: https://github.com/apache/hudi/pull/6719#issuecomment-1251788958 ## CI report: * 4b88f9a2c614e5002bdb029a267a6c351386357b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11514) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6697: [HUDI-3478] Spark CDC Write
hudi-bot commented on PR #6697: URL: https://github.com/apache/hudi/pull/6697#issuecomment-125170 ## CI report: * da97ef7f2f1d034d15851c3676b27f06294dc557 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11507) * 596d2554f2fff757e40c9f1fad4a02034123fa12 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6580: [HUDI-4792] Batch clean files to delete
hudi-bot commented on PR #6580: URL: https://github.com/apache/hudi/pull/6580#issuecomment-1251788754 ## CI report: * ff98ae0dda69ee611e4814fbae9c8ddc0a93a4f1 UNKNOWN * 99451dc89547f803eb6823b2baa620096e76459e UNKNOWN * 11ba7cd991ca83773aae03b1fd7271364079be21 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11435) * 675221955f01a2a4fdc138af346fc78a2d11a41b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6516: [HUDI-4729] Fix fq can not be queried in pending compaction when query ro table with spark
hudi-bot commented on PR #6516: URL: https://github.com/apache/hudi/pull/6516#issuecomment-1251788682 ## CI report: * f5cc48018e7d4e542c3d0b2dc677f4b708f03f11 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11502) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
hudi-bot commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251788382 ## CI report: * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN * 20f64af242ac3e6df5d1555edf0766e7dcdd698a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470) * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4878) Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS
[ https://issues.apache.org/jira/browse/HUDI-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4878: - Labels: pull-request-available (was: ) > Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS > > > Key: HUDI-4878 > URL: https://issues.apache.org/jira/browse/HUDI-4878 > Project: Apache Hudi > Issue Type: Improvement > Components: cleaning >Reporter: sivabalan narayanan >Assignee: nicolas paris >Priority: Major > Labels: pull-request-available > > clean based on LATEST_FILE_VERSIONS can be improved further since incremental > clean is not enabled. lets see if we can improvise. > > context from author: > > > Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, > KEEP_LATEST_BY_HOURS > policies. It is not run when KEEP_LATEST_FILE_VERSIONS. > This can lead to not cleaning files. This PR fixes this problem by enabling > incremental cleaning for KEEP_LATEST_FILE_VERSIONS only. > Here is the scenario of the problem: > Say we have 3 committed files in partition-A and we add a new commit in > partition-B, and we trigger cleaning for the first time (full partition scan): > {{partition-A/ > commit-0.parquet > commit-1.parquet > commit-2.parquet > partition-B/ > commit-3.parquet}} > In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, > the cleaner will remove the commit-0.parquet to keep 3 commits. > For the next cleaning, incremental cleaning will trigger, and won't consider > partition-A/ until a new commit change it. In case no later commit changes > partition-A then commit-1.parquet will stay forever. However it should be > removed by the cleaner. > Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep > commit-2.parquet. Then it makes sense that incremental cleaning won't > consider partition-A until it is changed. Because there is only one commit. > This is why incremental cleaning should only be enabled with > KEEP_LATEST_FILE_VERSIONS > Hope this is clear enough > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on pull request #6498: [HUDI-4878] Fix incremental cleaner use case
nsivabalan commented on PR #6498: URL: https://github.com/apache/hudi/pull/6498#issuecomment-1251786668 @parisni : hey hi. we have a code freeze coming up in a weeks time for 0.12.1. Just wanted to keep you informed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6719: [HUDI-4718] Add Kerberos kinit command support for cli.
hudi-bot commented on PR #6719: URL: https://github.com/apache/hudi/pull/6719#issuecomment-1251786271 ## CI report: * 4b88f9a2c614e5002bdb029a267a6c351386357b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6516: [HUDI-4729] Fix fq can not be queried in pending compaction when query ro table with spark
hudi-bot commented on PR #6516: URL: https://github.com/apache/hudi/pull/6516#issuecomment-1251786002 ## CI report: * f5cc48018e7d4e542c3d0b2dc677f4b708f03f11 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11502) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #4015: [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data
nsivabalan commented on PR #4015: URL: https://github.com/apache/hudi/pull/4015#issuecomment-1251785335 Also, do you think you can write up a unit test to cover the scenario. We have some tests around corrupt blocks already [here](https://github.com/apache/hudi/blob/e03c0388d9ddea7e4d650b3f7a64dff41e180d50/hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java#L606). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #4015: [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data
nsivabalan commented on PR #4015: URL: https://github.com/apache/hudi/pull/4015#issuecomment-1251784271 @hj2016 : thanks for the elaborate explanation. the fix makes sense. only comment I have is, within isBlockCorrupted(), we return in 2 to 3 places incase of corrupt. so, may be we could avoid seeking within isBlockCorrupt incase of corrupt block and reset it from the callers side. ``` int blockSize = inputStream.readLong(); . . boolean isBlockCorrupt = isBlockCorrupted(blockSize); if (isBlockCorrupt) { inputStream.seek(blockStartPos); return createCorruptBlock(); } ``` within isBlockCorrupted(blockSize): we should not do any seek incase of corrupt block. let me know what do you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #5920: [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool
xushiyan commented on code in PR #5920: URL: https://github.com/apache/hudi/pull/5920#discussion_r974843935 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveSyncClient.java: ## @@ -316,4 +353,11 @@ public void updateTableComments(String tableName, List fromMetastor } } + Table getTable(String tableName) { +try { + return client.getTable(databaseName, tableName); +} catch (TException e) { + throw new HoodieHiveSyncException(String.format("Database: %s, Table: %s does not exist", databaseName, tableName), e); +} + } Review Comment: where do we use this method? other methods from this class just call client.getTable(). we should not introduce random helpers that lowers the code quality -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline
Teng Huo created HUDI-4880: -- Summary: Corrupted parquet file found in Hudi Flink MOR pipeline Key: HUDI-4880 URL: https://issues.apache.org/jira/browse/HUDI-4880 Project: Apache Hudi Issue Type: Bug Components: compaction, flink Reporter: Teng Huo Assignee: Teng Huo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] wzx140 commented on pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback
wzx140 commented on PR #6132: URL: https://github.com/apache/hudi/pull/6132#issuecomment-1251779085 @alexeykudinkin I have changed func updateMetadataValues(Schema recordSchema, Properties props, MetadataValues metadataValues). truncateRecordKey will be put in HoodieRecordCompatibilityInterface. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-4326) Hudi spark datasource error after migrate from 0.8 to 0.11
[ https://issues.apache.org/jira/browse/HUDI-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-4326. - Resolution: Fixed > Hudi spark datasource error after migrate from 0.8 to 0.11 > -- > > Key: HUDI-4326 > URL: https://issues.apache.org/jira/browse/HUDI-4326 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Reporter: Kyle Zhike Chen >Assignee: Kyle Zhike Chen >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > After updated hudi to 0.11 from 0.8, using {{spark.table(fullTableName)}} to > read a hudi table is not working, the table has been sync to hive metastore > and spark is connected to the metastore. the error is > org.sparkproject.guava.util.concurrent.UncheckedExecutionException: > org.apache.hudi.exception.HoodieException: 'path' or 'Key: > 'hoodie.datasource.read.paths' , default: null description: Comma separated > list of file paths to read within a Hudi table. since version: version is not > defined deprecated after: version is not defined)' or both must be specified. > at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2263) > at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789) > at org.apache.spark.sql.catalyst.catalog.SessionCatalog. > ... > Caused by: org.apache.hudi.exception.HoodieException: 'path' or 'Key: > 'hoodie.datasource.read.paths' , default: null description: Comma separated > list of file paths to read within a Hudi table. since version: version is not > defined deprecated after: version is not defined)' or both must be specified. > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:78) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353) > at > org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:261) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) > After changing the table to the spark data source table, the table SerDeInfo > is missing. I created a pull request. > > related GH issue: > https://github.com/apache/hudi/issues/5861 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated (b5dc4312a4 -> e03c0388d9)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from b5dc4312a4 [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue (#6717) add e03c0388d9 [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool (#5920) No new revisions were added by this update. Summary of changes: .../java/org/apache/hudi/hive/HiveSyncTool.java| 3 ++ .../org/apache/hudi/hive/HoodieHiveSyncClient.java | 44 ++ .../org/apache/hudi/hive/TestHiveSyncTool.java | 3 ++ .../hudi/sync/common/HoodieMetaSyncOperations.java | 6 +++ 4 files changed, 56 insertions(+)
[GitHub] [hudi] nsivabalan merged pull request #5920: [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool
nsivabalan merged PR #5920: URL: https://github.com/apache/hudi/pull/5920 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #6580: [HUDI-4792] Batch clean files to delete
nsivabalan commented on PR #6580: URL: https://github.com/apache/hudi/pull/6580#issuecomment-1251769488 @parisni : can u checkout the CI failures from last run https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=11435&view=results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #6580: [HUDI-4792] Batch clean files to delete
nsivabalan commented on PR #6580: URL: https://github.com/apache/hudi/pull/6580#issuecomment-1251769268 have pushed out a commit addressing feedback. we should be good to land once CI succeeds. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] microbearz commented on pull request #6516: [HUDI-4729] Fix fq can not be queried in pending compaction when query ro table with spark
microbearz commented on PR #6516: URL: https://github.com/apache/hudi/pull/6516#issuecomment-1251768404 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #6580: [HUDI-4792] Batch clean files to delete
nsivabalan commented on code in PR #6580: URL: https://github.com/apache/hudi/pull/6580#discussion_r974834548 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java: ## @@ -290,9 +296,10 @@ private Pair> getFilesToCleanKeepingLatestCommits(S * @return A {@link Pair} whose left is boolean indicating whether partition itself needs to be deleted, * and right is a list of {@link CleanFileInfo} about the files in the partition that needs to be deleted. */ - private Pair> getFilesToCleanKeepingLatestCommits(String partitionPath, int commitsRetained, HoodieCleaningPolicy policy) { + private Map>> getFilesToCleanKeepingLatestCommits(List partitionPath, int commitsRetained, HoodieCleaningPolicy policy) { LOG.info("Cleaning " + partitionPath + ", retaining latest " + commitsRetained + " commits. "); List deletePaths = new ArrayList<>(); +Map>> map = new HashMap<>(); Review Comment: minor. `map` -> `cleanFileInfoPerPartitionMap` ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java: ## @@ -290,9 +296,10 @@ private Pair> getFilesToCleanKeepingLatestCommits(S * @return A {@link Pair} whose left is boolean indicating whether partition itself needs to be deleted, * and right is a list of {@link CleanFileInfo} about the files in the partition that needs to be deleted. */ - private Pair> getFilesToCleanKeepingLatestCommits(String partitionPath, int commitsRetained, HoodieCleaningPolicy policy) { + private Map>> getFilesToCleanKeepingLatestCommits(List partitionPath, int commitsRetained, HoodieCleaningPolicy policy) { Review Comment: minor. lets name the argument as plural. `partitionPath` -> `partitionPaths` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write
YannByron commented on code in PR #6697: URL: https://github.com/apache/hudi/pull/6697#discussion_r974834574 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java: ## @@ -0,0 +1,263 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.generic.IndexedRecord; + +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +import org.apache.hudi.avro.HoodieAvroUtils; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieAvroPayload; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.HoodieTableConfig; +import org.apache.hudi.common.table.cdc.HoodieCDCOperation; +import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode; +import org.apache.hudi.common.table.cdc.HoodieCDCUtils; +import org.apache.hudi.common.table.log.AppendResult; +import org.apache.hudi.common.table.log.HoodieLogFormat; +import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock; +import org.apache.hudi.common.table.log.block.HoodieLogBlock; +import org.apache.hudi.common.util.DefaultSizeEstimator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.common.util.collection.ExternalSpillableMap; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.HoodieUpsertException; + +import java.io.Closeable; +import java.io.IOException; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +/** + * This class encapsulates all the cdc-writing functions. + */ +public class HoodieCDCLogger implements Closeable { + + private final String commitTime; + + private final String keyField; + + private final Schema dataSchema; + + private final boolean populateMetaFields; + + // writer for cdc data + private final HoodieLogFormat.Writer cdcWriter; + + private final boolean cdcEnabled; + + private final HoodieCDCSupplementalLoggingMode cdcSupplementalLoggingMode; + + private final Schema cdcSchema; + + // the cdc data + private final Map cdcData; + + public HoodieCDCLogger( + String commitTime, + HoodieWriteConfig config, + HoodieTableConfig tableConfig, + Schema schema, + HoodieLogFormat.Writer cdcWriter, + long maxInMemorySizeInBytes) { +try { + this.commitTime = commitTime; + this.dataSchema = HoodieAvroUtils.removeMetadataFields(schema); + this.populateMetaFields = config.populateMetaFields(); + this.keyField = populateMetaFields ? HoodieRecord.RECORD_KEY_METADATA_FIELD + : tableConfig.getRecordKeyFieldProp(); + this.cdcWriter = cdcWriter; + + this.cdcEnabled = config.getBooleanOrDefault(HoodieTableConfig.CDC_ENABLED); + this.cdcSupplementalLoggingMode = HoodieCDCSupplementalLoggingMode.parse( + config.getStringOrDefault(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE)); + + if (cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE_AFTER)) { +this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA; + } else if (cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE)) { +this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_RECORDKEY_BEFORE; + } else { +this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_AND_RECORDKEY; + } + + this.cdcData = new ExternalSpillableMap<>( + maxInMemorySizeInBytes, + config.getSpillableMapBasePath(), + new DefaultSizeEstimator<>(), + new DefaultSizeEstimator<>(), + config.getCommonConfig().getSpillableDiskMapType(), + config.getCommonConfig().isBitCaskDiskMapCompressionEnabled() + ); +} catch (IOException e) { + throw new HoodieUpsertException("Failed to initial
[GitHub] [hudi] YannByron commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write
YannByron commented on code in PR #6697: URL: https://github.com/apache/hudi/pull/6697#discussion_r974829925 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java: ## @@ -0,0 +1,263 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.generic.IndexedRecord; + +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +import org.apache.hudi.avro.HoodieAvroUtils; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieAvroPayload; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.HoodieTableConfig; +import org.apache.hudi.common.table.cdc.HoodieCDCOperation; +import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode; +import org.apache.hudi.common.table.cdc.HoodieCDCUtils; +import org.apache.hudi.common.table.log.AppendResult; +import org.apache.hudi.common.table.log.HoodieLogFormat; +import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock; +import org.apache.hudi.common.table.log.block.HoodieLogBlock; +import org.apache.hudi.common.util.DefaultSizeEstimator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.common.util.collection.ExternalSpillableMap; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.HoodieUpsertException; + +import java.io.Closeable; +import java.io.IOException; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +/** + * This class encapsulates all the cdc-writing functions. + */ +public class HoodieCDCLogger implements Closeable { + + private final String commitTime; + + private final String keyField; + + private final Schema dataSchema; + + private final boolean populateMetaFields; + + // writer for cdc data + private final HoodieLogFormat.Writer cdcWriter; + + private final boolean cdcEnabled; + + private final HoodieCDCSupplementalLoggingMode cdcSupplementalLoggingMode; + + private final Schema cdcSchema; + + // the cdc data + private final Map cdcData; + + public HoodieCDCLogger( + String commitTime, + HoodieWriteConfig config, + HoodieTableConfig tableConfig, + Schema schema, + HoodieLogFormat.Writer cdcWriter, + long maxInMemorySizeInBytes) { +try { + this.commitTime = commitTime; + this.dataSchema = HoodieAvroUtils.removeMetadataFields(schema); + this.populateMetaFields = config.populateMetaFields(); + this.keyField = populateMetaFields ? HoodieRecord.RECORD_KEY_METADATA_FIELD + : tableConfig.getRecordKeyFieldProp(); + this.cdcWriter = cdcWriter; + + this.cdcEnabled = config.getBooleanOrDefault(HoodieTableConfig.CDC_ENABLED); + this.cdcSupplementalLoggingMode = HoodieCDCSupplementalLoggingMode.parse( + config.getStringOrDefault(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE)); + + if (cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE_AFTER)) { +this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA; + } else if (cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE)) { +this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_RECORDKEY_BEFORE; + } else { +this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_AND_RECORDKEY; + } + + this.cdcData = new ExternalSpillableMap<>( + maxInMemorySizeInBytes, + config.getSpillableMapBasePath(), + new DefaultSizeEstimator<>(), + new DefaultSizeEstimator<>(), + config.getCommonConfig().getSpillableDiskMapType(), + config.getCommonConfig().isBitCaskDiskMapCompressionEnabled() + ); +} catch (IOException e) { + throw new HoodieUpsertException("Failed to initial
[GitHub] [hudi] YannByron commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write
YannByron commented on code in PR #6697: URL: https://github.com/apache/hudi/pull/6697#discussion_r974829925 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java: ## @@ -0,0 +1,263 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.generic.IndexedRecord; + +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +import org.apache.hudi.avro.HoodieAvroUtils; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieAvroPayload; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.HoodieTableConfig; +import org.apache.hudi.common.table.cdc.HoodieCDCOperation; +import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode; +import org.apache.hudi.common.table.cdc.HoodieCDCUtils; +import org.apache.hudi.common.table.log.AppendResult; +import org.apache.hudi.common.table.log.HoodieLogFormat; +import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock; +import org.apache.hudi.common.table.log.block.HoodieLogBlock; +import org.apache.hudi.common.util.DefaultSizeEstimator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.common.util.collection.ExternalSpillableMap; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.HoodieUpsertException; + +import java.io.Closeable; +import java.io.IOException; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +/** + * This class encapsulates all the cdc-writing functions. + */ +public class HoodieCDCLogger implements Closeable { + + private final String commitTime; + + private final String keyField; + + private final Schema dataSchema; + + private final boolean populateMetaFields; + + // writer for cdc data + private final HoodieLogFormat.Writer cdcWriter; + + private final boolean cdcEnabled; + + private final HoodieCDCSupplementalLoggingMode cdcSupplementalLoggingMode; + + private final Schema cdcSchema; + + // the cdc data + private final Map cdcData; + + public HoodieCDCLogger( + String commitTime, + HoodieWriteConfig config, + HoodieTableConfig tableConfig, + Schema schema, + HoodieLogFormat.Writer cdcWriter, + long maxInMemorySizeInBytes) { +try { + this.commitTime = commitTime; + this.dataSchema = HoodieAvroUtils.removeMetadataFields(schema); + this.populateMetaFields = config.populateMetaFields(); + this.keyField = populateMetaFields ? HoodieRecord.RECORD_KEY_METADATA_FIELD + : tableConfig.getRecordKeyFieldProp(); + this.cdcWriter = cdcWriter; + + this.cdcEnabled = config.getBooleanOrDefault(HoodieTableConfig.CDC_ENABLED); + this.cdcSupplementalLoggingMode = HoodieCDCSupplementalLoggingMode.parse( + config.getStringOrDefault(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE)); + + if (cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE_AFTER)) { +this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA; + } else if (cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE)) { +this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_RECORDKEY_BEFORE; + } else { +this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_AND_RECORDKEY; + } + + this.cdcData = new ExternalSpillableMap<>( + maxInMemorySizeInBytes, + config.getSpillableMapBasePath(), + new DefaultSizeEstimator<>(), + new DefaultSizeEstimator<>(), + config.getCommonConfig().getSpillableDiskMapType(), + config.getCommonConfig().isBitCaskDiskMapCompressionEnabled() + ); +} catch (IOException e) { + throw new HoodieUpsertException("Failed to initial
[jira] [Updated] (HUDI-4718) Hudi cli does not support Kerberized Hadoop cluster
[ https://issues.apache.org/jira/browse/HUDI-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4718: - Labels: pull-request-available (was: ) > Hudi cli does not support Kerberized Hadoop cluster > --- > > Key: HUDI-4718 > URL: https://issues.apache.org/jira/browse/HUDI-4718 > Project: Apache Hudi > Issue Type: Bug > Components: cli >Reporter: Yao Zhang >Assignee: Yao Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > Hudi cli connect command cannot read table from Kerberized Hadoop cluster and > there is no way to perform Kerberos authentication. > I plan to add this feature. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] paul8263 opened a new pull request, #6719: [HUDI-4718] Add Kerberos kinit command support for cli.
paul8263 opened a new pull request, #6719: URL: https://github.com/apache/hudi/pull/6719 ### Change Logs Added kerberos kinit command for hudi cli. Now it supports connecting to Kerberized Hadoop cluster. As it may be complicated to prepare a temporary Kerberized environment for unit tests, I have tested it with local Kerberized Hadoop cluster. ### Impact It has no impact on performance. **Risk level: none** ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write
YannByron commented on code in PR #6697: URL: https://github.com/apache/hudi/pull/6697#discussion_r974827018 ## hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCUtils.java: ## @@ -0,0 +1,150 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.cdc; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; + +import org.apache.hudi.exception.HoodieException; + +public class HoodieCDCUtils { + + public static final String CDC_LOGFILE_SUFFIX = "-cdc"; + + /* the `op` column represents how a record is changed. */ + public static final String CDC_OPERATION_TYPE = "op"; + + /* the `ts` column represents when a record is changed. */ + public static final String CDC_COMMIT_TIMESTAMP = "ts"; + + /* the pre-image before one record is changed */ + public static final String CDC_BEFORE_IMAGE = "before"; + + /* the post-image after one record is changed */ + public static final String CDC_AFTER_IMAGE = "after"; + + /* the key of the changed record */ + public static final String CDC_RECORD_KEY = "record_key"; + + public static final String[] CDC_COLUMNS = new String[] { + CDC_OPERATION_TYPE, + CDC_COMMIT_TIMESTAMP, + CDC_BEFORE_IMAGE, + CDC_AFTER_IMAGE + }; + + /** + * This is the standard CDC output format. + * Also, this is the schema of cdc log file in the case `hoodie.table.cdc.supplemental.logging.mode` is 'cdc_data_before_after'. + */ + public static final String CDC_SCHEMA_STRING = "{\"type\":\"record\",\"name\":\"Record\"," + + "\"fields\":[" + + "{\"name\":\"op\",\"type\":[\"string\",\"null\"]}," + + "{\"name\":\"ts\",\"type\":[\"string\",\"null\"]}," + + "{\"name\":\"before\",\"type\":[\"string\",\"null\"]}," + + "{\"name\":\"after\",\"type\":[\"string\",\"null\"]}" + + "]}"; + + public static final Schema CDC_SCHEMA = new Schema.Parser().parse(CDC_SCHEMA_STRING); + + /** + * The schema of cdc log file in the case `hoodie.table.cdc.supplemental.logging.mode` is 'cdc_data_before'. + */ + public static final String CDC_SCHEMA_OP_RECORDKEY_BEFORE_STRING = "{\"type\":\"record\",\"name\":\"Record\"," + + "\"fields\":[" + + "{\"name\":\"op\",\"type\":[\"string\",\"null\"]}," + + "{\"name\":\"record_key\",\"type\":[\"string\",\"null\"]}," + + "{\"name\":\"before\",\"type\":[\"string\",\"null\"]}" + + "]}"; + + public static final Schema CDC_SCHEMA_OP_RECORDKEY_BEFORE = + new Schema.Parser().parse(CDC_SCHEMA_OP_RECORDKEY_BEFORE_STRING); + + /** + * The schema of cdc log file in the case `hoodie.table.cdc.supplemental.logging.mode` is 'op_key'. + */ + public static final String CDC_SCHEMA_OP_AND_RECORDKEY_STRING = "{\"type\":\"record\",\"name\":\"Record\"," + + "\"fields\":[" + + "{\"name\":\"op\",\"type\":[\"string\",\"null\"]}," + + "{\"name\":\"record_key\",\"type\":[\"string\",\"null\"]}" + + "]}"; + + public static final Schema CDC_SCHEMA_OP_AND_RECORDKEY = + new Schema.Parser().parse(CDC_SCHEMA_OP_AND_RECORDKEY_STRING); + + public static final Schema schemaBySupplementalLoggingMode(HoodieCDCSupplementalLoggingMode supplementalLoggingMode) { +switch (supplementalLoggingMode) { + case WITH_BEFORE_AFTER: +return CDC_SCHEMA; + case WITH_BEFORE: +return CDC_SCHEMA_OP_RECORDKEY_BEFORE; + case OP_KEY: +return CDC_SCHEMA_OP_AND_RECORDKEY; + default: +throw new HoodieException("not support this supplemental logging mode: " + supplementalLoggingMode); +} + } + + /** + * Build the cdc record which has all the cdc fields when `hoodie.table.cdc.supplemental.logging.mode` is 'cdc_data_before_after'. + */ + public static GenericData.Record cdcRecord( + String op, String commitTime, GenericRecord before, GenericRecord after) { +String beforeJsonStr = recordToJson(before); Review Comment: if the before/after value is store as the json string, we can load the cdc log file and return without any serialize/deserialize. IMO, persisting cdc data is an operation that consumes more storage in exchange
[GitHub] [hudi] wzx140 commented on pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback
wzx140 commented on PR #6132: URL: https://github.com/apache/hudi/pull/6132#issuecomment-1251754051 @alexeykudinkin Thank you for your suggestion. This will be fixed soon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wzx140 commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
wzx140 commented on PR #5629: URL: https://github.com/apache/hudi/pull/5629#issuecomment-1251753429 @alexeykudinkin I'm already rebased on master and add the config mergerStrategy with uuid. You can do final review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write
YannByron commented on code in PR #6697: URL: https://github.com/apache/hudi/pull/6697#discussion_r974821737 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java: ## @@ -0,0 +1,263 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.generic.IndexedRecord; + +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +import org.apache.hudi.avro.HoodieAvroUtils; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieAvroPayload; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.HoodieTableConfig; +import org.apache.hudi.common.table.cdc.HoodieCDCOperation; +import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode; +import org.apache.hudi.common.table.cdc.HoodieCDCUtils; +import org.apache.hudi.common.table.log.AppendResult; +import org.apache.hudi.common.table.log.HoodieLogFormat; +import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock; +import org.apache.hudi.common.table.log.block.HoodieLogBlock; +import org.apache.hudi.common.util.DefaultSizeEstimator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.common.util.collection.ExternalSpillableMap; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.HoodieUpsertException; + +import java.io.Closeable; +import java.io.IOException; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +/** + * This class encapsulates all the cdc-writing functions. + */ +public class HoodieCDCLogger implements Closeable { + + private final String commitTime; + + private final String keyField; + + private final Schema dataSchema; + + private final boolean populateMetaFields; + + // writer for cdc data + private final HoodieLogFormat.Writer cdcWriter; + + private final boolean cdcEnabled; + + private final HoodieCDCSupplementalLoggingMode cdcSupplementalLoggingMode; + + private final Schema cdcSchema; + + // the cdc data + private final Map cdcData; + + public HoodieCDCLogger( + String commitTime, + HoodieWriteConfig config, + HoodieTableConfig tableConfig, + Schema schema, + HoodieLogFormat.Writer cdcWriter, + long maxInMemorySizeInBytes) { +try { + this.commitTime = commitTime; + this.dataSchema = HoodieAvroUtils.removeMetadataFields(schema); + this.populateMetaFields = config.populateMetaFields(); + this.keyField = populateMetaFields ? HoodieRecord.RECORD_KEY_METADATA_FIELD + : tableConfig.getRecordKeyFieldProp(); + this.cdcWriter = cdcWriter; + + this.cdcEnabled = config.getBooleanOrDefault(HoodieTableConfig.CDC_ENABLED); + this.cdcSupplementalLoggingMode = HoodieCDCSupplementalLoggingMode.parse( + config.getStringOrDefault(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE)); + + if (cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE_AFTER)) { +this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA; + } else if (cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE)) { +this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_RECORDKEY_BEFORE; + } else { +this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_AND_RECORDKEY; + } + + this.cdcData = new ExternalSpillableMap<>( Review Comment: thank you for pointing this out. i will think deeply about it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apac
[GitHub] [hudi] ankitchandnani commented on issue #6404: [SUPPORT] Hudi Deltastreamer CSV ingestion issue
ankitchandnani commented on issue #6404: URL: https://github.com/apache/hudi/issues/6404#issuecomment-1251749105 Hi @nsivabalan, could you please share the exact properties, config, and test data you tried out. I am having difficulty making this work for my use case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write
YannByron commented on code in PR #6697: URL: https://github.com/apache/hudi/pull/6697#discussion_r974820909 ## hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCUtils.java: ## @@ -0,0 +1,150 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.cdc; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; + +import org.apache.hudi.exception.HoodieException; + +public class HoodieCDCUtils { + + public static final String CDC_LOGFILE_SUFFIX = "-cdc"; + + /* the `op` column represents how a record is changed. */ + public static final String CDC_OPERATION_TYPE = "op"; + + /* the `ts_ms` column represents when a record is changed. */ + public static final String CDC_COMMIT_TIMESTAMP = "ts_ms"; Review Comment: see https://github.com/apache/hudi/pull/6476#discussion_r974703425 so, we will choose `ts` or `ts_ms`. @xushiyan @alexeykudinkin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
boneanxs commented on code in PR #6046: URL: https://github.com/apache/hudi/pull/6046#discussion_r974818835 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala: ## @@ -160,9 +160,6 @@ class BaseFileOnlyRelation(sqlContext: SQLContext, fileFormat = fileFormat, optParams)(sparkSession) } else { - val readPathsStr = optParams.get(DataSourceReadOptions.READ_PATHS.key) Review Comment: I see, thanks for clear me! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
boneanxs commented on code in PR #6046: URL: https://github.com/apache/hudi/pull/6046#discussion_r974817505 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala: ## @@ -105,6 +108,52 @@ object HoodieDatasetBulkInsertHelper extends Logging { partitioner.repartitionRecords(trimmedDF, config.getBulkInsertShuffleParallelism) } + /** + * Perform bulk insert for [[Dataset]], will not change timeline/index, return + * information about write files. + */ + def bulkInsert(dataset: Dataset[Row], + instantTime: String, + table: HoodieTable[_ <: HoodieRecordPayload[_ <: HoodieRecordPayload[_ <: AnyRef]], _, _, _], + writeConfig: HoodieWriteConfig, + partitioner: BulkInsertPartitioner[Dataset[Row]], + parallelism: Int, + shouldPreserveHoodieMetadata: Boolean): HoodieData[WriteStatus] = { +val repartitionedDataset = partitioner.repartitionRecords(dataset, parallelism) +val arePartitionRecordsSorted = partitioner.arePartitionRecordsSorted +val schema = dataset.schema +val writeStatuses = repartitionedDataset.queryExecution.toRdd.mapPartitions(iter => { + val taskContextSupplier: TaskContextSupplier = table.getTaskContextSupplier + val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get + val taskId = taskContextSupplier.getStageIdSupplier.get.toLong + val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get + val writer = new BulkInsertDataInternalWriterHelper( +table, +writeConfig, +instantTime, +taskPartitionId, +taskId, +taskEpochId, +schema, +writeConfig.populateMetaFields, +arePartitionRecordsSorted, +shouldPreserveHoodieMetadata) + + try { +iter.foreach(writer.write) + } catch { +case t: Throwable => + writer.abort() + throw t + } finally { +writer.close() + } + + writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator +}).collect() +table.getContext.parallelize(writeStatuses.toList.asJava) Review Comment: `writeStatuses` is an `Array`, but parallelize func needs `list` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue (#6717)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new b5dc4312a4 [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue (#6717) b5dc4312a4 is described below commit b5dc4312a4038e62374c6a12f601df2889cced56 Author: FocusComputing AuthorDate: Tue Sep 20 09:31:50 2022 +0800 [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue (#6717) Co-authored-by: xiaoxingstack --- .../org/apache/hudi/index/bucket/TestHoodieSimpleBucketIndex.java | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bucket/TestHoodieSimpleBucketIndex.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bucket/TestHoodieSimpleBucketIndex.java index ea6418696c..34728c6cf3 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bucket/TestHoodieSimpleBucketIndex.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bucket/TestHoodieSimpleBucketIndex.java @@ -139,7 +139,9 @@ public class TestHoodieSimpleBucketIndex extends HoodieClientTestHarness { .filter(r -> BucketIdentifier.bucketIdFromFileId(r.getCurrentLocation().getFileId()) != getRecordBucketId(r)).findAny().isPresent()); assertTrue(taggedRecordRDD.collectAsList().stream().filter(r -> r.getPartitionPath().equals("2015/01/31") -&& !r.isCurrentLocationKnown()).count() == 1L); +&& !r.isCurrentLocationKnown()).count() == 1L); +assertTrue(taggedRecordRDD.collectAsList().stream().filter(r -> r.getPartitionPath().equals("2016/01/31") +&& r.isCurrentLocationKnown()).count() == 3L); } private HoodieWriteConfig makeConfig() {
[GitHub] [hudi] danny0405 merged pull request #6717: [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue
danny0405 merged PR #6717: URL: https://github.com/apache/hudi/pull/6717 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5875: [SUPPORT] Hoodie Delta streamer Job with Kafka Source fetching the same offset again and again Commiting the same offset again and again
nsivabalan commented on issue #5875: URL: https://github.com/apache/hudi/issues/5875#issuecomment-1251731380 these are the latest checkpoints as per the logs shared. ``` 22/06/14 18:27:12 INFO DeltaSync: Checkpoint to resume from : Option{val=dp.packet,0:23816747280,1:23766354517,2:23605513356,3:23434358328,4:23529418069,5:13014012706,6:13111477149,7:13326541492,8:13418171123,9:12818988564,10:13199646406,11:13018719613,12:13171335493,13:13202912834,14:13139217975,15:13177133210,16:1285564,17:13190680738,18:13384909168,19:13076367894,20:12931040506,21:13179604368,22:13103929138,23:13129130931,24:12962399070,25:13209169847,26:13261971379,27:13223540895,28:12800622145,29:13135630345,30:13286314622,31:12910064270,32:13012723305,33:12942700314,34:13300903762,35:13452813697,36:12774627787,37:13149143084,38:13397339159,39:12943639180,40:12850660061,41:13287830095,42:13416968091,43:13251840311,44:12975405300,45:13129020620,46:13319529463,47:13645113762,48:13171132949,49:13341802693,50:13160916594,51:12797360849,52:13231051973,53:13159710596,54:13462835274,55:13218075514,56:13228939350,57:13026346757,58:13197365542,59:12782050600,60:13274602048,61:1301 9911553,62:13093034410,63:12946710535,64:12735821947,65:13521932586,66:12885611345,67:12804964853,68:13190226613,69:13119906383,70:13037133163,71:13037077649,72:13249184425,73:13034553149,74:12596466583,75:13197572654,76:13068376212,77:13394048883,78:12949166912,79:12947874565,80:12766226593,81:12887001480,82:12961256747,83:12640403833,84:13209947935,85:12990821869,86:12967972824,87:13062323012,88:12801634102,89:13377026742,90:13075492590,91:12899740426,92:13105955253,93:12811456735,94:13018855871,95:12837481047,96:13143601548,97:12797197623,98:12990305191,99:13092561101,100:13133162523,101:12559759129,102:13091848951,103:12889825622,104:12749143212,105:13041769115,106:13023952197,107:13081277534,108:13043234272,109:13020451301,110:12607811366,111:13056149918,112:13283818745,113:12922522456,114:12828248592,115:12997400759,116:12837921515,117:13035132730,118:12979892771,119:13093824502} 22/06/14 20:08:29 INFO DeltaSync: Checkpoint to resume from : Option{val=dp.packet,0:23821747280,1:23771354517,2:23610513356,3:23439358328,4:23534418069,5:13019012706,6:13116477149,7:13331541492,8:13423171123,9:12823988564,10:13204646406,11:13023719613,12:13176335493,13:13207912834,14:13144217975,15:13182133210,16:1286064,17:13195680738,18:13389909168,19:13081367894,20:12936040506,21:13184604368,22:13108929138,23:13134130931,24:12967399070,25:13214169847,26:13266971379,27:13228540895,28:12805622145,29:13140630345,30:13291314622,31:12915064270,32:13017723305,33:12947700314,34:13305903762,35:13457813697,36:12779627787,37:13154143084,38:13402339159,39:12948639180,40:12855660061,41:13292830095,42:13421968091,43:13256840311,44:12980405300,45:13134020620,46:13324529463,47:13650113762,48:13176132949,49:13346802693,50:13165916594,51:12802360849,52:13236051973,53:13164710596,54:13467835274,55:13223075514,56:13233939350,57:13031346757,58:13202365542,59:12787050600,60:13279602048,61:1302 4911553,62:13098034410,63:12951710535,64:12740821947,65:13526932586,66:12890611345,67:12809964853,68:13195226613,69:13124906383,70:13042133163,71:13042077649,72:13254184425,73:13039553149,74:12601466583,75:13202572654,76:13073376212,77:13399048883,78:12954166912,79:12952874565,80:12771226593,81:12892001480,82:12966256747,83:12645403833,84:13214947935,85:12995821869,86:12972972824,87:13067323012,88:12806634102,89:13382026742,90:13080492590,91:12904740426,92:13110955253,93:12816456735,94:13023855871,95:12842481047,96:13148601548,97:12802197623,98:12995305191,99:13097561101,100:13138162523,101:12564759129,102:13096848951,103:12894825622,104:12754143212,105:13046769115,106:13028952197,107:13086277534,108:13048234272,109:13025451301,110:12612811366,111:13061149918,112:13288818745,113:12927522456,114:12833248592,115:13002400759,116:12842921515,117:13040132730,118:12984892771,119:13098824502} 22/06/14 20:53:50 INFO DeltaSync: Checkpoint to resume from : Option{val=dp.packet,0:23826747280,1:23776354517,2:23615513356,3:23444358328,4:23539418069,5:13024012706,6:13121477149,7:13336541492,8:13428171123,9:12828988564,10:13209646406,11:13028719613,12:13181335493,13:13212912834,14:13149217975,15:13187133210,16:1286564,17:13200680738,18:13394909168,19:13086367894,20:12941040506,21:13189604368,22:13113929138,23:13139130931,24:12972399070,25:13219169847,26:13271971379,27:13233540895,28:12810622145,29:13145630345,30:13296314622,31:12920064270,32:13022723305,33:12952700314,34:13310903762,35:13462813697,36:12784627787,37:13159143084,38:13407339159,39:12953639180,40:12860660061,41:13297830095,42:13426968091,43:13261840311,44:12985405300,45:13139020620,46:13329529463,47:13655113762,48:13181132949,49:13351802693,50:13170916594,51:12807360849,52:13241051973,53:13169710596,54:13472835274,55:13228075514,56:13238939350,57:13036346757,58:13207365542,59:12792050600,60:13284602048,61:1302 9911553,62:
[GitHub] [hudi] nsivabalan commented on issue #6101: [SUPPORT] Hudi Delete Not working with EMR, AWS Glue & S3
nsivabalan commented on issue #6101: URL: https://github.com/apache/hudi/issues/6101#issuecomment-1251727223 hey I tried to test delete_partitions for multiple partitions and could not reproduce. Would you mind giving us a reproducible script. would help us find the root cause. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on issue #6635: [SUPPORT] Failed to build hudi 0.12.0 with spark 3.2.2
alexeykudinkin commented on issue #6635: URL: https://github.com/apache/hudi/issues/6635#issuecomment-1251722860 @xushiyan you can assign this one to me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6349: [HUDI-4433] Hudi-CLI repair deduplicate not working with non-partitio…
hudi-bot commented on PR #6349: URL: https://github.com/apache/hudi/pull/6349#issuecomment-1251684276 ## CI report: * f6fde8b6313b3b0e250c41858c77cc425325a6db Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11508) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5777: [SUPPORT] Hudi table has duplicate data.
nsivabalan commented on issue #5777: URL: https://github.com/apache/hudi/issues/5777#issuecomment-1251682905 @jiangjiguang : can you respond to above requests. we could not proceed if not for further logs and info requested. whenever you get a chance, can you please respond. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6404: [SUPPORT] Hudi Deltastreamer CSV ingestion issue
nsivabalan commented on issue #6404: URL: https://github.com/apache/hudi/issues/6404#issuecomment-1251682385 sure, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6716: [SUPPORT] Unable to archive if no non-table service actions are performed on the data table
nsivabalan commented on issue #6716: URL: https://github.com/apache/hudi/issues/6716#issuecomment-1251679200 @yihua : Can you take this up please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6697: [HUDI-3478] Spark CDC Write
hudi-bot commented on PR #6697: URL: https://github.com/apache/hudi/pull/6697#issuecomment-1251678959 ## CI report: * da97ef7f2f1d034d15851c3676b27f06294dc557 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11507) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6665: [HUDI-4850] Incremental Ingestion from GCS
hudi-bot commented on PR #6665: URL: https://github.com/apache/hudi/pull/6665#issuecomment-1251678908 ## CI report: * a068aefd47b77ceb65c0f7ca3857e438af2d2d2b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11509) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write
alexeykudinkin commented on code in PR #6697: URL: https://github.com/apache/hudi/pull/6697#discussion_r974740166 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java: ## @@ -358,11 +357,10 @@ private Pair> getFilesToCleanKeepingLatestCommits(S deletePaths.add(new CleanFileInfo(hoodieDataFile.getBootstrapBaseFile().get().getPath(), true)); } }); -if (hoodieTable.getMetaClient().getTableType() == HoodieTableType.MERGE_ON_READ) { - // If merge on read, then clean the log files for the commits as well - deletePaths.addAll(aSlice.getLogFiles().map(lf -> new CleanFileInfo(lf.getPath().toString(), false)) - .collect(Collectors.toList())); -} +// since the cow tables may also write out the log files in cdc scenario, we need to clean the log files +// for this commit no matter the table type is mor or cow. +deletePaths.addAll(aSlice.getLogFiles().map(lf -> new CleanFileInfo(lf.getPath().toString(), false)) Review Comment: We should scope this: guard this by checking that CDC is enabled and only cleaning up CDC files (assuming we will have separate naming scheme for these). Overly broad conditionals like this one (cleaning all log-files) is a time-bomb. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java: ## @@ -0,0 +1,263 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.generic.IndexedRecord; + +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +import org.apache.hudi.avro.HoodieAvroUtils; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieAvroPayload; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.HoodieTableConfig; +import org.apache.hudi.common.table.cdc.HoodieCDCOperation; +import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode; +import org.apache.hudi.common.table.cdc.HoodieCDCUtils; +import org.apache.hudi.common.table.log.AppendResult; +import org.apache.hudi.common.table.log.HoodieLogFormat; +import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock; +import org.apache.hudi.common.table.log.block.HoodieLogBlock; +import org.apache.hudi.common.util.DefaultSizeEstimator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.common.util.collection.ExternalSpillableMap; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.HoodieUpsertException; + +import java.io.Closeable; +import java.io.IOException; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +/** + * This class encapsulates all the cdc-writing functions. + */ +public class HoodieCDCLogger implements Closeable { + + private final String commitTime; + + private final String keyField; + + private final Schema dataSchema; + + private final boolean populateMetaFields; + + // writer for cdc data + private final HoodieLogFormat.Writer cdcWriter; + + private final boolean cdcEnabled; + + private final HoodieCDCSupplementalLoggingMode cdcSupplementalLoggingMode; + + private final Schema cdcSchema; + + // the cdc data + private final Map cdcData; + + public HoodieCDCLogger( + String commitTime, + HoodieWriteConfig config, + HoodieTableConfig tableConfig, + Schema schema, + HoodieLogFormat.Writer cdcWriter, + long maxInMemorySizeInBytes) { +try { + this.commitTime = commitTime; + this.dataSchema = HoodieAvroUtils.removeMetadataFields(schema); + this.populateMetaFields = config.populateMetaFields(); + this.k
[GitHub] [hudi] hudi-bot commented on pull request #6349: [HUDI-4433] Hudi-CLI repair deduplicate not working with non-partitio…
hudi-bot commented on PR #6349: URL: https://github.com/apache/hudi/pull/6349#issuecomment-1251630547 ## CI report: * ed04e960bbc93c1d4d47c54f15f47191fef06fa3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11506) * f6fde8b6313b3b0e250c41858c77cc425325a6db Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11508) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6561: [HUDI-4760] Fixing repeated trigger of data file creations w/ clustering
alexeykudinkin commented on code in PR #6561: URL: https://github.com/apache/hudi/pull/6561#discussion_r974727375 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java: ## @@ -249,7 +249,7 @@ protected HoodieWriteMetadata> executeClustering(HoodieC HoodieData statuses = updateIndex(writeStatusList, writeMetadata); writeMetadata.setWriteStats(statuses.map(WriteStatus::getStat).collectAsList()); writeMetadata.setPartitionToReplaceFileIds(getPartitionToReplacedFileIds(clusteringPlan, writeMetadata)); -validateWriteResult(clusteringPlan, writeMetadata); +// if we don't cache the write statuses above, validation will call isEmpty which might retrigger the execution again. Review Comment: @nsivabalan was this addressed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback
alexeykudinkin commented on code in PR #6132: URL: https://github.com/apache/hudi/pull/6132#discussion_r974723729 ## rfc/rfc-46/rfc-46.md: ## @@ -128,21 +173,88 @@ Following major components will be refactored: 1. `HoodieWriteHandle`s will be 1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion) - 2. Using Combining API engine to merge records (when necessary) + 2. Using Record Merge API to merge records (when necessary) 3. Passes `HoodieRecord` as is to `FileWriter` 2. `HoodieFileWriter`s will be 1. Accepting `HoodieRecord` 2. Will be engine-specific (so that they're able to handle internal record representation) 3. `HoodieRealtimeRecordReader`s 1. API will be returning opaque `HoodieRecord` instead of raw Avro payload +### Config for Record Merge +The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type. + +### Public Api in HoodieRecord +Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData). +Its public API will look like following: + +```java +import java.util.Properties; + +class HoodieRecord { + + /** +* Get column in record to support RDDCustomColumnsSortPartitioner +*/ + Object getRecordColumnValues(Schema recordSchema, String[] columns, + boolean consistentLogicalTimestampEnabled); + + /** +* Support bootstrap. +*/ + HoodieRecord mergeWith(HoodieRecord other, Schema targetSchema) throws IOException; + + /** +* Rewrite record into new schema(add meta columns) +*/ + HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) + throws IOException; + + /** +* Support schema evolution. +*/ + HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, + Map renameCols) throws IOException; + + HoodieRecord updateValues(Schema recordSchema, Properties props, Review Comment: @wzx140 we should split these up: - Only legitimate use-case for us to update fields is Hudi's metadata - `HoodieHFileDataBlock` shouldn't be modifying existing payload but should instead be _rewriting_ w/o the field it wants to omit. We will tackle that separately, and for the sake of RFC-46 we can create temporary method `truncateRecordKey` which will be overwriting record-key value for now (we will deprecate and remove this method after we address this) We should not leave a loophole where we allow a record to be modified to make sure that nobody can start building against this API -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback
alexeykudinkin commented on code in PR #6132: URL: https://github.com/apache/hudi/pull/6132#discussion_r938178095 ## rfc/rfc-46/rfc-46.md: ## @@ -156,13 +187,76 @@ Following major components will be refactored: 3. `HoodieRealtimeRecordReader`s 1. API will be returning opaque `HoodieRecord` instead of raw Avro payload +### Config for Record Merge +The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type. + +### Public Api in HoodieRecord +Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData). +Its public API will look like following: + +```java +class HoodieRecord { + + /** +* Get column in record to support RDDCustomColumnsSortPartitioner +*/ + Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled); + + /** +* Support bootstrap. +*/ + HoodieRecord mergeWith(HoodieRecord other) throws IOException; + + /** +* Rewrite record into new schema(add meta columns) +*/ + HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException; + + /** +* Support schema evolution. +*/ + HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map renameCols) throws IOException; + + HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map metadataValues) throws IOException; + + /** +* Is deleted. +*/ + boolean isPresent(Schema recordSchema, Properties props) throws IOException; + + /** +* Is EmptyRecord. Generated by ExpressionPayload. +*/ + boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException; Review Comment: We should probably update the java-doc then to avoid ref to any particular implementation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback
alexeykudinkin commented on code in PR #6132: URL: https://github.com/apache/hudi/pull/6132#discussion_r974718156 ## rfc/rfc-46/rfc-46.md: ## @@ -128,21 +173,88 @@ Following major components will be refactored: 1. `HoodieWriteHandle`s will be 1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion) - 2. Using Combining API engine to merge records (when necessary) + 2. Using Record Merge API to merge records (when necessary) 3. Passes `HoodieRecord` as is to `FileWriter` 2. `HoodieFileWriter`s will be 1. Accepting `HoodieRecord` 2. Will be engine-specific (so that they're able to handle internal record representation) 3. `HoodieRealtimeRecordReader`s 1. API will be returning opaque `HoodieRecord` instead of raw Avro payload +### Config for Record Merge +The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type. + +### Public Api in HoodieRecord +Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData). +Its public API will look like following: + +```java +import java.util.Properties; + +class HoodieRecord { + + /** +* Get column in record to support RDDCustomColumnsSortPartitioner +*/ + Object getRecordColumnValues(Schema recordSchema, String[] columns, Review Comment: @wzx140 understand where you're coming from. We should have already deprecated `getRecordColumnValues` as this method is heavily coupled to where it's used currently and unfortunately isn't generic enough to serve its purpose. In this particular case converting the values and concat-ing them as strings doesn't make sense for a generic utility -- whenever someone requests a list of column values they expect to get a list of values (as they are) as compared to receiving a string (!) of concatenated values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-4836) Remove "hbase-default.xml" colliding w/ "hbase-site.xml" in Hudi bundles
[ https://issues.apache.org/jira/browse/HUDI-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin closed HUDI-4836. - Resolution: Not A Problem User was able to resolve their original issue by overriding such setting in Cloudera Manager: [https://github.com/apache/hudi/issues/6398#issuecomment-1250633915] This seems to me that Cloudera isn't actually affecting "hbase-default.xml", but rather is writing out its own "hbase-site.xml" that overrides one we're providing in our bundles. > Remove "hbase-default.xml" colliding w/ "hbase-site.xml" in Hudi bundles > > > Key: HUDI-4836 > URL: https://issues.apache.org/jira/browse/HUDI-4836 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.12.1 > > > This has been discovered in the process of testing 0.12: > [https://github.com/apache/hudi/issues/6398#issuecomment-1244930086] > > Original issues we meant to be addressed by > [https://github.com/apache/hudi/pull/6114,] but it ultimately led to > "hbase-default.xml" (from "hbase-common") been bundled along w/ our > "hbase-site.xml" purposed to override the \{hbase.defaults.for.version.skip} > configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] kazdy commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default
kazdy commented on code in PR #6196: URL: https://github.com/apache/hudi/pull/6196#discussion_r974709033 ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java: ## @@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig { public static final ConfigProperty RECONCILE_SCHEMA = ConfigProperty .key("hoodie.datasource.write.reconcile.schema") - .defaultValue(false) + .defaultValue(true) Review Comment: @alexeykudinkin afaik Schema Evolution config is there because it's an experimental feature and soon it will become GA? Then this config should be enabled by default or deprecated, will this logic hold then? I feel like hudi config is already very broad and therefore a bit hard to grasp and users would appreciate if it was one switch instead of a combination of two -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on issue #6398: [SUPPORT] Metadata table thows hbase exceptions
alexeykudinkin commented on issue #6398: URL: https://github.com/apache/hudi/issues/6398#issuecomment-1251581367 @rbtrtr glad that you've sorted out! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi
alexeykudinkin commented on code in PR #6476: URL: https://github.com/apache/hudi/pull/6476#discussion_r974703425 ## hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCUtils.java: ## @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.cdc; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; + +import org.apache.hudi.common.table.HoodieTableConfig; +import org.apache.hudi.exception.HoodieException; + +public class HoodieCDCUtils { + + /* the `op` column represents how a record is changed. */ + public static final String CDC_OPERATION_TYPE = "op"; + + /* the `ts_ms` column represents when a record is changed. */ + public static final String CDC_COMMIT_TIMESTAMP = "ts_ms"; Review Comment: Yeah, i've realized this later that it's also `ts_ms` in Debezium. Let's keep it as is to keep it consistent then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4875) NoSuchTableException is thrown while dropping temporary view after applied HoodieSparkSessionExtension in Spark 3.2
[ https://issues.apache.org/jira/browse/HUDI-4875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4875: -- Fix Version/s: 0.12.1 > NoSuchTableException is thrown while dropping temporary view after applied > HoodieSparkSessionExtension in Spark 3.2 > --- > > Key: HUDI-4875 > URL: https://issues.apache.org/jira/browse/HUDI-4875 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.11.1 > Environment: Spark 3.2.2 >Reporter: dohongdayi >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > NoSuchTableException is thrown while dropping temporary view after applied > HoodieSparkSessionExtension in Spark 3.2: > {code:java} > org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view > 'test_view' not found in database 'default' > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:225) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableRawMetadata(SessionCatalog.scala:516) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:502) > at > org.apache.spark.sql.hudi.SparkAdapter.isHoodieTable(SparkAdapter.scala:160) > at > org.apache.spark.sql.hudi.SparkAdapter.isHoodieTable$(SparkAdapter.scala:159) > at > org.apache.spark.sql.adapter.BaseSpark3Adapter.isHoodieTable(BaseSpark3Adapter.scala:45) > at > org.apache.spark.sql.hudi.analysis.HoodiePostAnalysisRule.apply(HoodieAnalysis.scala:539) > at > org.apache.spark.sql.hudi.analysis.HoodiePostAnalysisRule.apply(HoodieAnalysis.scala:530) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:91) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:222) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:218) > at > org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:167) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:218) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:182) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:203) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:202) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:75) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:183) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:788) > at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:183) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:75) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:73) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:629) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:788) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:620) > ... 51 elided {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
alexeykudinkin commented on code in PR #6046: URL: https://github.com/apache/hudi/pull/6046#discussion_r974679847 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala: ## @@ -105,6 +108,52 @@ object HoodieDatasetBulkInsertHelper extends Logging { partitioner.repartitionRecords(trimmedDF, config.getBulkInsertShuffleParallelism) } + /** + * Perform bulk insert for [[Dataset]], will not change timeline/index, return + * information about write files. + */ + def bulkInsert(dataset: Dataset[Row], + instantTime: String, + table: HoodieTable[_ <: HoodieRecordPayload[_ <: HoodieRecordPayload[_ <: AnyRef]], _, _, _], + writeConfig: HoodieWriteConfig, + partitioner: BulkInsertPartitioner[Dataset[Row]], + parallelism: Int, + shouldPreserveHoodieMetadata: Boolean): HoodieData[WriteStatus] = { +val repartitionedDataset = partitioner.repartitionRecords(dataset, parallelism) +val arePartitionRecordsSorted = partitioner.arePartitionRecordsSorted +val schema = dataset.schema +val writeStatuses = repartitionedDataset.queryExecution.toRdd.mapPartitions(iter => { + val taskContextSupplier: TaskContextSupplier = table.getTaskContextSupplier + val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get + val taskId = taskContextSupplier.getStageIdSupplier.get.toLong + val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get + val writer = new BulkInsertDataInternalWriterHelper( +table, +writeConfig, +instantTime, +taskPartitionId, +taskId, +taskEpochId, +schema, +writeConfig.populateMetaFields, +arePartitionRecordsSorted, +shouldPreserveHoodieMetadata) + + try { +iter.foreach(writer.write) + } catch { +case t: Throwable => + writer.abort() + throw t + } finally { +writer.close() + } + + writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator +}).collect() +table.getContext.parallelize(writeStatuses.toList.asJava) Review Comment: nit: no need for `toList` ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkAdapterSupport.scala: ## @@ -26,17 +26,6 @@ import org.apache.spark.sql.hudi.SparkAdapter */ trait SparkAdapterSupport { - lazy val sparkAdapter: SparkAdapter = { Review Comment: Instead of moving this to Java let's dot he following: - Create companion object `ScalaAdapterSupport` - Move this conditional there - Keep this var (for compatibility) referencing static one from the object ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java: ## @@ -183,8 +183,17 @@ public boolean accept(Path path) { metaClientCache.put(baseDir.toString(), metaClient); } - fsView = FileSystemViewManager.createInMemoryFileSystemView(engineContext, - metaClient, HoodieInputFormatUtils.buildMetadataConfig(getConf())); + if (getConf().get("as.of.instant") != null) { Review Comment: Good catch! ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSpatialCurveSortPartitioner.java: ## @@ -91,27 +81,4 @@ public JavaRDD> repartitionRecords(JavaRDD> reco return hoodieRecord; }); } - - private Dataset reorder(Dataset dataset, int numOutputGroups) { Review Comment: Thanks for cleaning that up! ## hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieSparkMergeOnReadTableClustering.java: ## @@ -60,20 +61,30 @@ class TestHoodieSparkMergeOnReadTableClustering extends SparkClientFunctionalTes private static Stream testClustering() { return Stream.of( -Arguments.of(true, true, true), -Arguments.of(true, true, false), -Arguments.of(true, false, true), -Arguments.of(true, false, false), -Arguments.of(false, true, true), -Arguments.of(false, true, false), -Arguments.of(false, false, true), -Arguments.of(false, false, false) -); +Arrays.asList(true, true, true), +Arrays.asList(true, true, false), +Arrays.asList(true, false, true), +Arrays.asList(true, false, false), +Arrays.asList(false, true, true), +Arrays.asList(false, true, false), +Arrays.asList(false, false, true), +Arrays.asList(false, false, false)) +.flatMap(arguments -> { + ArrayList enableRowClusteringArgs = new ArrayList<>(); + enableRowClusteringArgs.add(true); + enableRowClusteringArgs.addAll(arguments); + ArrayList disableRowClusteringArgs = new ArrayList<>(); + disableRowClusteringArgs.add(false); Review Comment
[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
hudi-bot commented on PR #5629: URL: https://github.com/apache/hudi/pull/5629#issuecomment-1251566082 ## CI report: * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN * ef85e9b3eefd14a098a1bc5b277fd5989ef8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11504) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support
alexeykudinkin commented on code in PR #6256: URL: https://github.com/apache/hudi/pull/6256#discussion_r971371531 ## rfc/rfc-51/rfc-51.md: ## @@ -62,73 +63,79 @@ We follow the debezium output format: four columns as shown below - u: represent `update`; when `op` is `u`, both `before` and `after` don't be null; - d: represent `delete`; when `op` is `d`, `after` is always null; -Note: the illustration here ignores all the Hudi metadata columns like `_hoodie_commit_time` in `before` and `after` columns. +**Note** -## Goals +* In case of the same record having operations like insert -> delete -> insert, CDC data should be produced to reflect the exact behaviors. +* The illustration above ignores all the Hudi metadata columns like `_hoodie_commit_time` in `before` and `after` columns. -1. Support row-level CDC records generation and persistence; -2. Support both MOR and COW tables; -3. Support all the write operations; -4. Support Spark DataFrame/SQL/Streaming Query; +## Design Goals -## Implementation +1. Support row-level CDC records generation and persistence +2. Support both MOR and COW tables +3. Support all the write operations +4. Support incremental queries in CDC format across supported engines Review Comment: Let's also explicitly call out that: - For CDC-enabled Tables performance of non-CDC queries should not be affected ## rfc/rfc-51/rfc-51.md: ## @@ -64,69 +65,72 @@ We follow the debezium output format: four columns as shown below Note: the illustration here ignores all the Hudi metadata columns like `_hoodie_commit_time` in `before` and `after` columns. -## Goals +## Design Goals 1. Support row-level CDC records generation and persistence; 2. Support both MOR and COW tables; 3. Support all the write operations; 4. Support Spark DataFrame/SQL/Streaming Query; -## Implementation +## Configurations -### CDC Architecture +| key | default | description | +|-|--|--| +| hoodie.table.cdc.enabled| `false` | The master switch of the CDC features. If `true`, writers and readers will respect CDC configurations and behave accordingly.| +| hoodie.table.cdc.supplemental.logging | `false` | If `true`, persist the required information about the changed data, including `before`. If `false`, only `op` and record keys will be persisted. | +| hoodie.table.cdc.supplemental.logging.include_after | `false` | If `true`, persist `after` as well. | -![](arch.jpg) +To perform CDC queries, users need to set `hoodie.table.cdc.enable=true` and `hoodie.datasource.query.type=incremental`. -Note: Table operations like `Compact`, `Clean`, `Index` do not write/change any data. So we don't need to consider them in CDC scenario. - -### Modifiying code paths +| key| default| description | +|||--| +| hoodie.table.cdc.enabled | `false`| set to `true` for CDC queries| +| hoodie.datasource.query.type | `snapshot` | set to `incremental` for CDC queries | +| hoodie.datasource.read.start.timestamp | - | requried. | +| hoodie.datasource.read.end.timestamp | - | optional. | -![](points.jpg) +### Logical File Types -### Config Definitions +We define 4 logical file types for the CDC scenario. Review Comment: Agree w/ @xushiyan proposal, let's simplify this -- having a table mapping an action on the Data table into the action on CDC log makes message much more clear. ## rfc/rfc-51/rfc-51.md: ## @@ -148,20 +155,46 @@ hudi_cdc_table/ Under a partition directory, the `.log` file with `CDCBlock` above will keep the changing data we have to materialize. -There is an option to control what data is written to `CDCBlock`, that is `hoodie.table.cdc.supplemental.logging`. See the description of this config above. + Persisting CDC in MOR: Write-on-indexing vs Write-on-compaction + +2 design choices on when to persist CDC in MOR tables: + +Write-on-indexing allows CDC info to be persisted at the earliest, however, in case of Flink writer or Bucket +indexing, `op` (I/U/D) data is not available at indexing. + +Write-on-compaction can always persist CDC info and achieve standardization of implementation log
[GitHub] [hudi] alexeykudinkin commented on issue #6354: [SUPPORT] Sparksql cow non-partition table execute 'merge into ' sqlstatment occure error after setting the tblproperties param "hoodie.dat
alexeykudinkin commented on issue #6354: URL: https://github.com/apache/hudi/issues/6354#issuecomment-1251523902 Created https://issues.apache.org/jira/browse/HUDI-4879 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4879) MERGE INTO fails when setting "hoodie.datasource.write.payload.class"
[ https://issues.apache.org/jira/browse/HUDI-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4879: -- Description: As reported by the user: [https://github.com/apache/hudi/issues/6354] Currently, setting \{{hoodie.datasource.write.payload.class = 'org.apache.hudi.common.model.DefaultHoodieRecordPayload' }}will result in the following exception: {code:java} org.apache.hudi.exception.HoodieUpsertExceptio n: Error upserting bucketType UPDATE for partition :0 at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244) at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1498) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merg e new record with old value in storage, for new record {HoodieRecord{key=HoodieKey { recordKey=id:1 partitionPath=}, currentLocation='HoodieRecordLocation {instantTime=20220810095846644, fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}', newLocation='HoodieRecordLocation {instantTime=20220810101719437, fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}'}}, old value {{"_hoodie_commit_time": "20220810095824514", "_hoodie_commit_seqno": "20220810095824514_0_0", "_hoodie_record_key": "id:1", "_hoodie_partition_path": "", "_hoodie_file_name": "60c04f95-ca5e-4f82-9558-40da29cc022e-0_0-937-24808_20220810095846644.parquet", "id": 1, "name": "a0", "ts": 1000}} at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322) ... 28 more Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merge new record with old value in storage, for new record {HoodieRecord{key=HoodieKey { recordKey=id:1 partitionPath=}, currentLocation='HoodieRecordLocation {instantTime=20220810095846644, fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}', newLocation='HoodieRecordLocation {instantTime=20220810101719437, fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}'}}, old value {{"_hoodie_commit_time": "20220810095824514", "_hoodie_commit_seqno": "20220810095824514_0_0", "_hoodie_record_key": "id:1", "_hoodie_partition_path": "", "_hoodie_file_name": "60c04f95-ca5e-4f82-9558-40da29cc022e-0_0-937-24808_202208100
[jira] [Updated] (HUDI-4879) MERGE INTO fails when setting "hoodie.datasource.write.payload.class"
[ https://issues.apache.org/jira/browse/HUDI-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4879: -- Description: As reported by the user: [https://github.com/apache/hudi/issues/6354] Currently, setting {{hoodie.datasource.write.payload.class = 'org.apache.hudi.common.model.DefaultHoodieRecordPayload' }}will result in the following exception: {code:java} org.apache.hudi.exception.HoodieUpsertExceptio n: Error upserting bucketType UPDATE for partition :0 at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244) at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1498) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merg e new record with old value in storage, for new record {HoodieRecord{key=HoodieKey { recordKey=id:1 partitionPath=}, currentLocation='HoodieRecordLocation {instantTime=20220810095846644, fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}', newLocation='HoodieRecordLocation {instantTime=20220810101719437, fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}'}}, old value {{"_hoodie_commit_time": "20220810095824514", "_hoodie_commit_seqno": "20220810095824514_0_0", "_hoodie_record_key": "id:1", "_hoodie_partition_path": "", "_hoodie_file_name": "60c04f95-ca5e-4f82-9558-40da29cc022e-0_0-937-24808_20220810095846644.parquet", "id": 1, "name": "a0", "ts": 1000}} at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322) ... 28 more Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merge new record with old value in storage, for new record {HoodieRecord{key=HoodieKey { recordKey=id:1 partitionPath=}, currentLocation='HoodieRecordLocation {instantTime=20220810095846644, fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}', newLocation='HoodieRecordLocation {instantTime=20220810101719437, fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}'}}, old value {{"_hoodie_commit_time": "20220810095824514", "_hoodie_commit_seqno": "20220810095824514_0_0", "_hoodie_record_key": "id:1", "_hoodie_partition_path": "", "_hoodie_file_name": "60c04f95-ca5e-4f82-9558-40da29cc022e-0_0-937-24808_2022081009
[jira] [Created] (HUDI-4879) MERGE INTO fails when setting "hoodie.datasource.write.payload.class"
Alexey Kudinkin created HUDI-4879: - Summary: MERGE INTO fails when setting "hoodie.datasource.write.payload.class" Key: HUDI-4879 URL: https://issues.apache.org/jira/browse/HUDI-4879 Project: Apache Hudi Issue Type: Bug Reporter: Alexey Kudinkin Assignee: Alexey Kudinkin Fix For: 0.12.1 As reported by the user: [https://github.com/apache/hudi/issues/6354] Currently, setting {{hoodie.datasource.write.payload.class = 'org.apache.hudi.common.model.DefaultHoodieRecordPayload'}} -- This message was sent by Atlassian Jira (v8.20.10#820010)