Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]
wombatu-kun commented on code in PR #11559: URL: https://github.com/apache/hudi/pull/11559#discussion_r1665211039 ## rfc/rfc-80/rfc-80.md: ## @@ -0,0 +1,161 @@ + +# RFC-80: Support column families for wide tbles + +## Proposers + +- @xiarixiaoyao +- @wombatu-kun + +## Approvers + - + - + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI- + +## Abstract + +In streaming processing, there are often scenarios where the table is widened. The current mainstream real-time stretching is completed through Flink's multi-layer join; +Flink's join will cache a large amount of data in the state backend. As the data set increases, the pressure on the Flink task state backend will gradually increase, and may even become unavailable. +In multi-layer join scenarios, this problem is more obvious. + +## Background +Currently, Hudi organizes data according to fileGroup granularity. The fileGroup is further divided into column clusters to introduce the columnFamily concept. +The organizational form of Hudi files is divided according to the following rules: +The data in the partition is divided into buckets according to hash; the files in each bucket are divided according to columnFamily; multiple colFamily files in the bucket form a completed fileGroup; when there is only one columnFamily, it degenerates into the native Hudi bucket table. + +![table](table.png) + +After splitting the fileGroup by columnFamily, the naming rules for base files and log files change. We add the cfName suffix to all file names to facilitate Hudi itself to distinguish column families. The addition of this suffix is compatible with Hudi's original naming method and has no conflict. + +![filenames](filenames.png) + +## Implementation +Describe the new thing you want to do in appropriate detail, how it fits into the project architecture. +Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. +Or it may be a few sentences. Use judgement based on the scope of the change. + +### Constraints and Restrictions +1. The overall design relies on the lock-free concurrent writing feature of Hudi 1.0. +2. Lower version Hudi cannot read and write column family tables. +3. Only MOR bucketed tables support setting column families. +4. Column families do not support repartitioning and renaming. +5. Schema evolution does not take effect on the current column family table. +6. Like native bucket tables, clustering operations are not supported. + +### Model change +After the column family is introduced, the storage structure of the entire Hudi bucket table changes: + +![bucket](bucket.png) + +The bucket is divided into multiple columnFamilies by column cluster. When columnFamily is 1, it will automatically degenerate into the native bucket table. + +![file-group](file-group.png) + +### Specifying column families when creating a table +In the table creation statement, column family division is specified in the options/tblproperties attribute; +Column family attributes are specified in key-value mode: +* Key is the column family name. Format: hoodie.colFamily. Column family name naming rules specified. +* Value is the specific content of the column family: it consists of all the columns included in the column family plus the precombine field. Format: " col1,col2...colN; precombineCol", the column family list and the preCombine field are separated by ";"; in the column family list the columns are split by ",". + +Constraints: The column family list must contain the primary key, and columns contained in different column families cannot overlap except for the primary key. The preCombie field does not need to be specified. If not specified, the primary key will be taken by default. + +After the table is created, the column family attributes will be persisted to hoodie's metadata for subsequent use. + +### Adding and deleting column families in existing table +Use the SQL alter command to modify the column family attributes and persist it: +* Execute ALTER TABLE table_name SET TBLPROPERTIES ('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family. +* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k'); to delete the column family. + +Specific steps are as follows: +1. Execute the ALTER command to modify the column family +2. Verify whether the column family modified by alter is legal. Column family modification must meet the following conditions, otherwise the verification will not pass: +* The column family name of an existing column family cannot be modified. +* Columns in other column families cannot be divided into new column families. +* When creating a new column family, it must meet the format requirements from previous chapter. +3. Save the modified column family to the .hoodie directory. + +### Writing data +The Hudi kernel divides t
Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]
wombatu-kun commented on code in PR #11559: URL: https://github.com/apache/hudi/pull/11559#discussion_r1665207108 ## rfc/rfc-80/rfc-80.md: ## @@ -0,0 +1,161 @@ + +# RFC-80: Support column families for wide tbles + +## Proposers + +- @xiarixiaoyao +- @wombatu-kun + +## Approvers + - + - + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI- + +## Abstract + +In streaming processing, there are often scenarios where the table is widened. The current mainstream real-time stretching is completed through Flink's multi-layer join; +Flink's join will cache a large amount of data in the state backend. As the data set increases, the pressure on the Flink task state backend will gradually increase, and may even become unavailable. +In multi-layer join scenarios, this problem is more obvious. + +## Background +Currently, Hudi organizes data according to fileGroup granularity. The fileGroup is further divided into column clusters to introduce the columnFamily concept. +The organizational form of Hudi files is divided according to the following rules: +The data in the partition is divided into buckets according to hash; the files in each bucket are divided according to columnFamily; multiple colFamily files in the bucket form a completed fileGroup; when there is only one columnFamily, it degenerates into the native Hudi bucket table. + +![table](table.png) + +After splitting the fileGroup by columnFamily, the naming rules for base files and log files change. We add the cfName suffix to all file names to facilitate Hudi itself to distinguish column families. The addition of this suffix is compatible with Hudi's original naming method and has no conflict. + +![filenames](filenames.png) + +## Implementation +Describe the new thing you want to do in appropriate detail, how it fits into the project architecture. +Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. +Or it may be a few sentences. Use judgement based on the scope of the change. + +### Constraints and Restrictions +1. The overall design relies on the lock-free concurrent writing feature of Hudi 1.0. +2. Lower version Hudi cannot read and write column family tables. +3. Only MOR bucketed tables support setting column families. +4. Column families do not support repartitioning and renaming. +5. Schema evolution does not take effect on the current column family table. +6. Like native bucket tables, clustering operations are not supported. + +### Model change +After the column family is introduced, the storage structure of the entire Hudi bucket table changes: + +![bucket](bucket.png) + +The bucket is divided into multiple columnFamilies by column cluster. When columnFamily is 1, it will automatically degenerate into the native bucket table. + +![file-group](file-group.png) + +### Specifying column families when creating a table +In the table creation statement, column family division is specified in the options/tblproperties attribute; +Column family attributes are specified in key-value mode: +* Key is the column family name. Format: hoodie.colFamily. Column family name naming rules specified. +* Value is the specific content of the column family: it consists of all the columns included in the column family plus the precombine field. Format: " col1,col2...colN; precombineCol", the column family list and the preCombine field are separated by ";"; in the column family list the columns are split by ",". + +Constraints: The column family list must contain the primary key, and columns contained in different column families cannot overlap except for the primary key. The preCombie field does not need to be specified. If not specified, the primary key will be taken by default. + +After the table is created, the column family attributes will be persisted to hoodie's metadata for subsequent use. + +### Adding and deleting column families in existing table +Use the SQL alter command to modify the column family attributes and persist it: +* Execute ALTER TABLE table_name SET TBLPROPERTIES ('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family. +* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k'); to delete the column family. + +Specific steps are as follows: +1. Execute the ALTER command to modify the column family +2. Verify whether the column family modified by alter is legal. Column family modification must meet the following conditions, otherwise the verification will not pass: +* The column family name of an existing column family cannot be modified. +* Columns in other column families cannot be divided into new column families. +* When creating a new column family, it must meet the format requirements from previous chapter. +3. Save the modified column family to the .hoodie directory. + +### Writing data +The Hudi kernel divides t
Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]
wombatu-kun commented on code in PR #11559: URL: https://github.com/apache/hudi/pull/11559#discussion_r1665204402 ## rfc/rfc-80/rfc-80.md: ## @@ -0,0 +1,161 @@ + +# RFC-80: Support column families for wide tbles + +## Proposers + +- @xiarixiaoyao +- @wombatu-kun + +## Approvers + - + - + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI- + +## Abstract + +In streaming processing, there are often scenarios where the table is widened. The current mainstream real-time stretching is completed through Flink's multi-layer join; +Flink's join will cache a large amount of data in the state backend. As the data set increases, the pressure on the Flink task state backend will gradually increase, and may even become unavailable. +In multi-layer join scenarios, this problem is more obvious. + +## Background +Currently, Hudi organizes data according to fileGroup granularity. The fileGroup is further divided into column clusters to introduce the columnFamily concept. +The organizational form of Hudi files is divided according to the following rules: +The data in the partition is divided into buckets according to hash; the files in each bucket are divided according to columnFamily; multiple colFamily files in the bucket form a completed fileGroup; when there is only one columnFamily, it degenerates into the native Hudi bucket table. + +![table](table.png) + +After splitting the fileGroup by columnFamily, the naming rules for base files and log files change. We add the cfName suffix to all file names to facilitate Hudi itself to distinguish column families. The addition of this suffix is compatible with Hudi's original naming method and has no conflict. + Review Comment: Yes, it will be the part of the write token. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]
hudi-bot commented on PR #11553: URL: https://github.com/apache/hudi/pull/11553#issuecomment-2208210061 ## CI report: * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN * c88af1906b580dd3d1497d48daa072728d6f8127 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24704) * cc375bc8ee00bee501b2937dbb6f2054c0fbe2d4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24705) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]
hudi-bot commented on PR #11553: URL: https://github.com/apache/hudi/pull/11553#issuecomment-2208196940 ## CI report: * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN * c88af1906b580dd3d1497d48daa072728d6f8127 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24704) * cc375bc8ee00bee501b2937dbb6f2054c0fbe2d4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]
hudi-bot commented on PR #11556: URL: https://github.com/apache/hudi/pull/11556#issuecomment-2208144242 ## CI report: * 77a10246fe Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24702) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] if sync mode is glue, fix the sync tool class [hudi]
hudi-bot commented on PR #11543: URL: https://github.com/apache/hudi/pull/11543#issuecomment-2208144158 ## CI report: * b1b0628d83c17467402de524a54829925aec9925 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24703) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]
hudi-bot commented on PR #11556: URL: https://github.com/apache/hudi/pull/11556#issuecomment-2208138126 ## CI report: * 77a10246fec770a8e0f3bfa1fe2fa4d3ffee33d1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24702) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24700) * 77a10246fe UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]
hudi-bot commented on PR #11553: URL: https://github.com/apache/hudi/pull/11553#issuecomment-2208138089 ## CI report: * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN * c88af1906b580dd3d1497d48daa072728d6f8127 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24704) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]
danny0405 commented on code in PR #11545: URL: https://github.com/apache/hudi/pull/11545#discussion_r1665134216 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/EightToSevenDowngradeHandler.java: ## @@ -32,6 +39,28 @@ public class EightToSevenDowngradeHandler implements DowngradeHandler { @Override public Map downgrade(HoodieWriteConfig config, HoodieEngineContext context, String instantTime, SupportsUpgradeDowngrade upgradeDowngradeHelper) { +HoodieTableMetaClient metaClient = HoodieTableMetaClient.builder().setConf(context.getStorageConf().newInstance()).setBasePath(config.getBasePath()).build(); +List instants = metaClient.getActiveTimeline().getInstants(); +if (!instants.isEmpty()) { + context.map(instants, instant -> { +if (!instant.getFileName().contains("_")) { + return false; +} +try { + // Rename the metadata file name from the ${instant_time}_${completion_time}.action[.state] format in version 1.x to the ${instant_time}.action[.state] format in version 0.x. + StoragePath fromPath = new StoragePath(metaClient.getMetaPath(), instant.getFileName()); + StoragePath toPath = new StoragePath(metaClient.getMetaPath(), instant.getFileName().replaceAll("_\\d+", "")); + boolean success = metaClient.getStorage().rename(fromPath, toPath); + // TODO: We need to rename the action-related part of the metadata file name here when we bring separate action name for clustering/compaction in 1.x as well. + if (!success) { +throw new HoodieIOException("Error when rename the instant file: " + fromPath + " to: " + toPath); + } + return true; +} catch (IOException e) { + throw new HoodieException("Can not to complete the downgrade from version eight to version seven.", e); Review Comment: One thing needs caution here is after the renaming, the file modification time has changed, the modification time is used as completion time in 0.x branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]
danny0405 commented on code in PR #11545: URL: https://github.com/apache/hudi/pull/11545#discussion_r1665133407 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java: ## @@ -262,8 +259,12 @@ private String getPendingFileName() { } private String getCompleteFileName(String completionTime) { -ValidationUtils.checkArgument(!StringUtils.isNullOrEmpty(completionTime), "Completion time should not be empty"); -String timestampWithCompletionTime = timestamp + "_" + completionTime; +String timestampWithCompletionTime; +if (StringUtils.isNullOrEmpty(completionTime)) { Review Comment: When the completion time could be empty then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]
danny0405 commented on code in PR #11545: URL: https://github.com/apache/hudi/pull/11545#discussion_r1665132485 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java: ## @@ -136,10 +136,7 @@ public HoodieInstant(StoragePathInfo pathInfo) { state = State.COMPLETED; } } - completionTime = timestamps.length > 1 - ? timestamps[1] - // for backward compatibility with 0.x release. - : state == State.COMPLETED ? pathInfo.getModificationTime() + "" : null; + completionTime = timestamps.length > 1 ? timestamps[1] : null; Review Comment: I think we still need to keep these logic, a downgrade logic in there is good but we also need to be compatible for 0.x for some code path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]
hudi-bot commented on PR #11553: URL: https://github.com/apache/hudi/pull/11553#issuecomment-2208131882 ## CI report: * 3e526156f7bf7121008c4965bedeeadd969f798a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24687) * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN * c88af1906b580dd3d1497d48daa072728d6f8127 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]
danny0405 commented on code in PR #11514: URL: https://github.com/apache/hudi/pull/11514#discussion_r1665124619 ## rfc/rfc-78/rfc-78.md: ## @@ -0,0 +1,220 @@ + +# RFC-76: [Bridge release for 1.x] + +## Proposers + +- @nsivabalan +- @vbalaji + +## Approvers + - @yihua + - @codope + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-7882 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +[Hudi 1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md) is a powerful +re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming +years. It introduces lot of differentiating features for Apache Hudi. We released beta releases which was meant for +enthusiastic developers/users to give a try of advanced features. But as we are working towards 1.0 GA, we are proposing +a bridge release (0.16.0) for smoother migration for existing hudi users. + +## Objectives +Goal is to have a smooth migration experience for the users from 0.x to 1.0. We plan to have a 0.16.0 bridge release asking everyone to first migrate to 0.16.0 before they can upgrade to 1.x. + +- 1.x reader should be able to read 0.16.x tables w/o any loss in functionality and no data inconsistencies. +- 0.16.x should have read capability for 1.x tables w/ some limitations. For features ported over from 0.x, no loss in functionality should be guaranteed. But for new features that was introduced in 1.x, we may not be able to support all of them. Will be calling out which new features may not work with 0.16.x reader. In this case, we explicitly request users to not turn on these features till readers are completely in 1.x. +- Document upgrade steps from 0.16.x to 1.x with limited user perceived latency. This will be auto upgrade, but document clearly what needs to be done. +- Downgrade from 1.x to 0.16.x documented with call outs on any functionality. + +### Considerations when choosing Migration strategy +- While migration is happening, we want to allow readers to continue reading data. This means, we cannot employ a stop-the-world strategy when we are migrating. +All the actions that we are performing as part of table upgrade should not have any side-effects of breaking snapshot isolation for readers. +- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do not want to add read support for very old versions of hudi in 1.x(for eg 0.7.0). +- So, in an effort to bring everyone to latest hudi versions, 1.x reader will have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader may not have full reader support. +The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and then slowly start upgrading to 1.x(readers followed by writers). + +Before we dive in further, lets understand the format changes: + +## Format changes +### Table properties +- Payload class ➝ payload type. +- New metadata partitions could be added (optionally enabled) + +### MDT changes +- New MDT partitions are available in 1.x. MDT schema upgraded. +- RLI schema is upgraded to hold row position + +### Timeline: +- [storage changes] Completed write commits have completed times in the file name. +- [storage changes] Completed and inflight write commits are in avro format which were json in 0.x. +- We are switching the action type for clustering from “replace commit” to “cluster”. +- Similarly, for completed compaction, we are switching from “commit” to “compaction” in an effort to standardize actions for a given write operation. +- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 1.x +- [In-memory changes] HoodieInstant changes due to presence of completion time for completed HoodieInstants. + +### Filegroup/FileSlice changes: +- Log files contain delta commit time instead of base instant time. +- Log appends are disabled in 1.x. In other words, each log block is already appended to a new log file. +- File Slice determination logic for log files changed (in 0.x, we have base instant time in log files and its straight forward. In 1.x, we find completion time for a log file and find the base instant time (parsed from base files) which has the highest value lesser than the completion time of the log file). +- Log file ordering within a file slice. (in 0.x, we use base instant time ➝l log file versions ➝ write token) to order diff log files. in 1.x, we will be using completion time to order). + +### Log format changes: +- We have added new header types in 1.x. (IS_PARTIAL) + +## Changes to be ported over 0.16.x to support reading 1.x tables +### What will be supported +- For features introduced in 0.x, and tables written in 1.x, 0.16.0 reader should be able to provide consistent reads w/o any breakage. +### What will not be supported +- A 0.16 writer cannot write to a table that has been upgraded-to/created usin
Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]
danny0405 commented on code in PR #11514: URL: https://github.com/apache/hudi/pull/11514#discussion_r1665122005 ## rfc/rfc-78/rfc-78.md: ## @@ -0,0 +1,220 @@ + +# RFC-76: [Bridge release for 1.x] + +## Proposers + +- @nsivabalan +- @vbalaji + +## Approvers + - @yihua + - @codope + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-7882 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +[Hudi 1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md) is a powerful +re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming +years. It introduces lot of differentiating features for Apache Hudi. We released beta releases which was meant for +enthusiastic developers/users to give a try of advanced features. But as we are working towards 1.0 GA, we are proposing +a bridge release (0.16.0) for smoother migration for existing hudi users. + +## Objectives +Goal is to have a smooth migration experience for the users from 0.x to 1.0. We plan to have a 0.16.0 bridge release asking everyone to first migrate to 0.16.0 before they can upgrade to 1.x. + +- 1.x reader should be able to read 0.16.x tables w/o any loss in functionality and no data inconsistencies. +- 0.16.x should have read capability for 1.x tables w/ some limitations. For features ported over from 0.x, no loss in functionality should be guaranteed. But for new features that was introduced in 1.x, we may not be able to support all of them. Will be calling out which new features may not work with 0.16.x reader. In this case, we explicitly request users to not turn on these features till readers are completely in 1.x. +- Document upgrade steps from 0.16.x to 1.x with limited user perceived latency. This will be auto upgrade, but document clearly what needs to be done. +- Downgrade from 1.x to 0.16.x documented with call outs on any functionality. + +### Considerations when choosing Migration strategy +- While migration is happening, we want to allow readers to continue reading data. This means, we cannot employ a stop-the-world strategy when we are migrating. +All the actions that we are performing as part of table upgrade should not have any side-effects of breaking snapshot isolation for readers. +- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do not want to add read support for very old versions of hudi in 1.x(for eg 0.7.0). +- So, in an effort to bring everyone to latest hudi versions, 1.x reader will have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader may not have full reader support. +The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and then slowly start upgrading to 1.x(readers followed by writers). + +Before we dive in further, lets understand the format changes: + +## Format changes +### Table properties +- Payload class ➝ payload type. +- New metadata partitions could be added (optionally enabled) + +### MDT changes +- New MDT partitions are available in 1.x. MDT schema upgraded. +- RLI schema is upgraded to hold row position + +### Timeline: +- [storage changes] Completed write commits have completed times in the file name. +- [storage changes] Completed and inflight write commits are in avro format which were json in 0.x. +- We are switching the action type for clustering from “replace commit” to “cluster”. +- Similarly, for completed compaction, we are switching from “commit” to “compaction” in an effort to standardize actions for a given write operation. +- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 1.x +- [In-memory changes] HoodieInstant changes due to presence of completion time for completed HoodieInstants. Review Comment: We do not introduce the completion time based inc queries for Spark yet, but for the GA release, we might need to have a compatible solution for migrattion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]
danny0405 commented on code in PR #11514: URL: https://github.com/apache/hudi/pull/11514#discussion_r1665122268 ## rfc/rfc-78/rfc-78.md: ## @@ -0,0 +1,220 @@ + +# RFC-76: [Bridge release for 1.x] + +## Proposers + +- @nsivabalan +- @vbalaji + +## Approvers + - @yihua + - @codope + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-7882 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +[Hudi 1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md) is a powerful +re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming +years. It introduces lot of differentiating features for Apache Hudi. We released beta releases which was meant for +enthusiastic developers/users to give a try of advanced features. But as we are working towards 1.0 GA, we are proposing +a bridge release (0.16.0) for smoother migration for existing hudi users. + +## Objectives +Goal is to have a smooth migration experience for the users from 0.x to 1.0. We plan to have a 0.16.0 bridge release asking everyone to first migrate to 0.16.0 before they can upgrade to 1.x. + +- 1.x reader should be able to read 0.16.x tables w/o any loss in functionality and no data inconsistencies. +- 0.16.x should have read capability for 1.x tables w/ some limitations. For features ported over from 0.x, no loss in functionality should be guaranteed. But for new features that was introduced in 1.x, we may not be able to support all of them. Will be calling out which new features may not work with 0.16.x reader. In this case, we explicitly request users to not turn on these features till readers are completely in 1.x. +- Document upgrade steps from 0.16.x to 1.x with limited user perceived latency. This will be auto upgrade, but document clearly what needs to be done. +- Downgrade from 1.x to 0.16.x documented with call outs on any functionality. + +### Considerations when choosing Migration strategy +- While migration is happening, we want to allow readers to continue reading data. This means, we cannot employ a stop-the-world strategy when we are migrating. +All the actions that we are performing as part of table upgrade should not have any side-effects of breaking snapshot isolation for readers. +- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do not want to add read support for very old versions of hudi in 1.x(for eg 0.7.0). +- So, in an effort to bring everyone to latest hudi versions, 1.x reader will have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader may not have full reader support. +The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and then slowly start upgrading to 1.x(readers followed by writers). + +Before we dive in further, lets understand the format changes: + +## Format changes +### Table properties +- Payload class ➝ payload type. +- New metadata partitions could be added (optionally enabled) + +### MDT changes +- New MDT partitions are available in 1.x. MDT schema upgraded. +- RLI schema is upgraded to hold row position + +### Timeline: +- [storage changes] Completed write commits have completed times in the file name. +- [storage changes] Completed and inflight write commits are in avro format which were json in 0.x. +- We are switching the action type for clustering from “replace commit” to “cluster”. +- Similarly, for completed compaction, we are switching from “commit” to “compaction” in an effort to standardize actions for a given write operation. +- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 1.x +- [In-memory changes] HoodieInstant changes due to presence of completion time for completed HoodieInstants. + +### Filegroup/FileSlice changes: +- Log files contain delta commit time instead of base instant time. +- Log appends are disabled in 1.x. In other words, each log block is already appended to a new log file. Review Comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]
danny0405 commented on code in PR #11514: URL: https://github.com/apache/hudi/pull/11514#discussion_r1665121084 ## rfc/rfc-78/rfc-78.md: ## @@ -0,0 +1,220 @@ + +# RFC-76: [Bridge release for 1.x] + +## Proposers + +- @nsivabalan +- @vbalaji + +## Approvers + - @yihua + - @codope + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-7882 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +[Hudi 1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md) is a powerful +re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming +years. It introduces lot of differentiating features for Apache Hudi. We released beta releases which was meant for +enthusiastic developers/users to give a try of advanced features. But as we are working towards 1.0 GA, we are proposing +a bridge release (0.16.0) for smoother migration for existing hudi users. + +## Objectives +Goal is to have a smooth migration experience for the users from 0.x to 1.0. We plan to have a 0.16.0 bridge release asking everyone to first migrate to 0.16.0 before they can upgrade to 1.x. + +- 1.x reader should be able to read 0.16.x tables w/o any loss in functionality and no data inconsistencies. +- 0.16.x should have read capability for 1.x tables w/ some limitations. For features ported over from 0.x, no loss in functionality should be guaranteed. But for new features that was introduced in 1.x, we may not be able to support all of them. Will be calling out which new features may not work with 0.16.x reader. In this case, we explicitly request users to not turn on these features till readers are completely in 1.x. +- Document upgrade steps from 0.16.x to 1.x with limited user perceived latency. This will be auto upgrade, but document clearly what needs to be done. +- Downgrade from 1.x to 0.16.x documented with call outs on any functionality. + +### Considerations when choosing Migration strategy +- While migration is happening, we want to allow readers to continue reading data. This means, we cannot employ a stop-the-world strategy when we are migrating. +All the actions that we are performing as part of table upgrade should not have any side-effects of breaking snapshot isolation for readers. +- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do not want to add read support for very old versions of hudi in 1.x(for eg 0.7.0). +- So, in an effort to bring everyone to latest hudi versions, 1.x reader will have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader may not have full reader support. +The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and then slowly start upgrading to 1.x(readers followed by writers). + +Before we dive in further, lets understand the format changes: + +## Format changes +### Table properties +- Payload class ➝ payload type. Review Comment: Might not be relared, but should `hoodie.record.merge.mode` should be a table config instead of a write config? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi-rs) branch main updated: feat: implement datafusion API using ParquetExec (#35)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/hudi-rs.git The following commit(s) were added to refs/heads/main by this push: new e8fde26 feat: implement datafusion API using ParquetExec (#35) e8fde26 is described below commit e8fde26df8cdd5355aacce4232138222ce00baf4 Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Wed Jul 3 23:59:11 2024 -0500 feat: implement datafusion API using ParquetExec (#35) - upgrade arrow from `50` to `52.0.0` - upgrade datafusion `35` to `39.0.0` - leverage `ParquetExec` for implementing TableProvider for Hudi in datafusion - add `hoodie.read.input.partitions` config --- Cargo.toml | 38 +++--- crates/core/src/config/mod.rs| 118 ++ crates/core/src/lib.rs | 3 +- crates/core/src/storage/mod.rs | 8 +- crates/core/src/storage/utils.rs | 47 ++- crates/core/src/table/mod.rs | 17 ++- crates/datafusion/Cargo.toml | 8 +- crates/datafusion/src/lib.rs | 261 ++- python/Cargo.toml| 2 +- python/hudi/_internal.pyi| 3 +- python/hudi/_utils.py| 23 python/hudi/table.py | 10 +- python/src/lib.rs| 12 +- python/tests/test_table_read.py | 4 +- 14 files changed, 344 insertions(+), 210 deletions(-) diff --git a/Cargo.toml b/Cargo.toml index 1b66057..82f0383 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -30,27 +30,27 @@ rust-version = "1.75.0" [workspace.dependencies] # arrow -arrow = { version = "50", features = ["pyarrow"] } -arrow-arith = { version = "50" } -arrow-array = { version = "50", features = ["chrono-tz"] } -arrow-buffer = { version = "50" } -arrow-cast = { version = "50" } -arrow-ipc = { version = "50" } -arrow-json = { version = "50" } -arrow-ord = { version = "50" } -arrow-row = { version = "50" } -arrow-schema = { version = "50" } -arrow-select = { version = "50" } -object_store = { version = "0.9.1" } -parquet = { version = "50" } +arrow = { version = "52.0.0", features = ["pyarrow"]} +arrow-arith = { version = "52.0.0" } +arrow-array = { version = "52.0.0", features = ["chrono-tz"] } +arrow-buffer = { version = "52.0.0" } +arrow-cast = { version = "52.0.0" } +arrow-ipc = { version = "52.0.0" } +arrow-json = { version = "52.0.0" } +arrow-ord = { version = "52.0.0" } +arrow-row = { version = "52.0.0" } +arrow-schema = { version = "52.0.0" } +arrow-select = { version = "52.0.0" } +object_store = { version = "0.10.1" } +parquet = { version = "52.0.0" } # datafusion -datafusion = { version = "35" } -datafusion-expr = { version = "35" } -datafusion-common = { version = "35" } -datafusion-proto = { version = "35" } -datafusion-sql = { version = "35" } -datafusion-physical-expr = { version = "35" } +datafusion = { version = "39.0.0" } +datafusion-expr = { version = "39.0.0" } +datafusion-common = { version = "39.0.0" } +datafusion-proto = { version = "39.0.0" } +datafusion-sql = { version = "39.0.0" } +datafusion-physical-expr = { version = "39.0.0" } # serde serde = { version = "1.0.203", features = ["derive"] } diff --git a/crates/core/src/config/mod.rs b/crates/core/src/config/mod.rs new file mode 100644 index 000..3322df3 --- /dev/null +++ b/crates/core/src/config/mod.rs @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +use std::collections::HashMap; + +use anyhow::{anyhow, Context, Result}; + +pub trait OptionsParser { +type Output; + +fn parse_value(&self, options: &HashMap) -> Result; + +fn parse_value_or_default(&self, options: &HashMap) -> Self::Output; +} + +#[derive(Clone, Debug, PartialEq, Eq, Hash)] +pub enum HudiConfig { +ReadInputPartitions, +} + +#[derive(Debug)] +pub enum HudiConfigValue { +Integer(isize), +} + +impl HudiConfigValue { +pub fn cast + TryFrom + std::fmt::Debug>(&self) -> T { +match self { +HudiConfigValue::Integer(value) => T::try_from(*value).unwrap_or_else(|_| { +panic!("Failed to convert isize to {}", std::any::type_name::()) +
Re: [PR] feat: implement datafusion API using ParquetExec [hudi-rs]
xushiyan merged PR #35: URL: https://github.com/apache/hudi-rs/pull/35 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862922#comment-17862922 ] Geser Dugarov commented on HUDI-7938: - [~yihua] , if you don't mind, could you, please, clarify what to do with registration of Hudi serializer in Spark? > NullPointerException during read from PySpark > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862921#comment-17862921 ] Geser Dugarov commented on HUDI-7938: - To support run from PySpark without set spark.kryo.registrator this MR has been landed: [https://github.com/apache/hudi/pull/11355] But after landed [https://github.com/apache/hudi/pull/10957] we need to set it again. For now, I don't know should we decide to make this configuration mandatory or make some changes in the code. Leave this task for some time as it is. > NullPointerException during read from PySpark > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7758] Only consider files in Hudi partitions when initializing MDT [hudi]
hudi-bot commented on PR #11219: URL: https://github.com/apache/hudi/pull/11219#issuecomment-2208095346 ## CI report: * 09e49d7c4856c6baf4089b538784f2d6cc7b143a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24701) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark
[ https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-7938: Status: Open (was: In Progress) > NullPointerException during read from PySpark > - > > Key: HUDI-7938 > URL: https://issues.apache.org/jira/browse/HUDI-7938 > Project: Apache Hudi > Issue Type: Bug >Reporter: Geser Dugarov >Assignee: Geser Dugarov >Priority: Major > > HUDI-7567 Add schema evolution to the filegroup reader (#10957), > but broke integration with PySpark. > When trying to call > {quote}df_load = > spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path) > df_load.collect() > {quote} > > got: > > {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 > (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException > at org.apache.hadoop.conf.Configuration.(Configuration.java:842) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) > at > org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) > at > org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > Spark 3.4.3 was used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7949) insert into hudi table with columns specified(reordered and not in table schema order) throws exception
[ https://issues.apache.org/jira/browse/HUDI-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KnightChess reassigned HUDI-7949: - Assignee: KnightChess > insert into hudi table with columns specified(reordered and not in table > schema order) throws exception > --- > > Key: HUDI-7949 > URL: https://issues.apache.org/jira/browse/HUDI-7949 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > > https://github.com/apache/hudi/issues/11552 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [MINOR] if sync mode is glue, fix the sync tool class [hudi]
hudi-bot commented on PR #11543: URL: https://github.com/apache/hudi/pull/11543#issuecomment-2208090398 ## CI report: * 90ef8064511b401f50d1f8796f75dd7bde7b155e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24648) * b1b0628d83c17467402de524a54829925aec9925 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24703) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]
hudi-bot commented on PR #11556: URL: https://github.com/apache/hudi/pull/11556#issuecomment-2208085682 ## CI report: * 77a10246fec770a8e0f3bfa1fe2fa4d3ffee33d1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24702) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24700) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]
hudi-bot commented on PR #11563: URL: https://github.com/apache/hudi/pull/11563#issuecomment-2208085724 ## CI report: * d7a6c5a6d873b6d07e8e3f0a9b15040dd1942d59 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24699) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] if sync mode is glue, fix the sync tool class [hudi]
hudi-bot commented on PR #11543: URL: https://github.com/apache/hudi/pull/11543#issuecomment-2208085624 ## CI report: * 90ef8064511b401f50d1f8796f75dd7bde7b155e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24648) * b1b0628d83c17467402de524a54829925aec9925 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]
hudi-bot commented on PR #11534: URL: https://github.com/apache/hudi/pull/11534#issuecomment-2208085580 ## CI report: * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN * ea4e4adbc06a4be8f4cd739e6b1750927b284f63 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24696) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]
watermelon12138 commented on PR #11545: URL: https://github.com/apache/hudi/pull/11545#issuecomment-2208074798 @danny0405 Hi, all checks have passed and all suggestions have resolved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]
watermelon12138 commented on PR #11545: URL: https://github.com/apache/hudi/pull/11545#issuecomment-2208073498 > @watermelon12138 would you mind to fix the compile errors: https://github.com/apache/hudi/actions/runs/9756232191/job/26926142959?pr=11545 @danny0405 @balaji-varadarajan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]
hudi-bot commented on PR #11545: URL: https://github.com/apache/hudi/pull/11545#issuecomment-2208022864 ## CI report: * 0e3cb49fb72bdc14dee9e67fe0aaeb0d271608f2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24698) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]
hudi-bot commented on PR #11556: URL: https://github.com/apache/hudi/pull/11556#issuecomment-2208023157 ## CI report: * 192707054c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24691) * 77a10246fec770a8e0f3bfa1fe2fa4d3ffee33d1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24702) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7758] Only consider files in Hudi partitions when initializing MDT [hudi]
hudi-bot commented on PR #11219: URL: https://github.com/apache/hudi/pull/11219#issuecomment-2208019950 ## CI report: * ef710f1f1e981fc83f69a3e2db164aa0c139e0c5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24013) * 09e49d7c4856c6baf4089b538784f2d6cc7b143a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24701) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] if sync mode is glue, fix the sync tool class [hudi]
prabodh1194 commented on code in PR #11543: URL: https://github.com/apache/hudi/pull/11543#discussion_r1665053757 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -60,13 +60,13 @@ import org.apache.hudi.sync.common.HoodieSyncConfig import org.apache.hudi.sync.common.util.SyncUtilHelpers import org.apache.hudi.sync.common.util.SyncUtilHelpers.getHoodieMetaSyncException import org.apache.hudi.util.SparkKeyGenUtils - import org.apache.avro.Schema import org.apache.avro.generic.GenericData import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.hive.conf.HiveConf import org.apache.hadoop.hive.shims.ShimLoader +import org.apache.hudi.hive.ddl.HiveSyncMode Review Comment: done now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7758] Only consider files in Hudi partitions when initializing MDT [hudi]
hudi-bot commented on PR #11219: URL: https://github.com/apache/hudi/pull/11219#issuecomment-2207981834 ## CI report: * ef710f1f1e981fc83f69a3e2db164aa0c139e0c5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24013) * 09e49d7c4856c6baf4089b538784f2d6cc7b143a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]
hudi-bot commented on PR #11556: URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207984886 ## CI report: * 192707054c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24691) * 77a10246fec770a8e0f3bfa1fe2fa4d3ffee33d1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]
hudi-bot commented on PR #11553: URL: https://github.com/apache/hudi/pull/11553#issuecomment-2207962441 ## CI report: * 3e526156f7bf7121008c4965bedeeadd969f798a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24687) * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7709] ClassCastException while reading the data using `TimestampBasedKeyGenerator` [hudi]
geserdugarov commented on code in PR #11501: URL: https://github.com/apache/hudi/pull/11501#discussion_r1665028011 ## hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java: ## @@ -360,13 +363,22 @@ protected HoodieTimeline getActiveTimeline() { } private Object[] parsePartitionColumnValues(String[] partitionColumns, String partitionPath) { -Object[] partitionColumnValues = doParsePartitionColumnValues(partitionColumns, partitionPath); -if (shouldListLazily && partitionColumnValues.length != partitionColumns.length) { - throw new HoodieException("Failed to parse partition column values from the partition-path:" - + " likely non-encoded slashes being used in partition column's values. You can try to" - + " work this around by switching listing mode to eager"); +HoodieTableConfig tableConfig = metaClient.getTableConfig(); +Object[] partitionColumnValues; +if (null != tableConfig.getKeyGeneratorClassName() +&& tableConfig.getKeyGeneratorClassName().equals(KeyGeneratorType.TIMESTAMP.getClassName()) +&& tableConfig.propsMap().get(TimestampKeyGeneratorConfig.TIMESTAMP_TYPE_FIELD.key()).matches("SCALAR|UNIX_TIMESTAMP|EPOCHMILLISECONDS")) { + // For TIMESTAMP key generator when TYPE is SCALAR, UNIX_TIMESTAMP or EPOCHMILLISECONDS, + // we couldn't reconstruct initial partition column values from partition paths due to lost data after formatting in most cases + partitionColumnValues = new Object[partitionColumns.length]; Review Comment: I will check it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7709] ClassCastException while reading the data using `TimestampBasedKeyGenerator` [hudi]
yihua commented on code in PR #11501: URL: https://github.com/apache/hudi/pull/11501#discussion_r1665023864 ## hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java: ## @@ -360,13 +363,22 @@ protected HoodieTimeline getActiveTimeline() { } private Object[] parsePartitionColumnValues(String[] partitionColumns, String partitionPath) { -Object[] partitionColumnValues = doParsePartitionColumnValues(partitionColumns, partitionPath); -if (shouldListLazily && partitionColumnValues.length != partitionColumns.length) { - throw new HoodieException("Failed to parse partition column values from the partition-path:" - + " likely non-encoded slashes being used in partition column's values. You can try to" - + " work this around by switching listing mode to eager"); +HoodieTableConfig tableConfig = metaClient.getTableConfig(); +Object[] partitionColumnValues; +if (null != tableConfig.getKeyGeneratorClassName() +&& tableConfig.getKeyGeneratorClassName().equals(KeyGeneratorType.TIMESTAMP.getClassName()) +&& tableConfig.propsMap().get(TimestampKeyGeneratorConfig.TIMESTAMP_TYPE_FIELD.key()).matches("SCALAR|UNIX_TIMESTAMP|EPOCHMILLISECONDS")) { + // For TIMESTAMP key generator when TYPE is SCALAR, UNIX_TIMESTAMP or EPOCHMILLISECONDS, + // we couldn't reconstruct initial partition column values from partition paths due to lost data after formatting in most cases + partitionColumnValues = new Object[partitionColumns.length]; Review Comment: Partition column values are empty. Can this cause the partition pruning to return wrong or empty results from `SparkHoodieTableFileIndex::listMatchingPartitionPaths`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]
lokeshj1703 commented on PR #11553: URL: https://github.com/apache/hudi/pull/11553#issuecomment-2207900106 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]
hudi-bot commented on PR #11563: URL: https://github.com/apache/hudi/pull/11563#issuecomment-2207885328 ## CI report: * 16b1d5c2603ef3eb68a40bc14572751b787d8d2f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24694) * d7a6c5a6d873b6d07e8e3f0a9b15040dd1942d59 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24699) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]
hudi-bot commented on PR #11545: URL: https://github.com/apache/hudi/pull/11545#issuecomment-2207885219 ## CI report: * fe7aa032f4463035775029ad486ca73ea2d02ac0 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24668) * 0e3cb49fb72bdc14dee9e67fe0aaeb0d271608f2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24698) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]
hudi-bot commented on PR #11539: URL: https://github.com/apache/hudi/pull/11539#issuecomment-2207885164 ## CI report: * 57e40251eba6a0d7dc68cd10b832478f4d2decb3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24697) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]
hudi-bot commented on PR #11534: URL: https://github.com/apache/hudi/pull/11534#issuecomment-2207885098 ## CI report: * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN * 7c75f078faf19390ceac585790181032570d184d Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24695) * ea4e4adbc06a4be8f4cd739e6b1750927b284f63 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24696) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]
hudi-bot commented on PR #11545: URL: https://github.com/apache/hudi/pull/11545#issuecomment-2207877464 ## CI report: * fe7aa032f4463035775029ad486ca73ea2d02ac0 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24668) * 0e3cb49fb72bdc14dee9e67fe0aaeb0d271608f2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]
hudi-bot commented on PR #11553: URL: https://github.com/apache/hudi/pull/11553#issuecomment-2207877507 ## CI report: * 3e526156f7bf7121008c4965bedeeadd969f798a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24687) * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]
hudi-bot commented on PR #11563: URL: https://github.com/apache/hudi/pull/11563#issuecomment-2207877586 ## CI report: * 16b1d5c2603ef3eb68a40bc14572751b787d8d2f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24694) * d7a6c5a6d873b6d07e8e3f0a9b15040dd1942d59 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]
hudi-bot commented on PR #11539: URL: https://github.com/apache/hudi/pull/11539#issuecomment-2207877428 ## CI report: * 88f5236331c0fdf66bca1617679abe2940f9e930 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24684) * 57e40251eba6a0d7dc68cd10b832478f4d2decb3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]
hudi-bot commented on PR #11534: URL: https://github.com/apache/hudi/pull/11534#issuecomment-2207877386 ## CI report: * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN * 7c75f078faf19390ceac585790181032570d184d Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24695) * ea4e4adbc06a4be8f4cd739e6b1750927b284f63 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]
hudi-bot commented on PR #11553: URL: https://github.com/apache/hudi/pull/11553#issuecomment-2207871843 ## CI report: * 3e526156f7bf7121008c4965bedeeadd969f798a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24687) * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]
hudi-bot commented on PR #11534: URL: https://github.com/apache/hudi/pull/11534#issuecomment-2207871783 ## CI report: * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN * 7c75f078faf19390ceac585790181032570d184d Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24695) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]
hudi-bot commented on PR #11559: URL: https://github.com/apache/hudi/pull/11559#issuecomment-2207871874 ## CI report: * 206eabc0c6a752e7a1e1d2206db231bf9a831570 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24693) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7950) Shade roaring bitmap dependency in root POM
[ https://issues.apache.org/jira/browse/HUDI-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7950: Reviewers: Lokesh Jain (was: Lokesh Jain) > Shade roaring bitmap dependency in root POM > --- > > Key: HUDI-7950 > URL: https://issues.apache.org/jira/browse/HUDI-7950 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0-beta2, 1.0.0, 0.15.1 > > > We should unify the shading rule of roaring bitmap dependency in the root POM > for consistency among bundles. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7950) Shade roaring bitmap dependency in root POM
[ https://issues.apache.org/jira/browse/HUDI-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7950: Reviewers: Lokesh Jain > Shade roaring bitmap dependency in root POM > --- > > Key: HUDI-7950 > URL: https://issues.apache.org/jira/browse/HUDI-7950 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0-beta2, 1.0.0, 0.15.1 > > > We should unify the shading rule of roaring bitmap dependency in the root POM > for consistency among bundles. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]
danny0405 commented on code in PR #11559: URL: https://github.com/apache/hudi/pull/11559#discussion_r1664980551 ## rfc/rfc-80/rfc-80.md: ## @@ -0,0 +1,161 @@ + +# RFC-80: Support column families for wide tbles + +## Proposers + +- @xiarixiaoyao +- @wombatu-kun + +## Approvers + - + - + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI- + +## Abstract + +In streaming processing, there are often scenarios where the table is widened. The current mainstream real-time stretching is completed through Flink's multi-layer join; +Flink's join will cache a large amount of data in the state backend. As the data set increases, the pressure on the Flink task state backend will gradually increase, and may even become unavailable. +In multi-layer join scenarios, this problem is more obvious. + +## Background +Currently, Hudi organizes data according to fileGroup granularity. The fileGroup is further divided into column clusters to introduce the columnFamily concept. +The organizational form of Hudi files is divided according to the following rules: +The data in the partition is divided into buckets according to hash; the files in each bucket are divided according to columnFamily; multiple colFamily files in the bucket form a completed fileGroup; when there is only one columnFamily, it degenerates into the native Hudi bucket table. + +![table](table.png) + +After splitting the fileGroup by columnFamily, the naming rules for base files and log files change. We add the cfName suffix to all file names to facilitate Hudi itself to distinguish column families. The addition of this suffix is compatible with Hudi's original naming method and has no conflict. + +![filenames](filenames.png) + +## Implementation +Describe the new thing you want to do in appropriate detail, how it fits into the project architecture. +Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. +Or it may be a few sentences. Use judgement based on the scope of the change. + +### Constraints and Restrictions +1. The overall design relies on the lock-free concurrent writing feature of Hudi 1.0. +2. Lower version Hudi cannot read and write column family tables. +3. Only MOR bucketed tables support setting column families. +4. Column families do not support repartitioning and renaming. +5. Schema evolution does not take effect on the current column family table. +6. Like native bucket tables, clustering operations are not supported. + +### Model change +After the column family is introduced, the storage structure of the entire Hudi bucket table changes: + +![bucket](bucket.png) + +The bucket is divided into multiple columnFamilies by column cluster. When columnFamily is 1, it will automatically degenerate into the native bucket table. + +![file-group](file-group.png) + +### Specifying column families when creating a table +In the table creation statement, column family division is specified in the options/tblproperties attribute; +Column family attributes are specified in key-value mode: +* Key is the column family name. Format: hoodie.colFamily. Column family name naming rules specified. +* Value is the specific content of the column family: it consists of all the columns included in the column family plus the precombine field. Format: " col1,col2...colN; precombineCol", the column family list and the preCombine field are separated by ";"; in the column family list the columns are split by ",". + +Constraints: The column family list must contain the primary key, and columns contained in different column families cannot overlap except for the primary key. The preCombie field does not need to be specified. If not specified, the primary key will be taken by default. + +After the table is created, the column family attributes will be persisted to hoodie's metadata for subsequent use. + +### Adding and deleting column families in existing table +Use the SQL alter command to modify the column family attributes and persist it: +* Execute ALTER TABLE table_name SET TBLPROPERTIES ('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family. +* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k'); to delete the column family. + +Specific steps are as follows: +1. Execute the ALTER command to modify the column family +2. Verify whether the column family modified by alter is legal. Column family modification must meet the following conditions, otherwise the verification will not pass: +* The column family name of an existing column family cannot be modified. +* Columns in other column families cannot be divided into new column families. +* When creating a new column family, it must meet the format requirements from previous chapter. +3. Save the modified column family to the .hoodie directory. + +### Writing data +The Hudi kernel divides the
Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]
danny0405 commented on code in PR #11559: URL: https://github.com/apache/hudi/pull/11559#discussion_r1664981596 ## rfc/rfc-80/rfc-80.md: ## @@ -0,0 +1,161 @@ + +# RFC-80: Support column families for wide tbles + +## Proposers + +- @xiarixiaoyao +- @wombatu-kun + +## Approvers + - + - + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI- + +## Abstract + +In streaming processing, there are often scenarios where the table is widened. The current mainstream real-time stretching is completed through Flink's multi-layer join; +Flink's join will cache a large amount of data in the state backend. As the data set increases, the pressure on the Flink task state backend will gradually increase, and may even become unavailable. +In multi-layer join scenarios, this problem is more obvious. + +## Background +Currently, Hudi organizes data according to fileGroup granularity. The fileGroup is further divided into column clusters to introduce the columnFamily concept. +The organizational form of Hudi files is divided according to the following rules: +The data in the partition is divided into buckets according to hash; the files in each bucket are divided according to columnFamily; multiple colFamily files in the bucket form a completed fileGroup; when there is only one columnFamily, it degenerates into the native Hudi bucket table. + +![table](table.png) + +After splitting the fileGroup by columnFamily, the naming rules for base files and log files change. We add the cfName suffix to all file names to facilitate Hudi itself to distinguish column families. The addition of this suffix is compatible with Hudi's original naming method and has no conflict. + +![filenames](filenames.png) + +## Implementation +Describe the new thing you want to do in appropriate detail, how it fits into the project architecture. +Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. +Or it may be a few sentences. Use judgement based on the scope of the change. + +### Constraints and Restrictions +1. The overall design relies on the lock-free concurrent writing feature of Hudi 1.0. +2. Lower version Hudi cannot read and write column family tables. +3. Only MOR bucketed tables support setting column families. +4. Column families do not support repartitioning and renaming. +5. Schema evolution does not take effect on the current column family table. +6. Like native bucket tables, clustering operations are not supported. + +### Model change +After the column family is introduced, the storage structure of the entire Hudi bucket table changes: + +![bucket](bucket.png) + +The bucket is divided into multiple columnFamilies by column cluster. When columnFamily is 1, it will automatically degenerate into the native bucket table. + +![file-group](file-group.png) + +### Specifying column families when creating a table +In the table creation statement, column family division is specified in the options/tblproperties attribute; +Column family attributes are specified in key-value mode: +* Key is the column family name. Format: hoodie.colFamily. Column family name naming rules specified. +* Value is the specific content of the column family: it consists of all the columns included in the column family plus the precombine field. Format: " col1,col2...colN; precombineCol", the column family list and the preCombine field are separated by ";"; in the column family list the columns are split by ",". + +Constraints: The column family list must contain the primary key, and columns contained in different column families cannot overlap except for the primary key. The preCombie field does not need to be specified. If not specified, the primary key will be taken by default. + +After the table is created, the column family attributes will be persisted to hoodie's metadata for subsequent use. + +### Adding and deleting column families in existing table +Use the SQL alter command to modify the column family attributes and persist it: +* Execute ALTER TABLE table_name SET TBLPROPERTIES ('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family. +* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k'); to delete the column family. + +Specific steps are as follows: +1. Execute the ALTER command to modify the column family +2. Verify whether the column family modified by alter is legal. Column family modification must meet the following conditions, otherwise the verification will not pass: +* The column family name of an existing column family cannot be modified. +* Columns in other column families cannot be divided into new column families. +* When creating a new column family, it must meet the format requirements from previous chapter. +3. Save the modified column family to the .hoodie directory. + +### Writing data +The Hudi kernel divides the
Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]
danny0405 commented on code in PR #11559: URL: https://github.com/apache/hudi/pull/11559#discussion_r1664980071 ## rfc/rfc-80/rfc-80.md: ## @@ -0,0 +1,161 @@ + +# RFC-80: Support column families for wide tbles + +## Proposers + +- @xiarixiaoyao +- @wombatu-kun + +## Approvers + - + - + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI- + +## Abstract + +In streaming processing, there are often scenarios where the table is widened. The current mainstream real-time stretching is completed through Flink's multi-layer join; +Flink's join will cache a large amount of data in the state backend. As the data set increases, the pressure on the Flink task state backend will gradually increase, and may even become unavailable. +In multi-layer join scenarios, this problem is more obvious. + +## Background +Currently, Hudi organizes data according to fileGroup granularity. The fileGroup is further divided into column clusters to introduce the columnFamily concept. +The organizational form of Hudi files is divided according to the following rules: +The data in the partition is divided into buckets according to hash; the files in each bucket are divided according to columnFamily; multiple colFamily files in the bucket form a completed fileGroup; when there is only one columnFamily, it degenerates into the native Hudi bucket table. + +![table](table.png) + +After splitting the fileGroup by columnFamily, the naming rules for base files and log files change. We add the cfName suffix to all file names to facilitate Hudi itself to distinguish column families. The addition of this suffix is compatible with Hudi's original naming method and has no conflict. + +![filenames](filenames.png) + +## Implementation +Describe the new thing you want to do in appropriate detail, how it fits into the project architecture. +Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. +Or it may be a few sentences. Use judgement based on the scope of the change. + +### Constraints and Restrictions +1. The overall design relies on the lock-free concurrent writing feature of Hudi 1.0. +2. Lower version Hudi cannot read and write column family tables. +3. Only MOR bucketed tables support setting column families. +4. Column families do not support repartitioning and renaming. +5. Schema evolution does not take effect on the current column family table. +6. Like native bucket tables, clustering operations are not supported. + +### Model change +After the column family is introduced, the storage structure of the entire Hudi bucket table changes: + +![bucket](bucket.png) + +The bucket is divided into multiple columnFamilies by column cluster. When columnFamily is 1, it will automatically degenerate into the native bucket table. + +![file-group](file-group.png) + +### Specifying column families when creating a table +In the table creation statement, column family division is specified in the options/tblproperties attribute; +Column family attributes are specified in key-value mode: +* Key is the column family name. Format: hoodie.colFamily. Column family name naming rules specified. +* Value is the specific content of the column family: it consists of all the columns included in the column family plus the precombine field. Format: " col1,col2...colN; precombineCol", the column family list and the preCombine field are separated by ";"; in the column family list the columns are split by ",". + +Constraints: The column family list must contain the primary key, and columns contained in different column families cannot overlap except for the primary key. The preCombie field does not need to be specified. If not specified, the primary key will be taken by default. Review Comment: Is there any SQL synctax we can reference in industry? Like the cockroach db: https://www.cockroachlabs.com/docs/stable/column-families#:~:text=A%20column%20family%20is%20a,%2C%20UPDATE%20%2C%20and%20DELETE%20operations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]
danny0405 commented on code in PR #11559: URL: https://github.com/apache/hudi/pull/11559#discussion_r1664978373 ## rfc/rfc-80/rfc-80.md: ## @@ -0,0 +1,161 @@ + +# RFC-80: Support column families for wide tbles + +## Proposers + +- @xiarixiaoyao +- @wombatu-kun + +## Approvers + - + - + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI- + +## Abstract + +In streaming processing, there are often scenarios where the table is widened. The current mainstream real-time stretching is completed through Flink's multi-layer join; +Flink's join will cache a large amount of data in the state backend. As the data set increases, the pressure on the Flink task state backend will gradually increase, and may even become unavailable. +In multi-layer join scenarios, this problem is more obvious. + +## Background +Currently, Hudi organizes data according to fileGroup granularity. The fileGroup is further divided into column clusters to introduce the columnFamily concept. +The organizational form of Hudi files is divided according to the following rules: +The data in the partition is divided into buckets according to hash; the files in each bucket are divided according to columnFamily; multiple colFamily files in the bucket form a completed fileGroup; when there is only one columnFamily, it degenerates into the native Hudi bucket table. + +![table](table.png) + +After splitting the fileGroup by columnFamily, the naming rules for base files and log files change. We add the cfName suffix to all file names to facilitate Hudi itself to distinguish column families. The addition of this suffix is compatible with Hudi's original naming method and has no conflict. + +![filenames](filenames.png) + +## Implementation +Describe the new thing you want to do in appropriate detail, how it fits into the project architecture. +Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. +Or it may be a few sentences. Use judgement based on the scope of the change. + +### Constraints and Restrictions +1. The overall design relies on the lock-free concurrent writing feature of Hudi 1.0. Review Comment: It's not lock-free, it's just non-blocking, Hudi table utilies the lock to keep the instant time generation monotonically increasing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]
danny0405 commented on code in PR #11559: URL: https://github.com/apache/hudi/pull/11559#discussion_r1664977832 ## rfc/rfc-80/rfc-80.md: ## @@ -0,0 +1,161 @@ + +# RFC-80: Support column families for wide tbles + +## Proposers + +- @xiarixiaoyao +- @wombatu-kun + +## Approvers + - + - + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI- + +## Abstract + +In streaming processing, there are often scenarios where the table is widened. The current mainstream real-time stretching is completed through Flink's multi-layer join; +Flink's join will cache a large amount of data in the state backend. As the data set increases, the pressure on the Flink task state backend will gradually increase, and may even become unavailable. +In multi-layer join scenarios, this problem is more obvious. + +## Background +Currently, Hudi organizes data according to fileGroup granularity. The fileGroup is further divided into column clusters to introduce the columnFamily concept. +The organizational form of Hudi files is divided according to the following rules: +The data in the partition is divided into buckets according to hash; the files in each bucket are divided according to columnFamily; multiple colFamily files in the bucket form a completed fileGroup; when there is only one columnFamily, it degenerates into the native Hudi bucket table. + +![table](table.png) + +After splitting the fileGroup by columnFamily, the naming rules for base files and log files change. We add the cfName suffix to all file names to facilitate Hudi itself to distinguish column families. The addition of this suffix is compatible with Hudi's original naming method and has no conflict. + +![filenames](filenames.png) + +## Implementation +Describe the new thing you want to do in appropriate detail, how it fits into the project architecture. Review Comment: This needs to be elaborated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]
danny0405 commented on code in PR #11559: URL: https://github.com/apache/hudi/pull/11559#discussion_r1664977594 ## rfc/rfc-80/rfc-80.md: ## @@ -0,0 +1,161 @@ + +# RFC-80: Support column families for wide tbles + +## Proposers + +- @xiarixiaoyao +- @wombatu-kun + +## Approvers + - + - + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI- + +## Abstract + +In streaming processing, there are often scenarios where the table is widened. The current mainstream real-time stretching is completed through Flink's multi-layer join; +Flink's join will cache a large amount of data in the state backend. As the data set increases, the pressure on the Flink task state backend will gradually increase, and may even become unavailable. +In multi-layer join scenarios, this problem is more obvious. + +## Background +Currently, Hudi organizes data according to fileGroup granularity. The fileGroup is further divided into column clusters to introduce the columnFamily concept. +The organizational form of Hudi files is divided according to the following rules: +The data in the partition is divided into buckets according to hash; the files in each bucket are divided according to columnFamily; multiple colFamily files in the bucket form a completed fileGroup; when there is only one columnFamily, it degenerates into the native Hudi bucket table. + +![table](table.png) + +After splitting the fileGroup by columnFamily, the naming rules for base files and log files change. We add the cfName suffix to all file names to facilitate Hudi itself to distinguish column families. The addition of this suffix is compatible with Hudi's original naming method and has no conflict. + Review Comment: It looks like the colFamilyName is part of the write token now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]
lokeshj1703 commented on PR #11553: URL: https://github.com/apache/hudi/pull/11553#issuecomment-2207852135 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7950) Shade roaring bitmap dependency in root POM
[ https://issues.apache.org/jira/browse/HUDI-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7950: Status: Patch Available (was: In Progress) > Shade roaring bitmap dependency in root POM > --- > > Key: HUDI-7950 > URL: https://issues.apache.org/jira/browse/HUDI-7950 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0-beta2, 1.0.0, 0.15.1 > > > We should unify the shading rule of roaring bitmap dependency in the root POM > for consistency among bundles. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]
hudi-bot commented on PR #11563: URL: https://github.com/apache/hudi/pull/11563#issuecomment-2207802336 ## CI report: * 16b1d5c2603ef3eb68a40bc14572751b787d8d2f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24694) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]
hudi-bot commented on PR #11534: URL: https://github.com/apache/hudi/pull/11534#issuecomment-2207801978 ## CI report: * a4b0e88de32cb689056c049fcf207b72a7df7fb4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24623) * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN * 7c75f078faf19390ceac585790181032570d184d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24695) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]
hudi-bot commented on PR #11534: URL: https://github.com/apache/hudi/pull/11534#issuecomment-2207782132 ## CI report: * a4b0e88de32cb689056c049fcf207b72a7df7fb4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24623) * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN * 7c75f078faf19390ceac585790181032570d184d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7937) Fix handling of decimals in StreamSync and Clustering
[ https://issues.apache.org/jira/browse/HUDI-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7937: - Labels: pull-request-available (was: ) > Fix handling of decimals in StreamSync and Clustering > - > > Key: HUDI-7937 > URL: https://issues.apache.org/jira/browse/HUDI-7937 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > When decimals are using a small precision, we need to write them in legacy > format to ensure all hudi components can read them back. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]
hudi-bot commented on PR #11534: URL: https://github.com/apache/hudi/pull/11534#issuecomment-2207762315 ## CI report: * a4b0e88de32cb689056c049fcf207b72a7df7fb4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24623) * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Does Hudi has the warm/cold data archive solution [hudi]
njalan closed issue #11457: Does Hudi has the warm/cold data archive solution URL: https://github.com/apache/hudi/issues/11457 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7921] Fixing file system view closures in MDT [hudi]
danny0405 commented on code in PR #11496: URL: https://github.com/apache/hudi/pull/11496#discussion_r1664947146 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1082,8 +1083,10 @@ private HoodieData getFunctionalIndexUpdates(HoodieCommitMetadata HoodieIndexDefinition indexDefinition = getFunctionalIndexDefinition(indexPartition); List> partitionFileSlicePairs = new ArrayList<>(); HoodieTableFileSystemView fsView = HoodieTableMetadataUtil.getFileSystemView(dataMetaClient); +fileSystemViews.add(fsView); +HoodieTableFileSystemView finalFsView = fsView; commitMetadata.getPartitionToWriteStats().forEach((dataPartition, value) -> { - List fileSlices = getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, Option.ofNullable(fsView), dataPartition); + List fileSlices = getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, Option.ofNullable(finalFsView), dataPartition); Review Comment: Can we move the instantiation of `fsView` inside `getPartitionLatestFileSlicesIncludingInflight`. Then there is no need for the fs view cache. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7921] Fixing file system view closures in MDT [hudi]
danny0405 commented on code in PR #11496: URL: https://github.com/apache/hudi/pull/11496#discussion_r1664947146 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1082,8 +1083,10 @@ private HoodieData getFunctionalIndexUpdates(HoodieCommitMetadata HoodieIndexDefinition indexDefinition = getFunctionalIndexDefinition(indexPartition); List> partitionFileSlicePairs = new ArrayList<>(); HoodieTableFileSystemView fsView = HoodieTableMetadataUtil.getFileSystemView(dataMetaClient); +fileSystemViews.add(fsView); +HoodieTableFileSystemView finalFsView = fsView; commitMetadata.getPartitionToWriteStats().forEach((dataPartition, value) -> { - List fileSlices = getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, Option.ofNullable(fsView), dataPartition); + List fileSlices = getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, Option.ofNullable(finalFsView), dataPartition); Review Comment: Can we move the instantiation of `fsView` inside `getPartitionLatestFileSlicesIncludingInflight`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]
danny0405 commented on issue #11554: URL: https://github.com/apache/hudi/issues/11554#issuecomment-2207698467 The cmd writes an empty data frame using spark writer: https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableDropPartitionCommand.scala, and that would trigger a batch sync of partitions with the https://github.com/apache/hudi/blob/master/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] insert into hudi table with columns specified(reordered and not in table schema order) throws exception [hudi]
danny0405 commented on issue #11552: URL: https://github.com/apache/hudi/issues/11552#issuecomment-2207682862 @KnightChess Thanks so much for taking care of the fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]
danny0405 commented on PR #11545: URL: https://github.com/apache/hudi/pull/11545#issuecomment-2207680792 @watermelon12138 would you mind to fix the compile errors: https://github.com/apache/hudi/actions/runs/9756232191/job/26926142959?pr=11545 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]
hudi-bot commented on PR #11563: URL: https://github.com/apache/hudi/pull/11563#issuecomment-2207639885 ## CI report: * 16b1d5c2603ef3eb68a40bc14572751b787d8d2f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24694) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]
hudi-bot commented on PR #11559: URL: https://github.com/apache/hudi/pull/11559#issuecomment-2207639749 ## CI report: * 1c8b5ccd83bfd1861d050c47b78a4addd6e558a1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24685) * 206eabc0c6a752e7a1e1d2206db231bf9a831570 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24693) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]
hudi-bot commented on PR #11563: URL: https://github.com/apache/hudi/pull/11563#issuecomment-2207619588 ## CI report: * 16b1d5c2603ef3eb68a40bc14572751b787d8d2f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]
hudi-bot commented on PR #11559: URL: https://github.com/apache/hudi/pull/11559#issuecomment-2207619441 ## CI report: * 1c8b5ccd83bfd1861d050c47b78a4addd6e558a1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24685) * 206eabc0c6a752e7a1e1d2206db231bf9a831570 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]
hudi-bot commented on PR #11556: URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207598201 ## CI report: * 192707054c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24691) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-7951) Classes using avro causing conflict in hudi-aws-bundle
[ https://issues.apache.org/jira/browse/HUDI-7951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shawn Chang reassigned HUDI-7951: - Assignee: Shawn Chang > Classes using avro causing conflict in hudi-aws-bundle > -- > > Key: HUDI-7951 > URL: https://issues.apache.org/jira/browse/HUDI-7951 > Project: Apache Hudi > Issue Type: Bug >Reporter: Shawn Chang >Assignee: Shawn Chang >Priority: Major > Labels: pull-request-available > > Hudi 0.15 added some Hudi classes with avro usages > (ParquetTableSchemaResolver in this case), also had hudi-aws-bundle depend on > hudi-hadoop-common. hudi-aws-bundle won't relocate avro classes to be > compatible with hudi-spark. > > The issue would happen when using hudi-flink-bundle with hudi-aws-bundle. > hudi-flink-bundle has relocated avro classes and would cause class conflict: > {code:java} > java.lang.NoSuchMethodError: 'org.apache.parquet.schema.MessageType > org.apache.hudi.common.table.ParquetTableSchemaResolver.convertAvroSchemaToParquet(org.apache.hudi.org.apache.avro.Schema, > org.apache.hadoop.conf.Configuration)' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7951) Classes using avro causing conflict in hudi-aws-bundle
[ https://issues.apache.org/jira/browse/HUDI-7951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7951: - Labels: pull-request-available (was: ) > Classes using avro causing conflict in hudi-aws-bundle > -- > > Key: HUDI-7951 > URL: https://issues.apache.org/jira/browse/HUDI-7951 > Project: Apache Hudi > Issue Type: Bug >Reporter: Shawn Chang >Priority: Major > Labels: pull-request-available > > Hudi 0.15 added some Hudi classes with avro usages > (ParquetTableSchemaResolver in this case), also had hudi-aws-bundle depend on > hudi-hadoop-common. hudi-aws-bundle won't relocate avro classes to be > compatible with hudi-spark. > > The issue would happen when using hudi-flink-bundle with hudi-aws-bundle. > hudi-flink-bundle has relocated avro classes and would cause class conflict: > {code:java} > java.lang.NoSuchMethodError: 'org.apache.parquet.schema.MessageType > org.apache.hudi.common.table.ParquetTableSchemaResolver.convertAvroSchemaToParquet(org.apache.hudi.org.apache.avro.Schema, > org.apache.hadoop.conf.Configuration)' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]
CTTY opened a new pull request, #11563: URL: https://github.com/apache/hudi/pull/11563 ### Change Logs Hudi 0.15 added some Hudi classes with avro usages (e.g. `ParquetTableSchemaResolver`), also had `hudi-aws-bundle `depend on `hudi-hadoop-common`. `hudi-aws-bundle` won't relocate avro classes to be compatible with hudi-spark. The issue would happen when using hudi-flink-bundle with hudi-aws-bundle. hudi-flink-bundle has relocated avro classes and would cause class conflict: ``` java.lang.NoSuchMethodError: 'org.apache.parquet.schema.MessageType org.apache.hudi.common.table.ParquetTableSchemaResolver.convertAvroSchemaToParquet(org.apache.hudi.org.apache.avro.Schema, org.apache.hadoop.conf.Configuration)' ``` ### Impact none ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7951) Classes using avro causing conflict in hudi-aws-bundle
Shawn Chang created HUDI-7951: - Summary: Classes using avro causing conflict in hudi-aws-bundle Key: HUDI-7951 URL: https://issues.apache.org/jira/browse/HUDI-7951 Project: Apache Hudi Issue Type: Bug Reporter: Shawn Chang Hudi 0.15 added some Hudi classes with avro usages (ParquetTableSchemaResolver in this case), also had hudi-aws-bundle depend on hudi-hadoop-common. hudi-aws-bundle won't relocate avro classes to be compatible with hudi-spark. The issue would happen when using hudi-flink-bundle with hudi-aws-bundle. hudi-flink-bundle has relocated avro classes and would cause class conflict: {code:java} java.lang.NoSuchMethodError: 'org.apache.parquet.schema.MessageType org.apache.hudi.common.table.ParquetTableSchemaResolver.convertAvroSchemaToParquet(org.apache.hudi.org.apache.avro.Schema, org.apache.hadoop.conf.Configuration)' {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]
hudi-bot commented on PR #11556: URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207484212 ## CI report: * 23d89d4a510f44094d65a95a02490e5cd7a9b165 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24692) * 192707054c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]
hudi-bot commented on PR #11556: URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207404846 ## CI report: * 23d89d4a510f44094d65a95a02490e5cd7a9b165 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24692) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]
hudi-bot commented on PR #11556: URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207397295 ## CI report: * 192707054c0d633621d2db4f706d6487974a74bb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24691) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24690) * 23d89d4a510f44094d65a95a02490e5cd7a9b165 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7950] Shade roaring bitmap dependency in root POM [hudi]
hudi-bot commented on PR #11561: URL: https://github.com/apache/hudi/pull/11561#issuecomment-2207297970 ## CI report: * e73a045009286da007f0ade464d3e24d87d08c9d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24688) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]
hudi-bot commented on PR #11556: URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207297893 ## CI report: * 192707054c0d633621d2db4f706d6487974a74bb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24691) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24690) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Performance Variation in Hudi 0.14 [hudi]
RuyRoaV commented on issue #11481: URL: https://github.com/apache/hudi/issues/11481#issuecomment-2207209448 Hi Aditya I have tried out your recommendation and found the following: ** Using SIMPLE INDEX** The average execution time was reduced from 20 min to around 11 min, which is great. In the Spark UI screenshot, you can see that a big percentage of the execution time is taken by a `countByKey at JavePairRDD` action in the `SparkCommitUpsert` executor, especially during the `SuffleWrite` part. ![Screenshot 2024-07-03 at 16 44 57](https://github.com/apache/hudi/assets/173461014/deb0599e-00e0-4cab-a6b5-8d4dcb8fb557) ![Screenshot 2024-07-03 at 16 52 19](https://github.com/apache/hudi/assets/173461014/3a37ef31-cbbf-4425-9f0d-f2c96948c4e9) ![Screenshot 2024-07-03 at 16 52 46](https://github.com/apache/hudi/assets/173461014/1440ea2d-9bab-44b9-85a8-9395375abba9) **We are in a need to reduce the job runtime even more, is there any other recommendation regarding the different configurations that we can set?** We may try deactivating of the archival beyond the savepoint a bit later. But I am curious about why would that help us improve in performance? **Using RECORD LEVEL** I replaced the index for a table, for which its upsert Glue job was already running in under 5 minutes. Overall, the job runtime has remained the same, being `count at HoodieSparkSqlWriter.scala:1072` during the `SparkCommitUpsert`, especially during the execution. This is similar as in the case presented when submitting this ticket. I'll try with one of our long running jobs and will let you know the outcome. By the way **is there a way to check the index type of a table?** Thanks Best regards -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]
hudi-bot commented on PR #11556: URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207181205 ## CI report: * 761085a6fa9cc6eeca493c1c116caea56b3693f8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24680) * 192707054c0d633621d2db4f706d6487974a74bb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24691) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]
hudi-bot commented on PR #11556: URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207162234 ## CI report: * 761085a6fa9cc6eeca493c1c116caea56b3693f8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24680) * 192707054c0d633621d2db4f706d6487974a74bb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] refactor: implement datafusion API using ParquetExec [hudi-rs]
codecov[bot] commented on PR #35: URL: https://github.com/apache/hudi-rs/pull/35#issuecomment-2207161986 ## [Codecov](https://app.codecov.io/gh/apache/hudi-rs/pull/35?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report Attention: Patch coverage is `70.0%` with `9 lines` in your changes missing coverage. Please review. > Project coverage is 84.81%. Comparing base [(`52a9245`)](https://app.codecov.io/gh/apache/hudi-rs/commit/52a924557ee18effadc02749ec7cdb1001ad6b58?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`8f5f96a`)](https://app.codecov.io/gh/apache/hudi-rs/commit/8f5f96a813faf8a686d210eb652235ab247d8b57?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). | [Files](https://app.codecov.io/gh/apache/hudi-rs/pull/35?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [crates/datafusion/src/lib.rs](https://app.codecov.io/gh/apache/hudi-rs/pull/35?src=pr&el=tree&filepath=crates%2Fdatafusion%2Fsrc%2Flib.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y3JhdGVzL2RhdGFmdXNpb24vc3JjL2xpYi5ycw==) | 55.00% | [9 Missing :warning: ](https://app.codecov.io/gh/apache/hudi-rs/pull/35?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@Coverage Diff @@ ## main #35 +/- ## == - Coverage 88.84% 84.81% -4.04% == Files 10 10 Lines 511 507 -4 == - Hits 454 430 -24 - Misses 57 77 +20 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/hudi-rs/pull/35?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] refactor: implement datafusion API using ParquetExec [hudi-rs]
xushiyan opened a new pull request, #35: URL: https://github.com/apache/hudi-rs/pull/35 - upgrade arrow from `50` to `52.0.0` - upgrade datafusion `35` to `39.0.0` - leverage `ParquetExec` for implementing TableProvider for Hudi in datafusion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7921] Making HoodieTable closeable [hudi]
nsivabalan closed pull request #11494: [HUDI-7921] Making HoodieTable closeable URL: https://github.com/apache/hudi/pull/11494 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7950] Shade roaring bitmap dependency in root POM [hudi]
hudi-bot commented on PR #11561: URL: https://github.com/apache/hudi/pull/11561#issuecomment-2207028303 ## CI report: * e73a045009286da007f0ade464d3e24d87d08c9d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24688) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DNM] Temp diff testing 1.x reads with 0.x branch [hudi]
hudi-bot commented on PR #11562: URL: https://github.com/apache/hudi/pull/11562#issuecomment-2207028360 ## CI report: * 013aef32a3ad3aa995beb626f5855d9a05234cbf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24689) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7950] Shade roaring bitmap dependency in root POM [hudi]
hudi-bot commented on PR #11561: URL: https://github.com/apache/hudi/pull/11561#issuecomment-2207017035 ## CI report: * e73a045009286da007f0ade464d3e24d87d08c9d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DNM] Temp diff testing 1.x reads with 0.x branch [hudi]
hudi-bot commented on PR #11562: URL: https://github.com/apache/hudi/pull/11562#issuecomment-2207017098 ## CI report: * 013aef32a3ad3aa995beb626f5855d9a05234cbf UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]
nsivabalan commented on PR #11514: URL: https://github.com/apache/hudi/pull/11514#issuecomment-2206976055 here is a glimpse of changes I had to make to 0.x timeline to support 1.x table reads https://github.com/apache/hudi/pull/11562 this is just a draft/hacky PR, just incase you wanna take a peek. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DNM] Temp diff testing 1.x reads with 0.x branch [hudi]
nsivabalan commented on code in PR #11562: URL: https://github.com/apache/hudi/pull/11562#discussion_r1664607967 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineMetadataUtils.java: ## @@ -209,6 +225,14 @@ public static T deserializeAvroMetadata(byte[] by return fileReader.next(); } + public static HoodieCommitMetadata deserializeCommitMetadata(byte[] bytes) throws IOException { +return deserializeAvroMetadata(bytes, HoodieCommitMetadata.class); Review Comment: NTR: supporting deser avro commit metadata -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org