Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]

2024-07-03 Thread via GitHub


wombatu-kun commented on code in PR #11559:
URL: https://github.com/apache/hudi/pull/11559#discussion_r1665211039


##
rfc/rfc-80/rfc-80.md:
##
@@ -0,0 +1,161 @@
+
+# RFC-80: Support column families for wide tbles
+
+## Proposers
+
+- @xiarixiaoyao
+- @wombatu-kun
+
+## Approvers
+ - 
+ - 
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-
+
+## Abstract
+
+In streaming processing, there are often scenarios where the table is widened. 
The current mainstream real-time stretching is completed through Flink's 
multi-layer join;
+Flink's join will cache a large amount of data in the state backend. As the 
data set increases, the pressure on the Flink task state backend will gradually 
increase, and may even become unavailable.
+In multi-layer join scenarios, this problem is more obvious.
+
+## Background
+Currently, Hudi organizes data according to fileGroup granularity. The 
fileGroup is further divided into column clusters to introduce the columnFamily 
concept.  
+The organizational form of Hudi files is divided according to the following 
rules:  
+The data in the partition is divided into buckets according to hash; the files 
in each bucket are divided according to columnFamily; multiple colFamily files 
in the bucket form a completed fileGroup; when there is only one columnFamily, 
it degenerates into the native Hudi bucket table.
+
+![table](table.png)
+
+After splitting the fileGroup by columnFamily, the naming rules for base files 
and log files change. We add the cfName suffix to all file names to facilitate 
Hudi itself to distinguish column families. The addition of this suffix is 
compatible with Hudi's original naming method and has no conflict.
+
+![filenames](filenames.png)
+
+## Implementation
+Describe the new thing you want to do in appropriate detail, how it fits into 
the project architecture. 
+Provide a detailed description of how you intend to implement this 
feature.This may be fairly extensive and have large subsections of its own. 
+Or it may be a few sentences. Use judgement based on the scope of the change.
+
+### Constraints and Restrictions
+1. The overall design relies on the lock-free concurrent writing feature of 
Hudi 1.0.  
+2. Lower version Hudi cannot read and write column family tables.  
+3. Only MOR bucketed tables support setting column families.  
+4. Column families do not support repartitioning and renaming.  
+5. Schema evolution does not take effect on the current column family table.  
+6. Like native bucket tables, clustering operations are not supported.
+
+### Model change
+After the column family is introduced, the storage structure of the entire 
Hudi bucket table changes:
+
+![bucket](bucket.png)
+
+The bucket is divided into multiple columnFamilies by column cluster. When 
columnFamily is 1, it will automatically degenerate into the native bucket 
table.
+
+![file-group](file-group.png)
+
+### Specifying column families when creating a table
+In the table creation statement, column family division is specified in the 
options/tblproperties attribute;
+Column family attributes are specified in key-value mode:  
+* Key is the column family name. Format: hoodie.colFamily. Column family name  
  naming rules specified.  
+* Value is the specific content of the column family: it consists of all the 
columns included in the column family plus the precombine field. Format: " 
col1,col2...colN; precombineCol", the column family list and the preCombine 
field are separated by ";"; in the column family list the columns are split by 
",".  
+
+Constraints: The column family list must contain the primary key, and columns 
contained in different column families cannot overlap except for the primary 
key. The preCombie field does not need to be specified. If not specified, the 
primary key will be taken by default.
+
+After the table is created, the column family attributes will be persisted to 
hoodie's metadata for subsequent use.
+
+### Adding and deleting column families in existing table
+Use the SQL alter command to modify the column family attributes and persist 
it:
+* Execute ALTER TABLE table_name SET TBLPROPERTIES 
('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family.  
+* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k'); 
to delete the column family.
+
+Specific steps are as follows:
+1. Execute the ALTER command to modify the column family
+2. Verify whether the column family modified by alter is legal. Column family 
modification must meet the following conditions, otherwise the verification 
will not pass:
+* The column family name of an existing column family cannot be modified.  
+* Columns in other column families cannot be divided into new column 
families.  
+* When creating a new column family, it must meet the format requirements 
from previous chapter.  
+3. Save the modified column family to the .hoodie directory.
+
+### Writing data
+The Hudi kernel divides t

Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]

2024-07-03 Thread via GitHub


wombatu-kun commented on code in PR #11559:
URL: https://github.com/apache/hudi/pull/11559#discussion_r1665207108


##
rfc/rfc-80/rfc-80.md:
##
@@ -0,0 +1,161 @@
+
+# RFC-80: Support column families for wide tbles
+
+## Proposers
+
+- @xiarixiaoyao
+- @wombatu-kun
+
+## Approvers
+ - 
+ - 
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-
+
+## Abstract
+
+In streaming processing, there are often scenarios where the table is widened. 
The current mainstream real-time stretching is completed through Flink's 
multi-layer join;
+Flink's join will cache a large amount of data in the state backend. As the 
data set increases, the pressure on the Flink task state backend will gradually 
increase, and may even become unavailable.
+In multi-layer join scenarios, this problem is more obvious.
+
+## Background
+Currently, Hudi organizes data according to fileGroup granularity. The 
fileGroup is further divided into column clusters to introduce the columnFamily 
concept.  
+The organizational form of Hudi files is divided according to the following 
rules:  
+The data in the partition is divided into buckets according to hash; the files 
in each bucket are divided according to columnFamily; multiple colFamily files 
in the bucket form a completed fileGroup; when there is only one columnFamily, 
it degenerates into the native Hudi bucket table.
+
+![table](table.png)
+
+After splitting the fileGroup by columnFamily, the naming rules for base files 
and log files change. We add the cfName suffix to all file names to facilitate 
Hudi itself to distinguish column families. The addition of this suffix is 
compatible with Hudi's original naming method and has no conflict.
+
+![filenames](filenames.png)
+
+## Implementation
+Describe the new thing you want to do in appropriate detail, how it fits into 
the project architecture. 
+Provide a detailed description of how you intend to implement this 
feature.This may be fairly extensive and have large subsections of its own. 
+Or it may be a few sentences. Use judgement based on the scope of the change.
+
+### Constraints and Restrictions
+1. The overall design relies on the lock-free concurrent writing feature of 
Hudi 1.0.  
+2. Lower version Hudi cannot read and write column family tables.  
+3. Only MOR bucketed tables support setting column families.  
+4. Column families do not support repartitioning and renaming.  
+5. Schema evolution does not take effect on the current column family table.  
+6. Like native bucket tables, clustering operations are not supported.
+
+### Model change
+After the column family is introduced, the storage structure of the entire 
Hudi bucket table changes:
+
+![bucket](bucket.png)
+
+The bucket is divided into multiple columnFamilies by column cluster. When 
columnFamily is 1, it will automatically degenerate into the native bucket 
table.
+
+![file-group](file-group.png)
+
+### Specifying column families when creating a table
+In the table creation statement, column family division is specified in the 
options/tblproperties attribute;
+Column family attributes are specified in key-value mode:  
+* Key is the column family name. Format: hoodie.colFamily. Column family name  
  naming rules specified.  
+* Value is the specific content of the column family: it consists of all the 
columns included in the column family plus the precombine field. Format: " 
col1,col2...colN; precombineCol", the column family list and the preCombine 
field are separated by ";"; in the column family list the columns are split by 
",".  
+
+Constraints: The column family list must contain the primary key, and columns 
contained in different column families cannot overlap except for the primary 
key. The preCombie field does not need to be specified. If not specified, the 
primary key will be taken by default.
+
+After the table is created, the column family attributes will be persisted to 
hoodie's metadata for subsequent use.
+
+### Adding and deleting column families in existing table
+Use the SQL alter command to modify the column family attributes and persist 
it:
+* Execute ALTER TABLE table_name SET TBLPROPERTIES 
('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family.  
+* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k'); 
to delete the column family.
+
+Specific steps are as follows:
+1. Execute the ALTER command to modify the column family
+2. Verify whether the column family modified by alter is legal. Column family 
modification must meet the following conditions, otherwise the verification 
will not pass:
+* The column family name of an existing column family cannot be modified.  
+* Columns in other column families cannot be divided into new column 
families.  
+* When creating a new column family, it must meet the format requirements 
from previous chapter.  
+3. Save the modified column family to the .hoodie directory.
+
+### Writing data
+The Hudi kernel divides t

Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]

2024-07-03 Thread via GitHub


wombatu-kun commented on code in PR #11559:
URL: https://github.com/apache/hudi/pull/11559#discussion_r1665204402


##
rfc/rfc-80/rfc-80.md:
##
@@ -0,0 +1,161 @@
+
+# RFC-80: Support column families for wide tbles
+
+## Proposers
+
+- @xiarixiaoyao
+- @wombatu-kun
+
+## Approvers
+ - 
+ - 
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-
+
+## Abstract
+
+In streaming processing, there are often scenarios where the table is widened. 
The current mainstream real-time stretching is completed through Flink's 
multi-layer join;
+Flink's join will cache a large amount of data in the state backend. As the 
data set increases, the pressure on the Flink task state backend will gradually 
increase, and may even become unavailable.
+In multi-layer join scenarios, this problem is more obvious.
+
+## Background
+Currently, Hudi organizes data according to fileGroup granularity. The 
fileGroup is further divided into column clusters to introduce the columnFamily 
concept.  
+The organizational form of Hudi files is divided according to the following 
rules:  
+The data in the partition is divided into buckets according to hash; the files 
in each bucket are divided according to columnFamily; multiple colFamily files 
in the bucket form a completed fileGroup; when there is only one columnFamily, 
it degenerates into the native Hudi bucket table.
+
+![table](table.png)
+
+After splitting the fileGroup by columnFamily, the naming rules for base files 
and log files change. We add the cfName suffix to all file names to facilitate 
Hudi itself to distinguish column families. The addition of this suffix is 
compatible with Hudi's original naming method and has no conflict.
+

Review Comment:
   Yes, it will be the part of the write token.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2208210061

   
   ## CI report:
   
   * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN
   * c88af1906b580dd3d1497d48daa072728d6f8127 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24704)
 
   * cc375bc8ee00bee501b2937dbb6f2054c0fbe2d4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24705)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2208196940

   
   ## CI report:
   
   * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN
   * c88af1906b580dd3d1497d48daa072728d6f8127 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24704)
 
   * cc375bc8ee00bee501b2937dbb6f2054c0fbe2d4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2208144242

   
   ## CI report:
   
   * 77a10246fe Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24702)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] if sync mode is glue, fix the sync tool class [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11543:
URL: https://github.com/apache/hudi/pull/11543#issuecomment-2208144158

   
   ## CI report:
   
   * b1b0628d83c17467402de524a54829925aec9925 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24703)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2208138126

   
   ## CI report:
   
   * 77a10246fec770a8e0f3bfa1fe2fa4d3ffee33d1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24702)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24700)
 
   * 77a10246fe UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2208138089

   
   ## CI report:
   
   * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN
   * c88af1906b580dd3d1497d48daa072728d6f8127 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24704)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11545:
URL: https://github.com/apache/hudi/pull/11545#discussion_r1665134216


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/EightToSevenDowngradeHandler.java:
##
@@ -32,6 +39,28 @@
 public class EightToSevenDowngradeHandler implements DowngradeHandler {
   @Override
   public Map downgrade(HoodieWriteConfig config, 
HoodieEngineContext context, String instantTime, SupportsUpgradeDowngrade 
upgradeDowngradeHelper) {
+HoodieTableMetaClient metaClient = 
HoodieTableMetaClient.builder().setConf(context.getStorageConf().newInstance()).setBasePath(config.getBasePath()).build();
+List instants = 
metaClient.getActiveTimeline().getInstants();
+if (!instants.isEmpty()) {
+  context.map(instants, instant -> {
+if (!instant.getFileName().contains("_")) {
+  return false;
+}
+try {
+  // Rename the metadata file name from the 
${instant_time}_${completion_time}.action[.state] format in version 1.x to the 
${instant_time}.action[.state] format in version 0.x.
+  StoragePath fromPath = new StoragePath(metaClient.getMetaPath(), 
instant.getFileName());
+  StoragePath toPath = new StoragePath(metaClient.getMetaPath(), 
instant.getFileName().replaceAll("_\\d+", ""));
+  boolean success = metaClient.getStorage().rename(fromPath, toPath);
+  // TODO: We need to rename the action-related part of the metadata 
file name here when we bring separate action name for clustering/compaction in 
1.x as well.
+  if (!success) {
+throw new HoodieIOException("Error when rename the instant file: " 
+ fromPath + " to: " + toPath);
+  }
+  return true;
+} catch (IOException e) {
+  throw new HoodieException("Can not to complete the downgrade from 
version eight to version seven.", e);

Review Comment:
   One thing needs caution here is after the renaming, the file modification 
time has changed, the modification time is used as completion time in 0.x 
branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11545:
URL: https://github.com/apache/hudi/pull/11545#discussion_r1665133407


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java:
##
@@ -262,8 +259,12 @@ private String getPendingFileName() {
   }
 
   private String getCompleteFileName(String completionTime) {
-ValidationUtils.checkArgument(!StringUtils.isNullOrEmpty(completionTime), 
"Completion time should not be empty");
-String timestampWithCompletionTime = timestamp + "_" + completionTime;
+String timestampWithCompletionTime;
+if (StringUtils.isNullOrEmpty(completionTime)) {

Review Comment:
   When the completion time could be empty then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11545:
URL: https://github.com/apache/hudi/pull/11545#discussion_r1665132485


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java:
##
@@ -136,10 +136,7 @@ public HoodieInstant(StoragePathInfo pathInfo) {
   state = State.COMPLETED;
 }
   }
-  completionTime = timestamps.length > 1
-  ? timestamps[1]
-  // for backward compatibility with 0.x release.
-  : state == State.COMPLETED ? pathInfo.getModificationTime() + "" : 
null;
+  completionTime = timestamps.length > 1 ? timestamps[1] : null;

Review Comment:
   I think we still need to keep these logic, a downgrade logic in there is 
good but we also need to be compatible for 0.x for some code path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2208131882

   
   ## CI report:
   
   * 3e526156f7bf7121008c4965bedeeadd969f798a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24687)
 
   * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN
   * c88af1906b580dd3d1497d48daa072728d6f8127 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1665124619


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row position
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name.
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for clustering from “replace commit” to 
“cluster”.
+- Similarly, for completed compaction, we are switching from “commit” to 
“compaction” in an effort to standardize actions for a given write operation.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x
+- [In-memory changes] HoodieInstant changes due to presence of completion time 
for completed HoodieInstants.
+
+### Filegroup/FileSlice changes:
+- Log files contain delta commit time instead of base instant time.
+- Log appends are disabled in 1.x. In other words, each log block is already 
appended to a new log file.
+- File Slice determination logic for log files changed (in 0.x, we have base 
instant time in log files and its straight forward. In 1.x, we find completion 
time for a log file and find the base instant time (parsed from base files) 
which has the highest value lesser than the completion time of the log file).
+- Log file ordering within a file slice. (in 0.x, we use base instant time ➝l 
log file versions ➝ write token) to order diff log files. in 1.x, we will be 
using completion time to order).
+
+### Log format changes:
+- We have added new header types in 1.x. (IS_PARTIAL)
+
+## Changes to be ported over 0.16.x to support reading 1.x tables
+### What will be supported
+- For features introduced in 0.x, and tables written in 1.x, 0.16.0 reader 
should be able to provide consistent reads w/o any breakage.
+### What will not be supported
+- A 0.16 writer cannot write to a table that has been upgraded-to/created 
usin

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1665122005


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row position
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name.
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for clustering from “replace commit” to 
“cluster”.
+- Similarly, for completed compaction, we are switching from “commit” to 
“compaction” in an effort to standardize actions for a given write operation.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x
+- [In-memory changes] HoodieInstant changes due to presence of completion time 
for completed HoodieInstants.

Review Comment:
   We do not introduce the completion time based inc queries for Spark yet, but 
for the GA release, we might need to have a compatible solution for migrattion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1665122268


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row position
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name.
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for clustering from “replace commit” to 
“cluster”.
+- Similarly, for completed compaction, we are switching from “commit” to 
“compaction” in an effort to standardize actions for a given write operation.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x
+- [In-memory changes] HoodieInstant changes due to presence of completion time 
for completed HoodieInstants.
+
+### Filegroup/FileSlice changes:
+- Log files contain delta commit time instead of base instant time.
+- Log appends are disabled in 1.x. In other words, each log block is already 
appended to a new log file.

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1665121084


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,220 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. We 
released beta releases which was meant for 
+enthusiastic developers/users to give a try of advanced features. But as we 
are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x.
+
+- 1.x reader should be able to read 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. In this case, we explicitly request users to not turn on these features 
till readers are completely in 1.x.
+- Document upgrade steps from 0.16.x to 1.x with limited user perceived 
latency. This will be auto upgrade, but document clearly what needs to be done.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality.
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.

Review Comment:
   Might not be relared, but should `hoodie.record.merge.mode` should be a 
table config instead of a write config?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi-rs) branch main updated: feat: implement datafusion API using ParquetExec (#35)

2024-07-03 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/hudi-rs.git


The following commit(s) were added to refs/heads/main by this push:
 new e8fde26  feat: implement datafusion API using ParquetExec (#35)
e8fde26 is described below

commit e8fde26df8cdd5355aacce4232138222ce00baf4
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Wed Jul 3 23:59:11 2024 -0500

feat: implement datafusion API using ParquetExec (#35)

- upgrade arrow from `50` to `52.0.0`
- upgrade datafusion `35` to `39.0.0`
- leverage `ParquetExec` for implementing TableProvider for Hudi in 
datafusion
- add `hoodie.read.input.partitions` config
---
 Cargo.toml   |  38 +++---
 crates/core/src/config/mod.rs| 118 ++
 crates/core/src/lib.rs   |   3 +-
 crates/core/src/storage/mod.rs   |   8 +-
 crates/core/src/storage/utils.rs |  47 ++-
 crates/core/src/table/mod.rs |  17 ++-
 crates/datafusion/Cargo.toml |   8 +-
 crates/datafusion/src/lib.rs | 261 ++-
 python/Cargo.toml|   2 +-
 python/hudi/_internal.pyi|   3 +-
 python/hudi/_utils.py|  23 
 python/hudi/table.py |  10 +-
 python/src/lib.rs|  12 +-
 python/tests/test_table_read.py  |   4 +-
 14 files changed, 344 insertions(+), 210 deletions(-)

diff --git a/Cargo.toml b/Cargo.toml
index 1b66057..82f0383 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -30,27 +30,27 @@ rust-version = "1.75.0"
 
 [workspace.dependencies]
 # arrow
-arrow = { version = "50", features = ["pyarrow"] }
-arrow-arith = { version = "50" }
-arrow-array = { version = "50", features = ["chrono-tz"] }
-arrow-buffer = { version = "50" }
-arrow-cast = { version = "50" }
-arrow-ipc = { version = "50" }
-arrow-json = { version = "50" }
-arrow-ord = { version = "50" }
-arrow-row = { version = "50" }
-arrow-schema = { version = "50" }
-arrow-select = { version = "50" }
-object_store = { version = "0.9.1" }
-parquet = { version = "50" }
+arrow = { version = "52.0.0", features = ["pyarrow"]}
+arrow-arith = { version = "52.0.0" }
+arrow-array = { version = "52.0.0", features = ["chrono-tz"] }
+arrow-buffer = { version = "52.0.0" }
+arrow-cast = { version = "52.0.0" }
+arrow-ipc = { version = "52.0.0" }
+arrow-json = { version = "52.0.0" }
+arrow-ord = { version = "52.0.0" }
+arrow-row = { version = "52.0.0" }
+arrow-schema = { version = "52.0.0" }
+arrow-select = { version = "52.0.0" }
+object_store = { version = "0.10.1" }
+parquet = { version = "52.0.0" }
 
 # datafusion
-datafusion = { version = "35" }
-datafusion-expr = { version = "35" }
-datafusion-common = { version = "35" }
-datafusion-proto = { version = "35" }
-datafusion-sql = { version = "35" }
-datafusion-physical-expr = { version = "35" }
+datafusion = { version = "39.0.0" }
+datafusion-expr = { version = "39.0.0" }
+datafusion-common = { version = "39.0.0" }
+datafusion-proto = { version = "39.0.0" }
+datafusion-sql = { version = "39.0.0" }
+datafusion-physical-expr = { version = "39.0.0" }
 
 # serde
 serde = { version = "1.0.203", features = ["derive"] }
diff --git a/crates/core/src/config/mod.rs b/crates/core/src/config/mod.rs
new file mode 100644
index 000..3322df3
--- /dev/null
+++ b/crates/core/src/config/mod.rs
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+use std::collections::HashMap;
+
+use anyhow::{anyhow, Context, Result};
+
+pub trait OptionsParser {
+type Output;
+
+fn parse_value(&self, options: &HashMap) -> 
Result;
+
+fn parse_value_or_default(&self, options: &HashMap) -> 
Self::Output;
+}
+
+#[derive(Clone, Debug, PartialEq, Eq, Hash)]
+pub enum HudiConfig {
+ReadInputPartitions,
+}
+
+#[derive(Debug)]
+pub enum HudiConfigValue {
+Integer(isize),
+}
+
+impl HudiConfigValue {
+pub fn cast + TryFrom + 
std::fmt::Debug>(&self) -> T {
+match self {
+HudiConfigValue::Integer(value) => 
T::try_from(*value).unwrap_or_else(|_| {
+panic!("Failed to convert isize to {}", 
std::any::type_name::())
+ 

Re: [PR] feat: implement datafusion API using ParquetExec [hudi-rs]

2024-07-03 Thread via GitHub


xushiyan merged PR #35:
URL: https://github.com/apache/hudi-rs/pull/35


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-7938) NullPointerException during read from PySpark

2024-07-03 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862922#comment-17862922
 ] 

Geser Dugarov commented on HUDI-7938:
-

[~yihua] , if you don't mind, could you, please, clarify what to do with 
registration of Hudi serializer in Spark?

> NullPointerException during read from PySpark
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7938) NullPointerException during read from PySpark

2024-07-03 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862921#comment-17862921
 ] 

Geser Dugarov commented on HUDI-7938:
-

To support run from PySpark without set spark.kryo.registrator this MR has been 
landed:

[https://github.com/apache/hudi/pull/11355]

But after landed

[https://github.com/apache/hudi/pull/10957]

we need to set it again.

For now, I don't know should we decide to make this configuration mandatory or 
make some changes in the code. Leave this task for some time as it is.

> NullPointerException during read from PySpark
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7758] Only consider files in Hudi partitions when initializing MDT [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11219:
URL: https://github.com/apache/hudi/pull/11219#issuecomment-2208095346

   
   ## CI report:
   
   * 09e49d7c4856c6baf4089b538784f2d6cc7b143a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24701)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7938) NullPointerException during read from PySpark

2024-07-03 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7938:

Status: Open  (was: In Progress)

> NullPointerException during read from PySpark
> -
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7949) insert into hudi table with columns specified(reordered and not in table schema order) throws exception

2024-07-03 Thread KnightChess (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KnightChess reassigned HUDI-7949:
-

Assignee: KnightChess

> insert into hudi table with columns specified(reordered and not in table 
> schema order) throws exception
> ---
>
> Key: HUDI-7949
> URL: https://issues.apache.org/jira/browse/HUDI-7949
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>
> https://github.com/apache/hudi/issues/11552



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [MINOR] if sync mode is glue, fix the sync tool class [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11543:
URL: https://github.com/apache/hudi/pull/11543#issuecomment-2208090398

   
   ## CI report:
   
   * 90ef8064511b401f50d1f8796f75dd7bde7b155e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24648)
 
   * b1b0628d83c17467402de524a54829925aec9925 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24703)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2208085682

   
   ## CI report:
   
   * 77a10246fec770a8e0f3bfa1fe2fa4d3ffee33d1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24702)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24700)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11563:
URL: https://github.com/apache/hudi/pull/11563#issuecomment-2208085724

   
   ## CI report:
   
   * d7a6c5a6d873b6d07e8e3f0a9b15040dd1942d59 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24699)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] if sync mode is glue, fix the sync tool class [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11543:
URL: https://github.com/apache/hudi/pull/11543#issuecomment-2208085624

   
   ## CI report:
   
   * 90ef8064511b401f50d1f8796f75dd7bde7b155e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24648)
 
   * b1b0628d83c17467402de524a54829925aec9925 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11534:
URL: https://github.com/apache/hudi/pull/11534#issuecomment-2208085580

   
   ## CI report:
   
   * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN
   * ea4e4adbc06a4be8f4cd739e6b1750927b284f63 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24696)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-03 Thread via GitHub


watermelon12138 commented on PR #11545:
URL: https://github.com/apache/hudi/pull/11545#issuecomment-2208074798

   @danny0405 Hi, all checks have passed and  all suggestions have resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-03 Thread via GitHub


watermelon12138 commented on PR #11545:
URL: https://github.com/apache/hudi/pull/11545#issuecomment-2208073498

   > @watermelon12138 would you mind to fix the compile errors: 
https://github.com/apache/hudi/actions/runs/9756232191/job/26926142959?pr=11545
   
   @danny0405 @balaji-varadarajan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11545:
URL: https://github.com/apache/hudi/pull/11545#issuecomment-2208022864

   
   ## CI report:
   
   * 0e3cb49fb72bdc14dee9e67fe0aaeb0d271608f2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24698)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2208023157

   
   ## CI report:
   
   * 192707054c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24691)
 
   * 77a10246fec770a8e0f3bfa1fe2fa4d3ffee33d1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24702)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7758] Only consider files in Hudi partitions when initializing MDT [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11219:
URL: https://github.com/apache/hudi/pull/11219#issuecomment-2208019950

   
   ## CI report:
   
   * ef710f1f1e981fc83f69a3e2db164aa0c139e0c5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24013)
 
   * 09e49d7c4856c6baf4089b538784f2d6cc7b143a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24701)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] if sync mode is glue, fix the sync tool class [hudi]

2024-07-03 Thread via GitHub


prabodh1194 commented on code in PR #11543:
URL: https://github.com/apache/hudi/pull/11543#discussion_r1665053757


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -60,13 +60,13 @@ import org.apache.hudi.sync.common.HoodieSyncConfig
 import org.apache.hudi.sync.common.util.SyncUtilHelpers
 import 
org.apache.hudi.sync.common.util.SyncUtilHelpers.getHoodieMetaSyncException
 import org.apache.hudi.util.SparkKeyGenUtils
-
 import org.apache.avro.Schema
 import org.apache.avro.generic.GenericData
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{FileSystem, Path}
 import org.apache.hadoop.hive.conf.HiveConf
 import org.apache.hadoop.hive.shims.ShimLoader
+import org.apache.hudi.hive.ddl.HiveSyncMode

Review Comment:
   done now
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7758] Only consider files in Hudi partitions when initializing MDT [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11219:
URL: https://github.com/apache/hudi/pull/11219#issuecomment-2207981834

   
   ## CI report:
   
   * ef710f1f1e981fc83f69a3e2db164aa0c139e0c5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24013)
 
   * 09e49d7c4856c6baf4089b538784f2d6cc7b143a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207984886

   
   ## CI report:
   
   * 192707054c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24691)
 
   * 77a10246fec770a8e0f3bfa1fe2fa4d3ffee33d1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2207962441

   
   ## CI report:
   
   * 3e526156f7bf7121008c4965bedeeadd969f798a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24687)
 
   * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7709] ClassCastException while reading the data using `TimestampBasedKeyGenerator` [hudi]

2024-07-03 Thread via GitHub


geserdugarov commented on code in PR #11501:
URL: https://github.com/apache/hudi/pull/11501#discussion_r1665028011


##
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##
@@ -360,13 +363,22 @@ protected HoodieTimeline getActiveTimeline() {
   }
 
   private Object[] parsePartitionColumnValues(String[] partitionColumns, 
String partitionPath) {
-Object[] partitionColumnValues = 
doParsePartitionColumnValues(partitionColumns, partitionPath);
-if (shouldListLazily && partitionColumnValues.length != 
partitionColumns.length) {
-  throw new HoodieException("Failed to parse partition column values from 
the partition-path:"
-  + " likely non-encoded slashes being used in partition column's 
values. You can try to"
-  + " work this around by switching listing mode to eager");
+HoodieTableConfig tableConfig = metaClient.getTableConfig();
+Object[] partitionColumnValues;
+if (null != tableConfig.getKeyGeneratorClassName()
+&& 
tableConfig.getKeyGeneratorClassName().equals(KeyGeneratorType.TIMESTAMP.getClassName())
+&& 
tableConfig.propsMap().get(TimestampKeyGeneratorConfig.TIMESTAMP_TYPE_FIELD.key()).matches("SCALAR|UNIX_TIMESTAMP|EPOCHMILLISECONDS"))
 {
+  // For TIMESTAMP key generator when TYPE is SCALAR, UNIX_TIMESTAMP or 
EPOCHMILLISECONDS,
+  // we couldn't reconstruct initial partition column values from 
partition paths due to lost data after formatting in most cases
+  partitionColumnValues = new Object[partitionColumns.length];

Review Comment:
   I will check it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7709] ClassCastException while reading the data using `TimestampBasedKeyGenerator` [hudi]

2024-07-03 Thread via GitHub


yihua commented on code in PR #11501:
URL: https://github.com/apache/hudi/pull/11501#discussion_r1665023864


##
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##
@@ -360,13 +363,22 @@ protected HoodieTimeline getActiveTimeline() {
   }
 
   private Object[] parsePartitionColumnValues(String[] partitionColumns, 
String partitionPath) {
-Object[] partitionColumnValues = 
doParsePartitionColumnValues(partitionColumns, partitionPath);
-if (shouldListLazily && partitionColumnValues.length != 
partitionColumns.length) {
-  throw new HoodieException("Failed to parse partition column values from 
the partition-path:"
-  + " likely non-encoded slashes being used in partition column's 
values. You can try to"
-  + " work this around by switching listing mode to eager");
+HoodieTableConfig tableConfig = metaClient.getTableConfig();
+Object[] partitionColumnValues;
+if (null != tableConfig.getKeyGeneratorClassName()
+&& 
tableConfig.getKeyGeneratorClassName().equals(KeyGeneratorType.TIMESTAMP.getClassName())
+&& 
tableConfig.propsMap().get(TimestampKeyGeneratorConfig.TIMESTAMP_TYPE_FIELD.key()).matches("SCALAR|UNIX_TIMESTAMP|EPOCHMILLISECONDS"))
 {
+  // For TIMESTAMP key generator when TYPE is SCALAR, UNIX_TIMESTAMP or 
EPOCHMILLISECONDS,
+  // we couldn't reconstruct initial partition column values from 
partition paths due to lost data after formatting in most cases
+  partitionColumnValues = new Object[partitionColumns.length];

Review Comment:
   Partition column values are empty.  Can this cause the partition pruning to 
return wrong or empty results from 
`SparkHoodieTableFileIndex::listMatchingPartitionPaths`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-03 Thread via GitHub


lokeshj1703 commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2207900106

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11563:
URL: https://github.com/apache/hudi/pull/11563#issuecomment-2207885328

   
   ## CI report:
   
   * 16b1d5c2603ef3eb68a40bc14572751b787d8d2f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24694)
 
   * d7a6c5a6d873b6d07e8e3f0a9b15040dd1942d59 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24699)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11545:
URL: https://github.com/apache/hudi/pull/11545#issuecomment-2207885219

   
   ## CI report:
   
   * fe7aa032f4463035775029ad486ca73ea2d02ac0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24668)
 
   * 0e3cb49fb72bdc14dee9e67fe0aaeb0d271608f2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24698)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2207885164

   
   ## CI report:
   
   * 57e40251eba6a0d7dc68cd10b832478f4d2decb3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24697)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11534:
URL: https://github.com/apache/hudi/pull/11534#issuecomment-2207885098

   
   ## CI report:
   
   * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN
   * 7c75f078faf19390ceac585790181032570d184d Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24695)
 
   * ea4e4adbc06a4be8f4cd739e6b1750927b284f63 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24696)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11545:
URL: https://github.com/apache/hudi/pull/11545#issuecomment-2207877464

   
   ## CI report:
   
   * fe7aa032f4463035775029ad486ca73ea2d02ac0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24668)
 
   * 0e3cb49fb72bdc14dee9e67fe0aaeb0d271608f2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2207877507

   
   ## CI report:
   
   * 3e526156f7bf7121008c4965bedeeadd969f798a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24687)
 
   * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11563:
URL: https://github.com/apache/hudi/pull/11563#issuecomment-2207877586

   
   ## CI report:
   
   * 16b1d5c2603ef3eb68a40bc14572751b787d8d2f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24694)
 
   * d7a6c5a6d873b6d07e8e3f0a9b15040dd1942d59 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7915] Spark4 + Hadoop3 [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11539:
URL: https://github.com/apache/hudi/pull/11539#issuecomment-2207877428

   
   ## CI report:
   
   * 88f5236331c0fdf66bca1617679abe2940f9e930 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24684)
 
   * 57e40251eba6a0d7dc68cd10b832478f4d2decb3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11534:
URL: https://github.com/apache/hudi/pull/11534#issuecomment-2207877386

   
   ## CI report:
   
   * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN
   * 7c75f078faf19390ceac585790181032570d184d Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24695)
 
   * ea4e4adbc06a4be8f4cd739e6b1750927b284f63 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2207871843

   
   ## CI report:
   
   * 3e526156f7bf7121008c4965bedeeadd969f798a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24687)
 
   * c5bde7f662a930b9a10b79fa38f9567300c0674a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11534:
URL: https://github.com/apache/hudi/pull/11534#issuecomment-2207871783

   
   ## CI report:
   
   * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN
   * 7c75f078faf19390ceac585790181032570d184d Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24695)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11559:
URL: https://github.com/apache/hudi/pull/11559#issuecomment-2207871874

   
   ## CI report:
   
   * 206eabc0c6a752e7a1e1d2206db231bf9a831570 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24693)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7950) Shade roaring bitmap dependency in root POM

2024-07-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7950:

Reviewers: Lokesh Jain  (was: Lokesh Jain)

> Shade roaring bitmap dependency in root POM
> ---
>
> Key: HUDI-7950
> URL: https://issues.apache.org/jira/browse/HUDI-7950
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0, 0.15.1
>
>
> We should unify the shading rule of roaring bitmap dependency in the root POM 
> for consistency among bundles.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7950) Shade roaring bitmap dependency in root POM

2024-07-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7950:

Reviewers: Lokesh Jain

> Shade roaring bitmap dependency in root POM
> ---
>
> Key: HUDI-7950
> URL: https://issues.apache.org/jira/browse/HUDI-7950
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0, 0.15.1
>
>
> We should unify the shading rule of roaring bitmap dependency in the root POM 
> for consistency among bundles.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11559:
URL: https://github.com/apache/hudi/pull/11559#discussion_r1664980551


##
rfc/rfc-80/rfc-80.md:
##
@@ -0,0 +1,161 @@
+
+# RFC-80: Support column families for wide tbles
+
+## Proposers
+
+- @xiarixiaoyao
+- @wombatu-kun
+
+## Approvers
+ - 
+ - 
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-
+
+## Abstract
+
+In streaming processing, there are often scenarios where the table is widened. 
The current mainstream real-time stretching is completed through Flink's 
multi-layer join;
+Flink's join will cache a large amount of data in the state backend. As the 
data set increases, the pressure on the Flink task state backend will gradually 
increase, and may even become unavailable.
+In multi-layer join scenarios, this problem is more obvious.
+
+## Background
+Currently, Hudi organizes data according to fileGroup granularity. The 
fileGroup is further divided into column clusters to introduce the columnFamily 
concept.  
+The organizational form of Hudi files is divided according to the following 
rules:  
+The data in the partition is divided into buckets according to hash; the files 
in each bucket are divided according to columnFamily; multiple colFamily files 
in the bucket form a completed fileGroup; when there is only one columnFamily, 
it degenerates into the native Hudi bucket table.
+
+![table](table.png)
+
+After splitting the fileGroup by columnFamily, the naming rules for base files 
and log files change. We add the cfName suffix to all file names to facilitate 
Hudi itself to distinguish column families. The addition of this suffix is 
compatible with Hudi's original naming method and has no conflict.
+
+![filenames](filenames.png)
+
+## Implementation
+Describe the new thing you want to do in appropriate detail, how it fits into 
the project architecture. 
+Provide a detailed description of how you intend to implement this 
feature.This may be fairly extensive and have large subsections of its own. 
+Or it may be a few sentences. Use judgement based on the scope of the change.
+
+### Constraints and Restrictions
+1. The overall design relies on the lock-free concurrent writing feature of 
Hudi 1.0.  
+2. Lower version Hudi cannot read and write column family tables.  
+3. Only MOR bucketed tables support setting column families.  
+4. Column families do not support repartitioning and renaming.  
+5. Schema evolution does not take effect on the current column family table.  
+6. Like native bucket tables, clustering operations are not supported.
+
+### Model change
+After the column family is introduced, the storage structure of the entire 
Hudi bucket table changes:
+
+![bucket](bucket.png)
+
+The bucket is divided into multiple columnFamilies by column cluster. When 
columnFamily is 1, it will automatically degenerate into the native bucket 
table.
+
+![file-group](file-group.png)
+
+### Specifying column families when creating a table
+In the table creation statement, column family division is specified in the 
options/tblproperties attribute;
+Column family attributes are specified in key-value mode:  
+* Key is the column family name. Format: hoodie.colFamily. Column family name  
  naming rules specified.  
+* Value is the specific content of the column family: it consists of all the 
columns included in the column family plus the precombine field. Format: " 
col1,col2...colN; precombineCol", the column family list and the preCombine 
field are separated by ";"; in the column family list the columns are split by 
",".  
+
+Constraints: The column family list must contain the primary key, and columns 
contained in different column families cannot overlap except for the primary 
key. The preCombie field does not need to be specified. If not specified, the 
primary key will be taken by default.
+
+After the table is created, the column family attributes will be persisted to 
hoodie's metadata for subsequent use.
+
+### Adding and deleting column families in existing table
+Use the SQL alter command to modify the column family attributes and persist 
it:
+* Execute ALTER TABLE table_name SET TBLPROPERTIES 
('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family.  
+* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k'); 
to delete the column family.
+
+Specific steps are as follows:
+1. Execute the ALTER command to modify the column family
+2. Verify whether the column family modified by alter is legal. Column family 
modification must meet the following conditions, otherwise the verification 
will not pass:
+* The column family name of an existing column family cannot be modified.  
+* Columns in other column families cannot be divided into new column 
families.  
+* When creating a new column family, it must meet the format requirements 
from previous chapter.  
+3. Save the modified column family to the .hoodie directory.
+
+### Writing data
+The Hudi kernel divides the

Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11559:
URL: https://github.com/apache/hudi/pull/11559#discussion_r1664981596


##
rfc/rfc-80/rfc-80.md:
##
@@ -0,0 +1,161 @@
+
+# RFC-80: Support column families for wide tbles
+
+## Proposers
+
+- @xiarixiaoyao
+- @wombatu-kun
+
+## Approvers
+ - 
+ - 
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-
+
+## Abstract
+
+In streaming processing, there are often scenarios where the table is widened. 
The current mainstream real-time stretching is completed through Flink's 
multi-layer join;
+Flink's join will cache a large amount of data in the state backend. As the 
data set increases, the pressure on the Flink task state backend will gradually 
increase, and may even become unavailable.
+In multi-layer join scenarios, this problem is more obvious.
+
+## Background
+Currently, Hudi organizes data according to fileGroup granularity. The 
fileGroup is further divided into column clusters to introduce the columnFamily 
concept.  
+The organizational form of Hudi files is divided according to the following 
rules:  
+The data in the partition is divided into buckets according to hash; the files 
in each bucket are divided according to columnFamily; multiple colFamily files 
in the bucket form a completed fileGroup; when there is only one columnFamily, 
it degenerates into the native Hudi bucket table.
+
+![table](table.png)
+
+After splitting the fileGroup by columnFamily, the naming rules for base files 
and log files change. We add the cfName suffix to all file names to facilitate 
Hudi itself to distinguish column families. The addition of this suffix is 
compatible with Hudi's original naming method and has no conflict.
+
+![filenames](filenames.png)
+
+## Implementation
+Describe the new thing you want to do in appropriate detail, how it fits into 
the project architecture. 
+Provide a detailed description of how you intend to implement this 
feature.This may be fairly extensive and have large subsections of its own. 
+Or it may be a few sentences. Use judgement based on the scope of the change.
+
+### Constraints and Restrictions
+1. The overall design relies on the lock-free concurrent writing feature of 
Hudi 1.0.  
+2. Lower version Hudi cannot read and write column family tables.  
+3. Only MOR bucketed tables support setting column families.  
+4. Column families do not support repartitioning and renaming.  
+5. Schema evolution does not take effect on the current column family table.  
+6. Like native bucket tables, clustering operations are not supported.
+
+### Model change
+After the column family is introduced, the storage structure of the entire 
Hudi bucket table changes:
+
+![bucket](bucket.png)
+
+The bucket is divided into multiple columnFamilies by column cluster. When 
columnFamily is 1, it will automatically degenerate into the native bucket 
table.
+
+![file-group](file-group.png)
+
+### Specifying column families when creating a table
+In the table creation statement, column family division is specified in the 
options/tblproperties attribute;
+Column family attributes are specified in key-value mode:  
+* Key is the column family name. Format: hoodie.colFamily. Column family name  
  naming rules specified.  
+* Value is the specific content of the column family: it consists of all the 
columns included in the column family plus the precombine field. Format: " 
col1,col2...colN; precombineCol", the column family list and the preCombine 
field are separated by ";"; in the column family list the columns are split by 
",".  
+
+Constraints: The column family list must contain the primary key, and columns 
contained in different column families cannot overlap except for the primary 
key. The preCombie field does not need to be specified. If not specified, the 
primary key will be taken by default.
+
+After the table is created, the column family attributes will be persisted to 
hoodie's metadata for subsequent use.
+
+### Adding and deleting column families in existing table
+Use the SQL alter command to modify the column family attributes and persist 
it:
+* Execute ALTER TABLE table_name SET TBLPROPERTIES 
('hoodie.columnFamily.k'='a,b,c;a'); to add a new column family.  
+* Execute ALTER TABLE table_name UNSET TBLPROPERTIES('hoodie.columnFamily.k'); 
to delete the column family.
+
+Specific steps are as follows:
+1. Execute the ALTER command to modify the column family
+2. Verify whether the column family modified by alter is legal. Column family 
modification must meet the following conditions, otherwise the verification 
will not pass:
+* The column family name of an existing column family cannot be modified.  
+* Columns in other column families cannot be divided into new column 
families.  
+* When creating a new column family, it must meet the format requirements 
from previous chapter.  
+3. Save the modified column family to the .hoodie directory.
+
+### Writing data
+The Hudi kernel divides the

Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11559:
URL: https://github.com/apache/hudi/pull/11559#discussion_r1664980071


##
rfc/rfc-80/rfc-80.md:
##
@@ -0,0 +1,161 @@
+
+# RFC-80: Support column families for wide tbles
+
+## Proposers
+
+- @xiarixiaoyao
+- @wombatu-kun
+
+## Approvers
+ - 
+ - 
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-
+
+## Abstract
+
+In streaming processing, there are often scenarios where the table is widened. 
The current mainstream real-time stretching is completed through Flink's 
multi-layer join;
+Flink's join will cache a large amount of data in the state backend. As the 
data set increases, the pressure on the Flink task state backend will gradually 
increase, and may even become unavailable.
+In multi-layer join scenarios, this problem is more obvious.
+
+## Background
+Currently, Hudi organizes data according to fileGroup granularity. The 
fileGroup is further divided into column clusters to introduce the columnFamily 
concept.  
+The organizational form of Hudi files is divided according to the following 
rules:  
+The data in the partition is divided into buckets according to hash; the files 
in each bucket are divided according to columnFamily; multiple colFamily files 
in the bucket form a completed fileGroup; when there is only one columnFamily, 
it degenerates into the native Hudi bucket table.
+
+![table](table.png)
+
+After splitting the fileGroup by columnFamily, the naming rules for base files 
and log files change. We add the cfName suffix to all file names to facilitate 
Hudi itself to distinguish column families. The addition of this suffix is 
compatible with Hudi's original naming method and has no conflict.
+
+![filenames](filenames.png)
+
+## Implementation
+Describe the new thing you want to do in appropriate detail, how it fits into 
the project architecture. 
+Provide a detailed description of how you intend to implement this 
feature.This may be fairly extensive and have large subsections of its own. 
+Or it may be a few sentences. Use judgement based on the scope of the change.
+
+### Constraints and Restrictions
+1. The overall design relies on the lock-free concurrent writing feature of 
Hudi 1.0.  
+2. Lower version Hudi cannot read and write column family tables.  
+3. Only MOR bucketed tables support setting column families.  
+4. Column families do not support repartitioning and renaming.  
+5. Schema evolution does not take effect on the current column family table.  
+6. Like native bucket tables, clustering operations are not supported.
+
+### Model change
+After the column family is introduced, the storage structure of the entire 
Hudi bucket table changes:
+
+![bucket](bucket.png)
+
+The bucket is divided into multiple columnFamilies by column cluster. When 
columnFamily is 1, it will automatically degenerate into the native bucket 
table.
+
+![file-group](file-group.png)
+
+### Specifying column families when creating a table
+In the table creation statement, column family division is specified in the 
options/tblproperties attribute;
+Column family attributes are specified in key-value mode:  
+* Key is the column family name. Format: hoodie.colFamily. Column family name  
  naming rules specified.  
+* Value is the specific content of the column family: it consists of all the 
columns included in the column family plus the precombine field. Format: " 
col1,col2...colN; precombineCol", the column family list and the preCombine 
field are separated by ";"; in the column family list the columns are split by 
",".  
+
+Constraints: The column family list must contain the primary key, and columns 
contained in different column families cannot overlap except for the primary 
key. The preCombie field does not need to be specified. If not specified, the 
primary key will be taken by default.

Review Comment:
   Is there any SQL synctax we can reference in industry? Like the cockroach 
db: 
https://www.cockroachlabs.com/docs/stable/column-families#:~:text=A%20column%20family%20is%20a,%2C%20UPDATE%20%2C%20and%20DELETE%20operations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11559:
URL: https://github.com/apache/hudi/pull/11559#discussion_r1664978373


##
rfc/rfc-80/rfc-80.md:
##
@@ -0,0 +1,161 @@
+
+# RFC-80: Support column families for wide tbles
+
+## Proposers
+
+- @xiarixiaoyao
+- @wombatu-kun
+
+## Approvers
+ - 
+ - 
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-
+
+## Abstract
+
+In streaming processing, there are often scenarios where the table is widened. 
The current mainstream real-time stretching is completed through Flink's 
multi-layer join;
+Flink's join will cache a large amount of data in the state backend. As the 
data set increases, the pressure on the Flink task state backend will gradually 
increase, and may even become unavailable.
+In multi-layer join scenarios, this problem is more obvious.
+
+## Background
+Currently, Hudi organizes data according to fileGroup granularity. The 
fileGroup is further divided into column clusters to introduce the columnFamily 
concept.  
+The organizational form of Hudi files is divided according to the following 
rules:  
+The data in the partition is divided into buckets according to hash; the files 
in each bucket are divided according to columnFamily; multiple colFamily files 
in the bucket form a completed fileGroup; when there is only one columnFamily, 
it degenerates into the native Hudi bucket table.
+
+![table](table.png)
+
+After splitting the fileGroup by columnFamily, the naming rules for base files 
and log files change. We add the cfName suffix to all file names to facilitate 
Hudi itself to distinguish column families. The addition of this suffix is 
compatible with Hudi's original naming method and has no conflict.
+
+![filenames](filenames.png)
+
+## Implementation
+Describe the new thing you want to do in appropriate detail, how it fits into 
the project architecture. 
+Provide a detailed description of how you intend to implement this 
feature.This may be fairly extensive and have large subsections of its own. 
+Or it may be a few sentences. Use judgement based on the scope of the change.
+
+### Constraints and Restrictions
+1. The overall design relies on the lock-free concurrent writing feature of 
Hudi 1.0.  

Review Comment:
   It's not lock-free, it's just non-blocking, Hudi table utilies the lock to 
keep the instant time generation monotonically increasing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11559:
URL: https://github.com/apache/hudi/pull/11559#discussion_r1664977832


##
rfc/rfc-80/rfc-80.md:
##
@@ -0,0 +1,161 @@
+
+# RFC-80: Support column families for wide tbles
+
+## Proposers
+
+- @xiarixiaoyao
+- @wombatu-kun
+
+## Approvers
+ - 
+ - 
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-
+
+## Abstract
+
+In streaming processing, there are often scenarios where the table is widened. 
The current mainstream real-time stretching is completed through Flink's 
multi-layer join;
+Flink's join will cache a large amount of data in the state backend. As the 
data set increases, the pressure on the Flink task state backend will gradually 
increase, and may even become unavailable.
+In multi-layer join scenarios, this problem is more obvious.
+
+## Background
+Currently, Hudi organizes data according to fileGroup granularity. The 
fileGroup is further divided into column clusters to introduce the columnFamily 
concept.  
+The organizational form of Hudi files is divided according to the following 
rules:  
+The data in the partition is divided into buckets according to hash; the files 
in each bucket are divided according to columnFamily; multiple colFamily files 
in the bucket form a completed fileGroup; when there is only one columnFamily, 
it degenerates into the native Hudi bucket table.
+
+![table](table.png)
+
+After splitting the fileGroup by columnFamily, the naming rules for base files 
and log files change. We add the cfName suffix to all file names to facilitate 
Hudi itself to distinguish column families. The addition of this suffix is 
compatible with Hudi's original naming method and has no conflict.
+
+![filenames](filenames.png)
+
+## Implementation
+Describe the new thing you want to do in appropriate detail, how it fits into 
the project architecture. 

Review Comment:
   This needs to be elaborated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11559:
URL: https://github.com/apache/hudi/pull/11559#discussion_r1664977594


##
rfc/rfc-80/rfc-80.md:
##
@@ -0,0 +1,161 @@
+
+# RFC-80: Support column families for wide tbles
+
+## Proposers
+
+- @xiarixiaoyao
+- @wombatu-kun
+
+## Approvers
+ - 
+ - 
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-
+
+## Abstract
+
+In streaming processing, there are often scenarios where the table is widened. 
The current mainstream real-time stretching is completed through Flink's 
multi-layer join;
+Flink's join will cache a large amount of data in the state backend. As the 
data set increases, the pressure on the Flink task state backend will gradually 
increase, and may even become unavailable.
+In multi-layer join scenarios, this problem is more obvious.
+
+## Background
+Currently, Hudi organizes data according to fileGroup granularity. The 
fileGroup is further divided into column clusters to introduce the columnFamily 
concept.  
+The organizational form of Hudi files is divided according to the following 
rules:  
+The data in the partition is divided into buckets according to hash; the files 
in each bucket are divided according to columnFamily; multiple colFamily files 
in the bucket form a completed fileGroup; when there is only one columnFamily, 
it degenerates into the native Hudi bucket table.
+
+![table](table.png)
+
+After splitting the fileGroup by columnFamily, the naming rules for base files 
and log files change. We add the cfName suffix to all file names to facilitate 
Hudi itself to distinguish column families. The addition of this suffix is 
compatible with Hudi's original naming method and has no conflict.
+

Review Comment:
   It looks like the colFamilyName is part of the write token now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7905] Use cluster action for clustering pending instants [hudi]

2024-07-03 Thread via GitHub


lokeshj1703 commented on PR #11553:
URL: https://github.com/apache/hudi/pull/11553#issuecomment-2207852135

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7950) Shade roaring bitmap dependency in root POM

2024-07-03 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7950:

Status: Patch Available  (was: In Progress)

> Shade roaring bitmap dependency in root POM
> ---
>
> Key: HUDI-7950
> URL: https://issues.apache.org/jira/browse/HUDI-7950
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0-beta2, 1.0.0, 0.15.1
>
>
> We should unify the shading rule of roaring bitmap dependency in the root POM 
> for consistency among bundles.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11563:
URL: https://github.com/apache/hudi/pull/11563#issuecomment-2207802336

   
   ## CI report:
   
   * 16b1d5c2603ef3eb68a40bc14572751b787d8d2f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24694)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11534:
URL: https://github.com/apache/hudi/pull/11534#issuecomment-2207801978

   
   ## CI report:
   
   * a4b0e88de32cb689056c049fcf207b72a7df7fb4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24623)
 
   * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN
   * 7c75f078faf19390ceac585790181032570d184d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24695)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11534:
URL: https://github.com/apache/hudi/pull/11534#issuecomment-2207782132

   
   ## CI report:
   
   * a4b0e88de32cb689056c049fcf207b72a7df7fb4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24623)
 
   * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN
   * 7c75f078faf19390ceac585790181032570d184d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7937) Fix handling of decimals in StreamSync and Clustering

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7937:
-
Labels: pull-request-available  (was: )

> Fix handling of decimals in StreamSync and Clustering
> -
>
> Key: HUDI-7937
> URL: https://issues.apache.org/jira/browse/HUDI-7937
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> When decimals are using a small precision, we need to write them in legacy 
> format to ensure all hudi components can read them back. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7937] Handle legacy writer requirements in StreamSync and Clustering [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11534:
URL: https://github.com/apache/hudi/pull/11534#issuecomment-2207762315

   
   ## CI report:
   
   * a4b0e88de32cb689056c049fcf207b72a7df7fb4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24623)
 
   * 05576aa12434670872e6ddbb5ada85f6cd56dbe3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Does Hudi has the warm/cold data archive solution [hudi]

2024-07-03 Thread via GitHub


njalan closed issue #11457: Does Hudi has the warm/cold data archive solution
URL: https://github.com/apache/hudi/issues/11457


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7921] Fixing file system view closures in MDT [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11496:
URL: https://github.com/apache/hudi/pull/11496#discussion_r1664947146


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1082,8 +1083,10 @@ private HoodieData 
getFunctionalIndexUpdates(HoodieCommitMetadata
 HoodieIndexDefinition indexDefinition = 
getFunctionalIndexDefinition(indexPartition);
 List> partitionFileSlicePairs = new ArrayList<>();
 HoodieTableFileSystemView fsView = 
HoodieTableMetadataUtil.getFileSystemView(dataMetaClient);
+fileSystemViews.add(fsView);
+HoodieTableFileSystemView finalFsView = fsView;
 commitMetadata.getPartitionToWriteStats().forEach((dataPartition, value) 
-> {
-  List fileSlices = 
getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, 
Option.ofNullable(fsView), dataPartition);
+  List fileSlices = 
getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, 
Option.ofNullable(finalFsView), dataPartition);

Review Comment:
   Can we move the instantiation of `fsView` inside 
`getPartitionLatestFileSlicesIncludingInflight`. Then there is no need for the 
fs view cache.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7921] Fixing file system view closures in MDT [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on code in PR #11496:
URL: https://github.com/apache/hudi/pull/11496#discussion_r1664947146


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1082,8 +1083,10 @@ private HoodieData 
getFunctionalIndexUpdates(HoodieCommitMetadata
 HoodieIndexDefinition indexDefinition = 
getFunctionalIndexDefinition(indexPartition);
 List> partitionFileSlicePairs = new ArrayList<>();
 HoodieTableFileSystemView fsView = 
HoodieTableMetadataUtil.getFileSystemView(dataMetaClient);
+fileSystemViews.add(fsView);
+HoodieTableFileSystemView finalFsView = fsView;
 commitMetadata.getPartitionToWriteStats().forEach((dataPartition, value) 
-> {
-  List fileSlices = 
getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, 
Option.ofNullable(fsView), dataPartition);
+  List fileSlices = 
getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, 
Option.ofNullable(finalFsView), dataPartition);

Review Comment:
   Can we move the instantiation of `fsView` inside 
`getPartitionLatestFileSlicesIncludingInflight`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on issue #11554:
URL: https://github.com/apache/hudi/issues/11554#issuecomment-2207698467

   The cmd writes an empty data frame using spark writer: 
https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableDropPartitionCommand.scala,
 and that would trigger a batch sync of partitions with the 
https://github.com/apache/hudi/blob/master/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] insert into hudi table with columns specified(reordered and not in table schema order) throws exception [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on issue #11552:
URL: https://github.com/apache/hudi/issues/11552#issuecomment-2207682862

   @KnightChess Thanks so much for taking care of the fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7859] Rename instant files to be consistent with 0.x naming format when downgrade [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on PR #11545:
URL: https://github.com/apache/hudi/pull/11545#issuecomment-2207680792

   @watermelon12138 would you mind to fix the compile errors: 
https://github.com/apache/hudi/actions/runs/9756232191/job/26926142959?pr=11545


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11563:
URL: https://github.com/apache/hudi/pull/11563#issuecomment-2207639885

   
   ## CI report:
   
   * 16b1d5c2603ef3eb68a40bc14572751b787d8d2f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24694)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11559:
URL: https://github.com/apache/hudi/pull/11559#issuecomment-2207639749

   
   ## CI report:
   
   * 1c8b5ccd83bfd1861d050c47b78a4addd6e558a1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24685)
 
   * 206eabc0c6a752e7a1e1d2206db231bf9a831570 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24693)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11563:
URL: https://github.com/apache/hudi/pull/11563#issuecomment-2207619588

   
   ## CI report:
   
   * 16b1d5c2603ef3eb68a40bc14572751b787d8d2f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7948] RFC-80: Support column families for wide tables [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11559:
URL: https://github.com/apache/hudi/pull/11559#issuecomment-2207619441

   
   ## CI report:
   
   * 1c8b5ccd83bfd1861d050c47b78a4addd6e558a1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24685)
 
   * 206eabc0c6a752e7a1e1d2206db231bf9a831570 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207598201

   
   ## CI report:
   
   * 192707054c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24691)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-7951) Classes using avro causing conflict in hudi-aws-bundle

2024-07-03 Thread Shawn Chang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Chang reassigned HUDI-7951:
-

Assignee: Shawn Chang

> Classes using avro causing conflict in hudi-aws-bundle
> --
>
> Key: HUDI-7951
> URL: https://issues.apache.org/jira/browse/HUDI-7951
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Shawn Chang
>Assignee: Shawn Chang
>Priority: Major
>  Labels: pull-request-available
>
> Hudi 0.15 added some Hudi classes with avro usages 
> (ParquetTableSchemaResolver in this case), also had hudi-aws-bundle depend on 
> hudi-hadoop-common. hudi-aws-bundle won't relocate avro classes to be 
> compatible with hudi-spark.
>  
> The issue would happen when using hudi-flink-bundle with hudi-aws-bundle. 
> hudi-flink-bundle has relocated avro classes and would cause class conflict:
> {code:java}
> java.lang.NoSuchMethodError: 'org.apache.parquet.schema.MessageType 
> org.apache.hudi.common.table.ParquetTableSchemaResolver.convertAvroSchemaToParquet(org.apache.hudi.org.apache.avro.Schema,
>  org.apache.hadoop.conf.Configuration)'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7951) Classes using avro causing conflict in hudi-aws-bundle

2024-07-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7951:
-
Labels: pull-request-available  (was: )

> Classes using avro causing conflict in hudi-aws-bundle
> --
>
> Key: HUDI-7951
> URL: https://issues.apache.org/jira/browse/HUDI-7951
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Shawn Chang
>Priority: Major
>  Labels: pull-request-available
>
> Hudi 0.15 added some Hudi classes with avro usages 
> (ParquetTableSchemaResolver in this case), also had hudi-aws-bundle depend on 
> hudi-hadoop-common. hudi-aws-bundle won't relocate avro classes to be 
> compatible with hudi-spark.
>  
> The issue would happen when using hudi-flink-bundle with hudi-aws-bundle. 
> hudi-flink-bundle has relocated avro classes and would cause class conflict:
> {code:java}
> java.lang.NoSuchMethodError: 'org.apache.parquet.schema.MessageType 
> org.apache.hudi.common.table.ParquetTableSchemaResolver.convertAvroSchemaToParquet(org.apache.hudi.org.apache.avro.Schema,
>  org.apache.hadoop.conf.Configuration)'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7951] Fix conflict caused by classes using avro in hudi-aws-bundle [hudi]

2024-07-03 Thread via GitHub


CTTY opened a new pull request, #11563:
URL: https://github.com/apache/hudi/pull/11563

   ### Change Logs
   Hudi 0.15 added some Hudi classes with avro usages (e.g. 
`ParquetTableSchemaResolver`), also had `hudi-aws-bundle `depend on 
`hudi-hadoop-common`.  `hudi-aws-bundle` won't relocate avro classes to be 
compatible with hudi-spark.
   
   The issue would happen when using hudi-flink-bundle with hudi-aws-bundle. 
hudi-flink-bundle has relocated avro classes and would cause class conflict:
   
   ```
   java.lang.NoSuchMethodError: 'org.apache.parquet.schema.MessageType 
org.apache.hudi.common.table.ParquetTableSchemaResolver.convertAvroSchemaToParquet(org.apache.hudi.org.apache.avro.Schema,
 org.apache.hadoop.conf.Configuration)'
   ```
   
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7951) Classes using avro causing conflict in hudi-aws-bundle

2024-07-03 Thread Shawn Chang (Jira)
Shawn Chang created HUDI-7951:
-

 Summary: Classes using avro causing conflict in hudi-aws-bundle
 Key: HUDI-7951
 URL: https://issues.apache.org/jira/browse/HUDI-7951
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Shawn Chang


Hudi 0.15 added some Hudi classes with avro usages (ParquetTableSchemaResolver 
in this case), also had hudi-aws-bundle depend on hudi-hadoop-common. 
hudi-aws-bundle won't relocate avro classes to be compatible with hudi-spark.

 

The issue would happen when using hudi-flink-bundle with hudi-aws-bundle. 
hudi-flink-bundle has relocated avro classes and would cause class conflict:


{code:java}
java.lang.NoSuchMethodError: 'org.apache.parquet.schema.MessageType 
org.apache.hudi.common.table.ParquetTableSchemaResolver.convertAvroSchemaToParquet(org.apache.hudi.org.apache.avro.Schema,
 org.apache.hadoop.conf.Configuration)'
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207484212

   
   ## CI report:
   
   * 23d89d4a510f44094d65a95a02490e5cd7a9b165 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24692)
 
   * 192707054c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207404846

   
   ## CI report:
   
   * 23d89d4a510f44094d65a95a02490e5cd7a9b165 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24692)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207397295

   
   ## CI report:
   
   * 192707054c0d633621d2db4f706d6487974a74bb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24691)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24690)
 
   * 23d89d4a510f44094d65a95a02490e5cd7a9b165 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7950] Shade roaring bitmap dependency in root POM [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11561:
URL: https://github.com/apache/hudi/pull/11561#issuecomment-2207297970

   
   ## CI report:
   
   * e73a045009286da007f0ade464d3e24d87d08c9d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24688)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207297893

   
   ## CI report:
   
   * 192707054c0d633621d2db4f706d6487974a74bb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24691)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24690)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] - Performance Variation in Hudi 0.14 [hudi]

2024-07-03 Thread via GitHub


RuyRoaV commented on issue #11481:
URL: https://github.com/apache/hudi/issues/11481#issuecomment-2207209448

   Hi Aditya 
   
   I have tried out your recommendation and found the following:
   
   ** Using SIMPLE INDEX**
   
   The average execution time was reduced from 20 min to around 11 min, which 
is great. In the Spark UI screenshot, you can see that a big percentage of the 
execution time is taken by a `countByKey at JavePairRDD` action in the 
`SparkCommitUpsert` executor, especially during the `SuffleWrite` part.
   
   ![Screenshot 2024-07-03 at 16 44 
57](https://github.com/apache/hudi/assets/173461014/deb0599e-00e0-4cab-a6b5-8d4dcb8fb557)
   ![Screenshot 2024-07-03 at 16 52 
19](https://github.com/apache/hudi/assets/173461014/3a37ef31-cbbf-4425-9f0d-f2c96948c4e9)
   ![Screenshot 2024-07-03 at 16 52 
46](https://github.com/apache/hudi/assets/173461014/1440ea2d-9bab-44b9-85a8-9395375abba9)
   
   **We are in a need to reduce the job runtime even more, is there any other 
recommendation regarding the different configurations that we can set?** 
   
   We may try deactivating of the archival beyond the savepoint a bit later. 
But I am curious about why would that help us improve in performance?
   
   **Using RECORD LEVEL** 
   
   I replaced the index for a table, for which its upsert Glue job was already 
running in under 5 minutes.  Overall, the job runtime has remained the same, 
being `count at HoodieSparkSqlWriter.scala:1072` during the 
`SparkCommitUpsert`, especially during the execution. This is similar as in the 
case presented when submitting this ticket. 
   
   I'll try with one of our long running jobs and will let you know the outcome.
   
   By the way **is there a way to check the index type of a table?** 
   
   Thanks
   
   Best regards


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207181205

   
   ## CI report:
   
   * 761085a6fa9cc6eeca493c1c116caea56b3693f8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24680)
 
   * 192707054c0d633621d2db4f706d6487974a74bb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24691)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7945] Fix file pruning using PARTITION_STATS index in Spark [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11556:
URL: https://github.com/apache/hudi/pull/11556#issuecomment-2207162234

   
   ## CI report:
   
   * 761085a6fa9cc6eeca493c1c116caea56b3693f8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24680)
 
   * 192707054c0d633621d2db4f706d6487974a74bb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] refactor: implement datafusion API using ParquetExec [hudi-rs]

2024-07-03 Thread via GitHub


codecov[bot] commented on PR #35:
URL: https://github.com/apache/hudi-rs/pull/35#issuecomment-2207161986

   ## 
[Codecov](https://app.codecov.io/gh/apache/hudi-rs/pull/35?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   Attention: Patch coverage is `70.0%` with `9 lines` in your changes 
missing coverage. Please review.
   > Project coverage is 84.81%. Comparing base 
[(`52a9245`)](https://app.codecov.io/gh/apache/hudi-rs/commit/52a924557ee18effadc02749ec7cdb1001ad6b58?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`8f5f96a`)](https://app.codecov.io/gh/apache/hudi-rs/commit/8f5f96a813faf8a686d210eb652235ab247d8b57?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   | 
[Files](https://app.codecov.io/gh/apache/hudi-rs/pull/35?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[crates/datafusion/src/lib.rs](https://app.codecov.io/gh/apache/hudi-rs/pull/35?src=pr&el=tree&filepath=crates%2Fdatafusion%2Fsrc%2Flib.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y3JhdGVzL2RhdGFmdXNpb24vc3JjL2xpYi5ycw==)
 | 55.00% | [9 Missing :warning: 
](https://app.codecov.io/gh/apache/hudi-rs/pull/35?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@Coverage Diff @@
   ## main  #35  +/-   ##
   ==
   - Coverage   88.84%   84.81%   -4.04% 
   ==
 Files  10   10  
 Lines 511  507   -4 
   ==
   - Hits  454  430  -24 
   - Misses 57   77  +20 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/hudi-rs/pull/35?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] refactor: implement datafusion API using ParquetExec [hudi-rs]

2024-07-03 Thread via GitHub


xushiyan opened a new pull request, #35:
URL: https://github.com/apache/hudi-rs/pull/35

   - upgrade arrow from `50` to `52.0.0`
   - upgrade datafusion `35` to `39.0.0`
   - leverage `ParquetExec` for implementing TableProvider for Hudi in 
datafusion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7921] Making HoodieTable closeable [hudi]

2024-07-03 Thread via GitHub


nsivabalan closed pull request #11494: [HUDI-7921] Making HoodieTable closeable
URL: https://github.com/apache/hudi/pull/11494


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7950] Shade roaring bitmap dependency in root POM [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11561:
URL: https://github.com/apache/hudi/pull/11561#issuecomment-2207028303

   
   ## CI report:
   
   * e73a045009286da007f0ade464d3e24d87d08c9d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24688)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DNM] Temp diff testing 1.x reads with 0.x branch [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11562:
URL: https://github.com/apache/hudi/pull/11562#issuecomment-2207028360

   
   ## CI report:
   
   * 013aef32a3ad3aa995beb626f5855d9a05234cbf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24689)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7950] Shade roaring bitmap dependency in root POM [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11561:
URL: https://github.com/apache/hudi/pull/11561#issuecomment-2207017035

   
   ## CI report:
   
   * e73a045009286da007f0ade464d3e24d87d08c9d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DNM] Temp diff testing 1.x reads with 0.x branch [hudi]

2024-07-03 Thread via GitHub


hudi-bot commented on PR #11562:
URL: https://github.com/apache/hudi/pull/11562#issuecomment-2207017098

   
   ## CI report:
   
   * 013aef32a3ad3aa995beb626f5855d9a05234cbf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-03 Thread via GitHub


nsivabalan commented on PR #11514:
URL: https://github.com/apache/hudi/pull/11514#issuecomment-2206976055

   here is a glimpse of changes I had to make to 0.x timeline to support 1.x 
table reads
   https://github.com/apache/hudi/pull/11562 
   this is just a draft/hacky PR, just incase you wanna take a peek. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DNM] Temp diff testing 1.x reads with 0.x branch [hudi]

2024-07-03 Thread via GitHub


nsivabalan commented on code in PR #11562:
URL: https://github.com/apache/hudi/pull/11562#discussion_r1664607967


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineMetadataUtils.java:
##
@@ -209,6 +225,14 @@ public static  T 
deserializeAvroMetadata(byte[] by
 return fileReader.next();
   }
 
+  public static HoodieCommitMetadata deserializeCommitMetadata(byte[] bytes) 
throws IOException {
+return deserializeAvroMetadata(bytes, HoodieCommitMetadata.class);

Review Comment:
   NTR: supporting deser avro commit metadata



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >