[GitHub] [hudi] hudi-bot commented on pull request #6677: [HUDI-4294][Stacked on 4293] Introduce build action to actually perform index data generation

2022-09-14 Thread GitBox


hudi-bot commented on PR #6677:
URL: https://github.com/apache/hudi/pull/6677#issuecomment-1247664423

   
   ## CI report:
   
   * 59e2196397ef68d75697a35b1b91e661ef9d3aa4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11372)
 
   * 0ce0aee73e1641f071abdfc44d4f5473a425befb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-14 Thread GitBox


hudi-bot commented on PR #4676:
URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247657488

   
   ## CI report:
   
   * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN
   * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN
   * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN
   * e3914eb7b48fc4c5e3bd6f0fd00888ac6da8fa21 UNKNOWN
   * 24747bb7e1f23d6db70672cab3795cb131ce8dcb Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11371)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hackergin commented on a diff in pull request #6628: [HUDI-4806] Use Avro version from the root pom for Flink bundle

2022-09-14 Thread GitBox


hackergin commented on code in PR #6628:
URL: https://github.com/apache/hudi/pull/6628#discussion_r971586922


##
packaging/hudi-flink-bundle/pom.xml:
##
@@ -501,8 +501,7 @@
 
   org.apache.avro
   avro
-  
-  1.10.0
+  ${avro.version}

Review Comment:
   > Yes, we've tested with Flink streamer loading data from Kafka datasource 
in Hudi format. And it works fine
   
   Hi, @CTTY  I met a java.lang.ClassNotFoundException when using the latest 
master code.   Class org.apache.avro.LogicalTypes$LocalTimestampMillis  seems 
to only appear in avro 1.10 version. Please help to confirm this problem, 
correct me if I am wrong . 
   ```
   Caused by: java.lang.ClassNotFoundException: 
org.apache.hudi.org.apache.avro.LogicalTypes$LocalTimestampMillis
   at java.net.URLClassLoader.findClass(URLClassLoader.java:382) 
~[?:1.8.0_202]
   at java.lang.ClassLoader.loadClass(ClassLoader.java:424) 
~[?:1.8.0_202]
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) 
~[?:1.8.0_202]
   at java.lang.ClassLoader.loadClass(ClassLoader.java:357) 
~[?:1.8.0_202]
   at 
org.apache.hudi.table.HoodieTableFactory.inferAvroSchema(HoodieTableFactory.java:346)
 ~[hudi-flink1.14-bundle-0.13.0-SNAPSHOT.jar:0.13.0-SNAPSHOT]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hackergin commented on a diff in pull request #6628: [HUDI-4806] Use Avro version from the root pom for Flink bundle

2022-09-14 Thread GitBox


hackergin commented on code in PR #6628:
URL: https://github.com/apache/hudi/pull/6628#discussion_r971586922


##
packaging/hudi-flink-bundle/pom.xml:
##
@@ -501,8 +501,7 @@
 
   org.apache.avro
   avro
-  
-  1.10.0
+  ${avro.version}

Review Comment:
   > Yes, we've tested with Flink streamer loading data from Kafka datasource 
in Hudi format. And it works fine
   Hi, @CTTY  I met a java.lang.ClassNotFoundException when using the latest 
master code.   Class org.apache.avro.LogicalTypes$LocalTimestampMillis  seems 
to only appear in avro 1.10 version. Please help to confirm this problem, 
correct me if I am wrong . 
   ```
   Caused by: java.lang.ClassNotFoundException: 
org.apache.hudi.org.apache.avro.LogicalTypes$LocalTimestampMillis
   at java.net.URLClassLoader.findClass(URLClassLoader.java:382) 
~[?:1.8.0_202]
   at java.lang.ClassLoader.loadClass(ClassLoader.java:424) 
~[?:1.8.0_202]
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) 
~[?:1.8.0_202]
   at java.lang.ClassLoader.loadClass(ClassLoader.java:357) 
~[?:1.8.0_202]
   at 
org.apache.hudi.table.HoodieTableFactory.inferAvroSchema(HoodieTableFactory.java:346)
 ~[hudi-flink1.14-bundle-0.13.0-SNAPSHOT.jar:0.13.0-SNAPSHOT]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] fengjian428 commented on a diff in pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-14 Thread GitBox


fengjian428 commented on code in PR #4676:
URL: https://github.com/apache/hudi/pull/4676#discussion_r971585340


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieWriteHelper.java:
##
@@ -51,16 +55,20 @@ protected HoodieData> 
tag(HoodieData> dedupedRec
 
   @Override
   public HoodieData> deduplicateRecords(
-  HoodieData> records, HoodieIndex index, int 
parallelism) {
+  HoodieData> records, HoodieIndex index, int 
parallelism, String jsonSchema) {
 boolean isIndexingGlobal = index.isGlobal();
+final Schema[] schema = {null};
 return records.mapToPair(record -> {
   HoodieKey hoodieKey = record.getKey();
   // If index used is global, then records are expected to differ in their 
partitionPath
   Object key = isIndexingGlobal ? hoodieKey.getRecordKey() : hoodieKey;
   return Pair.of(key, record);
 }).reduceByKey((rec1, rec2) -> {
+  if (schema[0] == null) {
+schema[0] = new Schema.Parser().parse(jsonSchema);

Review Comment:
   How about passing SerializableSchema to every executor, seems it is not easy 
to get spark context here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boundarymate commented on a diff in pull request #3951: [HUDI-2715] The BitCaskDiskMap iterator may cause memory leak

2022-09-14 Thread GitBox


boundarymate commented on code in PR #3951:
URL: https://github.com/apache/hudi/pull/3951#discussion_r971492356


##
hudi-common/src/main/java/org/apache/hudi/common/util/collection/BitCaskDiskMap.java:
##
@@ -275,6 +282,7 @@ public void close() {
 }
   }
   writeOnlyFile.delete();
+  this.iterators.forEach(ClosableIterator::close);

Review Comment:
   Hi danny
   Why not close in finally?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6600: [RFC-62] Diagnostic Reporter

2022-09-14 Thread GitBox


zhangyue19921010 commented on code in PR #6600:
URL: https://github.com/apache/hudi/pull/6600#discussion_r971576211


##
rfc/rfc-62/rfc-62.md:
##
@@ -0,0 +1,443 @@
+
+# RFC-62: Diagnostic Reporter
+
+
+
+## Proposers
+
+- zhangyue19921...@163.com
+
+## Approvers
+ - @codope
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4707
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+With the development of hudi, more and more users choose hudi to build their 
own ingestion pipelines to support real-time or batch upsert requirements.
+Subsequently, some of them may ask the community for help, such as how to 
improve the performance of their hudi ingestion jobs? Why did their hudi jobs 
fail? etc.
+
+For the volunteers in the hudi community, dealing with such issue, the 
volunteers may ask users to provide a list of information, including engine 
context, job configs,
+data pattern, Spark UI, etc. Users need to spend extra effort to review their 
own jobs, collect metrics one buy one according to the list and give feedback 
to volunteers.
+By the way, unexpected errors may occur at this time as users are manually 
collecting these information.
+
+Obviously, there are relatively high communication costs for both volunteers 
and users.
+
+On the other hand, for advanced users, they also need some way to efficiently 
understand the characteristics of their hudi tables, including data volume, 
upsert pattern, and so on.
+
+## Background
+As we know, hudi already has its own unique metrics system and metadata 
framework. These information are very important for hudi job tuning or 
troubleshooting. Fo example:
+
+1. Hudi will record the complete timeline in the .hoodie directory, including 
active timeline and archive timeline. From this we can trace the historical 
state of the hudi job.
+
+2. Hudi metadata table which will records all the partitions, all the data 
files, etc
+
+3. Each commit of hudi records various metadata information and runtime 
metrics currently written, such as:
+```json
+{
+"partitionToWriteStats":{
+"20210623/0/20210825":[
+{
+"fileId":"4ae31921-eedd-4c56-8218-bb47849397a4-0",
+
"path":"20210623/0/20210825/4ae31921-eedd-4c56-8218-bb47849397a4-0_0-27-2006_20220818134233973.parquet",
+"prevCommit":"null",
+"numWrites":123352,
+"numDeletes":0,
+"numUpdateWrites":0,
+"numInserts":123352,
+"totalWriteBytes":4675371,
+"totalWriteErrors":0,
+"tempPath":null,
+"partitionPath":"20210623/0/20210825",
+"totalLogRecords":0,
+"totalLogFilesCompacted":0,
+"totalLogSizeCompacted":0,
+"totalUpdatedRecordsCompacted":0,
+"totalLogBlocks":0,
+"totalCorruptLogBlock":0,
+"totalRollbackBlocks":0,
+"fileSizeInBytes":4675371,
+"minEventTime":null,
+"maxEventTime":null
+}
+]
+},
+"compacted":false,
+"extraMetadata":{
+"schema":""
+},
+"operationType":"UPSERT",
+"totalRecordsDeleted":0,
+"totalLogFilesSize":0,
+"totalScanTime":0,
+"totalCreateTime":21051,
+"totalUpsertTime":0,
+"minAndMaxEventTime":{
+"Optional.empty":{
+"val":null,
+"present":false
+}
+},
+"writePartitionPaths":[
+"20210623/0/20210825"
+],
+"fileIdAndRelativePaths":{
+
"c144908e-ca7d-401f-be1c-613de98d96a3-0":"20210623/0/20210825/c144908e-ca7d-401f-be1c-613de98d96a3-0_3-33-2009_20220818134233973.parquet"
+},
+"totalLogRecordsCompacted":0,
+"totalLogFilesCompacted":0,
+"totalCompactedRecordsUpdated":0
+}
+```
+
+In order to expose hudi table context more efficiently, this RFC propose a 
Diagnostic Reporter Tool.
+This tool can be turned on as the final stage in ingestion job after commit 
which will collect common troubleshooting information including engine(take 
spark as example here) runtime information and generate a diagnostic report 
json file.
+
+Or users can trigger this diagnostic reporter tool using hudi-cli to generate 
this report json file.
+
+## Implementation
+
+This Diagnostic Reporter Tool will go through whole hudi table and generate a 
report json file which contains all the necessary information. Also this tool 
will package .hoodie folder as a zip compressed file.
+
+Users can use this Diagnostic Reporter Tool in the following two ways:
+1. Users can directly enable this diagnostic reporter in the writing jobs, at 
this time diagnostic reporter tool will go through current hudi table and 
generate report files as the last stage after commit.
+2. Users can directly generate the corresponding report file for a hudi table 
through the hudi cli command

[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6600: [RFC-62] Diagnostic Reporter

2022-09-14 Thread GitBox


zhangyue19921010 commented on code in PR #6600:
URL: https://github.com/apache/hudi/pull/6600#discussion_r971574398


##
rfc/rfc-62/rfc-62.md:
##
@@ -0,0 +1,443 @@
+
+# RFC-62: Diagnostic Reporter
+
+
+
+## Proposers
+
+- zhangyue19921...@163.com
+
+## Approvers
+ - @codope
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4707
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+With the development of hudi, more and more users choose hudi to build their 
own ingestion pipelines to support real-time or batch upsert requirements.
+Subsequently, some of them may ask the community for help, such as how to 
improve the performance of their hudi ingestion jobs? Why did their hudi jobs 
fail? etc.
+
+For the volunteers in the hudi community, dealing with such issue, the 
volunteers may ask users to provide a list of information, including engine 
context, job configs,
+data pattern, Spark UI, etc. Users need to spend extra effort to review their 
own jobs, collect metrics one buy one according to the list and give feedback 
to volunteers.
+By the way, unexpected errors may occur at this time as users are manually 
collecting these information.
+
+Obviously, there are relatively high communication costs for both volunteers 
and users.
+
+On the other hand, for advanced users, they also need some way to efficiently 
understand the characteristics of their hudi tables, including data volume, 
upsert pattern, and so on.
+
+## Background
+As we know, hudi already has its own unique metrics system and metadata 
framework. These information are very important for hudi job tuning or 
troubleshooting. Fo example:
+
+1. Hudi will record the complete timeline in the .hoodie directory, including 
active timeline and archive timeline. From this we can trace the historical 
state of the hudi job.
+
+2. Hudi metadata table which will records all the partitions, all the data 
files, etc
+
+3. Each commit of hudi records various metadata information and runtime 
metrics currently written, such as:
+```json
+{
+"partitionToWriteStats":{
+"20210623/0/20210825":[
+{
+"fileId":"4ae31921-eedd-4c56-8218-bb47849397a4-0",
+
"path":"20210623/0/20210825/4ae31921-eedd-4c56-8218-bb47849397a4-0_0-27-2006_20220818134233973.parquet",
+"prevCommit":"null",
+"numWrites":123352,
+"numDeletes":0,
+"numUpdateWrites":0,
+"numInserts":123352,
+"totalWriteBytes":4675371,
+"totalWriteErrors":0,
+"tempPath":null,
+"partitionPath":"20210623/0/20210825",
+"totalLogRecords":0,
+"totalLogFilesCompacted":0,
+"totalLogSizeCompacted":0,
+"totalUpdatedRecordsCompacted":0,
+"totalLogBlocks":0,
+"totalCorruptLogBlock":0,
+"totalRollbackBlocks":0,
+"fileSizeInBytes":4675371,
+"minEventTime":null,
+"maxEventTime":null
+}
+]
+},
+"compacted":false,
+"extraMetadata":{
+"schema":""
+},
+"operationType":"UPSERT",
+"totalRecordsDeleted":0,
+"totalLogFilesSize":0,
+"totalScanTime":0,
+"totalCreateTime":21051,
+"totalUpsertTime":0,
+"minAndMaxEventTime":{
+"Optional.empty":{
+"val":null,
+"present":false
+}
+},
+"writePartitionPaths":[
+"20210623/0/20210825"
+],
+"fileIdAndRelativePaths":{
+
"c144908e-ca7d-401f-be1c-613de98d96a3-0":"20210623/0/20210825/c144908e-ca7d-401f-be1c-613de98d96a3-0_3-33-2009_20220818134233973.parquet"
+},
+"totalLogRecordsCompacted":0,
+"totalLogFilesCompacted":0,
+"totalCompactedRecordsUpdated":0
+}
+```
+
+In order to expose hudi table context more efficiently, this RFC propose a 
Diagnostic Reporter Tool.
+This tool can be turned on as the final stage in ingestion job after commit 
which will collect common troubleshooting information including engine(take 
spark as example here) runtime information and generate a diagnostic report 
json file.
+
+Or users can trigger this diagnostic reporter tool using hudi-cli to generate 
this report json file.
+
+## Implementation
+
+This Diagnostic Reporter Tool will go through whole hudi table and generate a 
report json file which contains all the necessary information. Also this tool 
will package .hoodie folder as a zip compressed file.
+
+Users can use this Diagnostic Reporter Tool in the following two ways:
+1. Users can directly enable this diagnostic reporter in the writing jobs, at 
this time diagnostic reporter tool will go through current hudi table and 
generate report files as the last stage after commit.
+2. Users can directly generate the corresponding report file for a hudi table 
through the hudi cli command

[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6600: [RFC-62] Diagnostic Reporter

2022-09-14 Thread GitBox


zhangyue19921010 commented on code in PR #6600:
URL: https://github.com/apache/hudi/pull/6600#discussion_r971551331


##
rfc/rfc-62/rfc-62.md:
##
@@ -0,0 +1,443 @@
+
+# RFC-62: Diagnostic Reporter
+
+
+
+## Proposers
+
+- zhangyue19921...@163.com
+
+## Approvers
+ - @codope
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4707
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+With the development of hudi, more and more users choose hudi to build their 
own ingestion pipelines to support real-time or batch upsert requirements.
+Subsequently, some of them may ask the community for help, such as how to 
improve the performance of their hudi ingestion jobs? Why did their hudi jobs 
fail? etc.
+
+For the volunteers in the hudi community, dealing with such issue, the 
volunteers may ask users to provide a list of information, including engine 
context, job configs,
+data pattern, Spark UI, etc. Users need to spend extra effort to review their 
own jobs, collect metrics one buy one according to the list and give feedback 
to volunteers.
+By the way, unexpected errors may occur at this time as users are manually 
collecting these information.
+
+Obviously, there are relatively high communication costs for both volunteers 
and users.
+
+On the other hand, for advanced users, they also need some way to efficiently 
understand the characteristics of their hudi tables, including data volume, 
upsert pattern, and so on.
+
+## Background
+As we know, hudi already has its own unique metrics system and metadata 
framework. These information are very important for hudi job tuning or 
troubleshooting. Fo example:
+
+1. Hudi will record the complete timeline in the .hoodie directory, including 
active timeline and archive timeline. From this we can trace the historical 
state of the hudi job.
+
+2. Hudi metadata table which will records all the partitions, all the data 
files, etc
+
+3. Each commit of hudi records various metadata information and runtime 
metrics currently written, such as:
+```json
+{
+"partitionToWriteStats":{
+"20210623/0/20210825":[
+{
+"fileId":"4ae31921-eedd-4c56-8218-bb47849397a4-0",
+
"path":"20210623/0/20210825/4ae31921-eedd-4c56-8218-bb47849397a4-0_0-27-2006_20220818134233973.parquet",
+"prevCommit":"null",
+"numWrites":123352,
+"numDeletes":0,
+"numUpdateWrites":0,
+"numInserts":123352,
+"totalWriteBytes":4675371,
+"totalWriteErrors":0,
+"tempPath":null,
+"partitionPath":"20210623/0/20210825",
+"totalLogRecords":0,
+"totalLogFilesCompacted":0,
+"totalLogSizeCompacted":0,
+"totalUpdatedRecordsCompacted":0,
+"totalLogBlocks":0,
+"totalCorruptLogBlock":0,
+"totalRollbackBlocks":0,
+"fileSizeInBytes":4675371,
+"minEventTime":null,
+"maxEventTime":null
+}
+]
+},
+"compacted":false,
+"extraMetadata":{
+"schema":""
+},
+"operationType":"UPSERT",
+"totalRecordsDeleted":0,
+"totalLogFilesSize":0,
+"totalScanTime":0,
+"totalCreateTime":21051,
+"totalUpsertTime":0,
+"minAndMaxEventTime":{
+"Optional.empty":{
+"val":null,
+"present":false
+}
+},
+"writePartitionPaths":[
+"20210623/0/20210825"
+],
+"fileIdAndRelativePaths":{
+
"c144908e-ca7d-401f-be1c-613de98d96a3-0":"20210623/0/20210825/c144908e-ca7d-401f-be1c-613de98d96a3-0_3-33-2009_20220818134233973.parquet"
+},
+"totalLogRecordsCompacted":0,
+"totalLogFilesCompacted":0,
+"totalCompactedRecordsUpdated":0
+}
+```
+
+In order to expose hudi table context more efficiently, this RFC propose a 
Diagnostic Reporter Tool.
+This tool can be turned on as the final stage in ingestion job after commit 
which will collect common troubleshooting information including engine(take 
spark as example here) runtime information and generate a diagnostic report 
json file.
+
+Or users can trigger this diagnostic reporter tool using hudi-cli to generate 
this report json file.
+
+## Implementation
+
+This Diagnostic Reporter Tool will go through whole hudi table and generate a 
report json file which contains all the necessary information. Also this tool 
will package .hoodie folder as a zip compressed file.
+
+Users can use this Diagnostic Reporter Tool in the following two ways:
+1. Users can directly enable this diagnostic reporter in the writing jobs, at 
this time diagnostic reporter tool will go through current hudi table and 
generate report files as the last stage after commit.
+2. Users can directly generate the corresponding report file for a hudi table 
through the hudi cli command

[GitHub] [hudi] hudi-bot commented on pull request #5933: [HUDI-4293] Implement Create/Drop/Show/Refresh Index Command for Secondary Index

2022-09-14 Thread GitBox


hudi-bot commented on PR #5933:
URL: https://github.com/apache/hudi/pull/5933#issuecomment-1247608019

   
   ## CI report:
   
   * 65359879df848d75b6693f4c313dc9453d635edd Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11370)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6677: [HUDI-4294][Stacked on 4293] Introduce build action to actually perform index data generation

2022-09-14 Thread GitBox


hudi-bot commented on PR #6677:
URL: https://github.com/apache/hudi/pull/6677#issuecomment-1247605650

   
   ## CI report:
   
   * 59e2196397ef68d75697a35b1b91e661ef9d3aa4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11372)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6600: [RFC-62] Diagnostic Reporter

2022-09-14 Thread GitBox


zhangyue19921010 commented on code in PR #6600:
URL: https://github.com/apache/hudi/pull/6600#discussion_r971542395


##
rfc/rfc-62/rfc-62.md:
##
@@ -0,0 +1,443 @@
+
+# RFC-62: Diagnostic Reporter
+
+
+
+## Proposers
+
+- zhangyue19921...@163.com
+
+## Approvers
+ - @codope
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4707
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+With the development of hudi, more and more users choose hudi to build their 
own ingestion pipelines to support real-time or batch upsert requirements.
+Subsequently, some of them may ask the community for help, such as how to 
improve the performance of their hudi ingestion jobs? Why did their hudi jobs 
fail? etc.
+
+For the volunteers in the hudi community, dealing with such issue, the 
volunteers may ask users to provide a list of information, including engine 
context, job configs,
+data pattern, Spark UI, etc. Users need to spend extra effort to review their 
own jobs, collect metrics one buy one according to the list and give feedback 
to volunteers.
+By the way, unexpected errors may occur at this time as users are manually 
collecting these information.
+
+Obviously, there are relatively high communication costs for both volunteers 
and users.
+
+On the other hand, for advanced users, they also need some way to efficiently 
understand the characteristics of their hudi tables, including data volume, 
upsert pattern, and so on.
+
+## Background
+As we know, hudi already has its own unique metrics system and metadata 
framework. These information are very important for hudi job tuning or 
troubleshooting. Fo example:
+
+1. Hudi will record the complete timeline in the .hoodie directory, including 
active timeline and archive timeline. From this we can trace the historical 
state of the hudi job.
+
+2. Hudi metadata table which will records all the partitions, all the data 
files, etc
+
+3. Each commit of hudi records various metadata information and runtime 
metrics currently written, such as:
+```json
+{
+"partitionToWriteStats":{
+"20210623/0/20210825":[
+{
+"fileId":"4ae31921-eedd-4c56-8218-bb47849397a4-0",
+
"path":"20210623/0/20210825/4ae31921-eedd-4c56-8218-bb47849397a4-0_0-27-2006_20220818134233973.parquet",
+"prevCommit":"null",
+"numWrites":123352,
+"numDeletes":0,
+"numUpdateWrites":0,
+"numInserts":123352,
+"totalWriteBytes":4675371,
+"totalWriteErrors":0,
+"tempPath":null,
+"partitionPath":"20210623/0/20210825",
+"totalLogRecords":0,
+"totalLogFilesCompacted":0,
+"totalLogSizeCompacted":0,
+"totalUpdatedRecordsCompacted":0,
+"totalLogBlocks":0,
+"totalCorruptLogBlock":0,
+"totalRollbackBlocks":0,
+"fileSizeInBytes":4675371,
+"minEventTime":null,
+"maxEventTime":null
+}
+]
+},
+"compacted":false,
+"extraMetadata":{
+"schema":""
+},
+"operationType":"UPSERT",
+"totalRecordsDeleted":0,
+"totalLogFilesSize":0,
+"totalScanTime":0,
+"totalCreateTime":21051,
+"totalUpsertTime":0,
+"minAndMaxEventTime":{
+"Optional.empty":{
+"val":null,
+"present":false
+}
+},
+"writePartitionPaths":[
+"20210623/0/20210825"
+],
+"fileIdAndRelativePaths":{
+
"c144908e-ca7d-401f-be1c-613de98d96a3-0":"20210623/0/20210825/c144908e-ca7d-401f-be1c-613de98d96a3-0_3-33-2009_20220818134233973.parquet"
+},
+"totalLogRecordsCompacted":0,
+"totalLogFilesCompacted":0,
+"totalCompactedRecordsUpdated":0
+}
+```
+
+In order to expose hudi table context more efficiently, this RFC propose a 
Diagnostic Reporter Tool.
+This tool can be turned on as the final stage in ingestion job after commit 
which will collect common troubleshooting information including engine(take 
spark as example here) runtime information and generate a diagnostic report 
json file.

Review Comment:
   I think if users turn on this feature, hoodie is going to generate this 
report no matter current commit is successful or not.
   
   This report will be created under `.hoodie/report/instant-time/HudiTableName 
+ "_" + HudiVersion + "_" + appName + "_" + applicationId + "_" + 
applicationAttemptId + "_" + isLocal + format`
   
   Of course, some contents in our report json is pretty time consuming like 
zip file or `Data information`, which could be false by default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For quer

[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6600: [RFC-62] Diagnostic Reporter

2022-09-14 Thread GitBox


zhangyue19921010 commented on code in PR #6600:
URL: https://github.com/apache/hudi/pull/6600#discussion_r971542395


##
rfc/rfc-62/rfc-62.md:
##
@@ -0,0 +1,443 @@
+
+# RFC-62: Diagnostic Reporter
+
+
+
+## Proposers
+
+- zhangyue19921...@163.com
+
+## Approvers
+ - @codope
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4707
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+With the development of hudi, more and more users choose hudi to build their 
own ingestion pipelines to support real-time or batch upsert requirements.
+Subsequently, some of them may ask the community for help, such as how to 
improve the performance of their hudi ingestion jobs? Why did their hudi jobs 
fail? etc.
+
+For the volunteers in the hudi community, dealing with such issue, the 
volunteers may ask users to provide a list of information, including engine 
context, job configs,
+data pattern, Spark UI, etc. Users need to spend extra effort to review their 
own jobs, collect metrics one buy one according to the list and give feedback 
to volunteers.
+By the way, unexpected errors may occur at this time as users are manually 
collecting these information.
+
+Obviously, there are relatively high communication costs for both volunteers 
and users.
+
+On the other hand, for advanced users, they also need some way to efficiently 
understand the characteristics of their hudi tables, including data volume, 
upsert pattern, and so on.
+
+## Background
+As we know, hudi already has its own unique metrics system and metadata 
framework. These information are very important for hudi job tuning or 
troubleshooting. Fo example:
+
+1. Hudi will record the complete timeline in the .hoodie directory, including 
active timeline and archive timeline. From this we can trace the historical 
state of the hudi job.
+
+2. Hudi metadata table which will records all the partitions, all the data 
files, etc
+
+3. Each commit of hudi records various metadata information and runtime 
metrics currently written, such as:
+```json
+{
+"partitionToWriteStats":{
+"20210623/0/20210825":[
+{
+"fileId":"4ae31921-eedd-4c56-8218-bb47849397a4-0",
+
"path":"20210623/0/20210825/4ae31921-eedd-4c56-8218-bb47849397a4-0_0-27-2006_20220818134233973.parquet",
+"prevCommit":"null",
+"numWrites":123352,
+"numDeletes":0,
+"numUpdateWrites":0,
+"numInserts":123352,
+"totalWriteBytes":4675371,
+"totalWriteErrors":0,
+"tempPath":null,
+"partitionPath":"20210623/0/20210825",
+"totalLogRecords":0,
+"totalLogFilesCompacted":0,
+"totalLogSizeCompacted":0,
+"totalUpdatedRecordsCompacted":0,
+"totalLogBlocks":0,
+"totalCorruptLogBlock":0,
+"totalRollbackBlocks":0,
+"fileSizeInBytes":4675371,
+"minEventTime":null,
+"maxEventTime":null
+}
+]
+},
+"compacted":false,
+"extraMetadata":{
+"schema":""
+},
+"operationType":"UPSERT",
+"totalRecordsDeleted":0,
+"totalLogFilesSize":0,
+"totalScanTime":0,
+"totalCreateTime":21051,
+"totalUpsertTime":0,
+"minAndMaxEventTime":{
+"Optional.empty":{
+"val":null,
+"present":false
+}
+},
+"writePartitionPaths":[
+"20210623/0/20210825"
+],
+"fileIdAndRelativePaths":{
+
"c144908e-ca7d-401f-be1c-613de98d96a3-0":"20210623/0/20210825/c144908e-ca7d-401f-be1c-613de98d96a3-0_3-33-2009_20220818134233973.parquet"
+},
+"totalLogRecordsCompacted":0,
+"totalLogFilesCompacted":0,
+"totalCompactedRecordsUpdated":0
+}
+```
+
+In order to expose hudi table context more efficiently, this RFC propose a 
Diagnostic Reporter Tool.
+This tool can be turned on as the final stage in ingestion job after commit 
which will collect common troubleshooting information including engine(take 
spark as example here) runtime information and generate a diagnostic report 
json file.

Review Comment:
   I think if users turn on this feature, hoodie is going to generate a report, 
no matter current commit is successful or not.
   
   This report will be created under `.hoodie/report/instant-time/HudiTableName 
+ "_" + HudiVersion + "_" + appName + "_" + applicationId + "_" + 
applicationAttemptId + "_" + isLocal + format`
   
   Of course, some contents in our report json is pretty time consuming like 
zip file or `Data information`, which could be false by default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For querie

[hudi] branch master updated: [HUDI-4837] Stop sleeping where it is not necessary after the success (#6270)

2022-09-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 35d03e9a1b [HUDI-4837] Stop sleeping where it is not necessary after 
the success (#6270)
35d03e9a1b is described below

commit 35d03e9a1bede05d10f10c6e4b57ffe66ca7f330
Author: Volodymyr Burenin 
AuthorDate: Thu Sep 15 00:11:34 2022 -0500

[HUDI-4837] Stop sleeping where it is not necessary after the success 
(#6270)

Co-authored-by: Volodymyr Burenin 
Co-authored-by: Y Ethan Guo 
---
 .../org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java | 4 +++-
 .../java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java   | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
index 1e78610ced..81f06a0f9f 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
@@ -315,7 +315,9 @@ public class KafkaOffsetGen {
   // TODO(HUDI-4625) cleanup, introduce retrying client
   partitionInfos = consumer.partitionsFor(topicName);
   try {
-TimeUnit.SECONDS.sleep(10);
+if (partitionInfos == null) {
+  TimeUnit.SECONDS.sleep(10);
+}
   } catch (InterruptedException e) {
 LOG.error("Sleep failed while fetching partitions");
   }
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
index 2b99f19b27..1147736143 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
@@ -249,7 +249,7 @@ abstract class BaseTestKafkaSource extends 
SparkClientFunctionalTestHarness {
 // create a topic with very short retention
 final String topic = TEST_TOPIC_PREFIX + "testFailOnDataLoss";
 Properties topicConfig = new Properties();
-topicConfig.setProperty("retention.ms", "1");
+topicConfig.setProperty("retention.ms", "8000");
 testUtils.createTopic(topic, 1, topicConfig);
 
 TypedProperties failOnDataLossProps = createPropsForKafkaSource(topic, 
null, "earliest");
@@ -261,7 +261,7 @@ abstract class BaseTestKafkaSource extends 
SparkClientFunctionalTestHarness {
 assertEquals(2, fetch1.getBatch().get().count());
 
 // wait for the checkpoint to expire
-Thread.sleep(10001);
+Thread.sleep(3);
 Throwable t = assertThrows(HoodieDeltaStreamerException.class, () -> {
   
kafkaSource.fetchNewDataInAvroFormat(Option.of(fetch1.getCheckpointForNextBatch()),
 Long.MAX_VALUE);
 });



[GitHub] [hudi] yihua merged pull request #6270: [HUDI-4837] Stop sleeping where it is not necessary after the success

2022-09-14 Thread GitBox


yihua merged PR #6270:
URL: https://github.com/apache/hudi/pull/6270


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on pull request #6270: [HUDI-4837] Stop sleeping where it is not necessary after the success

2022-09-14 Thread GitBox


yihua commented on PR #6270:
URL: https://github.com/apache/hudi/pull/6270#issuecomment-1247585982

   CI is green.
   https://user-images.githubusercontent.com/2497195/190318997-d8438ceb-c0c9-457b-9fbf-19c26fb01e0a.png";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6677: [HUDI-4294][Stacked on 4293] Introduce build action to actually perform index data generation

2022-09-14 Thread GitBox


hudi-bot commented on PR #6677:
URL: https://github.com/apache/hudi/pull/6677#issuecomment-1247572632

   
   ## CI report:
   
   * 59e2196397ef68d75697a35b1b91e661ef9d3aa4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11372)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-14 Thread GitBox


hudi-bot commented on PR #4676:
URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247571390

   
   ## CI report:
   
   * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN
   * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN
   * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN
   * 94c9b7bdfd83828a9552dfeab418403a4594c649 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11368)
 
   * e3914eb7b48fc4c5e3bd6f0fd00888ac6da8fa21 UNKNOWN
   * 24747bb7e1f23d6db70672cab3795cb131ce8dcb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11371)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6677: [HUDI-4294][Stacked on 4293] Introduce build action to actually perform index data generation

2022-09-14 Thread GitBox


hudi-bot commented on PR #6677:
URL: https://github.com/apache/hudi/pull/6677#issuecomment-1247569658

   
   ## CI report:
   
   * 59e2196397ef68d75697a35b1b91e661ef9d3aa4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon

2022-09-14 Thread GitBox


hudi-bot commented on PR #6668:
URL: https://github.com/apache/hudi/pull/6668#issuecomment-1247569608

   
   ## CI report:
   
   * 2480ce4c97130601e2727ab82851c428ea7a84bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11345)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-14 Thread GitBox


hudi-bot commented on PR #4676:
URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247568324

   
   ## CI report:
   
   * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN
   * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN
   * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN
   * 94c9b7bdfd83828a9552dfeab418403a4594c649 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11368)
 
   * e3914eb7b48fc4c5e3bd6f0fd00888ac6da8fa21 UNKNOWN
   * 24747bb7e1f23d6db70672cab3795cb131ce8dcb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated (8c296e0356 -> 1f2e72e06e)

2022-09-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 8c296e0356 [HUDI-4691] Cleaning up duplicated classes in Spark 3.3 
module (#6550)
 add 1f2e72e06e [HUDI-4752] Add dedup support for MOR table in cli (#6608)

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/hudi/cli/DedupeSparkJob.scala |   4 +-
 .../hudi/cli/integ/ITTestRepairsCommand.java   | 117 ++---
 2 files changed, 82 insertions(+), 39 deletions(-)



[GitHub] [hudi] yihua merged pull request #6608: [HUDI-4752] Add dedup support for MOR table in cli

2022-09-14 Thread GitBox


yihua merged PR #6608:
URL: https://github.com/apache/hudi/pull/6608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon

2022-09-14 Thread GitBox


hudi-bot commented on PR #6668:
URL: https://github.com/apache/hudi/pull/6668#issuecomment-124750

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5933: [HUDI-4293] Implement Create/Drop/Show/Refresh Index Command for Secondary Index

2022-09-14 Thread GitBox


hudi-bot commented on PR #5933:
URL: https://github.com/apache/hudi/pull/5933#issuecomment-1247565916

   
   ## CI report:
   
   * 3d5de064b208083601499666d925df2ec151afd9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11353)
 
   * 65359879df848d75b6693f4c313dc9453d635edd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11370)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-14 Thread GitBox


hudi-bot commented on PR #4676:
URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247565279

   
   ## CI report:
   
   * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN
   * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN
   * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN
   * 94c9b7bdfd83828a9552dfeab418403a4594c649 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11368)
 
   * e3914eb7b48fc4c5e3bd6f0fd00888ac6da8fa21 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon

2022-09-14 Thread GitBox


Zouxxyy commented on PR #6668:
URL: https://github.com/apache/hudi/pull/6668#issuecomment-1247552005

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon

2022-09-14 Thread GitBox


Zouxxyy commented on PR #6668:
URL: https://github.com/apache/hudi/pull/6668#issuecomment-1247551903

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4294) Introduce build action to actually perform index data generation

2022-09-14 Thread shibei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shibei updated HUDI-4294:
-
Status: In Progress  (was: Open)

> Introduce build action to actually perform index data generation
> 
>
> Key: HUDI-4294
> URL: https://issues.apache.org/jira/browse/HUDI-4294
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: shibei
>Assignee: shibei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> In this issue, we introduce a new action type called build to actually 
> perform index data generation. This action contains two steps as clustering 
> action does:
>  # Generate action plan to clarify which files and which indexes need to be 
> built;
>  # Execute build index according action plan generated by step one;
>  
> Call procedure will be implemented as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4294) Introduce build action to actually perform index data generation

2022-09-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4294:
-
Labels: pull-request-available  (was: )

> Introduce build action to actually perform index data generation
> 
>
> Key: HUDI-4294
> URL: https://issues.apache.org/jira/browse/HUDI-4294
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: shibei
>Assignee: shibei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> In this issue, we introduce a new action type called build to actually 
> perform index data generation. This action contains two steps as clustering 
> action does:
>  # Generate action plan to clarify which files and which indexes need to be 
> built;
>  # Execute build index according action plan generated by step one;
>  
> Call procedure will be implemented as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] huberylee opened a new pull request, #6677: [HUDI-4294][Stacked on 4293] Introduce build action to actually perform index data generation

2022-09-14 Thread GitBox


huberylee opened a new pull request, #6677:
URL: https://github.com/apache/hudi/pull/6677

   ### Change Logs
   
   Introducing a new action type called build to actually perform index data 
generation. This action contains two steps as clustering action does:
   - Generate action plan to clarify which files and which indexes need to be 
built;
   - Execute build index according action plan generated by step one;

   Call procedure will be implemented as well to show or run build action.
   
   Classes in package ``org.apache.hudi.secondary.index.lucene.hadoop`` were 
copied from package ``org.apache.solr.hdfs.store`` in Apache Solr project.
   
   
   ### Impact
   
   User can use ``call show_build(table=> '$table'[, path => $path], limit => 
$limit, show_involved_partition => [true/false])`` to list build commits,  
use``call run_build(table => '$table'[, path => $path], predicate => 
'$predicate', show_involved_partition => [true/false])`` to trigger new build 
action if conditions are satisfied.
   
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhengyuan-cn commented on issue #6596: [SUPPORT] with Impala 4.0 Records lost

2022-09-14 Thread GitBox


zhengyuan-cn commented on issue #6596:
URL: https://github.com/apache/hudi/issues/6596#issuecomment-1247545755

   > > I replaced impala hudi dependency jar (hudi-common-0.5.0-incubating.jar, 
hudi-hadoop-mr-0.5.0-incubating.jar) with (hudi-common-0.12.0.jar, 
hudi-hadoop-mr-0.12.0.jar),issues still.
   > 
   > > ENV: impala4.0+hive3.1.1 with hudi 0.11 is correct.
   > 
   > @zhengyuan-cn do you mean you replaced `hudi-*-0.5.0` with `hudi-*-0.11.0` 
and it worked?
   
   NO,  in env( impala4.0+hive3.1.1 with hudi 0.11) it worked, and  result is 
correct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boundarymate commented on a diff in pull request #3951: [HUDI-2715] The BitCaskDiskMap iterator may cause memory leak

2022-09-14 Thread GitBox


boundarymate commented on code in PR #3951:
URL: https://github.com/apache/hudi/pull/3951#discussion_r971492356


##
hudi-common/src/main/java/org/apache/hudi/common/util/collection/BitCaskDiskMap.java:
##
@@ -275,6 +282,7 @@ public void close() {
 }
   }
   writeOnlyFile.delete();
+  this.iterators.forEach(ClosableIterator::close);

Review Comment:
   Hi danny
   Why not close in finally?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5933: [HUDI-4293] Implement Create/Drop/Show/Refresh Index Command for Secondary Index

2022-09-14 Thread GitBox


hudi-bot commented on PR #5933:
URL: https://github.com/apache/hudi/pull/5933#issuecomment-1247537120

   
   ## CI report:
   
   * 3d5de064b208083601499666d925df2ec151afd9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11353)
 
   * 65359879df848d75b6693f4c313dc9453d635edd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon

2022-09-14 Thread GitBox


hudi-bot commented on PR #6668:
URL: https://github.com/apache/hudi/pull/6668#issuecomment-1247534796

   
   ## CI report:
   
   * 2480ce4c97130601e2727ab82851c428ea7a84bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11345)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon

2022-09-14 Thread GitBox


hudi-bot commented on PR #6668:
URL: https://github.com/apache/hudi/pull/6668#issuecomment-1247532061

   
   ## CI report:
   
   * 2480ce4c97130601e2727ab82851c428ea7a84bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11345)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-14 Thread GitBox


hudi-bot commented on PR #4676:
URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247530990

   
   ## CI report:
   
   * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN
   * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN
   * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN
   * e5af3c2bc8310bf3d41560fed377bfdd078505be Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11362)
 
   * 94c9b7bdfd83828a9552dfeab418403a4594c649 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11368)
 
   * e3914eb7b48fc4c5e3bd6f0fd00888ac6da8fa21 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on pull request #6668: [HUDI-4839] Upgrade rocksdbjni for compatibility with Apple Silicon

2022-09-14 Thread GitBox


Zouxxyy commented on PR #6668:
URL: https://github.com/apache/hudi/pull/6668#issuecomment-1247530533

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] fengjian428 commented on pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-14 Thread GitBox


fengjian428 commented on PR #4676:
URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247530055

   > Thanks for the work, i have reviewed and write a patch: 
[3304.patch.zip](https://github.com/apache/hudi/files/9571179/3304.patch.zip)
   
   Thanks, @danny0405  applied your patch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-09-14 Thread GitBox


hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1247526048

   
   ## CI report:
   
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * 7d716951c917fb1e173da31798736adc172800c4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11369)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] TJX2014 commented on pull request #6634: [HUDI-4813] Fix infer keygen not work in sparksql side issue

2022-09-14 Thread GitBox


TJX2014 commented on PR #6634:
URL: https://github.com/apache/hudi/pull/6634#issuecomment-1247515534

   @danny0405 Please help me review, seems ok in the last ci test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] TJX2014 commented on pull request #6634: [HUDI-4813] Fix infer keygen not work in sparksql side issue

2022-09-14 Thread GitBox


TJX2014 commented on PR #6634:
URL: https://github.com/apache/hudi/pull/6634#issuecomment-1247514620

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] guanlisheng commented on issue #6659: [SUPPORT] query hudi table with Spark SQL on Hive return empty result

2022-09-14 Thread GitBox


guanlisheng commented on issue #6659:
URL: https://github.com/apache/hudi/issues/6659#issuecomment-1247494876

   Hey @xushiyan , 
   it is 0.9.0.
   after further debugging with the 0.9.0 bundle, I suspect it is related to 
#6007. 
   hence I am waiting for 0.11.0 on EMR 5.x and will also try on the 
non-partitioned table. 
   
   the workaround now is to unset the `spark.sql.sources.provider` with HIVE 
SQL..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] guanlisheng closed issue #6659: [SUPPORT] query hudi table with Spark SQL on Hive return empty result

2022-09-14 Thread GitBox


guanlisheng closed issue #6659: [SUPPORT] query hudi table with Spark SQL on 
Hive return empty result
URL: https://github.com/apache/hudi/issues/6659


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-14 Thread GitBox


danny0405 commented on PR #4676:
URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247491717

   Thanks for the work, i have reviewed and write a patch:
   [3304.patch.zip](https://github.com/apache/hudi/files/9571179/3304.patch.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-09-14 Thread GitBox


hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1247488723

   
   ## CI report:
   
   * 8915ca346137d319276026dd7aa396a9c7bd2b29 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11128)
 
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * 7d716951c917fb1e173da31798736adc172800c4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11369)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-09-14 Thread GitBox


hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1247485971

   
   ## CI report:
   
   * 8915ca346137d319276026dd7aa396a9c7bd2b29 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11128)
 
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * 7d716951c917fb1e173da31798736adc172800c4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-09-14 Thread GitBox


hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1247483231

   
   ## CI report:
   
   * 8915ca346137d319276026dd7aa396a9c7bd2b29 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11128)
 
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6673: [HUDI-4785] Fix partition discovery in bootstrap operation

2022-09-14 Thread GitBox


hudi-bot commented on PR #6673:
URL: https://github.com/apache/hudi/pull/6673#issuecomment-1247480519

   
   ## CI report:
   
   * d549379aa13fdd32255ab4b47b184ae98014d44f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11366)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6672: [HUDI-4757] Create pyspark examples

2022-09-14 Thread GitBox


hudi-bot commented on PR #6672:
URL: https://github.com/apache/hudi/pull/6672#issuecomment-1247480503

   
   ## CI report:
   
   * 25fad9af64012f22e0bb00d1a454026de0902f92 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11367)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6608: [HUDI-4752] Add dedup support for MOR table in cli

2022-09-14 Thread GitBox


hudi-bot commented on PR #6608:
URL: https://github.com/apache/hudi/pull/6608#issuecomment-1247480389

   
   ## CI report:
   
   * 74ab4e17c851a4d7a910269b5e36e52880321ba8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11348)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-09-14 Thread GitBox


alexeykudinkin commented on code in PR #6358:
URL: https://github.com/apache/hudi/pull/6358#discussion_r971445275


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -169,23 +179,107 @@ object HoodieSparkSqlWriter {
   }
 
   val commitActionType = CommitUtils.getCommitActionType(operation, 
tableConfig.getTableType)
-  val dropPartitionColumns = 
hoodieConfig.getBoolean(DataSourceWriteOptions.DROP_PARTITION_COLUMNS)
+
+  // Register Avro classes ([[Schema]], [[GenericData]]) w/ Kryo
+  sparkContext.getConf.registerKryoClasses(
+Array(classOf[org.apache.avro.generic.GenericData],
+  classOf[org.apache.avro.Schema]))
+
+  val (structName, nameSpace) = 
AvroConversionUtils.getAvroRecordNameAndNamespace(tblName)
+  val reconcileSchema = 
parameters(DataSourceWriteOptions.RECONCILE_SCHEMA.key()).toBoolean
+
+  val schemaEvolutionEnabled = 
parameters.getOrDefault(DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.key(), 
"false").toBoolean
+  var internalSchemaOpt = getLatestTableInternalSchema(fs, basePath, 
sparkContext)
+
+  val sourceSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(df.schema, structName, 
nameSpace)
+  val latestTableSchemaOpt = getLatestTableSchema(spark, basePath, 
tableIdentifier, sparkContext.hadoopConfiguration)
+
+  val writerSchema: Schema = latestTableSchemaOpt match {
+// In case table schema is empty we're just going to use the source 
schema as a
+// writer's schema. No additional handling is required
+case None => sourceSchema
+// Otherwise, we need to make sure we reconcile incoming and latest 
table schemas
+case Some(latestTableSchema) =>
+  if (reconcileSchema) {
+// In case we need to reconcile the schema and schema evolution is 
enabled,
+// we will force-apply schema evolution to the writer's schema
+if (schemaEvolutionEnabled && internalSchemaOpt.isEmpty) {
+  internalSchemaOpt = 
Some(AvroInternalSchemaConverter.convert(sourceSchema))
+}
+
+if (internalSchemaOpt.isDefined) {
+  // Apply schema evolution, by auto-merging write schema and read 
schema
+  val mergedInternalSchema = 
AvroSchemaEvolutionUtils.reconcileSchema(sourceSchema, internalSchemaOpt.get)
+  AvroInternalSchemaConverter.convert(mergedInternalSchema, 
latestTableSchema.getName)
+} else if (TableSchemaResolver.isSchemaCompatible(sourceSchema, 
latestTableSchema)) {
+  // In case schema reconciliation is enabled and source and 
latest table schemas
+  // are compatible (as defined by 
[[TableSchemaResolver#isSchemaCompatible]]), then we
+  // will rebase incoming batch onto the table's latest schema 
(ie, reconcile them)
+  //
+  // NOTE: Since we'll be converting incoming batch from 
[[sourceSchema]] into [[latestTableSchema]]
+  //   we're validating in that order (where [[sourceSchema]] 
is treated as a reader's schema,
+  //   and [[latestTableSchema]] is treated as a writer's 
schema)
+  latestTableSchema
+} else {
+  log.error(
+s"""
+   |Failed to reconcile incoming batch schema with the table's 
one.
+   |Incoming schema ${sourceSchema.toString(true)}
+
+   |Table's schema ${latestTableSchema.toString(true)}
+
+   |""".stripMargin)
+  throw new SchemaCompatibilityException("Failed to reconcile 
incoming schema with the table's one")
+}
+  } else {
+// Before validating whether schemas are compatible, we need to 
"canonicalize" source's schema
+// relative to the table's one, by doing a (minor) reconciliation 
of the nullability constraints:
+// for ex, if in incoming schema column A is designated as 
non-null, but it's designated as nullable
+// in the table's one we want to proceed w/ such operation, simply 
relaxing such constraint in the
+// source schema.
+val canonicalizedSourceSchema = 
AvroSchemaEvolutionUtils.canonicalizeColumnNullability(sourceSchema, 
latestTableSchema)
+// In case reconciliation is disabled, we have to validate that 
the source's schema
+// is compatible w/ the table's latest schema, such that we're 
able to read existing table's
+// records using [[sourceSchema]].
+if (TableSchemaResolver.isSchemaCompatible(latestTableSchema, 
canonicalizedSourceSchema)) {
+  canonicalizedSourceSchema
+} else {
+  log.error(
+s"""
+   |Incoming batch schema is not compatible with the table's 
one.
+   |Incoming schema ${canonicalizedSourceSc

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-09-14 Thread GitBox


alexeykudinkin commented on code in PR #6358:
URL: https://github.com/apache/hudi/pull/6358#discussion_r971444542


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -169,23 +179,107 @@ object HoodieSparkSqlWriter {
   }
 
   val commitActionType = CommitUtils.getCommitActionType(operation, 
tableConfig.getTableType)
-  val dropPartitionColumns = 
hoodieConfig.getBoolean(DataSourceWriteOptions.DROP_PARTITION_COLUMNS)
+
+  // Register Avro classes ([[Schema]], [[GenericData]]) w/ Kryo
+  sparkContext.getConf.registerKryoClasses(
+Array(classOf[org.apache.avro.generic.GenericData],
+  classOf[org.apache.avro.Schema]))

Review Comment:
   We always had that, this code just been moved from below to make sure we 
handle the schema in the same way for bulk-insert (w/ row-writing) as we do for 
any other operation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-09-14 Thread GitBox


alexeykudinkin commented on code in PR #6358:
URL: https://github.com/apache/hudi/pull/6358#discussion_r971444217


##
hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java:
##
@@ -295,91 +295,19 @@ private MessageType convertAvroSchemaToParquet(Schema 
schema) {
   }
 
   /**
-   * HUDI specific validation of schema evolution. Ensures that a newer schema 
can be used for the dataset by
-   * checking if the data written using the old schema can be read using the 
new schema.
+   * Establishes whether {@code prevSchema} is compatible w/ {@code 
newSchema}, as
+   * defined by Avro's {@link SchemaCompatibility}
*
-   * HUDI requires a Schema to be specified in HoodieWriteConfig and is used 
by the HoodieWriteClient to
-   * create the records. The schema is also saved in the data files (parquet 
format) and log files (avro format).
-   * Since a schema is required each time new data is ingested into a HUDI 
dataset, schema can be evolved over time.
-   *
-   * New Schema is compatible only if:
-   * A1. There is no change in schema
-   * A2. A field has been added and it has a default value specified
-   *
-   * New Schema is incompatible if:
-   * B1. A field has been deleted
-   * B2. A field has been renamed (treated as delete + add)
-   * B3. A field's type has changed to be incompatible with the older type
-   *
-   * Issue with org.apache.avro.SchemaCompatibility:
-   *  org.apache.avro.SchemaCompatibility checks schema compatibility between 
a writer schema (which originally wrote
-   *  the AVRO record) and a readerSchema (with which we are reading the 
record). It ONLY guarantees that that each
-   *  field in the reader record can be populated from the writer record. 
Hence, if the reader schema is missing a
-   *  field, it is still compatible with the writer schema.
-   *
-   *  In other words, org.apache.avro.SchemaCompatibility was written to 
guarantee that we can read the data written
-   *  earlier. It does not guarantee schema evolution for HUDI (B1 above).
-   *
-   * Implementation: This function implements specific HUDI specific checks 
(listed below) and defers the remaining
-   * checks to the org.apache.avro.SchemaCompatibility code.
-   *
-   * Checks:
-   * C1. If there is no change in schema: success
-   * C2. If a field has been deleted in new schema: failure
-   * C3. If a field has been added in new schema: it should have default value 
specified
-   * C4. If a field has been renamed(treated as delete + add): failure
-   * C5. If a field type has changed: failure
-   *
-   * @param oldSchema Older schema to check.
-   * @param newSchema Newer schema to check.
-   * @return True if the schema validation is successful
-   *
-   * TODO revisit this method: it's implemented incorrectly as it might be 
applying different criteria
-   *  to top-level record and nested record (for ex, if that nested record 
is contained w/in an array)
+   * @param prevSchema previous instance of the schema
+   * @param newSchema new instance of the schema
*/
-  public static boolean isSchemaCompatible(Schema oldSchema, Schema newSchema) 
{
-if (oldSchema.getType() == newSchema.getType() && newSchema.getType() == 
Schema.Type.RECORD) {
-  // record names must match:
-  if (!SchemaCompatibility.schemaNameEquals(newSchema, oldSchema)) {
-return false;
-  }
-
-  // Check that each field in the oldSchema can populated the newSchema
-  for (final Field oldSchemaField : oldSchema.getFields()) {
-final Field newSchemaField = 
SchemaCompatibility.lookupWriterField(newSchema, oldSchemaField);
-if (newSchemaField == null) {
-  // C4 or C2: newSchema does not correspond to any field in the 
oldSchema
-  return false;
-} else {
-  if (!isSchemaCompatible(oldSchemaField.schema(), 
newSchemaField.schema())) {
-// C5: The fields do not have a compatible type
-return false;
-  }
-}
-  }
-
-  // Check that new fields added in newSchema have default values as they 
will not be
-  // present in oldSchema and hence cannot be populated on reading records 
from existing data.
-  for (final Field newSchemaField : newSchema.getFields()) {
-final Field oldSchemaField = 
SchemaCompatibility.lookupWriterField(oldSchema, newSchemaField);
-if (oldSchemaField == null) {
-  if (newSchemaField.defaultVal() == null) {
-// C3: newly added field in newSchema does not have a default value
-return false;
-  }
-}
-  }
-
-  // All fields in the newSchema record can be populated from the 
oldSchema record
-  return true;
-} else {
-  // Use the checks implemented by Avro
-  // newSchema is the schema which will be used to read the records 
written earlier using oldSchema. Hence, in the
-  // check below, use newSchema as the reader schema and oldSchema as the 
wr

[GitHub] [hudi] xushiyan commented on issue #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines

2022-09-14 Thread GitBox


xushiyan commented on issue #6610:
URL: https://github.com/apache/hudi/issues/6610#issuecomment-1247472941

   is this a spark streaming job you're running ? does it scale accordingly 
when backfill traffic spiked up? the OOM also hints that you may need tune 
spark configs properly, like spark memory and spark memory.storage.fraction to 
give more execution memory.
   Looks like order of records does not matter here since you pump them into 
the same topic. Why not start a batch job just for backfill? that's how people 
usually run backfill jobs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4453) Support partition pruning for tables Bootstrapped from Source Hive Style partitioned tables

2022-09-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4453:
-
Labels: pull-request-available  (was: )

> Support partition pruning for tables Bootstrapped from Source Hive Style 
> partitioned tables
> ---
>
> Key: HUDI-4453
> URL: https://issues.apache.org/jira/browse/HUDI-4453
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Udit Mehrotra
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> As of now the *Bootstrap* feature determines the source schema by reading it 
> from the source parquet files => 
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/bootstrap/ParquetBootstrapMetadataHandler.java#L61]
> This does not consider parquet tables which might be Hive style partitioned. 
> Thus, from the source schema partition columns would be missed and not 
> written to the target Hudi table either. Also because of this partition 
> pruning does not work, as we are unable to prune out source partitions. We 
> should improve this logic to determine partition schema correctly from the 
> partition paths in case of hive style partitioned tables and write the 
> partition column values correctly in the target Hudi table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua opened a new pull request, #6676: [HUDI-4453] Fix schema to include partition columns in bootstrap operation

2022-09-14 Thread GitBox


yihua opened a new pull request, #6676:
URL: https://github.com/apache/hudi/pull/6676

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan closed issue #6579: [SUPPORT] How to participate in HUDI code contribution

2022-09-14 Thread GitBox


xushiyan closed issue #6579: [SUPPORT] How to participate in HUDI code 
contribution
URL: https://github.com/apache/hudi/issues/6579


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #6579: [SUPPORT] How to participate in HUDI code contribution

2022-09-14 Thread GitBox


xushiyan commented on issue #6579:
URL: https://github.com/apache/hudi/issues/6579#issuecomment-1247465847

   hi @azhsmesos thanks for your interests in contributing! please check out 
https://hudi.apache.org/docs/quick-start-guide for quick start examples (both 
spark and flink) and many more guides under 
https://hudi.apache.org/docs/overview
   There is a `new-to-hudi` label you can [filter on from 
jira](https://issues.apache.org/jira/browse/HUDI-4752?jql=project%20%3D%20HUDI%20and%20labels%20%3D%20new-to-hudi%20and%20statusCategory%20%20!%3D%20done).
 It's not up-to-date though, as in we have not deliberately gone through all 
issues to add this label. But it can be somewhere to start with. I'd also 
suggest go through https://hudi.apache.org/contribute/how-to-contribute and 
other related pages. Please provide feedback if you have further questions on 
these guides. cc @bhasudha 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #6596: [SUPPORT] with Impala 4.0 Records lost

2022-09-14 Thread GitBox


xushiyan commented on issue #6596:
URL: https://github.com/apache/hudi/issues/6596#issuecomment-1247459150

   > I replaced impala hudi dependency jar (hudi-common-0.5.0-incubating.jar, 
hudi-hadoop-mr-0.5.0-incubating.jar) with (hudi-common-0.12.0.jar, 
hudi-hadoop-mr-0.12.0.jar),issues still.
   
   > ENV: impala4.0+hive3.1.1 with hudi 0.11 is correct.
   
   @zhengyuan-cn do you mean you replaced `hudi-*-0.5.0` with `hudi-*-0.11.0` 
and it worked?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #6618: Caused by: org.apache.http.NoHttpResponseException: xxxxxx:34812 failed to respond[SUPPORT]

2022-09-14 Thread GitBox


xushiyan commented on issue #6618:
URL: https://github.com/apache/hudi/issues/6618#issuecomment-1247453345

   @Aload can you verify if the patch is used in your version of hudi? and 
still having the problem?
   
   > I have encountered this problem,this pr may solve your problem : #6393
   
   in order to help diagnose, we need more info also to reproduce it. like 
configs and code snippet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #6626: [SUPPORT] HUDI merge into via spark sql not working

2022-09-14 Thread GitBox


xushiyan commented on issue #6626:
URL: https://github.com/apache/hudi/issues/6626#issuecomment-1247449888

   @arunb2w noticed that you're on hudi 0.10. would you also verify if 0.12 has 
the same behavior?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #6644: Hudi Multi Writer DynamoDBBasedLocking issue

2022-09-14 Thread GitBox


xushiyan commented on issue #6644:
URL: https://github.com/apache/hudi/issues/6644#issuecomment-1247444513

   > Is it mandatory to set AWS_ACCESS_KEY,AWS_SECRET_KEY ?
   
   No you should not need to. in aws env you'll just rely on whatever roles for 
your service to access another service. Please raise support case with aws and 
get help to configure roles properly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Xiaohan-Shen closed issue #6653: [SUPPORT] Hudi table COW taking up significant space for a small table

2022-09-14 Thread GitBox


Xiaohan-Shen closed issue #6653: [SUPPORT] Hudi table COW taking up significant 
space for a small table
URL: https://github.com/apache/hudi/issues/6653


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Xiaohan-Shen commented on issue #6653: [SUPPORT] Hudi table COW taking up significant space for a small table

2022-09-14 Thread GitBox


Xiaohan-Shen commented on issue #6653:
URL: https://github.com/apache/hudi/issues/6653#issuecomment-1247442099

   I just figured out the problem: I used the primary key field for 
partitioning, so it was creating one partition for every row. My bad. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #6653: [SUPPORT] Hudi table COW taking up significant space for a small table

2022-09-14 Thread GitBox


xushiyan commented on issue #6653:
URL: https://github.com/apache/hudi/issues/6653#issuecomment-1247438706

   likely a lot of small files were created. @Xiaohan-Shen how many files were 
created in S3? and what do the file sizes look like?
   cc @zhangyue19921010 @yihua this can be a good point for diagnositc reporter 
to capture, as in how file sizes distribution look like, to help diagnose 
parquet size setting issues for example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #6655: [SUPPORT] tryComposeIndexFilterExpr in dataskip util could support InSet expression of spark?

2022-09-14 Thread GitBox


xushiyan commented on issue #6655:
URL: https://github.com/apache/hudi/issues/6655#issuecomment-1247434901

   @alexeykudinkin can you take a look pls?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-3780) improve drop partitions

2022-09-14 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-3780.
-
Resolution: Fixed

> improve drop partitions
> ---
>
> Key: HUDI-3780
> URL: https://issues.apache.org/jira/browse/HUDI-3780
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Forward Xu
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-14 Thread GitBox


boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r971405570


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##
@@ -273,6 +330,60 @@ private HoodieData> 
readRecordsForGroupBaseFiles(JavaSparkContex
 .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from 
file slice (Apply updates from log files, if any).
+   */
+  private Dataset readRecordsForGroupAsRow(JavaSparkContext jsc,
+   HoodieClusteringGroup 
clusteringGroup,
+   String instantTime) {
+List clusteringOps = 
clusteringGroup.getSlices().stream()
+.map(ClusteringOperation::create).collect(Collectors.toList());
+boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> 
op.getDeltaFilePaths().size() > 0);
+SQLContext sqlContext = new SQLContext(jsc.sc());
+
+String[] baseFilePaths = clusteringOps
+.stream()
+.map(op -> {
+  ArrayList readPaths = new ArrayList<>();
+  if (op.getBootstrapFilePath() != null) {
+readPaths.add(op.getBootstrapFilePath());
+  }
+  if (op.getDataFilePath() != null) {
+readPaths.add(op.getDataFilePath());
+  }
+  return readPaths;
+})
+.flatMap(Collection::stream)
+.filter(path -> !path.isEmpty())
+.toArray(String[]::new);
+String[] deltaPaths = clusteringOps
+.stream()
+.filter(op -> !op.getDeltaFilePaths().isEmpty())
+.flatMap(op -> op.getDeltaFilePaths().stream())
+.toArray(String[]::new);
+
+Dataset inputRecords;
+if (hasLogFiles) {
+  String compactionFractor = 
Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction"))
+  .orElse("0.75");
+  String[] paths = CollectionUtils.combine(baseFilePaths, deltaPaths);
+  inputRecords = sqlContext.read()

Review Comment:
   Good idea!I'll take a try



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-3953) Flink Hudi module should support low-level read and write APIs

2022-09-14 Thread Kenneth William Krugler (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605011#comment-17605011
 ] 

Kenneth William Krugler commented on HUDI-3953:
---

I was initially wondering why Hudi didn't have a regular Flink sink. But after 
having implemented code to write Pinot segments, I can see advantages to having 
control over partitioning, which isn't possible at the sink level.

> Flink Hudi module should support  low-level read and write APIs
> ---
>
> Key: HUDI-3953
> URL: https://issues.apache.org/jira/browse/HUDI-3953
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: yuemeng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Currently.  Flink Hudi Module only supports SQL APIs. People who want to use 
> low-level APIs such used for operating Flink state or another purpose don't 
> have a friendly way.
> It can be provided a low-level APIs for users to write/read hoodie data
> The API design and main change will be:
>  # add sink and source API in Pipelines
>  # getSinkRuntimeProvider in HoodieTableSink call Pipelines.sink(...) to 
> return DataStreamSink
>  # getScanRuntimeProvider in HoodieTableSource call Pipelines.source() to 
> return DataStream
>  # move some common methods such as getInputFormat in util class
>  # low-level API such as read and write just call Pipelines.sink(...)  and 
> Pipelines.source()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi

2022-09-14 Thread GitBox


alexeykudinkin commented on code in PR #6476:
URL: https://github.com/apache/hudi/pull/6476#discussion_r971364028


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java:
##
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.avro.SerializableRecord;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.cdc.HoodieCDCOperation;
+import org.apache.hudi.common.table.cdc.HoodieCDCUtils;
+import org.apache.hudi.common.table.log.AppendResult;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieUpsertException;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+public class HoodieCDCLogger implements 
Closeable {
+
+  private final String partitionPath;
+
+  private final String fileName;
+
+  private final String commitTime;
+
+  private final List keyFields;
+
+  private final int taskPartitionId;
+
+  private final boolean populateMetaFields;
+
+  // writer for cdc data
+  private final HoodieLogFormat.Writer cdcWriter;
+
+  private final boolean cdcEnabled;
+
+  private final String cdcSupplementalLoggingMode;
+
+  // the cdc data
+  private final Map cdcData;
+
+  private final Function rewriteRecordFunc;
+
+  // the count of records currently being written, used to generate the same 
seqno for the cdc data
+  private final AtomicLong writtenRecordCount = new AtomicLong(-1);
+
+  public HoodieCDCLogger(
+  String partitionPath,
+  String fileName,
+  String commitTime,
+  HoodieWriteConfig config,
+  List keyFields,
+  int taskPartitionId,
+  HoodieLogFormat.Writer cdcWriter,
+  long maxInMemorySizeInBytes,
+  Function rewriteRecordFunc) {
+try {
+  this.partitionPath = partitionPath;
+  this.fileName = fileName;
+  this.commitTime = commitTime;
+  this.keyFields = keyFields;
+  this.taskPartitionId = taskPartitionId;
+  this.populateMetaFields = config.populateMetaFields();
+  this.cdcWriter = cdcWriter;
+  this.rewriteRecordFunc = rewriteRecordFunc;
+
+  this.cdcEnabled = 
config.getBooleanOrDefault(HoodieTableConfig.CDC_ENABLED);
+  this.cdcSupplementalLoggingMode = 
config.getStringOrDefault(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE);
+  this.cdcData = new ExternalSpillableMap<>(
+  maxInMemorySizeInBytes,
+  config.getSpillableMapBasePath(),
+  new DefaultSizeEstimator<>(),
+  new DefaultSizeEstimator<>(),
+  config.getCommonConfig().getSpillableDiskMapType(),
+  config.getCommonConfig().isBitCaskDiskMapCompressionEnabled()
+  );
+} catch (IOException e) {
+  throw new HoodieUpsertException("Failed to initialize HoodieCDCLogger", 
e);
+}
+  }
+
+  public void put(HoodieRecord hoodieRecord, GenericRecord oldRecord, 
Option indexedRecord) {
+if (cdcEnabled) {
+  String recordKey;
+  if (oldRecord == null) {
+recordKey = hoodieRecord.getRecordKey();
+  } else {
+recordKey = StringUtils.join(

Review Comment:
   Please check my previous commen

[GitHub] [hudi] xushiyan commented on issue #6659: [SUPPORT] query hudi table with Spark SQL on Hive return empty result

2022-09-14 Thread GitBox


xushiyan commented on issue #6659:
URL: https://github.com/apache/hudi/issues/6659#issuecomment-1247408240

   from what you listed, 
`org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4`
 this is hudi 0.5.3. can you confirm which hudi version do you have  problem 
with?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi

2022-09-14 Thread GitBox


hudi-bot commented on PR #6476:
URL: https://github.com/apache/hudi/pull/6476#issuecomment-1247402607

   
   ## CI report:
   
   * 34088aeee92daffe28ef3a17c04bb8e000f233e7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11363)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi

2022-09-14 Thread GitBox


alexeykudinkin commented on code in PR #6476:
URL: https://github.com/apache/hudi/pull/6476#discussion_r971363419


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##
@@ -273,6 +283,33 @@ protected HoodieFileWriter createNewFileWriter(String 
instantTime, Path path, Ho
 return HoodieFileWriterFactory.getFileWriter(instantTime, path, 
hoodieTable, config, schema, taskContextSupplier);
   }
 
+  protected HoodieLogFormat.Writer createLogWriter(
+  Option fileSlice, String baseCommitTime) throws IOException {
+int logVersion = HoodieLogFile.LOGFILE_BASE_VERSION;
+long logFileSize = 0L;
+String logWriteToken = writeToken;
+if (fileSlice.isPresent()) {
+  Option latestLogFileOpt = 
fileSlice.get().getLatestLogFile();
+  if (latestLogFileOpt.isPresent()) {
+HoodieLogFile latestLogFile = latestLogFileOpt.get();
+logVersion = latestLogFile.getLogVersion();
+logFileSize = latestLogFile.getFileSize();
+logWriteToken = 
FSUtils.getWriteTokenFromLogPath(latestLogFile.getPath());
+  }
+}
+return HoodieLogFormat.newWriterBuilder()
+
.onParentPath(FSUtils.getPartitionPath(hoodieTable.getMetaClient().getBasePath(),
 partitionPath))
+.withFileId(fileId)
+.overBaseCommit(baseCommitTime)
+.withLogVersion(logVersion)
+.withFileSize(logFileSize)
+.withSizeThreshold(config.getLogFileMaxSize())
+.withFs(fs)
+.withRolloverLogWriteToken(writeToken)
+.withLogWriteToken(logWriteToken)
+.withFileExtension(HoodieLogFile.DELTA_EXTENSION).build();

Review Comment:
   So one of the pre-requisites of the CDC is: 
   
   - When we're issuing normal Data query (and not a CDC one), there should be 
**no performance impact** to it
   
   Moreover, we should clearly disambiguate the CDC infra from the Data infra 
w/o the need to even fetch the first block of the file (we can still use the 
same Log format, but we should definitely create separate naming scheme for CDC 
Log files to not mix these up w/ the Data Delta Log files)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi

2022-09-14 Thread GitBox


alexeykudinkin commented on code in PR #6476:
URL: https://github.com/apache/hudi/pull/6476#discussion_r971358810


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -399,9 +453,65 @@ protected void writeIncomingRecords() throws IOException {
 }
   }
 
+  protected SerializableRecord createCDCRecord(HoodieCDCOperation operation, 
String recordKey, String partitionPath,
+   GenericRecord oldRecord, 
GenericRecord newRecord) {
+GenericData.Record record;
+if 
(cdcSupplementalLoggingMode.equals(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE_WITH_BEFORE_AFTER))
 {
+  record = HoodieCDCUtils.cdcRecord(operation.getValue(), instantTime,
+  oldRecord, addCommitMetadata(newRecord, recordKey, partitionPath));
+} else if 
(cdcSupplementalLoggingMode.equals(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE_WITH_BEFORE))
 {
+  record = HoodieCDCUtils.cdcRecord(operation.getValue(), recordKey, 
oldRecord);
+} else {
+  record = HoodieCDCUtils.cdcRecord(operation.getValue(), recordKey);
+}
+return new SerializableRecord(record);
+  }
+
+  protected GenericRecord addCommitMetadata(GenericRecord record, String 
recordKey, String partitionPath) {

Review Comment:
   Meta fields carry purely semantical information related to their 
_persistence_ by Hudi. 
   These aren't the part of the record's payload and we shouldn't be carrying 
them w/in CDC payload.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi

2022-09-14 Thread GitBox


alexeykudinkin commented on code in PR #6476:
URL: https://github.com/apache/hudi/pull/6476#discussion_r971357775


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -102,6 +118,15 @@
   protected Map> keyToNewRecords;
   protected Set writtenRecordKeys;
   protected HoodieFileWriter fileWriter;
+  // a flag that indicate whether allow the change data to write out a cdc log 
file.
+  protected boolean cdcEnabled = false;
+  protected String cdcSupplementalLoggingMode;

Review Comment:
   Was reviewing this before i caught up w/ an updated version of the RFC so 
got confused.
   Yeah, let's use enum for this one



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6341: [SUPPORT] Hudi delete not working via spark apis

2022-09-14 Thread GitBox


nsivabalan commented on issue #6341:
URL: https://github.com/apache/hudi/issues/6341#issuecomment-1247373856

   sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6101: [SUPPORT] Hudi Delete Not working with EMR, AWS Glue & S3

2022-09-14 Thread GitBox


nsivabalan commented on issue #6101:
URL: https://github.com/apache/hudi/issues/6101#issuecomment-1247372251

   I assume you are referring to delete_partitions right? how are you 
triggering delete_partition. are you passing in a regular dataframe as you 
would for other write operations. 
   Or are you setting the config 
https://hudi.apache.org/docs/configurations#hoodiedatasourcewritepartitionstodelete
 . you can set comma separated list of partition values that needs to be 
deleted. 
   
   I might need to reproduce your exact scenario and go from there. in the mean 
time, if you have a reproducible script, let me know. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] bhasudha commented on pull request #6674: [DOCS] Standardize blog images sizes

2022-09-14 Thread GitBox


bhasudha commented on PR #6674:
URL: https://github.com/apache/hudi/pull/6674#issuecomment-1247371882

   > 
Done. @yihua  redrawing the entire image might be a bigger effort. I 
changed the images so their ratio does not change and are closer to 1200 * 600 
either in width or height. Please take a look and flag any image that comes out 
odd.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6463: [SUPPORT]Caused by: java.lang.IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)

2022-09-14 Thread GitBox


nsivabalan commented on issue #6463:
URL: https://github.com/apache/hudi/issues/6463#issuecomment-1247370168

   let us know if you are still looking for any assistance. if not, we can go 
ahead and close out the issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6463: [SUPPORT]Caused by: java.lang.IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)

2022-09-14 Thread GitBox


nsivabalan commented on issue #6463:
URL: https://github.com/apache/hudi/issues/6463#issuecomment-1247369694

   common configs requried for any lock provider:
   
   hoodie.write.concurrency.mode=optimistic_concurrency_control
   hoodie.cleaner.policy.failed.writes=LAZY
   hoodie.write.lock.provider=
   
   configs for zookeeper based lock
   ```
   
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
   hoodie.write.lock.zookeeper.url
   hoodie.write.lock.zookeeper.port
   hoodie.write.lock.zookeeper.lock_key
   hoodie.write.lock.zookeeper.base_path
   ```
   
   Configs for hive metastore based lock:
   ```
   
hoodie.write.lock.provider=org.apache.hudi.hive.HiveMetastoreBasedLockProvider
   hoodie.write.lock.hivemetastore.database
   hoodie.write.lock.hivemetastore.table
   ```
   
   DynamoDb based lock:
   ```
   
hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
   hoodie.write.lock.dynamodb.table
   hoodie.write.lock.dynamodb.partition_key
   hoodie.write.lock.dynamodb.region
   hoodie.write.lock.dynamodb.endpoint_url
   hoodie.write.lock.dynamodb.billing_mode
   ```
   ```
   hoodie.aws.access.key
   hoodie.aws.secret.key
   hoodie.aws.session.token
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6672: [HUDI-4757] Create pyspark examples

2022-09-14 Thread GitBox


hudi-bot commented on PR #6672:
URL: https://github.com/apache/hudi/pull/6672#issuecomment-1247369500

   
   ## CI report:
   
   * 7864afbc773d4dde0fca7fad439d2da39cfa8c78 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11364)
 
   * 25fad9af64012f22e0bb00d1a454026de0902f92 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11367)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6463: [SUPPORT]Caused by: java.lang.IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)

2022-09-14 Thread GitBox


nsivabalan commented on issue #6463:
URL: https://github.com/apache/hudi/issues/6463#issuecomment-1247368648

   yes, you can find configs that need to set for zookeeper based lock or for 
hive metastore based lock here. 
   https://hudi.apache.org/docs/concurrency_control
   
   we also have dynamoDB based lock if you are interested. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6591: [SUPPORT]Duplicate records in MOR

2022-09-14 Thread GitBox


nsivabalan commented on issue #6591:
URL: https://github.com/apache/hudi/issues/6591#issuecomment-1247367471

   yes, I get it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-14 Thread GitBox


hudi-bot commented on PR #4676:
URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247360153

   
   ## CI report:
   
   * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN
   * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN
   * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN
   * e5af3c2bc8310bf3d41560fed377bfdd078505be Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11362)
 
   * 94c9b7bdfd83828a9552dfeab418403a4594c649 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11368)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6516: [HUDI-4729] Fix fq can not be queried in pending compaction when query ro table with spark

2022-09-14 Thread GitBox


hudi-bot commented on PR #6516:
URL: https://github.com/apache/hudi/pull/6516#issuecomment-1247357228

   
   ## CI report:
   
   * 8b06e2b181eb0d913a3d9a465e06082cd040bfec Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11361)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4759) Fix website Quick start guide to add validations

2022-09-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4759:
-
Labels: pull-request-available  (was: )

> Fix website Quick start guide to add validations
> 
>
> Key: HUDI-4759
> URL: https://issues.apache.org/jira/browse/HUDI-4759
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jonvex opened a new pull request, #6675: [HUDI-4759] added validations and some pyspark edits to the quick start guide

2022-09-14 Thread GitBox


jonvex opened a new pull request, #6675:
URL: https://github.com/apache/hudi/pull/6675

   ### Change Logs
   
   Added pyspark and scala validations to the quickstart. Added pyspark insert 
overwrite example. Fixed some errors with the existing pyspark examples
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none **
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-14 Thread GitBox


hudi-bot commented on PR #4676:
URL: https://github.com/apache/hudi/pull/4676#issuecomment-1247355671

   
   ## CI report:
   
   * 5944f5cbe9ce73fe6b7e27a0d381eaeb80dead38 UNKNOWN
   * 4ef7b451c3dd795906f3f68571256baeb330a59f UNKNOWN
   * 6aeb3d0d8f09aeab2a5766cf9d25ecb30537 UNKNOWN
   * b0c4d706cad14fba7cd31f3f22090f3867fbd2a7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11360)
 
   * e5af3c2bc8310bf3d41560fed377bfdd078505be Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11362)
 
   * 94c9b7bdfd83828a9552dfeab418403a4594c649 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11368)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on pull request #6674: [DOCS] Standardize blog images sizes

2022-09-14 Thread GitBox


yihua commented on PR #6674:
URL: https://github.com/apache/hudi/pull/6674#issuecomment-1247343367

   > I definitely agree. But a question or a thought is should it be okay to 
decouple that and fix individual images in a second PR. I want to do the bulk 
change quickly and go from there. But open to ideas. What do you think?
   
   Given this is affecting how the website is visualized, let's have the 
changes to make images look better in one PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] bhasudha commented on pull request #6674: [DOCS] Standardize blog images sizes

2022-09-14 Thread GitBox


bhasudha commented on PR #6674:
URL: https://github.com/apache/hudi/pull/6674#issuecomment-1247338781

   > 
   
   I definitely agree. But a question or a thought is should it be okay to 
decouple that and fix individual images in a second PR. I want to do the bulk 
change quickly and go from there. But open to ideas. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-1275) Incremental TImeline Syncing causes compaction to fail with FileNotFound exception

2022-09-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-1275.
-
Resolution: Incomplete

> Incremental TImeline Syncing causes compaction to fail with FileNotFound 
> exception
> --
>
> Key: HUDI-1275
> URL: https://issues.apache.org/jira/browse/HUDI-1275
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.1
>
>
> Context: [https://github.com/apache/hudi/issues/2020]
>  
>  
> {{20/08/25 07:17:13 WARN TaskSetManager: Lost task 3.0 in stage 41.0 (TID 
> 2540, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): 
> org.apache.hudi.exception.HoodieException: java.io.FileNotFoundException: No 
> such file or directory 
> 's3://myBucket/absolute_path_to/daas_date=2020/56be5da5-f5f3-4675-8dec-433f3656f839-0_3-816-50630_20200825065331.parquet'
> at 
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:207)
> at 
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:190)
> at 
> org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.compact(HoodieMergeOnReadTableCompactor.java:139)
> at 
> org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.lambda$compact$644ebad7$1(HoodieMergeOnReadTableCompactor.java:98)
> at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1040)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
> at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1182)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory 
> 's3://myBucket/absolute_path_to/daas_date=2020/56be5da5-f5f3-4675-8dec-433f3656f839-0_3-816-50630_20200825065331.parquet'
> at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:617)
> at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:553)
> at 
> org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300)
> at 
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:202)
> ... 26 more}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-1275) Incremental TImeline Syncing causes compaction to fail with FileNotFound exception

2022-09-14 Thread Alexey Kudinkin (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17604989#comment-17604989
 ] 

Alexey Kudinkin commented on HUDI-1275:
---

Seems like original issue was filed against Hudi 0.5.3.



At this point i don't think we captured enough context to even try to reproduce 
this issue, so will have to be closing it w/o resolution unfortunately.

> Incremental TImeline Syncing causes compaction to fail with FileNotFound 
> exception
> --
>
> Key: HUDI-1275
> URL: https://issues.apache.org/jira/browse/HUDI-1275
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.1
>
>
> Context: [https://github.com/apache/hudi/issues/2020]
>  
>  
> {{20/08/25 07:17:13 WARN TaskSetManager: Lost task 3.0 in stage 41.0 (TID 
> 2540, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): 
> org.apache.hudi.exception.HoodieException: java.io.FileNotFoundException: No 
> such file or directory 
> 's3://myBucket/absolute_path_to/daas_date=2020/56be5da5-f5f3-4675-8dec-433f3656f839-0_3-816-50630_20200825065331.parquet'
> at 
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:207)
> at 
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:190)
> at 
> org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.compact(HoodieMergeOnReadTableCompactor.java:139)
> at 
> org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.lambda$compact$644ebad7$1(HoodieMergeOnReadTableCompactor.java:98)
> at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1040)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
> at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1182)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory 
> 's3://myBucket/absolute_path_to/daas_date=2020/56be5da5-f5f3-4675-8dec-433f3656f839-0_3-816-50630_20200825065331.parquet'
> at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:617)
> at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:553)
> at 
> org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300)
> at 
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:202)
> ... 26 more}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] bhasudha opened a new pull request, #6674: [DOCS] Standardize blog images sizes

2022-09-14 Thread GitBox


bhasudha opened a new pull request, #6674:
URL: https://github.com/apache/hudi/pull/6674

   ### Change Logs
   
   Changed the image size to a standard size of 1200 * 600 for most images for 
better rendering of blogs landing page.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3915) Error upserting bucketType UPDATE for partition :0

2022-09-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3915:
--
Status: Open  (was: In Progress)

> Error upserting bucketType UPDATE for partition :0
> --
>
> Key: HUDI-3915
> URL: https://issues.apache.org/jira/browse/HUDI-3915
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Neetu Gupta
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.1
>
>
> I have updated the hudi column partition from 'year,month' to 'year. Then I 
> ran the process in overwrite mode. The process executed successfully and hudi 
> table got created. 
> However, when the process got triggered in 'append' mode, I started getting 
> the error mentioned below:
> '
> Task 0 in stage 32.0 failed 4 times; aborting job java.lang.Exception: Job 
> aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 32.0 (TID 1207, 
> ip-10-73-110-184.ec2.internal, executor 6): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :0 at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:305)
> '
> Then I reverted the partition columns back to 'year,month' but still got the 
> same error. But, when I am writing data in different folder in 'append' mode, 
> the script ran fine and I could see the Hudi table. 
> In short, the process is not working when I am trying to append the data in 
> the same path. Can you please look into this. This is critical to us because 
> the jobs are stuck.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3915) Error upserting bucketType UPDATE for partition :0

2022-09-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3915:
--
Sprint: 2022/09/19

> Error upserting bucketType UPDATE for partition :0
> --
>
> Key: HUDI-3915
> URL: https://issues.apache.org/jira/browse/HUDI-3915
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Neetu Gupta
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.1
>
>
> I have updated the hudi column partition from 'year,month' to 'year. Then I 
> ran the process in overwrite mode. The process executed successfully and hudi 
> table got created. 
> However, when the process got triggered in 'append' mode, I started getting 
> the error mentioned below:
> '
> Task 0 in stage 32.0 failed 4 times; aborting job java.lang.Exception: Job 
> aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 32.0 (TID 1207, 
> ip-10-73-110-184.ec2.internal, executor 6): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :0 at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:305)
> '
> Then I reverted the partition columns back to 'year,month' but still got the 
> same error. But, when I am writing data in different folder in 'append' mode, 
> the script ran fine and I could see the Hudi table. 
> In short, the process is not working when I am trying to append the data in 
> the same path. Can you please look into this. This is critical to us because 
> the jobs are stuck.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3915) Error upserting bucketType UPDATE for partition :0

2022-09-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3915:
--
Sprint:   (was: 2022/09/05)

> Error upserting bucketType UPDATE for partition :0
> --
>
> Key: HUDI-3915
> URL: https://issues.apache.org/jira/browse/HUDI-3915
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Neetu Gupta
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.1
>
>
> I have updated the hudi column partition from 'year,month' to 'year. Then I 
> ran the process in overwrite mode. The process executed successfully and hudi 
> table got created. 
> However, when the process got triggered in 'append' mode, I started getting 
> the error mentioned below:
> '
> Task 0 in stage 32.0 failed 4 times; aborting job java.lang.Exception: Job 
> aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 32.0 (TID 1207, 
> ip-10-73-110-184.ec2.internal, executor 6): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :0 at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:305)
> '
> Then I reverted the partition columns back to 'year,month' but still got the 
> same error. But, when I am writing data in different folder in 'append' mode, 
> the script ran fine and I could see the Hudi table. 
> In short, the process is not working when I am trying to append the data in 
> the same path. Can you please look into this. This is critical to us because 
> the jobs are stuck.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-3915) Error upserting bucketType UPDATE for partition :0

2022-09-14 Thread Alexey Kudinkin (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17604987#comment-17604987
 ] 

Alexey Kudinkin commented on HUDI-3915:
---

[~ngupta2206] can you please provide the full stack-trace? 

Also what's Spark, Hudi versions are you using?

> Error upserting bucketType UPDATE for partition :0
> --
>
> Key: HUDI-3915
> URL: https://issues.apache.org/jira/browse/HUDI-3915
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Neetu Gupta
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.1
>
>
> I have updated the hudi column partition from 'year,month' to 'year. Then I 
> ran the process in overwrite mode. The process executed successfully and hudi 
> table got created. 
> However, when the process got triggered in 'append' mode, I started getting 
> the error mentioned below:
> '
> Task 0 in stage 32.0 failed 4 times; aborting job java.lang.Exception: Job 
> aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 32.0 (TID 1207, 
> ip-10-73-110-184.ec2.internal, executor 6): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :0 at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:305)
> '
> Then I reverted the partition columns back to 'year,month' but still got the 
> same error. But, when I am writing data in different folder in 'append' mode, 
> the script ran fine and I could see the Hudi table. 
> In short, the process is not working when I am trying to append the data in 
> the same path. Can you please look into this. This is critical to us because 
> the jobs are stuck.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3915) Error upserting bucketType UPDATE for partition :0

2022-09-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3915:
--
Status: In Progress  (was: Open)

> Error upserting bucketType UPDATE for partition :0
> --
>
> Key: HUDI-3915
> URL: https://issues.apache.org/jira/browse/HUDI-3915
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Neetu Gupta
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.1
>
>
> I have updated the hudi column partition from 'year,month' to 'year. Then I 
> ran the process in overwrite mode. The process executed successfully and hudi 
> table got created. 
> However, when the process got triggered in 'append' mode, I started getting 
> the error mentioned below:
> '
> Task 0 in stage 32.0 failed 4 times; aborting job java.lang.Exception: Job 
> aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 32.0 (TID 1207, 
> ip-10-73-110-184.ec2.internal, executor 6): 
> org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
> UPDATE for partition :0 at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:305)
> '
> Then I reverted the partition columns back to 'year,month' but still got the 
> same error. But, when I am writing data in different folder in 'append' mode, 
> the script ran fine and I could see the Hudi table. 
> In short, the process is not working when I am trying to append the data in 
> the same path. Can you please look into this. This is critical to us because 
> the jobs are stuck.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4363) Support Clustering row writer to improve performance

2022-09-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4363:
--
Status: In Progress  (was: Open)

> Support Clustering row writer to improve performance
> 
>
> Key: HUDI-4363
> URL: https://issues.apache.org/jira/browse/HUDI-4363
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance, writer-core
>Reporter: Hui An
>Assignee: Hui An
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-07-05 at 17.25.13.png
>
>
> 1. Integrate clustering with datasource read and write api, in this way, 
> - enable clustering use Dataset api
> - Unify the read and write operations together, if read/write logic has 
> improvement, clustering can also benefit, such as vectorized read
> 2. Use {{hoodie.datasource.read.paths}} to pass paths for each 
> clusteringOperation
> 3. Introduce {{HoodieInternalWriteStatusCoordinator}} to persist the 
> {{InternalWriteStatus}} of a clustering action. As we can not get it if using 
> Spark datasource.
> 4. Add new configures to control this behavior.
> h4. Test performance
> A test table has 21 columns, 710716 rows, raw data size 929g(in spark 
> memory), after compressed: 38.3g
> executor memory: 50g, 20 instances, and enable global_sort
> Without clustering as row: 32mins, 12sec
> Using clustering as row: 9mins, 51sec
> We can also see the performance improve from test: 
> {{TestHoodieSparkMergeOnReadTableClustering}} and 
> {{testLayoutOptimizationFunctional}}
>  !Screen Shot 2022-07-05 at 17.25.13.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   >