date:20220908

[jira] [Created] (HUDI-4820) ORC dependency conflict with spark 3.1

2022-09-08 Thread xi chaomin (Jira)

xi chaomin created HUDI-4820:


 Summary: ORC dependency conflict with spark 3.1
 Key: HUDI-4820
 URL: https://issues.apache.org/jira/browse/HUDI-4820
 Project: Apache Hudi
  Issue Type: Bug
Reporter: xi chaomin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] paul8263 commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-09-08 Thread GitBox



paul8263 commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1241525099

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-09-08 Thread GitBox



hudi-bot commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1241501161

   
   ## CI report:
   
   * 252b9f49e2c86c7aad7908e5887064b0d0b36932 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11258)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6550: [HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module

2022-09-08 Thread GitBox



hudi-bot commented on PR #6550:
URL: https://github.com/apache/hudi/pull/6550#issuecomment-1241498549

   
   ## CI report:
   
   * 684ca9bdec8d75a27bf78ec09bf2ba31f67bdda4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11132)
 
   * bd41f0de1043153c18b1a716e6362c56cf03bbf9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11263)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-09-08 Thread GitBox



hudi-bot commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1241498437

   
   ## CI report:
   
   * 252b9f49e2c86c7aad7908e5887064b0d0b36932 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11258)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4819) run_sync_tool.sh in hudi-hive-sync fails with classpath errors on release-0.12.0

2022-09-08 Thread Pramod Biligiri (Jira)

Pramod Biligiri created HUDI-4819:
-

 Summary: run_sync_tool.sh in hudi-hive-sync fails with classpath 
errors on release-0.12.0
 Key: HUDI-4819
 URL: https://issues.apache.org/jira/browse/HUDI-4819
 Project: Apache Hudi
  Issue Type: Bug
  Components: hive, meta-sync
Affects Versions: 0.12.0
Reporter: Pramod Biligiri
 Attachments: modified_run_sync_tool.sh

I ran the run_sync_tool.sh script after git cloning and building a new instance 
of apache-hudi (branch: release-0.12.0). The script failed with classpath 
related errors. Find below the relevant sequence of commands I used:

$ git branch
* (HEAD detached at release-0.12.0)

$ mvn -Dspark3.2 -Dscala-2.12 -DskipTests  -Dcheckstyle.skip -Drat.skip clean 
install

$ echo $HADOOP_HOME
/home/pramod/2installers/hadoop-2.7.4

$ echo $HIVE_HOME
/home/pramod/2installers/apache-hive-3.1.3-bin

$ /run_sync_tool.sh  --jdbc-url jdbc:hive2:\/\/hiveserver:1 
--partitioned-by bucket --base-path 
/2-pramod/tmp/gcs-integration-test/data/meta-gcs --database default --table 
gcs_meta_hive_4 > log.out 2>&1
setting hadoop conf dir
Running Command : java -cp 
/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-metastore-3.1.3.jar::/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-service-3.1.3.jar::/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-exec-3.1.3.jar::/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-jdbc-3.1.3.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/hive-jdbc-handler-3.1.3.jar::/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-annotations-2.12.0.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-core-2.12.0.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-core-asl-1.9.13.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-databind-2.12.0.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-dataformat-smile-2.12.0.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-mapper-asl-1.9.13.jar:/home/pramod/2installers/apache-hive-3.1.3-bin/lib/jackson-module-scala_2.11-2.12.0.jar::/home/pramod/2installers/hadoop-2.7.4/share/hadoop/common/*:/home/pramod/2installers/hadoop-2.7.4/share/hadoop/mapreduce/*:/home/pramod/2installers/hadoop-2.7.4/share/hadoop/hdfs/*:/home/pramod/2installers/hadoop-2.7.4/share/hadoop/common/lib/*:/home/pramod/2installers/hadoop-2.7.4/share/hadoop/hdfs/lib/*:/home/pramod/2installers/hadoop-2.7.4/etc/hadoop:/3-pramod/3workspace/apache-hudi/hudi-sync/hudi-hive-sync/../../packaging/hudi-hive-sync-bundle/target/hudi-hive-sync-bundle-0.12.0.jar
 org.apache.hudi.hive.HiveSyncTool --jdbc-url jdbc:hive2://hiveserver:1 
--partitioned-by bucket --base-path 
/2-pramod/tmp/gcs-integration-test/data/meta-gcs --database default --table 
gcs_meta_hive_4
2022-09-08 10:53:24,335 INFO  [main] conf.HiveConf 
(HiveConf.java:findConfigFile(187)) - Found configuration file 
file:/home/pramod/2installers/apache-hive-3.1.3-bin/conf/hive-site.xml
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by 
org.apache.hadoop.security.authentication.util.KerberosUtil 
(file:/2-pramod/installers/hadoop-2.7.4/share/hadoop/common/lib/hadoop-auth-2.7.4.jar)
 to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of 
org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
2022-09-08 10:53:25,876 WARN  [main] util.NativeCodeLoader 
(NativeCodeLoader.java:(62)) - Unable to load native-hadoop library for 
your platform... using builtin-java classes where applicable
2022-09-08 10:53:26,359 INFO  [main] table.HoodieTableMetaClient 
(HoodieTableMetaClient.java:(121)) - Loading HoodieTableMetaClient from 
/2-pramod/tmp/gcs-integration-test/data/meta-gcs
2022-09-08 10:53:26,568 INFO  [main] table.HoodieTableConfig 
(HoodieTableConfig.java:(243)) - Loading table properties from 
/2-pramod/tmp/gcs-integration-test/data/meta-gcs/.hoodie/hoodie.properties
2022-09-08 10:53:26,585 INFO  [main] table.HoodieTableMetaClient 
(HoodieTableMetaClient.java:(140)) - Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
/2-pramod/tmp/gcs-integration-test/data/meta-gcs
2022-09-08 10:53:26,586 INFO  [main] table.HoodieTableMetaClient 
(HoodieTableMetaClient.java:(143)) - Loading Active commit timeline for 
/2-pramod/tmp/gcs-integration-test/data/meta-gcs
2022-09-08 10:53:26,727 INFO  [main] timeline.HoodieActiveTimeline 
(HoodieActiveTimeline.java:(129)) - Loaded instants upto : 
Option\{val=[20220907220948700__commit__COMPLETED]}
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/http/config/Lookup
    at

[GitHub] [hudi] paul8263 commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-09-08 Thread GitBox



paul8263 commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1241480297

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6550: [HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module

2022-09-08 Thread GitBox



hudi-bot commented on PR #6550:
URL: https://github.com/apache/hudi/pull/6550#issuecomment-1241471920

   
   ## CI report:
   
   * 684ca9bdec8d75a27bf78ec09bf2ba31f67bdda4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11132)
 
   * bd41f0de1043153c18b1a716e6362c56cf03bbf9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-09-08 Thread GitBox



danny0405 commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r966599692


##
rfc/rfc-51/rfc-51.md:
##
@@ -215,18 +245,31 @@ Note:
 
 - Only instants that are active can be queried in a CDC scenario.
 - `CDCReader` manages all the things on CDC, and all the spark 
entrances(DataFrame, SQL, Streaming) call the functions in `CDCReader`.
-- If `hoodie.table.cdc.supplemental.logging` is false, we need to do more work 
to get the change data. The following illustration explains the difference when 
this config is true or false.
+- If `hoodie.table.cdc.supplemental.logging.mode=KEY_OP`, we need to compute 
the changed data. The following illustrates the difference.
 
 ![](read_cdc_log_file.jpg)
 
  COW table
 
-Reading COW table in CDC query mode is equivalent to reading a simplified MOR 
table that has no normal log files.
+Reading COW tables in CDC query mode is equivalent to reading MOR tables in RO 
mode.
 
  MOR table
 
-According to the design of the writing part, only the cases where writing mor 
tables will write out the base file (which call the `HoodieMergeHandle` and 
it's subclasses) will write out the cdc files.
-In other words, cdc files will be written out only for the index and file size 
reasons.
+According to the section "Persisting CDC in MOR", CDC data is available upon 
base files' generation.
+
+When users want to get fresher real-time CDC results:
+
+- users are to set `hoodie.datasource.query.incremental.type=snapshot`
+- the implementation logic is to compute the results in-flight by reading log 
files and the corresponding base files (
+  current and previous file slices).
+- this is equivalent to running incremental-query on MOR RT tables
+
+When users want to optimize compute-cost and are tolerant with latency of CDC 
results,
+
+- users are to set `hoodie.datasource.query.incremental.type=read_optimized`
+- the implementation logic is to extract the results by reading persisted CDC 
data and the corresponding base files (
+  current and previous file slices).

Review Comment:
   > can you pls clarify what cases exactly represent writer anomalies
   
   I'm a little worried about the cdc semantics correctness before it is widely 
used in production, such as the deletion, the merging, the data sequence.
   
   > 2nd step to support read_optimized i
   
   We can all it an optimization, but not exposed as read_optimized, which 
conflicts a little with current read_optimized view, WDYT.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6631: [HUDI-4810] Fixing Hudi bundles requiring log4j2 on the classpath

2022-09-08 Thread GitBox



hudi-bot commented on PR #6631:
URL: https://github.com/apache/hudi/pull/6631#issuecomment-1241469308

   
   ## CI report:
   
   * e8e8c4d8047b5985764f7534bd84e82763c3ad28 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11243)
 
   * cf25a4e0d37980de4284afe841eced2f205b97a5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11262)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-09-08 Thread GitBox



danny0405 commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r966598104


##
rfc/rfc-51/rfc-51.md:
##
@@ -148,20 +152,46 @@ hudi_cdc_table/
 
 Under a partition directory, the `.log` file with `CDCBlock` above will keep 
the changing data we have to materialize.
 
-There is an option to control what data is written to `CDCBlock`, that is 
`hoodie.table.cdc.supplemental.logging`. See the description of this config 
above.
+ Persisting CDC in MOR: Write-on-indexing vs Write-on-compaction
+
+2 design choices on when to persist CDC in MOR tables:
+
+Write-on-indexing allows CDC info to be persisted at the earliest, however, in 
case of Flink writer or Bucket
+indexing, `op` (I/U/D) data is not available at indexing.
+
+Write-on-compaction can always persist CDC info and achieve standardization of 
implementation logic across engines,
+however, some delays are added to the CDC query results. Based on the business 
requirements, Log Compaction (RFC-48) or
+scheduling more frequent compaction can be used to minimize the latency.
 
-Spark DataSource example:
+The semantics we propose to establish are: when base files are written, the 
corresponding CDC data is also persisted.
+
+- For Spark
+  - inserts are written to base files: the CDC data `op=I` will be persisted
+  - updates/deletes that written to log files are compacted into base files: 
the CDC data `op=U|D` will be persisted
+- For Flink
+  - inserts/updates/deletes that written to log files are compacted into base 
files: the CDC data `op=I|U|D` will be
+persisted
+

Review Comment:
   That's true, but by default flink does not take the look up(for better 
throughput), and we may generate the changelog for reader on the fly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6631: [HUDI-4810] Fixing Hudi bundles requiring log4j2 on the classpath

2022-09-08 Thread GitBox



hudi-bot commented on PR #6631:
URL: https://github.com/apache/hudi/pull/6631#issuecomment-1241466870

   
   ## CI report:
   
   * e8e8c4d8047b5985764f7534bd84e82763c3ad28 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11243)
 
   * cf25a4e0d37980de4284afe841eced2f205b97a5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-09-08 Thread GitBox



hudi-bot commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1241466655

   
   ## CI report:
   
   * 252b9f49e2c86c7aad7908e5887064b0d0b36932 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11258)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-09-08 Thread GitBox



hudi-bot commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1241464172

   
   ## CI report:
   
   * 252b9f49e2c86c7aad7908e5887064b0d0b36932 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11258)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TJX2014 commented on pull request #6595: [HUDI-4777] Fix flink gen bucket index of mor table not consistent wi…

2022-09-08 Thread GitBox



TJX2014 commented on PR #6595:
URL: https://github.com/apache/hudi/pull/6595#issuecomment-1241461060

   > > but in flink side, I think deduplicate should also open as default 
option for mor table , when duplicate write to log file, very hard for compact 
to read, also lead mor table not stable due to the duplicate record twice read 
into memory.
   > 
   > Do you mean that there are two client writing to the same partition at the 
same time?
   
   Not exactly, if we deduplicate the record in memory, and then write to log 
is elegant for MOR because result is same. As @danny0405 say, in cdc situation, 
we need to retain origin records, not compact firstly in memory, which is 
acceptable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TJX2014 commented on pull request #6595: [HUDI-4777] Fix flink gen bucket index of mor table not consistent wi…

2022-09-08 Thread GitBox



TJX2014 commented on PR #6595:
URL: https://github.com/apache/hudi/pull/6595#issuecomment-1241460834

   > Not exactly, if we deduplicate the record in memory, and then write to log 
is elegant for MOR because result is same. As @danny0405 say, in cdc situation, 
we need to retain origin records, not compact firstly in memory, which is 
acceptable.
   
   Not exactly, if we deduplicate the record in memory, and then write to log 
is elegant for MOR because result is same. As @danny0405 say, in cdc situation, 
we need to retain origin records, not compact firstly in memory, which is 
acceptable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TJX2014 commented on pull request #6595: [HUDI-4777] Fix flink gen bucket index of mor table not consistent wi…

2022-09-08 Thread GitBox



TJX2014 commented on PR #6595:
URL: https://github.com/apache/hudi/pull/6595#issuecomment-1241460494

   > 
   
   Not exactly, if we deduplicate the record in memory, and then write to log 
is elegant for MOR because result is same. As @danny0405 say, in cdc situation, 
we need to retain origin records, not compact firstly in memory, which is 
acceptable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-09-08 Thread GitBox



danny0405 commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r966591772


##
rfc/rfc-51/rfc-51.md:
##
@@ -64,69 +65,72 @@ We follow the debezium output format: four columns as shown 
below
 
 Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-## Goals
+## Design Goals
 
 1. Support row-level CDC records generation and persistence;
 2. Support both MOR and COW tables;
 3. Support all the write operations;
 4. Support Spark DataFrame/SQL/Streaming Query;
 
-## Implementation
+## Configurations
 
-### CDC Architecture
+| key | default  | description 

 |
+|-|--|--|
+| hoodie.table.cdc.enabled| `false`  | The master 
switch of the CDC features. If `true`, writers and readers will respect CDC 
configurations and behave accordingly.|
+| hoodie.table.cdc.supplemental.logging   | `false`  | If `true`, 
persist the required information about the changed data, including `before`. If 
`false`, only `op` and record keys will be persisted. |
+| hoodie.table.cdc.supplemental.logging.include_after | `false`  | If `true`, 
persist `after` as well.
  |
 
-![](arch.jpg)
+To perform CDC queries, users need to set `hoodie.table.cdc.enable=true` and 
`hoodie.datasource.query.type=incremental`.
 
-Note: Table operations like `Compact`, `Clean`, `Index` do not write/change 
any data. So we don't need to consider them in CDC scenario.
- 
-### Modifiying code paths
+| key| default| description
  |
+|||--|
+| hoodie.table.cdc.enabled   | `false`| set to `true` for CDC 
queries|
+| hoodie.datasource.query.type   | `snapshot` | set to `incremental` 
for CDC queries |
+| hoodie.datasource.read.start.timestamp | -  | requried.  
  |
+| hoodie.datasource.read.end.timestamp   | -  | optional.  
  |
 
-![](points.jpg)
+### Logical File Types
 
-### Config Definitions
+We define 4 logical file types for the CDC scenario.

Review Comment:
   > always easier for people to digest when leverage on existing knowledge 
(actions here) instead of bringing up new definitions
   
   Cann't agree more, let's not introduce new terminology.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] paul8263 commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-09-08 Thread GitBox



paul8263 commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1241457900

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TJX2014 commented on a diff in pull request #6630: [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo…

2022-09-08 Thread GitBox



TJX2014 commented on code in PR #6630:
URL: https://github.com/apache/hudi/pull/6630#discussion_r966590804


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -72,6 +73,26 @@ public static List 
getLatestBaseFilesForPartition(
 return Collections.emptyList();
   }
 
+  /**
+   * Fetches Pair of partition path and {@link FileSlice}s for interested 
partitions.
+   *
+   * @param partition   Partition of interest
+   * @param hoodieTable Instance of {@link HoodieTable} of interest
+   * @return the list of {@link FileSlice}
+   */
+  public static List getLatestFileSlicesForPartition(
+  final String partition,
+  final HoodieTable hoodieTable) {
+Option latestCommitTime = 
hoodieTable.getMetaClient().getCommitsTimeline()
+.filterCompletedInstants().lastInstant();
+if (latestCommitTime.isPresent()) {
+  return hoodieTable.getHoodieView()
+  .getLatestFileSlicesBeforeOrOn(partition, 
latestCommitTime.get().getTimestamp(), true)
+  .collect(toList());

Review Comment:
   Ok, I will add a test for this pr.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TJX2014 commented on a diff in pull request #6634: [HUDI-4813] Fix infer keygen not work in sparksql side issue

2022-09-08 Thread GitBox



TJX2014 commented on code in PR #6634:
URL: https://github.com/apache/hudi/pull/6634#discussion_r966590414


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -787,9 +787,13 @@ object DataSourceOptionsHelper {
 
   def inferKeyGenClazz(props: TypedProperties): String = {
 val partitionFields = 
props.getString(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), null)
+val recordsKeyFields = 
props.getString(DataSourceWriteOptions.RECORDKEY_FIELD.key(), 
DataSourceWriteOptions.RECORDKEY_FIELD.defaultValue())
+genKeyGenerator(recordsKeyFields, partitionFields)
+  }
+
+  def genKeyGenerator(recordsKeyFields: String, partitionFields: String): 
String = {
 if (partitionFields != null) {

Review Comment:
   Ok, I will do it right now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6631: [HUDI-4810] Fixing Hudi bundles requiring log4j2 on the classpath

2022-09-08 Thread GitBox



the-other-tim-brown commented on code in PR #6631:
URL: https://github.com/apache/hudi/pull/6631#discussion_r966589943


##
hudi-client/hudi-flink-client/pom.xml:
##
@@ -35,7 +35,21 @@
 
 
 
-
+
+

Review Comment:
   nitpick: indentation is off here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4810) Fix Hudi bundles requiring log4j2 on the classpath

2022-09-08 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4810:
--
Status: Patch Available  (was: In Progress)

> Fix Hudi bundles requiring log4j2 on the classpath
> --
>
> Key: HUDI-4810
> URL: https://issues.apache.org/jira/browse/HUDI-4810
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> As part of addressing HUDI-4441, we've erroneously rebased Hudi onto 
> "log4j-1.2-api" module under impression that it's an API module (as 
> advertised) which turned out not to be the case: it's actual bridge 
> implementation, requiring Log4j2 be provided on the classpath as required 
> dependency.
> For version of Spark < 3.3 this triggers exceptions like the following one 
> (reported by [~akmodi])
>  
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/logging/log4j/LogManagerat 
> org.apache.hudi.metrics.datadog.DatadogReporter.(DatadogReporter.java:55)
> at 
> org.apache.hudi.metrics.datadog.DatadogMetricsReporter.(DatadogMetricsReporter.java:62)
> at 
> org.apache.hudi.metrics.MetricsReporterFactory.createReporter(MetricsReporterFactory.java:70)
> at org.apache.hudi.metrics.Metrics.(Metrics.java:50)at 
> org.apache.hudi.metrics.Metrics.init(Metrics.java:96)at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics.(HoodieDeltaStreamerMetrics.java:44)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.(DeltaSync.java:243)  
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:663)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:143)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:116)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:562)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)  
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000)
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) 
>at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089)  
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098)at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManagerat 
> java.net.URLClassLoader.findClass(URLClassLoader.java:387)at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:418)at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:351)... 23 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3391) presto and hive beeline fails to read MOR table w/ 2 or more array fields

2022-09-08 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3391:
--
Status: Patch Available  (was: In Progress)

> presto and hive beeline fails to read MOR table w/ 2 or more array fields
> -
>
> Key: HUDI-3391
> URL: https://issues.apache.org/jira/browse/HUDI-3391
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies, reader-core, trino-presto
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.1
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> We have an issue reported by user 
> [here|[https://github.com/apache/hudi/issues/2657].] Looks like w/ 0.10.0 or 
> later, spark datasource read works, but hive beeline does not work. Even 
> spark.sql (hive table) querying works as well. 
> Another related ticket: 
> [https://github.com/apache/hudi/issues/3834#issuecomment-997307677]
>  
> Steps that I tried:
> [https://gist.github.com/nsivabalan/fdb8794104181f93b9268380c7f7f079]
> From beeline, you will encounter below exception
> {code:java}
> Failed with exception 
> java.io.IOException:org.apache.hudi.org.apache.avro.SchemaParseException: 
> Can't redefine: array {code}
> All linked ticket states that upgrading parquet to 1.11.0 or greater should 
> work. We need to try it out w/ latest master and go from there. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-4465) Optimizing file-listing path in MT

2022-09-08 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-4465.
-
Resolution: Done

> Optimizing file-listing path in MT
> --
>
> Key: HUDI-4465
> URL: https://issues.apache.org/jira/browse/HUDI-4465
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We should review file-listing path and try to optimize the file-listing path 
> as much as possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated: [HUDI-4465] Optimizing file-listing sequence of Metadata Table (#6016)

2022-09-08 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 4af60dcfba [HUDI-4465] Optimizing file-listing sequence of Metadata 
Table (#6016)
4af60dcfba is described below

commit 4af60dcfbaeea5e79bc4b9457477e9ee0f9cdb79
Author: Alexey Kudinkin 
AuthorDate: Thu Sep 8 20:09:00 2022 -0700

[HUDI-4465] Optimizing file-listing sequence of Metadata Table (#6016)

Optimizes file-listing sequence of the Metadata Table to make sure it's on 
par or better than FS-based file-listing

Change log:

- Cleaned up avoidable instantiations of Hadoop's Path
- Replaced new Path w/ createUnsafePath where possible
- Cached TimestampFormatter, DateFormatter for timezone
- Avoid loading defaults in Hadoop conf when init-ing HFile reader
- Avoid re-instantiating BaseTableMetadata twice w/in 
BaseHoodieTableFileIndex
- Avoid looking up FileSystem for every partition when listing partitioned 
table, instead do it just once
---
 docker/demo/config/dfs-source.properties   |   4 +
 .../metadata/HoodieBackedTableMetadataWriter.java  |   6 +-
 .../apache/hudi/keygen/BuiltinKeyGenerator.java|   1 -
 .../org/apache/hudi/keygen/SimpleKeyGenerator.java |   7 ++
 .../datasources/SparkParsePartitionUtil.scala  |  12 +-
 .../org/apache/spark/sql/hudi/SparkAdapter.scala   |   2 +-
 .../functional/TestConsistentBucketIndex.java  |   2 +-
 .../hudi/table/TestHoodieMergeOnReadTable.java |  26 +++-
 .../TestHoodieSparkMergeOnReadTableCompaction.java |  19 ++-
 ...HoodieSparkMergeOnReadTableIncrementalRead.java |   2 +-
 .../TestHoodieSparkMergeOnReadTableRollback.java   |   9 +-
 .../hudi/testutils/HoodieClientTestHarness.java|   6 +-
 .../SparkClientFunctionalTestHarness.java  |  12 +-
 .../org/apache/hudi/BaseHoodieTableFileIndex.java  |  87 +
 .../java/org/apache/hudi/common/fs/FSUtils.java|  13 +-
 .../org/apache/hudi/common/model/BaseFile.java |   9 ++
 .../hudi/common/table/HoodieTableMetaClient.java   |   2 +-
 .../table/log/block/HoodieHFileDataBlock.java  |  26 ++--
 .../table/view/AbstractTableFileSystemView.java|  13 +-
 .../apache/hudi/common/util/CollectionUtils.java   |   9 ++
 .../java/org/apache/hudi/hadoop/CachingPath.java   |  59 -
 .../org/apache/hudi/hadoop/SerializablePath.java   |   9 +-
 .../apache/hudi/io/storage/HoodieHFileUtils.java   |   4 +-
 .../apache/hudi/metadata/BaseTableMetadata.java| 135 -
 .../hudi/metadata/HoodieBackedTableMetadata.java   |  11 +-
 .../hudi/metadata/HoodieMetadataPayload.java   |  18 ++-
 .../apache/hudi/metadata/HoodieTableMetadata.java  |  21 +++-
 .../hudi/common/testutils/HoodieTestUtils.java |  35 +++---
 .../hudi/hadoop/testutils/InputFormatTestUtil.java |   5 +-
 .../apache/hudi/SparkHoodieTableFileIndex.scala|  28 +++--
 .../hudi/functional/TestTimeTravelQuery.scala  |   6 +-
 .../apache/spark/sql/adapter/Spark2Adapter.scala   |   2 +-
 .../datasources/Spark2ParsePartitionUtil.scala |  14 +--
 .../apache/hudi/spark3/internal/ReflectUtil.java   |   4 +-
 .../spark/sql/adapter/BaseSpark3Adapter.scala  |   4 +-
 .../datasources/Spark3ParsePartitionUtil.scala |  48 
 .../functional/TestHoodieDeltaStreamer.java|  10 +-
 .../utilities/sources/TestHoodieIncrSource.java|   4 +-
 38 files changed, 451 insertions(+), 233 deletions(-)

diff --git a/docker/demo/config/dfs-source.properties 
b/docker/demo/config/dfs-source.properties
index ac7080e141..a90629ef8e 100644
--- a/docker/demo/config/dfs-source.properties
+++ b/docker/demo/config/dfs-source.properties
@@ -19,6 +19,10 @@ include=base.properties
 # Key fields, for kafka example
 hoodie.datasource.write.recordkey.field=key
 hoodie.datasource.write.partitionpath.field=date
+# NOTE: We have to duplicate configuration since this is being used
+#   w/ both Spark and DeltaStreamer
+hoodie.table.recordkey.fields=key
+hoodie.table.partition.fields=date
 # Schema provider props (change to absolute path based on your installation)
 
hoodie.deltastreamer.schemaprovider.source.schema.file=/var/demo/config/schema.avsc
 
hoodie.deltastreamer.schemaprovider.target.schema.file=/var/demo/config/schema.avsc
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index c7cc50967a..962875fb92 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -1006,21 +1006,21 @@ public abstract class HoodieBackedTableMetadataWriter

[GitHub] [hudi] codope merged pull request #6016: [HUDI-4465] Optimizing file-listing sequence of Metadata Table

2022-09-08 Thread GitBox



codope merged PR #6016:
URL: https://github.com/apache/hudi/pull/6016


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #6016: [HUDI-4465] Optimizing file-listing sequence of Metadata Table

2022-09-08 Thread GitBox



codope commented on code in PR #6016:
URL: https://github.com/apache/hudi/pull/6016#discussion_r966582484


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/SimpleKeyGenerator.java:
##
@@ -46,6 +47,12 @@ public SimpleKeyGenerator(TypedProperties props) {
 
   SimpleKeyGenerator(TypedProperties props, String recordKeyField, String 
partitionPathField) {
 super(props);
+// Make sure key-generator is configured properly
+ValidationUtils.checkArgument(recordKeyField == null || 
!recordKeyField.isEmpty(),
+"Record key field has to be non-empty!");
+ValidationUtils.checkArgument(partitionPathField == null || 
!partitionPathField.isEmpty(),

Review Comment:
   Discussed offline. It will be taken up separately. @alexeykudinkin in case 
if you have a JIRA, please link it here. For simple keygen we need the 
validation because of misconfiguration of some tests that were passing “” as 
partition fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] LinMingQiang commented on issue #6618: Caused by: org.apache.http.NoHttpResponseException: xxxxxx:34812 failed to respond[SUPPORT]

2022-09-08 Thread GitBox



LinMingQiang commented on issue #6618:
URL: https://github.com/apache/hudi/issues/6618#issuecomment-1241441100

   I have encountered this problem,this pr may solve your problem : 
https://github.com/apache/hudi/pull/6393


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zhifeiyu commented on issue #6640: [SUPPORT] HUDI partition table duplicate data cow hudi 0.10.0 flink 1.13.1

2022-09-08 Thread GitBox



Zhifeiyu commented on issue #6640:
URL: https://github.com/apache/hudi/issues/6640#issuecomment-1241440579

   mark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4818) Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex

2022-09-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4818:
-
Labels: pull-request-available  (was: )

> Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex
> ---
>
> Key: HUDI-4818
> URL: https://issues.apache.org/jira/browse/HUDI-4818
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently using `CustomKeyGenerator` with the partition-path config 
> \{hoodie.datasource.write.partitionpath.field=ts:timestamp} fails w/
> {code:java}
> Caused by: java.lang.RuntimeException: Failed to cast value `2022-05-11` to 
> `LongType` for partition column `ts_ms`
>   at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:72)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$1(Spark3ParsePartitionUtil.scala:65)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:63)
>   at 
> org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionPath(SparkHoodieTableFileIndex.scala:274)
>   at 
> org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionColumnValues(SparkHoodieTableFileIndex.scala:258)
>   at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$getAllQueryPartitionPaths$3(BaseHoodieTableFileIndex.java:190)
>   at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>   at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:193)
>  {code}
>  
> This occurs b/c SparkHoodieTableFileIndex produces incorrect partition schema 
> at XXX
> where it properly handles only `TimestampBasedKeyGenerator`s but not the 
> other key-generators that might be changing the data-type of the 
> partition-value as compared to the source partition-column (in this case it 
> has `ts` as a long in the source table schema, but it produces 
> partition-value as string)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] alexeykudinkin opened a new pull request, #6641: [WIP][HUDI-4818] Fixing SparkHoodieTableFileIndex handling of KeyGenerators changing the type of returned partition-value

2022-09-08 Thread GitBox



alexeykudinkin opened a new pull request, #6641:
URL: https://github.com/apache/hudi/pull/6641

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-09-08 Thread GitBox



hudi-bot commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1241435958

   
   ## CI report:
   
   * 252b9f49e2c86c7aad7908e5887064b0d0b36932 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11258)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4818) Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex

2022-09-08 Thread Alexey Kudinkin (Jira)

Alexey Kudinkin created HUDI-4818:
-

 Summary: Using CustomKeyGenerator fails w/ 
SparkHoodieTableFileIndex
 Key: HUDI-4818
 URL: https://issues.apache.org/jira/browse/HUDI-4818
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.13.0


Currently using `CustomKeyGenerator` with the partition-path config 
\{hoodie.datasource.write.partitionpath.field=ts:timestamp} fails w/
{code:java}
Caused by: java.lang.RuntimeException: Failed to cast value `2022-05-11` to 
`LongType` for partition column `ts_ms`
at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:72)
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$1(Spark3ParsePartitionUtil.scala:65)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:63)
at 
org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionPath(SparkHoodieTableFileIndex.scala:274)
at 
org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionColumnValues(SparkHoodieTableFileIndex.scala:258)
at 
org.apache.hudi.BaseHoodieTableFileIndex.lambda$getAllQueryPartitionPaths$3(BaseHoodieTableFileIndex.java:190)
at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at 
org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:193)
 {code}
 

This occurs b/c SparkHoodieTableFileIndex produces incorrect partition schema 
at XXX

where it properly handles only `TimestampBasedKeyGenerator`s but not the other 
key-generators that might be changing the data-type of the partition-value as 
compared to the source partition-column (in this case it has `ts` as a long in 
the source table schema, but it produces partition-value as string)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4818) Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex

2022-09-08 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4818:
--
Story Points: 4  (was: 2)

> Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex
> ---
>
> Key: HUDI-4818
> URL: https://issues.apache.org/jira/browse/HUDI-4818
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Currently using `CustomKeyGenerator` with the partition-path config 
> \{hoodie.datasource.write.partitionpath.field=ts:timestamp} fails w/
> {code:java}
> Caused by: java.lang.RuntimeException: Failed to cast value `2022-05-11` to 
> `LongType` for partition column `ts_ms`
>   at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:72)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$1(Spark3ParsePartitionUtil.scala:65)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:63)
>   at 
> org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionPath(SparkHoodieTableFileIndex.scala:274)
>   at 
> org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionColumnValues(SparkHoodieTableFileIndex.scala:258)
>   at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$getAllQueryPartitionPaths$3(BaseHoodieTableFileIndex.java:190)
>   at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>   at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:193)
>  {code}
>  
> This occurs b/c SparkHoodieTableFileIndex produces incorrect partition schema 
> at XXX
> where it properly handles only `TimestampBasedKeyGenerator`s but not the 
> other key-generators that might be changing the data-type of the 
> partition-value as compared to the source partition-column (in this case it 
> has `ts` as a long in the source table schema, but it produces 
> partition-value as string)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] zwj0110 opened a new issue, #6640: [SUPPORT] HUDI partition table duplicate data cow

2022-09-08 Thread GitBox



zwj0110 opened a new issue, #6640:
URL: https://github.com/apache/hudi/issues/6640

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   ## hudi sink config
   ```sql
   'connector' = 'hudi',
   'hoodie.table.name' = 'xxx',
   'table.type' = 'COPY_ON_WRITE',
   'path' = '',
   'hoodie.datasource.write.keygenerator.type' = 'COMPLEX',
   'hoodie.datasource.write.recordkey.field' = 'id',
   'hoodie.cleaner.policy' = 'KEEP_LATEST_FILE_VERSIONS',
   'hoodie.cleaner.fileversions.retained' = '20',
   'hoodie.keep.min.commits' = '30',
   'hoodie.keep.max.commits' = '40',
   'hoodie.cleaner.commits.retained' = '20',
   'write.operation' = 'upsert',
   'write.commit.ack.timeout' = '6000',
   'write.sort.memory' = '128',
   'write.task.max.size' = '1024',
   'write.merge.max_memory' = '100',
   'write.tasks' = '96',
   'write.precombine' = 'true',
   'write.precombine.field' = 'meta_es_offset',
   'index.state.ttl' = '0',
   'index.global.enabled' = 'false',
   'hive_sync.enable' = 'true',
   'hive_sync.table' = 'xxx',
   'hive_sync.auto_create_db' = 'true',
   'hive_sync.mode' = 'hms',
   'hive_sync.metastore.uris' = 'xxx',
   'hive_sync.db' = 'xxx',
   'hoodie.datasource.write.partitionpath.field' = 'year,month,day',
   'hoodie.datasource.write.hive_style_partitioning' = 'true',
   'hive_sync.partition_fields' = 'year,month,day',
   'hive_sync.partition_extractor_class' = 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
   'index.bootstrap.enabled' = 'true'
   ```
   ## description:
   After initialize history data,  set `scan.startup.mode` as `timestamp`,and 
set the timestamp ahead, the duplicate occur,and if we restart the job from 
checkpoint, the data is well
   ## data duplicate result:
   
![image](https://user-images.githubusercontent.com/44424308/189260717-03668b84-c3dd-4785-8b5c-a874f43a084f.png)
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * flink version : 1.13.1
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.1.0
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wqwl611 commented on pull request #6636: add new index RANGE_BUCKET , when primary key is auto-increment like most mysql table

2022-09-08 Thread GitBox



wqwl611 commented on PR #6636:
URL: https://github.com/apache/hudi/pull/6636#issuecomment-1241432142

   > Nice feature, can we log an JIRA issue and change the commit title to 
"[HUDI-${JIRA_ID}] ${your title}"
   
   @danny0405 yes，Thanks。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wqwl611 commented on pull request #6636: add new index RANGE_BUCKET , when primary key is auto-increment like most mysql table

2022-09-08 Thread GitBox



wqwl611 commented on PR #6636:
URL: https://github.com/apache/hudi/pull/6636#issuecomment-1241430662

   > Nice feature, can we log an JIRA issue and change the commit title to 
"[HUDI-${JIRA_ID}] ${your title}"
   
   yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YuweiXiao commented on pull request #6632: [HUDI-4753] more accurate record size estimation for log writing and spillable map

2022-09-08 Thread GitBox



YuweiXiao commented on PR #6632:
URL: https://github.com/apache/hudi/pull/6632#issuecomment-1241421818

   @yihua Hey Yihua, could you help review this PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YuweiXiao commented on pull request #6629: [HUDI-4807] Use base table instant for metadata table initialization

2022-09-08 Thread GitBox



YuweiXiao commented on PR #6629:
URL: https://github.com/apache/hudi/pull/6629#issuecomment-1241421204

   > @YuweiXiao Can you please add more details in the PR description? It would 
be great if you add a test as well.
   
   Have added some details. Feel like it is not easy to add a test for it, as 
the failure only occurs when we modify `HoodieAppendHandle` (i.e., in the 
consistent hashing PR).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #6595: [HUDI-4777] Fix flink gen bucket index of mor table not consistent wi…

2022-09-08 Thread GitBox



danny0405 commented on PR #6595:
URL: https://github.com/apache/hudi/pull/6595#issuecomment-1241417330

   > > 
   > 
   > I will fix give pr fix in spark side too， but in flink side, I think 
deduplicate should also open as default option for mor table , when duplicate 
write to log file, very hard for compact to read, also lead mor table not 
stable due to the duplicate record twice read into memory.
   
   The initial idea is to keep the details of the log records, such as in the 
cdc change log feed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #6634: [HUDI-4813] Fix infer keygen not work in sparksql side issue

2022-09-08 Thread GitBox



danny0405 commented on code in PR #6634:
URL: https://github.com/apache/hudi/pull/6634#discussion_r966556852


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -787,9 +787,13 @@ object DataSourceOptionsHelper {
 
   def inferKeyGenClazz(props: TypedProperties): String = {
 val partitionFields = 
props.getString(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), null)
+val recordsKeyFields = 
props.getString(DataSourceWriteOptions.RECORDKEY_FIELD.key(), 
DataSourceWriteOptions.RECORDKEY_FIELD.defaultValue())
+genKeyGenerator(recordsKeyFields, partitionFields)
+  }
+
+  def genKeyGenerator(recordsKeyFields: String, partitionFields: String): 
String = {
 if (partitionFields != null) {

Review Comment:
   Let just rename `genKeyGenerator` to `inferKeyGenClazz`, it's okey because 
they have different method signature. Also can we write a test case for Spark 
SQL then ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4811) Fix the checkstyle of hudi flink

2022-09-08 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602058#comment-17602058
 ] 

Danny Chen commented on HUDI-4811:
--

Fixed via master branch: 13eb892081fc4ddd5e1592ef8698831972012666

> Fix the checkstyle of hudi flink
> 
>
> Key: HUDI-4811
> URL: https://issues.apache.org/jira/browse/HUDI-4811
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink-sql
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4811) Fix the checkstyle of hudi flink

2022-09-08 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-4811.
--

> Fix the checkstyle of hudi flink
> 
>
> Key: HUDI-4811
> URL: https://issues.apache.org/jira/browse/HUDI-4811
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink-sql
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4811) Fix the checkstyle of hudi flink

2022-09-08 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-4811:
-
Fix Version/s: 0.12.1

> Fix the checkstyle of hudi flink
> 
>
> Key: HUDI-4811
> URL: https://issues.apache.org/jira/browse/HUDI-4811
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink-sql
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated (e1da06fa70 -> 13eb892081)

2022-09-08 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from e1da06fa70 [MINOR] Typo fix for kryo in flink-bundle (#6639)
 add 13eb892081 [HUDI-4811] Fix the checkstyle of hudi flink (#6633)

No new revisions were added by this update.

Summary of changes:
 hudi-client/hudi-flink-client/pom.xml  | 561 +++--
 .../hudi/client/FlinkTaskContextSupplier.java  |   2 +-
 .../apache/hudi/client/HoodieFlinkWriteClient.java |   8 +
 .../client/common/HoodieFlinkEngineContext.java|   7 +-
 .../hudi/execution/FlinkLazyInsertIterable.java|   5 +
 .../java/org/apache/hudi/io/MiniBatchHandle.java   |   3 +-
 .../io/storage/row/HoodieRowDataParquetWriter.java |   2 +-
 .../row/parquet/ParquetSchemaConverter.java|   5 +-
 .../FlinkHoodieBackedTableMetadataWriter.java  |   3 +
 .../hudi/table/ExplicitWriteHandleTable.java   |  22 +-
 .../hudi/table/HoodieFlinkCopyOnWriteTable.java|  23 +-
 .../hudi/table/HoodieFlinkMergeOnReadTable.java|   3 +
 .../org/apache/hudi/table/HoodieFlinkTable.java|   3 +
 .../commit/FlinkDeleteCommitActionExecutor.java|   3 +
 .../table/action/commit/FlinkDeleteHelper.java |   3 +
 .../commit/FlinkInsertCommitActionExecutor.java|   3 +
 .../FlinkInsertOverwriteCommitActionExecutor.java  |   3 +
 ...nkInsertOverwriteTableCommitActionExecutor.java |   3 +
 .../FlinkInsertPreppedCommitActionExecutor.java|   3 +
 .../hudi/table/action/commit/FlinkMergeHelper.java |   3 +
 .../commit/FlinkUpsertCommitActionExecutor.java|   3 +
 .../FlinkUpsertPreppedCommitActionExecutor.java|   3 +
 .../delta/BaseFlinkDeltaCommitActionExecutor.java  |   3 +
 .../FlinkUpsertDeltaCommitActionExecutor.java  |   3 +
 ...linkUpsertPreppedDeltaCommitActionExecutor.java |   3 +
 .../common/TestHoodieFlinkEngineContext.java   |   2 +-
 .../index/bloom/TestFlinkHoodieBloomIndex.java |  13 +-
 .../testutils/HoodieFlinkClientTestHarness.java|   9 +-
 .../testutils/HoodieFlinkWriteableTestTable.java   |   5 +-
 .../apache/hudi/configuration/FlinkOptions.java|   2 +-
 .../hudi/configuration/HadoopConfigurations.java   |   2 +-
 .../sink/clustering/FlinkClusteringConfig.java |  10 +-
 .../sink/clustering/HoodieFlinkClusteringJob.java  |   2 +-
 .../hudi/sink/compact/CompactionPlanOperator.java  |   2 +-
 .../hudi/sink/compact/HoodieFlinkCompactor.java|   2 +-
 .../java/org/apache/hudi/sink/utils/Pipelines.java |  18 +-
 .../hudi/source/stats/ExpressionEvaluator.java |   1 +
 .../apache/hudi/streamer/FlinkStreamerConfig.java  |   4 +-
 .../org/apache/hudi/table/HoodieTableSource.java   |   2 +-
 .../apache/hudi/table/catalog/HiveSchemaUtils.java |  16 +-
 .../hudi/table/catalog/HoodieHiveCatalog.java  |   2 +-
 .../org/apache/hudi/util/AvroSchemaConverter.java  |   2 +-
 .../java/org/apache/hudi/util/ClusteringUtil.java  |   2 +-
 .../java/org/apache/hudi/util/CompactionUtil.java  |   2 +-
 .../java/org/apache/hudi/util/DataTypeUtils.java   |   2 +-
 .../hudi/util/FlinkStateBackendConverter.java  |   6 +-
 .../java/org/apache/hudi/util/HoodiePipeline.java  |  62 +--
 .../java/org/apache/hudi/util/StreamerUtil.java|   4 +-
 .../apache/hudi/sink/ITTestDataStreamWrite.java|   4 +-
 .../org/apache/hudi/sink/TestWriteMergeOnRead.java |  52 +-
 .../sink/compact/ITTestHoodieFlinkCompactor.java   |   2 +-
 .../sink/compact/TestCompactionPlanStrategy.java   |  39 +-
 .../hudi/source/stats/TestExpressionEvaluator.java |   2 +-
 .../apache/hudi/table/ITTestHoodieDataSource.java  |   4 +-
 .../hudi/table/catalog/HoodieCatalogTestUtils.java |   8 +-
 .../table/catalog/TestHoodieCatalogFactory.java|   2 +-
 .../hudi/table/catalog/TestHoodieHiveCatalog.java  |   4 +-
 .../test/java/org/apache/hudi/utils/TestData.java  |  16 +-
 .../hudi-flink/src/test/resources/hive-site.xml|  16 +-
 .../test-catalog-factory-conf/hive-site.xml|  10 +-
 .../format/cow/vector/HeapMapColumnVector.java |   2 +-
 61 files changed, 546 insertions(+), 470 deletions(-)

[GitHub] [hudi] danny0405 merged pull request #6633: [HUDI-4811] Fix the checkstyle of hudi flink

2022-09-08 Thread GitBox



danny0405 merged PR #6633:
URL: https://github.com/apache/hudi/pull/6633


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-09-08 Thread GitBox



hudi-bot commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1241399541

   
   ## CI report:
   
   * 3ae4fb8b374e12b1097a86d56e5996b7dc0ac79f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11215)
 
   * 252b9f49e2c86c7aad7908e5887064b0d0b36932 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11258)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #6636: add new index RANGE_BUCKET , when primary key is auto-increment like most mysql table

2022-09-08 Thread GitBox



danny0405 commented on PR #6636:
URL: https://github.com/apache/hudi/pull/6636#issuecomment-1241399179

   Nice feature, can we log an JIRA issue and change the commit title to 
"[HUDI-${JIRA_ID}] ${your title}"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [MINOR] Typo fix for kryo in flink-bundle (#6639)

2022-09-08 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new e1da06fa70 [MINOR] Typo fix for kryo in flink-bundle (#6639)
e1da06fa70 is described below

commit e1da06fa7024b9fa84811ec030b8f5f06ce52b06
Author: Xingcan Cui 
AuthorDate: Thu Sep 8 21:25:31 2022 -0400

[MINOR] Typo fix for kryo in flink-bundle (#6639)
---
 packaging/hudi-flink-bundle/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/packaging/hudi-flink-bundle/pom.xml 
b/packaging/hudi-flink-bundle/pom.xml
index 8ccf92c77b..de0478ff06 100644
--- a/packaging/hudi-flink-bundle/pom.xml
+++ b/packaging/hudi-flink-bundle/pom.xml
@@ -130,7 +130,7 @@
   javax.servlet:javax.servlet-api
 
   
-  com.esotericsoftware:kryo-shaded
+  com.esotericsoftware:kryo-shaded
 
   
org.apache.flink:${flink.hadoop.compatibility.artifactId}
   org.apache.flink:flink-json

[GitHub] [hudi] danny0405 merged pull request #6639: [MINOR] Typo fix for kryo in flink-bundle

2022-09-08 Thread GitBox



danny0405 merged PR #6639:
URL: https://github.com/apache/hudi/pull/6639


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-09-08 Thread GitBox



hudi-bot commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1241397077

   
   ## CI report:
   
   * 3ae4fb8b374e12b1097a86d56e5996b7dc0ac79f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11215)
 
   * 252b9f49e2c86c7aad7908e5887064b0d0b36932 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dongkelun commented on a diff in pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning

2022-09-08 Thread GitBox



dongkelun commented on code in PR #5478:
URL: https://github.com/apache/hudi/pull/5478#discussion_r966535607


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -539,4 +543,19 @@ public void handle(@NotNull Context context) throws 
Exception {
   }
 }
   }
+
+  /**
+   * Determine whether to throw an exception when local view of table's 
timeline is behind that of client's view.
+   */
+  private boolean shouldThrowExceptionIfLocalViewBehind(HoodieTimeline 
localTimeline, String lastInstantTs) {

Review Comment:
   I modified the logic of `shouldThrowExceptionIfLocalViewBehind` method. I 
think it's better



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6639: [MINOR] Typo fix for kryo in flink-bundle

2022-09-08 Thread GitBox



hudi-bot commented on PR #6639:
URL: https://github.com/apache/hudi/pull/6639#issuecomment-1241323712

   
   ## CI report:
   
   * e83471bf24d848fbb3c8ec16decf1bcbe0d5449a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11257)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4817) Markers are not deleted after bootstrap operation

2022-09-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4817:

Sprint: 2022/09/05

> Markers are not deleted after bootstrap operation
> -
>
> Key: HUDI-4817
> URL: https://issues.apache.org/jira/browse/HUDI-4817
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Affects Versions: 0.12.0
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: bootstrap
> Fix For: 0.12.1
>
>
> After the bootstrap operation finishes, the markers are left behind, not 
> cleaned up.
>  
> {code:java}
> ls -l .hoodie/.temp/02/
> total 24
> -rw-r--r--   MARKERS.type
> -rw-r--r--   MARKERS0
> -rw-r--r--   MARKERS4 {code}
> {code:java}
> ls -l .hoodie/
> total 40
> -rw-r--r--   02.commit
> -rw-r--r--   02.commit.requested
> -rw-r--r--   02.inflight
> drwxr-xr-x   archived
> -rw-r--r--   hoodie.properties
> drwxr-xr-x   metadata {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4817) Markers are not deleted after bootstrap operation

2022-09-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4817:

Fix Version/s: 0.12.1

> Markers are not deleted after bootstrap operation
> -
>
> Key: HUDI-4817
> URL: https://issues.apache.org/jira/browse/HUDI-4817
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.12.1
>
>
> After the bootstrap operation finishes, the markers are left behind, not 
> cleaned up.
>  
> {code:java}
> ls -l .hoodie/.temp/02/
> total 24
> -rw-r--r--   MARKERS.type
> -rw-r--r--   MARKERS0
> -rw-r--r--   MARKERS4 {code}
> {code:java}
> ls -l .hoodie/
> total 40
> -rw-r--r--   02.commit
> -rw-r--r--   02.commit.requested
> -rw-r--r--   02.inflight
> drwxr-xr-x   archived
> -rw-r--r--   hoodie.properties
> drwxr-xr-x   metadata {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4817) Markers are not deleted after bootstrap operation

2022-09-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4817:

Priority: Critical  (was: Major)

> Markers are not deleted after bootstrap operation
> -
>
> Key: HUDI-4817
> URL: https://issues.apache.org/jira/browse/HUDI-4817
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Affects Versions: 0.12.0
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: bootstrap
> Fix For: 0.12.1
>
>
> After the bootstrap operation finishes, the markers are left behind, not 
> cleaned up.
>  
> {code:java}
> ls -l .hoodie/.temp/02/
> total 24
> -rw-r--r--   MARKERS.type
> -rw-r--r--   MARKERS0
> -rw-r--r--   MARKERS4 {code}
> {code:java}
> ls -l .hoodie/
> total 40
> -rw-r--r--   02.commit
> -rw-r--r--   02.commit.requested
> -rw-r--r--   02.inflight
> drwxr-xr-x   archived
> -rw-r--r--   hoodie.properties
> drwxr-xr-x   metadata {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4817) Markers are not deleted after bootstrap operation

2022-09-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4817:

Affects Version/s: 0.12.0

> Markers are not deleted after bootstrap operation
> -
>
> Key: HUDI-4817
> URL: https://issues.apache.org/jira/browse/HUDI-4817
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Affects Versions: 0.12.0
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: bootstrap
> Fix For: 0.12.1
>
>
> After the bootstrap operation finishes, the markers are left behind, not 
> cleaned up.
>  
> {code:java}
> ls -l .hoodie/.temp/02/
> total 24
> -rw-r--r--   MARKERS.type
> -rw-r--r--   MARKERS0
> -rw-r--r--   MARKERS4 {code}
> {code:java}
> ls -l .hoodie/
> total 40
> -rw-r--r--   02.commit
> -rw-r--r--   02.commit.requested
> -rw-r--r--   02.inflight
> drwxr-xr-x   archived
> -rw-r--r--   hoodie.properties
> drwxr-xr-x   metadata {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4817) Markers are not deleted after bootstrap operation

2022-09-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-4817:
---

Assignee: Ethan Guo

> Markers are not deleted after bootstrap operation
> -
>
> Key: HUDI-4817
> URL: https://issues.apache.org/jira/browse/HUDI-4817
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>
> After the bootstrap operation finishes, the markers are left behind, not 
> cleaned up.
>  
> {code:java}
> ls -l .hoodie/.temp/02/
> total 24
> -rw-r--r--   MARKERS.type
> -rw-r--r--   MARKERS0
> -rw-r--r--   MARKERS4 {code}
> {code:java}
> ls -l .hoodie/
> total 40
> -rw-r--r--   02.commit
> -rw-r--r--   02.commit.requested
> -rw-r--r--   02.inflight
> drwxr-xr-x   archived
> -rw-r--r--   hoodie.properties
> drwxr-xr-x   metadata {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-4817) Markers are not deleted after bootstrap operation

2022-09-08 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-4817:
---

 Summary: Markers are not deleted after bootstrap operation
 Key: HUDI-4817
 URL: https://issues.apache.org/jira/browse/HUDI-4817
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ethan Guo


After the bootstrap operation finishes, the markers are left behind, not 
cleaned up.

 
{code:java}
ls -l .hoodie/.temp/02/
total 24
-rw-r--r--   MARKERS.type
-rw-r--r--   MARKERS0
-rw-r--r--   MARKERS4 {code}
{code:java}
ls -l .hoodie/
total 40
-rw-r--r--   02.commit
-rw-r--r--   02.commit.requested
-rw-r--r--   02.inflight
drwxr-xr-x   archived
-rw-r--r--   hoodie.properties
drwxr-xr-x   metadata {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4817) Markers are not deleted after bootstrap operation

2022-09-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4817:

Labels: bootstrap  (was: )

> Markers are not deleted after bootstrap operation
> -
>
> Key: HUDI-4817
> URL: https://issues.apache.org/jira/browse/HUDI-4817
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: bootstrap
> Fix For: 0.12.1
>
>
> After the bootstrap operation finishes, the markers are left behind, not 
> cleaned up.
>  
> {code:java}
> ls -l .hoodie/.temp/02/
> total 24
> -rw-r--r--   MARKERS.type
> -rw-r--r--   MARKERS0
> -rw-r--r--   MARKERS4 {code}
> {code:java}
> ls -l .hoodie/
> total 40
> -rw-r--r--   02.commit
> -rw-r--r--   02.commit.requested
> -rw-r--r--   02.inflight
> drwxr-xr-x   archived
> -rw-r--r--   hoodie.properties
> drwxr-xr-x   metadata {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4817) Markers are not deleted after bootstrap operation

2022-09-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4817:

Component/s: bootstrap

> Markers are not deleted after bootstrap operation
> -
>
> Key: HUDI-4817
> URL: https://issues.apache.org/jira/browse/HUDI-4817
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: bootstrap
> Fix For: 0.12.1
>
>
> After the bootstrap operation finishes, the markers are left behind, not 
> cleaned up.
>  
> {code:java}
> ls -l .hoodie/.temp/02/
> total 24
> -rw-r--r--   MARKERS.type
> -rw-r--r--   MARKERS0
> -rw-r--r--   MARKERS4 {code}
> {code:java}
> ls -l .hoodie/
> total 40
> -rw-r--r--   02.commit
> -rw-r--r--   02.commit.requested
> -rw-r--r--   02.inflight
> drwxr-xr-x   archived
> -rw-r--r--   hoodie.properties
> drwxr-xr-x   metadata {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6637: Fix AWSDmsAvroPayload#getInsertValue,combineAndGetUpdateValue to invo…

2022-09-08 Thread GitBox



hudi-bot commented on PR #6637:
URL: https://github.com/apache/hudi/pull/6637#issuecomment-1241264591

   
   ## CI report:
   
   * 5b7d712d175b64de73ce924bbdb95962ebb790fe Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11256)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4816) Update asf-site docs for GlobalDeleteKeyGenerator

2022-09-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4816:

Description: The GlobalDeleteKeyGenerator should be used with a global 
index to delete records based on the record key solely and it works for a batch 
with deletes only.  The key generator can be used for both partitioned and 
non-partitioned table.  Note that when using GlobalDeleteKeyGenerator, the 
config hoodie.[bloom|simple|hbase].index.update.partition.path should be set to 
false to avoid redundant data written to the storage.  

> Update asf-site docs for GlobalDeleteKeyGenerator
> -
>
> Key: HUDI-4816
> URL: https://issues.apache.org/jira/browse/HUDI-4816
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.12.1
>
>
> The GlobalDeleteKeyGenerator should be used with a global index to delete 
> records based on the record key solely and it works for a batch with deletes 
> only.  The key generator can be used for both partitioned and non-partitioned 
> table.  Note that when using GlobalDeleteKeyGenerator, the config 
> hoodie.[bloom|simple|hbase].index.update.partition.path should be set to 
> false to avoid redundant data written to the storage.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4816) Update asf-site docs for GlobalDeleteKeyGenerator

2022-09-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4816:

Component/s: docs

> Update asf-site docs for GlobalDeleteKeyGenerator
> -
>
> Key: HUDI-4816
> URL: https://issues.apache.org/jira/browse/HUDI-4816
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4816) Update asf-site docs for GlobalDeleteKeyGenerator

2022-09-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-4816:
---

Assignee: Bhavani Sudha

> Update asf-site docs for GlobalDeleteKeyGenerator
> -
>
> Key: HUDI-4816
> URL: https://issues.apache.org/jira/browse/HUDI-4816
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Bhavani Sudha
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-4816) Update asf-site docs for GlobalDeleteKeyGenerator

2022-09-08 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-4816:
---

 Summary: Update asf-site docs for GlobalDeleteKeyGenerator
 Key: HUDI-4816
 URL: https://issues.apache.org/jira/browse/HUDI-4816
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4816) Update asf-site docs for GlobalDeleteKeyGenerator

2022-09-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4816:

Fix Version/s: 0.12.1

> Update asf-site docs for GlobalDeleteKeyGenerator
> -
>
> Key: HUDI-4816
> URL: https://issues.apache.org/jira/browse/HUDI-4816
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] umehrot2 commented on pull request #6637: Fix AWSDmsAvroPayload#getInsertValue,combineAndGetUpdateValue to invo…

2022-09-08 Thread GitBox



umehrot2 commented on PR #6637:
URL: https://github.com/apache/hudi/pull/6637#issuecomment-1241218981

   Fix LGTM.
   However, we should not be adding this whole end to end test in 
`TestCOWDataSource` and `TestMORDataSource`. These tests are there to test 
overall datasource related functionality, and should not really be used to test 
something so specific as DMS payload. There should be no need to run an entire 
end to end test to discover this issue.
   
   There is a `TestAWSDmsAvroPayload` class. We should understand why tests in 
that class did not catch the issue, and just modify them or add a new test as 
needed to be able to catch this issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning

2022-09-08 Thread GitBox



hudi-bot commented on PR #5478:
URL: https://github.com/apache/hudi/pull/5478#issuecomment-1241218459

   
   ## CI report:
   
   * 57cd61a9a02c62becbf0b763d322d0f70e68b588 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11254)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6639: [MINOR] Typo fix for kryo in flink-bundle

2022-09-08 Thread GitBox



hudi-bot commented on PR #6639:
URL: https://github.com/apache/hudi/pull/6639#issuecomment-1241210693

   
   ## CI report:
   
   * e83471bf24d848fbb3c8ec16decf1bcbe0d5449a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11257)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6636: add new index RANGE_BUCKET , when primary key is auto-increment like most mysql table

2022-09-08 Thread GitBox



hudi-bot commented on PR #6636:
URL: https://github.com/apache/hudi/pull/6636#issuecomment-1241210655

   
   ## CI report:
   
   * b837b813fb706508b1fccc0924f839275e9373c3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11253)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6639: [MINOR] Typo fix for kryo in flink-bundle

2022-09-08 Thread GitBox



hudi-bot commented on PR #6639:
URL: https://github.com/apache/hudi/pull/6639#issuecomment-1241205871

   
   ## CI report:
   
   * e83471bf24d848fbb3c8ec16decf1bcbe0d5449a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xccui opened a new pull request, #6639: [MINOR] Typo fix for kryo in flink-bundle

2022-09-08 Thread GitBox



xccui opened a new pull request, #6639:
URL: https://github.com/apache/hudi/pull/6639

   ### Change Logs
   
   There's a typo for the `Kryo-shaded` item in flink-bundle's pom file which 
causes the lib not to be packaged into the bundle file. Not sure if we should 
include it or just remove it.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: high**
   
   The Kryo lib causes a lot of issues when we integrate hudi with Flink. This 
change could potentially introduce some dependency conflict to the existing 
jobs.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bhasudha opened a new pull request, #6638: [DO NOT MERGE] [DOCS] Add tags to blog pages

2022-09-08 Thread GitBox



bhasudha opened a new pull request, #6638:
URL: https://github.com/apache/hudi/pull/6638

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #5320: [HUDI-3861] update tblp 'path' when rename table

2022-09-08 Thread GitBox



xushiyan commented on code in PR #5320:
URL: https://github.com/apache/hudi/pull/5320#discussion_r966357792


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTable.scala:
##
@@ -194,9 +194,14 @@ class TestAlterTable extends HoodieSparkSqlTestBase {
 val oldLocation = spark.sessionState.catalog.getTableMetadata(new 
TableIdentifier(tableName)).properties.get("path")
 spark.sql(s"alter table $tableName rename to $newTableName")
 val newLocation = spark.sessionState.catalog.getTableMetadata(new 
TableIdentifier(newTableName)).properties.get("path")
-assertResult(false)(
-  newLocation.equals(oldLocation)
-)
+if (HoodieSparkUtils.isSpark3_2) {

Review Comment:
   @KnightChess spark 3.3 is supported. this also applies to 3.3 right?



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTable.scala:
##
@@ -169,4 +168,77 @@ class TestAlterTable extends HoodieSparkSqlTestBase {
   }
 }
   }
+
+  test("Test Alter Rename Table") {
+withTempDir { tmp =>
+  Seq("cow", "mor").foreach { tableType =>
+val tableName = generateTableName
+// Create table
+spark.sql(
+  s"""
+ |create table $tableName (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  ts long
+ |) using hudi
+ | tblproperties (
+ |  type = '$tableType',
+ |  primaryKey = 'id',
+ |  preCombineField = 'ts'
+ | )
+   """.stripMargin)
+
+// alter table name.
+val newTableName = s"${tableName}_1"
+val oldLocation = spark.sessionState.catalog.getTableMetadata(new 
TableIdentifier(tableName)).properties.get("path")
+spark.sql(s"alter table $tableName rename to $newTableName")
+val newLocation = spark.sessionState.catalog.getTableMetadata(new 
TableIdentifier(newTableName)).properties.get("path")
+if (HoodieSparkUtils.isSpark3_2) {
+  assertResult(false)(
+newLocation.equals(oldLocation)
+  )
+} else {
+  assertResult(None) (oldLocation)
+  assertResult(None) (newLocation)
+}
+
+
+// Create table with location
+val locTableName = s"${tableName}_loc"
+val tablePath = s"${tmp.getCanonicalPath}/$locTableName"
+spark.sql(
+  s"""
+ |create table $locTableName (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  ts long
+ |) using hudi
+ | location '$tablePath'
+ | tblproperties (
+ |  type = '$tableType',
+ |  primaryKey = 'id',
+ |  preCombineField = 'ts'
+ | )
+   """.stripMargin)
+
+// alter table name.
+val newLocTableName = s"${locTableName}_1"
+val oldLocation2 = spark.sessionState.catalog.getTableMetadata(new 
TableIdentifier(locTableName))
+  .properties.get("path")
+spark.sql(s"alter table $locTableName rename to $newLocTableName")
+val newLocation2 = spark.sessionState.catalog.getTableMetadata(new 
TableIdentifier(newLocTableName))
+  .properties.get("path")
+if (HoodieSparkUtils.isSpark3_2) {

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-2733) Adding Thrift support in HiveSyncTool

2022-09-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2733:
-
Sprint: Cont' improve -  2021/01/24, Cont' improve -  2021/01/31, Cont' 
improve -  2022/02/07, Cont' improve -  2022/02/14  (was: Cont' improve -  
2021/01/24, Cont' improve -  2021/01/31, Cont' improve -  2022/02/07, Cont' 
improve -  2022/02/14, 2022/09/05)

> Adding Thrift support in HiveSyncTool
> -
>
> Key: HUDI-2733
> URL: https://issues.apache.org/jira/browse/HUDI-2733
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: meta-sync, Utilities
>Reporter: Satyam Raj
>Assignee: Satyam Raj
>Priority: Major
>  Labels: hive-sync, pull-request-available
> Fix For: 0.13.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Introduction of Thrift Metastore client to sync Hudi data in Hive warehouse.
> Suggested client to integrate with:
> https://github.com/akolb1/hclient/tree/master/tools-common



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2733) Adding Thrift support in HiveSyncTool

2022-09-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2733:
-
Sprint: Cont' improve -  2021/01/24, Cont' improve -  2021/01/31, Cont' 
improve -  2022/02/07, Cont' improve -  2022/02/14, 2022/09/19  (was: Cont' 
improve -  2021/01/24, Cont' improve -  2021/01/31, Cont' improve -  
2022/02/07, Cont' improve -  2022/02/14)

> Adding Thrift support in HiveSyncTool
> -
>
> Key: HUDI-2733
> URL: https://issues.apache.org/jira/browse/HUDI-2733
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: meta-sync, Utilities
>Reporter: Satyam Raj
>Assignee: Satyam Raj
>Priority: Major
>  Labels: hive-sync, pull-request-available
> Fix For: 0.13.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Introduction of Thrift Metastore client to sync Hudi data in Hive warehouse.
> Suggested client to integrate with:
> https://github.com/akolb1/hclient/tree/master/tools-common



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4585) Optimize query performance on Presto Hudi connector

2022-09-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4585:
-
Reviewers: Sagar Sumit  (was: Raymond Xu, Sagar Sumit)

> Optimize  query performance on Presto Hudi connector
> 
>
> Key: HUDI-4585
> URL: https://issues.apache.org/jira/browse/HUDI-4585
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] xushiyan commented on a diff in pull request #6550: [HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module

2022-09-08 Thread GitBox



xushiyan commented on code in PR #6550:
URL: https://github.com/apache/hudi/pull/6550#discussion_r966341103


##
pom.xml:
##
@@ -377,9 +377,17 @@
 org.sl4fj:slf4j-jcl
 log4j:log4j
 ch.qos.logback:logback-classic
+
+org.apache.hbase:hbase-common:*
+org.apache.hbase:hbase-client:*
+org.apache.hbase:hbase-server:*
   
   
 org.slf4j:slf4j-simple:*:*:test
+
org.apache.hbase:hbase-common:${hbase.version}
+
org.apache.hbase:hbase-client:${hbase.version}
+
org.apache.hbase:hbase-server:${hbase.version}
   

Review Comment:
   why are these "exclude"s repeated here under "includes" ?



##
packaging/hudi-spark-bundle/pom.xml:
##
@@ -140,14 +140,14 @@
 
-
-  javax.servlet.
-  org.apache.hudi.javax.servlet.
-
 
   org.apache.spark.sql.avro.
   
org.apache.hudi.org.apache.spark.sql.avro.
 
+
+  javax.servlet.
+  org.apache.hudi.javax.servlet.
+

Review Comment:
   unnecessary moving of these lines



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #6550: [HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module

2022-09-08 Thread GitBox



xushiyan commented on code in PR #6550:
URL: https://github.com/apache/hudi/pull/6550#discussion_r966335560


##
hudi-utilities/pom.xml:
##
@@ -139,14 +125,22 @@
 
   
 
+
+
+
+
 
   org.apache.hudi
-  hudi-spark-common_${scala.binary.version}
+  hudi-common

Review Comment:
   @nsivabalan the note above explained it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

2022-09-08 Thread GitBox



hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1241126360

   
   ## CI report:
   
   * 1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11252)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #6550: [HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module

2022-09-08 Thread GitBox



xushiyan commented on code in PR #6550:
URL: https://github.com/apache/hudi/pull/6550#discussion_r966311837


##
hudi-spark-datasource/hudi-spark3.2plus-common/pom.xml:
##
@@ -0,0 +1,234 @@
+
+
+http://maven.apache.org/POM/4.0.0;
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
+
+hudi-spark-datasource
+org.apache.hudi
+0.13.0-SNAPSHOT
+
+4.0.0
+
+hudi-spark3.2plus-common

Review Comment:
   ok sounds good to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on pull request #6535: [HUDI-4193] change protoc version so it compiles on m1 mac

2022-09-08 Thread GitBox



xushiyan commented on PR #6535:
URL: https://github.com/apache/hudi/pull/6535#issuecomment-1241092456

   > I don't think anything needs to be added to the README. It has the 
activation tag and checks if the os is mac and the processor type is aarch64 
and if those are true then it activates the profile. When you run mvn clean 
package on a m1 mac it will automatically enable that profile since those 
conditions are met
   
   @jonvex got it.
   
   > Please do not use different versions of protoc depending of different 
environment because build will not be reproducible That's why I proposed to 
bump protoc to one of latest versions. We made similar change in hive, thrift
   
   @slachiewicz sounds good. would you take this version alignment along with 
your fix (#5784 )? can simply remove the property overwrite in this new profile?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6637: Fix AWSDmsAvroPayload#getInsertValue,combineAndGetUpdateValue to invo…

2022-09-08 Thread GitBox



hudi-bot commented on PR #6637:
URL: https://github.com/apache/hudi/pull/6637#issuecomment-1241064229

   
   ## CI report:
   
   * 5b7d712d175b64de73ce924bbdb95962ebb790fe Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11256)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua closed issue #6623: [SUPPORT] java.lang.ClassNotFoundException: Class org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener with HBase Index

2022-09-08 Thread GitBox



yihua closed issue #6623: [SUPPORT] java.lang.ClassNotFoundException: Class 
org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener with 
HBase Index
URL: https://github.com/apache/hudi/issues/6623


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6623: [SUPPORT] java.lang.ClassNotFoundException: Class org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener with HBase Index

2022-09-08 Thread GitBox



yihua commented on issue #6623:
URL: https://github.com/apache/hudi/issues/6623#issuecomment-1241060251

   @praveenkmr Great to hear that.
   
   For the upgrade to OSS Hudi 0.12.0 (the latest release), using 
hudi-spark-bundle should be sufficient as OSS Hudi 0.12.0 bundle jars work 
out-of-the-box on the EMR environment.  Prior to this version, you need to 
follow the workaround above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6637: Fix AWSDmsAvroPayload#getInsertValue,combineAndGetUpdateValue to invo…

2022-09-08 Thread GitBox



hudi-bot commented on PR #6637:
URL: https://github.com/apache/hudi/pull/6637#issuecomment-1241058661

   
   ## CI report:
   
   * 5b7d712d175b64de73ce924bbdb95962ebb790fe UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] fengjian428 commented on a diff in pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-08 Thread GitBox



fengjian428 commented on code in PR #4676:
URL: https://github.com/apache/hudi/pull/4676#discussion_r966271696


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java:
##
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Properties;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * subclass of OverwriteNonDefaultsWithLatestAvroPayload used for delta 
streamer.
+ *
+ * 
+ * preCombine - Picks the latest delta record for a key, based on an 
ordering field;
+ * combineAndGetUpdateValue/getInsertValue - overwrite storage for 
specified fields
+ * that doesn't equal defaultValue.
+ * 
+ */
+public class PartialUpdateAvroPayload extends 
OverwriteNonDefaultsWithLatestAvroPayload {
+
+  public static ConcurrentHashMap schemaRepo = new 
ConcurrentHashMap<>();
+
+  public PartialUpdateAvroPayload(GenericRecord record, Comparable 
orderingVal) {
+super(record, orderingVal);
+  }
+
+  public PartialUpdateAvroPayload(Option record) {
+super(record); // natural order
+  }
+
+  @Override
+  public PartialUpdateAvroPayload preCombine(OverwriteWithLatestAvroPayload 
oldValue, Properties properties) {
+String schemaStringIn = properties.getProperty("schema");
+Schema schemaInstance;
+if (!schemaRepo.containsKey(schemaStringIn)) {
+  schemaInstance = new Schema.Parser().parse(schemaStringIn);
+  schemaRepo.put(schemaStringIn, schemaInstance);
+} else {
+  schemaInstance = schemaRepo.get(schemaStringIn);
+}
+if (oldValue.recordBytes.length == 0) {
+  // use natural order for delete record
+  return this;
+}
+
+try {
+  GenericRecord indexedOldValue = (GenericRecord) 
oldValue.getInsertValue(schemaInstance).get();
+  Option optValue = 
combineAndGetUpdateValue(indexedOldValue, schemaInstance, 
this.orderingVal.toString());
+  // Rebuild ordering value if required
+  String newOrderingVal = rebuildOrderingValMap((GenericRecord) 
optValue.get(), this.orderingVal.toString());
+  if (optValue.isPresent()) {
+return new PartialUpdateAvroPayload((GenericRecord) optValue.get(), 
newOrderingVal);
+  }
+} catch (Exception ex) {
+  return this;
+}
+return this;
+  }
+
+  public Option combineAndGetUpdateValue(
+  IndexedRecord currentValue, Schema schema, String 
newOrderingValWithMappings) throws IOException {
+Option recordOption = getInsertValue(schema);
+if (!recordOption.isPresent()) {
+  return Option.empty();
+}
+
+// Perform a deserialization again to prevent resultRecord from sharing 
the same reference as recordOption
+GenericRecord resultRecord = (GenericRecord) getInsertValue(schema).get();
+List fields = schema.getFields();
+
+// newOrderingValWithMappings = _ts1:name1,price1=999;_ts2:name2,price2=;
+for (String newOrderingFieldMapping : 
newOrderingValWithMappings.split(";")) {
+  String orderingField = newOrderingFieldMapping.split(":")[0];
+  String newOrderingVal = getNewOrderingVal(orderingField, 
newOrderingFieldMapping);
+  String oldOrderingVal = HoodieAvroUtils.getNestedFieldValAsString(
+  (GenericRecord) currentValue, orderingField, true, false);
+  if (oldOrderingVal == null) {
+oldOrderingVal = "";
+  }
+
+  // No update required
+  if (oldOrderingVal.isEmpty() && newOrderingVal.isEmpty()) {
+continue;
+  }
+
+  // Pick the payload with greatest ordering value as insert record
+  boolean isBaseRecord = false;
+  try {
+if (Long.parseLong(oldOrderingVal) > Long.parseLong(newOrderingVal)) {
+  isBaseRecord = true;
+}
+  } catch (NumberFormatException e) {
+if (oldOrderingVal.compareTo(newOrderingVal) > 0) {
+  isBaseRecord = true;
+}

[GitHub] [hudi] hudi-bot commented on pull request #6634: [HUDI-4813] Fix infer keygen not work in sparksql side issue

2022-09-08 Thread GitBox



hudi-bot commented on PR #6634:
URL: https://github.com/apache/hudi/pull/6634#issuecomment-1241053170

   
   ## CI report:
   
   * 1a7511e07003745ea2d0d7a802ea5bf1731bd9a4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11251)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6633: [HUDI-4811] Fix the checkstyle of hudi flink

2022-09-08 Thread GitBox



hudi-bot commented on PR #6633:
URL: https://github.com/apache/hudi/pull/6633#issuecomment-1241053135

   
   ## CI report:
   
   * 0c0a1c94e70675f5034fd4fd8bcc0812f27a272c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11249)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rahil-c commented on issue #6552: [SUPPORT] AWSDmsAvroPayload does not work correctly with any version above 0.10.0

2022-09-08 Thread GitBox



rahil-c commented on issue #6552:
URL: https://github.com/apache/hudi/issues/6552#issuecomment-1241038767

   Draft pr: https://github.com/apache/hudi/pull/6637 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] minihippo commented on a diff in pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-08 Thread GitBox



minihippo commented on code in PR #4676:
URL: https://github.com/apache/hudi/pull/4676#discussion_r966225565


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java:
##
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Properties;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * subclass of OverwriteNonDefaultsWithLatestAvroPayload used for delta 
streamer.
+ *
+ * 
+ * preCombine - Picks the latest delta record for a key, based on an 
ordering field;
+ * combineAndGetUpdateValue/getInsertValue - overwrite storage for 
specified fields
+ * that doesn't equal defaultValue.
+ * 
+ */
+public class PartialUpdateAvroPayload extends 
OverwriteNonDefaultsWithLatestAvroPayload {
+
+  public static ConcurrentHashMap schemaRepo = new 
ConcurrentHashMap<>();
+
+  public PartialUpdateAvroPayload(GenericRecord record, Comparable 
orderingVal) {
+super(record, orderingVal);
+  }
+
+  public PartialUpdateAvroPayload(Option record) {
+super(record); // natural order
+  }
+
+  @Override
+  public PartialUpdateAvroPayload preCombine(OverwriteWithLatestAvroPayload 
oldValue, Properties properties) {
+String schemaStringIn = properties.getProperty("schema");
+Schema schemaInstance;
+if (!schemaRepo.containsKey(schemaStringIn)) {
+  schemaInstance = new Schema.Parser().parse(schemaStringIn);
+  schemaRepo.put(schemaStringIn, schemaInstance);
+} else {
+  schemaInstance = schemaRepo.get(schemaStringIn);
+}
+if (oldValue.recordBytes.length == 0) {
+  // use natural order for delete record
+  return this;
+}
+
+try {
+  GenericRecord indexedOldValue = (GenericRecord) 
oldValue.getInsertValue(schemaInstance).get();
+  Option optValue = 
combineAndGetUpdateValue(indexedOldValue, schemaInstance, 
this.orderingVal.toString());
+  // Rebuild ordering value if required
+  String newOrderingVal = rebuildOrderingValMap((GenericRecord) 
optValue.get(), this.orderingVal.toString());
+  if (optValue.isPresent()) {
+return new PartialUpdateAvroPayload((GenericRecord) optValue.get(), 
newOrderingVal);
+  }
+} catch (Exception ex) {
+  return this;
+}
+return this;
+  }
+
+  public Option combineAndGetUpdateValue(
+  IndexedRecord currentValue, Schema schema, String 
newOrderingValWithMappings) throws IOException {
+Option recordOption = getInsertValue(schema);
+if (!recordOption.isPresent()) {
+  return Option.empty();
+}
+
+// Perform a deserialization again to prevent resultRecord from sharing 
the same reference as recordOption
+GenericRecord resultRecord = (GenericRecord) getInsertValue(schema).get();
+List fields = schema.getFields();
+
+// newOrderingValWithMappings = _ts1:name1,price1=999;_ts2:name2,price2=;
+for (String newOrderingFieldMapping : 
newOrderingValWithMappings.split(";")) {
+  String orderingField = newOrderingFieldMapping.split(":")[0];
+  String newOrderingVal = getNewOrderingVal(orderingField, 
newOrderingFieldMapping);
+  String oldOrderingVal = HoodieAvroUtils.getNestedFieldValAsString(
+  (GenericRecord) currentValue, orderingField, true, false);
+  if (oldOrderingVal == null) {
+oldOrderingVal = "";
+  }
+
+  // No update required
+  if (oldOrderingVal.isEmpty() && newOrderingVal.isEmpty()) {
+continue;
+  }
+
+  // Pick the payload with greatest ordering value as insert record
+  boolean isBaseRecord = false;
+  try {
+if (Long.parseLong(oldOrderingVal) > Long.parseLong(newOrderingVal)) {
+  isBaseRecord = true;
+}
+  } catch (NumberFormatException e) {
+if (oldOrderingVal.compareTo(newOrderingVal) > 0) {
+  isBaseRecord = true;
+}
+

[GitHub] [hudi] rahil-c opened a new pull request, #6637: Fix AWSDmsAvroPayload#getInsertValue,combineAndGetUpdateValue to invo…

2022-09-08 Thread GitBox



rahil-c opened a new pull request, #6637:
URL: https://github.com/apache/hudi/pull/6637

   …ke correct api
   
   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] minihippo commented on a diff in pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-08 Thread GitBox



minihippo commented on code in PR #4676:
URL: https://github.com/apache/hudi/pull/4676#discussion_r966225565


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java:
##
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Properties;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * subclass of OverwriteNonDefaultsWithLatestAvroPayload used for delta 
streamer.
+ *
+ * 
+ * preCombine - Picks the latest delta record for a key, based on an 
ordering field;
+ * combineAndGetUpdateValue/getInsertValue - overwrite storage for 
specified fields
+ * that doesn't equal defaultValue.
+ * 
+ */
+public class PartialUpdateAvroPayload extends 
OverwriteNonDefaultsWithLatestAvroPayload {
+
+  public static ConcurrentHashMap schemaRepo = new 
ConcurrentHashMap<>();
+
+  public PartialUpdateAvroPayload(GenericRecord record, Comparable 
orderingVal) {
+super(record, orderingVal);
+  }
+
+  public PartialUpdateAvroPayload(Option record) {
+super(record); // natural order
+  }
+
+  @Override
+  public PartialUpdateAvroPayload preCombine(OverwriteWithLatestAvroPayload 
oldValue, Properties properties) {
+String schemaStringIn = properties.getProperty("schema");
+Schema schemaInstance;
+if (!schemaRepo.containsKey(schemaStringIn)) {
+  schemaInstance = new Schema.Parser().parse(schemaStringIn);
+  schemaRepo.put(schemaStringIn, schemaInstance);
+} else {
+  schemaInstance = schemaRepo.get(schemaStringIn);
+}
+if (oldValue.recordBytes.length == 0) {
+  // use natural order for delete record
+  return this;
+}
+
+try {
+  GenericRecord indexedOldValue = (GenericRecord) 
oldValue.getInsertValue(schemaInstance).get();
+  Option optValue = 
combineAndGetUpdateValue(indexedOldValue, schemaInstance, 
this.orderingVal.toString());
+  // Rebuild ordering value if required
+  String newOrderingVal = rebuildOrderingValMap((GenericRecord) 
optValue.get(), this.orderingVal.toString());
+  if (optValue.isPresent()) {
+return new PartialUpdateAvroPayload((GenericRecord) optValue.get(), 
newOrderingVal);
+  }
+} catch (Exception ex) {
+  return this;
+}
+return this;
+  }
+
+  public Option combineAndGetUpdateValue(
+  IndexedRecord currentValue, Schema schema, String 
newOrderingValWithMappings) throws IOException {
+Option recordOption = getInsertValue(schema);
+if (!recordOption.isPresent()) {
+  return Option.empty();
+}
+
+// Perform a deserialization again to prevent resultRecord from sharing 
the same reference as recordOption
+GenericRecord resultRecord = (GenericRecord) getInsertValue(schema).get();
+List fields = schema.getFields();
+
+// newOrderingValWithMappings = _ts1:name1,price1=999;_ts2:name2,price2=;
+for (String newOrderingFieldMapping : 
newOrderingValWithMappings.split(";")) {
+  String orderingField = newOrderingFieldMapping.split(":")[0];
+  String newOrderingVal = getNewOrderingVal(orderingField, 
newOrderingFieldMapping);
+  String oldOrderingVal = HoodieAvroUtils.getNestedFieldValAsString(
+  (GenericRecord) currentValue, orderingField, true, false);
+  if (oldOrderingVal == null) {
+oldOrderingVal = "";
+  }
+
+  // No update required
+  if (oldOrderingVal.isEmpty() && newOrderingVal.isEmpty()) {
+continue;
+  }
+
+  // Pick the payload with greatest ordering value as insert record
+  boolean isBaseRecord = false;
+  try {
+if (Long.parseLong(oldOrderingVal) > Long.parseLong(newOrderingVal)) {
+  isBaseRecord = true;
+}
+  } catch (NumberFormatException e) {
+if (oldOrderingVal.compareTo(newOrderingVal) > 0) {
+  isBaseRecord = true;
+}
+

[GitHub] [hudi] minihippo commented on a diff in pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-08 Thread GitBox



minihippo commented on code in PR #4676:
URL: https://github.com/apache/hudi/pull/4676#discussion_r966225565


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java:
##
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Properties;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * subclass of OverwriteNonDefaultsWithLatestAvroPayload used for delta 
streamer.
+ *
+ * 
+ * preCombine - Picks the latest delta record for a key, based on an 
ordering field;
+ * combineAndGetUpdateValue/getInsertValue - overwrite storage for 
specified fields
+ * that doesn't equal defaultValue.
+ * 
+ */
+public class PartialUpdateAvroPayload extends 
OverwriteNonDefaultsWithLatestAvroPayload {
+
+  public static ConcurrentHashMap schemaRepo = new 
ConcurrentHashMap<>();
+
+  public PartialUpdateAvroPayload(GenericRecord record, Comparable 
orderingVal) {
+super(record, orderingVal);
+  }
+
+  public PartialUpdateAvroPayload(Option record) {
+super(record); // natural order
+  }
+
+  @Override
+  public PartialUpdateAvroPayload preCombine(OverwriteWithLatestAvroPayload 
oldValue, Properties properties) {
+String schemaStringIn = properties.getProperty("schema");
+Schema schemaInstance;
+if (!schemaRepo.containsKey(schemaStringIn)) {
+  schemaInstance = new Schema.Parser().parse(schemaStringIn);
+  schemaRepo.put(schemaStringIn, schemaInstance);
+} else {
+  schemaInstance = schemaRepo.get(schemaStringIn);
+}
+if (oldValue.recordBytes.length == 0) {
+  // use natural order for delete record
+  return this;
+}
+
+try {
+  GenericRecord indexedOldValue = (GenericRecord) 
oldValue.getInsertValue(schemaInstance).get();
+  Option optValue = 
combineAndGetUpdateValue(indexedOldValue, schemaInstance, 
this.orderingVal.toString());
+  // Rebuild ordering value if required
+  String newOrderingVal = rebuildOrderingValMap((GenericRecord) 
optValue.get(), this.orderingVal.toString());
+  if (optValue.isPresent()) {
+return new PartialUpdateAvroPayload((GenericRecord) optValue.get(), 
newOrderingVal);
+  }
+} catch (Exception ex) {
+  return this;
+}
+return this;
+  }
+
+  public Option combineAndGetUpdateValue(
+  IndexedRecord currentValue, Schema schema, String 
newOrderingValWithMappings) throws IOException {
+Option recordOption = getInsertValue(schema);
+if (!recordOption.isPresent()) {
+  return Option.empty();
+}
+
+// Perform a deserialization again to prevent resultRecord from sharing 
the same reference as recordOption
+GenericRecord resultRecord = (GenericRecord) getInsertValue(schema).get();
+List fields = schema.getFields();
+
+// newOrderingValWithMappings = _ts1:name1,price1=999;_ts2:name2,price2=;
+for (String newOrderingFieldMapping : 
newOrderingValWithMappings.split(";")) {
+  String orderingField = newOrderingFieldMapping.split(":")[0];
+  String newOrderingVal = getNewOrderingVal(orderingField, 
newOrderingFieldMapping);
+  String oldOrderingVal = HoodieAvroUtils.getNestedFieldValAsString(
+  (GenericRecord) currentValue, orderingField, true, false);
+  if (oldOrderingVal == null) {
+oldOrderingVal = "";
+  }
+
+  // No update required
+  if (oldOrderingVal.isEmpty() && newOrderingVal.isEmpty()) {
+continue;
+  }
+
+  // Pick the payload with greatest ordering value as insert record
+  boolean isBaseRecord = false;
+  try {
+if (Long.parseLong(oldOrderingVal) > Long.parseLong(newOrderingVal)) {
+  isBaseRecord = true;
+}
+  } catch (NumberFormatException e) {
+if (oldOrderingVal.compareTo(newOrderingVal) > 0) {
+  isBaseRecord = true;
+}
+

[GitHub] [hudi] minihippo commented on a diff in pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-08 Thread GitBox



minihippo commented on code in PR #4676:
URL: https://github.com/apache/hudi/pull/4676#discussion_r966225565


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java:
##
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Properties;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * subclass of OverwriteNonDefaultsWithLatestAvroPayload used for delta 
streamer.
+ *
+ * 
+ * preCombine - Picks the latest delta record for a key, based on an 
ordering field;
+ * combineAndGetUpdateValue/getInsertValue - overwrite storage for 
specified fields
+ * that doesn't equal defaultValue.
+ * 
+ */
+public class PartialUpdateAvroPayload extends 
OverwriteNonDefaultsWithLatestAvroPayload {
+
+  public static ConcurrentHashMap schemaRepo = new 
ConcurrentHashMap<>();
+
+  public PartialUpdateAvroPayload(GenericRecord record, Comparable 
orderingVal) {
+super(record, orderingVal);
+  }
+
+  public PartialUpdateAvroPayload(Option record) {
+super(record); // natural order
+  }
+
+  @Override
+  public PartialUpdateAvroPayload preCombine(OverwriteWithLatestAvroPayload 
oldValue, Properties properties) {
+String schemaStringIn = properties.getProperty("schema");
+Schema schemaInstance;
+if (!schemaRepo.containsKey(schemaStringIn)) {
+  schemaInstance = new Schema.Parser().parse(schemaStringIn);
+  schemaRepo.put(schemaStringIn, schemaInstance);
+} else {
+  schemaInstance = schemaRepo.get(schemaStringIn);
+}
+if (oldValue.recordBytes.length == 0) {
+  // use natural order for delete record
+  return this;
+}
+
+try {
+  GenericRecord indexedOldValue = (GenericRecord) 
oldValue.getInsertValue(schemaInstance).get();
+  Option optValue = 
combineAndGetUpdateValue(indexedOldValue, schemaInstance, 
this.orderingVal.toString());
+  // Rebuild ordering value if required
+  String newOrderingVal = rebuildOrderingValMap((GenericRecord) 
optValue.get(), this.orderingVal.toString());
+  if (optValue.isPresent()) {
+return new PartialUpdateAvroPayload((GenericRecord) optValue.get(), 
newOrderingVal);
+  }
+} catch (Exception ex) {
+  return this;
+}
+return this;
+  }
+
+  public Option combineAndGetUpdateValue(
+  IndexedRecord currentValue, Schema schema, String 
newOrderingValWithMappings) throws IOException {
+Option recordOption = getInsertValue(schema);
+if (!recordOption.isPresent()) {
+  return Option.empty();
+}
+
+// Perform a deserialization again to prevent resultRecord from sharing 
the same reference as recordOption
+GenericRecord resultRecord = (GenericRecord) getInsertValue(schema).get();
+List fields = schema.getFields();
+
+// newOrderingValWithMappings = _ts1:name1,price1=999;_ts2:name2,price2=;
+for (String newOrderingFieldMapping : 
newOrderingValWithMappings.split(";")) {
+  String orderingField = newOrderingFieldMapping.split(":")[0];
+  String newOrderingVal = getNewOrderingVal(orderingField, 
newOrderingFieldMapping);
+  String oldOrderingVal = HoodieAvroUtils.getNestedFieldValAsString(
+  (GenericRecord) currentValue, orderingField, true, false);
+  if (oldOrderingVal == null) {
+oldOrderingVal = "";
+  }
+
+  // No update required
+  if (oldOrderingVal.isEmpty() && newOrderingVal.isEmpty()) {
+continue;
+  }
+
+  // Pick the payload with greatest ordering value as insert record
+  boolean isBaseRecord = false;
+  try {
+if (Long.parseLong(oldOrderingVal) > Long.parseLong(newOrderingVal)) {
+  isBaseRecord = true;
+}
+  } catch (NumberFormatException e) {
+if (oldOrderingVal.compareTo(newOrderingVal) > 0) {
+  isBaseRecord = true;
+}
+

[GitHub] [hudi] minihippo commented on a diff in pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-08 Thread GitBox



minihippo commented on code in PR #4676:
URL: https://github.com/apache/hudi/pull/4676#discussion_r966225565


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java:
##
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Properties;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * subclass of OverwriteNonDefaultsWithLatestAvroPayload used for delta 
streamer.
+ *
+ * 
+ * preCombine - Picks the latest delta record for a key, based on an 
ordering field;
+ * combineAndGetUpdateValue/getInsertValue - overwrite storage for 
specified fields
+ * that doesn't equal defaultValue.
+ * 
+ */
+public class PartialUpdateAvroPayload extends 
OverwriteNonDefaultsWithLatestAvroPayload {
+
+  public static ConcurrentHashMap schemaRepo = new 
ConcurrentHashMap<>();
+
+  public PartialUpdateAvroPayload(GenericRecord record, Comparable 
orderingVal) {
+super(record, orderingVal);
+  }
+
+  public PartialUpdateAvroPayload(Option record) {
+super(record); // natural order
+  }
+
+  @Override
+  public PartialUpdateAvroPayload preCombine(OverwriteWithLatestAvroPayload 
oldValue, Properties properties) {
+String schemaStringIn = properties.getProperty("schema");
+Schema schemaInstance;
+if (!schemaRepo.containsKey(schemaStringIn)) {
+  schemaInstance = new Schema.Parser().parse(schemaStringIn);
+  schemaRepo.put(schemaStringIn, schemaInstance);
+} else {
+  schemaInstance = schemaRepo.get(schemaStringIn);
+}
+if (oldValue.recordBytes.length == 0) {
+  // use natural order for delete record
+  return this;
+}
+
+try {
+  GenericRecord indexedOldValue = (GenericRecord) 
oldValue.getInsertValue(schemaInstance).get();
+  Option optValue = 
combineAndGetUpdateValue(indexedOldValue, schemaInstance, 
this.orderingVal.toString());
+  // Rebuild ordering value if required
+  String newOrderingVal = rebuildOrderingValMap((GenericRecord) 
optValue.get(), this.orderingVal.toString());
+  if (optValue.isPresent()) {
+return new PartialUpdateAvroPayload((GenericRecord) optValue.get(), 
newOrderingVal);
+  }
+} catch (Exception ex) {
+  return this;
+}
+return this;
+  }
+
+  public Option combineAndGetUpdateValue(
+  IndexedRecord currentValue, Schema schema, String 
newOrderingValWithMappings) throws IOException {
+Option recordOption = getInsertValue(schema);
+if (!recordOption.isPresent()) {
+  return Option.empty();
+}
+
+// Perform a deserialization again to prevent resultRecord from sharing 
the same reference as recordOption
+GenericRecord resultRecord = (GenericRecord) getInsertValue(schema).get();
+List fields = schema.getFields();
+
+// newOrderingValWithMappings = _ts1:name1,price1=999;_ts2:name2,price2=;
+for (String newOrderingFieldMapping : 
newOrderingValWithMappings.split(";")) {
+  String orderingField = newOrderingFieldMapping.split(":")[0];
+  String newOrderingVal = getNewOrderingVal(orderingField, 
newOrderingFieldMapping);
+  String oldOrderingVal = HoodieAvroUtils.getNestedFieldValAsString(
+  (GenericRecord) currentValue, orderingField, true, false);
+  if (oldOrderingVal == null) {
+oldOrderingVal = "";
+  }
+
+  // No update required
+  if (oldOrderingVal.isEmpty() && newOrderingVal.isEmpty()) {
+continue;
+  }
+
+  // Pick the payload with greatest ordering value as insert record
+  boolean isBaseRecord = false;
+  try {
+if (Long.parseLong(oldOrderingVal) > Long.parseLong(newOrderingVal)) {
+  isBaseRecord = true;
+}
+  } catch (NumberFormatException e) {
+if (oldOrderingVal.compareTo(newOrderingVal) > 0) {
+  isBaseRecord = true;
+}
+

[GitHub] [hudi] minihippo commented on a diff in pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-08 Thread GitBox



minihippo commented on code in PR #4676:
URL: https://github.com/apache/hudi/pull/4676#discussion_r966225565


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java:
##
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Properties;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * subclass of OverwriteNonDefaultsWithLatestAvroPayload used for delta 
streamer.
+ *
+ * 
+ * preCombine - Picks the latest delta record for a key, based on an 
ordering field;
+ * combineAndGetUpdateValue/getInsertValue - overwrite storage for 
specified fields
+ * that doesn't equal defaultValue.
+ * 
+ */
+public class PartialUpdateAvroPayload extends 
OverwriteNonDefaultsWithLatestAvroPayload {
+
+  public static ConcurrentHashMap schemaRepo = new 
ConcurrentHashMap<>();
+
+  public PartialUpdateAvroPayload(GenericRecord record, Comparable 
orderingVal) {
+super(record, orderingVal);
+  }
+
+  public PartialUpdateAvroPayload(Option record) {
+super(record); // natural order
+  }
+
+  @Override
+  public PartialUpdateAvroPayload preCombine(OverwriteWithLatestAvroPayload 
oldValue, Properties properties) {
+String schemaStringIn = properties.getProperty("schema");
+Schema schemaInstance;
+if (!schemaRepo.containsKey(schemaStringIn)) {
+  schemaInstance = new Schema.Parser().parse(schemaStringIn);
+  schemaRepo.put(schemaStringIn, schemaInstance);
+} else {
+  schemaInstance = schemaRepo.get(schemaStringIn);
+}
+if (oldValue.recordBytes.length == 0) {
+  // use natural order for delete record
+  return this;
+}
+
+try {
+  GenericRecord indexedOldValue = (GenericRecord) 
oldValue.getInsertValue(schemaInstance).get();
+  Option optValue = 
combineAndGetUpdateValue(indexedOldValue, schemaInstance, 
this.orderingVal.toString());
+  // Rebuild ordering value if required
+  String newOrderingVal = rebuildOrderingValMap((GenericRecord) 
optValue.get(), this.orderingVal.toString());
+  if (optValue.isPresent()) {
+return new PartialUpdateAvroPayload((GenericRecord) optValue.get(), 
newOrderingVal);
+  }
+} catch (Exception ex) {
+  return this;
+}
+return this;
+  }
+
+  public Option combineAndGetUpdateValue(
+  IndexedRecord currentValue, Schema schema, String 
newOrderingValWithMappings) throws IOException {
+Option recordOption = getInsertValue(schema);
+if (!recordOption.isPresent()) {
+  return Option.empty();
+}
+
+// Perform a deserialization again to prevent resultRecord from sharing 
the same reference as recordOption
+GenericRecord resultRecord = (GenericRecord) getInsertValue(schema).get();
+List fields = schema.getFields();
+
+// newOrderingValWithMappings = _ts1:name1,price1=999;_ts2:name2,price2=;
+for (String newOrderingFieldMapping : 
newOrderingValWithMappings.split(";")) {
+  String orderingField = newOrderingFieldMapping.split(":")[0];
+  String newOrderingVal = getNewOrderingVal(orderingField, 
newOrderingFieldMapping);
+  String oldOrderingVal = HoodieAvroUtils.getNestedFieldValAsString(
+  (GenericRecord) currentValue, orderingField, true, false);
+  if (oldOrderingVal == null) {
+oldOrderingVal = "";
+  }
+
+  // No update required
+  if (oldOrderingVal.isEmpty() && newOrderingVal.isEmpty()) {
+continue;
+  }
+
+  // Pick the payload with greatest ordering value as insert record
+  boolean isBaseRecord = false;
+  try {
+if (Long.parseLong(oldOrderingVal) > Long.parseLong(newOrderingVal)) {
+  isBaseRecord = true;
+}
+  } catch (NumberFormatException e) {
+if (oldOrderingVal.compareTo(newOrderingVal) > 0) {
+  isBaseRecord = true;
+}
+

[GitHub] [hudi] minihippo commented on a diff in pull request #4676: [HUDI-3304] support partial update on mor table

2022-09-08 Thread GitBox



minihippo commented on code in PR #4676:
URL: https://github.com/apache/hudi/pull/4676#discussion_r966225565


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java:
##
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Properties;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * subclass of OverwriteNonDefaultsWithLatestAvroPayload used for delta 
streamer.
+ *
+ * 
+ * preCombine - Picks the latest delta record for a key, based on an 
ordering field;
+ * combineAndGetUpdateValue/getInsertValue - overwrite storage for 
specified fields
+ * that doesn't equal defaultValue.
+ * 
+ */
+public class PartialUpdateAvroPayload extends 
OverwriteNonDefaultsWithLatestAvroPayload {
+
+  public static ConcurrentHashMap schemaRepo = new 
ConcurrentHashMap<>();
+
+  public PartialUpdateAvroPayload(GenericRecord record, Comparable 
orderingVal) {
+super(record, orderingVal);
+  }
+
+  public PartialUpdateAvroPayload(Option record) {
+super(record); // natural order
+  }
+
+  @Override
+  public PartialUpdateAvroPayload preCombine(OverwriteWithLatestAvroPayload 
oldValue, Properties properties) {
+String schemaStringIn = properties.getProperty("schema");
+Schema schemaInstance;
+if (!schemaRepo.containsKey(schemaStringIn)) {
+  schemaInstance = new Schema.Parser().parse(schemaStringIn);
+  schemaRepo.put(schemaStringIn, schemaInstance);
+} else {
+  schemaInstance = schemaRepo.get(schemaStringIn);
+}
+if (oldValue.recordBytes.length == 0) {
+  // use natural order for delete record
+  return this;
+}
+
+try {
+  GenericRecord indexedOldValue = (GenericRecord) 
oldValue.getInsertValue(schemaInstance).get();
+  Option optValue = 
combineAndGetUpdateValue(indexedOldValue, schemaInstance, 
this.orderingVal.toString());
+  // Rebuild ordering value if required
+  String newOrderingVal = rebuildOrderingValMap((GenericRecord) 
optValue.get(), this.orderingVal.toString());
+  if (optValue.isPresent()) {
+return new PartialUpdateAvroPayload((GenericRecord) optValue.get(), 
newOrderingVal);
+  }
+} catch (Exception ex) {
+  return this;
+}
+return this;
+  }
+
+  public Option combineAndGetUpdateValue(
+  IndexedRecord currentValue, Schema schema, String 
newOrderingValWithMappings) throws IOException {
+Option recordOption = getInsertValue(schema);
+if (!recordOption.isPresent()) {
+  return Option.empty();
+}
+
+// Perform a deserialization again to prevent resultRecord from sharing 
the same reference as recordOption
+GenericRecord resultRecord = (GenericRecord) getInsertValue(schema).get();
+List fields = schema.getFields();
+
+// newOrderingValWithMappings = _ts1:name1,price1=999;_ts2:name2,price2=;
+for (String newOrderingFieldMapping : 
newOrderingValWithMappings.split(";")) {
+  String orderingField = newOrderingFieldMapping.split(":")[0];
+  String newOrderingVal = getNewOrderingVal(orderingField, 
newOrderingFieldMapping);
+  String oldOrderingVal = HoodieAvroUtils.getNestedFieldValAsString(
+  (GenericRecord) currentValue, orderingField, true, false);
+  if (oldOrderingVal == null) {
+oldOrderingVal = "";
+  }
+
+  // No update required
+  if (oldOrderingVal.isEmpty() && newOrderingVal.isEmpty()) {
+continue;
+  }
+
+  // Pick the payload with greatest ordering value as insert record
+  boolean isBaseRecord = false;
+  try {
+if (Long.parseLong(oldOrderingVal) > Long.parseLong(newOrderingVal)) {
+  isBaseRecord = true;
+}
+  } catch (NumberFormatException e) {
+if (oldOrderingVal.compareTo(newOrderingVal) > 0) {
+  isBaseRecord = true;
+}
+

1 2 3 >

1 - 100 of 200 matches

Mail list logo