date:20220808

[hudi] branch master updated (417cca94df -> 75f6266594)

2022-08-08 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 417cca94df [HUDI-4570] Fix hive sync path error due to reuse of 
storage descriptors. (#6329)
 add 75f6266594 [HUDI-4571] Fix partition extractor infer function when 
partition field mismatch (#6333)

No new revisions were added by this update.

Summary of changes:
 .../replication/TestHiveSyncGlobalCommitTool.java  |  3 ++
 .../apache/hudi/sync/common/HoodieSyncConfig.java  | 30 +++
 .../hudi/sync/common/TestHoodieSyncConfig.java | 60 +-
 3 files changed, 70 insertions(+), 23 deletions(-)

[GitHub] [hudi] codope merged pull request #6333: [HUDI-4571] Fix partition extractor infer function when partition field mismatch

2022-08-08 Thread GitBox



codope merged PR #6333:
URL: https://github.com/apache/hudi/pull/6333


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6339: [MINOR] Fix wrong key to determine sync sql cascade

2022-08-08 Thread GitBox



hudi-bot commented on PR #6339:
URL: https://github.com/apache/hudi/pull/6339#issuecomment-1208974741

   
   ## CI report:
   
   * 0557033c7d3a88cca59469d671e0d98e8ff447ef Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10692)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6339: [MINOR] Fix wrong key to determine sync sql cascade

2022-08-08 Thread GitBox



hudi-bot commented on PR #6339:
URL: https://github.com/apache/hudi/pull/6339#issuecomment-1208971538

   
   ## CI report:
   
   * 0557033c7d3a88cca59469d671e0d98e8ff447ef UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



hudi-bot commented on PR #6337:
URL: https://github.com/apache/hudi/pull/6337#issuecomment-1208971505

   
   ## CI report:
   
   * 121f67ce16a6b6102fcb7f246ab1c6c6e289ec8f UNKNOWN
   * 535a74909d59fa79c6267bc9883165344f7ddef1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10685)
 
   * 3995309352a994b48a8ea247b9a6db69e89767e4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10687)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6333: [HUDI-4571] Fix partition extractor infer function when partition field mismatch

2022-08-08 Thread GitBox



hudi-bot commented on PR #6333:
URL: https://github.com/apache/hudi/pull/6333#issuecomment-1208968320

   
   ## CI report:
   
   * e37527b7c57836656d2255967d5781226c5b76c6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10691)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6312: [HUDI-4551] The default value of READ_TASKS, WRITE_TASKS, CLUSTERING_TASKS is the parallelism of the execution environment

2022-08-08 Thread GitBox



hudi-bot commented on PR #6312:
URL: https://github.com/apache/hudi/pull/6312#issuecomment-1208968268

   
   ## CI report:
   
   * 33048bf5a19016121cbe271a9353803d3fe1d261 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10610)
 
   * c13c77b28b3dcd95461dbb19ab6d0caf2c0c0dc7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10690)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan opened a new pull request, #6339: [MINOR] Fix wrong key to determine sync sql cascade

2022-08-08 Thread GitBox



xushiyan opened a new pull request, #6339:
URL: https://github.com/apache/hudi/pull/6339

   ### Change Logs
   
   During refactoring (#5854), the condition to determine sync sql cascade 
option is mistyped. Fixing it to checking `META_SYNC_PARTITION_FIELDS`.
   
   ### Impact
   
   **Risk level: low**
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bhasudha opened a new pull request, #6338: [DOCS] Add prestocon 2022 to talks page

2022-08-08 Thread GitBox



bhasudha opened a new pull request, #6338:
URL: https://github.com/apache/hudi/pull/6338

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6312: [HUDI-4551] The default value of READ_TASKS, WRITE_TASKS, CLUSTERING_TASKS is the parallelism of the execution environment

2022-08-08 Thread GitBox



hudi-bot commented on PR #6312:
URL: https://github.com/apache/hudi/pull/6312#issuecomment-1208965092

   
   ## CI report:
   
   * 33048bf5a19016121cbe271a9353803d3fe1d261 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10610)
 
   * c13c77b28b3dcd95461dbb19ab6d0caf2c0c0dc7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6333: [HUDI-4571] Fix partition extractor infer function when partition field mismatch

2022-08-08 Thread GitBox



hudi-bot commented on PR #6333:
URL: https://github.com/apache/hudi/pull/6333#issuecomment-1208965160

   
   ## CI report:
   
   * e37527b7c57836656d2255967d5781226c5b76c6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #6329: [HUDI-4570] Fix hive sync path error due to reuse of storage descript…

2022-08-08 Thread GitBox



codope commented on code in PR #6329:
URL: https://github.com/apache/hudi/pull/6329#discussion_r940942901


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java:
##
@@ -225,8 +225,9 @@ public void updatePartitionsToTable(String tableName, 
List changedPartit
 String fullPartitionPath = 
StorageSchemes.HDFS.getScheme().equals(partitionScheme)
 ? 
FSUtils.getDFSFullPartitionPath(syncConfig.getHadoopFileSystem(), 
partitionPath) : partitionPath.toString();
 List partitionValues = 
partitionValueExtractor.extractPartitionValuesInPath(partition);
-sd.setLocation(fullPartitionPath);
-return new Partition(partitionValues, databaseName, tableName, 0, 0, 
sd, null);
+StorageDescriptor partitionSd = sd.deepCopy();
+partitionSd.setLocation(fullPartitionPath);

Review Comment:
   Great catch! Let's take it in 0.12.0. We should cover adding multiple 
partitions in a test. I can add it later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on pull request #6333: [HUDI-4571] Fix partition extractor infer function when partition field mismatch

2022-08-08 Thread GitBox



xushiyan commented on PR #6333:
URL: https://github.com/apache/hudi/pull/6333#issuecomment-1208963231

   
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10681&view=results
   
   ![Screen Shot 2022-08-09 at 1 22 18 
AM](https://user-images.githubusercontent.com/2701446/183578776-56fb84b1-16ab-423b-a9c7-e96ca889cc7e.png)
   
   this actually passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



hudi-bot commented on PR #6337:
URL: https://github.com/apache/hudi/pull/6337#issuecomment-1208962075

   
   ## CI report:
   
   * 121f67ce16a6b6102fcb7f246ab1c6c6e289ec8f UNKNOWN
   * 1b8a77eec4fb127274bd5ee3b52133a0b26c769d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10684)
 
   * 535a74909d59fa79c6267bc9883165344f7ddef1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10685)
 
   * 3995309352a994b48a8ea247b9a6db69e89767e4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10687)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6333: [HUDI-4571] Fix partition extractor infer function when partition field mismatch

2022-08-08 Thread GitBox



hudi-bot commented on PR #6333:
URL: https://github.com/apache/hudi/pull/6333#issuecomment-1208962034

   
   ## CI report:
   
   * 679fd111f43f4eea205f4eee81972f1132a5ae1e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10680)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4557) Support validation of column stats of avro log files in tests

2022-08-08 Thread Vamshi Gudavarthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vamshi Gudavarthi updated HUDI-4557:

Component/s: tests-ci

> Support validation of column stats of avro log files in tests
> -
>
> Key: HUDI-4557
> URL: https://issues.apache.org/jira/browse/HUDI-4557
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>
> In TestColumnStatsIndex, when comparing the column stats with the actual data 
> files, only parquet files are supported.  We need to support avro log files 
> as well.  Note that, to validate the column stat of avro log files, we use 
> resource files storing the expected column stat table content for validation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4537) why hudi clean service actually retain CLEAN_RETAIN_COMMITS + 1

2022-08-08 Thread Vamshi Gudavarthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vamshi Gudavarthi updated HUDI-4537:

Component/s: cleaning
 configs

> why hudi clean service actually retain CLEAN_RETAIN_COMMITS + 1
> ---
>
> Key: HUDI-4537
> URL: https://issues.apache.org/jira/browse/HUDI-4537
> Project: Apache Hudi
>  Issue Type: Wish
>  Components: cleaning, configs
>Reporter: yonghua jian
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4539) Make Hudi's CLI API consistent

2022-08-08 Thread Vamshi Gudavarthi (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vamshi Gudavarthi updated HUDI-4539:

Component/s: cli

> Make Hudi's CLI API consistent
> --
>
> Key: HUDI-4539
> URL: https://issues.apache.org/jira/browse/HUDI-4539
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Reporter: Alexey Kudinkin
>Priority: Major
>
> Currently API provided by the CLI is inconsistent:
>  # Some of the commands (to display metadata for ex) are applicable to some 
> commits/actions but not others
>  # Same actions should be applicable to both active and archived timeline 
> (from the CLI standpoint there should be essentially no difference)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] rohit-m-99 closed issue #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns

2022-08-08 Thread GitBox



rohit-m-99 closed issue #6335: [SUPPORT] Deltastreamer updates not supporting 
the addition of new columns
URL: https://github.com/apache/hudi/issues/6335


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns

2022-08-08 Thread GitBox



yihua commented on issue #6335:
URL: https://github.com/apache/hudi/issues/6335#issuecomment-1208945535

   @rohit-m-99 feel free to close the issue if you are all good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns

2022-08-08 Thread GitBox



yihua commented on issue #6335:
URL: https://github.com/apache/hudi/issues/6335#issuecomment-1208944879

   After discussion with @rohit-m-99 , the problem is not due to the write 
side, but rather how the Hudi table is read.  The glob pattern,  i.e., 
`spark.read.format("hudi").load("/*")`, is used to read the table, 
which causes inconsistent results.  Using 
`spark.read.format("hudi").load("")` solves the problem and returns 
the correct data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohit-m-99 commented on issue #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns

2022-08-08 Thread GitBox



rohit-m-99 commented on issue #6335:
URL: https://github.com/apache/hudi/issues/6335#issuecomment-1208944780

   This issue was resolved by instead remove asterisk:
   
   `
   stat_data_frame = 
(session.read.format("hudi").option("hoodie.datasource.write.reconcile.schema", 
"true").load("s3a://example-prod-output/stats/querying"))
   `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohit-m-99 commented on issue #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns

2022-08-08 Thread GitBox



rohit-m-99 commented on issue #6335:
URL: https://github.com/apache/hudi/issues/6335#issuecomment-1208940584

   ```
   stat_data_frame = 
(session.read.format("hudi").option("hoodie.datasource.write.reconcile.schema", 
"true").load("s3a://example-prod-output/stats/querying/0e6a3669-1f94-4ec4-93e8-6b5b25053b7e-0_0-70-1046_20220809052311671.parquet"))
   len(stat_data_frame.columns) # returns 616
   ```
   however
   
   ```
   stat_data_frame = 
(session.read.format("hudi").option("hoodie.datasource.write.reconcile.schema", 
"true").load("s3a://example-prod-output/stats/querying/*"))
   len(stat_data_frame.columns) # returns 551
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4571) unintended value set for partition_extractor_class if hoodie.datasource.write.partitionpath.field and hoodie.table.partition.fields not aligned

2022-08-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4571:
-
Reviewers: Raymond Xu

> unintended value set for partition_extractor_class if 
> hoodie.datasource.write.partitionpath.field and hoodie.table.partition.fields 
> not aligned
> ---
>
> Key: HUDI-4571
> URL: https://issues.apache.org/jira/browse/HUDI-4571
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jian Feng
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4571) unintended value set for partition_extractor_class if hoodie.datasource.write.partitionpath.field and hoodie.table.partition.fields not aligned

2022-08-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-4571:


Assignee: Sagar Sumit

> unintended value set for partition_extractor_class if 
> hoodie.datasource.write.partitionpath.field and hoodie.table.partition.fields 
> not aligned
> ---
>
> Key: HUDI-4571
> URL: https://issues.apache.org/jira/browse/HUDI-4571
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jian Feng
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4571) unintended value set for partition_extractor_class if hoodie.datasource.write.partitionpath.field and hoodie.table.partition.fields not aligned

2022-08-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4571:
-
Fix Version/s: 0.12.0

> unintended value set for partition_extractor_class if 
> hoodie.datasource.write.partitionpath.field and hoodie.table.partition.fields 
> not aligned
> ---
>
> Key: HUDI-4571
> URL: https://issues.apache.org/jira/browse/HUDI-4571
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jian Feng
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4570) Fix hive sync path error due to reuse of storage descriptors.

2022-08-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-4570:


Assignee: Ying Lin

> Fix hive sync path error due to reuse of storage descriptors.
> -
>
> Key: HUDI-4570
> URL: https://issues.apache.org/jira/browse/HUDI-4570
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, meta-sync
>Reporter: Ying Lin
>Assignee: Ying Lin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] BruceKellan commented on a diff in pull request #6329: [HUDI-4570] Fix hive sync path error due to reuse of storage descript…

2022-08-08 Thread GitBox



BruceKellan commented on code in PR #6329:
URL: https://github.com/apache/hudi/pull/6329#discussion_r940916733


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java:
##
@@ -225,8 +225,9 @@ public void updatePartitionsToTable(String tableName, 
List changedPartit
 String fullPartitionPath = 
StorageSchemes.HDFS.getScheme().equals(partitionScheme)
 ? 
FSUtils.getDFSFullPartitionPath(syncConfig.getHadoopFileSystem(), 
partitionPath) : partitionPath.toString();
 List partitionValues = 
partitionValueExtractor.extractPartitionValuesInPath(partition);
-sd.setLocation(fullPartitionPath);
-return new Partition(partitionValues, databaseName, tableName, 0, 0, 
sd, null);
+StorageDescriptor partitionSd = sd.deepCopy();
+partitionSd.setLocation(fullPartitionPath);

Review Comment:
   Yes, the location for partition is wrong. It will cause the query result to 
be wrong when using hive or trino. These queries rely on the partition location 
recorded in the metastore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-4570) Fix hive sync path error due to reuse of storage descriptors.

2022-08-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-4570.

Resolution: Fixed

> Fix hive sync path error due to reuse of storage descriptors.
> -
>
> Key: HUDI-4570
> URL: https://issues.apache.org/jira/browse/HUDI-4570
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, meta-sync
>Reporter: Ying Lin
>Assignee: Ying Lin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4570) Fix hive sync path error due to reuse of storage descriptors.

2022-08-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4570:
-
Fix Version/s: 0.12.0

> Fix hive sync path error due to reuse of storage descriptors.
> -
>
> Key: HUDI-4570
> URL: https://issues.apache.org/jira/browse/HUDI-4570
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, meta-sync
>Reporter: Ying Lin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] rohit-m-99 commented on issue #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns

2022-08-08 Thread GitBox



rohit-m-99 commented on issue #6335:
URL: https://github.com/apache/hudi/issues/6335#issuecomment-1208933957

   Also wanted note that in the logs I am seeing 
   
   `22/08/09 05:13:37 INFO DeltaSync: Seeing new schema. Source :{
   `
   This has the correct schema, for both the source and target. However the 
output data isn't matching the schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] prasannarajaperumal commented on a diff in pull request #6268: [HUDI-4519] Initial version of the Hudi storage format specification doc

2022-08-08 Thread GitBox



prasannarajaperumal commented on code in PR #6268:
URL: https://github.com/apache/hudi/pull/6268#discussion_r940912690


##
website/src/pages/tech-specs.md:
##
@@ -0,0 +1,371 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self-Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes
+- **Cloud Native Database** - abstracts Table/Schema from actual storage and 
ensures up-to-date metadata and indexes unlocking multi-fold read and write 
performance optimizations
+- **Engine neutrality** - designed to be neutral and not having a preferred 
computation engine. Apache Hudi will manage metadata, provide common 
abstractions and pluggable interfaces to most/all common computational engines.
+
+
+
+## Storage Format
+
+### Layout Hierarchy
+
+At a high level, Hudi organizes data into a high level directory structure 
under the base path (root directory for the Hudi table). The directory 
structure is based on coarse-grained partitioning values set for the dataset. 
Non-partitioned data sets store all the data files under the base path. Hudi 
storage format has a special reserved *.hoodie* directory under the base path 
that is used to store transaction logs and metadata.
+
+```
+/data/hudi_trips/  <== BASE PATH

Review Comment:
   Hmm looks like tabs have different space lengths. Correcting it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (f4b2782886 -> 417cca94df)

2022-08-08 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from f4b2782886 [HUDI-4572] Fix 'Not a valid schema field: ts' error in 
HoodieFlinkCompactor if precombine field is not ts (#6331)
 add 417cca94df [HUDI-4570] Fix hive sync path error due to reuse of 
storage descriptors. (#6329)

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java   | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

[GitHub] [hudi] xushiyan merged pull request #6329: [HUDI-4570] Fix hive sync path error due to reuse of storage descript…

2022-08-08 Thread GitBox



xushiyan merged PR #6329:
URL: https://github.com/apache/hudi/pull/6329


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] prasannarajaperumal commented on a diff in pull request #6268: [HUDI-4519] Initial version of the Hudi storage format specification doc

2022-08-08 Thread GitBox



prasannarajaperumal commented on code in PR #6268:
URL: https://github.com/apache/hudi/pull/6268#discussion_r940911356


##
website/src/pages/tech-specs.md:
##
@@ -0,0 +1,371 @@
+# Apache Hudi Storage Format Specification [DRAFT]
+
+
+
+This document is a specification for the Hudi Storage Format which transforms 
immutable cloud/file storage systems into transactional data lakes. 
+
+## Overview
+
+Hudi Storage Format enables the following features over very large collection 
of files/objects
+
+- streaming primitives like incremental merges, change stream etc
+- database primitives like tables, transactions, mutability, indexes and query 
performance optimizations 
+
+Apache Hudi is an open source data lake platform that is built on top of the 
Hudi Storage Format and it unlocks the following features 
+
+- **Unified Computation model** - an unified way to combine large batch style 
operations and frequent near real time streaming operations over a single 
unified dataset
+- **Self-Optimized Storage** - Automatically handle all the table storage 
maintenance such as compaction, clustering, vacuuming asynchronously and 
non-blocking to actual data changes

Review Comment:
   added: "in most cases"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6333: [HUDI-4571] Fix partition extractor infer function when partition field mismatch

2022-08-08 Thread GitBox



hudi-bot commented on PR #6333:
URL: https://github.com/apache/hudi/pull/6333#issuecomment-1208930208

   
   ## CI report:
   
   * e37527b7c57836656d2255967d5781226c5b76c6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10681)
 
   * 679fd111f43f4eea205f4eee81972f1132a5ae1e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10680)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



hudi-bot commented on PR #6337:
URL: https://github.com/apache/hudi/pull/6337#issuecomment-1208927890

   
   ## CI report:
   
   * 121f67ce16a6b6102fcb7f246ab1c6c6e289ec8f UNKNOWN
   * 1b8a77eec4fb127274bd5ee3b52133a0b26c769d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10684)
 
   * 535a74909d59fa79c6267bc9883165344f7ddef1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10685)
 
   * 3995309352a994b48a8ea247b9a6db69e89767e4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10687)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6333: [HUDI-4571] Fix partition extractor infer function when partition field mismatch

2022-08-08 Thread GitBox



hudi-bot commented on PR #6333:
URL: https://github.com/apache/hudi/pull/6333#issuecomment-1208927863

   
   ## CI report:
   
   * e37527b7c57836656d2255967d5781226c5b76c6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10681)
 
   * 679fd111f43f4eea205f4eee81972f1132a5ae1e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4576) Fix schema evolution docs

2022-08-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4576:
-
Sprint: 2022/08/08

> Fix schema evolution docs
> -
>
> Key: HUDI-4576
> URL: https://issues.apache.org/jira/browse/HUDI-4576
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Fix any incorrect information on the schema evolution docs page.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4577) Add more test coverage for Spark SQL, Spark Quickstart guide

2022-08-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4577:
-
Sprint: 2022/08/08

> Add more test coverage for Spark SQL,  Spark Quickstart guide
> -
>
> Key: HUDI-4577
> URL: https://issues.apache.org/jira/browse/HUDI-4577
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
>
> We should more test coverage, and in particular in these areas:
>  # Add tests for "DELETE FROM" clauses
>  # Make sure Spark Quickstart guide matches the one on the website



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4579) [DOCS] Add docs on manually upgrading and downgrading table through CLI

2022-08-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4579:
-
Sprint: 2022/08/08

> [DOCS] Add docs on manually upgrading and downgrading table through CLI
> ---
>
> Key: HUDI-4579
> URL: https://issues.apache.org/jira/browse/HUDI-4579
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli, docs
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Docs for the upgrade and downgrade commands in Hudi CLi is missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4580) [DOCS] Update quickstart: Spark SQL create table statement fails with "partitioned by"

2022-08-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4580:
-
Sprint: 2022/08/08

> [DOCS] Update quickstart: Spark SQL create table statement fails with 
> "partitioned by"
> --
>
> Key: HUDI-4580
> URL: https://issues.apache.org/jira/browse/HUDI-4580
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: docs, spark-sql
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Spark 3.2.2, Hudi master
> Steps to reproduce
> {code:java}
> Spark shell
> export SPARK_HOME=/Users/ethan/Work/lib/spark-3.2.2-bin-hadoop3.2
> spark-3.2.2-bin-hadoop3.2/bin/spark-shell \
>   --master local[6] \
>   --driver-memory 5g \
>   --num-executors 6 --executor-cores 1 \
>   --executor-memory 1g \
>   --conf spark.ui.port= \
>   --conf spark.driver.maxResultSize=1g \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
>  \
>   --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
>   --conf spark.sql.catalogImplementation=in-memory \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --jars 
> $HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
>   
> Prepare dataset in spark shell
> // spark-shell
> import org.apache.hudi.QuickstartUtils._
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._val tableName = 
> "hudi_trips_cow"
> val basePath = "file:///tmp/hudi_trips_cow"
> val dataGen = new DataGeneratorval inserts = 
> convertToStringList(dataGen.generateInserts(10))
> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
> df.write.format("hudi").
>   options(getQuickstartWriteConfigs).
>   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>   option(TABLE_NAME, tableName).
>   mode(Overwrite).
>   save(basePath)
>   
> Spark SQL
> spark-3.2.2-bin-hadoop3.2/bin/spark-sql \
>   --master local[6] \
>   --driver-memory 5g \
>   --num-executors 6 --executor-cores 1 \
>   --executor-memory 1g \
>   --conf spark.ui.port= \
>   --conf spark.driver.maxResultSize=1g \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
>  \
>   --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
>   --conf spark.sql.catalogImplementation=in-memory \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --jars 
> $HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
>  
> spark-sql> create table hudi_trips_cow_ext using hudi
>  > partitioned by (partitionpath)
>  > location 'file:///tmp/hudi_trips_cow';
> Error in query: It is not allowed to specify partition columns when the table 
> schema is not defined. When the table schema is not provided, schema and 
> partition columns will be inferred.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



hudi-bot commented on PR #6337:
URL: https://github.com/apache/hudi/pull/6337#issuecomment-1208925266

   
   ## CI report:
   
   * 121f67ce16a6b6102fcb7f246ab1c6c6e289ec8f UNKNOWN
   * 1b8a77eec4fb127274bd5ee3b52133a0b26c769d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10684)
 
   * 535a74909d59fa79c6267bc9883165344f7ddef1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10685)
 
   * 3995309352a994b48a8ea247b9a6db69e89767e4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2669) Upgrade Java toolset/runtime to JDK11

2022-08-08 Thread Sagar Sumit (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577138#comment-17577138
 ] 

Sagar Sumit commented on HUDI-2669:
---

I have made it a blocker for next major release of Hudi 0.13.0.

Let's keep track of the following page 
[https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions]

It looks like compiling Hadoop with Java 11 is still not supported, but Hadoop 
3.3+ offers runtime support.

> Upgrade Java toolset/runtime to JDK11
> -
>
> Key: HUDI-2669
> URL: https://issues.apache.org/jira/browse/HUDI-2669
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance
>Reporter: sivabalan narayanan
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: performance
> Fix For: 0.13.0
>
>
> We should upgrade to at least JDK11, or preferably current latest LTS JDK17
>  
> Plan for migration:
> *Compilation*
> JDK8 will still be used to *compile* source code (both source/target will 
> stay `1.8`): this is required to make sure as we migrate to JDK11, we don't 
> add dependencies on features not compatible w/ JDK8. Migrating off JDK8 in 
> toolset, will happen at a later point, when we would stop providing any 
> assurances about the code being able to be run on 1.8.
> *Runtime*
> JDK11 will be used to *run* the code: due to JVM b/w compatibility there 
> should be no issues of running the code compiled for 1.8 on JDK11+, other 
> than dependencies compatibility. For that we would make sure that all our 
> test-suites do run against JDK11+.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 commented on a diff in pull request #6329: [HUDI-4570] Fix hive sync path error due to reuse of storage descript…

2022-08-08 Thread GitBox



danny0405 commented on code in PR #6329:
URL: https://github.com/apache/hudi/pull/6329#discussion_r940896442


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java:
##
@@ -225,8 +225,9 @@ public void updatePartitionsToTable(String tableName, 
List changedPartit
 String fullPartitionPath = 
StorageSchemes.HDFS.getScheme().equals(partitionScheme)
 ? 
FSUtils.getDFSFullPartitionPath(syncConfig.getHadoopFileSystem(), 
partitionPath) : partitionPath.toString();
 List partitionValues = 
partitionValueExtractor.extractPartitionValuesInPath(partition);
-sd.setLocation(fullPartitionPath);
-return new Partition(partitionValues, databaseName, tableName, 0, 0, 
sd, null);
+StorageDescriptor partitionSd = sd.deepCopy();
+partitionSd.setLocation(fullPartitionPath);

Review Comment:
   You mean the location for the partition is wrong ? What is the effect of 
that ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #6329: [HUDI-4570] Fix hive sync path error due to reuse of storage descript…

2022-08-08 Thread GitBox



danny0405 commented on code in PR #6329:
URL: https://github.com/apache/hudi/pull/6329#discussion_r940894720


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java:
##
@@ -225,8 +225,9 @@ public void updatePartitionsToTable(String tableName, 
List changedPartit
 String fullPartitionPath = 
StorageSchemes.HDFS.getScheme().equals(partitionScheme)
 ? 
FSUtils.getDFSFullPartitionPath(syncConfig.getHadoopFileSystem(), 
partitionPath) : partitionPath.toString();
 List partitionValues = 
partitionValueExtractor.extractPartitionValuesInPath(partition);
-sd.setLocation(fullPartitionPath);
-return new Partition(partitionValues, databaseName, tableName, 0, 0, 
sd, null);
+StorageDescriptor partitionSd = sd.deepCopy();
+partitionSd.setLocation(fullPartitionPath);

Review Comment:
   Got you, let's make this into release 0.12.0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] XuQianJin-Stars commented on pull request #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



XuQianJin-Stars commented on PR #6337:
URL: https://github.com/apache/hudi/pull/6337#issuecomment-1208912457

   > How much gains do we have after this change ? Guess the checkpoint lock 
would not block frequently ?
   
   To fundamentally solve the problem of stream read timeout and lock, it is 
necessary to optimize the number of `MergeOnReadInputSplit` generated and read.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-2669) Upgrade Java toolset/runtime to JDK11

2022-08-08 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2669:
--
Fix Version/s: 0.13.0
   (was: 0.12.0)

> Upgrade Java toolset/runtime to JDK11
> -
>
> Key: HUDI-2669
> URL: https://issues.apache.org/jira/browse/HUDI-2669
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance
>Reporter: sivabalan narayanan
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: performance
> Fix For: 0.13.0
>
>
> We should upgrade to at least JDK11, or preferably current latest LTS JDK17
>  
> Plan for migration:
> *Compilation*
> JDK8 will still be used to *compile* source code (both source/target will 
> stay `1.8`): this is required to make sure as we migrate to JDK11, we don't 
> add dependencies on features not compatible w/ JDK8. Migrating off JDK8 in 
> toolset, will happen at a later point, when we would stop providing any 
> assurances about the code being able to be run on 1.8.
> *Runtime*
> JDK11 will be used to *run* the code: due to JVM b/w compatibility there 
> should be no issues of running the code compiled for 1.8 on JDK11+, other 
> than dependencies compatibility. For that we would make sure that all our 
> test-suites do run against JDK11+.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2669) Upgrade Java toolset/runtime to JDK11

2022-08-08 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2669:
--
Priority: Blocker  (was: Critical)

> Upgrade Java toolset/runtime to JDK11
> -
>
> Key: HUDI-2669
> URL: https://issues.apache.org/jira/browse/HUDI-2669
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance
>Reporter: sivabalan narayanan
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: performance
> Fix For: 0.13.0
>
>
> We should upgrade to at least JDK11, or preferably current latest LTS JDK17
>  
> Plan for migration:
> *Compilation*
> JDK8 will still be used to *compile* source code (both source/target will 
> stay `1.8`): this is required to make sure as we migrate to JDK11, we don't 
> add dependencies on features not compatible w/ JDK8. Migrating off JDK8 in 
> toolset, will happen at a later point, when we would stop providing any 
> assurances about the code being able to be run on 1.8.
> *Runtime*
> JDK11 will be used to *run* the code: due to JVM b/w compatibility there 
> should be no issues of running the code compiled for 1.8 on JDK11+, other 
> than dependencies compatibility. For that we would make sure that all our 
> test-suites do run against JDK11+.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 commented on a diff in pull request #6320: [HUDI-4558] lost 'hoodie.table.keygenerator.class' in hoodie.properties

2022-08-08 Thread GitBox



danny0405 commented on code in PR #6320:
URL: https://github.com/apache/hudi/pull/6320#discussion_r940893433


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java:
##
@@ -235,6 +235,11 @@ private static void setupHoodieKeyOptions(Configuration 
conf, CatalogTable table
   conf.setString(FlinkOptions.KEYGEN_CLASS_NAME, 
ComplexAvroKeyGenerator.class.getName());
   LOG.info("Table option [{}] is reset to {} because record key or 
partition path has two or more fields",
   FlinkOptions.KEYGEN_CLASS_NAME.key(), 
ComplexAvroKeyGenerator.class.getName());
+} else if (!conf.getOptional(FlinkOptions.KEYGEN_CLASS_NAME).isPresent()) {
+  String keyGenName = FlinkOptions.getKeyGenClassNameByType(conf);
+  conf.setString(FlinkOptions.KEYGEN_CLASS_NAME, keyGenName);
+  LOG.info("Table option [{}] is reset to {} because of {}",
+  FlinkOptions.KEYGEN_CLASS_NAME.key(), keyGenName, 
FlinkOptions.KEYGEN_TYPE);

Review Comment:
   Can we fix StreamerUtil line 218 instead ? Just give a default value 
`SimpleAvroKeyGenerator` should be enough, and this is a table config, the 
write config key gen type could overwrite this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #6333: [HUDI-4571] Fix partition extractor infer function when partition field mismatch

2022-08-08 Thread GitBox



xushiyan commented on code in PR #6333:
URL: https://github.com/apache/hudi/pull/6333#discussion_r940886446


##
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java:
##
@@ -72,31 +75,38 @@ public class HoodieSyncConfig extends HoodieConfig {
   public static final ConfigProperty META_SYNC_TABLE_NAME = 
ConfigProperty
   .key("hoodie.datasource.hive_sync.table")
   .defaultValue("unknown")
-  .withInferFunction(cfg -> 
Option.ofNullable(cfg.getString(HOODIE_WRITE_TABLE_NAME_KEY))
-  .or(() -> Option.ofNullable(cfg.getString(HOODIE_TABLE_NAME_KEY
+  .withInferFunction(cfg -> 
Option.ofNullable(cfg.getString(HOODIE_TABLE_NAME_KEY))
+  .or(() -> 
Option.ofNullable(cfg.getString(HOODIE_WRITE_TABLE_NAME_KEY
   .withDocumentation("The name of the destination table that we should 
sync the hudi table to.");
 
   public static final ConfigProperty META_SYNC_BASE_FILE_FORMAT = 
ConfigProperty
   .key("hoodie.datasource.hive_sync.base_file_format")
   .defaultValue("PARQUET")
-  .withInferFunction(cfg -> 
Option.ofNullable(cfg.getString(HoodieTableConfig.BASE_FILE_FORMAT)))
+  .withInferFunction(cfg -> 
Option.ofNullable(cfg.getString(BASE_FILE_FORMAT)))
   .withDocumentation("Base file format for the sync.");
 
   public static final ConfigProperty META_SYNC_PARTITION_FIELDS = 
ConfigProperty
   .key("hoodie.datasource.hive_sync.partition_fields")
   .defaultValue("")
-  .withInferFunction(cfg -> 
Option.ofNullable(cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME)))
+  .withInferFunction(cfg -> 
Option.ofNullable(cfg.getString(PARTITION_FIELDS))
+  .or(() -> 
Option.ofNullable(cfg.getString(PARTITIONPATH_FIELD_NAME
   .withDocumentation("Field in the table to use for determining hive 
partition columns.");
 
   public static final ConfigProperty 
META_SYNC_PARTITION_EXTRACTOR_CLASS = ConfigProperty
   .key("hoodie.datasource.hive_sync.partition_extractor_class")
   .defaultValue("org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .withInferFunction(cfg -> {
-if 
(StringUtils.nonEmpty(cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME)))
 {
-  int numOfPartFields = 
cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME).split(",").length;
+Option partitionFieldsOpt = 
Option.ofNullable(cfg.getString(PARTITION_FIELDS))

Review Comment:
   deltastreamer does not load table props but instead it persist writer props 
into table props 
`org.apache.hudi.utilities.deltastreamer.DeltaSync#refreshTimeline`, so the 
writer props is the source of truth. Reading or not table props is fine. As 
mentioned by @codope MultiPartKeysValueExtractor handles non-partition case too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



danny0405 commented on PR #6337:
URL: https://github.com/apache/hudi/pull/6337#issuecomment-1208898309

   How much gains do we have after this change ? Guess the checkpoint lock 
would not block frequently ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4580) [DOCS] Update quickstart: Spark SQL create table statement fails with "partitioned by"

2022-08-08 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4580:
--
Component/s: docs
 spark-sql

> [DOCS] Update quickstart: Spark SQL create table statement fails with 
> "partitioned by"
> --
>
> Key: HUDI-4580
> URL: https://issues.apache.org/jira/browse/HUDI-4580
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: docs, spark-sql
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Spark 3.2.2, Hudi master
> Steps to reproduce
> {code:java}
> Spark shell
> export SPARK_HOME=/Users/ethan/Work/lib/spark-3.2.2-bin-hadoop3.2
> spark-3.2.2-bin-hadoop3.2/bin/spark-shell \
>   --master local[6] \
>   --driver-memory 5g \
>   --num-executors 6 --executor-cores 1 \
>   --executor-memory 1g \
>   --conf spark.ui.port= \
>   --conf spark.driver.maxResultSize=1g \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
>  \
>   --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
>   --conf spark.sql.catalogImplementation=in-memory \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --jars 
> $HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
>   
> Prepare dataset in spark shell
> // spark-shell
> import org.apache.hudi.QuickstartUtils._
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._val tableName = 
> "hudi_trips_cow"
> val basePath = "file:///tmp/hudi_trips_cow"
> val dataGen = new DataGeneratorval inserts = 
> convertToStringList(dataGen.generateInserts(10))
> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
> df.write.format("hudi").
>   options(getQuickstartWriteConfigs).
>   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>   option(TABLE_NAME, tableName).
>   mode(Overwrite).
>   save(basePath)
>   
> Spark SQL
> spark-3.2.2-bin-hadoop3.2/bin/spark-sql \
>   --master local[6] \
>   --driver-memory 5g \
>   --num-executors 6 --executor-cores 1 \
>   --executor-memory 1g \
>   --conf spark.ui.port= \
>   --conf spark.driver.maxResultSize=1g \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
>  \
>   --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
>   --conf spark.sql.catalogImplementation=in-memory \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --jars 
> $HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
>  
> spark-sql> create table hudi_trips_cow_ext using hudi
>  > partitioned by (partitionpath)
>  > location 'file:///tmp/hudi_trips_cow';
> Error in query: It is not allowed to specify partition columns when the table 
> schema is not defined. When the table schema is not provided, schema and 
> partition columns will be inferred.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4580) [DOCS] Update quickstart: Spark SQL create table statement fails with "partitioned by"

2022-08-08 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4580:
--
Summary: [DOCS] Update quickstart: Spark SQL create table statement fails 
with "partitioned by"  (was: Spark SQL create table statement fails with 
"partitioned by")

> [DOCS] Update quickstart: Spark SQL create table statement fails with 
> "partitioned by"
> --
>
> Key: HUDI-4580
> URL: https://issues.apache.org/jira/browse/HUDI-4580
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Spark 3.2.2, Hudi master
> Steps to reproduce
> {code:java}
> Spark shell
> export SPARK_HOME=/Users/ethan/Work/lib/spark-3.2.2-bin-hadoop3.2
> spark-3.2.2-bin-hadoop3.2/bin/spark-shell \
>   --master local[6] \
>   --driver-memory 5g \
>   --num-executors 6 --executor-cores 1 \
>   --executor-memory 1g \
>   --conf spark.ui.port= \
>   --conf spark.driver.maxResultSize=1g \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
>  \
>   --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
>   --conf spark.sql.catalogImplementation=in-memory \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --jars 
> $HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
>   
> Prepare dataset in spark shell
> // spark-shell
> import org.apache.hudi.QuickstartUtils._
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._val tableName = 
> "hudi_trips_cow"
> val basePath = "file:///tmp/hudi_trips_cow"
> val dataGen = new DataGeneratorval inserts = 
> convertToStringList(dataGen.generateInserts(10))
> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
> df.write.format("hudi").
>   options(getQuickstartWriteConfigs).
>   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>   option(TABLE_NAME, tableName).
>   mode(Overwrite).
>   save(basePath)
>   
> Spark SQL
> spark-3.2.2-bin-hadoop3.2/bin/spark-sql \
>   --master local[6] \
>   --driver-memory 5g \
>   --num-executors 6 --executor-cores 1 \
>   --executor-memory 1g \
>   --conf spark.ui.port= \
>   --conf spark.driver.maxResultSize=1g \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
>  \
>   --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
>   --conf spark.sql.catalogImplementation=in-memory \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --jars 
> $HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
>  
> spark-sql> create table hudi_trips_cow_ext using hudi
>  > partitioned by (partitionpath)
>  > location 'file:///tmp/hudi_trips_cow';
> Error in query: It is not allowed to specify partition columns when the table 
> schema is not defined. When the table schema is not provided, schema and 
> partition columns will be inferred.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4580) Spark SQL create table statement fails with "partitioned by"

2022-08-08 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4580:
--
Priority: Blocker  (was: Major)

> Spark SQL create table statement fails with "partitioned by"
> 
>
> Key: HUDI-4580
> URL: https://issues.apache.org/jira/browse/HUDI-4580
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Spark 3.2.2, Hudi master
> Steps to reproduce
> {code:java}
> Spark shell
> export SPARK_HOME=/Users/ethan/Work/lib/spark-3.2.2-bin-hadoop3.2
> spark-3.2.2-bin-hadoop3.2/bin/spark-shell \
>   --master local[6] \
>   --driver-memory 5g \
>   --num-executors 6 --executor-cores 1 \
>   --executor-memory 1g \
>   --conf spark.ui.port= \
>   --conf spark.driver.maxResultSize=1g \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
>  \
>   --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
>   --conf spark.sql.catalogImplementation=in-memory \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --jars 
> $HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
>   
> Prepare dataset in spark shell
> // spark-shell
> import org.apache.hudi.QuickstartUtils._
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._val tableName = 
> "hudi_trips_cow"
> val basePath = "file:///tmp/hudi_trips_cow"
> val dataGen = new DataGeneratorval inserts = 
> convertToStringList(dataGen.generateInserts(10))
> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
> df.write.format("hudi").
>   options(getQuickstartWriteConfigs).
>   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>   option(TABLE_NAME, tableName).
>   mode(Overwrite).
>   save(basePath)
>   
> Spark SQL
> spark-3.2.2-bin-hadoop3.2/bin/spark-sql \
>   --master local[6] \
>   --driver-memory 5g \
>   --num-executors 6 --executor-cores 1 \
>   --executor-memory 1g \
>   --conf spark.ui.port= \
>   --conf spark.driver.maxResultSize=1g \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
>  \
>   --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
>   --conf spark.sql.catalogImplementation=in-memory \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --jars 
> $HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
>  
> spark-sql> create table hudi_trips_cow_ext using hudi
>  > partitioned by (partitionpath)
>  > location 'file:///tmp/hudi_trips_cow';
> Error in query: It is not allowed to specify partition columns when the table 
> schema is not defined. When the table schema is not provided, schema and 
> partition columns will be inferred.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4580) Spark SQL create table statement fails with "partitioned by"

2022-08-08 Thread Sagar Sumit (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577130#comment-17577130
 ] 

Sagar Sumit commented on HUDI-4580:
---

Thanks for highlighting the issue. Schema validation is intended. See 
[https://github.com/apache/hudi/pull/4584/files#diff-22e36b6bde89ee0d7e76ec0760025e27401bdca102d911162b0f48de3881945a]
We should update the quickstart guide instead of going back to the previous 
logic. I've updated the ticket accordingly.

> Spark SQL create table statement fails with "partitioned by"
> 
>
> Key: HUDI-4580
> URL: https://issues.apache.org/jira/browse/HUDI-4580
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.12.0
>
>
> Spark 3.2.2, Hudi master
> Steps to reproduce
> {code:java}
> Spark shell
> export SPARK_HOME=/Users/ethan/Work/lib/spark-3.2.2-bin-hadoop3.2
> spark-3.2.2-bin-hadoop3.2/bin/spark-shell \
>   --master local[6] \
>   --driver-memory 5g \
>   --num-executors 6 --executor-cores 1 \
>   --executor-memory 1g \
>   --conf spark.ui.port= \
>   --conf spark.driver.maxResultSize=1g \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
>  \
>   --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
>   --conf spark.sql.catalogImplementation=in-memory \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --jars 
> $HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
>   
> Prepare dataset in spark shell
> // spark-shell
> import org.apache.hudi.QuickstartUtils._
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._val tableName = 
> "hudi_trips_cow"
> val basePath = "file:///tmp/hudi_trips_cow"
> val dataGen = new DataGeneratorval inserts = 
> convertToStringList(dataGen.generateInserts(10))
> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
> df.write.format("hudi").
>   options(getQuickstartWriteConfigs).
>   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>   option(TABLE_NAME, tableName).
>   mode(Overwrite).
>   save(basePath)
>   
> Spark SQL
> spark-3.2.2-bin-hadoop3.2/bin/spark-sql \
>   --master local[6] \
>   --driver-memory 5g \
>   --num-executors 6 --executor-cores 1 \
>   --executor-memory 1g \
>   --conf spark.ui.port= \
>   --conf spark.driver.maxResultSize=1g \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
>  \
>   --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
>   --conf spark.sql.catalogImplementation=in-memory \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --jars 
> $HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
>  
> spark-sql> create table hudi_trips_cow_ext using hudi
>  > partitioned by (partitionpath)
>  > location 'file:///tmp/hudi_trips_cow';
> Error in query: It is not allowed to specify partition columns when the table 
> schema is not defined. When the table schema is not provided, schema and 
> partition columns will be inferred.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4579) [DOCS] Add docs on manually upgrading and downgrading table through CLI

2022-08-08 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-4579:
-

Assignee: Sagar Sumit

> [DOCS] Add docs on manually upgrading and downgrading table through CLI
> ---
>
> Key: HUDI-4579
> URL: https://issues.apache.org/jira/browse/HUDI-4579
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli, docs
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Docs for the upgrade and downgrade commands in Hudi CLi is missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4579) [DOCS] Add docs on manually upgrading and downgrading table through CLI

2022-08-08 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4579:
--
Priority: Blocker  (was: Major)

> [DOCS] Add docs on manually upgrading and downgrading table through CLI
> ---
>
> Key: HUDI-4579
> URL: https://issues.apache.org/jira/browse/HUDI-4579
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli, docs
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Docs for the upgrade and downgrade commands in Hudi CLi is missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4579) [DOCS] Add docs on manually upgrading and downgrading table through CLI

2022-08-08 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4579:
--
Fix Version/s: 0.12.0
   (was: 0.13.0)

> [DOCS] Add docs on manually upgrading and downgrading table through CLI
> ---
>
> Key: HUDI-4579
> URL: https://issues.apache.org/jira/browse/HUDI-4579
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli, docs
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.12.0
>
>
> Docs for the upgrade and downgrade commands in Hudi CLi is missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4579) [DOCS] Add docs on manually upgrading and downgrading table through CLI

2022-08-08 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4579:
--
Component/s: cli

> [DOCS] Add docs on manually upgrading and downgrading table through CLI
> ---
>
> Key: HUDI-4579
> URL: https://issues.apache.org/jira/browse/HUDI-4579
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli, docs
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>
> Docs for the upgrade and downgrade commands in Hudi CLi is missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4579) [DOCS] Add docs on manually upgrading and downgrading table through CLI

2022-08-08 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4579:
--
Summary: [DOCS] Add docs on manually upgrading and downgrading table 
through CLI  (was: Add docs on manually upgrading and downgrading table through 
CLI)

> [DOCS] Add docs on manually upgrading and downgrading table through CLI
> ---
>
> Key: HUDI-4579
> URL: https://issues.apache.org/jira/browse/HUDI-4579
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>
> Docs for the upgrade and downgrade commands in Hudi CLi is missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4579) Add docs on manually upgrading and downgrading table through CLI

2022-08-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4579:

Description: Docs for the upgrade and downgrade commands in Hudi CLi is 
missing.

> Add docs on manually upgrading and downgrading table through CLI
> 
>
> Key: HUDI-4579
> URL: https://issues.apache.org/jira/browse/HUDI-4579
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>
> Docs for the upgrade and downgrade commands in Hudi CLi is missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6324: [HUDI-4561] Improve incremental query using the fileSlice adjacent to read.end-commit

2022-08-08 Thread GitBox



hudi-bot commented on PR #6324:
URL: https://github.com/apache/hudi/pull/6324#issuecomment-1208889122

   
   ## CI report:
   
   * 3c9babcf29cc35f60f211ff5e4e52ddc32887c9d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10674)
 
   * 536d4d41ba51e74fb5e1e04456b45975f4fbba13 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10686)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



hudi-bot commented on PR #6337:
URL: https://github.com/apache/hudi/pull/6337#issuecomment-1208886685

   
   ## CI report:
   
   * 121f67ce16a6b6102fcb7f246ab1c6c6e289ec8f UNKNOWN
   * 1b8a77eec4fb127274bd5ee3b52133a0b26c769d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10684)
 
   * 535a74909d59fa79c6267bc9883165344f7ddef1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10685)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6333: [HUDI-4571] Fix partition extractor infer function when partition field mismatch

2022-08-08 Thread GitBox



hudi-bot commented on PR #6333:
URL: https://github.com/apache/hudi/pull/6333#issuecomment-1208886657

   
   ## CI report:
   
   * e37527b7c57836656d2255967d5781226c5b76c6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10681)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6324: [HUDI-4561] Improve incremental query using the fileSlice adjacent to read.end-commit

2022-08-08 Thread GitBox



hudi-bot commented on PR #6324:
URL: https://github.com/apache/hudi/pull/6324#issuecomment-1208886620

   
   ## CI report:
   
   * 3c9babcf29cc35f60f211ff5e4e52ddc32887c9d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10674)
 
   * 536d4d41ba51e74fb5e1e04456b45975f4fbba13 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5885: [HUDI-3478] Support CDC for Spark in Hudi

2022-08-08 Thread GitBox



hudi-bot commented on PR #5885:
URL: https://github.com/apache/hudi/pull/5885#issuecomment-1208886160

   
   ## CI report:
   
   * b41281075d76c23436f8e77380976009e9450cf2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10582)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



hudi-bot commented on PR #6337:
URL: https://github.com/apache/hudi/pull/6337#issuecomment-1208883935

   
   ## CI report:
   
   * 121f67ce16a6b6102fcb7f246ab1c6c6e289ec8f UNKNOWN
   * 1b8a77eec4fb127274bd5ee3b52133a0b26c769d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10684)
 
   * 535a74909d59fa79c6267bc9883165344f7ddef1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6333: [HUDI-4571] Fix partition extractor infer function when partition field mismatch

2022-08-08 Thread GitBox



hudi-bot commented on PR #6333:
URL: https://github.com/apache/hudi/pull/6333#issuecomment-1208883903

   
   ## CI report:
   
   * e37527b7c57836656d2255967d5781226c5b76c6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6318: [HUDI-4577] Adding test coverage for `DELETE FROM`, Spark Quickstart guide

2022-08-08 Thread GitBox



hudi-bot commented on PR #6318:
URL: https://github.com/apache/hudi/pull/6318#issuecomment-1208883815

   
   ## CI report:
   
   * 25172a471763150066600d2d0d01d21f4e8d2df7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10682)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5885: [HUDI-3478] Support CDC for Spark in Hudi

2022-08-08 Thread GitBox



hudi-bot commented on PR #5885:
URL: https://github.com/apache/hudi/pull/5885#issuecomment-1208883214

   
   ## CI report:
   
   * b41281075d76c23436f8e77380976009e9450cf2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10582)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4580) Spark SQL create table statement fails with "partitioned by"

2022-08-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4580:

Description: 
Spark 3.2.2, Hudi master

Steps to reproduce
{code:java}
Spark shell

export SPARK_HOME=/Users/ethan/Work/lib/spark-3.2.2-bin-hadoop3.2
spark-3.2.2-bin-hadoop3.2/bin/spark-shell \
  --master local[6] \
  --driver-memory 5g \
  --num-executors 6 --executor-cores 1 \
  --executor-memory 1g \
  --conf spark.ui.port= \
  --conf spark.driver.maxResultSize=1g \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 \
  --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
  --conf spark.sql.catalogImplementation=in-memory \
  --conf 
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
 \
  --jars 
$HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
  
Prepare dataset in spark shell

// spark-shell
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._val tableName = 
"hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGeneratorval inserts = 
convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
  option(TABLE_NAME, tableName).
  mode(Overwrite).
  save(basePath)
  
Spark SQL

spark-3.2.2-bin-hadoop3.2/bin/spark-sql \
  --master local[6] \
  --driver-memory 5g \
  --num-executors 6 --executor-cores 1 \
  --executor-memory 1g \
  --conf spark.ui.port= \
  --conf spark.driver.maxResultSize=1g \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 \
  --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
  --conf spark.sql.catalogImplementation=in-memory \
  --conf 
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
 \
  --jars 
$HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
 

spark-sql> create table hudi_trips_cow_ext using hudi
 > partitioned by (partitionpath)
 > location 'file:///tmp/hudi_trips_cow';
Error in query: It is not allowed to specify partition columns when the table 
schema is not defined. When the table schema is not provided, schema and 
partition columns will be inferred.{code}

> Spark SQL create table statement fails with "partitioned by"
> 
>
> Key: HUDI-4580
> URL: https://issues.apache.org/jira/browse/HUDI-4580
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.12.0
>
>
> Spark 3.2.2, Hudi master
> Steps to reproduce
> {code:java}
> Spark shell
> export SPARK_HOME=/Users/ethan/Work/lib/spark-3.2.2-bin-hadoop3.2
> spark-3.2.2-bin-hadoop3.2/bin/spark-shell \
>   --master local[6] \
>   --driver-memory 5g \
>   --num-executors 6 --executor-cores 1 \
>   --executor-memory 1g \
>   --conf spark.ui.port= \
>   --conf spark.driver.maxResultSize=1g \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
>  \
>   --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
>   --conf spark.sql.catalogImplementation=in-memory \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --jars 
> $HUDI_DIR/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar
>   
> Prepare dataset in spark shell
> // spark-shell
> import org.apache.hudi.QuickstartUtils._
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._val tableName = 
> "hudi_trips_cow"
> val basePath = "file:///tmp/hudi_trips_cow"
> val dataGen = new DataGeneratorval inserts = 
> convertToStringList(dataGen.generateInserts(10))
> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
> df.write.format("hudi").
>   options

[jira] [Updated] (HUDI-4580) Spark SQL create table statement fails with "partitioned by"

2022-08-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4580:

Fix Version/s: 0.12.0

> Spark SQL create table statement fails with "partitioned by"
> 
>
> Key: HUDI-4580
> URL: https://issues.apache.org/jira/browse/HUDI-4580
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-4580) Spark SQL create table statement fails with "partitioned by"

2022-08-08 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-4580:
---

 Summary: Spark SQL create table statement fails with "partitioned 
by"
 Key: HUDI-4580
 URL: https://issues.apache.org/jira/browse/HUDI-4580
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4580) Spark SQL create table statement fails with "partitioned by"

2022-08-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-4580:
---

Assignee: Sagar Sumit

> Spark SQL create table statement fails with "partitioned by"
> 
>
> Key: HUDI-4580
> URL: https://issues.apache.org/jira/browse/HUDI-4580
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] wuwenchi commented on a diff in pull request #6320: [HUDI-4558] lost 'hoodie.table.keygenerator.class' in hoodie.properties

2022-08-08 Thread GitBox



wuwenchi commented on code in PR #6320:
URL: https://github.com/apache/hudi/pull/6320#discussion_r940866883


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -875,6 +883,33 @@ public static  boolean 
isDefaultValueDefined(Configuration conf, ConfigOption
 || conf.get(option).equals(option.defaultValue());
   }
 
+  public static String getKeyGenClassNameByType(Configuration conf) {
+String genType = conf.get(FlinkOptions.KEYGEN_TYPE);

Review Comment:
   If there is no current process, then `KEYGEN_CLASS_NAME` will not be 
assigned, and there will be no KEYGEN_CLASS_NAME property in table properties.
   If you set the default value of `KEYGEN_CLASS_NAME` as you said above, then 
the `KEYGEN_TYPE` parameter will be invalid.
   Therefore, I initialized `KEYGEN_CLASS_NAME` according to `KEYGEN_TYPE`, so 
that not only can the value of `KEYGEN_CLASS_NAME` be saved in table 
properties, but `KEYGEN_TYPE` can also take effect normally according to the 
original logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-08 Thread GitBox



the-other-tim-brown commented on code in PR #6170:
URL: https://github.com/apache/hudi/pull/6170#discussion_r940865956


##
hudi-spark-datasource/hudi-spark/pom.xml:
##
@@ -267,6 +252,14 @@
   org.apache.logging.log4j
   log4j-1.2-api
 
+

Review Comment:
   Is there a way to know which dependencies will be provided by the runtime 
env for each environment? My strategy right now has just been to include as 
compile if the tests were failing at run time due to logging.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-08 Thread GitBox



the-other-tim-brown commented on code in PR #6170:
URL: https://github.com/apache/hudi/pull/6170#discussion_r940863541


##
hudi-integ-test/pom.xml:
##
@@ -525,24 +519,9 @@
 
   
   
-  org.scalatest

Review Comment:
   I moved the configuration up to the parent pom so we didn't need to define 
the configuration in ever module that wanted to use the plugin. This allows us 
to also specify the logging configuration once instead of hoping each module 
sets it properly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-08 Thread GitBox



the-other-tim-brown commented on code in PR #6170:
URL: https://github.com/apache/hudi/pull/6170#discussion_r940863124


##
hudi-client/hudi-client-common/pom.xml:
##
@@ -193,6 +193,12 @@
 
 
 
+
+  org.apache.hudi
+  hudi-tests-common
+  ${project.version}
+  test
+
 
   org.junit.jupiter

Review Comment:
   Do you know why junit-vintage-engine is being used by any chance? Do we 
still need to support running junit4 tests in some modules?
   
   I can add in the junit-jupiter-api, junit-jupiter-engine to the common test 
module



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-08 Thread GitBox



the-other-tim-brown commented on code in PR #6170:
URL: https://github.com/apache/hudi/pull/6170#discussion_r940862177


##
hudi-examples/hudi-examples-spark/pom.xml:
##
@@ -230,6 +230,27 @@
 
 
 
+
+
+org.apache.logging.log4j

Review Comment:
   This log4j-1.2-api jar specifically contains implementations of the log4j 
api's that make it compatible with log4j2. I don't expect Spark to provide 
these for us. 
   
   The other dependencies below are either logging APIs which we should declare 
if we directly rely on them and other bridges that were required to make the 
examples log properly. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-08 Thread GitBox



the-other-tim-brown commented on code in PR #6170:
URL: https://github.com/apache/hudi/pull/6170#discussion_r940860174


##
docker/demo/config/log4j2.properties:
##
@@ -0,0 +1,60 @@
+###
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+###
+status = warn
+name = HudiConsoleLog
+
+# Set everything to be logged to the console
+appender.console.type = Console
+appender.console.name = CONSOLE
+appender.console.layout.type = PatternLayout
+appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
+
+# Root logger level
+rootLogger.level = warn
+# Root logger referring to console appender
+rootLogger.appenderRef.stdout.ref = CONSOLE

Review Comment:
   I think it will fallback to stdout



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-08 Thread GitBox



the-other-tim-brown commented on code in PR #6170:
URL: https://github.com/apache/hudi/pull/6170#discussion_r940858499


##
docker/demo/config/log4j2.properties:
##
@@ -0,0 +1,60 @@
+###
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+###
+status = warn
+name = HudiConsoleLog
+
+# Set everything to be logged to the console
+appender.console.type = Console
+appender.console.name = CONSOLE
+appender.console.layout.type = PatternLayout
+appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
+
+# Root logger level
+rootLogger.level = warn
+# Root logger referring to console appender
+rootLogger.appenderRef.stdout.ref = CONSOLE
+
+# Set the default spark-shell log level to WARN. When running the spark-shell, 
the
+# log level for this class is used to overwrite the root logger's log level, 
so that
+# the user can have different defaults for the shell and regular Spark apps.
+logger.apache_spark_repl.name = org.apache.spark.repl.Main
+logger.apache_spark_repl.level = warn
+# Set logging of integration testsuite to INFO level
+logger.hudi_integ.name = org.apache.hudi.integ.testsuite
+logger.hudi_integ.level = info
+# Settings to quiet third party logs that are too verbose
+logger.apache_spark_jetty.name = org.spark_project.jetty
+logger.apache_spark_jetty.level = warn
+logger.apache_spark_jett_lifecycle.name = 
org.spark_project.jetty.util.component.AbstractLifeCycle

Review Comment:
   Personally, I'm not a fan of using the `.properties` format. The yaml format 
is my preference and even the xml seems a bit more natural than the syntax here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-3958) Resolve parquet-avro conflict in hudi-gcp-bundle and hudi-spark3.1-bundle

2022-08-08 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3958.

Fix Version/s: (was: 0.12.0)
   Resolution: Not A Problem

cannot reproduce this with latest master version. it's probably caused due to 
misusing jar from local build

> Resolve parquet-avro conflict in hudi-gcp-bundle and hudi-spark3.1-bundle
> -
>
> Key: HUDI-3958
> URL: https://issues.apache.org/jira/browse/HUDI-3958
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>
> In gcp bundle (master version) we include parquet-avro, which results in 
> issue running in dataproc 2.0.34-ubuntu18 with spark3.1-bundle and 
> utilities-slim bundle
> {code:text}
> 22/04/23 15:02:14 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 
> 0.0 in stage 36.0 (TID 93) 
> (cluster-4275-m.asia-southeast1-a.c.hudi-bq.internal executor 1): 
> java.lang.RuntimeException: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: 
> org/apache/parquet/schema/LogicalTypeAnnotation$UUIDLogicalTypeAnnotation
>   at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: 
> org/apache/parquet/schema/LogicalTypeAnnotation$UUIDLogicalTypeAnnotation
>   at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:94)
>   at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:37)
>   at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>   ... 22 more
> Caused by: org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: 
> org/apache/parquet/schema/LogicalTypeAnnotation$UUIDLogicalTypeAnnotation
>   at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:160)
>   at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:90)
>   ... 24 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.NoClassDefFoundError: 
> org/apache/parquet/schema/LogicalTypeAnnotation$UUIDLogicalTypeAnnotation
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:154)
>   ... 25 more
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/parquet/schema/LogicalTypeAnnotation$UUIDLogicalTypeAnnotation
>   at 
>

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-08 Thread GitBox



the-other-tim-brown commented on code in PR #6170:
URL: https://github.com/apache/hudi/pull/6170#discussion_r940855205


##
.github/workflows/bot.yml:
##
@@ -9,6 +9,8 @@ on:
 branches:
   - master
   - 'release-*'
+env:
+  MVN_ARGS: -ntp -B -V -Pwarn-log 
-Dorg.slf4j.simpleLogger.log.org.apache.maven.plugins.shade=warn 
-Dorg.slf4j.simpleLogger.log.org.apache.maven.plugins.dependency=warn

Review Comment:
   For the shade and dependency plugin, I didn't find a way to specify the log 
level in the configuration of the plugin unfortunately. We could put these in a 
basic logging configuration file that maven uses as well though.
   
   For the `-ntp -B -V`, that will need to be here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4579) Add docs on manually upgrading and downgrading table through CLI

2022-08-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4579:

Component/s: docs

> Add docs on manually upgrading and downgrading table through CLI
> 
>
> Key: HUDI-4579
> URL: https://issues.apache.org/jira/browse/HUDI-4579
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4579) Add docs on manually upgrading and downgrading table through CLI

2022-08-08 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4579:

Fix Version/s: 0.13.0

> Add docs on manually upgrading and downgrading table through CLI
> 
>
> Key: HUDI-4579
> URL: https://issues.apache.org/jira/browse/HUDI-4579
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-4579) Add docs on manually upgrading and downgrading table through CLI

2022-08-08 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-4579:
---

 Summary: Add docs on manually upgrading and downgrading table 
through CLI
 Key: HUDI-4579
 URL: https://issues.apache.org/jira/browse/HUDI-4579
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] BruceKellan commented on a diff in pull request #6329: [HUDI-4570] Fix hive sync path error due to reuse of storage descript…

2022-08-08 Thread GitBox



BruceKellan commented on code in PR #6329:
URL: https://github.com/apache/hudi/pull/6329#discussion_r940851172


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java:
##
@@ -225,8 +225,9 @@ public void updatePartitionsToTable(String tableName, 
List changedPartit
 String fullPartitionPath = 
StorageSchemes.HDFS.getScheme().equals(partitionScheme)
 ? 
FSUtils.getDFSFullPartitionPath(syncConfig.getHadoopFileSystem(), 
partitionPath) : partitionPath.toString();
 List partitionValues = 
partitionValueExtractor.extractPartitionValuesInPath(partition);
-sd.setLocation(fullPartitionPath);
-return new Partition(partitionValues, databaseName, tableName, 0, 0, 
sd, null);
+StorageDescriptor partitionSd = sd.deepCopy();
+partitionSd.setLocation(fullPartitionPath);

Review Comment:
   It uses the same one storage descriptor in a loop and returns a partition 
containing sd's ref.
   
   So if update multiple partitions, the sd of all partitions will be the same 
ref and the path of partition is wrong.



##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java:
##
@@ -225,8 +225,9 @@ public void updatePartitionsToTable(String tableName, 
List changedPartit
 String fullPartitionPath = 
StorageSchemes.HDFS.getScheme().equals(partitionScheme)
 ? 
FSUtils.getDFSFullPartitionPath(syncConfig.getHadoopFileSystem(), 
partitionPath) : partitionPath.toString();
 List partitionValues = 
partitionValueExtractor.extractPartitionValuesInPath(partition);
-sd.setLocation(fullPartitionPath);
-return new Partition(partitionValues, databaseName, tableName, 0, 0, 
sd, null);
+StorageDescriptor partitionSd = sd.deepCopy();
+partitionSd.setLocation(fullPartitionPath);

Review Comment:
   It uses the same one storage descriptor in a loop and returns a partition 
containing sd's ref.
   
   So if update multiple partitions, the sd of all partitions will be the same 
ref and the path of some partition is wrong.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on pull request #5885: [HUDI-3478] Support CDC for Spark in Hudi

2022-08-08 Thread GitBox



YannByron commented on PR #5885:
URL: https://github.com/apache/hudi/pull/5885#issuecomment-1208862487

   @hudi-bot run azure 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] BruceKellan commented on a diff in pull request #6329: [HUDI-4570] Fix hive sync path error due to reuse of storage descript…

2022-08-08 Thread GitBox



BruceKellan commented on code in PR #6329:
URL: https://github.com/apache/hudi/pull/6329#discussion_r940851172


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java:
##
@@ -225,8 +225,9 @@ public void updatePartitionsToTable(String tableName, 
List changedPartit
 String fullPartitionPath = 
StorageSchemes.HDFS.getScheme().equals(partitionScheme)
 ? 
FSUtils.getDFSFullPartitionPath(syncConfig.getHadoopFileSystem(), 
partitionPath) : partitionPath.toString();
 List partitionValues = 
partitionValueExtractor.extractPartitionValuesInPath(partition);
-sd.setLocation(fullPartitionPath);
-return new Partition(partitionValues, databaseName, tableName, 0, 0, 
sd, null);
+StorageDescriptor partitionSd = sd.deepCopy();
+partitionSd.setLocation(fullPartitionPath);

Review Comment:
   It uses the same one storage descriptor in a loop and returns a partition 
containing sd's ref.
   
   So if update multiple partitions, the sd of all partitions will be the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



xiarixiaoyao commented on code in PR #6337:
URL: https://github.com/apache/hudi/pull/6337#discussion_r940849240


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java:
##
@@ -195,35 +193,40 @@ public void 
monitorDirAndForwardSplits(SourceContext cont
   // table does not exist
   return;
 }
+
+long start = System.currentTimeMillis();
 IncrementalInputSplits.Result result =
 incrementalInputSplits.inputSplits(metaClient, this.hadoopConf, 
this.issuedInstant);
 if (result.isEmpty()) {
   // no new instants, returns early
   return;
 }
 
-for (MergeOnReadInputSplit split : result.getInputSplits()) {
-  context.collect(split);
+LOG.debug(
+"Discovered {} splits, time elapsed {}ms",
+result.getInputSplits().size(),
+System.currentTimeMillis() - start);
+
+// only need to hold the checkpoint lock when emitting the splits
+start = System.currentTimeMillis();
+synchronized (context.getCheckpointLock()) {

Review Comment:
   use checkpointLock ？



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



hudi-bot commented on PR #6337:
URL: https://github.com/apache/hudi/pull/6337#issuecomment-1208859863

   
   ## CI report:
   
   * 121f67ce16a6b6102fcb7f246ab1c6c6e289ec8f UNKNOWN
   * 1b8a77eec4fb127274bd5ee3b52133a0b26c769d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10684)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6333: [HUDI-4571] Fix partition extractor infer function when partition field mismatch

2022-08-08 Thread GitBox



hudi-bot commented on PR #6333:
URL: https://github.com/apache/hudi/pull/6333#issuecomment-1208859846

   
   ## CI report:
   
   * 679fd111f43f4eea205f4eee81972f1132a5ae1e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10680)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on a diff in pull request #6264: [HUDI-4503] support for parsing identifier with catalog

2022-08-08 Thread GitBox



YannByron commented on code in PR #6264:
URL: https://github.com/apache/hudi/pull/6264#discussion_r940847876


##
hudi-spark-datasource/hudi-spark3-common/src/main/scala/org/apache/spark/sql/HoodieSpark3CatalystPlanUtils.scala:
##
@@ -52,8 +57,56 @@ abstract class HoodieSpark3CatalystPlanUtils extends 
HoodieCatalystPlansUtils {
 }
   }
 
-  override def toTableIdentifier(relation: UnresolvedRelation): 
TableIdentifier = {
-relation.multipartIdentifier.asTableIdentifier
+  override def resolve(spark: SparkSession, relation: UnresolvedRelation): 
Option[CatalogTable] = {
+val catalogManager = spark.sessionState.catalogManager
+val nameParts = relation.multipartIdentifier
+val expandedNameParts = expandIdentifier(spark, nameParts)
+HoodieCatalogAndIdentifier.parse(catalogManager, expandedNameParts) match {
+  case Some((catalog, ident)) =>
+CatalogV2Util.loadTable(catalog, ident) match {
+  case Some(table) =>
+table match {
+  case v1Table: V1Table =>
+Some(v1Table.v1Table)
+  case withFallback: V2TableWithV1Fallback =>
+Some(withFallback.v1Table)
+  case _ =>
+logWarning(s"It's not a hoodie table: $table")
+None
+}
+  case _ =>
+logWarning(s"Can not load this catalog and identifier: 
${catalog.name()}, $ident")
+None
+}
+  case _ =>
+logWarning(s"Can not parse this name parts: 
${expandedNameParts.mkString(",")}")

Review Comment:
   @alexeykudinkin if no other places need to be changed, can we just leave it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on a diff in pull request #6264: [HUDI-4503] support for parsing identifier with catalog

2022-08-08 Thread GitBox



YannByron commented on code in PR #6264:
URL: https://github.com/apache/hudi/pull/6264#discussion_r940847507


##
hudi-spark-datasource/hudi-spark3-common/src/main/scala/org/apache/spark/sql/HoodieSpark3CatalystPlanUtils.scala:
##
@@ -52,8 +57,56 @@ abstract class HoodieSpark3CatalystPlanUtils extends 
HoodieCatalystPlansUtils {
 }
   }
 
-  override def toTableIdentifier(relation: UnresolvedRelation): 
TableIdentifier = {
-relation.multipartIdentifier.asTableIdentifier
+  override def resolve(spark: SparkSession, relation: UnresolvedRelation): 
Option[CatalogTable] = {

Review Comment:
   sure. see the user case: https://github.com/apache/hudi/issues/6223.
   If we use the three part `catalog.database.table` in spark3.x like , we need 
the ability to parse these identifiers.
   And, the reason that Spark's default `ResolveRelations` can't work is, the 
`UnresolvedRelation` hasn't resolved when apply the cases/rules in 
`HoodieAnalysis`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



hudi-bot commented on PR #6337:
URL: https://github.com/apache/hudi/pull/6337#issuecomment-1208857829

   
   ## CI report:
   
   * 121f67ce16a6b6102fcb7f246ab1c6c6e289ec8f UNKNOWN
   * 1b8a77eec4fb127274bd5ee3b52133a0b26c769d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6333: [HUDI-4571] Fix partition extractor infer function when partition field mismatch

2022-08-08 Thread GitBox



hudi-bot commented on PR #6333:
URL: https://github.com/apache/hudi/pull/6333#issuecomment-1208857814

   
   ## CI report:
   
   * e37527b7c57836656d2255967d5781226c5b76c6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10681)
 
   * 679fd111f43f4eea205f4eee81972f1132a5ae1e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



hudi-bot commented on PR #6337:
URL: https://github.com/apache/hudi/pull/6337#issuecomment-1208855620

   
   ## CI report:
   
   * 121f67ce16a6b6102fcb7f246ab1c6c6e289ec8f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4578) Reduce the scope and duration of holding checkpoint lock in stream read

2022-08-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4578:
-
Labels: pull-request-available  (was: )

>  Reduce the scope and duration of holding checkpoint lock in stream read
> 
>
> Key: HUDI-4578
> URL: https://issues.apache.org/jira/browse/HUDI-4578
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] XuQianJin-Stars opened a new pull request, #6337: [HUDI-4578] Reduce the scope and duration of holding checkpoint lock …

2022-08-08 Thread GitBox



XuQianJin-Stars opened a new pull request, #6337:
URL: https://github.com/apache/hudi/pull/6337

   …in stream read
   
   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 >

1 - 100 of 259 matches

Mail list logo