date:20220602

[GitHub] [hudi] hudi-bot commented on pull request #5744: [HUDI-4139]improvement for flink write operator name to identify tables easily

2022-06-02 Thread GitBox



hudi-bot commented on PR #5744:
URL: https://github.com/apache/hudi/pull/5744#issuecomment-1145644610

   
   ## CI report:
   
   * 4e442fa7861311c3e00623a392c8d8f5c4c99d0c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9057)
 
   * 4d0027edc4d1b3e2be507e3ef15aeb3d90dbb58f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9058)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5744: [HUDI-4139]improvement for flink write operator name to identify tables easily

2022-06-02 Thread GitBox



hudi-bot commented on PR #5744:
URL: https://github.com/apache/hudi/pull/5744#issuecomment-1145636603

   
   ## CI report:
   
   * 4e442fa7861311c3e00623a392c8d8f5c4c99d0c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9057)
 
   * 4d0027edc4d1b3e2be507e3ef15aeb3d90dbb58f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Gatsby-Lee commented on pull request #5112: [HUDI-3638] Make ZookeeperBasedLockProvider serializable

2022-06-02 Thread GitBox



Gatsby-Lee commented on PR #5112:
URL: https://github.com/apache/hudi/pull/5112#issuecomment-1145627513

   @yihua Thank you for the fix 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-4179) spark clustering with sort columns invalid

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-4179.
---
Resolution: Fixed

> spark clustering with sort columns invalid
> --
>
> Key: HUDI-4179
> URL: https://issues.apache.org/jira/browse/HUDI-4179
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> use HoodieClusterignJob with sort cloumns will invalid, did not sort by the 
> specified column when use default sort rule *LINEAR*



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4179) spark clustering with sort columns invalid

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4179:

Fix Version/s: 0.11.1

> spark clustering with sort columns invalid
> --
>
> Key: HUDI-4179
> URL: https://issues.apache.org/jira/browse/HUDI-4179
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> use HoodieClusterignJob with sort cloumns will invalid, did not sort by the 
> specified column when use default sort rule *LINEAR*



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-3670) SqlQueryBasedTransformer leaks temp views in continuous mode

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3670:

Fix Version/s: 0.11.1

> SqlQueryBasedTransformer leaks temp views in continuous mode
> 
>
> Key: HUDI-3670
> URL: https://issues.apache.org/jira/browse/HUDI-3670
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Ji Qi
>Assignee: Ji Qi
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> In the Sql transformers, a new temp view with a random name is created for 
> each incoming batch, but the temp view is never dropped.
> This causes a resource leak in Spark SessionState's Catalog. For long-running 
> deltastreamer jobs, the temp views created causes HiveSessionCatalog to take 
> up a lot of memory.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Closed] (HUDI-3670) SqlQueryBasedTransformer leaks temp views in continuous mode

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-3670.
---
Resolution: Fixed

> SqlQueryBasedTransformer leaks temp views in continuous mode
> 
>
> Key: HUDI-3670
> URL: https://issues.apache.org/jira/browse/HUDI-3670
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Ji Qi
>Assignee: Ji Qi
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> In the Sql transformers, a new temp view with a random name is created for 
> each incoming batch, but the temp view is never dropped.
> This causes a resource leak in Spark SessionState's Catalog. For long-running 
> deltastreamer jobs, the temp views created causes HiveSessionCatalog to take 
> up a lot of memory.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Closed] (HUDI-4107) Introduce --sync-tool-classes parameter in HoodieMultiTableDeltaStreamer

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-4107.
---
Resolution: Fixed

> Introduce --sync-tool-classes parameter in HoodieMultiTableDeltaStreamer 
> -
>
> Key: HUDI-4107
> URL: https://issues.apache.org/jira/browse/HUDI-4107
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Kumud Kumar Srivatsava Tirupati
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 0.11.1
>
>
> * HoodieDeltaStreamer added support for --sync-tool-classes to enable meta 
> syncs to multiple providers. The same option is missing in 
> HoodieMultiTableDeltaStreamer



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4149) Drop-Table fails when underlying table directory is broken

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4149:

Fix Version/s: 0.11.1

> Drop-Table fails when underlying table directory is broken
> --
>
> Key: HUDI-4149
> URL: https://issues.apache.org/jira/browse/HUDI-4149
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jin Xing
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> If a hudi table directory is lost on DFS due to some maloperation and cannot 
> restore, user cannot drop it from SQL interface any more. We may provide an 
> easy way for user to drop the table and do cleanup work.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Closed] (HUDI-4149) Drop-Table fails when underlying table directory is broken

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-4149.
---
Resolution: Fixed

> Drop-Table fails when underlying table directory is broken
> --
>
> Key: HUDI-4149
> URL: https://issues.apache.org/jira/browse/HUDI-4149
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jin Xing
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> If a hudi table directory is lost on DFS due to some maloperation and cannot 
> restore, user cannot drop it from SQL interface any more. We may provide an 
> easy way for user to drop the table and do cleanup work.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4086) thread factory optimization in async service

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4086:

Fix Version/s: 0.11.1

> thread factory optimization in async service
> 
>
> Key: HUDI-4086
> URL: https://issues.apache.org/jira/browse/HUDI-4086
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: scx
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> The old thread factory used anonymous classes and created the same thread 
> name for each thread. We can use an already existing thread 
> factory(org.apache.hudi.common.util.CustomizedThreadFactory) to create threads



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-3551) Add OCS StorageScheme to support Oracle Cloud

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3551:

Fix Version/s: 0.12.0
   (was: 0.11.1)

> Add OCS StorageScheme to support Oracle Cloud
> -
>
> Key: HUDI-3551
> URL: https://issues.apache.org/jira/browse/HUDI-3551
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: storage-management
>Reporter: Rajesh
>Assignee: Carter Shanklin
>Priority: Minor
>  Labels: Oracle, Storage, pull-request-available
> Fix For: 0.12.0
>
>
> StorageSchemes currently does not support OCS from OracleCloud for storage. 
> This will enable spark jobs to read and write from  OCS. Other integrations 
> with Metastore and query engines need to be looked at once the basic 
> integration with OCI DataFlow jobs is added.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4160) Make database regex of MaxwellJsonKafkaSourcePostProcessor optional

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4160:

Fix Version/s: 0.11.1
   (was: 0.12.0)

> Make database regex of MaxwellJsonKafkaSourcePostProcessor optional
> ---
>
> Key: HUDI-4160
> URL: https://issues.apache.org/jira/browse/HUDI-4160
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4151) flink split_reader supports rocksdb

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4151:

Fix Version/s: 0.12.0
   (was: 0.11.1)

> flink split_reader supports rocksdb
> ---
>
> Key: HUDI-4151
> URL: https://issues.apache.org/jira/browse/HUDI-4151
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink-sql
>Reporter: Bo Cui
>Assignee: Bo Cui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Closed] (HUDI-4162) Fix metasync constant mappings

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-4162.
---
Resolution: Fixed

> Fix metasync constant mappings
> --
>
> Key: HUDI-4162
> URL: https://issues.apache.org/jira/browse/HUDI-4162
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: configs
>Reporter: YangXuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4124) Add valid check in Spark Datasource configs

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4124:

Fix Version/s: 0.11.1

> Add valid check in Spark Datasource configs
> ---
>
> Key: HUDI-4124
> URL: https://issues.apache.org/jira/browse/HUDI-4124
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Frank Wong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Closed] (HUDI-4132) Fix target schema w/ delta sync when table is empty and when pulled in data is empty batch

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-4132.
---
Resolution: Fixed

> Fix target schema w/ delta sync when table is empty and when pulled in data 
> is empty batch
> --
>
> Key: HUDI-4132
> URL: https://issues.apache.org/jira/browse/HUDI-4132
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-2207) Support independent flink hudi clustering function

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2207:

Fix Version/s: 0.12.0

> Support independent flink hudi clustering function
> --
>
> Key: HUDI-2207
> URL: https://issues.apache.org/jira/browse/HUDI-2207
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Zhaojing Yu
>Assignee: Zhaojing Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4145) Archives the metadata file in HoodieInstant.State sequence

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4145:

Fix Version/s: (was: 0.12.0)

> Archives the metadata file in HoodieInstant.State sequence
> --
>
> Key: HUDI-4145
> URL: https://issues.apache.org/jira/browse/HUDI-4145
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1, 0.11.0
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> An error reported by the user:
> a file not exists exception throws for active timeline,
> !screenshot-1.png|width=681,height=296!
> from this picture, we can see it is deleted by the archiver:
> !screenshot-2.png|width=681,height=296!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4084) Add support to test any yaml w/ deltastreamer continuous mode and async table services

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4084:

Fix Version/s: (was: 0.12.0)

> Add support to test any yaml w/ deltastreamer continuous mode and async table 
> services
> --
>
> Key: HUDI-4084
> URL: https://issues.apache.org/jira/browse/HUDI-4084
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4134) Fix Method naming consistency issues in FSUtils

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4134:

Fix Version/s: 0.11.1
   (was: 0.12.0)

> Fix Method naming consistency issues in FSUtils
> ---
>
> Key: HUDI-4134
> URL: https://issues.apache.org/jira/browse/HUDI-4134
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: Heap
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> This is a small problem, I found that under hudi-common module, the method 
> name of FSUtils class to generate name for basefile(.parquet) is 
> makeDataFileName, but the method to generate name for log file is 
> makeLogFileName, so should I change the former method name to 
> makeBaseFileName is more in line with the specification, so that it can be 
> distinguished from makeLogFileName.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4129) Initializes a new fs view for WriteProfile#reload

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4129:

Fix Version/s: (was: 0.12.0)

> Initializes a new fs view for WriteProfile#reload
> -
>
> Key: HUDI-4129
> URL: https://issues.apache.org/jira/browse/HUDI-4129
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> To avoid unnecessary #sync of remote timeline service.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4100) CTAS failed to clean up when given an illegal MANAGED table definition

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4100:

Fix Version/s: 0.11.1

> CTAS failed to clean up when given an illegal MANAGED table definition
> --
>
> Key: HUDI-4100
> URL: https://issues.apache.org/jira/browse/HUDI-4100
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jin Xing
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> Current HoodieStagedTable#abortStagedChanges cleans up data path by the table 
> property of location, which doesn't work for a MANAGED table



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4122) Fix NPE caused by adding kafka nodes

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4122:

Fix Version/s: 0.11.1
   (was: 0.12.0)

> Fix NPE caused by adding kafka nodes
> 
>
> Key: HUDI-4122
> URL: https://issues.apache.org/jira/browse/HUDI-4122
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> Yesterday, when we add more nodes to kafka cluster, some of our hudi tasks 
> failed with a NPE as below:
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen.getNextOffsetRanges(KafkaOffsetGen.java:241)
> at 
> org.apache.hudi.utilities.sources.JsonKafkaSource.fetchNewData(JsonKafkaSource.java:67)
> at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:76)
> at 
> org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInAvroFormat(SourceFormatAdapter.java:63)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:430)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:283)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:641)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[hudi] branch asf-site updated: [MINOR][UI] Show tags when there are tags and align all blogs by their titles (#5726)

2022-06-02 Thread bhavanisudha

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 51da042741 [MINOR][UI] Show tags when there are tags and align all 
blogs by their titles (#5726)
51da042741 is described below

commit 51da042741d8cd7db31722892d0dd0c65ea0c37c
Author: yadav-jai <97013124+yadav-...@users.noreply.github.com>
AuthorDate: Fri Jun 3 11:23:53 2022 +0530

[MINOR][UI] Show tags when there are tags and align all blogs by their 
titles (#5726)

* Added tags and authors list

* Tags visible only if provided and minor UI changes

* Removed dummy tags
---
 ...e-data-capture-with-debezium-and-apache-hudi.md |  1 +
 ...-building-Lakehouse-Architecture-at-Halodoc.mdx |  1 +
 ...odal-Index-for-the-Lakehouse-in-Apache-Hudi.mdx |  1 +
 website/src/css/custom.css | 20 +++--
 website/src/theme/BlogPostItem/index.js| 91 ++
 website/src/theme/BlogPostItem/styles.module.css   | 82 +++
 6 files changed, 173 insertions(+), 23 deletions(-)

diff --git 
a/website/blog/2022-01-14-change-data-capture-with-debezium-and-apache-hudi.md 
b/website/blog/2022-01-14-change-data-capture-with-debezium-and-apache-hudi.md
index fc87dc04d0..c92880e2aa 100644
--- 
a/website/blog/2022-01-14-change-data-capture-with-debezium-and-apache-hudi.md
+++ 
b/website/blog/2022-01-14-change-data-capture-with-debezium-and-apache-hudi.md
@@ -4,6 +4,7 @@ excerpt: "A review of new Debezium source connector for Apache 
Hudi"
 author: Rajesh Mahindra
 category: blog
 image: /assets/images/blog/debezium.png
+
 ---
 
 As of Hudi v0.10.0, we are excited to announce the availability of 
[Debezium](https://debezium.io/) sources for 
[Deltastreamer](https://hudi.apache.org/docs/hoodie_deltastreamer) that provide 
the ingestion of change capture data (CDC) from Postgres and Mysql databases to 
your data lake. For more details, please refer to the original 
[RFC](https://github.com/apache/hudi/blob/master/rfc/rfc-39/rfc-39.md).
diff --git 
a/website/blog/2022-04-04-Key-Learnings-on-Using-Apache-HUDI-in-building-Lakehouse-Architecture-at-Halodoc.mdx
 
b/website/blog/2022-04-04-Key-Learnings-on-Using-Apache-HUDI-in-building-Lakehouse-Architecture-at-Halodoc.mdx
index bcbd27d43a..35a2d78592 100644
--- 
a/website/blog/2022-04-04-Key-Learnings-on-Using-Apache-HUDI-in-building-Lakehouse-Architecture-at-Halodoc.mdx
+++ 
b/website/blog/2022-04-04-Key-Learnings-on-Using-Apache-HUDI-in-building-Lakehouse-Architecture-at-Halodoc.mdx
@@ -5,6 +5,7 @@ authors:
 category: blog
 image: /assets/images/blog/2022-04-04-halodoc-lakehouse-architecture.png
 
+
 ---
 
 import Redirect from '@site/src/components/Redirect';
diff --git 
a/website/blog/2022-05-17-Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi.mdx
 
b/website/blog/2022-05-17-Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi.mdx
index d007cf7a81..cbfa3ffcbe 100644
--- 
a/website/blog/2022-05-17-Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi.mdx
+++ 
b/website/blog/2022-05-17-Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi.mdx
@@ -5,6 +5,7 @@ authors:
 - name: Ethan Guo
 category: blog
 image: /assets/images/blog/2022-05-17-multimodal-index.gif
+
 ---
 
 import Redirect from '@site/src/components/Redirect';
diff --git a/website/src/css/custom.css b/website/src/css/custom.css
index af1d677bc3..cee132bb1b 100644
--- a/website/src/css/custom.css
+++ b/website/src/css/custom.css
@@ -176,6 +176,7 @@ footer .container {
 
 .blog-wrapper .container {
   max-width: 100%;
+  
 }
 
 .who-uses {
@@ -208,16 +209,21 @@ footer .container {
   width: 100%;
 }
 
+
+
 .blog-list-page article {
   display: inline-flex;
-  width: 28%;
+  width: 45%;
   margin: 1.2em;
-  vertical-align: bottom;
+  vertical-align: text-top;
+  
+  
+  
 }
-@media(max-width:1145px){
+@media(max-width:1391px){
   .blog-list-page article {
 display: inline-flex;
-width: 40%;
+width: 80%;
 margin: 1.2em;
 vertical-align: bottom;
   }
@@ -237,8 +243,10 @@ footer .container {
 }
 
 .blogPostTitle_src-theme-BlogPostItem-styles-module{
-  height: 60px;
+  display:inline;
   overflow: hidden;
+ 
+  
 }
 
 h1.blogPostTitle_src-theme-BlogPostItem-styles-module{
@@ -249,5 +257,3 @@ h1.blogPostTitle_src-theme-BlogPostItem-styles-module{
 
 
 
-
-
diff --git a/website/src/theme/BlogPostItem/index.js 
b/website/src/theme/BlogPostItem/index.js
index 508f058365..1478e1861f 100644
--- a/website/src/theme/BlogPostItem/index.js
+++ b/website/src/theme/BlogPostItem/index.js
@@ -16,7 +16,8 @@
  import styles from './styles.module.css';
  import TagsListInline from '@theme/TagsListInline';
  import BlogPostAuthors from '@theme/BlogPostAuthors'; // Very simple 
pluralization: probably good enough for now
- 
+ import Tag from '@theme

[GitHub] [hudi] bhasudha merged pull request #5726: [MINOR][UI] Show tags when there are tags and align all blogs by their titles

2022-06-02 Thread GitBox



bhasudha merged PR #5726:
URL: https://github.com/apache/hudi/pull/5726


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4130) Remove the upgrade/downgrade for flink #initTable

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4130:

Fix Version/s: (was: 0.12.0)

> Remove the upgrade/downgrade for flink #initTable
> -
>
> Key: HUDI-4130
> URL: https://issues.apache.org/jira/browse/HUDI-4130
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.11.0
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4119) the first read result is incorrect when Flink upsert- Kafka connector is used in HUDi

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4119:

Fix Version/s: 0.11.1
   (was: 0.11.0)

> the first read result is incorrect  when Flink upsert- Kafka connector is 
> used in  HUDi 
> 
>
> Key: HUDI-4119
> URL: https://issues.apache.org/jira/browse/HUDI-4119
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: yanxiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
>  the first read result is incorrect  when Flink upsert- Kafka connector is 
> used in  HUDi .
>  
>  ETL  path: flink upsert-kafka connector -> hudi table (MOR table,query by 
> stream)
>  
> Here is the case:
>  
> 1. the first time: write two records  with the same primary key into kafka, 
> and  insert them into hudi table. the query result should be three records: 
> +I first record, -U first record, +U second record; But the first time I 
> query hudi table, I found that all the data operation were +I: +I first 
> record,+I first record and +I second record, and there was no update 
> operation; 
>  Three times +I has affected hudi's subsequent ETL process-the data of  
> groupBy is inaccurate; 
> 2. Second time: Exit the first query, restart the query job of hudi table, 
> and the query results are normal: +I first data, -U first data, +U second 
> data.
>  
> Reason:
> Reason:There is a bug in the program. When no data log file is generated, the 
> Schema does not include the column' _ hoodie _ operation'.Please refer to the 
> following link for details：
> [https://www.jianshu.com/p/29f9ec5e606e]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4116) Unify clustering/compaction related procedures' output type

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4116:

Fix Version/s: 0.11.1

> Unify clustering/compaction related procedures' output type
> ---
>
> Key: HUDI-4116
> URL: https://issues.apache.org/jira/browse/HUDI-4116
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: shibei
>Assignee: shibei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] hudi-bot commented on pull request #5744: [HUDI-4139]improvement for flink write operator name to identify tables easily

2022-06-02 Thread GitBox



hudi-bot commented on PR #5744:
URL: https://github.com/apache/hudi/pull/5744#issuecomment-1145608386

   
   ## CI report:
   
   * 4e442fa7861311c3e00623a392c8d8f5c4c99d0c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9057)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4111) Bump ANTLR runtime version in Spark 3.x

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4111:

Fix Version/s: 0.11.1

> Bump ANTLR runtime version in Spark 3.x
> ---
>
> Key: HUDI-4111
> URL: https://issues.apache.org/jira/browse/HUDI-4111
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: dzcxzl
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> Spark3.2 uses antlr version 4.8, Hudi uses 4.7, use the same version to avoid 
> a log of antlr check versions.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4108) Clean the marker files before starting new flink compaction

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4108:

Fix Version/s: 0.11.1

> Clean the marker files before starting new flink compaction
> ---
>
> Key: HUDI-4108
> URL: https://issues.apache.org/jira/browse/HUDI-4108
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> Caused by: org.apache.hadoop.ipc.RemoteException: 
> /xxx/.hoodie/.temp/20220513175804790/kafka_ts=20220513/0041-c9e0-42a3-b267-28f2ada94f83_1-4-1_20220513175804790.parquet.marker.MERGE
>  for client  already exists



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4109) Copy the old record directly when it is chosen for merging

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4109:

Fix Version/s: (was: 0.12.0)

> Copy the old record directly when it is chosen for merging
> --
>
> Key: HUDI-4109
> URL: https://issues.apache.org/jira/browse/HUDI-4109
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4110) Clean the marker files for flink compaction

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4110:

Fix Version/s: (was: 0.12.0)

> Clean the marker files for flink compaction
> ---
>
> Key: HUDI-4110
> URL: https://issues.apache.org/jira/browse/HUDI-4110
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] hudi-bot commented on pull request #5744: [HUDI-4139]improvement for flink write operator name to identify tables easily

2022-06-02 Thread GitBox



hudi-bot commented on PR #5744:
URL: https://github.com/apache/hudi/pull/5744#issuecomment-1145606506

   
   ## CI report:
   
   * 4e442fa7861311c3e00623a392c8d8f5c4c99d0c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4087) Support dropping RO and RT table in DropHoodieTableCommand

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4087:

Fix Version/s: (was: 0.12.0)

> Support dropping RO and RT table in DropHoodieTableCommand
> --
>
> Key: HUDI-4087
> URL: https://issues.apache.org/jira/browse/HUDI-4087
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jin Xing
>Assignee: Jin Xing
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> For a MOR table 'mor_simple' and its RO table 'mor_simple_ro' and RT table 
> 'mor_simple_rt', if user drops 'mor_simple_ro' without purging. There might 
> be issue. Reproduce like below:
> {code:java}
> 1. Create table as below:
> CREATE TABLE mor_simple (
>   `id` INT,
>   `name` STRING,
>   `price` DOUBLE)
> USING hudi
> location '/user/hive/warehous/mor_simple'
> OPTIONS(
>   'type' = 'mor',
>   'primaryKey' = 'id',
>   'hoodie.table.precombine.field' = 'id'
> ); 
> 2. Trigger hive-sync by inserting values 
>  insert into mor_simple values (1, 'z3', 1)
> 3. 'show tables' -- we will find mor_simple_rt and mor_simple_ro show up;
> 4. 'drop table mor_simple_ro' and 'show tables' -- we will find mor_simple is 
> dropped but mor_simple_ro can never be dropped.{code}
> We might need to refine DropHoodieTableCommand and take more consideration of 
> identifier of RO/RT.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] yanenze commented on a diff in pull request #5661: [HUDI-4139] improvement for flink sink operators so we can easily identify the table which write to hudi

2022-06-02 Thread GitBox



yanenze commented on code in PR #5661:
URL: https://github.com/apache/hudi/pull/5661#discussion_r888627316


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java:
##
@@ -137,7 +137,7 @@ public static DataStreamSink 
bulkInsert(Configuration conf, RowType rowT
 SortOperatorGen sortOperatorGen = new SortOperatorGen(rowType, 
partitionFields);
 // sort by partition keys
 dataStream = dataStream
-.transform("partition_key_sorter",
+.transform("partition_key_sorter" + ":" + 
conf.getString(FlinkOptions.TABLE_NAME),
 TypeInformation.of(RowData.class),

Review Comment:
   i have created a new PR in https://github.com/apache/hudi/pull/5744



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanenze commented on a diff in pull request #5661: [HUDI-4139] improvement for flink sink operators so we can easily identify the table which write to hudi

2022-06-02 Thread GitBox



yanenze commented on code in PR #5661:
URL: https://github.com/apache/hudi/pull/5661#discussion_r888627316


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java:
##
@@ -137,7 +137,7 @@ public static DataStreamSink 
bulkInsert(Configuration conf, RowType rowT
 SortOperatorGen sortOperatorGen = new SortOperatorGen(rowType, 
partitionFields);
 // sort by partition keys
 dataStream = dataStream
-.transform("partition_key_sorter",
+.transform("partition_key_sorter" + ":" + 
conf.getString(FlinkOptions.TABLE_NAME),
 TypeInformation.of(RowData.class),

Review Comment:
   i have create a new PR in https://github.com/apache/hudi/pull/5744



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4104) DeltaWriteProfile includes the pending compaction file slice when deciding small buckets

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4104:

Fix Version/s: (was: 0.12.0)

> DeltaWriteProfile includes the pending compaction file slice when deciding 
> small buckets
> 
>
> Key: HUDI-4104
> URL: https://issues.apache.org/jira/browse/HUDI-4104
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4101) BucketIndexPartitioner should take partition path for better dispersion

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4101:

Fix Version/s: (was: 0.12.0)

> BucketIndexPartitioner should take partition path for better dispersion
> ---
>
> Key: HUDI-4101
> URL: https://issues.apache.org/jira/browse/HUDI-4101
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] yanenze opened a new pull request, #5744: [HUDI-4139]improvement for flink write operator name to identify tables easily

2022-06-02 Thread GitBox



yanenze opened a new pull request, #5744:
URL: https://github.com/apache/hudi/pull/5744

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3123) Consistent hashing index for upsert/insert write path

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3123:

Fix Version/s: (was: 0.11.1)

> Consistent hashing index for upsert/insert write path
> -
>
> Key: HUDI-3123
> URL: https://issues.apache.org/jira/browse/HUDI-3123
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: index
>Reporter: Yuwei Xiao
>Assignee: Yuwei Xiao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Basic write path (insert/upsert) implementation of consistent hashing index.
>  
> A framework will be provided for flexible plugin of different dynamic hashing 
> scheme, e.g., consistent hashing or extendible hashing.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-3980) Suport kerberos hbase index

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3980:

Fix Version/s: 0.12.0
   (was: 0.11.1)

> Suport kerberos hbase index
> ---
>
> Key: HUDI-3980
> URL: https://issues.apache.org/jira/browse/HUDI-3980
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: xi chaomin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] yanenze closed pull request #5661: [HUDI-4139] improvement for flink sink operators so we can easily identify the table which write to hudi

2022-06-02 Thread GitBox



yanenze closed pull request #5661: [HUDI-4139] improvement for flink sink 
operators so we can easily identify the table which write to hudi
URL: https://github.com/apache/hudi/pull/5661


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4072) Clustering fails when there is an empty SCHEMA entry in commit metadata with deltastreamer

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4072:

Fix Version/s: 0.11.1
   (was: 0.12.0)

> Clustering fails when there is an empty SCHEMA entry in commit metadata with 
> deltastreamer
> --
>
> Key: HUDI-4072
> URL: https://issues.apache.org/jira/browse/HUDI-4072
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> when deltastreamer has an empty commit(no records to commit, but commit has 
> to happen since checkpoint has changed), we add NULL_SCHEMA or empty string 
> as schema in extra metadata for commits. This is having issues in follow up 
> commits when write client is instantiated from this commit. 
>  
> stacktrace1:
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a record: "null"
>   at org.apache.avro.Schema.getFields(Schema.java:279)
>   at 
> org.apache.hudi.avro.HoodieAvroUtils.addMetadataFields(HoodieAvroUtils.java:208)
>   at 
> org.apache.hudi.io.HoodieWriteHandle.(HoodieWriteHandle.java:115)
>   at 
> org.apache.hudi.io.HoodieWriteHandle.(HoodieWriteHandle.java:104)
>   at 
> org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:124)
>   at 
> org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:117)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.getUpdateHandle(BaseSparkCommitActionExecutor.java:376)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:347)
>   at 
> org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.handleUpdate(BaseSparkDeltaCommitActionExecutor.java:80)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:321)
>  {code}
>  
> stacktrace2: 
> {code:java}
> Exception in thread "main" org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: Async clustering failed.  Shutting 
> down Delta Sync...
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:184)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:179)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:530)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: Async clustering failed.  Shutting 
> down Delta Sync...
>   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>   at 
> org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:103)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:182)
>   ... 15 more
> Caused by: org.apache.hudi.exception.HoodieException: Async clustering 
> failed.  Shutting down Delta Sync...
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:690)
>   at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker

[jira] [Updated] (HUDI-4078) BootstrapOperator cannot load all index data

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4078:

Fix Version/s: (was: 0.12.0)

> BootstrapOperator cannot load all index data
> 
>
> Key: HUDI-4078
> URL: https://issues.apache.org/jira/browse/HUDI-4078
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.9.0, 0.11.0
>Reporter: Bo Cui
>Assignee: Bo Cui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> the bootstrapOperator can not obtain all the parquet and and log from the 
> hoodieTable#getSliceView()#getLatestFileSlicesBeforeOrOn
> Procedure:
> 1) write 10k records to the HUDI table by stream mode.
> create table（） with (
>  'table.type' = 'MERGE_ON_READ',
>  'index.bootstrap.enabled' =  'true',
>  'archive.max_commits' = '4200',
> 'archive.min_commits' = '4000',
> 'clean.retain_commits' = '3999', 
> ...
> )
> 2) stop job, and delete the last compaction commit, like 
> `.hoodie/20220505131426.commit`
> 3) restart job without chk/savepoint and not write data.
> 4)  Observe how much index data is loaded to the bootstrapOperator.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4018) Prepare minimal set of yamls to be tested against any write mode and against any query engine

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4018:

Fix Version/s: 0.11.1
   (was: 0.12.0)

> Prepare minimal set of yamls to be tested against any write mode and against 
> any query engine
> -
>
> Key: HUDI-4018
> URL: https://issues.apache.org/jira/browse/HUDI-4018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Prepare 5 to 8 minimal set of yamls that can be used against any write mode 
> and against any query engine and table types. 
>  
> For eg:
> lets say we come up with 6 yamls covering all cases. 
> Same set should work for all possible combinations from below categories. 
>  
> Table type: 
> COW/MOR
> Metadata:
> enable/disable
> Dataset type:
> partitioned/non-partitioned
> Write mode:
> delta streamer, spark datasource, spark sql, spark streaming sink
>  
> Query engine: 
> spark datasource, hive, presto, trino
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4027) add support to test non-core write operations (insert overwrite, delete partitions) to integ test framework

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4027:

Fix Version/s: 0.11.1
   (was: 0.12.0)

> add support to test non-core write operations (insert overwrite, delete 
> partitions) to integ test framework
> ---
>
> Key: HUDI-4027
> URL: https://issues.apache.org/jira/browse/HUDI-4027
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.11.1
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> we need support for testing non-core operations. 
> insert overwrite
> insert overwrite table
> delete partitions
>  
> spark-datasource writes
> spark-sql writes. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4038) Avoid invoking `getDataSize` in the hot-path

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4038:

Fix Version/s: (was: 0.12.0)

> Avoid invoking `getDataSize` in the hot-path
> 
>
> Key: HUDI-4038
> URL: https://issues.apache.org/jira/browse/HUDI-4038
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> `getDataSize` has non-trivial overhead of traversing already encoded Column 
> Groups stored in memory. We should sample its invocations to amortize its 
> costs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4079) Supports showing table comment for hudi with spark3

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4079:

Fix Version/s: (was: 0.12.0)

> Supports showing table comment for hudi with spark3
> ---
>
> Key: HUDI-4079
> URL: https://issues.apache.org/jira/browse/HUDI-4079
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jin Xing
>Assignee: Jin Xing
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> When creating table like below with a 'comment' and check by "show create 
> table", the comment is not shown
>  
> {code:java}
> create table test(
>   id int,
>   name string,
>   price double,
>   ts long
>  ) using hudi
>  comment "This is a simple hudi table"
>  tblproperties (
>    primaryKey = 'id',
>    preCombineField = 'ts'
>  ){code}
>  
> The cause is as below:
>  # Current hudi & spark3 invokes ShowCreateTableExec when "show create ..."
>  # ShowCreateTableExec checks table property of 'comment' for result
>  # Spark HiveClientImpl hides property of 'comment', but set it to 
> Catalog#comment when returning a CatalogTable 
> (https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L487)
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4003) Flink offline compaction may cause NPE when log file only contain delete opereation

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4003:

Fix Version/s: 0.11.1
   (was: 0.11.0)

> Flink offline compaction may cause NPE when log file only contain delete 
> opereation
> ---
>
> Key: HUDI-4003
> URL: https://issues.apache.org/jira/browse/HUDI-4003
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, flink
>Affects Versions: 0.11.0
>Reporter: lanyuanxiaoyao
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> Environment: Hudi 0.12.0 (Latest master), Flink 1.13.3, JDK 8
> My test:
>  # Two partitions: p1, p2
>  # Write data that p1 only delete record, p2 only update record
>  # Run offline compaction and it cause NPE
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
>     at 
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:264)
>     at 
> org.apache.hudi.common.table.TableSchemaResolver.convertParquetSchemaToAvro(TableSchemaResolver.java:341)
>     at 
> org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchemaFromDataFile(TableSchemaResolver.java:148)
>     at 
> org.apache.hudi.util.CompactionUtil.inferChangelogMode(CompactionUtil.java:131)
>     at 
> org.apache.hudi.sink.compact.HoodieFlinkCompactor$AsyncCompactionService.(HoodieFlinkCompactor.java:173)
>     at com.lanyuanxiaoyao.Compactor.main(Compactor.java:25) {code}
> Reson & Resolution:
>  # Flink offline compaction would get schema from latest data file to check  
> '_hoodie_operation' field have set or not 
> (org.apache.hudi.util.CompactionUtil#inferChangelogMode).
>  # For MOR table, it may get schema from a log file in random. But if it 
> choose the log file that only contains delete operation, the code will get a 
> NULL as result. 
> (org.apache.hudi.common.table.TableSchemaResolver#readSchemaFromLogFile(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path))
>  # Finally, it throw NPE when the code want to get the schema name. 
> (org.apache.parquet.avro.AvroSchemaConverter#convert(org.apache.parquet.schema.MessageType))
> {code:java}
> case MERGE_ON_READ:
>   // For MOR table, the file has data written may be a parquet file, .log 
> file, orc file or hfile.
>   // Determine the file format based on the file name, and then extract 
> schema from it.
>   if (instantAndCommitMetadata.isPresent()) {
> HoodieCommitMetadata commitMetadata = 
> instantAndCommitMetadata.get().getRight();
> String filePath = 
> commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().stream().findAny().get();
> if (filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
>   // this is a log file
>   return readSchemaFromLogFile(new Path(filePath));
> } else {
>   return readSchemaFromBaseFile(filePath);
> }
>   } {code}
> I think that can try another log file to parse schema when it get NULL from a 
> log file.
> My solution is make the code try to scan all the file path to parse schema 
> until success.
> {code:java}
> case MERGE_ON_READ:
>   // For MOR table, the file has data written may be a parquet file, .log 
> file, orc file or hfile.
>   // Determine the file format based on the file name, and then extract 
> schema from it.
>   if (instantAndCommitMetadata.isPresent()) {
> HoodieCommitMetadata commitMetadata = 
> instantAndCommitMetadata.get().getRight();
> Iterator filePaths = 
> commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().iterator();
> MessageType type = null;
> while (filePaths.hasNext() && type == null) {
>   String filePath = filePaths.next();
>   if (filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
> // this is a log file
> type = readSchemaFromLogFile(new Path(filePath));
>   } else {
> type = readSchemaFromBaseFile(filePath);
>   }
> }
> return type;
>   } {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4003) Flink offline compaction may cause NPE when log file only contain delete opereation

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4003:

Affects Version/s: (was: 0.12.0)

> Flink offline compaction may cause NPE when log file only contain delete 
> opereation
> ---
>
> Key: HUDI-4003
> URL: https://issues.apache.org/jira/browse/HUDI-4003
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, flink
>Affects Versions: 0.11.0
>Reporter: lanyuanxiaoyao
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Environment: Hudi 0.12.0 (Latest master), Flink 1.13.3, JDK 8
> My test:
>  # Two partitions: p1, p2
>  # Write data that p1 only delete record, p2 only update record
>  # Run offline compaction and it cause NPE
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
>     at 
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:264)
>     at 
> org.apache.hudi.common.table.TableSchemaResolver.convertParquetSchemaToAvro(TableSchemaResolver.java:341)
>     at 
> org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchemaFromDataFile(TableSchemaResolver.java:148)
>     at 
> org.apache.hudi.util.CompactionUtil.inferChangelogMode(CompactionUtil.java:131)
>     at 
> org.apache.hudi.sink.compact.HoodieFlinkCompactor$AsyncCompactionService.(HoodieFlinkCompactor.java:173)
>     at com.lanyuanxiaoyao.Compactor.main(Compactor.java:25) {code}
> Reson & Resolution:
>  # Flink offline compaction would get schema from latest data file to check  
> '_hoodie_operation' field have set or not 
> (org.apache.hudi.util.CompactionUtil#inferChangelogMode).
>  # For MOR table, it may get schema from a log file in random. But if it 
> choose the log file that only contains delete operation, the code will get a 
> NULL as result. 
> (org.apache.hudi.common.table.TableSchemaResolver#readSchemaFromLogFile(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path))
>  # Finally, it throw NPE when the code want to get the schema name. 
> (org.apache.parquet.avro.AvroSchemaConverter#convert(org.apache.parquet.schema.MessageType))
> {code:java}
> case MERGE_ON_READ:
>   // For MOR table, the file has data written may be a parquet file, .log 
> file, orc file or hfile.
>   // Determine the file format based on the file name, and then extract 
> schema from it.
>   if (instantAndCommitMetadata.isPresent()) {
> HoodieCommitMetadata commitMetadata = 
> instantAndCommitMetadata.get().getRight();
> String filePath = 
> commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().stream().findAny().get();
> if (filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
>   // this is a log file
>   return readSchemaFromLogFile(new Path(filePath));
> } else {
>   return readSchemaFromBaseFile(filePath);
> }
>   } {code}
> I think that can try another log file to parse schema when it get NULL from a 
> log file.
> My solution is make the code try to scan all the file path to parse schema 
> until success.
> {code:java}
> case MERGE_ON_READ:
>   // For MOR table, the file has data written may be a parquet file, .log 
> file, orc file or hfile.
>   // Determine the file format based on the file name, and then extract 
> schema from it.
>   if (instantAndCommitMetadata.isPresent()) {
> HoodieCommitMetadata commitMetadata = 
> instantAndCommitMetadata.get().getRight();
> Iterator filePaths = 
> commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().iterator();
> MessageType type = null;
> while (filePaths.hasNext() && type == null) {
>   String filePath = filePaths.next();
>   if (filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
> // this is a log file
> type = readSchemaFromLogFile(new Path(filePath));
>   } else {
> type = readSchemaFromBaseFile(filePath);
>   }
> }
> return type;
>   } {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4003) Flink offline compaction may cause NPE when log file only contain delete opereation

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4003:

Fix Version/s: (was: 0.12.0)

> Flink offline compaction may cause NPE when log file only contain delete 
> opereation
> ---
>
> Key: HUDI-4003
> URL: https://issues.apache.org/jira/browse/HUDI-4003
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, flink
>Affects Versions: 0.11.0, 0.12.0
>Reporter: lanyuanxiaoyao
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Environment: Hudi 0.12.0 (Latest master), Flink 1.13.3, JDK 8
> My test:
>  # Two partitions: p1, p2
>  # Write data that p1 only delete record, p2 only update record
>  # Run offline compaction and it cause NPE
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
>     at 
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:264)
>     at 
> org.apache.hudi.common.table.TableSchemaResolver.convertParquetSchemaToAvro(TableSchemaResolver.java:341)
>     at 
> org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchemaFromDataFile(TableSchemaResolver.java:148)
>     at 
> org.apache.hudi.util.CompactionUtil.inferChangelogMode(CompactionUtil.java:131)
>     at 
> org.apache.hudi.sink.compact.HoodieFlinkCompactor$AsyncCompactionService.(HoodieFlinkCompactor.java:173)
>     at com.lanyuanxiaoyao.Compactor.main(Compactor.java:25) {code}
> Reson & Resolution:
>  # Flink offline compaction would get schema from latest data file to check  
> '_hoodie_operation' field have set or not 
> (org.apache.hudi.util.CompactionUtil#inferChangelogMode).
>  # For MOR table, it may get schema from a log file in random. But if it 
> choose the log file that only contains delete operation, the code will get a 
> NULL as result. 
> (org.apache.hudi.common.table.TableSchemaResolver#readSchemaFromLogFile(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path))
>  # Finally, it throw NPE when the code want to get the schema name. 
> (org.apache.parquet.avro.AvroSchemaConverter#convert(org.apache.parquet.schema.MessageType))
> {code:java}
> case MERGE_ON_READ:
>   // For MOR table, the file has data written may be a parquet file, .log 
> file, orc file or hfile.
>   // Determine the file format based on the file name, and then extract 
> schema from it.
>   if (instantAndCommitMetadata.isPresent()) {
> HoodieCommitMetadata commitMetadata = 
> instantAndCommitMetadata.get().getRight();
> String filePath = 
> commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().stream().findAny().get();
> if (filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
>   // this is a log file
>   return readSchemaFromLogFile(new Path(filePath));
> } else {
>   return readSchemaFromBaseFile(filePath);
> }
>   } {code}
> I think that can try another log file to parse schema when it get NULL from a 
> log file.
> My solution is make the code try to scan all the file path to parse schema 
> until success.
> {code:java}
> case MERGE_ON_READ:
>   // For MOR table, the file has data written may be a parquet file, .log 
> file, orc file or hfile.
>   // Determine the file format based on the file name, and then extract 
> schema from it.
>   if (instantAndCommitMetadata.isPresent()) {
> HoodieCommitMetadata commitMetadata = 
> instantAndCommitMetadata.get().getRight();
> Iterator filePaths = 
> commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().iterator();
> MessageType type = null;
> while (filePaths.hasNext() && type == null) {
>   String filePath = filePaths.next();
>   if (filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
> // this is a log file
> type = readSchemaFromLogFile(new Path(filePath));
>   } else {
> type = readSchemaFromBaseFile(filePath);
>   }
> }
> return type;
>   } {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4044) When reading data from flink-hudi to external storage, the result is incorrect

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4044:

Fix Version/s: (was: 0.12.0)

> When reading data from flink-hudi to external storage, the result is incorrect
> --
>
> Key: HUDI-4044
> URL: https://issues.apache.org/jira/browse/HUDI-4044
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.11.0
>Reporter: yanxiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> When reading data from flink-hudi to external storage, the result is 
> incorrect  because of concurrency issues：
>  
> Here's the  case:
>  
> There is a split_monitor task that listens for changes on the TimeLine every 
> N seconds; There are four split_reader tasks for processing changing data and 
> sinking data to external storage:
>  
> (1) First,split_monitor listens to Instance1 changes , and the corresponding 
> fileId is log1. Split_monitor distributes the fileId information to 
> split_reader task 1 in Rebanlance mode for processing.
>  
> (2) then,split_monitor listens for Instance2 change . The corresponding 
> fileId is log1 (assuming that the changed data have the same primary key ). 
> The split_monitor task distributes fileId information to split_reader task 2 
> in Rebanlance mode for processing.
>  
> (3) Split_reader task 1 and split_reader task 2 process the same primary key 
> data, and their processing speeds are inconsistent. As a result, the sequence 
> of data sink to external storage is inconsistent. The data modified earlier 
> overwrites the data modified later, resulting in incorrect data.
>  
>  
> Solution:
> After the split_monitor task monitors the data changes, it distributes them 
> to the split_reader task through the FileId Hash mode to ensure that the same 
> FileId files are processed in the same split_reader task, thus solving this 
> problem .



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4053) Flaky ITTestHoodieDataSource.testStreamWriteBatchReadOptimized

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4053:

Fix Version/s: (was: 0.12.0)

> Flaky ITTestHoodieDataSource.testStreamWriteBatchReadOptimized
> --
>
> Key: HUDI-4053
> URL: https://issues.apache.org/jira/browse/HUDI-4053
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> ITTestHoodieDataSource.testStreamWriteBatchReadOptimized is flaky in our 
> azure CI. 
>  
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/8473/logs/22]
>  
> {code:java}
> 2022-05-06T15:34:29.7180500Z 1226235 [PermanentBlobCache shutdown hook] INFO  
> org.apache.flink.runtime.blob.PermanentBlobCache  - Shutting down BLOB cache
> 2022-05-06T15:34:29.7186034Z 1226236 [FileChannelManagerImpl-io shutdown 
> hook] INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl  - 
> FileChannelManager removed spill file directory 
> /tmp/flink-io-dc0ae218-8df6-4dab-a6ff-c5525b560da5
> 2022-05-06T15:34:29.7188305Z 1226235 [TransientBlobCache shutdown hook] INFO  
> org.apache.flink.runtime.blob.TransientBlobCache  - Shutting down BLOB cache
> 2022-05-06T15:34:29.7190064Z 1226235 [FileChannelManagerImpl-io shutdown 
> hook] INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl  - 
> FileChannelManager removed spill file directory 
> /tmp/flink-io-9a505012-2bf9-45fa-bdd0-d31380c9053b
> 2022-05-06T15:34:29.7242484Z 1226241 [BlobServer shutdown hook] INFO  
> org.apache.flink.runtime.blob.BlobServer  - Stopped BLOB server at 
> 0.0.0.0:36159
> 2022-05-06T15:34:30.2525532Z [INFO] 
> 2022-05-06T15:34:30.2526886Z [INFO] Results:
> 2022-05-06T15:34:30.2527439Z [INFO] 
> 2022-05-06T15:34:30.2528004Z [ERROR] Failures: 
> 2022-05-06T15:34:30.2528707Z [ERROR]   
> ITTestHoodieDataSource.testStreamWriteBatchReadOptimized:243 
> 2022-05-06T15:34:30.2540353Z Expected: is "[+I[id1, Danny, 23, 
> 1970-01-01T00:00:01, par1], +I[id2, Stephen, 33, 1970-01-01T00:00:02, par1], 
> +I[id3, Julian, 53, 1970-01-01T00:00:03, par2], +I[id4, Fabian, 31, 
> 1970-01-01T00:00:04, par2], +I[id5, Sophia, 18, 1970-01-01T00:00:05, par3], 
> +I[id6, Emma, 20, 1970-01-01T00:00:06, par3], +I[id7, Bob, 44, 
> 1970-01-01T00:00:07, par4], +I[id8, Han, 56, 1970-01-01T00:00:08, par4]]"
> 2022-05-06T15:34:30.2543461Z  but: was "[+I[id1, Danny, 23, 
> 1970-01-01T00:00:01, par1], +I[id2, Stephen, 33, 1970-01-01T00:00:02, par1], 
> +I[id3, Julian, 53, 1970-01-01T00:00:03, par2], +I[id4, Fabian, 31, 
> 1970-01-01T00:00:04, par2]]" {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4055) use loop replace recursive call in ratelimiter

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4055:

Fix Version/s: (was: 0.12.0)

> use loop replace recursive call in ratelimiter
> --
>
> Key: HUDI-4055
> URL: https://issues.apache.org/jira/browse/HUDI-4055
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: ZiyueGuan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> rate limiter recursively call to acquire which may lead stack over flow



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-3849) AvroDeserializer supports AVRO_REBASE_MODE_IN_READ configuration

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3849:

Fix Version/s: (was: 0.12.0)

> AvroDeserializer supports AVRO_REBASE_MODE_IN_READ configuration
> 
>
> Key: HUDI-3849
> URL: https://issues.apache.org/jira/browse/HUDI-3849
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: dzcxzl
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> Now the datetimeRebaseMode of AvroDeserializer is hardcode and the value is 
> "EXCEPTION"



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-2875) Concurrent call to HoodieMergeHandler cause parquet corruption

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2875:

Fix Version/s: (was: 0.12.0)

> Concurrent call to HoodieMergeHandler cause parquet corruption
> --
>
> Key: HUDI-2875
> URL: https://issues.apache.org/jira/browse/HUDI-2875
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core, writer-core
>Reporter: ZiyueGuan
>Assignee: ZiyueGuan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Problem:
> Some corrupted parquet files are generated and exceptions will be thrown when 
> read.
> e.g.
>  
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file 
>     at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
>     at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
>     at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
>     at 
> org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
>     at 
> org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
>     at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:112)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     ... 4 more
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read 
> page Page [bytes.size=1054316, valueCount=237, uncompressedSize=1054316] in 
> col  required binary col
>     at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readPageV1(ColumnReaderImpl.java:599)
>     at 
> org.apache.parquet.column.impl.ColumnReaderImpl.access$300(ColumnReaderImpl.java:57)
>     at 
> org.apache.parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:536)
>     at 
> org.apache.parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:533)
>     at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:95)
>     at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:533)
>     at 
> org.apache.parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:525)
>     at 
> org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:638)
>     at 
> org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:353)
>     at 
> org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:80)
>     at 
> org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:75)
>     at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:271)
>     at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
>     at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
>     at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
>     at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
>     at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
>     at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
>     ... 11 more
> Caused by: java.io.EOFException
>     at java.io.DataInputStream.readFully(DataInputStream.java:197)
>     at java.io.DataInputStream.readFully(DataInputStream.java:169)
>     at 
> org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)
>     at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
>     at org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
>     at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readPageV1(ColumnReaderImpl.java:592)
>  
> How to reproduce:
> We need a way which could interrupt one task w/o shutdown JVM. Let's say, 
> speculation. When speculation is triggered, other tasks working at the same 
> executor will have the risk to suffer a wrong parquet generation. This will 
> not always result in corrupted parquet file. Nearly half of them will throw 
> exception while there is few tasks succeed without any signal.
> RootCause:
> ParquetWriter is not thread safe. User of it should apply proper way to 
> guarantee that there is not concurrent call to ParquetWriter.
> In the following code: 
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkMergeHelper.java#L103]
>

[jira] [Updated] (HUDI-4031) Avoid clustering update handling when clustering is disabled

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4031:

Fix Version/s: 0.11.1
   (was: 0.12.0)

> Avoid clustering update handling when clustering is disabled
> 
>
> Key: HUDI-4031
> URL: https://issues.apache.org/jira/browse/HUDI-4031
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> We call distinct().collectAsList() on RDD to determine conflicting filegroups 
> while handling updates with clustering. See 
> [https://github.com/apache/hudi/blob/6af1ff7a663da57438db8847ca0dfda5a6e381f5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/update/strategy/BaseSparkUpdateStrategy.java#L50]
>  
> While this is needed when clustering is enabled with regular writer, it can 
> be avoided when clustering is disabled and there are no pending 
> replacecommits.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-3667) Unit tests in hudi-integ-tests are not executed in CI

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3667:

Fix Version/s: 0.11.1
   (was: 0.11.0)

> Unit tests in hudi-integ-tests are not executed in CI
> -
>
> Key: HUDI-3667
> URL: https://issues.apache.org/jira/browse/HUDI-3667
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Closed] (HUDI-4005) Update release script to help validation

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-4005.
---
Resolution: Fixed

> Update release script to help validation
> 
>
> Key: HUDI-4005
> URL: https://issues.apache.org/jira/browse/HUDI-4005
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Closed] (HUDI-3978) Fix hudi use partition path field as hive partition field in flink

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-3978.
---
Resolution: Fixed

> Fix hudi use partition path field as hive partition field in flink
> --
>
> Key: HUDI-3978
> URL: https://issues.apache.org/jira/browse/HUDI-3978
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Priority: Critical
>  Labels: flink, pull-request-available
> Fix For: 0.11.1
>
>
> See https://github.com/apache/hudi/issues/5394



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[hudi] branch asf-site updated: [DOCS] Add images to rest of the blogs (#5742)

2022-06-02 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 42e9786370 [DOCS] Add images to rest of the blogs (#5742)
42e9786370 is described below

commit 42e9786370d6d4c30881cfdf529d0f5b20827f39
Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com>
AuthorDate: Thu Jun 2 21:53:28 2022 -0700

[DOCS] Add images to rest of the blogs (#5742)

Co-authored-by: Bhavani Sudha Saktheeswaran 
---
 website/blog/2019-03-07-batch-vs-incremental.md|   1 +
 website/blog/2020-01-20-change-capture-using-aws.md|   1 +
 ...ansactional-Data-Lake-at-Uber-Using-Apache-Hudi.mdx |   1 +
 website/blog/2020-08-04-PrestoDB-and-Apache-Hudi.mdx   |   1 +
 ...-08-18-hudi-incremental-processing-on-data-lakes.md |   1 +
 ...8-20-efficient-migration-of-large-parquet-tables.md |   1 +
 .../2020-10-06-cdc-solution-using-hudi-by-nclouds.md   |   1 +
 .../blog/2020-10-15-apache-hudi-meets-apache-flink.md  |   1 +
 .../2020-10-19-Origins-of-Data-Lake-at-Grofers.mdx |   1 +
 .../blog/2020-10-19-hudi-meets-aws-emr-and-aws-dms.md  |   1 +
 ...ern-Enterprise-at-Data-Summit-Connect-Fall-2020.mdx |   1 +
 ...ge-Capture-using-Apache-Hudi-and-Amazon-AMS-EMR.mdx |   1 +
 website/blog/2020-11-11-hudi-indexing-mechanisms.md|   1 +
 ...2020-11-29-Can-Big-Data-Solutions-Be-Affordable.mdx |   1 +
 ...1-high-perf-data-lake-with-hudi-and-alluxio-t3go.md |   1 +
 website/blog/2021-01-27-hudi-clustering-intro.md   |   1 +
 ...me-travel-operations-in-Hopsworks-Feature-Store.mdx |   1 +
 ...Next-Generation-of-Data-Lakes-using-Apache-Hudi.mdx |   1 +
 website/blog/2021-03-01-hudi-file-sizing.md|   1 +
 ...data-stream-for-amazon-dynamodb-and-apache-hudi.mdx |   1 +
 ...-11-New-features-from-Apache-hudi-in-Amazon-EMR.mdx |   1 +
 ...with-Apache-Spark-and-Apache-Hudi-on-Amazon-EMR.mdx |   1 +
 .../blog/2021-05-12-Experts-primer-on-Apache-Hudi.mdx  |   1 +
 ...07-16-Amazon-Athena-expands-Apache-Hudi-support.mdx |   1 +
 ...-lake-with-amazon-athena-Read-optimized-queries.mdx |   1 +
 website/src/theme/BlogPostItem/index.js|   2 +-
 ...ansactional-Data-Lake-at-Uber-Using-Apache-Hudi.png | Bin 0 -> 75857 bytes
 .../blog/2020-08-04-PrestoDB-and-Apache-Hudi.png   | Bin 0 -> 77282 bytes
 .../2020-10-06-cdc-solution-using-hudi-by-nclouds.jpg  | Bin 0 -> 51232 bytes
 .../blog/2020-10-15-apache-hudi-meets-apache-flink.png | Bin 0 -> 116012 bytes
 .../2020-10-19-Origins-of-Data-Lake-at-Grofers.gif | Bin 0 -> 282866 bytes
 .../2020-10-19-hudi-meets-aws-emr-and-aws-dms.jpeg | Bin 0 -> 41844 bytes
 ...e-Capture-using-Apache-Hudi-and-Amazon-AMS-EMR.jpeg | Bin 0 -> 65066 bytes
 ...2020-11-29-Can-Big-Data-Solutions-Be-Affordable.jpg | Bin 0 -> 33672 bytes
 .../images/blog/2021-01-27-hudi-clustering-intro.png   | Bin 0 -> 416119 bytes
 .../blog/2021-02-24-featurestore_incremental_pull.png  | Bin 0 -> 469022 bytes
 ...Next-Generation-of-Data-Lakes-using-Apache-Hudi.png | Bin 0 -> 182290 bytes
 .../assets/images/blog/2021-03-01-hudi-file-sizing.png | Bin 0 -> 44237 bytes
 ...on-kinesis-for-amazon-dynamodb-and-apache-hudi.jpeg | Bin 0 -> 68544 bytes
 .../2021-07-16-query-hudi-using-athena-ro-queries.png  | Bin 0 -> 65767 bytes
 .../static/assets/images/blog/data-summit-connect.jpeg | Bin 0 -> 12956 bytes
 41 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/website/blog/2019-03-07-batch-vs-incremental.md 
b/website/blog/2019-03-07-batch-vs-incremental.md
index 2273227f2a..aaec1d25e6 100644
--- a/website/blog/2019-03-07-batch-vs-incremental.md
+++ b/website/blog/2019-03-07-batch-vs-incremental.md
@@ -2,6 +2,7 @@
 title: "Big Batch vs Incremental Processing"
 author: vinoth
 category: blog
+image: /assets/images/blog/batch_vs_incremental.png
 ---
 
 ![](/assets/images/blog/batch_vs_incremental.png)
diff --git a/website/blog/2020-01-20-change-capture-using-aws.md 
b/website/blog/2020-01-20-change-capture-using-aws.md
index b1ebe4ec81..a757bca98e 100644
--- a/website/blog/2020-01-20-change-capture-using-aws.md
+++ b/website/blog/2020-01-20-change-capture-using-aws.md
@@ -3,6 +3,7 @@ title: "Change Capture Using AWS Database Migration Service and 
Hudi"
 excerpt: "In this blog, we will build an end-end solution for capturing 
changes from a MySQL instance running on AWS RDS to a Hudi table on S3, using 
capabilities in the Hudi 0.5.1 release."
 author: vinoth
 category: blog
+image: /assets/images/blog/change-capture-architecture.png
 ---
 
 One of the core use-cases for Apache Hudi is enabling seamless, efficient 
database ingestion to your data lake. Even though a lot has been talked about 
and even users already adopting this model, content on how to go about this is 
sparse.
diff --git 
a/website/blog/2020-06-09-Building-a-Large-scale-Transactional-Data-Lake-at-Uber-Using-Apache-Hudi.m

[GitHub] [hudi] xushiyan merged pull request #5742: [DOCS] Add images to rest of the blogs

2022-06-02 Thread GitBox



xushiyan merged PR #5742:
URL: https://github.com/apache/hudi/pull/5742


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3862) Fix default configurations of HoodieHBaseIndexConfig

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3862:

Fix Version/s: 0.11.1

> Fix default configurations of HoodieHBaseIndexConfig
> 
>
> Key: HUDI-3862
> URL: https://issues.apache.org/jira/browse/HUDI-3862
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index
>Reporter: xi chaomin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> For example GET_BATCH_SIZE, MAX_QPS_PER_REGION_SERVER, 
> QPS_ALLOCATOR_CLASS_NAME, these configurations have default values, but we 
> didn't use the default values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] yanenze closed pull request #5661: [HUDI-4139] improvement for flink sink operators so we can easily identify the table which write to hudi

2022-06-02 Thread GitBox



yanenze closed pull request #5661: [HUDI-4139] improvement for flink sink 
operators so we can easily identify the table which write to hudi
URL: https://github.com/apache/hudi/pull/5661


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4159) non existant partition field values has issues w/ partition pruning

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4159:

Fix Version/s: 0.11.1
   (was: 0.11.0)

> non existant partition field values has issues w/ partition pruning
> ---
>
> Key: HUDI-4159
> URL: https://issues.apache.org/jira/browse/HUDI-4159
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.1
>
>
> lets say some records don't have partition field values and are null. Based 
> on our key gen code, we will set "Default" value for the same. But partition 
> pruning may run into issues if original datatype of the partition field is 
> non string (long/ integer) for instance. I ran into this for 0.10.0.
>  
> {code:java}
> org.sparkproject.guava.util.concurrent.UncheckedExecutionException: 
> java.lang.RuntimeException: Failed to cast value `default` to `IntegerType` 
> for partition column `ss_sold_date_sk`
>   at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
>   at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCachedPlan(SessionCatalog.scala:155)
>   at 
> org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$$readDataSourceTable(DataSourceStrategy.scala:249)
>   at 
> org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:278)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(L

[jira] [Updated] (HUDI-3758) Optimize flink partition table with BucketIndex

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3758:

Fix Version/s: (was: 0.11.0)

> Optimize flink partition table with BucketIndex
> ---
>
> Key: HUDI-3758
> URL: https://issues.apache.org/jira/browse/HUDI-3758
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: konwu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
> Attachments: image-2022-03-31-15-44-30-480.png, 
> image-2022-03-31-15-44-34-450.png
>
>
> When using flink bucket index , I meet two problems
>  * without use all streamWriter tasks when partition table with small Bucket 
> number
>  * crashed with the following step
>  # start job
>  # killed before first commit success ( left some log files)
>  # restart job run nomal after one successful commit
>  # kill job and restart  throws `Duplicate fileID 
> 0001-6f57-4c71-bf6f-ee7616ec7b14 from bucket 1 of partition  found during 
> the BucketStreamWriteFunction index bootstrap`
> !image-2022-03-31-15-44-34-450.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-3943) Some description fixes for 0.10.1 docs

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3943:

Fix Version/s: 0.11.1

> Some description fixes for 0.10.1 docs
> --
>
> Key: HUDI-3943
> URL: https://issues.apache.org/jira/browse/HUDI-3943
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Chuang Lee
>Assignee: Chuang Lee
>Priority: Minor
>  Labels: docs, pull-request-available
> Fix For: 0.11.1
>
>
> There are some detailed description errors in the 0.10.1 document.
> For example, the default value of METADATA_COMPACTION_DELTA_COMMITS is 10, 
> which is described as 24.
> Therefore, I hope to open an issue area to fix the problems found, and I also 
> hope that you can comment on the description problems on the found documents, 
> and I will fix them together.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-3815) Hudi Option metadata.compaction.delta_commits

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3815:

Fix Version/s: 0.11.1

> Hudi Option metadata.compaction.delta_commits
> -
>
> Key: HUDI-3815
> URL: https://issues.apache.org/jira/browse/HUDI-3815
> Project: Apache Hudi
>  Issue Type: Wish
>  Components: docs
>Reporter: Ibson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.1
>
> Attachments: 截屏2022-04-07 下午3.42.16.png
>
>
> h4.     The parameter description of 'metadata.compaction.delta_commits'  
> indicates that the default value is 24, but the actual default value is 10.
>     [docs 
> url|https://hudi.apache.org/docs/next/configurations/#metadatacompactiondelta_commits]
>  
>     [code 
> url|https://github.com/apache/hudi/blob/master/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java#L105]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Closed] (HUDI-3945) After the async compaction operation is complete, the task should exit.

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-3945.
---
Resolution: Fixed

> After the async compaction operation is complete, the task should exit.
> ---
>
> Key: HUDI-3945
> URL: https://issues.apache.org/jira/browse/HUDI-3945
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: YangXuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> This problem occurs if you perform the following operations:
> 1、Create a mor table and perform the upsert operation on it for five times.
> 2、Gets the timestamp of the last upsert execution, assuming it is 
> 20220318191221.
> 3、Do compaction schedule by spark-submit and the timestamp is 20220318191221 
> plus 5.
> spark-submit --conf 
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/opt/hudi/error_log4j.properties"
>  --jars /opt/client/Hudi/hudi/lib/hudi-client-common*.jar --class 
> org.apache.hudi.utilities.HoodieCompactor 
> /opt/client/Hudi/hudi/lib/hudi-utilities*.jar --base-path 
> /tmp/testdb/tb_test_mor --table-name tb_test_mor --parallelism 100 
> --spark-memory 1G --schema-file /tmp/json/compact_tb_base.json --instant-time 
> 20220318191226 --schedule --strategy 
> org.apache.hudi.table.action.compact.strategy.UnBoundedCompactionStrategy
> 4、Run compaction by spark-submit and you will see that the task of running 
> compaction is executed successfully, but the spark task does not exit.
> spark-submit --conf 
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/opt/hudi/error_log4j.properties"
>  --num-executors 4 --jars 
> /opt/client/Hudi/hudi/lib/hudi-client-common-{_}.jar --class 
> org.apache.hudi.utilities.HoodieCompactor 
> /opt/client/Hudi/hudi/lib/hudi-utilities_{_}.jar --base-path 
> /tmp/testdb/tb_test_mor --table-name tb_test_mor --parallelism 100 
> --spark-memory 1G --schema-file /tmp/json/compact_tb_base.json --instant-time 
> 20220318191226



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-3945) After the async compaction operation is complete, the task should exit.

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3945:

Fix Version/s: 0.11.1

> After the async compaction operation is complete, the task should exit.
> ---
>
> Key: HUDI-3945
> URL: https://issues.apache.org/jira/browse/HUDI-3945
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: YangXuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> This problem occurs if you perform the following operations:
> 1、Create a mor table and perform the upsert operation on it for five times.
> 2、Gets the timestamp of the last upsert execution, assuming it is 
> 20220318191221.
> 3、Do compaction schedule by spark-submit and the timestamp is 20220318191221 
> plus 5.
> spark-submit --conf 
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/opt/hudi/error_log4j.properties"
>  --jars /opt/client/Hudi/hudi/lib/hudi-client-common*.jar --class 
> org.apache.hudi.utilities.HoodieCompactor 
> /opt/client/Hudi/hudi/lib/hudi-utilities*.jar --base-path 
> /tmp/testdb/tb_test_mor --table-name tb_test_mor --parallelism 100 
> --spark-memory 1G --schema-file /tmp/json/compact_tb_base.json --instant-time 
> 20220318191226 --schedule --strategy 
> org.apache.hudi.table.action.compact.strategy.UnBoundedCompactionStrategy
> 4、Run compaction by spark-submit and you will see that the task of running 
> compaction is executed successfully, but the spark task does not exit.
> spark-submit --conf 
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/opt/hudi/error_log4j.properties"
>  --num-executors 4 --jars 
> /opt/client/Hudi/hudi/lib/hudi-client-common-{_}.jar --class 
> org.apache.hudi.utilities.HoodieCompactor 
> /opt/client/Hudi/hudi/lib/hudi-utilities_{_}.jar --base-path 
> /tmp/testdb/tb_test_mor --table-name tb_test_mor --parallelism 100 
> --spark-memory 1G --schema-file /tmp/json/compact_tb_base.json --instant-time 
> 20220318191226



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-3977) Flink hudi table with date type partition path throws HoodieNotSupportedException

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3977:

Fix Version/s: (was: 0.11.0)

> Flink hudi table with date type partition path throws 
> HoodieNotSupportedException
> -
>
> Key: HUDI-3977
> URL: https://issues.apache.org/jira/browse/HUDI-3977
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.10.0, 0.10.1
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] hudi-bot commented on pull request #5743: [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables

2022-06-02 Thread GitBox



hudi-bot commented on PR #5743:
URL: https://github.com/apache/hudi/pull/5743#issuecomment-1145526231

   
   ## CI report:
   
   * da04030deb85fcd36fd463793be08270e0216aad Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9056)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] KnightChess commented on pull request #5739: [HUDI-4179] cluster with sort cloumns invalid

2022-06-02 Thread GitBox



KnightChess commented on PR #5739:
URL: https://github.com/apache/hudi/pull/5739#issuecomment-1145497178

   thanks for review @XuQianJin-Stars @leesf 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3946) Validate option path in flink hudi sink

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3946:

Fix Version/s: (was: 0.11.0)

> Validate option path in flink hudi sink
> ---
>
> Key: HUDI-3946
> URL: https://issues.apache.org/jira/browse/HUDI-3946
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Ruguo Yu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> We should do a non-null check on option 'path'[{color:#ff}*1*{color}] in 
> flink hudi sink so that flink can expose the 'path' problem as early as 
> possible, instead of throwing the error[{color:#FF}*2*{color}] at runtime.
> [{color:#ff}*1*{color}]
> {code:java}
> CREATE TABLE t1(
>   uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
>   name VARCHAR(10),
>   age INT,
>   ts TIMESTAMP(3),
>   `partition` VARCHAR(20)
> )
> PARTITIONED BY (`partition`)
> WITH (
>   'connector' = 'hudi',
>   'path' = '${path}'
> );
> INSERT INTO t1 VALUES
>   ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
>   ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
>   ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
>   ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
>   ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
>   ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
>   ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
>   ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4'); {code}
> [{color:#ff}*2*{color}]
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Can not create a Path from a 
> null string
>         at org.apache.hadoop.fs.Path.checkPathArg(Path.java:122) 
> ~[hadoop-common-2.7.6.jar:?]
>         at org.apache.hadoop.fs.Path.(Path.java:134) 
> ~[hadoop-common-2.7.6.jar:?]
>         at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:103) 
> ~[hudi-flink1.14-bundle_2.11-0.11.0-rc1.jar:0.11.0-rc1]
>         at 
> org.apache.hudi.util.StreamerUtil.tableExists(StreamerUtil.java:289) 
> ~[hudi-flink1.14-bundle_2.11-0.11.0-rc1.jar:0.11.0-rc1]
>         at 
> org.apache.hudi.util.StreamerUtil.initTableIfNotExists(StreamerUtil.java:258) 
> ~[hudi-flink1.14-bundle_2.11-0.11.0-rc1.jar:0.11.0-rc1]
>         at 
> org.apache.hudi.sink.StreamWriteOperatorCoordinator.start(StreamWriteOperatorCoordinator.java:172)
>  ~[hudi-flink1.14-bundle_2.11-0.11.0-rc1.jar:0.11.0-rc1]
>         at 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:194)
>  ~[flink-dist_2.11-1.14.2.jar:1.14.2]
>         at 
> org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
>  ~[flink-dist_2.11-1.14.2.jar:1.14.2]
>         at 
> org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:584)
>  ~[flink-dist_2.11-1.14.2.jar:1.14.2]
>         at 
> org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:965)
>  ~[flink-dist_2.11-1.14.2.jar:1.14.2]
>         at 
> org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:882)
>  ~[flink-dist_2.11-1.14.2.jar:1.14.2]
>         at 
> org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:389) 
> ~[flink-dist_2.11-1.14.2.jar:1.14.2]
>         at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
>  ~[flink-dist_2.11-1.14.2.jar:1.14.2]
>         at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.lambda$start$0(AkkaRpcActor.java:624)
>  ~[flink-rpc-akka_a8af7b4c-9c0c-4ac4-a1b1-1690068e50df.jar:1.14.2]
>         at 
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>  ~[flink-rpc-akka_a8af7b4c-9c0c-4ac4-a1b1-1690068e50df.jar:1.14.2]
>         at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:623)
>  ~[flink-rpc-akka_a8af7b4c-9c0c-4ac4-a1b1-1690068e50df.jar:1.14.2]
>         ... 20 more {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-3848) Restore fails when files pertaining to a commit has been cleaned up

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3848:

Fix Version/s: 0.11.0
   (was: 0.11.1)

> Restore fails when files pertaining to a commit has been cleaned up
> ---
>
> Key: HUDI-3848
> URL: https://issues.apache.org/jira/browse/HUDI-3848
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sivabalan narayanan
>Assignee: Forward Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> DC1, DC2, savepoint DC2, DC3, C4 (compaction), DC5, DC6, C7 (compaction). 
> DC8, Clean9 (this will clean up file slice 2 since file slice 1 is 
> savepointed). Hence all files added as part of DC5, DC6 will be cleanedup. 
>  
> DC10. Restore to DC2. 
> Fails, bcoz, recently we moved to list based rollback. files to be deleted 
> are fetched from commit metadata and fs.listStatus is called at one point. In 
> this case, DC5 and DC6 data files are already cleaned up and hence the 
> failure. 
>  
> {code:java}
> 22/04/10 22:35:41 ERROR Executor: Exception in task 1.0 in stage 24.0 (TID 79)
> java.io.FileNotFoundException: File 
> file:/tmp/hudi_trips_cow/asia/india/chennai/.4bd46734-6490-4efd-806f-eb7a2b8c36f6-0_20220410223029836.log.2_2-271-544
>  does not exist
>     at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
>     at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
>     at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
>     at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
>     at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
>     at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1594)
>     at 
> org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$listStatus$21(HoodieWrapperFileSystem.java:595)
>     at 
> org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:101)
>     at 
> org.apache.hudi.common.fs.HoodieWrapperFileSystem.listStatus(HoodieWrapperFileSystem.java:594)
>     at 
> org.apache.hudi.table.action.rollback.ListingBasedRollbackStrategy.fetchFilesFromCommitMetadata(ListingBasedRollbackStrategy.java:242)
>     at 
> org.apache.hudi.table.action.rollback.ListingBasedRollbackStrategy.fetchFilesFromInstant(ListingBasedRollbackStrategy.java:228)
>     at 
> org.apache.hudi.table.action.rollback.ListingBasedRollbackStrategy.lambda$getRollbackRequests$762d0ff4$1(ListingBasedRollbackStrategy.java:101)
>     at 
> org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:137)
>     at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
>     at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
>     at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
>     at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>     at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>     at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>     at scala.collection.AbstractIterator.to(Iterator.scala:1334)
>     at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>     at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
>     at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>     at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
>     at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$15.apply(RDD.scala:990)
>     at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
>     at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> 22/04/10 22:35:41 ERROR Executor: Exception in task 0.0 in stage 24.0 (TID 78)
> java.io.FileNotFoundException: File 
> file:/tmp/hudi_trips_cow/americas/brazil/sao_paulo/.82cffa84-

[jira] [Updated] (HUDI-3848) Restore fails when files pertaining to a commit has been cleaned up

2022-06-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3848:

Fix Version/s: 0.11.1
   (was: 0.12.0)

> Restore fails when files pertaining to a commit has been cleaned up
> ---
>
> Key: HUDI-3848
> URL: https://issues.apache.org/jira/browse/HUDI-3848
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sivabalan narayanan
>Assignee: Forward Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> DC1, DC2, savepoint DC2, DC3, C4 (compaction), DC5, DC6, C7 (compaction). 
> DC8, Clean9 (this will clean up file slice 2 since file slice 1 is 
> savepointed). Hence all files added as part of DC5, DC6 will be cleanedup. 
>  
> DC10. Restore to DC2. 
> Fails, bcoz, recently we moved to list based rollback. files to be deleted 
> are fetched from commit metadata and fs.listStatus is called at one point. In 
> this case, DC5 and DC6 data files are already cleaned up and hence the 
> failure. 
>  
> {code:java}
> 22/04/10 22:35:41 ERROR Executor: Exception in task 1.0 in stage 24.0 (TID 79)
> java.io.FileNotFoundException: File 
> file:/tmp/hudi_trips_cow/asia/india/chennai/.4bd46734-6490-4efd-806f-eb7a2b8c36f6-0_20220410223029836.log.2_2-271-544
>  does not exist
>     at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
>     at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
>     at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
>     at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
>     at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
>     at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1594)
>     at 
> org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$listStatus$21(HoodieWrapperFileSystem.java:595)
>     at 
> org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:101)
>     at 
> org.apache.hudi.common.fs.HoodieWrapperFileSystem.listStatus(HoodieWrapperFileSystem.java:594)
>     at 
> org.apache.hudi.table.action.rollback.ListingBasedRollbackStrategy.fetchFilesFromCommitMetadata(ListingBasedRollbackStrategy.java:242)
>     at 
> org.apache.hudi.table.action.rollback.ListingBasedRollbackStrategy.fetchFilesFromInstant(ListingBasedRollbackStrategy.java:228)
>     at 
> org.apache.hudi.table.action.rollback.ListingBasedRollbackStrategy.lambda$getRollbackRequests$762d0ff4$1(ListingBasedRollbackStrategy.java:101)
>     at 
> org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:137)
>     at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
>     at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
>     at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
>     at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>     at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>     at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>     at scala.collection.AbstractIterator.to(Iterator.scala:1334)
>     at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>     at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
>     at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>     at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
>     at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$15.apply(RDD.scala:990)
>     at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
>     at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> 22/04/10 22:35:41 ERROR Executor: Exception in task 0.0 in stage 24.0 (TID 78)
> java.io.FileNotFoundException: File 
> file:/tmp/hudi_trips_cow/americas/brazil/sao_paulo/.82cffa84-

[GitHub] [hudi] XuQianJin-Stars commented on a diff in pull request #5738: [HUDI-4168] Add Call Procedure for marker deletion

2022-06-02 Thread GitBox



XuQianJin-Stars commented on code in PR #5738:
URL: https://github.com/apache/hudi/pull/5738#discussion_r888520737


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/DeleteMarkerProcedure.scala:
##
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.hudi.table.HoodieSparkTable
+import org.apache.hudi.table.marker.WriteMarkersFactory
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, 
StructType}
+
+import java.util.function.Supplier
+import scala.util.{Failure, Success, Try}
+
+class DeleteMarkerProcedure extends BaseProcedure with ProcedureBuilder with 
Logging {
+  private val PARAMETERS = Array[ProcedureParameter](
+ProcedureParameter.required(0, "table", DataTypes.StringType, None),
+ProcedureParameter.required(1, "instant_Time", DataTypes.StringType, None)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+StructField("delete_marker_result", DataTypes.BooleanType, nullable = 
true, Metadata.empty))
+  )
+
+  def parameters: Array[ProcedureParameter] = PARAMETERS
+
+  def outputType: StructType = OUTPUT_TYPE
+
+  override def call(args: ProcedureArgs): Seq[Row] = {
+super.checkArgs(PARAMETERS, args)
+
+val tableName = getArgValueOrDefault(args, PARAMETERS(0))
+val instantTime = getArgValueOrDefault(args, 
PARAMETERS(1)).get.asInstanceOf[String]
+val basePath = getBasePath(tableName)
+
+val result = Try {
+  val client = createHoodieClient(jsc, basePath)
+  val config = client.getConfig
+  val context = client.getEngineContext
+  val table = HoodieSparkTable.create(config, context)

Review Comment:
   It is better to refresh the timeline here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5743: [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables

2022-06-02 Thread GitBox



hudi-bot commented on PR #5743:
URL: https://github.com/apache/hudi/pull/5743#issuecomment-1145478721

   
   ## CI report:
   
   * da04030deb85fcd36fd463793be08270e0216aad Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9056)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5743: [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables

2022-06-02 Thread GitBox



hudi-bot commented on PR #5743:
URL: https://github.com/apache/hudi/pull/5743#issuecomment-1145477072

   
   ## CI report:
   
   * da04030deb85fcd36fd463793be08270e0216aad UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4183) Fix using HoodieCatalog to create non-hudi tables

2022-06-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4183:
-
Labels: pull-request-available  (was: )

> Fix using HoodieCatalog to create non-hudi tables
> -
>
> Key: HUDI-4183
> URL: https://issues.apache.org/jira/browse/HUDI-4183
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] leesf opened a new pull request, #5743: [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables

2022-06-02 Thread GitBox



leesf opened a new pull request, #5743:
URL: https://github.com/apache/hudi/pull/5743

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4183) Fix using HoodieCatalog to create non-hudi tables

2022-06-02 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-4183:

Summary: Fix using HoodieCatalog to create non-hudi tables  (was: Fix using 
HoodieCatalog to create non hudi tables)

> Fix using HoodieCatalog to create non-hudi tables
> -
>
> Key: HUDI-4183
> URL: https://issues.apache.org/jira/browse/HUDI-4183
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (HUDI-4183) Fix using HoodieCatalog to create non hudi tables

2022-06-02 Thread leesf (Jira)

leesf created HUDI-4183:
---

 Summary: Fix using HoodieCatalog to create non hudi tables
 Key: HUDI-4183
 URL: https://issues.apache.org/jira/browse/HUDI-4183
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: leesf
Assignee: leesf






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Closed] (HUDI-4175) Hive Sync Options pointing to wrong new options

2022-06-02 Thread Shawn Chang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Chang closed HUDI-4175.
-
Resolution: Duplicate

Duplicate issue of another jira: https://issues.apache.org/jira/browse/HUDI-4162

Closing this one now

> Hive Sync Options pointing to wrong new options
> ---
>
> Key: HUDI-4175
> URL: https://issues.apache.org/jira/browse/HUDI-4175
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Shawn Chang
>Priority: Minor
>  Labels: pull-request-available
>
>  
> {code:java}
> @Deprecated
> val META_SYNC_ENABLED_OPT_KEY = HoodieSyncConfig.META_SYNC_DATABASE_NAME.key()
> /** @deprecated Use {@link HIVE_DATABASE} and its methods instead */
> ...
> @Deprecated
> val HIVE_TABLE_OPT_KEY = HoodieSyncConfig.META_SYNC_DATABASE_NAME.key() {code}
> Those options above in 
> hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
>  are wrong
>  
> It's causing issues for apps that use older configs even they are deprecated
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] CTTY commented on pull request #5732: [HUDI-4175] Fix hive sync options

2022-06-02 Thread GitBox



CTTY commented on PR #5732:
URL: https://github.com/apache/hudi/pull/5732#issuecomment-1145458366

   I see there is a PR fixed this already, closing this now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY closed pull request #5732: [HUDI-4175] Fix hive sync options

2022-06-02 Thread GitBox



CTTY closed pull request #5732: [HUDI-4175] Fix hive sync options
URL: https://github.com/apache/hudi/pull/5732


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Fixing `HoodieSpark3Analysis` missing to pass schema from Spark Catalog

2022-06-02 Thread GitBox



vinothchandar commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r888475377


##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala:
##
@@ -45,16 +45,22 @@ case class HoodieSpark3Analysis(sparkSession: SparkSession) 
extends Rule[Logical
   with SparkAdapterSupport with ProvidesHoodieConfig {
 
   override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
-case dsv2 @ DataSourceV2Relation(d: HoodieInternalV2Table, _, _, _, _) =>
-  val output = dsv2.output
-  val catalogTable = if (d.catalogTable.isDefined) {
-Some(d.v1Table)
-  } else {
-None
-  }
-  val relation = new DefaultSource().createRelation(new 
SQLContext(sparkSession),
-buildHoodieConfig(d.hoodieCatalogTable))
-  LogicalRelation(relation, output, catalogTable, isStreaming = false)
+// NOTE: This step is required since Hudi relations don't currently 
implement DS V2 Read API
+case dsv2 @ DataSourceV2Relation(tbl: HoodieInternalV2Table, _, _, _, _) =>
+  val qualifiedTableName = QualifiedTableName(tbl.v1Table.database, 
tbl.v1Table.identifier.table)
+  val catalog = sparkSession.sessionState.catalog
+
+  catalog.getCachedPlan(qualifiedTableName, () => {

Review Comment:
   @leesf I propose we revert back to v1 and push out 0.11.1 to fix these perf 
regressions. Do you see any concerns with that ? 
   
   @YannByron would any of the follow-on sql work break if we revert? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Fixing `HoodieSpark3Analysis` missing to pass schema from Spark Catalog

2022-06-02 Thread GitBox



alexeykudinkin commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r888469016


##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala:
##
@@ -45,16 +45,22 @@ case class HoodieSpark3Analysis(sparkSession: SparkSession) 
extends Rule[Logical
   with SparkAdapterSupport with ProvidesHoodieConfig {
 
   override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
-case dsv2 @ DataSourceV2Relation(d: HoodieInternalV2Table, _, _, _, _) =>
-  val output = dsv2.output
-  val catalogTable = if (d.catalogTable.isDefined) {
-Some(d.v1Table)
-  } else {
-None
-  }
-  val relation = new DefaultSource().createRelation(new 
SQLContext(sparkSession),
-buildHoodieConfig(d.hoodieCatalogTable))
-  LogicalRelation(relation, output, catalogTable, isStreaming = false)
+// NOTE: This step is required since Hudi relations don't currently 
implement DS V2 Read API
+case dsv2 @ DataSourceV2Relation(tbl: HoodieInternalV2Table, _, _, _, _) =>
+  val qualifiedTableName = QualifiedTableName(tbl.v1Table.database, 
tbl.v1Table.identifier.table)
+  val catalog = sparkSession.sessionState.catalog
+
+  catalog.getCachedPlan(qualifiedTableName, () => {

Review Comment:
   Yes, the issue is that it's never invalidated. V1 actually has notion of the 
catalog, it's just it's being handled differently by V1 and V2 and so since 
we're in between those 2 worlds we can't really make use of it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Fixing `HoodieSpark3Analysis` missing to pass schema from Spark Catalog

2022-06-02 Thread GitBox



vinothchandar commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r888460126


##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala:
##
@@ -45,16 +45,22 @@ case class HoodieSpark3Analysis(sparkSession: SparkSession) 
extends Rule[Logical
   with SparkAdapterSupport with ProvidesHoodieConfig {
 
   override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
-case dsv2 @ DataSourceV2Relation(d: HoodieInternalV2Table, _, _, _, _) =>
-  val output = dsv2.output
-  val catalogTable = if (d.catalogTable.isDefined) {
-Some(d.v1Table)
-  } else {
-None
-  }
-  val relation = new DefaultSource().createRelation(new 
SQLContext(sparkSession),
-buildHoodieConfig(d.hoodieCatalogTable))
-  LogicalRelation(relation, output, catalogTable, isStreaming = false)
+// NOTE: This step is required since Hudi relations don't currently 
implement DS V2 Read API
+case dsv2 @ DataSourceV2Relation(tbl: HoodieInternalV2Table, _, _, _, _) =>
+  val qualifiedTableName = QualifiedTableName(tbl.v1Table.database, 
tbl.v1Table.identifier.table)
+  val catalog = sparkSession.sessionState.catalog
+
+  catalog.getCachedPlan(qualifiedTableName, () => {

Review Comment:
   So the issue is that this cache is never invalidated, on write. and V1 does 
not have a notion of Catalog (to be used for writes)



##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala:
##
@@ -45,16 +45,22 @@ case class HoodieSpark3Analysis(sparkSession: SparkSession) 
extends Rule[Logical
   with SparkAdapterSupport with ProvidesHoodieConfig {
 
   override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
-case dsv2 @ DataSourceV2Relation(d: HoodieInternalV2Table, _, _, _, _) =>
-  val output = dsv2.output
-  val catalogTable = if (d.catalogTable.isDefined) {
-Some(d.v1Table)
-  } else {
-None
-  }
-  val relation = new DefaultSource().createRelation(new 
SQLContext(sparkSession),
-buildHoodieConfig(d.hoodieCatalogTable))
-  LogicalRelation(relation, output, catalogTable, isStreaming = false)
+// NOTE: This step is required since Hudi relations don't currently 
implement DS V2 Read API
+case dsv2 @ DataSourceV2Relation(tbl: HoodieInternalV2Table, _, _, _, _) =>
+  val qualifiedTableName = QualifiedTableName(tbl.v1Table.database, 
tbl.v1Table.identifier.table)
+  val catalog = sparkSession.sessionState.catalog
+
+  catalog.getCachedPlan(qualifiedTableName, () => {

Review Comment:
   So the issue is that this cache is never invalidated, on write. and V1 does 
not have a notion of Catalog (to be used for writes)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bhasudha opened a new pull request, #5742: [DOCS] Add images to rest of the blogs

2022-06-02 Thread GitBox



bhasudha opened a new pull request, #5742:
URL: https://github.com/apache/hudi/pull/5742

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4182) HoodieIndexer failing when trying to invoke it as Single Writer

2022-06-02 Thread Alexey Kudinkin (Jira)

Alexey Kudinkin created HUDI-4182:
-

 Summary: HoodieIndexer failing when trying to invoke it as Single 
Writer
 Key: HUDI-4182
 URL: https://issues.apache.org/jira/browse/HUDI-4182
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Sagar Sumit
 Fix For: 0.12.0


Trying to invoke Async Indexer with following properties fails:
{code:java}
hoodie.metadata.enable=true
hoodie.metadata.index.async=true
hoodie.metadata.index.column.stats.enable=true
hoodie.write.concurrency.mode=single_writer
hoodie.cleaner.policy.failed.writes=EAGER {code}
{code:java}
2022-06-02 21:51:58,123 ERROR utilities.UtilHelpers: Indexer failed
org.apache.hudi.exception.HoodieIndexException: Need to set 
hoodie.write.concurrency.mode as OPTIMISTIC_CONCURRENCY_CONTROL and configure 
lock provider class
        at 
org.apache.hudi.table.action.index.ScheduleIndexActionExecutor.validateBeforeScheduling(ScheduleIndexActionExecutor.java:137)
        at 
org.apache.hudi.table.action.index.ScheduleIndexActionExecutor.execute(ScheduleIndexActionExecutor.java:84)
        at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleIndexing(HoodieSparkCopyOnWriteTable.java:286)
        at 
org.apache.hudi.client.BaseHoodieWriteClient.scheduleIndexing(BaseHoodieWriteClient.java:1016)
        at 
org.apache.hudi.utilities.HoodieIndexer.doSchedule(HoodieIndexer.java:234)
        at 
org.apache.hudi.utilities.HoodieIndexer.scheduleAndRunIndexing(HoodieIndexer.java:276)
        at 
org.apache.hudi.utilities.HoodieIndexer.lambda$start$1(HoodieIndexer.java:198)
        at org.apache.hudi.utilities.UtilHelpers.retry(UtilHelpers.java:541)
        at org.apache.hudi.utilities.HoodieIndexer.start(HoodieIndexer.java:185)
        at org.apache.hudi.utilities.HoodieIndexer.main(HoodieIndexer.java:154)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] hudi-bot commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-06-02 Thread GitBox



hudi-bot commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1145366721

   
   ## CI report:
   
   * 8c6f6e19940ce7ac04dfcfce52da3ccdaf3a8b0f UNKNOWN
   * c4799803cff8adffef56e889a5cd4d52599fcf73 UNKNOWN
   * c5616888bb267cb505a12b88cad3e99f9dd18d9b UNKNOWN
   * c02afe06f4b0d02291112351f62b1f4046faccc1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9055)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5732: [HUDI-4175] Fix hive sync options

2022-06-02 Thread GitBox



hudi-bot commented on PR #5732:
URL: https://github.com/apache/hudi/pull/5732#issuecomment-1145363692

   
   ## CI report:
   
   * 189a1f85cd52515367e121ac0abbca2e4010bccb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9040)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-4181) Async indexer fails for simple index creation

2022-06-02 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-4181.
-
Resolution: Duplicate

> Async indexer fails for simple index creation
> -
>
> Key: HUDI-4181
> URL: https://issues.apache.org/jira/browse/HUDI-4181
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>
> I tried to ran async indexing to build out col stats and ran into issues. Had 
> to make following fix to make progress. but this needs proper fix. 
>  
> Local fix that worked for me.
> {code:java}
> diff --git 
> a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
>  
> b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
> index f5a96fb676..1e67020810 100644
> --- 
> a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
> +++ 
> b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
> @@ -955,7 +955,6 @@ public abstract class HoodieBackedTableMetadataWriter 
> implements HoodieTableMeta
>  HoodieTableFileSystemView fsView = 
> HoodieTableMetadataUtil.getFileSystemView(metadataMetaClient);
>  for (Map.Entry> entry : 
> partitionRecordsMap.entrySet()) {
>final String partitionName = entry.getKey().getPartitionPath();
> -  final int fileGroupCount = entry.getKey().getFileGroupCount();
>HoodieData records = entry.getValue();
>  
>List fileSlices =
> @@ -965,9 +964,10 @@ public abstract class HoodieBackedTableMetadataWriter 
> implements HoodieTableMeta
>  // so if there are no committed file slices, look for inflight slices
>  fileSlices = 
> HoodieTableMetadataUtil.getPartitionLatestFileSlicesIncludingInflight(metadataMetaClient,
>  Option.ofNullable(fsView), partitionName);
>}
> -  ValidationUtils.checkArgument(fileSlices.size() == fileGroupCount,
> +  final int fileGroupCount = fileSlices.size();
> +  /*ValidationUtils.checkArgument(fileSlices.size() == fileGroupCount,
>String.format("Invalid number of file groups for partition:%s, 
> found=%d, required=%d",
> -  partitionName, fileSlices.size(), fileGroupCount));
> +  partitionName, fileSlices.size(), fileGroupCount));*/
>  
>List finalFileSlices = fileSlices;
>HoodieData rddSinglePartitionRecords = records.map(r -> 
> { {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] hudi-bot commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-06-02 Thread GitBox



hudi-bot commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1145363222

   
   ## CI report:
   
   * 8c6f6e19940ce7ac04dfcfce52da3ccdaf3a8b0f UNKNOWN
   * c4799803cff8adffef56e889a5cd4d52599fcf73 UNKNOWN
   * c5616888bb267cb505a12b88cad3e99f9dd18d9b UNKNOWN
   * 2b24187e10b42883b0c82fabbf5ad8684f566d80 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9054)
 
   * c02afe06f4b0d02291112351f62b1f4046faccc1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5732: [HUDI-4175] Fix hive sync options

2022-06-02 Thread GitBox



hudi-bot commented on PR #5732:
URL: https://github.com/apache/hudi/pull/5732#issuecomment-1145360310

   
   ## CI report:
   
   * 189a1f85cd52515367e121ac0abbca2e4010bccb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9040)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-06-02 Thread GitBox



hudi-bot commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1145359873

   
   ## CI report:
   
   * 8c6f6e19940ce7ac04dfcfce52da3ccdaf3a8b0f UNKNOWN
   * c4799803cff8adffef56e889a5cd4d52599fcf73 UNKNOWN
   * c5616888bb267cb505a12b88cad3e99f9dd18d9b UNKNOWN
   *  Unknown: [CANCELED](TBD) 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4156) AsyncIndexer fails for column stats partition

2022-06-02 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545663#comment-17545663
 ] 

sivabalan narayanan commented on HUDI-4156:
---

local fix to unblock myself for now
{code:java}
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index f5a96fb676..1e67020810 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -955,7 +955,6 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
 HoodieTableFileSystemView fsView = 
HoodieTableMetadataUtil.getFileSystemView(metadataMetaClient);
 for (Map.Entry> entry : 
partitionRecordsMap.entrySet()) {
   final String partitionName = entry.getKey().getPartitionPath();
-  final int fileGroupCount = entry.getKey().getFileGroupCount();
   HoodieData records = entry.getValue();
 
   List fileSlices =
@@ -965,9 +964,10 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
 // so if there are no committed file slices, look for inflight slices
 fileSlices = 
HoodieTableMetadataUtil.getPartitionLatestFileSlicesIncludingInflight(metadataMetaClient,
 Option.ofNullable(fsView), partitionName);
   }
-  ValidationUtils.checkArgument(fileSlices.size() == fileGroupCount,
+  final int fileGroupCount = fileSlices.size();
+  /*ValidationUtils.checkArgument(fileSlices.size() == fileGroupCount,
   String.format("Invalid number of file groups for partition:%s, 
found=%d, required=%d",
-  partitionName, fileSlices.size(), fileGroupCount));
+  partitionName, fileSlices.size(), fileGroupCount));*/
 
   List finalFileSlices = fileSlices;
   HoodieData rddSinglePartitionRecords = records.map(r -> { 
{code}

> AsyncIndexer fails for column stats partition 
> --
>
> Key: HUDI-4156
> URL: https://issues.apache.org/jira/browse/HUDI-4156
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.11.1
>
>
> Tried to build col stats for a hudi table w/ async indexer and ran into below 
> exception
>  
> Configs I had set are 
> {code:java}
> hoodie.metadata.enable=true
> hoodie.metadata.index.async=true
> hoodie.metadata.index.column.stats.enable=true
> hoodie.write.concurrency.mode=optimistic_concurrency_control
> hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
>  {code}
> command
> {code:java}
> ./bin/spark-submit --class org.apache.hudi.utilities.HoodieIndexer 
> /home/hadoop/hudi-utilities-bundle_2.12-0.12.0-SNAPSHOT.jar --props 
> file:///home/hadoop/indexer.properties --mode scheduleandexecute --base-path 
> TBL_PATH --table-name call_center --index-types COLUMN_STATS --parallelism 1 
> --spark-memory 10g {code}
>  
>  
> {code:java}
> 2022-05-26 00:14:27,936 INFO util.ClusteringUtils: Found 0 files in pending 
> clustering operations
> 2022-05-26 00:14:27,937 INFO client.BaseHoodieClient: Stopping Timeline 
> service !!
> 2022-05-26 00:14:27,937 INFO embedded.EmbeddedTimelineService: Closing 
> Timeline server
> 2022-05-26 00:14:27,937 INFO service.TimelineService: Closing Timeline Service
> 2022-05-26 00:14:27,937 INFO javalin.Javalin: Stopping Javalin ...
> 2022-05-26 00:14:27,945 INFO javalin.Javalin: Javalin has stopped
> 2022-05-26 00:14:27,945 INFO service.TimelineService: Closed Timeline Service
> 2022-05-26 00:14:27,945 INFO embedded.EmbeddedTimelineService: Closed 
> Timeline server
> 2022-05-26 00:14:27,945 INFO transaction.TransactionManager: Transaction 
> manager closed
> 2022-05-26 00:14:27,946 ERROR utilities.UtilHelpers: Indexer failed
> java.lang.IllegalArgumentException: Invalid number of file groups for 
> partition:column_stats, found=2, required=1
>   at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
>   at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:968)
>   at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:132)
>   at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initialCommit(HoodieBackedTableMetadataWriter.java:1087)
>   at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.buildMetadataPartitions(HoodieBackedTableMetadataWriter.java:858)
>

1 2 >

1 - 100 of 156 matches

Mail list logo