[GitHub] [hudi] codope commented on pull request #4591: Revert "[HUDI-3233] Make metadata commit synchronous for flink batch"

2022-01-14 Thread GitBox


codope commented on pull request #4591:
URL: https://github.com/apache/hudi/pull/4591#issuecomment-1013631866


   > @codope : was this an attempt to fix the flakiness ?
   
   Yeah, closing this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codope closed pull request #4591: Revert "[HUDI-3233] Make metadata commit synchronous for flink batch"

2022-01-14 Thread GitBox


codope closed pull request #4591:
URL: https://github.com/apache/hudi/pull/4591


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4607: [HUDI-3161][RFC-46] Add Call Produce Command for Spark SQL

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4607:
URL: https://github.com/apache/hudi/pull/4607#issuecomment-1013621545


   
   ## CI report:
   
   * 067b1741e59b23260e711e8e7275430c59552459 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5265)
 
   * 9ddbc330d21f82188865a3a76af2b79a98101d3b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5266)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4607: [HUDI-3161][RFC-46] Add Call Produce Command for Spark SQL

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4607:
URL: https://github.com/apache/hudi/pull/4607#issuecomment-1013628091


   
   ## CI report:
   
   * 9ddbc330d21f82188865a3a76af2b79a98101d3b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5266)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2597) Improve code quality around Generics with Java 8

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2597:

Priority: Blocker  (was: Major)

> Improve code quality around Generics with Java 8
> 
>
> Key: HUDI-2597
> URL: https://issues.apache.org/jira/browse/HUDI-2597
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2596) Make class names consistent in hudi-client

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2596:

Priority: Blocker  (was: Major)

> Make class names consistent in hudi-client
> --
>
> Key: HUDI-2596
> URL: https://issues.apache.org/jira/browse/HUDI-2596
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2598) Redesign record payload class to decouple HoodieRecordPayload from Avro

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2598:

Priority: Blocker  (was: Critical)

> Redesign record payload class to decouple HoodieRecordPayload from Avro
> ---
>
> Key: HUDI-2598
> URL: https://issues.apache.org/jira/browse/HUDI-2598
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> We need to redesign the HoodieRecordPayload interface, which should not 
> depend on avro, to pave the road for Spark Row writer work. Ideally, the new 
> abstraction should have the individual implementation of avro-based record 
> and spark-based Row, and the abstraction itself has more high-level 
> operations like preCombine, getValue, etc., which are not tied to 
> IndexedRecord. This should also make the write client easier to extend for 
> Spark Dataset of Rows.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3042) Refactor clustering action in hudi-client module to use HoodieData abstraction

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3042:

Priority: Blocker  (was: Major)

> Refactor clustering action in hudi-client module to use HoodieData abstraction
> --
>
> Key: HUDI-3042
> URL: https://issues.apache.org/jira/browse/HUDI-3042
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Ethan Guo
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2638) Rewrite tests around Hudi index

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2638:

Priority: Blocker  (was: Major)

> Rewrite tests around Hudi index
> ---
>
> Key: HUDI-2638
> URL: https://issues.apache.org/jira/browse/HUDI-2638
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Priority: Blocker
>
> There are duplicate code between `TestFlinkHoodieBloomIndex` and 
> `TestHoodieBloomIndex`, among other test classes.  We should do one pass to 
> clean the test code once the refactoring is done.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2439) Refactor table.action.commit package (CommitActionExecutors) in hudi-client module

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2439:

Priority: Blocker  (was: Major)

> Refactor table.action.commit package (CommitActionExecutors) in hudi-client 
> module
> --
>
> Key: HUDI-2439
> URL: https://issues.apache.org/jira/browse/HUDI-2439
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2656) Generalize HoodieIndex for flexible record data type

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2656:

Priority: Blocker  (was: Major)

> Generalize HoodieIndex for flexible record data type
> 
>
> Key: HUDI-2656
> URL: https://issues.apache.org/jira/browse/HUDI-2656
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-752) Make CompactionAdminClient spark-free

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo resolved HUDI-752.


> Make CompactionAdminClient spark-free
> -
>
> Key: HUDI-752
> URL: https://issues.apache.org/jira/browse/HUDI-752
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now, we always pass jsc, there can only one sparkContext in JVM. So, we can 
> store it in a Factory class, then we can get it everywhere. After that, we 
> make many class spark-free



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-752) Make CompactionAdminClient spark-free

2022-01-14 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476551#comment-17476551
 ] 

Ethan Guo commented on HUDI-752:


This is resolved by using HoodieEngineContext.

> Make CompactionAdminClient spark-free
> -
>
> Key: HUDI-752
> URL: https://issues.apache.org/jira/browse/HUDI-752
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now, we always pass jsc, there can only one sparkContext in JVM. So, we can 
> store it in a Factory class, then we can get it everywhere. After that, we 
> make many class spark-free



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-750) Make AbstractHoodieClient spark-free

2022-01-14 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476550#comment-17476550
 ] 

Ethan Guo commented on HUDI-750:


This is resolved in [https://github.com/apache/hudi/pull/1827.] 

> Make AbstractHoodieClient spark-free
> 
>
> Key: HUDI-750
> URL: https://issues.apache.org/jira/browse/HUDI-750
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-750) Make AbstractHoodieClient spark-free

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo resolved HUDI-750.


> Make AbstractHoodieClient spark-free
> 
>
> Key: HUDI-750
> URL: https://issues.apache.org/jira/browse/HUDI-750
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-729) Replace JavaSparkContext/SQLContext with SparkSession

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo resolved HUDI-729.


> Replace JavaSparkContext/SQLContext with SparkSession
> -
>
> Key: HUDI-729
> URL: https://issues.apache.org/jira/browse/HUDI-729
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Replace JavaSparkContext/SQLContext with SparkSession.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-729) Replace JavaSparkContext/SQLContext with SparkSession

2022-01-14 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476549#comment-17476549
 ] 

Ethan Guo commented on HUDI-729:


It looks like this is resolved on latest master.  [~lamber-ken] Please reopen 
if there is more work to do.

> Replace JavaSparkContext/SQLContext with SparkSession
> -
>
> Key: HUDI-729
> URL: https://issues.apache.org/jira/browse/HUDI-729
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Replace JavaSparkContext/SQLContext with SparkSession.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-682) Move HoodieReadClient into hudi-spark module

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo resolved HUDI-682.


> Move HoodieReadClient into hudi-spark module
> 
>
> Key: HUDI-682
> URL: https://issues.apache.org/jira/browse/HUDI-682
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-682) Move HoodieReadClient into hudi-spark module

2022-01-14 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476548#comment-17476548
 ] 

Ethan Guo commented on HUDI-682:


This is done in [https://github.com/apache/hudi/pull/1827.]

> Move HoodieReadClient into hudi-spark module
> 
>
> Key: HUDI-682
> URL: https://issues.apache.org/jira/browse/HUDI-682
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-661) Make EmbeddedTimelineService spark free

2022-01-14 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476547#comment-17476547
 ] 

Ethan Guo commented on HUDI-661:


EmbeddedTimelineService is engine-agnostic after the refactoring.  Closing this.

> Make EmbeddedTimelineService spark free
> ---
>
> Key: HUDI-661
> URL: https://issues.apache.org/jira/browse/HUDI-661
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Currently, {{EmbeddedTimelineService}} owns {{SparkConf}} to get 
> {{spark.driver.host}}. The value is a string. We can pass it from the outside 
> instead of depending on {{SparkConf}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-661) Make EmbeddedTimelineService spark free

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo resolved HUDI-661.


> Make EmbeddedTimelineService spark free
> ---
>
> Key: HUDI-661
> URL: https://issues.apache.org/jira/browse/HUDI-661
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Currently, {{EmbeddedTimelineService}} owns {{SparkConf}} to get 
> {{spark.driver.host}}. The value is a string. We can pass it from the outside 
> instead of depending on {{SparkConf}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-659) Make HoodieCommitArchiveLog spark free

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo resolved HUDI-659.


> Make HoodieCommitArchiveLog spark free
> --
>
> Key: HUDI-659
> URL: https://issues.apache.org/jira/browse/HUDI-659
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Currently, {{HoodieCommitArchiveLog}} depends on {{JavaSparkContext}} in its 
> two methods: {{archiveIfRequired}} and {{getInstantsToArchive}}. These two 
> methods pass {{JavaSparkContext}} to get {{HoodieTable}} object. After diving 
> into the call chain, I found we can replace {{JavaSparkContext}} with 
> {{Configuration}} and other some cleanup(e.g. HUDI-658). After that, we can 
> make {{HoodieCommitArchiveLog}} spark free.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-659) Make HoodieCommitArchiveLog spark free

2022-01-14 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476546#comment-17476546
 ] 

Ethan Guo commented on HUDI-659:


HoodieTimelineArchiveLog (new class name for HoodieCommitArchiveLog) is 
engine-agnostic now.  Closing this.

> Make HoodieCommitArchiveLog spark free
> --
>
> Key: HUDI-659
> URL: https://issues.apache.org/jira/browse/HUDI-659
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Currently, {{HoodieCommitArchiveLog}} depends on {{JavaSparkContext}} in its 
> two methods: {{archiveIfRequired}} and {{getInstantsToArchive}}. These two 
> methods pass {{JavaSparkContext}} to get {{HoodieTable}} object. After diving 
> into the call chain, I found we can replace {{JavaSparkContext}} with 
> {{Configuration}} and other some cleanup(e.g. HUDI-658). After that, we can 
> make {{HoodieCommitArchiveLog}} spark free.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-658) Make ClientUtils spark-free

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo resolved HUDI-658.


> Make ClientUtils spark-free
> ---
>
> Key: HUDI-658
> URL: https://issues.apache.org/jira/browse/HUDI-658
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> {{ClientUtils#createMetaClient}} require {{JavaSparkContext}} only for 
> getting the hadoop configuration obejct. We can pass the {{Configuration}} 
> object directly so that we can make {{ClientUtils}} spark-free.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-658) Make ClientUtils spark-free

2022-01-14 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476545#comment-17476545
 ] 

Ethan Guo commented on HUDI-658:


This has already been resolved on latest master (ClientUtils no longer exists 
and AbstractHoodieClient::createMetaClient is engine-agnostic).  Closing the 
ticket.  [~yanghua] Please reopen if there is more work to do.

> Make ClientUtils spark-free
> ---
>
> Key: HUDI-658
> URL: https://issues.apache.org/jira/browse/HUDI-658
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> {{ClientUtils#createMetaClient}} require {{JavaSparkContext}} only for 
> getting the hadoop configuration obejct. We can pass the {{Configuration}} 
> object directly so that we can make {{ClientUtils}} spark-free.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] xushiyan commented on issue #4597: [SUPPORT] - Hudi Upserts Not working

2022-01-14 Thread GitBox


xushiyan commented on issue #4597:
URL: https://github.com/apache/hudi/issues/4597#issuecomment-1013623902


   @harishraju-govindaraju as suggested above, this looks like unintended use 
of indexing type. Can you use GLOBAL_BLOOM setting instead? We should be good 
to close this as soon as you confirm this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-538) [UMBRELLA] Restructuring hudi client module for multi engine support

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-538:
---
Priority: Blocker  (was: Major)

> [UMBRELLA] Restructuring hudi client module for multi engine support
> 
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Blocker
>  Labels: hudi-umbrellas
> Fix For: 0.11.0
>
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-538) [UMBRELLA] Restructuring hudi client module for multi engine support

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-538:
---
Fix Version/s: 0.11.0

> [UMBRELLA] Restructuring hudi client module for multi engine support
> 
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 0.11.0
>
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] xushiyan commented on issue #4600: [SUPPORT]When hive queries Hudi data, the query path is wrong

2022-01-14 Thread GitBox


xushiyan commented on issue #4600:
URL: https://github.com/apache/hudi/issues/4600#issuecomment-1013622766


   @gubinjie we need more info about your environment setup to analyze and 
reproduce. Can you add info like hudi version, environment, other software 
versions, how table was prepared, what is the reproducing steps, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4607: [HUDI-3161][RFC-46] Add Call Produce Command for Spark SQL

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4607:
URL: https://github.com/apache/hudi/pull/4607#issuecomment-1013621545


   
   ## CI report:
   
   * 067b1741e59b23260e711e8e7275430c59552459 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5265)
 
   * 9ddbc330d21f82188865a3a76af2b79a98101d3b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5266)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4607: [HUDI-3161][RFC-46] Add Call Produce Command for Spark SQL

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4607:
URL: https://github.com/apache/hudi/pull/4607#issuecomment-1013614454


   
   ## CI report:
   
   * 067b1741e59b23260e711e8e7275430c59552459 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5265)
 
   * 9ddbc330d21f82188865a3a76af2b79a98101d3b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] XuQianJin-Stars commented on issue #3984: [SUPPORT] Upgrade from 0.8.0 to 0.9.0 removes functionality and decreases performance

2022-01-14 Thread GitBox


XuQianJin-Stars commented on issue #3984:
URL: https://github.com/apache/hudi/issues/3984#issuecomment-1013615169


   > > hi @cb149 @nsivabalan @xushiyan I have found this problem, just need to 
`set hoodie.file.index.enable=false` to work
   > > ```
   > > val tripsSnapshotDF = spark.read.format("hudi")
   > >   .option("hoodie.file.index.enable", "false")
   > >   .load(basePath) 
   > > ```
   > 
   > HI @XuQianJin-Stars that solves the problem but decreases the performance 
extremely, since it takes a very long time before the Stage in Spark is visible.
   > 
   > E.g. as a workaround I am using `where("_partition like 
'year=2021/month=6/%'").count` (depending on which column contains the 
partitionpath) , which takes like 5 seconds total, while using 
_hoodie.file.index.enable false_ takes multiple minutes
   
   Regarding this, we will divide it into three steps to completely solve this 
problem, 
[HUDI-3200](https://issues.apache.org/jira/browse/HUDI-3200)、[HUDI-3201](https://issues.apache.org/jira/browse/HUDI-3201)、[HUDI-3202](https://issues.apache.org/jira/browse/HUDI-3202)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4607: [HUDI-3161][RFC-46] Add Call Produce Command for Spark SQL

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4607:
URL: https://github.com/apache/hudi/pull/4607#issuecomment-1013614454


   
   ## CI report:
   
   * 067b1741e59b23260e711e8e7275430c59552459 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5265)
 
   * 9ddbc330d21f82188865a3a76af2b79a98101d3b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4607: [HUDI-3161][RFC-46] Add Call Produce Command for Spark SQL

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4607:
URL: https://github.com/apache/hudi/pull/4607#issuecomment-1013614115


   
   ## CI report:
   
   * 067b1741e59b23260e711e8e7275430c59552459 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5265)
 
   * 9ddbc330d21f82188865a3a76af2b79a98101d3b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4607: [HUDI-3161][RFC-46] Add Call Produce Command for Spark SQL

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4607:
URL: https://github.com/apache/hudi/pull/4607#issuecomment-1013614115


   
   ## CI report:
   
   * 067b1741e59b23260e711e8e7275430c59552459 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5265)
 
   * 9ddbc330d21f82188865a3a76af2b79a98101d3b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4607: [HUDI-3161][RFC-46] Add Call Produce Command for Spark SQL

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4607:
URL: https://github.com/apache/hudi/pull/4607#issuecomment-1013613746


   
   ## CI report:
   
   * 067b1741e59b23260e711e8e7275430c59552459 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5265)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4607: [HUDI-3161][RFC-46] Add Call Produce Command for Spark SQL

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4607:
URL: https://github.com/apache/hudi/pull/4607#issuecomment-1013613746


   
   ## CI report:
   
   * 067b1741e59b23260e711e8e7275430c59552459 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5265)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4607: [HUDI-3161][RFC-46] Add Call Produce Command for Spark SQL

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4607:
URL: https://github.com/apache/hudi/pull/4607#issuecomment-1013613381


   
   ## CI report:
   
   * 067b1741e59b23260e711e8e7275430c59552459 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4607: [HUDI-3161][RFC-46] Add Call Produce Command for Spark SQL

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4607:
URL: https://github.com/apache/hudi/pull/4607#issuecomment-1013613381


   
   ## CI report:
   
   * 067b1741e59b23260e711e8e7275430c59552459 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] XuQianJin-Stars opened a new pull request #4607: [HUDI-3161][RFC-46] Add Call Produce Command for Spark SQL

2022-01-14 Thread GitBox


XuQianJin-Stars opened a new pull request #4607:
URL: https://github.com/apache/hudi/pull/4607


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-14 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3242:

Priority: Critical  (was: Blocker)

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
> to_timestamp(timestamp) as timestamp, sid, 
> date_format(to_timestamp(timestamp), '/MM/dd') AS date FROM  a \"" \
> --hoodie-conf hoodie.file.listing.parallelism=256 \
> --hoodie-conf hoodie.finalize.write.parallelism=256 \
> 

[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013607661


   
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013598133


   
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] alexeykudinkin edited a comment on pull request #4551: [HUDI-3010] Unbundle parquet-avro and shade hbase in presto-bundle

2022-01-14 Thread GitBox


alexeykudinkin edited a comment on pull request #4551:
URL: https://github.com/apache/hudi/pull/4551#issuecomment-1013598939


   That makes total sense to me. But for that we have to update the Docker 
images we're using in ITs, right? If'd revert those changes my PR would have 
ITs failing b/c of missing classes. 
   
   Let me know when you'll be able to update Docker images and i'll revert the 
POM changes.
   
   **EDIT** 
   
   Please keep in mind that Hive flows on the current master don't involve 
Metadata table (which uses HFile), and therefore we'd need to validate that it 
works either triggering that flow manually or basing it on top of #4556 which 
does trigger Metadata table usage in Hive flows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-3250) Upgrade Presto version in docker setup and integ test

2022-01-14 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-3250:
-

 Summary: Upgrade Presto version in docker setup and integ test
 Key: HUDI-3250
 URL: https://issues.apache.org/jira/browse/HUDI-3250
 Project: Apache Hudi
  Issue Type: Test
Reporter: Sagar Sumit
Assignee: Sagar Sumit
 Fix For: 0.11.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] alexeykudinkin commented on pull request #4551: [HUDI-3010] Unbundle parquet-avro and shade hbase in presto-bundle

2022-01-14 Thread GitBox


alexeykudinkin commented on pull request #4551:
URL: https://github.com/apache/hudi/pull/4551#issuecomment-1013598939


   That makes total sense to me. But for that we have to update the Docker 
images we're using in ITs, right? If'd revert those changes my PR would have 
ITs failing b/c of missing classes. 
   
   Let me know when you'll be able to update Docker images and i'll revert the 
POM changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2872) Enable data skipping index even for sort based clustering

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-2872:
--
Reviewers: Y Ethan Guo

> Enable data skipping index even for sort based clustering
> -
>
> Key: HUDI-2872
> URL: https://issues.apache.org/jira/browse/HUDI-2872
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2646) Unify configurations for clustering execution strategy and layout optimization

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-2646:
--
Reviewers: Y Ethan Guo

> Unify configurations for clustering execution strategy and layout optimization
> --
>
> Key: HUDI-2646
> URL: https://issues.apache.org/jira/browse/HUDI-2646
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> IIUC in the current implementation, if user turns on both clustering and 
> layout optimization, the layout optimization/space filling curve based 
> sorting is what will take effect. This may be confusing for users. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013598133


   
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013597581


   
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2646) Unify configurations for clustering execution strategy and layout optimization

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-2646:
--
Sprint: Hudi-Sprint-Jan-10

> Unify configurations for clustering execution strategy and layout optimization
> --
>
> Key: HUDI-2646
> URL: https://issues.apache.org/jira/browse/HUDI-2646
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> IIUC in the current implementation, if user turns on both clustering and 
> layout optimization, the layout optimization/space filling curve based 
> sorting is what will take effect. This may be confusing for users. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2646) Unify configurations for clustering execution strategy and layout optimization

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-2646:
--
Fix Version/s: 0.11.0

> Unify configurations for clustering execution strategy and layout optimization
> --
>
> Key: HUDI-2646
> URL: https://issues.apache.org/jira/browse/HUDI-2646
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> IIUC in the current implementation, if user turns on both clustering and 
> layout optimization, the layout optimization/space filling curve based 
> sorting is what will take effect. This may be confusing for users. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2646) Unify configurations for clustering execution strategy and layout optimization

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-2646:
--
Priority: Blocker  (was: Major)

> Unify configurations for clustering execution strategy and layout optimization
> --
>
> Key: HUDI-2646
> URL: https://issues.apache.org/jira/browse/HUDI-2646
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> IIUC in the current implementation, if user turns on both clustering and 
> layout optimization, the layout optimization/space filling curve based 
> sorting is what will take effect. This may be confusing for users. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2872) Enable data skipping index even for sort based clustering

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-2872:
--
Sprint: Hudi-Sprint-Jan-10

> Enable data skipping index even for sort based clustering
> -
>
> Key: HUDI-2872
> URL: https://issues.apache.org/jira/browse/HUDI-2872
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013597581


   
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2872) Enable data skipping index even for sort based clustering

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-2872:
--
Status: Patch Available  (was: In Progress)

> Enable data skipping index even for sort based clustering
> -
>
> Key: HUDI-2872
> URL: https://issues.apache.org/jira/browse/HUDI-2872
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2646) Unify configurations for clustering execution strategy and layout optimization

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-2646:
--
Status: In Progress  (was: Open)

> Unify configurations for clustering execution strategy and layout optimization
> --
>
> Key: HUDI-2646
> URL: https://issues.apache.org/jira/browse/HUDI-2646
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Major
>
> IIUC in the current implementation, if user turns on both clustering and 
> layout optimization, the layout optimization/space filling curve based 
> sorting is what will take effect. This may be confusing for users. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-2646) Unify configurations for clustering execution strategy and layout optimization

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-2646:
-

Assignee: Alexey Kudinkin

> Unify configurations for clustering execution strategy and layout optimization
> --
>
> Key: HUDI-2646
> URL: https://issues.apache.org/jira/browse/HUDI-2646
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Major
>
> IIUC in the current implementation, if user turns on both clustering and 
> layout optimization, the layout optimization/space filling curve based 
> sorting is what will take effect. This may be confusing for users. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2646) Unify configurations for clustering execution strategy and layout optimization

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-2646:
--
Status: Patch Available  (was: In Progress)

> Unify configurations for clustering execution strategy and layout optimization
> --
>
> Key: HUDI-2646
> URL: https://issues.apache.org/jira/browse/HUDI-2646
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Major
>
> IIUC in the current implementation, if user turns on both clustering and 
> layout optimization, the layout optimization/space filling curve based 
> sorting is what will take effect. This may be confusing for users. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2872) Enable data skipping index even for sort based clustering

2022-01-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-2872:
-
Labels: pull-request-available  (was: )

> Enable data skipping index even for sort based clustering
> -
>
> Key: HUDI-2872
> URL: https://issues.apache.org/jira/browse/HUDI-2872
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] alexeykudinkin opened a new pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-14 Thread GitBox


alexeykudinkin opened a new pull request #4606:
URL: https://github.com/apache/hudi/pull/4606


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Refactoring layout optimization (clustering) flow to
- Enable support for linear (lexicographic) ordering as one of the ordering 
strategies (along w/ Z-order, Hilbert)
- Reconcile Layout Optimization and Clustering configuration to be more 
congruent
   
   ## Brief change log
   
- Refactored layout optimization flow to enable support for linear 
(lexicographic) ordering in column-stats indexes
- Reconcile Layout Optimization and Clustering configuration to be more 
congruent
- Refactored tests to validate full matrix of all optimization strategies, 
spatial curve composition strategies
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codope commented on pull request #4551: [HUDI-3010] Unbundle parquet-avro and shade hbase in presto-bundle

2022-01-14 Thread GitBox


codope commented on pull request #4551:
URL: https://github.com/apache/hudi/pull/4551#issuecomment-1013594138


   > @codope i had to revert these changes in my PR, since Presto queries are 
failing after rebase:
   > 
   > ```
   > 2022-01-14T20:45:04.265Z   WARNhive-hive-0 
com.facebook.presto.hive.util.ResumableTasksResumableTask completed 
exceptionally
   > java.lang.NoClassDefFoundError: 
org/apache/avro/message/BinaryMessageEncoder
   >at 
org.apache.hudi.avro.model.HoodieMetadataRecord.(HoodieMetadataRecord.java:23)
   
   
   @alexeykudinkin Let's not revert this. Instead, we should upgrade the presto 
version in hudi integ test. Currently, it is 0.217, over 3 years old which did 
not package avro.message. We want our bundles to be as lightweight as possible 
and so rely on deps provided by presto as much as possible. Moreover, 0.217 is 
far removed from the reality. It does not contain the hudi-specific changes 
that we did in Presto. Also, most Hudi users that I have interacted with are on 
0.246 or later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve

2022-01-14 Thread GitBox


scxwhite commented on a change in pull request #4400:
URL: https://github.com/apache/hudi/pull/4400#discussion_r785259770



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##
@@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan(
 .getLatestFileSlices(partitionPath)
 .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
 .map(s -> {
+  // We can think that the latest data is in the latest delta log 
file, so we sort it from large

Review comment:
   In addition, I changed the reading order of deltalog to avoid data 
rewriting to the greatest extent. Houdierecordpayload#precombine will still 
execute and select the correct data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve

2022-01-14 Thread GitBox


scxwhite commented on a change in pull request #4400:
URL: https://github.com/apache/hudi/pull/4400#discussion_r785259289



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##
@@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan(
 .getLatestFileSlices(partitionPath)
 .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
 .map(s -> {
+  // We can think that the latest data is in the latest delta log 
file, so we sort it from large

Review comment:
   > I think you are assuming the later writes in the log always overwrites 
the earlier ones? this is not true always.
   
   In the compact plan generation phase, I just changed the order of reading 
delta log files. In the internal production environment, I have used this 
method for a month, and no data exceptions have occurred. Now, I don't know how 
I should test this place. Can you give me some suggestions




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve

2022-01-14 Thread GitBox


scxwhite commented on a change in pull request #4400:
URL: https://github.com/apache/hudi/pull/4400#discussion_r785259224



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##
@@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan(
 .getLatestFileSlices(partitionPath)
 .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
 .map(s -> {
+  // We can think that the latest data is in the latest delta log 
file, so we sort it from large

Review comment:
   > I think you are assuming the later writes in the log always overwrites 
the earlier ones? this is not true always.
   In the compact plan generation phase, I just changed the order of reading 
delta log files. In the internal production environment, I have used this 
method for a month, and no data exceptions have occurred(cluster、clean、compact 
all inline). Now, I don't know how I should test this place. Can you give me 
some suggestions
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#issuecomment-1013578806


   
   ## CI report:
   
   * 3d6e3e70a7c3c1bb0b1b9d9e2945bc1dcdc1da5a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5263)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#issuecomment-1013560358


   
   ## CI report:
   
   * ce8a8d9547819b23368115ba640caed1cb385213 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5039)
 
   * 3d6e3e70a7c3c1bb0b1b9d9e2945bc1dcdc1da5a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5263)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1576) Add ability to perform archival synchronously

2022-01-14 Thread Nishith Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476506#comment-17476506
 ] 

Nishith Agarwal commented on HUDI-1576:
---

[~guoyihua] Yes, the idea was to detach archiving from being inline to async. 
Although, even today, archiving happens after the "COMMIT" is successfully 
completed and the file has been created on disk. So, introducing a new action 
is not needed. I think archival can just run async and keep archiving contents 
without the need to create any action since that may be an overkill. One 
side-effect I see is that we still need a way to track the progress and 
activity of archiving on a table. Since the .archive folder has this history, 
it should be fine. That's my opinion.

> Add ability to perform archival synchronously
> -
>
> Key: HUDI-1576
> URL: https://issues.apache.org/jira/browse/HUDI-1576
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently, archival runs inline. We want to move archival to a table service 
> like cleaning, compaction etc..
> and treat it like that. of course, no new action will be introduced. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4559: [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4559:
URL: https://github.com/apache/hudi/pull/4559#issuecomment-1013576099


   
   ## CI report:
   
   * 47970bd3a9cbbf2eb85b0a87f899256487efdffa UNKNOWN
   * 7798caf61854f0789bfdae4fa542ef1b0b6008a6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5262)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4559: [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4559:
URL: https://github.com/apache/hudi/pull/4559#issuecomment-1013557149


   
   ## CI report:
   
   * 47970bd3a9cbbf2eb85b0a87f899256487efdffa UNKNOWN
   * a70ea22d06f2b91b0c7e005e3db3c4d3faaf1d75 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5258)
 
   * 7798caf61854f0789bfdae4fa542ef1b0b6008a6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5262)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191][Stacked on 4531] Removing duplicating file-listing process w/in Hive's MOR `FIleInputFormat`s

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1013554187


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 0538475fe488089c4cb3bc0afb63c2f518329969 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5257)
 
   * 9f4b334e340fb26004e1b07fc8fbaaa047c76775 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5261)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191][Stacked on 4531] Removing duplicating file-listing process w/in Hive's MOR `FIleInputFormat`s

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1013567395


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 9f4b334e340fb26004e1b07fc8fbaaa047c76775 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5261)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec commented on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

2022-01-14 Thread GitBox


manojpec commented on pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#issuecomment-1013560479


   @prashantwason @vinothchandar 
   
   After discussions, made HoodieHFileReader the single source of truth for all 
HFile schema related fields. HFileReader already tracks other fields and it is 
meaningful to move the key field also here. MetadataPayload and HFileDataBlock 
will refer to HFileReader for the key field. Passing the properties to 
HFileDataBlock can come in separate PR as suggested. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#issuecomment-1013560358


   
   ## CI report:
   
   * ce8a8d9547819b23368115ba640caed1cb385213 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5039)
 
   * 3d6e3e70a7c3c1bb0b1b9d9e2945bc1dcdc1da5a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5263)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#issuecomment-1013559504


   
   ## CI report:
   
   * ce8a8d9547819b23368115ba640caed1cb385213 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5039)
 
   * 3d6e3e70a7c3c1bb0b1b9d9e2945bc1dcdc1da5a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#issuecomment-1013559504


   
   ## CI report:
   
   * ce8a8d9547819b23368115ba640caed1cb385213 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5039)
 
   * 3d6e3e70a7c3c1bb0b1b9d9e2945bc1dcdc1da5a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#issuecomment-1008518397


   
   ## CI report:
   
   * ce8a8d9547819b23368115ba640caed1cb385213 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5039)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

2022-01-14 Thread GitBox


manojpec commented on a change in pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#discussion_r785240782



##
File path: 
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
##
@@ -62,6 +64,7 @@
   // Scanner used to read individual keys. This is cached to prevent the 
overhead of opening the scanner for each
   // key retrieval.
   private HFileScanner keyScanner;
+  private final String keyField = HoodieMetadataPayload.SCHEMA_FIELD_ID_KEY;

Review comment:
   fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

2022-01-14 Thread GitBox


manojpec commented on a change in pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#discussion_r785240751



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
##
@@ -122,9 +118,9 @@ public HoodieLogBlockType getBlockType() {
   if (useIntegerKey) {
 recordKey = String.format("%" + keySize + "s", key++);
   } else {
-recordKey = record.get(keyField.pos()).toString();
+recordKey = record.get(schemaKeyField.pos()).toString();

Review comment:
   IndexedRecord supports only get by index. Not by name. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

2022-01-14 Thread GitBox


manojpec commented on a change in pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#discussion_r785240599



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
##
@@ -83,6 +84,11 @@
   .withDocumentation("Lower values increase the size of metadata tracked 
within HFile, but can offer potentially "
   + "faster lookup times.");
 
+  public static final ConfigProperty HFILE_SCHEMA_KEY_FIELD_ID = 
ConfigProperty
+  .key("hoodie.hfile.schema.key.field.id")
+  .defaultValue(HoodieMetadataPayload.SCHEMA_FIELD_ID_KEY)

Review comment:
   fixed.

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
##
@@ -83,6 +84,11 @@
   .withDocumentation("Lower values increase the size of metadata tracked 
within HFile, but can offer potentially "
   + "faster lookup times.");
 
+  public static final ConfigProperty HFILE_SCHEMA_KEY_FIELD_ID = 
ConfigProperty
+  .key("hoodie.hfile.schema.key.field.id")
+  .defaultValue(HoodieMetadataPayload.SCHEMA_FIELD_ID_KEY)
+  .withDocumentation("Key field name for the HFile schema. This key field 
is used for on-disk storage optimization");

Review comment:
   fixed.

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
##
@@ -110,8 +106,8 @@ public HoodieLogBlockType getBlockType() {
 boolean useIntegerKey = false;
 int key = 0;
 int keySize = 0;
-Field keyField = records.get(0).getSchema().getField(this.keyField);
-if (keyField == null) {
+final Field schemaKeyField = 
records.get(0).getSchema().getField(HoodieMetadataPayload.SCHEMA_FIELD_ID_KEY);

Review comment:
   fixed. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

2022-01-14 Thread GitBox


manojpec commented on a change in pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#discussion_r785240580



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
##
@@ -83,6 +84,11 @@
   .withDocumentation("Lower values increase the size of metadata tracked 
within HFile, but can offer potentially "
   + "faster lookup times.");
 
+  public static final ConfigProperty HFILE_SCHEMA_KEY_FIELD_ID = 
ConfigProperty
+  .key("hoodie.hfile.schema.key.field.id")

Review comment:
   fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4559: [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4559:
URL: https://github.com/apache/hudi/pull/4559#issuecomment-1013557149


   
   ## CI report:
   
   * 47970bd3a9cbbf2eb85b0a87f899256487efdffa UNKNOWN
   * a70ea22d06f2b91b0c7e005e3db3c4d3faaf1d75 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5258)
 
   * 7798caf61854f0789bfdae4fa542ef1b0b6008a6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5262)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4559: [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4559:
URL: https://github.com/apache/hudi/pull/4559#issuecomment-1013537449


   
   ## CI report:
   
   * 47970bd3a9cbbf2eb85b0a87f899256487efdffa UNKNOWN
   * a70ea22d06f2b91b0c7e005e3db3c4d3faaf1d75 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5258)
 
   * 7798caf61854f0789bfdae4fa542ef1b0b6008a6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191][Stacked on 4531] Removing duplicating file-listing process w/in Hive's MOR `FIleInputFormat`s

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1013537426


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 0538475fe488089c4cb3bc0afb63c2f518329969 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5257)
 
   * 9f4b334e340fb26004e1b07fc8fbaaa047c76775 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191][Stacked on 4531] Removing duplicating file-listing process w/in Hive's MOR `FIleInputFormat`s

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1013554187


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 0538475fe488089c4cb3bc0afb63c2f518329969 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5257)
 
   * 9f4b334e340fb26004e1b07fc8fbaaa047c76775 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5261)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3179) Extract common Hudi Table File Index implementation

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3179:
--
Reviewers: Vinoth Chandar, Y Ethan Guo

> Extract common Hudi Table File Index implementation 
> 
>
> Key: HUDI-3179
> URL: https://issues.apache.org/jira/browse/HUDI-3179
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Extract common Hudi Table File Index implementation from Spark's 
> `HoodieFileIndex`, to leverage common file indexing functionality across 
> Spark/Hive



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3206) Unify Hive's MOR `InputFormat` implementations (Parquet, HFile)

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3206:
--
Reviewers: Vinoth Chandar, Y Ethan Guo

> Unify Hive's MOR `InputFormat` implementations (Parquet, HFile)
> ---
>
> Key: HUDI-3206
> URL: https://issues.apache.org/jira/browse/HUDI-3206
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Essentially, HIve's different MOR implementations should only differ in the 
> file-format of the actual base-files written. 
>  
> Today, that's not the case: currently Hive's MOR `InputFormat` 
> implementations inherit from their respective COW file-format counterparts (
> `HoodieParquetInputFormat`, `HoodieHFileInputFormat`).
>  
> Instead we should unify both MOR impls to have common base class, which 
> separate file-format specific impls would extend (Parquet, HFile), only 
> overriding the `getRecordReader` method



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3191) Rebase Hive's FileInputFormat onto AbstractHoodieTableFileIndex

2022-01-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3191:
--
Reviewers: Vinoth Chandar, Y Ethan Guo

> Rebase Hive's FileInputFormat onto AbstractHoodieTableFileIndex
> ---
>
> Key: HUDI-3191
> URL: https://issues.apache.org/jira/browse/HUDI-3191
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> There are multiple control flows that would require accurate re-mapping to 
> start leveraging `AbstractHoodieTableFileIndex`
>  # Snapshot Query mode
>  # Incremental Query mode
> This task would focus mostly on rebasing Snapshot Mode



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot removed a comment on pull request #4559: [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4559:
URL: https://github.com/apache/hudi/pull/4559#issuecomment-1013471849


   
   ## CI report:
   
   * 47970bd3a9cbbf2eb85b0a87f899256487efdffa UNKNOWN
   * a70ea22d06f2b91b0c7e005e3db3c4d3faaf1d75 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5258)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191][Stacked on 4531] Removing duplicating file-listing process w/in Hive's MOR `FIleInputFormat`s

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1013537426


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 0538475fe488089c4cb3bc0afb63c2f518329969 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5257)
 
   * 9f4b334e340fb26004e1b07fc8fbaaa047c76775 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4559: [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4559:
URL: https://github.com/apache/hudi/pull/4559#issuecomment-1013537449


   
   ## CI report:
   
   * 47970bd3a9cbbf2eb85b0a87f899256487efdffa UNKNOWN
   * a70ea22d06f2b91b0c7e005e3db3c4d3faaf1d75 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5258)
 
   * 7798caf61854f0789bfdae4fa542ef1b0b6008a6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191][Stacked on 4531] Removing duplicating file-listing process w/in Hive's MOR `FIleInputFormat`s

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1013451270


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 0538475fe488089c4cb3bc0afb63c2f518329969 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5257)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4531: [HUDI-3191][Stacked on 4520] Rebasing Hive's FileInputFormat onto `AbstractHoodieTableFileIndex`

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4531:
URL: https://github.com/apache/hudi/pull/4531#issuecomment-1013535231


   
   ## CI report:
   
   * d8313ea2d2d4e98e35214573022aa7562936a166 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5260)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4531: [HUDI-3191][Stacked on 4520] Rebasing Hive's FileInputFormat onto `AbstractHoodieTableFileIndex`

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4531:
URL: https://github.com/apache/hudi/pull/4531#issuecomment-1013510325


   
   ## CI report:
   
   * 29076ce4ae979c5452de3af38d4535fb3768471d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5256)
 
   * d8313ea2d2d4e98e35214573022aa7562936a166 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5260)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#issuecomment-1013531885


   
   ## CI report:
   
   * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN
   * 9743a6f98e62888c5e9cb575dd8cf0d38ade0319 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5259)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#issuecomment-1013505801


   
   ## CI report:
   
   * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN
   * 46d45878e535f9517421a49958ebf13c54c6cca3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5240)
 
   * 9743a6f98e62888c5e9cb575dd8cf0d38ade0319 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5259)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4516: [HUDI-3181][HUDI-1295] Enabling metadata table based index by default for tests

2022-01-14 Thread GitBox


hudi-bot commented on pull request #4516:
URL: https://github.com/apache/hudi/pull/4516#issuecomment-1013529426


   
   ## CI report:
   
   * 7e2ec46af829fabeb506d639c54057d32f3c89fa UNKNOWN
   * a35de627f6cfdd75200371d41960901d7bbfefb1 UNKNOWN
   * 1f4165928c3204c00f69ac90617c92184d991cdf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5071)
 
   * a412e6dc8a65e53ac74defe160f4846bfc2bf977 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4516: [HUDI-3181][HUDI-1295] Enabling metadata table based index by default for tests

2022-01-14 Thread GitBox


hudi-bot removed a comment on pull request #4516:
URL: https://github.com/apache/hudi/pull/4516#issuecomment-1009494630


   
   ## CI report:
   
   * 7e2ec46af829fabeb506d639c54057d32f3c89fa UNKNOWN
   * a35de627f6cfdd75200371d41960901d7bbfefb1 UNKNOWN
   * 1f4165928c3204c00f69ac90617c92184d991cdf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5071)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on pull request #1946: [HUDI-1176]Upgrade tp log4j2

2022-01-14 Thread GitBox


umehrot2 commented on pull request #1946:
URL: https://github.com/apache/hudi/pull/1946#issuecomment-1013527978


   @hddong are you still going to work on this ?
   
   If you don't have the bandwidth, I would be happy to drive this to 
completion and upgrade this further to Log4j 2.17.1 which has fixed several of 
the major CVEs that have been encountered with Log4j2 recently. Please let me 
know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1629) Change partitioner abstraction to implement multiple strategies

2022-01-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-1629:

Status: In Progress  (was: Open)

> Change partitioner abstraction to implement multiple strategies
> ---
>
> Key: HUDI-1629
> URL: https://issues.apache.org/jira/browse/HUDI-1629
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: satish
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Existing UpsertPartitioner only considers file sizing to assign 
> inserts/updates. We also want to consider data locality and other factors. So 
> change partitioner abstraction to make it easy to implement and plug in other 
> strategies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   4   >