date:20221109

[GitHub] [hudi] schlichtanders commented on issue #6808: [SUPPORT] Cannot sync to spark embedded derby hive meta store (the default one)

2022-11-09 Thread GitBox



schlichtanders commented on issue #6808:
URL: https://github.com/apache/hudi/issues/6808#issuecomment-1309901120

   Thank you for the updated links. 
   
   They show how to connect Hudi to a derby metastore via 
`org.apache.derby.jdbc.ClientDriver`, where the derby metastore is running on a 
dedicated port on localhost. This is not what this ticket is about.
   
   This ticket is about connecting to an embedded in-memory derby metastore 
which you connect to via the `org.apache.derby.jdbc.EmbeddedDriver` driver. 
Please refer to my cleaned-up example 
[above](https://github.com/apache/hudi/issues/6808#issuecomment-1308766125) 
which works for PostgreSQL, but does not for Derby.
   
   I think this is a hudi bug, as no error is thrown, but also no sync is 
happening when using `org.apache.derby.jdbc.EmbeddedDriver`. How to elevate 
this Support Ticket to a bug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7080: [RFC-64] New APIs to facilitate faster Query Engine integrations

2022-11-09 Thread GitBox



xiarixiaoyao commented on code in PR #7080:
URL: https://github.com/apache/hudi/pull/7080#discussion_r1018743271


##
rfc/rfc-64/rfc-64.md:
##
@@ -0,0 +1,509 @@
+
+
+# RFC-64: New Hudi Table Spec API for Query Integrations
+
+## Proposers
+
+- @codope
+- @alexeykudinkin
+
+## Approvers
+
+- @xiarixiao
+- @danny0405
+- @vinothchandar
+- @prasanna
+- @xushiyan
+
+## Status
+
+JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141)
+
+## Abstract
+
+In this RFC we propose new set of higher-level Table Spec APIs that would allow
+us to up-level our current integration model with new Query Engines, enabling
+faster turnaround for such integrations.
+
+## Background
+
+[WIP] Plan
+
+- [x] High-level capabilities API will be supporting
+- [ ] High-level overview of integration points of the query engines
+- [ ] Deep-dive into existing Spark/Flink/Trino integration points
+
+The figure below shows the read path in Spark and Presto/Trino query engines. 
At
+a high level, integration with these engines require providing of the 
following capabilities:
+
+1. **Listing**: this stage requires enumerating of all of the data files w/in 
a table representing particular 
+snapshot of its state, that will be subsequently scanned to fetch the date 
requested by the target query. All query engines
+expect output of this stage in the form of balanced out file "splits" that way 
allowing to even out any skew in 
+file sizes. This stage is where various _pruning_ techniques such as 
partition-pruning, file-level column 
+statistics filtering are applied.
+
+2. **Reading**: this stage transforms every file-split (identified in a 
previous stage) into an iterator over 
+the records persisted in it. This stage is where actual data shaping to suit 
the needs of particular query takes 
+place: records are projected into a schema expected by the query, 
corresponding filters are pushed down to 
+reduce amount of data fetched from storage, schema evolution is handled, 
delayed operations are reconciled (merging/deleting) 
+
+![](./read_path.png)
+
+## Implementation
+
+[WIP] Plan
+
+- [x] Components & API pseudo-code
+- [x] Diagram show-casing data flow and integration points
+- [ ] Example integration walk-through
+
+With this RFC, we aim to achieve following:
+
+ - Up-level our current integration model with Query Engines, abstracting away 
Hudi's lower-level components,
+ instead providing simple and eloquent abstractions providing necessary 
capabilities (for listing, reading)
+
+ - Make sure Hudi's APIs are high-level enough to be useful in providing 
programmatic access to the data 
+ for users willing to access it directly
+
+To achieve that, we propose to introduce two tiers of APIs:
+
+1. **Mid-level** (engine-friendly): these APIs will be abstracting away Hudi's 
internals, behind
+simple and familiar concepts and components such as file splits, iterators,
+statistics, etc.
+
+2. **High-level** (user-friendly): these APIs will provide programmatic access 
to Hudi's capabilities
+bypassing query engine interfaces (like SQL, Spark DS, etc)
+
+Following classic [layer-cake 
architecture](https://cs.uwaterloo.ca/~m2nagapp/courses/CS446/1195/Arch_Design_Activity/Layered.pdf)
 
+would allow us to abstract away complexity of the lower levels components and 
APIs from
+Query Engines (on the *querying* side) as well as from the end users:
+
+ - Mid-level components will internally leverage Hudi's core building blocks 
(such as
+`FileSystemView`, `FileIndex`, `MergedLogRecordScanner`, `FileReader` etc), 
while exposing APIs
+providing capabilities expected by the Query Engines (listed above)
+
+ - High-level components will be stacked on top of mid-level ones and can be 
used directly by users to
+read and write to tables.
+
+These APIs will be built bottoms-up, initially focusing on the mid-level 
+components, then higher-level ones. Once the mid-level components are ready, 
we could
+demonstrate their utility by integrating with one of the query engines.
+
+In the initial phase of this project we will be focusing the effort on the 
read-side of the integration,
+shifting focus to the write-side in the subsequent phase.
+
+### Components & API
+
+Open Qs
+
+1. ~~Some of the models defined here do already exist and might be representing
+   lower-level components. Do we expose them in a higher level APIs or do we
+   bifurcate and hide them as internal impl details (for ex, `HoodieFileGroup`
+   vs `HoodieInternalFileGroup`)?~~
+   1. *We'll be re-using existing models abstracting them behind projected 
interfaces 
+   exposing only necessary functionality, hiding away complexity of the whole 
impl*
+2. ~~What about Expressions? Wrapping is going to be hard because we need to
+   analyze expression structure which is not going to be possible if we wrap.~~
+   1. We will need to introduce our own Expression hierarchy supporting a 
*superset*
+   of the commonly used expressions. While this requir

[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7080: [RFC-64] New APIs to facilitate faster Query Engine integrations

2022-11-09 Thread GitBox



xiarixiaoyao commented on code in PR #7080:
URL: https://github.com/apache/hudi/pull/7080#discussion_r1018750344


##
rfc/rfc-64/rfc-64.md:
##
@@ -0,0 +1,509 @@
+
+
+# RFC-64: New Hudi Table Spec API for Query Integrations
+
+## Proposers
+
+- @codope
+- @alexeykudinkin
+
+## Approvers
+
+- @xiarixiao
+- @danny0405
+- @vinothchandar
+- @prasanna
+- @xushiyan
+
+## Status
+
+JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141)
+
+## Abstract
+
+In this RFC we propose new set of higher-level Table Spec APIs that would allow
+us to up-level our current integration model with new Query Engines, enabling
+faster turnaround for such integrations.
+
+## Background
+
+[WIP] Plan
+
+- [x] High-level capabilities API will be supporting
+- [ ] High-level overview of integration points of the query engines
+- [ ] Deep-dive into existing Spark/Flink/Trino integration points
+
+The figure below shows the read path in Spark and Presto/Trino query engines. 
At
+a high level, integration with these engines require providing of the 
following capabilities:
+
+1. **Listing**: this stage requires enumerating of all of the data files w/in 
a table representing particular 
+snapshot of its state, that will be subsequently scanned to fetch the date 
requested by the target query. All query engines
+expect output of this stage in the form of balanced out file "splits" that way 
allowing to even out any skew in 
+file sizes. This stage is where various _pruning_ techniques such as 
partition-pruning, file-level column 
+statistics filtering are applied.
+
+2. **Reading**: this stage transforms every file-split (identified in a 
previous stage) into an iterator over 
+the records persisted in it. This stage is where actual data shaping to suit 
the needs of particular query takes 
+place: records are projected into a schema expected by the query, 
corresponding filters are pushed down to 
+reduce amount of data fetched from storage, schema evolution is handled, 
delayed operations are reconciled (merging/deleting) 
+
+![](./read_path.png)
+
+## Implementation
+
+[WIP] Plan
+
+- [x] Components & API pseudo-code
+- [x] Diagram show-casing data flow and integration points
+- [ ] Example integration walk-through
+
+With this RFC, we aim to achieve following:
+
+ - Up-level our current integration model with Query Engines, abstracting away 
Hudi's lower-level components,
+ instead providing simple and eloquent abstractions providing necessary 
capabilities (for listing, reading)
+
+ - Make sure Hudi's APIs are high-level enough to be useful in providing 
programmatic access to the data 
+ for users willing to access it directly
+
+To achieve that, we propose to introduce two tiers of APIs:
+
+1. **Mid-level** (engine-friendly): these APIs will be abstracting away Hudi's 
internals, behind
+simple and familiar concepts and components such as file splits, iterators,
+statistics, etc.
+
+2. **High-level** (user-friendly): these APIs will provide programmatic access 
to Hudi's capabilities
+bypassing query engine interfaces (like SQL, Spark DS, etc)
+
+Following classic [layer-cake 
architecture](https://cs.uwaterloo.ca/~m2nagapp/courses/CS446/1195/Arch_Design_Activity/Layered.pdf)
 
+would allow us to abstract away complexity of the lower levels components and 
APIs from
+Query Engines (on the *querying* side) as well as from the end users:
+
+ - Mid-level components will internally leverage Hudi's core building blocks 
(such as
+`FileSystemView`, `FileIndex`, `MergedLogRecordScanner`, `FileReader` etc), 
while exposing APIs
+providing capabilities expected by the Query Engines (listed above)
+
+ - High-level components will be stacked on top of mid-level ones and can be 
used directly by users to
+read and write to tables.
+
+These APIs will be built bottoms-up, initially focusing on the mid-level 
+components, then higher-level ones. Once the mid-level components are ready, 
we could
+demonstrate their utility by integrating with one of the query engines.
+
+In the initial phase of this project we will be focusing the effort on the 
read-side of the integration,
+shifting focus to the write-side in the subsequent phase.
+
+### Components & API
+
+Open Qs
+
+1. ~~Some of the models defined here do already exist and might be representing
+   lower-level components. Do we expose them in a higher level APIs or do we
+   bifurcate and hide them as internal impl details (for ex, `HoodieFileGroup`
+   vs `HoodieInternalFileGroup`)?~~
+   1. *We'll be re-using existing models abstracting them behind projected 
interfaces 
+   exposing only necessary functionality, hiding away complexity of the whole 
impl*
+2. ~~What about Expressions? Wrapping is going to be hard because we need to
+   analyze expression structure which is not going to be possible if we wrap.~~
+   1. We will need to introduce our own Expression hierarchy supporting a 
*superset*
+   of the commonly used expressions. While this requir

[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7080: [RFC-64] New APIs to facilitate faster Query Engine integrations

2022-11-09 Thread GitBox



xiarixiaoyao commented on code in PR #7080:
URL: https://github.com/apache/hudi/pull/7080#discussion_r1018748548


##
rfc/rfc-64/rfc-64.md:
##
@@ -0,0 +1,509 @@
+
+
+# RFC-64: New Hudi Table Spec API for Query Integrations
+
+## Proposers
+
+- @codope
+- @alexeykudinkin
+
+## Approvers
+
+- @xiarixiao
+- @danny0405
+- @vinothchandar
+- @prasanna
+- @xushiyan
+
+## Status
+
+JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141)
+
+## Abstract
+
+In this RFC we propose new set of higher-level Table Spec APIs that would allow
+us to up-level our current integration model with new Query Engines, enabling
+faster turnaround for such integrations.
+
+## Background
+
+[WIP] Plan
+
+- [x] High-level capabilities API will be supporting
+- [ ] High-level overview of integration points of the query engines
+- [ ] Deep-dive into existing Spark/Flink/Trino integration points
+
+The figure below shows the read path in Spark and Presto/Trino query engines. 
At
+a high level, integration with these engines require providing of the 
following capabilities:
+
+1. **Listing**: this stage requires enumerating of all of the data files w/in 
a table representing particular 
+snapshot of its state, that will be subsequently scanned to fetch the date 
requested by the target query. All query engines
+expect output of this stage in the form of balanced out file "splits" that way 
allowing to even out any skew in 
+file sizes. This stage is where various _pruning_ techniques such as 
partition-pruning, file-level column 
+statistics filtering are applied.
+
+2. **Reading**: this stage transforms every file-split (identified in a 
previous stage) into an iterator over 
+the records persisted in it. This stage is where actual data shaping to suit 
the needs of particular query takes 
+place: records are projected into a schema expected by the query, 
corresponding filters are pushed down to 
+reduce amount of data fetched from storage, schema evolution is handled, 
delayed operations are reconciled (merging/deleting) 
+
+![](./read_path.png)
+
+## Implementation
+
+[WIP] Plan
+
+- [x] Components & API pseudo-code
+- [x] Diagram show-casing data flow and integration points
+- [ ] Example integration walk-through
+
+With this RFC, we aim to achieve following:
+
+ - Up-level our current integration model with Query Engines, abstracting away 
Hudi's lower-level components,
+ instead providing simple and eloquent abstractions providing necessary 
capabilities (for listing, reading)
+
+ - Make sure Hudi's APIs are high-level enough to be useful in providing 
programmatic access to the data 
+ for users willing to access it directly
+
+To achieve that, we propose to introduce two tiers of APIs:
+
+1. **Mid-level** (engine-friendly): these APIs will be abstracting away Hudi's 
internals, behind
+simple and familiar concepts and components such as file splits, iterators,
+statistics, etc.
+
+2. **High-level** (user-friendly): these APIs will provide programmatic access 
to Hudi's capabilities
+bypassing query engine interfaces (like SQL, Spark DS, etc)
+
+Following classic [layer-cake 
architecture](https://cs.uwaterloo.ca/~m2nagapp/courses/CS446/1195/Arch_Design_Activity/Layered.pdf)
 
+would allow us to abstract away complexity of the lower levels components and 
APIs from
+Query Engines (on the *querying* side) as well as from the end users:
+
+ - Mid-level components will internally leverage Hudi's core building blocks 
(such as
+`FileSystemView`, `FileIndex`, `MergedLogRecordScanner`, `FileReader` etc), 
while exposing APIs
+providing capabilities expected by the Query Engines (listed above)
+
+ - High-level components will be stacked on top of mid-level ones and can be 
used directly by users to
+read and write to tables.
+
+These APIs will be built bottoms-up, initially focusing on the mid-level 
+components, then higher-level ones. Once the mid-level components are ready, 
we could
+demonstrate their utility by integrating with one of the query engines.
+
+In the initial phase of this project we will be focusing the effort on the 
read-side of the integration,
+shifting focus to the write-side in the subsequent phase.
+
+### Components & API
+
+Open Qs
+
+1. ~~Some of the models defined here do already exist and might be representing
+   lower-level components. Do we expose them in a higher level APIs or do we
+   bifurcate and hide them as internal impl details (for ex, `HoodieFileGroup`
+   vs `HoodieInternalFileGroup`)?~~
+   1. *We'll be re-using existing models abstracting them behind projected 
interfaces 
+   exposing only necessary functionality, hiding away complexity of the whole 
impl*
+2. ~~What about Expressions? Wrapping is going to be hard because we need to
+   analyze expression structure which is not going to be possible if we wrap.~~
+   1. We will need to introduce our own Expression hierarchy supporting a 
*superset*
+   of the commonly used expressions. While this requir

[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7080: [RFC-64] New APIs to facilitate faster Query Engine integrations

2022-11-09 Thread GitBox



xiarixiaoyao commented on code in PR #7080:
URL: https://github.com/apache/hudi/pull/7080#discussion_r1018743271


##
rfc/rfc-64/rfc-64.md:
##
@@ -0,0 +1,509 @@
+
+
+# RFC-64: New Hudi Table Spec API for Query Integrations
+
+## Proposers
+
+- @codope
+- @alexeykudinkin
+
+## Approvers
+
+- @xiarixiao
+- @danny0405
+- @vinothchandar
+- @prasanna
+- @xushiyan
+
+## Status
+
+JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141)
+
+## Abstract
+
+In this RFC we propose new set of higher-level Table Spec APIs that would allow
+us to up-level our current integration model with new Query Engines, enabling
+faster turnaround for such integrations.
+
+## Background
+
+[WIP] Plan
+
+- [x] High-level capabilities API will be supporting
+- [ ] High-level overview of integration points of the query engines
+- [ ] Deep-dive into existing Spark/Flink/Trino integration points
+
+The figure below shows the read path in Spark and Presto/Trino query engines. 
At
+a high level, integration with these engines require providing of the 
following capabilities:
+
+1. **Listing**: this stage requires enumerating of all of the data files w/in 
a table representing particular 
+snapshot of its state, that will be subsequently scanned to fetch the date 
requested by the target query. All query engines
+expect output of this stage in the form of balanced out file "splits" that way 
allowing to even out any skew in 
+file sizes. This stage is where various _pruning_ techniques such as 
partition-pruning, file-level column 
+statistics filtering are applied.
+
+2. **Reading**: this stage transforms every file-split (identified in a 
previous stage) into an iterator over 
+the records persisted in it. This stage is where actual data shaping to suit 
the needs of particular query takes 
+place: records are projected into a schema expected by the query, 
corresponding filters are pushed down to 
+reduce amount of data fetched from storage, schema evolution is handled, 
delayed operations are reconciled (merging/deleting) 
+
+![](./read_path.png)
+
+## Implementation
+
+[WIP] Plan
+
+- [x] Components & API pseudo-code
+- [x] Diagram show-casing data flow and integration points
+- [ ] Example integration walk-through
+
+With this RFC, we aim to achieve following:
+
+ - Up-level our current integration model with Query Engines, abstracting away 
Hudi's lower-level components,
+ instead providing simple and eloquent abstractions providing necessary 
capabilities (for listing, reading)
+
+ - Make sure Hudi's APIs are high-level enough to be useful in providing 
programmatic access to the data 
+ for users willing to access it directly
+
+To achieve that, we propose to introduce two tiers of APIs:
+
+1. **Mid-level** (engine-friendly): these APIs will be abstracting away Hudi's 
internals, behind
+simple and familiar concepts and components such as file splits, iterators,
+statistics, etc.
+
+2. **High-level** (user-friendly): these APIs will provide programmatic access 
to Hudi's capabilities
+bypassing query engine interfaces (like SQL, Spark DS, etc)
+
+Following classic [layer-cake 
architecture](https://cs.uwaterloo.ca/~m2nagapp/courses/CS446/1195/Arch_Design_Activity/Layered.pdf)
 
+would allow us to abstract away complexity of the lower levels components and 
APIs from
+Query Engines (on the *querying* side) as well as from the end users:
+
+ - Mid-level components will internally leverage Hudi's core building blocks 
(such as
+`FileSystemView`, `FileIndex`, `MergedLogRecordScanner`, `FileReader` etc), 
while exposing APIs
+providing capabilities expected by the Query Engines (listed above)
+
+ - High-level components will be stacked on top of mid-level ones and can be 
used directly by users to
+read and write to tables.
+
+These APIs will be built bottoms-up, initially focusing on the mid-level 
+components, then higher-level ones. Once the mid-level components are ready, 
we could
+demonstrate their utility by integrating with one of the query engines.
+
+In the initial phase of this project we will be focusing the effort on the 
read-side of the integration,
+shifting focus to the write-side in the subsequent phase.
+
+### Components & API
+
+Open Qs
+
+1. ~~Some of the models defined here do already exist and might be representing
+   lower-level components. Do we expose them in a higher level APIs or do we
+   bifurcate and hide them as internal impl details (for ex, `HoodieFileGroup`
+   vs `HoodieInternalFileGroup`)?~~
+   1. *We'll be re-using existing models abstracting them behind projected 
interfaces 
+   exposing only necessary functionality, hiding away complexity of the whole 
impl*
+2. ~~What about Expressions? Wrapping is going to be hard because we need to
+   analyze expression structure which is not going to be possible if we wrap.~~
+   1. We will need to introduce our own Expression hierarchy supporting a 
*superset*
+   of the commonly used expressions. While this requir

[GitHub] [hudi] hudi-bot commented on pull request #7151: [MINOR] Performance improvement of flink ITs with reused miniCluster

2022-11-09 Thread GitBox



hudi-bot commented on PR #7151:
URL: https://github.com/apache/hudi/pull/7151#issuecomment-1309888204

   
   ## CI report:
   
   * db207361ce3e01c7b39153f3156a6e19d8075212 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12918)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7080: [RFC-64] New APIs to facilitate faster Query Engine integrations

2022-11-09 Thread GitBox



xiarixiaoyao commented on code in PR #7080:
URL: https://github.com/apache/hudi/pull/7080#discussion_r1018740543


##
rfc/rfc-64/rfc-64.md:
##
@@ -0,0 +1,509 @@
+
+
+# RFC-64: New Hudi Table Spec API for Query Integrations
+
+## Proposers
+
+- @codope
+- @alexeykudinkin
+
+## Approvers
+
+- @xiarixiao
+- @danny0405
+- @vinothchandar
+- @prasanna
+- @xushiyan
+
+## Status
+
+JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141)
+
+## Abstract
+
+In this RFC we propose new set of higher-level Table Spec APIs that would allow
+us to up-level our current integration model with new Query Engines, enabling
+faster turnaround for such integrations.
+
+## Background
+
+[WIP] Plan
+
+- [x] High-level capabilities API will be supporting
+- [ ] High-level overview of integration points of the query engines
+- [ ] Deep-dive into existing Spark/Flink/Trino integration points
+
+The figure below shows the read path in Spark and Presto/Trino query engines. 
At
+a high level, integration with these engines require providing of the 
following capabilities:
+
+1. **Listing**: this stage requires enumerating of all of the data files w/in 
a table representing particular 
+snapshot of its state, that will be subsequently scanned to fetch the date 
requested by the target query. All query engines
+expect output of this stage in the form of balanced out file "splits" that way 
allowing to even out any skew in 
+file sizes. This stage is where various _pruning_ techniques such as 
partition-pruning, file-level column 
+statistics filtering are applied.
+
+2. **Reading**: this stage transforms every file-split (identified in a 
previous stage) into an iterator over 
+the records persisted in it. This stage is where actual data shaping to suit 
the needs of particular query takes 
+place: records are projected into a schema expected by the query, 
corresponding filters are pushed down to 
+reduce amount of data fetched from storage, schema evolution is handled, 
delayed operations are reconciled (merging/deleting) 
+
+![](./read_path.png)
+
+## Implementation
+
+[WIP] Plan
+
+- [x] Components & API pseudo-code
+- [x] Diagram show-casing data flow and integration points
+- [ ] Example integration walk-through
+
+With this RFC, we aim to achieve following:
+
+ - Up-level our current integration model with Query Engines, abstracting away 
Hudi's lower-level components,
+ instead providing simple and eloquent abstractions providing necessary 
capabilities (for listing, reading)
+
+ - Make sure Hudi's APIs are high-level enough to be useful in providing 
programmatic access to the data 
+ for users willing to access it directly
+
+To achieve that, we propose to introduce two tiers of APIs:
+
+1. **Mid-level** (engine-friendly): these APIs will be abstracting away Hudi's 
internals, behind
+simple and familiar concepts and components such as file splits, iterators,
+statistics, etc.
+
+2. **High-level** (user-friendly): these APIs will provide programmatic access 
to Hudi's capabilities
+bypassing query engine interfaces (like SQL, Spark DS, etc)
+
+Following classic [layer-cake 
architecture](https://cs.uwaterloo.ca/~m2nagapp/courses/CS446/1195/Arch_Design_Activity/Layered.pdf)
 
+would allow us to abstract away complexity of the lower levels components and 
APIs from
+Query Engines (on the *querying* side) as well as from the end users:
+
+ - Mid-level components will internally leverage Hudi's core building blocks 
(such as
+`FileSystemView`, `FileIndex`, `MergedLogRecordScanner`, `FileReader` etc), 
while exposing APIs
+providing capabilities expected by the Query Engines (listed above)
+
+ - High-level components will be stacked on top of mid-level ones and can be 
used directly by users to
+read and write to tables.
+
+These APIs will be built bottoms-up, initially focusing on the mid-level 
+components, then higher-level ones. Once the mid-level components are ready, 
we could
+demonstrate their utility by integrating with one of the query engines.
+
+In the initial phase of this project we will be focusing the effort on the 
read-side of the integration,
+shifting focus to the write-side in the subsequent phase.
+
+### Components & API
+
+Open Qs
+
+1. ~~Some of the models defined here do already exist and might be representing
+   lower-level components. Do we expose them in a higher level APIs or do we
+   bifurcate and hide them as internal impl details (for ex, `HoodieFileGroup`
+   vs `HoodieInternalFileGroup`)?~~
+   1. *We'll be re-using existing models abstracting them behind projected 
interfaces 
+   exposing only necessary functionality, hiding away complexity of the whole 
impl*
+2. ~~What about Expressions? Wrapping is going to be hard because we need to
+   analyze expression structure which is not going to be possible if we wrap.~~
+   1. We will need to introduce our own Expression hierarchy supporting a 
*superset*
+   of the commonly used expressions. While this requir

[GitHub] [hudi] albericgenius commented on pull request #7096: [MINOR] Fix OverwriteWithLatestAvroPayload full class name

2022-11-09 Thread GitBox



albericgenius commented on PR #7096:
URL: https://github.com/apache/hudi/pull/7096#issuecomment-1309840920

   @xushiyan sorry to both you, could you help to assign JIRA contributor 
permission to me(cnuliuwei).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7003: [minor] add more test for rfc46

2022-11-09 Thread GitBox



hudi-bot commented on PR #7003:
URL: https://github.com/apache/hudi/pull/7003#issuecomment-1309827449

   
   ## CI report:
   
   * b8ad950666cb87456151c70688b5eb7ad423955f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12917)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7167: [HUDI-5094] Remove partition fields before transform bytes to avro,if enable DROP_PARTITION_COLUMNS.

2022-11-09 Thread GitBox



hudi-bot commented on PR #7167:
URL: https://github.com/apache/hudi/pull/7167#issuecomment-1309827733

   
   ## CI report:
   
   * 6b165aec634812ba8d6f4a55d0dfb8578031d25c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12916)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] eric9204 commented on issue #6966: [SUPPORT]HoodieWriteHandle: Error writing record HoodieRecord{key=HoodieKey { recordKey=id308723 partitionPath=202210141643}, currentLocation='null',

2022-11-09 Thread GitBox



eric9204 commented on issue #6966:
URL: https://github.com/apache/hudi/issues/6966#issuecomment-1309826887

   #7167 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4812) Lazy partition listing and file groups fetching in Spark Query

2022-11-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4812:
--
Story Points: 16  (was: 2)

> Lazy partition listing and file groups fetching in Spark Query
> --
>
> Key: HUDI-4812
> URL: https://issues.apache.org/jira/browse/HUDI-4812
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yuwei Xiao
>Assignee: Yuwei Xiao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> In current spark query implementation, the FileIndex will refresh and load 
> all file groups in cached in order to serve subsequent queries.
>  
> For large table with many partitions, this may introduce much overhead in 
> initialization. Meanwhile, the query itself may come with partition filter. 
> So the loading of file groups will be unnecessary.
>  
> So to optimize, the whole refresh logic will become lazy, where actual work 
> will be carried out only after the partition filter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] eric9204 commented on pull request #7063: [HUDI-5094] Remove partition fields before transform bytes to avro,if enable DROP_PARTITION_COLUMNS

2022-11-09 Thread GitBox



eric9204 commented on PR #7063:
URL: https://github.com/apache/hudi/pull/7063#issuecomment-1309807382

   @YannByron have removed useless changes and redundant judgment conditions 
and add UT for logic changes.
   if you are available, pls review at this pr instead 
https://github.com/apache/hudi/pull/7167,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5188) ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo

2022-11-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5188:
--
Description: 
Getting following exception when running trivial Spark DS workloads:
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.DataSourceWriteOptions.{PRECOMBINE_FIELD, 
RECORDKEY_FIELD}
import org.apache.spark.sql.{SaveMode, SparkSession}

val spark: SparkSession

import spark.implicits._

val basePath = "/tmp/test"
val writerOpts: Map[String, String] = Map(
  "hoodie.table.name" -> "test",
  "hoodie.table.type" -> "COPY_ON_WRITE",
  PRECOMBINE_FIELD.key() -> "id",
  RECORDKEY_FIELD.key() -> "id",
  DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "dt"
)

val firstBatchDF = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 1,
  s"2021/03/0${i % 2 + 1}", s"${i % 3}")).toDF("id", "name", "price", 
"version", "dt", "hh")

firstBatchDF.write.
  options(writerOpts).
  option("hoodie.parquet.compression.codec", "gzip").
  format("hudi").
  mode(SaveMode.Overwrite).
  save(basePath) {code}
{code:java}
java.lang.ClassCastException: class 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo cannot 
be cast to class 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo 
(org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in 
unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader 
@3b1895e; 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in 
unnamed module of loader 
scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @58882a93)   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.listAllPartitions(HoodieBackedTableMetadataWriter.java:645)
   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initialCommit(HoodieBackedTableMetadataWriter.java:1070)
   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:560)
   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:393)
   at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.initialize(SparkHoodieBackedTableMetadataWriter.java:120)
   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:173)
   at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:89)
   at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:75)
   at 
org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:514)
   at 
org.apache.hudi.client.SparkRDDWriteClient.doInitTable(SparkRDDWriteClient.java:499)
   at 
org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1548)
   at 
org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1580)
   at 
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156) 
  at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206) 
  at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:331)   
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:148)   at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
   at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)   at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
   at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
   at 
org.apache.spark.sql.ca

[jira] [Updated] (HUDI-5188) ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo

2022-11-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5188:
--
Priority: Blocker  (was: Major)

> ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo 
> cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo
> -
>
> Key: HUDI-5188
> URL: https://issues.apache.org/jira/browse/HUDI-5188
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> After [#7036|https://github.com/apache/hudi/pull/7036/files], started to get 
> following exception when running trivial Spark DS workloads:
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.DataSourceWriteOptions.{PRECOMBINE_FIELD, 
> RECORDKEY_FIELD}
> import org.apache.spark.sql.{SaveMode, SparkSession}
> val spark: SparkSession
> import spark.implicits._
> val basePath = "/tmp/test"
> val writerOpts: Map[String, String] = Map(
>   "hoodie.table.name" -> "test",
>   "hoodie.table.type" -> "COPY_ON_WRITE",
>   PRECOMBINE_FIELD.key() -> "id",
>   RECORDKEY_FIELD.key() -> "id",
>   DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "dt"
> )
> val firstBatchDF = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 1,
>   s"2021/03/0${i % 2 + 1}", s"${i % 3}")).toDF("id", "name", "price", 
> "version", "dt", "hh")
> firstBatchDF.write.
>   options(writerOpts).
>   option("hoodie.parquet.compression.codec", "gzip").
>   format("hudi").
>   mode(SaveMode.Overwrite).
>   save(basePath) {code}
> {code:java}
> java.lang.ClassCastException: class 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo cannot 
> be cast to class 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo 
> (org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in 
> unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader 
> @3b1895e; 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in 
> unnamed module of loader 
> scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @58882a93)   at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.listAllPartitions(HoodieBackedTableMetadataWriter.java:645)
>    at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initialCommit(HoodieBackedTableMetadataWriter.java:1070)
>    at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:560)
>    at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:393)
>    at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.initialize(SparkHoodieBackedTableMetadataWriter.java:120)
>    at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:173)
>    at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:89)
>    at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:75)
>    at 
> org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:514)
>    at 
> org.apache.hudi.client.SparkRDDWriteClient.doInitTable(SparkRDDWriteClient.java:499)
>    at 
> org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1548)
>    at 
> org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1580)
>    at 
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156)
>    at 
> org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)   
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:331)   
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:148)   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>    at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>    at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>    at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>    at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>    at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>    at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>    at 
> org.apache.spark.sql.execut

[jira] [Updated] (HUDI-5188) ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo

2022-11-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5188:
--
Description: 
After [#7036|https://github.com/apache/hudi/pull/7036/files], started to get 
following exception when running trivial Spark DS workloads:
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.DataSourceWriteOptions.{PRECOMBINE_FIELD, 
RECORDKEY_FIELD}
import org.apache.spark.sql.{SaveMode, SparkSession}

val spark: SparkSession

import spark.implicits._

val basePath = "/tmp/test"
val writerOpts: Map[String, String] = Map(
  "hoodie.table.name" -> "test",
  "hoodie.table.type" -> "COPY_ON_WRITE",
  PRECOMBINE_FIELD.key() -> "id",
  RECORDKEY_FIELD.key() -> "id",
  DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "dt"
)

val firstBatchDF = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 1,
  s"2021/03/0${i % 2 + 1}", s"${i % 3}")).toDF("id", "name", "price", 
"version", "dt", "hh")

firstBatchDF.write.
  options(writerOpts).
  option("hoodie.parquet.compression.codec", "gzip").
  format("hudi").
  mode(SaveMode.Overwrite).
  save(basePath) {code}
{code:java}
java.lang.ClassCastException: class 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo cannot 
be cast to class 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo 
(org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in 
unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader 
@3b1895e; 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in 
unnamed module of loader 
scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @58882a93)   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.listAllPartitions(HoodieBackedTableMetadataWriter.java:645)
   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initialCommit(HoodieBackedTableMetadataWriter.java:1070)
   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:560)
   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:393)
   at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.initialize(SparkHoodieBackedTableMetadataWriter.java:120)
   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:173)
   at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:89)
   at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:75)
   at 
org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:514)
   at 
org.apache.hudi.client.SparkRDDWriteClient.doInitTable(SparkRDDWriteClient.java:499)
   at 
org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1548)
   at 
org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1580)
   at 
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156) 
  at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206) 
  at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:331)   
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:148)   at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
   at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)   at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
   at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
   at 
org.apache.spark.sql.catalyst.trees.TreeNode.trans

[jira] [Created] (HUDI-5188) ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo

2022-11-09 Thread Alexey Kudinkin (Jira)

Alexey Kudinkin created HUDI-5188:
-

 Summary: ClassCastException: class 
HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class 
HoodieBackedTableMetadataWriter$DirectoryInfo
 Key: HUDI-5188
 URL: https://issues.apache.org/jira/browse/HUDI-5188
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin


After XXX, started to get following exception when running trivial Spark DS 
workloads:
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.DataSourceWriteOptions.{PRECOMBINE_FIELD, 
RECORDKEY_FIELD}
import org.apache.spark.sql.{SaveMode, SparkSession}

val spark: SparkSession

import spark.implicits._

val basePath = "/tmp/test"
val writerOpts: Map[String, String] = Map(
  "hoodie.table.name" -> "test",
  "hoodie.table.type" -> "COPY_ON_WRITE",
  PRECOMBINE_FIELD.key() -> "id",
  RECORDKEY_FIELD.key() -> "id",
  DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "dt"
)

val firstBatchDF = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 1,
  s"2021/03/0${i % 2 + 1}", s"${i % 3}")).toDF("id", "name", "price", 
"version", "dt", "hh")

firstBatchDF.write.
  options(writerOpts).
  option("hoodie.parquet.compression.codec", "gzip").
  format("hudi").
  mode(SaveMode.Overwrite).
  save(basePath) {code}
{code:java}
java.lang.ClassCastException: class 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo cannot 
be cast to class 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo 
(org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in 
unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader 
@3b1895e; 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in 
unnamed module of loader 
scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @58882a93)   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.listAllPartitions(HoodieBackedTableMetadataWriter.java:645)
   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initialCommit(HoodieBackedTableMetadataWriter.java:1070)
   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:560)
   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:393)
   at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.initialize(SparkHoodieBackedTableMetadataWriter.java:120)
   at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:173)
   at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:89)
   at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:75)
   at 
org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:514)
   at 
org.apache.hudi.client.SparkRDDWriteClient.doInitTable(SparkRDDWriteClient.java:499)
   at 
org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1548)
   at 
org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1580)
   at 
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156) 
  at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206) 
  at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:331)   
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:148)   at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
   at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)   at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
   at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWith

[jira] [Updated] (HUDI-5188) ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo

2022-11-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5188:
--
Fix Version/s: 0.13.0

> ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo 
> cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo
> -
>
> Key: HUDI-5188
> URL: https://issues.apache.org/jira/browse/HUDI-5188
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Priority: Major
> Fix For: 0.13.0
>
>
> After XXX, started to get following exception when running trivial Spark DS 
> workloads:
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.DataSourceWriteOptions.{PRECOMBINE_FIELD, 
> RECORDKEY_FIELD}
> import org.apache.spark.sql.{SaveMode, SparkSession}
> val spark: SparkSession
> import spark.implicits._
> val basePath = "/tmp/test"
> val writerOpts: Map[String, String] = Map(
>   "hoodie.table.name" -> "test",
>   "hoodie.table.type" -> "COPY_ON_WRITE",
>   PRECOMBINE_FIELD.key() -> "id",
>   RECORDKEY_FIELD.key() -> "id",
>   DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "dt"
> )
> val firstBatchDF = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 1,
>   s"2021/03/0${i % 2 + 1}", s"${i % 3}")).toDF("id", "name", "price", 
> "version", "dt", "hh")
> firstBatchDF.write.
>   options(writerOpts).
>   option("hoodie.parquet.compression.codec", "gzip").
>   format("hudi").
>   mode(SaveMode.Overwrite).
>   save(basePath) {code}
> {code:java}
> java.lang.ClassCastException: class 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo cannot 
> be cast to class 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo 
> (org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in 
> unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader 
> @3b1895e; 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in 
> unnamed module of loader 
> scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @58882a93)   at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.listAllPartitions(HoodieBackedTableMetadataWriter.java:645)
>    at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initialCommit(HoodieBackedTableMetadataWriter.java:1070)
>    at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:560)
>    at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:393)
>    at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.initialize(SparkHoodieBackedTableMetadataWriter.java:120)
>    at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:173)
>    at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:89)
>    at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:75)
>    at 
> org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:514)
>    at 
> org.apache.hudi.client.SparkRDDWriteClient.doInitTable(SparkRDDWriteClient.java:499)
>    at 
> org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1548)
>    at 
> org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1580)
>    at 
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156)
>    at 
> org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)   
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:331)   
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:148)   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>    at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>    at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>    at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>    at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>    at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>    at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>    at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.sc

[GitHub] [hudi] zhangyue19921010 commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

2022-11-09 Thread GitBox



zhangyue19921010 commented on PR #5416:
URL: https://github.com/apache/hudi/pull/5416#issuecomment-1309800894

   > Great job buddy for patiently addressing all comments !
   
   Thanks a lot for your help! @alexeykudinkin and @nsivabalan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-5145) Remove HDFS from DeltaStreamer UT/FT

2022-11-09 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-5145.

Resolution: Done

> Remove HDFS from DeltaStreamer UT/FT
> 
>
> Key: HUDI-5145
> URL: https://issues.apache.org/jira/browse/HUDI-5145
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation

2022-11-09 Thread GitBox



hudi-bot commented on PR #7039:
URL: https://github.com/apache/hudi/pull/7039#issuecomment-1309782856

   
   ## CI report:
   
   * dc7c2bdefdb2a84b27b40751714a31604e2931eb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12911)
 
   * fe186d03969c6d1b4a62d9c506585a8b2ea05dd0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12920)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation

2022-11-09 Thread GitBox



hudi-bot commented on PR #7039:
URL: https://github.com/apache/hudi/pull/7039#issuecomment-1309778505

   
   ## CI report:
   
   * dc7c2bdefdb2a84b27b40751714a31604e2931eb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12911)
 
   * fe186d03969c6d1b4a62d9c506585a8b2ea05dd0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7169: [HUDI-5186] Parallelism does not take effect when hoodie.combine.before.upsert/insert false

2022-11-09 Thread GitBox



hudi-bot commented on PR #7169:
URL: https://github.com/apache/hudi/pull/7169#issuecomment-1309774543

   
   ## CI report:
   
   * f9e08f5d14106a18fe59bf752077ab1043595e03 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12899)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12905)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12915)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests (#7171)

2022-11-09 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 7fe35fb895 [HUDI-5145] Avoid starting HDFS in hudi-utilities tests 
(#7171)
7fe35fb895 is described below

commit 7fe35fb895fb58f0ea599974defd7ffe310a964e
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Thu Nov 10 12:23:24 2022 +0800

[HUDI-5145] Avoid starting HDFS in hudi-utilities tests (#7171)
---
 .../testutils/minicluster/HdfsTestService.java |   8 +-
 .../TestDFSHoodieTestSuiteWriterAdapter.java   |   8 +-
 .../integ/testsuite/TestFileDeltaInputWriter.java  |  10 +-
 .../testsuite/job/TestHoodieTestSuiteJob.java  |  74 ++---
 .../reader/TestDFSAvroDeltaInputReader.java|  10 +-
 .../reader/TestDFSHoodieDatasetInputReader.java|   6 +-
 .../functional/HoodieDeltaStreamerTestBase.java|  10 +-
 .../functional/TestHoodieDeltaStreamer.java| 314 ++---
 .../TestHoodieMultiTableDeltaStreamer.java |  40 +--
 .../hudi/utilities/sources/TestAvroDFSSource.java  |   4 +-
 .../hudi/utilities/sources/TestCsvDFSSource.java   |   4 +-
 .../utilities/sources/TestGcsEventsSource.java |   2 +-
 .../hudi/utilities/sources/TestJsonDFSSource.java  |   4 +-
 .../utilities/sources/TestParquetDFSSource.java|   2 +-
 .../hudi/utilities/sources/TestS3EventsSource.java |   4 +-
 .../hudi/utilities/sources/TestSqlSource.java  |  11 +-
 .../debezium/TestAbstractDebeziumSource.java   |   2 +-
 .../utilities/testutils/UtilitiesTestBase.java |  61 ++--
 .../AbstractCloudObjectsSourceTestBase.java|   2 +-
 .../sources/AbstractDFSSourceTestBase.java |   6 +-
 .../transform/TestSqlFileBasedTransformer.java |  23 +-
 21 files changed, 313 insertions(+), 292 deletions(-)

diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java
index 0766c61c67..727e1e4db6 100644
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java
@@ -53,8 +53,12 @@ public class HdfsTestService {
   private MiniDFSCluster miniDfsCluster;
 
   public HdfsTestService() throws IOException {
-hadoopConf = new Configuration();
-workDir = Files.createTempDirectory("temp").toAbsolutePath().toString();
+this(new Configuration());
+  }
+
+  public HdfsTestService(Configuration hadoopConf) throws IOException {
+this.hadoopConf = hadoopConf;
+this.workDir = 
Files.createTempDirectory("temp").toAbsolutePath().toString();
   }
 
   public Configuration getHadoopConf() {
diff --git 
a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java
 
b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java
index 2b69a319a5..9c21ee6bd4 100644
--- 
a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java
+++ 
b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java
@@ -65,7 +65,7 @@ public class TestDFSHoodieTestSuiteWriterAdapter extends 
UtilitiesTestBase {
 
   @BeforeAll
   public static void initClass() throws Exception {
-UtilitiesTestBase.initTestServices(false, false);
+UtilitiesTestBase.initTestServices(true, false, false);
   }
 
   @AfterAll
@@ -131,15 +131,15 @@ public class TestDFSHoodieTestSuiteWriterAdapter extends 
UtilitiesTestBase {
   // TODO(HUDI-3668): Fix this test
   public void testDFSWorkloadSinkWithMultipleFilesFunctional() throws 
IOException {
 DeltaConfig dfsSinkConfig = new DFSDeltaConfig(DeltaOutputMode.DFS, 
DeltaInputType.AVRO,
-new SerializableConfiguration(jsc.hadoopConfiguration()), dfsBasePath, 
dfsBasePath,
+new SerializableConfiguration(jsc.hadoopConfiguration()), basePath, 
basePath,
 schemaProvider.getSourceSchema().toString(), 10240L, 
jsc.defaultParallelism(), false, false);
 DeltaWriterAdapter dfsDeltaWriterAdapter = 
DeltaWriterFactory
 .getDeltaWriterAdapter(dfsSinkConfig, 1);
 FlexibleSchemaRecordGenerationIterator itr = new 
FlexibleSchemaRecordGenerationIterator(1000,
 schemaProvider.getSourceSchema().toString());
 dfsDeltaWriterAdapter.write(itr);
-FileSystem fs = FSUtils.getFs(dfsBasePath, jsc.hadoopConfiguration());
-FileStatus[] fileStatuses = fs.listStatus(new Path(dfsBasePath));
+FileSystem fs = FSUtils.getFs(basePath, jsc.hadoopConfiguration());
+FileStatus[] fileStatuses = fs.listStatus(new Path(basePath));
 // Since maxFileSize was 10240L and we produced 1K records each close to 
1K size, we

[GitHub] [hudi] xushiyan merged pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests

2022-11-09 Thread GitBox



xushiyan merged PR #7171:
URL: https://github.com/apache/hudi/pull/7171


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests

2022-11-09 Thread GitBox



xushiyan commented on PR #7171:
URL: https://github.com/apache/hudi/pull/7171#issuecomment-1309749162

   The test run time has good improvement. 
   
   **in master version**
   
   UT 
   ```
   [INFO] hudi-utilities_2.11  SUCCESS [08:32 
min]
   ```
   FT
   ```
   [INFO] hudi-utilities_2.11  SUCCESS [59:44 
min]
   ```
   
   **With this patch**
   
   UT
   ```
   [INFO] hudi-utilities_2.11  SUCCESS [07:14 
min]
   ```
   FT
   ```
   [INFO] hudi-utilities_2.11  SUCCESS [38:23 
min]
   ```
   
   Overall it has about 30% drop!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution

2022-11-09 Thread GitBox



danny0405 commented on code in PR #5830:
URL: https://github.com/apache/hudi/pull/5830#discussion_r1018637325


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -817,6 +805,8 @@ public abstract static class Builder {
 
 public abstract Builder withReaderSchema(Schema schema);
 
+public abstract Builder withReaderSchema(InternalSchema internalSchema);
+

Review Comment:
   We should not keep 2 `withReaderSchema` here, keeps either one of them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution

2022-11-09 Thread GitBox



danny0405 commented on code in PR #5830:
URL: https://github.com/apache/hudi/pull/5830#discussion_r1018636538


##
hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java:
##
@@ -91,6 +91,11 @@ public static InternalSchema convert(Schema schema) {
 return new InternalSchema(fields);
   }
 
+  /** Convert an avro schema into internalSchema with given versionId. */
+  public static InternalSchema convertToEmpty(Schema schema) {
+return new InternalSchema(InternalSchema.EMPTY_SCHEMA_VERSION_ID, schema);

Review Comment:
   This is also confusing, an internal schema with 'empty' version id but still 
got avro schema internal, please clarify it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution

2022-11-09 Thread GitBox



danny0405 commented on code in PR #5830:
URL: https://github.com/apache/hudi/pull/5830#discussion_r1018635942


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java:
##
@@ -141,4 +139,8 @@ public HoodieLogBlock prev() throws IOException {
 return this.currentReader.prev();
   }
 
+  private Schema getReaderSchema() {
+boolean useWriterSchema = !readerSchema.isEmptySchema();
+return useWriterSchema ? null : readerSchema.getAvroSchema();

Review Comment:
   This is so confusing, why we use writer schema if the reader schema is not 
empty ? And what does it means for an empty internal schema ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution

2022-11-09 Thread GitBox



danny0405 commented on PR #5830:
URL: https://github.com/apache/hudi/pull/5830#issuecomment-1309745375

   [3981.patch.zip](https://github.com/apache/hudi/files/9977268/3981.patch.zip)
   Thanks for the contribution, have reviewed some of the part, and left a 
local patch here and some comments ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests

2022-11-09 Thread GitBox



xushiyan commented on code in PR #7171:
URL: https://github.com/apache/hudi/pull/7171#discussion_r1018635258


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java:
##
@@ -250,7 +250,7 @@ static HoodieDeltaStreamer.Config makeConfig(String 
basePath, WriteOperationType
   cfg.operation = op;
   cfg.enableHiveSync = enableHiveSync;
   cfg.sourceOrderingField = sourceOrderingField;
-  cfg.propsFilePath = dfsBasePath + "/" + propsFilename;
+  cfg.propsFilePath = UtilitiesTestBase.basePath + "/" + propsFilename;

Review Comment:
   because there is another local var named the same. don't want to rename more 
variables further.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7151: [MINOR] Performance improvement of flink ITs with reused miniCluster

2022-11-09 Thread GitBox



hudi-bot commented on PR #7151:
URL: https://github.com/apache/hudi/pull/7151#issuecomment-1309731298

   
   ## CI report:
   
   * fb4b5616278cdd662ccee6add7bbb7b0684554ac Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12815)
 
   * db207361ce3e01c7b39153f3156a6e19d8075212 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12918)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7003: [minor] add more test for rfc46

2022-11-09 Thread GitBox



hudi-bot commented on PR #7003:
URL: https://github.com/apache/hudi/pull/7003#issuecomment-1309731121

   
   ## CI report:
   
   * da89f1a57bae167a6474092cc07ca0880e4028b8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12910)
 
   * b8ad950666cb87456151c70688b5eb7ad423955f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12917)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests

2022-11-09 Thread GitBox



nsivabalan commented on code in PR #7171:
URL: https://github.com/apache/hudi/pull/7171#discussion_r1018623450


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java:
##
@@ -250,7 +250,7 @@ static HoodieDeltaStreamer.Config makeConfig(String 
basePath, WriteOperationType
   cfg.operation = op;
   cfg.enableHiveSync = enableHiveSync;
   cfg.sourceOrderingField = sourceOrderingField;
-  cfg.propsFilePath = dfsBasePath + "/" + propsFilename;
+  cfg.propsFilePath = UtilitiesTestBase.basePath + "/" + propsFilename;

Review Comment:
   basePath is protected in UtilitiesTestBase and TestHoodieDeltastreamer 
extends from it. so why prefixing with class name ? 



##
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java:
##
@@ -272,7 +272,7 @@ static HoodieDeltaStreamer.Config 
makeConfigForHudiIncrSrc(String srcBasePath, S
   cfg.sourceClassName = HoodieIncrSource.class.getName();
   cfg.operation = op;
   cfg.sourceOrderingField = "timestamp";
-  cfg.propsFilePath = dfsBasePath + "/test-downstream-source.properties";
+  cfg.propsFilePath = UtilitiesTestBase.basePath + 
"/test-downstream-source.properties";

Review Comment:
   same here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7151: [MINOR] Performance improvement of flink ITs with reused miniCluster

2022-11-09 Thread GitBox



hudi-bot commented on PR #7151:
URL: https://github.com/apache/hudi/pull/7151#issuecomment-1309727775

   
   ## CI report:
   
   * fb4b5616278cdd662ccee6add7bbb7b0684554ac Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12815)
 
   * db207361ce3e01c7b39153f3156a6e19d8075212 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7003: [minor] add more test for rfc46

2022-11-09 Thread GitBox



hudi-bot commented on PR #7003:
URL: https://github.com/apache/hudi/pull/7003#issuecomment-1309727612

   
   ## CI report:
   
   * da89f1a57bae167a6474092cc07ca0880e4028b8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12910)
 
   * b8ad950666cb87456151c70688b5eb7ad423955f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TengHuo commented on issue #7106: [PROPOSE] Add column prune support for other payload class

2022-11-09 Thread GitBox



TengHuo commented on issue #7106:
URL: https://github.com/apache/hudi/issues/7106#issuecomment-1309697353

   @alexeykudinkin 
   
   Got it, thanks a lot. Saw the code in branch `release-feature-rfc46`, the 
interface `HoodieRecordMerger`. We will keep eyes on it, and migrate our code 
base on it. Looking forward to seeing RFC-46 in master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] trushev commented on a diff in pull request #7151: [MINOR] Performance improvement of flink ITs with reused miniCluster

2022-11-09 Thread GitBox



trushev commented on code in PR #7151:
URL: https://github.com/apache/hudi/pull/7151#discussion_r1018601277


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/AbstractHoodieTestBase.java:
##
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utils;
+
+import org.apache.hudi.exception.HoodieException;
+
+import org.apache.flink.test.util.AbstractTestBase;
+import org.apache.flink.test.util.MiniClusterWithClientResource;
+
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeAll;
+
+import java.lang.reflect.Field;
+import java.util.Arrays;
+
+import static java.lang.reflect.Modifier.isPublic;
+import static java.lang.reflect.Modifier.isStatic;
+
+/**
+ * Hoodie base class for tests that run multiple tests and want to reuse the 
same Flink cluster.
+ * Unlike {@link AbstractTestBase}, this class is designed to run with JUnit 5.
+ */
+public abstract class AbstractHoodieTestBase extends AbstractTestBase {
+
+  private static final MiniClusterWithClientResource MINI_CLUSTER_RESOURCE = 
getMiniClusterFromParentClass();
+
+  @BeforeAll

Review Comment:
   implemented as junit5 extension



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] KnightChess commented on issue #7147: [SUPPORT] Some classes lacked and couldn't be imported in the module of hudi-spark3.2.x_2.12 and hudi-spark3.3.x_2.12

2022-11-09 Thread GitBox



KnightChess commented on issue #7147:
URL: https://github.com/apache/hudi/issues/7147#issuecomment-1309691965

   @GoodJeek  `HoodieSqlBaseParser`, the class is generated by the plug 
`genantlr4-maven-plugin` when you compile


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7167: [HUDI-5094] Remove partition fields before transform bytes to avro,if enable DROP_PARTITION_COLUMNS.

2022-11-09 Thread GitBox



hudi-bot commented on PR #7167:
URL: https://github.com/apache/hudi/pull/7167#issuecomment-1309685076

   
   ## CI report:
   
   * 8881615927848e76214065870119e910e41e9c35 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12904)
 
   * 6b165aec634812ba8d6f4a55d0dfb8578031d25c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12916)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4142) RFC for new Table APIs proposal

2022-11-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4142:
--
Reviewers: Alexey Kudinkin

> RFC for new Table APIs proposal
> ---
>
> Key: HUDI-4142
> URL: https://issues.apache.org/jira/browse/HUDI-4142
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Document all APIs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4142) RFC for new Table APIs proposal

2022-11-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4142:
--
Priority: Blocker  (was: Major)

> RFC for new Table APIs proposal
> ---
>
> Key: HUDI-4142
> URL: https://issues.apache.org/jira/browse/HUDI-4142
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Document all APIs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4142) RFC for new Table APIs proposal

2022-11-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4142:
--
Fix Version/s: 0.13.0
   (was: 1.0.0)

> RFC for new Table APIs proposal
> ---
>
> Key: HUDI-4142
> URL: https://issues.apache.org/jira/browse/HUDI-4142
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Document all APIs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4141) RFC-54 Table Format APIs

2022-11-09 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4141:
--
Description: RFC: [https://github.com/apache/hudi/pull/7080]  (was: New 
Table APIs to create/update table programmatically. For example,

HudiTable.create(tableConfigs)

HudiTable.forPath(basePath)

HudiTable.delete()

HudiTable.merge())

> RFC-54 Table Format APIs
> 
>
> Key: HUDI-4141
> URL: https://issues.apache.org/jira/browse/HUDI-4141
> Project: Apache Hudi
>  Issue Type: Epic
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> RFC: [https://github.com/apache/hudi/pull/7080]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #7167: [HUDI-5094] Remove partition fields before transform bytes to avro,if enable DROP_PARTITION_COLUMNS.

2022-11-09 Thread GitBox



hudi-bot commented on PR #7167:
URL: https://github.com/apache/hudi/pull/7167#issuecomment-1309681703

   
   ## CI report:
   
   * 8881615927848e76214065870119e910e41e9c35 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12904)
 
   * 6b165aec634812ba8d6f4a55d0dfb8578031d25c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7169: [HUDI-5186] Parallelism does not take effect when hoodie.combine.before.upsert/insert false

2022-11-09 Thread GitBox



hudi-bot commented on PR #7169:
URL: https://github.com/apache/hudi/pull/7169#issuecomment-1309677939

   
   ## CI report:
   
   * f9e08f5d14106a18fe59bf752077ab1043595e03 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12899)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12905)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12915)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7172: [HUDI-4888] throw exception if COW table and consistent hashing bucket index

2022-11-09 Thread GitBox



hudi-bot commented on PR #7172:
URL: https://github.com/apache/hudi/pull/7172#issuecomment-1309677986

   
   ## CI report:
   
   * 959da168ba88673c2b2b4eb5d42cf6aac62d3808 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12913)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7035: [HUDI-5075] Adding support to rollback residual clustering after disabling clustering

2022-11-09 Thread GitBox



SteNicholas commented on code in PR #7035:
URL: https://github.com/apache/hudi/pull/7035#discussion_r1018585393


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##
@@ -310,6 +310,14 @@ public class HoodieClusteringConfig extends HoodieConfig {
   + "Please exercise caution while setting this config, especially 
when clustering is done very frequently. This could lead to race condition in "
   + "rare scenarios, for example, when the clustering completes after 
instants are fetched but before rollback completed.");
 
+  public static final ConfigProperty 
ROLLBACK_PENDING_CLUSTERING_WHEN_DISABLED = ConfigProperty
+  .key("hoodie.rollback.pending.clustering.when.disabled")

Review Comment:
   Could we replace the key `hoodie.rollback.pending.clustering.when.disabled` 
with `hoodie.clustering.rollback.pending.replacecommit` because of the naming 
of the interface `withRollbackPendingClustering`? The naming of 
`hoodie.rollback.pending.clustering.when.disabled` is confusing for users 
whether to disable clustering or rollback.
   cc @yihua 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7035: [HUDI-5075] Adding support to rollback residual clustering after disabling clustering

2022-11-09 Thread GitBox



SteNicholas commented on code in PR #7035:
URL: https://github.com/apache/hudi/pull/7035#discussion_r1018585393


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##
@@ -310,6 +310,14 @@ public class HoodieClusteringConfig extends HoodieConfig {
   + "Please exercise caution while setting this config, especially 
when clustering is done very frequently. This could lead to race condition in "
   + "rare scenarios, for example, when the clustering completes after 
instants are fetched but before rollback completed.");
 
+  public static final ConfigProperty 
ROLLBACK_PENDING_CLUSTERING_WHEN_DISABLED = ConfigProperty
+  .key("hoodie.rollback.pending.clustering.when.disabled")

Review Comment:
   Could we replace the key `hoodie.rollback.pending.clustering.when.disabled` 
with `hoodie.clustering.rollback.pending.replacecommit` because of the naming 
of the interface `withRollbackPendingClustering`? The naming of 
`hoodie.rollback.pending.clustering.when.disabled` is confusing for users 
whether to disable clustering or rollback.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7035: [HUDI-5075] Adding support to rollback residual clustering after disabling clustering

2022-11-09 Thread GitBox



SteNicholas commented on code in PR #7035:
URL: https://github.com/apache/hudi/pull/7035#discussion_r1018585393


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##
@@ -310,6 +310,14 @@ public class HoodieClusteringConfig extends HoodieConfig {
   + "Please exercise caution while setting this config, especially 
when clustering is done very frequently. This could lead to race condition in "
   + "rare scenarios, for example, when the clustering completes after 
instants are fetched but before rollback completed.");
 
+  public static final ConfigProperty 
ROLLBACK_PENDING_CLUSTERING_WHEN_DISABLED = ConfigProperty
+  .key("hoodie.rollback.pending.clustering.when.disabled")

Review Comment:
   Could we replace the key `hoodie.rollback.pending.clustering.when.disabled` 
with `hoodie.clustering.rollback.pending.replacecommit.when.disabled`? The 
naming of `hoodie.rollback.pending.clustering.when.disabled` is confusing for 
users whether to disable clustering or rollback.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zouxxyy commented on pull request #7140: [HUDI-5163] Fixing failure handling with spark datasource write

2022-11-09 Thread GitBox



Zouxxyy commented on PR #7140:
URL: https://github.com/apache/hudi/pull/7140#issuecomment-1309666320

   > > @YannByron : I might need some help to take this patch and make valid 
fixes for spark-sql classes. Also, we might need to write tests. if you can 
loop in someone, would be nice.
   > 
   > @Zouxxyy would you like to take this up to enrich more cases to validate? 
If yes, you can open another pr based on this pr and comments for your 
convenience.
   
   @YannByron Ok, I'll fix it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (HUDI-5187) Remove the preCondition check of BucketAssigner assign state

2022-11-09 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-5187.
--

> Remove the preCondition check of BucketAssigner assign state
> 
>
> Key: HUDI-5187
> URL: https://issues.apache.org/jira/browse/HUDI-5187
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-5187) Remove the preCondition check of BucketAssigner assign state

2022-11-09 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-5187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631382#comment-17631382
 ] 

Danny Chen commented on HUDI-5187:
--

Fixed via master branch: df9100b0cb496bb78b1173ae25f9140c1b155bc4

> Remove the preCondition check of BucketAssigner assign state
> 
>
> Key: HUDI-5187
> URL: https://issues.apache.org/jira/browse/HUDI-5187
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] trushev commented on pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution

2022-11-09 Thread GitBox



trushev commented on PR #5830:
URL: https://github.com/apache/hudi/pull/5830#issuecomment-1309655985

   @danny0405 rebased


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (1469469d6e -> df9100b0cb)

2022-11-09 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 1469469d6e [HUDI-5111] Improve integration test coverage (#7092)
 add df9100b0cb [HUDI-5187] Remove the preCondition check of BucketAssigner 
assign state (#7170)

No new revisions were added by this update.

Summary of changes:
 .../main/java/org/apache/hudi/sink/partitioner/BucketAssigner.java | 7 ---
 1 file changed, 7 deletions(-)

[GitHub] [hudi] danny0405 merged pull request #7170: [HUDI-5187] Remove the preCondition check of BucketAssigner assign state

2022-11-09 Thread GitBox



danny0405 merged PR #7170:
URL: https://github.com/apache/hudi/pull/7170


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] trushev commented on pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution

2022-11-09 Thread GitBox



trushev commented on PR #5830:
URL: https://github.com/apache/hudi/pull/5830#issuecomment-1309651755

   @danny0405  it's rebased


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zhangyue19921010 commented on pull request #7169: [HUDI-5186] Parallelism does not take effect when hoodie.combine.before.upsert/insert false

2022-11-09 Thread GitBox



zhangyue19921010 commented on PR #7169:
URL: https://github.com/apache/hudi/pull/7169#issuecomment-1309642342

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation

2022-11-09 Thread GitBox



hudi-bot commented on PR #7039:
URL: https://github.com/apache/hudi/pull/7039#issuecomment-1309617147

   
   ## CI report:
   
   * dc7c2bdefdb2a84b27b40751714a31604e2931eb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12911)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6680: [HUDI-4812] Lazy fetching partition path & file slice for HoodieFileIndex

2022-11-09 Thread GitBox



hudi-bot commented on PR #6680:
URL: https://github.com/apache/hudi/pull/6680#issuecomment-1309616517

   
   ## CI report:
   
   * 59cdd09e3190c3646e1e3ea6ca3f076526ec0473 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12912)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7172: [HUDI-4888] throw exception if COW table and consistent hashing bucket index

2022-11-09 Thread GitBox



hudi-bot commented on PR #7172:
URL: https://github.com/apache/hudi/pull/7172#issuecomment-1309553268

   
   ## CI report:
   
   * 959da168ba88673c2b2b4eb5d42cf6aac62d3808 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12913)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7172: [HUDI-4888] throw exception if COW table and consistent hashing bucket index

2022-11-09 Thread GitBox



hudi-bot commented on PR #7172:
URL: https://github.com/apache/hudi/pull/7172#issuecomment-1309549269

   
   ## CI report:
   
   * 959da168ba88673c2b2b4eb5d42cf6aac62d3808 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4888) Add validation to block COW table to use consistent hashing bucket index

2022-11-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4888:
-
Labels: pull-request-available  (was: )

> Add validation to block COW table to use consistent hashing bucket index
> 
>
> Key: HUDI-4888
> URL: https://issues.apache.org/jira/browse/HUDI-4888
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yuwei Xiao
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> Consistent hashing bucket index's resizing relies on the log feature of MOR 
> table. So with COW table, the consistent hashing bucket index can not achieve 
> resizing currently. 
> We should block the user from using it at the very beginning(i.e., table 
> creation), and suggest them to use MOR table or Simple Bucket Index. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] jonvex opened a new pull request, #7172: [HUDI-4888] throw exception if COW table and consistent hashing bucket index

2022-11-09 Thread GitBox



jonvex opened a new pull request, #7172:
URL: https://github.com/apache/hudi/pull/7172

   ### Change Logs
   
   Consistent hashing bucket index resizing does not work because it relies on 
writing to logs while the resizing is taking place. Instead of having Hudi fail 
when the resizing takes place, now it will fail on the first write with this 
configuration.
   
   ### Impact
   
   Fails faster and lets user know what is going wrong.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4888) Add validation to block COW table to use consistent hashing bucket index

2022-11-09 Thread Jonathan Vexler (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-4888:
--
Status: Patch Available  (was: In Progress)

> Add validation to block COW table to use consistent hashing bucket index
> 
>
> Key: HUDI-4888
> URL: https://issues.apache.org/jira/browse/HUDI-4888
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yuwei Xiao
>Assignee: Jonathan Vexler
>Priority: Major
> Fix For: 0.12.2
>
>
> Consistent hashing bucket index's resizing relies on the log feature of MOR 
> table. So with COW table, the consistent hashing bucket index can not achieve 
> resizing currently. 
> We should block the user from using it at the very beginning(i.e., table 
> creation), and suggest them to use MOR table or Simple Bucket Index. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] alexeykudinkin commented on issue #7106: [PROPOSE] Add column prune support for other payload class

2022-11-09 Thread GitBox



alexeykudinkin commented on issue #7106:
URL: https://github.com/apache/hudi/issues/7106#issuecomment-1309506033

   @TengHuo correct, it's slated for 0.13. We're currently in the final innings 
of merging Phase 1 to master and hoping to do this in the coming weeks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #7069: [HUDI-5097] Fix partition reading without partition fields table config

2022-11-09 Thread GitBox



alexeykudinkin commented on PR #7069:
URL: https://github.com/apache/hudi/pull/7069#issuecomment-1309504006

   > not landing this to master. only meant for 0.12.2 release patch
   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6680: [HUDI-4812] Lazy fetching partition path & file slice for HoodieFileIndex

2022-11-09 Thread GitBox



hudi-bot commented on PR #6680:
URL: https://github.com/apache/hudi/pull/6680#issuecomment-1309472827

   
   ## CI report:
   
   * c3aba0dc3e2f7c2c6240d3aa5bc279cf8f359153 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12822)
 
   * 59cdd09e3190c3646e1e3ea6ca3f076526ec0473 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12912)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation

2022-11-09 Thread GitBox



hudi-bot commented on PR #7039:
URL: https://github.com/apache/hudi/pull/7039#issuecomment-1309466898

   
   ## CI report:
   
   * 5ff96812e74f348af76c942f58e67445afbb765e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12767)
 
   * dc7c2bdefdb2a84b27b40751714a31604e2931eb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12911)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6680: [HUDI-4812] Lazy fetching partition path & file slice for HoodieFileIndex

2022-11-09 Thread GitBox



hudi-bot commented on PR #6680:
URL: https://github.com/apache/hudi/pull/6680#issuecomment-1309466349

   
   ## CI report:
   
   * c3aba0dc3e2f7c2c6240d3aa5bc279cf8f359153 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12822)
 
   * 59cdd09e3190c3646e1e3ea6ca3f076526ec0473 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests

2022-11-09 Thread GitBox



hudi-bot commented on PR #7171:
URL: https://github.com/apache/hudi/pull/7171#issuecomment-1309459685

   
   ## CI report:
   
   * 2a1766f041bf9b8bac2927b1aa076916361a00b6 UNKNOWN
   * bba82cddcfad42d2e2fe698d8a27077536257bfc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12909)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation

2022-11-09 Thread GitBox



hudi-bot commented on PR #7039:
URL: https://github.com/apache/hudi/pull/7039#issuecomment-1309459075

   
   ## CI report:
   
   * 5ff96812e74f348af76c942f58e67445afbb765e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12767)
 
   * dc7c2bdefdb2a84b27b40751714a31604e2931eb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7003: [minor] add more test for rfc46

2022-11-09 Thread GitBox



hudi-bot commented on PR #7003:
URL: https://github.com/apache/hudi/pull/7003#issuecomment-1309458829

   
   ## CI report:
   
   * da89f1a57bae167a6474092cc07ca0880e4028b8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12910)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation

2022-11-09 Thread GitBox



nsivabalan commented on code in PR #7039:
URL: https://github.com/apache/hudi/pull/7039#discussion_r1018436316


##
hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java:
##
@@ -44,6 +46,28 @@ public class CommitUtils {
 
   private static final Logger LOG = LogManager.getLogger(CommitUtils.class);
   private static final String NULL_SCHEMA_STR = 
Schema.create(Schema.Type.NULL).toString();
+  public static transient ConcurrentHashMap> 
PERSISTED_RDD_IDS = new ConcurrentHashMap();

Review Comment:
   sounds fair. I have an idea on how to go about this. but lets jam and have a 
consensus before I go ahead. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rtdt99 commented on issue #7158: [SUPPORT] Request for support for Azure Blob Storage Events Source and Events Hoodie Incr Source in DeltaStreamer

2022-11-09 Thread GitBox



rtdt99 commented on issue #7158:
URL: https://github.com/apache/hudi/issues/7158#issuecomment-1309389618

   @nsivabalan, Could you please provide any timeline on when we can expect 
this feature in DeltaStreamer?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7142: [HUDI-5056] Allow wildcards in partition paths for DELETE_PARTITIONS

2022-11-09 Thread GitBox



hudi-bot commented on PR #7142:
URL: https://github.com/apache/hudi/pull/7142#issuecomment-1309365677

   
   ## CI report:
   
   * f839fdda3077916eea26ed14e85aa01fa657f3e6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12908)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator

2022-11-09 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4967:

Description: 
Related fix: HUDI-4966

We need to add docs on how to properly set the meta sync configuration, 
especially the hoodie.datasource.hive_sync.partition_value_extractor, in 
[https://hudi.apache.org/docs/key_generation] (for different Hudi versions, the 
config can be different).  Check the ticket above and PR description of 
[https://github.com/apache/hudi/pull/6851] for more details.

We should also add the migration setup on the key generation page as well: 
[https://hudi.apache.org/releases/release-0.12.0/#configuration-updates]
 * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config 
is used to extract and transform partition value during Hive sync. Its default 
value has been changed from {{SlashEncodedDayPartitionValueExtractor}} to 
{{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default 
value (i.e., have not set it explicitly), you are required to set the config to 
{{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From this 
release, if this config is not set and Hive sync is enabled, then partition 
value extractor class will be *automatically inferred* on the basis of number 
of partition fields and whether or not hive style partitioning is enabled.

  was:
Related fix: HUDI-4966

We need to add docs on how to properly set the meta sync configuration, 
especially the hoodie.datasource.hive_sync.partition_value_extractor, in 
[https://hudi.apache.org/docs/key_generation] (for different Hudi versions, the 
config can be different).

We should also add the migration setup on the key generation page as well: 
[https://hudi.apache.org/releases/release-0.12.0/#configuration-updates]
 * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config 
is used to extract and transform partition value during Hive sync. Its default 
value has been changed from {{SlashEncodedDayPartitionValueExtractor}} to 
{{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default 
value (i.e., have not set it explicitly), you are required to set the config to 
{{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From this 
release, if this config is not set and Hive sync is enabled, then partition 
value extractor class will be *automatically inferred* on the basis of number 
of partition fields and whether or not hive style partitioning is enabled.


> Improve docs for meta sync with TimestampBasedKeyGenerator
> --
>
> Key: HUDI-4967
> URL: https://issues.apache.org/jira/browse/HUDI-4967
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Major
> Fix For: 0.12.2
>
>
> Related fix: HUDI-4966
> We need to add docs on how to properly set the meta sync configuration, 
> especially the hoodie.datasource.hive_sync.partition_value_extractor, in 
> [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, 
> the config can be different).  Check the ticket above and PR description of 
> [https://github.com/apache/hudi/pull/6851] for more details.
> We should also add the migration setup on the key generation page as well: 
> [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates]
>  * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config 
> is used to extract and transform partition value during Hive sync. Its 
> default value has been changed from 
> {{SlashEncodedDayPartitionValueExtractor}} to 
> {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default 
> value (i.e., have not set it explicitly), you are required to set the config 
> to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From 
> this release, if this config is not set and Hive sync is enabled, then 
> partition value extractor class will be *automatically inferred* on the basis 
> of number of partition fields and whether or not hive style partitioning is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4966) Meta sync throws exception if TimestampBasedKeyGenerator is used to generate partition path containing slashes

2022-11-09 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4966:

Fix Version/s: 0.12.2
   (was: 0.12.1)

> Meta sync throws exception if TimestampBasedKeyGenerator is used to generate 
> partition path containing slashes
> --
>
> Key: HUDI-4966
> URL: https://issues.apache.org/jira/browse/HUDI-4966
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> For Deltastreamer, when using TimestampBasedKeyGenerator with the output 
> format of partition path containing slashes, e.g., "/MM/dd", and 
> hive-style partitioning disabled (by default), the meta sync fails.
> {code:java}
> --hoodie-conf hoodie.datasource.write.partitionpath.field=createdDate
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.timezone=GMT
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS {code}
> Hive Sync exception:
> {code:java}
> Exception in thread "main" org.apache.hudi.exception.HoodieException: Could 
> not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool
>     at 
> org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:58)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.runMetaSync(DeltaSync.java:719)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:637)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:337)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:204)
>     at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:202)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:571)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>     at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>     at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>     at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>     at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>     at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception 
> when hive syncing test_table
>     at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145)
>     at 
> org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:56)
>     ... 19 more
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
> partitions for table test_table
>     at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:341)
>     at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:232)
>     at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:154)
>     at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:142)
>     ... 20 more
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: default.test_table 
> add partition failed
>     at 
> org.apache.hudi.hive.ddl.HMSDDLExecutor.addPartitionsToTable(HMSDDLExecutor.java:217)
>     at 
> org.apache.hudi.hive.HoodieHiveSyncClient.addPartitionsToTable(HoodieHiveSyncClient.java:107)
>     at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:324)
>     ... 23 more
> Caused by: MetaException(message:Invalid partition key & values; keys 
> [createddate, ], values [2022, 10, 02, ])
>     at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
>     at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add

[jira] [Updated] (HUDI-4966) Meta sync throws exception if TimestampBasedKeyGenerator is used to generate partition path containing slashes

2022-11-09 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4966:

Fix Version/s: 0.12.1
   (was: 0.12.2)

> Meta sync throws exception if TimestampBasedKeyGenerator is used to generate 
> partition path containing slashes
> --
>
> Key: HUDI-4966
> URL: https://issues.apache.org/jira/browse/HUDI-4966
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> For Deltastreamer, when using TimestampBasedKeyGenerator with the output 
> format of partition path containing slashes, e.g., "/MM/dd", and 
> hive-style partitioning disabled (by default), the meta sync fails.
> {code:java}
> --hoodie-conf hoodie.datasource.write.partitionpath.field=createdDate
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.timezone=GMT
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS {code}
> Hive Sync exception:
> {code:java}
> Exception in thread "main" org.apache.hudi.exception.HoodieException: Could 
> not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool
>     at 
> org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:58)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.runMetaSync(DeltaSync.java:719)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:637)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:337)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:204)
>     at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:202)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:571)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>     at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>     at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>     at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>     at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>     at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception 
> when hive syncing test_table
>     at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145)
>     at 
> org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:56)
>     ... 19 more
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
> partitions for table test_table
>     at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:341)
>     at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:232)
>     at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:154)
>     at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:142)
>     ... 20 more
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: default.test_table 
> add partition failed
>     at 
> org.apache.hudi.hive.ddl.HMSDDLExecutor.addPartitionsToTable(HMSDDLExecutor.java:217)
>     at 
> org.apache.hudi.hive.HoodieHiveSyncClient.addPartitionsToTable(HoodieHiveSyncClient.java:107)
>     at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:324)
>     ... 23 more
> Caused by: MetaException(message:Invalid partition key & values; keys 
> [createddate, ], values [2022, 10, 02, ])
>     at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
>     at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add

[jira] [Updated] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator

2022-11-09 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4967:

Description: 
Related fix: HUDI-4966

We need to add docs on how to properly set the meta sync configuration, 
especially the hoodie.datasource.hive_sync.partition_value_extractor, in 
[https://hudi.apache.org/docs/key_generation] (for different Hudi versions, the 
config can be different).

We should also add the migration setup on the key generation page as well: 
[https://hudi.apache.org/releases/release-0.12.0/#configuration-updates]
 * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config 
is used to extract and transform partition value during Hive sync. Its default 
value has been changed from {{SlashEncodedDayPartitionValueExtractor}} to 
{{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default 
value (i.e., have not set it explicitly), you are required to set the config to 
{{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From this 
release, if this config is not set and Hive sync is enabled, then partition 
value extractor class will be *automatically inferred* on the basis of number 
of partition fields and whether or not hive style partitioning is enabled.

  was:
Related fix: HUDI-4966

 


> Improve docs for meta sync with TimestampBasedKeyGenerator
> --
>
> Key: HUDI-4967
> URL: https://issues.apache.org/jira/browse/HUDI-4967
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Major
> Fix For: 0.12.2
>
>
> Related fix: HUDI-4966
> We need to add docs on how to properly set the meta sync configuration, 
> especially the hoodie.datasource.hive_sync.partition_value_extractor, in 
> [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, 
> the config can be different).
> We should also add the migration setup on the key generation page as well: 
> [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates]
>  * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config 
> is used to extract and transform partition value during Hive sync. Its 
> default value has been changed from 
> {{SlashEncodedDayPartitionValueExtractor}} to 
> {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default 
> value (i.e., have not set it explicitly), you are required to set the config 
> to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From 
> this release, if this config is not set and Hive sync is enabled, then 
> partition value extractor class will be *automatically inferred* on the basis 
> of number of partition fields and whether or not hive style partitioning is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-4966) Meta sync throws exception if TimestampBasedKeyGenerator is used to generate partition path containing slashes

2022-11-09 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-4966.
---
Resolution: Fixed

> Meta sync throws exception if TimestampBasedKeyGenerator is used to generate 
> partition path containing slashes
> --
>
> Key: HUDI-4966
> URL: https://issues.apache.org/jira/browse/HUDI-4966
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> For Deltastreamer, when using TimestampBasedKeyGenerator with the output 
> format of partition path containing slashes, e.g., "/MM/dd", and 
> hive-style partitioning disabled (by default), the meta sync fails.
> {code:java}
> --hoodie-conf hoodie.datasource.write.partitionpath.field=createdDate
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.timezone=GMT
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS {code}
> Hive Sync exception:
> {code:java}
> Exception in thread "main" org.apache.hudi.exception.HoodieException: Could 
> not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool
>     at 
> org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:58)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.runMetaSync(DeltaSync.java:719)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:637)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:337)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:204)
>     at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:202)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:571)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>     at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>     at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>     at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>     at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>     at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception 
> when hive syncing test_table
>     at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145)
>     at 
> org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:56)
>     ... 19 more
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
> partitions for table test_table
>     at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:341)
>     at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:232)
>     at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:154)
>     at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:142)
>     ... 20 more
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: default.test_table 
> add partition failed
>     at 
> org.apache.hudi.hive.ddl.HMSDDLExecutor.addPartitionsToTable(HMSDDLExecutor.java:217)
>     at 
> org.apache.hudi.hive.HoodieHiveSyncClient.addPartitionsToTable(HoodieHiveSyncClient.java:107)
>     at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:324)
>     ... 23 more
> Caused by: MetaException(message:Invalid partition key & values; keys 
> [createddate, ], values [2022, 10, 02, ])
>     at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
>     at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add_partitions_req_resultStandardScheme.read(T

[GitHub] [hudi] jonvex commented on issue #6808: [SUPPORT] Cannot sync to spark embedded derby hive meta store (the default one)

2022-11-09 Thread GitBox



jonvex commented on issue #6808:
URL: https://github.com/apache/hudi/issues/6808#issuecomment-1309323342

   The updated links are
   
https://github.com/apache/hudi/blob/master/packaging/bundle-validation/Dockerfile
 
https://github.com/apache/hudi/blob/master/packaging/bundle-validation/validate.sh
 you can look at the test_spark_hadoop_mr_bundles function
   https://github.com/apache/hudi/tree/master/packaging/bundle-validation/conf 
contains some configuration files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7003: [minor] add more test for rfc46

2022-11-09 Thread GitBox



hudi-bot commented on PR #7003:
URL: https://github.com/apache/hudi/pull/7003#issuecomment-1309277206

   
   ## CI report:
   
   * 6efb6b43bec7d58883b7120f23c06d4ae927a528 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12573)
 
   * da89f1a57bae167a6474092cc07ca0880e4028b8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12910)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests

2022-11-09 Thread GitBox



hudi-bot commented on PR #7171:
URL: https://github.com/apache/hudi/pull/7171#issuecomment-1309272029

   
   ## CI report:
   
   * 2a1766f041bf9b8bac2927b1aa076916361a00b6 UNKNOWN
   * bba82cddcfad42d2e2fe698d8a27077536257bfc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12909)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7003: [minor] add more test for rfc46

2022-11-09 Thread GitBox



hudi-bot commented on PR #7003:
URL: https://github.com/apache/hudi/pull/7003#issuecomment-1309271518

   
   ## CI report:
   
   * 6efb6b43bec7d58883b7120f23c06d4ae927a528 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12573)
 
   * da89f1a57bae167a6474092cc07ca0880e4028b8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7169: [HUDI-5186] Parallelism does not take effect when hoodie.combine.before.upsert/insert false

2022-11-09 Thread GitBox



hudi-bot commented on PR #7169:
URL: https://github.com/apache/hudi/pull/7169#issuecomment-1309266022

   
   ## CI report:
   
   * f9e08f5d14106a18fe59bf752077ab1043595e03 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12899)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12905)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests

2022-11-09 Thread GitBox



hudi-bot commented on PR #7171:
URL: https://github.com/apache/hudi/pull/7171#issuecomment-1309266099

   
   ## CI report:
   
   * 2a1766f041bf9b8bac2927b1aa076916361a00b6 UNKNOWN
   * bba82cddcfad42d2e2fe698d8a27077536257bfc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus

2022-11-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-2613.
-
Resolution: Done

> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
> 
>
> Key: HUDI-2613
> URL: https://issues.apache.org/jira/browse/HUDI-2613
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Critical
> Fix For: 0.13.0
>
>
> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of 
> getDeltalogs()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #7167: [HUDI-5094] Remove partition fields before transform bytes to avro,if enable DROP_PARTITION_COLUMNS.

2022-11-09 Thread GitBox



hudi-bot commented on PR #7167:
URL: https://github.com/apache/hudi/pull/7167#issuecomment-1309201490

   
   ## CI report:
   
   * 8881615927848e76214065870119e910e41e9c35 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12904)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus

2022-11-09 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631217#comment-17631217
 ] 

sivabalan narayanan commented on HUDI-2613:
---

Looks like we got this covered already. closing it as done. 

 

> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
> 
>
> Key: HUDI-2613
> URL: https://issues.apache.org/jira/browse/HUDI-2613
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Critical
> Fix For: 0.13.0
>
>
> Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of 
> getDeltalogs()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5171) Ensure validateTableConfig also checks for partition path field value switch

2022-11-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5171:
--
Story Points: 1

> Ensure validateTableConfig also checks for partition path field value switch
> 
>
> Key: HUDI-5171
> URL: https://issues.apache.org/jira/browse/HUDI-5171
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.12.1
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> as of now, validateTableConfig does not consider change in partition path 
> field value switch. we need to consider that as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4990) Parallelize deduplication in CLI tool

2022-11-09 Thread Jonathan Vexler (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631215#comment-17631215
 ] 

Jonathan Vexler commented on HUDI-4990:
---

We are stuck on this for now because we can't run the integration tests locally

> Parallelize deduplication in CLI tool
> -
>
> Key: HUDI-4990
> URL: https://issues.apache.org/jira/browse/HUDI-4990
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Major
> Fix For: 0.12.2
>
>
> The CLI tool command `repair deduplicate` repair one partition at a time.  To 
> repair hundreds of partitions, this takes time.  We should add a mode to take 
> multiple partition paths for the CLI and run the dedup job for multiple 
> partitions at the same time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #7142: [HUDI-5056] Allow wildcards in partition paths for DELETE_PARTITIONS

2022-11-09 Thread GitBox



hudi-bot commented on PR #7142:
URL: https://github.com/apache/hudi/pull/7142#issuecomment-1309195747

   
   ## CI report:
   
   * 50ab8834e6dad68dfb679216e7b3096254588dfc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12860)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12906)
 
   * f839fdda3077916eea26ed14e85aa01fa657f3e6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12908)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7142: [HUDI-5056] Allow wildcards in partition paths for DELETE_PARTITIONS

2022-11-09 Thread GitBox



hudi-bot commented on PR #7142:
URL: https://github.com/apache/hudi/pull/7142#issuecomment-1309189923

   
   ## CI report:
   
   * 50ab8834e6dad68dfb679216e7b3096254588dfc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12860)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12906)
 
   * f839fdda3077916eea26ed14e85aa01fa657f3e6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests

2022-11-09 Thread GitBox



hudi-bot commented on PR #7171:
URL: https://github.com/apache/hudi/pull/7171#issuecomment-1309183756

   
   ## CI report:
   
   * 2a1766f041bf9b8bac2927b1aa076916361a00b6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6838: [MINOR] Update azure image and balance CI jobs

2022-11-09 Thread GitBox



hudi-bot commented on PR #6838:
URL: https://github.com/apache/hudi/pull/6838#issuecomment-1309182962

   
   ## CI report:
   
   * cb0d3a60736adeac074af479eabdc844793ea067 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12903)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan closed pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests

2022-11-09 Thread GitBox



xushiyan closed pull request #7171: [HUDI-5145] Avoid starting HDFS in 
hudi-utilities tests
URL: https://github.com/apache/hudi/pull/7171


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #7132: [HUDI-51577] Adding capability to remove all meta fields from source hudi table with Hudi incr source

2022-11-09 Thread GitBox



nsivabalan commented on code in PR #7132:
URL: https://github.com/apache/hudi/pull/7132#discussion_r1018258230


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java:
##
@@ -172,10 +178,13 @@ public Pair>, String> 
fetchNextBatch(Option lastCkpt
  *
  * log.info("Validated Source Schema :" + validated.schema());
  */
+boolean dropAllMetaFields = 
props.getBoolean(Config.HOODIE_DROP_ALL_META_FIELDS_FROM_SOURCE,
+Config.DEFAULT_HOODIE_DROP_ALL_META_FIELDS_FROM_SOURCE);
 
 // Remove Hoodie meta columns except partition path from input source
-final Dataset src = 
source.drop(HoodieRecord.HOODIE_META_COLUMNS.stream()
-.filter(x -> 
!x.equals(HoodieRecord.PARTITION_PATH_METADATA_FIELD)).toArray(String[]::new));
+String[] colsToDrop = dropAllMetaFields ? 
HoodieRecord.HOODIE_META_COLUMNS.stream().toArray(String[]::new) :
+HoodieRecord.HOODIE_META_COLUMNS.stream().filter(x -> 
!x.equals(HoodieRecord.PARTITION_PATH_METADATA_FIELD)).toArray(String[]::new);

Review Comment:
   I could not decode that. I don't see a need unless we want to carry over the 
partitioning from tableA to tableB. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-5111] Improve integration test coverage (#7092)

2022-11-09 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 1469469d6e [HUDI-5111] Improve integration test coverage (#7092)
1469469d6e is described below

commit 1469469d6ecaeaa7d960676c305383e7e31fcd43
Author: Sivabalan Narayanan 
AuthorDate: Wed Nov 9 09:56:14 2022 -0800

[HUDI-5111] Improve integration test coverage (#7092)



Co-authored-by: Raymond Xu <2701446+xushi...@users.noreply.github.com>
---
 .../deltastreamer-immutable-dataset.yaml   | 14 ++-
 ...treamer-long-running-multi-partitions-hive.yaml |  6 +--
 ...mer-long-running-multi-partitions-metadata.yaml | 18 +++--
 ...eltastreamer-long-running-multi-partitions.yaml | 18 +++--
 .../deltastreamer-medium-clustering.yaml   | 19 --
 ...ltastreamer-medium-full-dataset-validation.yaml | 18 +++--
 .../test-suite/deltastreamer-non-partitioned.yaml  | 14 ++-
 .../detlastreamer-long-running-example.yaml| 18 +++--
 .../demo/config/test-suite/simple-clustering.yaml  | 16 +++-
 .../config/test-suite/simple-deltastreamer.yaml| 12 ++
 .../config/test-suite/spark-immutable-dataset.yaml | 14 ++-
 .../spark-long-running-non-partitioned.yaml| 12 ++
 .../demo/config/test-suite/spark-long-running.yaml | 16 +++-
 .../config/test-suite/spark-medium-clustering.yaml | 16 +++-
 docker/demo/config/test-suite/spark-simple.yaml| 14 ++-
 .../hudi/integ/testsuite/HoodieTestSuiteJob.java   |  7 
 .../integ/testsuite/dag/nodes/PrestoQueryNode.java | 44 +-
 .../testsuite/dag/nodes/ValidateDatasetNode.java   | 13 ---
 .../testsuite/dag/nodes/SparkInsertNode.scala  |  1 +
 19 files changed, 238 insertions(+), 52 deletions(-)

diff --git a/docker/demo/config/test-suite/deltastreamer-immutable-dataset.yaml 
b/docker/demo/config/test-suite/deltastreamer-immutable-dataset.yaml
index 4903e3650c..a19617ef13 100644
--- a/docker/demo/config/test-suite/deltastreamer-immutable-dataset.yaml
+++ b/docker/demo/config/test-suite/deltastreamer-immutable-dataset.yaml
@@ -45,9 +45,21 @@ dag_content:
   delete_input_data: false
 type: ValidateDatasetNode
 deps: first_insert
+  first_presto_query:
+config:
+  execute_itr_count: 5
+  presto_props:
+prop1: "SET SESSION hive.parquet_use_column_names = true"
+  presto_queries:
+query1: "select count(*) from testdb.table1"
+result1: 3
+query2: "select count(*) from testdb.table1 group by _row_key having 
count(*) > 1"
+result2: 0
+type: PrestoQueryNode
+deps: second_validate
   last_validate:
 config:
   execute_itr_count: 5
   delete_input_data: true
 type: ValidateAsyncOperations
-deps: second_validate
\ No newline at end of file
+deps: first_presto_query
\ No newline at end of file
diff --git 
a/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-hive.yaml
 
b/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-hive.yaml
index 8b82415982..6e94b05a69 100644
--- 
a/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-hive.yaml
+++ 
b/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-hive.yaml
@@ -14,7 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 dag_name: deltastreamer-long-running-multi-partitions.yaml
-dag_rounds: 50
+dag_rounds: 20
 dag_intermittent_delay_mins: 1
 dag_content:
   first_insert:
@@ -71,7 +71,7 @@ dag_content:
 deps: first_delete
   second_validate:
 config:
-  validate_once_every_itr : 5
+  execute_itr_count: 20
   validate_hive: true
   delete_input_data: true
   max_wait_time_for_deltastreamer_catch_up_ms: 60
@@ -79,7 +79,7 @@ dag_content:
 deps: second_hive_sync
   last_validate:
 config:
-  execute_itr_count: 50
+  execute_itr_count: 20
   max_wait_time_for_deltastreamer_catch_up_ms: 60
 type: ValidateAsyncOperations
 deps: second_validate
diff --git 
a/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-metadata.yaml
 
b/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-metadata.yaml
index 031664cd15..9ba6993e1d 100644
--- 
a/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-metadata.yaml
+++ 
b/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-metadata.yaml
@@ -14,7 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 dag_name: deltastreamer-long-running-multi-partitions.yaml
-dag_rounds: 30
+dag_rounds: 20
 dag_intermittent_delay_mins: 1
 dag_content:
   first_insert:
@@ -65,9 +65,21 @@ dag_content:
   max_wait_t

[GitHub] [hudi] xushiyan merged pull request #7092: [HUDI-5111] Improve integration test coverage

2022-11-09 Thread GitBox



xushiyan merged PR #7092:
URL: https://github.com/apache/hudi/pull/7092


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 152 matches

Mail list logo