[GitHub] [hudi] schlichtanders commented on issue #6808: [SUPPORT] Cannot sync to spark embedded derby hive meta store (the default one)
schlichtanders commented on issue #6808: URL: https://github.com/apache/hudi/issues/6808#issuecomment-1309901120 Thank you for the updated links. They show how to connect Hudi to a derby metastore via `org.apache.derby.jdbc.ClientDriver`, where the derby metastore is running on a dedicated port on localhost. This is not what this ticket is about. This ticket is about connecting to an embedded in-memory derby metastore which you connect to via the `org.apache.derby.jdbc.EmbeddedDriver` driver. Please refer to my cleaned-up example [above](https://github.com/apache/hudi/issues/6808#issuecomment-1308766125) which works for PostgreSQL, but does not for Derby. I think this is a hudi bug, as no error is thrown, but also no sync is happening when using `org.apache.derby.jdbc.EmbeddedDriver`. How to elevate this Support Ticket to a bug? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7080: [RFC-64] New APIs to facilitate faster Query Engine integrations
xiarixiaoyao commented on code in PR #7080: URL: https://github.com/apache/hudi/pull/7080#discussion_r1018743271 ## rfc/rfc-64/rfc-64.md: ## @@ -0,0 +1,509 @@ + + +# RFC-64: New Hudi Table Spec API for Query Integrations + +## Proposers + +- @codope +- @alexeykudinkin + +## Approvers + +- @xiarixiao +- @danny0405 +- @vinothchandar +- @prasanna +- @xushiyan + +## Status + +JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141) + +## Abstract + +In this RFC we propose new set of higher-level Table Spec APIs that would allow +us to up-level our current integration model with new Query Engines, enabling +faster turnaround for such integrations. + +## Background + +[WIP] Plan + +- [x] High-level capabilities API will be supporting +- [ ] High-level overview of integration points of the query engines +- [ ] Deep-dive into existing Spark/Flink/Trino integration points + +The figure below shows the read path in Spark and Presto/Trino query engines. At +a high level, integration with these engines require providing of the following capabilities: + +1. **Listing**: this stage requires enumerating of all of the data files w/in a table representing particular +snapshot of its state, that will be subsequently scanned to fetch the date requested by the target query. All query engines +expect output of this stage in the form of balanced out file "splits" that way allowing to even out any skew in +file sizes. This stage is where various _pruning_ techniques such as partition-pruning, file-level column +statistics filtering are applied. + +2. **Reading**: this stage transforms every file-split (identified in a previous stage) into an iterator over +the records persisted in it. This stage is where actual data shaping to suit the needs of particular query takes +place: records are projected into a schema expected by the query, corresponding filters are pushed down to +reduce amount of data fetched from storage, schema evolution is handled, delayed operations are reconciled (merging/deleting) + +![](./read_path.png) + +## Implementation + +[WIP] Plan + +- [x] Components & API pseudo-code +- [x] Diagram show-casing data flow and integration points +- [ ] Example integration walk-through + +With this RFC, we aim to achieve following: + + - Up-level our current integration model with Query Engines, abstracting away Hudi's lower-level components, + instead providing simple and eloquent abstractions providing necessary capabilities (for listing, reading) + + - Make sure Hudi's APIs are high-level enough to be useful in providing programmatic access to the data + for users willing to access it directly + +To achieve that, we propose to introduce two tiers of APIs: + +1. **Mid-level** (engine-friendly): these APIs will be abstracting away Hudi's internals, behind +simple and familiar concepts and components such as file splits, iterators, +statistics, etc. + +2. **High-level** (user-friendly): these APIs will provide programmatic access to Hudi's capabilities +bypassing query engine interfaces (like SQL, Spark DS, etc) + +Following classic [layer-cake architecture](https://cs.uwaterloo.ca/~m2nagapp/courses/CS446/1195/Arch_Design_Activity/Layered.pdf) +would allow us to abstract away complexity of the lower levels components and APIs from +Query Engines (on the *querying* side) as well as from the end users: + + - Mid-level components will internally leverage Hudi's core building blocks (such as +`FileSystemView`, `FileIndex`, `MergedLogRecordScanner`, `FileReader` etc), while exposing APIs +providing capabilities expected by the Query Engines (listed above) + + - High-level components will be stacked on top of mid-level ones and can be used directly by users to +read and write to tables. + +These APIs will be built bottoms-up, initially focusing on the mid-level +components, then higher-level ones. Once the mid-level components are ready, we could +demonstrate their utility by integrating with one of the query engines. + +In the initial phase of this project we will be focusing the effort on the read-side of the integration, +shifting focus to the write-side in the subsequent phase. + +### Components & API + +Open Qs + +1. ~~Some of the models defined here do already exist and might be representing + lower-level components. Do we expose them in a higher level APIs or do we + bifurcate and hide them as internal impl details (for ex, `HoodieFileGroup` + vs `HoodieInternalFileGroup`)?~~ + 1. *We'll be re-using existing models abstracting them behind projected interfaces + exposing only necessary functionality, hiding away complexity of the whole impl* +2. ~~What about Expressions? Wrapping is going to be hard because we need to + analyze expression structure which is not going to be possible if we wrap.~~ + 1. We will need to introduce our own Expression hierarchy supporting a *superset* + of the commonly used expressions. While this requir
[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7080: [RFC-64] New APIs to facilitate faster Query Engine integrations
xiarixiaoyao commented on code in PR #7080: URL: https://github.com/apache/hudi/pull/7080#discussion_r1018750344 ## rfc/rfc-64/rfc-64.md: ## @@ -0,0 +1,509 @@ + + +# RFC-64: New Hudi Table Spec API for Query Integrations + +## Proposers + +- @codope +- @alexeykudinkin + +## Approvers + +- @xiarixiao +- @danny0405 +- @vinothchandar +- @prasanna +- @xushiyan + +## Status + +JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141) + +## Abstract + +In this RFC we propose new set of higher-level Table Spec APIs that would allow +us to up-level our current integration model with new Query Engines, enabling +faster turnaround for such integrations. + +## Background + +[WIP] Plan + +- [x] High-level capabilities API will be supporting +- [ ] High-level overview of integration points of the query engines +- [ ] Deep-dive into existing Spark/Flink/Trino integration points + +The figure below shows the read path in Spark and Presto/Trino query engines. At +a high level, integration with these engines require providing of the following capabilities: + +1. **Listing**: this stage requires enumerating of all of the data files w/in a table representing particular +snapshot of its state, that will be subsequently scanned to fetch the date requested by the target query. All query engines +expect output of this stage in the form of balanced out file "splits" that way allowing to even out any skew in +file sizes. This stage is where various _pruning_ techniques such as partition-pruning, file-level column +statistics filtering are applied. + +2. **Reading**: this stage transforms every file-split (identified in a previous stage) into an iterator over +the records persisted in it. This stage is where actual data shaping to suit the needs of particular query takes +place: records are projected into a schema expected by the query, corresponding filters are pushed down to +reduce amount of data fetched from storage, schema evolution is handled, delayed operations are reconciled (merging/deleting) + +![](./read_path.png) + +## Implementation + +[WIP] Plan + +- [x] Components & API pseudo-code +- [x] Diagram show-casing data flow and integration points +- [ ] Example integration walk-through + +With this RFC, we aim to achieve following: + + - Up-level our current integration model with Query Engines, abstracting away Hudi's lower-level components, + instead providing simple and eloquent abstractions providing necessary capabilities (for listing, reading) + + - Make sure Hudi's APIs are high-level enough to be useful in providing programmatic access to the data + for users willing to access it directly + +To achieve that, we propose to introduce two tiers of APIs: + +1. **Mid-level** (engine-friendly): these APIs will be abstracting away Hudi's internals, behind +simple and familiar concepts and components such as file splits, iterators, +statistics, etc. + +2. **High-level** (user-friendly): these APIs will provide programmatic access to Hudi's capabilities +bypassing query engine interfaces (like SQL, Spark DS, etc) + +Following classic [layer-cake architecture](https://cs.uwaterloo.ca/~m2nagapp/courses/CS446/1195/Arch_Design_Activity/Layered.pdf) +would allow us to abstract away complexity of the lower levels components and APIs from +Query Engines (on the *querying* side) as well as from the end users: + + - Mid-level components will internally leverage Hudi's core building blocks (such as +`FileSystemView`, `FileIndex`, `MergedLogRecordScanner`, `FileReader` etc), while exposing APIs +providing capabilities expected by the Query Engines (listed above) + + - High-level components will be stacked on top of mid-level ones and can be used directly by users to +read and write to tables. + +These APIs will be built bottoms-up, initially focusing on the mid-level +components, then higher-level ones. Once the mid-level components are ready, we could +demonstrate their utility by integrating with one of the query engines. + +In the initial phase of this project we will be focusing the effort on the read-side of the integration, +shifting focus to the write-side in the subsequent phase. + +### Components & API + +Open Qs + +1. ~~Some of the models defined here do already exist and might be representing + lower-level components. Do we expose them in a higher level APIs or do we + bifurcate and hide them as internal impl details (for ex, `HoodieFileGroup` + vs `HoodieInternalFileGroup`)?~~ + 1. *We'll be re-using existing models abstracting them behind projected interfaces + exposing only necessary functionality, hiding away complexity of the whole impl* +2. ~~What about Expressions? Wrapping is going to be hard because we need to + analyze expression structure which is not going to be possible if we wrap.~~ + 1. We will need to introduce our own Expression hierarchy supporting a *superset* + of the commonly used expressions. While this requir
[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7080: [RFC-64] New APIs to facilitate faster Query Engine integrations
xiarixiaoyao commented on code in PR #7080: URL: https://github.com/apache/hudi/pull/7080#discussion_r1018748548 ## rfc/rfc-64/rfc-64.md: ## @@ -0,0 +1,509 @@ + + +# RFC-64: New Hudi Table Spec API for Query Integrations + +## Proposers + +- @codope +- @alexeykudinkin + +## Approvers + +- @xiarixiao +- @danny0405 +- @vinothchandar +- @prasanna +- @xushiyan + +## Status + +JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141) + +## Abstract + +In this RFC we propose new set of higher-level Table Spec APIs that would allow +us to up-level our current integration model with new Query Engines, enabling +faster turnaround for such integrations. + +## Background + +[WIP] Plan + +- [x] High-level capabilities API will be supporting +- [ ] High-level overview of integration points of the query engines +- [ ] Deep-dive into existing Spark/Flink/Trino integration points + +The figure below shows the read path in Spark and Presto/Trino query engines. At +a high level, integration with these engines require providing of the following capabilities: + +1. **Listing**: this stage requires enumerating of all of the data files w/in a table representing particular +snapshot of its state, that will be subsequently scanned to fetch the date requested by the target query. All query engines +expect output of this stage in the form of balanced out file "splits" that way allowing to even out any skew in +file sizes. This stage is where various _pruning_ techniques such as partition-pruning, file-level column +statistics filtering are applied. + +2. **Reading**: this stage transforms every file-split (identified in a previous stage) into an iterator over +the records persisted in it. This stage is where actual data shaping to suit the needs of particular query takes +place: records are projected into a schema expected by the query, corresponding filters are pushed down to +reduce amount of data fetched from storage, schema evolution is handled, delayed operations are reconciled (merging/deleting) + +![](./read_path.png) + +## Implementation + +[WIP] Plan + +- [x] Components & API pseudo-code +- [x] Diagram show-casing data flow and integration points +- [ ] Example integration walk-through + +With this RFC, we aim to achieve following: + + - Up-level our current integration model with Query Engines, abstracting away Hudi's lower-level components, + instead providing simple and eloquent abstractions providing necessary capabilities (for listing, reading) + + - Make sure Hudi's APIs are high-level enough to be useful in providing programmatic access to the data + for users willing to access it directly + +To achieve that, we propose to introduce two tiers of APIs: + +1. **Mid-level** (engine-friendly): these APIs will be abstracting away Hudi's internals, behind +simple and familiar concepts and components such as file splits, iterators, +statistics, etc. + +2. **High-level** (user-friendly): these APIs will provide programmatic access to Hudi's capabilities +bypassing query engine interfaces (like SQL, Spark DS, etc) + +Following classic [layer-cake architecture](https://cs.uwaterloo.ca/~m2nagapp/courses/CS446/1195/Arch_Design_Activity/Layered.pdf) +would allow us to abstract away complexity of the lower levels components and APIs from +Query Engines (on the *querying* side) as well as from the end users: + + - Mid-level components will internally leverage Hudi's core building blocks (such as +`FileSystemView`, `FileIndex`, `MergedLogRecordScanner`, `FileReader` etc), while exposing APIs +providing capabilities expected by the Query Engines (listed above) + + - High-level components will be stacked on top of mid-level ones and can be used directly by users to +read and write to tables. + +These APIs will be built bottoms-up, initially focusing on the mid-level +components, then higher-level ones. Once the mid-level components are ready, we could +demonstrate their utility by integrating with one of the query engines. + +In the initial phase of this project we will be focusing the effort on the read-side of the integration, +shifting focus to the write-side in the subsequent phase. + +### Components & API + +Open Qs + +1. ~~Some of the models defined here do already exist and might be representing + lower-level components. Do we expose them in a higher level APIs or do we + bifurcate and hide them as internal impl details (for ex, `HoodieFileGroup` + vs `HoodieInternalFileGroup`)?~~ + 1. *We'll be re-using existing models abstracting them behind projected interfaces + exposing only necessary functionality, hiding away complexity of the whole impl* +2. ~~What about Expressions? Wrapping is going to be hard because we need to + analyze expression structure which is not going to be possible if we wrap.~~ + 1. We will need to introduce our own Expression hierarchy supporting a *superset* + of the commonly used expressions. While this requir
[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7080: [RFC-64] New APIs to facilitate faster Query Engine integrations
xiarixiaoyao commented on code in PR #7080: URL: https://github.com/apache/hudi/pull/7080#discussion_r1018743271 ## rfc/rfc-64/rfc-64.md: ## @@ -0,0 +1,509 @@ + + +# RFC-64: New Hudi Table Spec API for Query Integrations + +## Proposers + +- @codope +- @alexeykudinkin + +## Approvers + +- @xiarixiao +- @danny0405 +- @vinothchandar +- @prasanna +- @xushiyan + +## Status + +JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141) + +## Abstract + +In this RFC we propose new set of higher-level Table Spec APIs that would allow +us to up-level our current integration model with new Query Engines, enabling +faster turnaround for such integrations. + +## Background + +[WIP] Plan + +- [x] High-level capabilities API will be supporting +- [ ] High-level overview of integration points of the query engines +- [ ] Deep-dive into existing Spark/Flink/Trino integration points + +The figure below shows the read path in Spark and Presto/Trino query engines. At +a high level, integration with these engines require providing of the following capabilities: + +1. **Listing**: this stage requires enumerating of all of the data files w/in a table representing particular +snapshot of its state, that will be subsequently scanned to fetch the date requested by the target query. All query engines +expect output of this stage in the form of balanced out file "splits" that way allowing to even out any skew in +file sizes. This stage is where various _pruning_ techniques such as partition-pruning, file-level column +statistics filtering are applied. + +2. **Reading**: this stage transforms every file-split (identified in a previous stage) into an iterator over +the records persisted in it. This stage is where actual data shaping to suit the needs of particular query takes +place: records are projected into a schema expected by the query, corresponding filters are pushed down to +reduce amount of data fetched from storage, schema evolution is handled, delayed operations are reconciled (merging/deleting) + +![](./read_path.png) + +## Implementation + +[WIP] Plan + +- [x] Components & API pseudo-code +- [x] Diagram show-casing data flow and integration points +- [ ] Example integration walk-through + +With this RFC, we aim to achieve following: + + - Up-level our current integration model with Query Engines, abstracting away Hudi's lower-level components, + instead providing simple and eloquent abstractions providing necessary capabilities (for listing, reading) + + - Make sure Hudi's APIs are high-level enough to be useful in providing programmatic access to the data + for users willing to access it directly + +To achieve that, we propose to introduce two tiers of APIs: + +1. **Mid-level** (engine-friendly): these APIs will be abstracting away Hudi's internals, behind +simple and familiar concepts and components such as file splits, iterators, +statistics, etc. + +2. **High-level** (user-friendly): these APIs will provide programmatic access to Hudi's capabilities +bypassing query engine interfaces (like SQL, Spark DS, etc) + +Following classic [layer-cake architecture](https://cs.uwaterloo.ca/~m2nagapp/courses/CS446/1195/Arch_Design_Activity/Layered.pdf) +would allow us to abstract away complexity of the lower levels components and APIs from +Query Engines (on the *querying* side) as well as from the end users: + + - Mid-level components will internally leverage Hudi's core building blocks (such as +`FileSystemView`, `FileIndex`, `MergedLogRecordScanner`, `FileReader` etc), while exposing APIs +providing capabilities expected by the Query Engines (listed above) + + - High-level components will be stacked on top of mid-level ones and can be used directly by users to +read and write to tables. + +These APIs will be built bottoms-up, initially focusing on the mid-level +components, then higher-level ones. Once the mid-level components are ready, we could +demonstrate their utility by integrating with one of the query engines. + +In the initial phase of this project we will be focusing the effort on the read-side of the integration, +shifting focus to the write-side in the subsequent phase. + +### Components & API + +Open Qs + +1. ~~Some of the models defined here do already exist and might be representing + lower-level components. Do we expose them in a higher level APIs or do we + bifurcate and hide them as internal impl details (for ex, `HoodieFileGroup` + vs `HoodieInternalFileGroup`)?~~ + 1. *We'll be re-using existing models abstracting them behind projected interfaces + exposing only necessary functionality, hiding away complexity of the whole impl* +2. ~~What about Expressions? Wrapping is going to be hard because we need to + analyze expression structure which is not going to be possible if we wrap.~~ + 1. We will need to introduce our own Expression hierarchy supporting a *superset* + of the commonly used expressions. While this requir
[GitHub] [hudi] hudi-bot commented on pull request #7151: [MINOR] Performance improvement of flink ITs with reused miniCluster
hudi-bot commented on PR #7151: URL: https://github.com/apache/hudi/pull/7151#issuecomment-1309888204 ## CI report: * db207361ce3e01c7b39153f3156a6e19d8075212 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12918) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7080: [RFC-64] New APIs to facilitate faster Query Engine integrations
xiarixiaoyao commented on code in PR #7080: URL: https://github.com/apache/hudi/pull/7080#discussion_r1018740543 ## rfc/rfc-64/rfc-64.md: ## @@ -0,0 +1,509 @@ + + +# RFC-64: New Hudi Table Spec API for Query Integrations + +## Proposers + +- @codope +- @alexeykudinkin + +## Approvers + +- @xiarixiao +- @danny0405 +- @vinothchandar +- @prasanna +- @xushiyan + +## Status + +JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141) + +## Abstract + +In this RFC we propose new set of higher-level Table Spec APIs that would allow +us to up-level our current integration model with new Query Engines, enabling +faster turnaround for such integrations. + +## Background + +[WIP] Plan + +- [x] High-level capabilities API will be supporting +- [ ] High-level overview of integration points of the query engines +- [ ] Deep-dive into existing Spark/Flink/Trino integration points + +The figure below shows the read path in Spark and Presto/Trino query engines. At +a high level, integration with these engines require providing of the following capabilities: + +1. **Listing**: this stage requires enumerating of all of the data files w/in a table representing particular +snapshot of its state, that will be subsequently scanned to fetch the date requested by the target query. All query engines +expect output of this stage in the form of balanced out file "splits" that way allowing to even out any skew in +file sizes. This stage is where various _pruning_ techniques such as partition-pruning, file-level column +statistics filtering are applied. + +2. **Reading**: this stage transforms every file-split (identified in a previous stage) into an iterator over +the records persisted in it. This stage is where actual data shaping to suit the needs of particular query takes +place: records are projected into a schema expected by the query, corresponding filters are pushed down to +reduce amount of data fetched from storage, schema evolution is handled, delayed operations are reconciled (merging/deleting) + +![](./read_path.png) + +## Implementation + +[WIP] Plan + +- [x] Components & API pseudo-code +- [x] Diagram show-casing data flow and integration points +- [ ] Example integration walk-through + +With this RFC, we aim to achieve following: + + - Up-level our current integration model with Query Engines, abstracting away Hudi's lower-level components, + instead providing simple and eloquent abstractions providing necessary capabilities (for listing, reading) + + - Make sure Hudi's APIs are high-level enough to be useful in providing programmatic access to the data + for users willing to access it directly + +To achieve that, we propose to introduce two tiers of APIs: + +1. **Mid-level** (engine-friendly): these APIs will be abstracting away Hudi's internals, behind +simple and familiar concepts and components such as file splits, iterators, +statistics, etc. + +2. **High-level** (user-friendly): these APIs will provide programmatic access to Hudi's capabilities +bypassing query engine interfaces (like SQL, Spark DS, etc) + +Following classic [layer-cake architecture](https://cs.uwaterloo.ca/~m2nagapp/courses/CS446/1195/Arch_Design_Activity/Layered.pdf) +would allow us to abstract away complexity of the lower levels components and APIs from +Query Engines (on the *querying* side) as well as from the end users: + + - Mid-level components will internally leverage Hudi's core building blocks (such as +`FileSystemView`, `FileIndex`, `MergedLogRecordScanner`, `FileReader` etc), while exposing APIs +providing capabilities expected by the Query Engines (listed above) + + - High-level components will be stacked on top of mid-level ones and can be used directly by users to +read and write to tables. + +These APIs will be built bottoms-up, initially focusing on the mid-level +components, then higher-level ones. Once the mid-level components are ready, we could +demonstrate their utility by integrating with one of the query engines. + +In the initial phase of this project we will be focusing the effort on the read-side of the integration, +shifting focus to the write-side in the subsequent phase. + +### Components & API + +Open Qs + +1. ~~Some of the models defined here do already exist and might be representing + lower-level components. Do we expose them in a higher level APIs or do we + bifurcate and hide them as internal impl details (for ex, `HoodieFileGroup` + vs `HoodieInternalFileGroup`)?~~ + 1. *We'll be re-using existing models abstracting them behind projected interfaces + exposing only necessary functionality, hiding away complexity of the whole impl* +2. ~~What about Expressions? Wrapping is going to be hard because we need to + analyze expression structure which is not going to be possible if we wrap.~~ + 1. We will need to introduce our own Expression hierarchy supporting a *superset* + of the commonly used expressions. While this requir
[GitHub] [hudi] albericgenius commented on pull request #7096: [MINOR] Fix OverwriteWithLatestAvroPayload full class name
albericgenius commented on PR #7096: URL: https://github.com/apache/hudi/pull/7096#issuecomment-1309840920 @xushiyan sorry to both you, could you help to assign JIRA contributor permission to me(cnuliuwei). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7003: [minor] add more test for rfc46
hudi-bot commented on PR #7003: URL: https://github.com/apache/hudi/pull/7003#issuecomment-1309827449 ## CI report: * b8ad950666cb87456151c70688b5eb7ad423955f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12917) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7167: [HUDI-5094] Remove partition fields before transform bytes to avro,if enable DROP_PARTITION_COLUMNS.
hudi-bot commented on PR #7167: URL: https://github.com/apache/hudi/pull/7167#issuecomment-1309827733 ## CI report: * 6b165aec634812ba8d6f4a55d0dfb8578031d25c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12916) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eric9204 commented on issue #6966: [SUPPORT]HoodieWriteHandle: Error writing record HoodieRecord{key=HoodieKey { recordKey=id308723 partitionPath=202210141643}, currentLocation='null',
eric9204 commented on issue #6966: URL: https://github.com/apache/hudi/issues/6966#issuecomment-1309826887 #7167 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4812) Lazy partition listing and file groups fetching in Spark Query
[ https://issues.apache.org/jira/browse/HUDI-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4812: -- Story Points: 16 (was: 2) > Lazy partition listing and file groups fetching in Spark Query > -- > > Key: HUDI-4812 > URL: https://issues.apache.org/jira/browse/HUDI-4812 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Yuwei Xiao >Assignee: Yuwei Xiao >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > > In current spark query implementation, the FileIndex will refresh and load > all file groups in cached in order to serve subsequent queries. > > For large table with many partitions, this may introduce much overhead in > initialization. Meanwhile, the query itself may come with partition filter. > So the loading of file groups will be unnecessary. > > So to optimize, the whole refresh logic will become lazy, where actual work > will be carried out only after the partition filter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] eric9204 commented on pull request #7063: [HUDI-5094] Remove partition fields before transform bytes to avro,if enable DROP_PARTITION_COLUMNS
eric9204 commented on PR #7063: URL: https://github.com/apache/hudi/pull/7063#issuecomment-1309807382 @YannByron have removed useless changes and redundant judgment conditions and add UT for logic changes. if you are available, pls review at this pr instead https://github.com/apache/hudi/pull/7167, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5188) ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo
[ https://issues.apache.org/jira/browse/HUDI-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5188: -- Description: Getting following exception when running trivial Spark DS workloads: {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.DataSourceWriteOptions.{PRECOMBINE_FIELD, RECORDKEY_FIELD} import org.apache.spark.sql.{SaveMode, SparkSession} val spark: SparkSession import spark.implicits._ val basePath = "/tmp/test" val writerOpts: Map[String, String] = Map( "hoodie.table.name" -> "test", "hoodie.table.type" -> "COPY_ON_WRITE", PRECOMBINE_FIELD.key() -> "id", RECORDKEY_FIELD.key() -> "id", DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "dt" ) val firstBatchDF = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 1, s"2021/03/0${i % 2 + 1}", s"${i % 3}")).toDF("id", "name", "price", "version", "dt", "hh") firstBatchDF.write. options(writerOpts). option("hoodie.parquet.compression.codec", "gzip"). format("hudi"). mode(SaveMode.Overwrite). save(basePath) {code} {code:java} java.lang.ClassCastException: class org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo (org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @3b1895e; org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in unnamed module of loader scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @58882a93) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.listAllPartitions(HoodieBackedTableMetadataWriter.java:645) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initialCommit(HoodieBackedTableMetadataWriter.java:1070) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:560) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:393) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.initialize(SparkHoodieBackedTableMetadataWriter.java:120) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:173) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:89) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:75) at org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:514) at org.apache.hudi.client.SparkRDDWriteClient.doInitTable(SparkRDDWriteClient.java:499) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1548) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1580) at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156) at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:331) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:148) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) at org.apache.spark.sql.ca
[jira] [Updated] (HUDI-5188) ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo
[ https://issues.apache.org/jira/browse/HUDI-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5188: -- Priority: Blocker (was: Major) > ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo > cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo > - > > Key: HUDI-5188 > URL: https://issues.apache.org/jira/browse/HUDI-5188 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > > After [#7036|https://github.com/apache/hudi/pull/7036/files], started to get > following exception when running trivial Spark DS workloads: > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.DataSourceWriteOptions.{PRECOMBINE_FIELD, > RECORDKEY_FIELD} > import org.apache.spark.sql.{SaveMode, SparkSession} > val spark: SparkSession > import spark.implicits._ > val basePath = "/tmp/test" > val writerOpts: Map[String, String] = Map( > "hoodie.table.name" -> "test", > "hoodie.table.type" -> "COPY_ON_WRITE", > PRECOMBINE_FIELD.key() -> "id", > RECORDKEY_FIELD.key() -> "id", > DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "dt" > ) > val firstBatchDF = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 1, > s"2021/03/0${i % 2 + 1}", s"${i % 3}")).toDF("id", "name", "price", > "version", "dt", "hh") > firstBatchDF.write. > options(writerOpts). > option("hoodie.parquet.compression.codec", "gzip"). > format("hudi"). > mode(SaveMode.Overwrite). > save(basePath) {code} > {code:java} > java.lang.ClassCastException: class > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo cannot > be cast to class > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo > (org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in > unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader > @3b1895e; > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in > unnamed module of loader > scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @58882a93) at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.listAllPartitions(HoodieBackedTableMetadataWriter.java:645) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initialCommit(HoodieBackedTableMetadataWriter.java:1070) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:560) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:393) > at > org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.initialize(SparkHoodieBackedTableMetadataWriter.java:120) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:173) > at > org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:89) > at > org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:75) > at > org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:514) > at > org.apache.hudi.client.SparkRDDWriteClient.doInitTable(SparkRDDWriteClient.java:499) > at > org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1548) > at > org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1580) > at > org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156) > at > org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:331) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:148) at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execut
[jira] [Updated] (HUDI-5188) ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo
[ https://issues.apache.org/jira/browse/HUDI-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5188: -- Description: After [#7036|https://github.com/apache/hudi/pull/7036/files], started to get following exception when running trivial Spark DS workloads: {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.DataSourceWriteOptions.{PRECOMBINE_FIELD, RECORDKEY_FIELD} import org.apache.spark.sql.{SaveMode, SparkSession} val spark: SparkSession import spark.implicits._ val basePath = "/tmp/test" val writerOpts: Map[String, String] = Map( "hoodie.table.name" -> "test", "hoodie.table.type" -> "COPY_ON_WRITE", PRECOMBINE_FIELD.key() -> "id", RECORDKEY_FIELD.key() -> "id", DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "dt" ) val firstBatchDF = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 1, s"2021/03/0${i % 2 + 1}", s"${i % 3}")).toDF("id", "name", "price", "version", "dt", "hh") firstBatchDF.write. options(writerOpts). option("hoodie.parquet.compression.codec", "gzip"). format("hudi"). mode(SaveMode.Overwrite). save(basePath) {code} {code:java} java.lang.ClassCastException: class org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo (org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @3b1895e; org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in unnamed module of loader scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @58882a93) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.listAllPartitions(HoodieBackedTableMetadataWriter.java:645) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initialCommit(HoodieBackedTableMetadataWriter.java:1070) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:560) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:393) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.initialize(SparkHoodieBackedTableMetadataWriter.java:120) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:173) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:89) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:75) at org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:514) at org.apache.hudi.client.SparkRDDWriteClient.doInitTable(SparkRDDWriteClient.java:499) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1548) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1580) at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156) at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:331) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:148) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.trees.TreeNode.trans
[jira] [Created] (HUDI-5188) ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo
Alexey Kudinkin created HUDI-5188: - Summary: ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo Key: HUDI-5188 URL: https://issues.apache.org/jira/browse/HUDI-5188 Project: Apache Hudi Issue Type: Bug Reporter: Alexey Kudinkin After XXX, started to get following exception when running trivial Spark DS workloads: {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.DataSourceWriteOptions.{PRECOMBINE_FIELD, RECORDKEY_FIELD} import org.apache.spark.sql.{SaveMode, SparkSession} val spark: SparkSession import spark.implicits._ val basePath = "/tmp/test" val writerOpts: Map[String, String] = Map( "hoodie.table.name" -> "test", "hoodie.table.type" -> "COPY_ON_WRITE", PRECOMBINE_FIELD.key() -> "id", RECORDKEY_FIELD.key() -> "id", DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "dt" ) val firstBatchDF = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 1, s"2021/03/0${i % 2 + 1}", s"${i % 3}")).toDF("id", "name", "price", "version", "dt", "hh") firstBatchDF.write. options(writerOpts). option("hoodie.parquet.compression.codec", "gzip"). format("hudi"). mode(SaveMode.Overwrite). save(basePath) {code} {code:java} java.lang.ClassCastException: class org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo (org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @3b1895e; org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in unnamed module of loader scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @58882a93) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.listAllPartitions(HoodieBackedTableMetadataWriter.java:645) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initialCommit(HoodieBackedTableMetadataWriter.java:1070) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:560) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:393) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.initialize(SparkHoodieBackedTableMetadataWriter.java:120) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:173) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:89) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:75) at org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:514) at org.apache.hudi.client.SparkRDDWriteClient.doInitTable(SparkRDDWriteClient.java:499) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1548) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1580) at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156) at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:331) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:148) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWith
[jira] [Updated] (HUDI-5188) ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo
[ https://issues.apache.org/jira/browse/HUDI-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-5188: -- Fix Version/s: 0.13.0 > ClassCastException: class HoodieBackedTableMetadataWriter$DirectoryInfo > cannot be cast to class HoodieBackedTableMetadataWriter$DirectoryInfo > - > > Key: HUDI-5188 > URL: https://issues.apache.org/jira/browse/HUDI-5188 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Priority: Major > Fix For: 0.13.0 > > > After XXX, started to get following exception when running trivial Spark DS > workloads: > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.DataSourceWriteOptions.{PRECOMBINE_FIELD, > RECORDKEY_FIELD} > import org.apache.spark.sql.{SaveMode, SparkSession} > val spark: SparkSession > import spark.implicits._ > val basePath = "/tmp/test" > val writerOpts: Map[String, String] = Map( > "hoodie.table.name" -> "test", > "hoodie.table.type" -> "COPY_ON_WRITE", > PRECOMBINE_FIELD.key() -> "id", > RECORDKEY_FIELD.key() -> "id", > DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "dt" > ) > val firstBatchDF = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 1, > s"2021/03/0${i % 2 + 1}", s"${i % 3}")).toDF("id", "name", "price", > "version", "dt", "hh") > firstBatchDF.write. > options(writerOpts). > option("hoodie.parquet.compression.codec", "gzip"). > format("hudi"). > mode(SaveMode.Overwrite). > save(basePath) {code} > {code:java} > java.lang.ClassCastException: class > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo cannot > be cast to class > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo > (org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in > unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader > @3b1895e; > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter$DirectoryInfo is in > unnamed module of loader > scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @58882a93) at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.listAllPartitions(HoodieBackedTableMetadataWriter.java:645) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initialCommit(HoodieBackedTableMetadataWriter.java:1070) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:560) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:393) > at > org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.initialize(SparkHoodieBackedTableMetadataWriter.java:120) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:173) > at > org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:89) > at > org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:75) > at > org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:514) > at > org.apache.hudi.client.SparkRDDWriteClient.doInitTable(SparkRDDWriteClient.java:499) > at > org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1548) > at > org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1580) > at > org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:156) > at > org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:331) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:148) at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.sc
[GitHub] [hudi] zhangyue19921010 commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency
zhangyue19921010 commented on PR #5416: URL: https://github.com/apache/hudi/pull/5416#issuecomment-1309800894 > Great job buddy for patiently addressing all comments ! Thanks a lot for your help! @alexeykudinkin and @nsivabalan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-5145) Remove HDFS from DeltaStreamer UT/FT
[ https://issues.apache.org/jira/browse/HUDI-5145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-5145. Resolution: Done > Remove HDFS from DeltaStreamer UT/FT > > > Key: HUDI-5145 > URL: https://issues.apache.org/jira/browse/HUDI-5145 > Project: Apache Hudi > Issue Type: Test >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.12.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation
hudi-bot commented on PR #7039: URL: https://github.com/apache/hudi/pull/7039#issuecomment-1309782856 ## CI report: * dc7c2bdefdb2a84b27b40751714a31604e2931eb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12911) * fe186d03969c6d1b4a62d9c506585a8b2ea05dd0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12920) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation
hudi-bot commented on PR #7039: URL: https://github.com/apache/hudi/pull/7039#issuecomment-1309778505 ## CI report: * dc7c2bdefdb2a84b27b40751714a31604e2931eb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12911) * fe186d03969c6d1b4a62d9c506585a8b2ea05dd0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7169: [HUDI-5186] Parallelism does not take effect when hoodie.combine.before.upsert/insert false
hudi-bot commented on PR #7169: URL: https://github.com/apache/hudi/pull/7169#issuecomment-1309774543 ## CI report: * f9e08f5d14106a18fe59bf752077ab1043595e03 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12899) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12905) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12915) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests (#7171)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 7fe35fb895 [HUDI-5145] Avoid starting HDFS in hudi-utilities tests (#7171) 7fe35fb895 is described below commit 7fe35fb895fb58f0ea599974defd7ffe310a964e Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Thu Nov 10 12:23:24 2022 +0800 [HUDI-5145] Avoid starting HDFS in hudi-utilities tests (#7171) --- .../testutils/minicluster/HdfsTestService.java | 8 +- .../TestDFSHoodieTestSuiteWriterAdapter.java | 8 +- .../integ/testsuite/TestFileDeltaInputWriter.java | 10 +- .../testsuite/job/TestHoodieTestSuiteJob.java | 74 ++--- .../reader/TestDFSAvroDeltaInputReader.java| 10 +- .../reader/TestDFSHoodieDatasetInputReader.java| 6 +- .../functional/HoodieDeltaStreamerTestBase.java| 10 +- .../functional/TestHoodieDeltaStreamer.java| 314 ++--- .../TestHoodieMultiTableDeltaStreamer.java | 40 +-- .../hudi/utilities/sources/TestAvroDFSSource.java | 4 +- .../hudi/utilities/sources/TestCsvDFSSource.java | 4 +- .../utilities/sources/TestGcsEventsSource.java | 2 +- .../hudi/utilities/sources/TestJsonDFSSource.java | 4 +- .../utilities/sources/TestParquetDFSSource.java| 2 +- .../hudi/utilities/sources/TestS3EventsSource.java | 4 +- .../hudi/utilities/sources/TestSqlSource.java | 11 +- .../debezium/TestAbstractDebeziumSource.java | 2 +- .../utilities/testutils/UtilitiesTestBase.java | 61 ++-- .../AbstractCloudObjectsSourceTestBase.java| 2 +- .../sources/AbstractDFSSourceTestBase.java | 6 +- .../transform/TestSqlFileBasedTransformer.java | 23 +- 21 files changed, 313 insertions(+), 292 deletions(-) diff --git a/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java b/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java index 0766c61c67..727e1e4db6 100644 --- a/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java +++ b/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java @@ -53,8 +53,12 @@ public class HdfsTestService { private MiniDFSCluster miniDfsCluster; public HdfsTestService() throws IOException { -hadoopConf = new Configuration(); -workDir = Files.createTempDirectory("temp").toAbsolutePath().toString(); +this(new Configuration()); + } + + public HdfsTestService(Configuration hadoopConf) throws IOException { +this.hadoopConf = hadoopConf; +this.workDir = Files.createTempDirectory("temp").toAbsolutePath().toString(); } public Configuration getHadoopConf() { diff --git a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java index 2b69a319a5..9c21ee6bd4 100644 --- a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java +++ b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java @@ -65,7 +65,7 @@ public class TestDFSHoodieTestSuiteWriterAdapter extends UtilitiesTestBase { @BeforeAll public static void initClass() throws Exception { -UtilitiesTestBase.initTestServices(false, false); +UtilitiesTestBase.initTestServices(true, false, false); } @AfterAll @@ -131,15 +131,15 @@ public class TestDFSHoodieTestSuiteWriterAdapter extends UtilitiesTestBase { // TODO(HUDI-3668): Fix this test public void testDFSWorkloadSinkWithMultipleFilesFunctional() throws IOException { DeltaConfig dfsSinkConfig = new DFSDeltaConfig(DeltaOutputMode.DFS, DeltaInputType.AVRO, -new SerializableConfiguration(jsc.hadoopConfiguration()), dfsBasePath, dfsBasePath, +new SerializableConfiguration(jsc.hadoopConfiguration()), basePath, basePath, schemaProvider.getSourceSchema().toString(), 10240L, jsc.defaultParallelism(), false, false); DeltaWriterAdapter dfsDeltaWriterAdapter = DeltaWriterFactory .getDeltaWriterAdapter(dfsSinkConfig, 1); FlexibleSchemaRecordGenerationIterator itr = new FlexibleSchemaRecordGenerationIterator(1000, schemaProvider.getSourceSchema().toString()); dfsDeltaWriterAdapter.write(itr); -FileSystem fs = FSUtils.getFs(dfsBasePath, jsc.hadoopConfiguration()); -FileStatus[] fileStatuses = fs.listStatus(new Path(dfsBasePath)); +FileSystem fs = FSUtils.getFs(basePath, jsc.hadoopConfiguration()); +FileStatus[] fileStatuses = fs.listStatus(new Path(basePath)); // Since maxFileSize was 10240L and we produced 1K records each close to 1K size, we
[GitHub] [hudi] xushiyan merged pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests
xushiyan merged PR #7171: URL: https://github.com/apache/hudi/pull/7171 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests
xushiyan commented on PR #7171: URL: https://github.com/apache/hudi/pull/7171#issuecomment-1309749162 The test run time has good improvement. **in master version** UT ``` [INFO] hudi-utilities_2.11 SUCCESS [08:32 min] ``` FT ``` [INFO] hudi-utilities_2.11 SUCCESS [59:44 min] ``` **With this patch** UT ``` [INFO] hudi-utilities_2.11 SUCCESS [07:14 min] ``` FT ``` [INFO] hudi-utilities_2.11 SUCCESS [38:23 min] ``` Overall it has about 30% drop! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution
danny0405 commented on code in PR #5830: URL: https://github.com/apache/hudi/pull/5830#discussion_r1018637325 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java: ## @@ -817,6 +805,8 @@ public abstract static class Builder { public abstract Builder withReaderSchema(Schema schema); +public abstract Builder withReaderSchema(InternalSchema internalSchema); + Review Comment: We should not keep 2 `withReaderSchema` here, keeps either one of them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution
danny0405 commented on code in PR #5830: URL: https://github.com/apache/hudi/pull/5830#discussion_r1018636538 ## hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java: ## @@ -91,6 +91,11 @@ public static InternalSchema convert(Schema schema) { return new InternalSchema(fields); } + /** Convert an avro schema into internalSchema with given versionId. */ + public static InternalSchema convertToEmpty(Schema schema) { +return new InternalSchema(InternalSchema.EMPTY_SCHEMA_VERSION_ID, schema); Review Comment: This is also confusing, an internal schema with 'empty' version id but still got avro schema internal, please clarify it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution
danny0405 commented on code in PR #5830: URL: https://github.com/apache/hudi/pull/5830#discussion_r1018635942 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java: ## @@ -141,4 +139,8 @@ public HoodieLogBlock prev() throws IOException { return this.currentReader.prev(); } + private Schema getReaderSchema() { +boolean useWriterSchema = !readerSchema.isEmptySchema(); +return useWriterSchema ? null : readerSchema.getAvroSchema(); Review Comment: This is so confusing, why we use writer schema if the reader schema is not empty ? And what does it means for an empty internal schema ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution
danny0405 commented on PR #5830: URL: https://github.com/apache/hudi/pull/5830#issuecomment-1309745375 [3981.patch.zip](https://github.com/apache/hudi/files/9977268/3981.patch.zip) Thanks for the contribution, have reviewed some of the part, and left a local patch here and some comments ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests
xushiyan commented on code in PR #7171: URL: https://github.com/apache/hudi/pull/7171#discussion_r1018635258 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java: ## @@ -250,7 +250,7 @@ static HoodieDeltaStreamer.Config makeConfig(String basePath, WriteOperationType cfg.operation = op; cfg.enableHiveSync = enableHiveSync; cfg.sourceOrderingField = sourceOrderingField; - cfg.propsFilePath = dfsBasePath + "/" + propsFilename; + cfg.propsFilePath = UtilitiesTestBase.basePath + "/" + propsFilename; Review Comment: because there is another local var named the same. don't want to rename more variables further. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7151: [MINOR] Performance improvement of flink ITs with reused miniCluster
hudi-bot commented on PR #7151: URL: https://github.com/apache/hudi/pull/7151#issuecomment-1309731298 ## CI report: * fb4b5616278cdd662ccee6add7bbb7b0684554ac Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12815) * db207361ce3e01c7b39153f3156a6e19d8075212 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12918) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7003: [minor] add more test for rfc46
hudi-bot commented on PR #7003: URL: https://github.com/apache/hudi/pull/7003#issuecomment-1309731121 ## CI report: * da89f1a57bae167a6474092cc07ca0880e4028b8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12910) * b8ad950666cb87456151c70688b5eb7ad423955f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12917) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests
nsivabalan commented on code in PR #7171: URL: https://github.com/apache/hudi/pull/7171#discussion_r1018623450 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java: ## @@ -250,7 +250,7 @@ static HoodieDeltaStreamer.Config makeConfig(String basePath, WriteOperationType cfg.operation = op; cfg.enableHiveSync = enableHiveSync; cfg.sourceOrderingField = sourceOrderingField; - cfg.propsFilePath = dfsBasePath + "/" + propsFilename; + cfg.propsFilePath = UtilitiesTestBase.basePath + "/" + propsFilename; Review Comment: basePath is protected in UtilitiesTestBase and TestHoodieDeltastreamer extends from it. so why prefixing with class name ? ## hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java: ## @@ -272,7 +272,7 @@ static HoodieDeltaStreamer.Config makeConfigForHudiIncrSrc(String srcBasePath, S cfg.sourceClassName = HoodieIncrSource.class.getName(); cfg.operation = op; cfg.sourceOrderingField = "timestamp"; - cfg.propsFilePath = dfsBasePath + "/test-downstream-source.properties"; + cfg.propsFilePath = UtilitiesTestBase.basePath + "/test-downstream-source.properties"; Review Comment: same here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7151: [MINOR] Performance improvement of flink ITs with reused miniCluster
hudi-bot commented on PR #7151: URL: https://github.com/apache/hudi/pull/7151#issuecomment-1309727775 ## CI report: * fb4b5616278cdd662ccee6add7bbb7b0684554ac Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12815) * db207361ce3e01c7b39153f3156a6e19d8075212 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7003: [minor] add more test for rfc46
hudi-bot commented on PR #7003: URL: https://github.com/apache/hudi/pull/7003#issuecomment-1309727612 ## CI report: * da89f1a57bae167a6474092cc07ca0880e4028b8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12910) * b8ad950666cb87456151c70688b5eb7ad423955f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TengHuo commented on issue #7106: [PROPOSE] Add column prune support for other payload class
TengHuo commented on issue #7106: URL: https://github.com/apache/hudi/issues/7106#issuecomment-1309697353 @alexeykudinkin Got it, thanks a lot. Saw the code in branch `release-feature-rfc46`, the interface `HoodieRecordMerger`. We will keep eyes on it, and migrate our code base on it. Looking forward to seeing RFC-46 in master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] trushev commented on a diff in pull request #7151: [MINOR] Performance improvement of flink ITs with reused miniCluster
trushev commented on code in PR #7151: URL: https://github.com/apache/hudi/pull/7151#discussion_r1018601277 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/AbstractHoodieTestBase.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utils; + +import org.apache.hudi.exception.HoodieException; + +import org.apache.flink.test.util.AbstractTestBase; +import org.apache.flink.test.util.MiniClusterWithClientResource; + +import org.junit.jupiter.api.AfterAll; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeAll; + +import java.lang.reflect.Field; +import java.util.Arrays; + +import static java.lang.reflect.Modifier.isPublic; +import static java.lang.reflect.Modifier.isStatic; + +/** + * Hoodie base class for tests that run multiple tests and want to reuse the same Flink cluster. + * Unlike {@link AbstractTestBase}, this class is designed to run with JUnit 5. + */ +public abstract class AbstractHoodieTestBase extends AbstractTestBase { + + private static final MiniClusterWithClientResource MINI_CLUSTER_RESOURCE = getMiniClusterFromParentClass(); + + @BeforeAll Review Comment: implemented as junit5 extension -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] KnightChess commented on issue #7147: [SUPPORT] Some classes lacked and couldn't be imported in the module of hudi-spark3.2.x_2.12 and hudi-spark3.3.x_2.12
KnightChess commented on issue #7147: URL: https://github.com/apache/hudi/issues/7147#issuecomment-1309691965 @GoodJeek `HoodieSqlBaseParser`, the class is generated by the plug `genantlr4-maven-plugin` when you compile -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7167: [HUDI-5094] Remove partition fields before transform bytes to avro,if enable DROP_PARTITION_COLUMNS.
hudi-bot commented on PR #7167: URL: https://github.com/apache/hudi/pull/7167#issuecomment-1309685076 ## CI report: * 8881615927848e76214065870119e910e41e9c35 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12904) * 6b165aec634812ba8d6f4a55d0dfb8578031d25c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12916) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4142) RFC for new Table APIs proposal
[ https://issues.apache.org/jira/browse/HUDI-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4142: -- Reviewers: Alexey Kudinkin > RFC for new Table APIs proposal > --- > > Key: HUDI-4142 > URL: https://issues.apache.org/jira/browse/HUDI-4142 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > > Document all APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4142) RFC for new Table APIs proposal
[ https://issues.apache.org/jira/browse/HUDI-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4142: -- Priority: Blocker (was: Major) > RFC for new Table APIs proposal > --- > > Key: HUDI-4142 > URL: https://issues.apache.org/jira/browse/HUDI-4142 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > > Document all APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4142) RFC for new Table APIs proposal
[ https://issues.apache.org/jira/browse/HUDI-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4142: -- Fix Version/s: 0.13.0 (was: 1.0.0) > RFC for new Table APIs proposal > --- > > Key: HUDI-4142 > URL: https://issues.apache.org/jira/browse/HUDI-4142 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > Document all APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4141) RFC-54 Table Format APIs
[ https://issues.apache.org/jira/browse/HUDI-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4141: -- Description: RFC: [https://github.com/apache/hudi/pull/7080] (was: New Table APIs to create/update table programmatically. For example, HudiTable.create(tableConfigs) HudiTable.forPath(basePath) HudiTable.delete() HudiTable.merge()) > RFC-54 Table Format APIs > > > Key: HUDI-4141 > URL: https://issues.apache.org/jira/browse/HUDI-4141 > Project: Apache Hudi > Issue Type: Epic >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > > RFC: [https://github.com/apache/hudi/pull/7080] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #7167: [HUDI-5094] Remove partition fields before transform bytes to avro,if enable DROP_PARTITION_COLUMNS.
hudi-bot commented on PR #7167: URL: https://github.com/apache/hudi/pull/7167#issuecomment-1309681703 ## CI report: * 8881615927848e76214065870119e910e41e9c35 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12904) * 6b165aec634812ba8d6f4a55d0dfb8578031d25c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7169: [HUDI-5186] Parallelism does not take effect when hoodie.combine.before.upsert/insert false
hudi-bot commented on PR #7169: URL: https://github.com/apache/hudi/pull/7169#issuecomment-1309677939 ## CI report: * f9e08f5d14106a18fe59bf752077ab1043595e03 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12899) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12905) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12915) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7172: [HUDI-4888] throw exception if COW table and consistent hashing bucket index
hudi-bot commented on PR #7172: URL: https://github.com/apache/hudi/pull/7172#issuecomment-1309677986 ## CI report: * 959da168ba88673c2b2b4eb5d42cf6aac62d3808 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12913) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #7035: [HUDI-5075] Adding support to rollback residual clustering after disabling clustering
SteNicholas commented on code in PR #7035: URL: https://github.com/apache/hudi/pull/7035#discussion_r1018585393 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java: ## @@ -310,6 +310,14 @@ public class HoodieClusteringConfig extends HoodieConfig { + "Please exercise caution while setting this config, especially when clustering is done very frequently. This could lead to race condition in " + "rare scenarios, for example, when the clustering completes after instants are fetched but before rollback completed."); + public static final ConfigProperty ROLLBACK_PENDING_CLUSTERING_WHEN_DISABLED = ConfigProperty + .key("hoodie.rollback.pending.clustering.when.disabled") Review Comment: Could we replace the key `hoodie.rollback.pending.clustering.when.disabled` with `hoodie.clustering.rollback.pending.replacecommit` because of the naming of the interface `withRollbackPendingClustering`? The naming of `hoodie.rollback.pending.clustering.when.disabled` is confusing for users whether to disable clustering or rollback. cc @yihua -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #7035: [HUDI-5075] Adding support to rollback residual clustering after disabling clustering
SteNicholas commented on code in PR #7035: URL: https://github.com/apache/hudi/pull/7035#discussion_r1018585393 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java: ## @@ -310,6 +310,14 @@ public class HoodieClusteringConfig extends HoodieConfig { + "Please exercise caution while setting this config, especially when clustering is done very frequently. This could lead to race condition in " + "rare scenarios, for example, when the clustering completes after instants are fetched but before rollback completed."); + public static final ConfigProperty ROLLBACK_PENDING_CLUSTERING_WHEN_DISABLED = ConfigProperty + .key("hoodie.rollback.pending.clustering.when.disabled") Review Comment: Could we replace the key `hoodie.rollback.pending.clustering.when.disabled` with `hoodie.clustering.rollback.pending.replacecommit` because of the naming of the interface `withRollbackPendingClustering`? The naming of `hoodie.rollback.pending.clustering.when.disabled` is confusing for users whether to disable clustering or rollback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #7035: [HUDI-5075] Adding support to rollback residual clustering after disabling clustering
SteNicholas commented on code in PR #7035: URL: https://github.com/apache/hudi/pull/7035#discussion_r1018585393 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java: ## @@ -310,6 +310,14 @@ public class HoodieClusteringConfig extends HoodieConfig { + "Please exercise caution while setting this config, especially when clustering is done very frequently. This could lead to race condition in " + "rare scenarios, for example, when the clustering completes after instants are fetched but before rollback completed."); + public static final ConfigProperty ROLLBACK_PENDING_CLUSTERING_WHEN_DISABLED = ConfigProperty + .key("hoodie.rollback.pending.clustering.when.disabled") Review Comment: Could we replace the key `hoodie.rollback.pending.clustering.when.disabled` with `hoodie.clustering.rollback.pending.replacecommit.when.disabled`? The naming of `hoodie.rollback.pending.clustering.when.disabled` is confusing for users whether to disable clustering or rollback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on pull request #7140: [HUDI-5163] Fixing failure handling with spark datasource write
Zouxxyy commented on PR #7140: URL: https://github.com/apache/hudi/pull/7140#issuecomment-1309666320 > > @YannByron : I might need some help to take this patch and make valid fixes for spark-sql classes. Also, we might need to write tests. if you can loop in someone, would be nice. > > @Zouxxyy would you like to take this up to enrich more cases to validate? If yes, you can open another pr based on this pr and comments for your convenience. @YannByron Ok, I'll fix it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (HUDI-5187) Remove the preCondition check of BucketAssigner assign state
[ https://issues.apache.org/jira/browse/HUDI-5187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen resolved HUDI-5187. -- > Remove the preCondition check of BucketAssigner assign state > > > Key: HUDI-5187 > URL: https://issues.apache.org/jira/browse/HUDI-5187 > Project: Apache Hudi > Issue Type: Task > Components: flink >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-5187) Remove the preCondition check of BucketAssigner assign state
[ https://issues.apache.org/jira/browse/HUDI-5187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631382#comment-17631382 ] Danny Chen commented on HUDI-5187: -- Fixed via master branch: df9100b0cb496bb78b1173ae25f9140c1b155bc4 > Remove the preCondition check of BucketAssigner assign state > > > Key: HUDI-5187 > URL: https://issues.apache.org/jira/browse/HUDI-5187 > Project: Apache Hudi > Issue Type: Task > Components: flink >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] trushev commented on pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution
trushev commented on PR #5830: URL: https://github.com/apache/hudi/pull/5830#issuecomment-1309655985 @danny0405 rebased -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (1469469d6e -> df9100b0cb)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 1469469d6e [HUDI-5111] Improve integration test coverage (#7092) add df9100b0cb [HUDI-5187] Remove the preCondition check of BucketAssigner assign state (#7170) No new revisions were added by this update. Summary of changes: .../main/java/org/apache/hudi/sink/partitioner/BucketAssigner.java | 7 --- 1 file changed, 7 deletions(-)
[GitHub] [hudi] danny0405 merged pull request #7170: [HUDI-5187] Remove the preCondition check of BucketAssigner assign state
danny0405 merged PR #7170: URL: https://github.com/apache/hudi/pull/7170 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] trushev commented on pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution
trushev commented on PR #5830: URL: https://github.com/apache/hudi/pull/5830#issuecomment-1309651755 @danny0405 it's rebased -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhangyue19921010 commented on pull request #7169: [HUDI-5186] Parallelism does not take effect when hoodie.combine.before.upsert/insert false
zhangyue19921010 commented on PR #7169: URL: https://github.com/apache/hudi/pull/7169#issuecomment-1309642342 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation
hudi-bot commented on PR #7039: URL: https://github.com/apache/hudi/pull/7039#issuecomment-1309617147 ## CI report: * dc7c2bdefdb2a84b27b40751714a31604e2931eb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12911) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6680: [HUDI-4812] Lazy fetching partition path & file slice for HoodieFileIndex
hudi-bot commented on PR #6680: URL: https://github.com/apache/hudi/pull/6680#issuecomment-1309616517 ## CI report: * 59cdd09e3190c3646e1e3ea6ca3f076526ec0473 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12912) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7172: [HUDI-4888] throw exception if COW table and consistent hashing bucket index
hudi-bot commented on PR #7172: URL: https://github.com/apache/hudi/pull/7172#issuecomment-1309553268 ## CI report: * 959da168ba88673c2b2b4eb5d42cf6aac62d3808 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12913) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7172: [HUDI-4888] throw exception if COW table and consistent hashing bucket index
hudi-bot commented on PR #7172: URL: https://github.com/apache/hudi/pull/7172#issuecomment-1309549269 ## CI report: * 959da168ba88673c2b2b4eb5d42cf6aac62d3808 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4888) Add validation to block COW table to use consistent hashing bucket index
[ https://issues.apache.org/jira/browse/HUDI-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4888: - Labels: pull-request-available (was: ) > Add validation to block COW table to use consistent hashing bucket index > > > Key: HUDI-4888 > URL: https://issues.apache.org/jira/browse/HUDI-4888 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Yuwei Xiao >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.12.2 > > > Consistent hashing bucket index's resizing relies on the log feature of MOR > table. So with COW table, the consistent hashing bucket index can not achieve > resizing currently. > We should block the user from using it at the very beginning(i.e., table > creation), and suggest them to use MOR table or Simple Bucket Index. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] jonvex opened a new pull request, #7172: [HUDI-4888] throw exception if COW table and consistent hashing bucket index
jonvex opened a new pull request, #7172: URL: https://github.com/apache/hudi/pull/7172 ### Change Logs Consistent hashing bucket index resizing does not work because it relies on writing to logs while the resizing is taking place. Instead of having Hudi fail when the resizing takes place, now it will fail on the first write with this configuration. ### Impact Fails faster and lets user know what is going wrong. ### Risk level (write none, low medium or high below) low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4888) Add validation to block COW table to use consistent hashing bucket index
[ https://issues.apache.org/jira/browse/HUDI-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-4888: -- Status: Patch Available (was: In Progress) > Add validation to block COW table to use consistent hashing bucket index > > > Key: HUDI-4888 > URL: https://issues.apache.org/jira/browse/HUDI-4888 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Yuwei Xiao >Assignee: Jonathan Vexler >Priority: Major > Fix For: 0.12.2 > > > Consistent hashing bucket index's resizing relies on the log feature of MOR > table. So with COW table, the consistent hashing bucket index can not achieve > resizing currently. > We should block the user from using it at the very beginning(i.e., table > creation), and suggest them to use MOR table or Simple Bucket Index. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] alexeykudinkin commented on issue #7106: [PROPOSE] Add column prune support for other payload class
alexeykudinkin commented on issue #7106: URL: https://github.com/apache/hudi/issues/7106#issuecomment-1309506033 @TengHuo correct, it's slated for 0.13. We're currently in the final innings of merging Phase 1 to master and hoping to do this in the coming weeks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #7069: [HUDI-5097] Fix partition reading without partition fields table config
alexeykudinkin commented on PR #7069: URL: https://github.com/apache/hudi/pull/7069#issuecomment-1309504006 > not landing this to master. only meant for 0.12.2 release patch LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6680: [HUDI-4812] Lazy fetching partition path & file slice for HoodieFileIndex
hudi-bot commented on PR #6680: URL: https://github.com/apache/hudi/pull/6680#issuecomment-1309472827 ## CI report: * c3aba0dc3e2f7c2c6240d3aa5bc279cf8f359153 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12822) * 59cdd09e3190c3646e1e3ea6ca3f076526ec0473 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12912) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation
hudi-bot commented on PR #7039: URL: https://github.com/apache/hudi/pull/7039#issuecomment-1309466898 ## CI report: * 5ff96812e74f348af76c942f58e67445afbb765e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12767) * dc7c2bdefdb2a84b27b40751714a31604e2931eb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12911) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6680: [HUDI-4812] Lazy fetching partition path & file slice for HoodieFileIndex
hudi-bot commented on PR #6680: URL: https://github.com/apache/hudi/pull/6680#issuecomment-1309466349 ## CI report: * c3aba0dc3e2f7c2c6240d3aa5bc279cf8f359153 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12822) * 59cdd09e3190c3646e1e3ea6ca3f076526ec0473 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests
hudi-bot commented on PR #7171: URL: https://github.com/apache/hudi/pull/7171#issuecomment-1309459685 ## CI report: * 2a1766f041bf9b8bac2927b1aa076916361a00b6 UNKNOWN * bba82cddcfad42d2e2fe698d8a27077536257bfc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12909) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation
hudi-bot commented on PR #7039: URL: https://github.com/apache/hudi/pull/7039#issuecomment-1309459075 ## CI report: * 5ff96812e74f348af76c942f58e67445afbb765e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12767) * dc7c2bdefdb2a84b27b40751714a31604e2931eb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7003: [minor] add more test for rfc46
hudi-bot commented on PR #7003: URL: https://github.com/apache/hudi/pull/7003#issuecomment-1309458829 ## CI report: * da89f1a57bae167a6474092cc07ca0880e4028b8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12910) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation
nsivabalan commented on code in PR #7039: URL: https://github.com/apache/hudi/pull/7039#discussion_r1018436316 ## hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java: ## @@ -44,6 +46,28 @@ public class CommitUtils { private static final Logger LOG = LogManager.getLogger(CommitUtils.class); private static final String NULL_SCHEMA_STR = Schema.create(Schema.Type.NULL).toString(); + public static transient ConcurrentHashMap> PERSISTED_RDD_IDS = new ConcurrentHashMap(); Review Comment: sounds fair. I have an idea on how to go about this. but lets jam and have a consensus before I go ahead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rtdt99 commented on issue #7158: [SUPPORT] Request for support for Azure Blob Storage Events Source and Events Hoodie Incr Source in DeltaStreamer
rtdt99 commented on issue #7158: URL: https://github.com/apache/hudi/issues/7158#issuecomment-1309389618 @nsivabalan, Could you please provide any timeline on when we can expect this feature in DeltaStreamer? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7142: [HUDI-5056] Allow wildcards in partition paths for DELETE_PARTITIONS
hudi-bot commented on PR #7142: URL: https://github.com/apache/hudi/pull/7142#issuecomment-1309365677 ## CI report: * f839fdda3077916eea26ed14e85aa01fa657f3e6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12908) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4967: Description: Related fix: HUDI-4966 We need to add docs on how to properly set the meta sync configuration, especially the hoodie.datasource.hive_sync.partition_value_extractor, in [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, the config can be different). Check the ticket above and PR description of [https://github.com/apache/hudi/pull/6851] for more details. We should also add the migration setup on the key generation page as well: [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates] * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config is used to extract and transform partition value during Hive sync. Its default value has been changed from {{SlashEncodedDayPartitionValueExtractor}} to {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default value (i.e., have not set it explicitly), you are required to set the config to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From this release, if this config is not set and Hive sync is enabled, then partition value extractor class will be *automatically inferred* on the basis of number of partition fields and whether or not hive style partitioning is enabled. was: Related fix: HUDI-4966 We need to add docs on how to properly set the meta sync configuration, especially the hoodie.datasource.hive_sync.partition_value_extractor, in [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, the config can be different). We should also add the migration setup on the key generation page as well: [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates] * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config is used to extract and transform partition value during Hive sync. Its default value has been changed from {{SlashEncodedDayPartitionValueExtractor}} to {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default value (i.e., have not set it explicitly), you are required to set the config to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From this release, if this config is not set and Hive sync is enabled, then partition value extractor class will be *automatically inferred* on the basis of number of partition fields and whether or not hive style partitioning is enabled. > Improve docs for meta sync with TimestampBasedKeyGenerator > -- > > Key: HUDI-4967 > URL: https://issues.apache.org/jira/browse/HUDI-4967 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Major > Fix For: 0.12.2 > > > Related fix: HUDI-4966 > We need to add docs on how to properly set the meta sync configuration, > especially the hoodie.datasource.hive_sync.partition_value_extractor, in > [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, > the config can be different). Check the ticket above and PR description of > [https://github.com/apache/hudi/pull/6851] for more details. > We should also add the migration setup on the key generation page as well: > [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates] > * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config > is used to extract and transform partition value during Hive sync. Its > default value has been changed from > {{SlashEncodedDayPartitionValueExtractor}} to > {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default > value (i.e., have not set it explicitly), you are required to set the config > to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From > this release, if this config is not set and Hive sync is enabled, then > partition value extractor class will be *automatically inferred* on the basis > of number of partition fields and whether or not hive style partitioning is > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4966) Meta sync throws exception if TimestampBasedKeyGenerator is used to generate partition path containing slashes
[ https://issues.apache.org/jira/browse/HUDI-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4966: Fix Version/s: 0.12.2 (was: 0.12.1) > Meta sync throws exception if TimestampBasedKeyGenerator is used to generate > partition path containing slashes > -- > > Key: HUDI-4966 > URL: https://issues.apache.org/jira/browse/HUDI-4966 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.2 > > > For Deltastreamer, when using TimestampBasedKeyGenerator with the output > format of partition path containing slashes, e.g., "/MM/dd", and > hive-style partitioning disabled (by default), the meta sync fails. > {code:java} > --hoodie-conf hoodie.datasource.write.partitionpath.field=createdDate > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator > --hoodie-conf hoodie.deltastreamer.keygen.timebased.timezone=GMT > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS {code} > Hive Sync exception: > {code:java} > Exception in thread "main" org.apache.hudi.exception.HoodieException: Could > not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool > at > org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:58) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.runMetaSync(DeltaSync.java:719) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:637) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:337) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:204) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:202) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:571) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception > when hive syncing test_table > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145) > at > org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:56) > ... 19 more > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync > partitions for table test_table > at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:341) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:232) > at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:154) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:142) > ... 20 more > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: default.test_table > add partition failed > at > org.apache.hudi.hive.ddl.HMSDDLExecutor.addPartitionsToTable(HMSDDLExecutor.java:217) > at > org.apache.hudi.hive.HoodieHiveSyncClient.addPartitionsToTable(HoodieHiveSyncClient.java:107) > at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:324) > ... 23 more > Caused by: MetaException(message:Invalid partition key & values; keys > [createddate, ], values [2022, 10, 02, ]) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add
[jira] [Updated] (HUDI-4966) Meta sync throws exception if TimestampBasedKeyGenerator is used to generate partition path containing slashes
[ https://issues.apache.org/jira/browse/HUDI-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4966: Fix Version/s: 0.12.1 (was: 0.12.2) > Meta sync throws exception if TimestampBasedKeyGenerator is used to generate > partition path containing slashes > -- > > Key: HUDI-4966 > URL: https://issues.apache.org/jira/browse/HUDI-4966 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > For Deltastreamer, when using TimestampBasedKeyGenerator with the output > format of partition path containing slashes, e.g., "/MM/dd", and > hive-style partitioning disabled (by default), the meta sync fails. > {code:java} > --hoodie-conf hoodie.datasource.write.partitionpath.field=createdDate > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator > --hoodie-conf hoodie.deltastreamer.keygen.timebased.timezone=GMT > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS {code} > Hive Sync exception: > {code:java} > Exception in thread "main" org.apache.hudi.exception.HoodieException: Could > not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool > at > org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:58) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.runMetaSync(DeltaSync.java:719) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:637) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:337) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:204) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:202) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:571) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception > when hive syncing test_table > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145) > at > org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:56) > ... 19 more > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync > partitions for table test_table > at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:341) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:232) > at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:154) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:142) > ... 20 more > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: default.test_table > add partition failed > at > org.apache.hudi.hive.ddl.HMSDDLExecutor.addPartitionsToTable(HMSDDLExecutor.java:217) > at > org.apache.hudi.hive.HoodieHiveSyncClient.addPartitionsToTable(HoodieHiveSyncClient.java:107) > at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:324) > ... 23 more > Caused by: MetaException(message:Invalid partition key & values; keys > [createddate, ], values [2022, 10, 02, ]) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add
[jira] [Updated] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4967: Description: Related fix: HUDI-4966 We need to add docs on how to properly set the meta sync configuration, especially the hoodie.datasource.hive_sync.partition_value_extractor, in [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, the config can be different). We should also add the migration setup on the key generation page as well: [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates] * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config is used to extract and transform partition value during Hive sync. Its default value has been changed from {{SlashEncodedDayPartitionValueExtractor}} to {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default value (i.e., have not set it explicitly), you are required to set the config to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From this release, if this config is not set and Hive sync is enabled, then partition value extractor class will be *automatically inferred* on the basis of number of partition fields and whether or not hive style partitioning is enabled. was: Related fix: HUDI-4966 > Improve docs for meta sync with TimestampBasedKeyGenerator > -- > > Key: HUDI-4967 > URL: https://issues.apache.org/jira/browse/HUDI-4967 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Major > Fix For: 0.12.2 > > > Related fix: HUDI-4966 > We need to add docs on how to properly set the meta sync configuration, > especially the hoodie.datasource.hive_sync.partition_value_extractor, in > [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, > the config can be different). > We should also add the migration setup on the key generation page as well: > [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates] > * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config > is used to extract and transform partition value during Hive sync. Its > default value has been changed from > {{SlashEncodedDayPartitionValueExtractor}} to > {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default > value (i.e., have not set it explicitly), you are required to set the config > to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From > this release, if this config is not set and Hive sync is enabled, then > partition value extractor class will be *automatically inferred* on the basis > of number of partition fields and whether or not hive style partitioning is > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-4966) Meta sync throws exception if TimestampBasedKeyGenerator is used to generate partition path containing slashes
[ https://issues.apache.org/jira/browse/HUDI-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-4966. --- Resolution: Fixed > Meta sync throws exception if TimestampBasedKeyGenerator is used to generate > partition path containing slashes > -- > > Key: HUDI-4966 > URL: https://issues.apache.org/jira/browse/HUDI-4966 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.2 > > > For Deltastreamer, when using TimestampBasedKeyGenerator with the output > format of partition path containing slashes, e.g., "/MM/dd", and > hive-style partitioning disabled (by default), the meta sync fails. > {code:java} > --hoodie-conf hoodie.datasource.write.partitionpath.field=createdDate > --hoodie-conf > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator > --hoodie-conf hoodie.deltastreamer.keygen.timebased.timezone=GMT > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd > --hoodie-conf > hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS {code} > Hive Sync exception: > {code:java} > Exception in thread "main" org.apache.hudi.exception.HoodieException: Could > not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool > at > org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:58) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.runMetaSync(DeltaSync.java:719) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:637) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:337) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:204) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:202) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:571) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception > when hive syncing test_table > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145) > at > org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:56) > ... 19 more > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync > partitions for table test_table > at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:341) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:232) > at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:154) > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:142) > ... 20 more > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: default.test_table > add partition failed > at > org.apache.hudi.hive.ddl.HMSDDLExecutor.addPartitionsToTable(HMSDDLExecutor.java:217) > at > org.apache.hudi.hive.HoodieHiveSyncClient.addPartitionsToTable(HoodieHiveSyncClient.java:107) > at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:324) > ... 23 more > Caused by: MetaException(message:Invalid partition key & values; keys > [createddate, ], values [2022, 10, 02, ]) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$add_partitions_req_result$add_partitions_req_resultStandardScheme.read(T
[GitHub] [hudi] jonvex commented on issue #6808: [SUPPORT] Cannot sync to spark embedded derby hive meta store (the default one)
jonvex commented on issue #6808: URL: https://github.com/apache/hudi/issues/6808#issuecomment-1309323342 The updated links are https://github.com/apache/hudi/blob/master/packaging/bundle-validation/Dockerfile https://github.com/apache/hudi/blob/master/packaging/bundle-validation/validate.sh you can look at the test_spark_hadoop_mr_bundles function https://github.com/apache/hudi/tree/master/packaging/bundle-validation/conf contains some configuration files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7003: [minor] add more test for rfc46
hudi-bot commented on PR #7003: URL: https://github.com/apache/hudi/pull/7003#issuecomment-1309277206 ## CI report: * 6efb6b43bec7d58883b7120f23c06d4ae927a528 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12573) * da89f1a57bae167a6474092cc07ca0880e4028b8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12910) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests
hudi-bot commented on PR #7171: URL: https://github.com/apache/hudi/pull/7171#issuecomment-1309272029 ## CI report: * 2a1766f041bf9b8bac2927b1aa076916361a00b6 UNKNOWN * bba82cddcfad42d2e2fe698d8a27077536257bfc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12909) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7003: [minor] add more test for rfc46
hudi-bot commented on PR #7003: URL: https://github.com/apache/hudi/pull/7003#issuecomment-1309271518 ## CI report: * 6efb6b43bec7d58883b7120f23c06d4ae927a528 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12573) * da89f1a57bae167a6474092cc07ca0880e4028b8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7169: [HUDI-5186] Parallelism does not take effect when hoodie.combine.before.upsert/insert false
hudi-bot commented on PR #7169: URL: https://github.com/apache/hudi/pull/7169#issuecomment-1309266022 ## CI report: * f9e08f5d14106a18fe59bf752077ab1043595e03 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12899) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12905) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests
hudi-bot commented on PR #7171: URL: https://github.com/apache/hudi/pull/7171#issuecomment-1309266099 ## CI report: * 2a1766f041bf9b8bac2927b1aa076916361a00b6 UNKNOWN * bba82cddcfad42d2e2fe698d8a27077536257bfc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
[ https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-2613. - Resolution: Done > Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus > > > Key: HUDI-2613 > URL: https://issues.apache.org/jira/browse/HUDI-2613 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality >Reporter: sivabalan narayanan >Assignee: Jonathan Vexler >Priority: Critical > Fix For: 0.13.0 > > > Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of > getDeltalogs() -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #7167: [HUDI-5094] Remove partition fields before transform bytes to avro,if enable DROP_PARTITION_COLUMNS.
hudi-bot commented on PR #7167: URL: https://github.com/apache/hudi/pull/7167#issuecomment-1309201490 ## CI report: * 8881615927848e76214065870119e910e41e9c35 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12904) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-2613) Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus
[ https://issues.apache.org/jira/browse/HUDI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631217#comment-17631217 ] sivabalan narayanan commented on HUDI-2613: --- Looks like we got this covered already. closing it as done. > Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus > > > Key: HUDI-2613 > URL: https://issues.apache.org/jira/browse/HUDI-2613 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality >Reporter: sivabalan narayanan >Assignee: Jonathan Vexler >Priority: Critical > Fix For: 0.13.0 > > > Fix usages of RealtimeSplit to use the new getDeltaLogFileStatus instead of > getDeltalogs() -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5171) Ensure validateTableConfig also checks for partition path field value switch
[ https://issues.apache.org/jira/browse/HUDI-5171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-5171: -- Story Points: 1 > Ensure validateTableConfig also checks for partition path field value switch > > > Key: HUDI-5171 > URL: https://issues.apache.org/jira/browse/HUDI-5171 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Affects Versions: 0.12.1 >Reporter: sivabalan narayanan >Assignee: Jonathan Vexler >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.2 > > > as of now, validateTableConfig does not consider change in partition path > field value switch. we need to consider that as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-4990) Parallelize deduplication in CLI tool
[ https://issues.apache.org/jira/browse/HUDI-4990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631215#comment-17631215 ] Jonathan Vexler commented on HUDI-4990: --- We are stuck on this for now because we can't run the integration tests locally > Parallelize deduplication in CLI tool > - > > Key: HUDI-4990 > URL: https://issues.apache.org/jira/browse/HUDI-4990 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Major > Fix For: 0.12.2 > > > The CLI tool command `repair deduplicate` repair one partition at a time. To > repair hundreds of partitions, this takes time. We should add a mode to take > multiple partition paths for the CLI and run the dedup job for multiple > partitions at the same time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #7142: [HUDI-5056] Allow wildcards in partition paths for DELETE_PARTITIONS
hudi-bot commented on PR #7142: URL: https://github.com/apache/hudi/pull/7142#issuecomment-1309195747 ## CI report: * 50ab8834e6dad68dfb679216e7b3096254588dfc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12860) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12906) * f839fdda3077916eea26ed14e85aa01fa657f3e6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12908) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7142: [HUDI-5056] Allow wildcards in partition paths for DELETE_PARTITIONS
hudi-bot commented on PR #7142: URL: https://github.com/apache/hudi/pull/7142#issuecomment-1309189923 ## CI report: * 50ab8834e6dad68dfb679216e7b3096254588dfc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12860) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12906) * f839fdda3077916eea26ed14e85aa01fa657f3e6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests
hudi-bot commented on PR #7171: URL: https://github.com/apache/hudi/pull/7171#issuecomment-1309183756 ## CI report: * 2a1766f041bf9b8bac2927b1aa076916361a00b6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6838: [MINOR] Update azure image and balance CI jobs
hudi-bot commented on PR #6838: URL: https://github.com/apache/hudi/pull/6838#issuecomment-1309182962 ## CI report: * cb0d3a60736adeac074af479eabdc844793ea067 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12903) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan closed pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests
xushiyan closed pull request #7171: [HUDI-5145] Avoid starting HDFS in hudi-utilities tests URL: https://github.com/apache/hudi/pull/7171 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #7132: [HUDI-51577] Adding capability to remove all meta fields from source hudi table with Hudi incr source
nsivabalan commented on code in PR #7132: URL: https://github.com/apache/hudi/pull/7132#discussion_r1018258230 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java: ## @@ -172,10 +178,13 @@ public Pair>, String> fetchNextBatch(Option lastCkpt * * log.info("Validated Source Schema :" + validated.schema()); */ +boolean dropAllMetaFields = props.getBoolean(Config.HOODIE_DROP_ALL_META_FIELDS_FROM_SOURCE, +Config.DEFAULT_HOODIE_DROP_ALL_META_FIELDS_FROM_SOURCE); // Remove Hoodie meta columns except partition path from input source -final Dataset src = source.drop(HoodieRecord.HOODIE_META_COLUMNS.stream() -.filter(x -> !x.equals(HoodieRecord.PARTITION_PATH_METADATA_FIELD)).toArray(String[]::new)); +String[] colsToDrop = dropAllMetaFields ? HoodieRecord.HOODIE_META_COLUMNS.stream().toArray(String[]::new) : +HoodieRecord.HOODIE_META_COLUMNS.stream().filter(x -> !x.equals(HoodieRecord.PARTITION_PATH_METADATA_FIELD)).toArray(String[]::new); Review Comment: I could not decode that. I don't see a need unless we want to carry over the partitioning from tableA to tableB. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-5111] Improve integration test coverage (#7092)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 1469469d6e [HUDI-5111] Improve integration test coverage (#7092) 1469469d6e is described below commit 1469469d6ecaeaa7d960676c305383e7e31fcd43 Author: Sivabalan Narayanan AuthorDate: Wed Nov 9 09:56:14 2022 -0800 [HUDI-5111] Improve integration test coverage (#7092) Co-authored-by: Raymond Xu <2701446+xushi...@users.noreply.github.com> --- .../deltastreamer-immutable-dataset.yaml | 14 ++- ...treamer-long-running-multi-partitions-hive.yaml | 6 +-- ...mer-long-running-multi-partitions-metadata.yaml | 18 +++-- ...eltastreamer-long-running-multi-partitions.yaml | 18 +++-- .../deltastreamer-medium-clustering.yaml | 19 -- ...ltastreamer-medium-full-dataset-validation.yaml | 18 +++-- .../test-suite/deltastreamer-non-partitioned.yaml | 14 ++- .../detlastreamer-long-running-example.yaml| 18 +++-- .../demo/config/test-suite/simple-clustering.yaml | 16 +++- .../config/test-suite/simple-deltastreamer.yaml| 12 ++ .../config/test-suite/spark-immutable-dataset.yaml | 14 ++- .../spark-long-running-non-partitioned.yaml| 12 ++ .../demo/config/test-suite/spark-long-running.yaml | 16 +++- .../config/test-suite/spark-medium-clustering.yaml | 16 +++- docker/demo/config/test-suite/spark-simple.yaml| 14 ++- .../hudi/integ/testsuite/HoodieTestSuiteJob.java | 7 .../integ/testsuite/dag/nodes/PrestoQueryNode.java | 44 +- .../testsuite/dag/nodes/ValidateDatasetNode.java | 13 --- .../testsuite/dag/nodes/SparkInsertNode.scala | 1 + 19 files changed, 238 insertions(+), 52 deletions(-) diff --git a/docker/demo/config/test-suite/deltastreamer-immutable-dataset.yaml b/docker/demo/config/test-suite/deltastreamer-immutable-dataset.yaml index 4903e3650c..a19617ef13 100644 --- a/docker/demo/config/test-suite/deltastreamer-immutable-dataset.yaml +++ b/docker/demo/config/test-suite/deltastreamer-immutable-dataset.yaml @@ -45,9 +45,21 @@ dag_content: delete_input_data: false type: ValidateDatasetNode deps: first_insert + first_presto_query: +config: + execute_itr_count: 5 + presto_props: +prop1: "SET SESSION hive.parquet_use_column_names = true" + presto_queries: +query1: "select count(*) from testdb.table1" +result1: 3 +query2: "select count(*) from testdb.table1 group by _row_key having count(*) > 1" +result2: 0 +type: PrestoQueryNode +deps: second_validate last_validate: config: execute_itr_count: 5 delete_input_data: true type: ValidateAsyncOperations -deps: second_validate \ No newline at end of file +deps: first_presto_query \ No newline at end of file diff --git a/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-hive.yaml b/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-hive.yaml index 8b82415982..6e94b05a69 100644 --- a/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-hive.yaml +++ b/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-hive.yaml @@ -14,7 +14,7 @@ # See the License for the specific language governing permissions and # limitations under the License. dag_name: deltastreamer-long-running-multi-partitions.yaml -dag_rounds: 50 +dag_rounds: 20 dag_intermittent_delay_mins: 1 dag_content: first_insert: @@ -71,7 +71,7 @@ dag_content: deps: first_delete second_validate: config: - validate_once_every_itr : 5 + execute_itr_count: 20 validate_hive: true delete_input_data: true max_wait_time_for_deltastreamer_catch_up_ms: 60 @@ -79,7 +79,7 @@ dag_content: deps: second_hive_sync last_validate: config: - execute_itr_count: 50 + execute_itr_count: 20 max_wait_time_for_deltastreamer_catch_up_ms: 60 type: ValidateAsyncOperations deps: second_validate diff --git a/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-metadata.yaml b/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-metadata.yaml index 031664cd15..9ba6993e1d 100644 --- a/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-metadata.yaml +++ b/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions-metadata.yaml @@ -14,7 +14,7 @@ # See the License for the specific language governing permissions and # limitations under the License. dag_name: deltastreamer-long-running-multi-partitions.yaml -dag_rounds: 30 +dag_rounds: 20 dag_intermittent_delay_mins: 1 dag_content: first_insert: @@ -65,9 +65,21 @@ dag_content: max_wait_t
[GitHub] [hudi] xushiyan merged pull request #7092: [HUDI-5111] Improve integration test coverage
xushiyan merged PR #7092: URL: https://github.com/apache/hudi/pull/7092 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org