[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer
yihua commented on code in PR #6003: URL: https://github.com/apache/hudi/pull/6003#discussion_r993980641 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,238 @@ + + +# RFC-56: Early Conflict Detection For Multi-writer + +## Proposers + +- @zhangyue19921010 + +## Approvers + +- @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + +## Abstract + +At present, Hudi implements an OCC (Optimistic Concurrency Control) based on timeline to ensure data consistency, +integrity and correctness between multi-writers. OCC detects the conflict at Hudi's file group level, i.e., two +concurrent writers updating the same file group are detected as a conflict. Currently, the conflict detection is +performed before commit metadata and after the data writing is completed. If any conflict is detected, it leads to a +waste of cluster resources because computing and writing were finished already. + +To solve this problem, this RFC proposes an early conflict detection mechanism to detect the conflict during the data +writing phase and abort the writing early if conflict is detected, using Hudi's marker mechanism. Before writing each +data file, the writer creates a corresponding marker to mark that the file is created, so that the writer can use the +markers to automatically clean up uncommitted data in failure and rollback scenarios. We propose to use the markers +identify the conflict at the file group level during writing data. There are some subtle differences in early conflict +detection work flow between different types of marker maintainers. For direct markers, hoodie lists necessary marker +files directly and does conflict checking before the writers creating markers and before starting to write corresponding +data file. For the timeline-server based markers, hoodie just gets the result of marker conflict checking before the +writers creating markers and before starting to write corresponding data files. The conflicts are asynchronously and +periodically checked so that the writing conflicts can be detected as early as possible. Both writers may still write +the data files of the same file slice, until the conflict is detected in the next round of checking. + +What's more? Hoodie can stop writing earlier because of early conflict detection and release the resources to cluster, +improving resource utilization. + +Note that, the early conflict detection proposed by this RFC operates within OCC. Any conflict detection outside the +scope of OCC is not handle. For example, current OCC for multiple writers cannot detect the conflict if two concurrent +writers perform INSERT operations for the same set of record keys, because the writers write to different file groups. +This RFC does not intend to address this problem. + +## Background + +As we know, transactions and multi-writers of data lakes are becoming the key characteristics of building Lakehouse +these days. Quoting this inspiring blog Lakehouse Concurrency Control: Are we too optimistic? directly: +https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/ + +> "Hudi implements a file level, log based concurrency control protocol on the Hudi timeline, which in-turn relies +> on bare minimum atomic puts to cloud storage. By building on an event log as the central piece for inter process +> coordination, Hudi is able to offer a few flexible deployment models that offer greater concurrency over pure OCC +> approaches that just track table snapshots." + +In the multi-writer scenario, Hudi's existing conflict detection occurs after the writer finishing writing the data and +before committing the metadata. In other words, the writer just detects the occurrence of the conflict when it starts to +commit, although all calculations and data writing have been completed, which causes a waste of resources. + +For example: + +Now there are two writing jobs: job1 writes 10M data to the Hudi table, including updates to file group 1. Another job2 +writes 100G to the Hudi table, and also updates the same file group 1. + +Job1 finishes and commits to Hudi successfully. After a few hours, job2 finishes writing data files(100G) and starts to +commit metadata. At this time, a conflict with job1 is found, and the job2 has to be aborted and re-run after failure. +Obviously, a lot of computing resources and time are wasted for job2. + +Hudi currently has two important mechanisms, marker mechanism and heartbeat mechanism: + +1. Marker mechanism can track all the files that are part of an active write. +2. Heartbeat mechanism that can track all active writers to a Hudi table. + +Based on marker and heartbeat, this RFC proposes a new conflict detection: Early Conflict Detection. Before the writer +creates the marker and before it starts to write the file, Hudi performs this new conflict detection, trying to detect +the writing conflict directly (for direct markers) or get the async conflict check
[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer
yihua commented on code in PR #6003: URL: https://github.com/apache/hudi/pull/6003#discussion_r989750821 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,238 @@ + + +# RFC-56: Early Conflict Detection For Multi-writer + +## Proposers + +- @zhangyue19921010 + +## Approvers + +- @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + +## Abstract + +At present, Hudi implements an OCC (Optimistic Concurrency Control) based on timeline to ensure data consistency, +integrity and correctness between multi-writers. OCC detects the conflict at Hudi's file group level, i.e., two +concurrent writers updating the same file group are detected as a conflict. Currently, the conflict detection is +performed before commit metadata and after the data writing is completed. If any conflict is detected, it leads to a +waste of cluster resources because computing and writing were finished already. + +To solve this problem, this RFC proposes an early conflict detection mechanism to detect the conflict during the data +writing phase and abort the writing early if conflict is detected, using Hudi's marker mechanism. Before writing each +data file, the writer creates a corresponding marker to mark that the file is created, so that the writer can use the +markers to automatically clean up uncommitted data in failure and rollback scenarios. We propose to use the markers +identify the conflict at the file group level during writing data. There are some subtle differences in early conflict +detection work flow between different types of marker maintainers. For direct markers, hoodie lists necessary marker +files directly and does conflict checking before the writers creating markers and before starting to write corresponding +data file. For the timeline-server based markers, hoodie just gets the result of marker conflict checking before the +writers creating markers and before starting to write corresponding data files. The conflicts are asynchronously and +periodically checked so that the writing conflicts can be detected as early as possible. Both writers may still write +the data files of the same file slice, until the conflict is detected in the next round of checking. + +What's more? Hoodie can stop writing earlier because of early conflict detection and release the resources to cluster, +improving resource utilization. + +Note that, the early conflict detection proposed by this RFC operates within OCC. Any conflict detection outside the +scope of OCC is not handle. For example, current OCC for multiple writers cannot detect the conflict if two concurrent +writers perform INSERT operations for the same set of record keys, because the writers write to different file groups. +This RFC does not intend to address this problem. + +## Background + +As we know, transactions and multi-writers of data lakes are becoming the key characteristics of building Lakehouse +these days. Quoting this inspiring blog Lakehouse Concurrency Control: Are we too optimistic? directly: +https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/ + +> "Hudi implements a file level, log based concurrency control protocol on the Hudi timeline, which in-turn relies +> on bare minimum atomic puts to cloud storage. By building on an event log as the central piece for inter process +> coordination, Hudi is able to offer a few flexible deployment models that offer greater concurrency over pure OCC +> approaches that just track table snapshots." + +In the multi-writer scenario, Hudi's existing conflict detection occurs after the writer finishing writing the data and +before committing the metadata. In other words, the writer just detects the occurrence of the conflict when it starts to +commit, although all calculations and data writing have been completed, which causes a waste of resources. + +For example: + +Now there are two writing jobs: job1 writes 10M data to the Hudi table, including updates to file group 1. Another job2 +writes 100G to the Hudi table, and also updates the same file group 1. + +Job1 finishes and commits to Hudi successfully. After a few hours, job2 finishes writing data files(100G) and starts to +commit metadata. At this time, a conflict with job1 is found, and the job2 has to be aborted and re-run after failure. +Obviously, a lot of computing resources and time are wasted for job2. + +Hudi currently has two important mechanisms, marker mechanism and heartbeat mechanism: + +1. Marker mechanism can track all the files that are part of an active write. +2. Heartbeat mechanism that can track all active writers to a Hudi table. + +Based on marker and heartbeat, this RFC proposes a new conflict detection: Early Conflict Detection. Before the writer +creates the marker and before it starts to write the file, Hudi performs this new conflict detection, trying to detect +the writing conflict directly (for direct markers) or get the async conflict check
[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer
yihua commented on code in PR #6003: URL: https://github.com/apache/hudi/pull/6003#discussion_r989728519 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,238 @@ + + +# RFC-56: Early Conflict Detection For Multi-writer + +## Proposers + +- @zhangyue19921010 + +## Approvers + +- @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + +## Abstract + +At present, Hudi implements an OCC (Optimistic Concurrency Control) based on timeline to ensure data consistency, +integrity and correctness between multi-writers. OCC detects the conflict at Hudi's file group level, i.e., two +concurrent writers updating the same file group are detected as a conflict. Currently, the conflict detection is +performed before commit metadata and after the data writing is completed. If any conflict is detected, it leads to a +waste of cluster resources because computing and writing were finished already. + +To solve this problem, this RFC proposes an early conflict detection mechanism to detect the conflict during the data +writing phase and abort the writing early if conflict is detected, using Hudi's marker mechanism. Before writing each +data file, the writer creates a corresponding marker to mark that the file is created, so that the writer can use the +markers to automatically clean up uncommitted data in failure and rollback scenarios. We propose to use the markers +identify the conflict at the file group level during writing data. There are some subtle differences in early conflict +detection work flow between different types of marker maintainers. For direct markers, hoodie lists necessary marker +files directly and does conflict checking before the writers creating markers and before starting to write corresponding +data file. For the timeline-server based markers, hoodie just gets the result of marker conflict checking before the +writers creating markers and before starting to write corresponding data files. The conflicts are asynchronously and +periodically checked so that the writing conflicts can be detected as early as possible. Both writers may still write +the data files of the same file slice, until the conflict is detected in the next round of checking. + +What's more? Hoodie can stop writing earlier because of early conflict detection and release the resources to cluster, +improving resource utilization. + +Note that, the early conflict detection proposed by this RFC operates within OCC. Any conflict detection outside the +scope of OCC is not handle. For example, current OCC for multiple writers cannot detect the conflict if two concurrent +writers perform INSERT operations for the same set of record keys, because the writers write to different file groups. +This RFC does not intend to address this problem. + +## Background + +As we know, transactions and multi-writers of data lakes are becoming the key characteristics of building Lakehouse +these days. Quoting this inspiring blog Lakehouse Concurrency Control: Are we too optimistic? directly: +https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/ + +> "Hudi implements a file level, log based concurrency control protocol on the Hudi timeline, which in-turn relies +> on bare minimum atomic puts to cloud storage. By building on an event log as the central piece for inter process +> coordination, Hudi is able to offer a few flexible deployment models that offer greater concurrency over pure OCC +> approaches that just track table snapshots." + +In the multi-writer scenario, Hudi's existing conflict detection occurs after the writer finishing writing the data and +before committing the metadata. In other words, the writer just detects the occurrence of the conflict when it starts to +commit, although all calculations and data writing have been completed, which causes a waste of resources. + +For example: + +Now there are two writing jobs: job1 writes 10M data to the Hudi table, including updates to file group 1. Another job2 +writes 100G to the Hudi table, and also updates the same file group 1. + +Job1 finishes and commits to Hudi successfully. After a few hours, job2 finishes writing data files(100G) and starts to +commit metadata. At this time, a conflict with job1 is found, and the job2 has to be aborted and re-run after failure. +Obviously, a lot of computing resources and time are wasted for job2. + +Hudi currently has two important mechanisms, marker mechanism and heartbeat mechanism: + +1. Marker mechanism can track all the files that are part of an active write. +2. Heartbeat mechanism that can track all active writers to a Hudi table. + +Based on marker and heartbeat, this RFC proposes a new conflict detection: Early Conflict Detection. Before the writer +creates the marker and before it starts to write the file, Hudi performs this new conflict detection, trying to detect +the writing conflict directly (for direct markers) or get the async conflict check
[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer
yihua commented on code in PR #6003: URL: https://github.com/apache/hudi/pull/6003#discussion_r989728329 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,238 @@ + + +# RFC-56: Early Conflict Detection For Multi-writer + +## Proposers + +- @zhangyue19921010 + +## Approvers + +- @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + +## Abstract + +At present, Hudi implements an OCC (Optimistic Concurrency Control) based on timeline to ensure data consistency, +integrity and correctness between multi-writers. OCC detects the conflict at Hudi's file group level, i.e., two +concurrent writers updating the same file group are detected as a conflict. Currently, the conflict detection is +performed before commit metadata and after the data writing is completed. If any conflict is detected, it leads to a +waste of cluster resources because computing and writing were finished already. + +To solve this problem, this RFC proposes an early conflict detection mechanism to detect the conflict during the data +writing phase and abort the writing early if conflict is detected, using Hudi's marker mechanism. Before writing each +data file, the writer creates a corresponding marker to mark that the file is created, so that the writer can use the +markers to automatically clean up uncommitted data in failure and rollback scenarios. We propose to use the markers +identify the conflict at the file group level during writing data. There are some subtle differences in early conflict +detection work flow between different types of marker maintainers. For direct markers, hoodie lists necessary marker +files directly and does conflict checking before the writers creating markers and before starting to write corresponding +data file. For the timeline-server based markers, hoodie just gets the result of marker conflict checking before the +writers creating markers and before starting to write corresponding data files. The conflicts are asynchronously and +periodically checked so that the writing conflicts can be detected as early as possible. Both writers may still write +the data files of the same file slice, until the conflict is detected in the next round of checking. + +What's more? Hoodie can stop writing earlier because of early conflict detection and release the resources to cluster, +improving resource utilization. + +Note that, the early conflict detection proposed by this RFC operates within OCC. Any conflict detection outside the +scope of OCC is not handle. For example, current OCC for multiple writers cannot detect the conflict if two concurrent +writers perform INSERT operations for the same set of record keys, because the writers write to different file groups. +This RFC does not intend to address this problem. + +## Background + +As we know, transactions and multi-writers of data lakes are becoming the key characteristics of building Lakehouse +these days. Quoting this inspiring blog Lakehouse Concurrency Control: Are we too optimistic? directly: +https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/ + +> "Hudi implements a file level, log based concurrency control protocol on the Hudi timeline, which in-turn relies +> on bare minimum atomic puts to cloud storage. By building on an event log as the central piece for inter process +> coordination, Hudi is able to offer a few flexible deployment models that offer greater concurrency over pure OCC +> approaches that just track table snapshots." + +In the multi-writer scenario, Hudi's existing conflict detection occurs after the writer finishing writing the data and +before committing the metadata. In other words, the writer just detects the occurrence of the conflict when it starts to +commit, although all calculations and data writing have been completed, which causes a waste of resources. + +For example: + +Now there are two writing jobs: job1 writes 10M data to the Hudi table, including updates to file group 1. Another job2 +writes 100G to the Hudi table, and also updates the same file group 1. + +Job1 finishes and commits to Hudi successfully. After a few hours, job2 finishes writing data files(100G) and starts to +commit metadata. At this time, a conflict with job1 is found, and the job2 has to be aborted and re-run after failure. +Obviously, a lot of computing resources and time are wasted for job2. + +Hudi currently has two important mechanisms, marker mechanism and heartbeat mechanism: + +1. Marker mechanism can track all the files that are part of an active write. +2. Heartbeat mechanism that can track all active writers to a Hudi table. + +Based on marker and heartbeat, this RFC proposes a new conflict detection: Early Conflict Detection. Before the writer +creates the marker and before it starts to write the file, Hudi performs this new conflict detection, trying to detect +the writing conflict directly (for direct markers) or get the async conflict check
[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer
yihua commented on code in PR #6003: URL: https://github.com/apache/hudi/pull/6003#discussion_r988457829 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,238 @@ + + +# RFC-56: Early Conflict Detection For Multi-writer + +## Proposers + +- @zhangyue19921010 + +## Approvers + +- @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + +## Abstract + +At present, Hudi implements an OCC (Optimistic Concurrency Control) based on timeline to ensure data consistency, +integrity and correctness between multi-writers. OCC detects the conflict at Hudi's file group level, i.e., two +concurrent writers updating the same file group are detected as a conflict. Currently, the conflict detection is +performed before commit metadata and after the data writing is completed. If any conflict is detected, it leads to a +waste of cluster resources because computing and writing were finished already. + +To solve this problem, this RFC proposes an early conflict detection mechanism to detect the conflict during the data +writing phase and abort the writing early if conflict is detected, using Hudi's marker mechanism. Before writing each +data file, the writer creates a corresponding marker to mark that the file is created, so that the writer can use the +markers to automatically clean up uncommitted data in failure and rollback scenarios. We propose to use the markers +identify the conflict at the file group level during writing data. There are some subtle differences in early conflict +detection work flow between different types of marker maintainers. For direct markers, hoodie lists necessary marker +files directly and does conflict checking before the writers creating markers and before starting to write corresponding +data file. For the timeline-server based markers, hoodie just gets the result of marker conflict checking before the +writers creating markers and before starting to write corresponding data files. The conflicts are asynchronously and +periodically checked so that the writing conflicts can be detected as early as possible. Both writers may still write +the data files of the same file slice, until the conflict is detected in the next round of checking. + +What's more? Hoodie can stop writing earlier because of early conflict detection and release the resources to cluster, +improving resource utilization. + +Note that, the early conflict detection proposed by this RFC operates within OCC. Any conflict detection outside the +scope of OCC is not handle. For example, current OCC for multiple writers cannot detect the conflict if two concurrent +writers perform INSERT operations for the same set of record keys, because the writers write to different file groups. +This RFC does not intend to address this problem. + +## Background + +As we know, transactions and multi-writers of data lakes are becoming the key characteristics of building Lakehouse +these days. Quoting this inspiring blog Lakehouse Concurrency Control: Are we too optimistic? directly: +https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/ + +> "Hudi implements a file level, log based concurrency control protocol on the Hudi timeline, which in-turn relies +> on bare minimum atomic puts to cloud storage. By building on an event log as the central piece for inter process +> coordination, Hudi is able to offer a few flexible deployment models that offer greater concurrency over pure OCC +> approaches that just track table snapshots." + +In the multi-writer scenario, Hudi's existing conflict detection occurs after the writer finishing writing the data and +before committing the metadata. In other words, the writer just detects the occurrence of the conflict when it starts to +commit, although all calculations and data writing have been completed, which causes a waste of resources. + +For example: + +Now there are two writing jobs: job1 writes 10M data to the Hudi table, including updates to file group 1. Another job2 +writes 100G to the Hudi table, and also updates the same file group 1. + +Job1 finishes and commits to Hudi successfully. After a few hours, job2 finishes writing data files(100G) and starts to +commit metadata. At this time, a conflict with job1 is found, and the job2 has to be aborted and re-run after failure. +Obviously, a lot of computing resources and time are wasted for job2. + +Hudi currently has two important mechanisms, marker mechanism and heartbeat mechanism: + +1. Marker mechanism can track all the files that are part of an active write. +2. Heartbeat mechanism that can track all active writers to a Hudi table. + +Based on marker and heartbeat, this RFC proposes a new conflict detection: Early Conflict Detection. Before the writer +creates the marker and before it starts to write the file, Hudi performs this new conflict detection, trying to detect +the writing conflict directly (for direct markers) or get the async conflict check
[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer
yihua commented on code in PR #6003: URL: https://github.com/apache/hudi/pull/6003#discussion_r988454552 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,238 @@ + + +# RFC-56: Early Conflict Detection For Multi-writer + +## Proposers + +- @zhangyue19921010 + +## Approvers + +- @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + +## Abstract + +At present, Hudi implements an OCC (Optimistic Concurrency Control) based on timeline to ensure data consistency, +integrity and correctness between multi-writers. OCC detects the conflict at Hudi's file group level, i.e., two +concurrent writers updating the same file group are detected as a conflict. Currently, the conflict detection is +performed before commit metadata and after the data writing is completed. If any conflict is detected, it leads to a +waste of cluster resources because computing and writing were finished already. + +To solve this problem, this RFC proposes an early conflict detection mechanism to detect the conflict during the data +writing phase and abort the writing early if conflict is detected, using Hudi's marker mechanism. Before writing each +data file, the writer creates a corresponding marker to mark that the file is created, so that the writer can use the +markers to automatically clean up uncommitted data in failure and rollback scenarios. We propose to use the markers +identify the conflict at the file group level during writing data. There are some subtle differences in early conflict +detection work flow between different types of marker maintainers. For direct markers, hoodie lists necessary marker +files directly and does conflict checking before the writers creating markers and before starting to write corresponding +data file. For the timeline-server based markers, hoodie just gets the result of marker conflict checking before the +writers creating markers and before starting to write corresponding data files. The conflicts are asynchronously and +periodically checked so that the writing conflicts can be detected as early as possible. Both writers may still write +the data files of the same file slice, until the conflict is detected in the next round of checking. + +What's more? Hoodie can stop writing earlier because of early conflict detection and release the resources to cluster, +improving resource utilization. + +Note that, the early conflict detection proposed by this RFC operates within OCC. Any conflict detection outside the +scope of OCC is not handle. For example, current OCC for multiple writers cannot detect the conflict if two concurrent +writers perform INSERT operations for the same set of record keys, because the writers write to different file groups. +This RFC does not intend to address this problem. + +## Background + +As we know, transactions and multi-writers of data lakes are becoming the key characteristics of building Lakehouse +these days. Quoting this inspiring blog Lakehouse Concurrency Control: Are we too optimistic? directly: +https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/ + +> "Hudi implements a file level, log based concurrency control protocol on the Hudi timeline, which in-turn relies +> on bare minimum atomic puts to cloud storage. By building on an event log as the central piece for inter process +> coordination, Hudi is able to offer a few flexible deployment models that offer greater concurrency over pure OCC +> approaches that just track table snapshots." + +In the multi-writer scenario, Hudi's existing conflict detection occurs after the writer finishing writing the data and +before committing the metadata. In other words, the writer just detects the occurrence of the conflict when it starts to +commit, although all calculations and data writing have been completed, which causes a waste of resources. + +For example: + +Now there are two writing jobs: job1 writes 10M data to the Hudi table, including updates to file group 1. Another job2 +writes 100G to the Hudi table, and also updates the same file group 1. + +Job1 finishes and commits to Hudi successfully. After a few hours, job2 finishes writing data files(100G) and starts to +commit metadata. At this time, a conflict with job1 is found, and the job2 has to be aborted and re-run after failure. +Obviously, a lot of computing resources and time are wasted for job2. + +Hudi currently has two important mechanisms, marker mechanism and heartbeat mechanism: + +1. Marker mechanism can track all the files that are part of an active write. +2. Heartbeat mechanism that can track all active writers to a Hudi table. + +Based on marker and heartbeat, this RFC proposes a new conflict detection: Early Conflict Detection. Before the writer +creates the marker and before it starts to write the file, Hudi performs this new conflict detection, trying to detect +the writing conflict directly (for direct markers) or get the async conflict check
[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer
yihua commented on code in PR #6003: URL: https://github.com/apache/hudi/pull/6003#discussion_r984000514 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,238 @@ + + +# RFC-56: Early Conflict Detection For Multi-writer + +## Proposers + +- @zhangyue19921010 + +## Approvers + +- @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + +## Abstract + +At present, Hudi implements an OCC (Optimistic Concurrency Control) based on timeline to ensure data consistency, +integrity and correctness between multi-writers. OCC detects the conflict at Hudi's file group level, i.e., two +concurrent writers updating the same file group are detected as a conflict. Currently, the conflict detection is +performed before commit metadata and after the data writing is completed. If any conflict is detected, it leads to a +waste of cluster resources because computing and writing were finished already. + +To solve this problem, this RFC proposes an early conflict detection mechanism to detect the conflict during the data +writing phase and abort the writing early if conflict is detected, using Hudi's marker mechanism. Before writing each +data file, the writer creates a corresponding marker to mark that the file is created, so that the writer can use the +markers to automatically clean up uncommitted data in failure and rollback scenarios. We propose to use the markers +identify the conflict at the file group level during writing data. There are some subtle differences in early conflict +detection work flow between different types of marker maintainers. For direct markers, hoodie lists necessary marker +files directly and does conflict checking before the writers creating markers and before starting to write corresponding +data file. For the timeline-server based markers, hoodie just gets the result of marker conflict checking before the +writers creating markers and before starting to write corresponding data files. The conflicts are asynchronously and +periodically checked so that the writing conflicts can be detected as early as possible. Both writers may still write +the data files of the same file slice, until the conflict is detected in the next round of checking. + +What's more? Hoodie can stop writing earlier because of early conflict detection and release the resources to cluster, +improving resource utilization. + +Note that, the early conflict detection proposed by this RFC operates within OCC. Any conflict detection outside the +scope of OCC is not handle. For example, current OCC for multiple writers cannot detect the conflict if two concurrent +writers perform INSERT operations for the same set of record keys, because the writers write to different file groups. +This RFC does not intend to address this problem. + +## Background + +As we know, transactions and multi-writers of data lakes are becoming the key characteristics of building Lakehouse +these days. Quoting this inspiring blog Lakehouse Concurrency Control: Are we too optimistic? directly: +https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/ + +> "Hudi implements a file level, log based concurrency control protocol on the Hudi timeline, which in-turn relies +> on bare minimum atomic puts to cloud storage. By building on an event log as the central piece for inter process +> coordination, Hudi is able to offer a few flexible deployment models that offer greater concurrency over pure OCC +> approaches that just track table snapshots." + +In the multi-writer scenario, Hudi's existing conflict detection occurs after the writer finishing writing the data and +before committing the metadata. In other words, the writer just detects the occurrence of the conflict when it starts to +commit, although all calculations and data writing have been completed, which causes a waste of resources. + +For example: + +Now there are two writing jobs: job1 writes 10M data to the Hudi table, including updates to file group 1. Another job2 +writes 100G to the Hudi table, and also updates the same file group 1. + +Job1 finishes and commits to Hudi successfully. After a few hours, job2 finishes writing data files(100G) and starts to +commit metadata. At this time, a conflict with job1 is found, and the job2 has to be aborted and re-run after failure. +Obviously, a lot of computing resources and time are wasted for job2. + +Hudi currently has two important mechanisms, marker mechanism and heartbeat mechanism: + +1. Marker mechanism can track all the files that are part of an active write. +2. Heartbeat mechanism that can track all active writers to a Hudi table. + +Based on marker and heartbeat, this RFC proposes a new conflict detection: Early Conflict Detection. Before the writer +creates the marker and before it starts to write the file, Hudi performs this new conflict detection, trying to detect +the writing conflict directly (for direct markers) or get the async conflict check
[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer
yihua commented on code in PR #6003: URL: https://github.com/apache/hudi/pull/6003#discussion_r983928225 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,235 @@ + +# RFC-56: Early Conflict Detection For Multi-writer + + +## Proposers + +- @zhangyue19921010 + +## Approvers + - @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + + +## Abstract + +At present, Hudi implements an OCC (Optimistic Concurrency Control) based on timeline to ensure data +consistency, integrity and correctness between multi-writers. However, the related conflict detection is performed +before commit metadata and after the data writing is completed. If any conflict is detected, it leads to a waste +of cluster resources because computing and writing were finished already. To solve this problem, this RFC proposes an +early conflict detection mechanism based on the existing Hudi marker mechanism. There are some subtle differences in +early conflict detection work flow between different types of marker maintainers. + + +For direct markers, hoodie lists necessary marker files directly and do conflict checking before the writers creating +markers and before starting to write corresponding data file. For the timeline-server based markers, hoodie just gets the +the result of marker conflict checking before the writers creating markers and before starting to write corresponding +data files. The conflicts are asynchronously and periodically checked so that the writing conflicts can be detected as +early as possible. Both writers may still write the data files of the same file slice, until the conflict is detected +in the next round of checking. + +What's more? Hoodie can stop writing earlier because of early conflict detection and release the resources to cluster, +improving resource utilization. + Review Comment: We can add a section to clarify the scope in detail. ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,235 @@ + +# RFC-56: Early Conflict Detection For Multi-writer + + +## Proposers + +- @zhangyue19921010 + +## Approvers + - @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + + +## Abstract + +At present, Hudi implements an OCC (Optimistic Concurrency Control) based on timeline to ensure data +consistency, integrity and correctness between multi-writers. However, the related conflict detection is performed +before commit metadata and after the data writing is completed. If any conflict is detected, it leads to a waste +of cluster resources because computing and writing were finished already. To solve this problem, this RFC proposes an +early conflict detection mechanism based on the existing Hudi marker mechanism. There are some subtle differences in +early conflict detection work flow between different types of marker maintainers. + + +For direct markers, hoodie lists necessary marker files directly and do conflict checking before the writers creating Review Comment: nit: `do` -> `does` ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,235 @@ + +# RFC-56: Early Conflict Detection For Multi-writer + + +## Proposers + +- @zhangyue19921010 + +## Approvers + - @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + + +## Abstract + +At present, Hudi implements an OCC (Optimistic Concurrency Control) based on timeline to ensure data +consistency, integrity and correctness between multi-writers. However, the related conflict detection is performed +before commit metadata and after the data writing is completed. If any conflict is detected, it leads to a waste +of cluster resources because computing and writing were finished already. To solve this problem, this RFC proposes an +early conflict detection mechanism based on the existing Hudi marker mechanism. There are some subtle differences in +early conflict detection work flow between different types of marker maintainers. + + +For direct markers, hoodie lists necessary marker files directly and do conflict checking before the writers creating +markers and before starting to write corresponding data file. For the timeline-server based markers, hoodie just gets the +the result of marker conflict checking before the writers creating markers and before starting to write corresponding +data files. The conflicts are asynchronously and periodically checked so that the writing conflicts can be detected as +early as possible. Both writers may still write the data files of the same file slice, until the conflict is detected +in the next round of checking. + +What's more? Hoodie can stop writing earlier because of early conflict detection and release the resources to cluster, +improving resource utilization. + +## Background +As we know, transactions and multi-writers of data lakes are becoming the key characteristics of building Lakehouse +these days. Quoting this inspiring blog Lakehouse Concurrency Control: Are we too optimistic? directly: +https://hudi
[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer
yihua commented on code in PR #6003: URL: https://github.com/apache/hudi/pull/6003#discussion_r915474413 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,231 @@ + +# RFC-56: Early Conflict Detection For Multi-writer + + + +## Proposers + +- @zhangyue19921010 + +## Approvers + - @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + + +## Abstract + +At present, Hudi implements an optimized occ mechanism based on timeline to ensure data consistency, integrity and +correctness between multi-writers. However, the related conflict detection is performed before commit metadata and +after the data writing is completed. If this detection was failed, it would lead to a waste of cluster resources +because computing and writing were finished already. To solve this problem, this RFC design an early conflict detection +mechanism based on the existing Hudi marker mechanism. This new mechanism will do conflict checking before the writers +creating markers and before starting to write corresponding data file. So that the writing conflicts can be detected as +early as possible. What's more? We can stop writing earlier because of early conflict detection and release the +resources to cluster, improving resource utilization. + +## Background +As we know, Transactions and multi-writers of data lakes are becoming the key characteristics of building LakeHouse +these days. Quoting this inspiring blog Lakehouse Concurrency Control: Are we too optimistic? directly: +https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/ + +>"Hudi implements a file level, log based concurrency control protocol on the Hudi timeline, which in-turn relies +> on bare minimum atomic puts to cloud storage. By building on an event log as the central piece for inter process +> coordination, Hudi is able to offer a few flexible deployment models that offer greater concurrency over pure OCC +> approaches that just track table snapshots." + +In the multi-writer scenario, Hudi's existing conflict detection occurs after the writer finished writing the data +and before committing the metadata. In other words, the writer will only detect the occurrence of the conflict +when it starts to commit, although all calculations and data writing have been completed, which will cause a waste +of resources. + +For example: + +Now there are two writing jobs: job1 will write 10M data to the Hudi table, including update file group 1. +Another job2 will write 100G to the Hudi table, and will also update the same file group 1. + +Job1 finished and committed to Hudi successfully. After a few hours, job2 finished writing data files(100G) and start +to commit metadata. At this time, a conflict compared with job1 was found, and the job2 had to be aborted and re-run +after failure. Obviously, a lot of computing resources and time are wasted for job2. + + +Hudi currently has two important mechanisms, marker mechanism and heartbeat mechanism: +1. Marker mechanism can track all the files that are part of an active write. +2. Heartbeat mechanism that can track all active writers to a Hudi table. + + +Based on marker and heartbeat, this RFC design a new conflict detection: Early Conflict Detection. +Before the writer creates the marker and before it starts to write the file, Hudi will perform this new conflict Review Comment: Similar here for inaccurate description. ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,231 @@ + +# RFC-56: Early Conflict Detection For Multi-writer + + + +## Proposers + +- @zhangyue19921010 + +## Approvers + - @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + + +## Abstract + +At present, Hudi implements an optimized occ mechanism based on timeline to ensure data consistency, integrity and Review Comment: Let's spell out OCC: `optimized occ mechanism` -> `OCC (Optimistic Concurrency Control)` ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,231 @@ + +# RFC-56: Early Conflict Detection For Multi-writer + + + +## Proposers + +- @zhangyue19921010 + +## Approvers + - @yihua + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-1575 + + +## Abstract + +At present, Hudi implements an optimized occ mechanism based on timeline to ensure data consistency, integrity and +correctness between multi-writers. However, the related conflict detection is performed before commit metadata and +after the data writing is completed. If this detection was failed, it would lead to a waste of cluster resources +because computing and writing were finished already. To solve this problem, this RFC design an early conflict detection +mechanism based on the existing Hudi marker mechanism. This new mechanism will do conflict checking before the writers +creating markers and before starting to write corresponding data file. So that the writing conflicts can be detected as +early as possible. What's