[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer

2022-10-12 Thread GitBox


yihua commented on code in PR #6003:
URL: https://github.com/apache/hudi/pull/6003#discussion_r993980641


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,238 @@
+
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+
+- @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+## Abstract
+
+At present, Hudi implements an OCC (Optimistic Concurrency Control) based on 
timeline to ensure data consistency,
+integrity and correctness between multi-writers. OCC detects the conflict at 
Hudi's file group level, i.e., two
+concurrent writers updating the same file group are detected as a conflict. 
Currently, the conflict detection is
+performed before commit metadata and after the data writing is completed. If 
any conflict is detected, it leads to a
+waste of cluster resources because computing and writing were finished already.
+
+To solve this problem, this RFC proposes an early conflict detection mechanism 
to detect the conflict during the data
+writing phase and abort the writing early if conflict is detected, using 
Hudi's marker mechanism. Before writing each
+data file, the writer creates a corresponding marker to mark that the file is 
created, so that the writer can use the
+markers to automatically clean up uncommitted data in failure and rollback 
scenarios. We propose to use the markers
+identify the conflict at the file group level during writing data. There are 
some subtle differences in early conflict
+detection work flow between different types of marker maintainers. For direct 
markers, hoodie lists necessary marker
+files directly and does conflict checking before the writers creating markers 
and before starting to write corresponding
+data file. For the timeline-server based markers, hoodie just gets the result 
of marker conflict checking before the
+writers creating markers and before starting to write corresponding data 
files. The conflicts are asynchronously and
+periodically checked so that the writing conflicts can be detected as early as 
possible. Both writers may still write
+the data files of the same file slice, until the conflict is detected in the 
next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict 
detection and release the resources to cluster,
+improving resource utilization.
+
+Note that, the early conflict detection proposed by this RFC operates within 
OCC. Any conflict detection outside the
+scope of OCC is not handle. For example, current OCC for multiple writers 
cannot detect the conflict if two concurrent
+writers perform INSERT operations for the same set of record keys, because the 
writers write to different file groups.
+This RFC does not intend to address this problem.
+
+## Background
+
+As we know, transactions and multi-writers of data lakes are becoming the key 
characteristics of building Lakehouse
+these days. Quoting this inspiring blog Lakehouse Concurrency Control: 
Are we too optimistic? directly:
+https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/
+
+> "Hudi implements a file level, log based concurrency control protocol on the 
Hudi timeline, which in-turn relies
+> on bare minimum atomic puts to cloud storage. By building on an event log as 
the central piece for inter process
+> coordination, Hudi is able to offer a few flexible deployment models that 
offer greater concurrency over pure OCC
+> approaches that just track table snapshots."
+
+In the multi-writer scenario, Hudi's existing conflict detection occurs after 
the writer finishing writing the data and
+before committing the metadata. In other words, the writer just detects the 
occurrence of the conflict when it starts to
+commit, although all calculations and data writing have been completed, which 
causes a waste of resources.
+
+For example:
+
+Now there are two writing jobs: job1 writes 10M data to the Hudi table, 
including updates to file group 1. Another job2
+writes 100G to the Hudi table, and also updates the same file group 1.
+
+Job1 finishes and commits to Hudi successfully. After a few hours, job2 
finishes writing data files(100G) and starts to
+commit metadata. At this time, a conflict with job1 is found, and the job2 has 
to be aborted and re-run after failure.
+Obviously, a lot of computing resources and time are wasted for job2.
+
+Hudi currently has two important mechanisms, marker mechanism and heartbeat 
mechanism:
+
+1. Marker mechanism can track all the files that are part of an active write.
+2. Heartbeat mechanism that can track all active writers to a Hudi table.
+
+Based on marker and heartbeat, this RFC proposes a new conflict detection: 
Early Conflict Detection. Before the writer
+creates the marker and before it starts to write the file, Hudi performs this 
new conflict detection, trying to detect
+the writing conflict directly (for direct markers) or get the async conflict 
check

[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer

2022-10-07 Thread GitBox


yihua commented on code in PR #6003:
URL: https://github.com/apache/hudi/pull/6003#discussion_r989750821


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,238 @@
+
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+
+- @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+## Abstract
+
+At present, Hudi implements an OCC (Optimistic Concurrency Control) based on 
timeline to ensure data consistency,
+integrity and correctness between multi-writers. OCC detects the conflict at 
Hudi's file group level, i.e., two
+concurrent writers updating the same file group are detected as a conflict. 
Currently, the conflict detection is
+performed before commit metadata and after the data writing is completed. If 
any conflict is detected, it leads to a
+waste of cluster resources because computing and writing were finished already.
+
+To solve this problem, this RFC proposes an early conflict detection mechanism 
to detect the conflict during the data
+writing phase and abort the writing early if conflict is detected, using 
Hudi's marker mechanism. Before writing each
+data file, the writer creates a corresponding marker to mark that the file is 
created, so that the writer can use the
+markers to automatically clean up uncommitted data in failure and rollback 
scenarios. We propose to use the markers
+identify the conflict at the file group level during writing data. There are 
some subtle differences in early conflict
+detection work flow between different types of marker maintainers. For direct 
markers, hoodie lists necessary marker
+files directly and does conflict checking before the writers creating markers 
and before starting to write corresponding
+data file. For the timeline-server based markers, hoodie just gets the result 
of marker conflict checking before the
+writers creating markers and before starting to write corresponding data 
files. The conflicts are asynchronously and
+periodically checked so that the writing conflicts can be detected as early as 
possible. Both writers may still write
+the data files of the same file slice, until the conflict is detected in the 
next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict 
detection and release the resources to cluster,
+improving resource utilization.
+
+Note that, the early conflict detection proposed by this RFC operates within 
OCC. Any conflict detection outside the
+scope of OCC is not handle. For example, current OCC for multiple writers 
cannot detect the conflict if two concurrent
+writers perform INSERT operations for the same set of record keys, because the 
writers write to different file groups.
+This RFC does not intend to address this problem.
+
+## Background
+
+As we know, transactions and multi-writers of data lakes are becoming the key 
characteristics of building Lakehouse
+these days. Quoting this inspiring blog Lakehouse Concurrency Control: 
Are we too optimistic? directly:
+https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/
+
+> "Hudi implements a file level, log based concurrency control protocol on the 
Hudi timeline, which in-turn relies
+> on bare minimum atomic puts to cloud storage. By building on an event log as 
the central piece for inter process
+> coordination, Hudi is able to offer a few flexible deployment models that 
offer greater concurrency over pure OCC
+> approaches that just track table snapshots."
+
+In the multi-writer scenario, Hudi's existing conflict detection occurs after 
the writer finishing writing the data and
+before committing the metadata. In other words, the writer just detects the 
occurrence of the conflict when it starts to
+commit, although all calculations and data writing have been completed, which 
causes a waste of resources.
+
+For example:
+
+Now there are two writing jobs: job1 writes 10M data to the Hudi table, 
including updates to file group 1. Another job2
+writes 100G to the Hudi table, and also updates the same file group 1.
+
+Job1 finishes and commits to Hudi successfully. After a few hours, job2 
finishes writing data files(100G) and starts to
+commit metadata. At this time, a conflict with job1 is found, and the job2 has 
to be aborted and re-run after failure.
+Obviously, a lot of computing resources and time are wasted for job2.
+
+Hudi currently has two important mechanisms, marker mechanism and heartbeat 
mechanism:
+
+1. Marker mechanism can track all the files that are part of an active write.
+2. Heartbeat mechanism that can track all active writers to a Hudi table.
+
+Based on marker and heartbeat, this RFC proposes a new conflict detection: 
Early Conflict Detection. Before the writer
+creates the marker and before it starts to write the file, Hudi performs this 
new conflict detection, trying to detect
+the writing conflict directly (for direct markers) or get the async conflict 
check

[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer

2022-10-06 Thread GitBox


yihua commented on code in PR #6003:
URL: https://github.com/apache/hudi/pull/6003#discussion_r989728519


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,238 @@
+
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+
+- @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+## Abstract
+
+At present, Hudi implements an OCC (Optimistic Concurrency Control) based on 
timeline to ensure data consistency,
+integrity and correctness between multi-writers. OCC detects the conflict at 
Hudi's file group level, i.e., two
+concurrent writers updating the same file group are detected as a conflict. 
Currently, the conflict detection is
+performed before commit metadata and after the data writing is completed. If 
any conflict is detected, it leads to a
+waste of cluster resources because computing and writing were finished already.
+
+To solve this problem, this RFC proposes an early conflict detection mechanism 
to detect the conflict during the data
+writing phase and abort the writing early if conflict is detected, using 
Hudi's marker mechanism. Before writing each
+data file, the writer creates a corresponding marker to mark that the file is 
created, so that the writer can use the
+markers to automatically clean up uncommitted data in failure and rollback 
scenarios. We propose to use the markers
+identify the conflict at the file group level during writing data. There are 
some subtle differences in early conflict
+detection work flow between different types of marker maintainers. For direct 
markers, hoodie lists necessary marker
+files directly and does conflict checking before the writers creating markers 
and before starting to write corresponding
+data file. For the timeline-server based markers, hoodie just gets the result 
of marker conflict checking before the
+writers creating markers and before starting to write corresponding data 
files. The conflicts are asynchronously and
+periodically checked so that the writing conflicts can be detected as early as 
possible. Both writers may still write
+the data files of the same file slice, until the conflict is detected in the 
next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict 
detection and release the resources to cluster,
+improving resource utilization.
+
+Note that, the early conflict detection proposed by this RFC operates within 
OCC. Any conflict detection outside the
+scope of OCC is not handle. For example, current OCC for multiple writers 
cannot detect the conflict if two concurrent
+writers perform INSERT operations for the same set of record keys, because the 
writers write to different file groups.
+This RFC does not intend to address this problem.
+
+## Background
+
+As we know, transactions and multi-writers of data lakes are becoming the key 
characteristics of building Lakehouse
+these days. Quoting this inspiring blog Lakehouse Concurrency Control: 
Are we too optimistic? directly:
+https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/
+
+> "Hudi implements a file level, log based concurrency control protocol on the 
Hudi timeline, which in-turn relies
+> on bare minimum atomic puts to cloud storage. By building on an event log as 
the central piece for inter process
+> coordination, Hudi is able to offer a few flexible deployment models that 
offer greater concurrency over pure OCC
+> approaches that just track table snapshots."
+
+In the multi-writer scenario, Hudi's existing conflict detection occurs after 
the writer finishing writing the data and
+before committing the metadata. In other words, the writer just detects the 
occurrence of the conflict when it starts to
+commit, although all calculations and data writing have been completed, which 
causes a waste of resources.
+
+For example:
+
+Now there are two writing jobs: job1 writes 10M data to the Hudi table, 
including updates to file group 1. Another job2
+writes 100G to the Hudi table, and also updates the same file group 1.
+
+Job1 finishes and commits to Hudi successfully. After a few hours, job2 
finishes writing data files(100G) and starts to
+commit metadata. At this time, a conflict with job1 is found, and the job2 has 
to be aborted and re-run after failure.
+Obviously, a lot of computing resources and time are wasted for job2.
+
+Hudi currently has two important mechanisms, marker mechanism and heartbeat 
mechanism:
+
+1. Marker mechanism can track all the files that are part of an active write.
+2. Heartbeat mechanism that can track all active writers to a Hudi table.
+
+Based on marker and heartbeat, this RFC proposes a new conflict detection: 
Early Conflict Detection. Before the writer
+creates the marker and before it starts to write the file, Hudi performs this 
new conflict detection, trying to detect
+the writing conflict directly (for direct markers) or get the async conflict 
check

[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer

2022-10-06 Thread GitBox


yihua commented on code in PR #6003:
URL: https://github.com/apache/hudi/pull/6003#discussion_r989728329


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,238 @@
+
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+
+- @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+## Abstract
+
+At present, Hudi implements an OCC (Optimistic Concurrency Control) based on 
timeline to ensure data consistency,
+integrity and correctness between multi-writers. OCC detects the conflict at 
Hudi's file group level, i.e., two
+concurrent writers updating the same file group are detected as a conflict. 
Currently, the conflict detection is
+performed before commit metadata and after the data writing is completed. If 
any conflict is detected, it leads to a
+waste of cluster resources because computing and writing were finished already.
+
+To solve this problem, this RFC proposes an early conflict detection mechanism 
to detect the conflict during the data
+writing phase and abort the writing early if conflict is detected, using 
Hudi's marker mechanism. Before writing each
+data file, the writer creates a corresponding marker to mark that the file is 
created, so that the writer can use the
+markers to automatically clean up uncommitted data in failure and rollback 
scenarios. We propose to use the markers
+identify the conflict at the file group level during writing data. There are 
some subtle differences in early conflict
+detection work flow between different types of marker maintainers. For direct 
markers, hoodie lists necessary marker
+files directly and does conflict checking before the writers creating markers 
and before starting to write corresponding
+data file. For the timeline-server based markers, hoodie just gets the result 
of marker conflict checking before the
+writers creating markers and before starting to write corresponding data 
files. The conflicts are asynchronously and
+periodically checked so that the writing conflicts can be detected as early as 
possible. Both writers may still write
+the data files of the same file slice, until the conflict is detected in the 
next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict 
detection and release the resources to cluster,
+improving resource utilization.
+
+Note that, the early conflict detection proposed by this RFC operates within 
OCC. Any conflict detection outside the
+scope of OCC is not handle. For example, current OCC for multiple writers 
cannot detect the conflict if two concurrent
+writers perform INSERT operations for the same set of record keys, because the 
writers write to different file groups.
+This RFC does not intend to address this problem.
+
+## Background
+
+As we know, transactions and multi-writers of data lakes are becoming the key 
characteristics of building Lakehouse
+these days. Quoting this inspiring blog Lakehouse Concurrency Control: 
Are we too optimistic? directly:
+https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/
+
+> "Hudi implements a file level, log based concurrency control protocol on the 
Hudi timeline, which in-turn relies
+> on bare minimum atomic puts to cloud storage. By building on an event log as 
the central piece for inter process
+> coordination, Hudi is able to offer a few flexible deployment models that 
offer greater concurrency over pure OCC
+> approaches that just track table snapshots."
+
+In the multi-writer scenario, Hudi's existing conflict detection occurs after 
the writer finishing writing the data and
+before committing the metadata. In other words, the writer just detects the 
occurrence of the conflict when it starts to
+commit, although all calculations and data writing have been completed, which 
causes a waste of resources.
+
+For example:
+
+Now there are two writing jobs: job1 writes 10M data to the Hudi table, 
including updates to file group 1. Another job2
+writes 100G to the Hudi table, and also updates the same file group 1.
+
+Job1 finishes and commits to Hudi successfully. After a few hours, job2 
finishes writing data files(100G) and starts to
+commit metadata. At this time, a conflict with job1 is found, and the job2 has 
to be aborted and re-run after failure.
+Obviously, a lot of computing resources and time are wasted for job2.
+
+Hudi currently has two important mechanisms, marker mechanism and heartbeat 
mechanism:
+
+1. Marker mechanism can track all the files that are part of an active write.
+2. Heartbeat mechanism that can track all active writers to a Hudi table.
+
+Based on marker and heartbeat, this RFC proposes a new conflict detection: 
Early Conflict Detection. Before the writer
+creates the marker and before it starts to write the file, Hudi performs this 
new conflict detection, trying to detect
+the writing conflict directly (for direct markers) or get the async conflict 
check

[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer

2022-10-06 Thread GitBox


yihua commented on code in PR #6003:
URL: https://github.com/apache/hudi/pull/6003#discussion_r988457829


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,238 @@
+
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+
+- @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+## Abstract
+
+At present, Hudi implements an OCC (Optimistic Concurrency Control) based on 
timeline to ensure data consistency,
+integrity and correctness between multi-writers. OCC detects the conflict at 
Hudi's file group level, i.e., two
+concurrent writers updating the same file group are detected as a conflict. 
Currently, the conflict detection is
+performed before commit metadata and after the data writing is completed. If 
any conflict is detected, it leads to a
+waste of cluster resources because computing and writing were finished already.
+
+To solve this problem, this RFC proposes an early conflict detection mechanism 
to detect the conflict during the data
+writing phase and abort the writing early if conflict is detected, using 
Hudi's marker mechanism. Before writing each
+data file, the writer creates a corresponding marker to mark that the file is 
created, so that the writer can use the
+markers to automatically clean up uncommitted data in failure and rollback 
scenarios. We propose to use the markers
+identify the conflict at the file group level during writing data. There are 
some subtle differences in early conflict
+detection work flow between different types of marker maintainers. For direct 
markers, hoodie lists necessary marker
+files directly and does conflict checking before the writers creating markers 
and before starting to write corresponding
+data file. For the timeline-server based markers, hoodie just gets the result 
of marker conflict checking before the
+writers creating markers and before starting to write corresponding data 
files. The conflicts are asynchronously and
+periodically checked so that the writing conflicts can be detected as early as 
possible. Both writers may still write
+the data files of the same file slice, until the conflict is detected in the 
next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict 
detection and release the resources to cluster,
+improving resource utilization.
+
+Note that, the early conflict detection proposed by this RFC operates within 
OCC. Any conflict detection outside the
+scope of OCC is not handle. For example, current OCC for multiple writers 
cannot detect the conflict if two concurrent
+writers perform INSERT operations for the same set of record keys, because the 
writers write to different file groups.
+This RFC does not intend to address this problem.
+
+## Background
+
+As we know, transactions and multi-writers of data lakes are becoming the key 
characteristics of building Lakehouse
+these days. Quoting this inspiring blog Lakehouse Concurrency Control: 
Are we too optimistic? directly:
+https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/
+
+> "Hudi implements a file level, log based concurrency control protocol on the 
Hudi timeline, which in-turn relies
+> on bare minimum atomic puts to cloud storage. By building on an event log as 
the central piece for inter process
+> coordination, Hudi is able to offer a few flexible deployment models that 
offer greater concurrency over pure OCC
+> approaches that just track table snapshots."
+
+In the multi-writer scenario, Hudi's existing conflict detection occurs after 
the writer finishing writing the data and
+before committing the metadata. In other words, the writer just detects the 
occurrence of the conflict when it starts to
+commit, although all calculations and data writing have been completed, which 
causes a waste of resources.
+
+For example:
+
+Now there are two writing jobs: job1 writes 10M data to the Hudi table, 
including updates to file group 1. Another job2
+writes 100G to the Hudi table, and also updates the same file group 1.
+
+Job1 finishes and commits to Hudi successfully. After a few hours, job2 
finishes writing data files(100G) and starts to
+commit metadata. At this time, a conflict with job1 is found, and the job2 has 
to be aborted and re-run after failure.
+Obviously, a lot of computing resources and time are wasted for job2.
+
+Hudi currently has two important mechanisms, marker mechanism and heartbeat 
mechanism:
+
+1. Marker mechanism can track all the files that are part of an active write.
+2. Heartbeat mechanism that can track all active writers to a Hudi table.
+
+Based on marker and heartbeat, this RFC proposes a new conflict detection: 
Early Conflict Detection. Before the writer
+creates the marker and before it starts to write the file, Hudi performs this 
new conflict detection, trying to detect
+the writing conflict directly (for direct markers) or get the async conflict 
check

[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer

2022-10-05 Thread GitBox


yihua commented on code in PR #6003:
URL: https://github.com/apache/hudi/pull/6003#discussion_r988454552


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,238 @@
+
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+
+- @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+## Abstract
+
+At present, Hudi implements an OCC (Optimistic Concurrency Control) based on 
timeline to ensure data consistency,
+integrity and correctness between multi-writers. OCC detects the conflict at 
Hudi's file group level, i.e., two
+concurrent writers updating the same file group are detected as a conflict. 
Currently, the conflict detection is
+performed before commit metadata and after the data writing is completed. If 
any conflict is detected, it leads to a
+waste of cluster resources because computing and writing were finished already.
+
+To solve this problem, this RFC proposes an early conflict detection mechanism 
to detect the conflict during the data
+writing phase and abort the writing early if conflict is detected, using 
Hudi's marker mechanism. Before writing each
+data file, the writer creates a corresponding marker to mark that the file is 
created, so that the writer can use the
+markers to automatically clean up uncommitted data in failure and rollback 
scenarios. We propose to use the markers
+identify the conflict at the file group level during writing data. There are 
some subtle differences in early conflict
+detection work flow between different types of marker maintainers. For direct 
markers, hoodie lists necessary marker
+files directly and does conflict checking before the writers creating markers 
and before starting to write corresponding
+data file. For the timeline-server based markers, hoodie just gets the result 
of marker conflict checking before the
+writers creating markers and before starting to write corresponding data 
files. The conflicts are asynchronously and
+periodically checked so that the writing conflicts can be detected as early as 
possible. Both writers may still write
+the data files of the same file slice, until the conflict is detected in the 
next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict 
detection and release the resources to cluster,
+improving resource utilization.
+
+Note that, the early conflict detection proposed by this RFC operates within 
OCC. Any conflict detection outside the
+scope of OCC is not handle. For example, current OCC for multiple writers 
cannot detect the conflict if two concurrent
+writers perform INSERT operations for the same set of record keys, because the 
writers write to different file groups.
+This RFC does not intend to address this problem.
+
+## Background
+
+As we know, transactions and multi-writers of data lakes are becoming the key 
characteristics of building Lakehouse
+these days. Quoting this inspiring blog Lakehouse Concurrency Control: 
Are we too optimistic? directly:
+https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/
+
+> "Hudi implements a file level, log based concurrency control protocol on the 
Hudi timeline, which in-turn relies
+> on bare minimum atomic puts to cloud storage. By building on an event log as 
the central piece for inter process
+> coordination, Hudi is able to offer a few flexible deployment models that 
offer greater concurrency over pure OCC
+> approaches that just track table snapshots."
+
+In the multi-writer scenario, Hudi's existing conflict detection occurs after 
the writer finishing writing the data and
+before committing the metadata. In other words, the writer just detects the 
occurrence of the conflict when it starts to
+commit, although all calculations and data writing have been completed, which 
causes a waste of resources.
+
+For example:
+
+Now there are two writing jobs: job1 writes 10M data to the Hudi table, 
including updates to file group 1. Another job2
+writes 100G to the Hudi table, and also updates the same file group 1.
+
+Job1 finishes and commits to Hudi successfully. After a few hours, job2 
finishes writing data files(100G) and starts to
+commit metadata. At this time, a conflict with job1 is found, and the job2 has 
to be aborted and re-run after failure.
+Obviously, a lot of computing resources and time are wasted for job2.
+
+Hudi currently has two important mechanisms, marker mechanism and heartbeat 
mechanism:
+
+1. Marker mechanism can track all the files that are part of an active write.
+2. Heartbeat mechanism that can track all active writers to a Hudi table.
+
+Based on marker and heartbeat, this RFC proposes a new conflict detection: 
Early Conflict Detection. Before the writer
+creates the marker and before it starts to write the file, Hudi performs this 
new conflict detection, trying to detect
+the writing conflict directly (for direct markers) or get the async conflict 
check

[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer

2022-09-29 Thread GitBox


yihua commented on code in PR #6003:
URL: https://github.com/apache/hudi/pull/6003#discussion_r984000514


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,238 @@
+
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+
+- @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+## Abstract
+
+At present, Hudi implements an OCC (Optimistic Concurrency Control) based on 
timeline to ensure data consistency,
+integrity and correctness between multi-writers. OCC detects the conflict at 
Hudi's file group level, i.e., two
+concurrent writers updating the same file group are detected as a conflict. 
Currently, the conflict detection is
+performed before commit metadata and after the data writing is completed. If 
any conflict is detected, it leads to a
+waste of cluster resources because computing and writing were finished already.
+
+To solve this problem, this RFC proposes an early conflict detection mechanism 
to detect the conflict during the data
+writing phase and abort the writing early if conflict is detected, using 
Hudi's marker mechanism. Before writing each
+data file, the writer creates a corresponding marker to mark that the file is 
created, so that the writer can use the
+markers to automatically clean up uncommitted data in failure and rollback 
scenarios. We propose to use the markers
+identify the conflict at the file group level during writing data. There are 
some subtle differences in early conflict
+detection work flow between different types of marker maintainers. For direct 
markers, hoodie lists necessary marker
+files directly and does conflict checking before the writers creating markers 
and before starting to write corresponding
+data file. For the timeline-server based markers, hoodie just gets the result 
of marker conflict checking before the
+writers creating markers and before starting to write corresponding data 
files. The conflicts are asynchronously and
+periodically checked so that the writing conflicts can be detected as early as 
possible. Both writers may still write
+the data files of the same file slice, until the conflict is detected in the 
next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict 
detection and release the resources to cluster,
+improving resource utilization.
+
+Note that, the early conflict detection proposed by this RFC operates within 
OCC. Any conflict detection outside the
+scope of OCC is not handle. For example, current OCC for multiple writers 
cannot detect the conflict if two concurrent
+writers perform INSERT operations for the same set of record keys, because the 
writers write to different file groups.
+This RFC does not intend to address this problem.
+
+## Background
+
+As we know, transactions and multi-writers of data lakes are becoming the key 
characteristics of building Lakehouse
+these days. Quoting this inspiring blog Lakehouse Concurrency Control: 
Are we too optimistic? directly:
+https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/
+
+> "Hudi implements a file level, log based concurrency control protocol on the 
Hudi timeline, which in-turn relies
+> on bare minimum atomic puts to cloud storage. By building on an event log as 
the central piece for inter process
+> coordination, Hudi is able to offer a few flexible deployment models that 
offer greater concurrency over pure OCC
+> approaches that just track table snapshots."
+
+In the multi-writer scenario, Hudi's existing conflict detection occurs after 
the writer finishing writing the data and
+before committing the metadata. In other words, the writer just detects the 
occurrence of the conflict when it starts to
+commit, although all calculations and data writing have been completed, which 
causes a waste of resources.
+
+For example:
+
+Now there are two writing jobs: job1 writes 10M data to the Hudi table, 
including updates to file group 1. Another job2
+writes 100G to the Hudi table, and also updates the same file group 1.
+
+Job1 finishes and commits to Hudi successfully. After a few hours, job2 
finishes writing data files(100G) and starts to
+commit metadata. At this time, a conflict with job1 is found, and the job2 has 
to be aborted and re-run after failure.
+Obviously, a lot of computing resources and time are wasted for job2.
+
+Hudi currently has two important mechanisms, marker mechanism and heartbeat 
mechanism:
+
+1. Marker mechanism can track all the files that are part of an active write.
+2. Heartbeat mechanism that can track all active writers to a Hudi table.
+
+Based on marker and heartbeat, this RFC proposes a new conflict detection: 
Early Conflict Detection. Before the writer
+creates the marker and before it starts to write the file, Hudi performs this 
new conflict detection, trying to detect
+the writing conflict directly (for direct markers) or get the async conflict 
check

[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer

2022-09-29 Thread GitBox


yihua commented on code in PR #6003:
URL: https://github.com/apache/hudi/pull/6003#discussion_r983928225


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,235 @@
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+ - @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+
+## Abstract
+
+At present, Hudi implements an OCC (Optimistic Concurrency Control) based on 
timeline to ensure data
+consistency, integrity and correctness between multi-writers. However, the 
related conflict detection is performed
+before commit metadata and after the data writing is completed. If any 
conflict is detected, it leads to a waste
+of cluster resources because computing and writing were finished already. To 
solve this problem, this RFC proposes an
+early conflict detection mechanism based on the existing Hudi marker 
mechanism. There are some subtle differences in
+early conflict detection work flow between different types of marker 
maintainers.
+
+
+For direct markers, hoodie lists necessary marker files directly and do 
conflict checking before the writers creating
+markers and before starting to write corresponding data file. For the 
timeline-server based markers, hoodie just gets the
+the result of marker conflict checking before the writers creating markers and 
before starting to write corresponding
+data files. The conflicts are asynchronously and periodically checked so that 
the writing conflicts can be detected as
+early as possible. Both writers may still write the data files of the same 
file slice, until the conflict is detected
+in the next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict 
detection and release the resources to cluster,
+improving resource utilization.
+

Review Comment:
   We can add a section to clarify the scope in detail.



##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,235 @@
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+ - @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+
+## Abstract
+
+At present, Hudi implements an OCC (Optimistic Concurrency Control) based on 
timeline to ensure data
+consistency, integrity and correctness between multi-writers. However, the 
related conflict detection is performed
+before commit metadata and after the data writing is completed. If any 
conflict is detected, it leads to a waste
+of cluster resources because computing and writing were finished already. To 
solve this problem, this RFC proposes an
+early conflict detection mechanism based on the existing Hudi marker 
mechanism. There are some subtle differences in
+early conflict detection work flow between different types of marker 
maintainers.
+
+
+For direct markers, hoodie lists necessary marker files directly and do 
conflict checking before the writers creating

Review Comment:
   nit: `do` -> `does`



##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,235 @@
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+ - @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+
+## Abstract
+
+At present, Hudi implements an OCC (Optimistic Concurrency Control) based on 
timeline to ensure data
+consistency, integrity and correctness between multi-writers. However, the 
related conflict detection is performed
+before commit metadata and after the data writing is completed. If any 
conflict is detected, it leads to a waste
+of cluster resources because computing and writing were finished already. To 
solve this problem, this RFC proposes an
+early conflict detection mechanism based on the existing Hudi marker 
mechanism. There are some subtle differences in
+early conflict detection work flow between different types of marker 
maintainers.
+
+
+For direct markers, hoodie lists necessary marker files directly and do 
conflict checking before the writers creating
+markers and before starting to write corresponding data file. For the 
timeline-server based markers, hoodie just gets the
+the result of marker conflict checking before the writers creating markers and 
before starting to write corresponding
+data files. The conflicts are asynchronously and periodically checked so that 
the writing conflicts can be detected as
+early as possible. Both writers may still write the data files of the same 
file slice, until the conflict is detected
+in the next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict 
detection and release the resources to cluster,
+improving resource utilization.
+
+## Background
+As we know, transactions and multi-writers of data lakes are becoming the key 
characteristics of building Lakehouse
+these days. Quoting this inspiring blog Lakehouse Concurrency Control: 
Are we too optimistic? directly: 
+https://hudi

[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer

2022-07-06 Thread GitBox


yihua commented on code in PR #6003:
URL: https://github.com/apache/hudi/pull/6003#discussion_r915474413


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,231 @@
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+ - @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+
+## Abstract
+
+At present, Hudi implements an optimized occ mechanism based on timeline to 
ensure data consistency, integrity and 
+correctness between multi-writers. However, the related conflict detection is 
performed before commit metadata and 
+after the data writing is completed. If this detection was failed, it would 
lead to a waste of cluster resources 
+because computing and writing were finished already. To solve this problem, 
this RFC design an early conflict detection 
+mechanism based on the existing Hudi marker mechanism. This new mechanism will 
do conflict checking before the writers 
+creating markers and before starting to write corresponding data file. So that 
the writing conflicts can be detected as 
+early as possible. What's more? We can stop writing earlier because of early 
conflict detection and release the 
+resources to cluster, improving resource utilization.
+
+## Background
+As we know, Transactions and multi-writers of data lakes are becoming the key 
characteristics of building LakeHouse 
+these days. Quoting this inspiring blog Lakehouse Concurrency Control: 
Are we too optimistic? directly: 
+https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/
+
+>"Hudi implements a file level, log based concurrency control protocol on the 
Hudi timeline, which in-turn relies 
+> on bare minimum atomic puts to cloud storage. By building on an event log as 
the central piece for inter process 
+> coordination, Hudi is able to offer a few flexible deployment models that 
offer greater concurrency over pure OCC 
+> approaches that just track table snapshots."
+  
+In the multi-writer scenario, Hudi's existing conflict detection occurs after 
the writer finished writing the data 
+and before committing the metadata. In other words, the writer will only 
detect the occurrence of the conflict
+when it starts to commit, although all calculations and data writing have been 
completed, which will cause a waste 
+of resources.
+
+For example:
+
+Now there are two writing jobs: job1 will write 10M data to the Hudi table, 
including update file group 1. 
+Another job2 will write 100G to the Hudi table, and will also update the same 
file group 1. 
+
+Job1 finished and committed to Hudi successfully. After a few hours, job2 
finished writing data files(100G) and start 
+to commit metadata. At this time, a conflict compared with job1 was found, and 
the job2 had to be aborted and re-run 
+after failure. Obviously, a lot of computing resources and time are wasted for 
job2.
+
+
+Hudi currently has two important mechanisms, marker mechanism and heartbeat 
mechanism:
+1. Marker mechanism can track all the files that are part of an active write.
+2. Heartbeat mechanism that can track all active writers to a Hudi table.
+
+
+Based on marker and heartbeat, this RFC design a new conflict detection: Early 
Conflict Detection. 
+Before the writer creates the marker and before it starts to write the file, 
Hudi will perform this new conflict 

Review Comment:
   Similar here for inaccurate description.



##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,231 @@
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+ - @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+
+## Abstract
+
+At present, Hudi implements an optimized occ mechanism based on timeline to 
ensure data consistency, integrity and 

Review Comment:
   Let's spell out OCC: `optimized occ mechanism` -> `OCC (Optimistic 
Concurrency Control)`



##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,231 @@
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+ - @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+
+## Abstract
+
+At present, Hudi implements an optimized occ mechanism based on timeline to 
ensure data consistency, integrity and 
+correctness between multi-writers. However, the related conflict detection is 
performed before commit metadata and 
+after the data writing is completed. If this detection was failed, it would 
lead to a waste of cluster resources 
+because computing and writing were finished already. To solve this problem, 
this RFC design an early conflict detection 
+mechanism based on the existing Hudi marker mechanism. This new mechanism will 
do conflict checking before the writers 
+creating markers and before starting to write corresponding data file. So that 
the writing conflicts can be detected as 
+early as possible. What's