hudi-bot opened a new issue, #14611:
URL: https://github.com/apache/hudi/issues/14611
h4. We need to allow a writer w writing to file groups f1, f2, f3,
concurrently while a clustering service C reclusters them into f4, f5.
Goals
* Writes can be either updates, deletes or inserts.
* Either clustering C or the writer W can finish first
* Both W and C need to be able to complete their actions without much
redoing of work.
* The number of output file groups for C can be higher or lower than input
file groups.
* Need to work across and be oblivious to whether the writers are operating
in OCC or NBCC modes
* Needs to interplay well with cleaning and compaction services.
h4. Non-goals
* Strictly the sort order achieved by clustering, in face of updates (e.g
updates change clustering field values, causing output clustering file groups
to be not fully sorted by those fields)
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-1045
- Type: Task
- Epic: https://issues.apache.org/jira/browse/HUDI-6640
- Fix version(s):
- 1.2.0
---
## Comments
26/Apr/24 22:00;vinoth;At first it may seem trivial to have clustering fail
all the time, ceding preference to the incoming writes. But, aside from wasting
resources, clustering can finish before writes and we cannot atomically both
rollback clustering (note that restoring a completed action is
considered/recommended as an offline maintenance) as well as finish the
write..;;;
---
30/Apr/24 17:55;vinoth;h2. [WIP] Approach 1 : Redistribute records from
the conflicting file groups
Within the finalize section (done within a table level distributed lock), we
could either have W or C perform the following .
{code:java}
W {
- identify the file groups that have been clustered concurrently by C
- Read out all records written by W, into these conflicting file groups
- Redistribute records based on new records distribution based on C
- finalize W
} {code}
{code:java}
C {
- identify the file groups that have been written to concurrently by W.
- Read out all records written by such W, into conflicting file groups
- Redistribute records based on new records distribution, based on C
- finalize C
}
{code}
h3. Pros:
# Simple to understand/debug, no storage format changes.
# Could work well for cases where the overlap between C and W is rather
small.
# No extra read amplification for queries, W/C absorbs tha cost.
{*}Cons{*}:
# Can be pretty wasteful in continuous writers or with high overlap between
C and W, forcing the entire write to be redone effectively (same as writer
failing and retrying like today)
# Particularly more expensive for CoW, where W has paid the cost of merging
columnar base files, with incoming records. ;;;
---
30/Apr/24 17:55;vinoth;h2. [WIP] Approach 2 : Introduce pointer data blocks
into storage format
if we truly wish to achieve, independent operations of C and W, while
minimizing the amount of work redone on the writer side, we need to introduce a
notion of "pointer data blocks" (name TBD) and a new "retain" command block -
in Hudi's log format.
*Pointer data blocks*
A pointer data block just keeps pointers to other blocks in a different file
groups.
{code:java}
pointer data block {
[
{fg1, logfile X, ..},
{fg2, logfile Y, ..},
{fg3, logfile Z, ..}
]
}
{code}
In this approach, instead of redistributing the records from file groups
f1,f2,f3 to f4,f5 - we will
* log pointer data blocks to f4, f5 , pointing back to new log files W
wrote to file groups f1,f2,f3. (X,Y,Z in the example above)
* log retain command blocks to f1 (with logfilex),f2 (with logfileY),f3
(with logfile z) to indicate to the cleaner service, that these log files are
supposed to be retained for later, so it can be skipped.
* When f4,f5 is compacted, these log files will be reconciled to new file
slices in f4, f5
* When the file slices in f4 and f5 are cleaned, the pointers are followed
and the retained logfiles in f1,f2,f3 can also be cleaned. (note: need some
reference counting in case f4 is cleaned while f5 is not)
Note that there is a 1:N relationship between pointer data block and log
files. For e.g both f4, f5's pointer data blocks will be pointing to all three
files X,Y,Z.
A snapshot read of the file groups f4, f5 need to carefully filter records
to avoid exposing duplicate records to the reader.
* merges the log files pointed to by pointer data blocks based on keys i.e
only delete/update records in the base file for keys that match from the
pointed log files.
* inserts need to be handled with care and need to be distinguishable from
an update e.g a record in pointed log files for f4 can either be an insert or a
update to a record that is now clustered into f5.
* storage format also needs to store a bitmap indicating which records are
inserts vs updates in the pointed to log files. Once such a list of available,
then the reader of f4/f5 can split the inserts amongst them, by using a hash
mod of the insert key for e.g i.e f4 will get insert 0, 2, 4, 6 ... in log file
x,y,z. while f5 will get inserts 1,3,5,7 in log files x,y,z
h3. Pros:
* Truly non-blocking, C and W can go at their own pace, complete without
much additional overhead since the pointer blocks are pretty light to add.
* Works even for high throughput streaming scenarios where there is not
much time or writer to be reconciling (e.g Flink)
h3. Cons:
* More merge costs on the read/query side (although can be very tolerable
for same cases approach 1 works well for)
* More complexity and new storage changes. ;;;
---
21/Nov/24 21:09;yihua;Related ticket: HUDI-8464.
When we support NBCC for concurrent updates and clustering (HUDI-1045), we
also need to keep this in mind for clustering since there is a time gap between
clustering request time generation and adding clustering requested file on
timeline after the clustering planning, i.e., clustering planning need to
consider the requested time of the clustering too when generating the list of
files to cluster.;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]