[ https://issues.apache.org/jira/browse/KUDU-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
shenxingwuying updated KUDU-3446: --------------------------------- Summary: I think we should talk about CommitMsg's order in WAL (was: I think we should talk about CommitMsg's order) > I think we should talk about CommitMsg's order in WAL > ----------------------------------------------------- > > Key: KUDU-3446 > URL: https://issues.apache.org/jira/browse/KUDU-3446 > Project: Kudu > Issue Type: Improvement > Reporter: shenxingwuying > Assignee: shenxingwuying > Priority: Major > > h1. Background > In kudu, kudu's WAL' records has two types, one is 'replicate', the other is > 'commit'. The 'replcate' log is the raft logs, the 'commit' logs is > durability for the applied opid on kudu storage engine. > Currently, appling the ops using 'apply_pool->Submit()'(i.e concurrent > thread-pool), > the apply task mainly run the following statements: > > {code:java} > // op_driver.cc > apply_pool_->Submit([this]() { this->ApplyTask(); }); > OpDriver::ApplyTask() { > CommitMsg* commit_msg; > Status s = op_->Apply(&commit_msg); > log_->AsyncAppendCommit(*commit_msg, ... > } {code} > apply_pool_ is an concurrent thread-pool, ApplyTask is concurrent, so some > raft logs statifys happen-before ralationship, it may not statisfies apply > them into kudu storage engine. > For example, 4 logs of 2 ops, we expected: > replicate 1.1 > commit 1.1 > replicate 1.2 > commit 1.2 > or > replicate 1.1 > replicate 1.2 > commit 1.1 > commit 1.2 > A incorrect order(IMO) is: > replicate 1.1 > replicate 1.2 > commit 1.2 > commit 1.1 > Currently, it's valid in kudu system, kudu system allow the order and some > test cases and bootstrap's processing can reflect this. > But that means 1.2 would become valid before 1.1 in kudu engine in a very > high probability, that may be not expected. > > > It's simple to reproduce the scenarios if there is enough WriteRequests. I > will write a test for this. > I obtain a case like this: > ./bin/kudu wal dump $wal_file | egrep "REPLICATE|COMMIT" | less > 1.75939@6812005919066001408 REPLICATE WRITE_OP > 1.75940@6812005919066857472 REPLICATE WRITE_OP > 1.75941@6812005919067430912 REPLICATE WRITE_OP > COMMIT 1.75939 > COMMIT 1.75941 > COMMIT 1.75940 > 1.75942@6812005919193690112 REPLICATE WRITE_OP > COMMIT 1.75942 > 1.75943@6812005919311241216 REPLICATE WRITE_OP > 1.75944@6812005919312207872 REPLICATE WRITE_OP > 1.75945@6812005919312932864 REPLICATE WRITE_OP > 1.75946@6812005919313645568 REPLICATE WRITE_OP > COMMIT 1.75943 > COMMIT 1.75945 > COMMIT 1.75944 > COMMIT 1.75946 > 1.75947@6812005919354585088 REPLICATE WRITE_OP > COMMIT 1.75947 > 1.75948@6812005919430410240 REPLICATE WRITE_OP > 1.75949@6812005919431192576 REPLICATE WRITE_OP > 1.75950@6812005919431778304 REPLICATE WRITE_OP > COMMIT 1.75948 > COMMIT 1.75950 > COMMIT 1.75949 > we can see the COMMIT: > COMMIT 1.75939 > COMMIT 1.75941 > COMMIT 1.75940 > and > COMMIT 1.75943 > COMMIT 1.75945 > COMMIT 1.75944 > and > COMMIT 1.75948 > COMMIT 1.75950 > COMMIT 1.75949 > h1. Motivation > I think the correct order should statisfy the invariable > r: replicate > c: commit > e[i]: a pair replicate and commit op for index i. > # r(e[i]) < r(e[i+1]) its raft's requirement > # r(e[i]) < c(e[i] its obvious > # c(e[i]) < c(e[i+1]) should same as 1. > The raft logs is an total order on server side, kudu storage engine is the > state machine and the applied order should same as raft logs. > h1. Solution > I think we should use a 'apply_pool_token_' with SERIAL_MODE > created by apply_pool_ instead of 'apply_pool_'. If we do this, some cases > should fix at the same time. > > We should talk about the words what I described above firstly and whether is > it correct? > -- This message was sent by Atlassian Jira (v8.20.10#820010)