[ 
https://issues.apache.org/jira/browse/KUDU-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenxingwuying updated KUDU-3446:
---------------------------------
    Summary: I think we should talk about CommitMsg's order in WAL  (was: I 
think we should talk about CommitMsg's order)

> I think we should talk about CommitMsg's order in WAL
> -----------------------------------------------------
>
>                 Key: KUDU-3446
>                 URL: https://issues.apache.org/jira/browse/KUDU-3446
>             Project: Kudu
>          Issue Type: Improvement
>            Reporter: shenxingwuying
>            Assignee: shenxingwuying
>            Priority: Major
>
> h1. Background
> In kudu, kudu's WAL' records has two types, one is 'replicate', the other is 
> 'commit'. The 'replcate' log is the raft logs, the 'commit' logs is 
> durability for the applied opid on kudu storage engine.
> Currently, appling the ops using 'apply_pool->Submit()'(i.e concurrent 
> thread-pool),
> the apply task mainly run the following statements:
>  
> {code:java}
> // op_driver.cc
> apply_pool_->Submit([this]() { this->ApplyTask(); });
> OpDriver::ApplyTask() {
>     CommitMsg* commit_msg; 
>     Status s = op_->Apply(&commit_msg);
>     log_->AsyncAppendCommit(*commit_msg, ...
> } {code}
> apply_pool_ is an concurrent thread-pool, ApplyTask is concurrent, so some 
> raft logs statifys happen-before ralationship, it may not statisfies apply 
> them into kudu storage engine.
> For example, 4 logs of 2 ops, we expected:
> replicate 1.1
> commit 1.1
> replicate 1.2
> commit 1.2
> or
> replicate 1.1
> replicate 1.2
> commit 1.1
> commit 1.2
> A incorrect order(IMO) is:
> replicate 1.1
> replicate 1.2
> commit 1.2
> commit 1.1
> Currently, it's valid in kudu system, kudu system allow the order and some 
> test cases and bootstrap's processing can reflect this.
> But that means 1.2 would become valid before 1.1 in kudu engine in a very 
> high probability, that may be not expected.
>  
>  
> It's simple to reproduce the scenarios if there is enough WriteRequests. I 
> will write a test for this.
> I obtain a case like this:
> ./bin/kudu wal dump $wal_file | egrep "REPLICATE|COMMIT" | less
> 1.75939@6812005919066001408 REPLICATE WRITE_OP
> 1.75940@6812005919066857472 REPLICATE WRITE_OP
> 1.75941@6812005919067430912 REPLICATE WRITE_OP
> COMMIT 1.75939
> COMMIT 1.75941
> COMMIT 1.75940
> 1.75942@6812005919193690112 REPLICATE WRITE_OP
> COMMIT 1.75942
> 1.75943@6812005919311241216 REPLICATE WRITE_OP
> 1.75944@6812005919312207872 REPLICATE WRITE_OP
> 1.75945@6812005919312932864 REPLICATE WRITE_OP
> 1.75946@6812005919313645568 REPLICATE WRITE_OP
> COMMIT 1.75943
> COMMIT 1.75945
> COMMIT 1.75944
> COMMIT 1.75946
> 1.75947@6812005919354585088 REPLICATE WRITE_OP
> COMMIT 1.75947
> 1.75948@6812005919430410240 REPLICATE WRITE_OP
> 1.75949@6812005919431192576 REPLICATE WRITE_OP
> 1.75950@6812005919431778304 REPLICATE WRITE_OP
> COMMIT 1.75948
> COMMIT 1.75950
> COMMIT 1.75949
> we can see the COMMIT:
> COMMIT 1.75939
> COMMIT 1.75941
> COMMIT 1.75940
> and
> COMMIT 1.75943
> COMMIT 1.75945
> COMMIT 1.75944
> and
> COMMIT 1.75948
> COMMIT 1.75950
> COMMIT 1.75949
> h1. Motivation
> I think the correct order should statisfy the invariable
> r: replicate
> c: commit
> e[i]: a pair replicate and commit op for index i.
>  # r(e[i]) < r(e[i+1]) its raft's requirement
>  # r(e[i]) < c(e[i] its obvious
>  # c(e[i]) < c(e[i+1]) should same as 1.
> The raft logs is an total order on server side, kudu storage engine is the 
> state machine and the applied order should same as raft logs.
> h1. Solution
> I think we should use a 'apply_pool_token_' with SERIAL_MODE
> created by apply_pool_ instead of 'apply_pool_'. If we do this, some cases 
> should fix at the same time.
>  
> We should talk about the words what I described above firstly and  whether is 
> it correct?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to