[ 
https://issues.apache.org/jira/browse/KUDU-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenxingwuying updated KUDU-3446:
---------------------------------
    Description: 
h1. Background

In kudu, kudu's WAL' records has two types, one is 'replicate', the other is 
'commit'. The 'replcate' log is the raft logs, the 'commit' logs is durability 
for the applied opid on kudu storage engine.

Currently, appling the ops using 'apply_pool->Submit()'(i.e concurrent 
thread-pool),
the apply task mainly run the following statements:

 
{code:java}
// op_driver.cc
apply_pool_->Submit([this]() { this->ApplyTask(); });

OpDriver::ApplyTask() {
    CommitMsg* commit_msg; 
    Status s = op_->Apply(&commit_msg);
    log_->AsyncAppendCommit(*commit_msg, ...
} {code}
apply_pool_ is an concurrent thread-pool, ApplyTask is concurrent, so some raft 
logs statifys happen-before ralationship, it may not statisfies apply them into 
kudu storage engine.

For example, 4 logs of 2 ops, we expected:
replicate 1.1
commit 1.1
replicate 1.2
commit 1.2
or
replicate 1.1
replicate 1.2
commit 1.1
commit 1.2

A incorrect order(IMO) is:
replicate 1.1
replicate 1.2
commit 1.2
commit 1.1
Currently, it's valid in kudu system. But that means 1.2 would become valid 
before 1.1 in kudu engine in a very high probability, that's not expected.

It's simple to reproduce the scenarios if there is enough WriteRequests. I will 
write a test for this.

I obtain a case like this:

./bin/kudu wal dump $wal_file | egrep "REPLICATE|COMMIT" | less

1.75939@6812005919066001408 REPLICATE WRITE_OP
1.75940@6812005919066857472 REPLICATE WRITE_OP
1.75941@6812005919067430912 REPLICATE WRITE_OP
COMMIT 1.75939
COMMIT 1.75941
COMMIT 1.75940
1.75942@6812005919193690112 REPLICATE WRITE_OP
COMMIT 1.75942
1.75943@6812005919311241216 REPLICATE WRITE_OP
1.75944@6812005919312207872 REPLICATE WRITE_OP
1.75945@6812005919312932864 REPLICATE WRITE_OP
1.75946@6812005919313645568 REPLICATE WRITE_OP
COMMIT 1.75943
COMMIT 1.75945
COMMIT 1.75944
COMMIT 1.75946
1.75947@6812005919354585088 REPLICATE WRITE_OP
COMMIT 1.75947
1.75948@6812005919430410240 REPLICATE WRITE_OP
1.75949@6812005919431192576 REPLICATE WRITE_OP
1.75950@6812005919431778304 REPLICATE WRITE_OP
COMMIT 1.75948
COMMIT 1.75950
COMMIT 1.75949

we can see the COMMIT:

COMMIT 1.75939
COMMIT 1.75941
COMMIT 1.75940
and
COMMIT 1.75943
COMMIT 1.75945
COMMIT 1.75944
and
COMMIT 1.75948
COMMIT 1.75950
COMMIT 1.75949
h1. Motivation

I think the correct order should statisfy the invariable
r: replicate
c: commit
e[i]: a pair replicate and commit op for index i.
 # r(e(i)) < r(e(i+1)) its raft's requirement
 # r(e(i)) < c(e(i) its obvious
 # c(e(i)) < c(e(i+1)) should same as 1.

The raft logs is an total order on server side, kudu storage engine is the 
state machine and the applied order should same as raft logs.
h1. Solution

I think we should use a 'apply_pool_token_' with SERIAL_MODE
created by apply_pool_ instead of 'apply_pool_'

We should talk about the problem what I described above.

 

  was:
h1. Background

In kudu, kudu's WAL' records has two types, one is 'replicate', the other is 
'commit'. The 'replcate' log is the raft logs, the 'commit' logs is durability 
for the applied opid on kudu storage engine.

Currently, appling the ops using 'apply_pool->Submit()'(i.e concurrent 
thread-pool),
the apply task mainly run the following statements:

 
{code:java}
// op_driver.cc
apply_pool_->Submit([this]() { this->ApplyTask(); });

OpDriver::ApplyTask() {
    CommitMsg* commit_msg; 
    Status s = op_->Apply(&commit_msg);
    log_->AsyncAppendCommit(*commit_msg, ...
} {code}
apply_pool_ is an concurrent thread-pool, ApplyTask is concurrent, so some raft 
logs statifys happen-before ralationship, it may not statisfies apply them into 
kudu storage engine.

For example, 4 logs of 2 ops, we expected:
replicate 1.1
commit 1.1
replicate 1.2
commit 1.2
or
replicate 1.1
replicate 1.2
commit 1.1
commit 1.2

A incorrect order(IMO) is:
replicate 1.1
replicate 1.2
commit 1.2
commit 1.1
Currently, it's valid in kudu system. But that means 1.2 would become valid 
before 1.1 in kudu engine in a very high probability, that's not expected.

It's simple to reproduce the scenarios if there is enough WriteRequests. I will 
write a test for this.

I obtain a case like this:

./bin/kudu wal dump $wal_file | egrep "REPLICATE|COMMIT" | less

1.75939@6812005919066001408 REPLICATE WRITE_OP
1.75940@6812005919066857472 REPLICATE WRITE_OP
1.75941@6812005919067430912 REPLICATE WRITE_OP
COMMIT 1.75939
COMMIT 1.75941
COMMIT 1.75940
1.75942@6812005919193690112 REPLICATE WRITE_OP
COMMIT 1.75942
1.75943@6812005919311241216 REPLICATE WRITE_OP
1.75944@6812005919312207872 REPLICATE WRITE_OP
1.75945@6812005919312932864 REPLICATE WRITE_OP
1.75946@6812005919313645568 REPLICATE WRITE_OP
COMMIT 1.75943
COMMIT 1.75945
COMMIT 1.75944
COMMIT 1.75946
1.75947@6812005919354585088 REPLICATE WRITE_OP
COMMIT 1.75947
1.75948@6812005919430410240 REPLICATE WRITE_OP
1.75949@6812005919431192576 REPLICATE WRITE_OP
1.75950@6812005919431778304 REPLICATE WRITE_OP
COMMIT 1.75948
COMMIT 1.75950
COMMIT 1.75949

we can see the COMMIT:

COMMIT 1.75939
COMMIT 1.75941
COMMIT 1.75940
and
COMMIT 1.75943
COMMIT 1.75945
COMMIT 1.75944
and
COMMIT 1.75948
COMMIT 1.75950
COMMIT 1.75949
h1. Motivation

I think the correct order should statisfy the invariable
r: replicate
c: commit
e(i): a pair replicate and commit op for index i.
 # r(e(i)) < r(e(i+1)) its raft's requirement
 # r(e(i)) < c(e(i) its obvious
 # c(e(i)) < c(e(i+1)) should same as 1.

The raft logs is an total order on server side, kudu storage engine is the 
state machine and the applied order should same as raft logs.
h1. Solution

I think we should use a 'apply_pool_token_' with SERIAL_MODE
created by apply_pool_ instead of 'apply_pool_'

We should talk about the problem what I described above.

 


> I think we should talk about CommitMsg's order
> ----------------------------------------------
>
>                 Key: KUDU-3446
>                 URL: https://issues.apache.org/jira/browse/KUDU-3446
>             Project: Kudu
>          Issue Type: Improvement
>            Reporter: shenxingwuying
>            Priority: Major
>
> h1. Background
> In kudu, kudu's WAL' records has two types, one is 'replicate', the other is 
> 'commit'. The 'replcate' log is the raft logs, the 'commit' logs is 
> durability for the applied opid on kudu storage engine.
> Currently, appling the ops using 'apply_pool->Submit()'(i.e concurrent 
> thread-pool),
> the apply task mainly run the following statements:
>  
> {code:java}
> // op_driver.cc
> apply_pool_->Submit([this]() { this->ApplyTask(); });
> OpDriver::ApplyTask() {
>     CommitMsg* commit_msg; 
>     Status s = op_->Apply(&commit_msg);
>     log_->AsyncAppendCommit(*commit_msg, ...
> } {code}
> apply_pool_ is an concurrent thread-pool, ApplyTask is concurrent, so some 
> raft logs statifys happen-before ralationship, it may not statisfies apply 
> them into kudu storage engine.
> For example, 4 logs of 2 ops, we expected:
> replicate 1.1
> commit 1.1
> replicate 1.2
> commit 1.2
> or
> replicate 1.1
> replicate 1.2
> commit 1.1
> commit 1.2
> A incorrect order(IMO) is:
> replicate 1.1
> replicate 1.2
> commit 1.2
> commit 1.1
> Currently, it's valid in kudu system. But that means 1.2 would become valid 
> before 1.1 in kudu engine in a very high probability, that's not expected.
> It's simple to reproduce the scenarios if there is enough WriteRequests. I 
> will write a test for this.
> I obtain a case like this:
> ./bin/kudu wal dump $wal_file | egrep "REPLICATE|COMMIT" | less
> 1.75939@6812005919066001408 REPLICATE WRITE_OP
> 1.75940@6812005919066857472 REPLICATE WRITE_OP
> 1.75941@6812005919067430912 REPLICATE WRITE_OP
> COMMIT 1.75939
> COMMIT 1.75941
> COMMIT 1.75940
> 1.75942@6812005919193690112 REPLICATE WRITE_OP
> COMMIT 1.75942
> 1.75943@6812005919311241216 REPLICATE WRITE_OP
> 1.75944@6812005919312207872 REPLICATE WRITE_OP
> 1.75945@6812005919312932864 REPLICATE WRITE_OP
> 1.75946@6812005919313645568 REPLICATE WRITE_OP
> COMMIT 1.75943
> COMMIT 1.75945
> COMMIT 1.75944
> COMMIT 1.75946
> 1.75947@6812005919354585088 REPLICATE WRITE_OP
> COMMIT 1.75947
> 1.75948@6812005919430410240 REPLICATE WRITE_OP
> 1.75949@6812005919431192576 REPLICATE WRITE_OP
> 1.75950@6812005919431778304 REPLICATE WRITE_OP
> COMMIT 1.75948
> COMMIT 1.75950
> COMMIT 1.75949
> we can see the COMMIT:
> COMMIT 1.75939
> COMMIT 1.75941
> COMMIT 1.75940
> and
> COMMIT 1.75943
> COMMIT 1.75945
> COMMIT 1.75944
> and
> COMMIT 1.75948
> COMMIT 1.75950
> COMMIT 1.75949
> h1. Motivation
> I think the correct order should statisfy the invariable
> r: replicate
> c: commit
> e[i]: a pair replicate and commit op for index i.
>  # r(e(i)) < r(e(i+1)) its raft's requirement
>  # r(e(i)) < c(e(i) its obvious
>  # c(e(i)) < c(e(i+1)) should same as 1.
> The raft logs is an total order on server side, kudu storage engine is the 
> state machine and the applied order should same as raft logs.
> h1. Solution
> I think we should use a 'apply_pool_token_' with SERIAL_MODE
> created by apply_pool_ instead of 'apply_pool_'
> We should talk about the problem what I described above.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to