[jira] [Commented] (KUDU-2958) ClientTest.TestReplicatedTabletWritesWithLeaderElection is flaky

2019-09-25 Thread Adar Dembo (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938266#comment-16938266
 ] 

Adar Dembo commented on KUDU-2958:
--

Good eye; I agree that the second SleepFor was probably keeping the test 
passing.

That said, I'm not so sure why. With one tserver down, the second round of row 
insertion must by necessity update both remaining replicas, so by the time we 
call CountRowsFromClient, all live replicas should have all 200 rows. Maybe 
it's sufficient for the FOLLOWER replica to make the rows durable in its WAL 
and not necessarily apply them to the MRS, and then we choose to scan the 
FOLLOWER? That'd explain the failure.

Anyway, the purpose of the test appears to be to verify that we can write after 
killing the leader, so a LEADER_ONLY scan following the second batch of inserts 
should be fine.

> ClientTest.TestReplicatedTabletWritesWithLeaderElection is flaky
> 
>
> Key: KUDU-2958
> URL: https://issues.apache.org/jira/browse/KUDU-2958
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: client-test.5.txt.xz
>
>
> The {{TestReplicatedTabletWritesWithLeaderElection}} of the {{client-test}} 
> is flaky.  Time to time in ASAN build configuration it fails with the 
> following error:
> {noformat}
> I0924 20:26:19.869351 14037 client-test.cc:4304] Counting rows... 
>   
> src/kudu/client/client-test.cc:4308: Failure
>   Expected: 2 * kNumRowsToWrite   
>   
>   Which is: 200   
>   
> To be equal to: CountRowsFromClient(table.get(), KuduClient::FIRST_REPLICA, 
> KuduScanner::READ_LATEST, kNoBound, kNoBound)
>   Which is: 100 
> {noformat}
> It seems there is implicit assumption in the test about fast propagation of 
> Raft transactions to follower replicas.
> I attached the full log of the failed tests scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2958) ClientTest.TestReplicatedTabletWritesWithLeaderElection is flaky

2019-09-25 Thread ZhangYao (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938199#comment-16938199
 ] 

ZhangYao commented on KUDU-2958:


I have encountered this failed yesterday and I have looked the log, I think 
there maybe a possible that the leader haven't propagated the writes to the 
followers. What's more I found the original test have 
{color:#FF}“_SleepFor(MonoDelta::FromMilliseconds(1500));”_ {color} to wait 
the sync between leader and follower, and it was removed on 
[https://gerrit.cloudera.org/#/c/14016/] . Do we have another method to 
guarantee that or it's a mistaken deletion.

> ClientTest.TestReplicatedTabletWritesWithLeaderElection is flaky
> 
>
> Key: KUDU-2958
> URL: https://issues.apache.org/jira/browse/KUDU-2958
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: client-test.5.txt.xz
>
>
> The {{TestReplicatedTabletWritesWithLeaderElection}} of the {{client-test}} 
> is flaky.  Time to time in ASAN build configuration it fails with the 
> following error:
> {noformat}
> I0924 20:26:19.869351 14037 client-test.cc:4304] Counting rows... 
>   
> src/kudu/client/client-test.cc:4308: Failure
>   Expected: 2 * kNumRowsToWrite   
>   
>   Which is: 200   
>   
> To be equal to: CountRowsFromClient(table.get(), KuduClient::FIRST_REPLICA, 
> KuduScanner::READ_LATEST, kNoBound, kNoBound)
>   Which is: 100 
> {noformat}
> It seems there is implicit assumption in the test about fast propagation of 
> Raft transactions to follower replicas.
> I attached the full log of the failed tests scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)