Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/5294 to look at the new patch set (#20). Change subject: KUDU-798 (part 3) Replica transactions must start/abort on the consensus update thread ...................................................................... KUDU-798 (part 3) Replica transactions must start/abort on the consensus update thread In order for a consensus replica to safely move "safe" time, it needs to know that all transactions that come before it have at least started. One way to do this is to call transaction_->Start() on transaction_driver_->Init(). We know Init() is called by the thread updating consensus, for non-leader transactions. This should be ok as the leader has already correctly serialized the transactions. However, if transactions are now started on Init() then, in the case they abort (i.e. when transaction_driver_->ReplicationFinished() is called with a non-ok status, again done by the thread updating consensus), we must make sure that the transaction is actually removed from mvcc before that method returns. Otherwise consensus might call Start() on another transaction with the same timestamp before the failed transaction is removed from mvcc. To do this this patch adds a way to release only the mvcc txn, since its the only thing we care about and Prepare() might still be acquiring row locks. I ran the new raft_consensus-itest that includes unique/duplicate key workloads and exactly once semantics in dist-test with asan, slow mode and 1 stress thread. 3/1000 tests failed. This is inline with the baseline from the patch that changed the itest. Results of slow mode, 1 stress thread 1000 loop runs of raft_consensus-itest on dist-test: TSAN-prev (16/1000 failures): http://dist-test.cloudera.org//job?job_id=david.alves.1480814757.31031 TSAN-this (10/1000 failures): http://dist-test.cloudera.org//job?job_id=david.alves.1480812541.27015 I inspected the TSAN failures and they were mostly ~DnsResolver() races on test code itself. The tests themselves passed for the most part. Some ended up timing out with archives that are too big to download. This also happens in the previous patch. ASAN-prev (02/1000 failures): http://dist-test.cloudera.org//job?job_id=david.alves.1480817418.1140 ASAN-this (03/1000 failures): http://dist-test.cloudera.org//job?job_id=david.alves.1480811163.22558 I inspected the ASAN failures. Two of them are test only flakes. One of them is worrying as the consensus queue remains full, suggesting a deadlock, but this also happening on the previous patch. Example: https://kudu-test-results.s3.amazonaws.com/david.alves.1480817418.1140.36139843995986a54aaf9f533a00111f384e85a6.145.0-artifacts.zip?Signature=PjjGYVvmWns4t0M7j6%2FAZQYVj34%3D&Expires=1480904781&AWSAccessKeyId=AKIAJ2NR2VXMAHTVLMRA Change-Id: Ie360e597eea86551c453717d7a1a000848027f4c --- M src/kudu/tablet/transactions/transaction.h M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/transaction_driver.h M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tablet/transactions/write_transaction.h 5 files changed, 60 insertions(+), 40 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/94/5294/20 -- To view, visit http://gerrit.cloudera.org:8080/5294 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie360e597eea86551c453717d7a1a000848027f4c Gerrit-PatchSet: 20 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org>