On 6/28/24 11:43, Ilya Maximets wrote: > On 6/28/24 09:20, Dumitru Ceara wrote: >> On 6/27/24 00:02, Ilya Maximets wrote: >>> Every transaction has RAFT log prerequisites. Even if transactions >>> are not related (because RAFT doesn't actually know what data it is >>> handling). When leader writes a new record to a RAFT storage, it is >>> getting appended to the log right away and changes current 'eid', >>> i.e., changes prerequisites. The leader will not try to write new >>> records until the current one is committed, because until then the >>> pre-check will be failing. >>> >>> However, that is different for the follower. Followers do not add >>> records to the RAFT log until the leader sends an append request back. >>> So, if there are multiple transactions pending on a follower, it will >>> create a command for each of them and prerequisites will be set to the >>> same values. All these commands will be sent to the leader, but only >>> one can succeed at a time, because accepting one command immediately >>> changes prerequisites and all other commands become non-applicable. >>> So, out of N commands, 1 will succeed and N - 1 will fail. The cluster >>> failure is a transient failure, so the follower will re-process all the >>> failed transactions and send them again. 1 will succeed and N - 2 will >>> fail. And so on, until there are no more transactions. In the end, >>> instead of processing N transactions, the follower is performing >>> N * (N - 1) / 2 transaction processing iterations. That is consuming >>> a huge amount of CPU resources completely unnecessarily. >>> >>> Since there is no real chance for multiple transactions from the same >>> follower to succeed, it's better to not send them in the first place. >>> This also eliminates prerequisite mismatch messages on a leader in >>> this particular case. >>> >> >> Makes sense! >> >>> In a test with 30 parallel shell threads executing 12K transactions >>> total with separate ovsdb-client calls through the same follower there >>> is about 60% performance improvement. The test takes ~100 seconds to >>> complete without this change and ~40 seconds with this change applied. >>> The new time is very close to what it takes to execute the same test >>> through the cluster leader. >>> >> >> Maybe a public link to the test (if possible) would be a nice reference >> to have in the future? > > I'm not sure where to post it, so I'll just leave it here, and I can > add a link to this post into the commit message. > > The script is adding 12K ports across 120 logical switches in a style > of ovn-kubernetes (sets up port security, some external IDs and adds > to an address set): > > --- > $ cat ../transaction-benchmark-parallel.sh > #set -x > set -o errexit > > # Creating 120 logical switches > for i in $(seq 120); do > ovn-nbctl ls-add ip-10-10-177-$i.us-west-2.compute.internal > done > IFS=$'\r\n' GLOBIGNORE='*' \ > command eval 'uuids=($(ovn-nbctl --bare --columns _uuid list > logical_switch | sed "/^$/d"))' > > for i in $(seq 120); do > echo "uuid #${i}: ${uuids[$(($i - 1))]}" > done > > # Creating one addres set > as_name="a0123456789012345678" > as_uuid=$(ovn-nbctl create address_set name=${as_name}) > > # Adding ports > function add_400() { > rm -rf txn_results$1.txt > for i in $(seq $1 $(($1 + 399)) ); do > echo $i > > port_name="node-density-00000000-0000-0000-0000-000000000000_node-density-$i" > namespace="node-density-00000000-0000-0000-0000-000000000000" > ls_name="ip-10-10-177-$((${i} % 120 + 1)).us-west-2.compute.internal" > uuid_name="row00000000_0000_0000_0000_000000000000" > mac=$(printf '0a:58:0a:9b:%02x:%02x' $((${i} >> 8)) $((${i} & 255))) > ip=$(printf "10.11.%d.%d\n" $((${i} / 255)) $((${i} % 255 + 1))) > > time ovsdb-client transact unix:$(pwd)/sandbox/nb2.ovsdb > "[\"OVN_Northbound\", > { > \"uuid-name\":\"${uuid_name}\", > \"row\":{ > \"name\":\"${port_name}\", > \"options\":[\"map\",[[\"requested-chassis\",\"${ls_name}\"]]], > \"addresses\":[\"set\",[\"${mac} ${ip}\"]], > > \"external_ids\":[\"map\",[[\"pod\",\"true\"],[\"namespace\",\"${namespace}\"]]], > \"port_security\":[\"set\",[\"${mac} ${ip}\"]] > }, > \"op\":\"insert\", \"table\":\"Logical_Switch_Port\" > }, > { > \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${uuids[$(($i % > 120))]}\"]]], > > \"mutations\":[[\"ports\",\"insert\",[\"set\",[[\"named-uuid\",\"${uuid_name}\"]]]]], > \"op\":\"mutate\", \"table\":\"Logical_Switch\" > }, > { > \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${as_uuid}\"]]], > \"mutations\":[[\"addresses\",\"insert\",[\"set\",[\"${ip}\"]]]], > \"op\":\"mutate\", \"table\":\"Address_Set\" > }]" >> txn_results$1.txt > done > echo "done" >> txn_results$1.txt > } > > rm -f txn_results*.txt > > n=0 > for i in $(seq 0 400 11600); do > n=$(($n + 1)) > add_400 $i & > done > > while [ $(grep 'done' txn_results*.txt | wc -l) != $n ]; do > sleep 1 > done > --- > > It is not the most generic script and it is designed for sandbox, > where nb1 is a leader by default: > > # make sandbox SANDBOXFLAGS="--nbdb-model=clustered" > # pkill ovn-northd # to remove the noise > # time bash ../transaction-benchmark-parallel.sh > > Note that the script doesn't perform a lot of correctness checks, so > it's better to examine the txn_results*.txt files, or check the > database content afterwards to be sure. > >> >>> Note: prerequisite failures on a leader are still possible, but mostly >>> in a case of simultaneous transactions from different followers. It's >>> a normal thing for a distributed database due to its nature. >>> >>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> >>> --- >>> ovsdb/raft.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- >>> ovsdb/raft.h | 2 +- >>> ovsdb/storage.c | 9 +++++---- >>> ovsdb/storage.h | 5 ++++- >>> ovsdb/transaction.c | 6 +----- >>> 5 files changed, 55 insertions(+), 12 deletions(-) >>> >>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c >>> index ac3d37ac4..116924058 100644 >>> --- a/ovsdb/raft.c >>> +++ b/ovsdb/raft.c >>> @@ -2307,12 +2307,55 @@ raft_get_eid(const struct raft *raft, uint64_t >>> index) >>> return &raft->snap.eid; >>> } >>> >>> -const struct uuid * >>> +static const struct uuid * >>> raft_current_eid(const struct raft *raft) >>> { >>> return raft_get_eid(raft, raft->log_end - 1); >>> } >>> >>> +bool >>> +raft_precheck_prereq(const struct raft *raft, const struct uuid *prereq) >>> +{ >>> + if (!uuid_equals(raft_current_eid(raft), prereq)) { >>> + VLOG_DBG("%s: prerequisites (" UUID_FMT ") " >>> + "do not match current eid (" UUID_FMT ")", >>> + __func__, UUID_ARGS(prereq), >>> + UUID_ARGS(raft_current_eid(raft))); >>> + return false; >>> + } >>> + >>> + /* Having incomplete commands on a follower means that the leader has >>> + * these commands and they will change the prerequisites once added to >>> + * the leader's log. >>> + * >>> + * There is a chance that all these commands will actually fail and the >>> + * record with current prerequisites will in fact succeed, but, since >>> + * these are our own commands, the chances are low. >>> + * >>> + * Incomplete commands on a leader will not change the leader's current >>> + * 'eid' on commit as they are already part of the leader's log. */ >>> + if (raft->role != RAFT_LEADER && hmap_count(&raft->commands)) { >>> + struct raft_command *cmd; >>> + >>> + HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) { >> >> Nit: I'd rewrite this as: >> >> if (raft->role == RAFT_LEADER) { >> return true; >> } >> >> HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) { >> ... >> } > > Makes sense, I can do that con commit, unless there will be further comments. > >> >>> + /* Skip commands that are already part of the log (have >>> non-zero >>> + * index) and ones that do not carry any data (have zero >>> 'eid'), >>> + * as they can't change prerequisites. >>> + * >>> + * Database will not re-run triggers unless the data changes or >>> + * one of the data-carrying triggers completes. So, pre-check >>> must >>> + * not fail if there are no outstanding data-carrying >>> commands. */ >>> + if (!cmd->index && !uuid_is_zero(&cmd->eid)) { >>> + VLOG_DBG("%s: follower still has an incomplete command " >>> + UUID_FMT, __func__, UUID_ARGS(&cmd->eid)); >>> + return false; >>> + } >>> + } >>> + } >>> + >>> + return true; >>> +} >>> + >>> static struct raft_command * >>> raft_command_create_completed(enum raft_command_status status) >>> { >>> diff --git a/ovsdb/raft.h b/ovsdb/raft.h >>> index a5b55d9bf..5833aaf23 100644 >>> --- a/ovsdb/raft.h >>> +++ b/ovsdb/raft.h >>> @@ -189,5 +189,5 @@ struct ovsdb_error *raft_store_snapshot(struct raft *, >>> void raft_take_leadership(struct raft *); >>> void raft_transfer_leadership(struct raft *, const char *reason); >>> >>> -const struct uuid *raft_current_eid(const struct raft *); >>> +bool raft_precheck_prereq(const struct raft *, const struct uuid *prereq); >>> #endif /* lib/raft.h */ >>> diff --git a/ovsdb/storage.c b/ovsdb/storage.c >>> index 6c395106c..c5aec5459 100644 >>> --- a/ovsdb/storage.c >>> +++ b/ovsdb/storage.c >>> @@ -661,11 +661,12 @@ ovsdb_storage_write_schema_change(struct >>> ovsdb_storage *storage, >>> return w; >>> } >>> >>> -const struct uuid * >>> -ovsdb_storage_peek_last_eid(struct ovsdb_storage *storage) >>> +bool >>> +ovsdb_storage_precheck_prereq(const struct ovsdb_storage *storage, >>> + const struct uuid *prereq) >>> { >>> if (!storage->raft) { >>> - return NULL; >>> + return true; >>> } >>> - return raft_current_eid(storage->raft); >>> + return raft_precheck_prereq(storage->raft, prereq); >>> } >>> diff --git a/ovsdb/storage.h b/ovsdb/storage.h >>> index 05f40ce93..7079ea261 100644 >>> --- a/ovsdb/storage.h >>> +++ b/ovsdb/storage.h >>> @@ -96,6 +96,9 @@ struct ovsdb_storage *ovsdb_storage_open_standalone(const >>> char *filename, >>> bool rw); >>> struct ovsdb_schema *ovsdb_storage_read_schema(struct ovsdb_storage *); >>> >>> -const struct uuid *ovsdb_storage_peek_last_eid(struct ovsdb_storage *); >>> +/* Checks that there is a chance for a record with specified prerequisites >>> + * to be successfully written to the storage. */ >>> +bool ovsdb_storage_precheck_prereq(const struct ovsdb_storage *, >>> + const struct uuid *prereq); >>> >>> #endif /* ovsdb/storage.h */ >>> diff --git a/ovsdb/transaction.c b/ovsdb/transaction.c >>> index 484a88e1c..65eca6478 100644 >>> --- a/ovsdb/transaction.c >>> +++ b/ovsdb/transaction.c >>> @@ -1277,11 +1277,7 @@ struct ovsdb_txn_progress { >>> bool >>> ovsdb_txn_precheck_prereq(const struct ovsdb *db) >>> { >>> - const struct uuid *eid = ovsdb_storage_peek_last_eid(db->storage); >>> - if (!eid) { >>> - return true; >>> - } >>> - return uuid_equals(&db->prereq, eid); >>> + return ovsdb_storage_precheck_prereq(db->storage, &db->prereq); >>> } >> >> Aside from the comment above, the patch looks good to me. If you >> address my comment.. or if you decide to dismiss it too :) then: >> >> Acked-by: Dumitru Ceara <dce...@redhat.com>
I made the change above and applied the patch. Thanks! Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev