Re: [ovs-dev] [PATCH] ovsdb: raft: Don't forward more than one command to the leader.

Ilya Maximets Mon, 08 Jul 2024 14:53:07 -0700

On 6/28/24 11:43, Ilya Maximets wrote:
> On 6/28/24 09:20, Dumitru Ceara wrote:
>> On 6/27/24 00:02, Ilya Maximets wrote:
>>> Every transaction has RAFT log prerequisites.  Even if transactions
>>> are not related (because RAFT doesn't actually know what data it is
>>> handling).  When leader writes a new record to a RAFT storage, it is
>>> getting appended to the log right away and changes current 'eid',
>>> i.e., changes prerequisites.  The leader will not try to write new
>>> records until the current one is committed, because until then the
>>> pre-check will be failing.
>>>
>>> However, that is different for the follower.  Followers do not add
>>> records to the RAFT log until the leader sends an append request back.
>>> So, if there are multiple transactions pending on a follower, it will
>>> create a command for each of them and prerequisites will be set to the
>>> same values.  All these commands will be sent to the leader, but only
>>> one can succeed at a time, because accepting one command immediately
>>> changes prerequisites and all other commands become non-applicable.
>>> So, out of N commands, 1 will succeed and N - 1 will fail.  The cluster
>>> failure is a transient failure, so the follower will re-process all the
>>> failed transactions and send them again.  1 will succeed and N - 2 will
>>> fail.  And so on, until there are no more transactions.  In the end,
>>> instead of processing N transactions, the follower is performing
>>> N * (N - 1) / 2 transaction processing iterations.  That is consuming
>>> a huge amount of CPU resources completely unnecessarily.
>>>
>>> Since there is no real chance for multiple transactions from the same
>>> follower to succeed, it's better to not send them in the first place.
>>> This also eliminates prerequisite mismatch messages on a leader in
>>> this particular case.
>>>
>>
>> Makes sense!
>>
>>> In a test with 30 parallel shell threads executing 12K transactions
>>> total with separate ovsdb-client calls through the same follower there
>>> is about 60% performance improvement.  The test takes ~100 seconds to
>>> complete without this change and ~40 seconds with this change applied.
>>> The new time is very close to what it takes to execute the same test
>>> through the cluster leader.
>>>
>>
>> Maybe a public link to the test (if possible) would be a nice reference
>> to have in the future?
> 
> I'm not sure where to post it, so I'll just leave it here, and I can
> add a link to this post into the commit message.
> 
> The script is adding 12K ports across 120 logical switches in a style
> of ovn-kubernetes (sets up port security, some external IDs and adds
> to an address set):
> 
> ---
> $ cat ../transaction-benchmark-parallel.sh
> #set -x
> set -o errexit
> 
> # Creating 120 logical switches
> for i in $(seq 120); do
>     ovn-nbctl ls-add ip-10-10-177-$i.us-west-2.compute.internal
> done
> IFS=$'\r\n' GLOBIGNORE='*' \
>     command eval  'uuids=($(ovn-nbctl --bare --columns _uuid list 
> logical_switch | sed "/^$/d"))'
> 
> for i in $(seq 120); do
>     echo "uuid #${i}: ${uuids[$(($i - 1))]}"
> done
> 
> # Creating one addres set
> as_name="a0123456789012345678"
> as_uuid=$(ovn-nbctl create address_set name=${as_name})
> 
> # Adding ports
> function add_400() {
>     rm -rf txn_results$1.txt
>     for i in $(seq $1 $(($1 + 399)) ); do
>         echo $i
>         
> port_name="node-density-00000000-0000-0000-0000-000000000000_node-density-$i"
>         namespace="node-density-00000000-0000-0000-0000-000000000000"
>         ls_name="ip-10-10-177-$((${i} % 120 + 1)).us-west-2.compute.internal"
>         uuid_name="row00000000_0000_0000_0000_000000000000"
>         mac=$(printf '0a:58:0a:9b:%02x:%02x' $((${i} >> 8)) $((${i} & 255)))
>         ip=$(printf "10.11.%d.%d\n" $((${i} / 255)) $((${i} % 255 + 1)))
> 
>         time ovsdb-client transact unix:$(pwd)/sandbox/nb2.ovsdb 
> "[\"OVN_Northbound\",
>         {
>           \"uuid-name\":\"${uuid_name}\",
>           \"row\":{
>               \"name\":\"${port_name}\",
>               \"options\":[\"map\",[[\"requested-chassis\",\"${ls_name}\"]]],
>               \"addresses\":[\"set\",[\"${mac} ${ip}\"]],
>               
> \"external_ids\":[\"map\",[[\"pod\",\"true\"],[\"namespace\",\"${namespace}\"]]],
>               \"port_security\":[\"set\",[\"${mac} ${ip}\"]]
>             },
>           \"op\":\"insert\", \"table\":\"Logical_Switch_Port\"
>         },
>         {
>           \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${uuids[$(($i % 
> 120))]}\"]]],
>           
> \"mutations\":[[\"ports\",\"insert\",[\"set\",[[\"named-uuid\",\"${uuid_name}\"]]]]],
>           \"op\":\"mutate\", \"table\":\"Logical_Switch\"
>         },
>         {
>           \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${as_uuid}\"]]],
>           \"mutations\":[[\"addresses\",\"insert\",[\"set\",[\"${ip}\"]]]],
>           \"op\":\"mutate\", \"table\":\"Address_Set\"
>         }]" >> txn_results$1.txt
>     done
>     echo "done" >> txn_results$1.txt
> }
> 
> rm -f txn_results*.txt
> 
> n=0
> for i in $(seq 0 400 11600); do
>     n=$(($n + 1))
>     add_400 $i &
> done
> 
> while [ $(grep 'done' txn_results*.txt | wc -l) != $n ]; do
>     sleep 1
> done
> ---
> 
> It is not the most generic script and it is designed for sandbox,
> where nb1 is a leader by default:
> 
> # make sandbox SANDBOXFLAGS="--nbdb-model=clustered"
> # pkill ovn-northd # to remove the noise
> # time bash ../transaction-benchmark-parallel.sh
> 
> Note that the script doesn't perform a lot of correctness checks, so
> it's better to examine the txn_results*.txt files, or check the
> database content afterwards to be sure.
> 
>>
>>> Note: prerequisite failures on a leader are still possible, but mostly
>>> in a case of simultaneous transactions from different followers.  It's
>>> a normal thing for a distributed database due to its nature.
>>>
>>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
>>> ---
>>>  ovsdb/raft.c        | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>>>  ovsdb/raft.h        |  2 +-
>>>  ovsdb/storage.c     |  9 +++++----
>>>  ovsdb/storage.h     |  5 ++++-
>>>  ovsdb/transaction.c |  6 +-----
>>>  5 files changed, 55 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>>> index ac3d37ac4..116924058 100644
>>> --- a/ovsdb/raft.c
>>> +++ b/ovsdb/raft.c
>>> @@ -2307,12 +2307,55 @@ raft_get_eid(const struct raft *raft, uint64_t 
>>> index)
>>>      return &raft->snap.eid;
>>>  }
>>>  
>>> -const struct uuid *
>>> +static const struct uuid *
>>>  raft_current_eid(const struct raft *raft)
>>>  {
>>>      return raft_get_eid(raft, raft->log_end - 1);
>>>  }
>>>  
>>> +bool
>>> +raft_precheck_prereq(const struct raft *raft, const struct uuid *prereq)
>>> +{
>>> +    if (!uuid_equals(raft_current_eid(raft), prereq)) {
>>> +        VLOG_DBG("%s: prerequisites (" UUID_FMT ") "
>>> +                 "do not match current eid (" UUID_FMT ")",
>>> +                 __func__, UUID_ARGS(prereq),
>>> +                 UUID_ARGS(raft_current_eid(raft)));
>>> +        return false;
>>> +    }
>>> +
>>> +    /* Having incomplete commands on a follower means that the leader has
>>> +     * these commands and they will change the prerequisites once added to
>>> +     * the leader's log.
>>> +     *
>>> +     * There is a chance that all these commands will actually fail and the
>>> +     * record with current prerequisites will in fact succeed, but, since
>>> +     * these are our own commands, the chances are low.
>>> +     *
>>> +     * Incomplete commands on a leader will not change the leader's current
>>> +     * 'eid' on commit as they are already part of the leader's log. */
>>> +    if (raft->role != RAFT_LEADER && hmap_count(&raft->commands)) {
>>> +        struct raft_command *cmd;
>>> +
>>> +        HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
>>
>> Nit: I'd rewrite this as:
>>
>> if (raft->role == RAFT_LEADER) {
>>     return true;
>> }
>>
>> HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
>>     ...
>> }
> 
> Makes sense, I can do that con commit, unless there will be further comments.
> 
>>
>>> +            /* Skip commands that are already part of the log (have 
>>> non-zero
>>> +             * index) and ones that do not carry any data (have zero 
>>> 'eid'),
>>> +             * as they can't change prerequisites.
>>> +             *
>>> +             * Database will not re-run triggers unless the data changes or
>>> +             * one of the data-carrying triggers completes.  So, pre-check 
>>> must
>>> +             * not fail if there are no outstanding data-carrying 
>>> commands. */
>>> +            if (!cmd->index && !uuid_is_zero(&cmd->eid)) {
>>> +                VLOG_DBG("%s: follower still has an incomplete command "
>>> +                         UUID_FMT, __func__, UUID_ARGS(&cmd->eid));
>>> +                return false;
>>> +            }
>>> +        }
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>> +
>>>  static struct raft_command *
>>>  raft_command_create_completed(enum raft_command_status status)
>>>  {
>>> diff --git a/ovsdb/raft.h b/ovsdb/raft.h
>>> index a5b55d9bf..5833aaf23 100644
>>> --- a/ovsdb/raft.h
>>> +++ b/ovsdb/raft.h
>>> @@ -189,5 +189,5 @@ struct ovsdb_error *raft_store_snapshot(struct raft *,
>>>  void raft_take_leadership(struct raft *);
>>>  void raft_transfer_leadership(struct raft *, const char *reason);
>>>  
>>> -const struct uuid *raft_current_eid(const struct raft *);
>>> +bool raft_precheck_prereq(const struct raft *, const struct uuid *prereq);
>>>  #endif /* lib/raft.h */
>>> diff --git a/ovsdb/storage.c b/ovsdb/storage.c
>>> index 6c395106c..c5aec5459 100644
>>> --- a/ovsdb/storage.c
>>> +++ b/ovsdb/storage.c
>>> @@ -661,11 +661,12 @@ ovsdb_storage_write_schema_change(struct 
>>> ovsdb_storage *storage,
>>>      return w;
>>>  }
>>>  
>>> -const struct uuid *
>>> -ovsdb_storage_peek_last_eid(struct ovsdb_storage *storage)
>>> +bool
>>> +ovsdb_storage_precheck_prereq(const struct ovsdb_storage *storage,
>>> +                              const struct uuid *prereq)
>>>  {
>>>      if (!storage->raft) {
>>> -        return NULL;
>>> +        return true;
>>>      }
>>> -    return raft_current_eid(storage->raft);
>>> +    return raft_precheck_prereq(storage->raft, prereq);
>>>  }
>>> diff --git a/ovsdb/storage.h b/ovsdb/storage.h
>>> index 05f40ce93..7079ea261 100644
>>> --- a/ovsdb/storage.h
>>> +++ b/ovsdb/storage.h
>>> @@ -96,6 +96,9 @@ struct ovsdb_storage *ovsdb_storage_open_standalone(const 
>>> char *filename,
>>>                                                      bool rw);
>>>  struct ovsdb_schema *ovsdb_storage_read_schema(struct ovsdb_storage *);
>>>  
>>> -const struct uuid *ovsdb_storage_peek_last_eid(struct ovsdb_storage *);
>>> +/* Checks that there is a chance for a record with specified prerequisites
>>> + * to be successfully written to the storage. */
>>> +bool ovsdb_storage_precheck_prereq(const struct ovsdb_storage *,
>>> +                                   const struct uuid *prereq);
>>>  
>>>  #endif /* ovsdb/storage.h */
>>> diff --git a/ovsdb/transaction.c b/ovsdb/transaction.c
>>> index 484a88e1c..65eca6478 100644
>>> --- a/ovsdb/transaction.c
>>> +++ b/ovsdb/transaction.c
>>> @@ -1277,11 +1277,7 @@ struct ovsdb_txn_progress {
>>>  bool
>>>  ovsdb_txn_precheck_prereq(const struct ovsdb *db)
>>>  {
>>> -    const struct uuid *eid = ovsdb_storage_peek_last_eid(db->storage);
>>> -    if (!eid) {
>>> -        return true;
>>> -    }
>>> -    return uuid_equals(&db->prereq, eid);
>>> +    return ovsdb_storage_precheck_prereq(db->storage, &db->prereq);
>>>  }
>>
>> Aside from the comment above, the patch looks good to me.  If you
>> address my comment.. or if you decide to dismiss it too :) then:
>>
>> Acked-by: Dumitru Ceara <dce...@redhat.com>


I made the change above and applied the patch.  Thanks!

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] ovsdb: raft: Don't forward more than one command to the leader.

Reply via email to