zenichev left a comment (kamailio/kamailio#4539)
@miconda
Sorry for the late reply, I had to re-consider certain things.
> The switch to hash table is not related to the reload issue, just considered
> a better design to store the items?
hash tables abstraction in common with the locks fulfill two goals:
- design improvement
- robust and convenient way to introduce locks (we control them on the upper
level)
I could introduce the locks on the linked lists directly, but that would have
been much more cumbersome.
The model running on the hash tables is already common, used in some other
modules.
> Although not the main author of the module, I think the secondary table was
> kept (at least for a while) to ensure that SIP worker processes finish
> matching on the old active list when a reload is completed.
Yes, this is not the least point. I assume they introduced the concept of
cursor (pointer to the actual table) and two possible versions of them latest
(newest) and outdated: to be able to write to one table, meanwhile the other
can be used by SIP. Then tables get swapped. It feels a little bit oldish, but
that works.
**The biggest drawback of that is**: it doesn't consider the multi-process
(e.g. RPCs) system.
Whereas multiple reloads can hit the system and nothing stops those processes
to have a race.
> could it happen that the SIP worker is still walking the old-old-data, which
> has to be destroyed by the 2nd reloading process more or less at the same
> time?
**Short answer**: currently the permissions module already has an issue, when
two or more processes start simultaneously the reload.
**Longer answer**:
Let me try to explain, how it's working now (even without the locks patch).
The case:
- DB backend gets updated with the new provisioned trusted sources
- P1 fires the RPC reload, is ongoing
- DB backend gets now more updates, so other new provisioned trusted sources
- P2 fires the RPC reload and collides with the P1
What is going to happen:
- before all the updates started: table 1 is the newest version (global pointer
on that), table 2 is the oldest version
- now updates started (first P1)
- P1 sees that table 1 is the actual one, takes table 2 (for updates), no
locks, it starts the update (the global pointer is not touched yet)
- P2 has arrived with a newer reload, has the same situation as the P1, no
locks, and nothing prevents it to empty the table (so cleaning the list),
meanwhile P1 hasn't finished its update yet
This is a seg.fault, either because of the changed list (iterator invalidation)
or because of hitting the corrupted pointer (re-written by parallel process),
while trying to do read/write operations via it. So far, we experienced this
case pretty much regularly. (to reproduce it, one needs to have quite big
trusted tables, whereas each reload takes a sensible amount of time)
**What this PR does not do:**
This thing still not solved with the current PR, because this PR only
introduces the locks for the scope of hash tables themselves, but does not
cover the reload as whole (upper level). So it only solves the cases, when two
processes simultaneously try to iterate the content (within one table).
**What we should better do to improve this change:**
Now, when I understood, that the solution I've uploaded is not ideal, I will
have to extend this a little bit.
I will have to introduce additionally **an exclusive lock** for the reload
function (`reload_trusted_table()`).
This one is going to serve the purpose of not letting the P2 select the very
same table, when P1 already does updates.
Hence P2 will select always the older table, without having to suddenly clean
the latest (actual) table.
What I mean:
- P1 acquired the locks of table 1, cleans the content of it and re-inits the
buckets
- P1 starts filling in the table with a new content
- P2 arrives and waits for the lock to be acquired
- P1 releases the lock after the table is filled and makes the global pointer
towards table 1
- P2 acquires the lock and cleans the content (<- here, one step before P2
selects the table for updates, we need the exclusive lock, which will serve
only the `reload_trusted_table()`)
F.y.i: `reload_trusted_table()` is the only multi-process weak point of the
module, as I can see.
Additionally it's worth having an **atomic bool**, for the cases when P3, P4
etc. are coming and also trying to acquire the exclusive lock.
If there is a process already waiting to acquire a lock and provide fresher
updates, there is no need to let P3, P4.. do the same. So having the **atomic
bool** is not obligatory, but that will make the reload process more efficient,
because having less duplicated work.
> Also, I haven't gotten (yet) the idea with the dummy item. Normally the
> bucket is its own structure
I think I should have clarified better, but shortly said I wanted to stay with
the dummy-head design
(when buckets always owns a permanent head node, which is pre-initilized and
list items are coming always after it)
for the sake of stable head pointers in buckets, which gives simpler iteration
implementation,
and reduced re-allocs.. so it's only about the speed. But this is definitely
not obligatory to have it.
I get your idea, you're talking rather about the semantic clarity.
The bucket item must always keep a data, never dummy etc.
This is possible, I can just rework it, will take some time, but I don't see it
as a big deal.
---
**My current proposal for the further improvements is:**
- introduce an exclusive lock only for the scope of `reload_trusted_table()`
(solves sudden table clean)
- introduce an atomic bool (improves the work, no duplicated reloads)
- introduce the same concept for the address table (so that both trusted and
address tables implementations are unified)
Please let me know what you think about my proposal, and let's find out what we
should do about this rework.
Then I can invest my time to finalize this PR.
--
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/pull/4539#issuecomment-3785133654
You are receiving this because you are subscribed to this thread.
Message ID: <kamailio/kamailio/pull/4539/[email protected]>
_______________________________________________
Kamailio - Development Mailing List -- [email protected]
To unsubscribe send an email to [email protected]
Important: keep the mailing list in the recipients, do not reply only to the
sender!