zenichev left a comment (kamailio/kamailio#4539)

@miconda 

Sorry for the late reply, I had to re-consider certain things.

> The switch to hash table is not related to the reload issue, just considered 
> a better design to store the items?

hash tables abstraction in common with the locks fulfill two goals:
- design improvement
- robust and convenient way to introduce locks (we control them on the upper 
level)

I could introduce the locks on the linked lists directly, but that would have 
been much more cumbersome.
The model running on the hash tables is already common, used in some other 
modules.

> Although not the main author of the module, I think the secondary table was 
> kept (at least for a while) to ensure that SIP worker processes finish 
> matching on the old active list when a reload is completed.

Yes, this is not the least point. I assume they introduced the concept of 
cursor (pointer to the actual table) and two possible versions of them latest 
(newest) and outdated: to be able to write to one table, meanwhile the other 
can be used by SIP. Then tables get swapped. It feels a little bit oldish, but 
that works.

**The biggest drawback of that is**: it doesn't consider the multi-process 
(e.g. RPCs) system.
Whereas multiple reloads can hit the system and nothing stops those processes 
to have a race.

> could it happen that the SIP worker is still walking the old-old-data, which 
> has to be destroyed by the 2nd reloading process more or less at the same 
> time?

**Short answer**: currently the permissions module already has an issue, when 
two or more processes start simultaneously the reload.

**Longer answer**:
Let me try to explain, how it's working now (even without the locks patch).

The case:
- DB backend gets updated with the new provisioned trusted sources
- P1 fires the RPC reload, is ongoing
- DB backend gets now more updates, so other new provisioned trusted sources
- P2 fires the RPC reload and collides with the P1

What is going to happen:
- before all the updates started: table 1 is the newest version (global pointer 
on that), table 2 is the oldest version
- now updates started (first P1)
- P1 sees that table 1 is the actual one, takes table 2 (for updates), no 
locks, it starts the update (the global pointer is not touched yet)
- P2 has arrived with a newer reload, has the same situation as the P1, no 
locks, and nothing prevents it to empty the table (so cleaning the list), 
meanwhile P1 hasn't finished its update yet

This is a seg.fault, either because of the changed list (iterator invalidation) 
or because of hitting the corrupted pointer (re-written by parallel process), 
while trying to do read/write operations via it. So far, we experienced this 
case pretty much regularly. (to reproduce it, one needs to have quite big 
trusted tables, whereas each reload takes a sensible amount of time)

**What this PR does not do:**

This thing still not solved with the current PR, because this PR only 
introduces the locks for the scope of hash tables themselves, but does not 
cover the reload as whole (upper level). So it only solves the cases, when two 
processes simultaneously try to iterate the content (within one table).

**What we should better do to improve this change:**

Now, when I understood, that the solution I've uploaded is not ideal, I will 
have to extend this a little bit.

I will have to introduce additionally **an exclusive lock** for the reload 
function (`reload_trusted_table()`).
This one is going to serve the purpose of not letting the P2 select the very 
same table, when P1 already does updates.
Hence P2 will select always the older table, without having to suddenly clean 
the latest (actual) table.

What I mean:
- P1 acquired the locks of table 1, cleans the content of it and re-inits the 
buckets
- P1 starts filling in the table with a new content
- P2 arrives and waits for the lock to be acquired
- P1 releases the lock after the table is filled and makes the global pointer 
towards table 1
- P2 acquires the lock and cleans the content (<- here, one step before P2 
selects the table for updates, we need the exclusive lock, which will serve 
only the `reload_trusted_table()`)

F.y.i: `reload_trusted_table()` is the only multi-process weak point of the 
module, as I can see.

Additionally it's worth having an **atomic bool**, for the cases when P3, P4 
etc. are coming and also trying to acquire the exclusive lock.
If there is a process already waiting to acquire a lock and provide fresher 
updates, there is no need to let P3, P4.. do the same. So having the **atomic 
bool** is not obligatory, but that will make the reload process more efficient, 
because having less duplicated work.

> Also, I haven't gotten (yet) the idea with the dummy item. Normally the 
> bucket is its own structure

I think I should have clarified better, but shortly said I wanted to stay with 
the dummy-head design
(when buckets always owns a permanent head node, which is pre-initilized and 
list items are coming always after it)
for the sake of stable head pointers in buckets, which gives simpler iteration 
implementation,
and reduced re-allocs.. so it's only about the speed. But this is definitely 
not obligatory to have it.

I get your idea, you're talking rather about the semantic clarity.
The bucket item must always keep a data, never dummy etc. 

This is possible, I can just rework it, will take some time, but I don't see it 
as a big deal.

---

**My current proposal for the further improvements is:**
- introduce an exclusive lock only for the scope of  `reload_trusted_table()` 
(solves sudden table clean)
- introduce an atomic bool (improves the work, no duplicated reloads)
- introduce the same concept for the address table (so that both trusted and 
address tables implementations are unified)

Please let me know what you think about my proposal, and let's find out what we 
should do about this rework.
Then I can invest my time to finalize this PR.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/pull/4539#issuecomment-3785133654
You are receiving this because you are subscribed to this thread.

Message ID: <kamailio/kamailio/pull/4539/[email protected]>
_______________________________________________
Kamailio - Development Mailing List -- [email protected]
To unsubscribe send an email to [email protected]
Important: keep the mailing list in the recipients, do not reply only to the 
sender!

Reply via email to