Hi,

We recently encountered a weird issue that corrupted the dialog table of our 
servers after an instance had no more shared memory.

The different memory errors that appeared in the logs:

```
xxxxxxx[6439]: WARNING:core:fm_malloc: Not enough free memory, will attempt 
defragmentation.
xxxxxxx[6439]: ERROR:tm:sip_msg_cloner: no more share memory
xxxxxxx[6439]: ERROR:tm:new_t: out of mem
xxxxxxx[6439]: ERROR:tm:t_newtran: new_t failed

xxxxxxx[6450]: WARNING:core:fm_malloc: Not enough free memory, will attempt 
defragmentation.
xxxxxxx[6450]: ERROR:tm:build_local: no more share memory
xxxxxxx[6450]: ERROR:tm:send_ack: failed to build ACK·
xxxxxxx[6450]: ERROR:tm:reply_received: failed to send ACK (local=no)·
xxxxxxx[6450]: ERROR:dialog:push_reply_in_dialog: missing TAG param in TO hdr 
:-/·
```

Then, a lot of duplicated dialogs were present in the dialog table. I am sure 
these duplicated dialogs have been added after the memory error occured. After 
analyzing the content of the dialog table, I found that most calls had 
duplicated dialogs as follows: one initial dialog created from the script (with 
a timestamp at the time it has been created) and multiple duplicates of this 
dialog (with a timestamp that is after the first error occured). The duplicate 
dialogs have the same data as the initial dialog except the id (auto 
increment), the timeout and the timestamp columns. Please note that the dlg_id 
column of the duplicated dialogs was identical to the initial dialog.

We just started to apply the change of adding the new dlg_id column so the id 
column was still present and defined as primary key. The dlg_id column wasn't 
defined as primary key, therefore adding duplicated dialogs didn't generate any 
error from the database side.

I thought the duplicated dialogs were really created in memory but if it was 
the case, they would have different dlg_id (the hash_entry would be the same 
because the CallID is the same, but the hash_id would be different).

This scenario is very bad since our monitoring system detected that opensips 
doesn't respond and therefore tried to restart it. However, there were more 
that 300K dialogs (around 5K were good dialogs) in the table and the 
load_dialog_from_db function that is executed at startup took too much time and 
memory and during this time opensips wasn't able to respond to incoming 
request, therefore the monitoring system continued to restart it again and 
again.

I tried to examine the code to understand what may have caused the duplication 
but I didn't find anything. I am sure the timer process added each one of the 
duplicated dialogs since the auto_increment primary key is different for each 
duplicated dialog.

Regards,
Mickael

---
Reply to this email directly or view it on GitHub:
https://github.com/OpenSIPS/opensips/issues/311
_______________________________________________
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel

Reply via email to