Hi,
We recently encountered a weird issue that corrupted the dialog table of our
servers after an instance had no more shared memory.
The different memory errors that appeared in the logs:
```
xxxxxxx[6439]: WARNING:core:fm_malloc: Not enough free memory, will attempt
defragmentation.
xxxxxxx[6439]: ERROR:tm:sip_msg_cloner: no more share memory
xxxxxxx[6439]: ERROR:tm:new_t: out of mem
xxxxxxx[6439]: ERROR:tm:t_newtran: new_t failed
xxxxxxx[6450]: WARNING:core:fm_malloc: Not enough free memory, will attempt
defragmentation.
xxxxxxx[6450]: ERROR:tm:build_local: no more share memory
xxxxxxx[6450]: ERROR:tm:send_ack: failed to build ACK·
xxxxxxx[6450]: ERROR:tm:reply_received: failed to send ACK (local=no)·
xxxxxxx[6450]: ERROR:dialog:push_reply_in_dialog: missing TAG param in TO hdr
:-/·
```
Then, a lot of duplicated dialogs were present in the dialog table. I am sure
these duplicated dialogs have been added after the memory error occured. After
analyzing the content of the dialog table, I found that most calls had
duplicated dialogs as follows: one initial dialog created from the script (with
a timestamp at the time it has been created) and multiple duplicates of this
dialog (with a timestamp that is after the first error occured). The duplicate
dialogs have the same data as the initial dialog except the id (auto
increment), the timeout and the timestamp columns. Please note that the dlg_id
column of the duplicated dialogs was identical to the initial dialog.
We just started to apply the change of adding the new dlg_id column so the id
column was still present and defined as primary key. The dlg_id column wasn't
defined as primary key, therefore adding duplicated dialogs didn't generate any
error from the database side.
I thought the duplicated dialogs were really created in memory but if it was
the case, they would have different dlg_id (the hash_entry would be the same
because the CallID is the same, but the hash_id would be different).
This scenario is very bad since our monitoring system detected that opensips
doesn't respond and therefore tried to restart it. However, there were more
that 300K dialogs (around 5K were good dialogs) in the table and the
load_dialog_from_db function that is executed at startup took too much time and
memory and during this time opensips wasn't able to respond to incoming
request, therefore the monitoring system continued to restart it again and
again.
I tried to examine the code to understand what may have caused the duplication
but I didn't find anything. I am sure the timer process added each one of the
duplicated dialogs since the auto_increment primary key is different for each
duplicated dialog.
Regards,
Mickael
---
Reply to this email directly or view it on GitHub:
https://github.com/OpenSIPS/opensips/issues/311
_______________________________________________
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel