Giulio Harding wrote:
Further to this, we've had a few more incidents this past week which
shed a little more light on the problem:
Sometimes, a transmit/receive bind to a carrier (this has happened
with 2 different carriers now) goes strange: MTs will start queuing on
Kannel, seemingly unable to be delivered, but MOs are processed fine.
However, my initial investigation with ethereal doesn't show any
submit_sm attempts for the queued MTs. Also, I don't see any problems
with enquire_links or any (failed) attempts to rebind (e.g. the
transmit bind having problems while the receive bind is ok) - Kannel
indicates that the link is online, and there are no errors specific to
that link in bearerbox.log
Another update:
This problem of binds entering a partially failed state, where Kannel
will not deliver MTs through them, is now occurring quite regularly,
across all our carrier links. Any time we have some sort of link
instability, or temporary connectivity outage, our binds will come up
again, and receive queued MOs from the carriers, but Kannel will not
deliver MTs down those binds. The only solution is to restart Kannel (at
the moment, I'm using the 'restart' HTTP command, and I haven't had
Kannel crash with it, which is good)
To reiterate: there are no errors in bearer.log, and running a tcpdump
for that bind's traffic shows NO submit_sms at all (also, as far as I
can tell, there are no bind enquiries for the transmit bind). Kannel is
not even trying to deliver MTs, it's just queuing the MTs immediately.
It's as if the receive bind is up and running fine, but Kannel hasn't
brought the transmit bind up. (Well, not quite - if a bind is stopped,
and thus in a 'dead' state, Kannel will discard MTs destined for it,
complaining about there not being any valid destination for the MT - in
this case, the MTs aren't discarded, they're queued)
One of the links that's had this problem occur a few times this past
week is configured like so:
# Optus Free MT
group = smsc
smsc = smpp
smsc-id = optusfrmt
host = xxxx
port = 2775
receive-port = 2775
transceiver-mode = false
smsc-username = xxxx
smsc-password = xxxx
system-type = smpp
address-range = 0
source-addr-ton = 0
source-addr-npi = 4
dest-addr-ton = 1
dest-addr-npi = 1
alt-charset="ASCII"
enquire-link-interval = 30
allowed-smsc-id = "optusfrmt"
This seems to be a relatively recent occurance, possibly coinciding with
our use of Kannel CVS (I don't remember experiencing any problems like
this with 1.4.0) and it's quite worrying, and recovering from this state
will affect every other bind we have (interrupting any traffic they are
experiencing)
Summary of related issues:
- After interrupted connectivity or possibly some other network event
causes a re-bind of a particular link, Kannel will indicate that the
link is back online (and will receive MOs), but will not attempt to
deliver MTs - only solution is to restart Kannel.
- Restarting just the affected bind using the stop/start-smsc HTTP
commands causes Kannel to hang.
- Trying to restart Kannel after it enters this hung state will cause a
crash.
At this point, I'd like to find out more about exactly what Kannel is
doing when these events happen - i.e. which of the two binds (transmit +
receive) for a link is actually online? How can I get detailed
information about the links, and/or more verbose logging about
bind/unbind operations for a link? If I can get more information about
what Kannel is doing, then I can investigate further myself...
Any suggestions, or tips?
Thanks,
If I try to use the HTTP stop-smsc command to stop the individual
bind, Kannel appears to hang - the admin webpage doesn't respond, and
Kannel seems to stop processing MOs and MTs. At this point, the only
thing I can do is restart Kannel via its init script. Restarting
Kannel fixes the original problem, and queued MTs are delivered.
However, shutting down Kannel in this 'hung' state seems to trigger a
crash - here are some excerpts from bearerbox.log from the last few
incidents:
...
2006-08-26 01:17:18 [14077] [3] INFO: HTTP: Shutting down smsc-id
`hutchthree'
2006-08-26 01:19:57 [14077] [0] WARNING: Killing signal or HTTP admin
command received, shutting down...
2006-08-26 01:19:57 [14077] [29] INFO: Connection closed by the box
<127.0.0.1>
2006-08-26 01:19:57 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [4] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [4] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [4] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [4] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:19:58 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:19:59 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:20:00 [14077] [15] WARNING: smsbox_list empty!
2006-08-26 01:20:01 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:20:02 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:20:02 [14077] [26] PANIC: gwlib/gw-rwlock.c:137:
gw_rwlock_rdlock: Assertion `lock != NULL' failed.
2006-08-26 01:20:02 [14077] [26] PANIC:
/opt/kannel/sbin/bearerbox(gw_panic+0xa3) [0x80b134b]
2006-08-26 01:20:02 [14077] [26] PANIC:
/opt/kannel/sbin/bearerbox(gw_rwlock_rdlock+0x63) [0x80a8a6b]
2006-08-26 01:20:02 [14077] [26] PANIC:
/opt/kannel/sbin/bearerbox(route_incoming_to_boxc+0x5c) [0x8055208]
2006-08-26 01:20:02 [14077] [26] PANIC:
/opt/kannel/sbin/bearerbox(bb_smscconn_receive+0xe5) [0x80577b1]
2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox
[0x807a44c]
2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox
[0x807b10b]
2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox
[0x80a906e]
2006-08-26 01:20:02 [14077] [26] PANIC: /lib/tls/libpthread.so.0
[0xa1b1d5]
2006-08-26 01:20:02 [14077] [26] PANIC:
/lib/tls/libc.so.6(__clone+0x5a) [0x7782da]
2006-08-26 01:20:18 [11540] [0] INFO: Added logfile
`/home/kannel/log/bearer.log' with level `1'.
2006-08-26 01:20:18 [11540] [0] INFO: Started access logfile
`/home/kannel/log/kannel_sms_traffic.log'.
2006-08-26 01:20:18 [11540] [0] INFO: SSL not supported, no SSL
initialization done.
2006-08-26 01:20:18 [11540] [0] INFO: HTTP: Opening server at port 13000.
2006-08-26 01:20:18 [11540] [0] INFO: BOXC: 'smsbox-max-pending' not
set, using default (100).
...
...
2006-08-26 01:21:37 [11540] [0] WARNING: Killing signal or HTTP admin
command received, shutting down...
2006-08-26 01:21:39 [11540] [4] WARNING: smsbox_list empty!
2006-08-26 01:21:39 [11540] [0] PANIC: Cannot die by its own will
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox(gw_panic+0xa3)
[0x80b134b]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox [0x8050b4e]
2006-08-26 01:21:39 [11540] [0] PANIC: /lib/tls/libpthread.so.0
[0xa218a8]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox [0x807bbd8]
2006-08-26 01:21:39 [11540] [0] PANIC:
./sbin/bearerbox(smsc2_shutdown+0x41) [0x8056e0d]
2006-08-26 01:21:39 [11540] [0] PANIC:
./sbin/bearerbox(bb_shutdown+0x9e) [0x8050e56]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox(main+0x892)
[0x805175a]
2006-08-26 01:21:39 [11540] [0] PANIC:
/lib/tls/libc.so.6(__libc_start_main+0xe3) [0x6c7e33]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox [0x8050a3d]
2006-08-26 01:21:51 [12104] [0] INFO: Added logfile
`/home/kannel/log/bearer.log' with level `1'.
2006-08-26 01:21:51 [12104] [0] INFO: Started access logfile
`/home/kannel/log/kannel_sms_traffic.log'.
2006-08-26 01:21:51 [12104] [0] INFO: SSL not supported, no SSL
initialization done.
2006-08-26 01:21:51 [12104] [0] INFO: HTTP: Opening server at port 13000.
2006-08-26 01:21:51 [12104] [0] INFO: BOXC: 'smsbox-max-pending' not
set, using default (100).
...
...
2006-08-28 14:02:58 [19145] [3] INFO: HTTP: Shutting down smsc-id
`vodafone000'
2006-08-28 14:03:34 [19145] [29] INFO: Connection closed by the box
<127.0.0.1>
2006-08-28 14:03:34 [19145] [0] WARNING: Killing signal or HTTP admin
command received, shutting down...
2006-08-28 14:03:35 [19145] [13] WARNING: smsbox_list empty!
2006-08-28 14:03:35 [19145] [4] WARNING: smsbox_list empty!
2006-08-28 14:03:35 [19145] [26] PANIC: gwlib/gw-rwlock.c:137:
gw_rwlock_rdlock: Assertion `lock != NULL' failed.
2006-08-28 14:03:35 [19145] [26] PANIC:
/opt/kannel/sbin/bearerbox(gw_panic+0xa3) [0x80b134b]
2006-08-28 14:03:35 [19145] [26] PANIC:
/opt/kannel/sbin/bearerbox(gw_rwlock_rdlock+0x63) [0x80a8a6b]
2006-08-28 14:03:35 [19145] [26] PANIC:
/opt/kannel/sbin/bearerbox(route_incoming_to_boxc+0x5c) [0x8055208]
2006-08-28 14:03:35 [19145] [26] PANIC:
/opt/kannel/sbin/bearerbox(bb_smscconn_receive+0xe5) [0x80577b1]
2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox
[0x807a44c]
2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox
[0x807b10b]
2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox
[0x80a906e]
2006-08-28 14:03:35 [19145] [26] PANIC: /lib/tls/libpthread.so.0
[0xa1b1d5]
2006-08-28 14:03:35 [19145] [26] PANIC:
/lib/tls/libc.so.6(__clone+0x5a) [0x7782da]
2006-08-28 14:03:40 [6758] [0] INFO: Added logfile
`/home/kannel/log/bearer.log' with level `1'.
2006-08-28 14:03:40 [6758] [0] INFO: Started access logfile
`/home/kannel/log/kannel_sms_traffic.log'.
2006-08-28 14:03:40 [6758] [0] INFO: SSL not supported, no SSL
initialization done.
2006-08-28 14:03:40 [6758] [0] INFO: HTTP: Opening server at port 13000.
2006-08-28 14:03:40 [6758] [0] INFO: BOXC: 'smsbox-max-pending' not
set, using default (100).
...
This is new behaviour - we haven't seen anything like this since we
started using Kannel (~2 years ago).
The only thing that has changed recently (and even then, it changed a
couple of months ago, and this problem only started a couple of weeks
ago at most) is that we're now using Kannel 'cvs-20060727' instead of
1.4.0 with a few patches (such as improved SMPP throttling, customised
smsbox 'success' HTTP response, etc.) and are now running on Fedora
Core 3 instead of Red Hat Linux 8.0. Also, it's *possible* that the
carriers may have made some changes on their end - one of them made
some mention about upgrading to support SMPP 3.4?
1.) Does anyone have any further ideas what the problem might be?
2.) Can anyone suggest the best way to get more information about
what's happened to the SMPP bind, and what Kannel is actually doing
just before, and during the time when it 'hangs'? I'm not overly
familiar with Kannel internals, or how to go about debugging it, so
any assistance would be greatly appreciated :)
Thanks,
On 24/08/2006, at 5:28 PM, Giulio Harding wrote:
Stuart's just left for the day, but I've picked out line 1811 in
function io_thread in gw/smsc/smsc_smpp.c from the source code we're
using:
...
/* unbind
* Read so long as unbind_resp received or timeout passed.
Otherwise
we have
* double delivered messages.
*/
if (smpp->quitting) {
send_unbind(smpp, conn);
last_response = time(NULL);
while(conn_wait(conn, 1.00) != -1 &&
difftime(time(NULL), last_response) <
SMPP_DEFAULT_SHUTDOW
N_TIMEOUT &&
smpp->conn->status != SMSCCONN_DISCONNECTED) {
if (read_pdu(smpp, conn, &len, &pdu) == 1) {
dump_pdu("Got PDU:", smpp->conn->id, pdu);
handle_pdu(smpp, conn, pdu, &pending_submits);
smpp_pdu_destroy(pdu);
}
}
debug("bb.sms.smpp", 0, "SMPP[%s]: %s: break and
shutting down",
octstr_get_cstr(smpp->conn->id), __func__);
<<<<================== Line1811
break;
}
send_enquire_link(smpp, conn, &last_enquire_sent);
...
Hope this helps...
(just on a side-note, and this may be related: on the day this crash
occured, later at night, we had an incident with the same carrier
where one of our transmit binds to them died - they were rejecting
our bind enquiries and then bind requests claiming we were already
bound - when I tried to restart the individual link using the
stop-smsc and start-smsc HTTP commands, Kannel hung, no longer
processing MOs/MTs, and the admin page not responding. I tried to
stop kannel, in order to restart it, but stopping it caused ~30
bearerbox processes to appear, each of which had to be kill -9'd
before I could restart kannel. Not sure if this would related or not,
but I haven't had a chance to investigate further... just thought I'd
mention it in case...)
Thanks,
Andreas Fink wrote:
I was quickly looking through the CVS code of today and wonder if
this was maybe created by the fact that login is being rejected but
enquire link tries to send its message anyway and then crashes.
Could you tell me exactly on your CVS 20060727 version where is line
1811 in that version?
On 24.08.2006, at 06:03, Mi Reflejo wrote:
Hi,
Which version do you use? If you are not using CVS HEAD please try it.
Regards,
Martin Conte.
On 8/23/06, Stuart Beck <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
Hi All,
We recently had a situation where one of the carriers we are
connecting to (2 separate SMSC's) had some unspecified issue and
reconnected, this however caused the following panic
as one of the SMSC's had been turned off (upon communicating with
the carrier it's been switched off for about a week, they are no
longer supporting that service)
I would like to know if anyone can give me any indication as to
why the failure occurred and what can be done about it.
The Kannel/host details are as follows
Kannel bearerbox version `cvs-20060727'.
Build `Aug 2 2006 12:31:01', compiler `3.2 20020903 (Red
Hat Linux 8.0 3.2-7)'.
System Linux, release 2.4.20-19.8smp, version #1 SMP Tue
Jul 15 15:01:43 EDT 2003, machine i686.
Libxml version 2.6.9. Using OpenSSL 0.9.6b [engine] 9 Jul
2001. Using native malloc.
> 2006-08-23 14:39:16 [32374] [38] ERROR: SMPP[hutchthree]: I/O
error or other error. Re-connecting.
> 2006-08-23 14:39:16 [32374] [38] ERROR: SMPP[hutchthree]:
Couldn't connect to SMS center (retrying in 10 seconds).
> 2006-08-23 14:39:16 [32448] [40] ERROR: SMPP[hutchorange]: I/O
error or other error. Re-connecting.
> 2006-08-23 14:39:16 [32448] [40] ERROR: SMPP[hutchorange]:
Couldn't connect to SMS center (retrying in 10 seconds).
> 2006-08-23 14:39:16 [32447] [39] ERROR: SMPP[hutchorange]: I/O
error or other error. Re-connecting.
> 2006-08-23 14:39:16 [32447] [39] ERROR: SMPP[hutchorange]:
Couldn't connect to SMS center (retrying in 10 seconds).
> 2006-08-23 14:39:16 [32373] [37] ERROR: SMPP[hutchthree]: I/O
error or other error. Re-connecting.
> 2006-08-23 14:39:16 [32373] [37] ERROR: SMPP[hutchthree]:
Couldn't connect to SMS center (retrying in 10 seconds).
> 2006-08-23 14:39:26 [32448] [40] WARNING: SMPP: PDU NULL
terminated string has no NULL.
> 2006-08-23 14:39:26 [32448] [40] ERROR: SMPP[hutchorange]: SMSC
rejected login to receive, code 0x0000000f (Invalid System ID).
> 2006-08-23 14:39:26 [32448] [40] ERROR: SMPP[hutchorange]: I/O
error or other error. Re-connecting.
> 2006-08-23 14:39:26 [32447] [39] WARNING: SMPP: PDU NULL
terminated string has no NULL.
> 2006-08-23 14:39:26 [32447] [39] ERROR: SMPP[hutchorange]: SMSC
rejected login to transmit, code 0x0000000f (Invalid System ID).
> 2006-08-23 16:24:12 [4905] [3] INFO: HTTP: Re-starting smsc-id
`hutchorange'
> 2006-08-23 16:24:12 [4905] [3] INFO: Set throughput to 13.000
for smsc id <hutchorange>
> 2006-08-23 16:24:12 [4905] [3] INFO: DLR rerouting for smsc id
<hutchorange> disabled.
> 2006-08-23 16:24:13 [16991] [42] WARNING: SMPP: PDU NULL
terminated string has no NULL.
> 2006-08-23 16:24:13 [16991] [42] ERROR: SMPP[hutchorange]: SMSC
rejected login to receive, code 0x0000000f (Invalid System ID).
> 2006-08-23 16:24:13 [16991] [42] ERROR: SMPP[hutchorange]: I/O
error or other error. Re-connecting.
> 2006-08-23 16:24:13 [16990] [41] WARNING: SMPP: PDU NULL
terminated string has no NULL.
> 2006-08-23 16:24:13 [16990] [41] ERROR: SMPP[hutchorange]: SMSC
rejected login to transmit, code 0x0000000f (Invalid System ID).
> 2006-08-23 16:24:38 [4905] [3] INFO: HTTP: Re-starting smsc-id
`hutchorange'
> 2006-08-23 16:24:38 [4905] [3] INFO: Set throughput to 13.000
for smsc id <hutchorange>
> 2006-08-23 16:24:38 [4905] [3] INFO: DLR rerouting for smsc id
<hutchorange> disabled.
> 2006-08-23 16:24:38 [17116] [44] WARNING: SMPP: PDU NULL
terminated string has no NULL.
> 2006-08-23 16:24:38 [17116] [44] ERROR: SMPP[hutchorange]: SMSC
rejected login to receive, code 0x0000000f (Invalid System ID).
> 2006-08-23 16:24:38 [17116] [44] ERROR: SMPP[hutchorange]: I/O
error or other error. Re-connecting.
> 2006-08-23 16:24:38 [17115] [43] WARNING: SMPP: PDU NULL
terminated string has no NULL.
> 2006-08-23 16:24:38 [17115] [43] ERROR: SMPP[hutchorange]: SMSC
rejected login to transmit, code 0x0000000f (Invalid System ID).
> 2006-08-23 16:24:43 [16990] [41] PANIC: gwlib/octstr.c:2461:
seems_valid_real: Assertion `ostr->len + 1 <= ostr->size' failed.
(Called from gw/smsc/smsc_smpp.c:1811:io_thread.)
> 2006-08-23 16:24:43 [16990] [41] PANIC:
/opt/kannel/sbin/bearerbox(gw_panic+0xfd) [0x80b794d]
> 2006-08-23 16:24:43 [16990] [41] PANIC:
/opt/kannel/sbin/bearerbox [0x80bcddc]
> 2006-08-23 16:24:43 [16990] [41] PANIC:
/opt/kannel/sbin/bearerbox(octstr_get_cstr_real+0x20) [0x80b8bf0]
> 2006-08-23 16:24:43 [16990] [41] PANIC:
/opt/kannel/sbin/bearerbox [0x8080b5e]
> 2006-08-23 16:24:43 [16990] [41] PANIC:
/opt/kannel/sbin/bearerbox [0x80af177]
> 2006-08-23 16:24:43 [16990] [41] PANIC:
/lib/i686/libpthread.so.0 [0x400b1881]
> 2006-08-23 16:24:43 [16990] [41] PANIC:
/lib/i686/libc.so.6(__clone+0x57) [0x420e40c7]
--
Stuart Beck
Systems Administrator
m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia
--Giulio Harding
Systems Administrator
m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia
Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]
http://www.mnetcorporation.com
--
Giulio Harding
Systems Administrator
m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia
Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]
http://www.mnetcorporation.com