Giulio Harding wrote:
Further to this, we've had a few more incidents this past week which shed a little more light on the problem:

Sometimes, a transmit/receive bind to a carrier (this has happened with 2 different carriers now) goes strange: MTs will start queuing on Kannel, seemingly unable to be delivered, but MOs are processed fine. However, my initial investigation with ethereal doesn't show any submit_sm attempts for the queued MTs. Also, I don't see any problems with enquire_links or any (failed) attempts to rebind (e.g. the transmit bind having problems while the receive bind is ok) - Kannel indicates that the link is online, and there are no errors specific to that link in bearerbox.log

Another update:

This problem of binds entering a partially failed state, where Kannel will not deliver MTs through them, is now occurring quite regularly, across all our carrier links. Any time we have some sort of link instability, or temporary connectivity outage, our binds will come up again, and receive queued MOs from the carriers, but Kannel will not deliver MTs down those binds. The only solution is to restart Kannel (at the moment, I'm using the 'restart' HTTP command, and I haven't had Kannel crash with it, which is good)

To reiterate: there are no errors in bearer.log, and running a tcpdump for that bind's traffic shows NO submit_sms at all (also, as far as I can tell, there are no bind enquiries for the transmit bind). Kannel is not even trying to deliver MTs, it's just queuing the MTs immediately. It's as if the receive bind is up and running fine, but Kannel hasn't brought the transmit bind up. (Well, not quite - if a bind is stopped, and thus in a 'dead' state, Kannel will discard MTs destined for it, complaining about there not being any valid destination for the MT - in this case, the MTs aren't discarded, they're queued)

One of the links that's had this problem occur a few times this past week is configured like so:

# Optus Free MT
group = smsc
smsc = smpp
smsc-id = optusfrmt
host = xxxx
port = 2775
receive-port = 2775
transceiver-mode = false
smsc-username = xxxx
smsc-password = xxxx
system-type = smpp
address-range = 0
source-addr-ton = 0
source-addr-npi = 4
dest-addr-ton = 1
dest-addr-npi = 1
alt-charset="ASCII"
enquire-link-interval = 30
allowed-smsc-id = "optusfrmt"

This seems to be a relatively recent occurance, possibly coinciding with our use of Kannel CVS (I don't remember experiencing any problems like this with 1.4.0) and it's quite worrying, and recovering from this state will affect every other bind we have (interrupting any traffic they are experiencing)

Summary of related issues:

- After interrupted connectivity or possibly some other network event causes a re-bind of a particular link, Kannel will indicate that the link is back online (and will receive MOs), but will not attempt to deliver MTs - only solution is to restart Kannel. - Restarting just the affected bind using the stop/start-smsc HTTP commands causes Kannel to hang. - Trying to restart Kannel after it enters this hung state will cause a crash.

At this point, I'd like to find out more about exactly what Kannel is doing when these events happen - i.e. which of the two binds (transmit + receive) for a link is actually online? How can I get detailed information about the links, and/or more verbose logging about bind/unbind operations for a link? If I can get more information about what Kannel is doing, then I can investigate further myself...

Any suggestions, or tips?

Thanks,
If I try to use the HTTP stop-smsc command to stop the individual bind, Kannel appears to hang - the admin webpage doesn't respond, and Kannel seems to stop processing MOs and MTs. At this point, the only thing I can do is restart Kannel via its init script. Restarting Kannel fixes the original problem, and queued MTs are delivered. However, shutting down Kannel in this 'hung' state seems to trigger a crash - here are some excerpts from bearerbox.log from the last few incidents:

...
2006-08-26 01:17:18 [14077] [3] INFO: HTTP: Shutting down smsc-id `hutchthree' 2006-08-26 01:19:57 [14077] [0] WARNING: Killing signal or HTTP admin command received, shutting down... 2006-08-26 01:19:57 [14077] [29] INFO: Connection closed by the box <127.0.0.1>
2006-08-26 01:19:57 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [4] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [4] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [4] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [4] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:19:58 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:19:59 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:20:00 [14077] [15] WARNING: smsbox_list empty!
2006-08-26 01:20:01 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:20:02 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:20:02 [14077] [26] PANIC: gwlib/gw-rwlock.c:137: gw_rwlock_rdlock: Assertion `lock != NULL' failed. 2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox(gw_panic+0xa3) [0x80b134b] 2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox(gw_rwlock_rdlock+0x63) [0x80a8a6b] 2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox(route_incoming_to_boxc+0x5c) [0x8055208] 2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox(bb_smscconn_receive+0xe5) [0x80577b1] 2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox [0x807a44c] 2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox [0x807b10b] 2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox [0x80a906e] 2006-08-26 01:20:02 [14077] [26] PANIC: /lib/tls/libpthread.so.0 [0xa1b1d5] 2006-08-26 01:20:02 [14077] [26] PANIC: /lib/tls/libc.so.6(__clone+0x5a) [0x7782da] 2006-08-26 01:20:18 [11540] [0] INFO: Added logfile `/home/kannel/log/bearer.log' with level `1'. 2006-08-26 01:20:18 [11540] [0] INFO: Started access logfile `/home/kannel/log/kannel_sms_traffic.log'. 2006-08-26 01:20:18 [11540] [0] INFO: SSL not supported, no SSL initialization done.
2006-08-26 01:20:18 [11540] [0] INFO: HTTP: Opening server at port 13000.
2006-08-26 01:20:18 [11540] [0] INFO: BOXC: 'smsbox-max-pending' not set, using default (100).
...

...
2006-08-26 01:21:37 [11540] [0] WARNING: Killing signal or HTTP admin command received, shutting down...
2006-08-26 01:21:39 [11540] [4] WARNING: smsbox_list empty!
2006-08-26 01:21:39 [11540] [0] PANIC: Cannot die by its own will
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox(gw_panic+0xa3) [0x80b134b]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox [0x8050b4e]
2006-08-26 01:21:39 [11540] [0] PANIC: /lib/tls/libpthread.so.0 [0xa218a8]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox [0x807bbd8]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox(smsc2_shutdown+0x41) [0x8056e0d] 2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox(bb_shutdown+0x9e) [0x8050e56] 2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox(main+0x892) [0x805175a] 2006-08-26 01:21:39 [11540] [0] PANIC: /lib/tls/libc.so.6(__libc_start_main+0xe3) [0x6c7e33]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox [0x8050a3d]
2006-08-26 01:21:51 [12104] [0] INFO: Added logfile `/home/kannel/log/bearer.log' with level `1'. 2006-08-26 01:21:51 [12104] [0] INFO: Started access logfile `/home/kannel/log/kannel_sms_traffic.log'. 2006-08-26 01:21:51 [12104] [0] INFO: SSL not supported, no SSL initialization done.
2006-08-26 01:21:51 [12104] [0] INFO: HTTP: Opening server at port 13000.
2006-08-26 01:21:51 [12104] [0] INFO: BOXC: 'smsbox-max-pending' not set, using default (100).
...

...
2006-08-28 14:02:58 [19145] [3] INFO: HTTP: Shutting down smsc-id `vodafone000' 2006-08-28 14:03:34 [19145] [29] INFO: Connection closed by the box <127.0.0.1> 2006-08-28 14:03:34 [19145] [0] WARNING: Killing signal or HTTP admin command received, shutting down...
2006-08-28 14:03:35 [19145] [13] WARNING: smsbox_list empty!
2006-08-28 14:03:35 [19145] [4] WARNING: smsbox_list empty!
2006-08-28 14:03:35 [19145] [26] PANIC: gwlib/gw-rwlock.c:137: gw_rwlock_rdlock: Assertion `lock != NULL' failed. 2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox(gw_panic+0xa3) [0x80b134b] 2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox(gw_rwlock_rdlock+0x63) [0x80a8a6b] 2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox(route_incoming_to_boxc+0x5c) [0x8055208] 2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox(bb_smscconn_receive+0xe5) [0x80577b1] 2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox [0x807a44c] 2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox [0x807b10b] 2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox [0x80a906e] 2006-08-28 14:03:35 [19145] [26] PANIC: /lib/tls/libpthread.so.0 [0xa1b1d5] 2006-08-28 14:03:35 [19145] [26] PANIC: /lib/tls/libc.so.6(__clone+0x5a) [0x7782da] 2006-08-28 14:03:40 [6758] [0] INFO: Added logfile `/home/kannel/log/bearer.log' with level `1'. 2006-08-28 14:03:40 [6758] [0] INFO: Started access logfile `/home/kannel/log/kannel_sms_traffic.log'. 2006-08-28 14:03:40 [6758] [0] INFO: SSL not supported, no SSL initialization done.
2006-08-28 14:03:40 [6758] [0] INFO: HTTP: Opening server at port 13000.
2006-08-28 14:03:40 [6758] [0] INFO: BOXC: 'smsbox-max-pending' not set, using default (100).
...

This is new behaviour - we haven't seen anything like this since we started using Kannel (~2 years ago).

The only thing that has changed recently (and even then, it changed a couple of months ago, and this problem only started a couple of weeks ago at most) is that we're now using Kannel 'cvs-20060727' instead of 1.4.0 with a few patches (such as improved SMPP throttling, customised smsbox 'success' HTTP response, etc.) and are now running on Fedora Core 3 instead of Red Hat Linux 8.0. Also, it's *possible* that the carriers may have made some changes on their end - one of them made some mention about upgrading to support SMPP 3.4?

1.) Does anyone have any further ideas what the problem might be?

2.) Can anyone suggest the best way to get more information about what's happened to the SMPP bind, and what Kannel is actually doing just before, and during the time when it 'hangs'? I'm not overly familiar with Kannel internals, or how to go about debugging it, so any assistance would be greatly appreciated :)

Thanks,


On 24/08/2006, at 5:28 PM, Giulio Harding wrote:

Stuart's just left for the day, but I've picked out line 1811 in function io_thread in gw/smsc/smsc_smpp.c from the source code we're using:

...

           /* unbind
* Read so long as unbind_resp received or timeout passed. Otherwise
we have
            * double delivered messages.
            */
           if (smpp->quitting) {
               send_unbind(smpp, conn);
               last_response = time(NULL);
               while(conn_wait(conn, 1.00) != -1 &&
difftime(time(NULL), last_response) < SMPP_DEFAULT_SHUTDOW
N_TIMEOUT &&
                     smpp->conn->status != SMSCCONN_DISCONNECTED) {
                   if (read_pdu(smpp, conn, &len, &pdu) == 1) {
                       dump_pdu("Got PDU:", smpp->conn->id, pdu);
                       handle_pdu(smpp, conn, pdu, &pending_submits);
                       smpp_pdu_destroy(pdu);
                   }
               }
debug("bb.sms.smpp", 0, "SMPP[%s]: %s: break and shutting down", octstr_get_cstr(smpp->conn->id), __func__); <<<<================== Line1811

               break;
           }

           send_enquire_link(smpp, conn, &last_enquire_sent);

...

Hope this helps...

(just on a side-note, and this may be related: on the day this crash occured, later at night, we had an incident with the same carrier where one of our transmit binds to them died - they were rejecting our bind enquiries and then bind requests claiming we were already bound - when I tried to restart the individual link using the stop-smsc and start-smsc HTTP commands, Kannel hung, no longer processing MOs/MTs, and the admin page not responding. I tried to stop kannel, in order to restart it, but stopping it caused ~30 bearerbox processes to appear, each of which had to be kill -9'd before I could restart kannel. Not sure if this would related or not, but I haven't had a chance to investigate further... just thought I'd mention it in case...)

Thanks,

Andreas Fink wrote:


I was quickly looking through the CVS code of today and wonder if this was maybe created by the fact that login is being rejected but enquire link tries to send its message anyway and then crashes.


Could you tell me exactly on your CVS 20060727 version where is line 1811 in that version?



On 24.08.2006, at 06:03, Mi Reflejo wrote:


Hi,

Which version do you use? If you are not using CVS HEAD please try it.


Regards,

Martin Conte.


On 8/23/06, Stuart Beck <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

Hi All,


We recently had a situation where one of the carriers we are connecting to (2 separate SMSC's) had some unspecified issue and reconnected, this however caused the following panic

as one of the SMSC's had been turned off (upon communicating with the carrier it's been switched off for about a week, they are no longer supporting that service)


I would like to know if anyone can give me any indication as to why the failure occurred and what can be done about it.


The Kannel/host details are as follows

        Kannel bearerbox version `cvs-20060727'.

Build `Aug 2 2006 12:31:01', compiler `3.2 20020903 (Red Hat Linux 8.0 3.2-7)'.

System Linux, release 2.4.20-19.8smp, version #1 SMP Tue Jul 15 15:01:43 EDT 2003, machine i686.

Libxml version 2.6.9. Using OpenSSL 0.9.6b [engine] 9 Jul 2001. Using native malloc.


> 2006-08-23 14:39:16 [32374] [38] ERROR: SMPP[hutchthree]: I/O error or other error. Re-connecting.

> 2006-08-23 14:39:16 [32374] [38] ERROR: SMPP[hutchthree]: Couldn't connect to SMS center (retrying in 10 seconds).

> 2006-08-23 14:39:16 [32448] [40] ERROR: SMPP[hutchorange]: I/O error or other error. Re-connecting.

> 2006-08-23 14:39:16 [32448] [40] ERROR: SMPP[hutchorange]: Couldn't connect to SMS center (retrying in 10 seconds).

> 2006-08-23 14:39:16 [32447] [39] ERROR: SMPP[hutchorange]: I/O error or other error. Re-connecting.

> 2006-08-23 14:39:16 [32447] [39] ERROR: SMPP[hutchorange]: Couldn't connect to SMS center (retrying in 10 seconds).

> 2006-08-23 14:39:16 [32373] [37] ERROR: SMPP[hutchthree]: I/O error or other error. Re-connecting.

> 2006-08-23 14:39:16 [32373] [37] ERROR: SMPP[hutchthree]: Couldn't connect to SMS center (retrying in 10 seconds).

> 2006-08-23 14:39:26 [32448] [40] WARNING: SMPP: PDU NULL terminated string has no NULL.

> 2006-08-23 14:39:26 [32448] [40] ERROR: SMPP[hutchorange]: SMSC rejected login to receive, code 0x0000000f (Invalid System ID).

> 2006-08-23 14:39:26 [32448] [40] ERROR: SMPP[hutchorange]: I/O error or other error. Re-connecting.

> 2006-08-23 14:39:26 [32447] [39] WARNING: SMPP: PDU NULL terminated string has no NULL.

> 2006-08-23 14:39:26 [32447] [39] ERROR: SMPP[hutchorange]: SMSC rejected login to transmit, code 0x0000000f (Invalid System ID).

> 2006-08-23 16:24:12 [4905] [3] INFO: HTTP: Re-starting smsc-id `hutchorange'

> 2006-08-23 16:24:12 [4905] [3] INFO: Set throughput to 13.000 for smsc id <hutchorange>

> 2006-08-23 16:24:12 [4905] [3] INFO: DLR rerouting for smsc id <hutchorange> disabled.

> 2006-08-23 16:24:13 [16991] [42] WARNING: SMPP: PDU NULL terminated string has no NULL.

> 2006-08-23 16:24:13 [16991] [42] ERROR: SMPP[hutchorange]: SMSC rejected login to receive, code 0x0000000f (Invalid System ID).

> 2006-08-23 16:24:13 [16991] [42] ERROR: SMPP[hutchorange]: I/O error or other error. Re-connecting.

> 2006-08-23 16:24:13 [16990] [41] WARNING: SMPP: PDU NULL terminated string has no NULL.

> 2006-08-23 16:24:13 [16990] [41] ERROR: SMPP[hutchorange]: SMSC rejected login to transmit, code 0x0000000f (Invalid System ID).

> 2006-08-23 16:24:38 [4905] [3] INFO: HTTP: Re-starting smsc-id `hutchorange'

> 2006-08-23 16:24:38 [4905] [3] INFO: Set throughput to 13.000 for smsc id <hutchorange>

> 2006-08-23 16:24:38 [4905] [3] INFO: DLR rerouting for smsc id <hutchorange> disabled.

> 2006-08-23 16:24:38 [17116] [44] WARNING: SMPP: PDU NULL terminated string has no NULL.

> 2006-08-23 16:24:38 [17116] [44] ERROR: SMPP[hutchorange]: SMSC rejected login to receive, code 0x0000000f (Invalid System ID).

> 2006-08-23 16:24:38 [17116] [44] ERROR: SMPP[hutchorange]: I/O error or other error. Re-connecting.

> 2006-08-23 16:24:38 [17115] [43] WARNING: SMPP: PDU NULL terminated string has no NULL.

> 2006-08-23 16:24:38 [17115] [43] ERROR: SMPP[hutchorange]: SMSC rejected login to transmit, code 0x0000000f (Invalid System ID).

> 2006-08-23 16:24:43 [16990] [41] PANIC: gwlib/octstr.c:2461: seems_valid_real: Assertion `ostr->len + 1 <= ostr->size' failed. (Called from gw/smsc/smsc_smpp.c:1811:io_thread.)

> 2006-08-23 16:24:43 [16990] [41] PANIC: /opt/kannel/sbin/bearerbox(gw_panic+0xfd) [0x80b794d]

> 2006-08-23 16:24:43 [16990] [41] PANIC: /opt/kannel/sbin/bearerbox [0x80bcddc]

> 2006-08-23 16:24:43 [16990] [41] PANIC: /opt/kannel/sbin/bearerbox(octstr_get_cstr_real+0x20) [0x80b8bf0]

> 2006-08-23 16:24:43 [16990] [41] PANIC: /opt/kannel/sbin/bearerbox [0x8080b5e]

> 2006-08-23 16:24:43 [16990] [41] PANIC: /opt/kannel/sbin/bearerbox [0x80af177]

> 2006-08-23 16:24:43 [16990] [41] PANIC: /lib/i686/libpthread.so.0 [0x400b1881]

> 2006-08-23 16:24:43 [16990] [41] PANIC: /lib/i686/libc.so.6(__clone+0x57) [0x420e40c7]


--

Stuart Beck

Systems Administrator


m.Net Corporation

Level 13, 99 Gawler Place

Adelaide SA 5000, Australia




--Giulio Harding
Systems Administrator

m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia

Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]

http://www.mnetcorporation.com


--
Giulio Harding
Systems Administrator

m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia

Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]

http://www.mnetcorporation.com







Reply via email to