This is a moderately urgent problem, since it's recently cropped up on a bind to carrier that carries a reasonable amount of traffic, and after some investigation, we have NO idea what's going wrong. This has happened 3 or 4 times over the past month - the bind had been performing flawlessly up until this point. In previous occurrences of this problem, the problem has gone away after some time (e.g. a few hours, or a day or two), but we're worried the problem will reappear just when we're due to handle large peaks in traffic. (We've been lucky so far)

The problem is as follows: at some point, we noticed the bind had one MT in the store, and no MOs were arriving on that bind. However, the bind was still listed as 'online'. We tried restarting Kannel, and that bind would come up again, and the queued MT would be sent to the destination handset, but wouldn't leave the queue, and wouldn't be marked as sent by Kannel. Again, no MOs would arrive on that bind, even though it seemed to be up. Sometimes, one or two MOs might arrive, and one or two MTs might be successfully sent, but then that bind would seize up again, with an MT in the store, and no further MOs or MTs would be

A packet capture during a Kannel restart then shutdown showed that after the initial successful bind, there would be one or two successful submit_sms, but then nothing - no enquire_links, no responses to deliver_sms sent by the SMSC, etc. - just occasional TCP keep-alives. Each time Kannel is restarted, the same queued messages would be successfully submitted, but not removed from the queue.

Here's a sample packet capture summary:

No. Time Source Destination Protocol Info 26 15.747608 10.110.123.31 xxx.xx.xx.xx TCP 35085 > copysnap [SYN] Seq=0 Len=0 MSS=1360 TSV=2576408299 TSER=0 WS=2 30 15.788987 xxx.xx.xx.xx 10.110.123.31 TCP copysnap > 35085 [SYN, ACK] Seq=0 Ack=1 Win=61440 Len=0 MSS=1436 WS=0 31 15.788998 10.110.123.31 xxx.xx.xx.xx TCP 35085 > copysnap [ACK] Seq=1 Ack=1 Win=5440 Len=0 32 15.789142 10.110.123.31 xxx.xx.xx.xx SMPP SMPP Bind_transceiver 42 15.830586 xxx.xx.xx.xx 10.110.123.31 SMPP SMPP Bind_transceiver - resp: "Ok" 43 15.830595 10.110.123.31 xxx.xx.xx.xx TCP 35085 > copysnap [ACK] Seq=45 Ack=28 Win=5440 Len=0 44 15.831265 10.110.123.31 xxx.xx.xx.xx SMPP SMPP Submit_sm 51 15.939520 xxx.xx.xx.xx 10.110.123.31 TCP copysnap > 35085 [ACK] Seq=28 Ack=175 Win=61266 Len=0 54 16.016723 xxx.xx.xx.xx 10.110.123.31 SMPP SMPP Submit_sm - resp: "Ok" 55 16.056163 10.110.123.31 xxx.xx.xx.xx TCP 35085 > copysnap [ACK] Seq=175 Ack=73 Win=5440 Len=0 60 20.344091 xxx.xx.xx.xx 10.110.123.31 SMPP SMPP Deliver_sm 61 20.344105 10.110.123.31 xxx.xx.xx.xx TCP 35085 > copysnap [ACK] Seq=175 Ack=145 Win=5440 Len=0 80 80.089288 xxx.xx.xx.xx 10.110.123.31 TCP [TCP Keep-Alive] copysnap > 35085 [ACK] Seq=144 Ack=174 Win=32768 Len=1 81 80.089299 10.110.123.31 xxx.xx.xx.xx TCP [TCP Keep-Alive ACK] 35085 > copysnap [ACK] Seq=175 Ack=145 Win=5440 Len=0 107 139.857528 xxx.xx.xx.xx 10.110.123.31 TCP [TCP Keep-Alive] copysnap > 35085 [ACK] Seq=144 Ack=174 Win=32768 Len=1 108 139.857543 10.110.123.31 xxx.xx.xx.xx TCP [TCP Keep-Alive ACK] 35085 > copysnap [ACK] Seq=175 Ack=145 Win=5440 Len=0 109 153.240014 xxx.xx.xx.xx 10.110.123.31 SMPP SMPP Deliver_sm 110 153.240029 10.110.123.31 xxx.xx.xx.xx TCP 35085 > copysnap [ACK] Seq=175 Ack=315 Win=6512 Len=0 129 212.573429 xxx.xx.xx.xx 10.110.123.31 TCP [TCP Keep-Alive] copysnap > 35085 [ACK] Seq=314 Ack=174 Win=32768 Len=1 130 212.573442 10.110.123.31 xxx.xx.xx.xx TCP [TCP Keep-Alive ACK] 35085 > copysnap [ACK] Seq=175 Ack=315 Win=6512 Len=0 156 272.340457 xxx.xx.xx.xx 10.110.123.31 TCP [TCP Keep-Alive] copysnap > 35085 [ACK] Seq=314 Ack=174 Win=32768 Len=1 157 272.340467 10.110.123.31 xxx.xx.xx.xx TCP [TCP Keep-Alive ACK] 35085 > copysnap [ACK] Seq=175 Ack=315 Win=6512 Len=0 167 314.727055 xxx.xx.xx.xx 10.110.123.31 SMPP SMPP Deliver_sm 168 314.727069 10.110.123.31 xxx.xx.xx.xx TCP 35085 > copysnap [ACK] Seq=175 Ack=383 Win=6512 Len=0 178 337.665939 10.110.123.31 xxx.xx.xx.xx TCP 35085 > copysnap [RST, ACK] Seq=175 Ack=383 Win=6512 Len=0

You can see the successful transciever bind (packet #'s 32, 42), the successful submit_sm (44, 54), the ignored deliver_sm (60, and again at 109 and 167), all the TCP keep-alives, no enquire-links, and a final RST as kannel is restarted (no unbind?)

The TCP stream immediately after the restart, for the same bind is pretty much identical. Bind_transciever, submit_sm, then nothing.

The message in the queue is:

SMS ID                                  Type            Time                    
Sender          Receiver        SMSC ID BOX ID  UDH     Message
----
aa37aeef-1db0-494b-bfdc-7f574f5a9263 MT-REPLY 2007-01-29 16:37:11 19xxxxxx 614xxxxxxxx xxxx <79 character text message that is being sent fine across all other binds, and was sent fine through this bind before the problem>

There are no messages in the bearer.log related to the problem bind.


So:

Firstly, are there any debug options or advanced logging I can turn on to try and find out what Kannel is doing with this bind? Secondly, does anyone have any ideas what the issue might be?

We're running Kannel version `cvs-20060727' - Build `Sep 21 2006 14:42:06', compiler `3.4.6 20060404 (Red Hat 3.4.6-3)'. System Linux, release 2.6.9-42.0.2.ELsmp, version #1 SMP Wed Aug 23 13:38:27 BST 2006, machine x86_64. Hostname smsgw2.appgw.mnetcorporation.com, IP 10.110.123.31. Libxml version 2.6.16. Using native malloc.

We haven't yet had a chance to move to 1.4.1 (our test/build servers are being serviced) but from keeping an eye on the dev mailing list, I don't think we've missed any critical patches...? In any case, since this CVS build, we've had no problems until now.

Any assistance would be greatly appreciated!

--
Giulio Harding
Systems Administrator

m.Net Corporation
Level 2, 16 Leigh Street
Adelaide SA 5000, Australia

Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]

http://www.mnetcorporation.com





Reply via email to