Fwd: Weird binding issue that causes queues to build up.
Hi guys We have just hit this issue AGAIN this morning. Can ANYONE please give some guidance here. I have had zero response on this critical issue for over two weeks now. Can someone please help. This issue is becoming increasingly urgent. Regards, Brett -- Forwarded message -- From: brett skinner tatty.dishcl...@gmail.com Date: Fri, Sep 17, 2010 at 2:16 PM Subject: Re: Weird binding issue that causes queues to build up. To: Users users@kannel.org Hi We have experienced this problem again. A couple of our binds to one particular smsc (the rest were okay) had connectivity issues last night at 12 AM. The binds were re-established and reported as being online from the status pages. However a queue for one of the binds built up on the bearerbox. Only when I had run a stop-smsc and start-smsc for that bind did the queue for that bind start processing again. In the logs at 12AM we have a bunch of Errors: 2010-09-16 23:59:35 [32641] [44] ERROR: Error reading from fd 57: 2010-09-16 23:59:35 [32641] [44] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:35 [32641] [44] ERROR: SMPP[XXX]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:38 [32641] [38] ERROR: Error reading from fd 52: 2010-09-16 23:59:38 [32641] [38] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:38 [32641] [38] ERROR: SMPP[YYY]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:39 [32641] [46] ERROR: Error reading from fd 50: 2010-09-16 23:59:39 [32641] [46] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:39 [32641] [46] ERROR: SMPP[AAA]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:47 [32641] [39] ERROR: Error reading from fd 49: 2010-09-16 23:59:47 [32641] [39] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:47 [32641] [39] ERROR: SMPP[YYY]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:51 [32641] [48] ERROR: Error reading from fd 61: 2010-09-16 23:59:51 [32641] [48] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:51 [32641] [48] ERROR: SMPP[BBB]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-17 00:00:00 [32641] [47] ERROR: Error reading from fd 40: 2010-09-17 00:00:00 [32641] [47] ERROR: System error 104: Connection reset by peer 2010-09-17 00:00:00 [32641] [47] ERROR: SMPP[XXX]: Couldn't connect to SMS center (retrying in 10 seconds). I am not sure how Kannel works internally, but it is almost as if when the bind is re-established, the old one is disposed and a new one is created but the queue and the pointers are still sticking around for the old one and have not been updated. This results in messages sitting in the queue and not being routed to the bind which reports as being online. I see that there might have been similar issues in the past: http://www.kannel.org/pipermail/users/2009-May/007166.html. It might be related maybe not. http://www.kannel.org/pipermail/users/2009-May/007166.htmlWe have already set our binds up in a transmitter and receiver. We are not running transceiver. Regards, On Thu, Sep 9, 2010 at 3:42 PM, brett skinner tatty.dishcl...@gmail.comwrote: Thanks Alvaro for your response. I am running a build from SVN from about a 2 weeks ago. I am bit weary of turning the loggers to debug mode because we are doing a lot of traffic and debug mode is very verbose we will eat through our disk in no time. It would be different if it was reproducible or if we could anticipate the problem because we could just turn on the loggers at the right time. This happens so sporadically we would have to leave the loggers in debug mode. The last time this happened was last week. I will go check out that tool you mentioned. I am not that interested in the extra TLVs. They were just making a bit of noise in our logs :) Thanks again for your help. On Thu, Sep 9, 2010 at 3:35 PM, Alvaro Cornejo cornejo.alv...@gmail.comwrote: Have you checked what does the system logs in debug mode? Regarding the queue, there is a tool created by Alejandro Guerreri that allows you to view the queue content and delete messages... well kannel does have several queues, so I don't know if it does wirk for the one you mention. I don't remember the details but you can check his blog. http://www.blogalex.com/archives/72 About the TLV's you are receiving, you should ask yoru provider to see what does they mean and what info are they sending. If its of your interest, you can configure meta-data so you can capture that info; otherwise you can safely ignore. As the PDU type is deliver_sm, I suspect that it might be the dlr status... and that is why you have that queue. Also if you upgrade to a recent version, the status page was improved and it shows now separate counters for MT and dlrs. in older versions MT/dlr counters were mixed Hope helps Alvaro
Weird binding issue that causes queues to build up.
Hi We are using transmitter and receiver because apparently there are performance issues when using transceiver due to both the transmitting/receiving being handled on a single thread. We were advised to use Tx and Rx. We have contacted the provider and while they acknowledge that they had an outage for a couple of seconds everyone else was able to reconnect without an issue. It was just us. But this is not limited to them, it seems any bind that dies and comes back there is a chance that bearerbox will start queuing. Do you have any extra information on the work-around? Regards, On Thu, Sep 23, 2010 at 2:24 PM, Benaiad bena...@gmail.com wrote: Hi Brett, Which type of connection are you using? if it's not as transceiver, I suggest you to use it if your provider has a support for this. There is a known bug regarding the separated connections for Tx Rx. I beleave that there is another workaround for this by defining two smsc groups one for Tx and the other for Rx. Regards -- Benaiad On Thu, Sep 23, 2010 at 11:21 AM, brett skinner tatty.dishcl...@gmail.com wrote: Hi guys We have just hit this issue AGAIN this morning. Can ANYONE please give some guidance here. I have had zero response on this critical issue for over two weeks now. Can someone please help. This issue is becoming increasingly urgent. Regards, Brett -- Forwarded message -- From: brett skinner tatty.dishcl...@gmail.com Date: Fri, Sep 17, 2010 at 2:16 PM Subject: Re: Weird binding issue that causes queues to build up. To: Users users@kannel.org Hi We have experienced this problem again. A couple of our binds to one particular smsc (the rest were okay) had connectivity issues last night at 12 AM. The binds were re-established and reported as being online from the status pages. However a queue for one of the binds built up on the bearerbox. Only when I had run a stop-smsc and start-smsc for that bind did the queue for that bind start processing again. In the logs at 12AM we have a bunch of Errors: 2010-09-16 23:59:35 [32641] [44] ERROR: Error reading from fd 57: 2010-09-16 23:59:35 [32641] [44] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:35 [32641] [44] ERROR: SMPP[XXX]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:38 [32641] [38] ERROR: Error reading from fd 52: 2010-09-16 23:59:38 [32641] [38] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:38 [32641] [38] ERROR: SMPP[YYY]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:39 [32641] [46] ERROR: Error reading from fd 50: 2010-09-16 23:59:39 [32641] [46] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:39 [32641] [46] ERROR: SMPP[AAA]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:47 [32641] [39] ERROR: Error reading from fd 49: 2010-09-16 23:59:47 [32641] [39] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:47 [32641] [39] ERROR: SMPP[YYY]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:51 [32641] [48] ERROR: Error reading from fd 61: 2010-09-16 23:59:51 [32641] [48] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:51 [32641] [48] ERROR: SMPP[BBB]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-17 00:00:00 [32641] [47] ERROR: Error reading from fd 40: 2010-09-17 00:00:00 [32641] [47] ERROR: System error 104: Connection reset by peer 2010-09-17 00:00:00 [32641] [47] ERROR: SMPP[XXX]: Couldn't connect to SMS center (retrying in 10 seconds). I am not sure how Kannel works internally, but it is almost as if when the bind is re-established, the old one is disposed and a new one is created but the queue and the pointers are still sticking around for the old one and have not been updated. This results in messages sitting in the queue and not being routed to the bind which reports as being online. I see that there might have been similar issues in the past: http://www.kannel.org/pipermail/users/2009-May/007166.html. It might be related maybe not. http://www.kannel.org/pipermail/users/2009-May/007166.htmlWe have already set our binds up in a transmitter and receiver. We are not running transceiver. Regards, On Thu, Sep 9, 2010 at 3:42 PM, brett skinner tatty.dishcl...@gmail.comwrote: Thanks Alvaro for your response. I am running a build from SVN from about a 2 weeks ago. I am bit weary of turning the loggers to debug mode because we are doing a lot of traffic and debug mode is very verbose we will eat through our disk in no time. It would be different if it was reproducible or if we could anticipate the problem because we could just turn on the loggers at the right time. This happens so sporadically we would have to leave the loggers in debug mode. The last time this happened was last week. I will go check out that tool you mentioned. I am
Re: Weird binding issue that causes queues to build up.
Hi Brett, I've found that workaround, which was offered by Mr. Donald Jackson. you may find it at http://www.mail-archive.com/users@kannel.org/msg15958.html Regards -- Abdulmnem Benaiad Almontaha CTO www.almontaha.ly Tripoli-Libya On Thu, Sep 23, 2010 at 2:58 PM, brett skinner tatty.dishcl...@gmail.comwrote: Hi We are using transmitter and receiver because apparently there are performance issues when using transceiver due to both the transmitting/receiving being handled on a single thread. We were advised to use Tx and Rx. We have contacted the provider and while they acknowledge that they had an outage for a couple of seconds everyone else was able to reconnect without an issue. It was just us. But this is not limited to them, it seems any bind that dies and comes back there is a chance that bearerbox will start queuing. Do you have any extra information on the work-around? Regards, On Thu, Sep 23, 2010 at 2:24 PM, Benaiad bena...@gmail.com wrote: Hi Brett, Which type of connection are you using? if it's not as transceiver, I suggest you to use it if your provider has a support for this. There is a known bug regarding the separated connections for Tx Rx. I beleave that there is another workaround for this by defining two smsc groups one for Tx and the other for Rx. Regards -- Benaiad On Thu, Sep 23, 2010 at 11:21 AM, brett skinner tatty.dishcl...@gmail.com wrote: Hi guys We have just hit this issue AGAIN this morning. Can ANYONE please give some guidance here. I have had zero response on this critical issue for over two weeks now. Can someone please help. This issue is becoming increasingly urgent. Regards, Brett -- Forwarded message -- From: brett skinner tatty.dishcl...@gmail.com Date: Fri, Sep 17, 2010 at 2:16 PM Subject: Re: Weird binding issue that causes queues to build up. To: Users users@kannel.org Hi We have experienced this problem again. A couple of our binds to one particular smsc (the rest were okay) had connectivity issues last night at 12 AM. The binds were re-established and reported as being online from the status pages. However a queue for one of the binds built up on the bearerbox. Only when I had run a stop-smsc and start-smsc for that bind did the queue for that bind start processing again. In the logs at 12AM we have a bunch of Errors: 2010-09-16 23:59:35 [32641] [44] ERROR: Error reading from fd 57: 2010-09-16 23:59:35 [32641] [44] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:35 [32641] [44] ERROR: SMPP[XXX]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:38 [32641] [38] ERROR: Error reading from fd 52: 2010-09-16 23:59:38 [32641] [38] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:38 [32641] [38] ERROR: SMPP[YYY]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:39 [32641] [46] ERROR: Error reading from fd 50: 2010-09-16 23:59:39 [32641] [46] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:39 [32641] [46] ERROR: SMPP[AAA]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:47 [32641] [39] ERROR: Error reading from fd 49: 2010-09-16 23:59:47 [32641] [39] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:47 [32641] [39] ERROR: SMPP[YYY]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:51 [32641] [48] ERROR: Error reading from fd 61: 2010-09-16 23:59:51 [32641] [48] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:51 [32641] [48] ERROR: SMPP[BBB]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-17 00:00:00 [32641] [47] ERROR: Error reading from fd 40: 2010-09-17 00:00:00 [32641] [47] ERROR: System error 104: Connection reset by peer 2010-09-17 00:00:00 [32641] [47] ERROR: SMPP[XXX]: Couldn't connect to SMS center (retrying in 10 seconds). I am not sure how Kannel works internally, but it is almost as if when the bind is re-established, the old one is disposed and a new one is created but the queue and the pointers are still sticking around for the old one and have not been updated. This results in messages sitting in the queue and not being routed to the bind which reports as being online. I see that there might have been similar issues in the past: http://www.kannel.org/pipermail/users/2009-May/007166.html. It might be related maybe not. http://www.kannel.org/pipermail/users/2009-May/007166.htmlWe have already set our binds up in a transmitter and receiver. We are not running transceiver. Regards, On Thu, Sep 9, 2010 at 3:42 PM, brett skinner tatty.dishcl...@gmail.com wrote: Thanks Alvaro for your response. I am running a build from SVN from about a 2 weeks ago. I am bit weary of turning the loggers to debug mode because we are doing a lot of traffic and debug mode is very verbose we will eat through our disk in no time
Re: Weird binding issue that causes queues to build up.
Thank you VERY MUCH for your help. We will give that a try. On Thu, Sep 23, 2010 at 3:13 PM, Benaiad bena...@gmail.com wrote: Hi Brett, I've found that workaround, which was offered by Mr. Donald Jackson. you may find it at http://www.mail-archive.com/users@kannel.org/msg15958.html Regards -- Abdulmnem Benaiad Almontaha CTO www.almontaha.ly Tripoli-Libya On Thu, Sep 23, 2010 at 2:58 PM, brett skinner tatty.dishcl...@gmail.comwrote: Hi We are using transmitter and receiver because apparently there are performance issues when using transceiver due to both the transmitting/receiving being handled on a single thread. We were advised to use Tx and Rx. We have contacted the provider and while they acknowledge that they had an outage for a couple of seconds everyone else was able to reconnect without an issue. It was just us. But this is not limited to them, it seems any bind that dies and comes back there is a chance that bearerbox will start queuing. Do you have any extra information on the work-around? Regards, On Thu, Sep 23, 2010 at 2:24 PM, Benaiad bena...@gmail.com wrote: Hi Brett, Which type of connection are you using? if it's not as transceiver, I suggest you to use it if your provider has a support for this. There is a known bug regarding the separated connections for Tx Rx. I beleave that there is another workaround for this by defining two smsc groups one for Tx and the other for Rx. Regards -- Benaiad On Thu, Sep 23, 2010 at 11:21 AM, brett skinner tatty.dishcl...@gmail.com wrote: Hi guys We have just hit this issue AGAIN this morning. Can ANYONE please give some guidance here. I have had zero response on this critical issue for over two weeks now. Can someone please help. This issue is becoming increasingly urgent. Regards, Brett -- Forwarded message -- From: brett skinner tatty.dishcl...@gmail.com Date: Fri, Sep 17, 2010 at 2:16 PM Subject: Re: Weird binding issue that causes queues to build up. To: Users users@kannel.org Hi We have experienced this problem again. A couple of our binds to one particular smsc (the rest were okay) had connectivity issues last night at 12 AM. The binds were re-established and reported as being online from the status pages. However a queue for one of the binds built up on the bearerbox. Only when I had run a stop-smsc and start-smsc for that bind did the queue for that bind start processing again. In the logs at 12AM we have a bunch of Errors: 2010-09-16 23:59:35 [32641] [44] ERROR: Error reading from fd 57: 2010-09-16 23:59:35 [32641] [44] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:35 [32641] [44] ERROR: SMPP[XXX]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:38 [32641] [38] ERROR: Error reading from fd 52: 2010-09-16 23:59:38 [32641] [38] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:38 [32641] [38] ERROR: SMPP[YYY]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:39 [32641] [46] ERROR: Error reading from fd 50: 2010-09-16 23:59:39 [32641] [46] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:39 [32641] [46] ERROR: SMPP[AAA]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:47 [32641] [39] ERROR: Error reading from fd 49: 2010-09-16 23:59:47 [32641] [39] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:47 [32641] [39] ERROR: SMPP[YYY]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-16 23:59:51 [32641] [48] ERROR: Error reading from fd 61: 2010-09-16 23:59:51 [32641] [48] ERROR: System error 104: Connection reset by peer 2010-09-16 23:59:51 [32641] [48] ERROR: SMPP[BBB]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-17 00:00:00 [32641] [47] ERROR: Error reading from fd 40: 2010-09-17 00:00:00 [32641] [47] ERROR: System error 104: Connection reset by peer 2010-09-17 00:00:00 [32641] [47] ERROR: SMPP[XXX]: Couldn't connect to SMS center (retrying in 10 seconds). I am not sure how Kannel works internally, but it is almost as if when the bind is re-established, the old one is disposed and a new one is created but the queue and the pointers are still sticking around for the old one and have not been updated. This results in messages sitting in the queue and not being routed to the bind which reports as being online. I see that there might have been similar issues in the past: http://www.kannel.org/pipermail/users/2009-May/007166.html. It might be related maybe not. http://www.kannel.org/pipermail/users/2009-May/007166.htmlWe have already set our binds up in a transmitter and receiver. We are not running transceiver. Regards, On Thu, Sep 9, 2010 at 3:42 PM, brett skinner tatty.dishcl...@gmail.com wrote: Thanks Alvaro for your response. I am running a build from SVN from about a 2 weeks ago. I am bit weary
Re: Weird binding issue that causes queues to build up.
has had a chance to look at this yet? Thanks and appreciate any help. -- Forwarded message -- From: brett skinner tatty.dishcl...@gmail.com Date: Tue, Sep 7, 2010 at 10:47 AM Subject: Weird binding issue that causes queues to build up. To: Users users@kannel.org Hi We are experiencing a rather weird occasional issue with Kannel. We have two different boxes each with a Kannel installation. Every now and then one of the boxes stops processing SMS queues and the queues just build up. This happens to both boxes (just not at the same time) When we have a look at the status page we can see the queue and there are sms queued to the bearerbox. I assume that it is the bearerbox queue. It looks as followed (from the status page) SMS: received 123 (0 queued), sent 123 (456 queued), store size -1 It is the 456 queued part that we are concerned about. All the binds report as being online with 0 in the queues but that 456 queue does not disappear. If I sit trying to restart bind after bind one of them usually does the trick and queue disappears. The problem is we usually have no idea which bind it is and they are all reporting as being online. I have noticed looking through our logs from upstream applications that it appears that there was a network outage at round about the same time. I have not yet confirmed this with the hosting company. Also this is what appears in the syslog. Sep 6 23:02:46 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 53756 on interface 'eth0.0' Sep 6 23:17:01 123-123-123-123 CRON[16934]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) Sep 6 23:32:46 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 33895 on interface 'eth0.0' Sep 7 00:02:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 55945 on interface 'eth0.0' Sep 7 00:17:01 123-123-123-123 CRON[17231]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) Sep 7 00:32:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 45291 on interface 'eth0.0' Sep 7 01:02:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 39067 on interface 'eth0.0' Sep 7 01:17:01 123-123-123-123 CRON[17479]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) That IP address is not in our kannel.conf file. I am not sure what these errors are about. I might need to investigate this further. I am not security expert so I have no idea if this is malicious or not. This is what appears in the bearerbox logs at about the same time as the outage: 2010-09-06 23:02:46 [32641] [12] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032503580) for PDU type (deliver_sm) received! 2010-09-06 23:03:07 [32641] [12] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032605180) for PDU type (deliver_sm) received! 2010-09-06 23:08:04 [32641] [10] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032113180) for PDU type (deliver_sm) received! 2010-09-06 23:14:32 [32641] [9] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032711480) for PDU type (deliver_sm) received! 2010-09-06 23:26:12 [32641] [6] ERROR: SMPP[AAA]: I/O error or other error. Re-connecting. 2010-09-06 23:26:12 [32641] [6] ERROR: SMPP[AAA]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-06 23:26:12 [32641] [18] WARNING: SMPP[BBB]: Not ACKED message found, will retransmit. SENT94sec. ago, SEQ423861, DST+x 2010-09-06 23:26:12 [32641] [18] WARNING: SMPP[BBB]: Not ACKED message found, will retransmit. SENT94sec. ago, SEQ423862, DST+x 2010-09-06 23:26:12 [32641] [18] ERROR: SMPP[BBB]: I/O error or other error. Re-connecting. 2010-09-06 23:26:12 [32641] [18] ERROR: SMPP[BBB]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-06 23:26:12 [32641] [24] WARNING: Time-out waiting for concatenated message ''+x' '+yyy' 'EEE' '10' '2' '''. Send message parts as is. 2010-09-06 23:26:12 [32641] [24] WARNING: Time-out waiting for concatenated message ''+x' '+yyy' 'EEE' '85' '2' '''. Send message parts as is. 2010-09-06 23:26:12 [32641] [24] WARNING: Time-out waiting for concatenated message ''+x' '+yyy' 'EEE' '152' '2' '''. Send message parts as is. 2010-09-06 23:26:42 [32641] [17] ERROR: SMPP[CCC]: I/O error or other error. Re-connecting. 2010-09-06 23:26:42 [32641] [17] ERROR: SMPP[CCC]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-06 23:27:08 [32641] [13] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032035180) for PDU type (deliver_sm) received! 2010-09-06 23:27:12 [32641] [19] ERROR: SMPP[BBB]: I/O error
Fwd: Weird binding issue that causes queues to build up.
Hi everyone Just wondering if anyone has had a chance to look at this yet? Thanks and appreciate any help. -- Forwarded message -- From: brett skinner tatty.dishcl...@gmail.com Date: Tue, Sep 7, 2010 at 10:47 AM Subject: Weird binding issue that causes queues to build up. To: Users users@kannel.org Hi We are experiencing a rather weird occasional issue with Kannel. We have two different boxes each with a Kannel installation. Every now and then one of the boxes stops processing SMS queues and the queues just build up. This happens to both boxes (just not at the same time) When we have a look at the status page we can see the queue and there are sms queued to the bearerbox. I assume that it is the bearerbox queue. It looks as followed (from the status page) SMS: received 123 (0 queued), sent 123 (456 queued), store size -1 It is the 456 queued part that we are concerned about. All the binds report as being online with 0 in the queues but that 456 queue does not disappear. If I sit trying to restart bind after bind one of them usually does the trick and queue disappears. The problem is we usually have no idea which bind it is and they are all reporting as being online. I have noticed looking through our logs from upstream applications that it appears that there was a network outage at round about the same time. I have not yet confirmed this with the hosting company. Also this is what appears in the syslog. Sep 6 23:02:46 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 53756 on interface 'eth0.0' Sep 6 23:17:01 123-123-123-123 CRON[16934]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) Sep 6 23:32:46 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 33895 on interface 'eth0.0' Sep 7 00:02:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 55945 on interface 'eth0.0' Sep 7 00:17:01 123-123-123-123 CRON[17231]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) Sep 7 00:32:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 45291 on interface 'eth0.0' Sep 7 01:02:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 39067 on interface 'eth0.0' Sep 7 01:17:01 123-123-123-123 CRON[17479]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) That IP address is not in our kannel.conf file. I am not sure what these errors are about. I might need to investigate this further. I am not security expert so I have no idea if this is malicious or not. This is what appears in the bearerbox logs at about the same time as the outage: 2010-09-06 23:02:46 [32641] [12] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032503580) for PDU type (deliver_sm) received! 2010-09-06 23:03:07 [32641] [12] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032605180) for PDU type (deliver_sm) received! 2010-09-06 23:08:04 [32641] [10] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032113180) for PDU type (deliver_sm) received! 2010-09-06 23:14:32 [32641] [9] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032711480) for PDU type (deliver_sm) received! 2010-09-06 23:26:12 [32641] [6] ERROR: SMPP[AAA]: I/O error or other error. Re-connecting. 2010-09-06 23:26:12 [32641] [6] ERROR: SMPP[AAA]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-06 23:26:12 [32641] [18] WARNING: SMPP[BBB]: Not ACKED message found, will retransmit. SENT94sec. ago, SEQ423861, DST+x 2010-09-06 23:26:12 [32641] [18] WARNING: SMPP[BBB]: Not ACKED message found, will retransmit. SENT94sec. ago, SEQ423862, DST+x 2010-09-06 23:26:12 [32641] [18] ERROR: SMPP[BBB]: I/O error or other error. Re-connecting. 2010-09-06 23:26:12 [32641] [18] ERROR: SMPP[BBB]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-06 23:26:12 [32641] [24] WARNING: Time-out waiting for concatenated message ''+x' '+yyy' 'EEE' '10' '2' '''. Send message parts as is. 2010-09-06 23:26:12 [32641] [24] WARNING: Time-out waiting for concatenated message ''+x' '+yyy' 'EEE' '85' '2' '''. Send message parts as is. 2010-09-06 23:26:12 [32641] [24] WARNING: Time-out waiting for concatenated message ''+x' '+yyy' 'EEE' '152' '2' '''. Send message parts as is. 2010-09-06 23:26:42 [32641] [17] ERROR: SMPP[CCC]: I/O error or other error. Re-connecting. 2010-09-06 23:26:42 [32641] [17] ERROR: SMPP[CCC]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-06 23:27:08 [32641] [13] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032035180) for PDU type (deliver_sm) received! 2010-09-06 23:27:12 [32641] [19] ERROR: SMPP[BBB]: I/O error or other error. Re-connecting. 2010-09-06 23:27:12 [32641] [19] ERROR: SMPP[BBB]: Couldn't connect to SMS center (retrying in 10 seconds). 2010
Re: Weird binding issue that causes queues to build up.
Have you checked what does the system logs in debug mode? Regarding the queue, there is a tool created by Alejandro Guerreri that allows you to view the queue content and delete messages... well kannel does have several queues, so I don't know if it does wirk for the one you mention. I don't remember the details but you can check his blog. http://www.blogalex.com/archives/72 About the TLV's you are receiving, you should ask yoru provider to see what does they mean and what info are they sending. If its of your interest, you can configure meta-data so you can capture that info; otherwise you can safely ignore. As the PDU type is deliver_sm, I suspect that it might be the dlr status... and that is why you have that queue. Also if you upgrade to a recent version, the status page was improved and it shows now separate counters for MT and dlrs. in older versions MT/dlr counters were mixed Hope helps Alvaro |-| Envíe y Reciba Datos y mensajes de Texto (SMS) hacia y desde cualquier celular y Nextel en el Perú, México y en mas de 180 paises. Use aplicaciones 2 vias via SMS y GPRS online Visitenos en www.perusms.NET www.smsglobal.com.mx y www.pravcom.com On Thu, Sep 9, 2010 at 2:42 AM, brett skinner tatty.dishcl...@gmail.com wrote: Hi everyone Just wondering if anyone has had a chance to look at this yet? Thanks and appreciate any help. -- Forwarded message -- From: brett skinner tatty.dishcl...@gmail.com Date: Tue, Sep 7, 2010 at 10:47 AM Subject: Weird binding issue that causes queues to build up. To: Users users@kannel.org Hi We are experiencing a rather weird occasional issue with Kannel. We have two different boxes each with a Kannel installation. Every now and then one of the boxes stops processing SMS queues and the queues just build up. This happens to both boxes (just not at the same time) When we have a look at the status page we can see the queue and there are sms queued to the bearerbox. I assume that it is the bearerbox queue. It looks as followed (from the status page) SMS: received 123 (0 queued), sent 123 (456 queued), store size -1 It is the 456 queued part that we are concerned about. All the binds report as being online with 0 in the queues but that 456 queue does not disappear. If I sit trying to restart bind after bind one of them usually does the trick and queue disappears. The problem is we usually have no idea which bind it is and they are all reporting as being online. I have noticed looking through our logs from upstream applications that it appears that there was a network outage at round about the same time. I have not yet confirmed this with the hosting company. Also this is what appears in the syslog. Sep 6 23:02:46 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 53756 on interface 'eth0.0' Sep 6 23:17:01 123-123-123-123 CRON[16934]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) Sep 6 23:32:46 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 33895 on interface 'eth0.0' Sep 7 00:02:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 55945 on interface 'eth0.0' Sep 7 00:17:01 123-123-123-123 CRON[17231]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) Sep 7 00:32:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 45291 on interface 'eth0.0' Sep 7 01:02:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 39067 on interface 'eth0.0' Sep 7 01:17:01 123-123-123-123 CRON[17479]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) That IP address is not in our kannel.conf file. I am not sure what these errors are about. I might need to investigate this further. I am not security expert so I have no idea if this is malicious or not. This is what appears in the bearerbox logs at about the same time as the outage: 2010-09-06 23:02:46 [32641] [12] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032503580) for PDU type (deliver_sm) received! 2010-09-06 23:03:07 [32641] [12] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032605180) for PDU type (deliver_sm) received! 2010-09-06 23:08:04 [32641] [10] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032113180) for PDU type (deliver_sm) received! 2010-09-06 23:14:32 [32641] [9] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032711480) for PDU type (deliver_sm) received! 2010-09-06 23:26:12 [32641] [6] ERROR: SMPP[AAA]: I/O error or other error. Re-connecting. 2010-09-06 23:26:12 [32641] [6] ERROR: SMPP[AAA]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-06 23:26:12 [32641] [18] WARNING: SMPP[BBB]: Not ACKED message
Fwd: Weird binding issue that causes queues to build up.
Thanks Alvaro for your response. I am running a build from SVN from about a 2 weeks ago. I am bit weary of turning the loggers to debug mode because we are doing a lot of traffic and debug mode is very verbose we will eat through our disk in no time. It would be different if it was reproducible or if we could anticipate the problem because we could just turn on the loggers at the right time. This happens so sporadically we would have to leave the loggers in debug mode. The last time this happened was last week. I will go check out that tool you mentioned. I am not that interested in the extra TLVs. They were just making a bit of noise in our logs :) Thanks again for your help. On Thu, Sep 9, 2010 at 3:35 PM, Alvaro Cornejo cornejo.alv...@gmail.comwrote: Have you checked what does the system logs in debug mode? Regarding the queue, there is a tool created by Alejandro Guerreri that allows you to view the queue content and delete messages... well kannel does have several queues, so I don't know if it does wirk for the one you mention. I don't remember the details but you can check his blog. http://www.blogalex.com/archives/72 About the TLV's you are receiving, you should ask yoru provider to see what does they mean and what info are they sending. If its of your interest, you can configure meta-data so you can capture that info; otherwise you can safely ignore. As the PDU type is deliver_sm, I suspect that it might be the dlr status... and that is why you have that queue. Also if you upgrade to a recent version, the status page was improved and it shows now separate counters for MT and dlrs. in older versions MT/dlr counters were mixed Hope helps Alvaro |-| Envíe y Reciba Datos y mensajes de Texto (SMS) hacia y desde cualquier celular y Nextel en el Perú, México y en mas de 180 paises. Use aplicaciones 2 vias via SMS y GPRS online Visitenos en www.perusms.NET www.smsglobal.com.mx y www.pravcom.com On Thu, Sep 9, 2010 at 2:42 AM, brett skinner tatty.dishcl...@gmail.com wrote: Hi everyone Just wondering if anyone has had a chance to look at this yet? Thanks and appreciate any help. -- Forwarded message -- From: brett skinner tatty.dishcl...@gmail.com Date: Tue, Sep 7, 2010 at 10:47 AM Subject: Weird binding issue that causes queues to build up. To: Users users@kannel.org Hi We are experiencing a rather weird occasional issue with Kannel. We have two different boxes each with a Kannel installation. Every now and then one of the boxes stops processing SMS queues and the queues just build up. This happens to both boxes (just not at the same time) When we have a look at the status page we can see the queue and there are sms queued to the bearerbox. I assume that it is the bearerbox queue. It looks as followed (from the status page) SMS: received 123 (0 queued), sent 123 (456 queued), store size -1 It is the 456 queued part that we are concerned about. All the binds report as being online with 0 in the queues but that 456 queue does not disappear. If I sit trying to restart bind after bind one of them usually does the trick and queue disappears. The problem is we usually have no idea which bind it is and they are all reporting as being online. I have noticed looking through our logs from upstream applications that it appears that there was a network outage at round about the same time. I have not yet confirmed this with the hosting company. Also this is what appears in the syslog. Sep 6 23:02:46 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 53756 on interface 'eth0.0' Sep 6 23:17:01 123-123-123-123 CRON[16934]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) Sep 6 23:32:46 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 33895 on interface 'eth0.0' Sep 7 00:02:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 55945 on interface 'eth0.0' Sep 7 00:17:01 123-123-123-123 CRON[17231]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) Sep 7 00:32:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 45291 on interface 'eth0.0' Sep 7 01:02:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 39067 on interface 'eth0.0' Sep 7 01:17:01 123-123-123-123 CRON[17479]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) That IP address is not in our kannel.conf file. I am not sure what these errors are about. I might need to investigate this further. I am not security expert so I have no idea if this is malicious
Weird binding issue that causes queues to build up.
Hi We are experiencing a rather weird occasional issue with Kannel. We have two different boxes each with a Kannel installation. Every now and then one of the boxes stops processing SMS queues and the queues just build up. This happens to both boxes (just not at the same time) When we have a look at the status page we can see the queue and there are sms queued to the bearerbox. I assume that it is the bearerbox queue. It looks as followed (from the status page) SMS: received 123 (0 queued), sent 123 (456 queued), store size -1 It is the 456 queued part that we are concerned about. All the binds report as being online with 0 in the queues but that 456 queue does not disappear. If I sit trying to restart bind after bind one of them usually does the trick and queue disappears. The problem is we usually have no idea which bind it is and they are all reporting as being online. I have noticed looking through our logs from upstream applications that it appears that there was a network outage at round about the same time. I have not yet confirmed this with the hosting company. Also this is what appears in the syslog. Sep 6 23:02:46 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 53756 on interface 'eth0.0' Sep 6 23:17:01 123-123-123-123 CRON[16934]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) Sep 6 23:32:46 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 33895 on interface 'eth0.0' Sep 7 00:02:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 55945 on interface 'eth0.0' Sep 7 00:17:01 123-123-123-123 CRON[17231]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) Sep 7 00:32:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 45291 on interface 'eth0.0' Sep 7 01:02:45 123-123-123-123 avahi-daemon[17943]: Received response from host 64.150.181.120 with invalid source port 39067 on interface 'eth0.0' Sep 7 01:17:01 123-123-123-123 CRON[17479]: (root) CMD ( cd / run-parts --report /etc/cron.hourly) That IP address is not in our kannel.conf file. I am not sure what these errors are about. I might need to investigate this further. I am not security expert so I have no idea if this is malicious or not. This is what appears in the bearerbox logs at about the same time as the outage: 2010-09-06 23:02:46 [32641] [12] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032503580) for PDU type (deliver_sm) received! 2010-09-06 23:03:07 [32641] [12] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032605180) for PDU type (deliver_sm) received! 2010-09-06 23:08:04 [32641] [10] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032113180) for PDU type (deliver_sm) received! 2010-09-06 23:14:32 [32641] [9] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032711480) for PDU type (deliver_sm) received! 2010-09-06 23:26:12 [32641] [6] ERROR: SMPP[AAA]: I/O error or other error. Re-connecting. 2010-09-06 23:26:12 [32641] [6] ERROR: SMPP[AAA]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-06 23:26:12 [32641] [18] WARNING: SMPP[BBB]: Not ACKED message found, will retransmit. SENT94sec. ago, SEQ423861, DST+x 2010-09-06 23:26:12 [32641] [18] WARNING: SMPP[BBB]: Not ACKED message found, will retransmit. SENT94sec. ago, SEQ423862, DST+x 2010-09-06 23:26:12 [32641] [18] ERROR: SMPP[BBB]: I/O error or other error. Re-connecting. 2010-09-06 23:26:12 [32641] [18] ERROR: SMPP[BBB]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-06 23:26:12 [32641] [24] WARNING: Time-out waiting for concatenated message ''+x' '+yyy' 'EEE' '10' '2' '''. Send message parts as is. 2010-09-06 23:26:12 [32641] [24] WARNING: Time-out waiting for concatenated message ''+x' '+yyy' 'EEE' '85' '2' '''. Send message parts as is. 2010-09-06 23:26:12 [32641] [24] WARNING: Time-out waiting for concatenated message ''+x' '+yyy' 'EEE' '152' '2' '''. Send message parts as is. 2010-09-06 23:26:42 [32641] [17] ERROR: SMPP[CCC]: I/O error or other error. Re-connecting. 2010-09-06 23:26:42 [32641] [17] ERROR: SMPP[CCC]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-06 23:27:08 [32641] [13] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032035180) for PDU type (deliver_sm) received! 2010-09-06 23:27:12 [32641] [19] ERROR: SMPP[BBB]: I/O error or other error. Re-connecting. 2010-09-06 23:27:12 [32641] [19] ERROR: SMPP[BBB]: Couldn't connect to SMS center (retrying in 10 seconds). 2010-09-06 23:27:14 [32641] [12] WARNING: SMPP: Unknown TLV(0x1406,0x0007,01906032031280) for PDU type (deliver_sm) received! 2010-09-06 23:27:25 [32641] [16] ERROR: SMPP[CCC]: I/O error or other error. Re-connecting. 2010-09-06 23:27:25 [32641] [16] ERROR: SMPP[CCC]: Couldn't connect to SMS center (retrying in 10 seconds).