[OpenSIPS-Devel] SF.net SVN: opensips:[5847] trunk/modules/nat_traversal/nat_traversal.c

2009-07-08 Thread Dan Pascu
Revision: 5847
  http://opensips.svn.sourceforge.net/opensips/?rev=5847&view=rev
Author:   dan_pascu
Date: 2009-07-09 04:11:28 + (Thu, 09 Jul 2009)

Log Message:
---
Rollback changes made in revisions #5783 and #5784

Those changes do not solve any problem, they only hide the error by not
sending anymore keepalive messages to the affected endpoints. In reality
contact->uri should always be sip:IP:port so no extra checks are required.
The port is always present even when default and the uri is built from the
msg->rcv structure so it doesn't depend on any user setting from the script.

Modified Paths:
--
trunk/modules/nat_traversal/nat_traversal.c


This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.

___
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel


Re: [OpenSIPS-Devel] SF.net SVN: opensips:[5847] trunk/modules/nat_traversal/nat_traversal.c

2009-07-09 Thread Bogdan-Andrei Iancu
Hi Dan,

I agree that the fix I made was fixing an effect and not the real cause, 
but this was preventing the crashing until the real cause is found and 
fixed. The crashing was reported by Thomas Gelf ( he got 4G od core 
files during a night), so I'm not sure if it safe to remove this fix 
without having the real fix in place - if not, we will simple expose the 
users to more crashes.

Regards,
Bogdan

Dan Pascu wrote:
> Revision: 5847
>   http://opensips.svn.sourceforge.net/opensips/?rev=5847&view=rev
> Author:   dan_pascu
> Date: 2009-07-09 04:11:28 + (Thu, 09 Jul 2009)
>
> Log Message:
> ---
> Rollback changes made in revisions #5783 and #5784
>
> Those changes do not solve any problem, they only hide the error by not
> sending anymore keepalive messages to the affected endpoints. In reality
> contact->uri should always be sip:IP:port so no extra checks are required.
> The port is always present even when default and the uri is built from the
> msg->rcv structure so it doesn't depend on any user setting from the script.
>
> Modified Paths:
> --
> trunk/modules/nat_traversal/nat_traversal.c
>
>
> This was sent by the SourceForge.net collaborative development platform, the 
> world's largest Open Source development site.
>
> ___
> Devel mailing list
> Devel@lists.opensips.org
> http://lists.opensips.org/cgi-bin/mailman/listinfo/devel
>
>   


___
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel


Re: [OpenSIPS-Devel] SF.net SVN: opensips:[5847] trunk/modules/nat_traversal/nat_traversal.c

2009-07-09 Thread Dan Pascu

On 9 Jul 2009, at 10:45, Bogdan-Andrei Iancu wrote:

> Hi Dan,
>
> I agree that the fix I made was fixing an effect and not the real  
> cause, but this was preventing the crashing until the real cause is  
> found and fixed. The crashing was reported by Thomas Gelf ( he got  
> 4G od core files during a night), so I'm not sure if it safe to  
> remove this fix without having the real fix in place - if not, we  
> will simple expose the users to more crashes.


I disagree. Nobody except Thomas reported this and in more than a year  
since nat_traversal is available nobody reported crashes with it. Thus  
I suspect that is something else in his case that needs to be better  
investigated. If we keep such a workaround in place, the result will  
be that it will not send keepalive messages to the affected endpoints.  
This both hides the segfault cause and generates a new problem. Thomas  
will start reporting that his endpoints are not kept alive anymore  
instead. Besides if the workaround is in place, people will become  
complacent and will not attempt to find the real cause anymore.

Just to get an idea why the case Thomas gets is so unexpected, contact- 
 >uri is built using this code:

static char*
get_source_uri(struct sip_msg *msg)
{
 static char uri[64];
 snprintf(uri, 64, "sip:%s:%d", ip_addr2a(&msg->rcv.src_ip), msg- 
 >rcv.src_port);
 return uri;
}

and then duplicated in shared memory. There is no way for contact->uri  
to end up NULL or not to contain the IP and port, no matter what  
actions the user does in the script.

Right now I suspect that Thomas suffers from some sort of memory  
corruption that happens to affect the nat_traversal module internal  
data somehow.

Thomas, can you please compile opensips to use the system malloc  
instead of pkg_malloc and see if the problem persists? I had suffered  
similar weird memory corruption issues in the past, that could not be  
identified but were cured by using the system malloc. In my case the  
segfaults happened in t_relay or sl_send_reply, but the memory was  
similarly corrupted in unexpected places.

--
Dan


___
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel


Re: [OpenSIPS-Devel] SF.net SVN: opensips:[5847] trunk/modules/nat_traversal/nat_traversal.c

2009-07-14 Thread Thomas Gelf
Dan Pascu wrote:
> Just to get an idea why the case Thomas gets is so unexpected, contact- 
>  >uri is built using this code:
> 
> static char*
> get_source_uri(struct sip_msg *msg)
> {
>  static char uri[64];
>  snprintf(uri, 64, "sip:%s:%d", ip_addr2a(&msg->rcv.src_ip), msg- 
>  >rcv.src_port);
>  return uri;
> }
> 
> and then duplicated in shared memory. There is no way for contact->uri  
> to end up NULL or not to contain the IP and port, no matter what  
> actions the user does in the script.

Is this also true if I'm doing AVP_RECEIVED = $source_uri ?
(that's what my config looks like)

> Right now I suspect that Thomas suffers from some sort of memory  
> corruption that happens to affect the nat_traversal module internal  
> data somehow.
> 
> Thomas, can you please compile opensips to use the system malloc  
> instead of pkg_malloc and see if the problem persists? I had suffered  
> similar weird memory corruption issues in the past, that could not be  
> identified but were cured by using the system malloc. In my case the  
> segfaults happened in t_relay or sl_send_reply, but the memory was  
> similarly corrupted in unexpected places.

I did so - or better, I tried my best to do so. Changes in revision
5653 didn't allow me to compile with system malloc. At least that's
my assumption. As I never wrote a C program that's just a wild guess.
After reverting some changes (r5653-5655) and disabling STATISTICS
I have finally been able to compile without PKG_MALLOC (see other
thread).

I do not really like the idea to reproduce the nat_helper crash, as
I have (very) few customers already using this proxy. And I really
have no idea how to do so. It was a really strange effect - after
a restart it kept crashing and crashing. After a while it appeared
to be stable again - but after the next call (not sure if it was
really the next one) - it crashed and crashed once again. Usually
it did so shortly after a new dialog started (at least it seemed
to be so). If you'd like to have a look at the core files I could
try to find the corresponding binary.

However, I discovered other ways to crash OpenSIPS - and they still
work even with my somehow-fiddled-system-malloc-version. I'll send
another post with related information.

There is one thing I got aware of: shortly after being started it's
easy to produce crashes - if running without being disturbed for a
while chances are good that it will keep running. I know that this
is not a good diagnose - but that's how it seemed to behave. OpenSIPS
probably needs a lot of love and care ;-)

Best regards,
Thomas Gelf


___
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel


Re: [OpenSIPS-Devel] SF.net SVN: opensips:[5847] trunk/modules/nat_traversal/nat_traversal.c

2009-07-14 Thread Dan Pascu


On 15 Jul 2009, at 01:32, Thomas Gelf wrote:


Dan Pascu wrote:
Just to get an idea why the case Thomas gets is so unexpected,  
contact-

uri is built using this code:


static char*
get_source_uri(struct sip_msg *msg)
{
static char uri[64];
snprintf(uri, 64, "sip:%s:%d", ip_addr2a(&msg->rcv.src_ip), msg-

rcv.src_port);

return uri;
}

and then duplicated in shared memory. There is no way for contact- 
>uri

to end up NULL or not to contain the IP and port, no matter what
actions the user does in the script.


Is this also true if I'm doing AVP_RECEIVED = $source_uri ?
(that's what my config looks like)



Yes. As I said, it doesn't matter what you do in script. The vales are  
read from some internal opensips structures that reflect some kernel  
structures that contain the source IP/port and destination IP/port.  
There is no way in which those are affected by script actions, nor can  
they not be present. This is what makes me believe that it's not a  
problem in the nat_traversal code, but some form of memory corruption.



Right now I suspect that Thomas suffers from some sort of memory
corruption that happens to affect the nat_traversal module internal
data somehow.

Thomas, can you please compile opensips to use the system malloc
instead of pkg_malloc and see if the problem persists? I had suffered
similar weird memory corruption issues in the past, that could not be
identified but were cured by using the system malloc. In my case the
segfaults happened in t_relay or sl_send_reply, but the memory was
similarly corrupted in unexpected places.


I did so - or better, I tried my best to do so. Changes in revision
5653 didn't allow me to compile with system malloc. At least that's
my assumption. As I never wrote a C program that's just a wild guess.
After reverting some changes (r5653-5655) and disabling STATISTICS
I have finally been able to compile without PKG_MALLOC (see other
thread).


I do not know about that as I've never tried opensips-1.5 or newer.  
However here is a patch that I use (debian dpatch, but it can be used  
as a standard patch), for disabling pkg_malloc and using system malloc  
instead. (see attachement)




I do not really like the idea to reproduce the nat_helper crash, as
I have (very) few customers already using this proxy. And I really
have no idea how to do so. It was a really strange effect - after
a restart it kept crashing and crashing. After a while it appeared
to be stable again - but after the next call (not sure if it was
really the next one) - it crashed and crashed once again. Usually
it did so shortly after a new dialog started (at least it seemed
to be so). If you'd like to have a look at the core files I could
try to find the corresponding binary.



For the moment it's enough if you post the output of bt full in gdb.


However, I discovered other ways to crash OpenSIPS - and they still
work even with my somehow-fiddled-system-malloc-version. I'll send
another post with related information.


Are you saying that with the system malloc you do not see the  
net_traversal related crashes anymore?

That you now only see crashes related to other parts of the code?



There is one thing I got aware of: shortly after being started it's
easy to produce crashes - if running without being disturbed for a
while chances are good that it will keep running. I know that this
is not a good diagnose - but that's how it seemed to behave. OpenSIPS
probably needs a lot of love and care ;-)



You are the first one to report such an issue in nat_traversal. If it  
would be a bug in the nat_traversal code, it would become obvious on  
inspection (especially after a backtrace) and many more people would  
be affected in a systematic manner. I have it running on dozens of  
system without any problem. The scarce and random nature of it makes  
me believe it's something else related to memory corruption. I've seen  
similarly strange issues in the past with unexpected crashes in parts  
of code that didn't have any obvious problems where the backtrace  
reported them, which were magically cured by using the system malloc.  
As in your case, nobody else experienced the crashes I did experience,  
so I believe it's a memory corruption issue that is triggered by a  
combination of factors that is unique to different installations,  
resulting in crashes in different areas of the code, without having  
programming problems in those particular areas of the code, only being  
affected as a side effect of the memory being corrupted.


Can you tell me if you see the issue anymore after switching to the  
system malloc?



--
Dan




12_use_system_malloc.dpatch
Description: Binary data
___
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel


Re: [OpenSIPS-Devel] SF.net SVN: opensips:[5847] trunk/modules/nat_traversal/nat_traversal.c

2009-07-15 Thread Thomas Gelf
Dan Pascu wrote:
> I do not know about that as I've never tried opensips-1.5 or newer.
...
> You are the first one to report such an issue in nat_traversal. If it
> would be a bug in the nat_traversal code, it would become obvious on
> inspection (especially after a backtrace) and many more people would be
> affected in a systematic manner. I have it running on dozens of system
> without any problem...

I'm running OpenSIPS dialog-stateful with restart-persistent dialogs
and dialog flags/profiles etc. Could this be somehow related to my
problem?

Please also note: there have been very few such crashes between other
MySQL-related ones, my nightmare with some Gigs of corefiles started
with revision 5783 and disappeared with 5784, at the cost of loosing a
functional keep_alive mechanism ;-)

I'll sent you a full backtrace from revision 5780 off-list...

Regards,
Thomas Gelf


___
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel


Re: [OpenSIPS-Devel] SF.net SVN: opensips:[5847] trunk/modules/nat_traversal/nat_traversal.c

2009-07-15 Thread Dan Pascu

On 15 Jul 2009, at 13:00, Thomas Gelf wrote:

> Dan Pascu wrote:
>> I do not know about that as I've never tried opensips-1.5 or newer.
> ...
>> You are the first one to report such an issue in nat_traversal. If it
>> would be a bug in the nat_traversal code, it would become obvious on
>> inspection (especially after a backtrace) and many more people  
>> would be
>> affected in a systematic manner. I have it running on dozens of  
>> system
>> without any problem...
>
> I'm running OpenSIPS dialog-stateful with restart-persistent dialogs
> and dialog flags/profiles etc. Could this be somehow related to my
> problem?
>

I don't know. I use restart-persistent dialogs, however I don't use  
dialog flags or profiles though. I only use dialogs indirectly  
(triggered by nat_traversal and mediaproxy).

> Please also note: there have been very few such crashes between other
> MySQL-related ones, my nightmare with some Gigs of corefiles started
> with revision 5783 and disappeared with 5784, at the cost of loosing a
> functional keep_alive mechanism ;-)
>

I'm a bit confused. You say segfaults started with 5783, but you have  
a backtrace from 5780?

> I'll sent you a full backtrace from revision 5780 off-list...

Can you confirm if using system malloc made the problem disappear?

--
Dan




___
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel


Re: [OpenSIPS-Devel] SF.net SVN: opensips:[5847] trunk/modules/nat_traversal/nat_traversal.c

2009-07-15 Thread Thomas Gelf
Dan Pascu wrote:
> I'm a bit confused. You say segfaults started with 5783, but you have  
> a backtrace from 5780?

I showed Bogdan my (few, seldom) segfaults that I noticed with 5780, and
that resulted in 5783 (causing MANY more segfaults) and finally 5784
(avoiding this special segfault at the cost of making keepalives no
longer working correctly).

> Can you confirm if using system malloc made the problem disappear?

No, I can't. I'm pretty sure that with 5783 I'd immediately achieve
once again such a segfault - but that's as far as I understood to
be expected and not related to the earlier problem (even if segfaults
are looking very similar).

Yesterday I tried latest SVN with both system- and pgk_malloc. And I
was able to crash both of them. But that was for other reasons (see
thread "Some ways to crash OpenSIPS with current SVN", NOT related
to nat_helper.

Now I'll better wait for a response to my thread "Unable to disable
PKG_MALLOC", as I don't feel quite well with an 1.4.4 patch for current
trunk ;-) Yesterday I removed -DPKG_MALLOC, reverted some statistic-
related changes (r5653-5655) and finally also removed -STATISTICS.

That way I was able to compile it (not sure if done correct, but
looked good to me). Please note once again, that I never ever wrote
C code - it's just thanks to OpenSIPS that I had to dig a little bit
into it and teach myself some basics. Therefore I'm mot sure whether
my changes where fine ;-)

As I absolutely need to keep this system up and running at least
during business hours (seldom crashes and restarts by monit are
something I can live with - just r5783 was a little bit too heavvy).
If you have an idea how to easily reproduce such a crash that would
be great - but if really related to memory corruption I doubt that
this would probably be not an easy task :-/

Cheers,
Thomas


___
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel


Re: [OpenSIPS-Devel] SF.net SVN: opensips:[5847] trunk/modules/nat_traversal/nat_traversal.c

2009-07-15 Thread Dan Pascu

On 15 Jul 2009, at 14:32, Thomas Gelf wrote:

> Dan Pascu wrote:
>> I'm a bit confused. You say segfaults started with 5783, but you have
>> a backtrace from 5780?
>
> I showed Bogdan my (few, seldom) segfaults that I noticed with 5780,  
> and
> that resulted in 5783 (causing MANY more segfaults)

That is no wonder. 5783 contained a bug that made segfaults certain  
(it operated on a different variable than intended). This is why 5784  
was made immediately after it. You should not use 5783 at all. The fix  
Bogdan made, is given by the combination of 5783+5784.

What you should use is current trunk with system malloc so we can see  
if it fixes your problem. Alternatively you can use anything before  
5783 also with the system malloc.

--
Dan




___
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel


Re: [OpenSIPS-Devel] SF.net SVN: opensips:[5847] trunk/modules/nat_traversal/nat_traversal.c

2009-07-20 Thread Thomas Gelf
Dan Pascu wrote:
> That is no wonder. 5783 contained a bug that made segfaults certain  
> (it operated on a different variable than intended). This is why 5784  
> was made immediately after it. You should not use 5783 at all. The fix  
> Bogdan made, is given by the combination of 5783+5784.
> 
> What you should use is current trunk with system malloc so we can see  
> if it fixes your problem. Alternatively you can use anything before  
> 5783 also with the system malloc.

Right now I'm pretty sure this crashes have been somehow related to the
crashes mentioned in my thread "Some ways to crash OpenSIPS with current
SVN" - so I think we can forget about this one...

...at least unless I manage it to discover something new ;-)


Cheers,
Thomas



___
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel