Re: Asterisk deadlocks since Kernel 4.1

2015-12-06 Thread Philipp Hahn
Hello Stefan,

Am 06.12.2015 um 21:56 schrieb Stefan Priebe:
> i think i found the issue in 4.1 with netlink. Somebody made a mistake
> while backporting or cherry-picking your patch "netlink: Fix autobind
> race condition that leads to zero port ID" to 4.1.
> 
> It misses a goto in 4.1.
> 
> This goto is missing in 4.1:
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 4017e12..f15c001 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32
> portid)
> if (err == -EEXIST)
> err = -EADDRINUSE;
> sock_put(sk);
> -   goto err;
> }
> 
> /* We need to ensure that the socket is hashed and visible. */
> 
> Can you please confirm, that this is not correct and might cause those
> issues.

I just tested that patch and it seems to fix our hang. Thank you for
your good work.

Philipp

PS: I guess I can skip testing your other test request as this simple
patch is part of your other hiuq4bsW patch. If I should still test it,
just send a note.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-06 Thread Stefan Priebe - Profihost AG
Hi Herbert,

Am 07.12.2015 um 02:20 schrieb Herbert Xu:
> On Sun, Dec 06, 2015 at 09:56:34PM +0100, Stefan Priebe wrote:
>> Hi Herbert,
>>
>> i think i found the issue in 4.1 with netlink. Somebody made a
>> mistake while backporting or cherry-picking your patch "netlink: Fix
>> autobind race condition that leads to zero port ID" to 4.1.
>>
>> It misses a goto in 4.1.
>>
>> This goto is missing in 4.1:
>>
>> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
>> index 4017e12..f15c001 100644
>> --- a/net/netlink/af_netlink.c
>> +++ b/net/netlink/af_netlink.c
>> @@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
>> if (err == -EEXIST)
>> err = -EADDRINUSE;
>> sock_put(sk);
>> -   goto err;
>> }
>>
>> /* We need to ensure that the socket is hashed and visible. */
>>
>> Can you please confirm, that this is not correct and might cause
>> those issues.
> 
> Well spotted! Yes this would be a fatal error and can cause the
> problems you guys are seeing.

Thanks, good. Can you help me to get this fix upstream into the stable
lines?

Stefan

> 
> Thanks,
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-06 Thread Herbert Xu
On Sun, Dec 06, 2015 at 09:56:34PM +0100, Stefan Priebe wrote:
> Hi Herbert,
> 
> i think i found the issue in 4.1 with netlink. Somebody made a
> mistake while backporting or cherry-picking your patch "netlink: Fix
> autobind race condition that leads to zero port ID" to 4.1.
> 
> It misses a goto in 4.1.
> 
> This goto is missing in 4.1:
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 4017e12..f15c001 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
> if (err == -EEXIST)
> err = -EADDRINUSE;
> sock_put(sk);
> -   goto err;
> }
> 
> /* We need to ensure that the socket is hashed and visible. */
> 
> Can you please confirm, that this is not correct and might cause
> those issues.

Well spotted! Yes this would be a fatal error and can cause the
problems you guys are seeing.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-06 Thread Stefan Priebe

Hi Herbert,

i think i found the issue in 4.1 with netlink. Somebody made a mistake 
while backporting or cherry-picking your patch "netlink: Fix autobind 
race condition that leads to zero port ID" to 4.1.


It misses a goto in 4.1.

This goto is missing in 4.1:

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 4017e12..f15c001 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
if (err == -EEXIST)
err = -EADDRINUSE;
sock_put(sk);
-   goto err;
}

/* We need to ensure that the socket is hashed and visible. */

Can you please confirm, that this is not correct and might cause those 
issues.


Stefan

Am 05.12.2015 um 02:08 schrieb Herbert Xu:

On Fri, Dec 04, 2015 at 07:26:12PM +0100, Stefan Priebe wrote:


* 9f87e0c - (2 months ago) netlink: Replace rhash_portid with bound
- Herbert Xu
* 35e9890 - (3 months ago) netlink: Fix autobind race condition that
leads to zero port ID - Herbert Xu
* 30c6472 - (7 months ago) netlink: Use random autobind rover - Herbert Xu


These three patches are absolutely required in any kernel where the
netlink insertion is lockless.  So yes they should be applied to
4.1.

Thanks,


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-06 Thread Stefan Priebe

Hi Herbert,

i think i found the issue in 4.1 with netlink. Somebody made a mistake 
while backporting or cherry-picking your patch "netlink: Fix autobind 
race condition that leads to zero port ID" to 4.1.


It misses a goto in 4.1.

This goto is missing in 4.1:

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 4017e12..f15c001 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
if (err == -EEXIST)
err = -EADDRINUSE;
sock_put(sk);
-   goto err;
}

/* We need to ensure that the socket is hashed and visible. */

Can you please confirm, that this is not correct and might cause those 
issues.


Stefan

Am 05.12.2015 um 02:08 schrieb Herbert Xu:

On Fri, Dec 04, 2015 at 07:26:12PM +0100, Stefan Priebe wrote:


* 9f87e0c - (2 months ago) netlink: Replace rhash_portid with bound
- Herbert Xu
* 35e9890 - (3 months ago) netlink: Fix autobind race condition that
leads to zero port ID - Herbert Xu
* 30c6472 - (7 months ago) netlink: Use random autobind rover - Herbert Xu


These three patches are absolutely required in any kernel where the
netlink insertion is lockless.  So yes they should be applied to
4.1.

Thanks,


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-06 Thread Herbert Xu
On Sun, Dec 06, 2015 at 09:56:34PM +0100, Stefan Priebe wrote:
> Hi Herbert,
> 
> i think i found the issue in 4.1 with netlink. Somebody made a
> mistake while backporting or cherry-picking your patch "netlink: Fix
> autobind race condition that leads to zero port ID" to 4.1.
> 
> It misses a goto in 4.1.
> 
> This goto is missing in 4.1:
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 4017e12..f15c001 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
> if (err == -EEXIST)
> err = -EADDRINUSE;
> sock_put(sk);
> -   goto err;
> }
> 
> /* We need to ensure that the socket is hashed and visible. */
> 
> Can you please confirm, that this is not correct and might cause
> those issues.

Well spotted! Yes this would be a fatal error and can cause the
problems you guys are seeing.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-06 Thread Philipp Hahn
Hello Stefan,

Am 06.12.2015 um 21:56 schrieb Stefan Priebe:
> i think i found the issue in 4.1 with netlink. Somebody made a mistake
> while backporting or cherry-picking your patch "netlink: Fix autobind
> race condition that leads to zero port ID" to 4.1.
> 
> It misses a goto in 4.1.
> 
> This goto is missing in 4.1:
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 4017e12..f15c001 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32
> portid)
> if (err == -EEXIST)
> err = -EADDRINUSE;
> sock_put(sk);
> -   goto err;
> }
> 
> /* We need to ensure that the socket is hashed and visible. */
> 
> Can you please confirm, that this is not correct and might cause those
> issues.

I just tested that patch and it seems to fix our hang. Thank you for
your good work.

Philipp

PS: I guess I can skip testing your other test request as this simple
patch is part of your other hiuq4bsW patch. If I should still test it,
just send a note.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-06 Thread Stefan Priebe - Profihost AG
Hi Herbert,

Am 07.12.2015 um 02:20 schrieb Herbert Xu:
> On Sun, Dec 06, 2015 at 09:56:34PM +0100, Stefan Priebe wrote:
>> Hi Herbert,
>>
>> i think i found the issue in 4.1 with netlink. Somebody made a
>> mistake while backporting or cherry-picking your patch "netlink: Fix
>> autobind race condition that leads to zero port ID" to 4.1.
>>
>> It misses a goto in 4.1.
>>
>> This goto is missing in 4.1:
>>
>> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
>> index 4017e12..f15c001 100644
>> --- a/net/netlink/af_netlink.c
>> +++ b/net/netlink/af_netlink.c
>> @@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
>> if (err == -EEXIST)
>> err = -EADDRINUSE;
>> sock_put(sk);
>> -   goto err;
>> }
>>
>> /* We need to ensure that the socket is hashed and visible. */
>>
>> Can you please confirm, that this is not correct and might cause
>> those issues.
> 
> Well spotted! Yes this would be a fatal error and can cause the
> problems you guys are seeing.

Thanks, good. Can you help me to get this fix upstream into the stable
lines?

Stefan

> 
> Thanks,
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-05 Thread Stefan Priebe

Hello Philipp,

Am 05.12.2015 um 15:19 schrieb Philipp Matthias Hahn:

Hello Hannes,

On Wed, Dec 02, 2015 at 12:40:32PM +0100, Hannes Frederic Sowa wrote:

git bisect tells me it stopped working after those two commits were applied:

commit d48623677191e0f035d7afd344f92cf880b01f8e
Author: Herbert Xu 
Date:   Tue Sep 22 11:38:56 2015 +0800

 netlink: Replace rhash_portid with bound

commit 4e27762417669cb459971635be550eb7b5598286
Author: Herbert Xu 
Date:   Fri Sep 18 19:16:50 2015 +0800

 netlink: Fix autobind race condition that leads to zero port ID


Cool, thanks a lot. Does this patch make a difference?

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 59651af..278e94c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)

/* We need to ensure that the socket is hashed and visible. */
smp_wmb();
-   nlk_sk(sk)->bound = portid;
+   nlk_sk(sk)->bound = true;

  err:
release_sock(sk);


Didn't help, test program still gets stuck; sorry :-(


Can you please try this patch on top of 4.1.13 (which helps me):
http://pastebin.com/raw.php?i=hiuq4bsW

Stefan



Philipp


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-05 Thread Philipp Matthias Hahn
Hello Hannes,

On Wed, Dec 02, 2015 at 12:40:32PM +0100, Hannes Frederic Sowa wrote:
> > git bisect tells me it stopped working after those two commits were applied:
> >
> > commit d48623677191e0f035d7afd344f92cf880b01f8e
> > Author: Herbert Xu 
> > Date:   Tue Sep 22 11:38:56 2015 +0800
> >
> > netlink: Replace rhash_portid with bound
> >
> > commit 4e27762417669cb459971635be550eb7b5598286
> > Author: Herbert Xu 
> > Date:   Fri Sep 18 19:16:50 2015 +0800
> >
> > netlink: Fix autobind race condition that leads to zero port ID
> 
> Cool, thanks a lot. Does this patch make a difference?
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>  
>   /* We need to ensure that the socket is hashed and visible. */
>   smp_wmb();
> - nlk_sk(sk)->bound = portid;
> + nlk_sk(sk)->bound = true;
>  
>  err:
>   release_sock(sk);

Didn't help, test program still gets stuck; sorry :-(

Philipp

-- 
  / /  (_)__  __   __ Philipp Hahn
 / /__/ / _ \/ // /\ \/ /
//_/_//_/\_,_/ /_/\_\ pmh...@pmhahn.de
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-05 Thread Stefan Priebe

Hello Philipp,

Am 05.12.2015 um 15:19 schrieb Philipp Matthias Hahn:

Hello Hannes,

On Wed, Dec 02, 2015 at 12:40:32PM +0100, Hannes Frederic Sowa wrote:

git bisect tells me it stopped working after those two commits were applied:

commit d48623677191e0f035d7afd344f92cf880b01f8e
Author: Herbert Xu 
Date:   Tue Sep 22 11:38:56 2015 +0800

 netlink: Replace rhash_portid with bound

commit 4e27762417669cb459971635be550eb7b5598286
Author: Herbert Xu 
Date:   Fri Sep 18 19:16:50 2015 +0800

 netlink: Fix autobind race condition that leads to zero port ID


Cool, thanks a lot. Does this patch make a difference?

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 59651af..278e94c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)

/* We need to ensure that the socket is hashed and visible. */
smp_wmb();
-   nlk_sk(sk)->bound = portid;
+   nlk_sk(sk)->bound = true;

  err:
release_sock(sk);


Didn't help, test program still gets stuck; sorry :-(


Can you please try this patch on top of 4.1.13 (which helps me):
http://pastebin.com/raw.php?i=hiuq4bsW

Stefan



Philipp


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-05 Thread Philipp Matthias Hahn
Hello Hannes,

On Wed, Dec 02, 2015 at 12:40:32PM +0100, Hannes Frederic Sowa wrote:
> > git bisect tells me it stopped working after those two commits were applied:
> >
> > commit d48623677191e0f035d7afd344f92cf880b01f8e
> > Author: Herbert Xu 
> > Date:   Tue Sep 22 11:38:56 2015 +0800
> >
> > netlink: Replace rhash_portid with bound
> >
> > commit 4e27762417669cb459971635be550eb7b5598286
> > Author: Herbert Xu 
> > Date:   Fri Sep 18 19:16:50 2015 +0800
> >
> > netlink: Fix autobind race condition that leads to zero port ID
> 
> Cool, thanks a lot. Does this patch make a difference?
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>  
>   /* We need to ensure that the socket is hashed and visible. */
>   smp_wmb();
> - nlk_sk(sk)->bound = portid;
> + nlk_sk(sk)->bound = true;
>  
>  err:
>   release_sock(sk);

Didn't help, test program still gets stuck; sorry :-(

Philipp

-- 
  / /  (_)__  __   __ Philipp Hahn
 / /__/ / _ \/ // /\ \/ /
//_/_//_/\_,_/ /_/\_\ pmh...@pmhahn.de
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-04 Thread Herbert Xu
On Fri, Dec 04, 2015 at 07:26:12PM +0100, Stefan Priebe wrote:
>
> * 9f87e0c - (2 months ago) netlink: Replace rhash_portid with bound
> - Herbert Xu
> * 35e9890 - (3 months ago) netlink: Fix autobind race condition that
> leads to zero port ID - Herbert Xu
> * 30c6472 - (7 months ago) netlink: Use random autobind rover - Herbert Xu

These three patches are absolutely required in any kernel where the
netlink insertion is lockless.  So yes they should be applied to
4.1.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-04 Thread Stefan Priebe

Hi,

I got it fixed / at least not live / deadlocking by doing applying the 
following patch - which is the diff of the commits below on top of 4.1.13.


patch:
http://pastebin.com/raw.php?i=hiuq4bsW

all commits / changes in reverse order:

* 0ceb380 - (6 weeks ago) netlink: fix locking around 
NETLINK_LIST_MEMBERSHIPS - David Herrmann (HEAD)
* c3f272b - (7 weeks ago) netlink: Trim skb to alloc size to avoid 
MSG_TRUNC - Arad, Ronen
* 9f87e0c - (2 months ago) netlink: Replace rhash_portid with bound - 
Herbert Xu
* 35e9890 - (3 months ago) netlink: Fix autobind race condition that 
leads to zero port ID - Herbert Xu
* f1d1215 - (3 months ago) netlink, mmap: transform mmap skb into full 
skb on taps - Daniel Borkmann
* faad871 - (3 months ago) netlink, mmap: fix edge-case leakages in nf 
queue zero-copy - Daniel Borkmann
* fb18c94 - (3 months ago) netlink, mmap: don't walk rx ring on poll if 
receive queue non-empty - Daniel Borkmann
* da13789 - (3 months ago) netlink: rx mmap: fix POLLIN condition - 
Ken-ichirou MATSUZAWA
* 808071f - (3 months ago) netlink: mmap: fix lookup frame position - 
Ken-ichirou MATSUZAWA
* 589bfd5 - (3 months ago) netlink: add NETLINK_CAP_ACK socket option - 
Christophe Ricard
* d23c4eb - (4 months ago) netlink: mmap: fix tx type check - 
Ken-ichirou MATSUZAWA
* 5dcc50a - (4 months ago) netlink: make sure -EBUSY won't escape from 
netlink_insert - Daniel Borkmann
* ada2b3e - (5 months ago) netlink: don't hold mutex in rcu callback 
when releasing mmapd ring - Florian Westphal
* e0f54a3 - (5 months ago) netlink: Delete an unnecessary check before 
the function call "module_put" - Markus Elfring
* 0a5bdaf - (6 months ago) netlink: add API to retrieve all group 
memberships - David Herrmann

* 30c6472 - (7 months ago) netlink: Use random autobind rover - Herbert Xu
* 021a670 - (7 months ago) netlink: Create kernel netlink sockets in the 
proper network namespace - Eric W. Biederman
* e1b01b4 - (7 months ago) net: Pass kern from net_proto_family.create 
to sk_alloc - Eric W. Biederman
* dd4b3c9 - (7 months ago) netlink: rename private flags and states - 
Nicolas Dichtel
* 0356126 - (2 days ago) Revert "netlink: don't hold mutex in rcu 
callback when releasing mmapd ring" - Stefan Priebe
* 231d0da - (2 days ago) Revert "netlink: make sure -EBUSY won't escape 
from netlink_insert" - Stefan Priebe
* e0f56af1 - (2 days ago) Revert "netlink, mmap: transform mmap skb into 
full skb on taps" - Stefan Priebe
* 23a0326 - (2 days ago) Revert "netlink: Fix autobind race condition 
that leads to zero port ID" - Stefan Priebe
* 97f4677 - (2 days ago) Revert "netlink: Replace rhash_portid with 
bound" - Stefan Priebe
* 40c851fe - (2 days ago) Revert "netlink: Trim skb to alloc size to 
avoid MSG_TRUNC" - Stefan Priebe
* 1f2ce4a - (4 weeks ago) Linux 4.1.13 - Greg Kroah-Hartman (v4.1.13, 
origin/linux-4.1.y)


So the netlink code is in line with 4.3.

Stefan

Am 02.12.2015 um 12:40 schrieb Hannes Frederic Sowa:

Hello Stefan,

Stefan Priebe - Profihost AG  writes:



here are the results.

It works with 4.1.
It works with 4.2.
It does not work with 4.1.13.

git bisect tells me it stopped working after those two commits were applied:

commit d48623677191e0f035d7afd344f92cf880b01f8e
Author: Herbert Xu 
Date:   Tue Sep 22 11:38:56 2015 +0800

 netlink: Replace rhash_portid with bound

commit 4e27762417669cb459971635be550eb7b5598286
Author: Herbert Xu 
Date:   Fri Sep 18 19:16:50 2015 +0800

 netlink: Fix autobind race condition that leads to zero port ID


Cool, thanks a lot. Does this patch make a difference?

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 59651af..278e94c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)

/* We need to ensure that the socket is hashed and visible. */
smp_wmb();
-   nlk_sk(sk)->bound = portid;
+   nlk_sk(sk)->bound = true;

  err:
release_sock(sk);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-04 Thread Herbert Xu
On Fri, Dec 04, 2015 at 07:26:12PM +0100, Stefan Priebe wrote:
>
> * 9f87e0c - (2 months ago) netlink: Replace rhash_portid with bound
> - Herbert Xu
> * 35e9890 - (3 months ago) netlink: Fix autobind race condition that
> leads to zero port ID - Herbert Xu
> * 30c6472 - (7 months ago) netlink: Use random autobind rover - Herbert Xu

These three patches are absolutely required in any kernel where the
netlink insertion is lockless.  So yes they should be applied to
4.1.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-04 Thread Stefan Priebe

Hi,

I got it fixed / at least not live / deadlocking by doing applying the 
following patch - which is the diff of the commits below on top of 4.1.13.


patch:
http://pastebin.com/raw.php?i=hiuq4bsW

all commits / changes in reverse order:

* 0ceb380 - (6 weeks ago) netlink: fix locking around 
NETLINK_LIST_MEMBERSHIPS - David Herrmann (HEAD)
* c3f272b - (7 weeks ago) netlink: Trim skb to alloc size to avoid 
MSG_TRUNC - Arad, Ronen
* 9f87e0c - (2 months ago) netlink: Replace rhash_portid with bound - 
Herbert Xu
* 35e9890 - (3 months ago) netlink: Fix autobind race condition that 
leads to zero port ID - Herbert Xu
* f1d1215 - (3 months ago) netlink, mmap: transform mmap skb into full 
skb on taps - Daniel Borkmann
* faad871 - (3 months ago) netlink, mmap: fix edge-case leakages in nf 
queue zero-copy - Daniel Borkmann
* fb18c94 - (3 months ago) netlink, mmap: don't walk rx ring on poll if 
receive queue non-empty - Daniel Borkmann
* da13789 - (3 months ago) netlink: rx mmap: fix POLLIN condition - 
Ken-ichirou MATSUZAWA
* 808071f - (3 months ago) netlink: mmap: fix lookup frame position - 
Ken-ichirou MATSUZAWA
* 589bfd5 - (3 months ago) netlink: add NETLINK_CAP_ACK socket option - 
Christophe Ricard
* d23c4eb - (4 months ago) netlink: mmap: fix tx type check - 
Ken-ichirou MATSUZAWA
* 5dcc50a - (4 months ago) netlink: make sure -EBUSY won't escape from 
netlink_insert - Daniel Borkmann
* ada2b3e - (5 months ago) netlink: don't hold mutex in rcu callback 
when releasing mmapd ring - Florian Westphal
* e0f54a3 - (5 months ago) netlink: Delete an unnecessary check before 
the function call "module_put" - Markus Elfring
* 0a5bdaf - (6 months ago) netlink: add API to retrieve all group 
memberships - David Herrmann

* 30c6472 - (7 months ago) netlink: Use random autobind rover - Herbert Xu
* 021a670 - (7 months ago) netlink: Create kernel netlink sockets in the 
proper network namespace - Eric W. Biederman
* e1b01b4 - (7 months ago) net: Pass kern from net_proto_family.create 
to sk_alloc - Eric W. Biederman
* dd4b3c9 - (7 months ago) netlink: rename private flags and states - 
Nicolas Dichtel
* 0356126 - (2 days ago) Revert "netlink: don't hold mutex in rcu 
callback when releasing mmapd ring" - Stefan Priebe
* 231d0da - (2 days ago) Revert "netlink: make sure -EBUSY won't escape 
from netlink_insert" - Stefan Priebe
* e0f56af1 - (2 days ago) Revert "netlink, mmap: transform mmap skb into 
full skb on taps" - Stefan Priebe
* 23a0326 - (2 days ago) Revert "netlink: Fix autobind race condition 
that leads to zero port ID" - Stefan Priebe
* 97f4677 - (2 days ago) Revert "netlink: Replace rhash_portid with 
bound" - Stefan Priebe
* 40c851fe - (2 days ago) Revert "netlink: Trim skb to alloc size to 
avoid MSG_TRUNC" - Stefan Priebe
* 1f2ce4a - (4 weeks ago) Linux 4.1.13 - Greg Kroah-Hartman (v4.1.13, 
origin/linux-4.1.y)


So the netlink code is in line with 4.3.

Stefan

Am 02.12.2015 um 12:40 schrieb Hannes Frederic Sowa:

Hello Stefan,

Stefan Priebe - Profihost AG  writes:



here are the results.

It works with 4.1.
It works with 4.2.
It does not work with 4.1.13.

git bisect tells me it stopped working after those two commits were applied:

commit d48623677191e0f035d7afd344f92cf880b01f8e
Author: Herbert Xu 
Date:   Tue Sep 22 11:38:56 2015 +0800

 netlink: Replace rhash_portid with bound

commit 4e27762417669cb459971635be550eb7b5598286
Author: Herbert Xu 
Date:   Fri Sep 18 19:16:50 2015 +0800

 netlink: Fix autobind race condition that leads to zero port ID


Cool, thanks a lot. Does this patch make a difference?

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 59651af..278e94c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)

/* We need to ensure that the socket is hashed and visible. */
smp_wmb();
-   nlk_sk(sk)->bound = portid;
+   nlk_sk(sk)->bound = true;

  err:
release_sock(sk);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-03 Thread Stefan Priebe - Profihost AG
> Am 02.12.2015 um 12:40 schrieb Hannes Frederic Sowa 
> :
> 
> Hello Stefan,
> 
> Stefan Priebe - Profihost AG  writes:
> 
> 
>> here are the results.
>> 
>> It works with 4.1.
>> It works with 4.2.
>> It does not work with 4.1.13.
>> 
>> git bisect tells me it stopped working after those two commits were applied:
>> 
>> commit d48623677191e0f035d7afd344f92cf880b01f8e
>> Author: Herbert Xu 
>> Date:   Tue Sep 22 11:38:56 2015 +0800
>> 
>>netlink: Replace rhash_portid with bound
>> 
>> commit 4e27762417669cb459971635be550eb7b5598286
>> Author: Herbert Xu 
>> Date:   Fri Sep 18 19:16:50 2015 +0800
>> 
>>netlink: Fix autobind race condition that leads to zero port ID
> 
> Cool, thanks a lot. Does this patch make a difference?
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
> 
>/* We need to ensure that the socket is hashed and visible. */
>smp_wmb();
> -nlk_sk(sk)->bound = portid;
> +nlk_sk(sk)->bound = true;
> 
> err:
>release_sock(sk);

Will try that soon but already started another test. I reverted all netlink 
patches from 4.1.13 back to 4.1 and applied all netlink patches from 4.1 up to 
4.3.

Greets,
Stefan--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-03 Thread Stefan Priebe - Profihost AG
> Am 02.12.2015 um 12:40 schrieb Hannes Frederic Sowa 
> :
> 
> Hello Stefan,
> 
> Stefan Priebe - Profihost AG  writes:
> 
> 
>> here are the results.
>> 
>> It works with 4.1.
>> It works with 4.2.
>> It does not work with 4.1.13.
>> 
>> git bisect tells me it stopped working after those two commits were applied:
>> 
>> commit d48623677191e0f035d7afd344f92cf880b01f8e
>> Author: Herbert Xu 
>> Date:   Tue Sep 22 11:38:56 2015 +0800
>> 
>>netlink: Replace rhash_portid with bound
>> 
>> commit 4e27762417669cb459971635be550eb7b5598286
>> Author: Herbert Xu 
>> Date:   Fri Sep 18 19:16:50 2015 +0800
>> 
>>netlink: Fix autobind race condition that leads to zero port ID
> 
> Cool, thanks a lot. Does this patch make a difference?
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
> 
>/* We need to ensure that the socket is hashed and visible. */
>smp_wmb();
> -nlk_sk(sk)->bound = portid;
> +nlk_sk(sk)->bound = true;
> 
> err:
>release_sock(sk);

Will try that soon but already started another test. I reverted all netlink 
patches from 4.1.13 back to 4.1 and applied all netlink patches from 4.1 up to 
4.3.

Greets,
Stefan--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-02 Thread Hannes Frederic Sowa
Hello,

On Wed, Dec 2, 2015, at 18:15, Philipp Hahn wrote:
> Hi,
> 
> Am 02.12.2015 um 10:45 schrieb Stefan Priebe - Profihost AG:
> > here are the results.
> > 
> > It works with 4.1.
> > It works with 4.2.
> > It does not work with 4.1.13.
> 
> the patches were first commitet in v4.3-rc3 and appear as backports only
> since v4.2.3 and v4.1.10
> 
> > git bisect tells me it stopped working after those two commits were applied:
> > 
> > commit d48623677191e0f035d7afd344f92cf880b01f8e
> > Author: Herbert Xu 
> > Date:   Tue Sep 22 11:38:56 2015 +0800
> > 
> > netlink: Replace rhash_portid with bound
> > 
> > commit 4e27762417669cb459971635be550eb7b5598286
> > Author: Herbert Xu 
> > Date:   Fri Sep 18 19:16:50 2015 +0800
> > 
> > netlink: Fix autobind race condition that leads to zero port ID
> 
> I identified the same two patches for our prpblem; see mail from
> 2015-11-17 16:53 with subject "Strange PF_NETLINK NETLINK_ROUTE stall:
> netlink: Fix autobind race condition that leads to zero port ID"
> 
> 

Thanks, I missed this mail. I have a further look. I couldn't reproduce
it myself but will check if your test program does it.

Thanks a lot,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-02 Thread Philipp Hahn
Hi,

Am 02.12.2015 um 10:45 schrieb Stefan Priebe - Profihost AG:
> here are the results.
> 
> It works with 4.1.
> It works with 4.2.
> It does not work with 4.1.13.

the patches were first commitet in v4.3-rc3 and appear as backports only
since v4.2.3 and v4.1.10

> git bisect tells me it stopped working after those two commits were applied:
> 
> commit d48623677191e0f035d7afd344f92cf880b01f8e
> Author: Herbert Xu 
> Date:   Tue Sep 22 11:38:56 2015 +0800
> 
> netlink: Replace rhash_portid with bound
> 
> commit 4e27762417669cb459971635be550eb7b5598286
> Author: Herbert Xu 
> Date:   Fri Sep 18 19:16:50 2015 +0800
> 
> netlink: Fix autobind race condition that leads to zero port ID

I identified the same two patches for our prpblem; see mail from
2015-11-17 16:53 with subject "Strange PF_NETLINK NETLINK_ROUTE stall:
netlink: Fix autobind race condition that leads to zero port ID"



Philipp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-02 Thread Philipp Hahn
Hi,

Am 02.12.2015 um 12:40 schrieb Hannes Frederic Sowa:
> Cool, thanks a lot. Does this patch make a difference?
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>  
>   /* We need to ensure that the socket is hashed and visible. */
>   smp_wmb();
> - nlk_sk(sk)->bound = portid;
> + nlk_sk(sk)->bound = true;
>  
>  err:
>   release_sock(sk);

Will test that tomorrow.

What confuss me is that it works with 4.3. I compared
net/netlink/af_netlink.[ch] for differences between v4.1.10 and v4.3 and
didn't find anything which would explain only 4,1 getting stuck.

Philipp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-02 Thread Hannes Frederic Sowa
Hello Stefan,

Stefan Priebe - Profihost AG  writes:


> here are the results.
>
> It works with 4.1.
> It works with 4.2.
> It does not work with 4.1.13.
>
> git bisect tells me it stopped working after those two commits were applied:
>
> commit d48623677191e0f035d7afd344f92cf880b01f8e
> Author: Herbert Xu 
> Date:   Tue Sep 22 11:38:56 2015 +0800
>
> netlink: Replace rhash_portid with bound
>
> commit 4e27762417669cb459971635be550eb7b5598286
> Author: Herbert Xu 
> Date:   Fri Sep 18 19:16:50 2015 +0800
>
> netlink: Fix autobind race condition that leads to zero port ID

Cool, thanks a lot. Does this patch make a difference?

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 59651af..278e94c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 
/* We need to ensure that the socket is hashed and visible. */
smp_wmb();
-   nlk_sk(sk)->bound = portid;
+   nlk_sk(sk)->bound = true;
 
 err:
release_sock(sk);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-02 Thread Stefan Priebe - Profihost AG
Hi,

here are the results.

It works with 4.1.
It works with 4.2.
It does not work with 4.1.13.

git bisect tells me it stopped working after those two commits were applied:

commit d48623677191e0f035d7afd344f92cf880b01f8e
Author: Herbert Xu 
Date:   Tue Sep 22 11:38:56 2015 +0800

netlink: Replace rhash_portid with bound

commit 4e27762417669cb459971635be550eb7b5598286
Author: Herbert Xu 
Date:   Fri Sep 18 19:16:50 2015 +0800

netlink: Fix autobind race condition that leads to zero port ID

Stefan
Am 23.11.2015 um 13:57 schrieb Hannes Frederic Sowa:
> On Mon, Nov 23, 2015, at 13:44, Stefan Priebe - Profihost AG wrote:
>> Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
>>>
>>> Am 19.11.2015 um 14:19 schrieb Florian Weimer:
 On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:

> I can try Kernel 4.4-rc1 next week. Or something else?

 I found this bug report which indicates that 4.1.10 works:



 But in your original report, you said that 4.1.13 is broken.
>>>
>>> That's correct i'm running 4.1.13.
>>>
 This backtrace:



 shows a lot of waiting on quite different netlink sockets.  So if this
 is due to a race in Asterisk, it must have happened several times in a
 row.
>>
>> Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
>> causing the isse in 4.1? It's an LTE kernel so it should be fixed!
> 
> Thanks for testing. I was not able to reproduce it at all, with as much
> parallelism and threads as possible on any kernel. Could you try to do a
> git bisect?
> 
> Thanks,
> Hannes
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-02 Thread Hannes Frederic Sowa
Hello,

On Wed, Dec 2, 2015, at 18:15, Philipp Hahn wrote:
> Hi,
> 
> Am 02.12.2015 um 10:45 schrieb Stefan Priebe - Profihost AG:
> > here are the results.
> > 
> > It works with 4.1.
> > It works with 4.2.
> > It does not work with 4.1.13.
> 
> the patches were first commitet in v4.3-rc3 and appear as backports only
> since v4.2.3 and v4.1.10
> 
> > git bisect tells me it stopped working after those two commits were applied:
> > 
> > commit d48623677191e0f035d7afd344f92cf880b01f8e
> > Author: Herbert Xu 
> > Date:   Tue Sep 22 11:38:56 2015 +0800
> > 
> > netlink: Replace rhash_portid with bound
> > 
> > commit 4e27762417669cb459971635be550eb7b5598286
> > Author: Herbert Xu 
> > Date:   Fri Sep 18 19:16:50 2015 +0800
> > 
> > netlink: Fix autobind race condition that leads to zero port ID
> 
> I identified the same two patches for our prpblem; see mail from
> 2015-11-17 16:53 with subject "Strange PF_NETLINK NETLINK_ROUTE stall:
> netlink: Fix autobind race condition that leads to zero port ID"
> 
> 

Thanks, I missed this mail. I have a further look. I couldn't reproduce
it myself but will check if your test program does it.

Thanks a lot,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-02 Thread Philipp Hahn
Hi,

Am 02.12.2015 um 10:45 schrieb Stefan Priebe - Profihost AG:
> here are the results.
> 
> It works with 4.1.
> It works with 4.2.
> It does not work with 4.1.13.

the patches were first commitet in v4.3-rc3 and appear as backports only
since v4.2.3 and v4.1.10

> git bisect tells me it stopped working after those two commits were applied:
> 
> commit d48623677191e0f035d7afd344f92cf880b01f8e
> Author: Herbert Xu 
> Date:   Tue Sep 22 11:38:56 2015 +0800
> 
> netlink: Replace rhash_portid with bound
> 
> commit 4e27762417669cb459971635be550eb7b5598286
> Author: Herbert Xu 
> Date:   Fri Sep 18 19:16:50 2015 +0800
> 
> netlink: Fix autobind race condition that leads to zero port ID

I identified the same two patches for our prpblem; see mail from
2015-11-17 16:53 with subject "Strange PF_NETLINK NETLINK_ROUTE stall:
netlink: Fix autobind race condition that leads to zero port ID"



Philipp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-02 Thread Philipp Hahn
Hi,

Am 02.12.2015 um 12:40 schrieb Hannes Frederic Sowa:
> Cool, thanks a lot. Does this patch make a difference?
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>  
>   /* We need to ensure that the socket is hashed and visible. */
>   smp_wmb();
> - nlk_sk(sk)->bound = portid;
> + nlk_sk(sk)->bound = true;
>  
>  err:
>   release_sock(sk);

Will test that tomorrow.

What confuss me is that it works with 4.3. I compared
net/netlink/af_netlink.[ch] for differences between v4.1.10 and v4.3 and
didn't find anything which would explain only 4,1 getting stuck.

Philipp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-02 Thread Stefan Priebe - Profihost AG
Hi,

here are the results.

It works with 4.1.
It works with 4.2.
It does not work with 4.1.13.

git bisect tells me it stopped working after those two commits were applied:

commit d48623677191e0f035d7afd344f92cf880b01f8e
Author: Herbert Xu 
Date:   Tue Sep 22 11:38:56 2015 +0800

netlink: Replace rhash_portid with bound

commit 4e27762417669cb459971635be550eb7b5598286
Author: Herbert Xu 
Date:   Fri Sep 18 19:16:50 2015 +0800

netlink: Fix autobind race condition that leads to zero port ID

Stefan
Am 23.11.2015 um 13:57 schrieb Hannes Frederic Sowa:
> On Mon, Nov 23, 2015, at 13:44, Stefan Priebe - Profihost AG wrote:
>> Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
>>>
>>> Am 19.11.2015 um 14:19 schrieb Florian Weimer:
 On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:

> I can try Kernel 4.4-rc1 next week. Or something else?

 I found this bug report which indicates that 4.1.10 works:



 But in your original report, you said that 4.1.13 is broken.
>>>
>>> That's correct i'm running 4.1.13.
>>>
 This backtrace:



 shows a lot of waiting on quite different netlink sockets.  So if this
 is due to a race in Asterisk, it must have happened several times in a
 row.
>>
>> Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
>> causing the isse in 4.1? It's an LTE kernel so it should be fixed!
> 
> Thanks for testing. I was not able to reproduce it at all, with as much
> parallelism and threads as possible on any kernel. Could you try to do a
> git bisect?
> 
> Thanks,
> Hannes
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-12-02 Thread Hannes Frederic Sowa
Hello Stefan,

Stefan Priebe - Profihost AG  writes:


> here are the results.
>
> It works with 4.1.
> It works with 4.2.
> It does not work with 4.1.13.
>
> git bisect tells me it stopped working after those two commits were applied:
>
> commit d48623677191e0f035d7afd344f92cf880b01f8e
> Author: Herbert Xu 
> Date:   Tue Sep 22 11:38:56 2015 +0800
>
> netlink: Replace rhash_portid with bound
>
> commit 4e27762417669cb459971635be550eb7b5598286
> Author: Herbert Xu 
> Date:   Fri Sep 18 19:16:50 2015 +0800
>
> netlink: Fix autobind race condition that leads to zero port ID

Cool, thanks a lot. Does this patch make a difference?

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 59651af..278e94c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 
/* We need to ensure that the socket is hashed and visible. */
smp_wmb();
-   nlk_sk(sk)->bound = portid;
+   nlk_sk(sk)->bound = true;
 
 err:
release_sock(sk);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-24 Thread Stefan Priebe - Profihost AG

Am 23.11.2015 um 13:57 schrieb Hannes Frederic Sowa:
> On Mon, Nov 23, 2015, at 13:44, Stefan Priebe - Profihost AG wrote:
>> Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
>>>
>>> Am 19.11.2015 um 14:19 schrieb Florian Weimer:
 On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:

> I can try Kernel 4.4-rc1 next week. Or something else?

 I found this bug report which indicates that 4.1.10 works:



 But in your original report, you said that 4.1.13 is broken.
>>>
>>> That's correct i'm running 4.1.13.
>>>
 This backtrace:



 shows a lot of waiting on quite different netlink sockets.  So if this
 is due to a race in Asterisk, it must have happened several times in a
 row.
>>
>> Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
>> causing the isse in 4.1? It's an LTE kernel so it should be fixed!
> 
> Thanks for testing. I was not able to reproduce it at all, with as much
> parallelism and threads as possible on any kernel. Could you try to do a
> git bisect?

OK i'll start digging around with major versions first to have a working
and a non working version.

I'll try 4.3 and 4.2 first to get an idea where it got fixed. Sadly i
can't bisect from 4.1.13 to 4.X so i have to verify first it also
happens with 4.1.

OK?

Stefan

> 
> Thanks,
> Hannes
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-24 Thread Stefan Priebe - Profihost AG

Am 23.11.2015 um 13:57 schrieb Hannes Frederic Sowa:
> On Mon, Nov 23, 2015, at 13:44, Stefan Priebe - Profihost AG wrote:
>> Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
>>>
>>> Am 19.11.2015 um 14:19 schrieb Florian Weimer:
 On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:

> I can try Kernel 4.4-rc1 next week. Or something else?

 I found this bug report which indicates that 4.1.10 works:



 But in your original report, you said that 4.1.13 is broken.
>>>
>>> That's correct i'm running 4.1.13.
>>>
 This backtrace:



 shows a lot of waiting on quite different netlink sockets.  So if this
 is due to a race in Asterisk, it must have happened several times in a
 row.
>>
>> Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
>> causing the isse in 4.1? It's an LTE kernel so it should be fixed!
> 
> Thanks for testing. I was not able to reproduce it at all, with as much
> parallelism and threads as possible on any kernel. Could you try to do a
> git bisect?

OK i'll start digging around with major versions first to have a working
and a non working version.

I'll try 4.3 and 4.2 first to get an idea where it got fixed. Sadly i
can't bisect from 4.1.13 to 4.X so i have to verify first it also
happens with 4.1.

OK?

Stefan

> 
> Thanks,
> Hannes
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-23 Thread Hannes Frederic Sowa
On Mon, Nov 23, 2015, at 13:44, Stefan Priebe - Profihost AG wrote:
> Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
> > 
> > Am 19.11.2015 um 14:19 schrieb Florian Weimer:
> >> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
> >>
> >>> I can try Kernel 4.4-rc1 next week. Or something else?
> >>
> >> I found this bug report which indicates that 4.1.10 works:
> >>
> >>
> >>
> >> But in your original report, you said that 4.1.13 is broken.
> > 
> > That's correct i'm running 4.1.13.
> > 
> >> This backtrace:
> >>
> >>
> >>
> >> shows a lot of waiting on quite different netlink sockets.  So if this
> >> is due to a race in Asterisk, it must have happened several times in a
> >> row.
> 
> Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
> causing the isse in 4.1? It's an LTE kernel so it should be fixed!

Thanks for testing. I was not able to reproduce it at all, with as much
parallelism and threads as possible on any kernel. Could you try to do a
git bisect?

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-23 Thread Stefan Priebe - Profihost AG
Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
> 
> Am 19.11.2015 um 14:19 schrieb Florian Weimer:
>> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
>>
>>> I can try Kernel 4.4-rc1 next week. Or something else?
>>
>> I found this bug report which indicates that 4.1.10 works:
>>
>>
>>
>> But in your original report, you said that 4.1.13 is broken.
> 
> That's correct i'm running 4.1.13.
> 
>> This backtrace:
>>
>>
>>
>> shows a lot of waiting on quite different netlink sockets.  So if this
>> is due to a race in Asterisk, it must have happened several times in a
>> row.

Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
causing the isse in 4.1? It's an LTE kernel so it should be fixed!

Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-23 Thread Stefan Priebe - Profihost AG
Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
> 
> Am 19.11.2015 um 14:19 schrieb Florian Weimer:
>> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
>>
>>> I can try Kernel 4.4-rc1 next week. Or something else?
>>
>> I found this bug report which indicates that 4.1.10 works:
>>
>>
>>
>> But in your original report, you said that 4.1.13 is broken.
> 
> That's correct i'm running 4.1.13.
> 
>> This backtrace:
>>
>>
>>
>> shows a lot of waiting on quite different netlink sockets.  So if this
>> is due to a race in Asterisk, it must have happened several times in a
>> row.

Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
causing the isse in 4.1? It's an LTE kernel so it should be fixed!

Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-23 Thread Hannes Frederic Sowa
On Mon, Nov 23, 2015, at 13:44, Stefan Priebe - Profihost AG wrote:
> Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
> > 
> > Am 19.11.2015 um 14:19 schrieb Florian Weimer:
> >> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
> >>
> >>> I can try Kernel 4.4-rc1 next week. Or something else?
> >>
> >> I found this bug report which indicates that 4.1.10 works:
> >>
> >>
> >>
> >> But in your original report, you said that 4.1.13 is broken.
> > 
> > That's correct i'm running 4.1.13.
> > 
> >> This backtrace:
> >>
> >>
> >>
> >> shows a lot of waiting on quite different netlink sockets.  So if this
> >> is due to a race in Asterisk, it must have happened several times in a
> >> row.
> 
> Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
> causing the isse in 4.1? It's an LTE kernel so it should be fixed!

Thanks for testing. I was not able to reproduce it at all, with as much
parallelism and threads as possible on any kernel. Could you try to do a
git bisect?

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Stefan Priebe


Am 19.11.2015 um 14:19 schrieb Florian Weimer:

On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:


I can try Kernel 4.4-rc1 next week. Or something else?


I found this bug report which indicates that 4.1.10 works:

   

But in your original report, you said that 4.1.13 is broken.


That's correct i'm running 4.1.13.


This backtrace:

   

shows a lot of waiting on quite different netlink sockets.  So if this
is due to a race in Asterisk, it must have happened several times in a row.


Might be I've always only one waiting.

Stefan



Thanks,
Florian


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Florian Weimer
On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:

> I can try Kernel 4.4-rc1 next week. Or something else?

I found this bug report which indicates that 4.1.10 works:

  

But in your original report, you said that 4.1.13 is broken.

This backtrace:

  

shows a lot of waiting on quite different netlink sockets.  So if this
is due to a race in Asterisk, it must have happened several times in a row.

Thanks,
Florian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Stefan Priebe - Profihost AG

Am 19.11.2015 um 13:41 schrieb Hannes Frederic Sowa:
> On Thu, Nov 19, 2015, at 12:43, Stefan Priebe - Profihost AG wrote:
>>
>> Am 19.11.2015 um 12:41 schrieb Hannes Frederic Sowa:
>>> On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
 OK it had a livelock again. It just took more time.

 So here is the data:
>>>
>>> Thanks, I couldn't reproduce it so far with simple threaded resolver
>>> loop on your kernel. :/
>>>
>>> Your data is useless if you don't also provide the file descriptor which
>>> you are blocking on right now. ;)
>>>
>>> Thanks,
>>> Hannes
>>
>> ah sorry. So we need the gdb backtrace with the rcvmsg and then the fd
>> list + netlink list?
> 
> Yes, albeit the probability we find something new is minimal. We already
> know it blocks deep down in netlink code. I tried to reproduce it in
> multithreaded environment with no results so far. Can you even try a
> newer kernel?

I can try Kernel 4.4-rc1 next week. Or something else?

Stefan

> 
> Bye,
> Hannes
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Hannes Frederic Sowa
On Thu, Nov 19, 2015, at 12:43, Stefan Priebe - Profihost AG wrote:
> 
> Am 19.11.2015 um 12:41 schrieb Hannes Frederic Sowa:
> > On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
> >> OK it had a livelock again. It just took more time.
> >>
> >> So here is the data:
> > 
> > Thanks, I couldn't reproduce it so far with simple threaded resolver
> > loop on your kernel. :/
> > 
> > Your data is useless if you don't also provide the file descriptor which
> > you are blocking on right now. ;)
> > 
> > Thanks,
> > Hannes
> 
> ah sorry. So we need the gdb backtrace with the rcvmsg and then the fd
> list + netlink list?

Yes, albeit the probability we find something new is minimal. We already
know it blocks deep down in netlink code. I tried to reproduce it in
multithreaded environment with no results so far. Can you even try a
newer kernel?

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Stefan Priebe - Profihost AG

Am 19.11.2015 um 12:41 schrieb Hannes Frederic Sowa:
> On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
>> OK it had a livelock again. It just took more time.
>>
>> So here is the data:
> 
> Thanks, I couldn't reproduce it so far with simple threaded resolver
> loop on your kernel. :/
> 
> Your data is useless if you don't also provide the file descriptor which
> you are blocking on right now. ;)
> 
> Thanks,
> Hannes

ah sorry. So we need the gdb backtrace with the rcvmsg and then the fd
list + netlink list?

Stefan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Hannes Frederic Sowa
On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
> OK it had a livelock again. It just took more time.
> 
> So here is the data:

Thanks, I couldn't reproduce it so far with simple threaded resolver
loop on your kernel. :/

Your data is useless if you don't also provide the file descriptor which
you are blocking on right now. ;)

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Stefan Priebe - Profihost AG
OK it had a livelock again. It just took more time.

So here is the data:

# la /proc/2598/fd
total 0
dr-x-- 2 root   root0 Nov 19 06:53 .
dr-xr-xr-x 7 callweaver callweaver  0 Nov 18 22:38 ..
lrwx-- 1 root   root   64 Nov 19 06:54 0 -> /dev/null
lrwx-- 1 root   root   64 Nov 19 06:54 1 -> /dev/null
lrwx-- 1 root   root   64 Nov 19 06:54 10 -> socket:[12066]
lr-x-- 1 root   root   64 Nov 19 06:54 11 -> anon_inode:inotify
lr-x-- 1 root   root   64 Nov 19 06:54 12 -> pipe:[12181]
l-wx-- 1 root   root   64 Nov 19 06:54 13 -> pipe:[12181]
lrwx-- 1 root   root   64 Nov 19 10:56 14 -> socket:[510853]
lrwx-- 1 root   root   64 Nov 19 10:56 15 -> socket:[510854]
lrwx-- 1 root   root   64 Nov 19 10:56 16 ->
anon_inode:[timerfd]
lr-x-- 1 root   root   64 Nov 19 10:56 17 -> pipe:[510856]
l-wx-- 1 root   root   64 Nov 19 10:56 18 -> pipe:[510856]
lrwx-- 1 root   root   64 Nov 19 10:56 19 -> socket:[208723]
lrwx-- 1 root   root   64 Nov 19 06:54 2 -> /dev/null
l-wx-- 1 root   root   64 Nov 19 10:56 20 ->
/var/log/asterisk/queue_log
lrwx-- 1 root   root   64 Nov 19 10:56 21 -> socket:[199595]
lrwx-- 1 root   root   64 Nov 19 10:56 22 -> socket:[510873]
lr-x-- 1 root   root   64 Nov 19 10:56 23 -> anon_inode:inotify
lrwx-- 1 root   root   64 Nov 19 10:56 24 -> socket:[525349]
lrwx-- 1 root   root   64 Nov 19 10:56 25 -> socket:[525350]
lrwx-- 1 root   root   64 Nov 19 10:56 26 -> socket:[510874]
lrwx-- 1 root   root   64 Nov 19 10:56 27 ->
anon_inode:[timerfd]
lr-x-- 1 root   root   64 Nov 19 10:56 28 -> pipe:[510876]
l-wx-- 1 root   root   64 Nov 19 10:56 29 -> pipe:[510876]
lr-x-- 1 root   root   64 Nov 19 06:54 3 -> /dev/urandom
lrwx-- 1 root   root   64 Nov 19 10:56 30 -> socket:[527569]
lrwx-- 1 root   root   64 Nov 19 10:56 31 -> socket:[527570]
lrwx-- 1 root   root   64 Nov 19 10:56 32 -> socket:[528123]
lrwx-- 1 root   root   64 Nov 19 10:56 33 -> socket:[528124]
lrwx-- 1 root   root   64 Nov 19 10:56 34 -> socket:[530711]
lrwx-- 1 root   root   64 Nov 19 10:56 35 -> socket:[530712]
lrwx-- 1 root   root   64 Nov 19 10:56 36 -> socket:[533366]
lrwx-- 1 root   root   64 Nov 19 10:56 37 -> socket:[533367]
lrwx-- 1 root   root   64 Nov 19 10:56 38 -> socket:[535390]
lrwx-- 1 root   root   64 Nov 19 10:56 39 -> socket:[531056]
lrwx-- 1 root   root   64 Nov 19 06:54 4 -> socket:[11726]
lrwx-- 1 root   root   64 Nov 19 10:56 40 -> socket:[531057]
lrwx-- 1 root   root   64 Nov 19 10:56 41 -> socket:[535391]
lrwx-- 1 root   root   64 Nov 19 10:56 42 -> socket:[537751]
lrwx-- 1 root   root   64 Nov 19 10:56 43 -> socket:[533468]
lrwx-- 1 root   root   64 Nov 19 10:56 44 -> socket:[531154]
lrwx-- 1 root   root   64 Nov 19 10:56 45 -> socket:[531155]
lrwx-- 1 root   root   64 Nov 19 10:56 46 -> socket:[533469]
lrwx-- 1 root   root   64 Nov 19 10:56 47 -> socket:[536172]
lrwx-- 1 root   root   64 Nov 19 10:56 48 -> socket:[536173]
lrwx-- 1 root   root   64 Nov 19 10:56 49 -> socket:[537877]
l-wx-- 1 root   root   64 Nov 19 06:54 5 ->
/var/log/asterisk/messages
lrwx-- 1 root   root   64 Nov 19 10:56 50 -> socket:[537752]
lrwx-- 1 root   root   64 Nov 19 10:56 51 -> socket:[539817]
lrwx-- 1 root   root   64 Nov 19 10:56 52 -> socket:[537878]
lrwx-- 1 root   root   64 Nov 19 10:56 53 -> socket:[539818]
lrwx-- 1 root   root   64 Nov 19 10:56 54 -> socket:[541781]
lrwx-- 1 root   root   64 Nov 19 10:56 55 -> socket:[541782]
lrwx-- 1 root   root   64 Nov 19 10:56 56 -> socket:[543462]
lrwx-- 1 root   root   64 Nov 19 10:56 57 -> socket:[545171]
lrwx-- 1 root   root   64 Nov 19 10:56 58 -> socket:[537432]
lrwx-- 1 root   root   64 Nov 19 10:56 59 -> socket:[537433]
l-wx-- 1 root   root   64 Nov 19 06:54 6 ->
/var/log/asterisk/debug.log
lrwx-- 1 root   root   64 Nov 19 10:56 60 -> socket:[545172]
lrwx-- 1 root   root   64 Nov 19 10:56 61 ->
anon_inode:[timerfd]
lrwx-- 1 root   root   64 Nov 19 10:56 62 -> socket:[541196]
lrwx-- 1 root   root   64 Nov 19 10:56 63 -> socket:[538319]
lrwx-- 1 root   root   64 Nov 19 10:56 64 -> socket:[538320]
lrwx-- 1 root   root   64 Nov 19 10:56 65 -> socket:[474586]
lrwx-- 1 root   root   64 Nov 19 10:56 66 -> socket:[541197]
lrwx-- 1 root   root   64 Nov 19 10:56 67 -> socket:[542437]
lrwx-- 1 root   root 

Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Stefan Priebe - Profihost AG

Am 19.11.2015 um 10:44 schrieb Florian Weimer:
> On 11/18/2015 10:36 PM, Stefan Priebe wrote:
> 
>>> please try to get a backtrace with debugging information.  It is likely
>>> that this is the make_request/__check_pf functionality in glibc, but it
>>> would be nice to get some certainty.
>>
>> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
>> have ipv6 except for link local.
> 
> glibc needs to know if the system has global unicast addresses if it
> receives  records.
> 
> It's curious that net.ipv6.conf.all.disable_ipv6=1 makes the problem go
> away.  Even with that setting, the kernel seems to send two Netlink
> responses.  So either this is enough to narrow the window for the race
> so that no longer triggers, or there is a genuine kernel issue with
> supplying the requested IPv6 Netlink response.

No idea it also goes away by downgrading to 3.18 again.

Stefan

>> Could it be this one?
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
> 
> No, that's on the DNS/UDP side, not in the Netlink code.
> 
> Florian
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Florian Weimer
On 11/18/2015 10:36 PM, Stefan Priebe wrote:

>> please try to get a backtrace with debugging information.  It is likely
>> that this is the make_request/__check_pf functionality in glibc, but it
>> would be nice to get some certainty.
> 
> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
> have ipv6 except for link local.

glibc needs to know if the system has global unicast addresses if it
receives  records.

It's curious that net.ipv6.conf.all.disable_ipv6=1 makes the problem go
away.  Even with that setting, the kernel seems to send two Netlink
responses.  So either this is enough to narrow the window for the race
so that no longer triggers, or there is a genuine kernel issue with
supplying the requested IPv6 Netlink response.

> Could it be this one?
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79

No, that's on the DNS/UDP side, not in the Netlink code.

Florian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Florian Weimer
On 11/18/2015 10:23 PM, Stefan Priebe wrote:

>> please try to get a backtrace with debugging information.  It is likely
>> that this is the make_request/__check_pf functionality in glibc, but it
>> would be nice to get some certainty.
>>
>> Which glibc version do you use?  Has it got a fix for CVE-2013-7423?
> 
> It's Debians 2.13-38+deb7u8 Debians issue tracker says it is fixed:
> https://security-tracker.debian.org/tracker/CVE-2013-7423

I checked, and the patch is there and applied.

>> Theoretically, recvmsg could also hang if the Netlink query was dropped
>> by the kernel, or the final packet in the response was dropped.  We
>> never saw that happen, even under extreme load, but I didn't test with
>> recent kernels.
> 
> The load is very low in this system. Just 30 phones and only 1-6 calling.

This is rather odd.

I'm not sure if we can do anything simple on the glibc to help to debug
this.  If something in the process consumes the Netlink message from the
kernel (incorrectly reading from the internal glibc file descriptor),
this is fairly difficult detect inside glibc (it would require rather
large changes which do not currently exist at all).

Florian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Stefan Priebe - Profihost AG

Am 18.11.2015 um 22:22 schrieb Hannes Frederic Sowa:
> On Wed, Nov 18, 2015, at 22:20, Stefan Priebe wrote:
>> you mean just:
>> la /proc/$pid/fd
> 
> ls -l /proc/pid/fd/
> 
> the numbers in brackets in return from readlink are the inode numbers.
> 
>> and
>>
>> cat /proc/net/netlink
> 
> Exactly, last row is the inode number.

It has stopped locking since i've set:

net.ipv6.conf.all.disable_ipv6=1

Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Florian Weimer
On 11/18/2015 10:23 PM, Stefan Priebe wrote:

>> please try to get a backtrace with debugging information.  It is likely
>> that this is the make_request/__check_pf functionality in glibc, but it
>> would be nice to get some certainty.
>>
>> Which glibc version do you use?  Has it got a fix for CVE-2013-7423?
> 
> It's Debians 2.13-38+deb7u8 Debians issue tracker says it is fixed:
> https://security-tracker.debian.org/tracker/CVE-2013-7423

I checked, and the patch is there and applied.

>> Theoretically, recvmsg could also hang if the Netlink query was dropped
>> by the kernel, or the final packet in the response was dropped.  We
>> never saw that happen, even under extreme load, but I didn't test with
>> recent kernels.
> 
> The load is very low in this system. Just 30 phones and only 1-6 calling.

This is rather odd.

I'm not sure if we can do anything simple on the glibc to help to debug
this.  If something in the process consumes the Netlink message from the
kernel (incorrectly reading from the internal glibc file descriptor),
this is fairly difficult detect inside glibc (it would require rather
large changes which do not currently exist at all).

Florian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Stefan Priebe - Profihost AG

Am 19.11.2015 um 10:44 schrieb Florian Weimer:
> On 11/18/2015 10:36 PM, Stefan Priebe wrote:
> 
>>> please try to get a backtrace with debugging information.  It is likely
>>> that this is the make_request/__check_pf functionality in glibc, but it
>>> would be nice to get some certainty.
>>
>> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
>> have ipv6 except for link local.
> 
> glibc needs to know if the system has global unicast addresses if it
> receives  records.
> 
> It's curious that net.ipv6.conf.all.disable_ipv6=1 makes the problem go
> away.  Even with that setting, the kernel seems to send two Netlink
> responses.  So either this is enough to narrow the window for the race
> so that no longer triggers, or there is a genuine kernel issue with
> supplying the requested IPv6 Netlink response.

No idea it also goes away by downgrading to 3.18 again.

Stefan

>> Could it be this one?
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
> 
> No, that's on the DNS/UDP side, not in the Netlink code.
> 
> Florian
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Stefan Priebe - Profihost AG

Am 18.11.2015 um 22:22 schrieb Hannes Frederic Sowa:
> On Wed, Nov 18, 2015, at 22:20, Stefan Priebe wrote:
>> you mean just:
>> la /proc/$pid/fd
> 
> ls -l /proc/pid/fd/
> 
> the numbers in brackets in return from readlink are the inode numbers.
> 
>> and
>>
>> cat /proc/net/netlink
> 
> Exactly, last row is the inode number.

It has stopped locking since i've set:

net.ipv6.conf.all.disable_ipv6=1

Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Florian Weimer
On 11/18/2015 10:36 PM, Stefan Priebe wrote:

>> please try to get a backtrace with debugging information.  It is likely
>> that this is the make_request/__check_pf functionality in glibc, but it
>> would be nice to get some certainty.
> 
> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
> have ipv6 except for link local.

glibc needs to know if the system has global unicast addresses if it
receives  records.

It's curious that net.ipv6.conf.all.disable_ipv6=1 makes the problem go
away.  Even with that setting, the kernel seems to send two Netlink
responses.  So either this is enough to narrow the window for the race
so that no longer triggers, or there is a genuine kernel issue with
supplying the requested IPv6 Netlink response.

> Could it be this one?
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79

No, that's on the DNS/UDP side, not in the Netlink code.

Florian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Stefan Priebe - Profihost AG
OK it had a livelock again. It just took more time.

So here is the data:

# la /proc/2598/fd
total 0
dr-x-- 2 root   root0 Nov 19 06:53 .
dr-xr-xr-x 7 callweaver callweaver  0 Nov 18 22:38 ..
lrwx-- 1 root   root   64 Nov 19 06:54 0 -> /dev/null
lrwx-- 1 root   root   64 Nov 19 06:54 1 -> /dev/null
lrwx-- 1 root   root   64 Nov 19 06:54 10 -> socket:[12066]
lr-x-- 1 root   root   64 Nov 19 06:54 11 -> anon_inode:inotify
lr-x-- 1 root   root   64 Nov 19 06:54 12 -> pipe:[12181]
l-wx-- 1 root   root   64 Nov 19 06:54 13 -> pipe:[12181]
lrwx-- 1 root   root   64 Nov 19 10:56 14 -> socket:[510853]
lrwx-- 1 root   root   64 Nov 19 10:56 15 -> socket:[510854]
lrwx-- 1 root   root   64 Nov 19 10:56 16 ->
anon_inode:[timerfd]
lr-x-- 1 root   root   64 Nov 19 10:56 17 -> pipe:[510856]
l-wx-- 1 root   root   64 Nov 19 10:56 18 -> pipe:[510856]
lrwx-- 1 root   root   64 Nov 19 10:56 19 -> socket:[208723]
lrwx-- 1 root   root   64 Nov 19 06:54 2 -> /dev/null
l-wx-- 1 root   root   64 Nov 19 10:56 20 ->
/var/log/asterisk/queue_log
lrwx-- 1 root   root   64 Nov 19 10:56 21 -> socket:[199595]
lrwx-- 1 root   root   64 Nov 19 10:56 22 -> socket:[510873]
lr-x-- 1 root   root   64 Nov 19 10:56 23 -> anon_inode:inotify
lrwx-- 1 root   root   64 Nov 19 10:56 24 -> socket:[525349]
lrwx-- 1 root   root   64 Nov 19 10:56 25 -> socket:[525350]
lrwx-- 1 root   root   64 Nov 19 10:56 26 -> socket:[510874]
lrwx-- 1 root   root   64 Nov 19 10:56 27 ->
anon_inode:[timerfd]
lr-x-- 1 root   root   64 Nov 19 10:56 28 -> pipe:[510876]
l-wx-- 1 root   root   64 Nov 19 10:56 29 -> pipe:[510876]
lr-x-- 1 root   root   64 Nov 19 06:54 3 -> /dev/urandom
lrwx-- 1 root   root   64 Nov 19 10:56 30 -> socket:[527569]
lrwx-- 1 root   root   64 Nov 19 10:56 31 -> socket:[527570]
lrwx-- 1 root   root   64 Nov 19 10:56 32 -> socket:[528123]
lrwx-- 1 root   root   64 Nov 19 10:56 33 -> socket:[528124]
lrwx-- 1 root   root   64 Nov 19 10:56 34 -> socket:[530711]
lrwx-- 1 root   root   64 Nov 19 10:56 35 -> socket:[530712]
lrwx-- 1 root   root   64 Nov 19 10:56 36 -> socket:[533366]
lrwx-- 1 root   root   64 Nov 19 10:56 37 -> socket:[533367]
lrwx-- 1 root   root   64 Nov 19 10:56 38 -> socket:[535390]
lrwx-- 1 root   root   64 Nov 19 10:56 39 -> socket:[531056]
lrwx-- 1 root   root   64 Nov 19 06:54 4 -> socket:[11726]
lrwx-- 1 root   root   64 Nov 19 10:56 40 -> socket:[531057]
lrwx-- 1 root   root   64 Nov 19 10:56 41 -> socket:[535391]
lrwx-- 1 root   root   64 Nov 19 10:56 42 -> socket:[537751]
lrwx-- 1 root   root   64 Nov 19 10:56 43 -> socket:[533468]
lrwx-- 1 root   root   64 Nov 19 10:56 44 -> socket:[531154]
lrwx-- 1 root   root   64 Nov 19 10:56 45 -> socket:[531155]
lrwx-- 1 root   root   64 Nov 19 10:56 46 -> socket:[533469]
lrwx-- 1 root   root   64 Nov 19 10:56 47 -> socket:[536172]
lrwx-- 1 root   root   64 Nov 19 10:56 48 -> socket:[536173]
lrwx-- 1 root   root   64 Nov 19 10:56 49 -> socket:[537877]
l-wx-- 1 root   root   64 Nov 19 06:54 5 ->
/var/log/asterisk/messages
lrwx-- 1 root   root   64 Nov 19 10:56 50 -> socket:[537752]
lrwx-- 1 root   root   64 Nov 19 10:56 51 -> socket:[539817]
lrwx-- 1 root   root   64 Nov 19 10:56 52 -> socket:[537878]
lrwx-- 1 root   root   64 Nov 19 10:56 53 -> socket:[539818]
lrwx-- 1 root   root   64 Nov 19 10:56 54 -> socket:[541781]
lrwx-- 1 root   root   64 Nov 19 10:56 55 -> socket:[541782]
lrwx-- 1 root   root   64 Nov 19 10:56 56 -> socket:[543462]
lrwx-- 1 root   root   64 Nov 19 10:56 57 -> socket:[545171]
lrwx-- 1 root   root   64 Nov 19 10:56 58 -> socket:[537432]
lrwx-- 1 root   root   64 Nov 19 10:56 59 -> socket:[537433]
l-wx-- 1 root   root   64 Nov 19 06:54 6 ->
/var/log/asterisk/debug.log
lrwx-- 1 root   root   64 Nov 19 10:56 60 -> socket:[545172]
lrwx-- 1 root   root   64 Nov 19 10:56 61 ->
anon_inode:[timerfd]
lrwx-- 1 root   root   64 Nov 19 10:56 62 -> socket:[541196]
lrwx-- 1 root   root   64 Nov 19 10:56 63 -> socket:[538319]
lrwx-- 1 root   root   64 Nov 19 10:56 64 -> socket:[538320]
lrwx-- 1 root   root   64 Nov 19 10:56 65 -> socket:[474586]
lrwx-- 1 root   root   64 Nov 19 10:56 66 -> socket:[541197]
lrwx-- 1 root   root   64 Nov 19 10:56 67 -> socket:[542437]
lrwx-- 1 root   root 

Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Stefan Priebe


Am 19.11.2015 um 14:19 schrieb Florian Weimer:

On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:


I can try Kernel 4.4-rc1 next week. Or something else?


I found this bug report which indicates that 4.1.10 works:

   

But in your original report, you said that 4.1.13 is broken.


That's correct i'm running 4.1.13.


This backtrace:

   

shows a lot of waiting on quite different netlink sockets.  So if this
is due to a race in Asterisk, it must have happened several times in a row.


Might be I've always only one waiting.

Stefan



Thanks,
Florian


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Hannes Frederic Sowa
On Thu, Nov 19, 2015, at 12:43, Stefan Priebe - Profihost AG wrote:
> 
> Am 19.11.2015 um 12:41 schrieb Hannes Frederic Sowa:
> > On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
> >> OK it had a livelock again. It just took more time.
> >>
> >> So here is the data:
> > 
> > Thanks, I couldn't reproduce it so far with simple threaded resolver
> > loop on your kernel. :/
> > 
> > Your data is useless if you don't also provide the file descriptor which
> > you are blocking on right now. ;)
> > 
> > Thanks,
> > Hannes
> 
> ah sorry. So we need the gdb backtrace with the rcvmsg and then the fd
> list + netlink list?

Yes, albeit the probability we find something new is minimal. We already
know it blocks deep down in netlink code. I tried to reproduce it in
multithreaded environment with no results so far. Can you even try a
newer kernel?

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Stefan Priebe - Profihost AG

Am 19.11.2015 um 13:41 schrieb Hannes Frederic Sowa:
> On Thu, Nov 19, 2015, at 12:43, Stefan Priebe - Profihost AG wrote:
>>
>> Am 19.11.2015 um 12:41 schrieb Hannes Frederic Sowa:
>>> On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
 OK it had a livelock again. It just took more time.

 So here is the data:
>>>
>>> Thanks, I couldn't reproduce it so far with simple threaded resolver
>>> loop on your kernel. :/
>>>
>>> Your data is useless if you don't also provide the file descriptor which
>>> you are blocking on right now. ;)
>>>
>>> Thanks,
>>> Hannes
>>
>> ah sorry. So we need the gdb backtrace with the rcvmsg and then the fd
>> list + netlink list?
> 
> Yes, albeit the probability we find something new is minimal. We already
> know it blocks deep down in netlink code. I tried to reproduce it in
> multithreaded environment with no results so far. Can you even try a
> newer kernel?

I can try Kernel 4.4-rc1 next week. Or something else?

Stefan

> 
> Bye,
> Hannes
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Stefan Priebe - Profihost AG

Am 19.11.2015 um 12:41 schrieb Hannes Frederic Sowa:
> On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
>> OK it had a livelock again. It just took more time.
>>
>> So here is the data:
> 
> Thanks, I couldn't reproduce it so far with simple threaded resolver
> loop on your kernel. :/
> 
> Your data is useless if you don't also provide the file descriptor which
> you are blocking on right now. ;)
> 
> Thanks,
> Hannes

ah sorry. So we need the gdb backtrace with the rcvmsg and then the fd
list + netlink list?

Stefan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Hannes Frederic Sowa
On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
> OK it had a livelock again. It just took more time.
> 
> So here is the data:

Thanks, I couldn't reproduce it so far with simple threaded resolver
loop on your kernel. :/

Your data is useless if you don't also provide the file descriptor which
you are blocking on right now. ;)

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-19 Thread Florian Weimer
On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:

> I can try Kernel 4.4-rc1 next week. Or something else?

I found this bug report which indicates that 4.1.10 works:

  

But in your original report, you said that 4.1.13 is broken.

This backtrace:

  

shows a lot of waiting on quite different netlink sockets.  So if this
is due to a race in Asterisk, it must have happened several times in a row.

Thanks,
Florian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Hannes Frederic Sowa
On Wed, Nov 18, 2015, at 22:42, Stefan Priebe wrote:
> 
> Am 18.11.2015 um 22:40 schrieb Hannes Frederic Sowa:
> > On Wed, Nov 18, 2015, at 22:36, Stefan Priebe wrote:
> >> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
> >> have ipv6 except for link local. Could it be this one?
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
> >>
> >> Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
> >> #0  0x7f295de3287d in recvmsg () at
> >> ../sysdeps/unix/syscall-template.S:82
> >> #1  0x7f295de52fcc in make_request (fd=35, pid=26631,
> >> seen_ipv4=, seen_ipv6=,
> >>   in6ai=, in6ailen=) at
> >> ../sysdeps/unix/sysv/linux/check_pf.c:119
> >> #2  0x7f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f,
> >> seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
> >>   in6ailen=0x7f295c00e838) at
> >>   ../sysdeps/unix/sysv/linux/check_pf.c:271
> >> #3  0x7f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0
> >> "10.12.12.55", service=0x7f295c00e8bc "2135",
> >>   hints=0x7f295c00e910, pai=0x7f295c00e908) at
> >> ../sysdeps/posix/getaddrinfo.c:2389
> >
> > Can you also get a /proc/pid/stack stacktrace?
> 
> Sure:
> 
> For the thread:
> # cat /proc/26631/task/26654/stack
> [] __skb_recv_datagram+0x52b/0x5d0
> [] skb_recv_datagram+0x32/0x40
> [] netlink_recvmsg+0x4c/0x300
> [] sock_recvmsg+0x13/0x20
> [] ___sys_recvmsg+0xee/0x230
> [] __sys_recvmsg+0x49/0x90
> [] SyS_recvmsg+0x12/0x20
> [] system_call_fastpath+0x12/0x71
> [] 0x

Ok, we are definitely in netlink code and thus have a correct(? at least
a netlink) netlink fd which is blocking and waiting for an answer. I
research the glibc code a little bit but I think that the occurrences of
ipv6 parameters are just generic glibc code.

Let me try to reproduce that.

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Stefan Priebe


Am 18.11.2015 um 22:40 schrieb Hannes Frederic Sowa:

On Wed, Nov 18, 2015, at 22:36, Stefan Priebe wrote:

sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
have ipv6 except for link local. Could it be this one?

https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79

Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
#0  0x7f295de3287d in recvmsg () at
../sysdeps/unix/syscall-template.S:82
#1  0x7f295de52fcc in make_request (fd=35, pid=26631,
seen_ipv4=, seen_ipv6=,
  in6ai=, in6ailen=) at
../sysdeps/unix/sysv/linux/check_pf.c:119
#2  0x7f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f,
seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
  in6ailen=0x7f295c00e838) at
  ../sysdeps/unix/sysv/linux/check_pf.c:271
#3  0x7f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0
"10.12.12.55", service=0x7f295c00e8bc "2135",
  hints=0x7f295c00e910, pai=0x7f295c00e908) at
../sysdeps/posix/getaddrinfo.c:2389


Can you also get a /proc/pid/stack stacktrace?


Sure:

For the thread:
# cat /proc/26631/task/26654/stack
[] __skb_recv_datagram+0x52b/0x5d0
[] skb_recv_datagram+0x32/0x40
[] netlink_recvmsg+0x4c/0x300
[] sock_recvmsg+0x13/0x20
[] ___sys_recvmsg+0xee/0x230
[] __sys_recvmsg+0x49/0x90
[] SyS_recvmsg+0x12/0x20
[] system_call_fastpath+0x12/0x71
[] 0x

Stefan



Thanks,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Hannes Frederic Sowa
On Wed, Nov 18, 2015, at 22:36, Stefan Priebe wrote:
> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't 
> have ipv6 except for link local. Could it be this one?
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
> 
> Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
> #0  0x7f295de3287d in recvmsg () at 
> ../sysdeps/unix/syscall-template.S:82
> #1  0x7f295de52fcc in make_request (fd=35, pid=26631, 
> seen_ipv4=, seen_ipv6=,
>  in6ai=, in6ailen=) at 
> ../sysdeps/unix/sysv/linux/check_pf.c:119
> #2  0x7f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f, 
> seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
>  in6ailen=0x7f295c00e838) at
>  ../sysdeps/unix/sysv/linux/check_pf.c:271
> #3  0x7f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0 
> "10.12.12.55", service=0x7f295c00e8bc "2135",
>  hints=0x7f295c00e910, pai=0x7f295c00e908) at 
> ../sysdeps/posix/getaddrinfo.c:2389

Can you also get a /proc/pid/stack stacktrace?

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Stefan Priebe

Am 18.11.2015 um 22:18 schrieb Florian Weimer:

On 11/18/2015 09:23 PM, Stefan Priebe wrote:


Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:

On Tue, 17 Nov 2015, Stefan Priebe wrote:

I've now also two gdb backtraces from two crashes:
http://pastebin.com/raw.php?i=yih5jNt8

http://pastebin.com/raw.php?i=kGEcvH4T


They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.


The asterisk guys told me it's a livelock asterisk is waiting for
getaddrinfo / recvmsg.

Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
#0  0x7fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7fbeb9e92007 in getaddrinfo () from
/lib/x86_64-linux-gnu/libc.so.6


Stefan,

please try to get a backtrace with debugging information.  It is likely
that this is the make_request/__check_pf functionality in glibc, but it
would be nice to get some certainty.


sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't 
have ipv6 except for link local. Could it be this one?


https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79

Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
#0  0x7f295de3287d in recvmsg () at 
../sysdeps/unix/syscall-template.S:82
#1  0x7f295de52fcc in make_request (fd=35, pid=26631, 
seen_ipv4=, seen_ipv6=,
in6ai=, in6ailen=) at 
../sysdeps/unix/sysv/linux/check_pf.c:119
#2  0x7f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f, 
seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,

in6ailen=0x7f295c00e838) at ../sysdeps/unix/sysv/linux/check_pf.c:271
#3  0x7f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0 
"10.12.12.55", service=0x7f295c00e8bc "2135",
hints=0x7f295c00e910, pai=0x7f295c00e908) at 
../sysdeps/posix/getaddrinfo.c:2389
#4  0x0050287e in ast_sockaddr_resolve (addrs=0x7f295c00e9d0, 
str=0x7f295c00ea30 "10.12.12.55:2135", flags=0, family=2)

at netsock2.c:268
#5  0x7f2958963ba2 in ast_sockaddr_resolve_first_af 
(addr=0x7f29300591d8, name=0x7f295c00ea30 "10.12.12.55:2135", flag=0,

family=2) at chan_sip.c:30689
#6  0x7f2958963cb5 in ast_sockaddr_resolve_first_transport 
(addr=0x7f29300591d8, name=0x7f295c00ea30 "10.12.12.55:2135",

flag=0, transport=1) at chan_sip.c:30720
#7  0x7f29588fd3cc in set_destination (p=0x7f2930058cc8, 
uri=0x7f29300576e8 "sip:9052@10.12.12.55:2135;line=to7a729l")

at chan_sip.c:10455
#8  0x7f29588fe6e0 in reqprep (req=0x7f295c00fee0, p=0x7f2930058cc8, 
sipmethod=4, seqno=287, newbranch=1) at chan_sip.c:10778
#9  0x7f295890a201 in transmit_state_notify (p=0x7f2930058cc8, 
state=1, full=1, timeout=0) at chan_sip.c:13259
#10 0x7f29589141bb in cb_extensionstate (context=0x7f295c010cd0 
"hints", exten=0x7f295c010c80 "9052QS", state=1,

data=0x7f2930058cc8) at chan_sip.c:15117
#11 0x0050ebf6 in handle_statechange (datap=0x7f293acef830) at 
pbx.c:4972
#12 0x00555f8e in tps_processing_function (data=0x1f24f28) at 
taskprocessor.c:327

#13 0x00569280 in dummy_start (data=0x1ed76f0) at utils.c:1173
#14 0x7f295d5dcb50 in start_thread (arg=) at 
pthread_create.c:304
#15 0x7f295de3195d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:112

#16 0x in ?? ()



Which glibc version do you use?  Has it got a fix for CVE-2013-7423?

So far, the only known cause for a hang in this place (that is, lack of
return from recvmsg) is incorrect file descriptor use.  (CVE-2013-7423
is such an issue in glibc itself.)  The kernel upgrade could change
scheduling behavior, and the actual bug might have been latent before.

Theoretically, recvmsg could also hang if the Netlink query was dropped
by the kernel, or the final packet in the response was dropped.  We
never saw that happen, even under extreme load, but I didn't test with
recent kernels.

The glibc change Hannes mentioned won't detect the hang, but if there is
incorrect file descriptor reuse going on, it is possible that the new
assert catches it.

Florian


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Stefan Priebe


Am 18.11.2015 um 22:18 schrieb Florian Weimer:

On 11/18/2015 09:23 PM, Stefan Priebe wrote:


Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:

On Tue, 17 Nov 2015, Stefan Priebe wrote:

I've now also two gdb backtraces from two crashes:
http://pastebin.com/raw.php?i=yih5jNt8

http://pastebin.com/raw.php?i=kGEcvH4T


They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.


The asterisk guys told me it's a livelock asterisk is waiting for
getaddrinfo / recvmsg.

Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
#0  0x7fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7fbeb9e92007 in getaddrinfo () from
/lib/x86_64-linux-gnu/libc.so.6


Stefan,

please try to get a backtrace with debugging information.  It is likely
that this is the make_request/__check_pf functionality in glibc, but it
would be nice to get some certainty.

Which glibc version do you use?  Has it got a fix for CVE-2013-7423?


It's Debians 2.13-38+deb7u8 Debians issue tracker says it is fixed: 
https://security-tracker.debian.org/tracker/CVE-2013-7423



So far, the only known cause for a hang in this place (that is, lack of
return from recvmsg) is incorrect file descriptor use.  (CVE-2013-7423
is such an issue in glibc itself.)  The kernel upgrade could change
scheduling behavior, and the actual bug might have been latent before.

Theoretically, recvmsg could also hang if the Netlink query was dropped
by the kernel, or the final packet in the response was dropped.  We
never saw that happen, even under extreme load, but I didn't test with
recent kernels.


The load is very low in this system. Just 30 phones and only 1-6 calling.


The glibc change Hannes mentioned won't detect the hang, but if there is
incorrect file descriptor reuse going on, it is possible that the new
assert catches it.

Florian




Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Hannes Frederic Sowa
On Wed, Nov 18, 2015, at 22:20, Stefan Priebe wrote:
> you mean just:
> la /proc/$pid/fd

ls -l /proc/pid/fd/

the numbers in brackets in return from readlink are the inode numbers.

> and
> 
> cat /proc/net/netlink

Exactly, last row is the inode number.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Stefan Priebe

Am 18.11.2015 um 22:00 schrieb Hannes Frederic Sowa:



On Wed, Nov 18, 2015, at 21:23, Stefan Priebe wrote:


Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:

On Tue, 17 Nov 2015, Stefan Priebe wrote:

I've now also two gdb backtraces from two crashes:
http://pastebin.com/raw.php?i=yih5jNt8

http://pastebin.com/raw.php?i=kGEcvH4T


They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.


The asterisk guys told me it's a livelock asterisk is waiting for
getaddrinfo / recvmsg.

Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
#0  0x7fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7fbeb9e92007 in getaddrinfo () from
/lib/x86_64-linux-gnu/libc.so.6


Florian Weimer recently patched glibc to detect filedescriptors which
have been reused incorrectly in the glibc name resolving routines:


Could you determine the type of filedescriptor in /proc/pid/fd and
/proc/net/netlink by matching inodes?


you mean just:
la /proc/$pid/fd

and

cat /proc/net/netlink

Stefan


Thanks,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Florian Weimer
On 11/18/2015 09:23 PM, Stefan Priebe wrote:
> 
> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
>> On Tue, 17 Nov 2015, Stefan Priebe wrote:
>>> I've now also two gdb backtraces from two crashes:
>>> http://pastebin.com/raw.php?i=yih5jNt8
>>>
>>> http://pastebin.com/raw.php?i=kGEcvH4T
>>
>> They don't tell me anything as I have no idea of the inner workings of
>> asterisk. You might be better of to talk to the asterisk folks to help
>> you track down what that thing is waiting for, so we can actually look
>> at a well defined area.
> 
> The asterisk guys told me it's a livelock asterisk is waiting for
> getaddrinfo / recvmsg.
> 
> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
> #0  0x7fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7fbeb9e92007 in getaddrinfo () from
> /lib/x86_64-linux-gnu/libc.so.6

Stefan,

please try to get a backtrace with debugging information.  It is likely
that this is the make_request/__check_pf functionality in glibc, but it
would be nice to get some certainty.

Which glibc version do you use?  Has it got a fix for CVE-2013-7423?

So far, the only known cause for a hang in this place (that is, lack of
return from recvmsg) is incorrect file descriptor use.  (CVE-2013-7423
is such an issue in glibc itself.)  The kernel upgrade could change
scheduling behavior, and the actual bug might have been latent before.

Theoretically, recvmsg could also hang if the Netlink query was dropped
by the kernel, or the final packet in the response was dropped.  We
never saw that happen, even under extreme load, but I didn't test with
recent kernels.

The glibc change Hannes mentioned won't detect the hang, but if there is
incorrect file descriptor reuse going on, it is possible that the new
assert catches it.

Florian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Hannes Frederic Sowa


On Wed, Nov 18, 2015, at 21:23, Stefan Priebe wrote:
> 
> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
> > On Tue, 17 Nov 2015, Stefan Priebe wrote:
> >> I've now also two gdb backtraces from two crashes:
> >> http://pastebin.com/raw.php?i=yih5jNt8
> >>
> >> http://pastebin.com/raw.php?i=kGEcvH4T
> >
> > They don't tell me anything as I have no idea of the inner workings of
> > asterisk. You might be better of to talk to the asterisk folks to help
> > you track down what that thing is waiting for, so we can actually look
> > at a well defined area.
> 
> The asterisk guys told me it's a livelock asterisk is waiting for 
> getaddrinfo / recvmsg.
> 
> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
> #0  0x7fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7fbeb9e92007 in getaddrinfo () from 
> /lib/x86_64-linux-gnu/libc.so.6

Florian Weimer recently patched glibc to detect filedescriptors which
have been reused incorrectly in the glibc name resolving routines:


Could you determine the type of filedescriptor in /proc/pid/fd and
/proc/net/netlink by matching inodes?

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Stefan Priebe


Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:

On Tue, 17 Nov 2015, Stefan Priebe wrote:

I've now also two gdb backtraces from two crashes:
http://pastebin.com/raw.php?i=yih5jNt8

http://pastebin.com/raw.php?i=kGEcvH4T


They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.


The asterisk guys told me it's a livelock asterisk is waiting for 
getaddrinfo / recvmsg.


Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
#0  0x7fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7fbeb9e92007 in getaddrinfo () from 
/lib/x86_64-linux-gnu/libc.so.6
#4  0x004f2e81 in ast_sockaddr_resolve 
(addrs=addrs@entry=0x7fbe989c48d8, str=,
str@entry=0x7fbe989c4a80 "10.12.12.92:2052", flags=flags@entry=0, 
family=2) at netsock2.c:268
#5  0x7fbeb095fb8b in ast_sockaddr_resolve_first_af 
(family=, flag=0, name=0x7fbe989c4a80 "10.12.12.92:2052",

addr=0x7fbeb4114948) at chan_sip.c:30797
#6  ast_sockaddr_resolve_first_transport (transport=, 
flag=0, name=0x7fbe989c4a80 "10.12.12.92:2052",

addr=0x7fbeb4114948) at chan_sip.c:30828
#7  set_destination (uri=, p=0x7fbeb4114438) at 
chan_sip.c:10602
#8  reqprep (req=req@entry=0x7fbe989c4fc0, p=p@entry=0x7fbeb4114438, 
sipmethod=sipmethod@entry=8, seqno=,

seqno@entry=0, newbranch=newbranch@entry=1) at chan_sip.c:10925
---Type  to continue, or q  to quit---
#9  0x7fbeb0969968 in transmit_request_with_auth 
(p=p@entry=0x7fbeb4114438, newbranch=1, reliable=XMIT_RELIABLE, seqno=0,

sipmethod=8) at chan_sip.c:14296
#10 0x7fbeb098b704 in sip_hangup (ast=) at 
chan_sip.c:6822
#11 0x00477b55 in ast_hangup (chan=chan@entry=0x7fbeb40ef018) at 
channel.c:2887
#12 0x0050ce1b in __ast_pbx_run (c=c@entry=0x7fbeb40ef018, 
args=args@entry=0x0) at pbx.c:5729
#13 0x0050e426 in pbx_thread (data=data@entry=0x7fbeb40ef018) at 
pbx.c:5821

#14 0x00548aaa in dummy_start (data=) at utils.c:1151
#15 0x7fbeb965eb50 in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0

#16 0x7fbeb9eb395d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x in ?? ()

In both cases, the thread is waiting in recvmsg.

Stefan



Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Stefan Priebe


Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:

On Tue, 17 Nov 2015, Stefan Priebe wrote:

I've now also two gdb backtraces from two crashes:
http://pastebin.com/raw.php?i=yih5jNt8

http://pastebin.com/raw.php?i=kGEcvH4T


They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.


The asterisk guys told me it's a livelock asterisk is waiting for 
getaddrinfo / recvmsg.


Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
#0  0x7fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7fbeb9e92007 in getaddrinfo () from 
/lib/x86_64-linux-gnu/libc.so.6
#4  0x004f2e81 in ast_sockaddr_resolve 
(addrs=addrs@entry=0x7fbe989c48d8, str=,
str@entry=0x7fbe989c4a80 "10.12.12.92:2052", flags=flags@entry=0, 
family=2) at netsock2.c:268
#5  0x7fbeb095fb8b in ast_sockaddr_resolve_first_af 
(family=, flag=0, name=0x7fbe989c4a80 "10.12.12.92:2052",

addr=0x7fbeb4114948) at chan_sip.c:30797
#6  ast_sockaddr_resolve_first_transport (transport=, 
flag=0, name=0x7fbe989c4a80 "10.12.12.92:2052",

addr=0x7fbeb4114948) at chan_sip.c:30828
#7  set_destination (uri=, p=0x7fbeb4114438) at 
chan_sip.c:10602
#8  reqprep (req=req@entry=0x7fbe989c4fc0, p=p@entry=0x7fbeb4114438, 
sipmethod=sipmethod@entry=8, seqno=,

seqno@entry=0, newbranch=newbranch@entry=1) at chan_sip.c:10925
---Type  to continue, or q  to quit---
#9  0x7fbeb0969968 in transmit_request_with_auth 
(p=p@entry=0x7fbeb4114438, newbranch=1, reliable=XMIT_RELIABLE, seqno=0,

sipmethod=8) at chan_sip.c:14296
#10 0x7fbeb098b704 in sip_hangup (ast=) at 
chan_sip.c:6822
#11 0x00477b55 in ast_hangup (chan=chan@entry=0x7fbeb40ef018) at 
channel.c:2887
#12 0x0050ce1b in __ast_pbx_run (c=c@entry=0x7fbeb40ef018, 
args=args@entry=0x0) at pbx.c:5729
#13 0x0050e426 in pbx_thread (data=data@entry=0x7fbeb40ef018) at 
pbx.c:5821

#14 0x00548aaa in dummy_start (data=) at utils.c:1151
#15 0x7fbeb965eb50 in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0

#16 0x7fbeb9eb395d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x in ?? ()

In both cases, the thread is waiting in recvmsg.

Stefan



Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Hannes Frederic Sowa


On Wed, Nov 18, 2015, at 21:23, Stefan Priebe wrote:
> 
> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
> > On Tue, 17 Nov 2015, Stefan Priebe wrote:
> >> I've now also two gdb backtraces from two crashes:
> >> http://pastebin.com/raw.php?i=yih5jNt8
> >>
> >> http://pastebin.com/raw.php?i=kGEcvH4T
> >
> > They don't tell me anything as I have no idea of the inner workings of
> > asterisk. You might be better of to talk to the asterisk folks to help
> > you track down what that thing is waiting for, so we can actually look
> > at a well defined area.
> 
> The asterisk guys told me it's a livelock asterisk is waiting for 
> getaddrinfo / recvmsg.
> 
> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
> #0  0x7fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7fbeb9e92007 in getaddrinfo () from 
> /lib/x86_64-linux-gnu/libc.so.6

Florian Weimer recently patched glibc to detect filedescriptors which
have been reused incorrectly in the glibc name resolving routines:


Could you determine the type of filedescriptor in /proc/pid/fd and
/proc/net/netlink by matching inodes?

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Hannes Frederic Sowa
On Wed, Nov 18, 2015, at 22:42, Stefan Priebe wrote:
> 
> Am 18.11.2015 um 22:40 schrieb Hannes Frederic Sowa:
> > On Wed, Nov 18, 2015, at 22:36, Stefan Priebe wrote:
> >> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
> >> have ipv6 except for link local. Could it be this one?
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
> >>
> >> Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
> >> #0  0x7f295de3287d in recvmsg () at
> >> ../sysdeps/unix/syscall-template.S:82
> >> #1  0x7f295de52fcc in make_request (fd=35, pid=26631,
> >> seen_ipv4=, seen_ipv6=,
> >>   in6ai=, in6ailen=) at
> >> ../sysdeps/unix/sysv/linux/check_pf.c:119
> >> #2  0x7f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f,
> >> seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
> >>   in6ailen=0x7f295c00e838) at
> >>   ../sysdeps/unix/sysv/linux/check_pf.c:271
> >> #3  0x7f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0
> >> "10.12.12.55", service=0x7f295c00e8bc "2135",
> >>   hints=0x7f295c00e910, pai=0x7f295c00e908) at
> >> ../sysdeps/posix/getaddrinfo.c:2389
> >
> > Can you also get a /proc/pid/stack stacktrace?
> 
> Sure:
> 
> For the thread:
> # cat /proc/26631/task/26654/stack
> [] __skb_recv_datagram+0x52b/0x5d0
> [] skb_recv_datagram+0x32/0x40
> [] netlink_recvmsg+0x4c/0x300
> [] sock_recvmsg+0x13/0x20
> [] ___sys_recvmsg+0xee/0x230
> [] __sys_recvmsg+0x49/0x90
> [] SyS_recvmsg+0x12/0x20
> [] system_call_fastpath+0x12/0x71
> [] 0x

Ok, we are definitely in netlink code and thus have a correct(? at least
a netlink) netlink fd which is blocking and waiting for an answer. I
research the glibc code a little bit but I think that the occurrences of
ipv6 parameters are just generic glibc code.

Let me try to reproduce that.

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Stefan Priebe


Am 18.11.2015 um 22:40 schrieb Hannes Frederic Sowa:

On Wed, Nov 18, 2015, at 22:36, Stefan Priebe wrote:

sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
have ipv6 except for link local. Could it be this one?

https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79

Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
#0  0x7f295de3287d in recvmsg () at
../sysdeps/unix/syscall-template.S:82
#1  0x7f295de52fcc in make_request (fd=35, pid=26631,
seen_ipv4=, seen_ipv6=,
  in6ai=, in6ailen=) at
../sysdeps/unix/sysv/linux/check_pf.c:119
#2  0x7f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f,
seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
  in6ailen=0x7f295c00e838) at
  ../sysdeps/unix/sysv/linux/check_pf.c:271
#3  0x7f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0
"10.12.12.55", service=0x7f295c00e8bc "2135",
  hints=0x7f295c00e910, pai=0x7f295c00e908) at
../sysdeps/posix/getaddrinfo.c:2389


Can you also get a /proc/pid/stack stacktrace?


Sure:

For the thread:
# cat /proc/26631/task/26654/stack
[] __skb_recv_datagram+0x52b/0x5d0
[] skb_recv_datagram+0x32/0x40
[] netlink_recvmsg+0x4c/0x300
[] sock_recvmsg+0x13/0x20
[] ___sys_recvmsg+0xee/0x230
[] __sys_recvmsg+0x49/0x90
[] SyS_recvmsg+0x12/0x20
[] system_call_fastpath+0x12/0x71
[] 0x

Stefan



Thanks,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Stefan Priebe

Am 18.11.2015 um 22:18 schrieb Florian Weimer:

On 11/18/2015 09:23 PM, Stefan Priebe wrote:


Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:

On Tue, 17 Nov 2015, Stefan Priebe wrote:

I've now also two gdb backtraces from two crashes:
http://pastebin.com/raw.php?i=yih5jNt8

http://pastebin.com/raw.php?i=kGEcvH4T


They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.


The asterisk guys told me it's a livelock asterisk is waiting for
getaddrinfo / recvmsg.

Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
#0  0x7fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7fbeb9e92007 in getaddrinfo () from
/lib/x86_64-linux-gnu/libc.so.6


Stefan,

please try to get a backtrace with debugging information.  It is likely
that this is the make_request/__check_pf functionality in glibc, but it
would be nice to get some certainty.


sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't 
have ipv6 except for link local. Could it be this one?


https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79

Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
#0  0x7f295de3287d in recvmsg () at 
../sysdeps/unix/syscall-template.S:82
#1  0x7f295de52fcc in make_request (fd=35, pid=26631, 
seen_ipv4=, seen_ipv6=,
in6ai=, in6ailen=) at 
../sysdeps/unix/sysv/linux/check_pf.c:119
#2  0x7f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f, 
seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,

in6ailen=0x7f295c00e838) at ../sysdeps/unix/sysv/linux/check_pf.c:271
#3  0x7f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0 
"10.12.12.55", service=0x7f295c00e8bc "2135",
hints=0x7f295c00e910, pai=0x7f295c00e908) at 
../sysdeps/posix/getaddrinfo.c:2389
#4  0x0050287e in ast_sockaddr_resolve (addrs=0x7f295c00e9d0, 
str=0x7f295c00ea30 "10.12.12.55:2135", flags=0, family=2)

at netsock2.c:268
#5  0x7f2958963ba2 in ast_sockaddr_resolve_first_af 
(addr=0x7f29300591d8, name=0x7f295c00ea30 "10.12.12.55:2135", flag=0,

family=2) at chan_sip.c:30689
#6  0x7f2958963cb5 in ast_sockaddr_resolve_first_transport 
(addr=0x7f29300591d8, name=0x7f295c00ea30 "10.12.12.55:2135",

flag=0, transport=1) at chan_sip.c:30720
#7  0x7f29588fd3cc in set_destination (p=0x7f2930058cc8, 
uri=0x7f29300576e8 "sip:9052@10.12.12.55:2135;line=to7a729l")

at chan_sip.c:10455
#8  0x7f29588fe6e0 in reqprep (req=0x7f295c00fee0, p=0x7f2930058cc8, 
sipmethod=4, seqno=287, newbranch=1) at chan_sip.c:10778
#9  0x7f295890a201 in transmit_state_notify (p=0x7f2930058cc8, 
state=1, full=1, timeout=0) at chan_sip.c:13259
#10 0x7f29589141bb in cb_extensionstate (context=0x7f295c010cd0 
"hints", exten=0x7f295c010c80 "9052QS", state=1,

data=0x7f2930058cc8) at chan_sip.c:15117
#11 0x0050ebf6 in handle_statechange (datap=0x7f293acef830) at 
pbx.c:4972
#12 0x00555f8e in tps_processing_function (data=0x1f24f28) at 
taskprocessor.c:327

#13 0x00569280 in dummy_start (data=0x1ed76f0) at utils.c:1173
#14 0x7f295d5dcb50 in start_thread (arg=) at 
pthread_create.c:304
#15 0x7f295de3195d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:112

#16 0x in ?? ()



Which glibc version do you use?  Has it got a fix for CVE-2013-7423?

So far, the only known cause for a hang in this place (that is, lack of
return from recvmsg) is incorrect file descriptor use.  (CVE-2013-7423
is such an issue in glibc itself.)  The kernel upgrade could change
scheduling behavior, and the actual bug might have been latent before.

Theoretically, recvmsg could also hang if the Netlink query was dropped
by the kernel, or the final packet in the response was dropped.  We
never saw that happen, even under extreme load, but I didn't test with
recent kernels.

The glibc change Hannes mentioned won't detect the hang, but if there is
incorrect file descriptor reuse going on, it is possible that the new
assert catches it.

Florian


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Hannes Frederic Sowa
On Wed, Nov 18, 2015, at 22:36, Stefan Priebe wrote:
> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't 
> have ipv6 except for link local. Could it be this one?
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
> 
> Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
> #0  0x7f295de3287d in recvmsg () at 
> ../sysdeps/unix/syscall-template.S:82
> #1  0x7f295de52fcc in make_request (fd=35, pid=26631, 
> seen_ipv4=, seen_ipv6=,
>  in6ai=, in6ailen=) at 
> ../sysdeps/unix/sysv/linux/check_pf.c:119
> #2  0x7f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f, 
> seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
>  in6ailen=0x7f295c00e838) at
>  ../sysdeps/unix/sysv/linux/check_pf.c:271
> #3  0x7f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0 
> "10.12.12.55", service=0x7f295c00e8bc "2135",
>  hints=0x7f295c00e910, pai=0x7f295c00e908) at 
> ../sysdeps/posix/getaddrinfo.c:2389

Can you also get a /proc/pid/stack stacktrace?

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Florian Weimer
On 11/18/2015 09:23 PM, Stefan Priebe wrote:
> 
> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
>> On Tue, 17 Nov 2015, Stefan Priebe wrote:
>>> I've now also two gdb backtraces from two crashes:
>>> http://pastebin.com/raw.php?i=yih5jNt8
>>>
>>> http://pastebin.com/raw.php?i=kGEcvH4T
>>
>> They don't tell me anything as I have no idea of the inner workings of
>> asterisk. You might be better of to talk to the asterisk folks to help
>> you track down what that thing is waiting for, so we can actually look
>> at a well defined area.
> 
> The asterisk guys told me it's a livelock asterisk is waiting for
> getaddrinfo / recvmsg.
> 
> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
> #0  0x7fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7fbeb9e92007 in getaddrinfo () from
> /lib/x86_64-linux-gnu/libc.so.6

Stefan,

please try to get a backtrace with debugging information.  It is likely
that this is the make_request/__check_pf functionality in glibc, but it
would be nice to get some certainty.

Which glibc version do you use?  Has it got a fix for CVE-2013-7423?

So far, the only known cause for a hang in this place (that is, lack of
return from recvmsg) is incorrect file descriptor use.  (CVE-2013-7423
is such an issue in glibc itself.)  The kernel upgrade could change
scheduling behavior, and the actual bug might have been latent before.

Theoretically, recvmsg could also hang if the Netlink query was dropped
by the kernel, or the final packet in the response was dropped.  We
never saw that happen, even under extreme load, but I didn't test with
recent kernels.

The glibc change Hannes mentioned won't detect the hang, but if there is
incorrect file descriptor reuse going on, it is possible that the new
assert catches it.

Florian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Stefan Priebe

Am 18.11.2015 um 22:00 schrieb Hannes Frederic Sowa:



On Wed, Nov 18, 2015, at 21:23, Stefan Priebe wrote:


Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:

On Tue, 17 Nov 2015, Stefan Priebe wrote:

I've now also two gdb backtraces from two crashes:
http://pastebin.com/raw.php?i=yih5jNt8

http://pastebin.com/raw.php?i=kGEcvH4T


They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.


The asterisk guys told me it's a livelock asterisk is waiting for
getaddrinfo / recvmsg.

Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
#0  0x7fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7fbeb9e92007 in getaddrinfo () from
/lib/x86_64-linux-gnu/libc.so.6


Florian Weimer recently patched glibc to detect filedescriptors which
have been reused incorrectly in the glibc name resolving routines:


Could you determine the type of filedescriptor in /proc/pid/fd and
/proc/net/netlink by matching inodes?


you mean just:
la /proc/$pid/fd

and

cat /proc/net/netlink

Stefan


Thanks,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Hannes Frederic Sowa
On Wed, Nov 18, 2015, at 22:20, Stefan Priebe wrote:
> you mean just:
> la /proc/$pid/fd

ls -l /proc/pid/fd/

the numbers in brackets in return from readlink are the inode numbers.

> and
> 
> cat /proc/net/netlink

Exactly, last row is the inode number.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-18 Thread Stefan Priebe


Am 18.11.2015 um 22:18 schrieb Florian Weimer:

On 11/18/2015 09:23 PM, Stefan Priebe wrote:


Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:

On Tue, 17 Nov 2015, Stefan Priebe wrote:

I've now also two gdb backtraces from two crashes:
http://pastebin.com/raw.php?i=yih5jNt8

http://pastebin.com/raw.php?i=kGEcvH4T


They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.


The asterisk guys told me it's a livelock asterisk is waiting for
getaddrinfo / recvmsg.

Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
#0  0x7fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7fbeb9e92007 in getaddrinfo () from
/lib/x86_64-linux-gnu/libc.so.6


Stefan,

please try to get a backtrace with debugging information.  It is likely
that this is the make_request/__check_pf functionality in glibc, but it
would be nice to get some certainty.

Which glibc version do you use?  Has it got a fix for CVE-2013-7423?


It's Debians 2.13-38+deb7u8 Debians issue tracker says it is fixed: 
https://security-tracker.debian.org/tracker/CVE-2013-7423



So far, the only known cause for a hang in this place (that is, lack of
return from recvmsg) is incorrect file descriptor use.  (CVE-2013-7423
is such an issue in glibc itself.)  The kernel upgrade could change
scheduling behavior, and the actual bug might have been latent before.

Theoretically, recvmsg could also hang if the Netlink query was dropped
by the kernel, or the final packet in the response was dropped.  We
never saw that happen, even under extreme load, but I didn't test with
recent kernels.


The load is very low in this system. Just 30 phones and only 1-6 calling.


The glibc change Hannes mentioned won't detect the hang, but if there is
incorrect file descriptor reuse going on, it is possible that the new
assert catches it.

Florian




Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-17 Thread Thomas Gleixner
On Tue, 17 Nov 2015, Stefan Priebe wrote:
> I've now also two gdb backtraces from two crashes:
> http://pastebin.com/raw.php?i=yih5jNt8
> 
> http://pastebin.com/raw.php?i=kGEcvH4T

They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-17 Thread Stefan Priebe

Am 17.11.2015 um 20:15 schrieb Thomas Gleixner:

On Tue, 17 Nov 2015, Stefan Priebe - Profihost AG wrote:

since Upgrading our Asterisk System from Kernel 3.18.17 to 4.1.13 it
deadlocks every few hours (kill -9 is the only thing working). Booting
with 3.18 again let it run smooth again.

An strace shows asterisk is looping like this:

[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8

fd 8 is:

lrwx-- 1 root root 64 Nov 17 15:27 /proc/6025/fd/8 ->
anon_inode:[timerfd]


# cat /proc/6025/stack
[] poll_schedule_timeout+0x49/0x70
[] do_sys_poll+0x3d7/0x590
[] do_restart_poll+0x3c/0x70
[] sys_restart_syscall+0x1f/0x30
[] system_call_fastpath+0x12/0x71
[] 0x

Any ideas how to debug this?


fd 8 is probably not really interesting. That looks like a interval
timer firing periodically. So it probably waits for fd 9 ...


I've now also two gdb backtraces from two crashes:
http://pastebin.com/raw.php?i=yih5jNt8

http://pastebin.com/raw.php?i=kGEcvH4T

Stefan



Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-17 Thread Thomas Gleixner
On Tue, 17 Nov 2015, Stefan Priebe - Profihost AG wrote:
> since Upgrading our Asterisk System from Kernel 3.18.17 to 4.1.13 it
> deadlocks every few hours (kill -9 is the only thing working). Booting
> with 3.18 again let it run smooth again.
> 
> An strace shows asterisk is looping like this:
> 
> [pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
> it_value={0, 140592906050976}}) = 0
> [pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
> 2, 1000) = 1 ([{fd=8, revents=POLLIN}])
> [pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
> it_value={0, 140592906050976}}) = 0
> [pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
> 
> fd 8 is:
> 
> lrwx-- 1 root root 64 Nov 17 15:27 /proc/6025/fd/8 ->
> anon_inode:[timerfd]
> 
> 
> # cat /proc/6025/stack
> [] poll_schedule_timeout+0x49/0x70
> [] do_sys_poll+0x3d7/0x590
> [] do_restart_poll+0x3c/0x70
> [] sys_restart_syscall+0x1f/0x30
> [] system_call_fastpath+0x12/0x71
> [] 0x
> 
> Any ideas how to debug this?

fd 8 is probably not really interesting. That looks like a interval
timer firing periodically. So it probably waits for fd 9 ...

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Asterisk deadlocks since Kernel 4.1

2015-11-17 Thread Stefan Priebe - Profihost AG
Hello,

since Upgrading our Asterisk System from Kernel 3.18.17 to 4.1.13 it
deadlocks every few hours (kill -9 is the only thing working). Booting
with 3.18 again let it run smooth again.

An strace shows asterisk is looping like this:

[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0

fd 8 is:

lrwx-- 1 root root 64 Nov 17 15:27 /proc/6025/fd/8 ->
anon_inode:[timerfd]


# cat /proc/6025/stack
[] poll_schedule_timeout+0x49/0x70
[] do_sys_poll+0x3d7/0x590
[] do_restart_poll+0x3c/0x70
[] sys_restart_syscall+0x1f/0x30
[] system_call_fastpath+0x12/0x71
[] 0x

Any ideas how to debug this?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Asterisk deadlocks since Kernel 4.1

2015-11-17 Thread Stefan Priebe - Profihost AG
Hello,

since Upgrading our Asterisk System from Kernel 3.18.17 to 4.1.13 it
deadlocks every few hours (kill -9 is the only thing working). Booting
with 3.18 again let it run smooth again.

An strace shows asterisk is looping like this:

[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0

fd 8 is:

lrwx-- 1 root root 64 Nov 17 15:27 /proc/6025/fd/8 ->
anon_inode:[timerfd]


# cat /proc/6025/stack
[] poll_schedule_timeout+0x49/0x70
[] do_sys_poll+0x3d7/0x590
[] do_restart_poll+0x3c/0x70
[] sys_restart_syscall+0x1f/0x30
[] system_call_fastpath+0x12/0x71
[] 0x

Any ideas how to debug this?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-17 Thread Stefan Priebe

Am 17.11.2015 um 20:15 schrieb Thomas Gleixner:

On Tue, 17 Nov 2015, Stefan Priebe - Profihost AG wrote:

since Upgrading our Asterisk System from Kernel 3.18.17 to 4.1.13 it
deadlocks every few hours (kill -9 is the only thing working). Booting
with 3.18 again let it run smooth again.

An strace shows asterisk is looping like this:

[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
it_value={0, 140592906050976}}) = 0
[pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8

fd 8 is:

lrwx-- 1 root root 64 Nov 17 15:27 /proc/6025/fd/8 ->
anon_inode:[timerfd]


# cat /proc/6025/stack
[] poll_schedule_timeout+0x49/0x70
[] do_sys_poll+0x3d7/0x590
[] do_restart_poll+0x3c/0x70
[] sys_restart_syscall+0x1f/0x30
[] system_call_fastpath+0x12/0x71
[] 0x

Any ideas how to debug this?


fd 8 is probably not really interesting. That looks like a interval
timer firing periodically. So it probably waits for fd 9 ...


I've now also two gdb backtraces from two crashes:
http://pastebin.com/raw.php?i=yih5jNt8

http://pastebin.com/raw.php?i=kGEcvH4T

Stefan



Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-17 Thread Thomas Gleixner
On Tue, 17 Nov 2015, Stefan Priebe wrote:
> I've now also two gdb backtraces from two crashes:
> http://pastebin.com/raw.php?i=yih5jNt8
> 
> http://pastebin.com/raw.php?i=kGEcvH4T

They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Asterisk deadlocks since Kernel 4.1

2015-11-17 Thread Thomas Gleixner
On Tue, 17 Nov 2015, Stefan Priebe - Profihost AG wrote:
> since Upgrading our Asterisk System from Kernel 3.18.17 to 4.1.13 it
> deadlocks every few hours (kill -9 is the only thing working). Booting
> with 3.18 again let it run smooth again.
> 
> An strace shows asterisk is looping like this:
> 
> [pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
> it_value={0, 140592906050976}}) = 0
> [pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid  6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
> 2, 1000) = 1 ([{fd=8, revents=POLLIN}])
> [pid  6068] timerfd_gettime(8, , {it_interval={0, 2000},
> it_value={0, 140592906050976}}) = 0
> [pid  6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
> 
> fd 8 is:
> 
> lrwx-- 1 root root 64 Nov 17 15:27 /proc/6025/fd/8 ->
> anon_inode:[timerfd]
> 
> 
> # cat /proc/6025/stack
> [] poll_schedule_timeout+0x49/0x70
> [] do_sys_poll+0x3d7/0x590
> [] do_restart_poll+0x3c/0x70
> [] sys_restart_syscall+0x1f/0x30
> [] system_call_fastpath+0x12/0x71
> [] 0x
> 
> Any ideas how to debug this?

fd 8 is probably not really interesting. That looks like a interval
timer firing periodically. So it probably waits for fd 9 ...

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/