Re: Latest net-next from GIT panic

2017-09-21 Thread Eric Dumazet
On Thu, 2017-09-21 at 15:18 +0200, Paweł Staszewski wrote:

> ok after adding patch all is working from now for about 1 hour of normal 
> traffic witc all bgp sessions connected and about 600k prefixes in kernel.


Great, I am doing to submit an official patch, uniting skb_dst_force()
and skb_dst_force_safe() into a single helper.

Thanks.





Re: Latest net-next from GIT panic

2017-09-21 Thread Paweł Staszewski



W dniu 2017-09-21 o 13:31, Paweł Staszewski pisze:



W dniu 2017-09-21 o 13:03, Eric Dumazet pisze:

OK we have two problems here

1) We need to unify skb_dst_force()  ( for net tree )

2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from
lower device. This will considerably help your performance.


For 1), this is what I had in mind, can you try it ?

Thanks a lot !

diff --git a/include/net/dst.h b/include/net/dst.h
index 
93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894 
100644

--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry 
*dst, unsigned long time)

  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
  {
  if (dst)
-    atomic_inc(>__refcnt);
+    dst_hold(dst);
  return dst;
  }
  @@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff 
*nskb, const struct sk_buff *oskb

  __skb_dst_copy(nskb, oskb->_skb_refdst);
  }
  -/**
- * skb_dst_force - makes sure skb dst is refcounted
- * @skb: buffer
- *
- * If dst is not yet refcounted, let's do it
- */
-static inline void skb_dst_force(struct sk_buff *skb)
-{
-    if (skb_dst_is_noref(skb)) {
-    WARN_ON(!rcu_read_lock_held());
-    skb->_skb_refdst &= ~SKB_DST_NOREF;
-    dst_clone(skb_dst(skb));
-    }
-}
-
  /**
   * dst_hold_safe - Take a reference on a dst if possible
   * @dst: pointer to dst entry
@@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct 
sk_buff *skb)

  }
  }
  +/**
+ * skb_dst_force - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted, let's do it
+ */
+static inline void skb_dst_force(struct sk_buff *skb)
+{
+    if (skb_dst_is_noref(skb)) {
+    struct dst_entry *dst = skb_dst(skb);
+
+    WARN_ON(!rcu_read_lock_held());
+    if (!dst_hold_safe(dst))
+    dst = NULL;
+    skb->_skb_refdst = (unsigned long)dst;
+    }
+}
    /**
   *    __skb_tunnel_rx - prepare skb for rx reinsert




Patch applied - soo far no problems - and no warnings in dmesg


ok after adding patch all is working from now for about 1 hour of normal 
traffic witc all bgp sessions connected and about 600k prefixes in kernel.




Re: Latest net-next from GIT panic

2017-09-21 Thread Paweł Staszewski



W dniu 2017-09-21 o 13:03, Eric Dumazet pisze:

OK we have two problems here

1) We need to unify skb_dst_force()  ( for net tree )

2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from
lower device. This will considerably help your performance.


For 1), this is what I had in mind, can you try it ?

Thanks a lot !

diff --git a/include/net/dst.h b/include/net/dst.h
index 
93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894
 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, 
unsigned long time)
  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
  {
if (dst)
-   atomic_inc(>__refcnt);
+   dst_hold(dst);
return dst;
  }
  
@@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff *nskb, const struct sk_buff *oskb

__skb_dst_copy(nskb, oskb->_skb_refdst);
  }
  
-/**

- * skb_dst_force - makes sure skb dst is refcounted
- * @skb: buffer
- *
- * If dst is not yet refcounted, let's do it
- */
-static inline void skb_dst_force(struct sk_buff *skb)
-{
-   if (skb_dst_is_noref(skb)) {
-   WARN_ON(!rcu_read_lock_held());
-   skb->_skb_refdst &= ~SKB_DST_NOREF;
-   dst_clone(skb_dst(skb));
-   }
-}
-
  /**
   * dst_hold_safe - Take a reference on a dst if possible
   * @dst: pointer to dst entry
@@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct sk_buff *skb)
}
  }
  
+/**

+ * skb_dst_force - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted, let's do it
+ */
+static inline void skb_dst_force(struct sk_buff *skb)
+{
+   if (skb_dst_is_noref(skb)) {
+   struct dst_entry *dst = skb_dst(skb);
+
+   WARN_ON(!rcu_read_lock_held());
+   if (!dst_hold_safe(dst))
+   dst = NULL;
+   skb->_skb_refdst = (unsigned long)dst;
+   }
+}
  
  /**

   *__skb_tunnel_rx - prepare skb for rx reinsert




Patch applied - soo far no problems - and no warnings in dmesg



Re: Latest net-next from GIT panic

2017-09-21 Thread Paweł Staszewski



W dniu 2017-09-21 o 13:12, Paweł Staszewski pisze:



W dniu 2017-09-21 o 13:03, Eric Dumazet pisze:

On Thu, 2017-09-21 at 11:06 +0200, Paweł Staszewski wrote:

W dniu 2017-09-21 o 03:17, Eric Dumazet pisze:

On Wed, 2017-09-20 at 18:09 -0700, Wei Wang wrote:

Thanks very much Pawel for the feedback.

I was looking into the code (specifically IPv4 part) and found 
that in
free_fib_info_rcu(), we call free_nh_exceptions() without holding 
the

fnhe_lock. I am wondering if that could cause some race condition on
fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
same dst could be happening.

But as we call free_fib_info_rcu() only after the grace period, and
the lookup code which could potentially modify
fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
fine...


Hi Pawel,

Could you try the following debug patch on top of net-next branch and
reproduce the issue check if there are warning msg showing?

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..82aff41c6f63 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry
*dst, unsigned long time)
   static inline struct dst_entry *dst_clone(struct dst_entry *dst)
   {
  if (dst)
-   atomic_inc(>__refcnt);
+   dst_hold(dst);
  return dst;
   }

Thanks.
Wei


Yes, we believe skb_dst_force() and skb_dst_force_safe() should be
unified  (to the 'safe' version)

We no longer have gc to protect from 0 -> 1 transition of dst 
refcount.






After adding patch from Wei
https://bugzilla.kernel.org/show_bug.cgi?id=197005#c14


OK we have two problems here

1) We need to unify skb_dst_force()  ( for net tree )

2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from
lower device. This will considerably help your performance.


For 1), this is what I had in mind, can you try it ?

Thanks a lot !

diff --git a/include/net/dst.h b/include/net/dst.h
index 
93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894 
100644

--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry 
*dst, unsigned long time)

  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
  {
  if (dst)
-    atomic_inc(>__refcnt);
+    dst_hold(dst);
  return dst;
  }
  @@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff 
*nskb, const struct sk_buff *oskb

  __skb_dst_copy(nskb, oskb->_skb_refdst);
  }
  -/**
- * skb_dst_force - makes sure skb dst is refcounted
- * @skb: buffer
- *
- * If dst is not yet refcounted, let's do it
- */
-static inline void skb_dst_force(struct sk_buff *skb)
-{
-    if (skb_dst_is_noref(skb)) {
-    WARN_ON(!rcu_read_lock_held());
-    skb->_skb_refdst &= ~SKB_DST_NOREF;
-    dst_clone(skb_dst(skb));
-    }
-}
-
  /**
   * dst_hold_safe - Take a reference on a dst if possible
   * @dst: pointer to dst entry
@@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct 
sk_buff *skb)

  }
  }
  +/**
+ * skb_dst_force - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted, let's do it
+ */
+static inline void skb_dst_force(struct sk_buff *skb)
+{
+    if (skb_dst_is_noref(skb)) {
+    struct dst_entry *dst = skb_dst(skb);
+
+    WARN_ON(!rcu_read_lock_held());
+    if (!dst_hold_safe(dst))
+    dst = NULL;
+    skb->_skb_refdst = (unsigned long)dst;
+    }
+}
    /**
   *    __skb_tunnel_rx - prepare skb for rx reinsert




Thanks

What is weird i have this part in my net-next from git:
/**
 * skb_dst_force_safe - makes sure skb dst is refcounted
 * @skb: buffer
 *
 * If dst is not yet refcounted and not destroyed, grab a ref on it.
 */
static inline void skb_dst_force_safe(struct sk_buff *skb)
{
    if (skb_dst_is_noref(skb)) {
    struct dst_entry *dst = skb_dst(skb);

    if (!dst_hold_safe(dst))
    dst = NULL;

    skb->_skb_refdst = (unsigned long)dst;
    }
}




ok the difference is skb_dst_force_safe not skb_dst_force




Re: Latest net-next from GIT panic

2017-09-21 Thread Paweł Staszewski



W dniu 2017-09-21 o 13:03, Eric Dumazet pisze:

On Thu, 2017-09-21 at 11:06 +0200, Paweł Staszewski wrote:

W dniu 2017-09-21 o 03:17, Eric Dumazet pisze:

On Wed, 2017-09-20 at 18:09 -0700, Wei Wang wrote:

Thanks very much Pawel for the feedback.

I was looking into the code (specifically IPv4 part) and found that in
free_fib_info_rcu(), we call free_nh_exceptions() without holding the
fnhe_lock. I am wondering if that could cause some race condition on
fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
same dst could be happening.

But as we call free_fib_info_rcu() only after the grace period, and
the lookup code which could potentially modify
fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
fine...


Hi Pawel,

Could you try the following debug patch on top of net-next branch and
reproduce the issue check if there are warning msg showing?

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..82aff41c6f63 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry
*dst, unsigned long time)
   static inline struct dst_entry *dst_clone(struct dst_entry *dst)
   {
  if (dst)
-   atomic_inc(>__refcnt);
+   dst_hold(dst);
  return dst;
   }

Thanks.
Wei


Yes, we believe skb_dst_force() and skb_dst_force_safe() should be
unified  (to the 'safe' version)

We no longer have gc to protect from 0 -> 1 transition of dst refcount.





After adding patch from Wei
https://bugzilla.kernel.org/show_bug.cgi?id=197005#c14


OK we have two problems here

1) We need to unify skb_dst_force()  ( for net tree )

2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from
lower device. This will considerably help your performance.


For 1), this is what I had in mind, can you try it ?

Thanks a lot !

diff --git a/include/net/dst.h b/include/net/dst.h
index 
93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894
 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, 
unsigned long time)
  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
  {
if (dst)
-   atomic_inc(>__refcnt);
+   dst_hold(dst);
return dst;
  }
  
@@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff *nskb, const struct sk_buff *oskb

__skb_dst_copy(nskb, oskb->_skb_refdst);
  }
  
-/**

- * skb_dst_force - makes sure skb dst is refcounted
- * @skb: buffer
- *
- * If dst is not yet refcounted, let's do it
- */
-static inline void skb_dst_force(struct sk_buff *skb)
-{
-   if (skb_dst_is_noref(skb)) {
-   WARN_ON(!rcu_read_lock_held());
-   skb->_skb_refdst &= ~SKB_DST_NOREF;
-   dst_clone(skb_dst(skb));
-   }
-}
-
  /**
   * dst_hold_safe - Take a reference on a dst if possible
   * @dst: pointer to dst entry
@@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct sk_buff *skb)
}
  }
  
+/**

+ * skb_dst_force - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted, let's do it
+ */
+static inline void skb_dst_force(struct sk_buff *skb)
+{
+   if (skb_dst_is_noref(skb)) {
+   struct dst_entry *dst = skb_dst(skb);
+
+   WARN_ON(!rcu_read_lock_held());
+   if (!dst_hold_safe(dst))
+   dst = NULL;
+   skb->_skb_refdst = (unsigned long)dst;
+   }
+}
  
  /**

   *__skb_tunnel_rx - prepare skb for rx reinsert




Thanks

What is weird i have this part in my net-next from git:
/**
 * skb_dst_force_safe - makes sure skb dst is refcounted
 * @skb: buffer
 *
 * If dst is not yet refcounted and not destroyed, grab a ref on it.
 */
static inline void skb_dst_force_safe(struct sk_buff *skb)
{
    if (skb_dst_is_noref(skb)) {
    struct dst_entry *dst = skb_dst(skb);

    if (!dst_hold_safe(dst))
    dst = NULL;

    skb->_skb_refdst = (unsigned long)dst;
    }
}




Re: Latest net-next from GIT panic

2017-09-21 Thread Eric Dumazet
On Thu, 2017-09-21 at 11:06 +0200, Paweł Staszewski wrote:
> 
> W dniu 2017-09-21 o 03:17, Eric Dumazet pisze:
> > On Wed, 2017-09-20 at 18:09 -0700, Wei Wang wrote:
> >>> Thanks very much Pawel for the feedback.
> >>>
> >>> I was looking into the code (specifically IPv4 part) and found that in
> >>> free_fib_info_rcu(), we call free_nh_exceptions() without holding the
> >>> fnhe_lock. I am wondering if that could cause some race condition on
> >>> fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
> >>> same dst could be happening.
> >>>
> >>> But as we call free_fib_info_rcu() only after the grace period, and
> >>> the lookup code which could potentially modify
> >>> fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
> >>> fine...
> >>>
> >> Hi Pawel,
> >>
> >> Could you try the following debug patch on top of net-next branch and
> >> reproduce the issue check if there are warning msg showing?
> >>
> >> diff --git a/include/net/dst.h b/include/net/dst.h
> >> index 93568bd0a352..82aff41c6f63 100644
> >> --- a/include/net/dst.h
> >> +++ b/include/net/dst.h
> >> @@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry
> >> *dst, unsigned long time)
> >>   static inline struct dst_entry *dst_clone(struct dst_entry *dst)
> >>   {
> >>  if (dst)
> >> -   atomic_inc(>__refcnt);
> >> +   dst_hold(dst);
> >>  return dst;
> >>   }
> >>
> >> Thanks.
> >> Wei
> >>
> >
> > Yes, we believe skb_dst_force() and skb_dst_force_safe() should be
> > unified  (to the 'safe' version)
> >
> > We no longer have gc to protect from 0 -> 1 transition of dst refcount.
> >
> >
> >
> >
> 
> After adding patch from Wei
> https://bugzilla.kernel.org/show_bug.cgi?id=197005#c14
> 

OK we have two problems here 

1) We need to unify skb_dst_force()  ( for net tree )

2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from
lower device. This will considerably help your performance.


For 1), this is what I had in mind, can you try it ?

Thanks a lot !

diff --git a/include/net/dst.h b/include/net/dst.h
index 
93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894
 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, 
unsigned long time)
 static inline struct dst_entry *dst_clone(struct dst_entry *dst)
 {
if (dst)
-   atomic_inc(>__refcnt);
+   dst_hold(dst);
return dst;
 }
 
@@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff *nskb, 
const struct sk_buff *oskb
__skb_dst_copy(nskb, oskb->_skb_refdst);
 }
 
-/**
- * skb_dst_force - makes sure skb dst is refcounted
- * @skb: buffer
- *
- * If dst is not yet refcounted, let's do it
- */
-static inline void skb_dst_force(struct sk_buff *skb)
-{
-   if (skb_dst_is_noref(skb)) {
-   WARN_ON(!rcu_read_lock_held());
-   skb->_skb_refdst &= ~SKB_DST_NOREF;
-   dst_clone(skb_dst(skb));
-   }
-}
-
 /**
  * dst_hold_safe - Take a reference on a dst if possible
  * @dst: pointer to dst entry
@@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct sk_buff *skb)
}
 }
 
+/**
+ * skb_dst_force - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted, let's do it
+ */
+static inline void skb_dst_force(struct sk_buff *skb)
+{
+   if (skb_dst_is_noref(skb)) {
+   struct dst_entry *dst = skb_dst(skb);
+
+   WARN_ON(!rcu_read_lock_held());
+   if (!dst_hold_safe(dst))
+   dst = NULL;
+   skb->_skb_refdst = (unsigned long)dst;
+   }
+}
 
 /**
  * __skb_tunnel_rx - prepare skb for rx reinsert




Re: Latest net-next from GIT panic

2017-09-21 Thread Paweł Staszewski



W dniu 2017-09-21 o 03:17, Eric Dumazet pisze:

On Wed, 2017-09-20 at 18:09 -0700, Wei Wang wrote:

Thanks very much Pawel for the feedback.

I was looking into the code (specifically IPv4 part) and found that in
free_fib_info_rcu(), we call free_nh_exceptions() without holding the
fnhe_lock. I am wondering if that could cause some race condition on
fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
same dst could be happening.

But as we call free_fib_info_rcu() only after the grace period, and
the lookup code which could potentially modify
fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
fine...


Hi Pawel,

Could you try the following debug patch on top of net-next branch and
reproduce the issue check if there are warning msg showing?

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..82aff41c6f63 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry
*dst, unsigned long time)
  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
  {
 if (dst)
-   atomic_inc(>__refcnt);
+   dst_hold(dst);
 return dst;
  }

Thanks.
Wei



Yes, we believe skb_dst_force() and skb_dst_force_safe() should be
unified  (to the 'safe' version)

We no longer have gc to protect from 0 -> 1 transition of dst refcount.






After adding patch from Wei
https://bugzilla.kernel.org/show_bug.cgi?id=197005#c14





Re: Latest net-next from GIT panic

2017-09-20 Thread Eric Dumazet
On Wed, 2017-09-20 at 18:09 -0700, Wei Wang wrote:
> > Thanks very much Pawel for the feedback.
> >
> > I was looking into the code (specifically IPv4 part) and found that in
> > free_fib_info_rcu(), we call free_nh_exceptions() without holding the
> > fnhe_lock. I am wondering if that could cause some race condition on
> > fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
> > same dst could be happening.
> >
> > But as we call free_fib_info_rcu() only after the grace period, and
> > the lookup code which could potentially modify
> > fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
> > fine...
> >
> 
> Hi Pawel,
> 
> Could you try the following debug patch on top of net-next branch and
> reproduce the issue check if there are warning msg showing?
> 
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 93568bd0a352..82aff41c6f63 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry
> *dst, unsigned long time)
>  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
>  {
> if (dst)
> -   atomic_inc(>__refcnt);
> +   dst_hold(dst);
> return dst;
>  }
> 
> Thanks.
> Wei
> 


Yes, we believe skb_dst_force() and skb_dst_force_safe() should be
unified  (to the 'safe' version)

We no longer have gc to protect from 0 -> 1 transition of dst refcount.





Re: Latest net-next from GIT panic

2017-09-20 Thread Wei Wang
> Thanks very much Pawel for the feedback.
>
> I was looking into the code (specifically IPv4 part) and found that in
> free_fib_info_rcu(), we call free_nh_exceptions() without holding the
> fnhe_lock. I am wondering if that could cause some race condition on
> fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
> same dst could be happening.
>
> But as we call free_fib_info_rcu() only after the grace period, and
> the lookup code which could potentially modify
> fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
> fine...
>

Hi Pawel,

Could you try the following debug patch on top of net-next branch and
reproduce the issue check if there are warning msg showing?

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..82aff41c6f63 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry
*dst, unsigned long time)
 static inline struct dst_entry *dst_clone(struct dst_entry *dst)
 {
if (dst)
-   atomic_inc(>__refcnt);
+   dst_hold(dst);
return dst;
 }

Thanks.
Wei


On Wed, Sep 20, 2017 at 3:09 PM, Wei Wang  wrote:
 bisected again and same result:
 b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
 commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
 Author: Wei Wang 
 Date:   Sat Jun 17 10:42:32 2017 -0700

 ipv4: mark DST_NOGC and remove the operation of dst_free()

 With the previous preparation patches, we are ready to get rid of the
 dst gc operation in ipv4 code and release dst based on refcnt only.
 So this patch adds DST_NOGC flag for all IPv4 dst and remove the
 calls
 to dst_free().
 At this point, all dst created in ipv4 code do not use the dst gc
 anymore and will be destroyed at the point when refcnt drops to 0.

 Signed-off-by: Wei Wang 
 Acked-by: Martin KaFai Lau 
 Signed-off-by: David S. Miller 

 :04 04 9b7e7fb641de6531fc7887473ca47ef7cb6a11da
 831a73b71d3df1755f3e24c0d3c86d7a93fd55e2 M  net

 Will add now version 2 of patch from Eric and we will see


>>> after adding patch
>>> perf top catch
>>>PerfTop:   77159 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz cycles],
>>> (all, 40 CPUs)
>>>
>>> ---
>>>
>>> 60.95%  [kernel][k] dev_put.part.6
>>>  4.00%  [kernel][k] ixgbe_poll
>>>  3.63%  [kernel][k] irq_entries_start
>>>  1.22%  [kernel][k] fib_table_lookup
>>>  1.15%  [kernel][k] do_raw_spin_lock
>>>  1.05%  [kernel][k] ixgbe_xmit_frame_ring
>>>  1.04%  [kernel][k] lookup
>>>  0.87%  [kernel][k] eth_type_trans
>>>
>>>
>>> no panic on console - rebooting to check logs
>>>
>>>
>> Nothing logged
>>
>
> Thanks very much Pawel for the feedback.
>
> I was looking into the code (specifically IPv4 part) and found that in
> free_fib_info_rcu(), we call free_nh_exceptions() without holding the
> fnhe_lock. I am wondering if that could cause some race condition on
> fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
> same dst could be happening.
>
> But as we call free_fib_info_rcu() only after the grace period, and
> the lookup code which could potentially modify
> fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
> fine...
>
>
> On Wed, Sep 20, 2017 at 2:25 PM, Paweł Staszewski  
> wrote:
>>
>>
>> W dniu 2017-09-20 o 23:24, Paweł Staszewski pisze:
>>
>>>
>>>
>>> W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:



 W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:
>
>
>
> W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:
>>
>>
>>
>> W dniu 2017-09-20 o 20:36, Cong Wang pisze:
>>>
>>> On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet
>>>  wrote:

 On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>
> but dmesg at this time shows nothing about interfaces or flaps.
>
> This is very odd.
>
> We only free netdevice in free_netdev() and it is only called when
> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
> to be NULL.

 If there is a missing dev_hold() or one dev_put() in excess,
 this would allow the netdev to be freed too soon.

 -> Use after free.
 memory holding netdev could be reallocated-cleared by some other
 kernel
 user.

>>> Sure, but only unregister 

Re: Latest net-next from GIT panic

2017-09-20 Thread Wei Wang
>>> bisected again and same result:
>>> b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
>>> commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
>>> Author: Wei Wang 
>>> Date:   Sat Jun 17 10:42:32 2017 -0700
>>>
>>> ipv4: mark DST_NOGC and remove the operation of dst_free()
>>>
>>> With the previous preparation patches, we are ready to get rid of the
>>> dst gc operation in ipv4 code and release dst based on refcnt only.
>>> So this patch adds DST_NOGC flag for all IPv4 dst and remove the
>>> calls
>>> to dst_free().
>>> At this point, all dst created in ipv4 code do not use the dst gc
>>> anymore and will be destroyed at the point when refcnt drops to 0.
>>>
>>> Signed-off-by: Wei Wang 
>>> Acked-by: Martin KaFai Lau 
>>> Signed-off-by: David S. Miller 
>>>
>>> :04 04 9b7e7fb641de6531fc7887473ca47ef7cb6a11da
>>> 831a73b71d3df1755f3e24c0d3c86d7a93fd55e2 M  net
>>>
>>> Will add now version 2 of patch from Eric and we will see
>>>
>>>
>> after adding patch
>> perf top catch
>>PerfTop:   77159 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz cycles],
>> (all, 40 CPUs)
>>
>> ---
>>
>> 60.95%  [kernel][k] dev_put.part.6
>>  4.00%  [kernel][k] ixgbe_poll
>>  3.63%  [kernel][k] irq_entries_start
>>  1.22%  [kernel][k] fib_table_lookup
>>  1.15%  [kernel][k] do_raw_spin_lock
>>  1.05%  [kernel][k] ixgbe_xmit_frame_ring
>>  1.04%  [kernel][k] lookup
>>  0.87%  [kernel][k] eth_type_trans
>>
>>
>> no panic on console - rebooting to check logs
>>
>>
> Nothing logged
>

Thanks very much Pawel for the feedback.

I was looking into the code (specifically IPv4 part) and found that in
free_fib_info_rcu(), we call free_nh_exceptions() without holding the
fnhe_lock. I am wondering if that could cause some race condition on
fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
same dst could be happening.

But as we call free_fib_info_rcu() only after the grace period, and
the lookup code which could potentially modify
fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
fine...


On Wed, Sep 20, 2017 at 2:25 PM, Paweł Staszewski  wrote:
>
>
> W dniu 2017-09-20 o 23:24, Paweł Staszewski pisze:
>
>>
>>
>> W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:
>>>
>>>
>>>
>>> W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:



 W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:
>
>
>
> W dniu 2017-09-20 o 20:36, Cong Wang pisze:
>>
>> On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet
>>  wrote:
>>>
>>> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:

 but dmesg at this time shows nothing about interfaces or flaps.

 This is very odd.

 We only free netdevice in free_netdev() and it is only called when
 we unregister a netdevice. Otherwise pcpu_refcnt is impossible
 to be NULL.
>>>
>>> If there is a missing dev_hold() or one dev_put() in excess,
>>> this would allow the netdev to be freed too soon.
>>>
>>> -> Use after free.
>>> memory holding netdev could be reallocated-cleared by some other
>>> kernel
>>> user.
>>>
>> Sure, but only unregister could trigger a free. If there is no
>> unregister,
>> like what Pawel claims, then there is no free, the refcnt just goes to
>> 0 but the memory is still there.
>>
> About possible mistake from my side with bisect - i can judge too early
> that some bisect was good
> the road was:
> git bisect start
> # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag
> 'pinctrl-v4.13-1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
> git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
> # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next'
> of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
> git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
> # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid using
> stack larger than 1024.
> git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
> # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch
> 'udp-reduce-cache-pressure'
> git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
> # bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch
> 's390-net-updates-part-2'
> git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
> # good: 

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski



W dniu 2017-09-20 o 23:25, Paweł Staszewski pisze:



W dniu 2017-09-20 o 23:24, Paweł Staszewski pisze:



W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:



W dniu 2017-09-20 o 20:36, Cong Wang pisze:
On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
 wrote:

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:

but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other 
kernel

user.

Sure, but only unregister could trigger a free. If there is no 
unregister,
like what Pawel claims, then there is no free, the refcnt just 
goes to

0 but the memory is still there.

About possible mistake from my side with bisect - i can judge too 
early that some bisect was good

the road was:
git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'

git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'

git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: 
remove cp_outgoing

git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix

git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce 
a new function dst_dev_put()


And currently have this running for about 4 hours without problems.



git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag


Here for sure - panic

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly

git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly

git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree


im not 100% sure tor last two
Will test them again starting from
[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call 
dst_dev_put() properly



git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark 
DST_NOGC and remove the operation of dst_free()




git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] 
ipv4: mark DST_NOGC and remove the operation of dst_free()





What i can say more
I can reproduce this on any server with similar configuration
the difference can be teamd instead of bonding
ixgbe or i40e and mlx5
Same problems

vlans - more or less prefixes learned from bgp -> zebra -> netlink 
-> kernel
But normally in lab when using only plain routing no bgpd and about 
128 vlans - with 128 routes - cant reproduce this - this apperas 
only with bgp - minimum where i can reproduce this was about 130k 
prefixes with about 286 nexthops






bisected again and same result:
b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
Author: Wei Wang 
Date:   Sat Jun 17 10:42:32 2017 -0700

    ipv4: mark DST_NOGC and remove the operation of dst_free()

    With the previous preparation patches, we are ready to get rid 
of the

    dst gc operation in ipv4 code and release dst based on refcnt only.
    So this patch adds DST_NOGC flag for all IPv4 dst and remove the 
calls

    to dst_free().
    At this point, all dst created in ipv4 code do not use the dst gc
    anymore and will be destroyed at the point when refcnt drops to 0.

    Signed-off-by: Wei Wang 

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski



W dniu 2017-09-20 o 23:24, Paweł Staszewski pisze:



W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:



W dniu 2017-09-20 o 20:36, Cong Wang pisze:
On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
 wrote:

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:

but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other 
kernel

user.

Sure, but only unregister could trigger a free. If there is no 
unregister,
like what Pawel claims, then there is no free, the refcnt just 
goes to

0 but the memory is still there.

About possible mistake from my side with bisect - i can judge too 
early that some bisect was good

the road was:
git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'

git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'

git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
cp_outgoing

git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix

git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a 
new function dst_dev_put()


And currently have this running for about 4 hours without problems.



git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag


Here for sure - panic

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly

git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly

git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree


im not 100% sure tor last two
Will test them again starting from
[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
properly



git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark 
DST_NOGC and remove the operation of dst_free()




git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] 
ipv4: mark DST_NOGC and remove the operation of dst_free()





What i can say more
I can reproduce this on any server with similar configuration
the difference can be teamd instead of bonding
ixgbe or i40e and mlx5
Same problems

vlans - more or less prefixes learned from bgp -> zebra -> netlink 
-> kernel
But normally in lab when using only plain routing no bgpd and about 
128 vlans - with 128 routes - cant reproduce this - this apperas 
only with bgp - minimum where i can reproduce this was about 130k 
prefixes with about 286 nexthops






bisected again and same result:
b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
Author: Wei Wang 
Date:   Sat Jun 17 10:42:32 2017 -0700

    ipv4: mark DST_NOGC and remove the operation of dst_free()

    With the previous preparation patches, we are ready to get rid of 
the

    dst gc operation in ipv4 code and release dst based on refcnt only.
    So this patch adds DST_NOGC flag for all IPv4 dst and remove the 
calls

    to dst_free().
    At this point, all dst created in ipv4 code do not use the dst gc
    anymore and will be destroyed at the point when refcnt drops to 0.

    Signed-off-by: Wei Wang 
    Acked-by: Martin KaFai Lau 

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski



W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:



W dniu 2017-09-20 o 20:36, Cong Wang pisze:
On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
 wrote:

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:

but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other 
kernel

user.

Sure, but only unregister could trigger a free. If there is no 
unregister,

like what Pawel claims, then there is no free, the refcnt just goes to
0 but the memory is still there.

About possible mistake from my side with bisect - i can judge too 
early that some bisect was good

the road was:
git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'

git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'

git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
cp_outgoing

git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix

git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a 
new function dst_dev_put()


And currently have this running for about 4 hours without problems.



git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag


Here for sure - panic

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly

git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly

git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree


im not 100% sure tor last two
Will test them again starting from
[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
properly



git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark 
DST_NOGC and remove the operation of dst_free()




git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
mark DST_NOGC and remove the operation of dst_free()





What i can say more
I can reproduce this on any server with similar configuration
the difference can be teamd instead of bonding
ixgbe or i40e and mlx5
Same problems

vlans - more or less prefixes learned from bgp -> zebra -> netlink -> 
kernel
But normally in lab when using only plain routing no bgpd and about 
128 vlans - with 128 routes - cant reproduce this - this apperas only 
with bgp - minimum where i can reproduce this was about 130k prefixes 
with about 286 nexthops






bisected again and same result:
b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
Author: Wei Wang 
Date:   Sat Jun 17 10:42:32 2017 -0700

    ipv4: mark DST_NOGC and remove the operation of dst_free()

    With the previous preparation patches, we are ready to get rid of the
    dst gc operation in ipv4 code and release dst based on refcnt only.
    So this patch adds DST_NOGC flag for all IPv4 dst and remove the 
calls

    to dst_free().
    At this point, all dst created in ipv4 code do not use the dst gc
    anymore and will be destroyed at the point when refcnt drops to 0.

    Signed-off-by: Wei Wang 
    Acked-by: Martin KaFai Lau 
    Signed-off-by: David S. Miller 

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski



W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:



W dniu 2017-09-20 o 20:36, Cong Wang pisze:
On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
 wrote:

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:

but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other 
kernel

user.

Sure, but only unregister could trigger a free. If there is no 
unregister,

like what Pawel claims, then there is no free, the refcnt just goes to
0 but the memory is still there.

About possible mistake from my side with bisect - i can judge too 
early that some bisect was good

the road was:
git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'

git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'

git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
cp_outgoing

git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix

git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a 
new function dst_dev_put()


And currently have this running for about 4 hours without problems.



git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag


Here for sure - panic

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly

git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly

git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree


im not 100% sure tor last two
Will test them again starting from
[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
properly



git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC 
and remove the operation of dst_free()




git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
mark DST_NOGC and remove the operation of dst_free()





What i can say more
I can reproduce this on any server with similar configuration
the difference can be teamd instead of bonding
ixgbe or i40e and mlx5
Same problems

vlans - more or less prefixes learned from bgp -> zebra -> netlink -> 
kernel
But normally in lab when using only plain routing no bgpd and about 
128 vlans - with 128 routes - cant reproduce this - this apperas only 
with bgp - minimum where i can reproduce this was about 130k prefixes 
with about 286 nexthops






bisected again and same result:
b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
Author: Wei Wang 
Date:   Sat Jun 17 10:42:32 2017 -0700

    ipv4: mark DST_NOGC and remove the operation of dst_free()

    With the previous preparation patches, we are ready to get rid of the
    dst gc operation in ipv4 code and release dst based on refcnt only.
    So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls
    to dst_free().
    At this point, all dst created in ipv4 code do not use the dst gc
    anymore and will be destroyed at the point when refcnt drops to 0.

    Signed-off-by: Wei Wang 
    Acked-by: Martin KaFai Lau 
    Signed-off-by: David S. Miller 

:04 04 

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski



W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:



W dniu 2017-09-20 o 20:36, Cong Wang pisze:
On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
 wrote:

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:

but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other kernel
user.

Sure, but only unregister could trigger a free. If there is no 
unregister,

like what Pawel claims, then there is no free, the refcnt just goes to
0 but the memory is still there.

About possible mistake from my side with bisect - i can judge too 
early that some bisect was good

the road was:
git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'

git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'

git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
cp_outgoing

git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix

git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a 
new function dst_dev_put()


And currently have this running for about 4 hours without problems.



git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag


Here for sure - panic

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly

git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly

git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree


im not 100% sure tor last two
Will test them again starting from
[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
properly



git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC 
and remove the operation of dst_free()




git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
mark DST_NOGC and remove the operation of dst_free()





What i can say more
I can reproduce this on any server with similar configuration
the difference can be teamd instead of bonding
ixgbe or i40e and mlx5
Same problems

vlans - more or less prefixes learned from bgp -> zebra -> netlink -> kernel
But normally in lab when using only plain routing no bgpd and about 128 
vlans - with 128 routes - cant reproduce this - this apperas only with 
bgp - minimum where i can reproduce this was about 130k prefixes with 
about 286 nexthops






Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski



W dniu 2017-09-20 o 20:36, Cong Wang pisze:

On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet  wrote:

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:

but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other kernel
user.


Sure, but only unregister could trigger a free. If there is no unregister,
like what Pawel claims, then there is no free, the refcnt just goes to
0 but the memory is still there.

About possible mistake from my side with bisect - i can judge too early 
that some bisect was good

the road was:
git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid using 
stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'

git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'

git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
cp_outgoing

git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix

git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new 
function dst_dev_put()


And currently have this running for about 4 hours without problems.



git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag


Here for sure - panic

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly

git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly

git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree


im not 100% sure tor last two
Will test them again starting from
[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly


git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC 
and remove the operation of dst_free()




git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
mark DST_NOGC and remove the operation of dst_free()





Re: Latest net-next from GIT panic

2017-09-20 Thread Cong Wang
On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet  wrote:
> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>> but dmesg at this time shows nothing about interfaces or flaps.
>>
>> This is very odd.
>>
>> We only free netdevice in free_netdev() and it is only called when
>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>> to be NULL.
>
> If there is a missing dev_hold() or one dev_put() in excess,
> this would allow the netdev to be freed too soon.
>
> -> Use after free.
> memory holding netdev could be reallocated-cleared by some other kernel
> user.
>

Sure, but only unregister could trigger a free. If there is no unregister,
like what Pawel claims, then there is no free, the refcnt just goes to
0 but the memory is still there.


Re: Latest net-next from GIT panic

2017-09-20 Thread Eric Dumazet
On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
> but dmesg at this time shows nothing about interfaces or flaps.
> 
> This is very odd.
> 
> We only free netdevice in free_netdev() and it is only called when
> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
> to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other kernel
user.




Re: Latest net-next from GIT panic

2017-09-20 Thread Cong Wang
On Wed, Sep 20, 2017 at 10:55 AM, Paweł Staszewski
 wrote:
>
>
> W dniu 2017-09-20 o 19:50, Cong Wang pisze:
>
> On Wed, Sep 20, 2017 at 6:11 AM, Eric Dumazet 
> wrote:
>
> Sorry for top-posting, but this is to give context to Wei, since Pawel
> used a top posting way to report his bisection.
>
> Wei, can you take a look at Pawel report ?
>
> Crash happens in dst_destroy() at following :
>
> if (dst->dev)
>  dev_put(dst->dev); <>
>
>
> dst->dev is not NULL, but netdev->pcpu_refcnt is NULL
>
> 65 ff 08decl   %gs:(%rax)   // CRASH since rax = NULL
>
>
>
> Pawel, please share your netdevices and routing setup  ?
>
> Looks like a double dev_put() on some dev...
>
> Pawel, do you have any idea how this is triggered? Does your
> test try to remove some network device? If so which one?
> I noticed you have at least multiple vlan, bond and ixgbe
> devices.
>
> Just after i start bgp sessions
> So when host is starting i have all bgp sessions to upstreams shutdown
>
> To trigger panic i just enable all 6x bgp sessions at once to upstreams -
> and zebra is start to pull prefixes and push them to the kernel
>
> Then some traffic is generated from test hosts thru this backup router and
> panic is generated - every time after 10 to 15 seconds after bgp sessions
> are connected.
>
> I'm not removing any interface at this time or do anything with interfaces -
> just wait.
>
> And yes there are vlans attached to the bond devices
> but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.


Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski



W dniu 2017-09-20 o 19:46, Wei Wang pisze:

This is why I suggested to replace the BUG() in another mail

So :

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index
f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
*/
   static inline void dev_put(struct net_device *dev)
   {
-   this_cpu_dec(*dev->pcpu_refcnt);
+   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+   if (!pref) {
+   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
%d\n",
+  dev, dev->name, dev->reg_state, dev->dismantle);
+   for (;;)
+   cpu_relax();
+   }
+   this_cpu_dec(*pref);
   }
 /**


Thanks a lot Eric for the debug patch.

Pawel,

I want to confirm with you about the last good commit when you did bisection.
You mentioned:


And the last one

git bisect good
Bisecting: 1 revision left to test after this (roughly 1 step)
[1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for
insertion into fib6 tree

With this have kernel panic same as always

git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
remove the operation of dst_free()


So it breaks right at:
[b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
remove the operation of dst_free()
Right?
If you sync the image to one commit before the above one:
[9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly
Does it crash?
Later today i will repeat last three steps - in about next 3 hours after 
rush hours of internet traffic - now i cant touch backup router  :)




And could you confirm that your config does not have any IPv6
addresses or routes configured?

There is ipv6 enabled
And yes there are some ipv6 ip's
One interface have ipv6 enabled with one static route

 but no ipv6 bgp sessions - so nt many ipv6 prefixes and ipv6 fib is 
almost empty


ip -6 r ls | wc -l
57




Thanks.
Wei


6:03 +0200, Paweł Staszewski wrote:

Nit much more after adding this patch

https://bugzilla.kernel.org/attachment.cgi?id=258529


This is why I suggested to replace the BUG() in another mail

So :

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index
f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
*/
   static inline void dev_put(struct net_device *dev)
   {
-   this_cpu_dec(*dev->pcpu_refcnt);
+   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+   if (!pref) {
+   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
%d\n",
+  dev, dev->name, dev->reg_state, dev->dismantle);
+   for (;;)
+   cpu_relax();
+   }
+   this_cpu_dec(*pref);
   }
 /**




Full panic

https://bugzilla.kernel.org/attachment.cgi?id=258531


I will change patch and apply but later today cause now cant use backup
router as testlab - Internet rush hours if something happens this will be
bed when second router will have bugged kernel :)






Re: Latest net-next from GIT panic

2017-09-20 Thread Eric Dumazet
On Wed, 2017-09-20 at 10:50 -0700, Cong Wang wrote:
> On Wed, Sep 20, 2017 at 6:11 AM, Eric Dumazet  wrote:
> > Sorry for top-posting, but this is to give context to Wei, since Pawel
> > used a top posting way to report his bisection.
> >
> > Wei, can you take a look at Pawel report ?
> >
> > Crash happens in dst_destroy() at following :
> >
> > if (dst->dev)
> >  dev_put(dst->dev); <>
> >
> >
> > dst->dev is not NULL, but netdev->pcpu_refcnt is NULL
> >
> > 65 ff 08decl   %gs:(%rax)   // CRASH since rax = NULL
> >
> >
> >
> > Pawel, please share your netdevices and routing setup  ?
> 
> Looks like a double dev_put() on some dev...
> 
> Pawel, do you have any idea how this is triggered? Does your
> test try to remove some network device? If so which one?
> I noticed you have at least multiple vlan, bond and ixgbe
> devices.

Or a missing dev_hold() somewhere.





Re: Latest net-next from GIT panic

2017-09-20 Thread Cong Wang
On Wed, Sep 20, 2017 at 6:11 AM, Eric Dumazet  wrote:
> Sorry for top-posting, but this is to give context to Wei, since Pawel
> used a top posting way to report his bisection.
>
> Wei, can you take a look at Pawel report ?
>
> Crash happens in dst_destroy() at following :
>
> if (dst->dev)
>  dev_put(dst->dev); <>
>
>
> dst->dev is not NULL, but netdev->pcpu_refcnt is NULL
>
> 65 ff 08decl   %gs:(%rax)   // CRASH since rax = NULL
>
>
>
> Pawel, please share your netdevices and routing setup  ?

Looks like a double dev_put() on some dev...

Pawel, do you have any idea how this is triggered? Does your
test try to remove some network device? If so which one?
I noticed you have at least multiple vlan, bond and ixgbe
devices.


Re: Latest net-next from GIT panic

2017-09-20 Thread Wei Wang
>> This is why I suggested to replace the BUG() in another mail
>>
>> So :
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index
>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>> 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>*/
>>   static inline void dev_put(struct net_device *dev)
>>   {
>> -   this_cpu_dec(*dev->pcpu_refcnt);
>> +   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>> +
>> +   if (!pref) {
>> +   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>> %d\n",
>> +  dev, dev->name, dev->reg_state, dev->dismantle);
>> +   for (;;)
>> +   cpu_relax();
>> +   }
>> +   this_cpu_dec(*pref);
>>   }
>> /**
>>

Thanks a lot Eric for the debug patch.

Pawel,

I want to confirm with you about the last good commit when you did bisection.
You mentioned:

> And the last one
>
> git bisect good
> Bisecting: 1 revision left to test after this (roughly 1 step)
> [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for
> insertion into fib6 tree
>
> With this have kernel panic same as always
>
> git bisect bad
> Bisecting: 0 revisions left to test after this (roughly 0 steps)
> [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
> remove the operation of dst_free()


So it breaks right at:
[b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
remove the operation of dst_free()
Right?
If you sync the image to one commit before the above one:
[9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly
Does it crash?

And could you confirm that your config does not have any IPv6
addresses or routes configured?

Thanks.
Wei


6:03 +0200, Paweł Staszewski wrote:
>>>
>>> Nit much more after adding this patch
>>>
>>> https://bugzilla.kernel.org/attachment.cgi?id=258529
>>>
>> This is why I suggested to replace the BUG() in another mail
>>
>> So :
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index
>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>> 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>*/
>>   static inline void dev_put(struct net_device *dev)
>>   {
>> -   this_cpu_dec(*dev->pcpu_refcnt);
>> +   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>> +
>> +   if (!pref) {
>> +   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>> %d\n",
>> +  dev, dev->name, dev->reg_state, dev->dismantle);
>> +   for (;;)
>> +   cpu_relax();
>> +   }
>> +   this_cpu_dec(*pref);
>>   }
>> /**
>>
>>
>>
>
> Full panic
>
> https://bugzilla.kernel.org/attachment.cgi?id=258531
>
>
> I will change patch and apply but later today cause now cant use backup
> router as testlab - Internet rush hours if something happens this will be
> bed when second router will have bugged kernel :)
>
>


Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski

W dniu 2017-09-20 o 16:40, Eric Dumazet pisze:

On Wed, 2017-09-20 at 16:03 +0200, Paweł Staszewski wrote:

Nit much more after adding this patch

https://bugzilla.kernel.org/attachment.cgi?id=258529


This is why I suggested to replace the BUG() in another mail

So :

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 
f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
   */
  static inline void dev_put(struct net_device *dev)
  {
-   this_cpu_dec(*dev->pcpu_refcnt);
+   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+   if (!pref) {
+   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle %d\n",
+  dev, dev->name, dev->reg_state, dev->dismantle);
+   for (;;)
+   cpu_relax();
+   }
+   this_cpu_dec(*pref);
  }
  
  /**






Full panic

https://bugzilla.kernel.org/attachment.cgi?id=258531


I will change patch and apply but later today cause now cant use backup 
router as testlab - Internet rush hours if something happens this will 
be bed when second router will have bugged kernel :)





Re: Latest net-next from GIT panic

2017-09-20 Thread Eric Dumazet
On Wed, 2017-09-20 at 16:03 +0200, Paweł Staszewski wrote:
> Nit much more after adding this patch
> 
> https://bugzilla.kernel.org/attachment.cgi?id=258529
> 

This is why I suggested to replace the BUG() in another mail

So :

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 
f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
  */
 static inline void dev_put(struct net_device *dev)
 {
-   this_cpu_dec(*dev->pcpu_refcnt);
+   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+   if (!pref) {
+   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle %d\n",
+  dev, dev->name, dev->reg_state, dev->dismantle);
+   for (;;)
+   cpu_relax();
+   }
+   this_cpu_dec(*pref);
 }
 
 /**




Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski

Nit much more after adding this patch

https://bugzilla.kernel.org/attachment.cgi?id=258529



W dniu 2017-09-20 o 15:44, Eric Dumazet pisze:

On Wed, 2017-09-20 at 15:39 +0200, Paweł Staszewski wrote:

W dniu 2017-09-20 o 15:34, Eric Dumazet pisze:

Could you try this debug patch ?

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 
f535779d9dc1dfe36934c2abba4e43d053ac5d6f..1eaa3553a724dc8c048f67b556337072d5addc82
 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,14 @@ void netdev_run_todo(void);
*/
   static inline void dev_put(struct net_device *dev)
   {
-   this_cpu_dec(*dev->pcpu_refcnt);
+   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+   if (!pref) {
+   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle %d\n",
+  dev, dev->name, dev->reg_state, dev->dismantle);
+   BUG();
+   }
+   this_cpu_dec(*pref);
   }
   
   /**





You want me to add this patch to what kernel version ?
currently im after git bisect reset - so mainline stable


Simply us the latest net-next as mentioned in the thread title, thanks.







Re: Latest net-next from GIT panic

2017-09-20 Thread Eric Dumazet
On Wed, 2017-09-20 at 15:39 +0200, Paweł Staszewski wrote:
> 
> W dniu 2017-09-20 o 15:34, Eric Dumazet pisze:
> > Could you try this debug patch ?
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 
> > f535779d9dc1dfe36934c2abba4e43d053ac5d6f..1eaa3553a724dc8c048f67b556337072d5addc82
> >  100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3331,7 +3331,14 @@ void netdev_run_todo(void);
> >*/
> >   static inline void dev_put(struct net_device *dev)
> >   {
> > -   this_cpu_dec(*dev->pcpu_refcnt);
> > +   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
> > +
> > +   if (!pref) {
> > +   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle %d\n",
> > +  dev, dev->name, dev->reg_state, dev->dismantle);
> > +   BUG();
> > +   }
> > +   this_cpu_dec(*pref);
> >   }
> >   
> >   /**
> >
> >
> >
> 
> You want me to add this patch to what kernel version ?
> currently im after git bisect reset - so mainline stable
> 

Simply us the latest net-next as mentioned in the thread title, thanks.




Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski



W dniu 2017-09-20 o 15:34, Eric Dumazet pisze:

Could you try this debug patch ?

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 
f535779d9dc1dfe36934c2abba4e43d053ac5d6f..1eaa3553a724dc8c048f67b556337072d5addc82
 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,14 @@ void netdev_run_todo(void);
   */
  static inline void dev_put(struct net_device *dev)
  {
-   this_cpu_dec(*dev->pcpu_refcnt);
+   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+   if (!pref) {
+   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle %d\n",
+  dev, dev->name, dev->reg_state, dev->dismantle);
+   BUG();
+   }
+   this_cpu_dec(*pref);
  }
  
  /**






You want me to add this patch to what kernel version ?
currently im after git bisect reset - so mainline stable



Re: Latest net-next from GIT panic

2017-09-20 Thread Eric Dumazet
On Wed, 2017-09-20 at 06:34 -0700, Eric Dumazet wrote:
> Could you try this debug patch ?
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 
> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..1eaa3553a724dc8c048f67b556337072d5addc82
>  100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3331,7 +3331,14 @@ void netdev_run_todo(void);
>   */
>  static inline void dev_put(struct net_device *dev)
>  {
> - this_cpu_dec(*dev->pcpu_refcnt);
> + int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
> +
> + if (!pref) {
> + pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle %d\n",
> +dev, dev->name, dev->reg_state, dev->dismantle);
> + BUG();
> + }
> + this_cpu_dec(*pref);
>  }
>  
>  /**
> 

And since the console will be filled by stack trace, maybe instead of
BUG() use some infinite loop ?

for (;;)
cpu_relax();





Re: Latest net-next from GIT panic

2017-09-20 Thread Eric Dumazet
Could you try this debug patch ?

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 
f535779d9dc1dfe36934c2abba4e43d053ac5d6f..1eaa3553a724dc8c048f67b556337072d5addc82
 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,14 @@ void netdev_run_todo(void);
  */
 static inline void dev_put(struct net_device *dev)
 {
-   this_cpu_dec(*dev->pcpu_refcnt);
+   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+   if (!pref) {
+   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle %d\n",
+  dev, dev->name, dev->reg_state, dev->dismantle);
+   BUG();
+   }
+   this_cpu_dec(*pref);
 }
 
 /**




Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski

Yes sorry for top-posting also.

Configuration:

Ethernet devices:

lspci | grep Etherne
02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network 
Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network 
Connection (rev 01)
04:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)
04:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)
07:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)
07:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)
81:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)
81:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)
83:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)
83:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection (rev 01)



ip l
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode 
DEFAULT qlen 1000

    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp2s0f0:  mtu 1500 qdisc mq state 
DOWN mode DEFAULT qlen 8192

    link/ether 00:25:90:e4:97:9a brd ff:ff:ff:ff:ff:ff
3: enp2s0f1:  mtu 1500 qdisc mq state 
DOWN mode DEFAULT qlen 8192

    link/ether 00:25:90:e4:97:9b brd ff:ff:ff:ff:ff:ff
4: enp4s0f0:  mtu 1500 qdisc mq 
master bond1 state UP mode DEFAULT qlen 8192

    link/ether 0c:c4:7a:bc:b8:68 brd ff:ff:ff:ff:ff:ff
5: enp4s0f1:  mtu 1500 qdisc mq 
master bond0 state UP mode DEFAULT qlen 8192

    link/ether 0c:c4:7a:bc:b8:69 brd ff:ff:ff:ff:ff:ff
6: enp7s0f0:  mtu 1500 qdisc mq 
master bond1 state UP mode DEFAULT qlen 8192

    link/ether 0c:c4:7a:bc:b8:68 brd ff:ff:ff:ff:ff:ff
7: enp7s0f1:  mtu 1500 qdisc mq 
master bond0 state UP mode DEFAULT qlen 8192

    link/ether 0c:c4:7a:bc:b8:69 brd ff:ff:ff:ff:ff:ff
8: enp129s0f0:  mtu 1500 qdisc mq 
master bond1 state UP mode DEFAULT qlen 8192

    link/ether 0c:c4:7a:bc:b8:68 brd ff:ff:ff:ff:ff:ff
9: enp129s0f1:  mtu 1500 qdisc mq 
master bond0 state UP mode DEFAULT qlen 8192

    link/ether 0c:c4:7a:bc:b8:69 brd ff:ff:ff:ff:ff:ff
10: enp131s0f0:  mtu 1500 qdisc 
mq master bond1 state UP mode DEFAULT qlen 8192

    link/ether 0c:c4:7a:bc:b8:68 brd ff:ff:ff:ff:ff:ff
11: enp131s0f1:  mtu 1500 qdisc 
mq master bond0 state UP mode DEFAULT qlen 8192

    link/ether 0c:c4:7a:bc:b8:69 brd ff:ff:ff:ff:ff:ff
12: sit0@NONE:  mtu 1480 qdisc noop state DOWN mode DEFAULT qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
13: bond0:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT qlen 1000

    link/ether 0c:c4:7a:bc:b8:69 brd ff:ff:ff:ff:ff:ff
14: bond1:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT qlen 1000

    link/ether 0c:c4:7a:bc:b8:68 brd ff:ff:ff:ff:ff:ff
15: vlan4091@bond0:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT qlen 1000

    link/ether 0c:c4:7a:bc:b8:69 brd ff:ff:ff:ff:ff:ff
16: vlan4032@bond0:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT qlen 1000

    link/ether 0c:c4:7a:bc:b8:69 brd ff:ff:ff:ff:ff:ff
17: vlan514@bond0:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT qlen 1000

    link/ether 0c:c4:7a:bc:b8:69 brd ff:ff:ff:ff:ff:ff
18: vlan87@bond0:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT qlen 1000

    link/ether 0c:c4:7a:bc:b8:69 brd ff:ff:ff:ff:ff:ff
19: vlan518@bond1:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT qlen 1000

    link/ether 0c:c4:7a:bc:b8:68 brd ff:ff:ff:ff:ff:ff
20: vlan646@bond1:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT qlen 1000

    link/ether 0c:c4:7a:bc:b8:68 brd ff:ff:ff:ff:ff:ff
21: vlan370@bond0:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT qlen 1000

    link/ether 0c:c4:7a:bc:b8:69 brd ff:ff:ff:ff:ff:ff
22: vlan3212@bond0:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT qlen 1000

    link/ether 0c:c4:7a:bc:b8:69 brd ff:ff:ff:ff:ff:ff
23: vlan746@bond0:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT qlen 1000

    link/ether 0c:c4:7a:bc:b8:69 brd ff:ff:ff:ff:ff:ff


There are bonds:

cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)


Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski

So far path for bisect was:

git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid using 
stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'

git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'

git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
cp_outgoing

git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix

git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new 
function dst_dev_put()

git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly

git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112

No PANIC

# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly

git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f

PANIC

# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree

git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855

PANIC

# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC 
and remove the operation of dst_free()

git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
mark DST_NOGC and remove the operation of dst_free()





W dniu 2017-09-20 o 15:05, Paweł Staszewski pisze:

hmm

But after

b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
Author: Wei Wang 
Date:   Sat Jun 17 10:42:32 2017 -0700

    ipv4: mark DST_NOGC and remove the operation of dst_free()

    With the previous preparation patches, we are ready to get rid of the
    dst gc operation in ipv4 code and release dst based on refcnt only.
    So this patch adds DST_NOGC flag for all IPv4 dst and remove the 
calls

    to dst_free().
    At this point, all dst created in ipv4 code do not use the dst gc
    anymore and will be destroyed at the point when refcnt drops to 0.

    Signed-off-by: Wei Wang 
    Acked-by: Martin KaFai Lau 
    Signed-off-by: David S. Miller 

:04 04 9b7e7fb641de6531fc7887473ca47ef7cb6a11da 
831a73b71d3df1755f3e24c0d3c86d7a93fd55e2 M  net



Still panic - soo will back to past 3 steps and will try to get again 
bisect without panic.




W dniu 2017-09-20 o 14:49, Paweł Staszewski pisze:

And the last one

git bisect good
Bisecting: 1 revision left to test after this (roughly 1 step)
[1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt 
for insertion into fib6 tree


With this have kernel panic same as always

git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and 
remove the operation of dst_free()




W dniu 2017-09-20 o 14:23, Paweł Staszewski pisze:

Almost there

Bisecting: 6 revisions left to test after this (roughly 3 steps)
[ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly




W dniu 2017-09-20 o 13:02, Paweł Staszewski pisze:

Ok resumed and soo far:

Panic:

# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f

No panic:

# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20


W dniu 2017-09-20 o 12:22, Paweł Staszewski pisze:

Soo far bisected and marked:

git bisect start
# bad: [07dd6cc1fff160143e82cf5df78c1db0b6e03355] Linux 4.13.2
git bisect bad 07dd6cc1fff160143e82cf5df78c1db0b6e03355
# good: [5d7d2e03e0f01a992e3521b180c3d3e67905f269] Linux 4.12.13
git bisect good 5d7d2e03e0f01a992e3521b180c3d3e67905f269
# good: [6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c] Linux 4.12

Re: Latest net-next from GIT panic

2017-09-20 Thread Eric Dumazet
Sorry for top-posting, but this is to give context to Wei, since Pawel
used a top posting way to report his bisection.

Wei, can you take a look at Pawel report ?

Crash happens in dst_destroy() at following :

if (dst->dev)
 dev_put(dst->dev); <>


dst->dev is not NULL, but netdev->pcpu_refcnt is NULL

65 ff 08decl   %gs:(%rax)   // CRASH since rax = NULL



Pawel, please share your netdevices and routing setup  ?

Thanks !

On Wed, 2017-09-20 at 14:49 +0200, Paweł Staszewski wrote:
> And the last one
> 
> git bisect good
> Bisecting: 1 revision left to test after this (roughly 1 step)
> [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for 
> insertion into fib6 tree
> 
> With this have kernel panic same as always
> 
> git bisect bad
> Bisecting: 0 revisions left to test after this (roughly 0 steps)
> [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and 
> remove the operation of dst_free()
> 
> 
> 
> W dniu 2017-09-20 o 14:23, Paweł Staszewski pisze:
> > Almost there
> >
> > Bisecting: 6 revisions left to test after this (roughly 3 steps)
> > [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call dst_hold_safe() 
> > properly
> >
> >
> >
> > W dniu 2017-09-20 o 13:02, Paweł Staszewski pisze:
> >> Ok resumed and soo far:
> >>
> >> Panic:
> >>
> >> # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
> >> using stack larger than 1024.
> >> git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
> >>
> >> No panic:
> >>
> >> # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
> >> 'udp-reduce-cache-pressure'
> >> git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
> >>
> >>
> >> W dniu 2017-09-20 o 12:22, Paweł Staszewski pisze:
> >>> Soo far bisected and marked:
> >>>
> >>> git bisect start
> >>> # bad: [07dd6cc1fff160143e82cf5df78c1db0b6e03355] Linux 4.13.2
> >>> git bisect bad 07dd6cc1fff160143e82cf5df78c1db0b6e03355
> >>> # good: [5d7d2e03e0f01a992e3521b180c3d3e67905f269] Linux 4.12.13
> >>> git bisect good 5d7d2e03e0f01a992e3521b180c3d3e67905f269
> >>> # good: [6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c] Linux 4.12
> >>> git bisect good 6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c
> >>> # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
> >>> 'pinctrl-v4.13-1' of 
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
> >>> git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
> >>> # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
> >>> 'next' of 
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
> >>> git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
> >>> # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
> >>> 'next' of 
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
> >>> git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
> >>> # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
> >>> 'next' of 
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
> >>> git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
> >>>
> >>>
> >>>
> >>> W dniu 2017-09-20 o 12:21, Paweł Staszewski pisze:
>  Ok kernel crashed with different panic that i didnt catch when i 
>  was doing bisect and now my bisection is broken :)
> 
>  git bisect good
>  Bisecting: 1787 revisions left to test after this (roughly 11 steps)
>  error: Your local changes to the following files would be 
>  overwritten by checkout:
>  Documentation/00-INDEX
>  Documentation/ABI/stable/sysfs-class-udc
>  Documentation/ABI/testing/configfs-usb-gadget-uac1
>  Documentation/ABI/testing/ima_policy
>  Documentation/ABI/testing/sysfs-bus-iio
>  Documentation/ABI/testing/sysfs-bus-iio-meas-spec
>  Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
>  Documentation/ABI/testing/sysfs-class-net
>  Documentation/ABI/testing/sysfs-class-power-twl4030
>  Documentation/ABI/testing/sysfs-class-typec
>  Documentation/DMA-API.txt
>  Documentation/IRQ-domain.txt
>  Documentation/Makefile
>  Documentation/PCI/MSI-HOWTO.txt
>  Documentation/RCU/00-INDEX
>  Documentation/RCU/Design/Requirements/Requirements.html
>  Documentation/RCU/checklist.txt
>  Documentation/admin-guide/README.rst
>  Documentation/admin-guide/devices.txt
>  Documentation/admin-guide/index.rst
>  Documentation/admin-guide/kernel-parameters.txt
>  Documentation/admin-guide/pm/cpufreq.rst
>  Documentation/admin-guide/pm/intel_pstate.rst
>  Documentation/admin-guide/ras.rst
>  Documentation/arm/Atmel/README
>  Documentation/block/biodoc.txt
>  Documentation/conf.py
>  Documentation/core-api/assoc_array.rst
>    

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski

hmm

But after

b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
Author: Wei Wang 
Date:   Sat Jun 17 10:42:32 2017 -0700

    ipv4: mark DST_NOGC and remove the operation of dst_free()

    With the previous preparation patches, we are ready to get rid of the
    dst gc operation in ipv4 code and release dst based on refcnt only.
    So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls
    to dst_free().
    At this point, all dst created in ipv4 code do not use the dst gc
    anymore and will be destroyed at the point when refcnt drops to 0.

    Signed-off-by: Wei Wang 
    Acked-by: Martin KaFai Lau 
    Signed-off-by: David S. Miller 

:04 04 9b7e7fb641de6531fc7887473ca47ef7cb6a11da 
831a73b71d3df1755f3e24c0d3c86d7a93fd55e2 M  net



Still panic - soo will back to past 3 steps and will try to get again 
bisect without panic.




W dniu 2017-09-20 o 14:49, Paweł Staszewski pisze:

And the last one

git bisect good
Bisecting: 1 revision left to test after this (roughly 1 step)
[1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt 
for insertion into fib6 tree


With this have kernel panic same as always

git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and 
remove the operation of dst_free()




W dniu 2017-09-20 o 14:23, Paweł Staszewski pisze:

Almost there

Bisecting: 6 revisions left to test after this (roughly 3 steps)
[ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call dst_hold_safe() 
properly




W dniu 2017-09-20 o 13:02, Paweł Staszewski pisze:

Ok resumed and soo far:

Panic:

# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f

No panic:

# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20


W dniu 2017-09-20 o 12:22, Paweł Staszewski pisze:

Soo far bisected and marked:

git bisect start
# bad: [07dd6cc1fff160143e82cf5df78c1db0b6e03355] Linux 4.13.2
git bisect bad 07dd6cc1fff160143e82cf5df78c1db0b6e03355
# good: [5d7d2e03e0f01a992e3521b180c3d3e67905f269] Linux 4.12.13
git bisect good 5d7d2e03e0f01a992e3521b180c3d3e67905f269
# good: [6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c] Linux 4.12
git bisect good 6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31



W dniu 2017-09-20 o 12:21, Paweł Staszewski pisze:
Ok kernel crashed with different panic that i didnt catch when i 
was doing bisect and now my bisection is broken :)


git bisect good
Bisecting: 1787 revisions left to test after this (roughly 11 steps)
error: Your local changes to the following files would be 
overwritten by checkout:

    Documentation/00-INDEX
    Documentation/ABI/stable/sysfs-class-udc
    Documentation/ABI/testing/configfs-usb-gadget-uac1
    Documentation/ABI/testing/ima_policy
    Documentation/ABI/testing/sysfs-bus-iio
    Documentation/ABI/testing/sysfs-bus-iio-meas-spec
Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
    Documentation/ABI/testing/sysfs-class-net
Documentation/ABI/testing/sysfs-class-power-twl4030
    Documentation/ABI/testing/sysfs-class-typec
    Documentation/DMA-API.txt
    Documentation/IRQ-domain.txt
    Documentation/Makefile
    Documentation/PCI/MSI-HOWTO.txt
    Documentation/RCU/00-INDEX
Documentation/RCU/Design/Requirements/Requirements.html
    Documentation/RCU/checklist.txt
    Documentation/admin-guide/README.rst
    Documentation/admin-guide/devices.txt
    Documentation/admin-guide/index.rst
    Documentation/admin-guide/kernel-parameters.txt
    Documentation/admin-guide/pm/cpufreq.rst
    Documentation/admin-guide/pm/intel_pstate.rst
    Documentation/admin-guide/ras.rst
    Documentation/arm/Atmel/README
    Documentation/block/biodoc.txt
    Documentation/conf.py
    Documentation/core-api/assoc_array.rst
    

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski

And the last one

git bisect good
Bisecting: 1 revision left to test after this (roughly 1 step)
[1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for 
insertion into fib6 tree


With this have kernel panic same as always

git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and 
remove the operation of dst_free()




W dniu 2017-09-20 o 14:23, Paweł Staszewski pisze:

Almost there

Bisecting: 6 revisions left to test after this (roughly 3 steps)
[ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call dst_hold_safe() 
properly




W dniu 2017-09-20 o 13:02, Paweł Staszewski pisze:

Ok resumed and soo far:

Panic:

# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f

No panic:

# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20


W dniu 2017-09-20 o 12:22, Paweł Staszewski pisze:

Soo far bisected and marked:

git bisect start
# bad: [07dd6cc1fff160143e82cf5df78c1db0b6e03355] Linux 4.13.2
git bisect bad 07dd6cc1fff160143e82cf5df78c1db0b6e03355
# good: [5d7d2e03e0f01a992e3521b180c3d3e67905f269] Linux 4.12.13
git bisect good 5d7d2e03e0f01a992e3521b180c3d3e67905f269
# good: [6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c] Linux 4.12
git bisect good 6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31



W dniu 2017-09-20 o 12:21, Paweł Staszewski pisze:
Ok kernel crashed with different panic that i didnt catch when i 
was doing bisect and now my bisection is broken :)


git bisect good
Bisecting: 1787 revisions left to test after this (roughly 11 steps)
error: Your local changes to the following files would be 
overwritten by checkout:

    Documentation/00-INDEX
    Documentation/ABI/stable/sysfs-class-udc
    Documentation/ABI/testing/configfs-usb-gadget-uac1
    Documentation/ABI/testing/ima_policy
    Documentation/ABI/testing/sysfs-bus-iio
    Documentation/ABI/testing/sysfs-bus-iio-meas-spec
    Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
    Documentation/ABI/testing/sysfs-class-net
    Documentation/ABI/testing/sysfs-class-power-twl4030
    Documentation/ABI/testing/sysfs-class-typec
    Documentation/DMA-API.txt
    Documentation/IRQ-domain.txt
    Documentation/Makefile
    Documentation/PCI/MSI-HOWTO.txt
    Documentation/RCU/00-INDEX
Documentation/RCU/Design/Requirements/Requirements.html
    Documentation/RCU/checklist.txt
    Documentation/admin-guide/README.rst
    Documentation/admin-guide/devices.txt
    Documentation/admin-guide/index.rst
    Documentation/admin-guide/kernel-parameters.txt
    Documentation/admin-guide/pm/cpufreq.rst
    Documentation/admin-guide/pm/intel_pstate.rst
    Documentation/admin-guide/ras.rst
    Documentation/arm/Atmel/README
    Documentation/block/biodoc.txt
    Documentation/conf.py
    Documentation/core-api/assoc_array.rst
    Documentation/core-api/atomic_ops.rst
    Documentation/core-api/index.rst
    Documentation/crypto/asymmetric-keys.txt
    Documentation/dev-tools/index.rst
    Documentation/dev-tools/sparse.rst
    Documentation/devicetree/bindings/arm/amlogic.txt
    Documentation/devicetree/bindings/arm/atmel-at91.txt
    Documentation/devicetree/bindings/arm/ccn.txt
    Documentation/devicetree/bindings/arm/cpus.txt
    Documentation/devicetree/bindings/arm/gemini.txt
Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt
Documentation/devicetree/bindings/arm/keystone/keystone.txt
    Documentation/devicetree/bindings/arm/mediatek.txt
    Documentation/devicetree/bindings/arm/rockchip.txt
    Documentation/devicetree/bindings/arm/shmobile.txt
    Documentation/devicetree/bindings/arm/tegra.txt
Documentation/devicetree/bindings/ata/ahci-fsl-qoriq.txt
Documentation/devicetree/bindings/bus/brcm,gisb-arb.txt
Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski

Almost there

Bisecting: 6 revisions left to test after this (roughly 3 steps)
[ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call dst_hold_safe() 
properly




W dniu 2017-09-20 o 13:02, Paweł Staszewski pisze:

Ok resumed and soo far:

Panic:

# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f

No panic:

# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20


W dniu 2017-09-20 o 12:22, Paweł Staszewski pisze:

Soo far bisected and marked:

git bisect start
# bad: [07dd6cc1fff160143e82cf5df78c1db0b6e03355] Linux 4.13.2
git bisect bad 07dd6cc1fff160143e82cf5df78c1db0b6e03355
# good: [5d7d2e03e0f01a992e3521b180c3d3e67905f269] Linux 4.12.13
git bisect good 5d7d2e03e0f01a992e3521b180c3d3e67905f269
# good: [6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c] Linux 4.12
git bisect good 6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31



W dniu 2017-09-20 o 12:21, Paweł Staszewski pisze:
Ok kernel crashed with different panic that i didnt catch when i was 
doing bisect and now my bisection is broken :)


git bisect good
Bisecting: 1787 revisions left to test after this (roughly 11 steps)
error: Your local changes to the following files would be 
overwritten by checkout:

    Documentation/00-INDEX
    Documentation/ABI/stable/sysfs-class-udc
    Documentation/ABI/testing/configfs-usb-gadget-uac1
    Documentation/ABI/testing/ima_policy
    Documentation/ABI/testing/sysfs-bus-iio
    Documentation/ABI/testing/sysfs-bus-iio-meas-spec
    Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
    Documentation/ABI/testing/sysfs-class-net
    Documentation/ABI/testing/sysfs-class-power-twl4030
    Documentation/ABI/testing/sysfs-class-typec
    Documentation/DMA-API.txt
    Documentation/IRQ-domain.txt
    Documentation/Makefile
    Documentation/PCI/MSI-HOWTO.txt
    Documentation/RCU/00-INDEX
Documentation/RCU/Design/Requirements/Requirements.html
    Documentation/RCU/checklist.txt
    Documentation/admin-guide/README.rst
    Documentation/admin-guide/devices.txt
    Documentation/admin-guide/index.rst
    Documentation/admin-guide/kernel-parameters.txt
    Documentation/admin-guide/pm/cpufreq.rst
    Documentation/admin-guide/pm/intel_pstate.rst
    Documentation/admin-guide/ras.rst
    Documentation/arm/Atmel/README
    Documentation/block/biodoc.txt
    Documentation/conf.py
    Documentation/core-api/assoc_array.rst
    Documentation/core-api/atomic_ops.rst
    Documentation/core-api/index.rst
    Documentation/crypto/asymmetric-keys.txt
    Documentation/dev-tools/index.rst
    Documentation/dev-tools/sparse.rst
    Documentation/devicetree/bindings/arm/amlogic.txt
    Documentation/devicetree/bindings/arm/atmel-at91.txt
    Documentation/devicetree/bindings/arm/ccn.txt
    Documentation/devicetree/bindings/arm/cpus.txt
    Documentation/devicetree/bindings/arm/gemini.txt
Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt
Documentation/devicetree/bindings/arm/keystone/keystone.txt
    Documentation/devicetree/bindings/arm/mediatek.txt
    Documentation/devicetree/bindings/arm/rockchip.txt
    Documentation/devicetree/bindings/arm/shmobile.txt
    Documentation/devicetree/bindings/arm/tegra.txt
Documentation/devicetree/bindings/ata/ahci-fsl-qoriq.txt
Documentation/devicetree/bindings/bus/brcm,gisb-arb.txt
Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt
Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt
    Documentation/devicetree/bindings/gpio/gpio_atmel.txt
Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt
Documentation/devicetree/bindings/iio/adc/renesas,gyroadc.txt
Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt
Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt
Documentation/devicetree/bindings/interrupt-controller/allwinner,sunxi-nmi.txt 


Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski

Ok resumed and soo far:

Panic:

# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid using 
stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f

No panic:

# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20


W dniu 2017-09-20 o 12:22, Paweł Staszewski pisze:

Soo far bisected and marked:

git bisect start
# bad: [07dd6cc1fff160143e82cf5df78c1db0b6e03355] Linux 4.13.2
git bisect bad 07dd6cc1fff160143e82cf5df78c1db0b6e03355
# good: [5d7d2e03e0f01a992e3521b180c3d3e67905f269] Linux 4.12.13
git bisect good 5d7d2e03e0f01a992e3521b180c3d3e67905f269
# good: [6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c] Linux 4.12
git bisect good 6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31



W dniu 2017-09-20 o 12:21, Paweł Staszewski pisze:
Ok kernel crashed with different panic that i didnt catch when i was 
doing bisect and now my bisection is broken :)


git bisect good
Bisecting: 1787 revisions left to test after this (roughly 11 steps)
error: Your local changes to the following files would be overwritten 
by checkout:

    Documentation/00-INDEX
    Documentation/ABI/stable/sysfs-class-udc
    Documentation/ABI/testing/configfs-usb-gadget-uac1
    Documentation/ABI/testing/ima_policy
    Documentation/ABI/testing/sysfs-bus-iio
    Documentation/ABI/testing/sysfs-bus-iio-meas-spec
    Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
    Documentation/ABI/testing/sysfs-class-net
    Documentation/ABI/testing/sysfs-class-power-twl4030
    Documentation/ABI/testing/sysfs-class-typec
    Documentation/DMA-API.txt
    Documentation/IRQ-domain.txt
    Documentation/Makefile
    Documentation/PCI/MSI-HOWTO.txt
    Documentation/RCU/00-INDEX
    Documentation/RCU/Design/Requirements/Requirements.html
    Documentation/RCU/checklist.txt
    Documentation/admin-guide/README.rst
    Documentation/admin-guide/devices.txt
    Documentation/admin-guide/index.rst
    Documentation/admin-guide/kernel-parameters.txt
    Documentation/admin-guide/pm/cpufreq.rst
    Documentation/admin-guide/pm/intel_pstate.rst
    Documentation/admin-guide/ras.rst
    Documentation/arm/Atmel/README
    Documentation/block/biodoc.txt
    Documentation/conf.py
    Documentation/core-api/assoc_array.rst
    Documentation/core-api/atomic_ops.rst
    Documentation/core-api/index.rst
    Documentation/crypto/asymmetric-keys.txt
    Documentation/dev-tools/index.rst
    Documentation/dev-tools/sparse.rst
    Documentation/devicetree/bindings/arm/amlogic.txt
    Documentation/devicetree/bindings/arm/atmel-at91.txt
    Documentation/devicetree/bindings/arm/ccn.txt
    Documentation/devicetree/bindings/arm/cpus.txt
    Documentation/devicetree/bindings/arm/gemini.txt
Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt
Documentation/devicetree/bindings/arm/keystone/keystone.txt
    Documentation/devicetree/bindings/arm/mediatek.txt
    Documentation/devicetree/bindings/arm/rockchip.txt
    Documentation/devicetree/bindings/arm/shmobile.txt
    Documentation/devicetree/bindings/arm/tegra.txt
    Documentation/devicetree/bindings/ata/ahci-fsl-qoriq.txt
    Documentation/devicetree/bindings/bus/brcm,gisb-arb.txt
Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt
    Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt
    Documentation/devicetree/bindings/gpio/gpio_atmel.txt
Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt
Documentation/devicetree/bindings/iio/adc/renesas,gyroadc.txt
Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt
    Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt
Documentation/devicetree/bindings/interrupt-controller/allwinner,sunxi-nmi.txt 

Documentation/devicetree/bindings/interrupt-controller/aspeed,ast2400-vic.txt 

Documentation/devicetree/bindings/interrupt-controller/mediatek,sysirq.txt 


    Documentation/devicetree/bindings/leds/common.txt
    

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski

Soo far bisected and marked:

git bisect start
# bad: [07dd6cc1fff160143e82cf5df78c1db0b6e03355] Linux 4.13.2
git bisect bad 07dd6cc1fff160143e82cf5df78c1db0b6e03355
# good: [5d7d2e03e0f01a992e3521b180c3d3e67905f269] Linux 4.12.13
git bisect good 5d7d2e03e0f01a992e3521b180c3d3e67905f269
# good: [6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c] Linux 4.12
git bisect good 6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31



W dniu 2017-09-20 o 12:21, Paweł Staszewski pisze:
Ok kernel crashed with different panic that i didnt catch when i was 
doing bisect and now my bisection is broken :)


git bisect good
Bisecting: 1787 revisions left to test after this (roughly 11 steps)
error: Your local changes to the following files would be overwritten 
by checkout:

    Documentation/00-INDEX
    Documentation/ABI/stable/sysfs-class-udc
    Documentation/ABI/testing/configfs-usb-gadget-uac1
    Documentation/ABI/testing/ima_policy
    Documentation/ABI/testing/sysfs-bus-iio
    Documentation/ABI/testing/sysfs-bus-iio-meas-spec
    Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
    Documentation/ABI/testing/sysfs-class-net
    Documentation/ABI/testing/sysfs-class-power-twl4030
    Documentation/ABI/testing/sysfs-class-typec
    Documentation/DMA-API.txt
    Documentation/IRQ-domain.txt
    Documentation/Makefile
    Documentation/PCI/MSI-HOWTO.txt
    Documentation/RCU/00-INDEX
    Documentation/RCU/Design/Requirements/Requirements.html
    Documentation/RCU/checklist.txt
    Documentation/admin-guide/README.rst
    Documentation/admin-guide/devices.txt
    Documentation/admin-guide/index.rst
    Documentation/admin-guide/kernel-parameters.txt
    Documentation/admin-guide/pm/cpufreq.rst
    Documentation/admin-guide/pm/intel_pstate.rst
    Documentation/admin-guide/ras.rst
    Documentation/arm/Atmel/README
    Documentation/block/biodoc.txt
    Documentation/conf.py
    Documentation/core-api/assoc_array.rst
    Documentation/core-api/atomic_ops.rst
    Documentation/core-api/index.rst
    Documentation/crypto/asymmetric-keys.txt
    Documentation/dev-tools/index.rst
    Documentation/dev-tools/sparse.rst
    Documentation/devicetree/bindings/arm/amlogic.txt
    Documentation/devicetree/bindings/arm/atmel-at91.txt
    Documentation/devicetree/bindings/arm/ccn.txt
    Documentation/devicetree/bindings/arm/cpus.txt
    Documentation/devicetree/bindings/arm/gemini.txt
Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt
Documentation/devicetree/bindings/arm/keystone/keystone.txt
    Documentation/devicetree/bindings/arm/mediatek.txt
    Documentation/devicetree/bindings/arm/rockchip.txt
    Documentation/devicetree/bindings/arm/shmobile.txt
    Documentation/devicetree/bindings/arm/tegra.txt
    Documentation/devicetree/bindings/ata/ahci-fsl-qoriq.txt
    Documentation/devicetree/bindings/bus/brcm,gisb-arb.txt
Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt
    Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt
    Documentation/devicetree/bindings/gpio/gpio_atmel.txt
Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt
Documentation/devicetree/bindings/iio/adc/renesas,gyroadc.txt
    Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt
    Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt
Documentation/devicetree/bindings/interrupt-controller/allwinner,sunxi-nmi.txt 

Documentation/devicetree/bindings/interrupt-controller/aspeed,ast2400-vic.txt 

Documentation/devicetree/bindings/interrupt-controller/mediatek,sysirq.txt 


    Documentation/devicetree/bindings/leds/common.txt
    Documentation/devicetree/bindings/mfd/hi6421.txt
    Documentation/devicetree/bindings/mfd/tps65910.txt
    Documentation/devicetree/bindings/mmc/fsl-esdhc.txt
    Documentation/devicetree/bindings/mmc/k3-dw-mshc.txt
    Documentation/devicetree/bindings/mmc/rockchip-dw-mshc.txt
    Documentation/devicetree/bindings/mmc/ti-omap-hsmmc.txt
    

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski
Ok kernel crashed with different panic that i didnt catch when i was 
doing bisect and now my bisection is broken :)


git bisect good
Bisecting: 1787 revisions left to test after this (roughly 11 steps)
error: Your local changes to the following files would be overwritten by 
checkout:

    Documentation/00-INDEX
    Documentation/ABI/stable/sysfs-class-udc
    Documentation/ABI/testing/configfs-usb-gadget-uac1
    Documentation/ABI/testing/ima_policy
    Documentation/ABI/testing/sysfs-bus-iio
    Documentation/ABI/testing/sysfs-bus-iio-meas-spec
    Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
    Documentation/ABI/testing/sysfs-class-net
    Documentation/ABI/testing/sysfs-class-power-twl4030
    Documentation/ABI/testing/sysfs-class-typec
    Documentation/DMA-API.txt
    Documentation/IRQ-domain.txt
    Documentation/Makefile
    Documentation/PCI/MSI-HOWTO.txt
    Documentation/RCU/00-INDEX
    Documentation/RCU/Design/Requirements/Requirements.html
    Documentation/RCU/checklist.txt
    Documentation/admin-guide/README.rst
    Documentation/admin-guide/devices.txt
    Documentation/admin-guide/index.rst
    Documentation/admin-guide/kernel-parameters.txt
    Documentation/admin-guide/pm/cpufreq.rst
    Documentation/admin-guide/pm/intel_pstate.rst
    Documentation/admin-guide/ras.rst
    Documentation/arm/Atmel/README
    Documentation/block/biodoc.txt
    Documentation/conf.py
    Documentation/core-api/assoc_array.rst
    Documentation/core-api/atomic_ops.rst
    Documentation/core-api/index.rst
    Documentation/crypto/asymmetric-keys.txt
    Documentation/dev-tools/index.rst
    Documentation/dev-tools/sparse.rst
    Documentation/devicetree/bindings/arm/amlogic.txt
    Documentation/devicetree/bindings/arm/atmel-at91.txt
    Documentation/devicetree/bindings/arm/ccn.txt
    Documentation/devicetree/bindings/arm/cpus.txt
    Documentation/devicetree/bindings/arm/gemini.txt
Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt
Documentation/devicetree/bindings/arm/keystone/keystone.txt
    Documentation/devicetree/bindings/arm/mediatek.txt
    Documentation/devicetree/bindings/arm/rockchip.txt
    Documentation/devicetree/bindings/arm/shmobile.txt
    Documentation/devicetree/bindings/arm/tegra.txt
    Documentation/devicetree/bindings/ata/ahci-fsl-qoriq.txt
    Documentation/devicetree/bindings/bus/brcm,gisb-arb.txt
Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt
    Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt
    Documentation/devicetree/bindings/gpio/gpio_atmel.txt
Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt
Documentation/devicetree/bindings/iio/adc/renesas,gyroadc.txt
    Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt
    Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt
Documentation/devicetree/bindings/interrupt-controller/allwinner,sunxi-nmi.txt
Documentation/devicetree/bindings/interrupt-controller/aspeed,ast2400-vic.txt
Documentation/devicetree/bindings/interrupt-controller/mediatek,sysirq.txt
    Documentation/devicetree/bindings/leds/common.txt
    Documentation/devicetree/bindings/mfd/hi6421.txt
    Documentation/devicetree/bindings/mfd/tps65910.txt
    Documentation/devicetree/bindings/mmc/fsl-esdhc.txt
    Documentation/devicetree/bindings/mmc/k3-dw-mshc.txt
    Documentation/devicetree/bindings/mmc/rockchip-dw-mshc.txt
    Documentation/devicetree/bindings/mmc/ti-omap-hsmmc.txt
    Documentation/devicetree/bindings/mtd/atmel-nand.txt
    Documentation/devicetree/bindings/net/dsa/b53.txt
    Documentation/devicetree/bindings/net/ethernet.txt
    Documentation/devicetree/bindings/net/macb.txt
Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
    Documentation/devicetree/bindings/net/ti,wilink-st.txt
Documentation/devicetree/bindings/net/wireless/ti,wlcore.txt
    Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt
    Documentation/devicetree/bindings/opp/opp.txt
    Documentation/devicetree/bindings/phy/bcm-ns-usb3-phy.txt
    Documentation/devicetree/bindings/phy/brcm-sata-phy.txt
    Documentation/devicetree/bindings/phy/meson8b-usb2-phy.txt
Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt
Documentation/devicetree/bindings/power/rockchip-io-domain.txt
    Documentation/devicetree/bindings/power/supply/bq27xxx.txt
    Documentation/devicetree/bindings/property-units.txt
    Documentation/devicetree/bindings/regulator/regulator.txt
    Documentation/devicetree/bindings/serial/8
error: The following untracked working tree files would be overwritten 
by checkout:

    Documentation/ABI/testing/sysfs-class-net-phydev
    Documentation/DocBook/.gitignore
    

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski

Ok looks like ending bisection


Latest bisected kernel when there is no kernel panic 4.12.0+ (from 
next)  - but only this warning:


[  309.030019] NETDEV WATCHDOG: enp4s0f0 (ixgbe): transmit queue 0 timed out
[  309.030034] [ cut here ]
[  309.030040] WARNING: CPU: 35 PID: 0 at dev_watchdog+0xcf/0x139
[  309.030041] Modules linked in: bonding ipmi_si x86_pkg_temp_thermal
[  309.030045] CPU: 35 PID: 0 Comm: swapper/35 Not tainted 4.12.0+ #5
[  309.030046] task: 88086d98a000 task.stack: c90003378000
[  309.030048] RIP: 0010:dev_watchdog+0xcf/0x139
[  309.030049] RSP: 0018:88087fbc3ea8 EFLAGS: 00010246
[  309.030050] RAX: 003d RBX: 88046b68 RCX: 

[  309.030050] RDX: 88087fbd2f01 RSI:  RDI: 
88087fbcda08
[  309.030051] RBP: 88087fbc3eb8 R08:  R09: 
88087ff80a04
[  309.030051] R10:  R11: 88086d98a001 R12: 

[  309.030052] R13: 88087fbc3ef8 R14: 88086d98a000 R15: 
81c06008
[  309.030053] FS:  () GS:88087fbc() 
knlGS:

[  309.030054] CS:  0010 DS:  ES:  CR0: 80050033
[  309.030054] CR2: 7fba600f6098 CR3: 00086b955000 CR4: 
001406e0

[  309.030055] Call Trace:
[  309.030057]  
[  309.030059]  ? netif_tx_lock+0x79/0x79
[  309.030062]  call_timer_fn.isra.24+0x17/0x77
[  309.030063]  run_timer_softirq+0x118/0x161
[  309.030065]  ? netif_tx_lock+0x79/0x79
[  309.030066]  ? ktime_get+0x2b/0x42
[  309.030070]  ? lapic_next_deadline+0x21/0x27
[  309.030073]  ? clockevents_program_event+0xa8/0xc5
[  309.030076]  __do_softirq+0xa8/0x19d
[  309.030078]  irq_exit+0x5d/0x6b
[  309.030079]  smp_apic_timer_interrupt+0x2a/0x36
[  309.030082]  apic_timer_interrupt+0x89/0x90
[  309.030085] RIP: 0010:mwait_idle+0x4e/0x6a
[  309.030086] RSP: 0018:c9000337be98 EFLAGS: 0246 ORIG_RAX: 
ff10
[  309.030087] RAX:  RBX:  RCX: 

[  309.030087] RDX:  RSI:  RDI: 
88086d98a000
[  309.030088] RBP: c9000337be98 R08: 88046f8279a0 R09: 
88046f827040
[  309.030089] R10: 88086d98a000 R11: 88086d98a000 R12: 

[  309.030089] R13: 88086d98a000 R14: 88086d98a000 R15: 
88086d98a000

[  309.030090]  
[  309.030094]  arch_cpu_idle+0xa/0xc
[  309.030095]  default_idle_call+0x19/0x1b
[  309.030102]  do_idle+0xbc/0x196
[  309.030104]  cpu_startup_entry+0x1d/0x20
[  309.030105]  start_secondary+0xd8/0xdc
[  309.030108]  secondary_startup_64+0x9f/0x9f
[  309.030109] Code: cc 75 bd eb 35 48 89 df c6 05 c3 dc 74 00 01 e8 3a 
62 fe ff 44 89 e1 48 89 de 48 89 c2 48 c7 c7 0f 65 a4 81 31 c0 e8 3d 4c 
b5 ff <0f> ff 48 8b 83 e0 01 00 00 48 89 df ff 50 78 48 8b 05 a0 bc 6a

[  309.030128] ---[ end trace 9102cb25703ae2d9 ]---


I just marked it as good - cause this problem above is differend - and 
im going to:


git bisect good
Bisecting: 1787 revisions left to test after this (roughly 11 steps)




W dniu 2017-09-20 o 10:44, Paweł Staszewski pisze:

Trying to make video from ipmi :)

with that results:

https://bugzilla.kernel.org/attachment.cgi?id=258521

catched two more lines where it starts - panic from 4.13.2.


Now will try tro do some bisection



W dniu 2017-09-20 o 09:58, Paweł Staszewski pisze:

Hi


Will try bisecting tonight



W dniu 2017-09-20 o 05:24, Eric Dumazet pisze:

On Wed, 2017-09-20 at 02:06 +0200, Paweł Staszewski wrote:

Just checked kernel 4.13.2 and same problem

Just after start all 6 bgp sessions - and kernel starts to learn 
routes

it panic.

https://bugzilla.kernel.org/attachment.cgi?id=258509



Unfortunately we have not enough information from these traces.

Can you get a full stack trace ?

Alternatively, can you bisect ?

Thanks.













Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski

Trying to make video from ipmi :)

with that results:

https://bugzilla.kernel.org/attachment.cgi?id=258521

catched two more lines where it starts - panic from 4.13.2.


Now will try tro do some bisection



W dniu 2017-09-20 o 09:58, Paweł Staszewski pisze:

Hi


Will try bisecting tonight



W dniu 2017-09-20 o 05:24, Eric Dumazet pisze:

On Wed, 2017-09-20 at 02:06 +0200, Paweł Staszewski wrote:

Just checked kernel 4.13.2 and same problem

Just after start all 6 bgp sessions - and kernel starts to learn routes
it panic.

https://bugzilla.kernel.org/attachment.cgi?id=258509



Unfortunately we have not enough information from these traces.

Can you get a full stack trace ?

Alternatively, can you bisect ?

Thanks.










Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski

Hi


Will try bisecting tonight



W dniu 2017-09-20 o 05:24, Eric Dumazet pisze:

On Wed, 2017-09-20 at 02:06 +0200, Paweł Staszewski wrote:

Just checked kernel 4.13.2 and same problem

Just after start all 6 bgp sessions - and kernel starts to learn routes
it panic.

https://bugzilla.kernel.org/attachment.cgi?id=258509



Unfortunately we have not enough information from these traces.

Can you get a full stack trace ?

Alternatively, can you bisect ?

Thanks.







Re: Latest net-next from GIT panic

2017-09-19 Thread Eric Dumazet
On Wed, 2017-09-20 at 02:06 +0200, Paweł Staszewski wrote:
> Just checked kernel 4.13.2 and same problem
> 
> Just after start all 6 bgp sessions - and kernel starts to learn routes 
> it panic.
> 
> https://bugzilla.kernel.org/attachment.cgi?id=258509
> 


Unfortunately we have not enough information from these traces.

Can you get a full stack trace ?

Alternatively, can you bisect ?

Thanks.




Re: Latest net-next from GIT panic

2017-09-19 Thread Paweł Staszewski

Latest working kernel with same configuration and kernel config 4.12.13

There is no panic after routes from all 6x bgp sessions are learned.

ip r | wc -l
653112




W dniu 2017-09-20 o 02:06, Paweł Staszewski pisze:

Just checked kernel 4.13.2 and same problem

Just after start all 6 bgp sessions - and kernel starts to learn 
routes it panic.


https://bugzilla.kernel.org/attachment.cgi?id=258509



W dniu 2017-09-20 o 02:01, Paweł Staszewski pisze:

Some information about enviroment:
Server is acting as a ip router with bgp
There are 6x bgp sessions - each with full bgp table ~600k prefixes

And it looks like panic is appearing after bgp sessions are connected 
- not by traffic - cause at time when panic occured there is almost 
no traffic.


Also when I run tris server without turning on BGP and push thru this 
server traffic by pktgen there is no panic.


just after it learn routes it panick







W dniu 2017-09-20 o 01:45, Paweł Staszewski pisze:
Added few more screenshoots from kernels 4.14-rc1(net-next) and 
4.14-rc1(linux-next)


https://bugzilla.kernel.org/show_bug.cgi?id=197005


W dniu 2017-09-20 o 00:35, Paweł Staszewski pisze:

Just tried latest net-next git and found kernel panic.

Below link to bugzilla.

https://bugzilla.kernel.org/attachment.cgi?id=258499


















Re: Latest net-next from GIT panic

2017-09-19 Thread Paweł Staszewski

Just checked kernel 4.13.2 and same problem

Just after start all 6 bgp sessions - and kernel starts to learn routes 
it panic.


https://bugzilla.kernel.org/attachment.cgi?id=258509



W dniu 2017-09-20 o 02:01, Paweł Staszewski pisze:

Some information about enviroment:
Server is acting as a ip router with bgp
There are 6x bgp sessions - each with full bgp table ~600k prefixes

And it looks like panic is appearing after bgp sessions are connected 
- not by traffic - cause at time when panic occured there is almost no 
traffic.


Also when I run tris server without turning on BGP and push thru this 
server traffic by pktgen there is no panic.


just after it learn routes it panick







W dniu 2017-09-20 o 01:45, Paweł Staszewski pisze:
Added few more screenshoots from kernels 4.14-rc1(net-next) and 
4.14-rc1(linux-next)


https://bugzilla.kernel.org/show_bug.cgi?id=197005


W dniu 2017-09-20 o 00:35, Paweł Staszewski pisze:

Just tried latest net-next git and found kernel panic.

Below link to bugzilla.

https://bugzilla.kernel.org/attachment.cgi?id=258499















Re: Latest net-next from GIT panic

2017-09-19 Thread Paweł Staszewski

Some information about enviroment:
Server is acting as a ip router with bgp
There are 6x bgp sessions - each with full bgp table ~600k prefixes

And it looks like panic is appearing after bgp sessions are connected - 
not by traffic - cause at time when panic occured there is almost no 
traffic.


Also when I run tris server without turning on BGP and push thru this 
server traffic by pktgen there is no panic.


just after it learn routes it panick







W dniu 2017-09-20 o 01:45, Paweł Staszewski pisze:
Added few more screenshoots from kernels 4.14-rc1(net-next) and 
4.14-rc1(linux-next)


https://bugzilla.kernel.org/show_bug.cgi?id=197005


W dniu 2017-09-20 o 00:35, Paweł Staszewski pisze:

Just tried latest net-next git and found kernel panic.

Below link to bugzilla.

https://bugzilla.kernel.org/attachment.cgi?id=258499












Re: Latest net-next from GIT panic

2017-09-19 Thread Paweł Staszewski
Added few more screenshoots from kernels 4.14-rc1(net-next) and 
4.14-rc1(linux-next)


https://bugzilla.kernel.org/show_bug.cgi?id=197005


W dniu 2017-09-20 o 00:35, Paweł Staszewski pisze:

Just tried latest net-next git and found kernel panic.

Below link to bugzilla.

https://bugzilla.kernel.org/attachment.cgi?id=258499