Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))
On 2018/06/03 18:20, 6b...@6bone.informatik.uni-leipzig.de wrote: Hello, I have applied http://www.netbsd.org/~msaitoh/ixgbe-eitr-20180522-0.dif and http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180530-0.dif to netbsd-8 RC1. With these patches the problem seems to be solved. Thanks. I've committed the latest patch now! Thank you for your efforts Regards Uwe On Fri, 1 Jun 2018, Masanobu SAITOH wrote: Date: Fri, 1 Jun 2018 12:47:32 +0900 From: Masanobu SAITOH To: 6b...@6bone.informatik.uni-leipzig.de Cc: msai...@execsw.org, Martin Husemann , current-users@netbsd.org Subject: Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?)) The same diff is at: http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180530-0.dif Updated patch (Fix compile error and ixv patch): -- Don't call ixgbe_rearm_queues() in ixgbe_local_timer1(). ixgbe_enable_queue() and ixgbe_disable_queue() try to enable/disable queue interrupt safely. It has the internal counter. When a queue's MSI-X is received, ixgbe_msix_que() is called (IPL_NET). This function disable the queue's interrupt by ixgbe_disable_queue() and issues an softint. ixgbe_handle() queue is called by the softint (IPL_SOFTNET), process TX,RX and call ixgbe_enable_queue() at the end. ixgbe_local_timer1() is a callout and run always on CPU 0 (IPL_SOFTCLOCK). When ixgbe_rearm_queues() called, an MSI-X interrupt is issued for a specific queue. It may not CPU 0. If this interrupt's ixgbe_msix_que() is called and sofint_schedule() is called before the last sofint's softint_execute() is not called, the softint_schedule() fails because of SOFTINT_PENDING. It result in breaking ixgbe_{enable,disable}_queue()'s internal counter. ixgbe_local_timer1() is written not to call ixgbe_rearm_queues() if the interrupt is disabled, but it's called because of unknown bug or a race. One solution is to not to use the internal counter, but it's little difficult. Another solution is stop using ixgbe_rearm_queues() at all. Essentially, ixgbe_rearm_queues() is not required (it was added in ixgbe.c rev. 1.43 (2016/12/01)). ixgbe_rearm_queues() helps for lost interrupt problem but I've never seen it other than ixgbe_rearm_queues() problem. Index: ixgbe.c === RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixgbe.c,v retrieving revision 1.158 diff -u -p -r1.158 ixgbe.c --- ixgbe.c 30 May 2018 09:17:17 - 1.158 +++ ixgbe.c 1 Jun 2018 03:22:05 - @@ -4411,6 +4411,7 @@ ixgbe_local_timer1(void *arg) /* Only truely watchdog if all queues show hung */ if (hung == adapter->num_queues) goto watchdog; +#if 0 /* XXX Avoid unexpectedly disabling interrupt forever (PR#53294) */ else if (queues != 0) { /* Force an IRQ on queues with work */ que = adapter->queues; for (i = 0; i < adapter->num_queues; i++, que++) { @@ -4421,6 +4422,7 @@ ixgbe_local_timer1(void *arg) mutex_exit(&que->dc_mtx); } } +#endif out: callout_reset(&adapter->timer, hz, ixgbe_local_timer, adapter); @@ -6643,7 +6645,7 @@ ixgbe_handle_link(void *context) / * ixgbe_rearm_queues / -static void +static __inline void ixgbe_rearm_queues(struct adapter *adapter, u64 queues) { u32 mask; Index: ixv.c === RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixv.c,v retrieving revision 1.102 diff -u -p -r1.102 ixv.c --- ixv.c 30 May 2018 08:35:26 - 1.102 +++ ixv.c 1 Jun 2018 03:22:05 - @@ -1266,9 +1266,11 @@ ixv_local_timer_locked(void *arg) /* Only truly watchdog if all queues show hung */ if (hung == adapter->num_queues) goto watchdog; +#if 0 else if (queues != 0) { /* Force an IRQ on queues with work */ ixv_rearm_queues(adapter, queues); } +#endif callout_reset(&adapter->timer, hz, ixv_local_timer, adapter); -- The same diff is at: http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180531-0.dif -- --- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org) -- --- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org)
Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))
Hello, I have applied http://www.netbsd.org/~msaitoh/ixgbe-eitr-20180522-0.dif and http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180530-0.dif to netbsd-8 RC1. With these patches the problem seems to be solved. Thank you for your efforts Regards Uwe On Fri, 1 Jun 2018, Masanobu SAITOH wrote: Date: Fri, 1 Jun 2018 12:47:32 +0900 From: Masanobu SAITOH To: 6b...@6bone.informatik.uni-leipzig.de Cc: msai...@execsw.org, Martin Husemann , current-users@netbsd.org Subject: Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?)) The same diff is at: http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180530-0.dif Updated patch (Fix compile error and ixv patch): -- Don't call ixgbe_rearm_queues() in ixgbe_local_timer1(). ixgbe_enable_queue() and ixgbe_disable_queue() try to enable/disable queue interrupt safely. It has the internal counter. When a queue's MSI-X is received, ixgbe_msix_que() is called (IPL_NET). This function disable the queue's interrupt by ixgbe_disable_queue() and issues an softint. ixgbe_handle() queue is called by the softint (IPL_SOFTNET), process TX,RX and call ixgbe_enable_queue() at the end. ixgbe_local_timer1() is a callout and run always on CPU 0 (IPL_SOFTCLOCK). When ixgbe_rearm_queues() called, an MSI-X interrupt is issued for a specific queue. It may not CPU 0. If this interrupt's ixgbe_msix_que() is called and sofint_schedule() is called before the last sofint's softint_execute() is not called, the softint_schedule() fails because of SOFTINT_PENDING. It result in breaking ixgbe_{enable,disable}_queue()'s internal counter. ixgbe_local_timer1() is written not to call ixgbe_rearm_queues() if the interrupt is disabled, but it's called because of unknown bug or a race. One solution is to not to use the internal counter, but it's little difficult. Another solution is stop using ixgbe_rearm_queues() at all. Essentially, ixgbe_rearm_queues() is not required (it was added in ixgbe.c rev. 1.43 (2016/12/01)). ixgbe_rearm_queues() helps for lost interrupt problem but I've never seen it other than ixgbe_rearm_queues() problem. Index: ixgbe.c === RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixgbe.c,v retrieving revision 1.158 diff -u -p -r1.158 ixgbe.c --- ixgbe.c 30 May 2018 09:17:17 - 1.158 +++ ixgbe.c 1 Jun 2018 03:22:05 - @@ -4411,6 +4411,7 @@ ixgbe_local_timer1(void *arg) /* Only truely watchdog if all queues show hung */ if (hung == adapter->num_queues) goto watchdog; +#if 0 /* XXX Avoid unexpectedly disabling interrupt forever (PR#53294) */ else if (queues != 0) { /* Force an IRQ on queues with work */ que = adapter->queues; for (i = 0; i < adapter->num_queues; i++, que++) { @@ -4421,6 +4422,7 @@ ixgbe_local_timer1(void *arg) mutex_exit(&que->dc_mtx); } } +#endif out: callout_reset(&adapter->timer, hz, ixgbe_local_timer, adapter); @@ -6643,7 +6645,7 @@ ixgbe_handle_link(void *context) / * ixgbe_rearm_queues / -static void +static __inline void ixgbe_rearm_queues(struct adapter *adapter, u64 queues) { u32 mask; Index: ixv.c === RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixv.c,v retrieving revision 1.102 diff -u -p -r1.102 ixv.c --- ixv.c 30 May 2018 08:35:26 - 1.102 +++ ixv.c 1 Jun 2018 03:22:05 - @@ -1266,9 +1266,11 @@ ixv_local_timer_locked(void *arg) /* Only truly watchdog if all queues show hung */ if (hung == adapter->num_queues) goto watchdog; +#if 0 else if (queues != 0) { /* Force an IRQ on queues with work */ ixv_rearm_queues(adapter, queues); } +#endif callout_reset(&adapter->timer, hz, ixv_local_timer, adapter); -- The same diff is at: http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180531-0.dif -- --- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org)
Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))
The same diff is at: http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180530-0.dif Updated patch (Fix compile error and ixv patch): -- Don't call ixgbe_rearm_queues() in ixgbe_local_timer1(). ixgbe_enable_queue() and ixgbe_disable_queue() try to enable/disable queue interrupt safely. It has the internal counter. When a queue's MSI-X is received, ixgbe_msix_que() is called (IPL_NET). This function disable the queue's interrupt by ixgbe_disable_queue() and issues an softint. ixgbe_handle() queue is called by the softint (IPL_SOFTNET), process TX,RX and call ixgbe_enable_queue() at the end. ixgbe_local_timer1() is a callout and run always on CPU 0 (IPL_SOFTCLOCK). When ixgbe_rearm_queues() called, an MSI-X interrupt is issued for a specific queue. It may not CPU 0. If this interrupt's ixgbe_msix_que() is called and sofint_schedule() is called before the last sofint's softint_execute() is not called, the softint_schedule() fails because of SOFTINT_PENDING. It result in breaking ixgbe_{enable,disable}_queue()'s internal counter. ixgbe_local_timer1() is written not to call ixgbe_rearm_queues() if the interrupt is disabled, but it's called because of unknown bug or a race. One solution is to not to use the internal counter, but it's little difficult. Another solution is stop using ixgbe_rearm_queues() at all. Essentially, ixgbe_rearm_queues() is not required (it was added in ixgbe.c rev. 1.43 (2016/12/01)). ixgbe_rearm_queues() helps for lost interrupt problem but I've never seen it other than ixgbe_rearm_queues() problem. Index: ixgbe.c === RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixgbe.c,v retrieving revision 1.158 diff -u -p -r1.158 ixgbe.c --- ixgbe.c 30 May 2018 09:17:17 - 1.158 +++ ixgbe.c 1 Jun 2018 03:22:05 - @@ -4411,6 +4411,7 @@ ixgbe_local_timer1(void *arg) /* Only truely watchdog if all queues show hung */ if (hung == adapter->num_queues) goto watchdog; +#if 0 /* XXX Avoid unexpectedly disabling interrupt forever (PR#53294) */ else if (queues != 0) { /* Force an IRQ on queues with work */ que = adapter->queues; for (i = 0; i < adapter->num_queues; i++, que++) { @@ -4421,6 +4422,7 @@ ixgbe_local_timer1(void *arg) mutex_exit(&que->dc_mtx); } } +#endif out: callout_reset(&adapter->timer, hz, ixgbe_local_timer, adapter); @@ -6643,7 +6645,7 @@ ixgbe_handle_link(void *context) / * ixgbe_rearm_queues / -static void +static __inline void ixgbe_rearm_queues(struct adapter *adapter, u64 queues) { u32 mask; Index: ixv.c === RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixv.c,v retrieving revision 1.102 diff -u -p -r1.102 ixv.c --- ixv.c 30 May 2018 08:35:26 - 1.102 +++ ixv.c 1 Jun 2018 03:22:05 - @@ -1266,9 +1266,11 @@ ixv_local_timer_locked(void *arg) /* Only truly watchdog if all queues show hung */ if (hung == adapter->num_queues) goto watchdog; +#if 0 else if (queues != 0) { /* Force an IRQ on queues with work */ ixv_rearm_queues(adapter, queues); } +#endif callout_reset(&adapter->timer, hz, ixv_local_timer, adapter); -- The same diff is at: http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180531-0.dif -- --- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org)
Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))
Hi, all. New patch: --- Don't call ixgbe_rearm_queues() in ixgbe_local_timer1(). ixgbe_enable_queue() and ixgbe_disable_queue() try to enable/disable queue interrupt safely. It has the internal counter. When a queue's MSI-X is received, ixgbe_msix_que() is called (IPL_NET). This function disable the queue's interrupt by ixgbe_disable_queue() and issues an softint. ixgbe_handle() queue is called by the softint (IPL_SOFTNET), process TX,RX and call ixgbe_enable_queue() at the end. ixgbe_local_timer1() is a callout and run always on CPU 0 (IPL_SOFTCLOCK). When ixgbe_rearm_queues() called, an MSI-X interrupt is issued for a specific queue. It may not CPU 0. If this interrupt's ixgbe_msix_que() is called and sofint_schedule() is called before the last sofint's softint_execute() is not called, the softint_schedule() fails because of SOFTINT_PENDING. It result in breaking ixgbe_{enable,disable}_queue()'s internal counter. ixgbe_local_timer1() is written not to call ixgbe_rearm_queues() if the interrupt is disabled, but it's called because of unknown bug or a race. One solution is to not to use the internal counter, but it's little difficult. Another solution is stop using ixgbe_rearm_queues() at all. Essentially, ixgbe_rearm_queues() is not required (it was added in ixgbe.c rev. 1.43 (2016/12/01)). ixgbe_rearm_queues() helps for lost interrupt problem but I've never seen it other than ixgbe_rearm_queues() problem. Index: ixgbe.c === RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixgbe.c,v retrieving revision 1.158 diff -u -p -r1.158 ixgbe.c --- ixgbe.c 30 May 2018 09:17:17 - 1.158 +++ ixgbe.c 31 May 2018 09:51:19 - @@ -4411,6 +4411,7 @@ ixgbe_local_timer1(void *arg) /* Only truely watchdog if all queues show hung */ if (hung == adapter->num_queues) goto watchdog; +#if 0 /* XXX Avoid unexpectedly disabling interrupt forever (PR#53294) */ else if (queues != 0) { /* Force an IRQ on queues with work */ que = adapter->queues; for (i = 0; i < adapter->num_queues; i++, que++) { @@ -4421,6 +4422,7 @@ ixgbe_local_timer1(void *arg) mutex_exit(&que->dc_mtx); } } +#endif out: callout_reset(&adapter->timer, hz, ixgbe_local_timer, adapter); --- The same diff is at: http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180530-0.dif -- --- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org)
Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))
Hello, Uew. On 2018/05/29 14:59, 6b...@6bone.informatik.uni-leipzig.de wrote: Hello, I have tested the the patch with netbsd-8. The problem is not solved. Thanks. It seems that the occurrence of this problem is depend on the hardware configuration. I've never seen this problem on some machines. Today, I could set up a system that this RX stall problem occurs quickly (in a few minutes). I don't know if I can fix this problem soon. Thanks. Regards Uwe On Mon, 28 May 2018, Masanobu SAITOH wrote: Date: Mon, 28 May 2018 17:10:02 +0900 From: Masanobu SAITOH To: Martin Husemann , 6b...@6bone.informatik.uni-leipzig.de, current-users@netbsd.org Cc: msai...@execsw.org Subject: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?)) On 2018/05/28 16:51, Martin Husemann wrote: On Mon, May 28, 2018 at 09:46:21AM +0200, 6b...@6bone.informatik.uni-leipzig.de wrote: Hello, At the weekend I tried to update to a current version of netbsd-8 rc1. After the restart, the kernel will work for a few hours. After that, no packets will arrive at the network card. Please try the following patch who are using ixg(4) on netbsd-8 or -current: http://www.netbsd.org/~msaitoh/ixgbe-eitr-20180522-0.dif This change might fix RX stall problem. If you got TX device timeout or RX stall, please report with the output of: sysctl hw |grep ixg Regards. The server is running normally. No hints in dmesg. Some network programs report issues: zebra[371]: rtm_write: write : No buffer space available (55) syslogd[541]: recvfrom() unix `/var/run/log': No buffer space available gate zebra[1423]: routing socket error: No buffer space available You are seeing two different issues here. The "No buffer space" is considered harmless (it used to be silent, but the lossage should be the same). The ixg(4) stops receiving packets is under investigation, RC2 is waiting for a proposed patch being tested. Martin -- --- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org) -- --- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org)
Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))
Hello, I have tested the the patch with netbsd-8. The problem is not solved. Regards Uwe On Mon, 28 May 2018, Masanobu SAITOH wrote: Date: Mon, 28 May 2018 17:10:02 +0900 From: Masanobu SAITOH To: Martin Husemann , 6b...@6bone.informatik.uni-leipzig.de, current-users@netbsd.org Cc: msai...@execsw.org Subject: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?)) On 2018/05/28 16:51, Martin Husemann wrote: On Mon, May 28, 2018 at 09:46:21AM +0200, 6b...@6bone.informatik.uni-leipzig.de wrote: Hello, At the weekend I tried to update to a current version of netbsd-8 rc1. After the restart, the kernel will work for a few hours. After that, no packets will arrive at the network card. Please try the following patch who are using ixg(4) on netbsd-8 or -current: http://www.netbsd.org/~msaitoh/ixgbe-eitr-20180522-0.dif This change might fix RX stall problem. If you got TX device timeout or RX stall, please report with the output of: sysctl hw |grep ixg Regards. The server is running normally. No hints in dmesg. Some network programs report issues: zebra[371]: rtm_write: write : No buffer space available (55) syslogd[541]: recvfrom() unix `/var/run/log': No buffer space available gate zebra[1423]: routing socket error: No buffer space available You are seeing two different issues here. The "No buffer space" is considered harmless (it used to be silent, but the lossage should be the same). The ixg(4) stops receiving packets is under investigation, RC2 is waiting for a proposed patch being tested. Martin -- --- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org)
ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))
On 2018/05/28 16:51, Martin Husemann wrote: On Mon, May 28, 2018 at 09:46:21AM +0200, 6b...@6bone.informatik.uni-leipzig.de wrote: Hello, At the weekend I tried to update to a current version of netbsd-8 rc1. After the restart, the kernel will work for a few hours. After that, no packets will arrive at the network card. Please try the following patch who are using ixg(4) on netbsd-8 or -current: http://www.netbsd.org/~msaitoh/ixgbe-eitr-20180522-0.dif This change might fix RX stall problem. If you got TX device timeout or RX stall, please report with the output of: sysctl hw |grep ixg Regards. The server is running normally. No hints in dmesg. Some network programs report issues: zebra[371]: rtm_write: write : No buffer space available (55) syslogd[541]: recvfrom() unix `/var/run/log': No buffer space available gate zebra[1423]: routing socket error: No buffer space available You are seeing two different issues here. The "No buffer space" is considered harmless (it used to be silent, but the lossage should be the same). The ixg(4) stops receiving packets is under investigation, RC2 is waiting for a proposed patch being tested. Martin -- --- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org)