Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))

2018-06-03 Thread Masanobu SAITOH

On 2018/06/03 18:20, 6b...@6bone.informatik.uni-leipzig.de wrote:

Hello,

I have applied

 http://www.netbsd.org/~msaitoh/ixgbe-eitr-20180522-0.dif
and
 http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180530-0.dif

to netbsd-8 RC1. With these patches the problem seems to be solved.


 Thanks. I've committed the latest patch now!




Thank you for your efforts

Regards
Uwe


On Fri, 1 Jun 2018, Masanobu SAITOH wrote:


Date: Fri, 1 Jun 2018 12:47:32 +0900
From: Masanobu SAITOH 
To: 6b...@6bone.informatik.uni-leipzig.de
Cc: msai...@execsw.org, Martin Husemann ,
    current-users@netbsd.org
Subject: Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg
    drivers (?))




  The same diff is at:

 http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180530-0.dif



Updated patch (Fix compile error and ixv patch):

--
Don't call ixgbe_rearm_queues() in ixgbe_local_timer1(). ixgbe_enable_queue()
and ixgbe_disable_queue() try to enable/disable queue interrupt safely. It
has the internal counter. When a queue's MSI-X is received, ixgbe_msix_que()
is called (IPL_NET). This function disable the queue's interrupt by
ixgbe_disable_queue() and issues an softint. ixgbe_handle() queue is called by
the softint (IPL_SOFTNET), process TX,RX and call ixgbe_enable_queue() at the
end.

ixgbe_local_timer1() is a callout and run always on CPU 0 (IPL_SOFTCLOCK).
When ixgbe_rearm_queues() called, an MSI-X interrupt is issued for a specific
queue. It may not CPU 0. If this interrupt's ixgbe_msix_que() is called
and sofint_schedule() is called before the last sofint's softint_execute()
is not called, the softint_schedule() fails because of SOFTINT_PENDING.
It result in breaking ixgbe_{enable,disable}_queue()'s internal counter.

ixgbe_local_timer1() is written not to call ixgbe_rearm_queues() if
the interrupt is disabled, but it's called because of unknown bug or a race.

One solution is to not to use the internal counter, but it's little difficult.
Another solution is stop using ixgbe_rearm_queues() at all.  Essentially,
ixgbe_rearm_queues() is not required (it was added in ixgbe.c rev. 1.43
(2016/12/01)). ixgbe_rearm_queues() helps for lost interrupt problem but
I've never seen it other than ixgbe_rearm_queues() problem.


Index: ixgbe.c
===
RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixgbe.c,v
retrieving revision 1.158
diff -u -p -r1.158 ixgbe.c
--- ixgbe.c    30 May 2018 09:17:17 -    1.158
+++ ixgbe.c    1 Jun 2018 03:22:05 -
@@ -4411,6 +4411,7 @@ ixgbe_local_timer1(void *arg)
/* Only truely watchdog if all queues show hung */
if (hung == adapter->num_queues)
    goto watchdog;
+#if 0 /* XXX Avoid unexpectedly disabling interrupt forever (PR#53294) */
else if (queues != 0) { /* Force an IRQ on queues with work */
    que = adapter->queues;
    for (i = 0; i < adapter->num_queues; i++, que++) {
@@ -4421,6 +4422,7 @@ ixgbe_local_timer1(void *arg)
    mutex_exit(&que->dc_mtx);
    }
}
+#endif
 out:
callout_reset(&adapter->timer, hz, ixgbe_local_timer, adapter);
@@ -6643,7 +6645,7 @@ ixgbe_handle_link(void *context)
/
 * ixgbe_rearm_queues
 /
-static void
+static __inline void
ixgbe_rearm_queues(struct adapter *adapter, u64 queues)
{
u32 mask;
Index: ixv.c
===
RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixv.c,v
retrieving revision 1.102
diff -u -p -r1.102 ixv.c
--- ixv.c    30 May 2018 08:35:26 -    1.102
+++ ixv.c    1 Jun 2018 03:22:05 -
@@ -1266,9 +1266,11 @@ ixv_local_timer_locked(void *arg)
/* Only truly watchdog if all queues show hung */
if (hung == adapter->num_queues)
    goto watchdog;
+#if 0
else if (queues != 0) { /* Force an IRQ on queues with work */
    ixv_rearm_queues(adapter, queues);
}
+#endif
 callout_reset(&adapter->timer, hz, ixv_local_timer, adapter);
--

The same diff is at:

http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180531-0.dif

--
---
   SAITOH Masanobu (msai...@execsw.org
    msai...@netbsd.org)




--
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)


Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))

2018-06-03 Thread 6bone

Hello,

I have applied

http://www.netbsd.org/~msaitoh/ixgbe-eitr-20180522-0.dif
and
http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180530-0.dif

to netbsd-8 RC1. With these patches the problem seems to be solved.


Thank you for your efforts

Regards
Uwe


On Fri, 1 Jun 2018, Masanobu SAITOH wrote:


Date: Fri, 1 Jun 2018 12:47:32 +0900
From: Masanobu SAITOH 
To: 6b...@6bone.informatik.uni-leipzig.de
Cc: msai...@execsw.org, Martin Husemann ,
current-users@netbsd.org
Subject: Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg
drivers (?))




  The same diff is at:

 http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180530-0.dif



Updated patch (Fix compile error and ixv patch):

--
Don't call ixgbe_rearm_queues() in ixgbe_local_timer1(). 
ixgbe_enable_queue()

and ixgbe_disable_queue() try to enable/disable queue interrupt safely. It
has the internal counter. When a queue's MSI-X is received, ixgbe_msix_que()
is called (IPL_NET). This function disable the queue's interrupt by
ixgbe_disable_queue() and issues an softint. ixgbe_handle() queue is called 
by

the softint (IPL_SOFTNET), process TX,RX and call ixgbe_enable_queue() at the
end.

ixgbe_local_timer1() is a callout and run always on CPU 0 (IPL_SOFTCLOCK).
When ixgbe_rearm_queues() called, an MSI-X interrupt is issued for a specific
queue. It may not CPU 0. If this interrupt's ixgbe_msix_que() is called
and sofint_schedule() is called before the last sofint's softint_execute()
is not called, the softint_schedule() fails because of SOFTINT_PENDING.
It result in breaking ixgbe_{enable,disable}_queue()'s internal counter.

ixgbe_local_timer1() is written not to call ixgbe_rearm_queues() if
the interrupt is disabled, but it's called because of unknown bug or a race.

One solution is to not to use the internal counter, but it's little 
difficult.

Another solution is stop using ixgbe_rearm_queues() at all.  Essentially,
ixgbe_rearm_queues() is not required (it was added in ixgbe.c rev. 1.43
(2016/12/01)). ixgbe_rearm_queues() helps for lost interrupt problem but
I've never seen it other than ixgbe_rearm_queues() problem.


Index: ixgbe.c
===
RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixgbe.c,v
retrieving revision 1.158
diff -u -p -r1.158 ixgbe.c
--- ixgbe.c 30 May 2018 09:17:17 -  1.158
+++ ixgbe.c 1 Jun 2018 03:22:05 -
@@ -4411,6 +4411,7 @@ ixgbe_local_timer1(void *arg)
/* Only truely watchdog if all queues show hung */
if (hung == adapter->num_queues)
goto watchdog;
+#if 0 /* XXX Avoid unexpectedly disabling interrupt forever (PR#53294) */
else if (queues != 0) { /* Force an IRQ on queues with work */
que = adapter->queues;
for (i = 0; i < adapter->num_queues; i++, que++) {
@@ -4421,6 +4422,7 @@ ixgbe_local_timer1(void *arg)
mutex_exit(&que->dc_mtx);
}
}
+#endif
 out:
callout_reset(&adapter->timer, hz, ixgbe_local_timer, adapter);
@@ -6643,7 +6645,7 @@ ixgbe_handle_link(void *context)
/
 * ixgbe_rearm_queues
 /
-static void
+static __inline void
ixgbe_rearm_queues(struct adapter *adapter, u64 queues)
{
u32 mask;
Index: ixv.c
===
RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixv.c,v
retrieving revision 1.102
diff -u -p -r1.102 ixv.c
--- ixv.c   30 May 2018 08:35:26 -  1.102
+++ ixv.c   1 Jun 2018 03:22:05 -
@@ -1266,9 +1266,11 @@ ixv_local_timer_locked(void *arg)
/* Only truly watchdog if all queues show hung */
if (hung == adapter->num_queues)
goto watchdog;
+#if 0
else if (queues != 0) { /* Force an IRQ on queues with work */
ixv_rearm_queues(adapter, queues);
}
+#endif
callout_reset(&adapter->timer, hz, ixv_local_timer, adapter);
--

The same diff is at:

http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180531-0.dif

--
---
   SAITOH Masanobu (msai...@execsw.org
msai...@netbsd.org)


Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))

2018-05-31 Thread Masanobu SAITOH





  The same diff is at:

 http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180530-0.dif



Updated patch (Fix compile error and ixv patch):

--
 Don't call ixgbe_rearm_queues() in ixgbe_local_timer1(). ixgbe_enable_queue()
and ixgbe_disable_queue() try to enable/disable queue interrupt safely. It
has the internal counter. When a queue's MSI-X is received, ixgbe_msix_que()
is called (IPL_NET). This function disable the queue's interrupt by
ixgbe_disable_queue() and issues an softint. ixgbe_handle() queue is called by
the softint (IPL_SOFTNET), process TX,RX and call ixgbe_enable_queue() at the
end.

 ixgbe_local_timer1() is a callout and run always on CPU 0 (IPL_SOFTCLOCK).
When ixgbe_rearm_queues() called, an MSI-X interrupt is issued for a specific
queue. It may not CPU 0. If this interrupt's ixgbe_msix_que() is called
and sofint_schedule() is called before the last sofint's softint_execute()
is not called, the softint_schedule() fails because of SOFTINT_PENDING.
It result in breaking ixgbe_{enable,disable}_queue()'s internal counter.

 ixgbe_local_timer1() is written not to call ixgbe_rearm_queues() if
the interrupt is disabled, but it's called because of unknown bug or a race.

 One solution is to not to use the internal counter, but it's little difficult.
Another solution is stop using ixgbe_rearm_queues() at all.  Essentially,
ixgbe_rearm_queues() is not required (it was added in ixgbe.c rev. 1.43
(2016/12/01)). ixgbe_rearm_queues() helps for lost interrupt problem but
I've never seen it other than ixgbe_rearm_queues() problem.


Index: ixgbe.c
===
RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixgbe.c,v
retrieving revision 1.158
diff -u -p -r1.158 ixgbe.c
--- ixgbe.c 30 May 2018 09:17:17 -  1.158
+++ ixgbe.c 1 Jun 2018 03:22:05 -
@@ -4411,6 +4411,7 @@ ixgbe_local_timer1(void *arg)
/* Only truely watchdog if all queues show hung */
if (hung == adapter->num_queues)
goto watchdog;
+#if 0 /* XXX Avoid unexpectedly disabling interrupt forever (PR#53294) */
else if (queues != 0) { /* Force an IRQ on queues with work */
que = adapter->queues;
for (i = 0; i < adapter->num_queues; i++, que++) {
@@ -4421,6 +4422,7 @@ ixgbe_local_timer1(void *arg)
mutex_exit(&que->dc_mtx);
}
}
+#endif
 
 out:

callout_reset(&adapter->timer, hz, ixgbe_local_timer, adapter);
@@ -6643,7 +6645,7 @@ ixgbe_handle_link(void *context)
 /
  * ixgbe_rearm_queues
  /
-static void
+static __inline void
 ixgbe_rearm_queues(struct adapter *adapter, u64 queues)
 {
u32 mask;
Index: ixv.c
===
RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixv.c,v
retrieving revision 1.102
diff -u -p -r1.102 ixv.c
--- ixv.c   30 May 2018 08:35:26 -  1.102
+++ ixv.c   1 Jun 2018 03:22:05 -
@@ -1266,9 +1266,11 @@ ixv_local_timer_locked(void *arg)
/* Only truly watchdog if all queues show hung */
if (hung == adapter->num_queues)
goto watchdog;
+#if 0
else if (queues != 0) { /* Force an IRQ on queues with work */
ixv_rearm_queues(adapter, queues);
}
+#endif
 
 	callout_reset(&adapter->timer, hz, ixv_local_timer, adapter);
 
--


The same diff is at:

http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180531-0.dif

--
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)


Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))

2018-05-31 Thread Masanobu SAITOH

 Hi, all.

 New patch:

---
 Don't call ixgbe_rearm_queues() in ixgbe_local_timer1(). ixgbe_enable_queue()
and ixgbe_disable_queue() try to enable/disable queue interrupt safely. It
has the internal counter. When a queue's MSI-X is received, ixgbe_msix_que()
is called (IPL_NET). This function disable the queue's interrupt by
ixgbe_disable_queue() and issues an softint. ixgbe_handle() queue is called by
the softint (IPL_SOFTNET), process TX,RX and call ixgbe_enable_queue() at the
end.

 ixgbe_local_timer1() is a callout and run always on CPU 0 (IPL_SOFTCLOCK).
When ixgbe_rearm_queues() called, an MSI-X interrupt is issued for a specific
queue. It may not CPU 0. If this interrupt's ixgbe_msix_que() is called
and sofint_schedule() is called before the last sofint's softint_execute()
is not called, the softint_schedule() fails because of SOFTINT_PENDING.
It result in breaking ixgbe_{enable,disable}_queue()'s internal counter.

 ixgbe_local_timer1() is written not to call ixgbe_rearm_queues() if
the interrupt is disabled, but it's called because of unknown bug or a race.

 One solution is to not to use the internal counter, but it's little difficult.
Another solution is stop using ixgbe_rearm_queues() at all.  Essentially,
ixgbe_rearm_queues() is not required (it was added in ixgbe.c rev. 1.43
(2016/12/01)). ixgbe_rearm_queues() helps for lost interrupt problem but
I've never seen it other than ixgbe_rearm_queues() problem.


Index: ixgbe.c
===
RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ixgbe.c,v
retrieving revision 1.158
diff -u -p -r1.158 ixgbe.c
--- ixgbe.c 30 May 2018 09:17:17 -  1.158
+++ ixgbe.c 31 May 2018 09:51:19 -
@@ -4411,6 +4411,7 @@ ixgbe_local_timer1(void *arg)
/* Only truely watchdog if all queues show hung */
if (hung == adapter->num_queues)
goto watchdog;
+#if 0 /* XXX Avoid unexpectedly disabling interrupt forever (PR#53294) */
else if (queues != 0) { /* Force an IRQ on queues with work */
que = adapter->queues;
for (i = 0; i < adapter->num_queues; i++, que++) {
@@ -4421,6 +4422,7 @@ ixgbe_local_timer1(void *arg)
mutex_exit(&que->dc_mtx);
}
}
+#endif
 
 out:

callout_reset(&adapter->timer, hz, ixgbe_local_timer, adapter);
---

 The same diff is at:

http://www.netbsd.org/~msaitoh/ixgbe-norearm-20180530-0.dif


--
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)


Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))

2018-05-28 Thread Masanobu SAITOH

Hello, Uew.

On 2018/05/29 14:59, 6b...@6bone.informatik.uni-leipzig.de wrote:

Hello,

I have tested the the patch with netbsd-8. The problem is not solved.


Thanks.

 It seems that the occurrence of this problem is depend on the
hardware configuration. I've never seen this problem on some
machines.

 Today, I could set up a system that this RX stall problem occurs
quickly (in a few minutes). I don't know if I can fix this problem soon.


 Thanks.





Regards
Uwe


On Mon, 28 May 2018, Masanobu SAITOH wrote:


Date: Mon, 28 May 2018 17:10:02 +0900
From: Masanobu SAITOH 
To: Martin Husemann ,
    6b...@6bone.informatik.uni-leipzig.de, current-users@netbsd.org
Cc: msai...@execsw.org
Subject: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers
 (?))

On 2018/05/28 16:51, Martin Husemann wrote:

On Mon, May 28, 2018 at 09:46:21AM +0200, 6b...@6bone.informatik.uni-leipzig.de 
wrote:

Hello,

At the weekend I tried to update to a current version of netbsd-8 rc1.

After the restart, the kernel will work for a few hours. After that, no
packets will arrive at the network card.



Please try the following patch who are using ixg(4) on netbsd-8 or -current:

http://www.netbsd.org/~msaitoh/ixgbe-eitr-20180522-0.dif

This change might fix RX stall problem. If you got TX device timeout or
RX stall,  please report with the output of:

sysctl hw |grep ixg


Regards.




The server is running normally. No
hints in dmesg.

Some network programs report issues:

zebra[371]: rtm_write: write : No buffer space available (55)

syslogd[541]: recvfrom() unix `/var/run/log': No buffer space available

gate zebra[1423]: routing socket error: No buffer space available


You are seeing two different issues here. The "No buffer space" is considered
harmless (it used to be silent, but the lossage should be the same).

The ixg(4) stops receiving packets is under investigation, RC2 is waiting
for a proposed patch being tested.

Martin




--
---
   SAITOH Masanobu (msai...@execsw.org
    msai...@netbsd.org)




--
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)


Re: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))

2018-05-28 Thread 6bone

Hello,

I have tested the the patch with netbsd-8. The problem is not solved.


Regards
Uwe


On Mon, 28 May 2018, Masanobu SAITOH wrote:


Date: Mon, 28 May 2018 17:10:02 +0900
From: Masanobu SAITOH 
To: Martin Husemann ,
6b...@6bone.informatik.uni-leipzig.de, current-users@netbsd.org
Cc: msai...@execsw.org
Subject: ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers
 (?))

On 2018/05/28 16:51, Martin Husemann wrote:
On Mon, May 28, 2018 at 09:46:21AM +0200, 
6b...@6bone.informatik.uni-leipzig.de wrote:

Hello,

At the weekend I tried to update to a current version of netbsd-8 rc1.

After the restart, the kernel will work for a few hours. After that, no
packets will arrive at the network card.



Please try the following patch who are using ixg(4) on netbsd-8 or -current:

http://www.netbsd.org/~msaitoh/ixgbe-eitr-20180522-0.dif

This change might fix RX stall problem. If you got TX device timeout or
RX stall,  please report with the output of:

sysctl hw |grep ixg


Regards.




The server is running normally. No
hints in dmesg.

Some network programs report issues:

zebra[371]: rtm_write: write : No buffer space available (55)

syslogd[541]: recvfrom() unix `/var/run/log': No buffer space available

gate zebra[1423]: routing socket error: No buffer space available


You are seeing two different issues here. The "No buffer space" is 
considered

harmless (it used to be silent, but the lossage should be the same).

The ixg(4) stops receiving packets is under investigation, RC2 is waiting
for a proposed patch being tested.

Martin




--
---
   SAITOH Masanobu (msai...@execsw.org
msai...@netbsd.org)



ixg tester needed (was Re: Problems with netbsd-8 RC1 and ixg drivers (?))

2018-05-28 Thread Masanobu SAITOH

On 2018/05/28 16:51, Martin Husemann wrote:

On Mon, May 28, 2018 at 09:46:21AM +0200, 6b...@6bone.informatik.uni-leipzig.de 
wrote:

Hello,

At the weekend I tried to update to a current version of netbsd-8 rc1.

After the restart, the kernel will work for a few hours. After that, no
packets will arrive at the network card.



 Please try the following patch who are using ixg(4) on netbsd-8 or -current:

http://www.netbsd.org/~msaitoh/ixgbe-eitr-20180522-0.dif

This change might fix RX stall problem. If you got TX device timeout or
RX stall,  please report with the output of:

sysctl hw |grep ixg


 Regards.




The server is running normally. No
hints in dmesg.

Some network programs report issues:

zebra[371]: rtm_write: write : No buffer space available (55)

syslogd[541]: recvfrom() unix `/var/run/log': No buffer space available

gate zebra[1423]: routing socket error: No buffer space available


You are seeing two different issues here. The "No buffer space" is considered
harmless (it used to be silent, but the lossage should be the same).

The ixg(4) stops receiving packets is under investigation, RC2 is waiting
for a proposed patch being tested.

Martin




--
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)