Re: kernel 4.18.5 Realtek 8111G network adapter stops responding under high system load

2018-09-27 Thread David Arendt
Hi,

Heiner Kallweit's patch seems to resolve the problem. The machine was
under high disk and network io pressure today and networking was
perfectly stable.

Bye,
David Arendt

On 9/25/18 11:03 PM, Heiner Kallweit wrote:
> On 19.09.2018 06:12, David Arendt wrote:
>> Hi,
>>
>> Thanks for the patch.
>>
>> I just applied it and the TxConfig register now contains 0x4f000f80.
>> The next day will show if it really solves the problem.
>>
>> Thanks in advance,
>> David Arendt
>>
>> On 9/19/18 12:30 AM, Maciej S. Szmigiero wrote:
>>> Hi,
>>>
>>> On 18.09.2018 12:23, David Arendt wrote:
>>>> Hi,
>>>>
>>>> Today I had the network adapter problems again.
>>>> So the patch doesn't seem to change anything regarding this problem.
>>>> This week my time is unfortunately very limited, but I will try to
>>>> find some time next weekend to look a bit more into the issue.
>>> If the problem is caused by missing TXCFG_AUTO_FIFO bit in TxConfig,
>>> as the register difference would suggest, then you can try applying
>>> the following patch (hack) on top of 4.18.8 that is already patched
>>> with commit f74dd480cf4e:
>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>> @@ -5043,7 +5043,8 @@
>>>  {
>>> /* Set DMA burst size and Interframe Gap Time */
>>> RTL_W32(tp, TxConfig, (TX_DMA_BURST << TxDMAShift) |
>>> -   (InterFrameGap << TxInterFrameGapShift));
>>> +   (InterFrameGap << TxInterFrameGapShift)
>>> +   | TXCFG_AUTO_FIFO);
>>>  }
>>>  
>>>  static void rtl_set_rx_max_size(struct rtl8169_private *tp)
>>>
>>> This hack will probably only work properly on RTL_GIGA_MAC_VER_40 or
>>> later NICs.
>>>
>>> Before running any tests please verify with "ethtool -d enp3s0" that
>>> TxConfig register now contains 0x4f000f80, as it did in the old,
>>> working driver version.
>>>
>>> If this does not help then a bisection will most likely be needed.
>>>
>>>> Thanks in advance,
>>>> David Arendt
>>> Maciej
>>
>>
> @Gabriel:
> Thanks for the hint, I wasn't fully aware of this thread.
> @Maciej:
> Thanks for the analysis.
>
> It seems that all chip versions from 34 (= RTL8168E-VL) with the
> exception of version 39 (= RTL8106E, first sub-version) need
> bit TXCFG_AUTO_FIFO.
>
> And indeed, due to reordering of calls this bit is overwritten.
> Following patch moves setting the bit from the chip-specific
> hw_start function to rtl_set_tx_config_registers().
>
> Whoever is hit by the issue and has the option to build a kernel,
> could you please test whether the patch fixes the issue for you?
>
> Thanks, Heiner
>
> ---
>  drivers/net/ethernet/realtek/r8169.c | 20 
>  1 file changed, 8 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/ethernet/realtek/r8169.c 
> b/drivers/net/ethernet/realtek/r8169.c
> index f882be49f..ae8abe900 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -4514,9 +4514,14 @@ static void rtl8169_hw_reset(struct rtl8169_private 
> *tp)
>  
>  static void rtl_set_tx_config_registers(struct rtl8169_private *tp)
>  {
> - /* Set DMA burst size and Interframe Gap Time */
> - RTL_W32(tp, TxConfig, (TX_DMA_BURST << TxDMAShift) |
> - (InterFrameGap << TxInterFrameGapShift));
> + u32 val = TX_DMA_BURST << TxDMAShift |
> +   InterFrameGap << TxInterFrameGapShift;
> +
> + if (tp->mac_version >= RTL_GIGA_MAC_VER_34 &&
> + tp->mac_version != RTL_GIGA_MAC_VER_39)
> + val |= TXCFG_AUTO_FIFO;
> +
> + RTL_W32(tp, TxConfig, val);
>  }
>  
>  static void rtl_set_rx_max_size(struct rtl8169_private *tp)
> @@ -5011,7 +5016,6 @@ static void rtl_hw_start_8168e_2(struct rtl8169_private 
> *tp)
>  
>   rtl_disable_clock_request(tp);
>  
> - RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
>   RTL_W8(tp, MCU, RTL_R8(tp, MCU) & ~NOW_IS_OOB);
>  
>   /* Adjust EEE LED frequency */
> @@ -5045,7 +5049,6 @@ static void rtl_hw_start_8168f(struct rtl8169_private 
> *tp)
>  
>   rtl_disable_clock_request(tp);
>  
> - RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
>   RTL_W8(tp, MCU, RTL_R8(tp, MCU) & ~NOW_IS_OOB);
>   RTL_W8(tp, DLLPR, RTL_R8(tp, DLLPR) 

Re: kernel 4.18.5 Realtek 8111G network adapter stops responding under high system load

2018-09-27 Thread David Arendt
Hi,

Heiner Kallweit's patch seems to resolve the problem. The machine was
under high disk and network io pressure today and networking was
perfectly stable.

Bye,
David Arendt

On 9/25/18 11:03 PM, Heiner Kallweit wrote:
> On 19.09.2018 06:12, David Arendt wrote:
>> Hi,
>>
>> Thanks for the patch.
>>
>> I just applied it and the TxConfig register now contains 0x4f000f80.
>> The next day will show if it really solves the problem.
>>
>> Thanks in advance,
>> David Arendt
>>
>> On 9/19/18 12:30 AM, Maciej S. Szmigiero wrote:
>>> Hi,
>>>
>>> On 18.09.2018 12:23, David Arendt wrote:
>>>> Hi,
>>>>
>>>> Today I had the network adapter problems again.
>>>> So the patch doesn't seem to change anything regarding this problem.
>>>> This week my time is unfortunately very limited, but I will try to
>>>> find some time next weekend to look a bit more into the issue.
>>> If the problem is caused by missing TXCFG_AUTO_FIFO bit in TxConfig,
>>> as the register difference would suggest, then you can try applying
>>> the following patch (hack) on top of 4.18.8 that is already patched
>>> with commit f74dd480cf4e:
>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>> @@ -5043,7 +5043,8 @@
>>>  {
>>> /* Set DMA burst size and Interframe Gap Time */
>>> RTL_W32(tp, TxConfig, (TX_DMA_BURST << TxDMAShift) |
>>> -   (InterFrameGap << TxInterFrameGapShift));
>>> +   (InterFrameGap << TxInterFrameGapShift)
>>> +   | TXCFG_AUTO_FIFO);
>>>  }
>>>  
>>>  static void rtl_set_rx_max_size(struct rtl8169_private *tp)
>>>
>>> This hack will probably only work properly on RTL_GIGA_MAC_VER_40 or
>>> later NICs.
>>>
>>> Before running any tests please verify with "ethtool -d enp3s0" that
>>> TxConfig register now contains 0x4f000f80, as it did in the old,
>>> working driver version.
>>>
>>> If this does not help then a bisection will most likely be needed.
>>>
>>>> Thanks in advance,
>>>> David Arendt
>>> Maciej
>>
>>
> @Gabriel:
> Thanks for the hint, I wasn't fully aware of this thread.
> @Maciej:
> Thanks for the analysis.
>
> It seems that all chip versions from 34 (= RTL8168E-VL) with the
> exception of version 39 (= RTL8106E, first sub-version) need
> bit TXCFG_AUTO_FIFO.
>
> And indeed, due to reordering of calls this bit is overwritten.
> Following patch moves setting the bit from the chip-specific
> hw_start function to rtl_set_tx_config_registers().
>
> Whoever is hit by the issue and has the option to build a kernel,
> could you please test whether the patch fixes the issue for you?
>
> Thanks, Heiner
>
> ---
>  drivers/net/ethernet/realtek/r8169.c | 20 
>  1 file changed, 8 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/ethernet/realtek/r8169.c 
> b/drivers/net/ethernet/realtek/r8169.c
> index f882be49f..ae8abe900 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -4514,9 +4514,14 @@ static void rtl8169_hw_reset(struct rtl8169_private 
> *tp)
>  
>  static void rtl_set_tx_config_registers(struct rtl8169_private *tp)
>  {
> - /* Set DMA burst size and Interframe Gap Time */
> - RTL_W32(tp, TxConfig, (TX_DMA_BURST << TxDMAShift) |
> - (InterFrameGap << TxInterFrameGapShift));
> + u32 val = TX_DMA_BURST << TxDMAShift |
> +   InterFrameGap << TxInterFrameGapShift;
> +
> + if (tp->mac_version >= RTL_GIGA_MAC_VER_34 &&
> + tp->mac_version != RTL_GIGA_MAC_VER_39)
> + val |= TXCFG_AUTO_FIFO;
> +
> + RTL_W32(tp, TxConfig, val);
>  }
>  
>  static void rtl_set_rx_max_size(struct rtl8169_private *tp)
> @@ -5011,7 +5016,6 @@ static void rtl_hw_start_8168e_2(struct rtl8169_private 
> *tp)
>  
>   rtl_disable_clock_request(tp);
>  
> - RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
>   RTL_W8(tp, MCU, RTL_R8(tp, MCU) & ~NOW_IS_OOB);
>  
>   /* Adjust EEE LED frequency */
> @@ -5045,7 +5049,6 @@ static void rtl_hw_start_8168f(struct rtl8169_private 
> *tp)
>  
>   rtl_disable_clock_request(tp);
>  
> - RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
>   RTL_W8(tp, MCU, RTL_R8(tp, MCU) & ~NOW_IS_OOB);
>   RTL_W8(tp, DLLPR, RTL_R8(tp, DLLPR) 

Re: kernel 4.18.5 Realtek 8111G network adapter stops responding under high system load

2018-09-15 Thread David Arendt
Hi,

just a follow up:

In kernel 4.18.8 the behaviour is different.

The network is not reachable a number of times, but restarting to be
reachable by itself before it finally is no longer reachable at all.

Here the logging output:

Sep 15 17:44:43 server kernel: NETDEV WATCHDOG: enp3s0 (r8169): transmit
queue 0 timed out
Sep 15 17:44:43 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:10:26 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:12:24 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:13:19 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:14:48 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:20:24 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:34:19 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:43:43 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:46:26 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 19:00:24 server kernel: r8169 :03:00.0 enp3s0: link up

>From 17:44 ro 18:46 the network is recovering automatically. After the
up from 19:00, the network is no longer reachable without any additional
message.

If looking at ifconfig, the counter for TX packets is incrementing, the
counter for RX packets not.

Here again the driver from 4.17.14 is working flawlessly.

Thanks in advance,
David Arendt


On 9/4/18 8:19 AM, David Arendt wrote:
> Hi,
>
> When using kernel 4.18.5 the Realtek 8111G network adapter stops
> responding under high system load.
>
> Dmesg is showing no errors.
>
> Sometimes an ifconfig enp3s0 down followed by an ifconfig enp3s0 up is
> enough for the network adapter to restart responding. Sometimes a reboot
> is necessary.
>
> When copying r8169.c from 4.17.14 to the 4.18.5 kernel, networking works
> perfectly stable on 4.18.5 so the problem seems r8169.c related.
>
> Here the output from lshw:
>
>     *-pci:2
>  description: PCI bridge
>  product: 8 Series/C220 Series Chipset Family PCI Express
> Root Port #3
>  vendor: Intel Corporation
>  physical id: 1c.2
>  bus info: pci@:00:1c.2
>  version: d5
>  width: 32 bits
>  clock: 33MHz
>  capabilities: pci pciexpress msi pm normal_decode
> bus_master cap_list
>  configuration: driver=pcieport
>  resources: irq:18 ioport:d000(size=4096)
> memory:f730-f73f ioport:f210(size=1048576)
>    *-network
>     description: Ethernet interface
>     product: RTL8111/8168/8411 PCI Express Gigabit Ethernet
> Controller
>     vendor: Realtek Semiconductor Co., Ltd.
>     physical id: 0
>     bus info: pci@:03:00.0
>     logical name: enp3s0
>     version: 0c
>     serial: 
>     size: 1Gbit/s
>     capacity: 1Gbit/s
>     width: 64 bits
>     clock: 33MHz
>     capabilities: pm msi pciexpress msix vpd bus_master
> cap_list ethernet physical tp mii 10bt 10bt-fd 100bt 100bt-fd 1000bt
> 1000bt-fd autonegotiation
>     configuration: autonegotiation=on broadcast=yes
> driver=r8169 driverversion=2.3LK-NAPI duplex=full
> firmware=rtl8168g-2_0.0.1 02/06/13 latency=0 link=yes multicast=yes
> port=MII speed=1Gbit/s
>     resources: irq:18 ioport:d000(size=256)
> memory:f730-f7300fff memory:f210-f2103fff
>
> Thanks in advance for looking into this,
>
> David Arendt
>
>



Re: kernel 4.18.5 Realtek 8111G network adapter stops responding under high system load

2018-09-15 Thread David Arendt
Hi,

just a follow up:

In kernel 4.18.8 the behaviour is different.

The network is not reachable a number of times, but restarting to be
reachable by itself before it finally is no longer reachable at all.

Here the logging output:

Sep 15 17:44:43 server kernel: NETDEV WATCHDOG: enp3s0 (r8169): transmit
queue 0 timed out
Sep 15 17:44:43 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:10:26 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:12:24 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:13:19 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:14:48 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:20:24 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:34:19 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:43:43 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 18:46:26 server kernel: r8169 :03:00.0 enp3s0: link up
Sep 15 19:00:24 server kernel: r8169 :03:00.0 enp3s0: link up

>From 17:44 ro 18:46 the network is recovering automatically. After the
up from 19:00, the network is no longer reachable without any additional
message.

If looking at ifconfig, the counter for TX packets is incrementing, the
counter for RX packets not.

Here again the driver from 4.17.14 is working flawlessly.

Thanks in advance,
David Arendt


On 9/4/18 8:19 AM, David Arendt wrote:
> Hi,
>
> When using kernel 4.18.5 the Realtek 8111G network adapter stops
> responding under high system load.
>
> Dmesg is showing no errors.
>
> Sometimes an ifconfig enp3s0 down followed by an ifconfig enp3s0 up is
> enough for the network adapter to restart responding. Sometimes a reboot
> is necessary.
>
> When copying r8169.c from 4.17.14 to the 4.18.5 kernel, networking works
> perfectly stable on 4.18.5 so the problem seems r8169.c related.
>
> Here the output from lshw:
>
>     *-pci:2
>  description: PCI bridge
>  product: 8 Series/C220 Series Chipset Family PCI Express
> Root Port #3
>  vendor: Intel Corporation
>  physical id: 1c.2
>  bus info: pci@:00:1c.2
>  version: d5
>  width: 32 bits
>  clock: 33MHz
>  capabilities: pci pciexpress msi pm normal_decode
> bus_master cap_list
>  configuration: driver=pcieport
>  resources: irq:18 ioport:d000(size=4096)
> memory:f730-f73f ioport:f210(size=1048576)
>    *-network
>     description: Ethernet interface
>     product: RTL8111/8168/8411 PCI Express Gigabit Ethernet
> Controller
>     vendor: Realtek Semiconductor Co., Ltd.
>     physical id: 0
>     bus info: pci@:03:00.0
>     logical name: enp3s0
>     version: 0c
>     serial: 
>     size: 1Gbit/s
>     capacity: 1Gbit/s
>     width: 64 bits
>     clock: 33MHz
>     capabilities: pm msi pciexpress msix vpd bus_master
> cap_list ethernet physical tp mii 10bt 10bt-fd 100bt 100bt-fd 1000bt
> 1000bt-fd autonegotiation
>     configuration: autonegotiation=on broadcast=yes
> driver=r8169 driverversion=2.3LK-NAPI duplex=full
> firmware=rtl8168g-2_0.0.1 02/06/13 latency=0 link=yes multicast=yes
> port=MII speed=1Gbit/s
>     resources: irq:18 ioport:d000(size=256)
> memory:f730-f7300fff memory:f210-f2103fff
>
> Thanks in advance for looking into this,
>
> David Arendt
>
>



kernel 4.18.5 Realtek 8111G network adapter stops responding under high system load

2018-09-04 Thread David Arendt
Hi,

When using kernel 4.18.5 the Realtek 8111G network adapter stops
responding under high system load.

Dmesg is showing no errors.

Sometimes an ifconfig enp3s0 down followed by an ifconfig enp3s0 up is
enough for the network adapter to restart responding. Sometimes a reboot
is necessary.

When copying r8169.c from 4.17.14 to the 4.18.5 kernel, networking works
perfectly stable on 4.18.5 so the problem seems r8169.c related.

Here the output from lshw:

    *-pci:2
 description: PCI bridge
 product: 8 Series/C220 Series Chipset Family PCI Express
Root Port #3
 vendor: Intel Corporation
 physical id: 1c.2
 bus info: pci@:00:1c.2
 version: d5
 width: 32 bits
 clock: 33MHz
 capabilities: pci pciexpress msi pm normal_decode
bus_master cap_list
 configuration: driver=pcieport
 resources: irq:18 ioport:d000(size=4096)
memory:f730-f73f ioport:f210(size=1048576)
   *-network
    description: Ethernet interface
    product: RTL8111/8168/8411 PCI Express Gigabit Ethernet
Controller
    vendor: Realtek Semiconductor Co., Ltd.
    physical id: 0
    bus info: pci@:03:00.0
    logical name: enp3s0
    version: 0c
    serial: 
    size: 1Gbit/s
    capacity: 1Gbit/s
    width: 64 bits
    clock: 33MHz
    capabilities: pm msi pciexpress msix vpd bus_master
cap_list ethernet physical tp mii 10bt 10bt-fd 100bt 100bt-fd 1000bt
1000bt-fd autonegotiation
    configuration: autonegotiation=on broadcast=yes
driver=r8169 driverversion=2.3LK-NAPI duplex=full
firmware=rtl8168g-2_0.0.1 02/06/13 latency=0 link=yes multicast=yes
port=MII speed=1Gbit/s
    resources: irq:18 ioport:d000(size=256)
memory:f730-f7300fff memory:f210-f2103fff

Thanks in advance for looking into this,

David Arendt




kernel 4.18.5 Realtek 8111G network adapter stops responding under high system load

2018-09-04 Thread David Arendt
Hi,

When using kernel 4.18.5 the Realtek 8111G network adapter stops
responding under high system load.

Dmesg is showing no errors.

Sometimes an ifconfig enp3s0 down followed by an ifconfig enp3s0 up is
enough for the network adapter to restart responding. Sometimes a reboot
is necessary.

When copying r8169.c from 4.17.14 to the 4.18.5 kernel, networking works
perfectly stable on 4.18.5 so the problem seems r8169.c related.

Here the output from lshw:

    *-pci:2
 description: PCI bridge
 product: 8 Series/C220 Series Chipset Family PCI Express
Root Port #3
 vendor: Intel Corporation
 physical id: 1c.2
 bus info: pci@:00:1c.2
 version: d5
 width: 32 bits
 clock: 33MHz
 capabilities: pci pciexpress msi pm normal_decode
bus_master cap_list
 configuration: driver=pcieport
 resources: irq:18 ioport:d000(size=4096)
memory:f730-f73f ioport:f210(size=1048576)
   *-network
    description: Ethernet interface
    product: RTL8111/8168/8411 PCI Express Gigabit Ethernet
Controller
    vendor: Realtek Semiconductor Co., Ltd.
    physical id: 0
    bus info: pci@:03:00.0
    logical name: enp3s0
    version: 0c
    serial: 
    size: 1Gbit/s
    capacity: 1Gbit/s
    width: 64 bits
    clock: 33MHz
    capabilities: pm msi pciexpress msix vpd bus_master
cap_list ethernet physical tp mii 10bt 10bt-fd 100bt 100bt-fd 1000bt
1000bt-fd autonegotiation
    configuration: autonegotiation=on broadcast=yes
driver=r8169 driverversion=2.3LK-NAPI duplex=full
firmware=rtl8168g-2_0.0.1 02/06/13 latency=0 link=yes multicast=yes
port=MII speed=1Gbit/s
    resources: irq:18 ioport:d000(size=256)
memory:f730-f7300fff memory:f210-f2103fff

Thanks in advance for looking into this,

David Arendt




Re: page allocation stall in kernel 4.9 when copying files from one btrfs hdd to another

2016-12-13 Thread David Arendt
Hi,

unfortunately I did not dump meminfo before the crash.

Here is the actual meminfo as of now with the copy running for about 3
hours.

MemTotal:   32806572 kB
MemFree:  197336 kB
MemAvailable:   31226888 kB
Buffers:  52 kB
Cached: 30603160 kB
SwapCached:11880 kB
Active: 29015008 kB
Inactive:2017292 kB
Active(anon): 162124 kB
Inactive(anon):   285104 kB
Active(file):   28852884 kB
Inactive(file):  1732188 kB
Unevictable:7092 kB
Mlocked:7092 kB
SwapTotal:  62522692 kB
SwapFree:   62460464 kB
Dirty:231944 kB
Writeback: 0 kB
AnonPages:425160 kB
Mapped:   227656 kB
Shmem: 12160 kB
Slab:1380280 kB
SReclaimable: 774584 kB
SUnreclaim:   605696 kB
KernelStack:7840 kB
PageTables:12800 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit:78925976 kB
Committed_AS:1883256 kB
VmallocTotal:   34359738367 kB
VmallocUsed:   0 kB
VmallocChunk:  0 kB
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
DirectMap4k:20220592 kB
DirectMap2M:13238272 kB
DirectMap1G: 1048576 kB

I will write a cronjob that dumps meminfo every 5 minutes to a file, so
I will have more info on the next crash.

The crash is not an isolated one as I already had this crash multiple
times with -rc7 and -rc8. It seems only to occur when copying from
7200rpm harddisks to 5600rpm ones, and never when copying between two
7200rpm or two 5400rpm.

Thanks,
David Arendt

On 12/13/2016 08:55 PM, Xin Zhou wrote:
> Hi David,
>
> It has GFP_NOFS flags, according to definition,
> the issue might have happened during initial DISK/IO.
>
> By the way, did you get a chance to dump the meminfo and run "top" before the 
> system hang?
> It seems more info about the system running state needed to know the issue. 
> Thanks.
>
> Xin
>
>  
>
> Sent: Tuesday, December 13, 2016 at 9:11 AM
> From: "David Arendt" <ad...@prnet.org>
> To: linux-bt...@vger.kernel.org, linux-kernel@vger.kernel.org
> Subject: page allocation stall in kernel 4.9 when copying files from one 
> btrfs hdd to another
> Hi,
>
> I receive the following page allocation stall while copying lots of
> large files from one btrfs hdd to another.
>
> Dec 13 13:04:29 server kernel: kworker/u16:8: page allocation stalls for
> 12260ms, order:0, mode:0x2400840(GFP_NOFS|__GFP_NOFAIL)
> Dec 13 13:04:29 server kernel: CPU: 0 PID: 24959 Comm: kworker/u16:8
> Tainted: P O 4.9.0 #1
> Dec 13 13:04:29 server kernel: Hardware name: ASUS All Series/H87M-PRO,
> BIOS 2102 10/28/2014
> Dec 13 13:04:29 server kernel: Workqueue: btrfs-extent-refs
> btrfs_extent_refs_helper
> Dec 13 13:04:29 server kernel:  813f3a59
> 81976b28 c90011093750
> Dec 13 13:04:29 server kernel: 81114fc1 02400840f39b6bc0
> 81976b28 c900110936f8
> Dec 13 13:04:29 server kernel: 88070010 c90011093760
> c90011093710 02400840
> Dec 13 13:04:29 server kernel: Call Trace:
> Dec 13 13:04:29 server kernel: [] ? dump_stack+0x46/0x5d
> Dec 13 13:04:29 server kernel: [] ?
> warn_alloc+0x111/0x130
> Dec 13 13:04:33 server kernel: [] ?
> __alloc_pages_nodemask+0xbe8/0xd30
> Dec 13 13:04:33 server kernel: [] ?
> pagecache_get_page+0xe4/0x230
> Dec 13 13:04:33 server kernel: [] ?
> alloc_extent_buffer+0x10b/0x400
> Dec 13 13:04:33 server kernel: [] ?
> btrfs_alloc_tree_block+0x125/0x560
> Dec 13 13:04:33 server kernel: [] ?
> read_extent_buffer_pages+0x21f/0x280
> Dec 13 13:04:33 server kernel: [] ?
> __btrfs_cow_block+0x141/0x580
> Dec 13 13:04:33 server kernel: [] ?
> btrfs_cow_block+0x100/0x150
> Dec 13 13:04:33 server kernel: [] ?
> btrfs_search_slot+0x1e9/0x9c0
> Dec 13 13:04:33 server kernel: [] ?
> __set_extent_bit+0x512/0x550
> Dec 13 13:04:33 server kernel: [] ?
> lookup_inline_extent_backref+0xf5/0x5e0
> Dec 13 13:04:34 server kernel: [] ?
> set_extent_bit+0x24/0x30
> Dec 13 13:04:34 server kernel: [] ?
> update_block_group.isra.34+0x114/0x380
> Dec 13 13:04:34 server kernel: [] ?
> __btrfs_free_extent.isra.35+0xf4/0xd20
> Dec 13 13:04:34 server kernel: [] ?
> btrfs_merge_delayed_refs+0x61/0x5d0
> Dec 13 13:04:34 server kernel: [] ?
> __btrfs_run_delayed_refs+0x902/0x10a0
> Dec 13 13:04:34 server kernel: [] ?
> btrfs_run_delayed_refs+0x90/0x2a0
> Dec 13 13:04:34 server kernel: [] ?
> delayed_ref_async_start+0x84/0xa0
> Dec 13 13:04:34 server kernel: [] ?
> process_one_work+0x11d/0x3b0
> Dec 13 13:04:34 server kernel: [] ?
> worker_thread+0x42/0x4b0
> Dec 13 13:04:34 se

Re: page allocation stall in kernel 4.9 when copying files from one btrfs hdd to another

2016-12-13 Thread David Arendt
Hi,

unfortunately I did not dump meminfo before the crash.

Here is the actual meminfo as of now with the copy running for about 3
hours.

MemTotal:   32806572 kB
MemFree:  197336 kB
MemAvailable:   31226888 kB
Buffers:  52 kB
Cached: 30603160 kB
SwapCached:11880 kB
Active: 29015008 kB
Inactive:2017292 kB
Active(anon): 162124 kB
Inactive(anon):   285104 kB
Active(file):   28852884 kB
Inactive(file):  1732188 kB
Unevictable:7092 kB
Mlocked:7092 kB
SwapTotal:  62522692 kB
SwapFree:   62460464 kB
Dirty:231944 kB
Writeback: 0 kB
AnonPages:425160 kB
Mapped:   227656 kB
Shmem: 12160 kB
Slab:1380280 kB
SReclaimable: 774584 kB
SUnreclaim:   605696 kB
KernelStack:7840 kB
PageTables:12800 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit:78925976 kB
Committed_AS:1883256 kB
VmallocTotal:   34359738367 kB
VmallocUsed:   0 kB
VmallocChunk:  0 kB
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
DirectMap4k:20220592 kB
DirectMap2M:13238272 kB
DirectMap1G: 1048576 kB

I will write a cronjob that dumps meminfo every 5 minutes to a file, so
I will have more info on the next crash.

The crash is not an isolated one as I already had this crash multiple
times with -rc7 and -rc8. It seems only to occur when copying from
7200rpm harddisks to 5600rpm ones, and never when copying between two
7200rpm or two 5400rpm.

Thanks,
David Arendt

On 12/13/2016 08:55 PM, Xin Zhou wrote:
> Hi David,
>
> It has GFP_NOFS flags, according to definition,
> the issue might have happened during initial DISK/IO.
>
> By the way, did you get a chance to dump the meminfo and run "top" before the 
> system hang?
> It seems more info about the system running state needed to know the issue. 
> Thanks.
>
> Xin
>
>  
>
> Sent: Tuesday, December 13, 2016 at 9:11 AM
> From: "David Arendt" 
> To: linux-bt...@vger.kernel.org, linux-kernel@vger.kernel.org
> Subject: page allocation stall in kernel 4.9 when copying files from one 
> btrfs hdd to another
> Hi,
>
> I receive the following page allocation stall while copying lots of
> large files from one btrfs hdd to another.
>
> Dec 13 13:04:29 server kernel: kworker/u16:8: page allocation stalls for
> 12260ms, order:0, mode:0x2400840(GFP_NOFS|__GFP_NOFAIL)
> Dec 13 13:04:29 server kernel: CPU: 0 PID: 24959 Comm: kworker/u16:8
> Tainted: P O 4.9.0 #1
> Dec 13 13:04:29 server kernel: Hardware name: ASUS All Series/H87M-PRO,
> BIOS 2102 10/28/2014
> Dec 13 13:04:29 server kernel: Workqueue: btrfs-extent-refs
> btrfs_extent_refs_helper
> Dec 13 13:04:29 server kernel:  813f3a59
> 81976b28 c90011093750
> Dec 13 13:04:29 server kernel: 81114fc1 02400840f39b6bc0
> 81976b28 c900110936f8
> Dec 13 13:04:29 server kernel: 88070010 c90011093760
> c90011093710 02400840
> Dec 13 13:04:29 server kernel: Call Trace:
> Dec 13 13:04:29 server kernel: [] ? dump_stack+0x46/0x5d
> Dec 13 13:04:29 server kernel: [] ?
> warn_alloc+0x111/0x130
> Dec 13 13:04:33 server kernel: [] ?
> __alloc_pages_nodemask+0xbe8/0xd30
> Dec 13 13:04:33 server kernel: [] ?
> pagecache_get_page+0xe4/0x230
> Dec 13 13:04:33 server kernel: [] ?
> alloc_extent_buffer+0x10b/0x400
> Dec 13 13:04:33 server kernel: [] ?
> btrfs_alloc_tree_block+0x125/0x560
> Dec 13 13:04:33 server kernel: [] ?
> read_extent_buffer_pages+0x21f/0x280
> Dec 13 13:04:33 server kernel: [] ?
> __btrfs_cow_block+0x141/0x580
> Dec 13 13:04:33 server kernel: [] ?
> btrfs_cow_block+0x100/0x150
> Dec 13 13:04:33 server kernel: [] ?
> btrfs_search_slot+0x1e9/0x9c0
> Dec 13 13:04:33 server kernel: [] ?
> __set_extent_bit+0x512/0x550
> Dec 13 13:04:33 server kernel: [] ?
> lookup_inline_extent_backref+0xf5/0x5e0
> Dec 13 13:04:34 server kernel: [] ?
> set_extent_bit+0x24/0x30
> Dec 13 13:04:34 server kernel: [] ?
> update_block_group.isra.34+0x114/0x380
> Dec 13 13:04:34 server kernel: [] ?
> __btrfs_free_extent.isra.35+0xf4/0xd20
> Dec 13 13:04:34 server kernel: [] ?
> btrfs_merge_delayed_refs+0x61/0x5d0
> Dec 13 13:04:34 server kernel: [] ?
> __btrfs_run_delayed_refs+0x902/0x10a0
> Dec 13 13:04:34 server kernel: [] ?
> btrfs_run_delayed_refs+0x90/0x2a0
> Dec 13 13:04:34 server kernel: [] ?
> delayed_ref_async_start+0x84/0xa0
> Dec 13 13:04:34 server kernel: [] ?
> process_one_work+0x11d/0x3b0
> Dec 13 13:04:34 server kernel: [] ?
> worker_thread+0x42/0x4b0
> Dec 13 13:04:34 server kernel: [] ?
> process_one_work+0x3

page allocation stall in kernel 4.9 when copying files from one btrfs hdd to another

2016-12-13 Thread David Arendt
Dec 13 13:04:34 server kernel: Swap cache stats: add 60282, delete
60213, find 249865/258319
Dec 13 13:04:34 server kernel: Free swap  = 62482976kB
Dec 13 13:04:34 server kernel: Total swap = 62522692kB
Dec 13 13:04:34 server kernel: 8364614 pages RAM
Dec 13 13:04:34 server kernel: 0 pages HighMem/MovableOnly
Dec 13 13:04:34 server kernel: 162971 pages reserved

Has anyone any idea what could go wrong here ?

Thanks in advance,

David Arendt



page allocation stall in kernel 4.9 when copying files from one btrfs hdd to another

2016-12-13 Thread David Arendt
Dec 13 13:04:34 server kernel: Swap cache stats: add 60282, delete
60213, find 249865/258319
Dec 13 13:04:34 server kernel: Free swap  = 62482976kB
Dec 13 13:04:34 server kernel: Total swap = 62522692kB
Dec 13 13:04:34 server kernel: 8364614 pages RAM
Dec 13 13:04:34 server kernel: 0 pages HighMem/MovableOnly
Dec 13 13:04:34 server kernel: 162971 pages reserved

Has anyone any idea what could go wrong here ?

Thanks in advance,

David Arendt