from:"nikolay"

Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-09 Thread Nikolay Aleksandrov

On 09/09/17 02:54, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Fri, Sep 8, 2017 at 7:30 AM, Nikolay Aleksandrov
> <niko...@cumulusnetworks.com> wrote:
>> On 08/09/17 17:17, Kosuke Tatsukawa wrote:
>>> Hi,
>>>
>>>> On 08/09/17 13:10, Nikolay Aleksandrov wrote:
>>>>> On 08/09/17 05:06, Kosuke Tatsukawa wrote:
>>>>>> Hi,
>>>>>>
>>>>>>> On  7.09.2017 01:47, Kosuke Tatsukawa wrote:
>>>>>>>> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
>>>>>>>> balance-alb mode") tried to fix transmit dynamic load balancing in
>>>>>>>> balance-alb mode, which wasn't working after commit 8b426dc54cf4
>>>>>>>> ("bonding: remove hardcoded value").
>>>>>>>>
>>>>>>>> It turned out that my previous patch only fixed the case when
>>>>>>>> balance-alb was specified as bonding module parameter, and not when
>>>>>>>> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
>>>>>>>> common usage).  In the latter case, tlb_dynamic_lb was set up according
>>>>>>>> to the default mode of the bonding interface, which happens to be
>>>>>>>> balance-rr.
>>>>>>>>
>>>>>>>> This additional patch addresses this issue by setting up tlb_dynamic_lb
>>>>>>>> to 1 if "mode" is set to balance-alb through the sysfs interface.
>>>>>>>>
>>>>>>>> I didn't add code to change tlb_balance_lb back to the default value 
>>>>>>>> for
>>>>>>>> other modes, because "mode" is usually set up only once during
>>>>>>>> initialization, and it's not worthwhile to change the static variable
>>>>>>>> bonding_defaults in bond_main.c to a global variable just for this
>>>>>>>> purpose.
>>>>>>>>
>>>>>>>> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
>>>>>>>> balance-tlb mode if it is set up using the sysfs interface.  I didn't
>>>>>>>> change that behavior, because the value of tlb_balance_lb can be 
>>>>>>>> changed
>>>>>>>> using the sysfs interface for balance-tlb, and I didn't like changing
>>>>>>>> the default value back and forth for balance-tlb.
>>>>>>>>
>>>>>>>> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
>>>>>>>> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
>>>>>>>> is not an intended usage, so there is little use making it writable at
>>>>>>>> this moment.
>>>>>>>>
>>>>>>>> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
>>>>>>>> Reported-by: Reinis Rozitis <r...@roze.lv>
>>>>>>>> Signed-off-by: Kosuke Tatsukawa <ta...@ab.jp.nec.com>
>>>>>>>> Cc: sta...@vger.kernel.org  # v4.12+
>>>>>>>> ---
>>>>>>>>  drivers/net/bonding/bond_options.c |3 +++
>>>>>>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>>>>>>
>>>>>>>
>>>>>>> I don't believe this to be the right solution, hardcoding it like this
>>>>>>> changes user-visible behaviour. The issue is that if someone configured
>>>>>>> it to be 0 in tlb mode, suddenly it will become 1 and will silently
>>>>>>> override their config if they switch the mode to alb. Also it robs users
>>>>>>> from their choice.
>>>>>>>
>>>>>>> If you think this should be settable in ALB mode, then IMO you should
>>>>>>> edit tlb_dynamic_lb option's unsuppmodes and allow it to be set in ALB.
>>>>>>> That would also be consistent with how it's handled in TLB mode.
>>>>>>
>>>>>> No, I don't think tlb_dynamic_lb should be settable in balance-alb at
>>>>>> this point.  All the current commits regarding tlb_dynamic_lb are for
>>>>>> balance-tlb mode, so I don't think balance-alb with tlb_dynamic_lb set
>>>>>> to 0 is an intended usage.
>>>>>>
>

Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-09 Thread Nikolay Aleksandrov

On 09/09/17 02:54, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Fri, Sep 8, 2017 at 7:30 AM, Nikolay Aleksandrov
>  wrote:
>> On 08/09/17 17:17, Kosuke Tatsukawa wrote:
>>> Hi,
>>>
>>>> On 08/09/17 13:10, Nikolay Aleksandrov wrote:
>>>>> On 08/09/17 05:06, Kosuke Tatsukawa wrote:
>>>>>> Hi,
>>>>>>
>>>>>>> On  7.09.2017 01:47, Kosuke Tatsukawa wrote:
>>>>>>>> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
>>>>>>>> balance-alb mode") tried to fix transmit dynamic load balancing in
>>>>>>>> balance-alb mode, which wasn't working after commit 8b426dc54cf4
>>>>>>>> ("bonding: remove hardcoded value").
>>>>>>>>
>>>>>>>> It turned out that my previous patch only fixed the case when
>>>>>>>> balance-alb was specified as bonding module parameter, and not when
>>>>>>>> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
>>>>>>>> common usage).  In the latter case, tlb_dynamic_lb was set up according
>>>>>>>> to the default mode of the bonding interface, which happens to be
>>>>>>>> balance-rr.
>>>>>>>>
>>>>>>>> This additional patch addresses this issue by setting up tlb_dynamic_lb
>>>>>>>> to 1 if "mode" is set to balance-alb through the sysfs interface.
>>>>>>>>
>>>>>>>> I didn't add code to change tlb_balance_lb back to the default value 
>>>>>>>> for
>>>>>>>> other modes, because "mode" is usually set up only once during
>>>>>>>> initialization, and it's not worthwhile to change the static variable
>>>>>>>> bonding_defaults in bond_main.c to a global variable just for this
>>>>>>>> purpose.
>>>>>>>>
>>>>>>>> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
>>>>>>>> balance-tlb mode if it is set up using the sysfs interface.  I didn't
>>>>>>>> change that behavior, because the value of tlb_balance_lb can be 
>>>>>>>> changed
>>>>>>>> using the sysfs interface for balance-tlb, and I didn't like changing
>>>>>>>> the default value back and forth for balance-tlb.
>>>>>>>>
>>>>>>>> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
>>>>>>>> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
>>>>>>>> is not an intended usage, so there is little use making it writable at
>>>>>>>> this moment.
>>>>>>>>
>>>>>>>> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
>>>>>>>> Reported-by: Reinis Rozitis 
>>>>>>>> Signed-off-by: Kosuke Tatsukawa 
>>>>>>>> Cc: sta...@vger.kernel.org  # v4.12+
>>>>>>>> ---
>>>>>>>>  drivers/net/bonding/bond_options.c |3 +++
>>>>>>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>>>>>>
>>>>>>>
>>>>>>> I don't believe this to be the right solution, hardcoding it like this
>>>>>>> changes user-visible behaviour. The issue is that if someone configured
>>>>>>> it to be 0 in tlb mode, suddenly it will become 1 and will silently
>>>>>>> override their config if they switch the mode to alb. Also it robs users
>>>>>>> from their choice.
>>>>>>>
>>>>>>> If you think this should be settable in ALB mode, then IMO you should
>>>>>>> edit tlb_dynamic_lb option's unsuppmodes and allow it to be set in ALB.
>>>>>>> That would also be consistent with how it's handled in TLB mode.
>>>>>>
>>>>>> No, I don't think tlb_dynamic_lb should be settable in balance-alb at
>>>>>> this point.  All the current commits regarding tlb_dynamic_lb are for
>>>>>> balance-tlb mode, so I don't think balance-alb with tlb_dynamic_lb set
>>>>>> to 0 is an intended usage.
>>>>>>
>>>>>>
>>>>>>> Going back and looking at your previou

Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-08 Thread Nikolay Aleksandrov

On 08/09/17 17:17, Kosuke Tatsukawa wrote:
> Hi,
> 
>> On 08/09/17 13:10, Nikolay Aleksandrov wrote:
>>> On 08/09/17 05:06, Kosuke Tatsukawa wrote:
>>>> Hi,
>>>>
>>>>> On  7.09.2017 01:47, Kosuke Tatsukawa wrote:
>>>>>> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
>>>>>> balance-alb mode") tried to fix transmit dynamic load balancing in
>>>>>> balance-alb mode, which wasn't working after commit 8b426dc54cf4
>>>>>> ("bonding: remove hardcoded value").
>>>>>>
>>>>>> It turned out that my previous patch only fixed the case when
>>>>>> balance-alb was specified as bonding module parameter, and not when
>>>>>> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
>>>>>> common usage).  In the latter case, tlb_dynamic_lb was set up according
>>>>>> to the default mode of the bonding interface, which happens to be
>>>>>> balance-rr.
>>>>>>
>>>>>> This additional patch addresses this issue by setting up tlb_dynamic_lb
>>>>>> to 1 if "mode" is set to balance-alb through the sysfs interface.
>>>>>>
>>>>>> I didn't add code to change tlb_balance_lb back to the default value for
>>>>>> other modes, because "mode" is usually set up only once during
>>>>>> initialization, and it's not worthwhile to change the static variable
>>>>>> bonding_defaults in bond_main.c to a global variable just for this
>>>>>> purpose.
>>>>>>
>>>>>> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
>>>>>> balance-tlb mode if it is set up using the sysfs interface.  I didn't
>>>>>> change that behavior, because the value of tlb_balance_lb can be changed
>>>>>> using the sysfs interface for balance-tlb, and I didn't like changing
>>>>>> the default value back and forth for balance-tlb.
>>>>>>
>>>>>> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
>>>>>> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
>>>>>> is not an intended usage, so there is little use making it writable at
>>>>>> this moment.
>>>>>>
>>>>>> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
>>>>>> Reported-by: Reinis Rozitis <r...@roze.lv>
>>>>>> Signed-off-by: Kosuke Tatsukawa <ta...@ab.jp.nec.com>
>>>>>> Cc: sta...@vger.kernel.org  # v4.12+
>>>>>> ---
>>>>>>  drivers/net/bonding/bond_options.c |3 +++
>>>>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>>>>
>>>>>
>>>>> I don't believe this to be the right solution, hardcoding it like this
>>>>> changes user-visible behaviour. The issue is that if someone configured
>>>>> it to be 0 in tlb mode, suddenly it will become 1 and will silently
>>>>> override their config if they switch the mode to alb. Also it robs users
>>>>> from their choice.
>>>>>
>>>>> If you think this should be settable in ALB mode, then IMO you should
>>>>> edit tlb_dynamic_lb option's unsuppmodes and allow it to be set in ALB.
>>>>> That would also be consistent with how it's handled in TLB mode.
>>>>
>>>> No, I don't think tlb_dynamic_lb should be settable in balance-alb at
>>>> this point.  All the current commits regarding tlb_dynamic_lb are for
>>>> balance-tlb mode, so I don't think balance-alb with tlb_dynamic_lb set
>>>> to 0 is an intended usage.
>>>>
>>>>
>>>>> Going back and looking at your previous fix I'd argue that it is also
>>>>> wrong, you should've removed the mode check altogether to return the
>>>>> original behaviour where the dynamic_lb is set to 1 (enabled) by
>>>>> default and then ALB mode would've had it, of course that would've left
>>>>> the case of setting it to 0 in TLB mode and switching to ALB, but that
>>>>> is a different issue.
>>>>
>>>> Maybe balance-alb shouldn't be dependent on tlb_dynamic_lb.
>>>> tlb_dynamic_lb is referenced in the following functions.
>>>>
>>>>  + bond_do_alb_xmit()  -- Used by b

Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-08 Thread Nikolay Aleksandrov

On 08/09/17 17:17, Kosuke Tatsukawa wrote:
> Hi,
> 
>> On 08/09/17 13:10, Nikolay Aleksandrov wrote:
>>> On 08/09/17 05:06, Kosuke Tatsukawa wrote:
>>>> Hi,
>>>>
>>>>> On  7.09.2017 01:47, Kosuke Tatsukawa wrote:
>>>>>> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
>>>>>> balance-alb mode") tried to fix transmit dynamic load balancing in
>>>>>> balance-alb mode, which wasn't working after commit 8b426dc54cf4
>>>>>> ("bonding: remove hardcoded value").
>>>>>>
>>>>>> It turned out that my previous patch only fixed the case when
>>>>>> balance-alb was specified as bonding module parameter, and not when
>>>>>> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
>>>>>> common usage).  In the latter case, tlb_dynamic_lb was set up according
>>>>>> to the default mode of the bonding interface, which happens to be
>>>>>> balance-rr.
>>>>>>
>>>>>> This additional patch addresses this issue by setting up tlb_dynamic_lb
>>>>>> to 1 if "mode" is set to balance-alb through the sysfs interface.
>>>>>>
>>>>>> I didn't add code to change tlb_balance_lb back to the default value for
>>>>>> other modes, because "mode" is usually set up only once during
>>>>>> initialization, and it's not worthwhile to change the static variable
>>>>>> bonding_defaults in bond_main.c to a global variable just for this
>>>>>> purpose.
>>>>>>
>>>>>> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
>>>>>> balance-tlb mode if it is set up using the sysfs interface.  I didn't
>>>>>> change that behavior, because the value of tlb_balance_lb can be changed
>>>>>> using the sysfs interface for balance-tlb, and I didn't like changing
>>>>>> the default value back and forth for balance-tlb.
>>>>>>
>>>>>> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
>>>>>> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
>>>>>> is not an intended usage, so there is little use making it writable at
>>>>>> this moment.
>>>>>>
>>>>>> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
>>>>>> Reported-by: Reinis Rozitis 
>>>>>> Signed-off-by: Kosuke Tatsukawa 
>>>>>> Cc: sta...@vger.kernel.org  # v4.12+
>>>>>> ---
>>>>>>  drivers/net/bonding/bond_options.c |3 +++
>>>>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>>>>
>>>>>
>>>>> I don't believe this to be the right solution, hardcoding it like this
>>>>> changes user-visible behaviour. The issue is that if someone configured
>>>>> it to be 0 in tlb mode, suddenly it will become 1 and will silently
>>>>> override their config if they switch the mode to alb. Also it robs users
>>>>> from their choice.
>>>>>
>>>>> If you think this should be settable in ALB mode, then IMO you should
>>>>> edit tlb_dynamic_lb option's unsuppmodes and allow it to be set in ALB.
>>>>> That would also be consistent with how it's handled in TLB mode.
>>>>
>>>> No, I don't think tlb_dynamic_lb should be settable in balance-alb at
>>>> this point.  All the current commits regarding tlb_dynamic_lb are for
>>>> balance-tlb mode, so I don't think balance-alb with tlb_dynamic_lb set
>>>> to 0 is an intended usage.
>>>>
>>>>
>>>>> Going back and looking at your previous fix I'd argue that it is also
>>>>> wrong, you should've removed the mode check altogether to return the
>>>>> original behaviour where the dynamic_lb is set to 1 (enabled) by
>>>>> default and then ALB mode would've had it, of course that would've left
>>>>> the case of setting it to 0 in TLB mode and switching to ALB, but that
>>>>> is a different issue.
>>>>
>>>> Maybe balance-alb shouldn't be dependent on tlb_dynamic_lb.
>>>> tlb_dynamic_lb is referenced in the following functions.
>>>>
>>>>  + bond_do_alb_xmit()  -- Used by both balance-tlb and balance-alb
>>>&g

Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-08 Thread Nikolay Aleksandrov

On 08/09/17 13:10, Nikolay Aleksandrov wrote:
> On 08/09/17 05:06, Kosuke Tatsukawa wrote:
>> Hi,
>>
>>> On  7.09.2017 01:47, Kosuke Tatsukawa wrote:
>>>> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
>>>> balance-alb mode") tried to fix transmit dynamic load balancing in
>>>> balance-alb mode, which wasn't working after commit 8b426dc54cf4
>>>> ("bonding: remove hardcoded value").
>>>>
>>>> It turned out that my previous patch only fixed the case when
>>>> balance-alb was specified as bonding module parameter, and not when
>>>> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
>>>> common usage).  In the latter case, tlb_dynamic_lb was set up according
>>>> to the default mode of the bonding interface, which happens to be
>>>> balance-rr.
>>>>
>>>> This additional patch addresses this issue by setting up tlb_dynamic_lb
>>>> to 1 if "mode" is set to balance-alb through the sysfs interface.
>>>>
>>>> I didn't add code to change tlb_balance_lb back to the default value for
>>>> other modes, because "mode" is usually set up only once during
>>>> initialization, and it's not worthwhile to change the static variable
>>>> bonding_defaults in bond_main.c to a global variable just for this
>>>> purpose.
>>>>
>>>> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
>>>> balance-tlb mode if it is set up using the sysfs interface.  I didn't
>>>> change that behavior, because the value of tlb_balance_lb can be changed
>>>> using the sysfs interface for balance-tlb, and I didn't like changing
>>>> the default value back and forth for balance-tlb.
>>>>
>>>> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
>>>> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
>>>> is not an intended usage, so there is little use making it writable at
>>>> this moment.
>>>>
>>>> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
>>>> Reported-by: Reinis Rozitis <r...@roze.lv>
>>>> Signed-off-by: Kosuke Tatsukawa <ta...@ab.jp.nec.com>
>>>> Cc: sta...@vger.kernel.org  # v4.12+
>>>> ---
>>>>  drivers/net/bonding/bond_options.c |3 +++
>>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>>
>>>
>>> I don't believe this to be the right solution, hardcoding it like this
>>> changes user-visible behaviour. The issue is that if someone configured
>>> it to be 0 in tlb mode, suddenly it will become 1 and will silently
>>> override their config if they switch the mode to alb. Also it robs users
>>> from their choice.
>>>
>>> If you think this should be settable in ALB mode, then IMO you should
>>> edit tlb_dynamic_lb option's unsuppmodes and allow it to be set in ALB.
>>> That would also be consistent with how it's handled in TLB mode.
>>
>> No, I don't think tlb_dynamic_lb should be settable in balance-alb at
>> this point.  All the current commits regarding tlb_dynamic_lb are for
>> balance-tlb mode, so I don't think balance-alb with tlb_dynamic_lb set
>> to 0 is an intended usage.
>>
>>
>>> Going back and looking at your previous fix I'd argue that it is also
>>> wrong, you should've removed the mode check altogether to return the
>>> original behaviour where the dynamic_lb is set to 1 (enabled) by
>>> default and then ALB mode would've had it, of course that would've left
>>> the case of setting it to 0 in TLB mode and switching to ALB, but that
>>> is a different issue.
>>
>> Maybe balance-alb shouldn't be dependent on tlb_dynamic_lb.
>> tlb_dynamic_lb is referenced in the following functions.
>>
>>  + bond_do_alb_xmit()  -- Used by both balance-tlb and balance-alb
>>  + bond_tlb_xmit()  -- Only used by balance-tlb
>>  + bond_open()  -- Used by both balance-tlb and balance-alb
>>  + bond_check_params()  -- Used during module initialization
>>  + bond_fill_info()  -- Used to get/set value
>>  + bond_option_tlb_dynamic_lb_set()  -- Used to get/set value
>>  + bonding_show_tlb_dynamic_lb()  -- Used to get/set value
>>  + bond_is_nondyn_tlb()  -- Only referenced if balance-tlb mode
>>
>> The following untested patch adds code to make balance-alb work as if
>> tlb_dynamic_lb==1 for the functions which affect balance-alb mode.  It
>> also reverts my previous patch.
>>
>> What do you think about this approach?
>> ---
>> Kosuke TATSUKAWA  | 1st Platform Software Division
>>   | NEC Solution Innovators
>>   | ta...@ab.jp.nec.com
>>
> 
> Logically the approach looks good, that being said it adds unnecessary tests 
> in
> the fast path, why not just something like the patch below ? That only leaves 
> the
> problem if it is zeroed in TLB and switched to ALB mode, and that is a one 
> line
> fix to unsuppmodes just allow it to be set for that specific case. The below
> returns the default behaviour before the commit in your Fixes tag.
> 
> 

Actually I'm fine with your approach, too. It will fix this regardless of the
value of tlb_dynamic_lb which sounds good to me for the price of a test in
the fast path.

Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-08 Thread Nikolay Aleksandrov

On 08/09/17 13:10, Nikolay Aleksandrov wrote:
> On 08/09/17 05:06, Kosuke Tatsukawa wrote:
>> Hi,
>>
>>> On  7.09.2017 01:47, Kosuke Tatsukawa wrote:
>>>> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
>>>> balance-alb mode") tried to fix transmit dynamic load balancing in
>>>> balance-alb mode, which wasn't working after commit 8b426dc54cf4
>>>> ("bonding: remove hardcoded value").
>>>>
>>>> It turned out that my previous patch only fixed the case when
>>>> balance-alb was specified as bonding module parameter, and not when
>>>> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
>>>> common usage).  In the latter case, tlb_dynamic_lb was set up according
>>>> to the default mode of the bonding interface, which happens to be
>>>> balance-rr.
>>>>
>>>> This additional patch addresses this issue by setting up tlb_dynamic_lb
>>>> to 1 if "mode" is set to balance-alb through the sysfs interface.
>>>>
>>>> I didn't add code to change tlb_balance_lb back to the default value for
>>>> other modes, because "mode" is usually set up only once during
>>>> initialization, and it's not worthwhile to change the static variable
>>>> bonding_defaults in bond_main.c to a global variable just for this
>>>> purpose.
>>>>
>>>> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
>>>> balance-tlb mode if it is set up using the sysfs interface.  I didn't
>>>> change that behavior, because the value of tlb_balance_lb can be changed
>>>> using the sysfs interface for balance-tlb, and I didn't like changing
>>>> the default value back and forth for balance-tlb.
>>>>
>>>> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
>>>> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
>>>> is not an intended usage, so there is little use making it writable at
>>>> this moment.
>>>>
>>>> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
>>>> Reported-by: Reinis Rozitis 
>>>> Signed-off-by: Kosuke Tatsukawa 
>>>> Cc: sta...@vger.kernel.org  # v4.12+
>>>> ---
>>>>  drivers/net/bonding/bond_options.c |3 +++
>>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>>
>>>
>>> I don't believe this to be the right solution, hardcoding it like this
>>> changes user-visible behaviour. The issue is that if someone configured
>>> it to be 0 in tlb mode, suddenly it will become 1 and will silently
>>> override their config if they switch the mode to alb. Also it robs users
>>> from their choice.
>>>
>>> If you think this should be settable in ALB mode, then IMO you should
>>> edit tlb_dynamic_lb option's unsuppmodes and allow it to be set in ALB.
>>> That would also be consistent with how it's handled in TLB mode.
>>
>> No, I don't think tlb_dynamic_lb should be settable in balance-alb at
>> this point.  All the current commits regarding tlb_dynamic_lb are for
>> balance-tlb mode, so I don't think balance-alb with tlb_dynamic_lb set
>> to 0 is an intended usage.
>>
>>
>>> Going back and looking at your previous fix I'd argue that it is also
>>> wrong, you should've removed the mode check altogether to return the
>>> original behaviour where the dynamic_lb is set to 1 (enabled) by
>>> default and then ALB mode would've had it, of course that would've left
>>> the case of setting it to 0 in TLB mode and switching to ALB, but that
>>> is a different issue.
>>
>> Maybe balance-alb shouldn't be dependent on tlb_dynamic_lb.
>> tlb_dynamic_lb is referenced in the following functions.
>>
>>  + bond_do_alb_xmit()  -- Used by both balance-tlb and balance-alb
>>  + bond_tlb_xmit()  -- Only used by balance-tlb
>>  + bond_open()  -- Used by both balance-tlb and balance-alb
>>  + bond_check_params()  -- Used during module initialization
>>  + bond_fill_info()  -- Used to get/set value
>>  + bond_option_tlb_dynamic_lb_set()  -- Used to get/set value
>>  + bonding_show_tlb_dynamic_lb()  -- Used to get/set value
>>  + bond_is_nondyn_tlb()  -- Only referenced if balance-tlb mode
>>
>> The following untested patch adds code to make balance-alb work as if
>> tlb_dynamic_lb==1 for the functions which affect balance-alb mode.  It
>> also reverts my previous patch.
>>
>> What do you think about this approach?
>> ---
>> Kosuke TATSUKAWA  | 1st Platform Software Division
>>   | NEC Solution Innovators
>>   | ta...@ab.jp.nec.com
>>
> 
> Logically the approach looks good, that being said it adds unnecessary tests 
> in
> the fast path, why not just something like the patch below ? That only leaves 
> the
> problem if it is zeroed in TLB and switched to ALB mode, and that is a one 
> line
> fix to unsuppmodes just allow it to be set for that specific case. The below
> returns the default behaviour before the commit in your Fixes tag.
> 
> 

Actually I'm fine with your approach, too. It will fix this regardless of the
value of tlb_dynamic_lb which sounds good to me for the price of a test in
the fast path.

Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-08 Thread Nikolay Aleksandrov

On 08/09/17 05:06, Kosuke Tatsukawa wrote:
> Hi,
> 
>> On  7.09.2017 01:47, Kosuke Tatsukawa wrote:
>>> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
>>> balance-alb mode") tried to fix transmit dynamic load balancing in
>>> balance-alb mode, which wasn't working after commit 8b426dc54cf4
>>> ("bonding: remove hardcoded value").
>>>
>>> It turned out that my previous patch only fixed the case when
>>> balance-alb was specified as bonding module parameter, and not when
>>> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
>>> common usage).  In the latter case, tlb_dynamic_lb was set up according
>>> to the default mode of the bonding interface, which happens to be
>>> balance-rr.
>>>
>>> This additional patch addresses this issue by setting up tlb_dynamic_lb
>>> to 1 if "mode" is set to balance-alb through the sysfs interface.
>>>
>>> I didn't add code to change tlb_balance_lb back to the default value for
>>> other modes, because "mode" is usually set up only once during
>>> initialization, and it's not worthwhile to change the static variable
>>> bonding_defaults in bond_main.c to a global variable just for this
>>> purpose.
>>>
>>> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
>>> balance-tlb mode if it is set up using the sysfs interface.  I didn't
>>> change that behavior, because the value of tlb_balance_lb can be changed
>>> using the sysfs interface for balance-tlb, and I didn't like changing
>>> the default value back and forth for balance-tlb.
>>>
>>> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
>>> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
>>> is not an intended usage, so there is little use making it writable at
>>> this moment.
>>>
>>> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
>>> Reported-by: Reinis Rozitis 
>>> Signed-off-by: Kosuke Tatsukawa 
>>> Cc: sta...@vger.kernel.org  # v4.12+
>>> ---
>>>  drivers/net/bonding/bond_options.c |3 +++
>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>
>>
>> I don't believe this to be the right solution, hardcoding it like this
>> changes user-visible behaviour. The issue is that if someone configured
>> it to be 0 in tlb mode, suddenly it will become 1 and will silently
>> override their config if they switch the mode to alb. Also it robs users
>> from their choice.
>>
>> If you think this should be settable in ALB mode, then IMO you should
>> edit tlb_dynamic_lb option's unsuppmodes and allow it to be set in ALB.
>> That would also be consistent with how it's handled in TLB mode.
> 
> No, I don't think tlb_dynamic_lb should be settable in balance-alb at
> this point.  All the current commits regarding tlb_dynamic_lb are for
> balance-tlb mode, so I don't think balance-alb with tlb_dynamic_lb set
> to 0 is an intended usage.
> 
> 
>> Going back and looking at your previous fix I'd argue that it is also
>> wrong, you should've removed the mode check altogether to return the
>> original behaviour where the dynamic_lb is set to 1 (enabled) by
>> default and then ALB mode would've had it, of course that would've left
>> the case of setting it to 0 in TLB mode and switching to ALB, but that
>> is a different issue.
> 
> Maybe balance-alb shouldn't be dependent on tlb_dynamic_lb.
> tlb_dynamic_lb is referenced in the following functions.
> 
>  + bond_do_alb_xmit()  -- Used by both balance-tlb and balance-alb
>  + bond_tlb_xmit()  -- Only used by balance-tlb
>  + bond_open()  -- Used by both balance-tlb and balance-alb
>  + bond_check_params()  -- Used during module initialization
>  + bond_fill_info()  -- Used to get/set value
>  + bond_option_tlb_dynamic_lb_set()  -- Used to get/set value
>  + bonding_show_tlb_dynamic_lb()  -- Used to get/set value
>  + bond_is_nondyn_tlb()  -- Only referenced if balance-tlb mode
> 
> The following untested patch adds code to make balance-alb work as if
> tlb_dynamic_lb==1 for the functions which affect balance-alb mode.  It
> also reverts my previous patch.
> 
> What do you think about this approach?
> ---
> Kosuke TATSUKAWA  | 1st Platform Software Division
>   | NEC Solution Innovators
>   | ta...@ab.jp.nec.com
> 

Logically the approach looks good, that being said it adds unnecessary tests in
the fast path, why not just something like the patch below ? That only leaves 
the
problem if it is zeroed in TLB and switched to ALB mode, and that is a one line
fix to unsuppmodes just allow it to be set for that specific case. The below
returns the default behaviour before the commit in your Fixes tag.


diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index fc63992ab0e0..c99dc59d729b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4289,7 +4289,7 @@ static int bond_check_params(struct bond_params *params)
int bond_mode   = BOND_MODE_ROUNDROBIN;

Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-08 Thread Nikolay Aleksandrov

On 08/09/17 05:06, Kosuke Tatsukawa wrote:
> Hi,
> 
>> On  7.09.2017 01:47, Kosuke Tatsukawa wrote:
>>> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
>>> balance-alb mode") tried to fix transmit dynamic load balancing in
>>> balance-alb mode, which wasn't working after commit 8b426dc54cf4
>>> ("bonding: remove hardcoded value").
>>>
>>> It turned out that my previous patch only fixed the case when
>>> balance-alb was specified as bonding module parameter, and not when
>>> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
>>> common usage).  In the latter case, tlb_dynamic_lb was set up according
>>> to the default mode of the bonding interface, which happens to be
>>> balance-rr.
>>>
>>> This additional patch addresses this issue by setting up tlb_dynamic_lb
>>> to 1 if "mode" is set to balance-alb through the sysfs interface.
>>>
>>> I didn't add code to change tlb_balance_lb back to the default value for
>>> other modes, because "mode" is usually set up only once during
>>> initialization, and it's not worthwhile to change the static variable
>>> bonding_defaults in bond_main.c to a global variable just for this
>>> purpose.
>>>
>>> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
>>> balance-tlb mode if it is set up using the sysfs interface.  I didn't
>>> change that behavior, because the value of tlb_balance_lb can be changed
>>> using the sysfs interface for balance-tlb, and I didn't like changing
>>> the default value back and forth for balance-tlb.
>>>
>>> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
>>> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
>>> is not an intended usage, so there is little use making it writable at
>>> this moment.
>>>
>>> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
>>> Reported-by: Reinis Rozitis 
>>> Signed-off-by: Kosuke Tatsukawa 
>>> Cc: sta...@vger.kernel.org  # v4.12+
>>> ---
>>>  drivers/net/bonding/bond_options.c |3 +++
>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>
>>
>> I don't believe this to be the right solution, hardcoding it like this
>> changes user-visible behaviour. The issue is that if someone configured
>> it to be 0 in tlb mode, suddenly it will become 1 and will silently
>> override their config if they switch the mode to alb. Also it robs users
>> from their choice.
>>
>> If you think this should be settable in ALB mode, then IMO you should
>> edit tlb_dynamic_lb option's unsuppmodes and allow it to be set in ALB.
>> That would also be consistent with how it's handled in TLB mode.
> 
> No, I don't think tlb_dynamic_lb should be settable in balance-alb at
> this point.  All the current commits regarding tlb_dynamic_lb are for
> balance-tlb mode, so I don't think balance-alb with tlb_dynamic_lb set
> to 0 is an intended usage.
> 
> 
>> Going back and looking at your previous fix I'd argue that it is also
>> wrong, you should've removed the mode check altogether to return the
>> original behaviour where the dynamic_lb is set to 1 (enabled) by
>> default and then ALB mode would've had it, of course that would've left
>> the case of setting it to 0 in TLB mode and switching to ALB, but that
>> is a different issue.
> 
> Maybe balance-alb shouldn't be dependent on tlb_dynamic_lb.
> tlb_dynamic_lb is referenced in the following functions.
> 
>  + bond_do_alb_xmit()  -- Used by both balance-tlb and balance-alb
>  + bond_tlb_xmit()  -- Only used by balance-tlb
>  + bond_open()  -- Used by both balance-tlb and balance-alb
>  + bond_check_params()  -- Used during module initialization
>  + bond_fill_info()  -- Used to get/set value
>  + bond_option_tlb_dynamic_lb_set()  -- Used to get/set value
>  + bonding_show_tlb_dynamic_lb()  -- Used to get/set value
>  + bond_is_nondyn_tlb()  -- Only referenced if balance-tlb mode
> 
> The following untested patch adds code to make balance-alb work as if
> tlb_dynamic_lb==1 for the functions which affect balance-alb mode.  It
> also reverts my previous patch.
> 
> What do you think about this approach?
> ---
> Kosuke TATSUKAWA  | 1st Platform Software Division
>   | NEC Solution Innovators
>   | ta...@ab.jp.nec.com
> 

Logically the approach looks good, that being said it adds unnecessary tests in
the fast path, why not just something like the patch below ? That only leaves 
the
problem if it is zeroed in TLB and switched to ALB mode, and that is a one line
fix to unsuppmodes just allow it to be set for that specific case. The below
returns the default behaviour before the commit in your Fixes tag.


diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index fc63992ab0e0..c99dc59d729b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4289,7 +4289,7 @@ static int bond_check_params(struct bond_params *params)
int bond_mode   = BOND_MODE_ROUNDROBIN;
int xmit_hashtype =

Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-07 Thread Nikolay Aleksandrov

On  7.09.2017 01:47, Kosuke Tatsukawa wrote:
> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
> balance-alb mode") tried to fix transmit dynamic load balancing in
> balance-alb mode, which wasn't working after commit 8b426dc54cf4
> ("bonding: remove hardcoded value").
> 
> It turned out that my previous patch only fixed the case when
> balance-alb was specified as bonding module parameter, and not when
> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
> common usage).  In the latter case, tlb_dynamic_lb was set up according
> to the default mode of the bonding interface, which happens to be
> balance-rr.
> 
> This additional patch addresses this issue by setting up tlb_dynamic_lb
> to 1 if "mode" is set to balance-alb through the sysfs interface.
> 
> I didn't add code to change tlb_balance_lb back to the default value for
> other modes, because "mode" is usually set up only once during
> initialization, and it's not worthwhile to change the static variable
> bonding_defaults in bond_main.c to a global variable just for this
> purpose.
> 
> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
> balance-tlb mode if it is set up using the sysfs interface.  I didn't
> change that behavior, because the value of tlb_balance_lb can be changed
> using the sysfs interface for balance-tlb, and I didn't like changing
> the default value back and forth for balance-tlb.
> 
> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
> is not an intended usage, so there is little use making it writable at
> this moment.
> 
> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
> Reported-by: Reinis Rozitis 
> Signed-off-by: Kosuke Tatsukawa 
> Cc: sta...@vger.kernel.org  # v4.12+
> ---
>  drivers/net/bonding/bond_options.c |3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 

I don't believe this to be the right solution, hardcoding it like this
changes user-visible behaviour. The issue is that if someone configured
it to be 0 in tlb mode, suddenly it will become 1 and will silently
override their config if they switch the mode to alb. Also it robs users
from their choice.

If you think this should be settable in ALB mode, then IMO you should
edit tlb_dynamic_lb option's unsuppmodes and allow it to be set in ALB.
That would also be consistent with how it's handled in TLB mode.

Going back and looking at your previous fix I'd argue that it is also
wrong, you should've removed the mode check altogether to return the
original behaviour where the dynamic_lb is set to 1 (enabled) by
default and then ALB mode would've had it, of course that would've left
the case of setting it to 0 in TLB mode and switching to ALB, but that
is a different issue.

Cheers,
 Nik

Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-07 Thread Nikolay Aleksandrov

On  7.09.2017 01:47, Kosuke Tatsukawa wrote:
> Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
> balance-alb mode") tried to fix transmit dynamic load balancing in
> balance-alb mode, which wasn't working after commit 8b426dc54cf4
> ("bonding: remove hardcoded value").
> 
> It turned out that my previous patch only fixed the case when
> balance-alb was specified as bonding module parameter, and not when
> balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
> common usage).  In the latter case, tlb_dynamic_lb was set up according
> to the default mode of the bonding interface, which happens to be
> balance-rr.
> 
> This additional patch addresses this issue by setting up tlb_dynamic_lb
> to 1 if "mode" is set to balance-alb through the sysfs interface.
> 
> I didn't add code to change tlb_balance_lb back to the default value for
> other modes, because "mode" is usually set up only once during
> initialization, and it's not worthwhile to change the static variable
> bonding_defaults in bond_main.c to a global variable just for this
> purpose.
> 
> Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
> balance-tlb mode if it is set up using the sysfs interface.  I didn't
> change that behavior, because the value of tlb_balance_lb can be changed
> using the sysfs interface for balance-tlb, and I didn't like changing
> the default value back and forth for balance-tlb.
> 
> As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
> written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
> is not an intended usage, so there is little use making it writable at
> this moment.
> 
> Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
> Reported-by: Reinis Rozitis 
> Signed-off-by: Kosuke Tatsukawa 
> Cc: sta...@vger.kernel.org  # v4.12+
> ---
>  drivers/net/bonding/bond_options.c |3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 

I don't believe this to be the right solution, hardcoding it like this
changes user-visible behaviour. The issue is that if someone configured
it to be 0 in tlb mode, suddenly it will become 1 and will silently
override their config if they switch the mode to alb. Also it robs users
from their choice.

If you think this should be settable in ALB mode, then IMO you should
edit tlb_dynamic_lb option's unsuppmodes and allow it to be set in ALB.
That would also be consistent with how it's handled in TLB mode.

Going back and looking at your previous fix I'd argue that it is also
wrong, you should've removed the mode check altogether to return the
original behaviour where the dynamic_lb is set to 1 (enabled) by
default and then ALB mode would've had it, of course that would've left
the case of setting it to 0 in TLB mode and switching to ALB, but that
is a different issue.

Cheers,
 Nik

workqueue threads ->journal_info buggery

2017-09-05 Thread Nikolay Borisov

Hello Tejun, 

I've hit the following problems under memory-heavy workload conditions: 

First is a BUG_ON : J_ASSERT(journal_current_handle() == handle);   


[   64.261793] kernel BUG at fs/jbd2/transaction.c:1644!
[   64.263894] invalid opcode:  [#1] SMP
[   64.266187] Modules linked in:
[   64.267472] CPU: 1 PID: 542 Comm: kworker/u12:6 Not tainted 4.12.0-nbor #135
[   64.269941] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   64.272374] Workqueue: writeback wb_workfn (flush-254:0)
[   64.273862] task: 88001c37b880 task.stack: 880018ac8000
[   64.275580] RIP: 0010:jbd2_journal_stop+0x375/0x4d0
[   64.276704] RSP: :880018acb990 EFLAGS: 00010286
[   64.278708] RAX: 88001c37b880 RBX: 88001e83c000 RCX: 88001c4f8800
[   64.280499] RDX: 88001e83c000 RSI: 0b26 RDI: 88001e83c000
[   64.282262] RBP: 880018acba10 R08: 880019ec5888 R09: 
[   64.284111] R10:  R11: 81283f8f R12: 880018a1a140
[   64.285553] R13: 88001c4f8800 R14: 88001c47d000 R15: 880018aa01f0
[   64.286337] FS:  () GS:88001fc4() 
knlGS:
[   64.287671] CS:  0010 DS:  ES:  CR0: 80050033
[   64.288568] CR2: 00421ac0 CR3: 1ae83000 CR4: 06a0
[   64.289468] Call Trace:
[   64.289748]  ? __ext4_journal_get_write_access+0x67/0xc0
[   64.290330]  ? ext4_writepages+0xec6/0x1200
[   64.290786]  __ext4_journal_stop+0x3c/0xa0
[   64.291233]  ext4_writepages+0x8b2/0x1200
[   64.291682]  ? writeback_sb_inodes+0x11f/0x5c0
[   64.292174]  do_writepages+0x1c/0x80
[   64.292572]  ? do_writepages+0x1c/0x80
[   64.292985]  __writeback_single_inode+0x61/0x760
[   64.293575]  writeback_sb_inodes+0x28d/0x5c0
[   64.294192]  __writeback_inodes_wb+0x92/0xc0
[   64.294777]  wb_writeback+0x3e9/0x560
[   64.295241]  wb_workfn+0x9a/0x5d0
[   64.295977]  ? wb_workfn+0x9a/0x5d0
[   64.296788]  ? process_one_work+0x15c/0x620
[   64.297971]  process_one_work+0x1d9/0x620
[   64.298969]  worker_thread+0x4e/0x3b0
[   64.299684]  kthread+0x113/0x150
[   64.300287]  ? process_one_work+0x620/0x620
[   64.301145]  ? kthread_create_on_node+0x40/0x40
[   64.301953]  ret_from_fork+0x2a/0x40
[   64.302572] Code: dd ff 41 8b 45 60 85 c0 0f 84 29 fe ff ff 49 8d bd 00 01 
00 00 31 c9 ba 01 00 00 00 be 03 00 00 00 e8 90 c1 dd ff e9 0c fe ff ff <0f> 0b 
44 89 fe 4c 89 ef e8 ce 83 00 00 89 45 c4 e9 18 fe ff ff 
[   64.305997] RIP: jbd2_journal_stop+0x375/0x4d0 RSP: 880018acb990
[   64.307037] ---[ end trace ec3f7cbd6e733faf ]---

I consulted with Jan his opinion is that this is due to ->journal_info 
in workqueue threads gets modified while the work was running.  

I've also hit this lockdep, Jan also said it's due to ->journal_info being 
modified, so 
we have to start a new handle, thus causing the splat.  

[   64.153143] 
[   64.154787] WARNING: possible recursive locking detected
[   64.156540] 4.12.0-nbor #135 Not tainted
[   64.157704] 
[   64.159787] kworker/u12:6/542 is trying to acquire lock:
[   64.160964]  (jbd2_handle){-.}, at: [] 
start_this_handle+0x104/0x440
[   64.163360] 
[   64.163360] but task is already holding lock:
[   64.165240]  (jbd2_handle){-.}, at: [] 
start_this_handle+0x104/0x440
[   64.168034] 
[   64.168034] other info that might help us debug this:
[   64.169969]  Possible unsafe locking scenario:
[   64.169969] 
[   64.172198]CPU0
[   64.173047]
[   64.173768]   lock(jbd2_handle);
[   64.174554]   lock(jbd2_handle);
[   64.175255] 
[   64.175255]  *** DEADLOCK ***
[   64.175255] 
[   64.176860]  May be due to missing lock nesting notation
[   64.176860] 
[   64.177932] 6 locks held by kworker/u12:6/542:
[   64.179133]  #0:  ("writeback"){.+.+.+}, at: [] 
process_one_work+0x15c/0x620
[   64.181395]  #1:  ((&(>dwork)->work)){+.+.+.}, at: [] 
process_one_work+0x15c/0x620
[   64.184053]  #2:  (>s_umount_key#27){.+}, at: [] 
trylock_super+0x1b/0x50
[   64.186218]  #3:  (>s_journal_flag_rwsem){.+.+.+}, at: 
[] do_writepages+0x1c/0x80
[   64.188299]  #4:  (jbd2_handle){-.}, at: [] 
start_this_handle+0x104/0x440
[   64.191933]  #5:  (>i_data_sem){..}, at: [] 
ext4_map_blocks+0x130/0x5c0
[   64.193975] 
[   64.193975] stack backtrace:
[   64.194935] CPU: 1 PID: 542 Comm: kworker/u12:6 Not tainted 4.12.0-nbor #135
[   64.196599] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   64.198790] Workqueue: writeback wb_workfn (flush-254:0)
[   64.200148] Call Trace:
[   64.201020]  dump_stack+0x85/0xc7
[   64.202497]  __lock_acquire+0x14b3/0x1790
[   64.203815]  lock_acquire+0xac/0x1e0
[   64.204999]  ? start_this_handle+0x134/0x440
[   64.206552]  ? lock_acquire+0xac/0x1e0
[   64.208104]  ? start_this_handle+0x104/0x440
[

workqueue threads ->journal_info buggery

2017-09-05 Thread Nikolay Borisov

Hello Tejun, 

I've hit the following problems under memory-heavy workload conditions: 

First is a BUG_ON : J_ASSERT(journal_current_handle() == handle);   


[   64.261793] kernel BUG at fs/jbd2/transaction.c:1644!
[   64.263894] invalid opcode:  [#1] SMP
[   64.266187] Modules linked in:
[   64.267472] CPU: 1 PID: 542 Comm: kworker/u12:6 Not tainted 4.12.0-nbor #135
[   64.269941] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   64.272374] Workqueue: writeback wb_workfn (flush-254:0)
[   64.273862] task: 88001c37b880 task.stack: 880018ac8000
[   64.275580] RIP: 0010:jbd2_journal_stop+0x375/0x4d0
[   64.276704] RSP: :880018acb990 EFLAGS: 00010286
[   64.278708] RAX: 88001c37b880 RBX: 88001e83c000 RCX: 88001c4f8800
[   64.280499] RDX: 88001e83c000 RSI: 0b26 RDI: 88001e83c000
[   64.282262] RBP: 880018acba10 R08: 880019ec5888 R09: 
[   64.284111] R10:  R11: 81283f8f R12: 880018a1a140
[   64.285553] R13: 88001c4f8800 R14: 88001c47d000 R15: 880018aa01f0
[   64.286337] FS:  () GS:88001fc4() 
knlGS:
[   64.287671] CS:  0010 DS:  ES:  CR0: 80050033
[   64.288568] CR2: 00421ac0 CR3: 1ae83000 CR4: 06a0
[   64.289468] Call Trace:
[   64.289748]  ? __ext4_journal_get_write_access+0x67/0xc0
[   64.290330]  ? ext4_writepages+0xec6/0x1200
[   64.290786]  __ext4_journal_stop+0x3c/0xa0
[   64.291233]  ext4_writepages+0x8b2/0x1200
[   64.291682]  ? writeback_sb_inodes+0x11f/0x5c0
[   64.292174]  do_writepages+0x1c/0x80
[   64.292572]  ? do_writepages+0x1c/0x80
[   64.292985]  __writeback_single_inode+0x61/0x760
[   64.293575]  writeback_sb_inodes+0x28d/0x5c0
[   64.294192]  __writeback_inodes_wb+0x92/0xc0
[   64.294777]  wb_writeback+0x3e9/0x560
[   64.295241]  wb_workfn+0x9a/0x5d0
[   64.295977]  ? wb_workfn+0x9a/0x5d0
[   64.296788]  ? process_one_work+0x15c/0x620
[   64.297971]  process_one_work+0x1d9/0x620
[   64.298969]  worker_thread+0x4e/0x3b0
[   64.299684]  kthread+0x113/0x150
[   64.300287]  ? process_one_work+0x620/0x620
[   64.301145]  ? kthread_create_on_node+0x40/0x40
[   64.301953]  ret_from_fork+0x2a/0x40
[   64.302572] Code: dd ff 41 8b 45 60 85 c0 0f 84 29 fe ff ff 49 8d bd 00 01 
00 00 31 c9 ba 01 00 00 00 be 03 00 00 00 e8 90 c1 dd ff e9 0c fe ff ff <0f> 0b 
44 89 fe 4c 89 ef e8 ce 83 00 00 89 45 c4 e9 18 fe ff ff 
[   64.305997] RIP: jbd2_journal_stop+0x375/0x4d0 RSP: 880018acb990
[   64.307037] ---[ end trace ec3f7cbd6e733faf ]---

I consulted with Jan his opinion is that this is due to ->journal_info 
in workqueue threads gets modified while the work was running.  

I've also hit this lockdep, Jan also said it's due to ->journal_info being 
modified, so 
we have to start a new handle, thus causing the splat.  

[   64.153143] 
[   64.154787] WARNING: possible recursive locking detected
[   64.156540] 4.12.0-nbor #135 Not tainted
[   64.157704] 
[   64.159787] kworker/u12:6/542 is trying to acquire lock:
[   64.160964]  (jbd2_handle){-.}, at: [] 
start_this_handle+0x104/0x440
[   64.163360] 
[   64.163360] but task is already holding lock:
[   64.165240]  (jbd2_handle){-.}, at: [] 
start_this_handle+0x104/0x440
[   64.168034] 
[   64.168034] other info that might help us debug this:
[   64.169969]  Possible unsafe locking scenario:
[   64.169969] 
[   64.172198]CPU0
[   64.173047]
[   64.173768]   lock(jbd2_handle);
[   64.174554]   lock(jbd2_handle);
[   64.175255] 
[   64.175255]  *** DEADLOCK ***
[   64.175255] 
[   64.176860]  May be due to missing lock nesting notation
[   64.176860] 
[   64.177932] 6 locks held by kworker/u12:6/542:
[   64.179133]  #0:  ("writeback"){.+.+.+}, at: [] 
process_one_work+0x15c/0x620
[   64.181395]  #1:  ((&(>dwork)->work)){+.+.+.}, at: [] 
process_one_work+0x15c/0x620
[   64.184053]  #2:  (>s_umount_key#27){.+}, at: [] 
trylock_super+0x1b/0x50
[   64.186218]  #3:  (>s_journal_flag_rwsem){.+.+.+}, at: 
[] do_writepages+0x1c/0x80
[   64.188299]  #4:  (jbd2_handle){-.}, at: [] 
start_this_handle+0x104/0x440
[   64.191933]  #5:  (>i_data_sem){..}, at: [] 
ext4_map_blocks+0x130/0x5c0
[   64.193975] 
[   64.193975] stack backtrace:
[   64.194935] CPU: 1 PID: 542 Comm: kworker/u12:6 Not tainted 4.12.0-nbor #135
[   64.196599] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   64.198790] Workqueue: writeback wb_workfn (flush-254:0)
[   64.200148] Call Trace:
[   64.201020]  dump_stack+0x85/0xc7
[   64.202497]  __lock_acquire+0x14b3/0x1790
[   64.203815]  lock_acquire+0xac/0x1e0
[   64.204999]  ? start_this_handle+0x134/0x440
[   64.206552]  ? lock_acquire+0xac/0x1e0
[   64.208104]  ? start_this_handle+0x104/0x440
[

Re: [PATCH 10/16] btrfs: switch write_buf to kernel_write

2017-08-30 Thread Nikolay Borisov



On 30.08.2017 18:00, Christoph Hellwig wrote:
> Instead of playing with the addressing limits.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>
> ---

Reviewed-by: Nikolay Borisov <nbori...@suse.com>

>  fs/btrfs/send.c | 18 --
>  1 file changed, 4 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index b082210df9c8..24b989fd130c 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -539,33 +539,23 @@ static struct btrfs_path *alloc_path_for_send(void)
>  static int write_buf(struct file *filp, const void *buf, u32 len, loff_t 
> *off)
>  {
>   int ret;
> - mm_segment_t old_fs;
>   u32 pos = 0;
>  
> - old_fs = get_fs();
> - set_fs(KERNEL_DS);
> -
>   while (pos < len) {
> - ret = vfs_write(filp, (__force const char __user *)buf + pos,
> - len - pos, off);
> + ret = kernel_write(filp, buf + pos, len - pos, off);
>   /* TODO handle that correctly */
>   /*if (ret == -ERESTARTSYS) {
>   continue;
>   }*/
>   if (ret < 0)
> - goto out;
> + return ret;
>   if (ret == 0) {
> - ret = -EIO;
> - goto out;
> + return -EIO;
>   }
>   pos += ret;
>   }
>  
> - ret = 0;
> -
> -out:
> - set_fs(old_fs);
> - return ret;
> + return 0;
>  }
>  
>  static int tlv_put(struct send_ctx *sctx, u16 attr, const void *data, int 
> len)
>

Re: [PATCH 10/16] btrfs: switch write_buf to kernel_write

2017-08-30 Thread Nikolay Borisov



On 30.08.2017 18:00, Christoph Hellwig wrote:
> Instead of playing with the addressing limits.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Reviewed-by: Nikolay Borisov 

>  fs/btrfs/send.c | 18 --
>  1 file changed, 4 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index b082210df9c8..24b989fd130c 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -539,33 +539,23 @@ static struct btrfs_path *alloc_path_for_send(void)
>  static int write_buf(struct file *filp, const void *buf, u32 len, loff_t 
> *off)
>  {
>   int ret;
> - mm_segment_t old_fs;
>   u32 pos = 0;
>  
> - old_fs = get_fs();
> - set_fs(KERNEL_DS);
> -
>   while (pos < len) {
> - ret = vfs_write(filp, (__force const char __user *)buf + pos,
> - len - pos, off);
> + ret = kernel_write(filp, buf + pos, len - pos, off);
>   /* TODO handle that correctly */
>   /*if (ret == -ERESTARTSYS) {
>   continue;
>   }*/
>   if (ret < 0)
> - goto out;
> + return ret;
>   if (ret == 0) {
> - ret = -EIO;
> - goto out;
> + return -EIO;
>   }
>   pos += ret;
>   }
>  
> - ret = 0;
> -
> -out:
> - set_fs(old_fs);
> - return ret;
> + return 0;
>  }
>  
>  static int tlv_put(struct send_ctx *sctx, u16 attr, const void *data, int 
> len)
>

[PATCH] swap: Remove obsolete sentence

2017-08-25 Thread Nikolay Borisov

Currently there are no ->swap_{in,out} method in address_space_operations
sructure definition, so the statement that anything is going to be proxied
through them is wrong.

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
 Documentation/filesystems/vfs.txt | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index 73e7d91f03dc..405a3df759b3 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -829,9 +829,7 @@ struct address_space_operations {
   swap_activate: Called when swapon is used on a file to allocate
space if necessary and pin the block lookup information in
memory. A return value of zero indicates success,
-   in which case this file can be used to back swapspace. The
-   swapspace operations will be proxied to this address space's
-   ->swap_{out,in} methods.
+   in which case this file can be used to back swapspace.
 
   swap_deactivate: Called during swapoff on files where swap_activate
was successful.
-- 
2.7.4

[PATCH] swap: Remove obsolete sentence

2017-08-25 Thread Nikolay Borisov

Currently there are no ->swap_{in,out} method in address_space_operations
sructure definition, so the statement that anything is going to be proxied
through them is wrong.

Signed-off-by: Nikolay Borisov 
---
 Documentation/filesystems/vfs.txt | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index 73e7d91f03dc..405a3df759b3 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -829,9 +829,7 @@ struct address_space_operations {
   swap_activate: Called when swapon is used on a file to allocate
space if necessary and pin the block lookup information in
memory. A return value of zero indicates success,
-   in which case this file can be used to back swapspace. The
-   swapspace operations will be proxied to this address space's
-   ->swap_{out,in} methods.
+   in which case this file can be used to back swapspace.
 
   swap_deactivate: Called during swapoff on files where swap_activate
was successful.
-- 
2.7.4

Re: [PATCH][V2][netdev-next] gre: remove duplicated assignment of iph

2017-08-23 Thread Nikolay Aleksandrov

On 23/08/17 14:59, Colin King wrote:
> From: Colin Ian King <colin.k...@canonical.com>
> 
> iph is being assigned the same value twice; remove the redundant
> first assignment. (Thanks to Nikolay Aleksandrov for pointing out
> that the first asssignment should be removed and not the second)
> 
> Fixes warning:
> net/ipv4/ip_gre.c:265:2: warning: Value stored to 'iph' is never read
> 
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>
> ---
>  net/ipv4/ip_gre.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 6e8a62289e03..161326f7f10b 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -262,7 +262,6 @@ static int erspan_rcv(struct sk_buff *skb, struct 
> tnl_ptk_info *tpi,
>   int len;
>  
>   itn = net_generic(net, erspan_net_id);
> - iph = ip_hdr(skb);
>   len = gre_hdr_len + sizeof(*ershdr);
>  
>   if (unlikely(!pskb_may_pull(skb, len)))
> 

LGTM,

Reviewed-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>

Re: [PATCH][V2][netdev-next] gre: remove duplicated assignment of iph

2017-08-23 Thread Nikolay Aleksandrov

On 23/08/17 14:59, Colin King wrote:
> From: Colin Ian King 
> 
> iph is being assigned the same value twice; remove the redundant
> first assignment. (Thanks to Nikolay Aleksandrov for pointing out
> that the first asssignment should be removed and not the second)
> 
> Fixes warning:
> net/ipv4/ip_gre.c:265:2: warning: Value stored to 'iph' is never read
> 
> Signed-off-by: Colin Ian King 
> ---
>  net/ipv4/ip_gre.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 6e8a62289e03..161326f7f10b 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -262,7 +262,6 @@ static int erspan_rcv(struct sk_buff *skb, struct 
> tnl_ptk_info *tpi,
>   int len;
>  
>   itn = net_generic(net, erspan_net_id);
> - iph = ip_hdr(skb);
>   len = gre_hdr_len + sizeof(*ershdr);
>  
>   if (unlikely(!pskb_may_pull(skb, len)))
> 

LGTM,

Reviewed-by: Nikolay Aleksandrov

Re: [PATCH][netdev-next] gre: remove duplicated assignment of iph

2017-08-23 Thread Nikolay Aleksandrov

On 23/08/17 14:13, Colin King wrote:
> From: Colin Ian King 
> 
> iph is being assigned the same value twice; remove the redundant
> second assignment.
> 
> Fixes warning:
> net/ipv4/ip_gre.c:265:2: warning: Value stored to 'iph' is never read
> 
> Signed-off-by: Colin Ian King 
> ---
>  net/ipv4/ip_gre.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 6e8a62289e03..6b3e7c99a3b6 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -268,7 +268,6 @@ static int erspan_rcv(struct sk_buff *skb, struct 
> tnl_ptk_info *tpi,
>   if (unlikely(!pskb_may_pull(skb, len)))
>   return -ENOMEM;
>  
> - iph = ip_hdr(skb);
>   ershdr = (struct erspanhdr *)(skb->data + gre_hdr_len);
>  
>   /* The original GRE header does not have key field,
> 

This one looks like a correct assignment, I'd remove the previous one because
pskb_may_pull may change the header pointers and the previously assigned iph 
might
become wrong.

Also add the author of the code to the CC list.

Re: [PATCH][netdev-next] gre: remove duplicated assignment of iph

2017-08-23 Thread Nikolay Aleksandrov

On 23/08/17 14:13, Colin King wrote:
> From: Colin Ian King 
> 
> iph is being assigned the same value twice; remove the redundant
> second assignment.
> 
> Fixes warning:
> net/ipv4/ip_gre.c:265:2: warning: Value stored to 'iph' is never read
> 
> Signed-off-by: Colin Ian King 
> ---
>  net/ipv4/ip_gre.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 6e8a62289e03..6b3e7c99a3b6 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -268,7 +268,6 @@ static int erspan_rcv(struct sk_buff *skb, struct 
> tnl_ptk_info *tpi,
>   if (unlikely(!pskb_may_pull(skb, len)))
>   return -ENOMEM;
>  
> - iph = ip_hdr(skb);
>   ershdr = (struct erspanhdr *)(skb->data + gre_hdr_len);
>  
>   /* The original GRE header does not have key field,
> 

This one looks like a correct assignment, I'd remove the previous one because
pskb_may_pull may change the header pointers and the previously assigned iph 
might
become wrong.

Also add the author of the code to the CC list.

Re: [PATCH] exec: Check stack space more strictly

2017-08-16 Thread Nikolay Borisov

minor nit below

On 18.07.2017 01:22, Andy Lutomirski wrote:
> We can currently blow past the stack rlimit and cause odd behavior
> if there are accounting bugs, rounding issues, or races.  It's not
> clear that the odd behavior is actually a problem, but it's nicer to
> fail the exec instead of getting out of sync with stack limits.
> 
> Improve the code to more carefully check for space and to abort if
> our stack mm gets too large in setup_arg_pages().
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  fs/exec.c | 44 ++--
>  1 file changed, 34 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 62175cbcc801..0c60c0495269 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -764,23 +764,47 @@ int setup_arg_pages(struct linux_binprm *bprm,
>   /* mprotect_fixup is overkill to remove the temporary stack flags */
>   vma->vm_flags &= ~VM_STACK_INCOMPLETE_SETUP;
>  
> - stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
> - stack_size = vma->vm_end - vma->vm_start;
>   /*
>* Align this down to a page boundary as expand_stack
>* will align it up.
>*/
>   rlim_stack = rlimit(RLIMIT_STACK) & PAGE_MASK;
> + stack_size = vma->vm_end - vma->vm_start;
> +
> + if (stack_size > rlim_stack) {
> + /*
> +  * If we've already used too much space (due to accounting
> +  * bugs, alignment, races, or any other cause), bail.
> +  */
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> +
> + /*
> +  * stack_expand is the amount of space beyond the space already used
> +  * that we're going to pre-allocate in our stack.  For historical
> +  * reasons, it's 128kB, unless we have less space than that available
> +  * in our rlimit.
> +  *
> +  * This particular historical wart is wrong-headed, though, since
> +  * we haven't finished binfmt-specific setup, and the binfmt code
> +  * is going to eat up some or all of this space.
> +  */
> + stack_expand = min(rlim_stack - stack_size, 131072UL);

nit: Use the SZ_128K from sizes.h.

> +
>  #ifdef CONFIG_STACK_GROWSUP
> - if (stack_size + stack_expand > rlim_stack)
> - stack_base = vma->vm_start + rlim_stack;
> - else
> - stack_base = vma->vm_end + stack_expand;
> + if (TASK_SIZE_MAX - vma->vm_end < stack_expand) {
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> + stack_base = vma->vm_end + stack_expand;
>  #else
> - if (stack_size + stack_expand > rlim_stack)
> - stack_base = vma->vm_end - rlim_stack;
> - else
> - stack_base = vma->vm_start - stack_expand;
> + if (vma->vm_start < mmap_min_addr ||
> + vma->vm_start - mmap_min_addr < stack_expand) {
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> + stack_base = vma->vm_start - stack_expand;
>  #endif
>   current->mm->start_stack = bprm->p;
>   ret = expand_stack(vma, stack_base);
>

Re: [PATCH] exec: Check stack space more strictly

2017-08-16 Thread Nikolay Borisov

minor nit below

On 18.07.2017 01:22, Andy Lutomirski wrote:
> We can currently blow past the stack rlimit and cause odd behavior
> if there are accounting bugs, rounding issues, or races.  It's not
> clear that the odd behavior is actually a problem, but it's nicer to
> fail the exec instead of getting out of sync with stack limits.
> 
> Improve the code to more carefully check for space and to abort if
> our stack mm gets too large in setup_arg_pages().
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  fs/exec.c | 44 ++--
>  1 file changed, 34 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 62175cbcc801..0c60c0495269 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -764,23 +764,47 @@ int setup_arg_pages(struct linux_binprm *bprm,
>   /* mprotect_fixup is overkill to remove the temporary stack flags */
>   vma->vm_flags &= ~VM_STACK_INCOMPLETE_SETUP;
>  
> - stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
> - stack_size = vma->vm_end - vma->vm_start;
>   /*
>* Align this down to a page boundary as expand_stack
>* will align it up.
>*/
>   rlim_stack = rlimit(RLIMIT_STACK) & PAGE_MASK;
> + stack_size = vma->vm_end - vma->vm_start;
> +
> + if (stack_size > rlim_stack) {
> + /*
> +  * If we've already used too much space (due to accounting
> +  * bugs, alignment, races, or any other cause), bail.
> +  */
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> +
> + /*
> +  * stack_expand is the amount of space beyond the space already used
> +  * that we're going to pre-allocate in our stack.  For historical
> +  * reasons, it's 128kB, unless we have less space than that available
> +  * in our rlimit.
> +  *
> +  * This particular historical wart is wrong-headed, though, since
> +  * we haven't finished binfmt-specific setup, and the binfmt code
> +  * is going to eat up some or all of this space.
> +  */
> + stack_expand = min(rlim_stack - stack_size, 131072UL);

nit: Use the SZ_128K from sizes.h.

> +
>  #ifdef CONFIG_STACK_GROWSUP
> - if (stack_size + stack_expand > rlim_stack)
> - stack_base = vma->vm_start + rlim_stack;
> - else
> - stack_base = vma->vm_end + stack_expand;
> + if (TASK_SIZE_MAX - vma->vm_end < stack_expand) {
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> + stack_base = vma->vm_end + stack_expand;
>  #else
> - if (stack_size + stack_expand > rlim_stack)
> - stack_base = vma->vm_end - rlim_stack;
> - else
> - stack_base = vma->vm_start - stack_expand;
> + if (vma->vm_start < mmap_min_addr ||
> + vma->vm_start - mmap_min_addr < stack_expand) {
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> + stack_base = vma->vm_start - stack_expand;
>  #endif
>   current->mm->start_stack = bprm->p;
>   ret = expand_stack(vma, stack_base);
>

Re: [RESEND PATCH] bcache: Don't reinvent the wheel but use existing llist API

2017-08-09 Thread Nikolay Borisov



On  8.08.2017 09:00, Byungchul Park wrote:
> On Tue, Aug 08, 2017 at 01:28:39PM +0800, Coly Li wrote:
> + llist_for_each_entry_safe(cl, t, reverse, list) {

 Just wondering why not using llist_for_each_entry(), or you use the
 _safe version on purpose ?
>>>
>>> If I use llist_for_each_entry(), then it would change the original
>>> behavior. Is it ok?

Generally, _safe versions of list primitives is used when you are going
to perform removal in the iteration. I haven't looked at the code in
bcache but if it's removing entries from the list then _safe version is
required. If you are only iterating - then non-safe version is fine.

>>>
>>
>> I feel llist_for_each_entry() keeps the original behavior, and variable
> 
> Ah.. I see. Then.. Can I change it into non-safe version? Is it still ok
> with non-safe one? I will change it at the next spin, if yes.
> 
>> 't' can be removed. Anyway, either llist_for_each_entry() or
>> llist_for_each_entry_safe() works correctly and well here. Any one you
>> use is OK to me, thanks for your informative reply :-)
> 
> I rather appriciate it.
> 
> Thank you,
> Byungchul
>

Re: [RESEND PATCH] bcache: Don't reinvent the wheel but use existing llist API

2017-08-09 Thread Nikolay Borisov



On  8.08.2017 09:00, Byungchul Park wrote:
> On Tue, Aug 08, 2017 at 01:28:39PM +0800, Coly Li wrote:
> + llist_for_each_entry_safe(cl, t, reverse, list) {

 Just wondering why not using llist_for_each_entry(), or you use the
 _safe version on purpose ?
>>>
>>> If I use llist_for_each_entry(), then it would change the original
>>> behavior. Is it ok?

Generally, _safe versions of list primitives is used when you are going
to perform removal in the iteration. I haven't looked at the code in
bcache but if it's removing entries from the list then _safe version is
required. If you are only iterating - then non-safe version is fine.

>>>
>>
>> I feel llist_for_each_entry() keeps the original behavior, and variable
> 
> Ah.. I see. Then.. Can I change it into non-safe version? Is it still ok
> with non-safe one? I will change it at the next spin, if yes.
> 
>> 't' can be removed. Anyway, either llist_for_each_entry() or
>> llist_for_each_entry_safe() works correctly and well here. Any one you
>> use is OK to me, thanks for your informative reply :-)
> 
> I rather appriciate it.
> 
> Thank you,
> Byungchul
>

[PATCH] direct-io: Minor cleanups in do_blockdev_direct_IO

2017-08-02 Thread Nikolay Borisov

We already get the block counts and the calculate the end block at the
beginning of the function. Let's use the local variables for consistency and
readability. No functional changes

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
 fs/direct-io.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 08cf27811e5a..987bc17a5f5e 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1139,7 +1139,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode 
*inode,
}
 
/* watch out for a 0 len io from a tricksy fs */
-   if (iov_iter_rw(iter) == READ && !iov_iter_count(iter))
+   if (iov_iter_rw(iter) == READ && !count)
return 0;
 
dio = kmem_cache_alloc(dio_cache, GFP_KERNEL);
@@ -1248,8 +1248,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode 
*inode,
 
dio->should_dirty = (iter->type == ITER_IOVEC);
sdio.iter = iter;
-   sdio.final_block_in_request =
-   (offset + iov_iter_count(iter)) >> blkbits;
+   sdio.final_block_in_request = end >> blkbits;
 
/*
 * In case of non-aligned buffers, we may need 2 more
-- 
2.7.4

[PATCH] direct-io: Minor cleanups in do_blockdev_direct_IO

2017-08-02 Thread Nikolay Borisov

We already get the block counts and the calculate the end block at the
beginning of the function. Let's use the local variables for consistency and
readability. No functional changes

Signed-off-by: Nikolay Borisov 
---
 fs/direct-io.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 08cf27811e5a..987bc17a5f5e 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1139,7 +1139,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode 
*inode,
}
 
/* watch out for a 0 len io from a tricksy fs */
-   if (iov_iter_rw(iter) == READ && !iov_iter_count(iter))
+   if (iov_iter_rw(iter) == READ && !count)
return 0;
 
dio = kmem_cache_alloc(dio_cache, GFP_KERNEL);
@@ -1248,8 +1248,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode 
*inode,
 
dio->should_dirty = (iter->type == ITER_IOVEC);
sdio.iter = iter;
-   sdio.final_block_in_request =
-   (offset + iov_iter_count(iter)) >> blkbits;
+   sdio.final_block_in_request = end >> blkbits;
 
/*
 * In case of non-aligned buffers, we may need 2 more
-- 
2.7.4

Re: [PATCH] btrfs: resume qgroup rescan on rw remount

2017-07-10 Thread Nikolay Borisov



On 10.07.2017 16:12, Nikolay Borisov wrote:
> 
> 
> On  4.07.2017 14:49, Aleksa Sarai wrote:
>> Several distributions mount the "proper root" as ro during initrd and
>> then remount it as rw before pivot_root(2). Thus, if a rescan had been
>> aborted by a previous shutdown, the rescan would never be resumed.
>>
>> This issue would manifest itself as several btrfs ioctl(2)s causing the
>> entire machine to hang when btrfs_qgroup_wait_for_completion was hit
>> (due to the fs_info->qgroup_rescan_running flag being set but the rescan
>> itself not being resumed). Notably, Docker's btrfs storage driver makes
>> regular use of BTRFS_QUOTA_CTL_DISABLE and BTRFS_IOC_QUOTA_RESCAN_WAIT
>> (causing this problem to be manifested on boot for some machines).
>>
>> Cc: <sta...@vger.kernel.org> # v3.11+
>> Cc: Jeff Mahoney <je...@suse.com>
>> Fixes: b382a324b60f ("Btrfs: fix qgroup rescan resume on mount")
>> Signed-off-by: Aleksa Sarai <asa...@suse.de>
> 
> Indeed, looking at the code it seems that b382a324b60f ("Btrfs: fix
> qgroup rescan resume on mount") missed adding the qgroup_rescan_resume
> in the remount path. One thing which I couldn't verify though is whether
> reading fs_info->qgroup_flags without any locking is safe from remount
> context.
> 
> During remount I don't see any locks taken that prevent operations which
> can modify qgroup_flags.
> 
> 

Further inspection reveals that the access rules to qgroup_flags are
somewhat broken so this patch doesn't really make things any worse than
they are. As such:

Reviewed-by: Nikolay Borisov <nbori...@suse.com>
Tested-by: Nikolay Borisov <nbori...@suse.com>

> 
>> ---
>>  fs/btrfs/super.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 6346876c97ea..ff6690389343 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -1821,6 +1821,8 @@ static int btrfs_remount(struct super_block *sb, int 
>> *flags, char *data)
>>  goto restore;
>>  }
>>  
>> +btrfs_qgroup_rescan_resume(fs_info);
>> +
>>  if (!fs_info->uuid_root) {
>>  btrfs_info(fs_info, "creating UUID tree");
>>  ret = btrfs_create_uuid_tree(fs_info);
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Re: [PATCH] btrfs: resume qgroup rescan on rw remount

2017-07-10 Thread Nikolay Borisov



On 10.07.2017 16:12, Nikolay Borisov wrote:
> 
> 
> On  4.07.2017 14:49, Aleksa Sarai wrote:
>> Several distributions mount the "proper root" as ro during initrd and
>> then remount it as rw before pivot_root(2). Thus, if a rescan had been
>> aborted by a previous shutdown, the rescan would never be resumed.
>>
>> This issue would manifest itself as several btrfs ioctl(2)s causing the
>> entire machine to hang when btrfs_qgroup_wait_for_completion was hit
>> (due to the fs_info->qgroup_rescan_running flag being set but the rescan
>> itself not being resumed). Notably, Docker's btrfs storage driver makes
>> regular use of BTRFS_QUOTA_CTL_DISABLE and BTRFS_IOC_QUOTA_RESCAN_WAIT
>> (causing this problem to be manifested on boot for some machines).
>>
>> Cc:  # v3.11+
>> Cc: Jeff Mahoney 
>> Fixes: b382a324b60f ("Btrfs: fix qgroup rescan resume on mount")
>> Signed-off-by: Aleksa Sarai 
> 
> Indeed, looking at the code it seems that b382a324b60f ("Btrfs: fix
> qgroup rescan resume on mount") missed adding the qgroup_rescan_resume
> in the remount path. One thing which I couldn't verify though is whether
> reading fs_info->qgroup_flags without any locking is safe from remount
> context.
> 
> During remount I don't see any locks taken that prevent operations which
> can modify qgroup_flags.
> 
> 

Further inspection reveals that the access rules to qgroup_flags are
somewhat broken so this patch doesn't really make things any worse than
they are. As such:

Reviewed-by: Nikolay Borisov 
Tested-by: Nikolay Borisov 

> 
>> ---
>>  fs/btrfs/super.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 6346876c97ea..ff6690389343 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -1821,6 +1821,8 @@ static int btrfs_remount(struct super_block *sb, int 
>> *flags, char *data)
>>  goto restore;
>>  }
>>  
>> +btrfs_qgroup_rescan_resume(fs_info);
>> +
>>  if (!fs_info->uuid_root) {
>>  btrfs_info(fs_info, "creating UUID tree");
>>  ret = btrfs_create_uuid_tree(fs_info);
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Re: [PATCH] btrfs: resume qgroup rescan on rw remount

2017-07-10 Thread Nikolay Borisov



On  4.07.2017 14:49, Aleksa Sarai wrote:
> Several distributions mount the "proper root" as ro during initrd and
> then remount it as rw before pivot_root(2). Thus, if a rescan had been
> aborted by a previous shutdown, the rescan would never be resumed.
> 
> This issue would manifest itself as several btrfs ioctl(2)s causing the
> entire machine to hang when btrfs_qgroup_wait_for_completion was hit
> (due to the fs_info->qgroup_rescan_running flag being set but the rescan
> itself not being resumed). Notably, Docker's btrfs storage driver makes
> regular use of BTRFS_QUOTA_CTL_DISABLE and BTRFS_IOC_QUOTA_RESCAN_WAIT
> (causing this problem to be manifested on boot for some machines).
> 
> Cc:  # v3.11+
> Cc: Jeff Mahoney 
> Fixes: b382a324b60f ("Btrfs: fix qgroup rescan resume on mount")
> Signed-off-by: Aleksa Sarai 

Indeed, looking at the code it seems that b382a324b60f ("Btrfs: fix
qgroup rescan resume on mount") missed adding the qgroup_rescan_resume
in the remount path. One thing which I couldn't verify though is whether
reading fs_info->qgroup_flags without any locking is safe from remount
context.

During remount I don't see any locks taken that prevent operations which
can modify qgroup_flags.



> ---
>  fs/btrfs/super.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 6346876c97ea..ff6690389343 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1821,6 +1821,8 @@ static int btrfs_remount(struct super_block *sb, int 
> *flags, char *data)
>   goto restore;
>   }
>  
> + btrfs_qgroup_rescan_resume(fs_info);
> +
>   if (!fs_info->uuid_root) {
>   btrfs_info(fs_info, "creating UUID tree");
>   ret = btrfs_create_uuid_tree(fs_info);
>

Re: [PATCH] btrfs: resume qgroup rescan on rw remount

2017-07-10 Thread Nikolay Borisov



On  4.07.2017 14:49, Aleksa Sarai wrote:
> Several distributions mount the "proper root" as ro during initrd and
> then remount it as rw before pivot_root(2). Thus, if a rescan had been
> aborted by a previous shutdown, the rescan would never be resumed.
> 
> This issue would manifest itself as several btrfs ioctl(2)s causing the
> entire machine to hang when btrfs_qgroup_wait_for_completion was hit
> (due to the fs_info->qgroup_rescan_running flag being set but the rescan
> itself not being resumed). Notably, Docker's btrfs storage driver makes
> regular use of BTRFS_QUOTA_CTL_DISABLE and BTRFS_IOC_QUOTA_RESCAN_WAIT
> (causing this problem to be manifested on boot for some machines).
> 
> Cc:  # v3.11+
> Cc: Jeff Mahoney 
> Fixes: b382a324b60f ("Btrfs: fix qgroup rescan resume on mount")
> Signed-off-by: Aleksa Sarai 

Indeed, looking at the code it seems that b382a324b60f ("Btrfs: fix
qgroup rescan resume on mount") missed adding the qgroup_rescan_resume
in the remount path. One thing which I couldn't verify though is whether
reading fs_info->qgroup_flags without any locking is safe from remount
context.

During remount I don't see any locks taken that prevent operations which
can modify qgroup_flags.



> ---
>  fs/btrfs/super.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 6346876c97ea..ff6690389343 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1821,6 +1821,8 @@ static int btrfs_remount(struct super_block *sb, int 
> *flags, char *data)
>   goto restore;
>   }
>  
> + btrfs_qgroup_rescan_resume(fs_info);
> +
>   if (!fs_info->uuid_root) {
>   btrfs_info(fs_info, "creating UUID tree");
>   ret = btrfs_create_uuid_tree(fs_info);
>

Re: [PATCH] writeback: Simplify wb_stat_sum

2017-06-26 Thread Nikolay Borisov

[CC'ing Andrew since he seems to be taking those patches through -mm ]

On 23.06.2017 18:11, Nikolay Borisov wrote:
> wb_stat_sum disables interrupts and calls __wb_stat_sum which eventually calls
> __percpu_counter_sum. However, the percpu routine is already irq-safe. 
> Simplify
> the code a bit by making wb_stat_sum directly call percpu_counter_sum_positive
> and not disable interrupts. Also remove the now-uneeded __wb_stat_sum which 
> was
> just a wrapper over percpu_counter_sum_positive
> 
> Signed-off-by: Nikolay Borisov <nbori...@suse.com>
> ---
>  include/linux/backing-dev.h | 15 +--
>  1 file changed, 1 insertion(+), 14 deletions(-)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index e9c967b86054..854e1bdd0b2a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -84,22 +84,9 @@ static inline s64 wb_stat(struct bdi_writeback *wb, enum 
> wb_stat_item item)
>   return percpu_counter_read_positive(>stat[item]);
>  }
>  
> -static inline s64 __wb_stat_sum(struct bdi_writeback *wb,
> - enum wb_stat_item item)
> -{
> - return percpu_counter_sum_positive(>stat[item]);
> -}
> -
>  static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item 
> item)
>  {
> - s64 sum;
> - unsigned long flags;
> -
> - local_irq_save(flags);
> - sum = __wb_stat_sum(wb, item);
> - local_irq_restore(flags);
> -
> - return sum;
> + return percpu_counter_sum_positive(>stat[item]);
>  }
>  
>  extern void wb_writeout_inc(struct bdi_writeback *wb);
>

Re: [PATCH] writeback: Simplify wb_stat_sum

2017-06-26 Thread Nikolay Borisov

[CC'ing Andrew since he seems to be taking those patches through -mm ]

On 23.06.2017 18:11, Nikolay Borisov wrote:
> wb_stat_sum disables interrupts and calls __wb_stat_sum which eventually calls
> __percpu_counter_sum. However, the percpu routine is already irq-safe. 
> Simplify
> the code a bit by making wb_stat_sum directly call percpu_counter_sum_positive
> and not disable interrupts. Also remove the now-uneeded __wb_stat_sum which 
> was
> just a wrapper over percpu_counter_sum_positive
> 
> Signed-off-by: Nikolay Borisov 
> ---
>  include/linux/backing-dev.h | 15 +--
>  1 file changed, 1 insertion(+), 14 deletions(-)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index e9c967b86054..854e1bdd0b2a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -84,22 +84,9 @@ static inline s64 wb_stat(struct bdi_writeback *wb, enum 
> wb_stat_item item)
>   return percpu_counter_read_positive(>stat[item]);
>  }
>  
> -static inline s64 __wb_stat_sum(struct bdi_writeback *wb,
> - enum wb_stat_item item)
> -{
> - return percpu_counter_sum_positive(>stat[item]);
> -}
> -
>  static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item 
> item)
>  {
> - s64 sum;
> - unsigned long flags;
> -
> - local_irq_save(flags);
> - sum = __wb_stat_sum(wb, item);
> - local_irq_restore(flags);
> -
> - return sum;
> + return percpu_counter_sum_positive(>stat[item]);
>  }
>  
>  extern void wb_writeout_inc(struct bdi_writeback *wb);
>

[PATCH] writeback: Simplify wb_stat_sum

2017-06-23 Thread Nikolay Borisov

wb_stat_sum disables interrupts and calls __wb_stat_sum which eventually calls
__percpu_counter_sum. However, the percpu routine is already irq-safe. Simplify
the code a bit by making wb_stat_sum directly call percpu_counter_sum_positive
and not disable interrupts. Also remove the now-uneeded __wb_stat_sum which was
just a wrapper over percpu_counter_sum_positive

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
 include/linux/backing-dev.h | 15 +--
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index e9c967b86054..854e1bdd0b2a 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -84,22 +84,9 @@ static inline s64 wb_stat(struct bdi_writeback *wb, enum 
wb_stat_item item)
return percpu_counter_read_positive(>stat[item]);
 }
 
-static inline s64 __wb_stat_sum(struct bdi_writeback *wb,
-   enum wb_stat_item item)
-{
-   return percpu_counter_sum_positive(>stat[item]);
-}
-
 static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item)
 {
-   s64 sum;
-   unsigned long flags;
-
-   local_irq_save(flags);
-   sum = __wb_stat_sum(wb, item);
-   local_irq_restore(flags);
-
-   return sum;
+   return percpu_counter_sum_positive(>stat[item]);
 }
 
 extern void wb_writeout_inc(struct bdi_writeback *wb);
-- 
2.7.4

[PATCH] writeback: Simplify wb_stat_sum

2017-06-23 Thread Nikolay Borisov

wb_stat_sum disables interrupts and calls __wb_stat_sum which eventually calls
__percpu_counter_sum. However, the percpu routine is already irq-safe. Simplify
the code a bit by making wb_stat_sum directly call percpu_counter_sum_positive
and not disable interrupts. Also remove the now-uneeded __wb_stat_sum which was
just a wrapper over percpu_counter_sum_positive

Signed-off-by: Nikolay Borisov 
---
 include/linux/backing-dev.h | 15 +--
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index e9c967b86054..854e1bdd0b2a 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -84,22 +84,9 @@ static inline s64 wb_stat(struct bdi_writeback *wb, enum 
wb_stat_item item)
return percpu_counter_read_positive(>stat[item]);
 }
 
-static inline s64 __wb_stat_sum(struct bdi_writeback *wb,
-   enum wb_stat_item item)
-{
-   return percpu_counter_sum_positive(>stat[item]);
-}
-
 static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item)
 {
-   s64 sum;
-   unsigned long flags;
-
-   local_irq_save(flags);
-   sum = __wb_stat_sum(wb, item);
-   local_irq_restore(flags);
-
-   return sum;
+   return percpu_counter_sum_positive(>stat[item]);
 }
 
 extern void wb_writeout_inc(struct bdi_writeback *wb);
-- 
2.7.4

[PATCH 1/4] remove mapping from balance_dirty_pages*()

2017-06-22 Thread Nikolay Borisov

From: Josef Bacik <jba...@fb.com>

The only reason we pass in the mapping is to get the inode in order to see if
writeback cgroups is enabled, and even then it only checks the bdi and a super
block flag.  balance_dirty_pages() doesn't even use the mapping.  Since
balance_dirty_pages*() works on a bdi level, just pass in the bdi and super
block directly so we can avoid using mapping.  This will allow us to still use
balance_dirty_pages for dirty metadata pages that are not backed by an
address_mapping.

Signed-off-by: Josef Bacik <jba...@fb.com>
Reviewed-by: Jan Kara <j...@suse.cz>
Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---

Changes since previous posting [1]:

 - No functional/logic changes, just forwarded-ported to 4.12-rc6, as such 
 I've retained the acked-by and reviewed-by tags. 

 [1] https://patchwork.kernel.org/patch/9395201/

 drivers/mtd/devices/block2mtd.c | 12 
 fs/btrfs/disk-io.c  |  6 +++---
 fs/btrfs/file.c |  3 ++-
 fs/btrfs/ioctl.c|  3 ++-
 fs/btrfs/relocation.c   |  3 ++-
 fs/buffer.c |  3 ++-
 fs/iomap.c  |  6 --
 fs/ntfs/attrib.c| 10 +++---
 fs/ntfs/file.c  |  4 ++--
 include/linux/backing-dev.h | 29 +++--
 include/linux/writeback.h   |  3 ++-
 mm/filemap.c|  4 +++-
 mm/memory.c |  5 -
 mm/page-writeback.c | 15 +++
 14 files changed, 71 insertions(+), 35 deletions(-)

diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index 7c887f111a7d..7892d0b9fcb0 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -52,7 +52,8 @@ static struct page *page_read(struct address_space *mapping, 
int index)
 /* erase a specified part of the device */
 static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len)
 {
-   struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+   struct inode *inode = dev->blkdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
struct page *page;
int index = to >> PAGE_SHIFT;   // page index
int pages = len >> PAGE_SHIFT;
@@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t 
to, size_t len)
memset(page_address(page), 0xff, PAGE_SIZE);
set_page_dirty(page);
unlock_page(page);
-   balance_dirty_pages_ratelimited(mapping);
+   
balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
break;
}
 
@@ -141,7 +143,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, 
const u_char *buf,
loff_t to, size_t len, size_t *retlen)
 {
struct page *page;
-   struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+   struct inode *inode = dev->blkdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
int index = to >> PAGE_SHIFT;   // page index
int offset = to & ~PAGE_MASK;   // page offset
int cpylen;
@@ -162,7 +165,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, 
const u_char *buf,
memcpy(page_address(page) + offset, buf, cpylen);
set_page_dirty(page);
unlock_page(page);
-   balance_dirty_pages_ratelimited(mapping);
+   balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
}
put_page(page);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0ebd44135f1f..c6c6c498df73 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4077,9 +4077,9 @@ static void __btrfs_btree_balance_dirty(struct 
btrfs_fs_info *fs_info,
 
ret = percpu_counter_compare(_info->dirty_metadata_bytes,
 BTRFS_DIRTY_METADATA_THRESH);
-   if (ret > 0) {
-   
balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping);
-   }
+   if (ret > 0)
+   balance_dirty_pages_ratelimited(fs_info->sb->s_bdi,
+   fs_info->sb);
 }
 
 void btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index da1096eb1a40..34ea85a81084 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1780,7 +1780,8 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
 
cond_resched()

[PATCH 1/4] remove mapping from balance_dirty_pages*()

2017-06-22 Thread Nikolay Borisov

From: Josef Bacik 

The only reason we pass in the mapping is to get the inode in order to see if
writeback cgroups is enabled, and even then it only checks the bdi and a super
block flag.  balance_dirty_pages() doesn't even use the mapping.  Since
balance_dirty_pages*() works on a bdi level, just pass in the bdi and super
block directly so we can avoid using mapping.  This will allow us to still use
balance_dirty_pages for dirty metadata pages that are not backed by an
address_mapping.

Signed-off-by: Josef Bacik 
Reviewed-by: Jan Kara 
Acked-by: Tejun Heo 
Signed-off-by: Nikolay Borisov 
---

Changes since previous posting [1]:

 - No functional/logic changes, just forwarded-ported to 4.12-rc6, as such 
 I've retained the acked-by and reviewed-by tags. 

 [1] https://patchwork.kernel.org/patch/9395201/

 drivers/mtd/devices/block2mtd.c | 12 
 fs/btrfs/disk-io.c  |  6 +++---
 fs/btrfs/file.c |  3 ++-
 fs/btrfs/ioctl.c|  3 ++-
 fs/btrfs/relocation.c   |  3 ++-
 fs/buffer.c |  3 ++-
 fs/iomap.c  |  6 --
 fs/ntfs/attrib.c| 10 +++---
 fs/ntfs/file.c  |  4 ++--
 include/linux/backing-dev.h | 29 +++--
 include/linux/writeback.h   |  3 ++-
 mm/filemap.c|  4 +++-
 mm/memory.c |  5 -
 mm/page-writeback.c | 15 +++
 14 files changed, 71 insertions(+), 35 deletions(-)

diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index 7c887f111a7d..7892d0b9fcb0 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -52,7 +52,8 @@ static struct page *page_read(struct address_space *mapping, 
int index)
 /* erase a specified part of the device */
 static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len)
 {
-   struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+   struct inode *inode = dev->blkdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
struct page *page;
int index = to >> PAGE_SHIFT;   // page index
int pages = len >> PAGE_SHIFT;
@@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t 
to, size_t len)
memset(page_address(page), 0xff, PAGE_SIZE);
set_page_dirty(page);
unlock_page(page);
-   balance_dirty_pages_ratelimited(mapping);
+   
balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
break;
}
 
@@ -141,7 +143,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, 
const u_char *buf,
loff_t to, size_t len, size_t *retlen)
 {
struct page *page;
-   struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+   struct inode *inode = dev->blkdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
int index = to >> PAGE_SHIFT;   // page index
int offset = to & ~PAGE_MASK;   // page offset
int cpylen;
@@ -162,7 +165,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, 
const u_char *buf,
memcpy(page_address(page) + offset, buf, cpylen);
set_page_dirty(page);
unlock_page(page);
-   balance_dirty_pages_ratelimited(mapping);
+   balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
}
put_page(page);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0ebd44135f1f..c6c6c498df73 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4077,9 +4077,9 @@ static void __btrfs_btree_balance_dirty(struct 
btrfs_fs_info *fs_info,
 
ret = percpu_counter_compare(_info->dirty_metadata_bytes,
 BTRFS_DIRTY_METADATA_THRESH);
-   if (ret > 0) {
-   
balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping);
-   }
+   if (ret > 0)
+   balance_dirty_pages_ratelimited(fs_info->sb->s_bdi,
+   fs_info->sb);
 }
 
 void btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index da1096eb1a40..34ea85a81084 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1780,7 +1780,8 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
 
cond_resched();
 
-   balance_dirty_pages_ratelimited(inode->i_mapping);
+   balance_dirty_p

[PATCH 3/4] writeback: add counters for metadata usage

2017-06-22 Thread Nikolay Borisov

From: Josef Bacik <jba...@fb.com>

Btrfs has no bounds except memory on the amount of dirty memory that we have in
use for metadata.  Historically we have used a special inode so we could take
advantage of the balance_dirty_pages throttling that comes with using pagecache.
However as we'd like to support different blocksizes it would be nice to not
have to rely on pagecache, but still get the balance_dirty_pages throttling
without having to do it ourselves.

So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES.  These are
zone and bdi_writeback counters to keep track of how many bytes we have in
flight for METADATA.  We need to count in bytes as blocksizes could be
percentages of pagesize.  We simply convert the bytes to number of pages where
it is needed for the throttling.

Also introduce NR_METADATA_BYTES so we can keep track of the total amount of
pages used for metadata on the system.  This is also needed so things like dirty
throttling know that this is dirtyable memory as well and easily reclaimed.

This patch doesn't introduce any functional changes.

Signed-off-by: Josef Bacik <jba...@fb.com>
Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---

Changs since previous posting [1]: 
 
 - Forward ported to 4.12-rc6

 - Factored out the __add_wb_stats calls outside of irq-disable region 
 since it's already irq-safe. This was commented on by Tejun in the previous
 posting. 

 This patch had a Reviewed-by: Jan Kara <j...@suse.cz> tag but I've omitted
 it due to my changes. 

[1] https://patchwork.kernel.org/patch/9395205/

 drivers/base/node.c  |   8 ++
 fs/fs-writeback.c|   2 +
 fs/proc/meminfo.c|   6 ++
 include/linux/backing-dev-defs.h |   2 +
 include/linux/backing-dev.h  |   2 +
 include/linux/mm.h   |   9 +++
 include/linux/mmzone.h   |   3 +
 include/trace/events/writeback.h |  13 +++-
 mm/backing-dev.c |   4 +
 mm/page-writeback.c  | 157 +++
 mm/page_alloc.c  |  21 +-
 mm/util.c|   2 +
 mm/vmscan.c  |  19 -
 mm/vmstat.c  |   3 +
 14 files changed, 229 insertions(+), 22 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index b6f563a3a3a9..65deb8ece4b9 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -50,6 +50,8 @@ static inline ssize_t node_read_cpulist(struct device *dev,
 static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
+#define BtoK(x) ((x) >> 10)
+
 static ssize_t node_read_meminfo(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -98,7 +100,10 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
n += sprintf(buf + n,
   "Node %d Dirty:  %8lu kB\n"
+  "Node %d MetadataDirty:  %8lu kB\n"
   "Node %d Writeback:  %8lu kB\n"
+  "Node %d MetaWriteback:  %8lu kB\n"
+  "Node %d Metadata:   %8lu kB\n"
   "Node %d FilePages:  %8lu kB\n"
   "Node %d Mapped: %8lu kB\n"
   "Node %d AnonPages:  %8lu kB\n"
@@ -118,7 +123,10 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
,
   nid, PtoK(node_page_state(pgdat, NR_FILE_DIRTY)),
+  nid, BtoK(node_page_state(pgdat, 
NR_METADATA_DIRTY_BYTES)),
   nid, PtoK(node_page_state(pgdat, NR_WRITEBACK)),
+  nid, BtoK(node_page_state(pgdat, 
NR_METADATA_WRITEBACK_BYTES)),
+  nid, BtoK(node_page_state(pgdat, NR_METADATA_BYTES)),
   nid, PtoK(node_page_state(pgdat, NR_FILE_PAGES)),
   nid, PtoK(node_page_state(pgdat, NR_FILE_MAPPED)),
   nid, PtoK(node_page_state(pgdat, NR_ANON_MAPPED)),
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 309364aab2a5..c7b33d124f3d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1814,6 +1814,7 @@ static struct wb_writeback_work 
*get_next_work_item(struct bdi_writeback *wb)
return work;
 }
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
 /*
  * Add in the number of potentially dirty inodes, because each inode
  * write can dirty pagecache in the underlying blockdev.
@@ -1822,6 +1823,7 @@ static unsigned long get_nr_dirty_pages(void)
 {
return global_node_page_state(NR_FILE_DIRTY) +
global_node_page_state(NR_UNSTABLE_NFS) +
+   BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES)) +
get_nr_dirty_inodes();
 }
 
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c

[PATCH 3/4] writeback: add counters for metadata usage

2017-06-22 Thread Nikolay Borisov

From: Josef Bacik 

Btrfs has no bounds except memory on the amount of dirty memory that we have in
use for metadata.  Historically we have used a special inode so we could take
advantage of the balance_dirty_pages throttling that comes with using pagecache.
However as we'd like to support different blocksizes it would be nice to not
have to rely on pagecache, but still get the balance_dirty_pages throttling
without having to do it ourselves.

So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES.  These are
zone and bdi_writeback counters to keep track of how many bytes we have in
flight for METADATA.  We need to count in bytes as blocksizes could be
percentages of pagesize.  We simply convert the bytes to number of pages where
it is needed for the throttling.

Also introduce NR_METADATA_BYTES so we can keep track of the total amount of
pages used for metadata on the system.  This is also needed so things like dirty
throttling know that this is dirtyable memory as well and easily reclaimed.

This patch doesn't introduce any functional changes.

Signed-off-by: Josef Bacik 
Signed-off-by: Nikolay Borisov 
---

Changs since previous posting [1]: 
 
 - Forward ported to 4.12-rc6

 - Factored out the __add_wb_stats calls outside of irq-disable region 
 since it's already irq-safe. This was commented on by Tejun in the previous
 posting. 

 This patch had a Reviewed-by: Jan Kara  tag but I've omitted
 it due to my changes. 

[1] https://patchwork.kernel.org/patch/9395205/

 drivers/base/node.c  |   8 ++
 fs/fs-writeback.c|   2 +
 fs/proc/meminfo.c|   6 ++
 include/linux/backing-dev-defs.h |   2 +
 include/linux/backing-dev.h  |   2 +
 include/linux/mm.h   |   9 +++
 include/linux/mmzone.h   |   3 +
 include/trace/events/writeback.h |  13 +++-
 mm/backing-dev.c |   4 +
 mm/page-writeback.c  | 157 +++
 mm/page_alloc.c  |  21 +-
 mm/util.c|   2 +
 mm/vmscan.c  |  19 -
 mm/vmstat.c  |   3 +
 14 files changed, 229 insertions(+), 22 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index b6f563a3a3a9..65deb8ece4b9 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -50,6 +50,8 @@ static inline ssize_t node_read_cpulist(struct device *dev,
 static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
+#define BtoK(x) ((x) >> 10)
+
 static ssize_t node_read_meminfo(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -98,7 +100,10 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
n += sprintf(buf + n,
   "Node %d Dirty:  %8lu kB\n"
+  "Node %d MetadataDirty:  %8lu kB\n"
   "Node %d Writeback:  %8lu kB\n"
+  "Node %d MetaWriteback:  %8lu kB\n"
+  "Node %d Metadata:   %8lu kB\n"
   "Node %d FilePages:  %8lu kB\n"
   "Node %d Mapped: %8lu kB\n"
   "Node %d AnonPages:  %8lu kB\n"
@@ -118,7 +123,10 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
,
   nid, PtoK(node_page_state(pgdat, NR_FILE_DIRTY)),
+  nid, BtoK(node_page_state(pgdat, 
NR_METADATA_DIRTY_BYTES)),
   nid, PtoK(node_page_state(pgdat, NR_WRITEBACK)),
+  nid, BtoK(node_page_state(pgdat, 
NR_METADATA_WRITEBACK_BYTES)),
+  nid, BtoK(node_page_state(pgdat, NR_METADATA_BYTES)),
   nid, PtoK(node_page_state(pgdat, NR_FILE_PAGES)),
   nid, PtoK(node_page_state(pgdat, NR_FILE_MAPPED)),
   nid, PtoK(node_page_state(pgdat, NR_ANON_MAPPED)),
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 309364aab2a5..c7b33d124f3d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1814,6 +1814,7 @@ static struct wb_writeback_work 
*get_next_work_item(struct bdi_writeback *wb)
return work;
 }
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
 /*
  * Add in the number of potentially dirty inodes, because each inode
  * write can dirty pagecache in the underlying blockdev.
@@ -1822,6 +1823,7 @@ static unsigned long get_nr_dirty_pages(void)
 {
return global_node_page_state(NR_FILE_DIRTY) +
global_node_page_state(NR_UNSTABLE_NFS) +
+   BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES)) +
get_nr_dirty_inodes();
 }
 
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8a428498d6b2..33eb566e04c5 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -71,6 +71,8

[RFC PATCH 0/4] Support for metadata specific accounting

2017-06-22 Thread Nikolay Borisov

Hello, 

This series is a report of Josef's original posting [1]. I've included 
fine-grained changelog in each patch with my changes. Basically, I've forward
ported it to 4.12-rc6 and tried incorporating the feedback which was given to 
every individual patch (I've included link with that information in each 
individual patch). 

The main rationale of pushing this is to enable btrfs' subpage-blocksizes
patches to eventually be merged.

This patchset depends on patches (in listed order) which have already
been submitted [2] [3] [4]. But overall they don't hamper review. 


[1] https://www.spinics.net/lists/linux-btrfs/msg59976.html
[2] https://patchwork.kernel.org/patch/9800129/
[3] https://patchwork.kernel.org/patch/9800985/
[4] https://patchwork.kernel.org/patch/9799735/

Josef Bacik (4):
  remove mapping from balance_dirty_pages*()
  writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes
  writeback: add counters for metadata usage
  writeback: introduce super_operations->write_metadata

 drivers/base/node.c  |   8 ++
 drivers/mtd/devices/block2mtd.c  |  12 ++-
 fs/btrfs/disk-io.c   |   6 +-
 fs/btrfs/file.c  |   3 +-
 fs/btrfs/ioctl.c |   3 +-
 fs/btrfs/relocation.c|   3 +-
 fs/buffer.c  |   3 +-
 fs/fs-writeback.c|  74 +--
 fs/fuse/file.c   |   4 +-
 fs/iomap.c   |   6 +-
 fs/ntfs/attrib.c |  10 +-
 fs/ntfs/file.c   |   4 +-
 fs/proc/meminfo.c|   6 ++
 fs/super.c   |   7 ++
 include/linux/backing-dev-defs.h |   8 +-
 include/linux/backing-dev.h  |  51 +--
 include/linux/fs.h   |   4 +
 include/linux/mm.h   |   9 ++
 include/linux/mmzone.h   |   3 +
 include/linux/writeback.h|   3 +-
 include/trace/events/writeback.h |  13 ++-
 mm/backing-dev.c |  15 ++-
 mm/filemap.c |   4 +-
 mm/memory.c  |   5 +-
 mm/page-writeback.c  | 192 ---
 mm/page_alloc.c  |  21 -
 mm/util.c|   2 +
 mm/vmscan.c  |  19 +++-
 mm/vmstat.c  |   3 +
 29 files changed, 418 insertions(+), 83 deletions(-)

-- 
2.7.4

[PATCH 4/4] writeback: introduce super_operations->write_metadata

2017-06-22 Thread Nikolay Borisov

From: Josef Bacik <jba...@fb.com>

Now that we have metadata counters in the VM, we need to provide a way to kick
writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
allows file systems to deal with writing back any dirty metadata we need based
on the writeback needs of the system.  Since there is no inode to key off of we
need a list in the bdi for dirty super blocks to be added.  From there we can
find any dirty sb's on the bdi we are currently doing writeback on and call into
their ->write_metadata callback.

Signed-off-by: Josef Bacik <jba...@fb.com>
Reviewed-by: Jan Kara <j...@suse.cz>
Reviewed-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---

Changes since previous posting [1] :

 - Forward ported to 4.12-rc6 kernel

 I've retained the review-by tags since I didn't introduce any changes. 

[1] https://patchwork.kernel.org/patch/9395213/
 fs/fs-writeback.c| 72 
 fs/super.c   |  7 
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/fs.h   |  4 +++
 mm/backing-dev.c |  2 ++
 5 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c7b33d124f3d..9fa2b6cfaf5b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback 
*wb,
return pages;
 }
 
+static long writeback_sb_metadata(struct super_block *sb,
+ struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
+{
+   struct writeback_control wbc = {
+   .sync_mode  = work->sync_mode,
+   .tagged_writepages  = work->tagged_writepages,
+   .for_kupdate= work->for_kupdate,
+   .for_background = work->for_background,
+   .for_sync   = work->for_sync,
+   .range_cyclic   = work->range_cyclic,
+   .range_start= 0,
+   .range_end  = LLONG_MAX,
+   };
+   long write_chunk;
+
+   write_chunk = writeback_chunk_size(wb, work);
+   wbc.nr_to_write = write_chunk;
+   sb->s_op->write_metadata(sb, );
+   work->nr_pages -= write_chunk - wbc.nr_to_write;
+
+   return write_chunk - wbc.nr_to_write;
+}
+
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -1505,6 +1530,7 @@ static long writeback_sb_inodes(struct super_block *sb,
unsigned long start_time = jiffies;
long write_chunk;
long wrote = 0;  /* count both pages and inodes */
+   bool done = false;
 
while (!list_empty(>b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1621,12 +1647,18 @@ static long writeback_sb_inodes(struct super_block *sb,
 * background threshold and other termination conditions.
 */
if (wrote) {
-   if (time_is_before_jiffies(start_time + HZ / 10UL))
-   break;
-   if (work->nr_pages <= 0)
+   if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+   work->nr_pages <= 0) {
+   done = true;
break;
+   }
}
}
+   if (!done && sb->s_op->write_metadata) {
+   spin_unlock(>list_lock);
+   wrote += writeback_sb_metadata(sb, wb, work);
+   spin_lock(>list_lock);
+   }
return wrote;
 }
 
@@ -1635,6 +1667,7 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 {
unsigned long start_time = jiffies;
long wrote = 0;
+   bool done = false;
 
while (!list_empty(>b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1654,12 +1687,39 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 
/* refer to the same tests at the end of writeback_sb_inodes */
if (wrote) {
-   if (time_is_before_jiffies(start_time + HZ / 10UL))
-   break;
-   if (work->nr_pages <= 0)
+   if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+   work->nr_pages <= 0) {
+   done = true;
break;
+   }
}
}
+
+   if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) {
+   LIST_HEAD(list);
+
+   spin_unlock(>list_lock);
+   spin_lock(>bdi->sb_list_lock);
+   list_splice_init(>

[PATCH 2/4] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes

2017-06-22 Thread Nikolay Borisov

From: Josef Bacik <jba...@fb.com>

These are counters that constantly go up in order to do bandwidth calculations.
It isn't important what the units are in, as long as they are consistent between
the two of them, so convert them to count bytes written/dirtied, and allow the
metadata accounting stuff to change the counters as well. Additionally, scale
WB_STAT_BATCH based on whether we are incrementing byte-based or page-based
counters.

Signed-off-by: Josef Bacik <jba...@fb.com>
Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---

Changes since previous posting [1]:

 - Incorporated Jan Kara's feedback to rename __wb_writeout_inc to 
 __wb_writeout_add. 

 - Eliminated IRQ clustering in account_page_dirtied() by  converting
 inc_wb_stat to __add_wb_stat since the latter is already irq-safe. This was 
 requested by Tejun. 

 - After talking privately with Jan he mentioned that the way the percpu 
 counters were used to account bytes could lead to constant hit of the slow 
 path. This is due to the WB_STAT_BATCH not being scaled for bytes. I've 
 implemented a very simple logic to do that in __add_wb_stat

 - Forward ported to 4.12.-rc6

One thing which will likely have to change with this patch is the fact that 
currently wb_completion count assume that each completion has done 4k worth of 
pages. With subpage blocksize however a completion needn't have written 4k. As
such he suggested to convert the accounting in wb_domain to explicitly track 
number of bytes written in a period and not completions per-period. He was 
afraid of skewing happening. Tejun, what's your take on that ? 

This patch had an ack from Tejun previously but due to my changes I haven't 
added it. 

[1] https://patchwork.kernel.org/patch/9395219/

 fs/fuse/file.c   |  4 ++--
 include/linux/backing-dev-defs.h |  4 ++--
 include/linux/backing-dev.h  | 20 ++--
 mm/backing-dev.c |  9 +
 mm/page-writeback.c  | 20 ++--
 5 files changed, 37 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3ee4fdc3da9e..2521f70ab8a6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1463,7 +1463,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
for (i = 0; i < req->num_pages; i++) {
dec_wb_stat(>wb, WB_WRITEBACK);
dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP);
-   wb_writeout_inc(>wb);
+   wb_writeout_add(>wb, PAGE_SIZE);
}
wake_up(>page_waitq);
 }
@@ -1767,7 +1767,7 @@ static bool fuse_writepage_in_flight(struct fuse_req 
*new_req,
 
dec_wb_stat(>wb, WB_WRITEBACK);
dec_node_page_state(page, NR_WRITEBACK_TEMP);
-   wb_writeout_inc(>wb);
+   wb_writeout_add(>wb, PAGE_SIZE);
fuse_writepage_free(fc, new_req);
fuse_request_free(new_req);
goto out;
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 866c433e7d32..ded45ac2cec7 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -36,8 +36,8 @@ typedef int (congested_fn)(void *, int);
 enum wb_stat_item {
WB_RECLAIMABLE,
WB_WRITEBACK,
-   WB_DIRTIED,
-   WB_WRITTEN,
+   WB_DIRTIED_BYTES,
+   WB_WRITTEN_BYTES,
NR_WB_STAT_ITEMS
 };
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index bdedea9be0a6..8b5a2e98b779 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -66,7 +66,23 @@ static inline bool bdi_has_dirty_io(struct backing_dev_info 
*bdi)
 static inline void __add_wb_stat(struct bdi_writeback *wb,
 enum wb_stat_item item, s64 amount)
 {
-   percpu_counter_add_batch(>stat[item], amount, WB_STAT_BATCH);
+   s32 batch_size;
+
+   /*
+* When woring with bytes scale the batch size to reduce hitting
+* the slow path in the percpu counter
+*/
+   switch (item) {
+   case WB_DIRTIED_BYTES:
+   case WB_WRITTEN_BYTES:
+   batch_size = WB_STAT_BATCH << PAGE_SHIFT;
+   break;
+   default:
+   batch_size = WB_STAT_BATCH;
+   break;
+
+   }
+   percpu_counter_add_batch(>stat[item], amount, batch_size);
 }
 
 static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
@@ -102,7 +118,7 @@ static inline s64 wb_stat_sum(struct bdi_writeback *wb, 
enum wb_stat_item item)
return sum;
 }
 
-extern void wb_writeout_inc(struct bdi_writeback *wb);
+extern void wb_writeout_add(struct bdi_writeback *wb, long bytes);
 
 /*
  * maximal error of a stat counter.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0c09dd103109..2eef55428654 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -67,14 +67,15 @@

[RFC PATCH 0/4] Support for metadata specific accounting

2017-06-22 Thread Nikolay Borisov

Hello, 

This series is a report of Josef's original posting [1]. I've included 
fine-grained changelog in each patch with my changes. Basically, I've forward
ported it to 4.12-rc6 and tried incorporating the feedback which was given to 
every individual patch (I've included link with that information in each 
individual patch). 

The main rationale of pushing this is to enable btrfs' subpage-blocksizes
patches to eventually be merged.

This patchset depends on patches (in listed order) which have already
been submitted [2] [3] [4]. But overall they don't hamper review. 


[1] https://www.spinics.net/lists/linux-btrfs/msg59976.html
[2] https://patchwork.kernel.org/patch/9800129/
[3] https://patchwork.kernel.org/patch/9800985/
[4] https://patchwork.kernel.org/patch/9799735/

Josef Bacik (4):
  remove mapping from balance_dirty_pages*()
  writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes
  writeback: add counters for metadata usage
  writeback: introduce super_operations->write_metadata

 drivers/base/node.c  |   8 ++
 drivers/mtd/devices/block2mtd.c  |  12 ++-
 fs/btrfs/disk-io.c   |   6 +-
 fs/btrfs/file.c  |   3 +-
 fs/btrfs/ioctl.c |   3 +-
 fs/btrfs/relocation.c|   3 +-
 fs/buffer.c  |   3 +-
 fs/fs-writeback.c|  74 +--
 fs/fuse/file.c   |   4 +-
 fs/iomap.c   |   6 +-
 fs/ntfs/attrib.c |  10 +-
 fs/ntfs/file.c   |   4 +-
 fs/proc/meminfo.c|   6 ++
 fs/super.c   |   7 ++
 include/linux/backing-dev-defs.h |   8 +-
 include/linux/backing-dev.h  |  51 +--
 include/linux/fs.h   |   4 +
 include/linux/mm.h   |   9 ++
 include/linux/mmzone.h   |   3 +
 include/linux/writeback.h|   3 +-
 include/trace/events/writeback.h |  13 ++-
 mm/backing-dev.c |  15 ++-
 mm/filemap.c |   4 +-
 mm/memory.c  |   5 +-
 mm/page-writeback.c  | 192 ---
 mm/page_alloc.c  |  21 -
 mm/util.c|   2 +
 mm/vmscan.c  |  19 +++-
 mm/vmstat.c  |   3 +
 29 files changed, 418 insertions(+), 83 deletions(-)

-- 
2.7.4

[PATCH 4/4] writeback: introduce super_operations->write_metadata

2017-06-22 Thread Nikolay Borisov

From: Josef Bacik 

Now that we have metadata counters in the VM, we need to provide a way to kick
writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
allows file systems to deal with writing back any dirty metadata we need based
on the writeback needs of the system.  Since there is no inode to key off of we
need a list in the bdi for dirty super blocks to be added.  From there we can
find any dirty sb's on the bdi we are currently doing writeback on and call into
their ->write_metadata callback.

Signed-off-by: Josef Bacik 
Reviewed-by: Jan Kara 
Reviewed-by: Tejun Heo 
Signed-off-by: Nikolay Borisov 
---

Changes since previous posting [1] :

 - Forward ported to 4.12-rc6 kernel

 I've retained the review-by tags since I didn't introduce any changes. 

[1] https://patchwork.kernel.org/patch/9395213/
 fs/fs-writeback.c| 72 
 fs/super.c   |  7 
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/fs.h   |  4 +++
 mm/backing-dev.c |  2 ++
 5 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c7b33d124f3d..9fa2b6cfaf5b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback 
*wb,
return pages;
 }
 
+static long writeback_sb_metadata(struct super_block *sb,
+ struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
+{
+   struct writeback_control wbc = {
+   .sync_mode  = work->sync_mode,
+   .tagged_writepages  = work->tagged_writepages,
+   .for_kupdate= work->for_kupdate,
+   .for_background = work->for_background,
+   .for_sync   = work->for_sync,
+   .range_cyclic   = work->range_cyclic,
+   .range_start= 0,
+   .range_end  = LLONG_MAX,
+   };
+   long write_chunk;
+
+   write_chunk = writeback_chunk_size(wb, work);
+   wbc.nr_to_write = write_chunk;
+   sb->s_op->write_metadata(sb, );
+   work->nr_pages -= write_chunk - wbc.nr_to_write;
+
+   return write_chunk - wbc.nr_to_write;
+}
+
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -1505,6 +1530,7 @@ static long writeback_sb_inodes(struct super_block *sb,
unsigned long start_time = jiffies;
long write_chunk;
long wrote = 0;  /* count both pages and inodes */
+   bool done = false;
 
while (!list_empty(>b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1621,12 +1647,18 @@ static long writeback_sb_inodes(struct super_block *sb,
 * background threshold and other termination conditions.
 */
if (wrote) {
-   if (time_is_before_jiffies(start_time + HZ / 10UL))
-   break;
-   if (work->nr_pages <= 0)
+   if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+   work->nr_pages <= 0) {
+   done = true;
break;
+   }
}
}
+   if (!done && sb->s_op->write_metadata) {
+   spin_unlock(>list_lock);
+   wrote += writeback_sb_metadata(sb, wb, work);
+   spin_lock(>list_lock);
+   }
return wrote;
 }
 
@@ -1635,6 +1667,7 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 {
unsigned long start_time = jiffies;
long wrote = 0;
+   bool done = false;
 
while (!list_empty(>b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1654,12 +1687,39 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 
/* refer to the same tests at the end of writeback_sb_inodes */
if (wrote) {
-   if (time_is_before_jiffies(start_time + HZ / 10UL))
-   break;
-   if (work->nr_pages <= 0)
+   if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+   work->nr_pages <= 0) {
+   done = true;
break;
+   }
}
}
+
+   if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) {
+   LIST_HEAD(list);
+
+   spin_unlock(>list_lock);
+   spin_lock(>bdi->sb_list_lock);
+   list_splice_init(>bdi->dirty_sb_list, );
+   while (!list_empty()) {
+

[PATCH 2/4] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes

2017-06-22 Thread Nikolay Borisov

From: Josef Bacik 

These are counters that constantly go up in order to do bandwidth calculations.
It isn't important what the units are in, as long as they are consistent between
the two of them, so convert them to count bytes written/dirtied, and allow the
metadata accounting stuff to change the counters as well. Additionally, scale
WB_STAT_BATCH based on whether we are incrementing byte-based or page-based
counters.

Signed-off-by: Josef Bacik 
Signed-off-by: Nikolay Borisov 
---

Changes since previous posting [1]:

 - Incorporated Jan Kara's feedback to rename __wb_writeout_inc to 
 __wb_writeout_add. 

 - Eliminated IRQ clustering in account_page_dirtied() by  converting
 inc_wb_stat to __add_wb_stat since the latter is already irq-safe. This was 
 requested by Tejun. 

 - After talking privately with Jan he mentioned that the way the percpu 
 counters were used to account bytes could lead to constant hit of the slow 
 path. This is due to the WB_STAT_BATCH not being scaled for bytes. I've 
 implemented a very simple logic to do that in __add_wb_stat

 - Forward ported to 4.12.-rc6

One thing which will likely have to change with this patch is the fact that 
currently wb_completion count assume that each completion has done 4k worth of 
pages. With subpage blocksize however a completion needn't have written 4k. As
such he suggested to convert the accounting in wb_domain to explicitly track 
number of bytes written in a period and not completions per-period. He was 
afraid of skewing happening. Tejun, what's your take on that ? 

This patch had an ack from Tejun previously but due to my changes I haven't 
added it. 

[1] https://patchwork.kernel.org/patch/9395219/

 fs/fuse/file.c   |  4 ++--
 include/linux/backing-dev-defs.h |  4 ++--
 include/linux/backing-dev.h  | 20 ++--
 mm/backing-dev.c |  9 +
 mm/page-writeback.c  | 20 ++--
 5 files changed, 37 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3ee4fdc3da9e..2521f70ab8a6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1463,7 +1463,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
for (i = 0; i < req->num_pages; i++) {
dec_wb_stat(>wb, WB_WRITEBACK);
dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP);
-   wb_writeout_inc(>wb);
+   wb_writeout_add(>wb, PAGE_SIZE);
}
wake_up(>page_waitq);
 }
@@ -1767,7 +1767,7 @@ static bool fuse_writepage_in_flight(struct fuse_req 
*new_req,
 
dec_wb_stat(>wb, WB_WRITEBACK);
dec_node_page_state(page, NR_WRITEBACK_TEMP);
-   wb_writeout_inc(>wb);
+   wb_writeout_add(>wb, PAGE_SIZE);
fuse_writepage_free(fc, new_req);
fuse_request_free(new_req);
goto out;
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 866c433e7d32..ded45ac2cec7 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -36,8 +36,8 @@ typedef int (congested_fn)(void *, int);
 enum wb_stat_item {
WB_RECLAIMABLE,
WB_WRITEBACK,
-   WB_DIRTIED,
-   WB_WRITTEN,
+   WB_DIRTIED_BYTES,
+   WB_WRITTEN_BYTES,
NR_WB_STAT_ITEMS
 };
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index bdedea9be0a6..8b5a2e98b779 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -66,7 +66,23 @@ static inline bool bdi_has_dirty_io(struct backing_dev_info 
*bdi)
 static inline void __add_wb_stat(struct bdi_writeback *wb,
 enum wb_stat_item item, s64 amount)
 {
-   percpu_counter_add_batch(>stat[item], amount, WB_STAT_BATCH);
+   s32 batch_size;
+
+   /*
+* When woring with bytes scale the batch size to reduce hitting
+* the slow path in the percpu counter
+*/
+   switch (item) {
+   case WB_DIRTIED_BYTES:
+   case WB_WRITTEN_BYTES:
+   batch_size = WB_STAT_BATCH << PAGE_SHIFT;
+   break;
+   default:
+   batch_size = WB_STAT_BATCH;
+   break;
+
+   }
+   percpu_counter_add_batch(>stat[item], amount, batch_size);
 }
 
 static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
@@ -102,7 +118,7 @@ static inline s64 wb_stat_sum(struct bdi_writeback *wb, 
enum wb_stat_item item)
return sum;
 }
 
-extern void wb_writeout_inc(struct bdi_writeback *wb);
+extern void wb_writeout_add(struct bdi_writeback *wb, long bytes);
 
 /*
  * maximal error of a stat counter.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0c09dd103109..2eef55428654 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -67,14 +67,15 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)

Re: [PATCH] mm: Refactor conversion of pages to bytes macro definitions

2017-06-22 Thread Nikolay Borisov



On 22.06.2017 09:44, Michal Hocko wrote:
> On Tue 20-06-17 18:14:28, Nikolay Borisov wrote:
>> Currently there are a multiple files with the following code:
>>  #define K(x) ((x) << (PAGE_SHIFT - 10))
>>  ... some code..
>>  #undef K
>>
>> This is mainly used to print out some memory-related statistics, where X is
>> given in pages and the macro just converts it to kilobytes. In the future
>> there is going to be more macros since there are intention to introduce
>> byte-based memory counters [1]. This could lead to proliferation of
>> multiple duplicated definition of various macros used to convert a quantity
>> from one unit to another. Let's try and consolidate such definition in the
>> mm.h header since currently it's being included in all files which exhibit
>> this pattern. Also let's rename it to something a bit more verbose.
>>
>> This patch doesn't introduce any functional changes
>>
>> [1] https://patchwork.kernel.org/patch/9395205/
>>
>> Signed-off-by: Nikolay Borisov <nbori...@suse.com>
>> ---
>>  arch/tile/mm/pgtable.c  |  2 --
>>  drivers/base/node.c | 66 ++---
>>  include/linux/mm.h  |  2 ++
>>  kernel/debug/kdb/kdb_main.c |  3 +-
>>  mm/backing-dev.c| 22 +
>>  mm/memcontrol.c | 17 +-
>>  mm/oom_kill.c   | 19 +--
>>  mm/page_alloc.c | 80 
>> ++---
>>  8 files changed, 100 insertions(+), 111 deletions(-)
> 
> Those macros are quite trivial and we do not really save much code while
> this touches a lot of code potentially causing some conflicts. So do we
> really need this? I am usually very keen on removing duplication but
> this doesn't seem to be worth all the troubles IMHO.
> 

There are 2 problems I see: 

1. K is in fact used for other macros than converting pages to kbytes. 
Simple grep before my patch is applied yields the following: 

arch/tile/mm/pgtable.c:#define K(x) ((x) << (PAGE_SHIFT-10))
arch/x86/crypto/serpent-sse2-i586-asm_32.S:#define K(x0, x1, x2, x3, x4, i) \
crypto/serpent_generic.c:#define K(x0, x1, x2, x3, i) ({
\
drivers/base/node.c:#define K(x) ((x) << (PAGE_SHIFT - 10))
drivers/net/hamradio/scc.c:#define K(x) kiss->x
include/uapi/linux/keyboard.h:#define K(t,v)(((t)<<8)|(v))
kernel/debug/kdb/kdb_main.c:#define K(x) ((x) << (PAGE_SHIFT - 10))
mm/backing-dev.c:#define K(x) ((x) << (PAGE_SHIFT - 10))
mm/backing-dev.c:#define K(pages) ((pages) << (PAGE_SHIFT - 10))
mm/memcontrol.c:#define K(x) ((x) << (PAGE_SHIFT-10))
mm/oom_kill.c:#define K(x) ((x) << (PAGE_SHIFT-10))
mm/page_alloc.c:#define K(x) ((x) << (PAGE_SHIFT-10))


Furthermore, I intend on sending another patchset which introduces 2 more 
macros:
drivers/base/node.c:#define BtoK(x) ((x) >> 10)
drivers/video/fbdev/intelfb/intelfb.h:#define BtoKB(x)  ((x) / 1024)
mm/backing-dev.c:#define BtoK(x) ((x) >> 10)
mm/page_alloc.c:#define BtoK(x) ((x) >> 10)

fs/fs-writeback.c:#define BtoP(x) ((x) >> PAGE_SHIFT)
include/trace/events/writeback.h:#define BtoP(x) ((x) >> PAGE_SHIFT)
mm/page_alloc.c:#define BtoP(x) ((x) >> PAGE_SHIFT)

As you can see this ends up in spreading those macros. Ideally 
they should be in a header which is shared among all affected 
files. This was inspired by the feedback that Tejun has given 
here: https://patchwork.kernel.org/patch/9395205/ and I believe
he is right.

Re: [PATCH] mm: Refactor conversion of pages to bytes macro definitions

2017-06-22 Thread Nikolay Borisov



On 22.06.2017 09:44, Michal Hocko wrote:
> On Tue 20-06-17 18:14:28, Nikolay Borisov wrote:
>> Currently there are a multiple files with the following code:
>>  #define K(x) ((x) << (PAGE_SHIFT - 10))
>>  ... some code..
>>  #undef K
>>
>> This is mainly used to print out some memory-related statistics, where X is
>> given in pages and the macro just converts it to kilobytes. In the future
>> there is going to be more macros since there are intention to introduce
>> byte-based memory counters [1]. This could lead to proliferation of
>> multiple duplicated definition of various macros used to convert a quantity
>> from one unit to another. Let's try and consolidate such definition in the
>> mm.h header since currently it's being included in all files which exhibit
>> this pattern. Also let's rename it to something a bit more verbose.
>>
>> This patch doesn't introduce any functional changes
>>
>> [1] https://patchwork.kernel.org/patch/9395205/
>>
>> Signed-off-by: Nikolay Borisov 
>> ---
>>  arch/tile/mm/pgtable.c  |  2 --
>>  drivers/base/node.c | 66 ++---
>>  include/linux/mm.h  |  2 ++
>>  kernel/debug/kdb/kdb_main.c |  3 +-
>>  mm/backing-dev.c| 22 +
>>  mm/memcontrol.c | 17 +-
>>  mm/oom_kill.c   | 19 +--
>>  mm/page_alloc.c | 80 
>> ++---
>>  8 files changed, 100 insertions(+), 111 deletions(-)
> 
> Those macros are quite trivial and we do not really save much code while
> this touches a lot of code potentially causing some conflicts. So do we
> really need this? I am usually very keen on removing duplication but
> this doesn't seem to be worth all the troubles IMHO.
> 

There are 2 problems I see: 

1. K is in fact used for other macros than converting pages to kbytes. 
Simple grep before my patch is applied yields the following: 

arch/tile/mm/pgtable.c:#define K(x) ((x) << (PAGE_SHIFT-10))
arch/x86/crypto/serpent-sse2-i586-asm_32.S:#define K(x0, x1, x2, x3, x4, i) \
crypto/serpent_generic.c:#define K(x0, x1, x2, x3, i) ({
\
drivers/base/node.c:#define K(x) ((x) << (PAGE_SHIFT - 10))
drivers/net/hamradio/scc.c:#define K(x) kiss->x
include/uapi/linux/keyboard.h:#define K(t,v)(((t)<<8)|(v))
kernel/debug/kdb/kdb_main.c:#define K(x) ((x) << (PAGE_SHIFT - 10))
mm/backing-dev.c:#define K(x) ((x) << (PAGE_SHIFT - 10))
mm/backing-dev.c:#define K(pages) ((pages) << (PAGE_SHIFT - 10))
mm/memcontrol.c:#define K(x) ((x) << (PAGE_SHIFT-10))
mm/oom_kill.c:#define K(x) ((x) << (PAGE_SHIFT-10))
mm/page_alloc.c:#define K(x) ((x) << (PAGE_SHIFT-10))


Furthermore, I intend on sending another patchset which introduces 2 more 
macros:
drivers/base/node.c:#define BtoK(x) ((x) >> 10)
drivers/video/fbdev/intelfb/intelfb.h:#define BtoKB(x)  ((x) / 1024)
mm/backing-dev.c:#define BtoK(x) ((x) >> 10)
mm/page_alloc.c:#define BtoK(x) ((x) >> 10)

fs/fs-writeback.c:#define BtoP(x) ((x) >> PAGE_SHIFT)
include/trace/events/writeback.h:#define BtoP(x) ((x) >> PAGE_SHIFT)
mm/page_alloc.c:#define BtoP(x) ((x) >> PAGE_SHIFT)

As you can see this ends up in spreading those macros. Ideally 
they should be in a header which is shared among all affected 
files. This was inspired by the feedback that Tejun has given 
here: https://patchwork.kernel.org/patch/9395205/ and I believe
he is right.

[PATCH v3] writeback: Rework wb_[dec|inc]_stat family of functions

2017-06-21 Thread Nikolay Borisov

Currently the writeback statistics code uses a percpu counters to hold
various statistics. Furthermore we have 2 families of functions - those which
disable local irq and those which doesn't and whose names begin with
double underscore. However, they both end up calling __add_wb_stats which in
turn calls percpu_counter_add_batch which is already irq-safe.

Exploiting this fact allows to eliminated the __wb_* functions since they don't
add any further protection than we already have. Furthermore, refactor
the wb_* function to call __add_wb_stat directly without the irq-disabling
dance. This will likely result in better runtime of code which deals with
modifying the stat counters.

While at it also document why percpu_counter_add_batch is in fact preempt and
irq-safe since at least 3 people got confused.

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---

Changes since v2: 
* Fixed build failure reported by kbuild test robot
* Explicitly document that percpu_counter_add_batch is preempt/irq safe
 fs/fs-writeback.c   |  8 
 include/linux/backing-dev.h | 24 ++--
 lib/percpu_counter.c|  7 +++
 mm/page-writeback.c | 10 +-
 4 files changed, 18 insertions(+), 31 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 63ee2940775c..309364aab2a5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -380,8 +380,8 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
struct page *page = radix_tree_deref_slot_protected(slot,
>tree_lock);
if (likely(page) && PageDirty(page)) {
-   __dec_wb_stat(old_wb, WB_RECLAIMABLE);
-   __inc_wb_stat(new_wb, WB_RECLAIMABLE);
+   dec_wb_stat(old_wb, WB_RECLAIMABLE);
+   inc_wb_stat(new_wb, WB_RECLAIMABLE);
}
}
 
@@ -391,8 +391,8 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
>tree_lock);
if (likely(page)) {
WARN_ON_ONCE(!PageWriteback(page));
-   __dec_wb_stat(old_wb, WB_WRITEBACK);
-   __inc_wb_stat(new_wb, WB_WRITEBACK);
+   dec_wb_stat(old_wb, WB_WRITEBACK);
+   inc_wb_stat(new_wb, WB_WRITEBACK);
}
}
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ace73f96eb1e..e9c967b86054 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -69,34 +69,14 @@ static inline void __add_wb_stat(struct bdi_writeback *wb,
percpu_counter_add_batch(>stat[item], amount, WB_STAT_BATCH);
 }
 
-static inline void __inc_wb_stat(struct bdi_writeback *wb,
-enum wb_stat_item item)
-{
-   __add_wb_stat(wb, item, 1);
-}
-
 static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __inc_wb_stat(wb, item);
-   local_irq_restore(flags);
-}
-
-static inline void __dec_wb_stat(struct bdi_writeback *wb,
-enum wb_stat_item item)
-{
-   __add_wb_stat(wb, item, -1);
+   __add_wb_stat(wb, item, 1);
 }
 
 static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __dec_wb_stat(wb, item);
-   local_irq_restore(flags);
+   __add_wb_stat(wb, item, -1);
 }
 
 static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 8ee7e5ec21be..3bf4a9984f4c 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -72,6 +72,13 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 
amount)
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
+/**
+ * This function is both preempt and irq safe. The former is due to explicit
+ * preemption disable. The latter is guaranteed by the fact that the slow path
+ * is explicitly protected by an irq-safe spinlock whereas the fast patch uses
+ * this_cpu_add which is irq-safe by definition. Hence there is no need muck
+ * with irq state before calling this one
+ */
 void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 
batch)
 {
s64 count;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 143c1c25d680..b7451891959a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -601,7 +601,7 @@ static inline void __wb_writeout_inc(struct bdi_writeback 
*wb)
 {
struct wb_domain *cgdom;
 
-   __inc_wb_stat(wb, WB_WRITTEN);
+   inc_wb_stat(wb, WB_WRITTEN);
wb_domain_writeout_inc(_wb_domain, >completions,
   wb->bdi->max_prop_frac);
 
@@ -2437,8 +2437,8 @@ void

[PATCH v3] writeback: Rework wb_[dec|inc]_stat family of functions

2017-06-21 Thread Nikolay Borisov

Currently the writeback statistics code uses a percpu counters to hold
various statistics. Furthermore we have 2 families of functions - those which
disable local irq and those which doesn't and whose names begin with
double underscore. However, they both end up calling __add_wb_stats which in
turn calls percpu_counter_add_batch which is already irq-safe.

Exploiting this fact allows to eliminated the __wb_* functions since they don't
add any further protection than we already have. Furthermore, refactor
the wb_* function to call __add_wb_stat directly without the irq-disabling
dance. This will likely result in better runtime of code which deals with
modifying the stat counters.

While at it also document why percpu_counter_add_batch is in fact preempt and
irq-safe since at least 3 people got confused.

Signed-off-by: Nikolay Borisov 
---

Changes since v2: 
* Fixed build failure reported by kbuild test robot
* Explicitly document that percpu_counter_add_batch is preempt/irq safe
 fs/fs-writeback.c   |  8 
 include/linux/backing-dev.h | 24 ++--
 lib/percpu_counter.c|  7 +++
 mm/page-writeback.c | 10 +-
 4 files changed, 18 insertions(+), 31 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 63ee2940775c..309364aab2a5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -380,8 +380,8 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
struct page *page = radix_tree_deref_slot_protected(slot,
>tree_lock);
if (likely(page) && PageDirty(page)) {
-   __dec_wb_stat(old_wb, WB_RECLAIMABLE);
-   __inc_wb_stat(new_wb, WB_RECLAIMABLE);
+   dec_wb_stat(old_wb, WB_RECLAIMABLE);
+   inc_wb_stat(new_wb, WB_RECLAIMABLE);
}
}
 
@@ -391,8 +391,8 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
>tree_lock);
if (likely(page)) {
WARN_ON_ONCE(!PageWriteback(page));
-   __dec_wb_stat(old_wb, WB_WRITEBACK);
-   __inc_wb_stat(new_wb, WB_WRITEBACK);
+   dec_wb_stat(old_wb, WB_WRITEBACK);
+   inc_wb_stat(new_wb, WB_WRITEBACK);
}
}
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ace73f96eb1e..e9c967b86054 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -69,34 +69,14 @@ static inline void __add_wb_stat(struct bdi_writeback *wb,
percpu_counter_add_batch(>stat[item], amount, WB_STAT_BATCH);
 }
 
-static inline void __inc_wb_stat(struct bdi_writeback *wb,
-enum wb_stat_item item)
-{
-   __add_wb_stat(wb, item, 1);
-}
-
 static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __inc_wb_stat(wb, item);
-   local_irq_restore(flags);
-}
-
-static inline void __dec_wb_stat(struct bdi_writeback *wb,
-enum wb_stat_item item)
-{
-   __add_wb_stat(wb, item, -1);
+   __add_wb_stat(wb, item, 1);
 }
 
 static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __dec_wb_stat(wb, item);
-   local_irq_restore(flags);
+   __add_wb_stat(wb, item, -1);
 }
 
 static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 8ee7e5ec21be..3bf4a9984f4c 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -72,6 +72,13 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 
amount)
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
+/**
+ * This function is both preempt and irq safe. The former is due to explicit
+ * preemption disable. The latter is guaranteed by the fact that the slow path
+ * is explicitly protected by an irq-safe spinlock whereas the fast patch uses
+ * this_cpu_add which is irq-safe by definition. Hence there is no need muck
+ * with irq state before calling this one
+ */
 void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 
batch)
 {
s64 count;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 143c1c25d680..b7451891959a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -601,7 +601,7 @@ static inline void __wb_writeout_inc(struct bdi_writeback 
*wb)
 {
struct wb_domain *cgdom;
 
-   __inc_wb_stat(wb, WB_WRITTEN);
+   inc_wb_stat(wb, WB_WRITTEN);
wb_domain_writeout_inc(_wb_domain, >completions,
   wb->bdi->max_prop_frac);
 
@@ -2437,8 +2437,8 @@ void account_page_dirtied

Re: [PATCH net-next v3 4/4] ip6mr: add netlink notifications on mrt6msg cache reports

2017-06-20 Thread Nikolay Aleksandrov

On 20/06/17 23:54, Julien Gomes wrote:
> Add Netlink notifications on cache reports in ip6mr, in addition to the
> existing mrt6msg sent to mroute6_sk.
> Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV6_MROUTE_R.
> 
> MSGTYPE, MIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
> same data as their equivalent fields in the mrt6msg header.
> PKT attribute is the packet sent to mroute6_sk, without the added
> mrt6msg header.
> 
> Suggested-by: Ryan Halbrook <halbr...@arista.com>
> Signed-off-by: Julien Gomes <jul...@arista.com>
> ---
>  include/uapi/linux/mroute6.h | 12 
>  net/ipv6/ip6mr.c | 71 
> ++--
>  2 files changed, 81 insertions(+), 2 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>

Re: [PATCH net-next v3 4/4] ip6mr: add netlink notifications on mrt6msg cache reports

2017-06-20 Thread Nikolay Aleksandrov

On 20/06/17 23:54, Julien Gomes wrote:
> Add Netlink notifications on cache reports in ip6mr, in addition to the
> existing mrt6msg sent to mroute6_sk.
> Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV6_MROUTE_R.
> 
> MSGTYPE, MIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
> same data as their equivalent fields in the mrt6msg header.
> PKT attribute is the packet sent to mroute6_sk, without the added
> mrt6msg header.
> 
> Suggested-by: Ryan Halbrook 
> Signed-off-by: Julien Gomes 
> ---
>  include/uapi/linux/mroute6.h | 12 
>  net/ipv6/ip6mr.c | 71 
> ++--
>  2 files changed, 81 insertions(+), 2 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov

Re: [PATCH net-next v3 3/4] ipmr: add netlink notifications on igmpmsg cache reports

2017-06-20 Thread Nikolay Aleksandrov

On 20/06/17 23:54, Julien Gomes wrote:
> Add Netlink notifications on cache reports in ipmr, in addition to the
> existing igmpmsg sent to mroute_sk.
> Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV4_MROUTE_R.
> 
> MSGTYPE, VIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
> same data as their equivalent fields in the igmpmsg header.
> PKT attribute is the packet sent to mroute_sk, without the added igmpmsg
> header.
> 
> Suggested-by: Ryan Halbrook <halbr...@arista.com>
> Signed-off-by: Julien Gomes <jul...@arista.com>
> ---
>  include/uapi/linux/mroute.h | 12 
>  net/ipv4/ipmr.c | 69 
> +++--
>  2 files changed, 79 insertions(+), 2 deletions(-)
> 

Thanks,

Reviewed-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>

Re: [PATCH net-next v3 3/4] ipmr: add netlink notifications on igmpmsg cache reports

2017-06-20 Thread Nikolay Aleksandrov

On 20/06/17 23:54, Julien Gomes wrote:
> Add Netlink notifications on cache reports in ipmr, in addition to the
> existing igmpmsg sent to mroute_sk.
> Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV4_MROUTE_R.
> 
> MSGTYPE, VIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
> same data as their equivalent fields in the igmpmsg header.
> PKT attribute is the packet sent to mroute_sk, without the added igmpmsg
> header.
> 
> Suggested-by: Ryan Halbrook 
> Signed-off-by: Julien Gomes 
> ---
>  include/uapi/linux/mroute.h | 12 
>  net/ipv4/ipmr.c | 69 
> +++--
>  2 files changed, 79 insertions(+), 2 deletions(-)
> 

Thanks,

Reviewed-by: Nikolay Aleksandrov

Re: [PATCH v2 2/2] writeback: Rework wb_[dec|inc]_stat family of functions

2017-06-20 Thread Nikolay Borisov



On 20.06.2017 23:30, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jun 20, 2017 at 11:28:30PM +0300, Nikolay Borisov wrote:
>>> Heh, looks like I was confused.  __percpu_counter_add() is not
>>> irq-safe.  It disables preemption and uses __this_cpu_read(), so
>>> there's no protection against irq.  If writeback statistics want
>>> irq-safe operations and it does, it would need these separate
>>> operations.  Am I missing something?
>>
>> So looking at the history of the commit initially there was
>> preempt_disable + this_cpu_ptr which was later changed in:
>>
>> 819a72af8d66 ("percpucounter: Optimize __percpu_counter_add a bit
>> through the use of this_cpu() options.")
>>
>> I believe that having __this_cpu_read ensures that we get an atomic
>> snapshot of the variable but when we are doing the actual write e.g. the
>> else {} branch we actually use this_cpu_add which ought to be preempt +
>> irq safe, meaning we won't get torn write. In essence we have atomic
>> reads by merit of __this_cpu_read + atomic writes by merit of using
>> raw_spin_lock_irqsave in the if() branch and this_cpu_add in the else {}
>> branch.
> 
> Ah, you're right.  The initial read is speculative.  The slow path is
> protected with irq spinlock.  The fast path is this_cpu_add() which is
> irq-safe.  We really need to document these functions.
> 
> Can I bother you with adding documentation to them while you're at it?

Sure, I will likely resend with a fresh head on my shoulders.

> 
> Thanks.
>

Re: [PATCH v2 2/2] writeback: Rework wb_[dec|inc]_stat family of functions

2017-06-20 Thread Nikolay Borisov



On 20.06.2017 23:30, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jun 20, 2017 at 11:28:30PM +0300, Nikolay Borisov wrote:
>>> Heh, looks like I was confused.  __percpu_counter_add() is not
>>> irq-safe.  It disables preemption and uses __this_cpu_read(), so
>>> there's no protection against irq.  If writeback statistics want
>>> irq-safe operations and it does, it would need these separate
>>> operations.  Am I missing something?
>>
>> So looking at the history of the commit initially there was
>> preempt_disable + this_cpu_ptr which was later changed in:
>>
>> 819a72af8d66 ("percpucounter: Optimize __percpu_counter_add a bit
>> through the use of this_cpu() options.")
>>
>> I believe that having __this_cpu_read ensures that we get an atomic
>> snapshot of the variable but when we are doing the actual write e.g. the
>> else {} branch we actually use this_cpu_add which ought to be preempt +
>> irq safe, meaning we won't get torn write. In essence we have atomic
>> reads by merit of __this_cpu_read + atomic writes by merit of using
>> raw_spin_lock_irqsave in the if() branch and this_cpu_add in the else {}
>> branch.
> 
> Ah, you're right.  The initial read is speculative.  The slow path is
> protected with irq spinlock.  The fast path is this_cpu_add() which is
> irq-safe.  We really need to document these functions.
> 
> Can I bother you with adding documentation to them while you're at it?

Sure, I will likely resend with a fresh head on my shoulders.

> 
> Thanks.
>

Re: [PATCH v2 2/2] writeback: Rework wb_[dec|inc]_stat family of functions

2017-06-20 Thread Nikolay Borisov

On 20.06.2017 22:37, Tejun Heo wrote:
> Hello, Nikolay.
> 
> On Tue, Jun 20, 2017 at 09:02:00PM +0300, Nikolay Borisov wrote:
>> Currently the writeback statistics code uses a percpu counters to hold
>> various statistics. Furthermore we have 2 families of functions - those which
>> disable local irq and those which doesn't and whose names begin with
>> double underscore. However, they both end up calling __add_wb_stats which in
>> turn calls percpu_counter_add_batch which is already irq-safe.
> 
> Heh, looks like I was confused.  __percpu_counter_add() is not
> irq-safe.  It disables preemption and uses __this_cpu_read(), so
> there's no protection against irq.  If writeback statistics want
> irq-safe operations and it does, it would need these separate
> operations.  Am I missing something?

So looking at the history of the commit initially there was
preempt_disable + this_cpu_ptr which was later changed in:

819a72af8d66 ("percpucounter: Optimize __percpu_counter_add a bit
through the use of this_cpu() options.")

I believe that having __this_cpu_read ensures that we get an atomic
snapshot of the variable but when we are doing the actual write e.g. the
else {} branch we actually use this_cpu_add which ought to be preempt +
irq safe, meaning we won't get torn write. In essence we have atomic
reads by merit of __this_cpu_read + atomic writes by merit of using
raw_spin_lock_irqsave in the if() branch and this_cpu_add in the else {}
branch.

> 
> Thanks.
>

Re: [PATCH v2 2/2] writeback: Rework wb_[dec|inc]_stat family of functions

2017-06-20 Thread Nikolay Borisov

On 20.06.2017 22:37, Tejun Heo wrote:
> Hello, Nikolay.
> 
> On Tue, Jun 20, 2017 at 09:02:00PM +0300, Nikolay Borisov wrote:
>> Currently the writeback statistics code uses a percpu counters to hold
>> various statistics. Furthermore we have 2 families of functions - those which
>> disable local irq and those which doesn't and whose names begin with
>> double underscore. However, they both end up calling __add_wb_stats which in
>> turn calls percpu_counter_add_batch which is already irq-safe.
> 
> Heh, looks like I was confused.  __percpu_counter_add() is not
> irq-safe.  It disables preemption and uses __this_cpu_read(), so
> there's no protection against irq.  If writeback statistics want
> irq-safe operations and it does, it would need these separate
> operations.  Am I missing something?

So looking at the history of the commit initially there was
preempt_disable + this_cpu_ptr which was later changed in:

819a72af8d66 ("percpucounter: Optimize __percpu_counter_add a bit
through the use of this_cpu() options.")

I believe that having __this_cpu_read ensures that we get an atomic
snapshot of the variable but when we are doing the actual write e.g. the
else {} branch we actually use this_cpu_add which ought to be preempt +
irq safe, meaning we won't get torn write. In essence we have atomic
reads by merit of __this_cpu_read + atomic writes by merit of using
raw_spin_lock_irqsave in the if() branch and this_cpu_add in the else {}
branch.

> 
> Thanks.
>

[PATCH v2 2/2] writeback: Rework wb_[dec|inc]_stat family of functions

2017-06-20 Thread Nikolay Borisov

Currently the writeback statistics code uses a percpu counters to hold
various statistics. Furthermore we have 2 families of functions - those which
disable local irq and those which doesn't and whose names begin with
double underscore. However, they both end up calling __add_wb_stats which in
turn calls percpu_counter_add_batch which is already irq-safe.

Exploiting this fact allows to eliminated the __wb_* functions since they don't
add any further protection than we already have. Furthermore, refactor
the wb_* function to call __add_wb_stat directly without the irq-disabling
dance. This will likely result in better runtime of code which deals with
modifying the stat counters.

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
 include/linux/backing-dev.h | 24 ++--
 mm/page-writeback.c | 10 +-
 2 files changed, 7 insertions(+), 27 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ace73f96eb1e..e9c967b86054 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -69,34 +69,14 @@ static inline void __add_wb_stat(struct bdi_writeback *wb,
percpu_counter_add_batch(>stat[item], amount, WB_STAT_BATCH);
 }
 
-static inline void __inc_wb_stat(struct bdi_writeback *wb,
-enum wb_stat_item item)
-{
-   __add_wb_stat(wb, item, 1);
-}
-
 static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __inc_wb_stat(wb, item);
-   local_irq_restore(flags);
-}
-
-static inline void __dec_wb_stat(struct bdi_writeback *wb,
-enum wb_stat_item item)
-{
-   __add_wb_stat(wb, item, -1);
+   __add_wb_stat(wb, item, 1);
 }
 
 static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __dec_wb_stat(wb, item);
-   local_irq_restore(flags);
+   __add_wb_stat(wb, item, -1);
 }
 
 static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 143c1c25d680..b7451891959a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -601,7 +601,7 @@ static inline void __wb_writeout_inc(struct bdi_writeback 
*wb)
 {
struct wb_domain *cgdom;
 
-   __inc_wb_stat(wb, WB_WRITTEN);
+   inc_wb_stat(wb, WB_WRITTEN);
wb_domain_writeout_inc(_wb_domain, >completions,
   wb->bdi->max_prop_frac);
 
@@ -2437,8 +2437,8 @@ void account_page_dirtied(struct page *page, struct 
address_space *mapping)
__inc_node_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
__inc_node_page_state(page, NR_DIRTIED);
-   __inc_wb_stat(wb, WB_RECLAIMABLE);
-   __inc_wb_stat(wb, WB_DIRTIED);
+   inc_wb_stat(wb, WB_RECLAIMABLE);
+   inc_wb_stat(wb, WB_DIRTIED);
task_io_account_write(PAGE_SIZE);
current->nr_dirtied++;
this_cpu_inc(bdp_ratelimits);
@@ -2745,7 +2745,7 @@ int test_clear_page_writeback(struct page *page)
if (bdi_cap_account_writeback(bdi)) {
struct bdi_writeback *wb = inode_to_wb(inode);
 
-   __dec_wb_stat(wb, WB_WRITEBACK);
+   dec_wb_stat(wb, WB_WRITEBACK);
__wb_writeout_inc(wb);
}
}
@@ -2791,7 +2791,7 @@ int __test_set_page_writeback(struct page *page, bool 
keep_write)
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
-   __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
+   inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
 
/*
 * We can come through here when swapping anonymous
-- 
2.7.4

[PATCH v2 2/2] writeback: Rework wb_[dec|inc]_stat family of functions

2017-06-20 Thread Nikolay Borisov

Currently the writeback statistics code uses a percpu counters to hold
various statistics. Furthermore we have 2 families of functions - those which
disable local irq and those which doesn't and whose names begin with
double underscore. However, they both end up calling __add_wb_stats which in
turn calls percpu_counter_add_batch which is already irq-safe.

Exploiting this fact allows to eliminated the __wb_* functions since they don't
add any further protection than we already have. Furthermore, refactor
the wb_* function to call __add_wb_stat directly without the irq-disabling
dance. This will likely result in better runtime of code which deals with
modifying the stat counters.

Signed-off-by: Nikolay Borisov 
---
 include/linux/backing-dev.h | 24 ++--
 mm/page-writeback.c | 10 +-
 2 files changed, 7 insertions(+), 27 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ace73f96eb1e..e9c967b86054 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -69,34 +69,14 @@ static inline void __add_wb_stat(struct bdi_writeback *wb,
percpu_counter_add_batch(>stat[item], amount, WB_STAT_BATCH);
 }
 
-static inline void __inc_wb_stat(struct bdi_writeback *wb,
-enum wb_stat_item item)
-{
-   __add_wb_stat(wb, item, 1);
-}
-
 static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __inc_wb_stat(wb, item);
-   local_irq_restore(flags);
-}
-
-static inline void __dec_wb_stat(struct bdi_writeback *wb,
-enum wb_stat_item item)
-{
-   __add_wb_stat(wb, item, -1);
+   __add_wb_stat(wb, item, 1);
 }
 
 static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __dec_wb_stat(wb, item);
-   local_irq_restore(flags);
+   __add_wb_stat(wb, item, -1);
 }
 
 static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 143c1c25d680..b7451891959a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -601,7 +601,7 @@ static inline void __wb_writeout_inc(struct bdi_writeback 
*wb)
 {
struct wb_domain *cgdom;
 
-   __inc_wb_stat(wb, WB_WRITTEN);
+   inc_wb_stat(wb, WB_WRITTEN);
wb_domain_writeout_inc(_wb_domain, >completions,
   wb->bdi->max_prop_frac);
 
@@ -2437,8 +2437,8 @@ void account_page_dirtied(struct page *page, struct 
address_space *mapping)
__inc_node_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
__inc_node_page_state(page, NR_DIRTIED);
-   __inc_wb_stat(wb, WB_RECLAIMABLE);
-   __inc_wb_stat(wb, WB_DIRTIED);
+   inc_wb_stat(wb, WB_RECLAIMABLE);
+   inc_wb_stat(wb, WB_DIRTIED);
task_io_account_write(PAGE_SIZE);
current->nr_dirtied++;
this_cpu_inc(bdp_ratelimits);
@@ -2745,7 +2745,7 @@ int test_clear_page_writeback(struct page *page)
if (bdi_cap_account_writeback(bdi)) {
struct bdi_writeback *wb = inode_to_wb(inode);
 
-   __dec_wb_stat(wb, WB_WRITEBACK);
+   dec_wb_stat(wb, WB_WRITEBACK);
__wb_writeout_inc(wb);
}
}
@@ -2791,7 +2791,7 @@ int __test_set_page_writeback(struct page *page, bool 
keep_write)
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
-   __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
+   inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
 
/*
 * We can come through here when swapping anonymous
-- 
2.7.4

[PATCH v2 1/2] percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch

2017-06-20 Thread Nikolay Borisov

Currently, in both !SMP and SMP configs percpu_counter_add calls
__percpu_counter_add which is preempt safe due to explicit calls to
preempt_disable. This state of play creates the false sense that
__percpu_counter_add is less SMP-safe than percpu_counter_add. They are both
identical irrespective of CONFIG_SMP. The only difference is that the
__ version takes a batch parameter.

Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
 fs/btrfs/disk-io.c | 4 ++--
 fs/btrfs/extent_io.c   | 2 +-
 fs/btrfs/inode.c   | 4 ++--
 fs/xfs/xfs_mount.c | 4 ++--
 include/linux/backing-dev.h| 2 +-
 include/linux/blk-cgroup.h | 6 +++---
 include/linux/mman.h   | 2 +-
 include/linux/percpu_counter.h | 7 ---
 include/net/inet_frag.h| 4 ++--
 lib/flex_proportions.c | 6 +++---
 lib/percpu_counter.c   | 4 ++--
 11 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5f678dcb20e6..0ebd44135f1f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1255,7 +1255,7 @@ void clean_tree_block(struct btrfs_fs_info *fs_info,
btrfs_assert_tree_locked(buf);
 
if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, >bflags)) {
-   __percpu_counter_add(_info->dirty_metadata_bytes,
+   percpu_counter_add_batch(_info->dirty_metadata_bytes,
 -buf->len,
 fs_info->dirty_metadata_batch);
/* ugh, clear_extent_buffer_dirty needs to lock the 
page */
@@ -4049,7 +4049,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
buf->start, transid, fs_info->generation);
was_dirty = set_extent_buffer_dirty(buf);
if (!was_dirty)
-   __percpu_counter_add(_info->dirty_metadata_bytes,
+   percpu_counter_add_batch(_info->dirty_metadata_bytes,
 buf->len,
 fs_info->dirty_metadata_batch);
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..a1c303f27699 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3597,7 +3597,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
set_bit(EXTENT_BUFFER_WRITEBACK, >bflags);
spin_unlock(>refs_lock);
btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
-   __percpu_counter_add(_info->dirty_metadata_bytes,
+   percpu_counter_add_batch(_info->dirty_metadata_bytes,
 -eb->len,
 fs_info->dirty_metadata_batch);
ret = 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef3c98c527c1..fa138217219e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1766,7 +1766,7 @@ static void btrfs_set_bit_hook(struct inode *inode,
if (btrfs_is_testing(fs_info))
return;
 
-   __percpu_counter_add(_info->delalloc_bytes, len,
+   percpu_counter_add_batch(_info->delalloc_bytes, len,
 fs_info->delalloc_batch);
spin_lock(_I(inode)->lock);
BTRFS_I(inode)->delalloc_bytes += len;
@@ -1840,7 +1840,7 @@ static void btrfs_clear_bit_hook(struct btrfs_inode 
*inode,
>vfs_inode,
state->start, len);
 
-   __percpu_counter_add(_info->delalloc_bytes, -len,
+   percpu_counter_add_batch(_info->delalloc_bytes, -len,
 fs_info->delalloc_batch);
spin_lock(>lock);
inode->delalloc_bytes -= len;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 2eaf81859166..7147d4a8d207 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1209,7 +1209,7 @@ xfs_mod_icount(
struct xfs_mount*mp,
int64_t delta)
 {
-   __percpu_counter_add(>m_icount, delta, XFS_ICOUNT_BATCH);
+   percpu_counter_add_batch(>m_icount, delta, XFS_ICOUNT_BATCH);
if (__percpu_counter_compare(>m_icount, 0, XFS_ICOUNT_BATCH) < 0) {
ASSERT(0);
percpu_counter_add(>m_icount, -delta);
@@ -1288,7 +1288,7 @@ xfs_mod_fdblocks(
else
batch = XFS_FDBLOCKS_BATCH;
 
-   __percpu_counter_add(>m_fdblocks, delta, batch);
+   percpu_counter_add_batch(>m_fdblocks, delta, batch);
if (__percpu_counter_compare(>m_fdblocks, mp->m_alloc_set_aside,

[PATCH v2 1/2] percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch

2017-06-20 Thread Nikolay Borisov

Currently, in both !SMP and SMP configs percpu_counter_add calls
__percpu_counter_add which is preempt safe due to explicit calls to
preempt_disable. This state of play creates the false sense that
__percpu_counter_add is less SMP-safe than percpu_counter_add. They are both
identical irrespective of CONFIG_SMP. The only difference is that the
__ version takes a batch parameter.

Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/disk-io.c | 4 ++--
 fs/btrfs/extent_io.c   | 2 +-
 fs/btrfs/inode.c   | 4 ++--
 fs/xfs/xfs_mount.c | 4 ++--
 include/linux/backing-dev.h| 2 +-
 include/linux/blk-cgroup.h | 6 +++---
 include/linux/mman.h   | 2 +-
 include/linux/percpu_counter.h | 7 ---
 include/net/inet_frag.h| 4 ++--
 lib/flex_proportions.c | 6 +++---
 lib/percpu_counter.c   | 4 ++--
 11 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5f678dcb20e6..0ebd44135f1f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1255,7 +1255,7 @@ void clean_tree_block(struct btrfs_fs_info *fs_info,
btrfs_assert_tree_locked(buf);
 
if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, >bflags)) {
-   __percpu_counter_add(_info->dirty_metadata_bytes,
+   percpu_counter_add_batch(_info->dirty_metadata_bytes,
 -buf->len,
 fs_info->dirty_metadata_batch);
/* ugh, clear_extent_buffer_dirty needs to lock the 
page */
@@ -4049,7 +4049,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
buf->start, transid, fs_info->generation);
was_dirty = set_extent_buffer_dirty(buf);
if (!was_dirty)
-   __percpu_counter_add(_info->dirty_metadata_bytes,
+   percpu_counter_add_batch(_info->dirty_metadata_bytes,
 buf->len,
 fs_info->dirty_metadata_batch);
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..a1c303f27699 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3597,7 +3597,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
set_bit(EXTENT_BUFFER_WRITEBACK, >bflags);
spin_unlock(>refs_lock);
btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
-   __percpu_counter_add(_info->dirty_metadata_bytes,
+   percpu_counter_add_batch(_info->dirty_metadata_bytes,
 -eb->len,
 fs_info->dirty_metadata_batch);
ret = 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef3c98c527c1..fa138217219e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1766,7 +1766,7 @@ static void btrfs_set_bit_hook(struct inode *inode,
if (btrfs_is_testing(fs_info))
return;
 
-   __percpu_counter_add(_info->delalloc_bytes, len,
+   percpu_counter_add_batch(_info->delalloc_bytes, len,
 fs_info->delalloc_batch);
spin_lock(_I(inode)->lock);
BTRFS_I(inode)->delalloc_bytes += len;
@@ -1840,7 +1840,7 @@ static void btrfs_clear_bit_hook(struct btrfs_inode 
*inode,
>vfs_inode,
state->start, len);
 
-   __percpu_counter_add(_info->delalloc_bytes, -len,
+   percpu_counter_add_batch(_info->delalloc_bytes, -len,
 fs_info->delalloc_batch);
spin_lock(>lock);
inode->delalloc_bytes -= len;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 2eaf81859166..7147d4a8d207 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1209,7 +1209,7 @@ xfs_mod_icount(
struct xfs_mount*mp,
int64_t delta)
 {
-   __percpu_counter_add(>m_icount, delta, XFS_ICOUNT_BATCH);
+   percpu_counter_add_batch(>m_icount, delta, XFS_ICOUNT_BATCH);
if (__percpu_counter_compare(>m_icount, 0, XFS_ICOUNT_BATCH) < 0) {
ASSERT(0);
percpu_counter_add(>m_icount, -delta);
@@ -1288,7 +1288,7 @@ xfs_mod_fdblocks(
else
batch = XFS_FDBLOCKS_BATCH;
 
-   __percpu_counter_add(>m_fdblocks, delta, batch);
+   percpu_counter_add_batch(>m_fdblocks, delta, batch);
if (__percpu_counter_compare(>m_fdblocks, mp->m_alloc_set_aside,
 XFS_FDBLOCKS_BATCH) >=

[PATCH] mm: Refactor conversion of pages to bytes macro definitions

2017-06-20 Thread Nikolay Borisov

Currently there are a multiple files with the following code:
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 ... some code..
 #undef K

This is mainly used to print out some memory-related statistics, where X is
given in pages and the macro just converts it to kilobytes. In the future
there is going to be more macros since there are intention to introduce
byte-based memory counters [1]. This could lead to proliferation of
multiple duplicated definition of various macros used to convert a quantity
from one unit to another. Let's try and consolidate such definition in the
mm.h header since currently it's being included in all files which exhibit
this pattern. Also let's rename it to something a bit more verbose.

This patch doesn't introduce any functional changes

[1] https://patchwork.kernel.org/patch/9395205/

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
 arch/tile/mm/pgtable.c  |  2 --
 drivers/base/node.c | 66 ++---
 include/linux/mm.h  |  2 ++
 kernel/debug/kdb/kdb_main.c |  3 +-
 mm/backing-dev.c| 22 +
 mm/memcontrol.c | 17 +-
 mm/oom_kill.c   | 19 +--
 mm/page_alloc.c | 80 ++---
 8 files changed, 100 insertions(+), 111 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 492a7361e58e..f04af570c1c2 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -34,8 +34,6 @@
 #include 
 #include 
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
-
 /**
  * shatter_huge_page() - ensure a given address is mapped by a small page.
  *
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f9686016..b6f563a3a3a9 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -50,7 +50,6 @@ static inline ssize_t node_read_cpulist(struct device *dev,
 static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
-#define K(x) ((x) << (PAGE_SHIFT - 10))
 static ssize_t node_read_meminfo(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -72,19 +71,19 @@ static ssize_t node_read_meminfo(struct device *dev,
   "Node %d Inactive(file): %8lu kB\n"
   "Node %d Unevictable:%8lu kB\n"
   "Node %d Mlocked:%8lu kB\n",
-  nid, K(i.totalram),
-  nid, K(i.freeram),
-  nid, K(i.totalram - i.freeram),
-  nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
+  nid, PtoK(i.totalram),
+  nid, PtoK(i.freeram),
+  nid, PtoK(i.totalram - i.freeram),
+  nid, PtoK(node_page_state(pgdat, NR_ACTIVE_ANON) +
node_page_state(pgdat, NR_ACTIVE_FILE)),
-  nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) +
+  nid, PtoK(node_page_state(pgdat, NR_INACTIVE_ANON) +
node_page_state(pgdat, NR_INACTIVE_FILE)),
-  nid, K(node_page_state(pgdat, NR_ACTIVE_ANON)),
-  nid, K(node_page_state(pgdat, NR_INACTIVE_ANON)),
-  nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
-  nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
-  nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
-  nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
+  nid, PtoK(node_page_state(pgdat, NR_ACTIVE_ANON)),
+  nid, PtoK(node_page_state(pgdat, NR_INACTIVE_ANON)),
+  nid, PtoK(node_page_state(pgdat, NR_ACTIVE_FILE)),
+  nid, PtoK(node_page_state(pgdat, NR_INACTIVE_FILE)),
+  nid, PtoK(node_page_state(pgdat, NR_UNEVICTABLE)),
+  nid, PtoK(sum_zone_node_page_state(nid, NR_MLOCK)));
 
 #ifdef CONFIG_HIGHMEM
n += sprintf(buf + n,
@@ -92,10 +91,10 @@ static ssize_t node_read_meminfo(struct device *dev,
   "Node %d HighFree:   %8lu kB\n"
   "Node %d LowTotal:   %8lu kB\n"
   "Node %d LowFree:%8lu kB\n",
-  nid, K(i.totalhigh),
-  nid, K(i.freehigh),
-  nid, K(i.totalram - i.totalhigh),
-  nid, K(i.freeram - i.freehigh));
+  nid, PtoK(i.totalhigh),
+  nid, PtoK(i.freehigh),
+  nid, PtoK(i.totalram - i.totalhigh),
+  nid, PtoK(i.freeram - i.freehigh));
 #endif
n += sprintf(buf + n,
   "Node %d Dirty:  %8lu kB\n"
@@ -118,36 +117,35 @@ static ssize_t node_read_

[PATCH] mm: Refactor conversion of pages to bytes macro definitions

2017-06-20 Thread Nikolay Borisov

Currently there are a multiple files with the following code:
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 ... some code..
 #undef K

This is mainly used to print out some memory-related statistics, where X is
given in pages and the macro just converts it to kilobytes. In the future
there is going to be more macros since there are intention to introduce
byte-based memory counters [1]. This could lead to proliferation of
multiple duplicated definition of various macros used to convert a quantity
from one unit to another. Let's try and consolidate such definition in the
mm.h header since currently it's being included in all files which exhibit
this pattern. Also let's rename it to something a bit more verbose.

This patch doesn't introduce any functional changes

[1] https://patchwork.kernel.org/patch/9395205/

Signed-off-by: Nikolay Borisov 
---
 arch/tile/mm/pgtable.c  |  2 --
 drivers/base/node.c | 66 ++---
 include/linux/mm.h  |  2 ++
 kernel/debug/kdb/kdb_main.c |  3 +-
 mm/backing-dev.c| 22 +
 mm/memcontrol.c | 17 +-
 mm/oom_kill.c   | 19 +--
 mm/page_alloc.c | 80 ++---
 8 files changed, 100 insertions(+), 111 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 492a7361e58e..f04af570c1c2 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -34,8 +34,6 @@
 #include 
 #include 
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
-
 /**
  * shatter_huge_page() - ensure a given address is mapped by a small page.
  *
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f9686016..b6f563a3a3a9 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -50,7 +50,6 @@ static inline ssize_t node_read_cpulist(struct device *dev,
 static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
-#define K(x) ((x) << (PAGE_SHIFT - 10))
 static ssize_t node_read_meminfo(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -72,19 +71,19 @@ static ssize_t node_read_meminfo(struct device *dev,
   "Node %d Inactive(file): %8lu kB\n"
   "Node %d Unevictable:%8lu kB\n"
   "Node %d Mlocked:%8lu kB\n",
-  nid, K(i.totalram),
-  nid, K(i.freeram),
-  nid, K(i.totalram - i.freeram),
-  nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
+  nid, PtoK(i.totalram),
+  nid, PtoK(i.freeram),
+  nid, PtoK(i.totalram - i.freeram),
+  nid, PtoK(node_page_state(pgdat, NR_ACTIVE_ANON) +
node_page_state(pgdat, NR_ACTIVE_FILE)),
-  nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) +
+  nid, PtoK(node_page_state(pgdat, NR_INACTIVE_ANON) +
node_page_state(pgdat, NR_INACTIVE_FILE)),
-  nid, K(node_page_state(pgdat, NR_ACTIVE_ANON)),
-  nid, K(node_page_state(pgdat, NR_INACTIVE_ANON)),
-  nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
-  nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
-  nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
-  nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
+  nid, PtoK(node_page_state(pgdat, NR_ACTIVE_ANON)),
+  nid, PtoK(node_page_state(pgdat, NR_INACTIVE_ANON)),
+  nid, PtoK(node_page_state(pgdat, NR_ACTIVE_FILE)),
+  nid, PtoK(node_page_state(pgdat, NR_INACTIVE_FILE)),
+  nid, PtoK(node_page_state(pgdat, NR_UNEVICTABLE)),
+  nid, PtoK(sum_zone_node_page_state(nid, NR_MLOCK)));
 
 #ifdef CONFIG_HIGHMEM
n += sprintf(buf + n,
@@ -92,10 +91,10 @@ static ssize_t node_read_meminfo(struct device *dev,
   "Node %d HighFree:   %8lu kB\n"
   "Node %d LowTotal:   %8lu kB\n"
   "Node %d LowFree:%8lu kB\n",
-  nid, K(i.totalhigh),
-  nid, K(i.freehigh),
-  nid, K(i.totalram - i.totalhigh),
-  nid, K(i.freeram - i.freehigh));
+  nid, PtoK(i.totalhigh),
+  nid, PtoK(i.freehigh),
+  nid, PtoK(i.totalram - i.totalhigh),
+  nid, PtoK(i.freeram - i.freehigh));
 #endif
n += sprintf(buf + n,
   "Node %d Dirty:  %8lu kB\n"
@@ -118,36 +117,35 @@ static ssize_t node_read_meminf

[PATCH 1/2] percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch

2017-06-20 Thread Nikolay Borisov

252e0ba6b77d ("lib: percpu_counter variable batch") added a batched version
of percpu_counter_add. However, one problem with this patch is the fact that it
overloads the meaning of double underscore, which in kernel-land are taken
to implicitly mean there is no preempt protection for the API. Currently, in
both !SMP and SMP configs percpu_counter_add calls __percpu_counter_add which
is preempt safe due to explicit calls to preempt_disable. This state of play
creates the false sense that __percpu_counter_add is less SMP-safe than
percpu_counter_add. They are both identical irrespective of CONFIG_SNMP value.
The only difference is that the __ version takes a batch parameter.

Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
 fs/btrfs/disk-io.c | 4 ++--
 fs/btrfs/extent_io.c   | 2 +-
 fs/btrfs/inode.c   | 4 ++--
 fs/xfs/xfs_mount.c | 4 ++--
 include/linux/backing-dev.h| 2 +-
 include/linux/blk-cgroup.h | 6 +++---
 include/linux/mman.h   | 2 +-
 include/linux/percpu_counter.h | 7 ---
 include/net/inet_frag.h| 4 ++--
 lib/flex_proportions.c | 6 +++---
 lib/percpu_counter.c   | 4 ++--
 11 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2eaa1b1db08d..9fc37d6641cd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1255,7 +1255,7 @@ void clean_tree_block(struct btrfs_fs_info *fs_info,
btrfs_assert_tree_locked(buf);
 
if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, >bflags)) {
-   __percpu_counter_add(_info->dirty_metadata_bytes,
+   percpu_counter_add_batch(_info->dirty_metadata_bytes,
 -buf->len,
 fs_info->dirty_metadata_batch);
/* ugh, clear_extent_buffer_dirty needs to lock the 
page */
@@ -4048,7 +4048,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
buf->start, transid, fs_info->generation);
was_dirty = set_extent_buffer_dirty(buf);
if (!was_dirty)
-   __percpu_counter_add(_info->dirty_metadata_bytes,
+   percpu_counter_add_batch(_info->dirty_metadata_bytes,
 buf->len,
 fs_info->dirty_metadata_batch);
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..a1c303f27699 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3597,7 +3597,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
set_bit(EXTENT_BUFFER_WRITEBACK, >bflags);
spin_unlock(>refs_lock);
btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
-   __percpu_counter_add(_info->dirty_metadata_bytes,
+   percpu_counter_add_batch(_info->dirty_metadata_bytes,
 -eb->len,
 fs_info->dirty_metadata_batch);
ret = 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef3c98c527c1..fa138217219e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1766,7 +1766,7 @@ static void btrfs_set_bit_hook(struct inode *inode,
if (btrfs_is_testing(fs_info))
return;
 
-   __percpu_counter_add(_info->delalloc_bytes, len,
+   percpu_counter_add_batch(_info->delalloc_bytes, len,
 fs_info->delalloc_batch);
spin_lock(_I(inode)->lock);
BTRFS_I(inode)->delalloc_bytes += len;
@@ -1840,7 +1840,7 @@ static void btrfs_clear_bit_hook(struct btrfs_inode 
*inode,
>vfs_inode,
state->start, len);
 
-   __percpu_counter_add(_info->delalloc_bytes, -len,
+   percpu_counter_add_batch(_info->delalloc_bytes, -len,
 fs_info->delalloc_batch);
spin_lock(>lock);
inode->delalloc_bytes -= len;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 2eaf81859166..7147d4a8d207 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1209,7 +1209,7 @@ xfs_mod_icount(
struct xfs_mount*mp,
int64_t delta)
 {
-   __percpu_counter_add(>m_icount, delta, XFS_ICOUNT_BATCH);
+   percpu_counter_add_batch(>m_icount, delta, XFS_ICOUNT_BATCH);
if (__percpu_counter_compare(>m_icount, 0, XFS_ICOUNT_BATCH) < 0) {
ASSERT(0);
percpu_counter_add(>m_icount, -delta);
@@ -1288,7 +1288,7 @@ xfs_mo

[PATCH 1/2] percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch

2017-06-20 Thread Nikolay Borisov

252e0ba6b77d ("lib: percpu_counter variable batch") added a batched version
of percpu_counter_add. However, one problem with this patch is the fact that it
overloads the meaning of double underscore, which in kernel-land are taken
to implicitly mean there is no preempt protection for the API. Currently, in
both !SMP and SMP configs percpu_counter_add calls __percpu_counter_add which
is preempt safe due to explicit calls to preempt_disable. This state of play
creates the false sense that __percpu_counter_add is less SMP-safe than
percpu_counter_add. They are both identical irrespective of CONFIG_SNMP value.
The only difference is that the __ version takes a batch parameter.

Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/disk-io.c | 4 ++--
 fs/btrfs/extent_io.c   | 2 +-
 fs/btrfs/inode.c   | 4 ++--
 fs/xfs/xfs_mount.c | 4 ++--
 include/linux/backing-dev.h| 2 +-
 include/linux/blk-cgroup.h | 6 +++---
 include/linux/mman.h   | 2 +-
 include/linux/percpu_counter.h | 7 ---
 include/net/inet_frag.h| 4 ++--
 lib/flex_proportions.c | 6 +++---
 lib/percpu_counter.c   | 4 ++--
 11 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2eaa1b1db08d..9fc37d6641cd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1255,7 +1255,7 @@ void clean_tree_block(struct btrfs_fs_info *fs_info,
btrfs_assert_tree_locked(buf);
 
if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, >bflags)) {
-   __percpu_counter_add(_info->dirty_metadata_bytes,
+   percpu_counter_add_batch(_info->dirty_metadata_bytes,
 -buf->len,
 fs_info->dirty_metadata_batch);
/* ugh, clear_extent_buffer_dirty needs to lock the 
page */
@@ -4048,7 +4048,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
buf->start, transid, fs_info->generation);
was_dirty = set_extent_buffer_dirty(buf);
if (!was_dirty)
-   __percpu_counter_add(_info->dirty_metadata_bytes,
+   percpu_counter_add_batch(_info->dirty_metadata_bytes,
 buf->len,
 fs_info->dirty_metadata_batch);
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..a1c303f27699 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3597,7 +3597,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
set_bit(EXTENT_BUFFER_WRITEBACK, >bflags);
spin_unlock(>refs_lock);
btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
-   __percpu_counter_add(_info->dirty_metadata_bytes,
+   percpu_counter_add_batch(_info->dirty_metadata_bytes,
 -eb->len,
 fs_info->dirty_metadata_batch);
ret = 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef3c98c527c1..fa138217219e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1766,7 +1766,7 @@ static void btrfs_set_bit_hook(struct inode *inode,
if (btrfs_is_testing(fs_info))
return;
 
-   __percpu_counter_add(_info->delalloc_bytes, len,
+   percpu_counter_add_batch(_info->delalloc_bytes, len,
 fs_info->delalloc_batch);
spin_lock(_I(inode)->lock);
BTRFS_I(inode)->delalloc_bytes += len;
@@ -1840,7 +1840,7 @@ static void btrfs_clear_bit_hook(struct btrfs_inode 
*inode,
>vfs_inode,
state->start, len);
 
-   __percpu_counter_add(_info->delalloc_bytes, -len,
+   percpu_counter_add_batch(_info->delalloc_bytes, -len,
 fs_info->delalloc_batch);
spin_lock(>lock);
inode->delalloc_bytes -= len;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 2eaf81859166..7147d4a8d207 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1209,7 +1209,7 @@ xfs_mod_icount(
struct xfs_mount*mp,
int64_t delta)
 {
-   __percpu_counter_add(>m_icount, delta, XFS_ICOUNT_BATCH);
+   percpu_counter_add_batch(>m_icount, delta, XFS_ICOUNT_BATCH);
if (__percpu_counter_compare(>m_icount, 0, XFS_ICOUNT_BATCH) < 0) {
ASSERT(0);
percpu_counter_add(>m_icount, -delta);
@@ -1288,7 +1288,7 @@ xfs_mod_fdblocks(

[PATCH 2/2] writeback: Rework wb_[dec|inc]_stat family of functions

2017-06-20 Thread Nikolay Borisov

Currently the writeback statistics code uses a percpu counters to hold
various statistics. As such we have 2 families of functions - those which
disable local irq and those which doesn't and whose names are begin with
double underscore. However, they both end up calling __add_wb_stats which in
turn end up calling percpu_counter_add_batch which is already SMP-safe.

Exploiting this fact allows to eliminated the __wb_* functions since they do
in fact cal SMP-safe primitive. Furthermore, refactor the wb_* function
to call __add_wb_stat directly without the irq-disabling dance. This will
likely result in better runtime of code which deals with modifying the stat
counters.

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
Hello Tejun, 

This patch resulted from me reading your feedback on Josef's memory 
throttling prep patch https://patchwork.kernel.org/patch/9395219/ . If these
changes are merged then his series can eliminated irq clustering and use 
straight __add_wb_stat call. I'd like to see his series merged sooner rather
than later hence why sending this cleanup. 

 include/linux/backing-dev.h | 24 ++--
 mm/page-writeback.c | 10 +-
 2 files changed, 7 insertions(+), 27 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ace73f96eb1e..e9c967b86054 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -69,34 +69,14 @@ static inline void __add_wb_stat(struct bdi_writeback *wb,
percpu_counter_add_batch(>stat[item], amount, WB_STAT_BATCH);
 }
 
-static inline void __inc_wb_stat(struct bdi_writeback *wb,
-enum wb_stat_item item)
-{
-   __add_wb_stat(wb, item, 1);
-}
-
 static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __inc_wb_stat(wb, item);
-   local_irq_restore(flags);
-}
-
-static inline void __dec_wb_stat(struct bdi_writeback *wb,
-enum wb_stat_item item)
-{
-   __add_wb_stat(wb, item, -1);
+   __add_wb_stat(wb, item, 1);
 }
 
 static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __dec_wb_stat(wb, item);
-   local_irq_restore(flags);
+   __add_wb_stat(wb, item, -1);
 }
 
 static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 143c1c25d680..b7451891959a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -601,7 +601,7 @@ static inline void __wb_writeout_inc(struct bdi_writeback 
*wb)
 {
struct wb_domain *cgdom;
 
-   __inc_wb_stat(wb, WB_WRITTEN);
+   inc_wb_stat(wb, WB_WRITTEN);
wb_domain_writeout_inc(_wb_domain, >completions,
   wb->bdi->max_prop_frac);
 
@@ -2437,8 +2437,8 @@ void account_page_dirtied(struct page *page, struct 
address_space *mapping)
__inc_node_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
__inc_node_page_state(page, NR_DIRTIED);
-   __inc_wb_stat(wb, WB_RECLAIMABLE);
-   __inc_wb_stat(wb, WB_DIRTIED);
+   inc_wb_stat(wb, WB_RECLAIMABLE);
+   inc_wb_stat(wb, WB_DIRTIED);
task_io_account_write(PAGE_SIZE);
current->nr_dirtied++;
this_cpu_inc(bdp_ratelimits);
@@ -2745,7 +2745,7 @@ int test_clear_page_writeback(struct page *page)
if (bdi_cap_account_writeback(bdi)) {
struct bdi_writeback *wb = inode_to_wb(inode);
 
-   __dec_wb_stat(wb, WB_WRITEBACK);
+   dec_wb_stat(wb, WB_WRITEBACK);
__wb_writeout_inc(wb);
}
}
@@ -2791,7 +2791,7 @@ int __test_set_page_writeback(struct page *page, bool 
keep_write)
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
-   __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
+   inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
 
/*
 * We can come through here when swapping anonymous
-- 
2.7.4

[PATCH 2/2] writeback: Rework wb_[dec|inc]_stat family of functions

2017-06-20 Thread Nikolay Borisov

Currently the writeback statistics code uses a percpu counters to hold
various statistics. As such we have 2 families of functions - those which
disable local irq and those which doesn't and whose names are begin with
double underscore. However, they both end up calling __add_wb_stats which in
turn end up calling percpu_counter_add_batch which is already SMP-safe.

Exploiting this fact allows to eliminated the __wb_* functions since they do
in fact cal SMP-safe primitive. Furthermore, refactor the wb_* function
to call __add_wb_stat directly without the irq-disabling dance. This will
likely result in better runtime of code which deals with modifying the stat
counters.

Signed-off-by: Nikolay Borisov 
---
Hello Tejun, 

This patch resulted from me reading your feedback on Josef's memory 
throttling prep patch https://patchwork.kernel.org/patch/9395219/ . If these
changes are merged then his series can eliminated irq clustering and use 
straight __add_wb_stat call. I'd like to see his series merged sooner rather
than later hence why sending this cleanup. 

 include/linux/backing-dev.h | 24 ++--
 mm/page-writeback.c | 10 +-
 2 files changed, 7 insertions(+), 27 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ace73f96eb1e..e9c967b86054 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -69,34 +69,14 @@ static inline void __add_wb_stat(struct bdi_writeback *wb,
percpu_counter_add_batch(>stat[item], amount, WB_STAT_BATCH);
 }
 
-static inline void __inc_wb_stat(struct bdi_writeback *wb,
-enum wb_stat_item item)
-{
-   __add_wb_stat(wb, item, 1);
-}
-
 static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __inc_wb_stat(wb, item);
-   local_irq_restore(flags);
-}
-
-static inline void __dec_wb_stat(struct bdi_writeback *wb,
-enum wb_stat_item item)
-{
-   __add_wb_stat(wb, item, -1);
+   __add_wb_stat(wb, item, 1);
 }
 
 static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item 
item)
 {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __dec_wb_stat(wb, item);
-   local_irq_restore(flags);
+   __add_wb_stat(wb, item, -1);
 }
 
 static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 143c1c25d680..b7451891959a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -601,7 +601,7 @@ static inline void __wb_writeout_inc(struct bdi_writeback 
*wb)
 {
struct wb_domain *cgdom;
 
-   __inc_wb_stat(wb, WB_WRITTEN);
+   inc_wb_stat(wb, WB_WRITTEN);
wb_domain_writeout_inc(_wb_domain, >completions,
   wb->bdi->max_prop_frac);
 
@@ -2437,8 +2437,8 @@ void account_page_dirtied(struct page *page, struct 
address_space *mapping)
__inc_node_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
__inc_node_page_state(page, NR_DIRTIED);
-   __inc_wb_stat(wb, WB_RECLAIMABLE);
-   __inc_wb_stat(wb, WB_DIRTIED);
+   inc_wb_stat(wb, WB_RECLAIMABLE);
+   inc_wb_stat(wb, WB_DIRTIED);
task_io_account_write(PAGE_SIZE);
current->nr_dirtied++;
this_cpu_inc(bdp_ratelimits);
@@ -2745,7 +2745,7 @@ int test_clear_page_writeback(struct page *page)
if (bdi_cap_account_writeback(bdi)) {
struct bdi_writeback *wb = inode_to_wb(inode);
 
-   __dec_wb_stat(wb, WB_WRITEBACK);
+   dec_wb_stat(wb, WB_WRITEBACK);
__wb_writeout_inc(wb);
}
}
@@ -2791,7 +2791,7 @@ int __test_set_page_writeback(struct page *page, bool 
keep_write)
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
-   __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
+   inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
 
/*
 * We can come through here when swapping anonymous
-- 
2.7.4

Re: [PATCH net-next v2 4/4] ip6mr: add netlink notifications on mrt6msg cache reports

2017-06-20 Thread Nikolay Aleksandrov

On 19/06/17 23:44, Julien Gomes wrote:
> Add Netlink notifications on cache reports in ip6mr, in addition to the
> existing mrt6msg sent to mroute6_sk.
> Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV6_MROUTE_R.
> 
> MSGTYPE, MIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
> same data as their equivalent fields in the mrt6msg header.
> PKT attribute is the packet sent to mroute6_sk, without the added
> mrt6msg header.
> 
> Suggested-by: Ryan Halbrook 
> Signed-off-by: Julien Gomes 
> ---
>  include/uapi/linux/mroute6.h | 12 
>  net/ipv6/ip6mr.c | 67 
> ++--
>  2 files changed, 77 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/mroute6.h b/include/uapi/linux/mroute6.h
> index ed5721148768..e4746816c855 100644
> --- a/include/uapi/linux/mroute6.h
> +++ b/include/uapi/linux/mroute6.h
> @@ -133,4 +133,16 @@ struct mrt6msg {
>   struct in6_addr im6_src, im6_dst;
>  };
>  
> +/* ip6mr netlink cache report attributes */
> +enum {
> + IP6MRA_CREPORT_UNSPEC,
> + IP6MRA_CREPORT_MSGTYPE,
> + IP6MRA_CREPORT_MIF_ID,
> + IP6MRA_CREPORT_SRC_ADDR,
> + IP6MRA_CREPORT_DST_ADDR,
> + IP6MRA_CREPORT_PKT,
> + __IP6MRA_CREPORT_MAX
> +};
> +#define IP6MRA_CREPORT_MAX (__IP6MRA_CREPORT_MAX - 1)
> +
>  #endif /* _UAPI__LINUX_MROUTE6_H */
> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> index b0e2bf1f4212..28a1fb49f12e 100644
> --- a/net/ipv6/ip6mr.c
> +++ b/net/ipv6/ip6mr.c
> @@ -116,6 +116,7 @@ static int __ip6mr_fill_mroute(struct mr6_table *mrt, 
> struct sk_buff *skb,
>  struct mfc6_cache *c, struct rtmsg *rtm);
>  static void mr6_netlink_event(struct mr6_table *mrt, struct mfc6_cache *mfc,
> int cmd);
> +static void mrt6msg_netlink_event(struct mr6_table *mrt, struct sk_buff 
> *pkt);
>  static int ip6mr_rtm_dumproute(struct sk_buff *skb,
>  struct netlink_callback *cb);
>  static void mroute_clean_tables(struct mr6_table *mrt, bool all);
> @@ -1125,8 +1126,7 @@ static void ip6mr_cache_resolve(struct net *net, struct 
> mr6_table *mrt,
>  }
>  
>  /*
> - *   Bounce a cache query up to pim6sd. We could use netlink for this but 
> pim6sd
> - *   expects the following bizarre scheme.
> + *   Bounce a cache query up to pim6sd and netlink.
>   *
>   *   Called under mrt_lock.
>   */
> @@ -1208,6 +1208,8 @@ static int ip6mr_cache_report(struct mr6_table *mrt, 
> struct sk_buff *pkt,
>   return -EINVAL;
>   }
>  
> + mrt6msg_netlink_event(mrt, skb);
> +
>   /*
>*  Deliver to user space multicast routing algorithms
>*/
> @@ -2457,6 +2459,67 @@ static void mr6_netlink_event(struct mr6_table *mrt, 
> struct mfc6_cache *mfc,
>   rtnl_set_sk_err(net, RTNLGRP_IPV6_MROUTE, err);
>  }
>  
> +static void mrt6msg_netlink_event(struct mr6_table *mrt, struct sk_buff *pkt)
> +{
> + struct net *net = read_pnet(>net);
> + struct nlmsghdr *nlh;
> + struct rtgenmsg *rtgenm;
> + struct mrt6msg *msg;
> + struct sk_buff *skb;
> + struct nlattr *nla;
> + int payloadlen;
> + int msgsize;
> +
> + payloadlen = pkt->len - sizeof(struct mrt6msg);
> + msg = (struct mrt6msg *)skb_transport_header(pkt);
> + msgsize = NLMSG_ALIGN(sizeof(struct rtgenmsg))
> + + nla_total_size(1)
> + /* IP6MRA_CREPORT_MSGTYPE */
> + + nla_total_size(2)
> + /* IP6MRA_CREPORT_MIF_ID */
> + + nla_total_size(sizeof(struct in6_addr))
> + /* IP6MRA_CREPORT_SRC_ADDR */
> + + nla_total_size(sizeof(struct in6_addr))
> + /* IP6MRA_CREPORT_DST_ADDR */
> + + nla_total_size(payloadlen)
> + /* IP6MRA_CREPORT_PKT */
> + ;

Same as patch 03, this calculation could be in a separate function.

> +
> + skb = nlmsg_new(msgsize, GFP_ATOMIC);
> + if (!skb)
> + goto errout;
> +
> + nlh = nlmsg_put(skb, 0, 0, RTM_NEWCACHEREPORT,
> + sizeof(struct rtgenmsg), 0);
> + if (!nlh)
> + goto errout;
> + rtgenm = nlmsg_data(nlh);
> + rtgenm->rtgen_family = RTNL_FAMILY_IP6MR;
> + if (nla_put_u8(skb, IP6MRA_CREPORT_MSGTYPE, msg->im6_msgtype) ||
> + nla_put_u16(skb, IP6MRA_CREPORT_MIF_ID, msg->im6_mif) ||
> + nla_put_in6_addr(skb, IP6MRA_CREPORT_SRC_ADDR,
> +  >im6_src) ||
> + nla_put_in6_addr(skb, IP6MRA_CREPORT_DST_ADDR,
> +  >im6_dst))
> + goto nla_put_failure;
> +
> + nla = nla_reserve(skb, IP6MRA_CREPORT_PKT, payloadlen);
> + if (!nla || skb_copy_bits(pkt, sizeof(struct mrt6msg),
> +

Re: [PATCH net-next v2 4/4] ip6mr: add netlink notifications on mrt6msg cache reports

2017-06-20 Thread Nikolay Aleksandrov

On 19/06/17 23:44, Julien Gomes wrote:
> Add Netlink notifications on cache reports in ip6mr, in addition to the
> existing mrt6msg sent to mroute6_sk.
> Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV6_MROUTE_R.
> 
> MSGTYPE, MIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
> same data as their equivalent fields in the mrt6msg header.
> PKT attribute is the packet sent to mroute6_sk, without the added
> mrt6msg header.
> 
> Suggested-by: Ryan Halbrook 
> Signed-off-by: Julien Gomes 
> ---
>  include/uapi/linux/mroute6.h | 12 
>  net/ipv6/ip6mr.c | 67 
> ++--
>  2 files changed, 77 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/mroute6.h b/include/uapi/linux/mroute6.h
> index ed5721148768..e4746816c855 100644
> --- a/include/uapi/linux/mroute6.h
> +++ b/include/uapi/linux/mroute6.h
> @@ -133,4 +133,16 @@ struct mrt6msg {
>   struct in6_addr im6_src, im6_dst;
>  };
>  
> +/* ip6mr netlink cache report attributes */
> +enum {
> + IP6MRA_CREPORT_UNSPEC,
> + IP6MRA_CREPORT_MSGTYPE,
> + IP6MRA_CREPORT_MIF_ID,
> + IP6MRA_CREPORT_SRC_ADDR,
> + IP6MRA_CREPORT_DST_ADDR,
> + IP6MRA_CREPORT_PKT,
> + __IP6MRA_CREPORT_MAX
> +};
> +#define IP6MRA_CREPORT_MAX (__IP6MRA_CREPORT_MAX - 1)
> +
>  #endif /* _UAPI__LINUX_MROUTE6_H */
> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> index b0e2bf1f4212..28a1fb49f12e 100644
> --- a/net/ipv6/ip6mr.c
> +++ b/net/ipv6/ip6mr.c
> @@ -116,6 +116,7 @@ static int __ip6mr_fill_mroute(struct mr6_table *mrt, 
> struct sk_buff *skb,
>  struct mfc6_cache *c, struct rtmsg *rtm);
>  static void mr6_netlink_event(struct mr6_table *mrt, struct mfc6_cache *mfc,
> int cmd);
> +static void mrt6msg_netlink_event(struct mr6_table *mrt, struct sk_buff 
> *pkt);
>  static int ip6mr_rtm_dumproute(struct sk_buff *skb,
>  struct netlink_callback *cb);
>  static void mroute_clean_tables(struct mr6_table *mrt, bool all);
> @@ -1125,8 +1126,7 @@ static void ip6mr_cache_resolve(struct net *net, struct 
> mr6_table *mrt,
>  }
>  
>  /*
> - *   Bounce a cache query up to pim6sd. We could use netlink for this but 
> pim6sd
> - *   expects the following bizarre scheme.
> + *   Bounce a cache query up to pim6sd and netlink.
>   *
>   *   Called under mrt_lock.
>   */
> @@ -1208,6 +1208,8 @@ static int ip6mr_cache_report(struct mr6_table *mrt, 
> struct sk_buff *pkt,
>   return -EINVAL;
>   }
>  
> + mrt6msg_netlink_event(mrt, skb);
> +
>   /*
>*  Deliver to user space multicast routing algorithms
>*/
> @@ -2457,6 +2459,67 @@ static void mr6_netlink_event(struct mr6_table *mrt, 
> struct mfc6_cache *mfc,
>   rtnl_set_sk_err(net, RTNLGRP_IPV6_MROUTE, err);
>  }
>  
> +static void mrt6msg_netlink_event(struct mr6_table *mrt, struct sk_buff *pkt)
> +{
> + struct net *net = read_pnet(>net);
> + struct nlmsghdr *nlh;
> + struct rtgenmsg *rtgenm;
> + struct mrt6msg *msg;
> + struct sk_buff *skb;
> + struct nlattr *nla;
> + int payloadlen;
> + int msgsize;
> +
> + payloadlen = pkt->len - sizeof(struct mrt6msg);
> + msg = (struct mrt6msg *)skb_transport_header(pkt);
> + msgsize = NLMSG_ALIGN(sizeof(struct rtgenmsg))
> + + nla_total_size(1)
> + /* IP6MRA_CREPORT_MSGTYPE */
> + + nla_total_size(2)
> + /* IP6MRA_CREPORT_MIF_ID */
> + + nla_total_size(sizeof(struct in6_addr))
> + /* IP6MRA_CREPORT_SRC_ADDR */
> + + nla_total_size(sizeof(struct in6_addr))
> + /* IP6MRA_CREPORT_DST_ADDR */
> + + nla_total_size(payloadlen)
> + /* IP6MRA_CREPORT_PKT */
> + ;

Same as patch 03, this calculation could be in a separate function.

> +
> + skb = nlmsg_new(msgsize, GFP_ATOMIC);
> + if (!skb)
> + goto errout;
> +
> + nlh = nlmsg_put(skb, 0, 0, RTM_NEWCACHEREPORT,
> + sizeof(struct rtgenmsg), 0);
> + if (!nlh)
> + goto errout;
> + rtgenm = nlmsg_data(nlh);
> + rtgenm->rtgen_family = RTNL_FAMILY_IP6MR;
> + if (nla_put_u8(skb, IP6MRA_CREPORT_MSGTYPE, msg->im6_msgtype) ||
> + nla_put_u16(skb, IP6MRA_CREPORT_MIF_ID, msg->im6_mif) ||
> + nla_put_in6_addr(skb, IP6MRA_CREPORT_SRC_ADDR,
> +  >im6_src) ||
> + nla_put_in6_addr(skb, IP6MRA_CREPORT_DST_ADDR,
> +  >im6_dst))
> + goto nla_put_failure;
> +
> + nla = nla_reserve(skb, IP6MRA_CREPORT_PKT, payloadlen);
> + if (!nla || skb_copy_bits(pkt, sizeof(struct mrt6msg),
> +   nla_data(nla), payloadlen))
> +

Re: [PATCH net-next v2 3/4] ipmr: add netlink notifications on igmpmsg cache reports

2017-06-20 Thread Nikolay Aleksandrov

On 19/06/17 23:44, Julien Gomes wrote:
> Add Netlink notifications on cache reports in ipmr, in addition to the
> existing igmpmsg sent to mroute_sk.
> Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV4_MROUTE_R.
> 
> MSGTYPE, VIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
> same data as their equivalent fields in the igmpmsg header.
> PKT attribute is the packet sent to mroute_sk, without the added igmpmsg
> header.
> 
> Suggested-by: Ryan Halbrook 
> Signed-off-by: Julien Gomes 
> ---
>  include/uapi/linux/mroute.h | 12 
>  net/ipv4/ipmr.c | 67 
> +++--
>  2 files changed, 77 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/mroute.h b/include/uapi/linux/mroute.h
> index f904367c0cee..e8e5041dea8e 100644
> --- a/include/uapi/linux/mroute.h
> +++ b/include/uapi/linux/mroute.h
> @@ -152,6 +152,18 @@ enum {
>  };
>  #define IPMRA_VIFA_MAX (__IPMRA_VIFA_MAX - 1)
>  
> +/* ipmr netlink cache report attributes */
> +enum {
> + IPMRA_CREPORT_UNSPEC,
> + IPMRA_CREPORT_MSGTYPE,
> + IPMRA_CREPORT_VIF_ID,
> + IPMRA_CREPORT_SRC_ADDR,
> + IPMRA_CREPORT_DST_ADDR,
> + IPMRA_CREPORT_PKT,
> + __IPMRA_CREPORT_MAX
> +};
> +#define IPMRA_CREPORT_MAX (__IPMRA_CREPORT_MAX - 1)
> +
>  /* That's all usermode folks */
>  
>  #define MFC_ASSERT_THRESH (3*HZ) /* Maximal freq. of asserts */
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index 3e7454aa49e8..1e591bcaad6d 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -109,6 +109,7 @@ static int __ipmr_fill_mroute(struct mr_table *mrt, 
> struct sk_buff *skb,
> struct mfc_cache *c, struct rtmsg *rtm);
>  static void mroute_netlink_event(struct mr_table *mrt, struct mfc_cache *mfc,
>int cmd);
> +static void igmpmsg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt);
>  static void mroute_clean_tables(struct mr_table *mrt, bool all);
>  static void ipmr_expire_process(unsigned long arg);
>  
> @@ -995,8 +996,7 @@ static void ipmr_cache_resolve(struct net *net, struct 
> mr_table *mrt,
>   }
>  }
>  
> -/* Bounce a cache query up to mrouted. We could use netlink for this but 
> mrouted
> - * expects the following bizarre scheme.
> +/* Bounce a cache query up to mrouted and netlink.
>   *
>   * Called under mrt_lock.
>   */
> @@ -1062,6 +1062,8 @@ static int ipmr_cache_report(struct mr_table *mrt,
>   return -EINVAL;
>   }
>  
> + igmpmsg_netlink_event(mrt, skb);
> +
>   /* Deliver to mrouted */
>   ret = sock_queue_rcv_skb(mroute_sk, skb);
>   rcu_read_unlock();
> @@ -2341,6 +2343,67 @@ static void mroute_netlink_event(struct mr_table *mrt, 
> struct mfc_cache *mfc,
>   rtnl_set_sk_err(net, RTNLGRP_IPV4_MROUTE, err);
>  }
>  
> +static void igmpmsg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt)
> +{
> + struct net *net = read_pnet(>net);
> + struct nlmsghdr *nlh;
> + struct rtgenmsg *rtgenm;
> + struct igmpmsg *msg;
> + struct sk_buff *skb;
> + struct nlattr *nla;
> + int payloadlen;
> + int msgsize;
> +
> + payloadlen = pkt->len - sizeof(struct igmpmsg);
> + msg = (struct igmpmsg *)skb_network_header(pkt);
> + msgsize = NLMSG_ALIGN(sizeof(struct rtgenmsg))
> + + nla_total_size(1)
> + /* IPMRA_CREPORT_MSGTYPE */
> + + nla_total_size(1)
> + /* IPMRA_CREPORT_VIF_ID */
> + + nla_total_size(4)
> + /* IPMRA_CREPORT_SRC_ADDR */
> + + nla_total_size(4)
> + /* IPMRA_CREPORT_DST_ADDR */
> + + nla_total_size(payloadlen)
> + /* IPMRA_CREPORT_PKT */
> + ;

If this ends up having another version you could pull this size
calculation into a separate function. E.g. see mroute_msgsize

> +
> + skb = nlmsg_new(msgsize, GFP_ATOMIC);
> + if (!skb)
> + goto errout;
> +
> + nlh = nlmsg_put(skb, 0, 0, RTM_NEWCACHEREPORT,
> + sizeof(struct rtgenmsg), 0);
> + if (!nlh)
> + goto errout;
> + rtgenm = nlmsg_data(nlh);
> + rtgenm->rtgen_family = RTNL_FAMILY_IPMR;
> + if (nla_put_u8(skb, IPMRA_CREPORT_MSGTYPE, msg->im_msgtype) ||
> + nla_put_u8(skb, IPMRA_CREPORT_VIF_ID, msg->im_vif) ||

This would effectively limit the new call to handle 255 ifaces. I used a u32
for the getlink interface's vif id, it'd be nice to be consistent and also allow
for more interfaces in the future (u32 might be too big, but we're not pressed 
for
space here).

> + nla_put_in_addr(skb, IPMRA_CREPORT_SRC_ADDR,
> + msg->im_src.s_addr) ||
> + nla_put_in_addr(skb,

Re: [PATCH net-next v2 3/4] ipmr: add netlink notifications on igmpmsg cache reports

2017-06-20 Thread Nikolay Aleksandrov

On 19/06/17 23:44, Julien Gomes wrote:
> Add Netlink notifications on cache reports in ipmr, in addition to the
> existing igmpmsg sent to mroute_sk.
> Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV4_MROUTE_R.
> 
> MSGTYPE, VIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
> same data as their equivalent fields in the igmpmsg header.
> PKT attribute is the packet sent to mroute_sk, without the added igmpmsg
> header.
> 
> Suggested-by: Ryan Halbrook 
> Signed-off-by: Julien Gomes 
> ---
>  include/uapi/linux/mroute.h | 12 
>  net/ipv4/ipmr.c | 67 
> +++--
>  2 files changed, 77 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/mroute.h b/include/uapi/linux/mroute.h
> index f904367c0cee..e8e5041dea8e 100644
> --- a/include/uapi/linux/mroute.h
> +++ b/include/uapi/linux/mroute.h
> @@ -152,6 +152,18 @@ enum {
>  };
>  #define IPMRA_VIFA_MAX (__IPMRA_VIFA_MAX - 1)
>  
> +/* ipmr netlink cache report attributes */
> +enum {
> + IPMRA_CREPORT_UNSPEC,
> + IPMRA_CREPORT_MSGTYPE,
> + IPMRA_CREPORT_VIF_ID,
> + IPMRA_CREPORT_SRC_ADDR,
> + IPMRA_CREPORT_DST_ADDR,
> + IPMRA_CREPORT_PKT,
> + __IPMRA_CREPORT_MAX
> +};
> +#define IPMRA_CREPORT_MAX (__IPMRA_CREPORT_MAX - 1)
> +
>  /* That's all usermode folks */
>  
>  #define MFC_ASSERT_THRESH (3*HZ) /* Maximal freq. of asserts */
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index 3e7454aa49e8..1e591bcaad6d 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -109,6 +109,7 @@ static int __ipmr_fill_mroute(struct mr_table *mrt, 
> struct sk_buff *skb,
> struct mfc_cache *c, struct rtmsg *rtm);
>  static void mroute_netlink_event(struct mr_table *mrt, struct mfc_cache *mfc,
>int cmd);
> +static void igmpmsg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt);
>  static void mroute_clean_tables(struct mr_table *mrt, bool all);
>  static void ipmr_expire_process(unsigned long arg);
>  
> @@ -995,8 +996,7 @@ static void ipmr_cache_resolve(struct net *net, struct 
> mr_table *mrt,
>   }
>  }
>  
> -/* Bounce a cache query up to mrouted. We could use netlink for this but 
> mrouted
> - * expects the following bizarre scheme.
> +/* Bounce a cache query up to mrouted and netlink.
>   *
>   * Called under mrt_lock.
>   */
> @@ -1062,6 +1062,8 @@ static int ipmr_cache_report(struct mr_table *mrt,
>   return -EINVAL;
>   }
>  
> + igmpmsg_netlink_event(mrt, skb);
> +
>   /* Deliver to mrouted */
>   ret = sock_queue_rcv_skb(mroute_sk, skb);
>   rcu_read_unlock();
> @@ -2341,6 +2343,67 @@ static void mroute_netlink_event(struct mr_table *mrt, 
> struct mfc_cache *mfc,
>   rtnl_set_sk_err(net, RTNLGRP_IPV4_MROUTE, err);
>  }
>  
> +static void igmpmsg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt)
> +{
> + struct net *net = read_pnet(>net);
> + struct nlmsghdr *nlh;
> + struct rtgenmsg *rtgenm;
> + struct igmpmsg *msg;
> + struct sk_buff *skb;
> + struct nlattr *nla;
> + int payloadlen;
> + int msgsize;
> +
> + payloadlen = pkt->len - sizeof(struct igmpmsg);
> + msg = (struct igmpmsg *)skb_network_header(pkt);
> + msgsize = NLMSG_ALIGN(sizeof(struct rtgenmsg))
> + + nla_total_size(1)
> + /* IPMRA_CREPORT_MSGTYPE */
> + + nla_total_size(1)
> + /* IPMRA_CREPORT_VIF_ID */
> + + nla_total_size(4)
> + /* IPMRA_CREPORT_SRC_ADDR */
> + + nla_total_size(4)
> + /* IPMRA_CREPORT_DST_ADDR */
> + + nla_total_size(payloadlen)
> + /* IPMRA_CREPORT_PKT */
> + ;

If this ends up having another version you could pull this size
calculation into a separate function. E.g. see mroute_msgsize

> +
> + skb = nlmsg_new(msgsize, GFP_ATOMIC);
> + if (!skb)
> + goto errout;
> +
> + nlh = nlmsg_put(skb, 0, 0, RTM_NEWCACHEREPORT,
> + sizeof(struct rtgenmsg), 0);
> + if (!nlh)
> + goto errout;
> + rtgenm = nlmsg_data(nlh);
> + rtgenm->rtgen_family = RTNL_FAMILY_IPMR;
> + if (nla_put_u8(skb, IPMRA_CREPORT_MSGTYPE, msg->im_msgtype) ||
> + nla_put_u8(skb, IPMRA_CREPORT_VIF_ID, msg->im_vif) ||

This would effectively limit the new call to handle 255 ifaces. I used a u32
for the getlink interface's vif id, it'd be nice to be consistent and also allow
for more interfaces in the future (u32 might be too big, but we're not pressed 
for
space here).

> + nla_put_in_addr(skb, IPMRA_CREPORT_SRC_ADDR,
> + msg->im_src.s_addr) ||
> + nla_put_in_addr(skb, IPMRA_CREPORT_DST_ADDR,
> +

Re: [PATCH net-next v2 2/4] rtnetlink: add restricted rtnl groups for ipv4 and ipv6 mroute

2017-06-20 Thread Nikolay Aleksandrov

On 19/06/17 23:44, Julien Gomes wrote:
> Add RTNLGRP_{IPV4,IPV6}_MROUTE_R as two new restricted groups for the
> NETLINK_ROUTE family.
> Binding to these groups specifically requires CAP_NET_ADMIN to allow
> multicast of sensitive messages (e.g. mroute cache reports).
> 
> Signed-off-by: Julien Gomes <jul...@arista.com>
> ---
>  include/uapi/linux/rtnetlink.h |  4 
>  net/core/rtnetlink.c   | 13 +
>  2 files changed, 17 insertions(+)

Thanks!

Suggested-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>

Re: [PATCH net-next v2 2/4] rtnetlink: add restricted rtnl groups for ipv4 and ipv6 mroute

2017-06-20 Thread Nikolay Aleksandrov

On 19/06/17 23:44, Julien Gomes wrote:
> Add RTNLGRP_{IPV4,IPV6}_MROUTE_R as two new restricted groups for the
> NETLINK_ROUTE family.
> Binding to these groups specifically requires CAP_NET_ADMIN to allow
> multicast of sensitive messages (e.g. mroute cache reports).
> 
> Signed-off-by: Julien Gomes 
> ---
>  include/uapi/linux/rtnetlink.h |  4 
>  net/core/rtnetlink.c   | 13 +
>  2 files changed, 17 insertions(+)

Thanks!

Suggested-by: Nikolay Aleksandrov 
Signed-off-by: Nikolay Aleksandrov

Re: [PATCH net-next v2 1/4] rtnetlink: add NEWCACHEREPORT message type

2017-06-20 Thread Nikolay Aleksandrov

On 19/06/17 23:44, Julien Gomes wrote:
> New NEWCACHEREPORT message type to be used for cache reports sent
> via Netlink, effectively allowing splitting cache report reception from
> mroute programming.
> 
> Suggested-by: Ryan Halbrook <halbr...@arista.com>
> Signed-off-by: Julien Gomes <jul...@arista.com>
> ---
>  include/uapi/linux/rtnetlink.h | 3 +++
>  security/selinux/nlmsgtab.c| 3 ++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 

Reviewed-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>

Re: [PATCH net-next v2 1/4] rtnetlink: add NEWCACHEREPORT message type

2017-06-20 Thread Nikolay Aleksandrov

On 19/06/17 23:44, Julien Gomes wrote:
> New NEWCACHEREPORT message type to be used for cache reports sent
> via Netlink, effectively allowing splitting cache report reception from
> mroute programming.
> 
> Suggested-by: Ryan Halbrook 
> Signed-off-by: Julien Gomes 
> ---
>  include/uapi/linux/rtnetlink.h | 3 +++
>  security/selinux/nlmsgtab.c| 3 ++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 

Reviewed-by: Nikolay Aleksandrov

Re: [PATCH] Add printk for bonding module packets_per_slave parameter

2017-06-13 Thread Nikolay Aleksandrov

On 13/06/17 20:00, Joe Perches wrote:
> On Tue, 2017-06-13 at 12:42 -0400, Jonathan Toppins wrote:
>> On 06/13/2017 12:21 PM, Joe Perches wrote:
>>> On Tue, 2017-06-13 at 11:34 -0400, David Miller wrote:
>>>> From: Michael Dilmore <michael.j.dilm...@gmail.com>
>>>> Date: Tue, 13 Jun 2017 14:42:46 +0100
>>>>
>>>>> The packets per slave parameter used by round robin mode does not have a 
>>>>> printk debug
>>>>> message in its set function in bond_options.c. Adding such a function 
>>>>> would aid debugging
>>>>> of round-robin mode and allow the user to more easily verify that the 
>>>>> parameter has been
>>>>> set correctly. I should add that I'm motivated by my own experience here 
>>>>> - it's not
>>>>> obvious from output of tools such as wireshark and ifstat that the 
>>>>> parameter is working
>>>>> correctly, and with the differences in bonding configuration across 
>>>>> different distributions,
>>>>> it would have been comforting to see this output.
> []
>>>> You can verify things by simplying reading the value back.
>>>>
>>>> If every parameter emitted a kernel log message, it would be
>>>> unreadable.
>>>>
>>>> I'm not applying this, sorry.
>>>
>>> I agree.  Noisy logging output is not good.
>>>
>>> Perhaps a general conversion of the dozens
>>> of existing netdev_info uses in this file to
>>> netdev_dbg and adding this at netdev_dbg is
>>> appropriate.
>>
>> In general I agree. The few times I have debugged bonds, I always ended
>> up enabling debug prinks anyway. I don't see a problem moving these to
>> debug as well.
>>
>> Adding nik whom converted a lot of this code to common paths for input.
> 
> If Nikolay agrees with the conversion, it's trivial.
> Please submit it.  I did it just for reference.
> 
> Stylistic nits about the existing file:
> 
> There are some inconsistencies in pr_info/pr_err uses
> with invalid inputs.
> 
> It would also be nicer if the forward static declarations
> were removed and the static definitions reordered.
> 

Agreed, there are many ways to extract the values.

Re: [PATCH] Add printk for bonding module packets_per_slave parameter

2017-06-13 Thread Nikolay Aleksandrov

On 13/06/17 20:00, Joe Perches wrote:
> On Tue, 2017-06-13 at 12:42 -0400, Jonathan Toppins wrote:
>> On 06/13/2017 12:21 PM, Joe Perches wrote:
>>> On Tue, 2017-06-13 at 11:34 -0400, David Miller wrote:
>>>> From: Michael Dilmore 
>>>> Date: Tue, 13 Jun 2017 14:42:46 +0100
>>>>
>>>>> The packets per slave parameter used by round robin mode does not have a 
>>>>> printk debug
>>>>> message in its set function in bond_options.c. Adding such a function 
>>>>> would aid debugging
>>>>> of round-robin mode and allow the user to more easily verify that the 
>>>>> parameter has been
>>>>> set correctly. I should add that I'm motivated by my own experience here 
>>>>> - it's not
>>>>> obvious from output of tools such as wireshark and ifstat that the 
>>>>> parameter is working
>>>>> correctly, and with the differences in bonding configuration across 
>>>>> different distributions,
>>>>> it would have been comforting to see this output.
> []
>>>> You can verify things by simplying reading the value back.
>>>>
>>>> If every parameter emitted a kernel log message, it would be
>>>> unreadable.
>>>>
>>>> I'm not applying this, sorry.
>>>
>>> I agree.  Noisy logging output is not good.
>>>
>>> Perhaps a general conversion of the dozens
>>> of existing netdev_info uses in this file to
>>> netdev_dbg and adding this at netdev_dbg is
>>> appropriate.
>>
>> In general I agree. The few times I have debugged bonds, I always ended
>> up enabling debug prinks anyway. I don't see a problem moving these to
>> debug as well.
>>
>> Adding nik whom converted a lot of this code to common paths for input.
> 
> If Nikolay agrees with the conversion, it's trivial.
> Please submit it.  I did it just for reference.
> 
> Stylistic nits about the existing file:
> 
> There are some inconsistencies in pr_info/pr_err uses
> with invalid inputs.
> 
> It would also be nicer if the forward static declarations
> were removed and the static definitions reordered.
> 

Agreed, there are many ways to extract the values.

Re: [PATCH] memcg: refactor mem_cgroup_resize_limit()

2017-06-02 Thread Nikolay Borisov



On  2.06.2017 02:02, Yu Zhao wrote:
> mem_cgroup_resize_limit() and mem_cgroup_resize_memsw_limit() have
> identical logics. Refactor code so we don't need to keep two pieces
> of code that does same thing.
> 
> Signed-off-by: Yu Zhao 
> ---
>  mm/memcontrol.c | 71 
> +
>  1 file changed, 11 insertions(+), 60 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 94172089f52f..a4f0daaff704 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2422,13 +2422,14 @@ static inline int 
> mem_cgroup_move_swap_account(swp_entry_t entry,
>  static DEFINE_MUTEX(memcg_limit_mutex);
>  
>  static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> -unsigned long limit)
> +unsigned long limit, bool memsw)
>  {
>   unsigned long curusage;
>   unsigned long oldusage;
>   bool enlarge = false;
>   int retry_count;
>   int ret;
> + struct page_counter *counter = memsw ? >memsw : >memory;
>  
>   /*
>* For keeping hierarchical_reclaim simple, how long we should retry
> @@ -2438,58 +2439,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup 
> *memcg,
>   retry_count = MEM_CGROUP_RECLAIM_RETRIES *
> mem_cgroup_count_children(memcg);
>  
> - oldusage = page_counter_read(>memory);
> -
> - do {
> - if (signal_pending(current)) {
> - ret = -EINTR;
> - break;
> - }
> -
> - mutex_lock(_limit_mutex);
> - if (limit > memcg->memsw.limit) {
> - mutex_unlock(_limit_mutex);
> - ret = -EINVAL;
> - break;
> - }
> - if (limit > memcg->memory.limit)
> - enlarge = true;
> - ret = page_counter_limit(>memory, limit);
> - mutex_unlock(_limit_mutex);
> -
> - if (!ret)
> - break;
> -
> - try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
> -
> - curusage = page_counter_read(>memory);
> - /* Usage is reduced ? */
> - if (curusage >= oldusage)
> - retry_count--;
> - else
> - oldusage = curusage;
> - } while (retry_count);
> -
> - if (!ret && enlarge)
> - memcg_oom_recover(memcg);
> -
> - return ret;
> -}
> -
> -static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> -  unsigned long limit)
> -{
> - unsigned long curusage;
> - unsigned long oldusage;
> - bool enlarge = false;
> - int retry_count;
> - int ret;
> -
> - /* see mem_cgroup_resize_res_limit */
> - retry_count = MEM_CGROUP_RECLAIM_RETRIES *
> -   mem_cgroup_count_children(memcg);
> -
> - oldusage = page_counter_read(>memsw);
> + oldusage = page_counter_read(counter);
>  
>   do {
>   if (signal_pending(current)) {
> @@ -2498,22 +2448,23 @@ static int mem_cgroup_resize_memsw_limit(struct 
> mem_cgroup *memcg,
>   }
>  
>   mutex_lock(_limit_mutex);
> - if (limit < memcg->memory.limit) {
> + if (memsw ? limit < memcg->memory.limit :
> + limit > memcg->memsw.limit) {

No, just no. Please createa a local variable and use that. Using the
ternary operator in an 'if' statement is just ugly!

>   mutex_unlock(_limit_mutex);
>   ret = -EINVAL;
>   break;
>   }
> - if (limit > memcg->memsw.limit)
> + if (limit > counter->limit)
>   enlarge = true;
> - ret = page_counter_limit(>memsw, limit);
> + ret = page_counter_limit(counter, limit);
>   mutex_unlock(_limit_mutex);
>  
>   if (!ret)
>   break;
>  
> - try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
> + try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, !memsw);
>  
> - curusage = page_counter_read(>memsw);
> + curusage = page_counter_read(counter);
>   /* Usage is reduced ? */
>   if (curusage >= oldusage)
>   retry_count--;
> @@ -2975,10 +2926,10 @@ static ssize_t mem_cgroup_write(struct 
> kernfs_open_file *of,
>   }
>   switch (MEMFILE_TYPE(of_cft(of)->private)) {
>   case _MEM:
> - ret = mem_cgroup_resize_limit(memcg, nr_pages);
> + ret = mem_cgroup_resize_limit(memcg, nr_pages, false);
>   break;
>   case _MEMSWAP:
> - ret = mem_cgroup_resize_memsw_limit(memcg, nr_pages);
> + ret = mem_cgroup_resize_limit(memcg,

Re: [PATCH] memcg: refactor mem_cgroup_resize_limit()

2017-06-02 Thread Nikolay Borisov



On  2.06.2017 02:02, Yu Zhao wrote:
> mem_cgroup_resize_limit() and mem_cgroup_resize_memsw_limit() have
> identical logics. Refactor code so we don't need to keep two pieces
> of code that does same thing.
> 
> Signed-off-by: Yu Zhao 
> ---
>  mm/memcontrol.c | 71 
> +
>  1 file changed, 11 insertions(+), 60 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 94172089f52f..a4f0daaff704 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2422,13 +2422,14 @@ static inline int 
> mem_cgroup_move_swap_account(swp_entry_t entry,
>  static DEFINE_MUTEX(memcg_limit_mutex);
>  
>  static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> -unsigned long limit)
> +unsigned long limit, bool memsw)
>  {
>   unsigned long curusage;
>   unsigned long oldusage;
>   bool enlarge = false;
>   int retry_count;
>   int ret;
> + struct page_counter *counter = memsw ? >memsw : >memory;
>  
>   /*
>* For keeping hierarchical_reclaim simple, how long we should retry
> @@ -2438,58 +2439,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup 
> *memcg,
>   retry_count = MEM_CGROUP_RECLAIM_RETRIES *
> mem_cgroup_count_children(memcg);
>  
> - oldusage = page_counter_read(>memory);
> -
> - do {
> - if (signal_pending(current)) {
> - ret = -EINTR;
> - break;
> - }
> -
> - mutex_lock(_limit_mutex);
> - if (limit > memcg->memsw.limit) {
> - mutex_unlock(_limit_mutex);
> - ret = -EINVAL;
> - break;
> - }
> - if (limit > memcg->memory.limit)
> - enlarge = true;
> - ret = page_counter_limit(>memory, limit);
> - mutex_unlock(_limit_mutex);
> -
> - if (!ret)
> - break;
> -
> - try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
> -
> - curusage = page_counter_read(>memory);
> - /* Usage is reduced ? */
> - if (curusage >= oldusage)
> - retry_count--;
> - else
> - oldusage = curusage;
> - } while (retry_count);
> -
> - if (!ret && enlarge)
> - memcg_oom_recover(memcg);
> -
> - return ret;
> -}
> -
> -static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> -  unsigned long limit)
> -{
> - unsigned long curusage;
> - unsigned long oldusage;
> - bool enlarge = false;
> - int retry_count;
> - int ret;
> -
> - /* see mem_cgroup_resize_res_limit */
> - retry_count = MEM_CGROUP_RECLAIM_RETRIES *
> -   mem_cgroup_count_children(memcg);
> -
> - oldusage = page_counter_read(>memsw);
> + oldusage = page_counter_read(counter);
>  
>   do {
>   if (signal_pending(current)) {
> @@ -2498,22 +2448,23 @@ static int mem_cgroup_resize_memsw_limit(struct 
> mem_cgroup *memcg,
>   }
>  
>   mutex_lock(_limit_mutex);
> - if (limit < memcg->memory.limit) {
> + if (memsw ? limit < memcg->memory.limit :
> + limit > memcg->memsw.limit) {

No, just no. Please createa a local variable and use that. Using the
ternary operator in an 'if' statement is just ugly!

>   mutex_unlock(_limit_mutex);
>   ret = -EINVAL;
>   break;
>   }
> - if (limit > memcg->memsw.limit)
> + if (limit > counter->limit)
>   enlarge = true;
> - ret = page_counter_limit(>memsw, limit);
> + ret = page_counter_limit(counter, limit);
>   mutex_unlock(_limit_mutex);
>  
>   if (!ret)
>   break;
>  
> - try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
> + try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, !memsw);
>  
> - curusage = page_counter_read(>memsw);
> + curusage = page_counter_read(counter);
>   /* Usage is reduced ? */
>   if (curusage >= oldusage)
>   retry_count--;
> @@ -2975,10 +2926,10 @@ static ssize_t mem_cgroup_write(struct 
> kernfs_open_file *of,
>   }
>   switch (MEMFILE_TYPE(of_cft(of)->private)) {
>   case _MEM:
> - ret = mem_cgroup_resize_limit(memcg, nr_pages);
> + ret = mem_cgroup_resize_limit(memcg, nr_pages, false);
>   break;
>   case _MEMSWAP:
> - ret = mem_cgroup_resize_memsw_limit(memcg, nr_pages);
> + ret = mem_cgroup_resize_limit(memcg, nr_pages, true);
>

Re: Oops with commit 6d18c73 bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

2017-06-01 Thread Nikolay Aleksandrov

On 01/06/17 17:16, Nikolay Aleksandrov wrote:
> On 01/06/17 17:00, Nikolay Aleksandrov wrote:
>> On 01/06/17 15:34, Sebastian Ott wrote:
>>> On Thu, 1 Jun 2017, Xin Long wrote:
>>>> On Thu, Jun 1, 2017 at 12:32 AM, Sebastian Ott
>>>> <seb...@linux.vnet.ibm.com> wrote:
>>>>> [...]
>>>> I couldn't see any bridge-related thing here, and it couldn't be reproduced
>>>> with virbr0 (stp=1) on my box (on both s390x and x86_64), I guess there
>>>> is something else in you machine.
>>>>
>>>> With the latest upstream kernel, can you remove libvirt (virbr0) and boot 
>>>> your
>>>> machine normally, then:
>>>> # brctl addbr br0
>>>> # ip link set br0 up
>>>> # brctl stp br0 on
>>>>
>>>> to check if it will still hang.
>>>
>>> Nope. That doesn't hang.
>>>
>>>
>>>> If it can't be reproduced in this way, pls add this on your kernel:
>>>>
>>>> --- a/net/bridge/br_stp_if.c
>>>> +++ b/net/bridge/br_stp_if.c
>>>> @@ -178,9 +178,11 @@ static void br_stp_start(struct net_bridge *br)
>>>> br->stp_enabled = BR_KERNEL_STP;
>>>> br_debug(br, "using kernel STP\n");
>>>>
>>>> +   WARN_ON(1);
>>>> /* To start timers on any ports left in blocking */
>>>> mod_timer(>hello_timer, jiffies + br->hello_time);
>>>> br_port_state_selection(br);
>>>> +   pr_warn("hello timer start done\n");
>>>> }
>>>>
>>>> spin_unlock_bh(>lock);
>>>> diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
>>>> index 60b6fe2..c98b3e5 100644
>>>> --- a/net/bridge/br_stp_timer.c
>>>> +++ b/net/bridge/br_stp_timer.c
>>>> @@ -40,7 +40,7 @@ static void br_hello_timer_expired(unsigned long arg)
>>>> if (br->dev->flags & IFF_UP) {
>>>> br_config_bpdu_generation(br);
>>>>
>>>> -   if (br->stp_enabled == BR_KERNEL_STP)
>>>> +   if (br->stp_enabled != BR_USER_STP)
>>>> mod_timer(>hello_timer,
>>>>   round_jiffies(jiffies + br->hello_time));
>>>>
>>>>
>>>> let's see if it hangs when starting the timer. Thanks.
>>>
>>> No hang either:
>>>
>> [snip]
>> Could you please try the patch below ?
>>
>> ---
>>
>> diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
>> index 4efd5d54498a..89110319ef0f 100644
>> --- a/net/bridge/br_stp_if.c
>> +++ b/net/bridge/br_stp_if.c
>> @@ -173,7 +173,8 @@ static void br_stp_start(struct net_bridge *br)
>>  br_debug(br, "using kernel STP\n");
>>  
>>  /* To start timers on any ports left in blocking */
>> -mod_timer(>hello_timer, jiffies + br->hello_time);
>> +if (br->dev->flags & IFF_UP)
>> +mod_timer(>hello_timer, jiffies + br->hello_time);
>>  br_port_state_selection(br);
>>  }
>>  
>>
> 
> Ah nevermind, this patch reverts it back to the previous state.
> 

Okay, I saw the problem and can reliably reproduce it. I will send a fix for 
testing
in a few minutes. I think the issue is that the timer can be started before the 
bridge
even goes up, i.e. create bridge -> brctl stp br0 on -> ip l del br0
so the del_timer_sync() doesn't get executed and thus it's still armed.

$ while :; do ip l add br0 type bridge hello_time 100; brctl stp br0 on; ip l 
del br0; done;

Re: Oops with commit 6d18c73 bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

2017-06-01 Thread Nikolay Aleksandrov

On 01/06/17 17:16, Nikolay Aleksandrov wrote:
> On 01/06/17 17:00, Nikolay Aleksandrov wrote:
>> On 01/06/17 15:34, Sebastian Ott wrote:
>>> On Thu, 1 Jun 2017, Xin Long wrote:
>>>> On Thu, Jun 1, 2017 at 12:32 AM, Sebastian Ott
>>>>  wrote:
>>>>> [...]
>>>> I couldn't see any bridge-related thing here, and it couldn't be reproduced
>>>> with virbr0 (stp=1) on my box (on both s390x and x86_64), I guess there
>>>> is something else in you machine.
>>>>
>>>> With the latest upstream kernel, can you remove libvirt (virbr0) and boot 
>>>> your
>>>> machine normally, then:
>>>> # brctl addbr br0
>>>> # ip link set br0 up
>>>> # brctl stp br0 on
>>>>
>>>> to check if it will still hang.
>>>
>>> Nope. That doesn't hang.
>>>
>>>
>>>> If it can't be reproduced in this way, pls add this on your kernel:
>>>>
>>>> --- a/net/bridge/br_stp_if.c
>>>> +++ b/net/bridge/br_stp_if.c
>>>> @@ -178,9 +178,11 @@ static void br_stp_start(struct net_bridge *br)
>>>> br->stp_enabled = BR_KERNEL_STP;
>>>> br_debug(br, "using kernel STP\n");
>>>>
>>>> +   WARN_ON(1);
>>>> /* To start timers on any ports left in blocking */
>>>> mod_timer(>hello_timer, jiffies + br->hello_time);
>>>> br_port_state_selection(br);
>>>> +   pr_warn("hello timer start done\n");
>>>> }
>>>>
>>>> spin_unlock_bh(>lock);
>>>> diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
>>>> index 60b6fe2..c98b3e5 100644
>>>> --- a/net/bridge/br_stp_timer.c
>>>> +++ b/net/bridge/br_stp_timer.c
>>>> @@ -40,7 +40,7 @@ static void br_hello_timer_expired(unsigned long arg)
>>>> if (br->dev->flags & IFF_UP) {
>>>> br_config_bpdu_generation(br);
>>>>
>>>> -   if (br->stp_enabled == BR_KERNEL_STP)
>>>> +   if (br->stp_enabled != BR_USER_STP)
>>>> mod_timer(>hello_timer,
>>>>   round_jiffies(jiffies + br->hello_time));
>>>>
>>>>
>>>> let's see if it hangs when starting the timer. Thanks.
>>>
>>> No hang either:
>>>
>> [snip]
>> Could you please try the patch below ?
>>
>> ---
>>
>> diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
>> index 4efd5d54498a..89110319ef0f 100644
>> --- a/net/bridge/br_stp_if.c
>> +++ b/net/bridge/br_stp_if.c
>> @@ -173,7 +173,8 @@ static void br_stp_start(struct net_bridge *br)
>>  br_debug(br, "using kernel STP\n");
>>  
>>  /* To start timers on any ports left in blocking */
>> -mod_timer(>hello_timer, jiffies + br->hello_time);
>> +if (br->dev->flags & IFF_UP)
>> +mod_timer(>hello_timer, jiffies + br->hello_time);
>>  br_port_state_selection(br);
>>  }
>>  
>>
> 
> Ah nevermind, this patch reverts it back to the previous state.
> 

Okay, I saw the problem and can reliably reproduce it. I will send a fix for 
testing
in a few minutes. I think the issue is that the timer can be started before the 
bridge
even goes up, i.e. create bridge -> brctl stp br0 on -> ip l del br0
so the del_timer_sync() doesn't get executed and thus it's still armed.

$ while :; do ip l add br0 type bridge hello_time 100; brctl stp br0 on; ip l 
del br0; done;

Re: Oops with commit 6d18c73 bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

2017-06-01 Thread Nikolay Aleksandrov

On 01/06/17 17:00, Nikolay Aleksandrov wrote:
> On 01/06/17 15:34, Sebastian Ott wrote:
>> On Thu, 1 Jun 2017, Xin Long wrote:
>>> On Thu, Jun 1, 2017 at 12:32 AM, Sebastian Ott
>>> <seb...@linux.vnet.ibm.com> wrote:
>>>> [...]
>>> I couldn't see any bridge-related thing here, and it couldn't be reproduced
>>> with virbr0 (stp=1) on my box (on both s390x and x86_64), I guess there
>>> is something else in you machine.
>>>
>>> With the latest upstream kernel, can you remove libvirt (virbr0) and boot 
>>> your
>>> machine normally, then:
>>> # brctl addbr br0
>>> # ip link set br0 up
>>> # brctl stp br0 on
>>>
>>> to check if it will still hang.
>>
>> Nope. That doesn't hang.
>>
>>
>>> If it can't be reproduced in this way, pls add this on your kernel:
>>>
>>> --- a/net/bridge/br_stp_if.c
>>> +++ b/net/bridge/br_stp_if.c
>>> @@ -178,9 +178,11 @@ static void br_stp_start(struct net_bridge *br)
>>> br->stp_enabled = BR_KERNEL_STP;
>>> br_debug(br, "using kernel STP\n");
>>>
>>> +   WARN_ON(1);
>>> /* To start timers on any ports left in blocking */
>>> mod_timer(>hello_timer, jiffies + br->hello_time);
>>> br_port_state_selection(br);
>>> +   pr_warn("hello timer start done\n");
>>> }
>>>
>>> spin_unlock_bh(>lock);
>>> diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
>>> index 60b6fe2..c98b3e5 100644
>>> --- a/net/bridge/br_stp_timer.c
>>> +++ b/net/bridge/br_stp_timer.c
>>> @@ -40,7 +40,7 @@ static void br_hello_timer_expired(unsigned long arg)
>>> if (br->dev->flags & IFF_UP) {
>>> br_config_bpdu_generation(br);
>>>
>>> -   if (br->stp_enabled == BR_KERNEL_STP)
>>> +   if (br->stp_enabled != BR_USER_STP)
>>> mod_timer(>hello_timer,
>>>   round_jiffies(jiffies + br->hello_time));
>>>
>>>
>>> let's see if it hangs when starting the timer. Thanks.
>>
>> No hang either:
>>
> [snip]
> Could you please try the patch below ?
> 
> ---
> 
> diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
> index 4efd5d54498a..89110319ef0f 100644
> --- a/net/bridge/br_stp_if.c
> +++ b/net/bridge/br_stp_if.c
> @@ -173,7 +173,8 @@ static void br_stp_start(struct net_bridge *br)
>   br_debug(br, "using kernel STP\n");
>  
>   /* To start timers on any ports left in blocking */
> - mod_timer(>hello_timer, jiffies + br->hello_time);
> + if (br->dev->flags & IFF_UP)
> + mod_timer(>hello_timer, jiffies + br->hello_time);
>   br_port_state_selection(br);
>   }
>  
> 

Ah nevermind, this patch reverts it back to the previous state.

Re: Oops with commit 6d18c73 bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

2017-06-01 Thread Nikolay Aleksandrov

On 01/06/17 17:00, Nikolay Aleksandrov wrote:
> On 01/06/17 15:34, Sebastian Ott wrote:
>> On Thu, 1 Jun 2017, Xin Long wrote:
>>> On Thu, Jun 1, 2017 at 12:32 AM, Sebastian Ott
>>>  wrote:
>>>> [...]
>>> I couldn't see any bridge-related thing here, and it couldn't be reproduced
>>> with virbr0 (stp=1) on my box (on both s390x and x86_64), I guess there
>>> is something else in you machine.
>>>
>>> With the latest upstream kernel, can you remove libvirt (virbr0) and boot 
>>> your
>>> machine normally, then:
>>> # brctl addbr br0
>>> # ip link set br0 up
>>> # brctl stp br0 on
>>>
>>> to check if it will still hang.
>>
>> Nope. That doesn't hang.
>>
>>
>>> If it can't be reproduced in this way, pls add this on your kernel:
>>>
>>> --- a/net/bridge/br_stp_if.c
>>> +++ b/net/bridge/br_stp_if.c
>>> @@ -178,9 +178,11 @@ static void br_stp_start(struct net_bridge *br)
>>> br->stp_enabled = BR_KERNEL_STP;
>>> br_debug(br, "using kernel STP\n");
>>>
>>> +   WARN_ON(1);
>>> /* To start timers on any ports left in blocking */
>>> mod_timer(>hello_timer, jiffies + br->hello_time);
>>> br_port_state_selection(br);
>>> +   pr_warn("hello timer start done\n");
>>> }
>>>
>>> spin_unlock_bh(>lock);
>>> diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
>>> index 60b6fe2..c98b3e5 100644
>>> --- a/net/bridge/br_stp_timer.c
>>> +++ b/net/bridge/br_stp_timer.c
>>> @@ -40,7 +40,7 @@ static void br_hello_timer_expired(unsigned long arg)
>>> if (br->dev->flags & IFF_UP) {
>>> br_config_bpdu_generation(br);
>>>
>>> -   if (br->stp_enabled == BR_KERNEL_STP)
>>> +   if (br->stp_enabled != BR_USER_STP)
>>> mod_timer(>hello_timer,
>>>   round_jiffies(jiffies + br->hello_time));
>>>
>>>
>>> let's see if it hangs when starting the timer. Thanks.
>>
>> No hang either:
>>
> [snip]
> Could you please try the patch below ?
> 
> ---
> 
> diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
> index 4efd5d54498a..89110319ef0f 100644
> --- a/net/bridge/br_stp_if.c
> +++ b/net/bridge/br_stp_if.c
> @@ -173,7 +173,8 @@ static void br_stp_start(struct net_bridge *br)
>   br_debug(br, "using kernel STP\n");
>  
>   /* To start timers on any ports left in blocking */
> - mod_timer(>hello_timer, jiffies + br->hello_time);
> + if (br->dev->flags & IFF_UP)
> + mod_timer(>hello_timer, jiffies + br->hello_time);
>   br_port_state_selection(br);
>   }
>  
> 

Ah nevermind, this patch reverts it back to the previous state.

Re: Oops with commit 6d18c73 bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

2017-06-01 Thread Nikolay Aleksandrov

On 01/06/17 15:34, Sebastian Ott wrote:
> On Thu, 1 Jun 2017, Xin Long wrote:
>> On Thu, Jun 1, 2017 at 12:32 AM, Sebastian Ott
>>  wrote:
>>> [...]
>> I couldn't see any bridge-related thing here, and it couldn't be reproduced
>> with virbr0 (stp=1) on my box (on both s390x and x86_64), I guess there
>> is something else in you machine.
>>
>> With the latest upstream kernel, can you remove libvirt (virbr0) and boot 
>> your
>> machine normally, then:
>> # brctl addbr br0
>> # ip link set br0 up
>> # brctl stp br0 on
>>
>> to check if it will still hang.
> 
> Nope. That doesn't hang.
> 
> 
>> If it can't be reproduced in this way, pls add this on your kernel:
>>
>> --- a/net/bridge/br_stp_if.c
>> +++ b/net/bridge/br_stp_if.c
>> @@ -178,9 +178,11 @@ static void br_stp_start(struct net_bridge *br)
>> br->stp_enabled = BR_KERNEL_STP;
>> br_debug(br, "using kernel STP\n");
>>
>> +   WARN_ON(1);
>> /* To start timers on any ports left in blocking */
>> mod_timer(>hello_timer, jiffies + br->hello_time);
>> br_port_state_selection(br);
>> +   pr_warn("hello timer start done\n");
>> }
>>
>> spin_unlock_bh(>lock);
>> diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
>> index 60b6fe2..c98b3e5 100644
>> --- a/net/bridge/br_stp_timer.c
>> +++ b/net/bridge/br_stp_timer.c
>> @@ -40,7 +40,7 @@ static void br_hello_timer_expired(unsigned long arg)
>> if (br->dev->flags & IFF_UP) {
>> br_config_bpdu_generation(br);
>>
>> -   if (br->stp_enabled == BR_KERNEL_STP)
>> +   if (br->stp_enabled != BR_USER_STP)
>> mod_timer(>hello_timer,
>>   round_jiffies(jiffies + br->hello_time));
>>
>>
>> let's see if it hangs when starting the timer. Thanks.
> 
> No hang either:
> 
[snip]
Could you please try the patch below ?

---

diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
index 4efd5d54498a..89110319ef0f 100644
--- a/net/bridge/br_stp_if.c
+++ b/net/bridge/br_stp_if.c
@@ -173,7 +173,8 @@ static void br_stp_start(struct net_bridge *br)
br_debug(br, "using kernel STP\n");
 
/* To start timers on any ports left in blocking */
-   mod_timer(>hello_timer, jiffies + br->hello_time);
+   if (br->dev->flags & IFF_UP)
+   mod_timer(>hello_timer, jiffies + br->hello_time);
br_port_state_selection(br);
}

Re: Oops with commit 6d18c73 bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

2017-06-01 Thread Nikolay Aleksandrov

On 01/06/17 15:34, Sebastian Ott wrote:
> On Thu, 1 Jun 2017, Xin Long wrote:
>> On Thu, Jun 1, 2017 at 12:32 AM, Sebastian Ott
>>  wrote:
>>> [...]
>> I couldn't see any bridge-related thing here, and it couldn't be reproduced
>> with virbr0 (stp=1) on my box (on both s390x and x86_64), I guess there
>> is something else in you machine.
>>
>> With the latest upstream kernel, can you remove libvirt (virbr0) and boot 
>> your
>> machine normally, then:
>> # brctl addbr br0
>> # ip link set br0 up
>> # brctl stp br0 on
>>
>> to check if it will still hang.
> 
> Nope. That doesn't hang.
> 
> 
>> If it can't be reproduced in this way, pls add this on your kernel:
>>
>> --- a/net/bridge/br_stp_if.c
>> +++ b/net/bridge/br_stp_if.c
>> @@ -178,9 +178,11 @@ static void br_stp_start(struct net_bridge *br)
>> br->stp_enabled = BR_KERNEL_STP;
>> br_debug(br, "using kernel STP\n");
>>
>> +   WARN_ON(1);
>> /* To start timers on any ports left in blocking */
>> mod_timer(>hello_timer, jiffies + br->hello_time);
>> br_port_state_selection(br);
>> +   pr_warn("hello timer start done\n");
>> }
>>
>> spin_unlock_bh(>lock);
>> diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
>> index 60b6fe2..c98b3e5 100644
>> --- a/net/bridge/br_stp_timer.c
>> +++ b/net/bridge/br_stp_timer.c
>> @@ -40,7 +40,7 @@ static void br_hello_timer_expired(unsigned long arg)
>> if (br->dev->flags & IFF_UP) {
>> br_config_bpdu_generation(br);
>>
>> -   if (br->stp_enabled == BR_KERNEL_STP)
>> +   if (br->stp_enabled != BR_USER_STP)
>> mod_timer(>hello_timer,
>>   round_jiffies(jiffies + br->hello_time));
>>
>>
>> let's see if it hangs when starting the timer. Thanks.
> 
> No hang either:
> 
[snip]
Could you please try the patch below ?

---

diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
index 4efd5d54498a..89110319ef0f 100644
--- a/net/bridge/br_stp_if.c
+++ b/net/bridge/br_stp_if.c
@@ -173,7 +173,8 @@ static void br_stp_start(struct net_bridge *br)
br_debug(br, "using kernel STP\n");
 
/* To start timers on any ports left in blocking */
-   mod_timer(>hello_timer, jiffies + br->hello_time);
+   if (br->dev->flags & IFF_UP)
+   mod_timer(>hello_timer, jiffies + br->hello_time);
br_port_state_selection(br);
}

Re: [PATCH net-next 5/6] net: bridge: get msgtype from nlmsghdr in mdb ops

2017-05-18 Thread Nikolay Aleksandrov


On 5/18/17 12:27 AM, Vivien Didelot wrote:

Retrieve the message type from the nlmsghdr structure instead of
hardcoding it in both br_mdb_add and br_mdb_del.

Signed-off-by: Vivien Didelot 
---
  net/bridge/br_mdb.c | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index a72d5e6f339f..d280b20587cb 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -569,6 +569,7 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr 
*nlh,
struct net_bridge_port *p;
struct net_bridge_vlan *v;
struct net_bridge *br;
+   int msgtype = nlh->nlmsg_type;


minor nits:
nlmsg_type is a u16, also please keep the order and arrange these from longest 
to shortest



int err;
  
  	err = br_mdb_parse(skb, nlh, , );

@@ -595,12 +596,12 @@ static int br_mdb_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (br_vlan_enabled(br) && vg && entry->vid == 0) {
list_for_each_entry(v, >vlan_list, vlist) {
entry->vid = v->vid;
-   err = __br_mdb_do(p, entry, RTM_NEWMDB);
+   err = __br_mdb_do(p, entry, msgtype);
if (err)
break;
}
} else {
-   err = __br_mdb_do(p, entry, RTM_NEWMDB);
+   err = __br_mdb_do(p, entry, msgtype);
}
  
  	return err;

@@ -677,6 +678,7 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr 
*nlh,
struct net_bridge_port *p;
struct net_bridge_vlan *v;
struct net_bridge *br;
+   int msgtype = nlh->nlmsg_type;


same here


int err;
  
  	err = br_mdb_parse(skb, nlh, , );

@@ -703,12 +705,12 @@ static int br_mdb_del(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (br_vlan_enabled(br) && vg && entry->vid == 0) {
list_for_each_entry(v, >vlan_list, vlist) {
entry->vid = v->vid;
-   err = __br_mdb_do(p, entry, RTM_DELMDB);
+   err = __br_mdb_do(p, entry, msgtype);
if (err)
break;
}
} else {
-   err = __br_mdb_do(p, entry, RTM_DELMDB);
+   err = __br_mdb_do(p, entry, msgtype);
}
  
  	return err;

Re: [PATCH net-next 5/6] net: bridge: get msgtype from nlmsghdr in mdb ops

2017-05-18 Thread Nikolay Aleksandrov


On 5/18/17 12:27 AM, Vivien Didelot wrote:

Retrieve the message type from the nlmsghdr structure instead of
hardcoding it in both br_mdb_add and br_mdb_del.

Signed-off-by: Vivien Didelot 
---
  net/bridge/br_mdb.c | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index a72d5e6f339f..d280b20587cb 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -569,6 +569,7 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr 
*nlh,
struct net_bridge_port *p;
struct net_bridge_vlan *v;
struct net_bridge *br;
+   int msgtype = nlh->nlmsg_type;


minor nits:
nlmsg_type is a u16, also please keep the order and arrange these from longest 
to shortest



int err;
  
  	err = br_mdb_parse(skb, nlh, , );

@@ -595,12 +596,12 @@ static int br_mdb_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (br_vlan_enabled(br) && vg && entry->vid == 0) {
list_for_each_entry(v, >vlan_list, vlist) {
entry->vid = v->vid;
-   err = __br_mdb_do(p, entry, RTM_NEWMDB);
+   err = __br_mdb_do(p, entry, msgtype);
if (err)
break;
}
} else {
-   err = __br_mdb_do(p, entry, RTM_NEWMDB);
+   err = __br_mdb_do(p, entry, msgtype);
}
  
  	return err;

@@ -677,6 +678,7 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr 
*nlh,
struct net_bridge_port *p;
struct net_bridge_vlan *v;
struct net_bridge *br;
+   int msgtype = nlh->nlmsg_type;


same here


int err;
  
  	err = br_mdb_parse(skb, nlh, , );

@@ -703,12 +705,12 @@ static int br_mdb_del(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (br_vlan_enabled(br) && vg && entry->vid == 0) {
list_for_each_entry(v, >vlan_list, vlist) {
entry->vid = v->vid;
-   err = __br_mdb_do(p, entry, RTM_DELMDB);
+   err = __br_mdb_do(p, entry, msgtype);
if (err)
break;
}
} else {
-   err = __br_mdb_do(p, entry, RTM_DELMDB);
+   err = __br_mdb_do(p, entry, msgtype);
}
  
  	return err;

Re: [PATCH net-next 3/6] net: bridge: break if __br_mdb_del fails

2017-05-18 Thread Nikolay Aleksandrov


On 5/18/17 6:08 PM, Vivien Didelot wrote:

Hi Nikolay,

Nikolay Aleksandrov <niko...@cumulusnetworks.com> writes:


err = __br_mdb_del(br, entry);
-   if (!err)
-   __br_mdb_notify(dev, p, entry, RTM_DELMDB);
+   if (err)
+   break;
+   __br_mdb_notify(dev, p, entry, RTM_DELMDB);
}
} else {
err = __br_mdb_del(br, entry);



This can potentially break user-space scripts that rely on the best-effort
behaviour, this is the normal "delete without vid & enabled vlan filtering".
You can check the fdb delete code which does the same, this was intentional.

You can add an mdb entry without a vid to all vlans, add a vlan and then try
to remove it from all vlans where it is present - with this patch obviously
that will fail at the new vlan.


OK good to know. That intention wasn't obvious. I can make __br_mdb_del
return void instead? What about the rest of the patchset if I do so?

Thanks,

 Vivien



If you make it return void we will not be able to return proper error value
when doing a single operation (the else case). About the rest I see only some
minor style issues, I'll comment on the respective patches. Another minor nit is 
using switch() instead of if/else for the message types but that is really up to 
you, I don't mind either way. :-)


Cheers,
 Nik

Re: [PATCH net-next 3/6] net: bridge: break if __br_mdb_del fails

2017-05-18 Thread Nikolay Aleksandrov


On 5/18/17 6:08 PM, Vivien Didelot wrote:

Hi Nikolay,

Nikolay Aleksandrov  writes:


err = __br_mdb_del(br, entry);
-   if (!err)
-   __br_mdb_notify(dev, p, entry, RTM_DELMDB);
+   if (err)
+   break;
+   __br_mdb_notify(dev, p, entry, RTM_DELMDB);
}
} else {
err = __br_mdb_del(br, entry);



This can potentially break user-space scripts that rely on the best-effort
behaviour, this is the normal "delete without vid & enabled vlan filtering".
You can check the fdb delete code which does the same, this was intentional.

You can add an mdb entry without a vid to all vlans, add a vlan and then try
to remove it from all vlans where it is present - with this patch obviously
that will fail at the new vlan.


OK good to know. That intention wasn't obvious. I can make __br_mdb_del
return void instead? What about the rest of the patchset if I do so?

Thanks,

 Vivien



If you make it return void we will not be able to return proper error value
when doing a single operation (the else case). About the rest I see only some
minor style issues, I'll comment on the respective patches. Another minor nit is 
using switch() instead of if/else for the message types but that is really up to 
you, I don't mind either way. :-)


Cheers,
 Nik

Re: [PATCH net-next 3/6] net: bridge: break if __br_mdb_del fails

2017-05-18 Thread Nikolay Aleksandrov


On 5/18/17 12:27 AM, Vivien Didelot wrote:

Be symmetric with br_mdb_add and break if __br_mdb_del returns an error.

Signed-off-by: Vivien Didelot 
---
  net/bridge/br_mdb.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index d20a01622b20..24fb4179 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -688,8 +688,9 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr 
*nlh,
list_for_each_entry(v, >vlan_list, vlist) {
entry->vid = v->vid;
err = __br_mdb_del(br, entry);
-   if (!err)
-   __br_mdb_notify(dev, p, entry, RTM_DELMDB);
+   if (err)
+   break;
+   __br_mdb_notify(dev, p, entry, RTM_DELMDB);
}
} else {
err = __br_mdb_del(br, entry);



This can potentially break user-space scripts that rely on the best-effort
behaviour, this is the normal "delete without vid & enabled vlan filtering".
You can check the fdb delete code which does the same, this was intentional.

You can add an mdb entry without a vid to all vlans, add a vlan and then try
to remove it from all vlans where it is present - with this patch obviously
that will fail at the new vlan.

Re: [PATCH net-next 3/6] net: bridge: break if __br_mdb_del fails

2017-05-18 Thread Nikolay Aleksandrov


On 5/18/17 12:27 AM, Vivien Didelot wrote:

Be symmetric with br_mdb_add and break if __br_mdb_del returns an error.

Signed-off-by: Vivien Didelot 
---
  net/bridge/br_mdb.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index d20a01622b20..24fb4179 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -688,8 +688,9 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr 
*nlh,
list_for_each_entry(v, >vlan_list, vlist) {
entry->vid = v->vid;
err = __br_mdb_del(br, entry);
-   if (!err)
-   __br_mdb_notify(dev, p, entry, RTM_DELMDB);
+   if (err)
+   break;
+   __br_mdb_notify(dev, p, entry, RTM_DELMDB);
}
} else {
err = __br_mdb_del(br, entry);



This can potentially break user-space scripts that rely on the best-effort
behaviour, this is the normal "delete without vid & enabled vlan filtering".
You can check the fdb delete code which does the same, this was intentional.

You can add an mdb entry without a vid to all vlans, add a vlan and then try
to remove it from all vlans where it is present - with this patch obviously
that will fail at the new vlan.

Re: [GIT PULL] Please pull NFS client fixes for 4.12

2017-05-11 Thread Nikolay Borisov



On 10.05.2017 19:47, Trond Myklebust wrote:
> Hi Linus,
> 
> The following changes since commit 4f7d029b9bf009fbee76bb10c0c4351a1870d2f3:
> 
>   Linux 4.11-rc7 (2017-04-16 13:00:18 -0700)
> 
> are available in the git repository at:
> 
>   git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-4.12-1
> 
> for you to fetch changes up to 76b2a303384e1d6299c3a0249f0f0ce2f8f96017:
> 
>   pNFS/flexfiles: Always attempt to call layoutstats when flexfiles is 
> enabled (2017-05-09 16:02:57 -0400)
> 
> 
> NFS client updates for Linux 4.12
> 
> Highlights include:
> 
> Stable bugfixes:
> - Fix use after free in write error path
> - Use GFP_NOIO for two allocations in writeback
> - Fix a hang in OPEN related to server reboot
> - Check the result of nfs4_pnfs_ds_connect
> - Fix an rcu lock leak
> 
> Features:
> - Removal of the unmaintained and unused OSD pNFS layout
> - Cleanup and removal of lots of unnecessary dprintk()s
> - Cleanup and removal of some memory failure paths now that
>   GFP_NOFS is guaranteed to never fail.

What guarantees that? Since if this is the case then this can result in
a lot of opportunities for cleanup across the whole kernel tree. After
discussing with mhocko (cc'ed) it seems that in practice everything
below COSTLY_ORDER which are not GFP_NORETRY will never fail. But this
semantic is not the same as GFP_NOFAIL. E.g. nothing guarantees that
this will stay like that in the future?



[omitted for brevity]

Re: [GIT PULL] Please pull NFS client fixes for 4.12

2017-05-11 Thread Nikolay Borisov



On 10.05.2017 19:47, Trond Myklebust wrote:
> Hi Linus,
> 
> The following changes since commit 4f7d029b9bf009fbee76bb10c0c4351a1870d2f3:
> 
>   Linux 4.11-rc7 (2017-04-16 13:00:18 -0700)
> 
> are available in the git repository at:
> 
>   git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-4.12-1
> 
> for you to fetch changes up to 76b2a303384e1d6299c3a0249f0f0ce2f8f96017:
> 
>   pNFS/flexfiles: Always attempt to call layoutstats when flexfiles is 
> enabled (2017-05-09 16:02:57 -0400)
> 
> 
> NFS client updates for Linux 4.12
> 
> Highlights include:
> 
> Stable bugfixes:
> - Fix use after free in write error path
> - Use GFP_NOIO for two allocations in writeback
> - Fix a hang in OPEN related to server reboot
> - Check the result of nfs4_pnfs_ds_connect
> - Fix an rcu lock leak
> 
> Features:
> - Removal of the unmaintained and unused OSD pNFS layout
> - Cleanup and removal of lots of unnecessary dprintk()s
> - Cleanup and removal of some memory failure paths now that
>   GFP_NOFS is guaranteed to never fail.

What guarantees that? Since if this is the case then this can result in
a lot of opportunities for cleanup across the whole kernel tree. After
discussing with mhocko (cc'ed) it seems that in practice everything
below COSTLY_ORDER which are not GFP_NORETRY will never fail. But this
semantic is not the same as GFP_NOFAIL. E.g. nothing guarantees that
this will stay like that in the future?



[omitted for brevity]

Re: [PATCH net] ip6mr: fix notification device destruction

2017-04-21 Thread Nikolay Aleksandrov

On 21/04/17 22:50, Nikolay Aleksandrov wrote:
> On 21/04/17 22:36, David Miller wrote:
>> From: Nikolay Aleksandrov <niko...@cumulusnetworks.com>
>> Date: Fri, 21 Apr 2017 21:30:42 +0300
>>
>>> On 21/04/17 20:42, Nikolay Aleksandrov wrote:
>>>> Andrey Konovalov reported a BUG caused by the ip6mr code which is caused
>>>> because we call unregister_netdevice_many for a device that is already
>>>> being destroyed. In IPv4's ipmr that has been resolved by two commits
>>>> long time ago by introducing the "notify" parameter to the delete
>>>> function and avoiding the unregister when called from a notifier, so
>>>> let's do the same for ip6mr.
>>  ...
>>> +CC LKML and Linus
>>
>> Applied, thanks Nikolay and thanks Andrey for the report and testing.
>>
>> Nikolay, how far does this bug go back?
>>
> 
> Good question, AFAICS since ip6mr exists because it was copied from ipmr:
> commit 7bc570c8b4f7
> Author: YOSHIFUJI Hideaki <yoshf...@linux-ipv6.org>
> Date:   Thu Apr 3 09:22:53 2008 +0900
> 
> [IPV6] MROUTE: Support multicast forwarding.
> 
> 

Oops no, my bad. That wouldn't cause it to BUG because it was already removed 
by mif6_delete
earlier. So since it can be destroyed by a netns exiting, currently I don't see 
any other
way which is outside of ip6mr for destroying that device.

That should be:
commit 8229efdaef1e
Author: Benjamin Thery <benjamin.th...@bull.net>
Date:   Wed Dec 10 16:30:15 2008 -0800

netns: ip6mr: enable namespace support in ipv6 multicast forwarding code


Which allowed the notifier to be executed for pimreg devices in other network 
namespaces.

Re: [PATCH net] ip6mr: fix notification device destruction

2017-04-21 Thread Nikolay Aleksandrov

On 21/04/17 22:50, Nikolay Aleksandrov wrote:
> On 21/04/17 22:36, David Miller wrote:
>> From: Nikolay Aleksandrov 
>> Date: Fri, 21 Apr 2017 21:30:42 +0300
>>
>>> On 21/04/17 20:42, Nikolay Aleksandrov wrote:
>>>> Andrey Konovalov reported a BUG caused by the ip6mr code which is caused
>>>> because we call unregister_netdevice_many for a device that is already
>>>> being destroyed. In IPv4's ipmr that has been resolved by two commits
>>>> long time ago by introducing the "notify" parameter to the delete
>>>> function and avoiding the unregister when called from a notifier, so
>>>> let's do the same for ip6mr.
>>  ...
>>> +CC LKML and Linus
>>
>> Applied, thanks Nikolay and thanks Andrey for the report and testing.
>>
>> Nikolay, how far does this bug go back?
>>
> 
> Good question, AFAICS since ip6mr exists because it was copied from ipmr:
> commit 7bc570c8b4f7
> Author: YOSHIFUJI Hideaki 
> Date:   Thu Apr 3 09:22:53 2008 +0900
> 
> [IPV6] MROUTE: Support multicast forwarding.
> 
> 

Oops no, my bad. That wouldn't cause it to BUG because it was already removed 
by mif6_delete
earlier. So since it can be destroyed by a netns exiting, currently I don't see 
any other
way which is outside of ip6mr for destroying that device.

That should be:
commit 8229efdaef1e
Author: Benjamin Thery 
Date:   Wed Dec 10 16:30:15 2008 -0800

netns: ip6mr: enable namespace support in ipv6 multicast forwarding code


Which allowed the notifier to be executed for pimreg devices in other network 
namespaces.

Re: [PATCH net] ip6mr: fix notification device destruction

2017-04-21 Thread Nikolay Aleksandrov

On 21/04/17 22:36, David Miller wrote:
> From: Nikolay Aleksandrov <niko...@cumulusnetworks.com>
> Date: Fri, 21 Apr 2017 21:30:42 +0300
> 
>> On 21/04/17 20:42, Nikolay Aleksandrov wrote:
>>> Andrey Konovalov reported a BUG caused by the ip6mr code which is caused
>>> because we call unregister_netdevice_many for a device that is already
>>> being destroyed. In IPv4's ipmr that has been resolved by two commits
>>> long time ago by introducing the "notify" parameter to the delete
>>> function and avoiding the unregister when called from a notifier, so
>>> let's do the same for ip6mr.
>  ...
>> +CC LKML and Linus
> 
> Applied, thanks Nikolay and thanks Andrey for the report and testing.
> 
> Nikolay, how far does this bug go back?
> 

Good question, AFAICS since ip6mr exists because it was copied from ipmr:
commit 7bc570c8b4f7
Author: YOSHIFUJI Hideaki <yoshf...@linux-ipv6.org>
Date:   Thu Apr 3 09:22:53 2008 +0900

[IPV6] MROUTE: Support multicast forwarding.

Re: [PATCH net] ip6mr: fix notification device destruction

2017-04-21 Thread Nikolay Aleksandrov

On 21/04/17 22:36, David Miller wrote:
> From: Nikolay Aleksandrov 
> Date: Fri, 21 Apr 2017 21:30:42 +0300
> 
>> On 21/04/17 20:42, Nikolay Aleksandrov wrote:
>>> Andrey Konovalov reported a BUG caused by the ip6mr code which is caused
>>> because we call unregister_netdevice_many for a device that is already
>>> being destroyed. In IPv4's ipmr that has been resolved by two commits
>>> long time ago by introducing the "notify" parameter to the delete
>>> function and avoiding the unregister when called from a notifier, so
>>> let's do the same for ip6mr.
>  ...
>> +CC LKML and Linus
> 
> Applied, thanks Nikolay and thanks Andrey for the report and testing.
> 
> Nikolay, how far does this bug go back?
> 

Good question, AFAICS since ip6mr exists because it was copied from ipmr:
commit 7bc570c8b4f7
Author: YOSHIFUJI Hideaki 
Date:   Thu Apr 3 09:22:53 2008 +0900

[IPV6] MROUTE: Support multicast forwarding.

Re: [PATCH net] ip6mr: fix notification device destruction

2017-04-21 Thread Nikolay Aleksandrov

On 21/04/17 20:42, Nikolay Aleksandrov wrote:
> Andrey Konovalov reported a BUG caused by the ip6mr code which is caused
> because we call unregister_netdevice_many for a device that is already
> being destroyed. In IPv4's ipmr that has been resolved by two commits
> long time ago by introducing the "notify" parameter to the delete
> function and avoiding the unregister when called from a notifier, so
> let's do the same for ip6mr.
> 
> The trace from Andrey:
> [ cut here ]
> kernel BUG at net/core/dev.c:6813!
> invalid opcode:  [#1] SMP KASAN
> Modules linked in:
> CPU: 1 PID: 1165 Comm: kworker/u4:3 Not tainted 4.11.0-rc7+ #251
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
> 01/01/2011
> Workqueue: netns cleanup_net
> task: 880069208000 task.stack: 8800692d8000
> RIP: 0010:rollback_registered_many+0x348/0xeb0 net/core/dev.c:6813
> RSP: 0018:8800692de7f0 EFLAGS: 00010297
> RAX: 880069208000 RBX: 0002 RCX: 0001
> RDX:  RSI:  RDI: 88006af90569
> RBP: 8800692de9f0 R08: 8800692dec60 R09: 
> R10: 0006 R11:  R12: 88006af90070
> R13: 8800692debf0 R14: dc00 R15: 88006af9
> FS:  () GS:88006cb0()
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fe7e897d870 CR3: 657e7000 CR4: 06e0
> Call Trace:
>  unregister_netdevice_many.part.105+0x87/0x440 net/core/dev.c:7881
>  unregister_netdevice_many+0xc8/0x120 net/core/dev.c:7880
>  ip6mr_device_event+0x362/0x3f0 net/ipv6/ip6mr.c:1346
>  notifier_call_chain+0x145/0x2f0 kernel/notifier.c:93
>  __raw_notifier_call_chain kernel/notifier.c:394
>  raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
>  call_netdevice_notifiers_info+0x51/0x90 net/core/dev.c:1647
>  call_netdevice_notifiers net/core/dev.c:1663
>  rollback_registered_many+0x919/0xeb0 net/core/dev.c:6841
>  unregister_netdevice_many.part.105+0x87/0x440 net/core/dev.c:7881
>  unregister_netdevice_many net/core/dev.c:7880
>  default_device_exit_batch+0x4fa/0x640 net/core/dev.c:8333
>  ops_exit_list.isra.4+0x100/0x150 net/core/net_namespace.c:144
>  cleanup_net+0x5a8/0xb40 net/core/net_namespace.c:463
>  process_one_work+0xc04/0x1c10 kernel/workqueue.c:2097
>  worker_thread+0x223/0x19c0 kernel/workqueue.c:2231
>  kthread+0x35e/0x430 kernel/kthread.c:231
>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> Code: 3c 32 00 0f 85 70 0b 00 00 48 b8 00 02 00 00 00 00 ad de 49 89
> 47 78 e9 93 fe ff ff 49 8d 57 70 49 8d 5f 78 eb 9e e8 88 7a 14 fe <0f>
> 0b 48 8b 9d 28 fe ff ff e8 7a 7a 14 fe 48 b8 00 00 00 00 00
> RIP: rollback_registered_many+0x348/0xeb0 RSP: 8800692de7f0
> ---[ end trace e0b29c57e9b3292c ]---
> 
> Reported-by: Andrey Konovalov <andreyk...@google.com>
> Signed-off-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>
> ---

+CC LKML and Linus

> Andrey could you please test with this patch applied ?
> I have run the reproducer locally and can no longer trigger the bug.
> I've made "notify" an int instead of a bool only to be closer to the ipmr
> code for easier future patches.
> 
>  net/ipv6/ip6mr.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> index fb4546e80c82..374997d26488 100644
> --- a/net/ipv6/ip6mr.c
> +++ b/net/ipv6/ip6mr.c
> @@ -774,7 +774,8 @@ static struct net_device *ip6mr_reg_vif(struct net *net, 
> struct mr6_table *mrt)
>   *   Delete a VIF entry
>   */
>  
> -static int mif6_delete(struct mr6_table *mrt, int vifi, struct list_head 
> *head)
> +static int mif6_delete(struct mr6_table *mrt, int vifi, int notify,
> +struct list_head *head)
>  {
>   struct mif_device *v;
>   struct net_device *dev;
> @@ -820,7 +821,7 @@ static int mif6_delete(struct mr6_table *mrt, int vifi, 
> struct list_head *head)
>dev->ifindex, _dev->cnf);
>   }
>  
> - if (v->flags & MIFF_REGISTER)
> + if ((v->flags & MIFF_REGISTER) && !notify)
>   unregister_netdevice_queue(dev, head);
>  
>   dev_put(dev);
> @@ -1331,7 +1332,6 @@ static int ip6mr_device_event(struct notifier_block 
> *this,
>   struct mr6_table *mrt;
>   struct mif_device *v;
>   int ct;
> - LIST_HEAD(list);
>  
>   if (event != NETDEV_UNREGISTER)
>   return NOTIFY_DONE;
> @@ -1340,10 +1340,9 @@ static int ip6mr_device_event(struct notifier_block 
> *this,
>   v

Re: [PATCH net] ip6mr: fix notification device destruction

2017-04-21 Thread Nikolay Aleksandrov

On 21/04/17 20:42, Nikolay Aleksandrov wrote:
> Andrey Konovalov reported a BUG caused by the ip6mr code which is caused
> because we call unregister_netdevice_many for a device that is already
> being destroyed. In IPv4's ipmr that has been resolved by two commits
> long time ago by introducing the "notify" parameter to the delete
> function and avoiding the unregister when called from a notifier, so
> let's do the same for ip6mr.
> 
> The trace from Andrey:
> [ cut here ]
> kernel BUG at net/core/dev.c:6813!
> invalid opcode:  [#1] SMP KASAN
> Modules linked in:
> CPU: 1 PID: 1165 Comm: kworker/u4:3 Not tainted 4.11.0-rc7+ #251
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
> 01/01/2011
> Workqueue: netns cleanup_net
> task: 880069208000 task.stack: 8800692d8000
> RIP: 0010:rollback_registered_many+0x348/0xeb0 net/core/dev.c:6813
> RSP: 0018:8800692de7f0 EFLAGS: 00010297
> RAX: 880069208000 RBX: 0002 RCX: 0001
> RDX:  RSI:  RDI: 88006af90569
> RBP: 8800692de9f0 R08: 8800692dec60 R09: 
> R10: 0006 R11:  R12: 88006af90070
> R13: 8800692debf0 R14: dc00 R15: 88006af9
> FS:  () GS:88006cb0()
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fe7e897d870 CR3: 657e7000 CR4: 06e0
> Call Trace:
>  unregister_netdevice_many.part.105+0x87/0x440 net/core/dev.c:7881
>  unregister_netdevice_many+0xc8/0x120 net/core/dev.c:7880
>  ip6mr_device_event+0x362/0x3f0 net/ipv6/ip6mr.c:1346
>  notifier_call_chain+0x145/0x2f0 kernel/notifier.c:93
>  __raw_notifier_call_chain kernel/notifier.c:394
>  raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
>  call_netdevice_notifiers_info+0x51/0x90 net/core/dev.c:1647
>  call_netdevice_notifiers net/core/dev.c:1663
>  rollback_registered_many+0x919/0xeb0 net/core/dev.c:6841
>  unregister_netdevice_many.part.105+0x87/0x440 net/core/dev.c:7881
>  unregister_netdevice_many net/core/dev.c:7880
>  default_device_exit_batch+0x4fa/0x640 net/core/dev.c:8333
>  ops_exit_list.isra.4+0x100/0x150 net/core/net_namespace.c:144
>  cleanup_net+0x5a8/0xb40 net/core/net_namespace.c:463
>  process_one_work+0xc04/0x1c10 kernel/workqueue.c:2097
>  worker_thread+0x223/0x19c0 kernel/workqueue.c:2231
>  kthread+0x35e/0x430 kernel/kthread.c:231
>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> Code: 3c 32 00 0f 85 70 0b 00 00 48 b8 00 02 00 00 00 00 ad de 49 89
> 47 78 e9 93 fe ff ff 49 8d 57 70 49 8d 5f 78 eb 9e e8 88 7a 14 fe <0f>
> 0b 48 8b 9d 28 fe ff ff e8 7a 7a 14 fe 48 b8 00 00 00 00 00
> RIP: rollback_registered_many+0x348/0xeb0 RSP: 8800692de7f0
> ---[ end trace e0b29c57e9b3292c ]---
> 
> Reported-by: Andrey Konovalov 
> Signed-off-by: Nikolay Aleksandrov 
> ---

+CC LKML and Linus

> Andrey could you please test with this patch applied ?
> I have run the reproducer locally and can no longer trigger the bug.
> I've made "notify" an int instead of a bool only to be closer to the ipmr
> code for easier future patches.
> 
>  net/ipv6/ip6mr.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> index fb4546e80c82..374997d26488 100644
> --- a/net/ipv6/ip6mr.c
> +++ b/net/ipv6/ip6mr.c
> @@ -774,7 +774,8 @@ static struct net_device *ip6mr_reg_vif(struct net *net, 
> struct mr6_table *mrt)
>   *   Delete a VIF entry
>   */
>  
> -static int mif6_delete(struct mr6_table *mrt, int vifi, struct list_head 
> *head)
> +static int mif6_delete(struct mr6_table *mrt, int vifi, int notify,
> +struct list_head *head)
>  {
>   struct mif_device *v;
>   struct net_device *dev;
> @@ -820,7 +821,7 @@ static int mif6_delete(struct mr6_table *mrt, int vifi, 
> struct list_head *head)
>dev->ifindex, _dev->cnf);
>   }
>  
> - if (v->flags & MIFF_REGISTER)
> + if ((v->flags & MIFF_REGISTER) && !notify)
>   unregister_netdevice_queue(dev, head);
>  
>   dev_put(dev);
> @@ -1331,7 +1332,6 @@ static int ip6mr_device_event(struct notifier_block 
> *this,
>   struct mr6_table *mrt;
>   struct mif_device *v;
>   int ct;
> - LIST_HEAD(list);
>  
>   if (event != NETDEV_UNREGISTER)
>   return NOTIFY_DONE;
> @@ -1340,10 +1340,9 @@ static int ip6mr_device_event(struct notifier_block 
> *this,
>   v = >vif6_table[0];
>   for (ct = 0; ct <

Re: net/core: BUG in unregister_netdevice_many

2017-04-21 Thread Nikolay Aleksandrov

On 21/04/17 20:42, Linus Torvalds wrote:
> On Fri, Apr 21, 2017 at 10:25 AM, Linus Torvalds
>  wrote:
>>
>> I'm assuming that the real cause is simply that "dev->reg_state" ends
>> up being NETREG_UNREGISTERING or something. Maybe the BUG_ON() could
>> be just removed, and replaced by the previous warning about
>> NETREG_UNINITIALIZED.
>>
>> Something like the attached (TOTALLY UNTESTED) patch.
> 
> .. might as well test it.
> 
> That patch doesn't fix the problem, but it does show that yes, it was
> NETREG_UNREGISTERING:
> 
>   unregister_netdevice: device pim6reg/962dc4606000 was not registered (2)
> 
> but then immediately afterwards we get
> 
>   general protection fault:  [#1] SMP
>   Workqueue: netns cleanup_net
>   RIP: 0010:dev_shutdown+0xe/0xc0
>   Call Trace:
>  rollback_registered_many+0x2a5/0x440
>  unregister_netdevice_many+0x1e/0xb0
>  default_device_exit_batch+0x145/0x170
> 
> which is due to a
> 
> mov0x388(%rdi),%eax
> 
> where %rdi is 0xdead0090. That is at the very beginning of
> dev_shutdown, it's "dev" itself that has that value, so it comes from
> (_another_) invocation of rollback_registered_many(), when it does
> that
> 
> list_for_each_entry(dev, head, unreg_list) {
> 
> so it seems to be a case of another "list_del() leaves list in bad
> state", and it was the added test for "dev->reg_state !=
> NETREG_REGISTERED" that did that
> 
> list_del(>unreg_list);
> 
> and left random contents in the unreg_list.
> 
> So that "handle error case" was almost certainly just buggy too.
> 
> And the bug seems to be that we're trying to unregister a netdevice
> that has already been unregistered.
> 
> Over to Eric and networking people. This oops is user-triggerable, and
> leaves the machine in a bad state (the original BUG_ON() and the new
> GP fault both happen while holding the RTNL, so networking is not
> healthy afterwards.
> 
>   Linus
> 

Right, I've already posted a patch for ip6mr that should fix the issue.
CCed you and LKML just now.

Thanks,
 Nik

Re: net/core: BUG in unregister_netdevice_many

2017-04-21 Thread Nikolay Aleksandrov

On 21/04/17 20:42, Linus Torvalds wrote:
> On Fri, Apr 21, 2017 at 10:25 AM, Linus Torvalds
>  wrote:
>>
>> I'm assuming that the real cause is simply that "dev->reg_state" ends
>> up being NETREG_UNREGISTERING or something. Maybe the BUG_ON() could
>> be just removed, and replaced by the previous warning about
>> NETREG_UNINITIALIZED.
>>
>> Something like the attached (TOTALLY UNTESTED) patch.
> 
> .. might as well test it.
> 
> That patch doesn't fix the problem, but it does show that yes, it was
> NETREG_UNREGISTERING:
> 
>   unregister_netdevice: device pim6reg/962dc4606000 was not registered (2)
> 
> but then immediately afterwards we get
> 
>   general protection fault:  [#1] SMP
>   Workqueue: netns cleanup_net
>   RIP: 0010:dev_shutdown+0xe/0xc0
>   Call Trace:
>  rollback_registered_many+0x2a5/0x440
>  unregister_netdevice_many+0x1e/0xb0
>  default_device_exit_batch+0x145/0x170
> 
> which is due to a
> 
> mov0x388(%rdi),%eax
> 
> where %rdi is 0xdead0090. That is at the very beginning of
> dev_shutdown, it's "dev" itself that has that value, so it comes from
> (_another_) invocation of rollback_registered_many(), when it does
> that
> 
> list_for_each_entry(dev, head, unreg_list) {
> 
> so it seems to be a case of another "list_del() leaves list in bad
> state", and it was the added test for "dev->reg_state !=
> NETREG_REGISTERED" that did that
> 
> list_del(>unreg_list);
> 
> and left random contents in the unreg_list.
> 
> So that "handle error case" was almost certainly just buggy too.
> 
> And the bug seems to be that we're trying to unregister a netdevice
> that has already been unregistered.
> 
> Over to Eric and networking people. This oops is user-triggerable, and
> leaves the machine in a bad state (the original BUG_ON() and the new
> GP fault both happen while holding the RTNL, so networking is not
> healthy afterwards.
> 
>   Linus
> 

Right, I've already posted a patch for ip6mr that should fix the issue.
CCed you and LKML just now.

Thanks,
 Nik

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 1057 matches

Mail list logo