Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread David Ahern
On 12/19/16 6:56 PM, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern  wrote:
>> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>>> net.socket_create_filter = "none": no filter
>>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>>> net.socket_create_filter = "disallow": no sockets created period
>>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>>
>> Such a scheme works for the socket create filter b/c it is a very simple use 
>> case. It does not work for the ingress and egress which allow generic bpf 
>> filters.
> 
> Can you elaborate on what goes wrong?  (Obviously the
> "address_family_list" example makes no sense in that context.)

Being able to dump a filter or see that one exists would be a great add-on, but 
I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for 
loading generic BPF filters. Simple cases like "disallow" are easy -- just 
return 0 in the filter, no complicated BPF code needed. The rest are specific 
cases of the moment which goes against the intent of ebpf and generic 
programmability. 
 
>>
>> ...
>>
 you're ignoring use cases I described earlier.
 In vrf case there is only one ifindex it needs to bind to.
>>>
>>> I'm totally lost.  Can you explain what this has to do with the cgroup
>>> hierarchy?
>>
>> I think the point is that a group hierarchy makes no sense for the VRF use 
>> case. What I put into iproute2 is
>>
>> cgrp2/vrf/NAME
>>
>> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 
>> sockets to a specific device index. cgrp2/vrf is the "default" vrf and does 
>> not have a filter. A user can certainly add another layer 
>> cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not 
>> make sense.
> 
> I tend to agree.  I still think that the mechanism as it stands is
> broken in other respects and should be fixed before it goes live.  I
> have no desire to cause problems for the vrf use case.
> 
> But keep in mind that the vrf use case is, in Linus' tree, a bit
> broken right now in its interactions with other users of the same
> mechanism.  Suppose I create a container and want to trace all of its
> created sockets.  I'll set up cgrp2/container and load my tracer as a
> socket creation hook.  Then a container sets up
> cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding
> filter.  Now the tracing stops working -- oops.

There are other ways to achieve socket tracing, but I get your point -- nested 
cases do not work as users may want.


> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
> have to do it now (or disable the feature for 4.10).  This is why I'm
> bringing this whole thing up now.

 We don't have to touch user visible api here, so extensions are fine.
>>>
>>> Huh?  My example in the original email attaches a program in a
>>> sub-hierarchy.  Are you saying that 4.11 could make that example stop
>>> working?
>>
>> Are you suggesting sub-cgroups should not be allowed to override the filter 
>> of a parent cgroup?
> 
> Yes, exactly.  I think there are two sensible behaviors:
> 
> a) sub-cgroups cannot have a filter at all of the parent has a filter.
> (This is the "punt" approach -- it lets different semantics be
> assigned later without breaking userspace.)
> 
> b) sub-cgroups can have a filter if a parent does, too.  The semantics
> are that the sub-cgroup filter runs first and all side-effects occur.
> If that filter says "reject" then ancestor filters are skipped.  If
> that filter says "accept", then the ancestor filter is run and its
> side-effects happen as well.  (And so on, all the way up to the root.)

That comes with a big performance hit for skb / data path cases.

I'm riding my use case on Daniel's work, and as I understand it the nesting 
case has been discussed. I'll defer to Daniel and Alexei on this part.



Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread David Ahern
On 12/19/16 6:56 PM, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern  wrote:
>> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>>> net.socket_create_filter = "none": no filter
>>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>>> net.socket_create_filter = "disallow": no sockets created period
>>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>>
>> Such a scheme works for the socket create filter b/c it is a very simple use 
>> case. It does not work for the ingress and egress which allow generic bpf 
>> filters.
> 
> Can you elaborate on what goes wrong?  (Obviously the
> "address_family_list" example makes no sense in that context.)

Being able to dump a filter or see that one exists would be a great add-on, but 
I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for 
loading generic BPF filters. Simple cases like "disallow" are easy -- just 
return 0 in the filter, no complicated BPF code needed. The rest are specific 
cases of the moment which goes against the intent of ebpf and generic 
programmability. 
 
>>
>> ...
>>
 you're ignoring use cases I described earlier.
 In vrf case there is only one ifindex it needs to bind to.
>>>
>>> I'm totally lost.  Can you explain what this has to do with the cgroup
>>> hierarchy?
>>
>> I think the point is that a group hierarchy makes no sense for the VRF use 
>> case. What I put into iproute2 is
>>
>> cgrp2/vrf/NAME
>>
>> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 
>> sockets to a specific device index. cgrp2/vrf is the "default" vrf and does 
>> not have a filter. A user can certainly add another layer 
>> cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not 
>> make sense.
> 
> I tend to agree.  I still think that the mechanism as it stands is
> broken in other respects and should be fixed before it goes live.  I
> have no desire to cause problems for the vrf use case.
> 
> But keep in mind that the vrf use case is, in Linus' tree, a bit
> broken right now in its interactions with other users of the same
> mechanism.  Suppose I create a container and want to trace all of its
> created sockets.  I'll set up cgrp2/container and load my tracer as a
> socket creation hook.  Then a container sets up
> cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding
> filter.  Now the tracing stops working -- oops.

There are other ways to achieve socket tracing, but I get your point -- nested 
cases do not work as users may want.


> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
> have to do it now (or disable the feature for 4.10).  This is why I'm
> bringing this whole thing up now.

 We don't have to touch user visible api here, so extensions are fine.
>>>
>>> Huh?  My example in the original email attaches a program in a
>>> sub-hierarchy.  Are you saying that 4.11 could make that example stop
>>> working?
>>
>> Are you suggesting sub-cgroups should not be allowed to override the filter 
>> of a parent cgroup?
> 
> Yes, exactly.  I think there are two sensible behaviors:
> 
> a) sub-cgroups cannot have a filter at all of the parent has a filter.
> (This is the "punt" approach -- it lets different semantics be
> assigned later without breaking userspace.)
> 
> b) sub-cgroups can have a filter if a parent does, too.  The semantics
> are that the sub-cgroup filter runs first and all side-effects occur.
> If that filter says "reject" then ancestor filters are skipped.  If
> that filter says "accept", then the ancestor filter is run and its
> side-effects happen as well.  (And so on, all the way up to the root.)

That comes with a big performance hit for skb / data path cases.

I'm riding my use case on Daniel's work, and as I understand it the nesting 
case has been discussed. I'll defer to Daniel and Alexei on this part.



Re: [PATCH 2/7] PM / devfreq: exynos-ppmu: Use the regmap interface to handle the registers

2016-12-19 Thread Chanwoo Choi
Hi,

On 2016년 12월 20일 04:47, Tobias Jakobi wrote:
> Hello,
> 
> I was just wondering what is improved by moving to regmap. For me this
> looks like it only complicates the code. Lots of regmap_{read,write}()
> and for each one of these we need to check the return code.

It is correct to check the return value. It cover all of exception.

> 
> Also when exactly did __raw_writel() and friends become legacy?

Also, I want to use the regmap interface (regmap-io) instead of 
__raw_readl/__raw_writel using the base address 
directly.

Regards,
Chanwoo Choi

> 
> With best wishes,
> Tobias
> 
> 
> Chanwoo Choi wrote:
>> This patch uses the regmap interface to read and write the registers for 
>> exynos
>> PPMU device instead of the legacy memory map functions.
>>
>> Cc: Kukjin Kim 
>> Cc: Krzysztof Kozlowski 
>> Cc: Javier Martinez Canillas 
>> Cc: linux-samsung-...@vger.kernel.org
>> Signed-off-by: Chanwoo Choi 
>> ---
>>  drivers/devfreq/event/exynos-ppmu.c | 326 
>> ++--
>>  1 file changed, 237 insertions(+), 89 deletions(-)
>>
>> diff --git a/drivers/devfreq/event/exynos-ppmu.c 
>> b/drivers/devfreq/event/exynos-ppmu.c
>> index 107eb91a9415..fb3706faf5bd 100644
>> --- a/drivers/devfreq/event/exynos-ppmu.c
>> +++ b/drivers/devfreq/event/exynos-ppmu.c
>> @@ -17,13 +17,13 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  
>>  #include "exynos-ppmu.h"
>>  
>>  struct exynos_ppmu_data {
>> -void __iomem *base;
>>  struct clk *clk;
>>  };
>>  
>> @@ -33,6 +33,7 @@ struct exynos_ppmu {
>>  unsigned int num_events;
>>  
>>  struct device *dev;
>> +struct regmap *regmap;
>>  
>>  struct exynos_ppmu_data ppmu;
>>  };
>> @@ -107,20 +108,28 @@ static int exynos_ppmu_find_ppmu_id(struct 
>> devfreq_event_dev *edev)
>>  static int exynos_ppmu_disable(struct devfreq_event_dev *edev)
>>  {
>>  struct exynos_ppmu *info = devfreq_event_get_drvdata(edev);
>> +int ret;
>>  u32 pmnc;
>>  
>>  /* Disable all counters */
>> -__raw_writel(PPMU_CCNT_MASK |
>> - PPMU_PMCNT0_MASK |
>> - PPMU_PMCNT1_MASK |
>> - PPMU_PMCNT2_MASK |
>> - PPMU_PMCNT3_MASK,
>> - info->ppmu.base + PPMU_CNTENC);
>> +ret = regmap_write(info->regmap, PPMU_CNTENC,
>> +PPMU_CCNT_MASK |
>> +PPMU_PMCNT0_MASK |
>> +PPMU_PMCNT1_MASK |
>> +PPMU_PMCNT2_MASK |
>> +PPMU_PMCNT3_MASK);
>> +if (ret < 0)
>> +return ret;
>>  
>>  /* Disable PPMU */
>> -pmnc = __raw_readl(info->ppmu.base + PPMU_PMNC);
>> +ret = regmap_read(info->regmap, PPMU_PMNC, );
>> +if (ret < 0)
>> +return ret;
>> +
>>  pmnc &= ~PPMU_PMNC_ENABLE_MASK;
>> -__raw_writel(pmnc, info->ppmu.base + PPMU_PMNC);
>> +ret = regmap_write(info->regmap, PPMU_PMNC, pmnc);
>> +if (ret < 0)
>> +return ret;
>>  
>>  return 0;
>>  }
>> @@ -129,29 +138,42 @@ static int exynos_ppmu_set_event(struct 
>> devfreq_event_dev *edev)
>>  {
>>  struct exynos_ppmu *info = devfreq_event_get_drvdata(edev);
>>  int id = exynos_ppmu_find_ppmu_id(edev);
>> +int ret;
>>  u32 pmnc, cntens;
>>  
>>  if (id < 0)
>>  return id;
>>  
>>  /* Enable specific counter */
>> -cntens = __raw_readl(info->ppmu.base + PPMU_CNTENS);
>> +ret = regmap_read(info->regmap, PPMU_CNTENS, );
>> +if (ret < 0)
>> +return ret;
>> +
>>  cntens |= (PPMU_CCNT_MASK | (PPMU_ENABLE << id));
>> -__raw_writel(cntens, info->ppmu.base + PPMU_CNTENS);
>> +ret = regmap_write(info->regmap, PPMU_CNTENS, cntens);
>> +if (ret < 0)
>> +return ret;
>>  
>>  /* Set the event of Read/Write data count  */
>> -__raw_writel(PPMU_RO_DATA_CNT | PPMU_WO_DATA_CNT,
>> -info->ppmu.base + PPMU_BEVTxSEL(id));
>> +ret = regmap_write(info->regmap, PPMU_BEVTxSEL(id),
>> +PPMU_RO_DATA_CNT | PPMU_WO_DATA_CNT);
>> +if (ret < 0)
>> +return ret;
>>  
>>  /* Reset cycle counter/performance counter and enable PPMU */
>> -pmnc = __raw_readl(info->ppmu.base + PPMU_PMNC);
>> +ret = regmap_read(info->regmap, PPMU_PMNC, );
>> +if (ret < 0)
>> +return ret;
>> +
>>  pmnc &= ~(PPMU_PMNC_ENABLE_MASK
>>  | PPMU_PMNC_COUNTER_RESET_MASK
>>  | PPMU_PMNC_CC_RESET_MASK);
>>  pmnc |= (PPMU_ENABLE << PPMU_PMNC_ENABLE_SHIFT);
>>  pmnc |= (PPMU_ENABLE << PPMU_PMNC_COUNTER_RESET_SHIFT);
>>  pmnc |= (PPMU_ENABLE << PPMU_PMNC_CC_RESET_SHIFT);
>> -__raw_writel(pmnc, info->ppmu.base + PPMU_PMNC);
>> +ret = regmap_write(info->regmap, PPMU_PMNC, pmnc);
>> +if (ret < 0)
>> +

Re: [PATCH 2/7] PM / devfreq: exynos-ppmu: Use the regmap interface to handle the registers

2016-12-19 Thread Chanwoo Choi
Hi,

On 2016년 12월 20일 04:47, Tobias Jakobi wrote:
> Hello,
> 
> I was just wondering what is improved by moving to regmap. For me this
> looks like it only complicates the code. Lots of regmap_{read,write}()
> and for each one of these we need to check the return code.

It is correct to check the return value. It cover all of exception.

> 
> Also when exactly did __raw_writel() and friends become legacy?

Also, I want to use the regmap interface (regmap-io) instead of 
__raw_readl/__raw_writel using the base address 
directly.

Regards,
Chanwoo Choi

> 
> With best wishes,
> Tobias
> 
> 
> Chanwoo Choi wrote:
>> This patch uses the regmap interface to read and write the registers for 
>> exynos
>> PPMU device instead of the legacy memory map functions.
>>
>> Cc: Kukjin Kim 
>> Cc: Krzysztof Kozlowski 
>> Cc: Javier Martinez Canillas 
>> Cc: linux-samsung-...@vger.kernel.org
>> Signed-off-by: Chanwoo Choi 
>> ---
>>  drivers/devfreq/event/exynos-ppmu.c | 326 
>> ++--
>>  1 file changed, 237 insertions(+), 89 deletions(-)
>>
>> diff --git a/drivers/devfreq/event/exynos-ppmu.c 
>> b/drivers/devfreq/event/exynos-ppmu.c
>> index 107eb91a9415..fb3706faf5bd 100644
>> --- a/drivers/devfreq/event/exynos-ppmu.c
>> +++ b/drivers/devfreq/event/exynos-ppmu.c
>> @@ -17,13 +17,13 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  
>>  #include "exynos-ppmu.h"
>>  
>>  struct exynos_ppmu_data {
>> -void __iomem *base;
>>  struct clk *clk;
>>  };
>>  
>> @@ -33,6 +33,7 @@ struct exynos_ppmu {
>>  unsigned int num_events;
>>  
>>  struct device *dev;
>> +struct regmap *regmap;
>>  
>>  struct exynos_ppmu_data ppmu;
>>  };
>> @@ -107,20 +108,28 @@ static int exynos_ppmu_find_ppmu_id(struct 
>> devfreq_event_dev *edev)
>>  static int exynos_ppmu_disable(struct devfreq_event_dev *edev)
>>  {
>>  struct exynos_ppmu *info = devfreq_event_get_drvdata(edev);
>> +int ret;
>>  u32 pmnc;
>>  
>>  /* Disable all counters */
>> -__raw_writel(PPMU_CCNT_MASK |
>> - PPMU_PMCNT0_MASK |
>> - PPMU_PMCNT1_MASK |
>> - PPMU_PMCNT2_MASK |
>> - PPMU_PMCNT3_MASK,
>> - info->ppmu.base + PPMU_CNTENC);
>> +ret = regmap_write(info->regmap, PPMU_CNTENC,
>> +PPMU_CCNT_MASK |
>> +PPMU_PMCNT0_MASK |
>> +PPMU_PMCNT1_MASK |
>> +PPMU_PMCNT2_MASK |
>> +PPMU_PMCNT3_MASK);
>> +if (ret < 0)
>> +return ret;
>>  
>>  /* Disable PPMU */
>> -pmnc = __raw_readl(info->ppmu.base + PPMU_PMNC);
>> +ret = regmap_read(info->regmap, PPMU_PMNC, );
>> +if (ret < 0)
>> +return ret;
>> +
>>  pmnc &= ~PPMU_PMNC_ENABLE_MASK;
>> -__raw_writel(pmnc, info->ppmu.base + PPMU_PMNC);
>> +ret = regmap_write(info->regmap, PPMU_PMNC, pmnc);
>> +if (ret < 0)
>> +return ret;
>>  
>>  return 0;
>>  }
>> @@ -129,29 +138,42 @@ static int exynos_ppmu_set_event(struct 
>> devfreq_event_dev *edev)
>>  {
>>  struct exynos_ppmu *info = devfreq_event_get_drvdata(edev);
>>  int id = exynos_ppmu_find_ppmu_id(edev);
>> +int ret;
>>  u32 pmnc, cntens;
>>  
>>  if (id < 0)
>>  return id;
>>  
>>  /* Enable specific counter */
>> -cntens = __raw_readl(info->ppmu.base + PPMU_CNTENS);
>> +ret = regmap_read(info->regmap, PPMU_CNTENS, );
>> +if (ret < 0)
>> +return ret;
>> +
>>  cntens |= (PPMU_CCNT_MASK | (PPMU_ENABLE << id));
>> -__raw_writel(cntens, info->ppmu.base + PPMU_CNTENS);
>> +ret = regmap_write(info->regmap, PPMU_CNTENS, cntens);
>> +if (ret < 0)
>> +return ret;
>>  
>>  /* Set the event of Read/Write data count  */
>> -__raw_writel(PPMU_RO_DATA_CNT | PPMU_WO_DATA_CNT,
>> -info->ppmu.base + PPMU_BEVTxSEL(id));
>> +ret = regmap_write(info->regmap, PPMU_BEVTxSEL(id),
>> +PPMU_RO_DATA_CNT | PPMU_WO_DATA_CNT);
>> +if (ret < 0)
>> +return ret;
>>  
>>  /* Reset cycle counter/performance counter and enable PPMU */
>> -pmnc = __raw_readl(info->ppmu.base + PPMU_PMNC);
>> +ret = regmap_read(info->regmap, PPMU_PMNC, );
>> +if (ret < 0)
>> +return ret;
>> +
>>  pmnc &= ~(PPMU_PMNC_ENABLE_MASK
>>  | PPMU_PMNC_COUNTER_RESET_MASK
>>  | PPMU_PMNC_CC_RESET_MASK);
>>  pmnc |= (PPMU_ENABLE << PPMU_PMNC_ENABLE_SHIFT);
>>  pmnc |= (PPMU_ENABLE << PPMU_PMNC_COUNTER_RESET_SHIFT);
>>  pmnc |= (PPMU_ENABLE << PPMU_PMNC_CC_RESET_SHIFT);
>> -__raw_writel(pmnc, info->ppmu.base + PPMU_PMNC);
>> +ret = regmap_write(info->regmap, PPMU_PMNC, pmnc);
>> +if (ret < 0)
>> +return ret;
>>  
>>  return 0;
>>  }
>> @@ -161,40 +183,64 @@ static int 

[PATCH] rtlwifi: Fix kernel oops introduced with commit e49656147359

2016-12-19 Thread Larry Finger
With commit e49656147359 {"rtlwifi: Use dev_kfree_skb_irq instead of
kfree_skb"), the method used to free an skb was changed because the
kfree_skb() was inside a spinlock. What was forgotten is that kfree_skb()
guards against a NULL value for the argument. Routine dev_kfree_skb_irq()
does not, and a test is needed to prevent kernel panics.

Fixes: commit e49656147359 {"rtlwifi: Use dev_kfree_skb_irq instead of 
kfree_skb")
Signed-off-by: Larry Finger 
Cc: Stable  (4.9+)
Cc: Wei Yongjun 
---
Kalle,

This change should be sent to mainline during the 4.10 merge period,
or as soon as possible.

Thanks,

Larry
---
 drivers/net/wireless/realtek/rtlwifi/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/realtek/rtlwifi/core.c 
b/drivers/net/wireless/realtek/rtlwifi/core.c
index 955055b..df8b977 100644
--- a/drivers/net/wireless/realtek/rtlwifi/core.c
+++ b/drivers/net/wireless/realtek/rtlwifi/core.c
@@ -1823,7 +1823,8 @@ bool rtl_cmd_send_packet(struct ieee80211_hw *hw, struct 
sk_buff *skb)
 
spin_lock_irqsave(>locks.irq_th_lock, flags);
pskb = __skb_dequeue(>queue);
-   dev_kfree_skb_irq(pskb);
+   if (pskb)
+   dev_kfree_skb_irq(pskb);
 
/*this is wrong, fill_tx_cmddesc needs update*/
pdesc = >desc[0];
-- 
2.10.2



[PATCH] rtlwifi: Fix kernel oops introduced with commit e49656147359

2016-12-19 Thread Larry Finger
With commit e49656147359 {"rtlwifi: Use dev_kfree_skb_irq instead of
kfree_skb"), the method used to free an skb was changed because the
kfree_skb() was inside a spinlock. What was forgotten is that kfree_skb()
guards against a NULL value for the argument. Routine dev_kfree_skb_irq()
does not, and a test is needed to prevent kernel panics.

Fixes: commit e49656147359 {"rtlwifi: Use dev_kfree_skb_irq instead of 
kfree_skb")
Signed-off-by: Larry Finger 
Cc: Stable  (4.9+)
Cc: Wei Yongjun 
---
Kalle,

This change should be sent to mainline during the 4.10 merge period,
or as soon as possible.

Thanks,

Larry
---
 drivers/net/wireless/realtek/rtlwifi/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/realtek/rtlwifi/core.c 
b/drivers/net/wireless/realtek/rtlwifi/core.c
index 955055b..df8b977 100644
--- a/drivers/net/wireless/realtek/rtlwifi/core.c
+++ b/drivers/net/wireless/realtek/rtlwifi/core.c
@@ -1823,7 +1823,8 @@ bool rtl_cmd_send_packet(struct ieee80211_hw *hw, struct 
sk_buff *skb)
 
spin_lock_irqsave(>locks.irq_th_lock, flags);
pskb = __skb_dequeue(>queue);
-   dev_kfree_skb_irq(pskb);
+   if (pskb)
+   dev_kfree_skb_irq(pskb);
 
/*this is wrong, fill_tx_cmddesc needs update*/
pdesc = >desc[0];
-- 
2.10.2



Re: [PATCH] block: loose check on sg gap

2016-12-19 Thread Jens Axboe
On 12/19/2016 07:07 PM, Ming Lei wrote:
> On Sun, Dec 18, 2016 at 12:49 AM, Jens Axboe  wrote:
>> On 12/17/2016 03:49 AM, Ming Lei wrote:
>>> If the last bvec of the 1st bio and the 1st bvec of the next
>>> bio are contineous physically, and the latter can be merged
>>> to last segment of the 1st bio, we should think they don't
>>> violate sg gap(or virt boundary) limit.
>>>
>>> Both Vitaly and Dexuan reported lots of unmergeable small bios
>>> are observed when running mkfs on Hyper-V virtual storage, and
>>> performance becomes quite low, so this patch is figured out for
>>> fixing the performance issue.
>>>
>>> The same issue should exist on NVMe too sine it sets virt boundary too.
>>
>> It looks pretty reasonable to me. I'll queue it up for some testing,
>> changes like this always make me a little nervous.
> 
> Understood.
> 
> But given it is still in early stage of 4.10 cycle, seems fine to expose
> it now, and we should have enough time to fix it if there might be
> regressions.
> 
> BTW, it passes my xfstest(ext4) over sata/NVMe.

It's been fine here in testing, too. I'm not worried about performance
regressions, those we can always fix. Merging makes me worried about
corruption, and those regressions are much worse.

Any reason we need to rush this? I'd be more comfortable pushing this to
4.11, unless there are strong reasons this should make 4.10.

-- 
Jens Axboe



Re: [PATCH] block: loose check on sg gap

2016-12-19 Thread Jens Axboe
On 12/19/2016 07:07 PM, Ming Lei wrote:
> On Sun, Dec 18, 2016 at 12:49 AM, Jens Axboe  wrote:
>> On 12/17/2016 03:49 AM, Ming Lei wrote:
>>> If the last bvec of the 1st bio and the 1st bvec of the next
>>> bio are contineous physically, and the latter can be merged
>>> to last segment of the 1st bio, we should think they don't
>>> violate sg gap(or virt boundary) limit.
>>>
>>> Both Vitaly and Dexuan reported lots of unmergeable small bios
>>> are observed when running mkfs on Hyper-V virtual storage, and
>>> performance becomes quite low, so this patch is figured out for
>>> fixing the performance issue.
>>>
>>> The same issue should exist on NVMe too sine it sets virt boundary too.
>>
>> It looks pretty reasonable to me. I'll queue it up for some testing,
>> changes like this always make me a little nervous.
> 
> Understood.
> 
> But given it is still in early stage of 4.10 cycle, seems fine to expose
> it now, and we should have enough time to fix it if there might be
> regressions.
> 
> BTW, it passes my xfstest(ext4) over sata/NVMe.

It's been fine here in testing, too. I'm not worried about performance
regressions, those we can always fix. Merging makes me worried about
corruption, and those regressions are much worse.

Any reason we need to rush this? I'd be more comfortable pushing this to
4.11, unless there are strong reasons this should make 4.10.

-- 
Jens Axboe



Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-19 Thread Nicholas Piggin
On Mon, 19 Dec 2016 16:20:05 -0800
Dave Hansen  wrote:

> On 12/19/2016 03:07 PM, Linus Torvalds wrote:
> > +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> > +{
> > +   const int __maybe_unused nid = page_to_nid(virt_to_page(word));
> > +
> > +   return __bit_waitqueue(word, bit, nid);
> > 
> > No can do. Part of the problem with the old coffee was that it did that
> > virt_to_page() crud. That doesn't work with the virtually mapped stack.   
> 
> Ahhh, got it.
> 
> So, what did you have in mind?  Just redirect bit_waitqueue() to the
> "first_online_node" waitqueues?
> 
> wait_queue_head_t *bit_waitqueue(void *word, int bit)
> {
> return __bit_waitqueue(word, bit, first_online_node);
> }
> 
> We could do some fancy stuff like only do virt_to_page() for things in
> the linear map, but I'm not sure we'll see much of a gain for it.  None
> of the other waitqueue users look as pathological as the 'struct page'
> ones.  Maybe:
> 
> wait_queue_head_t *bit_waitqueue(void *word, int bit)
> {
>   int nid
>   if (word >= VMALLOC_START) /* all addrs not in linear map */
>   nid = first_online_node;
>   else
>   nid = page_to_nid(virt_to_page(word));
> return __bit_waitqueue(word, bit, nid);
> }

I think he meant just make the page_waitqueue do the per-node thing
and leave bit_waitqueue as the global bit.

It would be cool if CPUs had an instruction that translates an address
though. You could avoid all that lookup and just do it with the TLB :)

Thanks,
Nick




Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-19 Thread Nicholas Piggin
On Mon, 19 Dec 2016 16:20:05 -0800
Dave Hansen  wrote:

> On 12/19/2016 03:07 PM, Linus Torvalds wrote:
> > +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> > +{
> > +   const int __maybe_unused nid = page_to_nid(virt_to_page(word));
> > +
> > +   return __bit_waitqueue(word, bit, nid);
> > 
> > No can do. Part of the problem with the old coffee was that it did that
> > virt_to_page() crud. That doesn't work with the virtually mapped stack.   
> 
> Ahhh, got it.
> 
> So, what did you have in mind?  Just redirect bit_waitqueue() to the
> "first_online_node" waitqueues?
> 
> wait_queue_head_t *bit_waitqueue(void *word, int bit)
> {
> return __bit_waitqueue(word, bit, first_online_node);
> }
> 
> We could do some fancy stuff like only do virt_to_page() for things in
> the linear map, but I'm not sure we'll see much of a gain for it.  None
> of the other waitqueue users look as pathological as the 'struct page'
> ones.  Maybe:
> 
> wait_queue_head_t *bit_waitqueue(void *word, int bit)
> {
>   int nid
>   if (word >= VMALLOC_START) /* all addrs not in linear map */
>   nid = first_online_node;
>   else
>   nid = page_to_nid(virt_to_page(word));
> return __bit_waitqueue(word, bit, nid);
> }

I think he meant just make the page_waitqueue do the per-node thing
and leave bit_waitqueue as the global bit.

It would be cool if CPUs had an instruction that translates an address
though. You could avoid all that lookup and just do it with the TLB :)

Thanks,
Nick




Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-19 Thread Nicholas Piggin
On Mon, 19 Dec 2016 14:58:26 -0800
Dave Hansen  wrote:

> I saw a 4.8->4.9 regression (details below) that I attributed to:
> 
>   9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues
> 
> That commit took the bitlock waitqueues from being dynamically-allocated
> per-zone to being statically allocated and global.  As suggested by
> Linus, this makes them per-node, but keeps them statically-allocated.
> 
> It leaves us with more waitqueues than the global approach, inherently
> scales it up as we gain nodes, and avoids generating code for
> page_zone() which was evidently quite ugly.  The patch is pretty darn
> tiny too.
> 
> This turns what was a ~40% 4.8->4.9 regression into a 17% gain over
> what on 4.8 did.  That gain is a _bit_ surprising, but not entirely
> unexpected since we now get much simpler code from no page_zone() and a
> fixed-size array for which we don't have to follow a pointer (and get to
> do power-of-2 math).

I'll have to respin the PageWaiters patch and resend it. There were
just a couple of small issues picked up in review. I've just got side
tracked with getting a few other things done and haven't had time to
benchmark it properly.

I'd still like to see what per-node waitqueues does on top of that. If
it's significant for realistic workloads then it could be done for the
page waitqueues as Linus said.

Thanks,
Nick


Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-19 Thread Nicholas Piggin
On Mon, 19 Dec 2016 14:58:26 -0800
Dave Hansen  wrote:

> I saw a 4.8->4.9 regression (details below) that I attributed to:
> 
>   9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues
> 
> That commit took the bitlock waitqueues from being dynamically-allocated
> per-zone to being statically allocated and global.  As suggested by
> Linus, this makes them per-node, but keeps them statically-allocated.
> 
> It leaves us with more waitqueues than the global approach, inherently
> scales it up as we gain nodes, and avoids generating code for
> page_zone() which was evidently quite ugly.  The patch is pretty darn
> tiny too.
> 
> This turns what was a ~40% 4.8->4.9 regression into a 17% gain over
> what on 4.8 did.  That gain is a _bit_ surprising, but not entirely
> unexpected since we now get much simpler code from no page_zone() and a
> fixed-size array for which we don't have to follow a pointer (and get to
> do power-of-2 math).

I'll have to respin the PageWaiters patch and resend it. There were
just a couple of small issues picked up in review. I've just got side
tracked with getting a few other things done and haven't had time to
benchmark it properly.

I'd still like to see what per-node waitqueues does on top of that. If
it's significant for realistic workloads then it could be done for the
page waitqueues as Linus said.

Thanks,
Nick


Re: [PATCH 4/4] oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA

2016-12-19 Thread kbuild test robot
Hi Kirill,

[auto build test WARNING on mmotm/master]
[also build test WARNING on v4.9 next-20161219]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/mm-drop-zap_details-ignore_dirty/20161220-092938
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-x004-201651 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/asm-generic/bug.h:4:0,
from arch/x86/include/asm/bug.h:35,
from include/linux/bug.h:4,
from include/linux/mmdebug.h:4,
from include/linux/mm.h:8,
from include/linux/mman.h:4,
from mm/madvise.c:8:
   mm/madvise.c: In function 'madvise_dontneed':
   mm/madvise.c:476:7: error: implicit declaration of function 
'can_madv_dontneed_vma' [-Werror=implicit-function-declaration]
 if (!can_madv_dontneed_vma(vma))
  ^
   include/linux/compiler.h:149:30: note: in definition of macro '__trace_if'
 if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
 ^~~~
>> mm/madvise.c:476:2: note: in expansion of macro 'if'
 if (!can_madv_dontneed_vma(vma))
 ^~
   cc1: some warnings being treated as errors

vim +/if +476 mm/madvise.c

   460   *
   461   * NB: This interface discards data rather than pushes it out to swap,
   462   * as some implementations do.  This has performance implications for
   463   * applications like large transactional databases which want to discard
   464   * pages in anonymous maps after committing to backing store the data
   465   * that was kept in them.  There is no reason to write this data out to
   466   * the swap area if the application is discarding it.
   467   *
   468   * An interface that causes the system to free clean pages and flush
   469   * dirty pages is already available as msync(MS_INVALIDATE).
   470   */
   471  static long madvise_dontneed(struct vm_area_struct *vma,
   472   struct vm_area_struct **prev,
   473   unsigned long start, unsigned long end)
   474  {
   475  *prev = vma;
 > 476  if (!can_madv_dontneed_vma(vma))
   477  return -EINVAL;
   478  
   479  zap_page_range(vma, start, end - start);
   480  return 0;
   481  }
   482  
   483  /*
   484   * Application wants to free up the pages and associated backing store.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 4/4] oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA

2016-12-19 Thread kbuild test robot
Hi Kirill,

[auto build test WARNING on mmotm/master]
[also build test WARNING on v4.9 next-20161219]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/mm-drop-zap_details-ignore_dirty/20161220-092938
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-x004-201651 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/asm-generic/bug.h:4:0,
from arch/x86/include/asm/bug.h:35,
from include/linux/bug.h:4,
from include/linux/mmdebug.h:4,
from include/linux/mm.h:8,
from include/linux/mman.h:4,
from mm/madvise.c:8:
   mm/madvise.c: In function 'madvise_dontneed':
   mm/madvise.c:476:7: error: implicit declaration of function 
'can_madv_dontneed_vma' [-Werror=implicit-function-declaration]
 if (!can_madv_dontneed_vma(vma))
  ^
   include/linux/compiler.h:149:30: note: in definition of macro '__trace_if'
 if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
 ^~~~
>> mm/madvise.c:476:2: note: in expansion of macro 'if'
 if (!can_madv_dontneed_vma(vma))
 ^~
   cc1: some warnings being treated as errors

vim +/if +476 mm/madvise.c

   460   *
   461   * NB: This interface discards data rather than pushes it out to swap,
   462   * as some implementations do.  This has performance implications for
   463   * applications like large transactional databases which want to discard
   464   * pages in anonymous maps after committing to backing store the data
   465   * that was kept in them.  There is no reason to write this data out to
   466   * the swap area if the application is discarding it.
   467   *
   468   * An interface that causes the system to free clean pages and flush
   469   * dirty pages is already available as msync(MS_INVALIDATE).
   470   */
   471  static long madvise_dontneed(struct vm_area_struct *vma,
   472   struct vm_area_struct **prev,
   473   unsigned long start, unsigned long end)
   474  {
   475  *prev = vma;
 > 476  if (!can_madv_dontneed_vma(vma))
   477  return -EINVAL;
   478  
   479  zap_page_range(vma, start, end - start);
   480  return 0;
   481  }
   482  
   483  /*
   484   * Application wants to free up the pages and associated backing store.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH 1/2] net: hix5hd2_gmac: fix compatible strings name

2016-12-19 Thread Dongpo Li
The SoC hix5hd2 compatible string has the suffix "-gmac" and
we should not change its compatible string.
So we should name all the compatible string with the suffix "-gmac".
Creating a new name suffix "-gemac" is unnecessary.

We also add another SoC compatible string in dt binding documentation
and describe which generic version the SoC belongs to.

Fixes: d0fb6ba75dc0 ("net: hix5hd2_gmac: add generic compatible string")
Signed-off-by: Dongpo Li 
---
 .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt  | 13 -
 drivers/net/ethernet/hisilicon/hix5hd2_gmac.c   | 13 +++--
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt 
b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
index 063c02d..eea73ad 100644
--- a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
+++ b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
@@ -2,11 +2,14 @@ Hisilicon hix5hd2 gmac controller
 
 Required properties:
 - compatible: should contain one of the following SoC strings:
-   * "hisilicon,hix5hd2-gemac"
-   * "hisilicon,hi3798cv200-gemac"
+   * "hisilicon,hix5hd2-gmac"
+   * "hisilicon,hi3798cv200-gmac"
+   * "hisilicon,hi3516a-gmac"
and one of the following version string:
-   * "hisilicon,hisi-gemac-v1"
-   * "hisilicon,hisi-gemac-v2"
+   * "hisilicon,hisi-gmac-v1"
+   * "hisilicon,hisi-gmac-v2"
+  The version v1 includes SoCs hix5hd2.
+  The version v2 includes SoCs hi3798cv200, hi3516a.
 - reg: specifies base physical address(s) and size of the device registers.
   The first region is the MAC register base and size.
   The second region is external interface control register.
@@ -35,7 +38,7 @@ Required properties:
 
 Example:
gmac0: ethernet@f984 {
-   compatible = "hisilicon,hi3798cv200-gemac", 
"hisilicon,hisi-gemac-v2";
+   compatible = "hisilicon,hi3798cv200-gmac", 
"hisilicon,hisi-gmac-v2";
reg = <0xf984 0x1000>,<0xf984300c 0x4>;
interrupts = <0 71 4>;
#address-cells = <1>;
diff --git a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c 
b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
index ee7e9ce..418ca1f3 100644
--- a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
+++ b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
@@ -1316,10 +1316,11 @@ static int hix5hd2_dev_remove(struct platform_device 
*pdev)
 }
 
 static const struct of_device_id hix5hd2_of_match[] = {
-   { .compatible = "hisilicon,hisi-gemac-v1", .data = (void *)GEMAC_V1 },
-   { .compatible = "hisilicon,hisi-gemac-v2", .data = (void *)GEMAC_V2 },
-   { .compatible = "hisilicon,hix5hd2-gemac", .data = (void *)GEMAC_V1 },
-   { .compatible = "hisilicon,hi3798cv200-gemac", .data = (void *)GEMAC_V2 
},
+   { .compatible = "hisilicon,hisi-gmac-v1", .data = (void *)GEMAC_V1 },
+   { .compatible = "hisilicon,hisi-gmac-v2", .data = (void *)GEMAC_V2 },
+   { .compatible = "hisilicon,hix5hd2-gmac", .data = (void *)GEMAC_V1 },
+   { .compatible = "hisilicon,hi3798cv200-gmac", .data = (void *)GEMAC_V2 
},
+   { .compatible = "hisilicon,hi3516a-gmac", .data = (void *)GEMAC_V2 },
{},
 };
 
@@ -1327,7 +1328,7 @@ MODULE_DEVICE_TABLE(of, hix5hd2_of_match);
 
 static struct platform_driver hix5hd2_dev_driver = {
.driver = {
-   .name = "hisi-gemac",
+   .name = "hisi-gmac",
.of_match_table = hix5hd2_of_match,
},
.probe = hix5hd2_dev_probe,
@@ -1338,4 +1339,4 @@ module_platform_driver(hix5hd2_dev_driver);
 
 MODULE_DESCRIPTION("HISILICON Gigabit Ethernet MAC driver");
 MODULE_LICENSE("GPL v2");
-MODULE_ALIAS("platform:hisi-gemac");
+MODULE_ALIAS("platform:hisi-gmac");
-- 
2.8.2



[PATCH 2/2] ARM: dts: hix5hd2: don't change the existing compatible string

2016-12-19 Thread Dongpo Li
The SoC hix5hd2 compatible string has the suffix "-gmac" and
we should not change it.
We should only add the generic compatible string "hisi-gmac-v1".

Fixes: 0855950ba580 ("ARM: dts: hix5hd2: add gmac generic compatible and clock 
names")
Signed-off-by: Dongpo Li 
---
 arch/arm/boot/dts/hisi-x5hd2.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/hisi-x5hd2.dtsi 
b/arch/arm/boot/dts/hisi-x5hd2.dtsi
index 0da76c5..2d06f4c 100644
--- a/arch/arm/boot/dts/hisi-x5hd2.dtsi
+++ b/arch/arm/boot/dts/hisi-x5hd2.dtsi
@@ -436,7 +436,7 @@
};
 
gmac0: ethernet@184 {
-   compatible = "hisilicon,hix5hd2-gemac", 
"hisilicon,hisi-gemac-v1";
+   compatible = "hisilicon,hix5hd2-gmac", 
"hisilicon,hisi-gmac-v1";
reg = <0x184 0x1000>,<0x184300c 0x4>;
interrupts = <0 71 4>;
clocks = < HIX5HD2_MAC0_CLK>;
@@ -445,7 +445,7 @@
};
 
gmac1: ethernet@1841000 {
-   compatible = "hisilicon,hix5hd2-gemac", 
"hisilicon,hisi-gemac-v1";
+   compatible = "hisilicon,hix5hd2-gmac", 
"hisilicon,hisi-gmac-v1";
reg = <0x1841000 0x1000>,<0x1843010 0x4>;
interrupts = <0 72 4>;
clocks = < HIX5HD2_MAC1_CLK>;
-- 
2.8.2



[PATCH 1/2] net: hix5hd2_gmac: fix compatible strings name

2016-12-19 Thread Dongpo Li
The SoC hix5hd2 compatible string has the suffix "-gmac" and
we should not change its compatible string.
So we should name all the compatible string with the suffix "-gmac".
Creating a new name suffix "-gemac" is unnecessary.

We also add another SoC compatible string in dt binding documentation
and describe which generic version the SoC belongs to.

Fixes: d0fb6ba75dc0 ("net: hix5hd2_gmac: add generic compatible string")
Signed-off-by: Dongpo Li 
---
 .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt  | 13 -
 drivers/net/ethernet/hisilicon/hix5hd2_gmac.c   | 13 +++--
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt 
b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
index 063c02d..eea73ad 100644
--- a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
+++ b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
@@ -2,11 +2,14 @@ Hisilicon hix5hd2 gmac controller
 
 Required properties:
 - compatible: should contain one of the following SoC strings:
-   * "hisilicon,hix5hd2-gemac"
-   * "hisilicon,hi3798cv200-gemac"
+   * "hisilicon,hix5hd2-gmac"
+   * "hisilicon,hi3798cv200-gmac"
+   * "hisilicon,hi3516a-gmac"
and one of the following version string:
-   * "hisilicon,hisi-gemac-v1"
-   * "hisilicon,hisi-gemac-v2"
+   * "hisilicon,hisi-gmac-v1"
+   * "hisilicon,hisi-gmac-v2"
+  The version v1 includes SoCs hix5hd2.
+  The version v2 includes SoCs hi3798cv200, hi3516a.
 - reg: specifies base physical address(s) and size of the device registers.
   The first region is the MAC register base and size.
   The second region is external interface control register.
@@ -35,7 +38,7 @@ Required properties:
 
 Example:
gmac0: ethernet@f984 {
-   compatible = "hisilicon,hi3798cv200-gemac", 
"hisilicon,hisi-gemac-v2";
+   compatible = "hisilicon,hi3798cv200-gmac", 
"hisilicon,hisi-gmac-v2";
reg = <0xf984 0x1000>,<0xf984300c 0x4>;
interrupts = <0 71 4>;
#address-cells = <1>;
diff --git a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c 
b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
index ee7e9ce..418ca1f3 100644
--- a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
+++ b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
@@ -1316,10 +1316,11 @@ static int hix5hd2_dev_remove(struct platform_device 
*pdev)
 }
 
 static const struct of_device_id hix5hd2_of_match[] = {
-   { .compatible = "hisilicon,hisi-gemac-v1", .data = (void *)GEMAC_V1 },
-   { .compatible = "hisilicon,hisi-gemac-v2", .data = (void *)GEMAC_V2 },
-   { .compatible = "hisilicon,hix5hd2-gemac", .data = (void *)GEMAC_V1 },
-   { .compatible = "hisilicon,hi3798cv200-gemac", .data = (void *)GEMAC_V2 
},
+   { .compatible = "hisilicon,hisi-gmac-v1", .data = (void *)GEMAC_V1 },
+   { .compatible = "hisilicon,hisi-gmac-v2", .data = (void *)GEMAC_V2 },
+   { .compatible = "hisilicon,hix5hd2-gmac", .data = (void *)GEMAC_V1 },
+   { .compatible = "hisilicon,hi3798cv200-gmac", .data = (void *)GEMAC_V2 
},
+   { .compatible = "hisilicon,hi3516a-gmac", .data = (void *)GEMAC_V2 },
{},
 };
 
@@ -1327,7 +1328,7 @@ MODULE_DEVICE_TABLE(of, hix5hd2_of_match);
 
 static struct platform_driver hix5hd2_dev_driver = {
.driver = {
-   .name = "hisi-gemac",
+   .name = "hisi-gmac",
.of_match_table = hix5hd2_of_match,
},
.probe = hix5hd2_dev_probe,
@@ -1338,4 +1339,4 @@ module_platform_driver(hix5hd2_dev_driver);
 
 MODULE_DESCRIPTION("HISILICON Gigabit Ethernet MAC driver");
 MODULE_LICENSE("GPL v2");
-MODULE_ALIAS("platform:hisi-gemac");
+MODULE_ALIAS("platform:hisi-gmac");
-- 
2.8.2



[PATCH 2/2] ARM: dts: hix5hd2: don't change the existing compatible string

2016-12-19 Thread Dongpo Li
The SoC hix5hd2 compatible string has the suffix "-gmac" and
we should not change it.
We should only add the generic compatible string "hisi-gmac-v1".

Fixes: 0855950ba580 ("ARM: dts: hix5hd2: add gmac generic compatible and clock 
names")
Signed-off-by: Dongpo Li 
---
 arch/arm/boot/dts/hisi-x5hd2.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/hisi-x5hd2.dtsi 
b/arch/arm/boot/dts/hisi-x5hd2.dtsi
index 0da76c5..2d06f4c 100644
--- a/arch/arm/boot/dts/hisi-x5hd2.dtsi
+++ b/arch/arm/boot/dts/hisi-x5hd2.dtsi
@@ -436,7 +436,7 @@
};
 
gmac0: ethernet@184 {
-   compatible = "hisilicon,hix5hd2-gemac", 
"hisilicon,hisi-gemac-v1";
+   compatible = "hisilicon,hix5hd2-gmac", 
"hisilicon,hisi-gmac-v1";
reg = <0x184 0x1000>,<0x184300c 0x4>;
interrupts = <0 71 4>;
clocks = < HIX5HD2_MAC0_CLK>;
@@ -445,7 +445,7 @@
};
 
gmac1: ethernet@1841000 {
-   compatible = "hisilicon,hix5hd2-gemac", 
"hisilicon,hisi-gemac-v1";
+   compatible = "hisilicon,hix5hd2-gmac", 
"hisilicon,hisi-gmac-v1";
reg = <0x1841000 0x1000>,<0x1843010 0x4>;
interrupts = <0 72 4>;
clocks = < HIX5HD2_MAC1_CLK>;
-- 
2.8.2



[PATCH 0/2] net: hix5hd2_gmac: keep the compatible string not changed

2016-12-19 Thread Dongpo Li
This patch series fix the patch:
d0fb6ba75dc0 ("net: hix5hd2_gmac: add generic compatible string")

The SoC hix5hd2 compatible string has the suffix "-gmac" and
we should not change its compatible string.
So we should name all the compatible string with the suffix "-gmac".
Creating a new name suffix "-gemac" is unnecessary.

Dongpo Li (2):
  net: hix5hd2_gmac: fix compatible strings name
  ARM: dts: hix5hd2: don't change the existing compatible string

 .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt  | 13 -
 arch/arm/boot/dts/hisi-x5hd2.dtsi   |  4 ++--
 drivers/net/ethernet/hisilicon/hix5hd2_gmac.c   | 13 +++--
 3 files changed, 17 insertions(+), 13 deletions(-)

-- 
2.8.2



[PATCH 0/2] net: hix5hd2_gmac: keep the compatible string not changed

2016-12-19 Thread Dongpo Li
This patch series fix the patch:
d0fb6ba75dc0 ("net: hix5hd2_gmac: add generic compatible string")

The SoC hix5hd2 compatible string has the suffix "-gmac" and
we should not change its compatible string.
So we should name all the compatible string with the suffix "-gmac".
Creating a new name suffix "-gemac" is unnecessary.

Dongpo Li (2):
  net: hix5hd2_gmac: fix compatible strings name
  ARM: dts: hix5hd2: don't change the existing compatible string

 .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt  | 13 -
 arch/arm/boot/dts/hisi-x5hd2.dtsi   |  4 ++--
 drivers/net/ethernet/hisilicon/hix5hd2_gmac.c   | 13 +++--
 3 files changed, 17 insertions(+), 13 deletions(-)

-- 
2.8.2



Re: OOM: Better, but still there on

2016-12-19 Thread Nils Holland
On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:

> Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> not know whether we managed to rotate those pages. If they are referenced
> quickly enough we might just keep refaulting them... Could you try to apply
> the followin diff on top what you have currently. It should add some more
> tracepoint data which might tell us more. We can reduce the amount of
> tracing data by enabling only mm_vmscan_lru_isolate,
> mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.

So, the results are in! I applied your patch and rebuild the kernel,
then I rebooted the machine, set up tracing so that only the three
events you mentioned were being traced, and captured the output over
the network.

Things went a bit different this time: The trace events started to
appear after a while and a whole lot of them were generated, but
suddenly they stopped. A short while later, we get

[ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)

along with a backtrace and memory information, and then there was
silence. When I walked up to the machine, it had completely died; it
wouldn't turn on its screen on key press any more, blindly trying to
reboot via SysRequest had no effect, but the caps lock LED also wasn't
blinking, like it normally does when a kernel panic occurs. Good
question what state it was in. The OOM reaper didn't really seem to
kick in and kill processes this time, it seems.

The complete capture is up at:

http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz

Greetings
Nils


Re: OOM: Better, but still there on

2016-12-19 Thread Nils Holland
On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:

> Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> not know whether we managed to rotate those pages. If they are referenced
> quickly enough we might just keep refaulting them... Could you try to apply
> the followin diff on top what you have currently. It should add some more
> tracepoint data which might tell us more. We can reduce the amount of
> tracing data by enabling only mm_vmscan_lru_isolate,
> mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.

So, the results are in! I applied your patch and rebuild the kernel,
then I rebooted the machine, set up tracing so that only the three
events you mentioned were being traced, and captured the output over
the network.

Things went a bit different this time: The trace events started to
appear after a while and a whole lot of them were generated, but
suddenly they stopped. A short while later, we get

[ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)

along with a backtrace and memory information, and then there was
silence. When I walked up to the machine, it had completely died; it
wouldn't turn on its screen on key press any more, blindly trying to
reboot via SysRequest had no effect, but the caps lock LED also wasn't
blinking, like it normally does when a kernel panic occurs. Good
question what state it was in. The OOM reaper didn't really seem to
kick in and kill processes this time, it seems.

The complete capture is up at:

http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz

Greetings
Nils


Re: [PATCH] block: loose check on sg gap

2016-12-19 Thread Ming Lei
On Sun, Dec 18, 2016 at 12:49 AM, Jens Axboe  wrote:
> On 12/17/2016 03:49 AM, Ming Lei wrote:
>> If the last bvec of the 1st bio and the 1st bvec of the next
>> bio are contineous physically, and the latter can be merged
>> to last segment of the 1st bio, we should think they don't
>> violate sg gap(or virt boundary) limit.
>>
>> Both Vitaly and Dexuan reported lots of unmergeable small bios
>> are observed when running mkfs on Hyper-V virtual storage, and
>> performance becomes quite low, so this patch is figured out for
>> fixing the performance issue.
>>
>> The same issue should exist on NVMe too sine it sets virt boundary too.
>
> It looks pretty reasonable to me. I'll queue it up for some testing,
> changes like this always make me a little nervous.

Understood.

But given it is still in early stage of 4.10 cycle, seems fine to expose
it now, and we should have enough time to fix it if there might be
regressions.

BTW, it passes my xfstest(ext4) over sata/NVMe.

Thanks,
Ming


Re: [PATCH] block: loose check on sg gap

2016-12-19 Thread Ming Lei
On Sun, Dec 18, 2016 at 12:49 AM, Jens Axboe  wrote:
> On 12/17/2016 03:49 AM, Ming Lei wrote:
>> If the last bvec of the 1st bio and the 1st bvec of the next
>> bio are contineous physically, and the latter can be merged
>> to last segment of the 1st bio, we should think they don't
>> violate sg gap(or virt boundary) limit.
>>
>> Both Vitaly and Dexuan reported lots of unmergeable small bios
>> are observed when running mkfs on Hyper-V virtual storage, and
>> performance becomes quite low, so this patch is figured out for
>> fixing the performance issue.
>>
>> The same issue should exist on NVMe too sine it sets virt boundary too.
>
> It looks pretty reasonable to me. I'll queue it up for some testing,
> changes like this always make me a little nervous.

Understood.

But given it is still in early stage of 4.10 cycle, seems fine to expose
it now, and we should have enough time to fix it if there might be
regressions.

BTW, it passes my xfstest(ext4) over sata/NVMe.

Thanks,
Ming


[PATCH] scsi: do not requeue requests unaligned with device sector size

2016-12-19 Thread Mauricio Faria de Oliveira
When a SCSI command (e.g., read operation) is partially completed
with good status and residual bytes (i.e., not all the bytes from
the specified transfer length were transferred) the SCSI midlayer
will update the request/bios with the completed bytes and requeue
the request in order to complete the remainder/pending bytes.

However, when the device sector size is greater than the 512-byte
default/kernel sector size, alignment restrictions and validation
apply (both to the starting logical block address, and the number
of logical blocks to transfer) -- values must be multiples of the
device sector size, otherwise the kernel fails the request in the
preparation stage (e.g., sd_setup_read_write_cmnd() at sd.c file):

  [...] sd 0:0:0:0: [sda] Bad block number requested

Hence, this error message (and the respective failed request) can
be observed on devices with larger sector sizes which may respond
the SCSI command with a SCSI residual size that is unaligned with
the device sector size -- because the requeued request's starting
logical block and number of logical blocks are based on the value
of the remainder/pending bytes.

In order to address this problem, introduce a check for this case
in scsi_io_completion() (before it calls scsi_end_request() which
in turn calls blk_update_request() which is the site that changes
the request's __sector and __data_len fields, which are validated
by sd_setup_read_write_cmnd()).

This check verifies whether there is any residual/remainder bytes
in the (potentially partially) completed requested and calculates
the correctly aligned values for the number of completed bytes to
pass up to scsi_end_request()/blk_update_request() that guarantee
that the requeued request is aligned with the device sector size.

The corner case is when one sector is requested and the response
is partially complete, for which the remainder/pending bytes are
unaligned and no further request would be valid.  On such a case,
the original request is retried after a delay (in case the error
is hopefully due to a temporary condition in the device), but up
to the retry limit (in case the condition is permanent, e.g. bad
sector in the medium), after which the request is finally failed.

In order to reproduce and verify this problem, the virtio_scsi.c
file can be modified to respond to 3 'special' SCSI commands, on
which partial completions are introduced (described in the patch).

This is the guest's disk test image and libvirt XML snippets for
the 4k-sector disk using the virtio scsi driver:

# qemu-img create -f qcow2 /var/lib/libvirt/images/test.qcow2 128G


  
  
  
  
  



  


And the verification in the guest:

# cat /sys/block/sda/queue/physical_block_size
4096

# cat /sys/block/sda/queue/hw_sector_size
4096

This is the patch to virtio_scsi.c (lines prefixed with ' ___ '):

 ___ --- a/drivers/scsi/virtio_scsi.c
 ___ +++ b/drivers/scsi/virtio_scsi.c
 ___ @@ -153,11 +153,45 @@
 ___struct virtio_scsi_cmd_resp *resp = >resp.cmd;
 ___struct virtio_scsi_target_state *tgt =
 ___scsi_target(sc->device)->hostdata;
 ___ +  static int debug_failures = 0;
 ___
 ___dev_dbg(>device->sdev_gendev,
 ___"cmd %p response %u status %#02x sense_len %u\n",
 ___sc, resp->response, resp->status, resp->sense_len);
 ___
 ___ +  // DEBUG: filter this CDB for testing purposes.
 ___ +  // CDB: Read(10) 28 00 01 02 03 xx 00 00 yy 00
 ___ +  // (xx: LSB of the LBA, and yy: LSB of the LEN)
 ___ +  if ((sc->cmnd[0] == 0x28) && (sc->cmnd[1] == 0x00)
 ___ +  &&  (sc->cmnd[2] == 0x01) && (sc->cmnd[3] == 0x02)
 ___ +  &&  (sc->cmnd[4] == 0x03) && (sc->cmnd[6] == 0x00)
 ___ +  &&  (sc->cmnd[7] == 0x00) && (sc->cmnd[9] == 0x00)) {
 ___ +
 ___ +  // Test 1:
 ___ +  // - LBA: 01 02 03 _04_
 ___ +  // - LEN: two sectors (2 * 4k = 8k)
 ___ +  // - Action: complete 5k out of 8k (3k residual)
 ___ +  if ((sc->cmnd[5] == 0x04) && (sc->cmnd[8] == 0x02))
 ___ +  resp->resid = 6 * 512;
 ___ +
 ___ +  // Test 2:
 ___ +  // - LBA: 01 02 03 _04_
 ___ +  // - LEN: one sector (1 * 4k = 4k)
 ___ +  // - Action: complete 3k out of 4k (1k residual)
 ___ +  //   always.
 ___ +  if ((sc->cmnd[5] == 0x04) && (sc->cmnd[8] == 0x01))
 ___ +  resp->resid = 2 * 512;
 ___ +
 ___ +  // Test 3:
 ___ +  // - LBA: 01 02 03 _08_
 ___ +  // - LEN: one sector (1 * 4k = 4k)
 ___ +  // - Action: complete 3k out of 4k (1k residual)
 ___ +  //   but on every 4th attempt (complete 4k)
 ___ +  if ((sc->cmnd[5] == 0x08) &&  (sc->cmnd[8] == 0x01)
 ___ +  &&  (++debug_failures % 4 != 0))
 ___ +  resp->resid = 2 * 512;
 ___ +  }
 ___ +
 ___sc->result = resp->status;
 ___virtscsi_compute_resid(sc, virtio32_to_cpu(vscsi->vdev,
 ___ 

[PATCH] scsi: do not requeue requests unaligned with device sector size

2016-12-19 Thread Mauricio Faria de Oliveira
When a SCSI command (e.g., read operation) is partially completed
with good status and residual bytes (i.e., not all the bytes from
the specified transfer length were transferred) the SCSI midlayer
will update the request/bios with the completed bytes and requeue
the request in order to complete the remainder/pending bytes.

However, when the device sector size is greater than the 512-byte
default/kernel sector size, alignment restrictions and validation
apply (both to the starting logical block address, and the number
of logical blocks to transfer) -- values must be multiples of the
device sector size, otherwise the kernel fails the request in the
preparation stage (e.g., sd_setup_read_write_cmnd() at sd.c file):

  [...] sd 0:0:0:0: [sda] Bad block number requested

Hence, this error message (and the respective failed request) can
be observed on devices with larger sector sizes which may respond
the SCSI command with a SCSI residual size that is unaligned with
the device sector size -- because the requeued request's starting
logical block and number of logical blocks are based on the value
of the remainder/pending bytes.

In order to address this problem, introduce a check for this case
in scsi_io_completion() (before it calls scsi_end_request() which
in turn calls blk_update_request() which is the site that changes
the request's __sector and __data_len fields, which are validated
by sd_setup_read_write_cmnd()).

This check verifies whether there is any residual/remainder bytes
in the (potentially partially) completed requested and calculates
the correctly aligned values for the number of completed bytes to
pass up to scsi_end_request()/blk_update_request() that guarantee
that the requeued request is aligned with the device sector size.

The corner case is when one sector is requested and the response
is partially complete, for which the remainder/pending bytes are
unaligned and no further request would be valid.  On such a case,
the original request is retried after a delay (in case the error
is hopefully due to a temporary condition in the device), but up
to the retry limit (in case the condition is permanent, e.g. bad
sector in the medium), after which the request is finally failed.

In order to reproduce and verify this problem, the virtio_scsi.c
file can be modified to respond to 3 'special' SCSI commands, on
which partial completions are introduced (described in the patch).

This is the guest's disk test image and libvirt XML snippets for
the 4k-sector disk using the virtio scsi driver:

# qemu-img create -f qcow2 /var/lib/libvirt/images/test.qcow2 128G


  
  
  
  
  



  


And the verification in the guest:

# cat /sys/block/sda/queue/physical_block_size
4096

# cat /sys/block/sda/queue/hw_sector_size
4096

This is the patch to virtio_scsi.c (lines prefixed with ' ___ '):

 ___ --- a/drivers/scsi/virtio_scsi.c
 ___ +++ b/drivers/scsi/virtio_scsi.c
 ___ @@ -153,11 +153,45 @@
 ___struct virtio_scsi_cmd_resp *resp = >resp.cmd;
 ___struct virtio_scsi_target_state *tgt =
 ___scsi_target(sc->device)->hostdata;
 ___ +  static int debug_failures = 0;
 ___
 ___dev_dbg(>device->sdev_gendev,
 ___"cmd %p response %u status %#02x sense_len %u\n",
 ___sc, resp->response, resp->status, resp->sense_len);
 ___
 ___ +  // DEBUG: filter this CDB for testing purposes.
 ___ +  // CDB: Read(10) 28 00 01 02 03 xx 00 00 yy 00
 ___ +  // (xx: LSB of the LBA, and yy: LSB of the LEN)
 ___ +  if ((sc->cmnd[0] == 0x28) && (sc->cmnd[1] == 0x00)
 ___ +  &&  (sc->cmnd[2] == 0x01) && (sc->cmnd[3] == 0x02)
 ___ +  &&  (sc->cmnd[4] == 0x03) && (sc->cmnd[6] == 0x00)
 ___ +  &&  (sc->cmnd[7] == 0x00) && (sc->cmnd[9] == 0x00)) {
 ___ +
 ___ +  // Test 1:
 ___ +  // - LBA: 01 02 03 _04_
 ___ +  // - LEN: two sectors (2 * 4k = 8k)
 ___ +  // - Action: complete 5k out of 8k (3k residual)
 ___ +  if ((sc->cmnd[5] == 0x04) && (sc->cmnd[8] == 0x02))
 ___ +  resp->resid = 6 * 512;
 ___ +
 ___ +  // Test 2:
 ___ +  // - LBA: 01 02 03 _04_
 ___ +  // - LEN: one sector (1 * 4k = 4k)
 ___ +  // - Action: complete 3k out of 4k (1k residual)
 ___ +  //   always.
 ___ +  if ((sc->cmnd[5] == 0x04) && (sc->cmnd[8] == 0x01))
 ___ +  resp->resid = 2 * 512;
 ___ +
 ___ +  // Test 3:
 ___ +  // - LBA: 01 02 03 _08_
 ___ +  // - LEN: one sector (1 * 4k = 4k)
 ___ +  // - Action: complete 3k out of 4k (1k residual)
 ___ +  //   but on every 4th attempt (complete 4k)
 ___ +  if ((sc->cmnd[5] == 0x08) &&  (sc->cmnd[8] == 0x01)
 ___ +  &&  (++debug_failures % 4 != 0))
 ___ +  resp->resid = 2 * 512;
 ___ +  }
 ___ +
 ___sc->result = resp->status;
 ___virtscsi_compute_resid(sc, virtio32_to_cpu(vscsi->vdev,
 ___ 

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski
On Mon, Dec 19, 2016 at 5:44 PM, David Ahern  wrote:
> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>> net.socket_create_filter = "none": no filter
>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>> net.socket_create_filter = "disallow": no sockets created period
>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>
> Such a scheme works for the socket create filter b/c it is a very simple use 
> case. It does not work for the ingress and egress which allow generic bpf 
> filters.

Can you elaborate on what goes wrong?  (Obviously the
"address_family_list" example makes no sense in that context.)

>
> ...
>
>>> you're ignoring use cases I described earlier.
>>> In vrf case there is only one ifindex it needs to bind to.
>>
>> I'm totally lost.  Can you explain what this has to do with the cgroup
>> hierarchy?
>
> I think the point is that a group hierarchy makes no sense for the VRF use 
> case. What I put into iproute2 is
>
> cgrp2/vrf/NAME
>
> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 
> sockets to a specific device index. cgrp2/vrf is the "default" vrf and does 
> not have a filter. A user can certainly add another layer 
> cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not 
> make sense.

I tend to agree.  I still think that the mechanism as it stands is
broken in other respects and should be fixed before it goes live.  I
have no desire to cause problems for the vrf use case.

But keep in mind that the vrf use case is, in Linus' tree, a bit
broken right now in its interactions with other users of the same
mechanism.  Suppose I create a container and want to trace all of its
created sockets.  I'll set up cgrp2/container and load my tracer as a
socket creation hook.  Then a container sets up
cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding
filter.  Now the tracing stops working -- oops.

>
> ...
>
 I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
 have to do it now (or disable the feature for 4.10).  This is why I'm
 bringing this whole thing up now.
>>>
>>> We don't have to touch user visible api here, so extensions are fine.
>>
>> Huh?  My example in the original email attaches a program in a
>> sub-hierarchy.  Are you saying that 4.11 could make that example stop
>> working?
>
> Are you suggesting sub-cgroups should not be allowed to override the filter 
> of a parent cgroup?

Yes, exactly.  I think there are two sensible behaviors:

a) sub-cgroups cannot have a filter at all of the parent has a filter.
(This is the "punt" approach -- it lets different semantics be
assigned later without breaking userspace.)

b) sub-cgroups can have a filter if a parent does, too.  The semantics
are that the sub-cgroup filter runs first and all side-effects occur.
If that filter says "reject" then ancestor filters are skipped.  If
that filter says "accept", then the ancestor filter is run and its
side-effects happen as well.  (And so on, all the way up to the root.)

--Andy


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski
On Mon, Dec 19, 2016 at 5:44 PM, David Ahern  wrote:
> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>> net.socket_create_filter = "none": no filter
>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>> net.socket_create_filter = "disallow": no sockets created period
>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>
> Such a scheme works for the socket create filter b/c it is a very simple use 
> case. It does not work for the ingress and egress which allow generic bpf 
> filters.

Can you elaborate on what goes wrong?  (Obviously the
"address_family_list" example makes no sense in that context.)

>
> ...
>
>>> you're ignoring use cases I described earlier.
>>> In vrf case there is only one ifindex it needs to bind to.
>>
>> I'm totally lost.  Can you explain what this has to do with the cgroup
>> hierarchy?
>
> I think the point is that a group hierarchy makes no sense for the VRF use 
> case. What I put into iproute2 is
>
> cgrp2/vrf/NAME
>
> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 
> sockets to a specific device index. cgrp2/vrf is the "default" vrf and does 
> not have a filter. A user can certainly add another layer 
> cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not 
> make sense.

I tend to agree.  I still think that the mechanism as it stands is
broken in other respects and should be fixed before it goes live.  I
have no desire to cause problems for the vrf use case.

But keep in mind that the vrf use case is, in Linus' tree, a bit
broken right now in its interactions with other users of the same
mechanism.  Suppose I create a container and want to trace all of its
created sockets.  I'll set up cgrp2/container and load my tracer as a
socket creation hook.  Then a container sets up
cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding
filter.  Now the tracing stops working -- oops.

>
> ...
>
 I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
 have to do it now (or disable the feature for 4.10).  This is why I'm
 bringing this whole thing up now.
>>>
>>> We don't have to touch user visible api here, so extensions are fine.
>>
>> Huh?  My example in the original email attaches a program in a
>> sub-hierarchy.  Are you saying that 4.11 could make that example stop
>> working?
>
> Are you suggesting sub-cgroups should not be allowed to override the filter 
> of a parent cgroup?

Yes, exactly.  I think there are two sensible behaviors:

a) sub-cgroups cannot have a filter at all of the parent has a filter.
(This is the "punt" approach -- it lets different semantics be
assigned later without breaking userspace.)

b) sub-cgroups can have a filter if a parent does, too.  The semantics
are that the sub-cgroup filter runs first and all side-effects occur.
If that filter says "reject" then ancestor filters are skipped.  If
that filter says "accept", then the ancestor filter is run and its
side-effects happen as well.  (And so on, all the way up to the root.)

--Andy


linux-next: Tree for Dec 20

2016-12-19 Thread Stephen Rothwell
Hi all,

Please do not add any material for v4.11 to your linux-next included
branches until after v4.10-rc1 has been released.

Changes since 20161219:

The kvm tree lost its build failure.

Non-merge commits (relative to Linus' tree): 566
 1073 files changed, 26213 insertions(+), 8676 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc and an allmodconfig (with
CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a
native build of tools/perf. After the final fixups (if any), I do an
x86_64 modules_install followed by builds for x86_64 allnoconfig,
powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig
(with KALLSYMS_EXTRA_PASS=1) and pseries_le_defconfig and i386, sparc
and sparc64 defconfig.

Below is a summary of the state of the merge.

I am currently merging 246 trees (counting Linus' and 35 trees of bug
fix patches pending for the current merge release).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (e93b1cc8a896 Merge branch 'for_linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs)
Merging fixes/master (30066ce675d3 Merge branch 'linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6)
Merging kbuild-current/rc-fixes (152b695d7437 builddeb: fix cross-building to 
arm64 producing host-arch debs)
Merging arc-current/for-curr (08fe007968b2 ARC: mm: arc700: Don't assume 2 
colours for aliasing VIPT dcache)
Merging arm-current/fixes (8478132a8784 Revert "arm: move exports to 
definitions")
Merging m68k-current/for-linus (7e251bb21ae0 m68k: Fix ndelay() macro)
Merging metag-fixes/fixes (35d04077ad96 metag: Only define 
atomic_dec_if_positive conditionally)
Merging powerpc-fixes/fixes (69973b830859 Linux 4.9)
Merging sparc/master (e93b1cc8a896 Merge branch 'for_linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs)
Merging net/master (ad688cdbb076 stmmac: fix memory barriers)
Merging ipsec/master (bc3913a5378c Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc)
Merging netfilter/master (a220871be66f virtio-net: correctly enable multiqueue)
Merging ipvs/master (045169816b31 Merge branch 'linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6)
Merging wireless-drivers/master (fcd2042e8d36 mwifiex: printk() overflow with 
32-byte SSIDs)
Merging mac80211/master (a17d93ff3a95 mac80211: fix legacy and invalid rx-rate 
report)
Merging sound-current/for-linus (995c6a7fd9b9 ALSA: hiface: Fix M2Tech hiFace 
driver sampling rate change)
Merging pci-current/for-linus (e42010d8207f PCI: Set Read Completion Boundary 
to 128 iff Root Port supports it (_HPX))
Merging driver-core.current/driver-core-linus (cdb98c2698b4 Revert "nvme: add 
support for the Write Zeroes command")
Merging tty.current/tty-linus (cdb98c2698b4 Revert "nvme: add support for the 
Write Zeroes command")
Merging usb.current/usb-linus (cdb98c2698b4 Revert "nvme: add support for the 
Write Zeroes command")
Merging usb-gadget-fixes/fixes (05e78c6933d6 usb: gadget: f_fs: fix wrong 
parenthesis in ffs_func_req_match())
Merging usb-serial-fixes/usb-linus (46490c347df4 USB: serial: option: add dlink 
dwm-158)
Merging usb-chipidea-fixes/ci-for-usb-stable (c7fbb09b2ea1 usb: chipidea: move 
the lock initialization to core file)
Merging phy/fixes (4320f9d4c183 phy: sun4i: check PMU presence when poking 
unknown bit of pmu)
Merging staging.current/staging-linus (cdb98c2698b4 Revert "nvme: add support 
for the Write Zeroes command")
Merging char-misc.current/char-misc-linus (cdb98c2698b4 Revert "nvme: add 
support for the Write Zeroes command")
Merging input-current/for-linus (67626c932302 Input: synaptics_i2c - change 
msleep to usleep_range for small msecs)
Merging crypto-current/master (8759fec4af22 crypto: marvell - Copy IVDIG before 
lau

linux-next: Tree for Dec 20

2016-12-19 Thread Stephen Rothwell
Hi all,

Please do not add any material for v4.11 to your linux-next included
branches until after v4.10-rc1 has been released.

Changes since 20161219:

The kvm tree lost its build failure.

Non-merge commits (relative to Linus' tree): 566
 1073 files changed, 26213 insertions(+), 8676 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc and an allmodconfig (with
CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a
native build of tools/perf. After the final fixups (if any), I do an
x86_64 modules_install followed by builds for x86_64 allnoconfig,
powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig
(with KALLSYMS_EXTRA_PASS=1) and pseries_le_defconfig and i386, sparc
and sparc64 defconfig.

Below is a summary of the state of the merge.

I am currently merging 246 trees (counting Linus' and 35 trees of bug
fix patches pending for the current merge release).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (e93b1cc8a896 Merge branch 'for_linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs)
Merging fixes/master (30066ce675d3 Merge branch 'linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6)
Merging kbuild-current/rc-fixes (152b695d7437 builddeb: fix cross-building to 
arm64 producing host-arch debs)
Merging arc-current/for-curr (08fe007968b2 ARC: mm: arc700: Don't assume 2 
colours for aliasing VIPT dcache)
Merging arm-current/fixes (8478132a8784 Revert "arm: move exports to 
definitions")
Merging m68k-current/for-linus (7e251bb21ae0 m68k: Fix ndelay() macro)
Merging metag-fixes/fixes (35d04077ad96 metag: Only define 
atomic_dec_if_positive conditionally)
Merging powerpc-fixes/fixes (69973b830859 Linux 4.9)
Merging sparc/master (e93b1cc8a896 Merge branch 'for_linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs)
Merging net/master (ad688cdbb076 stmmac: fix memory barriers)
Merging ipsec/master (bc3913a5378c Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc)
Merging netfilter/master (a220871be66f virtio-net: correctly enable multiqueue)
Merging ipvs/master (045169816b31 Merge branch 'linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6)
Merging wireless-drivers/master (fcd2042e8d36 mwifiex: printk() overflow with 
32-byte SSIDs)
Merging mac80211/master (a17d93ff3a95 mac80211: fix legacy and invalid rx-rate 
report)
Merging sound-current/for-linus (995c6a7fd9b9 ALSA: hiface: Fix M2Tech hiFace 
driver sampling rate change)
Merging pci-current/for-linus (e42010d8207f PCI: Set Read Completion Boundary 
to 128 iff Root Port supports it (_HPX))
Merging driver-core.current/driver-core-linus (cdb98c2698b4 Revert "nvme: add 
support for the Write Zeroes command")
Merging tty.current/tty-linus (cdb98c2698b4 Revert "nvme: add support for the 
Write Zeroes command")
Merging usb.current/usb-linus (cdb98c2698b4 Revert "nvme: add support for the 
Write Zeroes command")
Merging usb-gadget-fixes/fixes (05e78c6933d6 usb: gadget: f_fs: fix wrong 
parenthesis in ffs_func_req_match())
Merging usb-serial-fixes/usb-linus (46490c347df4 USB: serial: option: add dlink 
dwm-158)
Merging usb-chipidea-fixes/ci-for-usb-stable (c7fbb09b2ea1 usb: chipidea: move 
the lock initialization to core file)
Merging phy/fixes (4320f9d4c183 phy: sun4i: check PMU presence when poking 
unknown bit of pmu)
Merging staging.current/staging-linus (cdb98c2698b4 Revert "nvme: add support 
for the Write Zeroes command")
Merging char-misc.current/char-misc-linus (cdb98c2698b4 Revert "nvme: add 
support for the Write Zeroes command")
Merging input-current/for-linus (67626c932302 Input: synaptics_i2c - change 
msleep to usleep_range for small msecs)
Merging crypto-current/master (8759fec4af22 crypto: marvell - Copy IVDIG before 
lau

RE: [PATCH V4 08/11] megaraid_sas: Enable or Disable Fast path based on the PCI Threshold Bandwidth

2016-12-19 Thread Sasikumar PC
Hi Tomas,

Please see my response inline

Thanks
sasi

-Original Message-
From: Tomas Henzl [mailto:the...@redhat.com]
Sent: Thursday, December 15, 2016 10:10 AM
To: Sasikumar PC; j...@kernel.org; h...@infradead.org
Cc: linux-s...@vger.kernel.org; Sathya Prakash Veerichetty;
linux-kernel@vger.kernel.org; Christopher Owens; Kiran Kumar Kasturi
Subject: Re: [PATCH V4 08/11] megaraid_sas: Enable or Disable Fast path
based on the PCI Threshold Bandwidth

On 14.12.2016 22:54, Sasikumar PC wrote:
> Hi Tomas,
>
> Please see my response inline
>
> Thanks
> sasi
>
> -Original Message-
> From: Tomas Henzl [mailto:the...@redhat.com]
> Sent: Friday, December 09, 2016 8:59 AM
> To: Sasikumar Chandrasekaran; j...@kernel.org; h...@infradead.org
> Cc: linux-s...@vger.kernel.org; sathya.prak...@broadcom.com;
> linux-kernel@vger.kernel.org; christopher.ow...@broadcom.com;
> kiran-kumar.kast...@broadcom.com
> Subject: Re: [PATCH V4 08/11] megaraid_sas: Enable or Disable Fast
> path based on the PCI Threshold Bandwidth
>
> On 7.12.2016 00:00, Sasikumar Chandrasekaran wrote:
>> Large SEQ IO workload should sent as non fast path commands
>>
>> This patch is depending on patch 7
>>
>> Signed-off-by: Sasikumar Chandrasekaran 
>> ---
>>  drivers/scsi/megaraid/megaraid_sas.h|  8 +
>>  drivers/scsi/megaraid/megaraid_sas_base.c   | 48
> +
>>  drivers/scsi/megaraid/megaraid_sas_fp.c | 11 +--
>>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 20 +++-
>> drivers/scsi/megaraid/megaraid_sas_fusion.h |  2 +-
>>  5 files changed, 78 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/scsi/megaraid/megaraid_sas.h
>> b/drivers/scsi/megaraid/megaraid_sas.h
>> index 3e087ab..eb5be2b 100644
>> --- a/drivers/scsi/megaraid/megaraid_sas.h
>> +++ b/drivers/scsi/megaraid/megaraid_sas.h
>> @@ -1429,6 +1429,8 @@ enum FW_BOOT_CONTEXT {
>>  #define MFI_1068_FW_HANDSHAKE_OFFSET0x64
>>  #define MFI_1068_FW_READY   0x
>>
>> +#define MEGASAS_RAID1_FAST_PATH_STATUS_CHECK_INTERVAL HZ
>> +
>>  #define MR_MAX_REPLY_QUEUES_OFFSET  0X001F
>>  #define MR_MAX_REPLY_QUEUES_EXT_OFFSET  0X003FC000
>>  #define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT14
>> @@ -2101,6 +2103,10 @@ struct megasas_instance {
>>  atomic_t ldio_outstanding;
>>  atomic_t fw_reset_no_pci_access;
>>
>> +atomic64_t bytes_wrote; /* used for raid1 fast path enable or
> disable */
>> +atomic_t r1_write_fp_capable;
> Is a an atomic variable needed for a just boolean variable?
> Sasi - As we need to synchronize timer thread and IO issue threads,
> With atomic, at any point of time the value will be definitive.
> With boolean, it depends, the value could be in transit while one
> thread is reading and other is writing.

This explanation is I think not good enough, as a write of a char value is
atomic on all architectures. If you need synchronisation you probably need a
spinlock.
Tomash

boolean may not be a char in all architectures/implementations. It could be
implementation specific isn't it ?
Spin_Lock is heavier as the check is in IO path.We need it to be consistent
non-transient value, not an exact synchronization.
There could be more values that we may set this variable to, to make
different decisions and value can be set in more places in future.
Atomic will help it keep consistent and extensible.

sasi

>
>> +
>> +
>>  struct megasas_instance_template *instancet;
>>  struct tasklet_struct isr_tasklet;
>>  struct work_struct work_init;
>> @@ -2143,6 +2149,7 @@ struct megasas_instance {
>>  long reset_flags;
>>  struct mutex reset_mutex;
>>  struct timer_list sriov_heartbeat_timer;
>> +struct timer_list r1_fp_hold_timer;
>>  char skip_heartbeat_timer_del;
>>  u8 requestorId;
>>  char PlasmaFW111;
>> @@ -2159,6 +2166,7 @@ struct megasas_instance {
>>  bool is_ventura;
>>  bool msix_combined;
>>  u16 max_raid_mapsize;
>> +u64 pci_threshold_bandwidth; /* used to control the fp writes */
>>  };
>>  struct MR_LD_VF_MAP {
>>  u32 size;
>> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
>> b/drivers/scsi/megaraid/megaraid_sas_base.c
>> index 8aafb59..899d25c 100644
>> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
>> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
>> @@ -1940,6 +1940,9 @@ void megaraid_sas_kill_hba(struct
>> megasas_instance
> *instance)
>>  }
>>  /* Complete outstanding ioctls when adapter is killed */
>>  megasas_complete_outstanding_ioctls(instance);
>> +if (instance->is_ventura)
>> +del_timer_sync(>r1_fp_hold_timer);
>> +
>>  }
>>
>>   /**
>> @@ -2438,6 +2441,24 @@ void megasas_sriov_heartbeat_handler(unsigned
> long instance_addr)
>>  }
>>  }
>>
>> +/*Handler for disabling/enabling raid 1 fast paths*/ void
>> +megasas_change_r1_fp_status(unsigned long instance_addr) {
>> +struct megasas_instance 

RE: [PATCH V4 08/11] megaraid_sas: Enable or Disable Fast path based on the PCI Threshold Bandwidth

2016-12-19 Thread Sasikumar PC
Hi Tomas,

Please see my response inline

Thanks
sasi

-Original Message-
From: Tomas Henzl [mailto:the...@redhat.com]
Sent: Thursday, December 15, 2016 10:10 AM
To: Sasikumar PC; j...@kernel.org; h...@infradead.org
Cc: linux-s...@vger.kernel.org; Sathya Prakash Veerichetty;
linux-kernel@vger.kernel.org; Christopher Owens; Kiran Kumar Kasturi
Subject: Re: [PATCH V4 08/11] megaraid_sas: Enable or Disable Fast path
based on the PCI Threshold Bandwidth

On 14.12.2016 22:54, Sasikumar PC wrote:
> Hi Tomas,
>
> Please see my response inline
>
> Thanks
> sasi
>
> -Original Message-
> From: Tomas Henzl [mailto:the...@redhat.com]
> Sent: Friday, December 09, 2016 8:59 AM
> To: Sasikumar Chandrasekaran; j...@kernel.org; h...@infradead.org
> Cc: linux-s...@vger.kernel.org; sathya.prak...@broadcom.com;
> linux-kernel@vger.kernel.org; christopher.ow...@broadcom.com;
> kiran-kumar.kast...@broadcom.com
> Subject: Re: [PATCH V4 08/11] megaraid_sas: Enable or Disable Fast
> path based on the PCI Threshold Bandwidth
>
> On 7.12.2016 00:00, Sasikumar Chandrasekaran wrote:
>> Large SEQ IO workload should sent as non fast path commands
>>
>> This patch is depending on patch 7
>>
>> Signed-off-by: Sasikumar Chandrasekaran 
>> ---
>>  drivers/scsi/megaraid/megaraid_sas.h|  8 +
>>  drivers/scsi/megaraid/megaraid_sas_base.c   | 48
> +
>>  drivers/scsi/megaraid/megaraid_sas_fp.c | 11 +--
>>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 20 +++-
>> drivers/scsi/megaraid/megaraid_sas_fusion.h |  2 +-
>>  5 files changed, 78 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/scsi/megaraid/megaraid_sas.h
>> b/drivers/scsi/megaraid/megaraid_sas.h
>> index 3e087ab..eb5be2b 100644
>> --- a/drivers/scsi/megaraid/megaraid_sas.h
>> +++ b/drivers/scsi/megaraid/megaraid_sas.h
>> @@ -1429,6 +1429,8 @@ enum FW_BOOT_CONTEXT {
>>  #define MFI_1068_FW_HANDSHAKE_OFFSET0x64
>>  #define MFI_1068_FW_READY   0x
>>
>> +#define MEGASAS_RAID1_FAST_PATH_STATUS_CHECK_INTERVAL HZ
>> +
>>  #define MR_MAX_REPLY_QUEUES_OFFSET  0X001F
>>  #define MR_MAX_REPLY_QUEUES_EXT_OFFSET  0X003FC000
>>  #define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT14
>> @@ -2101,6 +2103,10 @@ struct megasas_instance {
>>  atomic_t ldio_outstanding;
>>  atomic_t fw_reset_no_pci_access;
>>
>> +atomic64_t bytes_wrote; /* used for raid1 fast path enable or
> disable */
>> +atomic_t r1_write_fp_capable;
> Is a an atomic variable needed for a just boolean variable?
> Sasi - As we need to synchronize timer thread and IO issue threads,
> With atomic, at any point of time the value will be definitive.
> With boolean, it depends, the value could be in transit while one
> thread is reading and other is writing.

This explanation is I think not good enough, as a write of a char value is
atomic on all architectures. If you need synchronisation you probably need a
spinlock.
Tomash

boolean may not be a char in all architectures/implementations. It could be
implementation specific isn't it ?
Spin_Lock is heavier as the check is in IO path.We need it to be consistent
non-transient value, not an exact synchronization.
There could be more values that we may set this variable to, to make
different decisions and value can be set in more places in future.
Atomic will help it keep consistent and extensible.

sasi

>
>> +
>> +
>>  struct megasas_instance_template *instancet;
>>  struct tasklet_struct isr_tasklet;
>>  struct work_struct work_init;
>> @@ -2143,6 +2149,7 @@ struct megasas_instance {
>>  long reset_flags;
>>  struct mutex reset_mutex;
>>  struct timer_list sriov_heartbeat_timer;
>> +struct timer_list r1_fp_hold_timer;
>>  char skip_heartbeat_timer_del;
>>  u8 requestorId;
>>  char PlasmaFW111;
>> @@ -2159,6 +2166,7 @@ struct megasas_instance {
>>  bool is_ventura;
>>  bool msix_combined;
>>  u16 max_raid_mapsize;
>> +u64 pci_threshold_bandwidth; /* used to control the fp writes */
>>  };
>>  struct MR_LD_VF_MAP {
>>  u32 size;
>> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
>> b/drivers/scsi/megaraid/megaraid_sas_base.c
>> index 8aafb59..899d25c 100644
>> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
>> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
>> @@ -1940,6 +1940,9 @@ void megaraid_sas_kill_hba(struct
>> megasas_instance
> *instance)
>>  }
>>  /* Complete outstanding ioctls when adapter is killed */
>>  megasas_complete_outstanding_ioctls(instance);
>> +if (instance->is_ventura)
>> +del_timer_sync(>r1_fp_hold_timer);
>> +
>>  }
>>
>>   /**
>> @@ -2438,6 +2441,24 @@ void megasas_sriov_heartbeat_handler(unsigned
> long instance_addr)
>>  }
>>  }
>>
>> +/*Handler for disabling/enabling raid 1 fast paths*/ void
>> +megasas_change_r1_fp_status(unsigned long instance_addr) {
>> +struct megasas_instance *instance =
>> +   

RE: [PATCH V4 06/11] megaraid_sas: Dynamic Raid Map Changes for SAS3.5 Generic Megaraid Controllers

2016-12-19 Thread Sasikumar PC
Hi Tomas,

Please see my response inline

Thanks
sasi

-Original Message-
From: Sasikumar PC [mailto:sasikumar...@broadcom.com]
Sent: Wednesday, December 14, 2016 4:49 PM
To: 'Tomas Henzl'; 'j...@kernel.org'; 'h...@infradead.org'
Cc: 'linux-s...@vger.kernel.org'; Sathya Prakash Veerichetty;
'linux-kernel@vger.kernel.org'; Christopher Owens; Kiran Kumar Kasturi
Subject: RE: [PATCH V4 06/11] megaraid_sas: Dynamic Raid Map Changes for
SAS3.5 Generic Megaraid Controllers

Hi Tomas,

Please see my response inline

Thanks
sasi

-Original Message-
From: Tomas Henzl [mailto:the...@redhat.com]
Sent: Friday, December 09, 2016 7:55 AM
To: Sasikumar Chandrasekaran; j...@kernel.org; h...@infradead.org
Cc: linux-s...@vger.kernel.org; sathya.prak...@broadcom.com;
linux-kernel@vger.kernel.org; christopher.ow...@broadcom.com;
kiran-kumar.kast...@broadcom.com
Subject: Re: [PATCH V4 06/11] megaraid_sas: Dynamic Raid Map Changes for
SAS3.5 Generic Megaraid Controllers

On 7.12.2016 00:00, Sasikumar Chandrasekaran wrote:
> SAS3.5 Generic Megaraid Controllers FW will support new dynamic RaidMap
to have different
> sizes for different number of supported VDs.
>
> This patch is depending on patch 5
>
> Signed-off-by: Sasikumar Chandrasekaran 
> ---
>  drivers/scsi/megaraid/megaraid_sas.h|   7 +
>  drivers/scsi/megaraid/megaraid_sas_base.c   |  61 --
>  drivers/scsi/megaraid/megaraid_sas_fp.c | 303

>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 223 
>  drivers/scsi/megaraid/megaraid_sas_fusion.h | 240
++
>  5 files changed, 699 insertions(+), 135 deletions(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas.h
b/drivers/scsi/megaraid/megaraid_sas.h
> index f4d6a94..3e087ab 100644
> --- a/drivers/scsi/megaraid/megaraid_sas.h
> +++ b/drivers/scsi/megaraid/megaraid_sas.h
> @@ -1434,6 +1434,12 @@ enum FW_BOOT_CONTEXT {
>  #define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT14
>  #define MR_MAX_MSIX_REG_ARRAY   16
>  #define MR_RDPQ_MODE_OFFSET  0X0080
> +
> +#define MR_MAX_RAID_MAP_SIZE_OFFSET_SHIFT16
> +#define MR_MAX_RAID_MAP_SIZE_MASK0x1FF
> +#define MR_MIN_MAP_SIZE  0x1
> +/* 64k */
> +
>  #define MR_CAN_HANDLE_SYNC_CACHE_OFFSET  0X0100
>
>  /*
> @@ -2152,6 +2158,7 @@ struct megasas_instance {
>   bool fw_sync_cache_support;
>   bool is_ventura;
>   bool msix_combined;
> + u16 max_raid_mapsize;
>  };
>  struct MR_LD_VF_MAP {
>   u32 size;
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
b/drivers/scsi/megaraid/megaraid_sas_base.c
> index c52f7be..3f06b57 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -4424,8 +4424,7 @@ int megasas_alloc_cmds(struct megasas_instance
*instance)
>  static void megasas_update_ext_vd_details(struct megasas_instance
*instance)
>  {
>   struct fusion_context *fusion;
> - u32 old_map_sz;
> - u32 new_map_sz;
> + u32 ventura_map_sz = 0;
>
>   fusion = instance->ctrl_context;
>   /* For MFI based controllers return dummy success */
> @@ -4455,21 +4454,39 @@ static void megasas_update_ext_vd_details(struct
megasas_instance *instance)
>   instance->supportmax256vd ? "Extended VD(240 VD)firmware"
:
>   "Legacy(64 VD) firmware");
>
> - old_map_sz = sizeof(struct MR_FW_RAID_MAP) +
> - (sizeof(struct MR_LD_SPAN_MAP) *
> - (instance->fw_supported_vd_count - 1));
> - new_map_sz = sizeof(struct MR_FW_RAID_MAP_EXT);
> - fusion->drv_map_sz = sizeof(struct MR_DRV_RAID_MAP) +
> - (sizeof(struct MR_LD_SPAN_MAP) *
> - (instance->drv_supported_vd_count - 1));
> -
> - fusion->max_map_sz = max(old_map_sz, new_map_sz);
> + if (instance->max_raid_mapsize) {
> + ventura_map_sz = instance->max_raid_mapsize *
> + MR_MIN_MAP_SIZE; /* 64k */
> + fusion->current_map_sz = ventura_map_sz;
> + fusion->max_map_sz = ventura_map_sz;
> + } else {
> + fusion->old_map_sz =  sizeof(struct MR_FW_RAID_MAP) +
> + (sizeof(struct MR_LD_SPAN_MAP) *
> + (instance->fw_supported_vd_count -
1));
> + fusion->new_map_sz =  sizeof(struct MR_FW_RAID_MAP_EXT);
>
> + fusion->max_map_sz =
> + max(fusion->old_map_sz, fusion->new_map_sz);
>
> - if (instance->supportmax256vd)
> - fusion->current_map_sz = new_map_sz;
> - else
> - fusion->current_map_sz = old_map_sz;
> + if (instance->supportmax256vd)
> + fusion->current_map_sz = fusion->new_map_sz;
> + else
> + 

RE: [PATCH V4 06/11] megaraid_sas: Dynamic Raid Map Changes for SAS3.5 Generic Megaraid Controllers

2016-12-19 Thread Sasikumar PC
Hi Tomas,

Please see my response inline

Thanks
sasi

-Original Message-
From: Sasikumar PC [mailto:sasikumar...@broadcom.com]
Sent: Wednesday, December 14, 2016 4:49 PM
To: 'Tomas Henzl'; 'j...@kernel.org'; 'h...@infradead.org'
Cc: 'linux-s...@vger.kernel.org'; Sathya Prakash Veerichetty;
'linux-kernel@vger.kernel.org'; Christopher Owens; Kiran Kumar Kasturi
Subject: RE: [PATCH V4 06/11] megaraid_sas: Dynamic Raid Map Changes for
SAS3.5 Generic Megaraid Controllers

Hi Tomas,

Please see my response inline

Thanks
sasi

-Original Message-
From: Tomas Henzl [mailto:the...@redhat.com]
Sent: Friday, December 09, 2016 7:55 AM
To: Sasikumar Chandrasekaran; j...@kernel.org; h...@infradead.org
Cc: linux-s...@vger.kernel.org; sathya.prak...@broadcom.com;
linux-kernel@vger.kernel.org; christopher.ow...@broadcom.com;
kiran-kumar.kast...@broadcom.com
Subject: Re: [PATCH V4 06/11] megaraid_sas: Dynamic Raid Map Changes for
SAS3.5 Generic Megaraid Controllers

On 7.12.2016 00:00, Sasikumar Chandrasekaran wrote:
> SAS3.5 Generic Megaraid Controllers FW will support new dynamic RaidMap
to have different
> sizes for different number of supported VDs.
>
> This patch is depending on patch 5
>
> Signed-off-by: Sasikumar Chandrasekaran 
> ---
>  drivers/scsi/megaraid/megaraid_sas.h|   7 +
>  drivers/scsi/megaraid/megaraid_sas_base.c   |  61 --
>  drivers/scsi/megaraid/megaraid_sas_fp.c | 303

>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 223 
>  drivers/scsi/megaraid/megaraid_sas_fusion.h | 240
++
>  5 files changed, 699 insertions(+), 135 deletions(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas.h
b/drivers/scsi/megaraid/megaraid_sas.h
> index f4d6a94..3e087ab 100644
> --- a/drivers/scsi/megaraid/megaraid_sas.h
> +++ b/drivers/scsi/megaraid/megaraid_sas.h
> @@ -1434,6 +1434,12 @@ enum FW_BOOT_CONTEXT {
>  #define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT14
>  #define MR_MAX_MSIX_REG_ARRAY   16
>  #define MR_RDPQ_MODE_OFFSET  0X0080
> +
> +#define MR_MAX_RAID_MAP_SIZE_OFFSET_SHIFT16
> +#define MR_MAX_RAID_MAP_SIZE_MASK0x1FF
> +#define MR_MIN_MAP_SIZE  0x1
> +/* 64k */
> +
>  #define MR_CAN_HANDLE_SYNC_CACHE_OFFSET  0X0100
>
>  /*
> @@ -2152,6 +2158,7 @@ struct megasas_instance {
>   bool fw_sync_cache_support;
>   bool is_ventura;
>   bool msix_combined;
> + u16 max_raid_mapsize;
>  };
>  struct MR_LD_VF_MAP {
>   u32 size;
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
b/drivers/scsi/megaraid/megaraid_sas_base.c
> index c52f7be..3f06b57 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -4424,8 +4424,7 @@ int megasas_alloc_cmds(struct megasas_instance
*instance)
>  static void megasas_update_ext_vd_details(struct megasas_instance
*instance)
>  {
>   struct fusion_context *fusion;
> - u32 old_map_sz;
> - u32 new_map_sz;
> + u32 ventura_map_sz = 0;
>
>   fusion = instance->ctrl_context;
>   /* For MFI based controllers return dummy success */
> @@ -4455,21 +4454,39 @@ static void megasas_update_ext_vd_details(struct
megasas_instance *instance)
>   instance->supportmax256vd ? "Extended VD(240 VD)firmware"
:
>   "Legacy(64 VD) firmware");
>
> - old_map_sz = sizeof(struct MR_FW_RAID_MAP) +
> - (sizeof(struct MR_LD_SPAN_MAP) *
> - (instance->fw_supported_vd_count - 1));
> - new_map_sz = sizeof(struct MR_FW_RAID_MAP_EXT);
> - fusion->drv_map_sz = sizeof(struct MR_DRV_RAID_MAP) +
> - (sizeof(struct MR_LD_SPAN_MAP) *
> - (instance->drv_supported_vd_count - 1));
> -
> - fusion->max_map_sz = max(old_map_sz, new_map_sz);
> + if (instance->max_raid_mapsize) {
> + ventura_map_sz = instance->max_raid_mapsize *
> + MR_MIN_MAP_SIZE; /* 64k */
> + fusion->current_map_sz = ventura_map_sz;
> + fusion->max_map_sz = ventura_map_sz;
> + } else {
> + fusion->old_map_sz =  sizeof(struct MR_FW_RAID_MAP) +
> + (sizeof(struct MR_LD_SPAN_MAP) *
> + (instance->fw_supported_vd_count -
1));
> + fusion->new_map_sz =  sizeof(struct MR_FW_RAID_MAP_EXT);
>
> + fusion->max_map_sz =
> + max(fusion->old_map_sz, fusion->new_map_sz);
>
> - if (instance->supportmax256vd)
> - fusion->current_map_sz = new_map_sz;
> - else
> - fusion->current_map_sz = old_map_sz;
> + if (instance->supportmax256vd)
> + fusion->current_map_sz = fusion->new_map_sz;
> + else
> + fusion->current_map_sz = 

RE: [PATCH V4 02/11] megaraid_sas: 128 MSIX Support

2016-12-19 Thread Sasikumar PC
Hi Tomas,

Please see my response inline

Thanks
sasi

-Original Message-
From: Sasikumar PC [mailto:sasikumar...@broadcom.com]
Sent: Wednesday, December 14, 2016 4:43 PM
To: 'Tomas Henzl'; 'j...@kernel.org'; 'h...@infradead.org'
Cc: 'linux-s...@vger.kernel.org'; Sathya Prakash Veerichetty;
'linux-kernel@vger.kernel.org'; Christopher Owens; Kiran Kumar Kasturi
Subject: RE: [PATCH V4 02/11] megaraid_sas: 128 MSIX Support

Hi Tomas,

Please see my response inline

Thanks
sasi

-Original Message-
From: Tomas Henzl [mailto:the...@redhat.com]
Sent: Thursday, December 08, 2016 10:35 AM
To: Sasikumar Chandrasekaran; j...@kernel.org; h...@infradead.org
Cc: linux-s...@vger.kernel.org; sathya.prak...@broadcom.com;
linux-kernel@vger.kernel.org; christopher.ow...@broadcom.com;
kiran-kumar.kast...@broadcom.com
Subject: Re: [PATCH V4 02/11] megaraid_sas: 128 MSIX Support

On 7.12.2016 00:00, Sasikumar Chandrasekaran wrote:
> SAS3.5 Generic Megaraid based Controllers will have the support for
> 128 MSI-X vectors, resulting in the need to support 128 reply queues
>
> This patch is depending on patch 1
>
> Signed-off-by: Sasikumar Chandrasekaran 
> ---
>  drivers/scsi/megaraid/megaraid_sas.h|  1 +
>  drivers/scsi/megaraid/megaraid_sas_base.c   | 24
+---
>  drivers/scsi/megaraid/megaraid_sas_fusion.c |  4 ++--
>  3 files changed, 20 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas.h
> b/drivers/scsi/megaraid/megaraid_sas.h
> index 72e16c2..9d4ca8d 100644
> --- a/drivers/scsi/megaraid/megaraid_sas.h
> +++ b/drivers/scsi/megaraid/megaraid_sas.h
> @@ -2149,6 +2149,7 @@ struct megasas_instance {
>   bool dev_handle;
>   bool fw_sync_cache_support;
>   bool is_ventura;
> + bool msix_combined;
>  };
>  struct MR_LD_VF_MAP {
>   u32 size;
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index efccf98..c583e0b 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -5086,13 +5086,7 @@ static int megasas_init_fw(struct
megasas_instance *instance)
>   goto fail_ready_state;
>   }
>
> - /*
> -  * MSI-X host index 0 is common for all adapter.
> -  * It is used for all MPT based Adapters.
> -  */
> - instance->reply_post_host_index_addr[0] =
> - (u32 __iomem *)((u8 __iomem *)instance->reg_set +
> - MPI2_REPLY_POST_HOST_INDEX_OFFSET);
> +
>
>   /* Check if MSI-X is supported while in ready state */
>   msix_enable = (instance->instancet->read_fw_status_reg(reg_set) &
@@
> -5110,6 +5104,9 @@ static int megasas_init_fw(struct megasas_instance
*instance)
>   instance->msix_vectors = ((scratch_pad_2
>   & MR_MAX_REPLY_QUEUES_EXT_OFFSET)
>   >>
MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT) + 1;
> + if (instance->msix_vectors > 16)
> + instance->msix_combined = true;
> +
>   if (rdpq_enable)
>   instance->is_rdpq = (scratch_pad_2
& MR_RDPQ_MODE_OFFSET) ?
>   1 : 0;
> @@ -5143,6 +5140,19 @@ static int megasas_init_fw(struct
megasas_instance *instance)
>   else
>   instance->msix_vectors = 0;
>   }

Have you tested this patch with the pci=nomsi kernel option?
Sasi - Driver is tested with pci=nomsi option and looking good

is it safe when msix_combined is true and pci_enable_msix_range fails so
instance->msix_vectors is zero?

msix_combined mode is dependent on how many vectors adapter supports, not
the actual vectors used.
It is correct to be in combined mode even if actual number of msix_vectors
used are fewer than 16, if hardware is in combined mode

Sasi


tomash

> + /*
> +  * MSI-X host index 0 is common for all adapter.
> +  * It is used for all MPT based Adapters.
> +  */
> + if (instance->msix_combined) {
> + instance->reply_post_host_index_addr[0] =
> + (u32 *)((u8 *)instance->reg_set +
> + MPI2_SUP_REPLY_POST_HOST_INDEX_OFFSET);
> + } else {
> + instance->reply_post_host_index_addr[0] =
> + (u32 *)((u8 *)instance->reg_set +
> + MPI2_REPLY_POST_HOST_INDEX_OFFSET);
> + }
>
>   dev_info(>pdev->dev,
>   "firmware supports msix\t: (%d)", fw_msix_count); diff
--git
> a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> index 8d7a397..413e2030 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -2391,7 +2391,7 @@ static void 

RE: [PATCH V4 02/11] megaraid_sas: 128 MSIX Support

2016-12-19 Thread Sasikumar PC
Hi Tomas,

Please see my response inline

Thanks
sasi

-Original Message-
From: Sasikumar PC [mailto:sasikumar...@broadcom.com]
Sent: Wednesday, December 14, 2016 4:43 PM
To: 'Tomas Henzl'; 'j...@kernel.org'; 'h...@infradead.org'
Cc: 'linux-s...@vger.kernel.org'; Sathya Prakash Veerichetty;
'linux-kernel@vger.kernel.org'; Christopher Owens; Kiran Kumar Kasturi
Subject: RE: [PATCH V4 02/11] megaraid_sas: 128 MSIX Support

Hi Tomas,

Please see my response inline

Thanks
sasi

-Original Message-
From: Tomas Henzl [mailto:the...@redhat.com]
Sent: Thursday, December 08, 2016 10:35 AM
To: Sasikumar Chandrasekaran; j...@kernel.org; h...@infradead.org
Cc: linux-s...@vger.kernel.org; sathya.prak...@broadcom.com;
linux-kernel@vger.kernel.org; christopher.ow...@broadcom.com;
kiran-kumar.kast...@broadcom.com
Subject: Re: [PATCH V4 02/11] megaraid_sas: 128 MSIX Support

On 7.12.2016 00:00, Sasikumar Chandrasekaran wrote:
> SAS3.5 Generic Megaraid based Controllers will have the support for
> 128 MSI-X vectors, resulting in the need to support 128 reply queues
>
> This patch is depending on patch 1
>
> Signed-off-by: Sasikumar Chandrasekaran 
> ---
>  drivers/scsi/megaraid/megaraid_sas.h|  1 +
>  drivers/scsi/megaraid/megaraid_sas_base.c   | 24
+---
>  drivers/scsi/megaraid/megaraid_sas_fusion.c |  4 ++--
>  3 files changed, 20 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas.h
> b/drivers/scsi/megaraid/megaraid_sas.h
> index 72e16c2..9d4ca8d 100644
> --- a/drivers/scsi/megaraid/megaraid_sas.h
> +++ b/drivers/scsi/megaraid/megaraid_sas.h
> @@ -2149,6 +2149,7 @@ struct megasas_instance {
>   bool dev_handle;
>   bool fw_sync_cache_support;
>   bool is_ventura;
> + bool msix_combined;
>  };
>  struct MR_LD_VF_MAP {
>   u32 size;
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index efccf98..c583e0b 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -5086,13 +5086,7 @@ static int megasas_init_fw(struct
megasas_instance *instance)
>   goto fail_ready_state;
>   }
>
> - /*
> -  * MSI-X host index 0 is common for all adapter.
> -  * It is used for all MPT based Adapters.
> -  */
> - instance->reply_post_host_index_addr[0] =
> - (u32 __iomem *)((u8 __iomem *)instance->reg_set +
> - MPI2_REPLY_POST_HOST_INDEX_OFFSET);
> +
>
>   /* Check if MSI-X is supported while in ready state */
>   msix_enable = (instance->instancet->read_fw_status_reg(reg_set) &
@@
> -5110,6 +5104,9 @@ static int megasas_init_fw(struct megasas_instance
*instance)
>   instance->msix_vectors = ((scratch_pad_2
>   & MR_MAX_REPLY_QUEUES_EXT_OFFSET)
>   >>
MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT) + 1;
> + if (instance->msix_vectors > 16)
> + instance->msix_combined = true;
> +
>   if (rdpq_enable)
>   instance->is_rdpq = (scratch_pad_2
& MR_RDPQ_MODE_OFFSET) ?
>   1 : 0;
> @@ -5143,6 +5140,19 @@ static int megasas_init_fw(struct
megasas_instance *instance)
>   else
>   instance->msix_vectors = 0;
>   }

Have you tested this patch with the pci=nomsi kernel option?
Sasi - Driver is tested with pci=nomsi option and looking good

is it safe when msix_combined is true and pci_enable_msix_range fails so
instance->msix_vectors is zero?

msix_combined mode is dependent on how many vectors adapter supports, not
the actual vectors used.
It is correct to be in combined mode even if actual number of msix_vectors
used are fewer than 16, if hardware is in combined mode

Sasi


tomash

> + /*
> +  * MSI-X host index 0 is common for all adapter.
> +  * It is used for all MPT based Adapters.
> +  */
> + if (instance->msix_combined) {
> + instance->reply_post_host_index_addr[0] =
> + (u32 *)((u8 *)instance->reg_set +
> + MPI2_SUP_REPLY_POST_HOST_INDEX_OFFSET);
> + } else {
> + instance->reply_post_host_index_addr[0] =
> + (u32 *)((u8 *)instance->reg_set +
> + MPI2_REPLY_POST_HOST_INDEX_OFFSET);
> + }
>
>   dev_info(>pdev->dev,
>   "firmware supports msix\t: (%d)", fw_msix_count); diff
--git
> a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> index 8d7a397..413e2030 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -2391,7 +2391,7 @@ static void megasas_build_ld_nonrw_fusion(struct
megasas_instance 

Re: [kbuild-all] 答复: Re: [PATCH 1/2] ocfs2: add kobject for online file check

2016-12-19 Thread Fengguang Wu

Hi Gang,

On Mon, Dec 19, 2016 at 06:43:48PM -0700, Gang He wrote:

Hello Kbuild,
Could you build my whole patch set (2 patch)? I think that the code is OK.


We test your whole patch as well as first-N patches, and noticed that
the first-1 patch breaks bisectibility:


Note: the 
linux-review/Gang-He/ocfs2-add-kobject-for-online-file-check/20161219-181858 
HEAD 6ef9256cd25ef72a5e69490cc3dacde95b8e2ac4 builds fine.
 It only hurts bisectibility.


Thanks,
Fengguang


Re: [PATCH v2 2/3] Bluetooth: btusb: Add out-of-band wakeup support

2016-12-19 Thread Rajat Jain
Hi Brian,


On Mon, Dec 19, 2016 at 3:10 PM, Brian Norris  wrote:
> Hi Rajat,
>
> On Fri, Dec 16, 2016 at 11:30:03AM -0800, Rajat Jain wrote:
>> Some onboard BT chips (e.g. Marvell 8997) contain a wakeup pin that
>> can be connected to a gpio on the CPU side, and can be used to wakeup
>> the host out-of-band. This can be useful in situations where the
>> in-band wakeup is not possible or not preferable (e.g. the in-band
>> wakeup may require the USB host controller to remain active, and
>> hence consuming more system power during system sleep).
>>
>> The oob gpio interrupt to be used for wakeup on the CPU side, is
>> read from the device tree node, (using standard interrupt descriptors).
>> A devcie tree binding document is also added for the driver. The
>> compatible string is in compliance with
>> Documentation/devicetree/bindings/usb/usb-device.txt
>>
>> Signed-off-by: Rajat Jain 
>
> A few small comments below, but other than those, for the whole series:
>
> Reviewed-by: Brian Norris 
>
>> ---
>> v2: * Use interrupt-names ("wakeup") instead of assuming first interrupt.
>> * Leave it on device tree to specify IRQ flags (level /edge triggered)
>> * Mark the device as non wakeable on exit.
>>
>>  Documentation/devicetree/bindings/net/btusb.txt | 40 
>>  drivers/bluetooth/btusb.c   | 84 
>> +
>>  2 files changed, 124 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/net/btusb.txt
>>
>> diff --git a/Documentation/devicetree/bindings/net/btusb.txt 
>> b/Documentation/devicetree/bindings/net/btusb.txt
>> new file mode 100644
>> index 000..2c0355c
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/btusb.txt
>> @@ -0,0 +1,40 @@
>> +Generic Bluetooth controller over USB (btusb driver)
>> +---
>> +
>> +Required properties:
>> +
>> +  - compatible : should comply with the format "usbVID,PID" specified in
>> +  Documentation/devicetree/bindings/usb/usb-device.txt
>> +  At the time of writing, the only OF supported devices
>> +  (more may be added later) are:
>> +
>> +   "usb1286,204e" (Marvell 8997)
>
> I don't know if anyone cares about these sort of organizational aspects,
> but in patch 3 you're extending this binding with device-specific
> Marvell bindings. Seems like it'd be good to have some clear way to
> correlate those 2. e.g., add a reference to marvell-bt-sd8xxx.txt (or
> marvell-bt-8xxx.txt, once you rename it) in patch 3.

Good idea, I'll do it. (I'll add a reference to marvell-bt-8xxx.txt in
btusb.txt).

>
>> +
>> +Optional properties:
>> +
>> +  - interrupt-parent: phandle of the parent interrupt controller
>> +  - interrupt-names: (see below)
>> +  - interrupts : The interrupt specified by the name "wakeup" is the 
>> interrupt
>> +  that shall be used for out-of-band wake-on-bt. Driver will
>> +  request this interrupt for wakeup. During system suspend, the
>> +  irq will be enabled so that the bluetooth chip can wakeup host
>> +  platform out of band. During system resume, the irq will be
>> +  disabled to make sure unnecessary interrupt is not received.
>> +
>> +Example:
>> +
>> +Following example uses irq pin number 3 of gpio0 for out of band wake-on-bt:
>> +
>> +_host1_ehci {
>> +status = "okay";
>> +#address-cells = <1>;
>> +#size-cells = <0>;
>> +
>> +mvl_bt1: bt@1 {
>> + compatible = "usb1286,204e";
>> + reg = <1>;
>> + interrupt-parent = <>;
>> + interrupt-name = "wakeup";
>> + interrupts = <3 IRQ_TYPE_LEVEL_LOW>;
>> +};
>> +};
>> diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
>> index ce22cef..beca4e9 100644
>> --- a/drivers/bluetooth/btusb.c
>> +++ b/drivers/bluetooth/btusb.c
>> @@ -24,6 +24,8 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>> +#include 
>>  #include 
>>
>>  #include 
>> @@ -369,6 +371,7 @@ static const struct usb_device_id blacklist_table[] = {
>>  #define BTUSB_BOOTING9
>>  #define BTUSB_RESET_RESUME   10
>>  #define BTUSB_DIAG_RUNNING   11
>> +#define BTUSB_OOB_WAKE_DISABLED  12
>>
>>  struct btusb_data {
>>   struct hci_dev   *hdev;
>> @@ -416,6 +419,8 @@ struct btusb_data {
>>   int (*recv_bulk)(struct btusb_data *data, void *buffer, int count);
>>
>>   int (*setup_on_usb)(struct hci_dev *hdev);
>> +
>> + int oob_wake_irq;   /* irq for out-of-band wake-on-bt */
>>  };
>>
>>  static inline void btusb_free_frags(struct btusb_data *data)
>> @@ -2728,6 +2733,65 @@ static int btusb_bcm_set_diag(struct hci_dev *hdev, 
>> bool enable)
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_PM
>> +static irqreturn_t btusb_oob_wake_handler(int irq, void *priv)
>> +{
>> + struct btusb_data *data = priv;
>> +
>> + /* Disable only if not already disabled 

Re: [kbuild-all] 答复: Re: [PATCH 1/2] ocfs2: add kobject for online file check

2016-12-19 Thread Fengguang Wu

Hi Gang,

On Mon, Dec 19, 2016 at 06:43:48PM -0700, Gang He wrote:

Hello Kbuild,
Could you build my whole patch set (2 patch)? I think that the code is OK.


We test your whole patch as well as first-N patches, and noticed that
the first-1 patch breaks bisectibility:


Note: the 
linux-review/Gang-He/ocfs2-add-kobject-for-online-file-check/20161219-181858 
HEAD 6ef9256cd25ef72a5e69490cc3dacde95b8e2ac4 builds fine.
 It only hurts bisectibility.


Thanks,
Fengguang


Re: [PATCH v2 2/3] Bluetooth: btusb: Add out-of-band wakeup support

2016-12-19 Thread Rajat Jain
Hi Brian,


On Mon, Dec 19, 2016 at 3:10 PM, Brian Norris  wrote:
> Hi Rajat,
>
> On Fri, Dec 16, 2016 at 11:30:03AM -0800, Rajat Jain wrote:
>> Some onboard BT chips (e.g. Marvell 8997) contain a wakeup pin that
>> can be connected to a gpio on the CPU side, and can be used to wakeup
>> the host out-of-band. This can be useful in situations where the
>> in-band wakeup is not possible or not preferable (e.g. the in-band
>> wakeup may require the USB host controller to remain active, and
>> hence consuming more system power during system sleep).
>>
>> The oob gpio interrupt to be used for wakeup on the CPU side, is
>> read from the device tree node, (using standard interrupt descriptors).
>> A devcie tree binding document is also added for the driver. The
>> compatible string is in compliance with
>> Documentation/devicetree/bindings/usb/usb-device.txt
>>
>> Signed-off-by: Rajat Jain 
>
> A few small comments below, but other than those, for the whole series:
>
> Reviewed-by: Brian Norris 
>
>> ---
>> v2: * Use interrupt-names ("wakeup") instead of assuming first interrupt.
>> * Leave it on device tree to specify IRQ flags (level /edge triggered)
>> * Mark the device as non wakeable on exit.
>>
>>  Documentation/devicetree/bindings/net/btusb.txt | 40 
>>  drivers/bluetooth/btusb.c   | 84 
>> +
>>  2 files changed, 124 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/net/btusb.txt
>>
>> diff --git a/Documentation/devicetree/bindings/net/btusb.txt 
>> b/Documentation/devicetree/bindings/net/btusb.txt
>> new file mode 100644
>> index 000..2c0355c
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/btusb.txt
>> @@ -0,0 +1,40 @@
>> +Generic Bluetooth controller over USB (btusb driver)
>> +---
>> +
>> +Required properties:
>> +
>> +  - compatible : should comply with the format "usbVID,PID" specified in
>> +  Documentation/devicetree/bindings/usb/usb-device.txt
>> +  At the time of writing, the only OF supported devices
>> +  (more may be added later) are:
>> +
>> +   "usb1286,204e" (Marvell 8997)
>
> I don't know if anyone cares about these sort of organizational aspects,
> but in patch 3 you're extending this binding with device-specific
> Marvell bindings. Seems like it'd be good to have some clear way to
> correlate those 2. e.g., add a reference to marvell-bt-sd8xxx.txt (or
> marvell-bt-8xxx.txt, once you rename it) in patch 3.

Good idea, I'll do it. (I'll add a reference to marvell-bt-8xxx.txt in
btusb.txt).

>
>> +
>> +Optional properties:
>> +
>> +  - interrupt-parent: phandle of the parent interrupt controller
>> +  - interrupt-names: (see below)
>> +  - interrupts : The interrupt specified by the name "wakeup" is the 
>> interrupt
>> +  that shall be used for out-of-band wake-on-bt. Driver will
>> +  request this interrupt for wakeup. During system suspend, the
>> +  irq will be enabled so that the bluetooth chip can wakeup host
>> +  platform out of band. During system resume, the irq will be
>> +  disabled to make sure unnecessary interrupt is not received.
>> +
>> +Example:
>> +
>> +Following example uses irq pin number 3 of gpio0 for out of band wake-on-bt:
>> +
>> +_host1_ehci {
>> +status = "okay";
>> +#address-cells = <1>;
>> +#size-cells = <0>;
>> +
>> +mvl_bt1: bt@1 {
>> + compatible = "usb1286,204e";
>> + reg = <1>;
>> + interrupt-parent = <>;
>> + interrupt-name = "wakeup";
>> + interrupts = <3 IRQ_TYPE_LEVEL_LOW>;
>> +};
>> +};
>> diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
>> index ce22cef..beca4e9 100644
>> --- a/drivers/bluetooth/btusb.c
>> +++ b/drivers/bluetooth/btusb.c
>> @@ -24,6 +24,8 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>> +#include 
>>  #include 
>>
>>  #include 
>> @@ -369,6 +371,7 @@ static const struct usb_device_id blacklist_table[] = {
>>  #define BTUSB_BOOTING9
>>  #define BTUSB_RESET_RESUME   10
>>  #define BTUSB_DIAG_RUNNING   11
>> +#define BTUSB_OOB_WAKE_DISABLED  12
>>
>>  struct btusb_data {
>>   struct hci_dev   *hdev;
>> @@ -416,6 +419,8 @@ struct btusb_data {
>>   int (*recv_bulk)(struct btusb_data *data, void *buffer, int count);
>>
>>   int (*setup_on_usb)(struct hci_dev *hdev);
>> +
>> + int oob_wake_irq;   /* irq for out-of-band wake-on-bt */
>>  };
>>
>>  static inline void btusb_free_frags(struct btusb_data *data)
>> @@ -2728,6 +2733,65 @@ static int btusb_bcm_set_diag(struct hci_dev *hdev, 
>> bool enable)
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_PM
>> +static irqreturn_t btusb_oob_wake_handler(int irq, void *priv)
>> +{
>> + struct btusb_data *data = priv;
>> +
>> + /* Disable only if not already disabled (keep it balanced) */
>> + if 

Re: [PATCH 4/4] oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA

2016-12-19 Thread kbuild test robot
Hi Kirill,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.9 next-20161219]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/mm-drop-zap_details-ignore_dirty/20161220-092938
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x003-201651 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   mm/madvise.c: In function 'madvise_dontneed':
>> mm/madvise.c:476:7: error: implicit declaration of function 
>> 'can_madv_dontneed_vma' [-Werror=implicit-function-declaration]
 if (!can_madv_dontneed_vma(vma))
  ^
   cc1: some warnings being treated as errors

vim +/can_madv_dontneed_vma +476 mm/madvise.c

   470   */
   471  static long madvise_dontneed(struct vm_area_struct *vma,
   472   struct vm_area_struct **prev,
   473   unsigned long start, unsigned long end)
   474  {
   475  *prev = vma;
 > 476  if (!can_madv_dontneed_vma(vma))
   477  return -EINVAL;
   478  
   479  zap_page_range(vma, start, end - start);

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski
On Mon, Dec 19, 2016 at 4:25 PM, Andy Lutomirski  wrote:
> On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
>  wrote:
>> you're ignoring use cases I described earlier.
>> In vrf case there is only one ifindex it needs to bind to.
>
> I'm totally lost.  Can you explain what this has to do with the cgroup
> hierarchy?
>

Okay, I figured out what you mean, I think.  You have a handful of vrf
devices.  Let's say they have ifindexes 1 and 2 (and maybe more).

The interesting case happens when you set up /cgroup/a with a bpf
program that binds new sockets to ifindex 1 and /cgroup/a/b with a bpf
program that binds new sockets to ifindex 2.  The question is: what
should happen if you're in /cgroup/a/b?  Presumably, if you do this,
you wanted to end up with ifindex 2.

I think the way it should actually work is that the kernel evaluates
/cgroup/a/b's hook and then /cgroup/a's hook.  Then /cgroup/a (which
is the more privileged hook) gets to make the choice.  If it wants
ifindex 2 to win, it can do (pseudocode):

if (!sk->sk_bound_if)
  sk->sk_bound_if = 1;


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread David Ahern
On 12/19/16 5:25 PM, Andy Lutomirski wrote:
> net.socket_create_filter = "none": no filter
> net.socket_create_filter = "bpf:baadf00d": bpf filter
> net.socket_create_filter = "disallow": no sockets created period
> net.socket_create_filter = "iptables:foobar": some iptables thingy
> net.socket_create_filter = "nft:blahblahblah": some nft thingy
> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

Such a scheme works for the socket create filter b/c it is a very simple use 
case. It does not work for the ingress and egress which allow generic bpf 
filters.

...

>> you're ignoring use cases I described earlier.
>> In vrf case there is only one ifindex it needs to bind to.
> 
> I'm totally lost.  Can you explain what this has to do with the cgroup
> hierarchy?

I think the point is that a group hierarchy makes no sense for the VRF use 
case. What I put into iproute2 is

cgrp2/vrf/NAME

where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets 
to a specific device index. cgrp2/vrf is the "default" vrf and does not have a 
filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it 
provides no value since VRF in a VRF does not make sense. 

...

>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
>>> have to do it now (or disable the feature for 4.10).  This is why I'm
>>> bringing this whole thing up now.
>>
>> We don't have to touch user visible api here, so extensions are fine.
> 
> Huh?  My example in the original email attaches a program in a
> sub-hierarchy.  Are you saying that 4.11 could make that example stop
> working?

Are you suggesting sub-cgroups should not be allowed to override the filter of 
a parent cgroup?




Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski
On Mon, Dec 19, 2016 at 4:25 PM, Andy Lutomirski  wrote:
> On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
>  wrote:
>> you're ignoring use cases I described earlier.
>> In vrf case there is only one ifindex it needs to bind to.
>
> I'm totally lost.  Can you explain what this has to do with the cgroup
> hierarchy?
>

Okay, I figured out what you mean, I think.  You have a handful of vrf
devices.  Let's say they have ifindexes 1 and 2 (and maybe more).

The interesting case happens when you set up /cgroup/a with a bpf
program that binds new sockets to ifindex 1 and /cgroup/a/b with a bpf
program that binds new sockets to ifindex 2.  The question is: what
should happen if you're in /cgroup/a/b?  Presumably, if you do this,
you wanted to end up with ifindex 2.

I think the way it should actually work is that the kernel evaluates
/cgroup/a/b's hook and then /cgroup/a's hook.  Then /cgroup/a (which
is the more privileged hook) gets to make the choice.  If it wants
ifindex 2 to win, it can do (pseudocode):

if (!sk->sk_bound_if)
  sk->sk_bound_if = 1;


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread David Ahern
On 12/19/16 5:25 PM, Andy Lutomirski wrote:
> net.socket_create_filter = "none": no filter
> net.socket_create_filter = "bpf:baadf00d": bpf filter
> net.socket_create_filter = "disallow": no sockets created period
> net.socket_create_filter = "iptables:foobar": some iptables thingy
> net.socket_create_filter = "nft:blahblahblah": some nft thingy
> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

Such a scheme works for the socket create filter b/c it is a very simple use 
case. It does not work for the ingress and egress which allow generic bpf 
filters.

...

>> you're ignoring use cases I described earlier.
>> In vrf case there is only one ifindex it needs to bind to.
> 
> I'm totally lost.  Can you explain what this has to do with the cgroup
> hierarchy?

I think the point is that a group hierarchy makes no sense for the VRF use 
case. What I put into iproute2 is

cgrp2/vrf/NAME

where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets 
to a specific device index. cgrp2/vrf is the "default" vrf and does not have a 
filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it 
provides no value since VRF in a VRF does not make sense. 

...

>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
>>> have to do it now (or disable the feature for 4.10).  This is why I'm
>>> bringing this whole thing up now.
>>
>> We don't have to touch user visible api here, so extensions are fine.
> 
> Huh?  My example in the original email attaches a program in a
> sub-hierarchy.  Are you saying that 4.11 could make that example stop
> working?

Are you suggesting sub-cgroups should not be allowed to override the filter of 
a parent cgroup?




Re: [PATCH 4/4] oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA

2016-12-19 Thread kbuild test robot
Hi Kirill,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.9 next-20161219]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/mm-drop-zap_details-ignore_dirty/20161220-092938
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x003-201651 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   mm/madvise.c: In function 'madvise_dontneed':
>> mm/madvise.c:476:7: error: implicit declaration of function 
>> 'can_madv_dontneed_vma' [-Werror=implicit-function-declaration]
 if (!can_madv_dontneed_vma(vma))
  ^
   cc1: some warnings being treated as errors

vim +/can_madv_dontneed_vma +476 mm/madvise.c

   470   */
   471  static long madvise_dontneed(struct vm_area_struct *vma,
   472   struct vm_area_struct **prev,
   473   unsigned long start, unsigned long end)
   474  {
   475  *prev = vma;
 > 476  if (!can_madv_dontneed_vma(vma))
   477  return -EINVAL;
   478  
   479  zap_page_range(vma, start, end - start);

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


答复: Re: [PATCH 1/2] ocfs2: add kobject for online file check

2016-12-19 Thread Gang He
Hello Kbuild,
Could you build my whole patch set (2 patch)? I think that the code is OK.

Thanks
Gang


>>> kbuild test robot <l...@intel.com> 2016-12-19 下午 18:56 >>>
Hi Gang,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.9 next-20161219]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Gang-He/ocfs2-add-kobject-for-online-file-check/20161219-181858
config: x86_64-randconfig-x000-201651 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

Note: the 
linux-review/Gang-He/ocfs2-add-kobject-for-online-file-check/20161219-181858 
HEAD 6ef9256cd25ef72a5e69490cc3dacde95b8e2ac4 builds fine.
  It only hurts bisectibility.

All error/warnings (new ones prefixed by >>):

   fs/ocfs2/super.c: In function 'ocfs2_fill_super':
>> fs/ocfs2/super.c:1204:31: error: passing argument 1 of 
>> 'ocfs2_filecheck_create_sysfs' from incompatible pointer type 
>> [-Werror=incompatible-pointer-types]
 ocfs2_filecheck_create_sysfs(sb);
  ^~
   In file included from fs/ocfs2/super.c:77:0:
   fs/ocfs2/filecheck.h:71:5: note: expected 'struct ocfs2_super *' but 
argument is of type 'struct super_block *'
int ocfs2_filecheck_create_sysfs(struct ocfs2_super *osb);
^~~~
   fs/ocfs2/super.c: In function 'ocfs2_put_super':
>> fs/ocfs2/super.c:1656:31: error: passing argument 1 of 
>> 'ocfs2_filecheck_remove_sysfs' from incompatible pointer type 
>> [-Werror=incompatible-pointer-types]
 ocfs2_filecheck_remove_sysfs(sb);
  ^~
   In file included from fs/ocfs2/super.c:77:0:
   fs/ocfs2/filecheck.h:72:6: note: expected 'struct ocfs2_super *' but 
argument is of type 'struct super_block *'
void ocfs2_filecheck_remove_sysfs(struct ocfs2_super *osb);
 ^~~~
   cc1: some warnings being treated as errors
--
   fs/ocfs2/filecheck.c: In function 'ocfs2_filecheck_create_sysfs':
>> fs/ocfs2/filecheck.c:179:50: error: 'struct ocfs2_super' has no member named 
>> 'osb_fc_ent'; did you mean 'osb_ctxt'?
 struct ocfs2_filecheck_sysfs_entry *entry = >osb_fc_ent;
 ^~
>> fs/ocfs2/filecheck.c:191:27: error: 'struct ocfs2_super' has no member named 
>> 'osb_dev_kset'; did you mean 'osb_dx_seed'?
 entry->fs_kobj.kset = osb->osb_dev_kset;
  ^~
   fs/ocfs2/filecheck.c: In function 'ocfs2_filecheck_remove_sysfs':
   fs/ocfs2/filecheck.c:206:10: error: 'struct ocfs2_super' has no member named 
'osb_fc_ent'; did you mean 'osb_ctxt'?
 if (!osb->osb_fc_ent.fs_fcheck)
 ^~
   fs/ocfs2/filecheck.c:209:18: error: 'struct ocfs2_super' has no member named 
'osb_fc_ent'; did you mean 'osb_ctxt'?
 kobject_del(>osb_fc_ent.fs_kobj);
 ^~
   fs/ocfs2/filecheck.c:210:18: error: 'struct ocfs2_super' has no member named 
'osb_fc_ent'; did you mean 'osb_ctxt'?
 kobject_put(>osb_fc_ent.fs_kobj);
 ^~
   fs/ocfs2/filecheck.c:211:26: error: 'struct ocfs2_super' has no member named 
'osb_fc_ent'; did you mean 'osb_ctxt'?
 wait_for_completion(>osb_fc_ent.fs_kobj_unregister);
 ^~
   fs/ocfs2/filecheck.c:212:33: error: 'struct ocfs2_super' has no member named 
'osb_fc_ent'; did you mean 'osb_ctxt'?
 ocfs2_filecheck_sysfs_free(>osb_fc_ent);
^~
   In file included from include/linux/list.h:8:0,
from fs/ocfs2/filecheck.c:20:
   fs/ocfs2/filecheck.c: In function 'ocfs2_filecheck_handle_entry':
>> include/linux/kernel.h:850:27: error: 'struct ocfs2_super' has no member 
>> named 'osb_fc_ent'; did you mean 'osb_ctxt'?
 const typeof( ((type *)0)->member ) *__mptr = (ptr); \
  ^
>> fs/ocfs2/filecheck.c:444: struct ocfs2_super *osb = container_of(ent, 
>> struct ocfs2_super,
   ^~~~
   include/linux/kernel.h:850:48: error: initialization from incompatible 
pointer type [-Werror=incompatible-pointer-types]
 const typeof( ((type *)0)->member ) *__mptr = (ptr); \
   ^
>> fs/ocfs2/filecheck.c:444:28: note: in expansion of macro 'container_of'
 struct ocfs2_super *osb = container_of(ent, struct ocfs2_super,
   ^~~~
   include/linux/kernel.h:850:48: note: (near initialization for 'osb')
 const typeof( ((type *)0)->member ) *__mptr = (ptr); \
   ^
>> fs/ocfs2/filecheck.c:444:28: note: in expansion of macro 'container_of'
   

答复: Re: [PATCH 1/2] ocfs2: add kobject for online file check

2016-12-19 Thread Gang He
Hello Kbuild,
Could you build my whole patch set (2 patch)? I think that the code is OK.

Thanks
Gang


>>> kbuild test robot  2016-12-19 下午 18:56 >>>
Hi Gang,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.9 next-20161219]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Gang-He/ocfs2-add-kobject-for-online-file-check/20161219-181858
config: x86_64-randconfig-x000-201651 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

Note: the 
linux-review/Gang-He/ocfs2-add-kobject-for-online-file-check/20161219-181858 
HEAD 6ef9256cd25ef72a5e69490cc3dacde95b8e2ac4 builds fine.
  It only hurts bisectibility.

All error/warnings (new ones prefixed by >>):

   fs/ocfs2/super.c: In function 'ocfs2_fill_super':
>> fs/ocfs2/super.c:1204:31: error: passing argument 1 of 
>> 'ocfs2_filecheck_create_sysfs' from incompatible pointer type 
>> [-Werror=incompatible-pointer-types]
 ocfs2_filecheck_create_sysfs(sb);
  ^~
   In file included from fs/ocfs2/super.c:77:0:
   fs/ocfs2/filecheck.h:71:5: note: expected 'struct ocfs2_super *' but 
argument is of type 'struct super_block *'
int ocfs2_filecheck_create_sysfs(struct ocfs2_super *osb);
^~~~
   fs/ocfs2/super.c: In function 'ocfs2_put_super':
>> fs/ocfs2/super.c:1656:31: error: passing argument 1 of 
>> 'ocfs2_filecheck_remove_sysfs' from incompatible pointer type 
>> [-Werror=incompatible-pointer-types]
 ocfs2_filecheck_remove_sysfs(sb);
  ^~
   In file included from fs/ocfs2/super.c:77:0:
   fs/ocfs2/filecheck.h:72:6: note: expected 'struct ocfs2_super *' but 
argument is of type 'struct super_block *'
void ocfs2_filecheck_remove_sysfs(struct ocfs2_super *osb);
 ^~~~
   cc1: some warnings being treated as errors
--
   fs/ocfs2/filecheck.c: In function 'ocfs2_filecheck_create_sysfs':
>> fs/ocfs2/filecheck.c:179:50: error: 'struct ocfs2_super' has no member named 
>> 'osb_fc_ent'; did you mean 'osb_ctxt'?
 struct ocfs2_filecheck_sysfs_entry *entry = >osb_fc_ent;
 ^~
>> fs/ocfs2/filecheck.c:191:27: error: 'struct ocfs2_super' has no member named 
>> 'osb_dev_kset'; did you mean 'osb_dx_seed'?
 entry->fs_kobj.kset = osb->osb_dev_kset;
  ^~
   fs/ocfs2/filecheck.c: In function 'ocfs2_filecheck_remove_sysfs':
   fs/ocfs2/filecheck.c:206:10: error: 'struct ocfs2_super' has no member named 
'osb_fc_ent'; did you mean 'osb_ctxt'?
 if (!osb->osb_fc_ent.fs_fcheck)
 ^~
   fs/ocfs2/filecheck.c:209:18: error: 'struct ocfs2_super' has no member named 
'osb_fc_ent'; did you mean 'osb_ctxt'?
 kobject_del(>osb_fc_ent.fs_kobj);
 ^~
   fs/ocfs2/filecheck.c:210:18: error: 'struct ocfs2_super' has no member named 
'osb_fc_ent'; did you mean 'osb_ctxt'?
 kobject_put(>osb_fc_ent.fs_kobj);
 ^~
   fs/ocfs2/filecheck.c:211:26: error: 'struct ocfs2_super' has no member named 
'osb_fc_ent'; did you mean 'osb_ctxt'?
 wait_for_completion(>osb_fc_ent.fs_kobj_unregister);
 ^~
   fs/ocfs2/filecheck.c:212:33: error: 'struct ocfs2_super' has no member named 
'osb_fc_ent'; did you mean 'osb_ctxt'?
 ocfs2_filecheck_sysfs_free(>osb_fc_ent);
^~
   In file included from include/linux/list.h:8:0,
from fs/ocfs2/filecheck.c:20:
   fs/ocfs2/filecheck.c: In function 'ocfs2_filecheck_handle_entry':
>> include/linux/kernel.h:850:27: error: 'struct ocfs2_super' has no member 
>> named 'osb_fc_ent'; did you mean 'osb_ctxt'?
 const typeof( ((type *)0)->member ) *__mptr = (ptr); \
  ^
>> fs/ocfs2/filecheck.c:444: struct ocfs2_super *osb = container_of(ent, 
>> struct ocfs2_super,
   ^~~~
   include/linux/kernel.h:850:48: error: initialization from incompatible 
pointer type [-Werror=incompatible-pointer-types]
 const typeof( ((type *)0)->member ) *__mptr = (ptr); \
   ^
>> fs/ocfs2/filecheck.c:444:28: note: in expansion of macro 'container_of'
 struct ocfs2_super *osb = container_of(ent, struct ocfs2_super,
   ^~~~
   include/linux/kernel.h:850:48: note: (near initialization for 'osb')
 const typeof( ((type *)0)->member ) *__mptr = (ptr); \
   ^
>> fs/ocfs2/filecheck.c:444:28: note: in expansion of macro 'container_of'
 struct oc

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski
On Mon, Dec 19, 2016 at 5:34 PM, David Miller  wrote:
> From: Alexei Starovoitov 
> Date: Mon, 19 Dec 2016 16:02:56 -0800
>
>> huh? 'not right api' because it's using bpf syscall instead
>> of cgroup control-file? I think the opposite is the truth.
>
> I completely agree with Alexei on this.

So what happens when someone adds another type of filter?  Let's say
there's a simple, no-privilege-required list of allowed address
families that can hook up to the socket creation hook for a cgroup.
Does BPF_PROG_DETACH still detach it?  Or would both the bpf *and* the
list of allowed address families be in force?  If the latter, why
wouldn't two BPF programs on the same hook be allowed?

Concretely:

# mkdir /cgroup/a
# set_up_bpf_socket_rule /cgroup/a
# set_up_list_of_address_families /cgroup/a
# cat /cgroup/a/some_new_file [what gets displayed?]
# BPF_PROG_DETACH: what happens

By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
even take a reference to a BPF program as an argument.  What is it
supposed to do if this mechanism ever gets extended?

--Andy


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski
On Mon, Dec 19, 2016 at 5:34 PM, David Miller  wrote:
> From: Alexei Starovoitov 
> Date: Mon, 19 Dec 2016 16:02:56 -0800
>
>> huh? 'not right api' because it's using bpf syscall instead
>> of cgroup control-file? I think the opposite is the truth.
>
> I completely agree with Alexei on this.

So what happens when someone adds another type of filter?  Let's say
there's a simple, no-privilege-required list of allowed address
families that can hook up to the socket creation hook for a cgroup.
Does BPF_PROG_DETACH still detach it?  Or would both the bpf *and* the
list of allowed address families be in force?  If the latter, why
wouldn't two BPF programs on the same hook be allowed?

Concretely:

# mkdir /cgroup/a
# set_up_bpf_socket_rule /cgroup/a
# set_up_list_of_address_families /cgroup/a
# cat /cgroup/a/some_new_file [what gets displayed?]
# BPF_PROG_DETACH: what happens

By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
even take a reference to a BPF program as an argument.  What is it
supposed to do if this mechanism ever gets extended?

--Andy


Re: [PATCH] x86/microcode: Adjust ramdisk address when accessing by virtual address

2016-12-19 Thread Boris Ostrovsky



On 12/19/2016 08:27 PM, Boris Ostrovsky wrote:



On 12/19/2016 06:32 PM, Borislav Petkov wrote:

On Mon, Dec 19, 2016 at 07:43:40PM +0100, Borislav Petkov wrote:

On Mon, Dec 19, 2016 at 01:12:25PM -0500, Boris Ostrovsky wrote:

IIUIC find_microcode_in_initrd() is called with paging on only on Intel
(which is where I observed it).


Ah, that was an important fact. Yes, I can repro it now.


Ok, questions:

* does your guest relocate the ramdisk?



This is not a guest. I crashed with baremetal kernel.




I.e., do you see something like this in dmesg before the splat:

[0.00] RAMDISK: [mem 0x7f84c000-0x7ffc]
[0.00] Allocated new RAMDISK: [mem 0x3647a000-0x36bfd9e6]
[0.00] Move RAMDISK from [mem 0x7f84c000-0x7ffcf9e6] to [mem
0x3647a000-0x36bfd9e6]
   ^^



Sorry, forgot about this: I see the first line but not the other two (so 
the relocation did not occur).



-boris



If not, then I know what happens.

Also, does it work if you change these lines:

if (!use_pa && relocated_ramdisk)
start = initrd_start;

to:

if (!use_pa)
start = initrd_start;


Yes, it does.

I also thought it might be better but I haven't gone through the code to
make sure this would always work.

I can run more tests tomorrow if you want.

-boris




Because if that works, I can actually simplify that function radically.

But more tomorrow.

Thanks.



Re: [PATCH] x86/microcode: Adjust ramdisk address when accessing by virtual address

2016-12-19 Thread Boris Ostrovsky



On 12/19/2016 08:27 PM, Boris Ostrovsky wrote:



On 12/19/2016 06:32 PM, Borislav Petkov wrote:

On Mon, Dec 19, 2016 at 07:43:40PM +0100, Borislav Petkov wrote:

On Mon, Dec 19, 2016 at 01:12:25PM -0500, Boris Ostrovsky wrote:

IIUIC find_microcode_in_initrd() is called with paging on only on Intel
(which is where I observed it).


Ah, that was an important fact. Yes, I can repro it now.


Ok, questions:

* does your guest relocate the ramdisk?



This is not a guest. I crashed with baremetal kernel.




I.e., do you see something like this in dmesg before the splat:

[0.00] RAMDISK: [mem 0x7f84c000-0x7ffc]
[0.00] Allocated new RAMDISK: [mem 0x3647a000-0x36bfd9e6]
[0.00] Move RAMDISK from [mem 0x7f84c000-0x7ffcf9e6] to [mem
0x3647a000-0x36bfd9e6]
   ^^



Sorry, forgot about this: I see the first line but not the other two (so 
the relocation did not occur).



-boris



If not, then I know what happens.

Also, does it work if you change these lines:

if (!use_pa && relocated_ramdisk)
start = initrd_start;

to:

if (!use_pa)
start = initrd_start;


Yes, it does.

I also thought it might be better but I haven't gone through the code to
make sure this would always work.

I can run more tests tomorrow if you want.

-boris




Because if that works, I can actually simplify that function radically.

But more tomorrow.

Thanks.



Re: [PATCH v2 1/4] net: hix5hd2_gmac: add generic compatible string

2016-12-19 Thread Dongpo Li


On 2016/12/20 0:04, Rob Herring wrote:
> On Mon, Dec 19, 2016 at 2:14 AM, Dongpo Li  wrote:
>> Hi Rob and David,
>>
>> On 2016/12/12 22:21, Rob Herring wrote:
>>> On Mon, Dec 12, 2016 at 5:16 AM, Dongpo Li  wrote:
 Hi Rob,

 On 2016/12/10 6:35, Rob Herring wrote:
> On Mon, Dec 05, 2016 at 09:27:58PM +0800, Dongpo Li wrote:
> 
> [...]
> 
>> @@ -20,7 +25,7 @@ Required properties:
>>
>>  Example:
>>  gmac0: ethernet@f984 {
>> -compatible = "hisilicon,hix5hd2-gmac";
>> +compatible = "hisilicon,hix5hd2-gemac", 
>> "hisilicon,hisi-gemac-v1";
>
> You can't just change compatible strings.
>
 Okay, maybe I should name all the compatible string with the suffix 
 "-gmac" instead of
 "-gemac". This can keep the compatible strings with the same suffix. Is 
 this okay?
 Can I just add the generic compatible string without changing the SoCs 
 compatible string?
 Like following:
 gmac0: ethernet@f984 {
  -  compatible = "hisilicon,hix5hd2-gmac";
  +  compatible = "hisilicon,hix5hd2-gmac", 
 "hisilicon,hisi-gmac-v1";
>>>
>>> Yes, this is fine.
>>
>> As the patch series have been applied to net-next branch,
>> in which way should I commit this compatible fix?
>> Should I send a new patch fixing this compatible string error with "Fixes: 
>> xxx"?
>> Looking forward to your reply!
> 
> Yes to both.

Okay, thanks for your reply!
I will send the fix patch series as soon as possible.


Regards,
Dongpo

.



Re: [PATCH v2 1/4] net: hix5hd2_gmac: add generic compatible string

2016-12-19 Thread Dongpo Li


On 2016/12/20 0:04, Rob Herring wrote:
> On Mon, Dec 19, 2016 at 2:14 AM, Dongpo Li  wrote:
>> Hi Rob and David,
>>
>> On 2016/12/12 22:21, Rob Herring wrote:
>>> On Mon, Dec 12, 2016 at 5:16 AM, Dongpo Li  wrote:
 Hi Rob,

 On 2016/12/10 6:35, Rob Herring wrote:
> On Mon, Dec 05, 2016 at 09:27:58PM +0800, Dongpo Li wrote:
> 
> [...]
> 
>> @@ -20,7 +25,7 @@ Required properties:
>>
>>  Example:
>>  gmac0: ethernet@f984 {
>> -compatible = "hisilicon,hix5hd2-gmac";
>> +compatible = "hisilicon,hix5hd2-gemac", 
>> "hisilicon,hisi-gemac-v1";
>
> You can't just change compatible strings.
>
 Okay, maybe I should name all the compatible string with the suffix 
 "-gmac" instead of
 "-gemac". This can keep the compatible strings with the same suffix. Is 
 this okay?
 Can I just add the generic compatible string without changing the SoCs 
 compatible string?
 Like following:
 gmac0: ethernet@f984 {
  -  compatible = "hisilicon,hix5hd2-gmac";
  +  compatible = "hisilicon,hix5hd2-gmac", 
 "hisilicon,hisi-gmac-v1";
>>>
>>> Yes, this is fine.
>>
>> As the patch series have been applied to net-next branch,
>> in which way should I commit this compatible fix?
>> Should I send a new patch fixing this compatible string error with "Fixes: 
>> xxx"?
>> Looking forward to your reply!
> 
> Yes to both.

Okay, thanks for your reply!
I will send the fix patch series as soon as possible.


Regards,
Dongpo

.



Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread David Miller
From: Alexei Starovoitov 
Date: Mon, 19 Dec 2016 16:02:56 -0800

> huh? 'not right api' because it's using bpf syscall instead
> of cgroup control-file? I think the opposite is the truth.

I completely agree with Alexei on this.


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread David Miller
From: Alexei Starovoitov 
Date: Mon, 19 Dec 2016 16:02:56 -0800

> huh? 'not right api' because it's using bpf syscall instead
> of cgroup control-file? I think the opposite is the truth.

I completely agree with Alexei on this.


Re: [PATCH] x86/microcode: Adjust ramdisk address when accessing by virtual address

2016-12-19 Thread Boris Ostrovsky



On 12/19/2016 06:32 PM, Borislav Petkov wrote:

On Mon, Dec 19, 2016 at 07:43:40PM +0100, Borislav Petkov wrote:

On Mon, Dec 19, 2016 at 01:12:25PM -0500, Boris Ostrovsky wrote:

IIUIC find_microcode_in_initrd() is called with paging on only on Intel
(which is where I observed it).


Ah, that was an important fact. Yes, I can repro it now.


Ok, questions:

* does your guest relocate the ramdisk?



This is not a guest. I crashed with baremetal kernel.




I.e., do you see something like this in dmesg before the splat:

[0.00] RAMDISK: [mem 0x7f84c000-0x7ffc]
[0.00] Allocated new RAMDISK: [mem 0x3647a000-0x36bfd9e6]
[0.00] Move RAMDISK from [mem 0x7f84c000-0x7ffcf9e6] to [mem 
0x3647a000-0x36bfd9e6]
   ^^

If not, then I know what happens.

Also, does it work if you change these lines:

if (!use_pa && relocated_ramdisk)
start = initrd_start;

to:

if (!use_pa)
start = initrd_start;


Yes, it does.

I also thought it might be better but I haven't gone through the code to 
make sure this would always work.


I can run more tests tomorrow if you want.

-boris




Because if that works, I can actually simplify that function radically.

But more tomorrow.

Thanks.



Re: [PATCH] x86/microcode: Adjust ramdisk address when accessing by virtual address

2016-12-19 Thread Boris Ostrovsky



On 12/19/2016 06:32 PM, Borislav Petkov wrote:

On Mon, Dec 19, 2016 at 07:43:40PM +0100, Borislav Petkov wrote:

On Mon, Dec 19, 2016 at 01:12:25PM -0500, Boris Ostrovsky wrote:

IIUIC find_microcode_in_initrd() is called with paging on only on Intel
(which is where I observed it).


Ah, that was an important fact. Yes, I can repro it now.


Ok, questions:

* does your guest relocate the ramdisk?



This is not a guest. I crashed with baremetal kernel.




I.e., do you see something like this in dmesg before the splat:

[0.00] RAMDISK: [mem 0x7f84c000-0x7ffc]
[0.00] Allocated new RAMDISK: [mem 0x3647a000-0x36bfd9e6]
[0.00] Move RAMDISK from [mem 0x7f84c000-0x7ffcf9e6] to [mem 
0x3647a000-0x36bfd9e6]
   ^^

If not, then I know what happens.

Also, does it work if you change these lines:

if (!use_pa && relocated_ramdisk)
start = initrd_start;

to:

if (!use_pa)
start = initrd_start;


Yes, it does.

I also thought it might be better but I haven't gone through the code to 
make sure this would always work.


I can run more tests tomorrow if you want.

-boris




Because if that works, I can actually simplify that function radically.

But more tomorrow.

Thanks.



Re: [PATCH] ALSA: snd-usb: fix IRQ triggered NULL pointer dereference

2016-12-19 Thread Takashi Sakamoto
---

Hi,

> Commit 16200948d835 ("ALSA: usb-audio: Fix race at stopping the stream")
> fixes a race-codition but it also introduces another really nasty data race
> regression which makes my usb sound card [1] completely useless, throwing
> the kernel into a panic if anything from userspace tries to start playback.
> 
> The problem is this: ep->data_subs is now set to NULL every time inside
> wait_clear_urbs(). ep->data_subs is initalized only in one place in
> start_endpoints(), then it is immediately wiped by the pre-existing call to
> wait_clear_urbs() inside snd_usb_endpoint_start().
> 
> To ilustrate, this is what happens in the non-irq part of the code:
> 
> Step 1 (inside start_endpoints): ep->data_subs = subs;
> Step 2 (inside start_endpoints): snd_usb_endpoint_start(ep, can_sleep);
> Step 3 (inside snd_usb_endpoint_start): wait_clear_urbs(ep);
> Step 4 (inside wait_clear_urbs): ep->data_subs = NULL;
> 
> Here's a simplified call stack for the IRQ part (full one at the end):
> 
> (NULL dereference, param "subs" is passed NULL value of ep->data_subs)
> retire_playback_urb
> retire_outbound_urb
> snd_complete_urb
> (...)
> xhci_irq
> 
> Looking at the git log there seems to be quite a lot of history in this
> part of the codebase, dating back to 2012 or earlier. My fix is based on
> 015618b90 ("ALSA: snd-usb: Fix URB cancellation at stream start") and
> e9ba389c5 ("ALSA: usb-audio: Fix scheduling-while-atomic bug in PCM capture
> stream") with a few modifications of my own. My idea is to do the
> ep->data_subs wiping before endpoint initialization in a manner similar
> to these older commits by using stop_endpoints() in snd_usb_pcm_prepare()
> and at the same time keep the ep->data_subs = NULL in wait_clear_urbs() to
> not trigger the recently fixed stream stopping race again.
> 
> Full call stack:
> 
> [  131.093240] BUG: unable to handle kernel NULL pointer dereference at 
> 0010
> [  131.094313] IP: retire_playback_urb+0x1b/0x160 [snd_usb_audio]
> [  131.095046] PGD 0
> [  131.095047]
> [  131.096509] Oops:  [#1] PREEMPT SMP
> [  131.097255] Modules linked in: fuse snd_usb_audio snd_usbmidi_lib 
> snd_rawmidi snd_seq_device ctr ccm arc4 ath9k intel_rapl ath9k_common 
> x86_pkg_temp_thermal ath9k_hw intel_powerclamp coretemp joydev mousedev ath 
> kvm_intel mac80211 kvm
>  input_leds hid_generic psmouse irqbypass usbhid hid crct10dif_pclmul 
> crc32_pclmul serio_raw crc32c_intel atkbd ghash_clmulni_intel 
> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel 
> pcbc aesni_intel libps2 cfg8
> 0211 aes_x86_64 crypto_simd dell_laptop glue_helper r8169 dell_smbios 
> snd_hda_codec sr_mod cryptd led_class mii rfkill snd_hwdep cdrom snd_hda_core 
> ac dcdbas i8042 xhci_pci serio xhci_hcd snd_pcm tpm_tis battery tpm_tis_core 
> lpc_ich tpm
> snd_timer evdev shpchp i2c_i801 sch_fq_codel
> [  131.105551] CPU: 2 PID: 165 Comm: irq/29-xhci_hcd Not tainted 
> 4.9.0-gd824cdc58ba0 #10
> [  131.107516] Hardware name: Dell Inc. Inspiron 3521/018DYG, BIOS A14 
> 07/31/2015
> [  131.109592] task: 880154a7 task.stack: c9f48000
> [  131.111746] RIP: 0010:retire_playback_urb+0x1b/0x160 [snd_usb_audio]
> [  131.113899] RSP: 0018:c9f4bc10 EFLAGS: 00010082
> [  131.116080] RAX: a04cabe0 RBX:  RCX: 
> 
> [  131.118284] RDX:  RSI: 8801435a4c00 RDI: 
> 
> [  131.120505] RBP: c9f4bc40 R08: 0001 R09: 
> 0001
> [  131.122807] R10: 0001 R11: 82f60d6d R12: 
> 8801432a0238
> [  131.125265] R13: 8801435a4c00 R14:  R15: 
> 8801535c83b8
> [  131.127723] FS:  () GS:88015a00() 
> knlGS:
> [  131.130217] CS:  0010 DS:  ES:  CR0: 80050033
> [  131.132568] CR2: 0010 CR3: 01e0b000 CR4: 
> 001406e0
> [  131.135007] Call Trace:
> [  131.137329]  snd_complete_urb+0x190/0x2b0 [snd_usb_audio]
> [  131.139526]  __usb_hcd_giveback_urb+0x8e/0x160
> [  131.141948]  usb_hcd_giveback_urb+0x43/0x110
> [  131.144104]  xhci_giveback_urb_in_irq.isra.22+0x7d/0xb0 [xhci_hcd]
> [  131.146521]  finish_td.constprop.38+0x1de/0x2f0 [xhci_hcd]
> [  131.148736]  xhci_irq+0x13a2/0x1ca0 [xhci_hcd]
> [  131.150972]  ? trace_hardirqs_on+0xd/0x10
> [  131.153209]  ? _raw_spin_unlock_irq+0x2c/0x50
> [  131.155390]  ? irq_thread+0xb5/0x1d0
> [  131.157771]  xhci_msi_irq+0x11/0x20 [xhci_hcd]
> [  131.159938]  irq_forced_thread_fn+0x2f/0x70
> [  131.162072]  ? irq_thread+0xb5/0x1d0
> [  131.164141]  irq_thread+0x149/0x1d0
> [  131.166132]  ? irq_finalize_oneshot.part.2+0xe0/0xe0
> [  131.168142]  ? wake_threads_waitq+0x30/0x30
> [  131.170149]  kthread+0x10f/0x150
> [  131.172144]  ? irq_thread_dtor+0xc0/0xc0
> [  131.174139]  ? kthread_create_on_node+0x40/0x40
> [  131.176116]  ret_from_fork+0x2a/0x40
> [  131.178074] Code: 8b 77 64 4c 89 e7 e8 e5 fe ff ff eb c3 

Re: [PATCH] ALSA: snd-usb: fix IRQ triggered NULL pointer dereference

2016-12-19 Thread Takashi Sakamoto
---

Hi,

> Commit 16200948d835 ("ALSA: usb-audio: Fix race at stopping the stream")
> fixes a race-codition but it also introduces another really nasty data race
> regression which makes my usb sound card [1] completely useless, throwing
> the kernel into a panic if anything from userspace tries to start playback.
> 
> The problem is this: ep->data_subs is now set to NULL every time inside
> wait_clear_urbs(). ep->data_subs is initalized only in one place in
> start_endpoints(), then it is immediately wiped by the pre-existing call to
> wait_clear_urbs() inside snd_usb_endpoint_start().
> 
> To ilustrate, this is what happens in the non-irq part of the code:
> 
> Step 1 (inside start_endpoints): ep->data_subs = subs;
> Step 2 (inside start_endpoints): snd_usb_endpoint_start(ep, can_sleep);
> Step 3 (inside snd_usb_endpoint_start): wait_clear_urbs(ep);
> Step 4 (inside wait_clear_urbs): ep->data_subs = NULL;
> 
> Here's a simplified call stack for the IRQ part (full one at the end):
> 
> (NULL dereference, param "subs" is passed NULL value of ep->data_subs)
> retire_playback_urb
> retire_outbound_urb
> snd_complete_urb
> (...)
> xhci_irq
> 
> Looking at the git log there seems to be quite a lot of history in this
> part of the codebase, dating back to 2012 or earlier. My fix is based on
> 015618b90 ("ALSA: snd-usb: Fix URB cancellation at stream start") and
> e9ba389c5 ("ALSA: usb-audio: Fix scheduling-while-atomic bug in PCM capture
> stream") with a few modifications of my own. My idea is to do the
> ep->data_subs wiping before endpoint initialization in a manner similar
> to these older commits by using stop_endpoints() in snd_usb_pcm_prepare()
> and at the same time keep the ep->data_subs = NULL in wait_clear_urbs() to
> not trigger the recently fixed stream stopping race again.
> 
> Full call stack:
> 
> [  131.093240] BUG: unable to handle kernel NULL pointer dereference at 
> 0010
> [  131.094313] IP: retire_playback_urb+0x1b/0x160 [snd_usb_audio]
> [  131.095046] PGD 0
> [  131.095047]
> [  131.096509] Oops:  [#1] PREEMPT SMP
> [  131.097255] Modules linked in: fuse snd_usb_audio snd_usbmidi_lib 
> snd_rawmidi snd_seq_device ctr ccm arc4 ath9k intel_rapl ath9k_common 
> x86_pkg_temp_thermal ath9k_hw intel_powerclamp coretemp joydev mousedev ath 
> kvm_intel mac80211 kvm
>  input_leds hid_generic psmouse irqbypass usbhid hid crct10dif_pclmul 
> crc32_pclmul serio_raw crc32c_intel atkbd ghash_clmulni_intel 
> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel 
> pcbc aesni_intel libps2 cfg8
> 0211 aes_x86_64 crypto_simd dell_laptop glue_helper r8169 dell_smbios 
> snd_hda_codec sr_mod cryptd led_class mii rfkill snd_hwdep cdrom snd_hda_core 
> ac dcdbas i8042 xhci_pci serio xhci_hcd snd_pcm tpm_tis battery tpm_tis_core 
> lpc_ich tpm
> snd_timer evdev shpchp i2c_i801 sch_fq_codel
> [  131.105551] CPU: 2 PID: 165 Comm: irq/29-xhci_hcd Not tainted 
> 4.9.0-gd824cdc58ba0 #10
> [  131.107516] Hardware name: Dell Inc. Inspiron 3521/018DYG, BIOS A14 
> 07/31/2015
> [  131.109592] task: 880154a7 task.stack: c9f48000
> [  131.111746] RIP: 0010:retire_playback_urb+0x1b/0x160 [snd_usb_audio]
> [  131.113899] RSP: 0018:c9f4bc10 EFLAGS: 00010082
> [  131.116080] RAX: a04cabe0 RBX:  RCX: 
> 
> [  131.118284] RDX:  RSI: 8801435a4c00 RDI: 
> 
> [  131.120505] RBP: c9f4bc40 R08: 0001 R09: 
> 0001
> [  131.122807] R10: 0001 R11: 82f60d6d R12: 
> 8801432a0238
> [  131.125265] R13: 8801435a4c00 R14:  R15: 
> 8801535c83b8
> [  131.127723] FS:  () GS:88015a00() 
> knlGS:
> [  131.130217] CS:  0010 DS:  ES:  CR0: 80050033
> [  131.132568] CR2: 0010 CR3: 01e0b000 CR4: 
> 001406e0
> [  131.135007] Call Trace:
> [  131.137329]  snd_complete_urb+0x190/0x2b0 [snd_usb_audio]
> [  131.139526]  __usb_hcd_giveback_urb+0x8e/0x160
> [  131.141948]  usb_hcd_giveback_urb+0x43/0x110
> [  131.144104]  xhci_giveback_urb_in_irq.isra.22+0x7d/0xb0 [xhci_hcd]
> [  131.146521]  finish_td.constprop.38+0x1de/0x2f0 [xhci_hcd]
> [  131.148736]  xhci_irq+0x13a2/0x1ca0 [xhci_hcd]
> [  131.150972]  ? trace_hardirqs_on+0xd/0x10
> [  131.153209]  ? _raw_spin_unlock_irq+0x2c/0x50
> [  131.155390]  ? irq_thread+0xb5/0x1d0
> [  131.157771]  xhci_msi_irq+0x11/0x20 [xhci_hcd]
> [  131.159938]  irq_forced_thread_fn+0x2f/0x70
> [  131.162072]  ? irq_thread+0xb5/0x1d0
> [  131.164141]  irq_thread+0x149/0x1d0
> [  131.166132]  ? irq_finalize_oneshot.part.2+0xe0/0xe0
> [  131.168142]  ? wake_threads_waitq+0x30/0x30
> [  131.170149]  kthread+0x10f/0x150
> [  131.172144]  ? irq_thread_dtor+0xc0/0xc0
> [  131.174139]  ? kthread_create_on_node+0x40/0x40
> [  131.176116]  ret_from_fork+0x2a/0x40
> [  131.178074] Code: 8b 77 64 4c 89 e7 e8 e5 fe ff ff eb c3 

Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-12-19 Thread Dan Williams
On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong
 wrote:
> On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
>> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
>> <>
>> > Definitely the first step would be your simple preallocated per
>> > inode approach until it is shown to be insufficient.
>>
>> Reviving this thread a few months later...
>>
>> Dave, we're interested in taking a serious look at what it would take to get
>> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
>> become, with some amount of work) a workable solution?
>>
>> We're happy to do the grunt work for this feature, but we will probably need
>> guidance from someone with more XFS experience.  With you out on extended 
>> leave
>> the first half of 2017, who would be the best person to ask for this 
>> guidance?
>> Darrick?
>
> Yes, probably. :)
>
> I think where we left off with this (on the XFS side) is some sort of
> fallocate mode that would allocate blocks, zero them, and then set the
> DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
> file and thereby gain the ability to control write persistents behavior
> without having to worry about fs metadata updates.  As an added plus, I
> think zeroing the pmem also clears media errors, or something like that.
>
>  Is that a reasonable starting point?  My memory is a little foggy.
>
> Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
> read that.

That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE
via a character device interface. It's useful for cases where you want
an entire nvdimm namespace/volume in "no fs-metadata to worry about"
mode.  But, for sub-allocations of a namespace and support for
existing tooling, PMEM_IMMUTABLE is much more usable.


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-12-19 Thread Dan Williams
On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong
 wrote:
> On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
>> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
>> <>
>> > Definitely the first step would be your simple preallocated per
>> > inode approach until it is shown to be insufficient.
>>
>> Reviving this thread a few months later...
>>
>> Dave, we're interested in taking a serious look at what it would take to get
>> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
>> become, with some amount of work) a workable solution?
>>
>> We're happy to do the grunt work for this feature, but we will probably need
>> guidance from someone with more XFS experience.  With you out on extended 
>> leave
>> the first half of 2017, who would be the best person to ask for this 
>> guidance?
>> Darrick?
>
> Yes, probably. :)
>
> I think where we left off with this (on the XFS side) is some sort of
> fallocate mode that would allocate blocks, zero them, and then set the
> DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
> file and thereby gain the ability to control write persistents behavior
> without having to worry about fs metadata updates.  As an added plus, I
> think zeroing the pmem also clears media errors, or something like that.
>
>  Is that a reasonable starting point?  My memory is a little foggy.
>
> Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
> read that.

That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE
via a character device interface. It's useful for cases where you want
an entire nvdimm namespace/volume in "no fs-metadata to worry about"
mode.  But, for sub-allocations of a namespace and support for
existing tooling, PMEM_IMMUTABLE is much more usable.


Re: [PATCH] fbdev: remove current maintainer

2016-12-19 Thread Laurent Pinchart
Hi Daniel,

On Thursday 08 Dec 2016 11:04:47 Daniel Vetter wrote:
> On Thu, Dec 08, 2016 at 10:34:12AM +0200, Tomi Valkeinen wrote:
> > Remove Tomi Valkeinen from fbdev maintainer and mark fbdev as orphan.
> > 
> > Signed-off-by: Tomi Valkeinen 
> > ---
> > 
> > I just don't have time to even pretend I maintain fbdev, so mark it as
> > Orphan.
> > 
> > Anyone want to pick fbdev maintainership?
> 
> I think for fbdev stuff that we need for the fbdev emulation in drm we
> could merge it through drm-misc.

Or we could instead invest time on getting rid of that last dependency on 
fbdev ? :-)

> And I'd very much welcome if you can
> still review those patches (e.g. the nommu enabling that's floating right
> now). If you want can also give you commit rights to drm-misc for that
> area. Or drm-misc maintainers can apply those patches.
> 
> Anything else would be still out of scope ofc.
> 
> What we might want to do is add dri-devel to the mailing lists, to make
> sure patch submissions get redicted to the right subsystem. Mail volume
> seems pretty low compared to dri-devel, I think we can absorb that.
> -Daniel
> 
> >  MAINTAINERS | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 851b89b9edcb..fc2790eb4e84 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -4968,10 +4968,9 @@ F:   drivers/net/wan/dlci.c
> > 
> >  F: drivers/net/wan/sdla.c
> >  
> >  FRAMEBUFFER LAYER
> > 
> > -M: Tomi Valkeinen 
> > 
> >  L: linux-fb...@vger.kernel.org
> >  Q: http://patchwork.kernel.org/project/linux-fbdev/list/
> > 
> > -S: Maintained
> > +S: Orphan
> > 
> >  F: Documentation/fb/
> >  F: drivers/video/
> >  F: include/video/

-- 
Regards,

Laurent Pinchart



Re: [PATCH] fbdev: remove current maintainer

2016-12-19 Thread Laurent Pinchart
Hi Daniel,

On Thursday 08 Dec 2016 11:04:47 Daniel Vetter wrote:
> On Thu, Dec 08, 2016 at 10:34:12AM +0200, Tomi Valkeinen wrote:
> > Remove Tomi Valkeinen from fbdev maintainer and mark fbdev as orphan.
> > 
> > Signed-off-by: Tomi Valkeinen 
> > ---
> > 
> > I just don't have time to even pretend I maintain fbdev, so mark it as
> > Orphan.
> > 
> > Anyone want to pick fbdev maintainership?
> 
> I think for fbdev stuff that we need for the fbdev emulation in drm we
> could merge it through drm-misc.

Or we could instead invest time on getting rid of that last dependency on 
fbdev ? :-)

> And I'd very much welcome if you can
> still review those patches (e.g. the nommu enabling that's floating right
> now). If you want can also give you commit rights to drm-misc for that
> area. Or drm-misc maintainers can apply those patches.
> 
> Anything else would be still out of scope ofc.
> 
> What we might want to do is add dri-devel to the mailing lists, to make
> sure patch submissions get redicted to the right subsystem. Mail volume
> seems pretty low compared to dri-devel, I think we can absorb that.
> -Daniel
> 
> >  MAINTAINERS | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 851b89b9edcb..fc2790eb4e84 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -4968,10 +4968,9 @@ F:   drivers/net/wan/dlci.c
> > 
> >  F: drivers/net/wan/sdla.c
> >  
> >  FRAMEBUFFER LAYER
> > 
> > -M: Tomi Valkeinen 
> > 
> >  L: linux-fb...@vger.kernel.org
> >  Q: http://patchwork.kernel.org/project/linux-fbdev/list/
> > 
> > -S: Maintained
> > +S: Orphan
> > 
> >  F: Documentation/fb/
> >  F: drivers/video/
> >  F: include/video/

-- 
Regards,

Laurent Pinchart



Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-12-19 Thread Darrick J. Wong
On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
> <>
> > Definitely the first step would be your simple preallocated per
> > inode approach until it is shown to be insufficient.
> 
> Reviving this thread a few months later...
> 
> Dave, we're interested in taking a serious look at what it would take to get
> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
> become, with some amount of work) a workable solution?
> 
> We're happy to do the grunt work for this feature, but we will probably need
> guidance from someone with more XFS experience.  With you out on extended 
> leave
> the first half of 2017, who would be the best person to ask for this guidance?
> Darrick?

Yes, probably. :)

I think where we left off with this (on the XFS side) is some sort of
fallocate mode that would allocate blocks, zero them, and then set the
DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
file and thereby gain the ability to control write persistents behavior
without having to worry about fs metadata updates.  As an added plus, I
think zeroing the pmem also clears media errors, or something like that.

 Is that a reasonable starting point?  My memory is a little foggy.

Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
read that.

--D

> 
> Thanks,
> - Ross


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-12-19 Thread Darrick J. Wong
On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote:
> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote:
> <>
> > Definitely the first step would be your simple preallocated per
> > inode approach until it is shown to be insufficient.
> 
> Reviving this thread a few months later...
> 
> Dave, we're interested in taking a serious look at what it would take to get
> PMEM_IMMUTABLE working.  Do you still hold the opinion that this is (or could
> become, with some amount of work) a workable solution?
> 
> We're happy to do the grunt work for this feature, but we will probably need
> guidance from someone with more XFS experience.  With you out on extended 
> leave
> the first half of 2017, who would be the best person to ask for this guidance?
> Darrick?

Yes, probably. :)

I think where we left off with this (on the XFS side) is some sort of
fallocate mode that would allocate blocks, zero them, and then set the
DAX and PMEM_IMMUTABLE on-disk inode flags.  After that, you'd mmap the
file and thereby gain the ability to control write persistents behavior
without having to worry about fs metadata updates.  As an added plus, I
think zeroing the pmem also clears media errors, or something like that.

 Is that a reasonable starting point?  My memory is a little foggy.

Hmm, I see Dan just posted something about blockdev fallocate.  I'll go
read that.

--D

> 
> Thanks,
> - Ross


Re: [PATCH] nvme-fabrics: remove some logically dead code performing redundant ret checks

2016-12-19 Thread James Smart

Looks good.

-- james

Signed-off-by: James Smart 



On 12/9/2016 6:59 AM, Colin King wrote:

From: Colin Ian King 

The check to see if ret is non-zero and return this rather than count
is redundant in two occassions.  It is redundant because prior to this
check, the return code ret is already checked for a non-zero error
return value and we return from the function at that point.

Signed-off-by: Colin Ian King 
---
  drivers/nvme/target/fcloop.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/fcloop.c b/drivers/nvme/target/fcloop.c
index bcb8ebe..4e8e6a2 100644
--- a/drivers/nvme/target/fcloop.c
+++ b/drivers/nvme/target/fcloop.c
@@ -845,7 +845,7 @@ fcloop_create_remote_port(struct device *dev, struct 
device_attribute *attr,
rport->lport = nport->lport;
nport->rport = rport;
  
-	return ret ? ret : count;

+   return count;
  }
  
  
@@ -952,7 +952,7 @@ fcloop_create_target_port(struct device *dev, struct device_attribute *attr,

tport->lport = nport->lport;
nport->tport = tport;
  
-	return ret ? ret : count;

+   return count;
  }
  
  




Re: [PATCH] nvme-fabrics: remove some logically dead code performing redundant ret checks

2016-12-19 Thread James Smart

Looks good.

-- james

Signed-off-by: James Smart 



On 12/9/2016 6:59 AM, Colin King wrote:

From: Colin Ian King 

The check to see if ret is non-zero and return this rather than count
is redundant in two occassions.  It is redundant because prior to this
check, the return code ret is already checked for a non-zero error
return value and we return from the function at that point.

Signed-off-by: Colin Ian King 
---
  drivers/nvme/target/fcloop.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/fcloop.c b/drivers/nvme/target/fcloop.c
index bcb8ebe..4e8e6a2 100644
--- a/drivers/nvme/target/fcloop.c
+++ b/drivers/nvme/target/fcloop.c
@@ -845,7 +845,7 @@ fcloop_create_remote_port(struct device *dev, struct 
device_attribute *attr,
rport->lport = nport->lport;
nport->rport = rport;
  
-	return ret ? ret : count;

+   return count;
  }
  
  
@@ -952,7 +952,7 @@ fcloop_create_target_port(struct device *dev, struct device_attribute *attr,

tport->lport = nport->lport;
nport->tport = tport;
  
-	return ret ? ret : count;

+   return count;
  }
  
  




Re: [patch] nvme-fabrics: correct some printk information

2016-12-19 Thread James Smart

Dan,

Mind if I solve this a different way ?  I really don't know why knowing 
the ptr value is even meaningful


-- james


On 12/10/2016 1:06 AM, Dan Carpenter wrote:

We really don't care where "ctrl" is on the stack since we're just
returning soon what we want is the actual ctrl pointer itself.

Signed-off-by: Dan Carpenter 

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 771e2e761872..e6395ed2f562 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2402,7 +2402,7 @@ enum blk_eh_timer_return
  
  	dev_info(ctrl->ctrl.device,

"NVME-FC{%d}: new ctrl: NQN \"%s\" (%p)\n",
-   ctrl->cnum, ctrl->ctrl.opts->subsysnqn, );
+   ctrl->cnum, ctrl->ctrl.opts->subsysnqn, ctrl);
  
  	kref_get(>ctrl.kref);
  




Re: [patch] nvme-fabrics: correct some printk information

2016-12-19 Thread James Smart

Dan,

Mind if I solve this a different way ?  I really don't know why knowing 
the ptr value is even meaningful


-- james


On 12/10/2016 1:06 AM, Dan Carpenter wrote:

We really don't care where "ctrl" is on the stack since we're just
returning soon what we want is the actual ctrl pointer itself.

Signed-off-by: Dan Carpenter 

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 771e2e761872..e6395ed2f562 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2402,7 +2402,7 @@ enum blk_eh_timer_return
  
  	dev_info(ctrl->ctrl.device,

"NVME-FC{%d}: new ctrl: NQN \"%s\" (%p)\n",
-   ctrl->cnum, ctrl->ctrl.opts->subsysnqn, );
+   ctrl->cnum, ctrl->ctrl.opts->subsysnqn, ctrl);
  
  	kref_get(>ctrl.kref);
  




[PULL] One last documentation fix

2016-12-19 Thread Jonathan Corbet
The following changes since commit
3fa71d0f58a9b9df84e8e79196f961bcfbf01b2e:

  crypto: doc - optimize compilation (2016-12-13 16:38:07 -0700)

are available in the git repository at:

  git://git.lwn.net/linux.git tags/doc-4.10-3

for you to fetch changes up to 217e2bfab22e740227df09f22165e834cddd8a3b:

  docs: sphinx-extensions: make rstFlatTable work with docutils 0.13 
(2016-12-18 13:30:29 -0700)


A single fix for the build system.  It would appear that the docutils
developers, in their wisdom, broke the API in the 0.13 release.  This fix
detects the breakage and allows the docs to be built with both the old and
new versions.


Dmitry Shachnev (1):
  docs: sphinx-extensions: make rstFlatTable work with docutils 0.13

 Documentation/sphinx/rstFlatTable.py | 5 +
 1 file changed, 5 insertions(+)


[PULL] One last documentation fix

2016-12-19 Thread Jonathan Corbet
The following changes since commit
3fa71d0f58a9b9df84e8e79196f961bcfbf01b2e:

  crypto: doc - optimize compilation (2016-12-13 16:38:07 -0700)

are available in the git repository at:

  git://git.lwn.net/linux.git tags/doc-4.10-3

for you to fetch changes up to 217e2bfab22e740227df09f22165e834cddd8a3b:

  docs: sphinx-extensions: make rstFlatTable work with docutils 0.13 
(2016-12-18 13:30:29 -0700)


A single fix for the build system.  It would appear that the docutils
developers, in their wisdom, broke the API in the 0.13 release.  This fix
detects the breakage and allows the docs to be built with both the old and
new versions.


Dmitry Shachnev (1):
  docs: sphinx-extensions: make rstFlatTable work with docutils 0.13

 Documentation/sphinx/rstFlatTable.py | 5 +
 1 file changed, 5 insertions(+)


mmotm 2016-12-19-16-31 uploaded

2016-12-19 Thread akpm
The mm-of-the-moment snapshot 2016-12-19-16-31 has been uploaded to

   http://www.ozlabs.org/~akpm/mmotm/

mmotm-readme.txt says

README for mm-of-the-moment:

http://www.ozlabs.org/~akpm/mmotm/

This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
more than once a week.

You will need quilt to apply these patches to the latest Linus release (4.x
or 4.x-rcY).  The series file is in broken-out.tar.gz and is duplicated in
http://ozlabs.org/~akpm/mmotm/series

The file broken-out.tar.gz contains two datestamp files: .DATE and
.DATE--mm-dd-hh-mm-ss.  Both contain the string -mm-dd-hh-mm-ss,
followed by the base kernel version against which this patch series is to
be applied.

This tree is partially included in linux-next.  To see which patches are
included in linux-next, consult the `series' file.  Only the patches
within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
linux-next.

A git tree which contains the memory management portion of this tree is
maintained at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
by Michal Hocko.  It contains the patches which are between the
"#NEXT_PATCHES_START mm" and "#NEXT_PATCHES_END" markers, from the series
file, http://www.ozlabs.org/~akpm/mmotm/series.


A full copy of the full kernel tree with the linux-next and mmotm patches
already applied is available through git within an hour of the mmotm
release.  Individual mmotm releases are tagged.  The master branch always
points to the latest release, so it's constantly rebasing.

http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/

To develop on top of mmotm git:

  $ git remote add mmotm 
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
  $ git remote update mmotm
  $ git checkout -b topic mmotm/master
  
  $ git send-email mmotm/master.. [...]

To rebase a branch with older patches to a new mmotm release:

  $ git remote update mmotm
  $ git rebase --onto mmotm/master  topic




The directory http://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second)
contains daily snapshots of the -mm tree.  It is updated more frequently
than mmotm, and is untested.

A git copy of this tree is available at

http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/

and use of this tree is similar to
http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/, described above.


This mmotm tree contains the following patches against 4.9:
(patches marked "*" will be included in linux-next)

  origin.patch
* powerpc-ima-get-the-kexec-buffer-passed-by-the-previous-kernel.patch
* ima-on-soft-reboot-restore-the-measurement-list.patch
* ima-permit-duplicate-measurement-list-entries.patch
* ima-maintain-memory-size-needed-for-serializing-the-measurement-list.patch
* powerpc-ima-send-the-kexec-buffer-to-the-next-kernel.patch
* ima-on-soft-reboot-save-the-measurement-list.patch
* ima-store-the-builtin-custom-template-definitions-in-a-list.patch
* ima-support-restoring-multiple-template-formats.patch
* ima-define-a-canonical-binary_runtime_measurements-list-format.patch
* ima-platform-independent-hash-value.patch
* mm-fadvise-avoid-expensive-remote-lru-cache-draining-after-fadv_dontneed.patch
* arm64-setup-introduce-kaslr_offset.patch
* kcov-make-kcov-work-properly-with-kaslr-enabled.patch
* ratelimit-fix-warn_on_ratelimit-return-value.patch
* printk-fix-typo-in-console_loglevel_default-help-text.patch
  i-need-old-gcc.patch
* mm-page_alloc-fix-incorrect-zone_statistics-data.patch
* mm-thp-pagecache-only-withdraw-page-table-after-a-successful-deposit.patch
* 
mm-thp-pagecache-collapse-free-the-pte-page-table-on-collapse-for-thp-page-cache.patch
* arm-arch-arm-include-asm-pageh-needs-personalityh.patch
* scripts-spellingtxt-add-several-more-common-spelling-mistakes.patch
* 
ocfs2-old-mle-put-and-release-after-the-function-dlm_add_migration_mle-called.patch
* 
ocfs2-old-mle-put-and-release-after-the-function-dlm_add_migration_mle-called-fix.patch
* ocfs2-fix-crash-caused-by-stale-lvb-with-fsdlm-plugin.patch
* 
block-restore-proc-partitions-to-not-display-non-partitionable-removable-devices.patch
* kernel-watchdog-prevent-false-hardlockup-on-overloaded-system.patch
* kernel-watchdog-prevent-false-hardlockup-on-overloaded-system-fix.patch
  mm.patch
* tmpfs-change-shmem_mapping-to-test-shmem_aops.patch
* mm-throttle-show_mem-from-warn_alloc.patch
* mm-throttle-show_mem-from-warn_alloc-fix.patch
* mm-page_alloc-dont-convert-pfn-to-idx-when-merging.patch
* mm-page_alloc-avoid-page_to_pfn-when-merging-buddies.patch
* z3fold-limit-first_num-to-the-actual-range-of-possible-buddy-indexes.patch
* z3fold-make-pages_nr-atomic.patch
* z3fold-extend-compaction-function.patch
* z3fold-use-per-page-spinlock.patch
* z3fold-discourage-use-of-pages-that-werent-compacted.patch
* z3fold-fix-header-size-related-issues.patch
* z3fold-fix-locking-issues.patch
* mm-page_owner-align-with-pageblock_nr-pages.patch
* mm-walk-the-zone-in-pageblock_nr_pages-steps.patch
* lib-add-crc64-ecma-module.patch
* 

mmotm 2016-12-19-16-31 uploaded

2016-12-19 Thread akpm
The mm-of-the-moment snapshot 2016-12-19-16-31 has been uploaded to

   http://www.ozlabs.org/~akpm/mmotm/

mmotm-readme.txt says

README for mm-of-the-moment:

http://www.ozlabs.org/~akpm/mmotm/

This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
more than once a week.

You will need quilt to apply these patches to the latest Linus release (4.x
or 4.x-rcY).  The series file is in broken-out.tar.gz and is duplicated in
http://ozlabs.org/~akpm/mmotm/series

The file broken-out.tar.gz contains two datestamp files: .DATE and
.DATE--mm-dd-hh-mm-ss.  Both contain the string -mm-dd-hh-mm-ss,
followed by the base kernel version against which this patch series is to
be applied.

This tree is partially included in linux-next.  To see which patches are
included in linux-next, consult the `series' file.  Only the patches
within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
linux-next.

A git tree which contains the memory management portion of this tree is
maintained at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
by Michal Hocko.  It contains the patches which are between the
"#NEXT_PATCHES_START mm" and "#NEXT_PATCHES_END" markers, from the series
file, http://www.ozlabs.org/~akpm/mmotm/series.


A full copy of the full kernel tree with the linux-next and mmotm patches
already applied is available through git within an hour of the mmotm
release.  Individual mmotm releases are tagged.  The master branch always
points to the latest release, so it's constantly rebasing.

http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/

To develop on top of mmotm git:

  $ git remote add mmotm 
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
  $ git remote update mmotm
  $ git checkout -b topic mmotm/master
  
  $ git send-email mmotm/master.. [...]

To rebase a branch with older patches to a new mmotm release:

  $ git remote update mmotm
  $ git rebase --onto mmotm/master  topic




The directory http://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second)
contains daily snapshots of the -mm tree.  It is updated more frequently
than mmotm, and is untested.

A git copy of this tree is available at

http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/

and use of this tree is similar to
http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/, described above.


This mmotm tree contains the following patches against 4.9:
(patches marked "*" will be included in linux-next)

  origin.patch
* powerpc-ima-get-the-kexec-buffer-passed-by-the-previous-kernel.patch
* ima-on-soft-reboot-restore-the-measurement-list.patch
* ima-permit-duplicate-measurement-list-entries.patch
* ima-maintain-memory-size-needed-for-serializing-the-measurement-list.patch
* powerpc-ima-send-the-kexec-buffer-to-the-next-kernel.patch
* ima-on-soft-reboot-save-the-measurement-list.patch
* ima-store-the-builtin-custom-template-definitions-in-a-list.patch
* ima-support-restoring-multiple-template-formats.patch
* ima-define-a-canonical-binary_runtime_measurements-list-format.patch
* ima-platform-independent-hash-value.patch
* mm-fadvise-avoid-expensive-remote-lru-cache-draining-after-fadv_dontneed.patch
* arm64-setup-introduce-kaslr_offset.patch
* kcov-make-kcov-work-properly-with-kaslr-enabled.patch
* ratelimit-fix-warn_on_ratelimit-return-value.patch
* printk-fix-typo-in-console_loglevel_default-help-text.patch
  i-need-old-gcc.patch
* mm-page_alloc-fix-incorrect-zone_statistics-data.patch
* mm-thp-pagecache-only-withdraw-page-table-after-a-successful-deposit.patch
* 
mm-thp-pagecache-collapse-free-the-pte-page-table-on-collapse-for-thp-page-cache.patch
* arm-arch-arm-include-asm-pageh-needs-personalityh.patch
* scripts-spellingtxt-add-several-more-common-spelling-mistakes.patch
* 
ocfs2-old-mle-put-and-release-after-the-function-dlm_add_migration_mle-called.patch
* 
ocfs2-old-mle-put-and-release-after-the-function-dlm_add_migration_mle-called-fix.patch
* ocfs2-fix-crash-caused-by-stale-lvb-with-fsdlm-plugin.patch
* 
block-restore-proc-partitions-to-not-display-non-partitionable-removable-devices.patch
* kernel-watchdog-prevent-false-hardlockup-on-overloaded-system.patch
* kernel-watchdog-prevent-false-hardlockup-on-overloaded-system-fix.patch
  mm.patch
* tmpfs-change-shmem_mapping-to-test-shmem_aops.patch
* mm-throttle-show_mem-from-warn_alloc.patch
* mm-throttle-show_mem-from-warn_alloc-fix.patch
* mm-page_alloc-dont-convert-pfn-to-idx-when-merging.patch
* mm-page_alloc-avoid-page_to_pfn-when-merging-buddies.patch
* z3fold-limit-first_num-to-the-actual-range-of-possible-buddy-indexes.patch
* z3fold-make-pages_nr-atomic.patch
* z3fold-extend-compaction-function.patch
* z3fold-use-per-page-spinlock.patch
* z3fold-discourage-use-of-pages-that-werent-compacted.patch
* z3fold-fix-header-size-related-issues.patch
* z3fold-fix-locking-issues.patch
* mm-page_owner-align-with-pageblock_nr-pages.patch
* mm-walk-the-zone-in-pageblock_nr_pages-steps.patch
* lib-add-crc64-ecma-module.patch
* 

Re: [PATCH v2] fix code alignment with open parenthesis

2016-12-19 Thread Scott Matheina



On 12/19/2016 12:35 AM, Greg KH wrote:

On Sun, Dec 18, 2016 at 11:47:30AM -0600, Scott Matheina wrote:

These changes where identified by checkpatch.pl as needed changes to
align the code with the linux development coding style. The several
lines of text where aligned with the precending parenthesis.

Signed-off-by: Scott Matheina 

  Changes to be committed:
modified:   drivers/staging/fbtft/fb_agm1264k-fl.c
I'll remove the above text after signed-off-by and resubmit, honestly 
it's been
over a year since I submitted a patch so I have to relearn how you want 
them

formatted. Thanks for working with me.

Why are these lines in the changelog text?




Re: [PATCH v2] fix code alignment with open parenthesis

2016-12-19 Thread Scott Matheina



On 12/19/2016 12:35 AM, Greg KH wrote:

On Sun, Dec 18, 2016 at 11:47:30AM -0600, Scott Matheina wrote:

These changes where identified by checkpatch.pl as needed changes to
align the code with the linux development coding style. The several
lines of text where aligned with the precending parenthesis.

Signed-off-by: Scott Matheina 

  Changes to be committed:
modified:   drivers/staging/fbtft/fb_agm1264k-fl.c
I'll remove the above text after signed-off-by and resubmit, honestly 
it's been
over a year since I submitted a patch so I have to relearn how you want 
them

formatted. Thanks for working with me.

Why are these lines in the changelog text?




Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski
On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
 wrote:
> On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
>>  wrote:
>> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
>> >> Hi all-
>> >>
>> >> I apologize for being rather late with this.  I didn't realize that
>> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
>> >> it on the linux-api list, so I missed the discussion.
>> >>
>> >> I think that the inet ingress, egress etc filters are a neat feature,
>> >> but I think the API has some issues that will bite us down the road
>> >> if it becomes stable in its current form.
>> >>
>> >> Most of the problems I see are summarized in this transcript:
>> >>
>> >> # mkdir cg2
>> >> # mount -t cgroup2 none cg2
>> >> # mkdir cg2/nosockets
>> >> # strace cgrp_socket_rule cg2/nosockets/ 0
>> >> ...
>> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>> >>
>> >>  You can modify a cgroup after opening it O_RDONLY?
>> >>
>> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
>> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
>> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
>> >>
>> >>  This is fine.  The bpf() syscall manipulates bpf objects.
>> >>
>> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>> >>
>> >>  This is not so good:
>> >> 
>> >>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
>> >> is manipulating a cgroup.  There's no reason that a socket 
>> >> creation
>> >> filter couldn't be written in a different language (new iptables
>> >> table?  Simple list of address families?), but if that happened,
>> >> then using bpf() to install it would be entirely nonsensical.
>> >
>> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
>> > network stack is using cgroup as an application group that should
>> > invoke bpf program at the certain point in the stack.
>> > imo cgroup management is orthogonal.
>>
>> It is literally modifying the struct cgroup, and, as a practical
>> matter, it's causing membership in the cgroup to have a certain
>> effect.  But rather than pointless arguing, let me propose an
>> alternative API that I think solves most of the problems here.
>>
>> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
>> Instead, the cgroup gets three new control files:
>> "net.ingress_filter", "net.egress_filter", and
>> "net.socket_create_filter".  Initially, if you read these files, you
>> see "none\n".
>>
>> To attach a bpf filter, you open the file for write and do an ioctl on
>> it.  After doing the ioctl, if you read the file, you'll see
>> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
>> the bpf program.
>>
>> To detach any type of filter, bpf or otherwise, you open the file for
>> write and write "none\n" (or just "none").
>>
>> If you write anything else to the file, you get -EINVAL.  But, if
>> someone writes a new type of filter (perhaps a simple list of address
>> families), maybe you can enable the filter by writing something
>> appropriate to the file.
>
> I see no difference in what you're proposing vs what is already implemented
> from feature set point of view, but the file approach is very ugly, since
> it's a mismatch to FD style access that bpf is using everywhere.
> In your proposal you'd also need to add bpf prefix everywhere.
> So the control file names should be bpf_inet_ingress, bpf_inet_egress
> and bpf_socket_create.

I think we're still talking past each other.  A big part of the point
of changing it is that none of this is specific to bpf.  You could (in
theory -- I'm not proposing implementing these until there's demand)
have:

net.socket_create_filter = "none": no filter
net.socket_create_filter = "bpf:baadf00d": bpf filter
net.socket_create_filter = "disallow": no sockets created period
net.socket_create_filter = "iptables:foobar": some iptables thingy
net.socket_create_filter = "nft:blahblahblah": some nft thingy
net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

See?  This API is not bpf-specific.  It's an API for filtering.  The
fact that struct cgroup currently contains a member called "bpf" is
purely an artifact of the fact that it currently only supports bpf.
Someone will want to rename it to "filters" some day, and
BPF_PROG_DETACH makes no sense whatsoever as a generic API to detach a
filter.

> If you want to prepare such patch for them, I don't mind,
> but you cannot kill syscall command, since it's more flexible
> and your control-file approach _will_ be obsolete pretty quickly.

BPF_PROG_ATTACH and BPF_PROC_DETACH should be removed regardless IMO.
If you really really want a syscall, make it a new syscall.

>
>> Now the API matches the effect.  A cgroup with something other than
>> "none" in one 

[PATCH] tty: hvc_dcc: Add basic early_con support

2016-12-19 Thread Nishanth Menon
In some cases, earlycon can help catch errors with kernel boot prior to
standard console is available.

Example bootargs: console=hvc0 earlycon=hvcdcc

Signed-off-by: Nishanth Menon <n...@ti.com>
---
Based on: v4.9 tag
Also applies on: next-20161219
Tested on Simulation environment (which did not have serial console option)

 drivers/tty/hvc/hvc_dcc.c | 44 
 1 file changed, 44 insertions(+)

diff --git a/drivers/tty/hvc/hvc_dcc.c b/drivers/tty/hvc/hvc_dcc.c
index 82f240fb98f0..1ec06e550656 100644
--- a/drivers/tty/hvc/hvc_dcc.c
+++ b/drivers/tty/hvc/hvc_dcc.c
@@ -10,7 +10,9 @@
  * GNU General Public License for more details.
  */
 
+#include 
 #include 
+#include 
 
 #include 
 #include 
@@ -94,3 +96,45 @@ static int __init hvc_dcc_init(void)
return PTR_ERR_OR_ZERO(p);
 }
 device_initcall(hvc_dcc_init);
+
+static int hvc_dcc_earlyputc(int c)
+{
+   unsigned long count = 0x;
+   static bool dead_dcc_earlycon;
+
+   if (dead_dcc_earlycon)
+   return -EBUSY;
+
+   while (count--) {
+   if (!(__dcc_getstatus() & DCC_STATUS_TX))
+   break;
+   }
+   if (!count) {
+   dead_dcc_earlycon = true;
+   return -EBUSY;
+   }
+   __dcc_putchar(c);
+   return 0;
+}
+
+static void hvc_dcc_earlywrite(struct console *con, const char *s,
+  unsigned int n)
+{
+   int r;
+
+   while (n--) {
+   r = hvc_dcc_earlyputc(*s);
+   if (r)
+   break;
+   s++;
+   }
+}
+
+static int
+__init early_hvc_dcc_setup(struct earlycon_device *device, const char *opt)
+{
+   device->con->write = hvc_dcc_earlywrite;
+   return 0;
+}
+
+EARLYCON_DECLARE(hvcdcc, early_hvc_dcc_setup);
-- 
2.11.0.65.geff96d7



[PATCH] tty: hvc_dcc: Add basic early_con support

2016-12-19 Thread Nishanth Menon
In some cases, earlycon can help catch errors with kernel boot prior to
standard console is available.

Example bootargs: console=hvc0 earlycon=hvcdcc

Signed-off-by: Nishanth Menon 
---
Based on: v4.9 tag
Also applies on: next-20161219
Tested on Simulation environment (which did not have serial console option)

 drivers/tty/hvc/hvc_dcc.c | 44 
 1 file changed, 44 insertions(+)

diff --git a/drivers/tty/hvc/hvc_dcc.c b/drivers/tty/hvc/hvc_dcc.c
index 82f240fb98f0..1ec06e550656 100644
--- a/drivers/tty/hvc/hvc_dcc.c
+++ b/drivers/tty/hvc/hvc_dcc.c
@@ -10,7 +10,9 @@
  * GNU General Public License for more details.
  */
 
+#include 
 #include 
+#include 
 
 #include 
 #include 
@@ -94,3 +96,45 @@ static int __init hvc_dcc_init(void)
return PTR_ERR_OR_ZERO(p);
 }
 device_initcall(hvc_dcc_init);
+
+static int hvc_dcc_earlyputc(int c)
+{
+   unsigned long count = 0x;
+   static bool dead_dcc_earlycon;
+
+   if (dead_dcc_earlycon)
+   return -EBUSY;
+
+   while (count--) {
+   if (!(__dcc_getstatus() & DCC_STATUS_TX))
+   break;
+   }
+   if (!count) {
+   dead_dcc_earlycon = true;
+   return -EBUSY;
+   }
+   __dcc_putchar(c);
+   return 0;
+}
+
+static void hvc_dcc_earlywrite(struct console *con, const char *s,
+  unsigned int n)
+{
+   int r;
+
+   while (n--) {
+   r = hvc_dcc_earlyputc(*s);
+   if (r)
+   break;
+   s++;
+   }
+}
+
+static int
+__init early_hvc_dcc_setup(struct earlycon_device *device, const char *opt)
+{
+   device->con->write = hvc_dcc_earlywrite;
+   return 0;
+}
+
+EARLYCON_DECLARE(hvcdcc, early_hvc_dcc_setup);
-- 
2.11.0.65.geff96d7



Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Andy Lutomirski
On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
 wrote:
> On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
>>  wrote:
>> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
>> >> Hi all-
>> >>
>> >> I apologize for being rather late with this.  I didn't realize that
>> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
>> >> it on the linux-api list, so I missed the discussion.
>> >>
>> >> I think that the inet ingress, egress etc filters are a neat feature,
>> >> but I think the API has some issues that will bite us down the road
>> >> if it becomes stable in its current form.
>> >>
>> >> Most of the problems I see are summarized in this transcript:
>> >>
>> >> # mkdir cg2
>> >> # mount -t cgroup2 none cg2
>> >> # mkdir cg2/nosockets
>> >> # strace cgrp_socket_rule cg2/nosockets/ 0
>> >> ...
>> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>> >>
>> >>  You can modify a cgroup after opening it O_RDONLY?
>> >>
>> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
>> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
>> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
>> >>
>> >>  This is fine.  The bpf() syscall manipulates bpf objects.
>> >>
>> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>> >>
>> >>  This is not so good:
>> >> 
>> >>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
>> >> is manipulating a cgroup.  There's no reason that a socket 
>> >> creation
>> >> filter couldn't be written in a different language (new iptables
>> >> table?  Simple list of address families?), but if that happened,
>> >> then using bpf() to install it would be entirely nonsensical.
>> >
>> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
>> > network stack is using cgroup as an application group that should
>> > invoke bpf program at the certain point in the stack.
>> > imo cgroup management is orthogonal.
>>
>> It is literally modifying the struct cgroup, and, as a practical
>> matter, it's causing membership in the cgroup to have a certain
>> effect.  But rather than pointless arguing, let me propose an
>> alternative API that I think solves most of the problems here.
>>
>> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
>> Instead, the cgroup gets three new control files:
>> "net.ingress_filter", "net.egress_filter", and
>> "net.socket_create_filter".  Initially, if you read these files, you
>> see "none\n".
>>
>> To attach a bpf filter, you open the file for write and do an ioctl on
>> it.  After doing the ioctl, if you read the file, you'll see
>> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
>> the bpf program.
>>
>> To detach any type of filter, bpf or otherwise, you open the file for
>> write and write "none\n" (or just "none").
>>
>> If you write anything else to the file, you get -EINVAL.  But, if
>> someone writes a new type of filter (perhaps a simple list of address
>> families), maybe you can enable the filter by writing something
>> appropriate to the file.
>
> I see no difference in what you're proposing vs what is already implemented
> from feature set point of view, but the file approach is very ugly, since
> it's a mismatch to FD style access that bpf is using everywhere.
> In your proposal you'd also need to add bpf prefix everywhere.
> So the control file names should be bpf_inet_ingress, bpf_inet_egress
> and bpf_socket_create.

I think we're still talking past each other.  A big part of the point
of changing it is that none of this is specific to bpf.  You could (in
theory -- I'm not proposing implementing these until there's demand)
have:

net.socket_create_filter = "none": no filter
net.socket_create_filter = "bpf:baadf00d": bpf filter
net.socket_create_filter = "disallow": no sockets created period
net.socket_create_filter = "iptables:foobar": some iptables thingy
net.socket_create_filter = "nft:blahblahblah": some nft thingy
net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

See?  This API is not bpf-specific.  It's an API for filtering.  The
fact that struct cgroup currently contains a member called "bpf" is
purely an artifact of the fact that it currently only supports bpf.
Someone will want to rename it to "filters" some day, and
BPF_PROG_DETACH makes no sense whatsoever as a generic API to detach a
filter.

> If you want to prepare such patch for them, I don't mind,
> but you cannot kill syscall command, since it's more flexible
> and your control-file approach _will_ be obsolete pretty quickly.

BPF_PROG_ATTACH and BPF_PROC_DETACH should be removed regardless IMO.
If you really really want a syscall, make it a new syscall.

>
>> Now the API matches the effect.  A cgroup with something other than
>> "none" in one of its net.*_filter files is a cgroup that filters
>> 

Re: Documentation/unaligned-memory-access.txt: fix incorrect comparison operator

2016-12-19 Thread Cihangir Akturk
On Sun, Dec 18, 2016 at 12:52:12AM +0200, Ozgur Karatas wrote:
> 17.12.2016, 19:43, "Cihangir Akturk" :
> > In the actual implementation ether_addr_equal function tests for equality 
> > to 0
> > when returning. It seems in commit 0d74c4 it is somehow overlooked to change
> > this operator to reflect the actual function.
> 
> why this "return" function need to be ==0? I think, u16 functions read memory 
> but "0" is should not be equalty.

XOR is true only when inputs differ. That means if inputs are the
same, then it outputs false (0) or whatever you call it. Then we
perform OR operation between those outputs. So if the result is 0 then
addr1 and addr2 is equal.

> This way, -for the code to work- memory should be everytime unaligned !=0.

Sorry I didn't quite get the point.


Re: Documentation/unaligned-memory-access.txt: fix incorrect comparison operator

2016-12-19 Thread Cihangir Akturk
On Sun, Dec 18, 2016 at 12:52:12AM +0200, Ozgur Karatas wrote:
> 17.12.2016, 19:43, "Cihangir Akturk" :
> > In the actual implementation ether_addr_equal function tests for equality 
> > to 0
> > when returning. It seems in commit 0d74c4 it is somehow overlooked to change
> > this operator to reflect the actual function.
> 
> why this "return" function need to be ==0? I think, u16 functions read memory 
> but "0" is should not be equalty.

XOR is true only when inputs differ. That means if inputs are the
same, then it outputs false (0) or whatever you call it. Then we
perform OR operation between those outputs. So if the result is 0 then
addr1 and addr2 is equal.

> This way, -for the code to work- memory should be everytime unaligned !=0.

Sorry I didn't quite get the point.


Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-19 Thread Dave Hansen
On 12/19/2016 03:07 PM, Linus Torvalds wrote:
> +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> +{
> +   const int __maybe_unused nid = page_to_nid(virt_to_page(word));
> +
> +   return __bit_waitqueue(word, bit, nid);
> 
> No can do. Part of the problem with the old coffee was that it did that
> virt_to_page() crud. That doesn't work with the virtually mapped stack. 

Ahhh, got it.

So, what did you have in mind?  Just redirect bit_waitqueue() to the
"first_online_node" waitqueues?

wait_queue_head_t *bit_waitqueue(void *word, int bit)
{
return __bit_waitqueue(word, bit, first_online_node);
}

We could do some fancy stuff like only do virt_to_page() for things in
the linear map, but I'm not sure we'll see much of a gain for it.  None
of the other waitqueue users look as pathological as the 'struct page'
ones.  Maybe:

wait_queue_head_t *bit_waitqueue(void *word, int bit)
{
int nid
if (word >= VMALLOC_START) /* all addrs not in linear map */
nid = first_online_node;
else
nid = page_to_nid(virt_to_page(word));
return __bit_waitqueue(word, bit, nid);
}


Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-19 Thread Dave Hansen
On 12/19/2016 03:07 PM, Linus Torvalds wrote:
> +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> +{
> +   const int __maybe_unused nid = page_to_nid(virt_to_page(word));
> +
> +   return __bit_waitqueue(word, bit, nid);
> 
> No can do. Part of the problem with the old coffee was that it did that
> virt_to_page() crud. That doesn't work with the virtually mapped stack. 

Ahhh, got it.

So, what did you have in mind?  Just redirect bit_waitqueue() to the
"first_online_node" waitqueues?

wait_queue_head_t *bit_waitqueue(void *word, int bit)
{
return __bit_waitqueue(word, bit, first_online_node);
}

We could do some fancy stuff like only do virt_to_page() for things in
the linear map, but I'm not sure we'll see much of a gain for it.  None
of the other waitqueue users look as pathological as the 'struct page'
ones.  Maybe:

wait_queue_head_t *bit_waitqueue(void *word, int bit)
{
int nid
if (word >= VMALLOC_START) /* all addrs not in linear map */
nid = first_online_node;
else
nid = page_to_nid(virt_to_page(word));
return __bit_waitqueue(word, bit, nid);
}


Re: [PATCH 1/1] x86/platform/intel/quark: add printf attribute to imr_self_test_result()

2016-12-19 Thread Bryan O'Donoghue



On 19/12/16 13:21, Nicolas Iooss wrote:

__printf attributes help detecting issues in printf format strings at
compile time.

Even though imr_selftest.c is only compiled with
CONFIG_DEBUG_IMR_SELFTEST, gcc complains about a missing format
attribute when compiling allmodconfig with -Wmissing-format-attribute.
Silent this warning by adding the attribute.

Signed-off-by: Nicolas Iooss 
---
 arch/x86/platform/intel-quark/imr_selftest.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/platform/intel-quark/imr_selftest.c 
b/arch/x86/platform/intel-quark/imr_selftest.c
index f5bad40936ac..b8f562049cad 100644
--- a/arch/x86/platform/intel-quark/imr_selftest.c
+++ b/arch/x86/platform/intel-quark/imr_selftest.c
@@ -25,7 +25,8 @@
  * @fmt:   format string.
  * ... variadic argument list.
  */
-static void __init imr_self_test_result(int res, const char *fmt, ...)
+static __printf(2, 3)
+void __init imr_self_test_result(int res, const char *fmt, ...)
 {
va_list vlist;




since I wrote this code.

Acked-by: Bryan O'Donoghue 

cheers

---
bod


Re: [PATCH 1/1] x86/platform/intel/quark: add printf attribute to imr_self_test_result()

2016-12-19 Thread Bryan O'Donoghue



On 19/12/16 13:21, Nicolas Iooss wrote:

__printf attributes help detecting issues in printf format strings at
compile time.

Even though imr_selftest.c is only compiled with
CONFIG_DEBUG_IMR_SELFTEST, gcc complains about a missing format
attribute when compiling allmodconfig with -Wmissing-format-attribute.
Silent this warning by adding the attribute.

Signed-off-by: Nicolas Iooss 
---
 arch/x86/platform/intel-quark/imr_selftest.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/platform/intel-quark/imr_selftest.c 
b/arch/x86/platform/intel-quark/imr_selftest.c
index f5bad40936ac..b8f562049cad 100644
--- a/arch/x86/platform/intel-quark/imr_selftest.c
+++ b/arch/x86/platform/intel-quark/imr_selftest.c
@@ -25,7 +25,8 @@
  * @fmt:   format string.
  * ... variadic argument list.
  */
-static void __init imr_self_test_result(int res, const char *fmt, ...)
+static __printf(2, 3)
+void __init imr_self_test_result(int res, const char *fmt, ...)
 {
va_list vlist;




since I wrote this code.

Acked-by: Bryan O'Donoghue 

cheers

---
bod


[PATCH v2 0/1] perf: check that objdump correctly works

2016-12-19 Thread Alexis Berlemont
Hi Arnaldo,

Here is a patch which checks that objdump works without calling it
first with the "-v" option.

Thanks to the shell pipefail option (which returns the rightmost error
status in the shell pipe) and the grep subshell which prevents
"no-match" errors, we are able the use waitpid on the whole pipeline
and get the appropriate return code.

This is the most simple and straightforward solution I found. The
downside might be that it relies on the availability of "pipefail"
shell option. However, I checked that even busybox's shell implemented
it.

Alexis.

Alexis Berlemont (1):
  perf annotate: check that objdump correctly works

 tools/perf/util/annotate.c | 69 +++---
 tools/perf/util/annotate.h |  3 ++
 2 files changed, 68 insertions(+), 4 deletions(-)

-- 
2.11.0



[PATCH v2 1/1] perf annotate: check that objdump correctly works

2016-12-19 Thread Alexis Berlemont
Before disassembling, the tool objdump is called just to be sure:
* objdump is available in the path;
* objdump is an executable binary;
* objdump has no dependency issue or anything else.

This objdump "pre-"command is only necessary because the real objdump
command is followed by some " | grep ..."; this prevents the shell
from returning the exit code of objdump execution.

Signed-off-by: Alexis Berlemont 
---
 tools/perf/util/annotate.c | 69 +++---
 tools/perf/util/annotate.h |  3 ++
 2 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index ea7e0de4b9c1..6fbaabbc9d2a 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -20,9 +20,12 @@
 #include "block-range.h"
 #include "arch/common.h"
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
+#include 
 
 const char *disassembler_style;
 const char *objdump_path;
@@ -1286,6 +1289,21 @@ int symbol__strerror_disassemble(struct symbol *sym 
__maybe_unused, struct map *
  "  --vmlinux vmlinux\n", build_id_msg ?: "");
}
break;
+
+   case SYMBOL_ANNOTATE_ERRNO__NO_OBJDUMP:
+   scnprintf(buf, buflen, "No objdump tool available in $PATH\n");
+   break;
+
+   case SYMBOL_ANNOTATE_ERRNO__NO_EXEC_OBJDUMP:
+   scnprintf(buf, buflen,
+   "The objdump tool found in $PATH cannot be executed\n");
+   break;
+
+   case SYMBOL_ANNOTATE_ERRNO__NO_OUTPUT:
+   scnprintf(buf, buflen,
+   "The objdump tool returned no disassembled code\n");
+   break;
+
default:
scnprintf(buf, buflen, "Internal error: Invalid %d error 
code\n", errnum);
break;
@@ -1329,6 +1347,44 @@ static int dso__disassemble_filename(struct dso *dso, 
char *filename, size_t fil
return 0;
 }
 
+static int annotate__check_exec(int pid, const char *command)
+{
+   int wstatus, err;
+
+   err = waitpid(pid, , 0);
+   if (err < 0) {
+   pr_err("Failure calling waitpid: %s: (%s)\n",
+   strerror(errno), command);
+   return -1;
+   }
+
+   pr_debug("%s: %d %d\n", command, pid, WEXITSTATUS(wstatus));
+
+   switch (WEXITSTATUS(wstatus)) {
+   case 0:
+   /* Success */
+   err = 0;
+   break;
+   case 127:
+   /* The shell did not find objdump in the path */
+   err = SYMBOL_ANNOTATE_ERRNO__NO_OBJDUMP;
+   break;
+   default:
+   /*
+* In the default case, we consider that objdump
+* cannot be executed; so it gathers many fault
+* scenarii:
+* - objdump is not an executable (126);
+* - objdump has some dependency issue;
+* - ...
+*/
+   err = SYMBOL_ANNOTATE_ERRNO__NO_EXEC_OBJDUMP;
+   break;
+   }
+
+   return err;
+}
+
 static const char *annotate__norm_arch(const char *arch_name)
 {
struct utsname uts;
@@ -1426,7 +1482,8 @@ int symbol__disassemble(struct symbol *sym, struct map 
*map, const char *arch_na
snprintf(command, sizeof(command),
 "%s %s%s --start-address=0x%016" PRIx64
 " --stop-address=0x%016" PRIx64
-" -l -d %s %s -C %s 2>/dev/null|grep -v %s|expand",
+" -l -d %s %s -C %s 2>/dev/null | "
+"{ grep -v %s || true; } | expand",
 objdump_path ? objdump_path : "objdump",
 disassembler_style ? "-M " : "",
 disassembler_style ? disassembler_style : "",
@@ -1454,7 +1511,7 @@ int symbol__disassemble(struct symbol *sym, struct map 
*map, const char *arch_na
close(stdout_fd[0]);
dup2(stdout_fd[1], 1);
close(stdout_fd[1]);
-   execl("/bin/sh", "sh", "-c", command, NULL);
+   execl("/bin/sh", "sh", "-o", "pipefail", "-c", command, NULL);
perror(command);
exit(-1);
}
@@ -1479,8 +1536,12 @@ int symbol__disassemble(struct symbol *sym, struct map 
*map, const char *arch_na
nline++;
}
 
-   if (nline == 0)
+   err = annotate__check_exec(pid, command);
+
+   if (err == 0 && nline == 0) {
pr_err("No output from %s\n", command);
+   err = SYMBOL_ANNOTATE_ERRNO__NO_OUTPUT;
+   }
 
/*
 * kallsyms does not have symbol sizes so there may a nop at the end.
@@ -1490,7 +1551,7 @@ int symbol__disassemble(struct symbol *sym, struct map 
*map, const char *arch_na
delete_last_nop(sym);
 
fclose(file);
-   err = 0;
+
 out_remove_tmp:
close(stdout_fd[0]);
 
diff --git 

[PATCH v2 0/1] perf: check that objdump correctly works

2016-12-19 Thread Alexis Berlemont
Hi Arnaldo,

Here is a patch which checks that objdump works without calling it
first with the "-v" option.

Thanks to the shell pipefail option (which returns the rightmost error
status in the shell pipe) and the grep subshell which prevents
"no-match" errors, we are able the use waitpid on the whole pipeline
and get the appropriate return code.

This is the most simple and straightforward solution I found. The
downside might be that it relies on the availability of "pipefail"
shell option. However, I checked that even busybox's shell implemented
it.

Alexis.

Alexis Berlemont (1):
  perf annotate: check that objdump correctly works

 tools/perf/util/annotate.c | 69 +++---
 tools/perf/util/annotate.h |  3 ++
 2 files changed, 68 insertions(+), 4 deletions(-)

-- 
2.11.0



[PATCH v2 1/1] perf annotate: check that objdump correctly works

2016-12-19 Thread Alexis Berlemont
Before disassembling, the tool objdump is called just to be sure:
* objdump is available in the path;
* objdump is an executable binary;
* objdump has no dependency issue or anything else.

This objdump "pre-"command is only necessary because the real objdump
command is followed by some " | grep ..."; this prevents the shell
from returning the exit code of objdump execution.

Signed-off-by: Alexis Berlemont 
---
 tools/perf/util/annotate.c | 69 +++---
 tools/perf/util/annotate.h |  3 ++
 2 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index ea7e0de4b9c1..6fbaabbc9d2a 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -20,9 +20,12 @@
 #include "block-range.h"
 #include "arch/common.h"
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
+#include 
 
 const char *disassembler_style;
 const char *objdump_path;
@@ -1286,6 +1289,21 @@ int symbol__strerror_disassemble(struct symbol *sym 
__maybe_unused, struct map *
  "  --vmlinux vmlinux\n", build_id_msg ?: "");
}
break;
+
+   case SYMBOL_ANNOTATE_ERRNO__NO_OBJDUMP:
+   scnprintf(buf, buflen, "No objdump tool available in $PATH\n");
+   break;
+
+   case SYMBOL_ANNOTATE_ERRNO__NO_EXEC_OBJDUMP:
+   scnprintf(buf, buflen,
+   "The objdump tool found in $PATH cannot be executed\n");
+   break;
+
+   case SYMBOL_ANNOTATE_ERRNO__NO_OUTPUT:
+   scnprintf(buf, buflen,
+   "The objdump tool returned no disassembled code\n");
+   break;
+
default:
scnprintf(buf, buflen, "Internal error: Invalid %d error 
code\n", errnum);
break;
@@ -1329,6 +1347,44 @@ static int dso__disassemble_filename(struct dso *dso, 
char *filename, size_t fil
return 0;
 }
 
+static int annotate__check_exec(int pid, const char *command)
+{
+   int wstatus, err;
+
+   err = waitpid(pid, , 0);
+   if (err < 0) {
+   pr_err("Failure calling waitpid: %s: (%s)\n",
+   strerror(errno), command);
+   return -1;
+   }
+
+   pr_debug("%s: %d %d\n", command, pid, WEXITSTATUS(wstatus));
+
+   switch (WEXITSTATUS(wstatus)) {
+   case 0:
+   /* Success */
+   err = 0;
+   break;
+   case 127:
+   /* The shell did not find objdump in the path */
+   err = SYMBOL_ANNOTATE_ERRNO__NO_OBJDUMP;
+   break;
+   default:
+   /*
+* In the default case, we consider that objdump
+* cannot be executed; so it gathers many fault
+* scenarii:
+* - objdump is not an executable (126);
+* - objdump has some dependency issue;
+* - ...
+*/
+   err = SYMBOL_ANNOTATE_ERRNO__NO_EXEC_OBJDUMP;
+   break;
+   }
+
+   return err;
+}
+
 static const char *annotate__norm_arch(const char *arch_name)
 {
struct utsname uts;
@@ -1426,7 +1482,8 @@ int symbol__disassemble(struct symbol *sym, struct map 
*map, const char *arch_na
snprintf(command, sizeof(command),
 "%s %s%s --start-address=0x%016" PRIx64
 " --stop-address=0x%016" PRIx64
-" -l -d %s %s -C %s 2>/dev/null|grep -v %s|expand",
+" -l -d %s %s -C %s 2>/dev/null | "
+"{ grep -v %s || true; } | expand",
 objdump_path ? objdump_path : "objdump",
 disassembler_style ? "-M " : "",
 disassembler_style ? disassembler_style : "",
@@ -1454,7 +1511,7 @@ int symbol__disassemble(struct symbol *sym, struct map 
*map, const char *arch_na
close(stdout_fd[0]);
dup2(stdout_fd[1], 1);
close(stdout_fd[1]);
-   execl("/bin/sh", "sh", "-c", command, NULL);
+   execl("/bin/sh", "sh", "-o", "pipefail", "-c", command, NULL);
perror(command);
exit(-1);
}
@@ -1479,8 +1536,12 @@ int symbol__disassemble(struct symbol *sym, struct map 
*map, const char *arch_na
nline++;
}
 
-   if (nline == 0)
+   err = annotate__check_exec(pid, command);
+
+   if (err == 0 && nline == 0) {
pr_err("No output from %s\n", command);
+   err = SYMBOL_ANNOTATE_ERRNO__NO_OUTPUT;
+   }
 
/*
 * kallsyms does not have symbol sizes so there may a nop at the end.
@@ -1490,7 +1551,7 @@ int symbol__disassemble(struct symbol *sym, struct map 
*map, const char *arch_na
delete_last_nop(sym);
 
fclose(file);
-   err = 0;
+
 out_remove_tmp:
close(stdout_fd[0]);
 
diff --git a/tools/perf/util/annotate.h 

Re: [PATCH v2] vfs: fix isize/pos/len checks for reflink & dedupe

2016-12-19 Thread Al Viro
On Mon, Dec 19, 2016 at 03:13:26PM -0800, Darrick J. Wong wrote:
> Strengthen the checking of pos/len vs. i_size, clarify the return values
> for the clone prep function, and remove pointless code.

Applied.


Re: [PATCH v2] vfs: fix isize/pos/len checks for reflink & dedupe

2016-12-19 Thread Al Viro
On Mon, Dec 19, 2016 at 03:13:26PM -0800, Darrick J. Wong wrote:
> Strengthen the checking of pos/len vs. i_size, clarify the return values
> for the clone prep function, and remove pointless code.

Applied.


Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock

2016-12-19 Thread Francois Romieu
Pavel Machek  :
[...]
> Considering the memory barriers... is something like this neccessary
> in the via-rhine ?

Yes.

> AFAICT... we need a barrier after making sure that descriptor is no
> longer owned by DMA (to make sure we don't get stale data in rest of
> descriptor)... and we need a barrier before giving the descriptor to
> the dma, to make sure DMA engine sees the complete update?

I would not expect stale data while processing a single transmit
descriptor as the transmit completion does not use the rest of the
descriptor at all in the via-rhine driver. However I agree that transmit
descriptors should be read by the cpu with adequate ordering so the
dma_rmb() should stay.

Same kind of narrative for dma_wmb rhine_rx (s/read/written/ and
s/cpu/device/).

> diff --git a/drivers/net/ethernet/via/via-rhine.c 
> b/drivers/net/ethernet/via/via-rhine.c
> index ba5c542..3806e72 100644
> --- a/drivers/net/ethernet/via/via-rhine.c
> +++ b/drivers/net/ethernet/via/via-rhine.c
[...]
> @@ -2061,6 +2062,7 @@ static int rhine_rx(struct net_device *dev, int limit)
>  
>   if (desc_status & DescOwn)
>   break;
> + dma_rmb();
>  

I agree with your explanation for this one (late vlan processing in a 
different word from the same descriptor).

-- 
Ueimor


Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock

2016-12-19 Thread Francois Romieu
Pavel Machek  :
[...]
> Considering the memory barriers... is something like this neccessary
> in the via-rhine ?

Yes.

> AFAICT... we need a barrier after making sure that descriptor is no
> longer owned by DMA (to make sure we don't get stale data in rest of
> descriptor)... and we need a barrier before giving the descriptor to
> the dma, to make sure DMA engine sees the complete update?

I would not expect stale data while processing a single transmit
descriptor as the transmit completion does not use the rest of the
descriptor at all in the via-rhine driver. However I agree that transmit
descriptors should be read by the cpu with adequate ordering so the
dma_rmb() should stay.

Same kind of narrative for dma_wmb rhine_rx (s/read/written/ and
s/cpu/device/).

> diff --git a/drivers/net/ethernet/via/via-rhine.c 
> b/drivers/net/ethernet/via/via-rhine.c
> index ba5c542..3806e72 100644
> --- a/drivers/net/ethernet/via/via-rhine.c
> +++ b/drivers/net/ethernet/via/via-rhine.c
[...]
> @@ -2061,6 +2062,7 @@ static int rhine_rx(struct net_device *dev, int limit)
>  
>   if (desc_status & DescOwn)
>   break;
> + dma_rmb();
>  

I agree with your explanation for this one (late vlan processing in a 
different word from the same descriptor).

-- 
Ueimor


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Alexei Starovoitov
On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
>  wrote:
> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
> >> Hi all-
> >>
> >> I apologize for being rather late with this.  I didn't realize that
> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> >> it on the linux-api list, so I missed the discussion.
> >>
> >> I think that the inet ingress, egress etc filters are a neat feature,
> >> but I think the API has some issues that will bite us down the road
> >> if it becomes stable in its current form.
> >>
> >> Most of the problems I see are summarized in this transcript:
> >>
> >> # mkdir cg2
> >> # mount -t cgroup2 none cg2
> >> # mkdir cg2/nosockets
> >> # strace cgrp_socket_rule cg2/nosockets/ 0
> >> ...
> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> >>
> >>  You can modify a cgroup after opening it O_RDONLY?
> >>
> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
> >>
> >>  This is fine.  The bpf() syscall manipulates bpf objects.
> >>
> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> >>
> >>  This is not so good:
> >> 
> >>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
> >> is manipulating a cgroup.  There's no reason that a socket creation
> >> filter couldn't be written in a different language (new iptables
> >> table?  Simple list of address families?), but if that happened,
> >> then using bpf() to install it would be entirely nonsensical.
> >
> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
> > network stack is using cgroup as an application group that should
> > invoke bpf program at the certain point in the stack.
> > imo cgroup management is orthogonal.
> 
> It is literally modifying the struct cgroup, and, as a practical
> matter, it's causing membership in the cgroup to have a certain
> effect.  But rather than pointless arguing, let me propose an
> alternative API that I think solves most of the problems here.
> 
> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
> Instead, the cgroup gets three new control files:
> "net.ingress_filter", "net.egress_filter", and
> "net.socket_create_filter".  Initially, if you read these files, you
> see "none\n".
> 
> To attach a bpf filter, you open the file for write and do an ioctl on
> it.  After doing the ioctl, if you read the file, you'll see
> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
> the bpf program.
> 
> To detach any type of filter, bpf or otherwise, you open the file for
> write and write "none\n" (or just "none").
> 
> If you write anything else to the file, you get -EINVAL.  But, if
> someone writes a new type of filter (perhaps a simple list of address
> families), maybe you can enable the filter by writing something
> appropriate to the file.

I see no difference in what you're proposing vs what is already implemented
from feature set point of view, but the file approach is very ugly, since
it's a mismatch to FD style access that bpf is using everywhere.
In your proposal you'd also need to add bpf prefix everywhere.
So the control file names should be bpf_inet_ingress, bpf_inet_egress
and bpf_socket_create.
If you want to prepare such patch for them, I don't mind,
but you cannot kill syscall command, since it's more flexible
and your control-file approach _will_ be obsolete pretty quickly.

> Now the API matches the effect.  A cgroup with something other than
> "none" in one of its net.*_filter files is a cgroup that filters
> network activity.  And you get CRIU compatibility for free: CRIU can
> read the control file and use whatever mechanism is uses for BPF in
> general to restore the cgroup filter state.   As an added bonus, you
> get ACLs for free and the ugly multiplexer goes away.

extended bpf is not supported by criu. only classic, so having
control_file-style attachment doesn't buy us anything.

> >> # mkdir cg2/nosockets/sockets
> >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule 
> >> cg2/nosockets/sockets/ 1
> >>
> >>  This succeeded, which means that, if this feature is enabled in 4.10,
> >>  then we're stuck with its semantics.  If it returned -EINVAL instead,
> >>  there would be a chance to refine it.
> >
> > i don't see the problem with this behavior. bpf and cgroup are indepedent.
> > Imange that socket accounting program is attached to cg2/nosockets.
> > The program is readonly and carry no security meaning.
> > Why cgroup should prevent creation of cg2/nosockets/foo directory ?
> 
> I think you're misunderstanding me.  What I'm saying is that, if you
> allow a cgroup and one of its descendents to both enable the same type
> of filter, you have just committed to 

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-19 Thread Alexei Starovoitov
On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
>  wrote:
> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
> >> Hi all-
> >>
> >> I apologize for being rather late with this.  I didn't realize that
> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> >> it on the linux-api list, so I missed the discussion.
> >>
> >> I think that the inet ingress, egress etc filters are a neat feature,
> >> but I think the API has some issues that will bite us down the road
> >> if it becomes stable in its current form.
> >>
> >> Most of the problems I see are summarized in this transcript:
> >>
> >> # mkdir cg2
> >> # mount -t cgroup2 none cg2
> >> # mkdir cg2/nosockets
> >> # strace cgrp_socket_rule cg2/nosockets/ 0
> >> ...
> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> >>
> >>  You can modify a cgroup after opening it O_RDONLY?
> >>
> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
> >>
> >>  This is fine.  The bpf() syscall manipulates bpf objects.
> >>
> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> >>
> >>  This is not so good:
> >> 
> >>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
> >> is manipulating a cgroup.  There's no reason that a socket creation
> >> filter couldn't be written in a different language (new iptables
> >> table?  Simple list of address families?), but if that happened,
> >> then using bpf() to install it would be entirely nonsensical.
> >
> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
> > network stack is using cgroup as an application group that should
> > invoke bpf program at the certain point in the stack.
> > imo cgroup management is orthogonal.
> 
> It is literally modifying the struct cgroup, and, as a practical
> matter, it's causing membership in the cgroup to have a certain
> effect.  But rather than pointless arguing, let me propose an
> alternative API that I think solves most of the problems here.
> 
> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
> Instead, the cgroup gets three new control files:
> "net.ingress_filter", "net.egress_filter", and
> "net.socket_create_filter".  Initially, if you read these files, you
> see "none\n".
> 
> To attach a bpf filter, you open the file for write and do an ioctl on
> it.  After doing the ioctl, if you read the file, you'll see
> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
> the bpf program.
> 
> To detach any type of filter, bpf or otherwise, you open the file for
> write and write "none\n" (or just "none").
> 
> If you write anything else to the file, you get -EINVAL.  But, if
> someone writes a new type of filter (perhaps a simple list of address
> families), maybe you can enable the filter by writing something
> appropriate to the file.

I see no difference in what you're proposing vs what is already implemented
from feature set point of view, but the file approach is very ugly, since
it's a mismatch to FD style access that bpf is using everywhere.
In your proposal you'd also need to add bpf prefix everywhere.
So the control file names should be bpf_inet_ingress, bpf_inet_egress
and bpf_socket_create.
If you want to prepare such patch for them, I don't mind,
but you cannot kill syscall command, since it's more flexible
and your control-file approach _will_ be obsolete pretty quickly.

> Now the API matches the effect.  A cgroup with something other than
> "none" in one of its net.*_filter files is a cgroup that filters
> network activity.  And you get CRIU compatibility for free: CRIU can
> read the control file and use whatever mechanism is uses for BPF in
> general to restore the cgroup filter state.   As an added bonus, you
> get ACLs for free and the ugly multiplexer goes away.

extended bpf is not supported by criu. only classic, so having
control_file-style attachment doesn't buy us anything.

> >> # mkdir cg2/nosockets/sockets
> >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule 
> >> cg2/nosockets/sockets/ 1
> >>
> >>  This succeeded, which means that, if this feature is enabled in 4.10,
> >>  then we're stuck with its semantics.  If it returned -EINVAL instead,
> >>  there would be a chance to refine it.
> >
> > i don't see the problem with this behavior. bpf and cgroup are indepedent.
> > Imange that socket accounting program is attached to cg2/nosockets.
> > The program is readonly and carry no security meaning.
> > Why cgroup should prevent creation of cg2/nosockets/foo directory ?
> 
> I think you're misunderstanding me.  What I'm saying is that, if you
> allow a cgroup and one of its descendents to both enable the same type
> of filter, you have just committed to some particular semantics for

Re: [PATCH] Documentation/unaligned-memory-access.txt: fix incorrect comparison operator

2016-12-19 Thread Cihangir Akturk
On Mon, Dec 19, 2016 at 04:13:13PM -0700, Jonathan Corbet wrote:
> On Mon, 19 Dec 2016 23:53:40 +0200
> Cihangir Akturk  wrote:
> 
> > In the actual implementation ether_addr_equal function tests for equality 
> > to 0
> > when returning. It seems in commit 0d74c4 it is somehow overlooked to change
> > this operator to reflect the actual function.

I realized that I generated the patch with the -k flag to git
format-patch. And think that it'd be better to resend it without this
flag. Besides that nothing is changed in the patch itself. Sorry for
the noise.

> 
> I received this patch two days ago; has something changed that you're
> sending it again?
> 
> Meanwhile, there was a question from Ozgur Karatas on the patch, but I've
> not yet seen your response.

Ok, I'll try to answer his question.

> 
> Thanks,
> 
> jon

Cihangir


Re: [PATCH] Documentation/unaligned-memory-access.txt: fix incorrect comparison operator

2016-12-19 Thread Cihangir Akturk
On Mon, Dec 19, 2016 at 04:13:13PM -0700, Jonathan Corbet wrote:
> On Mon, 19 Dec 2016 23:53:40 +0200
> Cihangir Akturk  wrote:
> 
> > In the actual implementation ether_addr_equal function tests for equality 
> > to 0
> > when returning. It seems in commit 0d74c4 it is somehow overlooked to change
> > this operator to reflect the actual function.

I realized that I generated the patch with the -k flag to git
format-patch. And think that it'd be better to resend it without this
flag. Besides that nothing is changed in the patch itself. Sorry for
the noise.

> 
> I received this patch two days ago; has something changed that you're
> sending it again?
> 
> Meanwhile, there was a question from Ozgur Karatas on the patch, but I've
> not yet seen your response.

Ok, I'll try to answer his question.

> 
> Thanks,
> 
> jon

Cihangir


Re: [PATCH] sched/fair: fix calc_cfs_shares fixed point arithmetics

2016-12-19 Thread Samuel Thibault
Paul Turner, on Mon 19 Dec 2016 15:32:15 -0800, wrote:
> On Mon, Dec 19, 2016 at 3:29 PM, Samuel Thibault
>  wrote:
> > Paul Turner, on Mon 19 Dec 2016 15:26:19 -0800, wrote:
> >> >> > -   if (shares < MIN_SHARES)
> >> >> > -   shares = MIN_SHARES;
> >> > ...
> >> >> > return shares;
> >> >
> >> > This will only make sure that the returned shares is 2, not 2048.
> >>
> >> This is intentional.  The MIN_SHARES you are seeing here is overloaded.
> >> Every "1" unit of share is "SCHED_LOAD_RESOLUTION" bits internally.
> >
> > I'm not talking about the SCHED_LOAD_RESOLUTION scaling, but about the
> > SCHED_FIXEDPOINT_SHIFT scaling, which is what
> > 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit 
> > kernels")
> > modified on 64bit platforms.
> 
>  From that commit:
> 
> """
> -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power
> usage under light load  */
> +#ifdef CONFIG_64BIT
>  # define SCHED_LOAD_RESOLUTION 10
>  # define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION)
>  # define scale_load_down(w)((w) >> SCHED_LOAD_RESOLUTION)

Errgl, sorry, I was referring to the old naming.  This stuff has seen
so much patching over and over in the past revisions...  It though you
were referring to SCHED_CAPACITY_SCALE.  The code I was reading now uses
SCHED_LOAD_RESOLUTION, so that's why I read your "SCHED_LOAD_RESOLUTION"
as "the other scaling".

> The MIN_SHARES you are seeing here is overloaded.
> In the unscaled case this needs to be MIN_SHARES, and in the scaled
> case, the subdivision of the scaled values must still be >=2.

Ok, now I understand.  I have to say this overloading is confusing.

Samuel


Re: [PATCH] sched/fair: fix calc_cfs_shares fixed point arithmetics

2016-12-19 Thread Samuel Thibault
Paul Turner, on Mon 19 Dec 2016 15:32:15 -0800, wrote:
> On Mon, Dec 19, 2016 at 3:29 PM, Samuel Thibault
>  wrote:
> > Paul Turner, on Mon 19 Dec 2016 15:26:19 -0800, wrote:
> >> >> > -   if (shares < MIN_SHARES)
> >> >> > -   shares = MIN_SHARES;
> >> > ...
> >> >> > return shares;
> >> >
> >> > This will only make sure that the returned shares is 2, not 2048.
> >>
> >> This is intentional.  The MIN_SHARES you are seeing here is overloaded.
> >> Every "1" unit of share is "SCHED_LOAD_RESOLUTION" bits internally.
> >
> > I'm not talking about the SCHED_LOAD_RESOLUTION scaling, but about the
> > SCHED_FIXEDPOINT_SHIFT scaling, which is what
> > 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit 
> > kernels")
> > modified on 64bit platforms.
> 
>  From that commit:
> 
> """
> -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power
> usage under light load  */
> +#ifdef CONFIG_64BIT
>  # define SCHED_LOAD_RESOLUTION 10
>  # define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION)
>  # define scale_load_down(w)((w) >> SCHED_LOAD_RESOLUTION)

Errgl, sorry, I was referring to the old naming.  This stuff has seen
so much patching over and over in the past revisions...  It though you
were referring to SCHED_CAPACITY_SCALE.  The code I was reading now uses
SCHED_LOAD_RESOLUTION, so that's why I read your "SCHED_LOAD_RESOLUTION"
as "the other scaling".

> The MIN_SHARES you are seeing here is overloaded.
> In the unscaled case this needs to be MIN_SHARES, and in the scaled
> case, the subdivision of the scaled values must still be >=2.

Ok, now I understand.  I have to say this overloading is confusing.

Samuel


<    1   2   3   4   5   6   7   8   9   10   >