Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On 12/19/16 6:56 PM, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 5:44 PM, David Ahernwrote: >> On 12/19/16 5:25 PM, Andy Lutomirski wrote: >>> net.socket_create_filter = "none": no filter >>> net.socket_create_filter = "bpf:baadf00d": bpf filter >>> net.socket_create_filter = "disallow": no sockets created period >>> net.socket_create_filter = "iptables:foobar": some iptables thingy >>> net.socket_create_filter = "nft:blahblahblah": some nft thingy >>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 >> >> Such a scheme works for the socket create filter b/c it is a very simple use >> case. It does not work for the ingress and egress which allow generic bpf >> filters. > > Can you elaborate on what goes wrong? (Obviously the > "address_family_list" example makes no sense in that context.) Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability. >> >> ... >> you're ignoring use cases I described earlier. In vrf case there is only one ifindex it needs to bind to. >>> >>> I'm totally lost. Can you explain what this has to do with the cgroup >>> hierarchy? >> >> I think the point is that a group hierarchy makes no sense for the VRF use >> case. What I put into iproute2 is >> >> cgrp2/vrf/NAME >> >> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 >> sockets to a specific device index. cgrp2/vrf is the "default" vrf and does >> not have a filter. A user can certainly add another layer >> cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not >> make sense. > > I tend to agree. I still think that the mechanism as it stands is > broken in other respects and should be fixed before it goes live. I > have no desire to cause problems for the vrf use case. > > But keep in mind that the vrf use case is, in Linus' tree, a bit > broken right now in its interactions with other users of the same > mechanism. Suppose I create a container and want to trace all of its > created sockets. I'll set up cgrp2/container and load my tracer as a > socket creation hook. Then a container sets up > cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding > filter. Now the tracing stops working -- oops. There are other ways to achieve socket tracing, but I get your point -- nested cases do not work as users may want. > I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You > have to do it now (or disable the feature for 4.10). This is why I'm > bringing this whole thing up now. We don't have to touch user visible api here, so extensions are fine. >>> >>> Huh? My example in the original email attaches a program in a >>> sub-hierarchy. Are you saying that 4.11 could make that example stop >>> working? >> >> Are you suggesting sub-cgroups should not be allowed to override the filter >> of a parent cgroup? > > Yes, exactly. I think there are two sensible behaviors: > > a) sub-cgroups cannot have a filter at all of the parent has a filter. > (This is the "punt" approach -- it lets different semantics be > assigned later without breaking userspace.) > > b) sub-cgroups can have a filter if a parent does, too. The semantics > are that the sub-cgroup filter runs first and all side-effects occur. > If that filter says "reject" then ancestor filters are skipped. If > that filter says "accept", then the ancestor filter is run and its > side-effects happen as well. (And so on, all the way up to the root.) That comes with a big performance hit for skb / data path cases. I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On 12/19/16 6:56 PM, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 5:44 PM, David Ahern wrote: >> On 12/19/16 5:25 PM, Andy Lutomirski wrote: >>> net.socket_create_filter = "none": no filter >>> net.socket_create_filter = "bpf:baadf00d": bpf filter >>> net.socket_create_filter = "disallow": no sockets created period >>> net.socket_create_filter = "iptables:foobar": some iptables thingy >>> net.socket_create_filter = "nft:blahblahblah": some nft thingy >>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 >> >> Such a scheme works for the socket create filter b/c it is a very simple use >> case. It does not work for the ingress and egress which allow generic bpf >> filters. > > Can you elaborate on what goes wrong? (Obviously the > "address_family_list" example makes no sense in that context.) Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability. >> >> ... >> you're ignoring use cases I described earlier. In vrf case there is only one ifindex it needs to bind to. >>> >>> I'm totally lost. Can you explain what this has to do with the cgroup >>> hierarchy? >> >> I think the point is that a group hierarchy makes no sense for the VRF use >> case. What I put into iproute2 is >> >> cgrp2/vrf/NAME >> >> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 >> sockets to a specific device index. cgrp2/vrf is the "default" vrf and does >> not have a filter. A user can certainly add another layer >> cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not >> make sense. > > I tend to agree. I still think that the mechanism as it stands is > broken in other respects and should be fixed before it goes live. I > have no desire to cause problems for the vrf use case. > > But keep in mind that the vrf use case is, in Linus' tree, a bit > broken right now in its interactions with other users of the same > mechanism. Suppose I create a container and want to trace all of its > created sockets. I'll set up cgrp2/container and load my tracer as a > socket creation hook. Then a container sets up > cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding > filter. Now the tracing stops working -- oops. There are other ways to achieve socket tracing, but I get your point -- nested cases do not work as users may want. > I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You > have to do it now (or disable the feature for 4.10). This is why I'm > bringing this whole thing up now. We don't have to touch user visible api here, so extensions are fine. >>> >>> Huh? My example in the original email attaches a program in a >>> sub-hierarchy. Are you saying that 4.11 could make that example stop >>> working? >> >> Are you suggesting sub-cgroups should not be allowed to override the filter >> of a parent cgroup? > > Yes, exactly. I think there are two sensible behaviors: > > a) sub-cgroups cannot have a filter at all of the parent has a filter. > (This is the "punt" approach -- it lets different semantics be > assigned later without breaking userspace.) > > b) sub-cgroups can have a filter if a parent does, too. The semantics > are that the sub-cgroup filter runs first and all side-effects occur. > If that filter says "reject" then ancestor filters are skipped. If > that filter says "accept", then the ancestor filter is run and its > side-effects happen as well. (And so on, all the way up to the root.) That comes with a big performance hit for skb / data path cases. I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part.
Re: [PATCH 2/7] PM / devfreq: exynos-ppmu: Use the regmap interface to handle the registers
Hi, On 2016년 12월 20일 04:47, Tobias Jakobi wrote: > Hello, > > I was just wondering what is improved by moving to regmap. For me this > looks like it only complicates the code. Lots of regmap_{read,write}() > and for each one of these we need to check the return code. It is correct to check the return value. It cover all of exception. > > Also when exactly did __raw_writel() and friends become legacy? Also, I want to use the regmap interface (regmap-io) instead of __raw_readl/__raw_writel using the base address directly. Regards, Chanwoo Choi > > With best wishes, > Tobias > > > Chanwoo Choi wrote: >> This patch uses the regmap interface to read and write the registers for >> exynos >> PPMU device instead of the legacy memory map functions. >> >> Cc: Kukjin Kim>> Cc: Krzysztof Kozlowski >> Cc: Javier Martinez Canillas >> Cc: linux-samsung-...@vger.kernel.org >> Signed-off-by: Chanwoo Choi >> --- >> drivers/devfreq/event/exynos-ppmu.c | 326 >> ++-- >> 1 file changed, 237 insertions(+), 89 deletions(-) >> >> diff --git a/drivers/devfreq/event/exynos-ppmu.c >> b/drivers/devfreq/event/exynos-ppmu.c >> index 107eb91a9415..fb3706faf5bd 100644 >> --- a/drivers/devfreq/event/exynos-ppmu.c >> +++ b/drivers/devfreq/event/exynos-ppmu.c >> @@ -17,13 +17,13 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> >> #include "exynos-ppmu.h" >> >> struct exynos_ppmu_data { >> -void __iomem *base; >> struct clk *clk; >> }; >> >> @@ -33,6 +33,7 @@ struct exynos_ppmu { >> unsigned int num_events; >> >> struct device *dev; >> +struct regmap *regmap; >> >> struct exynos_ppmu_data ppmu; >> }; >> @@ -107,20 +108,28 @@ static int exynos_ppmu_find_ppmu_id(struct >> devfreq_event_dev *edev) >> static int exynos_ppmu_disable(struct devfreq_event_dev *edev) >> { >> struct exynos_ppmu *info = devfreq_event_get_drvdata(edev); >> +int ret; >> u32 pmnc; >> >> /* Disable all counters */ >> -__raw_writel(PPMU_CCNT_MASK | >> - PPMU_PMCNT0_MASK | >> - PPMU_PMCNT1_MASK | >> - PPMU_PMCNT2_MASK | >> - PPMU_PMCNT3_MASK, >> - info->ppmu.base + PPMU_CNTENC); >> +ret = regmap_write(info->regmap, PPMU_CNTENC, >> +PPMU_CCNT_MASK | >> +PPMU_PMCNT0_MASK | >> +PPMU_PMCNT1_MASK | >> +PPMU_PMCNT2_MASK | >> +PPMU_PMCNT3_MASK); >> +if (ret < 0) >> +return ret; >> >> /* Disable PPMU */ >> -pmnc = __raw_readl(info->ppmu.base + PPMU_PMNC); >> +ret = regmap_read(info->regmap, PPMU_PMNC, ); >> +if (ret < 0) >> +return ret; >> + >> pmnc &= ~PPMU_PMNC_ENABLE_MASK; >> -__raw_writel(pmnc, info->ppmu.base + PPMU_PMNC); >> +ret = regmap_write(info->regmap, PPMU_PMNC, pmnc); >> +if (ret < 0) >> +return ret; >> >> return 0; >> } >> @@ -129,29 +138,42 @@ static int exynos_ppmu_set_event(struct >> devfreq_event_dev *edev) >> { >> struct exynos_ppmu *info = devfreq_event_get_drvdata(edev); >> int id = exynos_ppmu_find_ppmu_id(edev); >> +int ret; >> u32 pmnc, cntens; >> >> if (id < 0) >> return id; >> >> /* Enable specific counter */ >> -cntens = __raw_readl(info->ppmu.base + PPMU_CNTENS); >> +ret = regmap_read(info->regmap, PPMU_CNTENS, ); >> +if (ret < 0) >> +return ret; >> + >> cntens |= (PPMU_CCNT_MASK | (PPMU_ENABLE << id)); >> -__raw_writel(cntens, info->ppmu.base + PPMU_CNTENS); >> +ret = regmap_write(info->regmap, PPMU_CNTENS, cntens); >> +if (ret < 0) >> +return ret; >> >> /* Set the event of Read/Write data count */ >> -__raw_writel(PPMU_RO_DATA_CNT | PPMU_WO_DATA_CNT, >> -info->ppmu.base + PPMU_BEVTxSEL(id)); >> +ret = regmap_write(info->regmap, PPMU_BEVTxSEL(id), >> +PPMU_RO_DATA_CNT | PPMU_WO_DATA_CNT); >> +if (ret < 0) >> +return ret; >> >> /* Reset cycle counter/performance counter and enable PPMU */ >> -pmnc = __raw_readl(info->ppmu.base + PPMU_PMNC); >> +ret = regmap_read(info->regmap, PPMU_PMNC, ); >> +if (ret < 0) >> +return ret; >> + >> pmnc &= ~(PPMU_PMNC_ENABLE_MASK >> | PPMU_PMNC_COUNTER_RESET_MASK >> | PPMU_PMNC_CC_RESET_MASK); >> pmnc |= (PPMU_ENABLE << PPMU_PMNC_ENABLE_SHIFT); >> pmnc |= (PPMU_ENABLE << PPMU_PMNC_COUNTER_RESET_SHIFT); >> pmnc |= (PPMU_ENABLE << PPMU_PMNC_CC_RESET_SHIFT); >> -__raw_writel(pmnc, info->ppmu.base + PPMU_PMNC); >> +ret = regmap_write(info->regmap, PPMU_PMNC, pmnc); >> +if (ret < 0) >> +
Re: [PATCH 2/7] PM / devfreq: exynos-ppmu: Use the regmap interface to handle the registers
Hi, On 2016년 12월 20일 04:47, Tobias Jakobi wrote: > Hello, > > I was just wondering what is improved by moving to regmap. For me this > looks like it only complicates the code. Lots of regmap_{read,write}() > and for each one of these we need to check the return code. It is correct to check the return value. It cover all of exception. > > Also when exactly did __raw_writel() and friends become legacy? Also, I want to use the regmap interface (regmap-io) instead of __raw_readl/__raw_writel using the base address directly. Regards, Chanwoo Choi > > With best wishes, > Tobias > > > Chanwoo Choi wrote: >> This patch uses the regmap interface to read and write the registers for >> exynos >> PPMU device instead of the legacy memory map functions. >> >> Cc: Kukjin Kim >> Cc: Krzysztof Kozlowski >> Cc: Javier Martinez Canillas >> Cc: linux-samsung-...@vger.kernel.org >> Signed-off-by: Chanwoo Choi >> --- >> drivers/devfreq/event/exynos-ppmu.c | 326 >> ++-- >> 1 file changed, 237 insertions(+), 89 deletions(-) >> >> diff --git a/drivers/devfreq/event/exynos-ppmu.c >> b/drivers/devfreq/event/exynos-ppmu.c >> index 107eb91a9415..fb3706faf5bd 100644 >> --- a/drivers/devfreq/event/exynos-ppmu.c >> +++ b/drivers/devfreq/event/exynos-ppmu.c >> @@ -17,13 +17,13 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> >> #include "exynos-ppmu.h" >> >> struct exynos_ppmu_data { >> -void __iomem *base; >> struct clk *clk; >> }; >> >> @@ -33,6 +33,7 @@ struct exynos_ppmu { >> unsigned int num_events; >> >> struct device *dev; >> +struct regmap *regmap; >> >> struct exynos_ppmu_data ppmu; >> }; >> @@ -107,20 +108,28 @@ static int exynos_ppmu_find_ppmu_id(struct >> devfreq_event_dev *edev) >> static int exynos_ppmu_disable(struct devfreq_event_dev *edev) >> { >> struct exynos_ppmu *info = devfreq_event_get_drvdata(edev); >> +int ret; >> u32 pmnc; >> >> /* Disable all counters */ >> -__raw_writel(PPMU_CCNT_MASK | >> - PPMU_PMCNT0_MASK | >> - PPMU_PMCNT1_MASK | >> - PPMU_PMCNT2_MASK | >> - PPMU_PMCNT3_MASK, >> - info->ppmu.base + PPMU_CNTENC); >> +ret = regmap_write(info->regmap, PPMU_CNTENC, >> +PPMU_CCNT_MASK | >> +PPMU_PMCNT0_MASK | >> +PPMU_PMCNT1_MASK | >> +PPMU_PMCNT2_MASK | >> +PPMU_PMCNT3_MASK); >> +if (ret < 0) >> +return ret; >> >> /* Disable PPMU */ >> -pmnc = __raw_readl(info->ppmu.base + PPMU_PMNC); >> +ret = regmap_read(info->regmap, PPMU_PMNC, ); >> +if (ret < 0) >> +return ret; >> + >> pmnc &= ~PPMU_PMNC_ENABLE_MASK; >> -__raw_writel(pmnc, info->ppmu.base + PPMU_PMNC); >> +ret = regmap_write(info->regmap, PPMU_PMNC, pmnc); >> +if (ret < 0) >> +return ret; >> >> return 0; >> } >> @@ -129,29 +138,42 @@ static int exynos_ppmu_set_event(struct >> devfreq_event_dev *edev) >> { >> struct exynos_ppmu *info = devfreq_event_get_drvdata(edev); >> int id = exynos_ppmu_find_ppmu_id(edev); >> +int ret; >> u32 pmnc, cntens; >> >> if (id < 0) >> return id; >> >> /* Enable specific counter */ >> -cntens = __raw_readl(info->ppmu.base + PPMU_CNTENS); >> +ret = regmap_read(info->regmap, PPMU_CNTENS, ); >> +if (ret < 0) >> +return ret; >> + >> cntens |= (PPMU_CCNT_MASK | (PPMU_ENABLE << id)); >> -__raw_writel(cntens, info->ppmu.base + PPMU_CNTENS); >> +ret = regmap_write(info->regmap, PPMU_CNTENS, cntens); >> +if (ret < 0) >> +return ret; >> >> /* Set the event of Read/Write data count */ >> -__raw_writel(PPMU_RO_DATA_CNT | PPMU_WO_DATA_CNT, >> -info->ppmu.base + PPMU_BEVTxSEL(id)); >> +ret = regmap_write(info->regmap, PPMU_BEVTxSEL(id), >> +PPMU_RO_DATA_CNT | PPMU_WO_DATA_CNT); >> +if (ret < 0) >> +return ret; >> >> /* Reset cycle counter/performance counter and enable PPMU */ >> -pmnc = __raw_readl(info->ppmu.base + PPMU_PMNC); >> +ret = regmap_read(info->regmap, PPMU_PMNC, ); >> +if (ret < 0) >> +return ret; >> + >> pmnc &= ~(PPMU_PMNC_ENABLE_MASK >> | PPMU_PMNC_COUNTER_RESET_MASK >> | PPMU_PMNC_CC_RESET_MASK); >> pmnc |= (PPMU_ENABLE << PPMU_PMNC_ENABLE_SHIFT); >> pmnc |= (PPMU_ENABLE << PPMU_PMNC_COUNTER_RESET_SHIFT); >> pmnc |= (PPMU_ENABLE << PPMU_PMNC_CC_RESET_SHIFT); >> -__raw_writel(pmnc, info->ppmu.base + PPMU_PMNC); >> +ret = regmap_write(info->regmap, PPMU_PMNC, pmnc); >> +if (ret < 0) >> +return ret; >> >> return 0; >> } >> @@ -161,40 +183,64 @@ static int
[PATCH] rtlwifi: Fix kernel oops introduced with commit e49656147359
With commit e49656147359 {"rtlwifi: Use dev_kfree_skb_irq instead of kfree_skb"), the method used to free an skb was changed because the kfree_skb() was inside a spinlock. What was forgotten is that kfree_skb() guards against a NULL value for the argument. Routine dev_kfree_skb_irq() does not, and a test is needed to prevent kernel panics. Fixes: commit e49656147359 {"rtlwifi: Use dev_kfree_skb_irq instead of kfree_skb") Signed-off-by: Larry FingerCc: Stable (4.9+) Cc: Wei Yongjun --- Kalle, This change should be sent to mainline during the 4.10 merge period, or as soon as possible. Thanks, Larry --- drivers/net/wireless/realtek/rtlwifi/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/wireless/realtek/rtlwifi/core.c b/drivers/net/wireless/realtek/rtlwifi/core.c index 955055b..df8b977 100644 --- a/drivers/net/wireless/realtek/rtlwifi/core.c +++ b/drivers/net/wireless/realtek/rtlwifi/core.c @@ -1823,7 +1823,8 @@ bool rtl_cmd_send_packet(struct ieee80211_hw *hw, struct sk_buff *skb) spin_lock_irqsave(>locks.irq_th_lock, flags); pskb = __skb_dequeue(>queue); - dev_kfree_skb_irq(pskb); + if (pskb) + dev_kfree_skb_irq(pskb); /*this is wrong, fill_tx_cmddesc needs update*/ pdesc = >desc[0]; -- 2.10.2
[PATCH] rtlwifi: Fix kernel oops introduced with commit e49656147359
With commit e49656147359 {"rtlwifi: Use dev_kfree_skb_irq instead of kfree_skb"), the method used to free an skb was changed because the kfree_skb() was inside a spinlock. What was forgotten is that kfree_skb() guards against a NULL value for the argument. Routine dev_kfree_skb_irq() does not, and a test is needed to prevent kernel panics. Fixes: commit e49656147359 {"rtlwifi: Use dev_kfree_skb_irq instead of kfree_skb") Signed-off-by: Larry Finger Cc: Stable (4.9+) Cc: Wei Yongjun --- Kalle, This change should be sent to mainline during the 4.10 merge period, or as soon as possible. Thanks, Larry --- drivers/net/wireless/realtek/rtlwifi/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/wireless/realtek/rtlwifi/core.c b/drivers/net/wireless/realtek/rtlwifi/core.c index 955055b..df8b977 100644 --- a/drivers/net/wireless/realtek/rtlwifi/core.c +++ b/drivers/net/wireless/realtek/rtlwifi/core.c @@ -1823,7 +1823,8 @@ bool rtl_cmd_send_packet(struct ieee80211_hw *hw, struct sk_buff *skb) spin_lock_irqsave(>locks.irq_th_lock, flags); pskb = __skb_dequeue(>queue); - dev_kfree_skb_irq(pskb); + if (pskb) + dev_kfree_skb_irq(pskb); /*this is wrong, fill_tx_cmddesc needs update*/ pdesc = >desc[0]; -- 2.10.2
Re: [PATCH] block: loose check on sg gap
On 12/19/2016 07:07 PM, Ming Lei wrote: > On Sun, Dec 18, 2016 at 12:49 AM, Jens Axboewrote: >> On 12/17/2016 03:49 AM, Ming Lei wrote: >>> If the last bvec of the 1st bio and the 1st bvec of the next >>> bio are contineous physically, and the latter can be merged >>> to last segment of the 1st bio, we should think they don't >>> violate sg gap(or virt boundary) limit. >>> >>> Both Vitaly and Dexuan reported lots of unmergeable small bios >>> are observed when running mkfs on Hyper-V virtual storage, and >>> performance becomes quite low, so this patch is figured out for >>> fixing the performance issue. >>> >>> The same issue should exist on NVMe too sine it sets virt boundary too. >> >> It looks pretty reasonable to me. I'll queue it up for some testing, >> changes like this always make me a little nervous. > > Understood. > > But given it is still in early stage of 4.10 cycle, seems fine to expose > it now, and we should have enough time to fix it if there might be > regressions. > > BTW, it passes my xfstest(ext4) over sata/NVMe. It's been fine here in testing, too. I'm not worried about performance regressions, those we can always fix. Merging makes me worried about corruption, and those regressions are much worse. Any reason we need to rush this? I'd be more comfortable pushing this to 4.11, unless there are strong reasons this should make 4.10. -- Jens Axboe
Re: [PATCH] block: loose check on sg gap
On 12/19/2016 07:07 PM, Ming Lei wrote: > On Sun, Dec 18, 2016 at 12:49 AM, Jens Axboe wrote: >> On 12/17/2016 03:49 AM, Ming Lei wrote: >>> If the last bvec of the 1st bio and the 1st bvec of the next >>> bio are contineous physically, and the latter can be merged >>> to last segment of the 1st bio, we should think they don't >>> violate sg gap(or virt boundary) limit. >>> >>> Both Vitaly and Dexuan reported lots of unmergeable small bios >>> are observed when running mkfs on Hyper-V virtual storage, and >>> performance becomes quite low, so this patch is figured out for >>> fixing the performance issue. >>> >>> The same issue should exist on NVMe too sine it sets virt boundary too. >> >> It looks pretty reasonable to me. I'll queue it up for some testing, >> changes like this always make me a little nervous. > > Understood. > > But given it is still in early stage of 4.10 cycle, seems fine to expose > it now, and we should have enough time to fix it if there might be > regressions. > > BTW, it passes my xfstest(ext4) over sata/NVMe. It's been fine here in testing, too. I'm not worried about performance regressions, those we can always fix. Merging makes me worried about corruption, and those regressions are much worse. Any reason we need to rush this? I'd be more comfortable pushing this to 4.11, unless there are strong reasons this should make 4.10. -- Jens Axboe
Re: [RFC][PATCH] make global bitlock waitqueues per-node
On Mon, 19 Dec 2016 16:20:05 -0800 Dave Hansenwrote: > On 12/19/2016 03:07 PM, Linus Torvalds wrote: > > +wait_queue_head_t *bit_waitqueue(void *word, int bit) > > +{ > > + const int __maybe_unused nid = page_to_nid(virt_to_page(word)); > > + > > + return __bit_waitqueue(word, bit, nid); > > > > No can do. Part of the problem with the old coffee was that it did that > > virt_to_page() crud. That doesn't work with the virtually mapped stack. > > Ahhh, got it. > > So, what did you have in mind? Just redirect bit_waitqueue() to the > "first_online_node" waitqueues? > > wait_queue_head_t *bit_waitqueue(void *word, int bit) > { > return __bit_waitqueue(word, bit, first_online_node); > } > > We could do some fancy stuff like only do virt_to_page() for things in > the linear map, but I'm not sure we'll see much of a gain for it. None > of the other waitqueue users look as pathological as the 'struct page' > ones. Maybe: > > wait_queue_head_t *bit_waitqueue(void *word, int bit) > { > int nid > if (word >= VMALLOC_START) /* all addrs not in linear map */ > nid = first_online_node; > else > nid = page_to_nid(virt_to_page(word)); > return __bit_waitqueue(word, bit, nid); > } I think he meant just make the page_waitqueue do the per-node thing and leave bit_waitqueue as the global bit. It would be cool if CPUs had an instruction that translates an address though. You could avoid all that lookup and just do it with the TLB :) Thanks, Nick
Re: [RFC][PATCH] make global bitlock waitqueues per-node
On Mon, 19 Dec 2016 16:20:05 -0800 Dave Hansen wrote: > On 12/19/2016 03:07 PM, Linus Torvalds wrote: > > +wait_queue_head_t *bit_waitqueue(void *word, int bit) > > +{ > > + const int __maybe_unused nid = page_to_nid(virt_to_page(word)); > > + > > + return __bit_waitqueue(word, bit, nid); > > > > No can do. Part of the problem with the old coffee was that it did that > > virt_to_page() crud. That doesn't work with the virtually mapped stack. > > Ahhh, got it. > > So, what did you have in mind? Just redirect bit_waitqueue() to the > "first_online_node" waitqueues? > > wait_queue_head_t *bit_waitqueue(void *word, int bit) > { > return __bit_waitqueue(word, bit, first_online_node); > } > > We could do some fancy stuff like only do virt_to_page() for things in > the linear map, but I'm not sure we'll see much of a gain for it. None > of the other waitqueue users look as pathological as the 'struct page' > ones. Maybe: > > wait_queue_head_t *bit_waitqueue(void *word, int bit) > { > int nid > if (word >= VMALLOC_START) /* all addrs not in linear map */ > nid = first_online_node; > else > nid = page_to_nid(virt_to_page(word)); > return __bit_waitqueue(word, bit, nid); > } I think he meant just make the page_waitqueue do the per-node thing and leave bit_waitqueue as the global bit. It would be cool if CPUs had an instruction that translates an address though. You could avoid all that lookup and just do it with the TLB :) Thanks, Nick
Re: [RFC][PATCH] make global bitlock waitqueues per-node
On Mon, 19 Dec 2016 14:58:26 -0800 Dave Hansenwrote: > I saw a 4.8->4.9 regression (details below) that I attributed to: > > 9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues > > That commit took the bitlock waitqueues from being dynamically-allocated > per-zone to being statically allocated and global. As suggested by > Linus, this makes them per-node, but keeps them statically-allocated. > > It leaves us with more waitqueues than the global approach, inherently > scales it up as we gain nodes, and avoids generating code for > page_zone() which was evidently quite ugly. The patch is pretty darn > tiny too. > > This turns what was a ~40% 4.8->4.9 regression into a 17% gain over > what on 4.8 did. That gain is a _bit_ surprising, but not entirely > unexpected since we now get much simpler code from no page_zone() and a > fixed-size array for which we don't have to follow a pointer (and get to > do power-of-2 math). I'll have to respin the PageWaiters patch and resend it. There were just a couple of small issues picked up in review. I've just got side tracked with getting a few other things done and haven't had time to benchmark it properly. I'd still like to see what per-node waitqueues does on top of that. If it's significant for realistic workloads then it could be done for the page waitqueues as Linus said. Thanks, Nick
Re: [RFC][PATCH] make global bitlock waitqueues per-node
On Mon, 19 Dec 2016 14:58:26 -0800 Dave Hansen wrote: > I saw a 4.8->4.9 regression (details below) that I attributed to: > > 9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues > > That commit took the bitlock waitqueues from being dynamically-allocated > per-zone to being statically allocated and global. As suggested by > Linus, this makes them per-node, but keeps them statically-allocated. > > It leaves us with more waitqueues than the global approach, inherently > scales it up as we gain nodes, and avoids generating code for > page_zone() which was evidently quite ugly. The patch is pretty darn > tiny too. > > This turns what was a ~40% 4.8->4.9 regression into a 17% gain over > what on 4.8 did. That gain is a _bit_ surprising, but not entirely > unexpected since we now get much simpler code from no page_zone() and a > fixed-size array for which we don't have to follow a pointer (and get to > do power-of-2 math). I'll have to respin the PageWaiters patch and resend it. There were just a couple of small issues picked up in review. I've just got side tracked with getting a few other things done and haven't had time to benchmark it properly. I'd still like to see what per-node waitqueues does on top of that. If it's significant for realistic workloads then it could be done for the page waitqueues as Linus said. Thanks, Nick
Re: [PATCH 4/4] oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA
Hi Kirill, [auto build test WARNING on mmotm/master] [also build test WARNING on v4.9 next-20161219] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/mm-drop-zap_details-ignore_dirty/20161220-092938 base: git://git.cmpxchg.org/linux-mmotm.git master config: x86_64-randconfig-x004-201651 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): In file included from include/asm-generic/bug.h:4:0, from arch/x86/include/asm/bug.h:35, from include/linux/bug.h:4, from include/linux/mmdebug.h:4, from include/linux/mm.h:8, from include/linux/mman.h:4, from mm/madvise.c:8: mm/madvise.c: In function 'madvise_dontneed': mm/madvise.c:476:7: error: implicit declaration of function 'can_madv_dontneed_vma' [-Werror=implicit-function-declaration] if (!can_madv_dontneed_vma(vma)) ^ include/linux/compiler.h:149:30: note: in definition of macro '__trace_if' if (__builtin_constant_p(!!(cond)) ? !!(cond) : \ ^~~~ >> mm/madvise.c:476:2: note: in expansion of macro 'if' if (!can_madv_dontneed_vma(vma)) ^~ cc1: some warnings being treated as errors vim +/if +476 mm/madvise.c 460 * 461 * NB: This interface discards data rather than pushes it out to swap, 462 * as some implementations do. This has performance implications for 463 * applications like large transactional databases which want to discard 464 * pages in anonymous maps after committing to backing store the data 465 * that was kept in them. There is no reason to write this data out to 466 * the swap area if the application is discarding it. 467 * 468 * An interface that causes the system to free clean pages and flush 469 * dirty pages is already available as msync(MS_INVALIDATE). 470 */ 471 static long madvise_dontneed(struct vm_area_struct *vma, 472 struct vm_area_struct **prev, 473 unsigned long start, unsigned long end) 474 { 475 *prev = vma; > 476 if (!can_madv_dontneed_vma(vma)) 477 return -EINVAL; 478 479 zap_page_range(vma, start, end - start); 480 return 0; 481 } 482 483 /* 484 * Application wants to free up the pages and associated backing store. --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 4/4] oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA
Hi Kirill, [auto build test WARNING on mmotm/master] [also build test WARNING on v4.9 next-20161219] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/mm-drop-zap_details-ignore_dirty/20161220-092938 base: git://git.cmpxchg.org/linux-mmotm.git master config: x86_64-randconfig-x004-201651 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): In file included from include/asm-generic/bug.h:4:0, from arch/x86/include/asm/bug.h:35, from include/linux/bug.h:4, from include/linux/mmdebug.h:4, from include/linux/mm.h:8, from include/linux/mman.h:4, from mm/madvise.c:8: mm/madvise.c: In function 'madvise_dontneed': mm/madvise.c:476:7: error: implicit declaration of function 'can_madv_dontneed_vma' [-Werror=implicit-function-declaration] if (!can_madv_dontneed_vma(vma)) ^ include/linux/compiler.h:149:30: note: in definition of macro '__trace_if' if (__builtin_constant_p(!!(cond)) ? !!(cond) : \ ^~~~ >> mm/madvise.c:476:2: note: in expansion of macro 'if' if (!can_madv_dontneed_vma(vma)) ^~ cc1: some warnings being treated as errors vim +/if +476 mm/madvise.c 460 * 461 * NB: This interface discards data rather than pushes it out to swap, 462 * as some implementations do. This has performance implications for 463 * applications like large transactional databases which want to discard 464 * pages in anonymous maps after committing to backing store the data 465 * that was kept in them. There is no reason to write this data out to 466 * the swap area if the application is discarding it. 467 * 468 * An interface that causes the system to free clean pages and flush 469 * dirty pages is already available as msync(MS_INVALIDATE). 470 */ 471 static long madvise_dontneed(struct vm_area_struct *vma, 472 struct vm_area_struct **prev, 473 unsigned long start, unsigned long end) 474 { 475 *prev = vma; > 476 if (!can_madv_dontneed_vma(vma)) 477 return -EINVAL; 478 479 zap_page_range(vma, start, end - start); 480 return 0; 481 } 482 483 /* 484 * Application wants to free up the pages and associated backing store. --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH 1/2] net: hix5hd2_gmac: fix compatible strings name
The SoC hix5hd2 compatible string has the suffix "-gmac" and we should not change its compatible string. So we should name all the compatible string with the suffix "-gmac". Creating a new name suffix "-gemac" is unnecessary. We also add another SoC compatible string in dt binding documentation and describe which generic version the SoC belongs to. Fixes: d0fb6ba75dc0 ("net: hix5hd2_gmac: add generic compatible string") Signed-off-by: Dongpo Li--- .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt | 13 - drivers/net/ethernet/hisilicon/hix5hd2_gmac.c | 13 +++-- 2 files changed, 15 insertions(+), 11 deletions(-) diff --git a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt index 063c02d..eea73ad 100644 --- a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt +++ b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt @@ -2,11 +2,14 @@ Hisilicon hix5hd2 gmac controller Required properties: - compatible: should contain one of the following SoC strings: - * "hisilicon,hix5hd2-gemac" - * "hisilicon,hi3798cv200-gemac" + * "hisilicon,hix5hd2-gmac" + * "hisilicon,hi3798cv200-gmac" + * "hisilicon,hi3516a-gmac" and one of the following version string: - * "hisilicon,hisi-gemac-v1" - * "hisilicon,hisi-gemac-v2" + * "hisilicon,hisi-gmac-v1" + * "hisilicon,hisi-gmac-v2" + The version v1 includes SoCs hix5hd2. + The version v2 includes SoCs hi3798cv200, hi3516a. - reg: specifies base physical address(s) and size of the device registers. The first region is the MAC register base and size. The second region is external interface control register. @@ -35,7 +38,7 @@ Required properties: Example: gmac0: ethernet@f984 { - compatible = "hisilicon,hi3798cv200-gemac", "hisilicon,hisi-gemac-v2"; + compatible = "hisilicon,hi3798cv200-gmac", "hisilicon,hisi-gmac-v2"; reg = <0xf984 0x1000>,<0xf984300c 0x4>; interrupts = <0 71 4>; #address-cells = <1>; diff --git a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c index ee7e9ce..418ca1f3 100644 --- a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c +++ b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c @@ -1316,10 +1316,11 @@ static int hix5hd2_dev_remove(struct platform_device *pdev) } static const struct of_device_id hix5hd2_of_match[] = { - { .compatible = "hisilicon,hisi-gemac-v1", .data = (void *)GEMAC_V1 }, - { .compatible = "hisilicon,hisi-gemac-v2", .data = (void *)GEMAC_V2 }, - { .compatible = "hisilicon,hix5hd2-gemac", .data = (void *)GEMAC_V1 }, - { .compatible = "hisilicon,hi3798cv200-gemac", .data = (void *)GEMAC_V2 }, + { .compatible = "hisilicon,hisi-gmac-v1", .data = (void *)GEMAC_V1 }, + { .compatible = "hisilicon,hisi-gmac-v2", .data = (void *)GEMAC_V2 }, + { .compatible = "hisilicon,hix5hd2-gmac", .data = (void *)GEMAC_V1 }, + { .compatible = "hisilicon,hi3798cv200-gmac", .data = (void *)GEMAC_V2 }, + { .compatible = "hisilicon,hi3516a-gmac", .data = (void *)GEMAC_V2 }, {}, }; @@ -1327,7 +1328,7 @@ MODULE_DEVICE_TABLE(of, hix5hd2_of_match); static struct platform_driver hix5hd2_dev_driver = { .driver = { - .name = "hisi-gemac", + .name = "hisi-gmac", .of_match_table = hix5hd2_of_match, }, .probe = hix5hd2_dev_probe, @@ -1338,4 +1339,4 @@ module_platform_driver(hix5hd2_dev_driver); MODULE_DESCRIPTION("HISILICON Gigabit Ethernet MAC driver"); MODULE_LICENSE("GPL v2"); -MODULE_ALIAS("platform:hisi-gemac"); +MODULE_ALIAS("platform:hisi-gmac"); -- 2.8.2
[PATCH 2/2] ARM: dts: hix5hd2: don't change the existing compatible string
The SoC hix5hd2 compatible string has the suffix "-gmac" and we should not change it. We should only add the generic compatible string "hisi-gmac-v1". Fixes: 0855950ba580 ("ARM: dts: hix5hd2: add gmac generic compatible and clock names") Signed-off-by: Dongpo Li--- arch/arm/boot/dts/hisi-x5hd2.dtsi | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm/boot/dts/hisi-x5hd2.dtsi b/arch/arm/boot/dts/hisi-x5hd2.dtsi index 0da76c5..2d06f4c 100644 --- a/arch/arm/boot/dts/hisi-x5hd2.dtsi +++ b/arch/arm/boot/dts/hisi-x5hd2.dtsi @@ -436,7 +436,7 @@ }; gmac0: ethernet@184 { - compatible = "hisilicon,hix5hd2-gemac", "hisilicon,hisi-gemac-v1"; + compatible = "hisilicon,hix5hd2-gmac", "hisilicon,hisi-gmac-v1"; reg = <0x184 0x1000>,<0x184300c 0x4>; interrupts = <0 71 4>; clocks = < HIX5HD2_MAC0_CLK>; @@ -445,7 +445,7 @@ }; gmac1: ethernet@1841000 { - compatible = "hisilicon,hix5hd2-gemac", "hisilicon,hisi-gemac-v1"; + compatible = "hisilicon,hix5hd2-gmac", "hisilicon,hisi-gmac-v1"; reg = <0x1841000 0x1000>,<0x1843010 0x4>; interrupts = <0 72 4>; clocks = < HIX5HD2_MAC1_CLK>; -- 2.8.2
[PATCH 1/2] net: hix5hd2_gmac: fix compatible strings name
The SoC hix5hd2 compatible string has the suffix "-gmac" and we should not change its compatible string. So we should name all the compatible string with the suffix "-gmac". Creating a new name suffix "-gemac" is unnecessary. We also add another SoC compatible string in dt binding documentation and describe which generic version the SoC belongs to. Fixes: d0fb6ba75dc0 ("net: hix5hd2_gmac: add generic compatible string") Signed-off-by: Dongpo Li --- .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt | 13 - drivers/net/ethernet/hisilicon/hix5hd2_gmac.c | 13 +++-- 2 files changed, 15 insertions(+), 11 deletions(-) diff --git a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt index 063c02d..eea73ad 100644 --- a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt +++ b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt @@ -2,11 +2,14 @@ Hisilicon hix5hd2 gmac controller Required properties: - compatible: should contain one of the following SoC strings: - * "hisilicon,hix5hd2-gemac" - * "hisilicon,hi3798cv200-gemac" + * "hisilicon,hix5hd2-gmac" + * "hisilicon,hi3798cv200-gmac" + * "hisilicon,hi3516a-gmac" and one of the following version string: - * "hisilicon,hisi-gemac-v1" - * "hisilicon,hisi-gemac-v2" + * "hisilicon,hisi-gmac-v1" + * "hisilicon,hisi-gmac-v2" + The version v1 includes SoCs hix5hd2. + The version v2 includes SoCs hi3798cv200, hi3516a. - reg: specifies base physical address(s) and size of the device registers. The first region is the MAC register base and size. The second region is external interface control register. @@ -35,7 +38,7 @@ Required properties: Example: gmac0: ethernet@f984 { - compatible = "hisilicon,hi3798cv200-gemac", "hisilicon,hisi-gemac-v2"; + compatible = "hisilicon,hi3798cv200-gmac", "hisilicon,hisi-gmac-v2"; reg = <0xf984 0x1000>,<0xf984300c 0x4>; interrupts = <0 71 4>; #address-cells = <1>; diff --git a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c index ee7e9ce..418ca1f3 100644 --- a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c +++ b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c @@ -1316,10 +1316,11 @@ static int hix5hd2_dev_remove(struct platform_device *pdev) } static const struct of_device_id hix5hd2_of_match[] = { - { .compatible = "hisilicon,hisi-gemac-v1", .data = (void *)GEMAC_V1 }, - { .compatible = "hisilicon,hisi-gemac-v2", .data = (void *)GEMAC_V2 }, - { .compatible = "hisilicon,hix5hd2-gemac", .data = (void *)GEMAC_V1 }, - { .compatible = "hisilicon,hi3798cv200-gemac", .data = (void *)GEMAC_V2 }, + { .compatible = "hisilicon,hisi-gmac-v1", .data = (void *)GEMAC_V1 }, + { .compatible = "hisilicon,hisi-gmac-v2", .data = (void *)GEMAC_V2 }, + { .compatible = "hisilicon,hix5hd2-gmac", .data = (void *)GEMAC_V1 }, + { .compatible = "hisilicon,hi3798cv200-gmac", .data = (void *)GEMAC_V2 }, + { .compatible = "hisilicon,hi3516a-gmac", .data = (void *)GEMAC_V2 }, {}, }; @@ -1327,7 +1328,7 @@ MODULE_DEVICE_TABLE(of, hix5hd2_of_match); static struct platform_driver hix5hd2_dev_driver = { .driver = { - .name = "hisi-gemac", + .name = "hisi-gmac", .of_match_table = hix5hd2_of_match, }, .probe = hix5hd2_dev_probe, @@ -1338,4 +1339,4 @@ module_platform_driver(hix5hd2_dev_driver); MODULE_DESCRIPTION("HISILICON Gigabit Ethernet MAC driver"); MODULE_LICENSE("GPL v2"); -MODULE_ALIAS("platform:hisi-gemac"); +MODULE_ALIAS("platform:hisi-gmac"); -- 2.8.2
[PATCH 2/2] ARM: dts: hix5hd2: don't change the existing compatible string
The SoC hix5hd2 compatible string has the suffix "-gmac" and we should not change it. We should only add the generic compatible string "hisi-gmac-v1". Fixes: 0855950ba580 ("ARM: dts: hix5hd2: add gmac generic compatible and clock names") Signed-off-by: Dongpo Li --- arch/arm/boot/dts/hisi-x5hd2.dtsi | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm/boot/dts/hisi-x5hd2.dtsi b/arch/arm/boot/dts/hisi-x5hd2.dtsi index 0da76c5..2d06f4c 100644 --- a/arch/arm/boot/dts/hisi-x5hd2.dtsi +++ b/arch/arm/boot/dts/hisi-x5hd2.dtsi @@ -436,7 +436,7 @@ }; gmac0: ethernet@184 { - compatible = "hisilicon,hix5hd2-gemac", "hisilicon,hisi-gemac-v1"; + compatible = "hisilicon,hix5hd2-gmac", "hisilicon,hisi-gmac-v1"; reg = <0x184 0x1000>,<0x184300c 0x4>; interrupts = <0 71 4>; clocks = < HIX5HD2_MAC0_CLK>; @@ -445,7 +445,7 @@ }; gmac1: ethernet@1841000 { - compatible = "hisilicon,hix5hd2-gemac", "hisilicon,hisi-gemac-v1"; + compatible = "hisilicon,hix5hd2-gmac", "hisilicon,hisi-gmac-v1"; reg = <0x1841000 0x1000>,<0x1843010 0x4>; interrupts = <0 72 4>; clocks = < HIX5HD2_MAC1_CLK>; -- 2.8.2
[PATCH 0/2] net: hix5hd2_gmac: keep the compatible string not changed
This patch series fix the patch: d0fb6ba75dc0 ("net: hix5hd2_gmac: add generic compatible string") The SoC hix5hd2 compatible string has the suffix "-gmac" and we should not change its compatible string. So we should name all the compatible string with the suffix "-gmac". Creating a new name suffix "-gemac" is unnecessary. Dongpo Li (2): net: hix5hd2_gmac: fix compatible strings name ARM: dts: hix5hd2: don't change the existing compatible string .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt | 13 - arch/arm/boot/dts/hisi-x5hd2.dtsi | 4 ++-- drivers/net/ethernet/hisilicon/hix5hd2_gmac.c | 13 +++-- 3 files changed, 17 insertions(+), 13 deletions(-) -- 2.8.2
[PATCH 0/2] net: hix5hd2_gmac: keep the compatible string not changed
This patch series fix the patch: d0fb6ba75dc0 ("net: hix5hd2_gmac: add generic compatible string") The SoC hix5hd2 compatible string has the suffix "-gmac" and we should not change its compatible string. So we should name all the compatible string with the suffix "-gmac". Creating a new name suffix "-gemac" is unnecessary. Dongpo Li (2): net: hix5hd2_gmac: fix compatible strings name ARM: dts: hix5hd2: don't change the existing compatible string .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt | 13 - arch/arm/boot/dts/hisi-x5hd2.dtsi | 4 ++-- drivers/net/ethernet/hisilicon/hix5hd2_gmac.c | 13 +++-- 3 files changed, 17 insertions(+), 13 deletions(-) -- 2.8.2
Re: OOM: Better, but still there on
On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote: > Unfortunatelly shrink_active_list doesn't have any tracepoint so we do > not know whether we managed to rotate those pages. If they are referenced > quickly enough we might just keep refaulting them... Could you try to apply > the followin diff on top what you have currently. It should add some more > tracepoint data which might tell us more. We can reduce the amount of > tracing data by enabling only mm_vmscan_lru_isolate, > mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active. So, the results are in! I applied your patch and rebuild the kernel, then I rebooted the machine, set up tracing so that only the three events you mentioned were being traced, and captured the output over the network. Things went a bit different this time: The trace events started to appear after a while and a whole lot of them were generated, but suddenly they stopped. A short while later, we get [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE) along with a backtrace and memory information, and then there was silence. When I walked up to the machine, it had completely died; it wouldn't turn on its screen on key press any more, blindly trying to reboot via SysRequest had no effect, but the caps lock LED also wasn't blinking, like it normally does when a kernel panic occurs. Good question what state it was in. The OOM reaper didn't really seem to kick in and kill processes this time, it seems. The complete capture is up at: http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz Greetings Nils
Re: OOM: Better, but still there on
On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote: > Unfortunatelly shrink_active_list doesn't have any tracepoint so we do > not know whether we managed to rotate those pages. If they are referenced > quickly enough we might just keep refaulting them... Could you try to apply > the followin diff on top what you have currently. It should add some more > tracepoint data which might tell us more. We can reduce the amount of > tracing data by enabling only mm_vmscan_lru_isolate, > mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active. So, the results are in! I applied your patch and rebuild the kernel, then I rebooted the machine, set up tracing so that only the three events you mentioned were being traced, and captured the output over the network. Things went a bit different this time: The trace events started to appear after a while and a whole lot of them were generated, but suddenly they stopped. A short while later, we get [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE) along with a backtrace and memory information, and then there was silence. When I walked up to the machine, it had completely died; it wouldn't turn on its screen on key press any more, blindly trying to reboot via SysRequest had no effect, but the caps lock LED also wasn't blinking, like it normally does when a kernel panic occurs. Good question what state it was in. The OOM reaper didn't really seem to kick in and kill processes this time, it seems. The complete capture is up at: http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz Greetings Nils
Re: [PATCH] block: loose check on sg gap
On Sun, Dec 18, 2016 at 12:49 AM, Jens Axboewrote: > On 12/17/2016 03:49 AM, Ming Lei wrote: >> If the last bvec of the 1st bio and the 1st bvec of the next >> bio are contineous physically, and the latter can be merged >> to last segment of the 1st bio, we should think they don't >> violate sg gap(or virt boundary) limit. >> >> Both Vitaly and Dexuan reported lots of unmergeable small bios >> are observed when running mkfs on Hyper-V virtual storage, and >> performance becomes quite low, so this patch is figured out for >> fixing the performance issue. >> >> The same issue should exist on NVMe too sine it sets virt boundary too. > > It looks pretty reasonable to me. I'll queue it up for some testing, > changes like this always make me a little nervous. Understood. But given it is still in early stage of 4.10 cycle, seems fine to expose it now, and we should have enough time to fix it if there might be regressions. BTW, it passes my xfstest(ext4) over sata/NVMe. Thanks, Ming
Re: [PATCH] block: loose check on sg gap
On Sun, Dec 18, 2016 at 12:49 AM, Jens Axboe wrote: > On 12/17/2016 03:49 AM, Ming Lei wrote: >> If the last bvec of the 1st bio and the 1st bvec of the next >> bio are contineous physically, and the latter can be merged >> to last segment of the 1st bio, we should think they don't >> violate sg gap(or virt boundary) limit. >> >> Both Vitaly and Dexuan reported lots of unmergeable small bios >> are observed when running mkfs on Hyper-V virtual storage, and >> performance becomes quite low, so this patch is figured out for >> fixing the performance issue. >> >> The same issue should exist on NVMe too sine it sets virt boundary too. > > It looks pretty reasonable to me. I'll queue it up for some testing, > changes like this always make me a little nervous. Understood. But given it is still in early stage of 4.10 cycle, seems fine to expose it now, and we should have enough time to fix it if there might be regressions. BTW, it passes my xfstest(ext4) over sata/NVMe. Thanks, Ming
[PATCH] scsi: do not requeue requests unaligned with device sector size
When a SCSI command (e.g., read operation) is partially completed with good status and residual bytes (i.e., not all the bytes from the specified transfer length were transferred) the SCSI midlayer will update the request/bios with the completed bytes and requeue the request in order to complete the remainder/pending bytes. However, when the device sector size is greater than the 512-byte default/kernel sector size, alignment restrictions and validation apply (both to the starting logical block address, and the number of logical blocks to transfer) -- values must be multiples of the device sector size, otherwise the kernel fails the request in the preparation stage (e.g., sd_setup_read_write_cmnd() at sd.c file): [...] sd 0:0:0:0: [sda] Bad block number requested Hence, this error message (and the respective failed request) can be observed on devices with larger sector sizes which may respond the SCSI command with a SCSI residual size that is unaligned with the device sector size -- because the requeued request's starting logical block and number of logical blocks are based on the value of the remainder/pending bytes. In order to address this problem, introduce a check for this case in scsi_io_completion() (before it calls scsi_end_request() which in turn calls blk_update_request() which is the site that changes the request's __sector and __data_len fields, which are validated by sd_setup_read_write_cmnd()). This check verifies whether there is any residual/remainder bytes in the (potentially partially) completed requested and calculates the correctly aligned values for the number of completed bytes to pass up to scsi_end_request()/blk_update_request() that guarantee that the requeued request is aligned with the device sector size. The corner case is when one sector is requested and the response is partially complete, for which the remainder/pending bytes are unaligned and no further request would be valid. On such a case, the original request is retried after a delay (in case the error is hopefully due to a temporary condition in the device), but up to the retry limit (in case the condition is permanent, e.g. bad sector in the medium), after which the request is finally failed. In order to reproduce and verify this problem, the virtio_scsi.c file can be modified to respond to 3 'special' SCSI commands, on which partial completions are introduced (described in the patch). This is the guest's disk test image and libvirt XML snippets for the 4k-sector disk using the virtio scsi driver: # qemu-img create -f qcow2 /var/lib/libvirt/images/test.qcow2 128G And the verification in the guest: # cat /sys/block/sda/queue/physical_block_size 4096 # cat /sys/block/sda/queue/hw_sector_size 4096 This is the patch to virtio_scsi.c (lines prefixed with ' ___ '): ___ --- a/drivers/scsi/virtio_scsi.c ___ +++ b/drivers/scsi/virtio_scsi.c ___ @@ -153,11 +153,45 @@ ___struct virtio_scsi_cmd_resp *resp = >resp.cmd; ___struct virtio_scsi_target_state *tgt = ___scsi_target(sc->device)->hostdata; ___ + static int debug_failures = 0; ___ ___dev_dbg(>device->sdev_gendev, ___"cmd %p response %u status %#02x sense_len %u\n", ___sc, resp->response, resp->status, resp->sense_len); ___ ___ + // DEBUG: filter this CDB for testing purposes. ___ + // CDB: Read(10) 28 00 01 02 03 xx 00 00 yy 00 ___ + // (xx: LSB of the LBA, and yy: LSB of the LEN) ___ + if ((sc->cmnd[0] == 0x28) && (sc->cmnd[1] == 0x00) ___ + && (sc->cmnd[2] == 0x01) && (sc->cmnd[3] == 0x02) ___ + && (sc->cmnd[4] == 0x03) && (sc->cmnd[6] == 0x00) ___ + && (sc->cmnd[7] == 0x00) && (sc->cmnd[9] == 0x00)) { ___ + ___ + // Test 1: ___ + // - LBA: 01 02 03 _04_ ___ + // - LEN: two sectors (2 * 4k = 8k) ___ + // - Action: complete 5k out of 8k (3k residual) ___ + if ((sc->cmnd[5] == 0x04) && (sc->cmnd[8] == 0x02)) ___ + resp->resid = 6 * 512; ___ + ___ + // Test 2: ___ + // - LBA: 01 02 03 _04_ ___ + // - LEN: one sector (1 * 4k = 4k) ___ + // - Action: complete 3k out of 4k (1k residual) ___ + // always. ___ + if ((sc->cmnd[5] == 0x04) && (sc->cmnd[8] == 0x01)) ___ + resp->resid = 2 * 512; ___ + ___ + // Test 3: ___ + // - LBA: 01 02 03 _08_ ___ + // - LEN: one sector (1 * 4k = 4k) ___ + // - Action: complete 3k out of 4k (1k residual) ___ + // but on every 4th attempt (complete 4k) ___ + if ((sc->cmnd[5] == 0x08) && (sc->cmnd[8] == 0x01) ___ + && (++debug_failures % 4 != 0)) ___ + resp->resid = 2 * 512; ___ + } ___ + ___sc->result = resp->status; ___virtscsi_compute_resid(sc, virtio32_to_cpu(vscsi->vdev, ___
[PATCH] scsi: do not requeue requests unaligned with device sector size
When a SCSI command (e.g., read operation) is partially completed with good status and residual bytes (i.e., not all the bytes from the specified transfer length were transferred) the SCSI midlayer will update the request/bios with the completed bytes and requeue the request in order to complete the remainder/pending bytes. However, when the device sector size is greater than the 512-byte default/kernel sector size, alignment restrictions and validation apply (both to the starting logical block address, and the number of logical blocks to transfer) -- values must be multiples of the device sector size, otherwise the kernel fails the request in the preparation stage (e.g., sd_setup_read_write_cmnd() at sd.c file): [...] sd 0:0:0:0: [sda] Bad block number requested Hence, this error message (and the respective failed request) can be observed on devices with larger sector sizes which may respond the SCSI command with a SCSI residual size that is unaligned with the device sector size -- because the requeued request's starting logical block and number of logical blocks are based on the value of the remainder/pending bytes. In order to address this problem, introduce a check for this case in scsi_io_completion() (before it calls scsi_end_request() which in turn calls blk_update_request() which is the site that changes the request's __sector and __data_len fields, which are validated by sd_setup_read_write_cmnd()). This check verifies whether there is any residual/remainder bytes in the (potentially partially) completed requested and calculates the correctly aligned values for the number of completed bytes to pass up to scsi_end_request()/blk_update_request() that guarantee that the requeued request is aligned with the device sector size. The corner case is when one sector is requested and the response is partially complete, for which the remainder/pending bytes are unaligned and no further request would be valid. On such a case, the original request is retried after a delay (in case the error is hopefully due to a temporary condition in the device), but up to the retry limit (in case the condition is permanent, e.g. bad sector in the medium), after which the request is finally failed. In order to reproduce and verify this problem, the virtio_scsi.c file can be modified to respond to 3 'special' SCSI commands, on which partial completions are introduced (described in the patch). This is the guest's disk test image and libvirt XML snippets for the 4k-sector disk using the virtio scsi driver: # qemu-img create -f qcow2 /var/lib/libvirt/images/test.qcow2 128G And the verification in the guest: # cat /sys/block/sda/queue/physical_block_size 4096 # cat /sys/block/sda/queue/hw_sector_size 4096 This is the patch to virtio_scsi.c (lines prefixed with ' ___ '): ___ --- a/drivers/scsi/virtio_scsi.c ___ +++ b/drivers/scsi/virtio_scsi.c ___ @@ -153,11 +153,45 @@ ___struct virtio_scsi_cmd_resp *resp = >resp.cmd; ___struct virtio_scsi_target_state *tgt = ___scsi_target(sc->device)->hostdata; ___ + static int debug_failures = 0; ___ ___dev_dbg(>device->sdev_gendev, ___"cmd %p response %u status %#02x sense_len %u\n", ___sc, resp->response, resp->status, resp->sense_len); ___ ___ + // DEBUG: filter this CDB for testing purposes. ___ + // CDB: Read(10) 28 00 01 02 03 xx 00 00 yy 00 ___ + // (xx: LSB of the LBA, and yy: LSB of the LEN) ___ + if ((sc->cmnd[0] == 0x28) && (sc->cmnd[1] == 0x00) ___ + && (sc->cmnd[2] == 0x01) && (sc->cmnd[3] == 0x02) ___ + && (sc->cmnd[4] == 0x03) && (sc->cmnd[6] == 0x00) ___ + && (sc->cmnd[7] == 0x00) && (sc->cmnd[9] == 0x00)) { ___ + ___ + // Test 1: ___ + // - LBA: 01 02 03 _04_ ___ + // - LEN: two sectors (2 * 4k = 8k) ___ + // - Action: complete 5k out of 8k (3k residual) ___ + if ((sc->cmnd[5] == 0x04) && (sc->cmnd[8] == 0x02)) ___ + resp->resid = 6 * 512; ___ + ___ + // Test 2: ___ + // - LBA: 01 02 03 _04_ ___ + // - LEN: one sector (1 * 4k = 4k) ___ + // - Action: complete 3k out of 4k (1k residual) ___ + // always. ___ + if ((sc->cmnd[5] == 0x04) && (sc->cmnd[8] == 0x01)) ___ + resp->resid = 2 * 512; ___ + ___ + // Test 3: ___ + // - LBA: 01 02 03 _08_ ___ + // - LEN: one sector (1 * 4k = 4k) ___ + // - Action: complete 3k out of 4k (1k residual) ___ + // but on every 4th attempt (complete 4k) ___ + if ((sc->cmnd[5] == 0x08) && (sc->cmnd[8] == 0x01) ___ + && (++debug_failures % 4 != 0)) ___ + resp->resid = 2 * 512; ___ + } ___ + ___sc->result = resp->status; ___virtscsi_compute_resid(sc, virtio32_to_cpu(vscsi->vdev, ___
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 5:44 PM, David Ahernwrote: > On 12/19/16 5:25 PM, Andy Lutomirski wrote: >> net.socket_create_filter = "none": no filter >> net.socket_create_filter = "bpf:baadf00d": bpf filter >> net.socket_create_filter = "disallow": no sockets created period >> net.socket_create_filter = "iptables:foobar": some iptables thingy >> net.socket_create_filter = "nft:blahblahblah": some nft thingy >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 > > Such a scheme works for the socket create filter b/c it is a very simple use > case. It does not work for the ingress and egress which allow generic bpf > filters. Can you elaborate on what goes wrong? (Obviously the "address_family_list" example makes no sense in that context.) > > ... > >>> you're ignoring use cases I described earlier. >>> In vrf case there is only one ifindex it needs to bind to. >> >> I'm totally lost. Can you explain what this has to do with the cgroup >> hierarchy? > > I think the point is that a group hierarchy makes no sense for the VRF use > case. What I put into iproute2 is > > cgrp2/vrf/NAME > > where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 > sockets to a specific device index. cgrp2/vrf is the "default" vrf and does > not have a filter. A user can certainly add another layer > cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not > make sense. I tend to agree. I still think that the mechanism as it stands is broken in other respects and should be fixed before it goes live. I have no desire to cause problems for the vrf use case. But keep in mind that the vrf use case is, in Linus' tree, a bit broken right now in its interactions with other users of the same mechanism. Suppose I create a container and want to trace all of its created sockets. I'll set up cgrp2/container and load my tracer as a socket creation hook. Then a container sets up cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding filter. Now the tracing stops working -- oops. > > ... > I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You have to do it now (or disable the feature for 4.10). This is why I'm bringing this whole thing up now. >>> >>> We don't have to touch user visible api here, so extensions are fine. >> >> Huh? My example in the original email attaches a program in a >> sub-hierarchy. Are you saying that 4.11 could make that example stop >> working? > > Are you suggesting sub-cgroups should not be allowed to override the filter > of a parent cgroup? Yes, exactly. I think there are two sensible behaviors: a) sub-cgroups cannot have a filter at all of the parent has a filter. (This is the "punt" approach -- it lets different semantics be assigned later without breaking userspace.) b) sub-cgroups can have a filter if a parent does, too. The semantics are that the sub-cgroup filter runs first and all side-effects occur. If that filter says "reject" then ancestor filters are skipped. If that filter says "accept", then the ancestor filter is run and its side-effects happen as well. (And so on, all the way up to the root.) --Andy
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 5:44 PM, David Ahern wrote: > On 12/19/16 5:25 PM, Andy Lutomirski wrote: >> net.socket_create_filter = "none": no filter >> net.socket_create_filter = "bpf:baadf00d": bpf filter >> net.socket_create_filter = "disallow": no sockets created period >> net.socket_create_filter = "iptables:foobar": some iptables thingy >> net.socket_create_filter = "nft:blahblahblah": some nft thingy >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 > > Such a scheme works for the socket create filter b/c it is a very simple use > case. It does not work for the ingress and egress which allow generic bpf > filters. Can you elaborate on what goes wrong? (Obviously the "address_family_list" example makes no sense in that context.) > > ... > >>> you're ignoring use cases I described earlier. >>> In vrf case there is only one ifindex it needs to bind to. >> >> I'm totally lost. Can you explain what this has to do with the cgroup >> hierarchy? > > I think the point is that a group hierarchy makes no sense for the VRF use > case. What I put into iproute2 is > > cgrp2/vrf/NAME > > where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 > sockets to a specific device index. cgrp2/vrf is the "default" vrf and does > not have a filter. A user can certainly add another layer > cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not > make sense. I tend to agree. I still think that the mechanism as it stands is broken in other respects and should be fixed before it goes live. I have no desire to cause problems for the vrf use case. But keep in mind that the vrf use case is, in Linus' tree, a bit broken right now in its interactions with other users of the same mechanism. Suppose I create a container and want to trace all of its created sockets. I'll set up cgrp2/container and load my tracer as a socket creation hook. Then a container sets up cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding filter. Now the tracing stops working -- oops. > > ... > I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You have to do it now (or disable the feature for 4.10). This is why I'm bringing this whole thing up now. >>> >>> We don't have to touch user visible api here, so extensions are fine. >> >> Huh? My example in the original email attaches a program in a >> sub-hierarchy. Are you saying that 4.11 could make that example stop >> working? > > Are you suggesting sub-cgroups should not be allowed to override the filter > of a parent cgroup? Yes, exactly. I think there are two sensible behaviors: a) sub-cgroups cannot have a filter at all of the parent has a filter. (This is the "punt" approach -- it lets different semantics be assigned later without breaking userspace.) b) sub-cgroups can have a filter if a parent does, too. The semantics are that the sub-cgroup filter runs first and all side-effects occur. If that filter says "reject" then ancestor filters are skipped. If that filter says "accept", then the ancestor filter is run and its side-effects happen as well. (And so on, all the way up to the root.) --Andy
linux-next: Tree for Dec 20
Hi all, Please do not add any material for v4.11 to your linux-next included branches until after v4.10-rc1 has been released. Changes since 20161219: The kvm tree lost its build failure. Non-merge commits (relative to Linus' tree): 566 1073 files changed, 26213 insertions(+), 8676 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc and an allmodconfig (with CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig (with KALLSYMS_EXTRA_PASS=1) and pseries_le_defconfig and i386, sparc and sparc64 defconfig. Below is a summary of the state of the merge. I am currently merging 246 trees (counting Linus' and 35 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (e93b1cc8a896 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs) Merging fixes/master (30066ce675d3 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6) Merging kbuild-current/rc-fixes (152b695d7437 builddeb: fix cross-building to arm64 producing host-arch debs) Merging arc-current/for-curr (08fe007968b2 ARC: mm: arc700: Don't assume 2 colours for aliasing VIPT dcache) Merging arm-current/fixes (8478132a8784 Revert "arm: move exports to definitions") Merging m68k-current/for-linus (7e251bb21ae0 m68k: Fix ndelay() macro) Merging metag-fixes/fixes (35d04077ad96 metag: Only define atomic_dec_if_positive conditionally) Merging powerpc-fixes/fixes (69973b830859 Linux 4.9) Merging sparc/master (e93b1cc8a896 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs) Merging net/master (ad688cdbb076 stmmac: fix memory barriers) Merging ipsec/master (bc3913a5378c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc) Merging netfilter/master (a220871be66f virtio-net: correctly enable multiqueue) Merging ipvs/master (045169816b31 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6) Merging wireless-drivers/master (fcd2042e8d36 mwifiex: printk() overflow with 32-byte SSIDs) Merging mac80211/master (a17d93ff3a95 mac80211: fix legacy and invalid rx-rate report) Merging sound-current/for-linus (995c6a7fd9b9 ALSA: hiface: Fix M2Tech hiFace driver sampling rate change) Merging pci-current/for-linus (e42010d8207f PCI: Set Read Completion Boundary to 128 iff Root Port supports it (_HPX)) Merging driver-core.current/driver-core-linus (cdb98c2698b4 Revert "nvme: add support for the Write Zeroes command") Merging tty.current/tty-linus (cdb98c2698b4 Revert "nvme: add support for the Write Zeroes command") Merging usb.current/usb-linus (cdb98c2698b4 Revert "nvme: add support for the Write Zeroes command") Merging usb-gadget-fixes/fixes (05e78c6933d6 usb: gadget: f_fs: fix wrong parenthesis in ffs_func_req_match()) Merging usb-serial-fixes/usb-linus (46490c347df4 USB: serial: option: add dlink dwm-158) Merging usb-chipidea-fixes/ci-for-usb-stable (c7fbb09b2ea1 usb: chipidea: move the lock initialization to core file) Merging phy/fixes (4320f9d4c183 phy: sun4i: check PMU presence when poking unknown bit of pmu) Merging staging.current/staging-linus (cdb98c2698b4 Revert "nvme: add support for the Write Zeroes command") Merging char-misc.current/char-misc-linus (cdb98c2698b4 Revert "nvme: add support for the Write Zeroes command") Merging input-current/for-linus (67626c932302 Input: synaptics_i2c - change msleep to usleep_range for small msecs) Merging crypto-current/master (8759fec4af22 crypto: marvell - Copy IVDIG before lau
linux-next: Tree for Dec 20
Hi all, Please do not add any material for v4.11 to your linux-next included branches until after v4.10-rc1 has been released. Changes since 20161219: The kvm tree lost its build failure. Non-merge commits (relative to Linus' tree): 566 1073 files changed, 26213 insertions(+), 8676 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc and an allmodconfig (with CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig (with KALLSYMS_EXTRA_PASS=1) and pseries_le_defconfig and i386, sparc and sparc64 defconfig. Below is a summary of the state of the merge. I am currently merging 246 trees (counting Linus' and 35 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (e93b1cc8a896 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs) Merging fixes/master (30066ce675d3 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6) Merging kbuild-current/rc-fixes (152b695d7437 builddeb: fix cross-building to arm64 producing host-arch debs) Merging arc-current/for-curr (08fe007968b2 ARC: mm: arc700: Don't assume 2 colours for aliasing VIPT dcache) Merging arm-current/fixes (8478132a8784 Revert "arm: move exports to definitions") Merging m68k-current/for-linus (7e251bb21ae0 m68k: Fix ndelay() macro) Merging metag-fixes/fixes (35d04077ad96 metag: Only define atomic_dec_if_positive conditionally) Merging powerpc-fixes/fixes (69973b830859 Linux 4.9) Merging sparc/master (e93b1cc8a896 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs) Merging net/master (ad688cdbb076 stmmac: fix memory barriers) Merging ipsec/master (bc3913a5378c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc) Merging netfilter/master (a220871be66f virtio-net: correctly enable multiqueue) Merging ipvs/master (045169816b31 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6) Merging wireless-drivers/master (fcd2042e8d36 mwifiex: printk() overflow with 32-byte SSIDs) Merging mac80211/master (a17d93ff3a95 mac80211: fix legacy and invalid rx-rate report) Merging sound-current/for-linus (995c6a7fd9b9 ALSA: hiface: Fix M2Tech hiFace driver sampling rate change) Merging pci-current/for-linus (e42010d8207f PCI: Set Read Completion Boundary to 128 iff Root Port supports it (_HPX)) Merging driver-core.current/driver-core-linus (cdb98c2698b4 Revert "nvme: add support for the Write Zeroes command") Merging tty.current/tty-linus (cdb98c2698b4 Revert "nvme: add support for the Write Zeroes command") Merging usb.current/usb-linus (cdb98c2698b4 Revert "nvme: add support for the Write Zeroes command") Merging usb-gadget-fixes/fixes (05e78c6933d6 usb: gadget: f_fs: fix wrong parenthesis in ffs_func_req_match()) Merging usb-serial-fixes/usb-linus (46490c347df4 USB: serial: option: add dlink dwm-158) Merging usb-chipidea-fixes/ci-for-usb-stable (c7fbb09b2ea1 usb: chipidea: move the lock initialization to core file) Merging phy/fixes (4320f9d4c183 phy: sun4i: check PMU presence when poking unknown bit of pmu) Merging staging.current/staging-linus (cdb98c2698b4 Revert "nvme: add support for the Write Zeroes command") Merging char-misc.current/char-misc-linus (cdb98c2698b4 Revert "nvme: add support for the Write Zeroes command") Merging input-current/for-linus (67626c932302 Input: synaptics_i2c - change msleep to usleep_range for small msecs) Merging crypto-current/master (8759fec4af22 crypto: marvell - Copy IVDIG before lau
RE: [PATCH V4 08/11] megaraid_sas: Enable or Disable Fast path based on the PCI Threshold Bandwidth
Hi Tomas, Please see my response inline Thanks sasi -Original Message- From: Tomas Henzl [mailto:the...@redhat.com] Sent: Thursday, December 15, 2016 10:10 AM To: Sasikumar PC; j...@kernel.org; h...@infradead.org Cc: linux-s...@vger.kernel.org; Sathya Prakash Veerichetty; linux-kernel@vger.kernel.org; Christopher Owens; Kiran Kumar Kasturi Subject: Re: [PATCH V4 08/11] megaraid_sas: Enable or Disable Fast path based on the PCI Threshold Bandwidth On 14.12.2016 22:54, Sasikumar PC wrote: > Hi Tomas, > > Please see my response inline > > Thanks > sasi > > -Original Message- > From: Tomas Henzl [mailto:the...@redhat.com] > Sent: Friday, December 09, 2016 8:59 AM > To: Sasikumar Chandrasekaran; j...@kernel.org; h...@infradead.org > Cc: linux-s...@vger.kernel.org; sathya.prak...@broadcom.com; > linux-kernel@vger.kernel.org; christopher.ow...@broadcom.com; > kiran-kumar.kast...@broadcom.com > Subject: Re: [PATCH V4 08/11] megaraid_sas: Enable or Disable Fast > path based on the PCI Threshold Bandwidth > > On 7.12.2016 00:00, Sasikumar Chandrasekaran wrote: >> Large SEQ IO workload should sent as non fast path commands >> >> This patch is depending on patch 7 >> >> Signed-off-by: Sasikumar Chandrasekaran>> --- >> drivers/scsi/megaraid/megaraid_sas.h| 8 + >> drivers/scsi/megaraid/megaraid_sas_base.c | 48 > + >> drivers/scsi/megaraid/megaraid_sas_fp.c | 11 +-- >> drivers/scsi/megaraid/megaraid_sas_fusion.c | 20 +++- >> drivers/scsi/megaraid/megaraid_sas_fusion.h | 2 +- >> 5 files changed, 78 insertions(+), 11 deletions(-) >> >> diff --git a/drivers/scsi/megaraid/megaraid_sas.h >> b/drivers/scsi/megaraid/megaraid_sas.h >> index 3e087ab..eb5be2b 100644 >> --- a/drivers/scsi/megaraid/megaraid_sas.h >> +++ b/drivers/scsi/megaraid/megaraid_sas.h >> @@ -1429,6 +1429,8 @@ enum FW_BOOT_CONTEXT { >> #define MFI_1068_FW_HANDSHAKE_OFFSET0x64 >> #define MFI_1068_FW_READY 0x >> >> +#define MEGASAS_RAID1_FAST_PATH_STATUS_CHECK_INTERVAL HZ >> + >> #define MR_MAX_REPLY_QUEUES_OFFSET 0X001F >> #define MR_MAX_REPLY_QUEUES_EXT_OFFSET 0X003FC000 >> #define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT14 >> @@ -2101,6 +2103,10 @@ struct megasas_instance { >> atomic_t ldio_outstanding; >> atomic_t fw_reset_no_pci_access; >> >> +atomic64_t bytes_wrote; /* used for raid1 fast path enable or > disable */ >> +atomic_t r1_write_fp_capable; > Is a an atomic variable needed for a just boolean variable? > Sasi - As we need to synchronize timer thread and IO issue threads, > With atomic, at any point of time the value will be definitive. > With boolean, it depends, the value could be in transit while one > thread is reading and other is writing. This explanation is I think not good enough, as a write of a char value is atomic on all architectures. If you need synchronisation you probably need a spinlock. Tomash boolean may not be a char in all architectures/implementations. It could be implementation specific isn't it ? Spin_Lock is heavier as the check is in IO path.We need it to be consistent non-transient value, not an exact synchronization. There could be more values that we may set this variable to, to make different decisions and value can be set in more places in future. Atomic will help it keep consistent and extensible. sasi > >> + >> + >> struct megasas_instance_template *instancet; >> struct tasklet_struct isr_tasklet; >> struct work_struct work_init; >> @@ -2143,6 +2149,7 @@ struct megasas_instance { >> long reset_flags; >> struct mutex reset_mutex; >> struct timer_list sriov_heartbeat_timer; >> +struct timer_list r1_fp_hold_timer; >> char skip_heartbeat_timer_del; >> u8 requestorId; >> char PlasmaFW111; >> @@ -2159,6 +2166,7 @@ struct megasas_instance { >> bool is_ventura; >> bool msix_combined; >> u16 max_raid_mapsize; >> +u64 pci_threshold_bandwidth; /* used to control the fp writes */ >> }; >> struct MR_LD_VF_MAP { >> u32 size; >> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c >> b/drivers/scsi/megaraid/megaraid_sas_base.c >> index 8aafb59..899d25c 100644 >> --- a/drivers/scsi/megaraid/megaraid_sas_base.c >> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c >> @@ -1940,6 +1940,9 @@ void megaraid_sas_kill_hba(struct >> megasas_instance > *instance) >> } >> /* Complete outstanding ioctls when adapter is killed */ >> megasas_complete_outstanding_ioctls(instance); >> +if (instance->is_ventura) >> +del_timer_sync(>r1_fp_hold_timer); >> + >> } >> >> /** >> @@ -2438,6 +2441,24 @@ void megasas_sriov_heartbeat_handler(unsigned > long instance_addr) >> } >> } >> >> +/*Handler for disabling/enabling raid 1 fast paths*/ void >> +megasas_change_r1_fp_status(unsigned long instance_addr) { >> +struct megasas_instance
RE: [PATCH V4 08/11] megaraid_sas: Enable or Disable Fast path based on the PCI Threshold Bandwidth
Hi Tomas, Please see my response inline Thanks sasi -Original Message- From: Tomas Henzl [mailto:the...@redhat.com] Sent: Thursday, December 15, 2016 10:10 AM To: Sasikumar PC; j...@kernel.org; h...@infradead.org Cc: linux-s...@vger.kernel.org; Sathya Prakash Veerichetty; linux-kernel@vger.kernel.org; Christopher Owens; Kiran Kumar Kasturi Subject: Re: [PATCH V4 08/11] megaraid_sas: Enable or Disable Fast path based on the PCI Threshold Bandwidth On 14.12.2016 22:54, Sasikumar PC wrote: > Hi Tomas, > > Please see my response inline > > Thanks > sasi > > -Original Message- > From: Tomas Henzl [mailto:the...@redhat.com] > Sent: Friday, December 09, 2016 8:59 AM > To: Sasikumar Chandrasekaran; j...@kernel.org; h...@infradead.org > Cc: linux-s...@vger.kernel.org; sathya.prak...@broadcom.com; > linux-kernel@vger.kernel.org; christopher.ow...@broadcom.com; > kiran-kumar.kast...@broadcom.com > Subject: Re: [PATCH V4 08/11] megaraid_sas: Enable or Disable Fast > path based on the PCI Threshold Bandwidth > > On 7.12.2016 00:00, Sasikumar Chandrasekaran wrote: >> Large SEQ IO workload should sent as non fast path commands >> >> This patch is depending on patch 7 >> >> Signed-off-by: Sasikumar Chandrasekaran >> --- >> drivers/scsi/megaraid/megaraid_sas.h| 8 + >> drivers/scsi/megaraid/megaraid_sas_base.c | 48 > + >> drivers/scsi/megaraid/megaraid_sas_fp.c | 11 +-- >> drivers/scsi/megaraid/megaraid_sas_fusion.c | 20 +++- >> drivers/scsi/megaraid/megaraid_sas_fusion.h | 2 +- >> 5 files changed, 78 insertions(+), 11 deletions(-) >> >> diff --git a/drivers/scsi/megaraid/megaraid_sas.h >> b/drivers/scsi/megaraid/megaraid_sas.h >> index 3e087ab..eb5be2b 100644 >> --- a/drivers/scsi/megaraid/megaraid_sas.h >> +++ b/drivers/scsi/megaraid/megaraid_sas.h >> @@ -1429,6 +1429,8 @@ enum FW_BOOT_CONTEXT { >> #define MFI_1068_FW_HANDSHAKE_OFFSET0x64 >> #define MFI_1068_FW_READY 0x >> >> +#define MEGASAS_RAID1_FAST_PATH_STATUS_CHECK_INTERVAL HZ >> + >> #define MR_MAX_REPLY_QUEUES_OFFSET 0X001F >> #define MR_MAX_REPLY_QUEUES_EXT_OFFSET 0X003FC000 >> #define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT14 >> @@ -2101,6 +2103,10 @@ struct megasas_instance { >> atomic_t ldio_outstanding; >> atomic_t fw_reset_no_pci_access; >> >> +atomic64_t bytes_wrote; /* used for raid1 fast path enable or > disable */ >> +atomic_t r1_write_fp_capable; > Is a an atomic variable needed for a just boolean variable? > Sasi - As we need to synchronize timer thread and IO issue threads, > With atomic, at any point of time the value will be definitive. > With boolean, it depends, the value could be in transit while one > thread is reading and other is writing. This explanation is I think not good enough, as a write of a char value is atomic on all architectures. If you need synchronisation you probably need a spinlock. Tomash boolean may not be a char in all architectures/implementations. It could be implementation specific isn't it ? Spin_Lock is heavier as the check is in IO path.We need it to be consistent non-transient value, not an exact synchronization. There could be more values that we may set this variable to, to make different decisions and value can be set in more places in future. Atomic will help it keep consistent and extensible. sasi > >> + >> + >> struct megasas_instance_template *instancet; >> struct tasklet_struct isr_tasklet; >> struct work_struct work_init; >> @@ -2143,6 +2149,7 @@ struct megasas_instance { >> long reset_flags; >> struct mutex reset_mutex; >> struct timer_list sriov_heartbeat_timer; >> +struct timer_list r1_fp_hold_timer; >> char skip_heartbeat_timer_del; >> u8 requestorId; >> char PlasmaFW111; >> @@ -2159,6 +2166,7 @@ struct megasas_instance { >> bool is_ventura; >> bool msix_combined; >> u16 max_raid_mapsize; >> +u64 pci_threshold_bandwidth; /* used to control the fp writes */ >> }; >> struct MR_LD_VF_MAP { >> u32 size; >> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c >> b/drivers/scsi/megaraid/megaraid_sas_base.c >> index 8aafb59..899d25c 100644 >> --- a/drivers/scsi/megaraid/megaraid_sas_base.c >> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c >> @@ -1940,6 +1940,9 @@ void megaraid_sas_kill_hba(struct >> megasas_instance > *instance) >> } >> /* Complete outstanding ioctls when adapter is killed */ >> megasas_complete_outstanding_ioctls(instance); >> +if (instance->is_ventura) >> +del_timer_sync(>r1_fp_hold_timer); >> + >> } >> >> /** >> @@ -2438,6 +2441,24 @@ void megasas_sriov_heartbeat_handler(unsigned > long instance_addr) >> } >> } >> >> +/*Handler for disabling/enabling raid 1 fast paths*/ void >> +megasas_change_r1_fp_status(unsigned long instance_addr) { >> +struct megasas_instance *instance = >> +
RE: [PATCH V4 06/11] megaraid_sas: Dynamic Raid Map Changes for SAS3.5 Generic Megaraid Controllers
Hi Tomas, Please see my response inline Thanks sasi -Original Message- From: Sasikumar PC [mailto:sasikumar...@broadcom.com] Sent: Wednesday, December 14, 2016 4:49 PM To: 'Tomas Henzl'; 'j...@kernel.org'; 'h...@infradead.org' Cc: 'linux-s...@vger.kernel.org'; Sathya Prakash Veerichetty; 'linux-kernel@vger.kernel.org'; Christopher Owens; Kiran Kumar Kasturi Subject: RE: [PATCH V4 06/11] megaraid_sas: Dynamic Raid Map Changes for SAS3.5 Generic Megaraid Controllers Hi Tomas, Please see my response inline Thanks sasi -Original Message- From: Tomas Henzl [mailto:the...@redhat.com] Sent: Friday, December 09, 2016 7:55 AM To: Sasikumar Chandrasekaran; j...@kernel.org; h...@infradead.org Cc: linux-s...@vger.kernel.org; sathya.prak...@broadcom.com; linux-kernel@vger.kernel.org; christopher.ow...@broadcom.com; kiran-kumar.kast...@broadcom.com Subject: Re: [PATCH V4 06/11] megaraid_sas: Dynamic Raid Map Changes for SAS3.5 Generic Megaraid Controllers On 7.12.2016 00:00, Sasikumar Chandrasekaran wrote: > SAS3.5 Generic Megaraid Controllers FW will support new dynamic RaidMap to have different > sizes for different number of supported VDs. > > This patch is depending on patch 5 > > Signed-off-by: Sasikumar Chandrasekaran> --- > drivers/scsi/megaraid/megaraid_sas.h| 7 + > drivers/scsi/megaraid/megaraid_sas_base.c | 61 -- > drivers/scsi/megaraid/megaraid_sas_fp.c | 303 > drivers/scsi/megaraid/megaraid_sas_fusion.c | 223 > drivers/scsi/megaraid/megaraid_sas_fusion.h | 240 ++ > 5 files changed, 699 insertions(+), 135 deletions(-) > > diff --git a/drivers/scsi/megaraid/megaraid_sas.h b/drivers/scsi/megaraid/megaraid_sas.h > index f4d6a94..3e087ab 100644 > --- a/drivers/scsi/megaraid/megaraid_sas.h > +++ b/drivers/scsi/megaraid/megaraid_sas.h > @@ -1434,6 +1434,12 @@ enum FW_BOOT_CONTEXT { > #define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT14 > #define MR_MAX_MSIX_REG_ARRAY 16 > #define MR_RDPQ_MODE_OFFSET 0X0080 > + > +#define MR_MAX_RAID_MAP_SIZE_OFFSET_SHIFT16 > +#define MR_MAX_RAID_MAP_SIZE_MASK0x1FF > +#define MR_MIN_MAP_SIZE 0x1 > +/* 64k */ > + > #define MR_CAN_HANDLE_SYNC_CACHE_OFFSET 0X0100 > > /* > @@ -2152,6 +2158,7 @@ struct megasas_instance { > bool fw_sync_cache_support; > bool is_ventura; > bool msix_combined; > + u16 max_raid_mapsize; > }; > struct MR_LD_VF_MAP { > u32 size; > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c > index c52f7be..3f06b57 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -4424,8 +4424,7 @@ int megasas_alloc_cmds(struct megasas_instance *instance) > static void megasas_update_ext_vd_details(struct megasas_instance *instance) > { > struct fusion_context *fusion; > - u32 old_map_sz; > - u32 new_map_sz; > + u32 ventura_map_sz = 0; > > fusion = instance->ctrl_context; > /* For MFI based controllers return dummy success */ > @@ -4455,21 +4454,39 @@ static void megasas_update_ext_vd_details(struct megasas_instance *instance) > instance->supportmax256vd ? "Extended VD(240 VD)firmware" : > "Legacy(64 VD) firmware"); > > - old_map_sz = sizeof(struct MR_FW_RAID_MAP) + > - (sizeof(struct MR_LD_SPAN_MAP) * > - (instance->fw_supported_vd_count - 1)); > - new_map_sz = sizeof(struct MR_FW_RAID_MAP_EXT); > - fusion->drv_map_sz = sizeof(struct MR_DRV_RAID_MAP) + > - (sizeof(struct MR_LD_SPAN_MAP) * > - (instance->drv_supported_vd_count - 1)); > - > - fusion->max_map_sz = max(old_map_sz, new_map_sz); > + if (instance->max_raid_mapsize) { > + ventura_map_sz = instance->max_raid_mapsize * > + MR_MIN_MAP_SIZE; /* 64k */ > + fusion->current_map_sz = ventura_map_sz; > + fusion->max_map_sz = ventura_map_sz; > + } else { > + fusion->old_map_sz = sizeof(struct MR_FW_RAID_MAP) + > + (sizeof(struct MR_LD_SPAN_MAP) * > + (instance->fw_supported_vd_count - 1)); > + fusion->new_map_sz = sizeof(struct MR_FW_RAID_MAP_EXT); > > + fusion->max_map_sz = > + max(fusion->old_map_sz, fusion->new_map_sz); > > - if (instance->supportmax256vd) > - fusion->current_map_sz = new_map_sz; > - else > - fusion->current_map_sz = old_map_sz; > + if (instance->supportmax256vd) > + fusion->current_map_sz = fusion->new_map_sz; > + else > +
RE: [PATCH V4 06/11] megaraid_sas: Dynamic Raid Map Changes for SAS3.5 Generic Megaraid Controllers
Hi Tomas, Please see my response inline Thanks sasi -Original Message- From: Sasikumar PC [mailto:sasikumar...@broadcom.com] Sent: Wednesday, December 14, 2016 4:49 PM To: 'Tomas Henzl'; 'j...@kernel.org'; 'h...@infradead.org' Cc: 'linux-s...@vger.kernel.org'; Sathya Prakash Veerichetty; 'linux-kernel@vger.kernel.org'; Christopher Owens; Kiran Kumar Kasturi Subject: RE: [PATCH V4 06/11] megaraid_sas: Dynamic Raid Map Changes for SAS3.5 Generic Megaraid Controllers Hi Tomas, Please see my response inline Thanks sasi -Original Message- From: Tomas Henzl [mailto:the...@redhat.com] Sent: Friday, December 09, 2016 7:55 AM To: Sasikumar Chandrasekaran; j...@kernel.org; h...@infradead.org Cc: linux-s...@vger.kernel.org; sathya.prak...@broadcom.com; linux-kernel@vger.kernel.org; christopher.ow...@broadcom.com; kiran-kumar.kast...@broadcom.com Subject: Re: [PATCH V4 06/11] megaraid_sas: Dynamic Raid Map Changes for SAS3.5 Generic Megaraid Controllers On 7.12.2016 00:00, Sasikumar Chandrasekaran wrote: > SAS3.5 Generic Megaraid Controllers FW will support new dynamic RaidMap to have different > sizes for different number of supported VDs. > > This patch is depending on patch 5 > > Signed-off-by: Sasikumar Chandrasekaran > --- > drivers/scsi/megaraid/megaraid_sas.h| 7 + > drivers/scsi/megaraid/megaraid_sas_base.c | 61 -- > drivers/scsi/megaraid/megaraid_sas_fp.c | 303 > drivers/scsi/megaraid/megaraid_sas_fusion.c | 223 > drivers/scsi/megaraid/megaraid_sas_fusion.h | 240 ++ > 5 files changed, 699 insertions(+), 135 deletions(-) > > diff --git a/drivers/scsi/megaraid/megaraid_sas.h b/drivers/scsi/megaraid/megaraid_sas.h > index f4d6a94..3e087ab 100644 > --- a/drivers/scsi/megaraid/megaraid_sas.h > +++ b/drivers/scsi/megaraid/megaraid_sas.h > @@ -1434,6 +1434,12 @@ enum FW_BOOT_CONTEXT { > #define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT14 > #define MR_MAX_MSIX_REG_ARRAY 16 > #define MR_RDPQ_MODE_OFFSET 0X0080 > + > +#define MR_MAX_RAID_MAP_SIZE_OFFSET_SHIFT16 > +#define MR_MAX_RAID_MAP_SIZE_MASK0x1FF > +#define MR_MIN_MAP_SIZE 0x1 > +/* 64k */ > + > #define MR_CAN_HANDLE_SYNC_CACHE_OFFSET 0X0100 > > /* > @@ -2152,6 +2158,7 @@ struct megasas_instance { > bool fw_sync_cache_support; > bool is_ventura; > bool msix_combined; > + u16 max_raid_mapsize; > }; > struct MR_LD_VF_MAP { > u32 size; > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c > index c52f7be..3f06b57 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -4424,8 +4424,7 @@ int megasas_alloc_cmds(struct megasas_instance *instance) > static void megasas_update_ext_vd_details(struct megasas_instance *instance) > { > struct fusion_context *fusion; > - u32 old_map_sz; > - u32 new_map_sz; > + u32 ventura_map_sz = 0; > > fusion = instance->ctrl_context; > /* For MFI based controllers return dummy success */ > @@ -4455,21 +4454,39 @@ static void megasas_update_ext_vd_details(struct megasas_instance *instance) > instance->supportmax256vd ? "Extended VD(240 VD)firmware" : > "Legacy(64 VD) firmware"); > > - old_map_sz = sizeof(struct MR_FW_RAID_MAP) + > - (sizeof(struct MR_LD_SPAN_MAP) * > - (instance->fw_supported_vd_count - 1)); > - new_map_sz = sizeof(struct MR_FW_RAID_MAP_EXT); > - fusion->drv_map_sz = sizeof(struct MR_DRV_RAID_MAP) + > - (sizeof(struct MR_LD_SPAN_MAP) * > - (instance->drv_supported_vd_count - 1)); > - > - fusion->max_map_sz = max(old_map_sz, new_map_sz); > + if (instance->max_raid_mapsize) { > + ventura_map_sz = instance->max_raid_mapsize * > + MR_MIN_MAP_SIZE; /* 64k */ > + fusion->current_map_sz = ventura_map_sz; > + fusion->max_map_sz = ventura_map_sz; > + } else { > + fusion->old_map_sz = sizeof(struct MR_FW_RAID_MAP) + > + (sizeof(struct MR_LD_SPAN_MAP) * > + (instance->fw_supported_vd_count - 1)); > + fusion->new_map_sz = sizeof(struct MR_FW_RAID_MAP_EXT); > > + fusion->max_map_sz = > + max(fusion->old_map_sz, fusion->new_map_sz); > > - if (instance->supportmax256vd) > - fusion->current_map_sz = new_map_sz; > - else > - fusion->current_map_sz = old_map_sz; > + if (instance->supportmax256vd) > + fusion->current_map_sz = fusion->new_map_sz; > + else > + fusion->current_map_sz =
RE: [PATCH V4 02/11] megaraid_sas: 128 MSIX Support
Hi Tomas, Please see my response inline Thanks sasi -Original Message- From: Sasikumar PC [mailto:sasikumar...@broadcom.com] Sent: Wednesday, December 14, 2016 4:43 PM To: 'Tomas Henzl'; 'j...@kernel.org'; 'h...@infradead.org' Cc: 'linux-s...@vger.kernel.org'; Sathya Prakash Veerichetty; 'linux-kernel@vger.kernel.org'; Christopher Owens; Kiran Kumar Kasturi Subject: RE: [PATCH V4 02/11] megaraid_sas: 128 MSIX Support Hi Tomas, Please see my response inline Thanks sasi -Original Message- From: Tomas Henzl [mailto:the...@redhat.com] Sent: Thursday, December 08, 2016 10:35 AM To: Sasikumar Chandrasekaran; j...@kernel.org; h...@infradead.org Cc: linux-s...@vger.kernel.org; sathya.prak...@broadcom.com; linux-kernel@vger.kernel.org; christopher.ow...@broadcom.com; kiran-kumar.kast...@broadcom.com Subject: Re: [PATCH V4 02/11] megaraid_sas: 128 MSIX Support On 7.12.2016 00:00, Sasikumar Chandrasekaran wrote: > SAS3.5 Generic Megaraid based Controllers will have the support for > 128 MSI-X vectors, resulting in the need to support 128 reply queues > > This patch is depending on patch 1 > > Signed-off-by: Sasikumar Chandrasekaran> --- > drivers/scsi/megaraid/megaraid_sas.h| 1 + > drivers/scsi/megaraid/megaraid_sas_base.c | 24 +--- > drivers/scsi/megaraid/megaraid_sas_fusion.c | 4 ++-- > 3 files changed, 20 insertions(+), 9 deletions(-) > > diff --git a/drivers/scsi/megaraid/megaraid_sas.h > b/drivers/scsi/megaraid/megaraid_sas.h > index 72e16c2..9d4ca8d 100644 > --- a/drivers/scsi/megaraid/megaraid_sas.h > +++ b/drivers/scsi/megaraid/megaraid_sas.h > @@ -2149,6 +2149,7 @@ struct megasas_instance { > bool dev_handle; > bool fw_sync_cache_support; > bool is_ventura; > + bool msix_combined; > }; > struct MR_LD_VF_MAP { > u32 size; > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > b/drivers/scsi/megaraid/megaraid_sas_base.c > index efccf98..c583e0b 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -5086,13 +5086,7 @@ static int megasas_init_fw(struct megasas_instance *instance) > goto fail_ready_state; > } > > - /* > - * MSI-X host index 0 is common for all adapter. > - * It is used for all MPT based Adapters. > - */ > - instance->reply_post_host_index_addr[0] = > - (u32 __iomem *)((u8 __iomem *)instance->reg_set + > - MPI2_REPLY_POST_HOST_INDEX_OFFSET); > + > > /* Check if MSI-X is supported while in ready state */ > msix_enable = (instance->instancet->read_fw_status_reg(reg_set) & @@ > -5110,6 +5104,9 @@ static int megasas_init_fw(struct megasas_instance *instance) > instance->msix_vectors = ((scratch_pad_2 > & MR_MAX_REPLY_QUEUES_EXT_OFFSET) > >> MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT) + 1; > + if (instance->msix_vectors > 16) > + instance->msix_combined = true; > + > if (rdpq_enable) > instance->is_rdpq = (scratch_pad_2 & MR_RDPQ_MODE_OFFSET) ? > 1 : 0; > @@ -5143,6 +5140,19 @@ static int megasas_init_fw(struct megasas_instance *instance) > else > instance->msix_vectors = 0; > } Have you tested this patch with the pci=nomsi kernel option? Sasi - Driver is tested with pci=nomsi option and looking good is it safe when msix_combined is true and pci_enable_msix_range fails so instance->msix_vectors is zero? msix_combined mode is dependent on how many vectors adapter supports, not the actual vectors used. It is correct to be in combined mode even if actual number of msix_vectors used are fewer than 16, if hardware is in combined mode Sasi tomash > + /* > + * MSI-X host index 0 is common for all adapter. > + * It is used for all MPT based Adapters. > + */ > + if (instance->msix_combined) { > + instance->reply_post_host_index_addr[0] = > + (u32 *)((u8 *)instance->reg_set + > + MPI2_SUP_REPLY_POST_HOST_INDEX_OFFSET); > + } else { > + instance->reply_post_host_index_addr[0] = > + (u32 *)((u8 *)instance->reg_set + > + MPI2_REPLY_POST_HOST_INDEX_OFFSET); > + } > > dev_info(>pdev->dev, > "firmware supports msix\t: (%d)", fw_msix_count); diff --git > a/drivers/scsi/megaraid/megaraid_sas_fusion.c > b/drivers/scsi/megaraid/megaraid_sas_fusion.c > index 8d7a397..413e2030 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c > +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c > @@ -2391,7 +2391,7 @@ static void
RE: [PATCH V4 02/11] megaraid_sas: 128 MSIX Support
Hi Tomas, Please see my response inline Thanks sasi -Original Message- From: Sasikumar PC [mailto:sasikumar...@broadcom.com] Sent: Wednesday, December 14, 2016 4:43 PM To: 'Tomas Henzl'; 'j...@kernel.org'; 'h...@infradead.org' Cc: 'linux-s...@vger.kernel.org'; Sathya Prakash Veerichetty; 'linux-kernel@vger.kernel.org'; Christopher Owens; Kiran Kumar Kasturi Subject: RE: [PATCH V4 02/11] megaraid_sas: 128 MSIX Support Hi Tomas, Please see my response inline Thanks sasi -Original Message- From: Tomas Henzl [mailto:the...@redhat.com] Sent: Thursday, December 08, 2016 10:35 AM To: Sasikumar Chandrasekaran; j...@kernel.org; h...@infradead.org Cc: linux-s...@vger.kernel.org; sathya.prak...@broadcom.com; linux-kernel@vger.kernel.org; christopher.ow...@broadcom.com; kiran-kumar.kast...@broadcom.com Subject: Re: [PATCH V4 02/11] megaraid_sas: 128 MSIX Support On 7.12.2016 00:00, Sasikumar Chandrasekaran wrote: > SAS3.5 Generic Megaraid based Controllers will have the support for > 128 MSI-X vectors, resulting in the need to support 128 reply queues > > This patch is depending on patch 1 > > Signed-off-by: Sasikumar Chandrasekaran > --- > drivers/scsi/megaraid/megaraid_sas.h| 1 + > drivers/scsi/megaraid/megaraid_sas_base.c | 24 +--- > drivers/scsi/megaraid/megaraid_sas_fusion.c | 4 ++-- > 3 files changed, 20 insertions(+), 9 deletions(-) > > diff --git a/drivers/scsi/megaraid/megaraid_sas.h > b/drivers/scsi/megaraid/megaraid_sas.h > index 72e16c2..9d4ca8d 100644 > --- a/drivers/scsi/megaraid/megaraid_sas.h > +++ b/drivers/scsi/megaraid/megaraid_sas.h > @@ -2149,6 +2149,7 @@ struct megasas_instance { > bool dev_handle; > bool fw_sync_cache_support; > bool is_ventura; > + bool msix_combined; > }; > struct MR_LD_VF_MAP { > u32 size; > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > b/drivers/scsi/megaraid/megaraid_sas_base.c > index efccf98..c583e0b 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -5086,13 +5086,7 @@ static int megasas_init_fw(struct megasas_instance *instance) > goto fail_ready_state; > } > > - /* > - * MSI-X host index 0 is common for all adapter. > - * It is used for all MPT based Adapters. > - */ > - instance->reply_post_host_index_addr[0] = > - (u32 __iomem *)((u8 __iomem *)instance->reg_set + > - MPI2_REPLY_POST_HOST_INDEX_OFFSET); > + > > /* Check if MSI-X is supported while in ready state */ > msix_enable = (instance->instancet->read_fw_status_reg(reg_set) & @@ > -5110,6 +5104,9 @@ static int megasas_init_fw(struct megasas_instance *instance) > instance->msix_vectors = ((scratch_pad_2 > & MR_MAX_REPLY_QUEUES_EXT_OFFSET) > >> MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT) + 1; > + if (instance->msix_vectors > 16) > + instance->msix_combined = true; > + > if (rdpq_enable) > instance->is_rdpq = (scratch_pad_2 & MR_RDPQ_MODE_OFFSET) ? > 1 : 0; > @@ -5143,6 +5140,19 @@ static int megasas_init_fw(struct megasas_instance *instance) > else > instance->msix_vectors = 0; > } Have you tested this patch with the pci=nomsi kernel option? Sasi - Driver is tested with pci=nomsi option and looking good is it safe when msix_combined is true and pci_enable_msix_range fails so instance->msix_vectors is zero? msix_combined mode is dependent on how many vectors adapter supports, not the actual vectors used. It is correct to be in combined mode even if actual number of msix_vectors used are fewer than 16, if hardware is in combined mode Sasi tomash > + /* > + * MSI-X host index 0 is common for all adapter. > + * It is used for all MPT based Adapters. > + */ > + if (instance->msix_combined) { > + instance->reply_post_host_index_addr[0] = > + (u32 *)((u8 *)instance->reg_set + > + MPI2_SUP_REPLY_POST_HOST_INDEX_OFFSET); > + } else { > + instance->reply_post_host_index_addr[0] = > + (u32 *)((u8 *)instance->reg_set + > + MPI2_REPLY_POST_HOST_INDEX_OFFSET); > + } > > dev_info(>pdev->dev, > "firmware supports msix\t: (%d)", fw_msix_count); diff --git > a/drivers/scsi/megaraid/megaraid_sas_fusion.c > b/drivers/scsi/megaraid/megaraid_sas_fusion.c > index 8d7a397..413e2030 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c > +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c > @@ -2391,7 +2391,7 @@ static void megasas_build_ld_nonrw_fusion(struct megasas_instance
Re: [kbuild-all] 答复: Re: [PATCH 1/2] ocfs2: add kobject for online file check
Hi Gang, On Mon, Dec 19, 2016 at 06:43:48PM -0700, Gang He wrote: Hello Kbuild, Could you build my whole patch set (2 patch)? I think that the code is OK. We test your whole patch as well as first-N patches, and noticed that the first-1 patch breaks bisectibility: Note: the linux-review/Gang-He/ocfs2-add-kobject-for-online-file-check/20161219-181858 HEAD 6ef9256cd25ef72a5e69490cc3dacde95b8e2ac4 builds fine. It only hurts bisectibility. Thanks, Fengguang
Re: [PATCH v2 2/3] Bluetooth: btusb: Add out-of-band wakeup support
Hi Brian, On Mon, Dec 19, 2016 at 3:10 PM, Brian Norriswrote: > Hi Rajat, > > On Fri, Dec 16, 2016 at 11:30:03AM -0800, Rajat Jain wrote: >> Some onboard BT chips (e.g. Marvell 8997) contain a wakeup pin that >> can be connected to a gpio on the CPU side, and can be used to wakeup >> the host out-of-band. This can be useful in situations where the >> in-band wakeup is not possible or not preferable (e.g. the in-band >> wakeup may require the USB host controller to remain active, and >> hence consuming more system power during system sleep). >> >> The oob gpio interrupt to be used for wakeup on the CPU side, is >> read from the device tree node, (using standard interrupt descriptors). >> A devcie tree binding document is also added for the driver. The >> compatible string is in compliance with >> Documentation/devicetree/bindings/usb/usb-device.txt >> >> Signed-off-by: Rajat Jain > > A few small comments below, but other than those, for the whole series: > > Reviewed-by: Brian Norris > >> --- >> v2: * Use interrupt-names ("wakeup") instead of assuming first interrupt. >> * Leave it on device tree to specify IRQ flags (level /edge triggered) >> * Mark the device as non wakeable on exit. >> >> Documentation/devicetree/bindings/net/btusb.txt | 40 >> drivers/bluetooth/btusb.c | 84 >> + >> 2 files changed, 124 insertions(+) >> create mode 100644 Documentation/devicetree/bindings/net/btusb.txt >> >> diff --git a/Documentation/devicetree/bindings/net/btusb.txt >> b/Documentation/devicetree/bindings/net/btusb.txt >> new file mode 100644 >> index 000..2c0355c >> --- /dev/null >> +++ b/Documentation/devicetree/bindings/net/btusb.txt >> @@ -0,0 +1,40 @@ >> +Generic Bluetooth controller over USB (btusb driver) >> +--- >> + >> +Required properties: >> + >> + - compatible : should comply with the format "usbVID,PID" specified in >> + Documentation/devicetree/bindings/usb/usb-device.txt >> + At the time of writing, the only OF supported devices >> + (more may be added later) are: >> + >> + "usb1286,204e" (Marvell 8997) > > I don't know if anyone cares about these sort of organizational aspects, > but in patch 3 you're extending this binding with device-specific > Marvell bindings. Seems like it'd be good to have some clear way to > correlate those 2. e.g., add a reference to marvell-bt-sd8xxx.txt (or > marvell-bt-8xxx.txt, once you rename it) in patch 3. Good idea, I'll do it. (I'll add a reference to marvell-bt-8xxx.txt in btusb.txt). > >> + >> +Optional properties: >> + >> + - interrupt-parent: phandle of the parent interrupt controller >> + - interrupt-names: (see below) >> + - interrupts : The interrupt specified by the name "wakeup" is the >> interrupt >> + that shall be used for out-of-band wake-on-bt. Driver will >> + request this interrupt for wakeup. During system suspend, the >> + irq will be enabled so that the bluetooth chip can wakeup host >> + platform out of band. During system resume, the irq will be >> + disabled to make sure unnecessary interrupt is not received. >> + >> +Example: >> + >> +Following example uses irq pin number 3 of gpio0 for out of band wake-on-bt: >> + >> +_host1_ehci { >> +status = "okay"; >> +#address-cells = <1>; >> +#size-cells = <0>; >> + >> +mvl_bt1: bt@1 { >> + compatible = "usb1286,204e"; >> + reg = <1>; >> + interrupt-parent = <>; >> + interrupt-name = "wakeup"; >> + interrupts = <3 IRQ_TYPE_LEVEL_LOW>; >> +}; >> +}; >> diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c >> index ce22cef..beca4e9 100644 >> --- a/drivers/bluetooth/btusb.c >> +++ b/drivers/bluetooth/btusb.c >> @@ -24,6 +24,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> #include >> >> #include >> @@ -369,6 +371,7 @@ static const struct usb_device_id blacklist_table[] = { >> #define BTUSB_BOOTING9 >> #define BTUSB_RESET_RESUME 10 >> #define BTUSB_DIAG_RUNNING 11 >> +#define BTUSB_OOB_WAKE_DISABLED 12 >> >> struct btusb_data { >> struct hci_dev *hdev; >> @@ -416,6 +419,8 @@ struct btusb_data { >> int (*recv_bulk)(struct btusb_data *data, void *buffer, int count); >> >> int (*setup_on_usb)(struct hci_dev *hdev); >> + >> + int oob_wake_irq; /* irq for out-of-band wake-on-bt */ >> }; >> >> static inline void btusb_free_frags(struct btusb_data *data) >> @@ -2728,6 +2733,65 @@ static int btusb_bcm_set_diag(struct hci_dev *hdev, >> bool enable) >> } >> #endif >> >> +#ifdef CONFIG_PM >> +static irqreturn_t btusb_oob_wake_handler(int irq, void *priv) >> +{ >> + struct btusb_data *data = priv; >> + >> + /* Disable only if not already disabled
Re: [kbuild-all] 答复: Re: [PATCH 1/2] ocfs2: add kobject for online file check
Hi Gang, On Mon, Dec 19, 2016 at 06:43:48PM -0700, Gang He wrote: Hello Kbuild, Could you build my whole patch set (2 patch)? I think that the code is OK. We test your whole patch as well as first-N patches, and noticed that the first-1 patch breaks bisectibility: Note: the linux-review/Gang-He/ocfs2-add-kobject-for-online-file-check/20161219-181858 HEAD 6ef9256cd25ef72a5e69490cc3dacde95b8e2ac4 builds fine. It only hurts bisectibility. Thanks, Fengguang
Re: [PATCH v2 2/3] Bluetooth: btusb: Add out-of-band wakeup support
Hi Brian, On Mon, Dec 19, 2016 at 3:10 PM, Brian Norris wrote: > Hi Rajat, > > On Fri, Dec 16, 2016 at 11:30:03AM -0800, Rajat Jain wrote: >> Some onboard BT chips (e.g. Marvell 8997) contain a wakeup pin that >> can be connected to a gpio on the CPU side, and can be used to wakeup >> the host out-of-band. This can be useful in situations where the >> in-band wakeup is not possible or not preferable (e.g. the in-band >> wakeup may require the USB host controller to remain active, and >> hence consuming more system power during system sleep). >> >> The oob gpio interrupt to be used for wakeup on the CPU side, is >> read from the device tree node, (using standard interrupt descriptors). >> A devcie tree binding document is also added for the driver. The >> compatible string is in compliance with >> Documentation/devicetree/bindings/usb/usb-device.txt >> >> Signed-off-by: Rajat Jain > > A few small comments below, but other than those, for the whole series: > > Reviewed-by: Brian Norris > >> --- >> v2: * Use interrupt-names ("wakeup") instead of assuming first interrupt. >> * Leave it on device tree to specify IRQ flags (level /edge triggered) >> * Mark the device as non wakeable on exit. >> >> Documentation/devicetree/bindings/net/btusb.txt | 40 >> drivers/bluetooth/btusb.c | 84 >> + >> 2 files changed, 124 insertions(+) >> create mode 100644 Documentation/devicetree/bindings/net/btusb.txt >> >> diff --git a/Documentation/devicetree/bindings/net/btusb.txt >> b/Documentation/devicetree/bindings/net/btusb.txt >> new file mode 100644 >> index 000..2c0355c >> --- /dev/null >> +++ b/Documentation/devicetree/bindings/net/btusb.txt >> @@ -0,0 +1,40 @@ >> +Generic Bluetooth controller over USB (btusb driver) >> +--- >> + >> +Required properties: >> + >> + - compatible : should comply with the format "usbVID,PID" specified in >> + Documentation/devicetree/bindings/usb/usb-device.txt >> + At the time of writing, the only OF supported devices >> + (more may be added later) are: >> + >> + "usb1286,204e" (Marvell 8997) > > I don't know if anyone cares about these sort of organizational aspects, > but in patch 3 you're extending this binding with device-specific > Marvell bindings. Seems like it'd be good to have some clear way to > correlate those 2. e.g., add a reference to marvell-bt-sd8xxx.txt (or > marvell-bt-8xxx.txt, once you rename it) in patch 3. Good idea, I'll do it. (I'll add a reference to marvell-bt-8xxx.txt in btusb.txt). > >> + >> +Optional properties: >> + >> + - interrupt-parent: phandle of the parent interrupt controller >> + - interrupt-names: (see below) >> + - interrupts : The interrupt specified by the name "wakeup" is the >> interrupt >> + that shall be used for out-of-band wake-on-bt. Driver will >> + request this interrupt for wakeup. During system suspend, the >> + irq will be enabled so that the bluetooth chip can wakeup host >> + platform out of band. During system resume, the irq will be >> + disabled to make sure unnecessary interrupt is not received. >> + >> +Example: >> + >> +Following example uses irq pin number 3 of gpio0 for out of band wake-on-bt: >> + >> +_host1_ehci { >> +status = "okay"; >> +#address-cells = <1>; >> +#size-cells = <0>; >> + >> +mvl_bt1: bt@1 { >> + compatible = "usb1286,204e"; >> + reg = <1>; >> + interrupt-parent = <>; >> + interrupt-name = "wakeup"; >> + interrupts = <3 IRQ_TYPE_LEVEL_LOW>; >> +}; >> +}; >> diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c >> index ce22cef..beca4e9 100644 >> --- a/drivers/bluetooth/btusb.c >> +++ b/drivers/bluetooth/btusb.c >> @@ -24,6 +24,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> #include >> >> #include >> @@ -369,6 +371,7 @@ static const struct usb_device_id blacklist_table[] = { >> #define BTUSB_BOOTING9 >> #define BTUSB_RESET_RESUME 10 >> #define BTUSB_DIAG_RUNNING 11 >> +#define BTUSB_OOB_WAKE_DISABLED 12 >> >> struct btusb_data { >> struct hci_dev *hdev; >> @@ -416,6 +419,8 @@ struct btusb_data { >> int (*recv_bulk)(struct btusb_data *data, void *buffer, int count); >> >> int (*setup_on_usb)(struct hci_dev *hdev); >> + >> + int oob_wake_irq; /* irq for out-of-band wake-on-bt */ >> }; >> >> static inline void btusb_free_frags(struct btusb_data *data) >> @@ -2728,6 +2733,65 @@ static int btusb_bcm_set_diag(struct hci_dev *hdev, >> bool enable) >> } >> #endif >> >> +#ifdef CONFIG_PM >> +static irqreturn_t btusb_oob_wake_handler(int irq, void *priv) >> +{ >> + struct btusb_data *data = priv; >> + >> + /* Disable only if not already disabled (keep it balanced) */ >> + if
Re: [PATCH 4/4] oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA
Hi Kirill, [auto build test ERROR on mmotm/master] [also build test ERROR on v4.9 next-20161219] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/mm-drop-zap_details-ignore_dirty/20161220-092938 base: git://git.cmpxchg.org/linux-mmotm.git master config: i386-randconfig-x003-201651 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): mm/madvise.c: In function 'madvise_dontneed': >> mm/madvise.c:476:7: error: implicit declaration of function >> 'can_madv_dontneed_vma' [-Werror=implicit-function-declaration] if (!can_madv_dontneed_vma(vma)) ^ cc1: some warnings being treated as errors vim +/can_madv_dontneed_vma +476 mm/madvise.c 470 */ 471 static long madvise_dontneed(struct vm_area_struct *vma, 472 struct vm_area_struct **prev, 473 unsigned long start, unsigned long end) 474 { 475 *prev = vma; > 476 if (!can_madv_dontneed_vma(vma)) 477 return -EINVAL; 478 479 zap_page_range(vma, start, end - start); --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 4:25 PM, Andy Lutomirskiwrote: > On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov > wrote: >> you're ignoring use cases I described earlier. >> In vrf case there is only one ifindex it needs to bind to. > > I'm totally lost. Can you explain what this has to do with the cgroup > hierarchy? > Okay, I figured out what you mean, I think. You have a handful of vrf devices. Let's say they have ifindexes 1 and 2 (and maybe more). The interesting case happens when you set up /cgroup/a with a bpf program that binds new sockets to ifindex 1 and /cgroup/a/b with a bpf program that binds new sockets to ifindex 2. The question is: what should happen if you're in /cgroup/a/b? Presumably, if you do this, you wanted to end up with ifindex 2. I think the way it should actually work is that the kernel evaluates /cgroup/a/b's hook and then /cgroup/a's hook. Then /cgroup/a (which is the more privileged hook) gets to make the choice. If it wants ifindex 2 to win, it can do (pseudocode): if (!sk->sk_bound_if) sk->sk_bound_if = 1;
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On 12/19/16 5:25 PM, Andy Lutomirski wrote: > net.socket_create_filter = "none": no filter > net.socket_create_filter = "bpf:baadf00d": bpf filter > net.socket_create_filter = "disallow": no sockets created period > net.socket_create_filter = "iptables:foobar": some iptables thingy > net.socket_create_filter = "nft:blahblahblah": some nft thingy > net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters. ... >> you're ignoring use cases I described earlier. >> In vrf case there is only one ifindex it needs to bind to. > > I'm totally lost. Can you explain what this has to do with the cgroup > hierarchy? I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is cgrp2/vrf/NAME where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense. ... >>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You >>> have to do it now (or disable the feature for 4.10). This is why I'm >>> bringing this whole thing up now. >> >> We don't have to touch user visible api here, so extensions are fine. > > Huh? My example in the original email attaches a program in a > sub-hierarchy. Are you saying that 4.11 could make that example stop > working? Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup?
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 4:25 PM, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov > wrote: >> you're ignoring use cases I described earlier. >> In vrf case there is only one ifindex it needs to bind to. > > I'm totally lost. Can you explain what this has to do with the cgroup > hierarchy? > Okay, I figured out what you mean, I think. You have a handful of vrf devices. Let's say they have ifindexes 1 and 2 (and maybe more). The interesting case happens when you set up /cgroup/a with a bpf program that binds new sockets to ifindex 1 and /cgroup/a/b with a bpf program that binds new sockets to ifindex 2. The question is: what should happen if you're in /cgroup/a/b? Presumably, if you do this, you wanted to end up with ifindex 2. I think the way it should actually work is that the kernel evaluates /cgroup/a/b's hook and then /cgroup/a's hook. Then /cgroup/a (which is the more privileged hook) gets to make the choice. If it wants ifindex 2 to win, it can do (pseudocode): if (!sk->sk_bound_if) sk->sk_bound_if = 1;
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On 12/19/16 5:25 PM, Andy Lutomirski wrote: > net.socket_create_filter = "none": no filter > net.socket_create_filter = "bpf:baadf00d": bpf filter > net.socket_create_filter = "disallow": no sockets created period > net.socket_create_filter = "iptables:foobar": some iptables thingy > net.socket_create_filter = "nft:blahblahblah": some nft thingy > net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters. ... >> you're ignoring use cases I described earlier. >> In vrf case there is only one ifindex it needs to bind to. > > I'm totally lost. Can you explain what this has to do with the cgroup > hierarchy? I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is cgrp2/vrf/NAME where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense. ... >>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You >>> have to do it now (or disable the feature for 4.10). This is why I'm >>> bringing this whole thing up now. >> >> We don't have to touch user visible api here, so extensions are fine. > > Huh? My example in the original email attaches a program in a > sub-hierarchy. Are you saying that 4.11 could make that example stop > working? Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup?
Re: [PATCH 4/4] oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA
Hi Kirill, [auto build test ERROR on mmotm/master] [also build test ERROR on v4.9 next-20161219] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/mm-drop-zap_details-ignore_dirty/20161220-092938 base: git://git.cmpxchg.org/linux-mmotm.git master config: i386-randconfig-x003-201651 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): mm/madvise.c: In function 'madvise_dontneed': >> mm/madvise.c:476:7: error: implicit declaration of function >> 'can_madv_dontneed_vma' [-Werror=implicit-function-declaration] if (!can_madv_dontneed_vma(vma)) ^ cc1: some warnings being treated as errors vim +/can_madv_dontneed_vma +476 mm/madvise.c 470 */ 471 static long madvise_dontneed(struct vm_area_struct *vma, 472 struct vm_area_struct **prev, 473 unsigned long start, unsigned long end) 474 { 475 *prev = vma; > 476 if (!can_madv_dontneed_vma(vma)) 477 return -EINVAL; 478 479 zap_page_range(vma, start, end - start); --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
答复: Re: [PATCH 1/2] ocfs2: add kobject for online file check
Hello Kbuild, Could you build my whole patch set (2 patch)? I think that the code is OK. Thanks Gang >>> kbuild test robot <l...@intel.com> 2016-12-19 下午 18:56 >>> Hi Gang, [auto build test ERROR on linus/master] [also build test ERROR on v4.9 next-20161219] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Gang-He/ocfs2-add-kobject-for-online-file-check/20161219-181858 config: x86_64-randconfig-x000-201651 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 Note: the linux-review/Gang-He/ocfs2-add-kobject-for-online-file-check/20161219-181858 HEAD 6ef9256cd25ef72a5e69490cc3dacde95b8e2ac4 builds fine. It only hurts bisectibility. All error/warnings (new ones prefixed by >>): fs/ocfs2/super.c: In function 'ocfs2_fill_super': >> fs/ocfs2/super.c:1204:31: error: passing argument 1 of >> 'ocfs2_filecheck_create_sysfs' from incompatible pointer type >> [-Werror=incompatible-pointer-types] ocfs2_filecheck_create_sysfs(sb); ^~ In file included from fs/ocfs2/super.c:77:0: fs/ocfs2/filecheck.h:71:5: note: expected 'struct ocfs2_super *' but argument is of type 'struct super_block *' int ocfs2_filecheck_create_sysfs(struct ocfs2_super *osb); ^~~~ fs/ocfs2/super.c: In function 'ocfs2_put_super': >> fs/ocfs2/super.c:1656:31: error: passing argument 1 of >> 'ocfs2_filecheck_remove_sysfs' from incompatible pointer type >> [-Werror=incompatible-pointer-types] ocfs2_filecheck_remove_sysfs(sb); ^~ In file included from fs/ocfs2/super.c:77:0: fs/ocfs2/filecheck.h:72:6: note: expected 'struct ocfs2_super *' but argument is of type 'struct super_block *' void ocfs2_filecheck_remove_sysfs(struct ocfs2_super *osb); ^~~~ cc1: some warnings being treated as errors -- fs/ocfs2/filecheck.c: In function 'ocfs2_filecheck_create_sysfs': >> fs/ocfs2/filecheck.c:179:50: error: 'struct ocfs2_super' has no member named >> 'osb_fc_ent'; did you mean 'osb_ctxt'? struct ocfs2_filecheck_sysfs_entry *entry = >osb_fc_ent; ^~ >> fs/ocfs2/filecheck.c:191:27: error: 'struct ocfs2_super' has no member named >> 'osb_dev_kset'; did you mean 'osb_dx_seed'? entry->fs_kobj.kset = osb->osb_dev_kset; ^~ fs/ocfs2/filecheck.c: In function 'ocfs2_filecheck_remove_sysfs': fs/ocfs2/filecheck.c:206:10: error: 'struct ocfs2_super' has no member named 'osb_fc_ent'; did you mean 'osb_ctxt'? if (!osb->osb_fc_ent.fs_fcheck) ^~ fs/ocfs2/filecheck.c:209:18: error: 'struct ocfs2_super' has no member named 'osb_fc_ent'; did you mean 'osb_ctxt'? kobject_del(>osb_fc_ent.fs_kobj); ^~ fs/ocfs2/filecheck.c:210:18: error: 'struct ocfs2_super' has no member named 'osb_fc_ent'; did you mean 'osb_ctxt'? kobject_put(>osb_fc_ent.fs_kobj); ^~ fs/ocfs2/filecheck.c:211:26: error: 'struct ocfs2_super' has no member named 'osb_fc_ent'; did you mean 'osb_ctxt'? wait_for_completion(>osb_fc_ent.fs_kobj_unregister); ^~ fs/ocfs2/filecheck.c:212:33: error: 'struct ocfs2_super' has no member named 'osb_fc_ent'; did you mean 'osb_ctxt'? ocfs2_filecheck_sysfs_free(>osb_fc_ent); ^~ In file included from include/linux/list.h:8:0, from fs/ocfs2/filecheck.c:20: fs/ocfs2/filecheck.c: In function 'ocfs2_filecheck_handle_entry': >> include/linux/kernel.h:850:27: error: 'struct ocfs2_super' has no member >> named 'osb_fc_ent'; did you mean 'osb_ctxt'? const typeof( ((type *)0)->member ) *__mptr = (ptr); \ ^ >> fs/ocfs2/filecheck.c:444: struct ocfs2_super *osb = container_of(ent, >> struct ocfs2_super, ^~~~ include/linux/kernel.h:850:48: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types] const typeof( ((type *)0)->member ) *__mptr = (ptr); \ ^ >> fs/ocfs2/filecheck.c:444:28: note: in expansion of macro 'container_of' struct ocfs2_super *osb = container_of(ent, struct ocfs2_super, ^~~~ include/linux/kernel.h:850:48: note: (near initialization for 'osb') const typeof( ((type *)0)->member ) *__mptr = (ptr); \ ^ >> fs/ocfs2/filecheck.c:444:28: note: in expansion of macro 'container_of'
答复: Re: [PATCH 1/2] ocfs2: add kobject for online file check
Hello Kbuild, Could you build my whole patch set (2 patch)? I think that the code is OK. Thanks Gang >>> kbuild test robot 2016-12-19 下午 18:56 >>> Hi Gang, [auto build test ERROR on linus/master] [also build test ERROR on v4.9 next-20161219] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Gang-He/ocfs2-add-kobject-for-online-file-check/20161219-181858 config: x86_64-randconfig-x000-201651 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 Note: the linux-review/Gang-He/ocfs2-add-kobject-for-online-file-check/20161219-181858 HEAD 6ef9256cd25ef72a5e69490cc3dacde95b8e2ac4 builds fine. It only hurts bisectibility. All error/warnings (new ones prefixed by >>): fs/ocfs2/super.c: In function 'ocfs2_fill_super': >> fs/ocfs2/super.c:1204:31: error: passing argument 1 of >> 'ocfs2_filecheck_create_sysfs' from incompatible pointer type >> [-Werror=incompatible-pointer-types] ocfs2_filecheck_create_sysfs(sb); ^~ In file included from fs/ocfs2/super.c:77:0: fs/ocfs2/filecheck.h:71:5: note: expected 'struct ocfs2_super *' but argument is of type 'struct super_block *' int ocfs2_filecheck_create_sysfs(struct ocfs2_super *osb); ^~~~ fs/ocfs2/super.c: In function 'ocfs2_put_super': >> fs/ocfs2/super.c:1656:31: error: passing argument 1 of >> 'ocfs2_filecheck_remove_sysfs' from incompatible pointer type >> [-Werror=incompatible-pointer-types] ocfs2_filecheck_remove_sysfs(sb); ^~ In file included from fs/ocfs2/super.c:77:0: fs/ocfs2/filecheck.h:72:6: note: expected 'struct ocfs2_super *' but argument is of type 'struct super_block *' void ocfs2_filecheck_remove_sysfs(struct ocfs2_super *osb); ^~~~ cc1: some warnings being treated as errors -- fs/ocfs2/filecheck.c: In function 'ocfs2_filecheck_create_sysfs': >> fs/ocfs2/filecheck.c:179:50: error: 'struct ocfs2_super' has no member named >> 'osb_fc_ent'; did you mean 'osb_ctxt'? struct ocfs2_filecheck_sysfs_entry *entry = >osb_fc_ent; ^~ >> fs/ocfs2/filecheck.c:191:27: error: 'struct ocfs2_super' has no member named >> 'osb_dev_kset'; did you mean 'osb_dx_seed'? entry->fs_kobj.kset = osb->osb_dev_kset; ^~ fs/ocfs2/filecheck.c: In function 'ocfs2_filecheck_remove_sysfs': fs/ocfs2/filecheck.c:206:10: error: 'struct ocfs2_super' has no member named 'osb_fc_ent'; did you mean 'osb_ctxt'? if (!osb->osb_fc_ent.fs_fcheck) ^~ fs/ocfs2/filecheck.c:209:18: error: 'struct ocfs2_super' has no member named 'osb_fc_ent'; did you mean 'osb_ctxt'? kobject_del(>osb_fc_ent.fs_kobj); ^~ fs/ocfs2/filecheck.c:210:18: error: 'struct ocfs2_super' has no member named 'osb_fc_ent'; did you mean 'osb_ctxt'? kobject_put(>osb_fc_ent.fs_kobj); ^~ fs/ocfs2/filecheck.c:211:26: error: 'struct ocfs2_super' has no member named 'osb_fc_ent'; did you mean 'osb_ctxt'? wait_for_completion(>osb_fc_ent.fs_kobj_unregister); ^~ fs/ocfs2/filecheck.c:212:33: error: 'struct ocfs2_super' has no member named 'osb_fc_ent'; did you mean 'osb_ctxt'? ocfs2_filecheck_sysfs_free(>osb_fc_ent); ^~ In file included from include/linux/list.h:8:0, from fs/ocfs2/filecheck.c:20: fs/ocfs2/filecheck.c: In function 'ocfs2_filecheck_handle_entry': >> include/linux/kernel.h:850:27: error: 'struct ocfs2_super' has no member >> named 'osb_fc_ent'; did you mean 'osb_ctxt'? const typeof( ((type *)0)->member ) *__mptr = (ptr); \ ^ >> fs/ocfs2/filecheck.c:444: struct ocfs2_super *osb = container_of(ent, >> struct ocfs2_super, ^~~~ include/linux/kernel.h:850:48: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types] const typeof( ((type *)0)->member ) *__mptr = (ptr); \ ^ >> fs/ocfs2/filecheck.c:444:28: note: in expansion of macro 'container_of' struct ocfs2_super *osb = container_of(ent, struct ocfs2_super, ^~~~ include/linux/kernel.h:850:48: note: (near initialization for 'osb') const typeof( ((type *)0)->member ) *__mptr = (ptr); \ ^ >> fs/ocfs2/filecheck.c:444:28: note: in expansion of macro 'container_of' struct oc
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 5:34 PM, David Millerwrote: > From: Alexei Starovoitov > Date: Mon, 19 Dec 2016 16:02:56 -0800 > >> huh? 'not right api' because it's using bpf syscall instead >> of cgroup control-file? I think the opposite is the truth. > > I completely agree with Alexei on this. So what happens when someone adds another type of filter? Let's say there's a simple, no-privilege-required list of allowed address families that can hook up to the socket creation hook for a cgroup. Does BPF_PROG_DETACH still detach it? Or would both the bpf *and* the list of allowed address families be in force? If the latter, why wouldn't two BPF programs on the same hook be allowed? Concretely: # mkdir /cgroup/a # set_up_bpf_socket_rule /cgroup/a # set_up_list_of_address_families /cgroup/a # cat /cgroup/a/some_new_file [what gets displayed?] # BPF_PROG_DETACH: what happens By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't even take a reference to a BPF program as an argument. What is it supposed to do if this mechanism ever gets extended? --Andy
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 5:34 PM, David Miller wrote: > From: Alexei Starovoitov > Date: Mon, 19 Dec 2016 16:02:56 -0800 > >> huh? 'not right api' because it's using bpf syscall instead >> of cgroup control-file? I think the opposite is the truth. > > I completely agree with Alexei on this. So what happens when someone adds another type of filter? Let's say there's a simple, no-privilege-required list of allowed address families that can hook up to the socket creation hook for a cgroup. Does BPF_PROG_DETACH still detach it? Or would both the bpf *and* the list of allowed address families be in force? If the latter, why wouldn't two BPF programs on the same hook be allowed? Concretely: # mkdir /cgroup/a # set_up_bpf_socket_rule /cgroup/a # set_up_list_of_address_families /cgroup/a # cat /cgroup/a/some_new_file [what gets displayed?] # BPF_PROG_DETACH: what happens By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't even take a reference to a BPF program as an argument. What is it supposed to do if this mechanism ever gets extended? --Andy
Re: [PATCH] x86/microcode: Adjust ramdisk address when accessing by virtual address
On 12/19/2016 08:27 PM, Boris Ostrovsky wrote: On 12/19/2016 06:32 PM, Borislav Petkov wrote: On Mon, Dec 19, 2016 at 07:43:40PM +0100, Borislav Petkov wrote: On Mon, Dec 19, 2016 at 01:12:25PM -0500, Boris Ostrovsky wrote: IIUIC find_microcode_in_initrd() is called with paging on only on Intel (which is where I observed it). Ah, that was an important fact. Yes, I can repro it now. Ok, questions: * does your guest relocate the ramdisk? This is not a guest. I crashed with baremetal kernel. I.e., do you see something like this in dmesg before the splat: [0.00] RAMDISK: [mem 0x7f84c000-0x7ffc] [0.00] Allocated new RAMDISK: [mem 0x3647a000-0x36bfd9e6] [0.00] Move RAMDISK from [mem 0x7f84c000-0x7ffcf9e6] to [mem 0x3647a000-0x36bfd9e6] ^^ Sorry, forgot about this: I see the first line but not the other two (so the relocation did not occur). -boris If not, then I know what happens. Also, does it work if you change these lines: if (!use_pa && relocated_ramdisk) start = initrd_start; to: if (!use_pa) start = initrd_start; Yes, it does. I also thought it might be better but I haven't gone through the code to make sure this would always work. I can run more tests tomorrow if you want. -boris Because if that works, I can actually simplify that function radically. But more tomorrow. Thanks.
Re: [PATCH] x86/microcode: Adjust ramdisk address when accessing by virtual address
On 12/19/2016 08:27 PM, Boris Ostrovsky wrote: On 12/19/2016 06:32 PM, Borislav Petkov wrote: On Mon, Dec 19, 2016 at 07:43:40PM +0100, Borislav Petkov wrote: On Mon, Dec 19, 2016 at 01:12:25PM -0500, Boris Ostrovsky wrote: IIUIC find_microcode_in_initrd() is called with paging on only on Intel (which is where I observed it). Ah, that was an important fact. Yes, I can repro it now. Ok, questions: * does your guest relocate the ramdisk? This is not a guest. I crashed with baremetal kernel. I.e., do you see something like this in dmesg before the splat: [0.00] RAMDISK: [mem 0x7f84c000-0x7ffc] [0.00] Allocated new RAMDISK: [mem 0x3647a000-0x36bfd9e6] [0.00] Move RAMDISK from [mem 0x7f84c000-0x7ffcf9e6] to [mem 0x3647a000-0x36bfd9e6] ^^ Sorry, forgot about this: I see the first line but not the other two (so the relocation did not occur). -boris If not, then I know what happens. Also, does it work if you change these lines: if (!use_pa && relocated_ramdisk) start = initrd_start; to: if (!use_pa) start = initrd_start; Yes, it does. I also thought it might be better but I haven't gone through the code to make sure this would always work. I can run more tests tomorrow if you want. -boris Because if that works, I can actually simplify that function radically. But more tomorrow. Thanks.
Re: [PATCH v2 1/4] net: hix5hd2_gmac: add generic compatible string
On 2016/12/20 0:04, Rob Herring wrote: > On Mon, Dec 19, 2016 at 2:14 AM, Dongpo Liwrote: >> Hi Rob and David, >> >> On 2016/12/12 22:21, Rob Herring wrote: >>> On Mon, Dec 12, 2016 at 5:16 AM, Dongpo Li wrote: Hi Rob, On 2016/12/10 6:35, Rob Herring wrote: > On Mon, Dec 05, 2016 at 09:27:58PM +0800, Dongpo Li wrote: > > [...] > >> @@ -20,7 +25,7 @@ Required properties: >> >> Example: >> gmac0: ethernet@f984 { >> -compatible = "hisilicon,hix5hd2-gmac"; >> +compatible = "hisilicon,hix5hd2-gemac", >> "hisilicon,hisi-gemac-v1"; > > You can't just change compatible strings. > Okay, maybe I should name all the compatible string with the suffix "-gmac" instead of "-gemac". This can keep the compatible strings with the same suffix. Is this okay? Can I just add the generic compatible string without changing the SoCs compatible string? Like following: gmac0: ethernet@f984 { - compatible = "hisilicon,hix5hd2-gmac"; + compatible = "hisilicon,hix5hd2-gmac", "hisilicon,hisi-gmac-v1"; >>> >>> Yes, this is fine. >> >> As the patch series have been applied to net-next branch, >> in which way should I commit this compatible fix? >> Should I send a new patch fixing this compatible string error with "Fixes: >> xxx"? >> Looking forward to your reply! > > Yes to both. Okay, thanks for your reply! I will send the fix patch series as soon as possible. Regards, Dongpo .
Re: [PATCH v2 1/4] net: hix5hd2_gmac: add generic compatible string
On 2016/12/20 0:04, Rob Herring wrote: > On Mon, Dec 19, 2016 at 2:14 AM, Dongpo Li wrote: >> Hi Rob and David, >> >> On 2016/12/12 22:21, Rob Herring wrote: >>> On Mon, Dec 12, 2016 at 5:16 AM, Dongpo Li wrote: Hi Rob, On 2016/12/10 6:35, Rob Herring wrote: > On Mon, Dec 05, 2016 at 09:27:58PM +0800, Dongpo Li wrote: > > [...] > >> @@ -20,7 +25,7 @@ Required properties: >> >> Example: >> gmac0: ethernet@f984 { >> -compatible = "hisilicon,hix5hd2-gmac"; >> +compatible = "hisilicon,hix5hd2-gemac", >> "hisilicon,hisi-gemac-v1"; > > You can't just change compatible strings. > Okay, maybe I should name all the compatible string with the suffix "-gmac" instead of "-gemac". This can keep the compatible strings with the same suffix. Is this okay? Can I just add the generic compatible string without changing the SoCs compatible string? Like following: gmac0: ethernet@f984 { - compatible = "hisilicon,hix5hd2-gmac"; + compatible = "hisilicon,hix5hd2-gmac", "hisilicon,hisi-gmac-v1"; >>> >>> Yes, this is fine. >> >> As the patch series have been applied to net-next branch, >> in which way should I commit this compatible fix? >> Should I send a new patch fixing this compatible string error with "Fixes: >> xxx"? >> Looking forward to your reply! > > Yes to both. Okay, thanks for your reply! I will send the fix patch series as soon as possible. Regards, Dongpo .
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei StarovoitovDate: Mon, 19 Dec 2016 16:02:56 -0800 > huh? 'not right api' because it's using bpf syscall instead > of cgroup control-file? I think the opposite is the truth. I completely agree with Alexei on this.
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov Date: Mon, 19 Dec 2016 16:02:56 -0800 > huh? 'not right api' because it's using bpf syscall instead > of cgroup control-file? I think the opposite is the truth. I completely agree with Alexei on this.
Re: [PATCH] x86/microcode: Adjust ramdisk address when accessing by virtual address
On 12/19/2016 06:32 PM, Borislav Petkov wrote: On Mon, Dec 19, 2016 at 07:43:40PM +0100, Borislav Petkov wrote: On Mon, Dec 19, 2016 at 01:12:25PM -0500, Boris Ostrovsky wrote: IIUIC find_microcode_in_initrd() is called with paging on only on Intel (which is where I observed it). Ah, that was an important fact. Yes, I can repro it now. Ok, questions: * does your guest relocate the ramdisk? This is not a guest. I crashed with baremetal kernel. I.e., do you see something like this in dmesg before the splat: [0.00] RAMDISK: [mem 0x7f84c000-0x7ffc] [0.00] Allocated new RAMDISK: [mem 0x3647a000-0x36bfd9e6] [0.00] Move RAMDISK from [mem 0x7f84c000-0x7ffcf9e6] to [mem 0x3647a000-0x36bfd9e6] ^^ If not, then I know what happens. Also, does it work if you change these lines: if (!use_pa && relocated_ramdisk) start = initrd_start; to: if (!use_pa) start = initrd_start; Yes, it does. I also thought it might be better but I haven't gone through the code to make sure this would always work. I can run more tests tomorrow if you want. -boris Because if that works, I can actually simplify that function radically. But more tomorrow. Thanks.
Re: [PATCH] x86/microcode: Adjust ramdisk address when accessing by virtual address
On 12/19/2016 06:32 PM, Borislav Petkov wrote: On Mon, Dec 19, 2016 at 07:43:40PM +0100, Borislav Petkov wrote: On Mon, Dec 19, 2016 at 01:12:25PM -0500, Boris Ostrovsky wrote: IIUIC find_microcode_in_initrd() is called with paging on only on Intel (which is where I observed it). Ah, that was an important fact. Yes, I can repro it now. Ok, questions: * does your guest relocate the ramdisk? This is not a guest. I crashed with baremetal kernel. I.e., do you see something like this in dmesg before the splat: [0.00] RAMDISK: [mem 0x7f84c000-0x7ffc] [0.00] Allocated new RAMDISK: [mem 0x3647a000-0x36bfd9e6] [0.00] Move RAMDISK from [mem 0x7f84c000-0x7ffcf9e6] to [mem 0x3647a000-0x36bfd9e6] ^^ If not, then I know what happens. Also, does it work if you change these lines: if (!use_pa && relocated_ramdisk) start = initrd_start; to: if (!use_pa) start = initrd_start; Yes, it does. I also thought it might be better but I haven't gone through the code to make sure this would always work. I can run more tests tomorrow if you want. -boris Because if that works, I can actually simplify that function radically. But more tomorrow. Thanks.
Re: [PATCH] ALSA: snd-usb: fix IRQ triggered NULL pointer dereference
--- Hi, > Commit 16200948d835 ("ALSA: usb-audio: Fix race at stopping the stream") > fixes a race-codition but it also introduces another really nasty data race > regression which makes my usb sound card [1] completely useless, throwing > the kernel into a panic if anything from userspace tries to start playback. > > The problem is this: ep->data_subs is now set to NULL every time inside > wait_clear_urbs(). ep->data_subs is initalized only in one place in > start_endpoints(), then it is immediately wiped by the pre-existing call to > wait_clear_urbs() inside snd_usb_endpoint_start(). > > To ilustrate, this is what happens in the non-irq part of the code: > > Step 1 (inside start_endpoints): ep->data_subs = subs; > Step 2 (inside start_endpoints): snd_usb_endpoint_start(ep, can_sleep); > Step 3 (inside snd_usb_endpoint_start): wait_clear_urbs(ep); > Step 4 (inside wait_clear_urbs): ep->data_subs = NULL; > > Here's a simplified call stack for the IRQ part (full one at the end): > > (NULL dereference, param "subs" is passed NULL value of ep->data_subs) > retire_playback_urb > retire_outbound_urb > snd_complete_urb > (...) > xhci_irq > > Looking at the git log there seems to be quite a lot of history in this > part of the codebase, dating back to 2012 or earlier. My fix is based on > 015618b90 ("ALSA: snd-usb: Fix URB cancellation at stream start") and > e9ba389c5 ("ALSA: usb-audio: Fix scheduling-while-atomic bug in PCM capture > stream") with a few modifications of my own. My idea is to do the > ep->data_subs wiping before endpoint initialization in a manner similar > to these older commits by using stop_endpoints() in snd_usb_pcm_prepare() > and at the same time keep the ep->data_subs = NULL in wait_clear_urbs() to > not trigger the recently fixed stream stopping race again. > > Full call stack: > > [ 131.093240] BUG: unable to handle kernel NULL pointer dereference at > 0010 > [ 131.094313] IP: retire_playback_urb+0x1b/0x160 [snd_usb_audio] > [ 131.095046] PGD 0 > [ 131.095047] > [ 131.096509] Oops: [#1] PREEMPT SMP > [ 131.097255] Modules linked in: fuse snd_usb_audio snd_usbmidi_lib > snd_rawmidi snd_seq_device ctr ccm arc4 ath9k intel_rapl ath9k_common > x86_pkg_temp_thermal ath9k_hw intel_powerclamp coretemp joydev mousedev ath > kvm_intel mac80211 kvm > input_leds hid_generic psmouse irqbypass usbhid hid crct10dif_pclmul > crc32_pclmul serio_raw crc32c_intel atkbd ghash_clmulni_intel > snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel > pcbc aesni_intel libps2 cfg8 > 0211 aes_x86_64 crypto_simd dell_laptop glue_helper r8169 dell_smbios > snd_hda_codec sr_mod cryptd led_class mii rfkill snd_hwdep cdrom snd_hda_core > ac dcdbas i8042 xhci_pci serio xhci_hcd snd_pcm tpm_tis battery tpm_tis_core > lpc_ich tpm > snd_timer evdev shpchp i2c_i801 sch_fq_codel > [ 131.105551] CPU: 2 PID: 165 Comm: irq/29-xhci_hcd Not tainted > 4.9.0-gd824cdc58ba0 #10 > [ 131.107516] Hardware name: Dell Inc. Inspiron 3521/018DYG, BIOS A14 > 07/31/2015 > [ 131.109592] task: 880154a7 task.stack: c9f48000 > [ 131.111746] RIP: 0010:retire_playback_urb+0x1b/0x160 [snd_usb_audio] > [ 131.113899] RSP: 0018:c9f4bc10 EFLAGS: 00010082 > [ 131.116080] RAX: a04cabe0 RBX: RCX: > > [ 131.118284] RDX: RSI: 8801435a4c00 RDI: > > [ 131.120505] RBP: c9f4bc40 R08: 0001 R09: > 0001 > [ 131.122807] R10: 0001 R11: 82f60d6d R12: > 8801432a0238 > [ 131.125265] R13: 8801435a4c00 R14: R15: > 8801535c83b8 > [ 131.127723] FS: () GS:88015a00() > knlGS: > [ 131.130217] CS: 0010 DS: ES: CR0: 80050033 > [ 131.132568] CR2: 0010 CR3: 01e0b000 CR4: > 001406e0 > [ 131.135007] Call Trace: > [ 131.137329] snd_complete_urb+0x190/0x2b0 [snd_usb_audio] > [ 131.139526] __usb_hcd_giveback_urb+0x8e/0x160 > [ 131.141948] usb_hcd_giveback_urb+0x43/0x110 > [ 131.144104] xhci_giveback_urb_in_irq.isra.22+0x7d/0xb0 [xhci_hcd] > [ 131.146521] finish_td.constprop.38+0x1de/0x2f0 [xhci_hcd] > [ 131.148736] xhci_irq+0x13a2/0x1ca0 [xhci_hcd] > [ 131.150972] ? trace_hardirqs_on+0xd/0x10 > [ 131.153209] ? _raw_spin_unlock_irq+0x2c/0x50 > [ 131.155390] ? irq_thread+0xb5/0x1d0 > [ 131.157771] xhci_msi_irq+0x11/0x20 [xhci_hcd] > [ 131.159938] irq_forced_thread_fn+0x2f/0x70 > [ 131.162072] ? irq_thread+0xb5/0x1d0 > [ 131.164141] irq_thread+0x149/0x1d0 > [ 131.166132] ? irq_finalize_oneshot.part.2+0xe0/0xe0 > [ 131.168142] ? wake_threads_waitq+0x30/0x30 > [ 131.170149] kthread+0x10f/0x150 > [ 131.172144] ? irq_thread_dtor+0xc0/0xc0 > [ 131.174139] ? kthread_create_on_node+0x40/0x40 > [ 131.176116] ret_from_fork+0x2a/0x40 > [ 131.178074] Code: 8b 77 64 4c 89 e7 e8 e5 fe ff ff eb c3
Re: [PATCH] ALSA: snd-usb: fix IRQ triggered NULL pointer dereference
--- Hi, > Commit 16200948d835 ("ALSA: usb-audio: Fix race at stopping the stream") > fixes a race-codition but it also introduces another really nasty data race > regression which makes my usb sound card [1] completely useless, throwing > the kernel into a panic if anything from userspace tries to start playback. > > The problem is this: ep->data_subs is now set to NULL every time inside > wait_clear_urbs(). ep->data_subs is initalized only in one place in > start_endpoints(), then it is immediately wiped by the pre-existing call to > wait_clear_urbs() inside snd_usb_endpoint_start(). > > To ilustrate, this is what happens in the non-irq part of the code: > > Step 1 (inside start_endpoints): ep->data_subs = subs; > Step 2 (inside start_endpoints): snd_usb_endpoint_start(ep, can_sleep); > Step 3 (inside snd_usb_endpoint_start): wait_clear_urbs(ep); > Step 4 (inside wait_clear_urbs): ep->data_subs = NULL; > > Here's a simplified call stack for the IRQ part (full one at the end): > > (NULL dereference, param "subs" is passed NULL value of ep->data_subs) > retire_playback_urb > retire_outbound_urb > snd_complete_urb > (...) > xhci_irq > > Looking at the git log there seems to be quite a lot of history in this > part of the codebase, dating back to 2012 or earlier. My fix is based on > 015618b90 ("ALSA: snd-usb: Fix URB cancellation at stream start") and > e9ba389c5 ("ALSA: usb-audio: Fix scheduling-while-atomic bug in PCM capture > stream") with a few modifications of my own. My idea is to do the > ep->data_subs wiping before endpoint initialization in a manner similar > to these older commits by using stop_endpoints() in snd_usb_pcm_prepare() > and at the same time keep the ep->data_subs = NULL in wait_clear_urbs() to > not trigger the recently fixed stream stopping race again. > > Full call stack: > > [ 131.093240] BUG: unable to handle kernel NULL pointer dereference at > 0010 > [ 131.094313] IP: retire_playback_urb+0x1b/0x160 [snd_usb_audio] > [ 131.095046] PGD 0 > [ 131.095047] > [ 131.096509] Oops: [#1] PREEMPT SMP > [ 131.097255] Modules linked in: fuse snd_usb_audio snd_usbmidi_lib > snd_rawmidi snd_seq_device ctr ccm arc4 ath9k intel_rapl ath9k_common > x86_pkg_temp_thermal ath9k_hw intel_powerclamp coretemp joydev mousedev ath > kvm_intel mac80211 kvm > input_leds hid_generic psmouse irqbypass usbhid hid crct10dif_pclmul > crc32_pclmul serio_raw crc32c_intel atkbd ghash_clmulni_intel > snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel > pcbc aesni_intel libps2 cfg8 > 0211 aes_x86_64 crypto_simd dell_laptop glue_helper r8169 dell_smbios > snd_hda_codec sr_mod cryptd led_class mii rfkill snd_hwdep cdrom snd_hda_core > ac dcdbas i8042 xhci_pci serio xhci_hcd snd_pcm tpm_tis battery tpm_tis_core > lpc_ich tpm > snd_timer evdev shpchp i2c_i801 sch_fq_codel > [ 131.105551] CPU: 2 PID: 165 Comm: irq/29-xhci_hcd Not tainted > 4.9.0-gd824cdc58ba0 #10 > [ 131.107516] Hardware name: Dell Inc. Inspiron 3521/018DYG, BIOS A14 > 07/31/2015 > [ 131.109592] task: 880154a7 task.stack: c9f48000 > [ 131.111746] RIP: 0010:retire_playback_urb+0x1b/0x160 [snd_usb_audio] > [ 131.113899] RSP: 0018:c9f4bc10 EFLAGS: 00010082 > [ 131.116080] RAX: a04cabe0 RBX: RCX: > > [ 131.118284] RDX: RSI: 8801435a4c00 RDI: > > [ 131.120505] RBP: c9f4bc40 R08: 0001 R09: > 0001 > [ 131.122807] R10: 0001 R11: 82f60d6d R12: > 8801432a0238 > [ 131.125265] R13: 8801435a4c00 R14: R15: > 8801535c83b8 > [ 131.127723] FS: () GS:88015a00() > knlGS: > [ 131.130217] CS: 0010 DS: ES: CR0: 80050033 > [ 131.132568] CR2: 0010 CR3: 01e0b000 CR4: > 001406e0 > [ 131.135007] Call Trace: > [ 131.137329] snd_complete_urb+0x190/0x2b0 [snd_usb_audio] > [ 131.139526] __usb_hcd_giveback_urb+0x8e/0x160 > [ 131.141948] usb_hcd_giveback_urb+0x43/0x110 > [ 131.144104] xhci_giveback_urb_in_irq.isra.22+0x7d/0xb0 [xhci_hcd] > [ 131.146521] finish_td.constprop.38+0x1de/0x2f0 [xhci_hcd] > [ 131.148736] xhci_irq+0x13a2/0x1ca0 [xhci_hcd] > [ 131.150972] ? trace_hardirqs_on+0xd/0x10 > [ 131.153209] ? _raw_spin_unlock_irq+0x2c/0x50 > [ 131.155390] ? irq_thread+0xb5/0x1d0 > [ 131.157771] xhci_msi_irq+0x11/0x20 [xhci_hcd] > [ 131.159938] irq_forced_thread_fn+0x2f/0x70 > [ 131.162072] ? irq_thread+0xb5/0x1d0 > [ 131.164141] irq_thread+0x149/0x1d0 > [ 131.166132] ? irq_finalize_oneshot.part.2+0xe0/0xe0 > [ 131.168142] ? wake_threads_waitq+0x30/0x30 > [ 131.170149] kthread+0x10f/0x150 > [ 131.172144] ? irq_thread_dtor+0xc0/0xc0 > [ 131.174139] ? kthread_create_on_node+0x40/0x40 > [ 131.176116] ret_from_fork+0x2a/0x40 > [ 131.178074] Code: 8b 77 64 4c 89 e7 e8 e5 fe ff ff eb c3
Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wongwrote: > On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote: >> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote: >> <> >> > Definitely the first step would be your simple preallocated per >> > inode approach until it is shown to be insufficient. >> >> Reviving this thread a few months later... >> >> Dave, we're interested in taking a serious look at what it would take to get >> PMEM_IMMUTABLE working. Do you still hold the opinion that this is (or could >> become, with some amount of work) a workable solution? >> >> We're happy to do the grunt work for this feature, but we will probably need >> guidance from someone with more XFS experience. With you out on extended >> leave >> the first half of 2017, who would be the best person to ask for this >> guidance? >> Darrick? > > Yes, probably. :) > > I think where we left off with this (on the XFS side) is some sort of > fallocate mode that would allocate blocks, zero them, and then set the > DAX and PMEM_IMMUTABLE on-disk inode flags. After that, you'd mmap the > file and thereby gain the ability to control write persistents behavior > without having to worry about fs metadata updates. As an added plus, I > think zeroing the pmem also clears media errors, or something like that. > > Is that a reasonable starting point? My memory is a little foggy. > > Hmm, I see Dan just posted something about blockdev fallocate. I'll go > read that. That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE via a character device interface. It's useful for cases where you want an entire nvdimm namespace/volume in "no fs-metadata to worry about" mode. But, for sub-allocations of a namespace and support for existing tooling, PMEM_IMMUTABLE is much more usable.
Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
On Mon, Dec 19, 2016 at 5:09 PM, Darrick J. Wong wrote: > On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote: >> On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote: >> <> >> > Definitely the first step would be your simple preallocated per >> > inode approach until it is shown to be insufficient. >> >> Reviving this thread a few months later... >> >> Dave, we're interested in taking a serious look at what it would take to get >> PMEM_IMMUTABLE working. Do you still hold the opinion that this is (or could >> become, with some amount of work) a workable solution? >> >> We're happy to do the grunt work for this feature, but we will probably need >> guidance from someone with more XFS experience. With you out on extended >> leave >> the first half of 2017, who would be the best person to ask for this >> guidance? >> Darrick? > > Yes, probably. :) > > I think where we left off with this (on the XFS side) is some sort of > fallocate mode that would allocate blocks, zero them, and then set the > DAX and PMEM_IMMUTABLE on-disk inode flags. After that, you'd mmap the > file and thereby gain the ability to control write persistents behavior > without having to worry about fs metadata updates. As an added plus, I > think zeroing the pmem also clears media errors, or something like that. > > Is that a reasonable starting point? My memory is a little foggy. > > Hmm, I see Dan just posted something about blockdev fallocate. I'll go > read that. That's for device-dax, which is basically a poor man's PMEM_IMMUTABLE via a character device interface. It's useful for cases where you want an entire nvdimm namespace/volume in "no fs-metadata to worry about" mode. But, for sub-allocations of a namespace and support for existing tooling, PMEM_IMMUTABLE is much more usable.
Re: [PATCH] fbdev: remove current maintainer
Hi Daniel, On Thursday 08 Dec 2016 11:04:47 Daniel Vetter wrote: > On Thu, Dec 08, 2016 at 10:34:12AM +0200, Tomi Valkeinen wrote: > > Remove Tomi Valkeinen from fbdev maintainer and mark fbdev as orphan. > > > > Signed-off-by: Tomi Valkeinen> > --- > > > > I just don't have time to even pretend I maintain fbdev, so mark it as > > Orphan. > > > > Anyone want to pick fbdev maintainership? > > I think for fbdev stuff that we need for the fbdev emulation in drm we > could merge it through drm-misc. Or we could instead invest time on getting rid of that last dependency on fbdev ? :-) > And I'd very much welcome if you can > still review those patches (e.g. the nommu enabling that's floating right > now). If you want can also give you commit rights to drm-misc for that > area. Or drm-misc maintainers can apply those patches. > > Anything else would be still out of scope ofc. > > What we might want to do is add dri-devel to the mailing lists, to make > sure patch submissions get redicted to the right subsystem. Mail volume > seems pretty low compared to dri-devel, I think we can absorb that. > -Daniel > > > MAINTAINERS | 3 +-- > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/MAINTAINERS b/MAINTAINERS > > index 851b89b9edcb..fc2790eb4e84 100644 > > --- a/MAINTAINERS > > +++ b/MAINTAINERS > > @@ -4968,10 +4968,9 @@ F: drivers/net/wan/dlci.c > > > > F: drivers/net/wan/sdla.c > > > > FRAMEBUFFER LAYER > > > > -M: Tomi Valkeinen > > > > L: linux-fb...@vger.kernel.org > > Q: http://patchwork.kernel.org/project/linux-fbdev/list/ > > > > -S: Maintained > > +S: Orphan > > > > F: Documentation/fb/ > > F: drivers/video/ > > F: include/video/ -- Regards, Laurent Pinchart
Re: [PATCH] fbdev: remove current maintainer
Hi Daniel, On Thursday 08 Dec 2016 11:04:47 Daniel Vetter wrote: > On Thu, Dec 08, 2016 at 10:34:12AM +0200, Tomi Valkeinen wrote: > > Remove Tomi Valkeinen from fbdev maintainer and mark fbdev as orphan. > > > > Signed-off-by: Tomi Valkeinen > > --- > > > > I just don't have time to even pretend I maintain fbdev, so mark it as > > Orphan. > > > > Anyone want to pick fbdev maintainership? > > I think for fbdev stuff that we need for the fbdev emulation in drm we > could merge it through drm-misc. Or we could instead invest time on getting rid of that last dependency on fbdev ? :-) > And I'd very much welcome if you can > still review those patches (e.g. the nommu enabling that's floating right > now). If you want can also give you commit rights to drm-misc for that > area. Or drm-misc maintainers can apply those patches. > > Anything else would be still out of scope ofc. > > What we might want to do is add dri-devel to the mailing lists, to make > sure patch submissions get redicted to the right subsystem. Mail volume > seems pretty low compared to dri-devel, I think we can absorb that. > -Daniel > > > MAINTAINERS | 3 +-- > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/MAINTAINERS b/MAINTAINERS > > index 851b89b9edcb..fc2790eb4e84 100644 > > --- a/MAINTAINERS > > +++ b/MAINTAINERS > > @@ -4968,10 +4968,9 @@ F: drivers/net/wan/dlci.c > > > > F: drivers/net/wan/sdla.c > > > > FRAMEBUFFER LAYER > > > > -M: Tomi Valkeinen > > > > L: linux-fb...@vger.kernel.org > > Q: http://patchwork.kernel.org/project/linux-fbdev/list/ > > > > -S: Maintained > > +S: Orphan > > > > F: Documentation/fb/ > > F: drivers/video/ > > F: include/video/ -- Regards, Laurent Pinchart
Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote: > On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote: > <> > > Definitely the first step would be your simple preallocated per > > inode approach until it is shown to be insufficient. > > Reviving this thread a few months later... > > Dave, we're interested in taking a serious look at what it would take to get > PMEM_IMMUTABLE working. Do you still hold the opinion that this is (or could > become, with some amount of work) a workable solution? > > We're happy to do the grunt work for this feature, but we will probably need > guidance from someone with more XFS experience. With you out on extended > leave > the first half of 2017, who would be the best person to ask for this guidance? > Darrick? Yes, probably. :) I think where we left off with this (on the XFS side) is some sort of fallocate mode that would allocate blocks, zero them, and then set the DAX and PMEM_IMMUTABLE on-disk inode flags. After that, you'd mmap the file and thereby gain the ability to control write persistents behavior without having to worry about fs metadata updates. As an added plus, I think zeroing the pmem also clears media errors, or something like that. Is that a reasonable starting point? My memory is a little foggy. Hmm, I see Dan just posted something about blockdev fallocate. I'll go read that. --D > > Thanks, > - Ross
Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
On Mon, Dec 19, 2016 at 02:11:49PM -0700, Ross Zwisler wrote: > On Fri, Sep 16, 2016 at 03:54:05PM +1000, Nicholas Piggin wrote: > <> > > Definitely the first step would be your simple preallocated per > > inode approach until it is shown to be insufficient. > > Reviving this thread a few months later... > > Dave, we're interested in taking a serious look at what it would take to get > PMEM_IMMUTABLE working. Do you still hold the opinion that this is (or could > become, with some amount of work) a workable solution? > > We're happy to do the grunt work for this feature, but we will probably need > guidance from someone with more XFS experience. With you out on extended > leave > the first half of 2017, who would be the best person to ask for this guidance? > Darrick? Yes, probably. :) I think where we left off with this (on the XFS side) is some sort of fallocate mode that would allocate blocks, zero them, and then set the DAX and PMEM_IMMUTABLE on-disk inode flags. After that, you'd mmap the file and thereby gain the ability to control write persistents behavior without having to worry about fs metadata updates. As an added plus, I think zeroing the pmem also clears media errors, or something like that. Is that a reasonable starting point? My memory is a little foggy. Hmm, I see Dan just posted something about blockdev fallocate. I'll go read that. --D > > Thanks, > - Ross
Re: [PATCH] nvme-fabrics: remove some logically dead code performing redundant ret checks
Looks good. -- james Signed-off-by: James SmartOn 12/9/2016 6:59 AM, Colin King wrote: From: Colin Ian King The check to see if ret is non-zero and return this rather than count is redundant in two occassions. It is redundant because prior to this check, the return code ret is already checked for a non-zero error return value and we return from the function at that point. Signed-off-by: Colin Ian King --- drivers/nvme/target/fcloop.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/target/fcloop.c b/drivers/nvme/target/fcloop.c index bcb8ebe..4e8e6a2 100644 --- a/drivers/nvme/target/fcloop.c +++ b/drivers/nvme/target/fcloop.c @@ -845,7 +845,7 @@ fcloop_create_remote_port(struct device *dev, struct device_attribute *attr, rport->lport = nport->lport; nport->rport = rport; - return ret ? ret : count; + return count; } @@ -952,7 +952,7 @@ fcloop_create_target_port(struct device *dev, struct device_attribute *attr, tport->lport = nport->lport; nport->tport = tport; - return ret ? ret : count; + return count; }
Re: [PATCH] nvme-fabrics: remove some logically dead code performing redundant ret checks
Looks good. -- james Signed-off-by: James Smart On 12/9/2016 6:59 AM, Colin King wrote: From: Colin Ian King The check to see if ret is non-zero and return this rather than count is redundant in two occassions. It is redundant because prior to this check, the return code ret is already checked for a non-zero error return value and we return from the function at that point. Signed-off-by: Colin Ian King --- drivers/nvme/target/fcloop.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/target/fcloop.c b/drivers/nvme/target/fcloop.c index bcb8ebe..4e8e6a2 100644 --- a/drivers/nvme/target/fcloop.c +++ b/drivers/nvme/target/fcloop.c @@ -845,7 +845,7 @@ fcloop_create_remote_port(struct device *dev, struct device_attribute *attr, rport->lport = nport->lport; nport->rport = rport; - return ret ? ret : count; + return count; } @@ -952,7 +952,7 @@ fcloop_create_target_port(struct device *dev, struct device_attribute *attr, tport->lport = nport->lport; nport->tport = tport; - return ret ? ret : count; + return count; }
Re: [patch] nvme-fabrics: correct some printk information
Dan, Mind if I solve this a different way ? I really don't know why knowing the ptr value is even meaningful -- james On 12/10/2016 1:06 AM, Dan Carpenter wrote: We really don't care where "ctrl" is on the stack since we're just returning soon what we want is the actual ctrl pointer itself. Signed-off-by: Dan Carpenterdiff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 771e2e761872..e6395ed2f562 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -2402,7 +2402,7 @@ enum blk_eh_timer_return dev_info(ctrl->ctrl.device, "NVME-FC{%d}: new ctrl: NQN \"%s\" (%p)\n", - ctrl->cnum, ctrl->ctrl.opts->subsysnqn, ); + ctrl->cnum, ctrl->ctrl.opts->subsysnqn, ctrl); kref_get(>ctrl.kref);
Re: [patch] nvme-fabrics: correct some printk information
Dan, Mind if I solve this a different way ? I really don't know why knowing the ptr value is even meaningful -- james On 12/10/2016 1:06 AM, Dan Carpenter wrote: We really don't care where "ctrl" is on the stack since we're just returning soon what we want is the actual ctrl pointer itself. Signed-off-by: Dan Carpenter diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 771e2e761872..e6395ed2f562 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -2402,7 +2402,7 @@ enum blk_eh_timer_return dev_info(ctrl->ctrl.device, "NVME-FC{%d}: new ctrl: NQN \"%s\" (%p)\n", - ctrl->cnum, ctrl->ctrl.opts->subsysnqn, ); + ctrl->cnum, ctrl->ctrl.opts->subsysnqn, ctrl); kref_get(>ctrl.kref);
[PULL] One last documentation fix
The following changes since commit 3fa71d0f58a9b9df84e8e79196f961bcfbf01b2e: crypto: doc - optimize compilation (2016-12-13 16:38:07 -0700) are available in the git repository at: git://git.lwn.net/linux.git tags/doc-4.10-3 for you to fetch changes up to 217e2bfab22e740227df09f22165e834cddd8a3b: docs: sphinx-extensions: make rstFlatTable work with docutils 0.13 (2016-12-18 13:30:29 -0700) A single fix for the build system. It would appear that the docutils developers, in their wisdom, broke the API in the 0.13 release. This fix detects the breakage and allows the docs to be built with both the old and new versions. Dmitry Shachnev (1): docs: sphinx-extensions: make rstFlatTable work with docutils 0.13 Documentation/sphinx/rstFlatTable.py | 5 + 1 file changed, 5 insertions(+)
[PULL] One last documentation fix
The following changes since commit 3fa71d0f58a9b9df84e8e79196f961bcfbf01b2e: crypto: doc - optimize compilation (2016-12-13 16:38:07 -0700) are available in the git repository at: git://git.lwn.net/linux.git tags/doc-4.10-3 for you to fetch changes up to 217e2bfab22e740227df09f22165e834cddd8a3b: docs: sphinx-extensions: make rstFlatTable work with docutils 0.13 (2016-12-18 13:30:29 -0700) A single fix for the build system. It would appear that the docutils developers, in their wisdom, broke the API in the 0.13 release. This fix detects the breakage and allows the docs to be built with both the old and new versions. Dmitry Shachnev (1): docs: sphinx-extensions: make rstFlatTable work with docutils 0.13 Documentation/sphinx/rstFlatTable.py | 5 + 1 file changed, 5 insertions(+)
mmotm 2016-12-19-16-31 uploaded
The mm-of-the-moment snapshot 2016-12-19-16-31 has been uploaded to http://www.ozlabs.org/~akpm/mmotm/ mmotm-readme.txt says README for mm-of-the-moment: http://www.ozlabs.org/~akpm/mmotm/ This is a snapshot of my -mm patch queue. Uploaded at random hopefully more than once a week. You will need quilt to apply these patches to the latest Linus release (4.x or 4.x-rcY). The series file is in broken-out.tar.gz and is duplicated in http://ozlabs.org/~akpm/mmotm/series The file broken-out.tar.gz contains two datestamp files: .DATE and .DATE--mm-dd-hh-mm-ss. Both contain the string -mm-dd-hh-mm-ss, followed by the base kernel version against which this patch series is to be applied. This tree is partially included in linux-next. To see which patches are included in linux-next, consult the `series' file. Only the patches within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in linux-next. A git tree which contains the memory management portion of this tree is maintained at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git by Michal Hocko. It contains the patches which are between the "#NEXT_PATCHES_START mm" and "#NEXT_PATCHES_END" markers, from the series file, http://www.ozlabs.org/~akpm/mmotm/series. A full copy of the full kernel tree with the linux-next and mmotm patches already applied is available through git within an hour of the mmotm release. Individual mmotm releases are tagged. The master branch always points to the latest release, so it's constantly rebasing. http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/ To develop on top of mmotm git: $ git remote add mmotm git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git $ git remote update mmotm $ git checkout -b topic mmotm/master $ git send-email mmotm/master.. [...] To rebase a branch with older patches to a new mmotm release: $ git remote update mmotm $ git rebase --onto mmotm/master topic The directory http://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second) contains daily snapshots of the -mm tree. It is updated more frequently than mmotm, and is untested. A git copy of this tree is available at http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/ and use of this tree is similar to http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/, described above. This mmotm tree contains the following patches against 4.9: (patches marked "*" will be included in linux-next) origin.patch * powerpc-ima-get-the-kexec-buffer-passed-by-the-previous-kernel.patch * ima-on-soft-reboot-restore-the-measurement-list.patch * ima-permit-duplicate-measurement-list-entries.patch * ima-maintain-memory-size-needed-for-serializing-the-measurement-list.patch * powerpc-ima-send-the-kexec-buffer-to-the-next-kernel.patch * ima-on-soft-reboot-save-the-measurement-list.patch * ima-store-the-builtin-custom-template-definitions-in-a-list.patch * ima-support-restoring-multiple-template-formats.patch * ima-define-a-canonical-binary_runtime_measurements-list-format.patch * ima-platform-independent-hash-value.patch * mm-fadvise-avoid-expensive-remote-lru-cache-draining-after-fadv_dontneed.patch * arm64-setup-introduce-kaslr_offset.patch * kcov-make-kcov-work-properly-with-kaslr-enabled.patch * ratelimit-fix-warn_on_ratelimit-return-value.patch * printk-fix-typo-in-console_loglevel_default-help-text.patch i-need-old-gcc.patch * mm-page_alloc-fix-incorrect-zone_statistics-data.patch * mm-thp-pagecache-only-withdraw-page-table-after-a-successful-deposit.patch * mm-thp-pagecache-collapse-free-the-pte-page-table-on-collapse-for-thp-page-cache.patch * arm-arch-arm-include-asm-pageh-needs-personalityh.patch * scripts-spellingtxt-add-several-more-common-spelling-mistakes.patch * ocfs2-old-mle-put-and-release-after-the-function-dlm_add_migration_mle-called.patch * ocfs2-old-mle-put-and-release-after-the-function-dlm_add_migration_mle-called-fix.patch * ocfs2-fix-crash-caused-by-stale-lvb-with-fsdlm-plugin.patch * block-restore-proc-partitions-to-not-display-non-partitionable-removable-devices.patch * kernel-watchdog-prevent-false-hardlockup-on-overloaded-system.patch * kernel-watchdog-prevent-false-hardlockup-on-overloaded-system-fix.patch mm.patch * tmpfs-change-shmem_mapping-to-test-shmem_aops.patch * mm-throttle-show_mem-from-warn_alloc.patch * mm-throttle-show_mem-from-warn_alloc-fix.patch * mm-page_alloc-dont-convert-pfn-to-idx-when-merging.patch * mm-page_alloc-avoid-page_to_pfn-when-merging-buddies.patch * z3fold-limit-first_num-to-the-actual-range-of-possible-buddy-indexes.patch * z3fold-make-pages_nr-atomic.patch * z3fold-extend-compaction-function.patch * z3fold-use-per-page-spinlock.patch * z3fold-discourage-use-of-pages-that-werent-compacted.patch * z3fold-fix-header-size-related-issues.patch * z3fold-fix-locking-issues.patch * mm-page_owner-align-with-pageblock_nr-pages.patch * mm-walk-the-zone-in-pageblock_nr_pages-steps.patch * lib-add-crc64-ecma-module.patch *
mmotm 2016-12-19-16-31 uploaded
The mm-of-the-moment snapshot 2016-12-19-16-31 has been uploaded to http://www.ozlabs.org/~akpm/mmotm/ mmotm-readme.txt says README for mm-of-the-moment: http://www.ozlabs.org/~akpm/mmotm/ This is a snapshot of my -mm patch queue. Uploaded at random hopefully more than once a week. You will need quilt to apply these patches to the latest Linus release (4.x or 4.x-rcY). The series file is in broken-out.tar.gz and is duplicated in http://ozlabs.org/~akpm/mmotm/series The file broken-out.tar.gz contains two datestamp files: .DATE and .DATE--mm-dd-hh-mm-ss. Both contain the string -mm-dd-hh-mm-ss, followed by the base kernel version against which this patch series is to be applied. This tree is partially included in linux-next. To see which patches are included in linux-next, consult the `series' file. Only the patches within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in linux-next. A git tree which contains the memory management portion of this tree is maintained at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git by Michal Hocko. It contains the patches which are between the "#NEXT_PATCHES_START mm" and "#NEXT_PATCHES_END" markers, from the series file, http://www.ozlabs.org/~akpm/mmotm/series. A full copy of the full kernel tree with the linux-next and mmotm patches already applied is available through git within an hour of the mmotm release. Individual mmotm releases are tagged. The master branch always points to the latest release, so it's constantly rebasing. http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/ To develop on top of mmotm git: $ git remote add mmotm git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git $ git remote update mmotm $ git checkout -b topic mmotm/master $ git send-email mmotm/master.. [...] To rebase a branch with older patches to a new mmotm release: $ git remote update mmotm $ git rebase --onto mmotm/master topic The directory http://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second) contains daily snapshots of the -mm tree. It is updated more frequently than mmotm, and is untested. A git copy of this tree is available at http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/ and use of this tree is similar to http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/, described above. This mmotm tree contains the following patches against 4.9: (patches marked "*" will be included in linux-next) origin.patch * powerpc-ima-get-the-kexec-buffer-passed-by-the-previous-kernel.patch * ima-on-soft-reboot-restore-the-measurement-list.patch * ima-permit-duplicate-measurement-list-entries.patch * ima-maintain-memory-size-needed-for-serializing-the-measurement-list.patch * powerpc-ima-send-the-kexec-buffer-to-the-next-kernel.patch * ima-on-soft-reboot-save-the-measurement-list.patch * ima-store-the-builtin-custom-template-definitions-in-a-list.patch * ima-support-restoring-multiple-template-formats.patch * ima-define-a-canonical-binary_runtime_measurements-list-format.patch * ima-platform-independent-hash-value.patch * mm-fadvise-avoid-expensive-remote-lru-cache-draining-after-fadv_dontneed.patch * arm64-setup-introduce-kaslr_offset.patch * kcov-make-kcov-work-properly-with-kaslr-enabled.patch * ratelimit-fix-warn_on_ratelimit-return-value.patch * printk-fix-typo-in-console_loglevel_default-help-text.patch i-need-old-gcc.patch * mm-page_alloc-fix-incorrect-zone_statistics-data.patch * mm-thp-pagecache-only-withdraw-page-table-after-a-successful-deposit.patch * mm-thp-pagecache-collapse-free-the-pte-page-table-on-collapse-for-thp-page-cache.patch * arm-arch-arm-include-asm-pageh-needs-personalityh.patch * scripts-spellingtxt-add-several-more-common-spelling-mistakes.patch * ocfs2-old-mle-put-and-release-after-the-function-dlm_add_migration_mle-called.patch * ocfs2-old-mle-put-and-release-after-the-function-dlm_add_migration_mle-called-fix.patch * ocfs2-fix-crash-caused-by-stale-lvb-with-fsdlm-plugin.patch * block-restore-proc-partitions-to-not-display-non-partitionable-removable-devices.patch * kernel-watchdog-prevent-false-hardlockup-on-overloaded-system.patch * kernel-watchdog-prevent-false-hardlockup-on-overloaded-system-fix.patch mm.patch * tmpfs-change-shmem_mapping-to-test-shmem_aops.patch * mm-throttle-show_mem-from-warn_alloc.patch * mm-throttle-show_mem-from-warn_alloc-fix.patch * mm-page_alloc-dont-convert-pfn-to-idx-when-merging.patch * mm-page_alloc-avoid-page_to_pfn-when-merging-buddies.patch * z3fold-limit-first_num-to-the-actual-range-of-possible-buddy-indexes.patch * z3fold-make-pages_nr-atomic.patch * z3fold-extend-compaction-function.patch * z3fold-use-per-page-spinlock.patch * z3fold-discourage-use-of-pages-that-werent-compacted.patch * z3fold-fix-header-size-related-issues.patch * z3fold-fix-locking-issues.patch * mm-page_owner-align-with-pageblock_nr-pages.patch * mm-walk-the-zone-in-pageblock_nr_pages-steps.patch * lib-add-crc64-ecma-module.patch *
Re: [PATCH v2] fix code alignment with open parenthesis
On 12/19/2016 12:35 AM, Greg KH wrote: On Sun, Dec 18, 2016 at 11:47:30AM -0600, Scott Matheina wrote: These changes where identified by checkpatch.pl as needed changes to align the code with the linux development coding style. The several lines of text where aligned with the precending parenthesis. Signed-off-by: Scott MatheinaChanges to be committed: modified: drivers/staging/fbtft/fb_agm1264k-fl.c I'll remove the above text after signed-off-by and resubmit, honestly it's been over a year since I submitted a patch so I have to relearn how you want them formatted. Thanks for working with me. Why are these lines in the changelog text?
Re: [PATCH v2] fix code alignment with open parenthesis
On 12/19/2016 12:35 AM, Greg KH wrote: On Sun, Dec 18, 2016 at 11:47:30AM -0600, Scott Matheina wrote: These changes where identified by checkpatch.pl as needed changes to align the code with the linux development coding style. The several lines of text where aligned with the precending parenthesis. Signed-off-by: Scott Matheina Changes to be committed: modified: drivers/staging/fbtft/fb_agm1264k-fl.c I'll remove the above text after signed-off-by and resubmit, honestly it's been over a year since I submitted a patch so I have to relearn how you want them formatted. Thanks for working with me. Why are these lines in the changelog text?
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitovwrote: > On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote: >> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov >> wrote: >> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: >> >> Hi all- >> >> >> >> I apologize for being rather late with this. I didn't realize that >> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see >> >> it on the linux-api list, so I missed the discussion. >> >> >> >> I think that the inet ingress, egress etc filters are a neat feature, >> >> but I think the API has some issues that will bite us down the road >> >> if it becomes stable in its current form. >> >> >> >> Most of the problems I see are summarized in this transcript: >> >> >> >> # mkdir cg2 >> >> # mount -t cgroup2 none cg2 >> >> # mkdir cg2/nosockets >> >> # strace cgrp_socket_rule cg2/nosockets/ 0 >> >> ... >> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 >> >> >> >> You can modify a cgroup after opening it O_RDONLY? >> >> >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, >> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, >> >> log_buf=0x6020c0, kern_version=0}, 48) = 4 >> >> >> >> This is fine. The bpf() syscall manipulates bpf objects. >> >> >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 >> >> >> >> This is not so good: >> >> >> >> a) The bpf() syscall is supposed to manipulate bpf objects. This >> >> is manipulating a cgroup. There's no reason that a socket >> >> creation >> >> filter couldn't be written in a different language (new iptables >> >> table? Simple list of address families?), but if that happened, >> >> then using bpf() to install it would be entirely nonsensical. >> > >> > I don't see why it's _modifing_ the cgroup. I'm looking at it as >> > network stack is using cgroup as an application group that should >> > invoke bpf program at the certain point in the stack. >> > imo cgroup management is orthogonal. >> >> It is literally modifying the struct cgroup, and, as a practical >> matter, it's causing membership in the cgroup to have a certain >> effect. But rather than pointless arguing, let me propose an >> alternative API that I think solves most of the problems here. >> >> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely. >> Instead, the cgroup gets three new control files: >> "net.ingress_filter", "net.egress_filter", and >> "net.socket_create_filter". Initially, if you read these files, you >> see "none\n". >> >> To attach a bpf filter, you open the file for write and do an ioctl on >> it. After doing the ioctl, if you read the file, you'll see >> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for >> the bpf program. >> >> To detach any type of filter, bpf or otherwise, you open the file for >> write and write "none\n" (or just "none"). >> >> If you write anything else to the file, you get -EINVAL. But, if >> someone writes a new type of filter (perhaps a simple list of address >> families), maybe you can enable the filter by writing something >> appropriate to the file. > > I see no difference in what you're proposing vs what is already implemented > from feature set point of view, but the file approach is very ugly, since > it's a mismatch to FD style access that bpf is using everywhere. > In your proposal you'd also need to add bpf prefix everywhere. > So the control file names should be bpf_inet_ingress, bpf_inet_egress > and bpf_socket_create. I think we're still talking past each other. A big part of the point of changing it is that none of this is specific to bpf. You could (in theory -- I'm not proposing implementing these until there's demand) have: net.socket_create_filter = "none": no filter net.socket_create_filter = "bpf:baadf00d": bpf filter net.socket_create_filter = "disallow": no sockets created period net.socket_create_filter = "iptables:foobar": some iptables thingy net.socket_create_filter = "nft:blahblahblah": some nft thingy net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 See? This API is not bpf-specific. It's an API for filtering. The fact that struct cgroup currently contains a member called "bpf" is purely an artifact of the fact that it currently only supports bpf. Someone will want to rename it to "filters" some day, and BPF_PROG_DETACH makes no sense whatsoever as a generic API to detach a filter. > If you want to prepare such patch for them, I don't mind, > but you cannot kill syscall command, since it's more flexible > and your control-file approach _will_ be obsolete pretty quickly. BPF_PROG_ATTACH and BPF_PROC_DETACH should be removed regardless IMO. If you really really want a syscall, make it a new syscall. > >> Now the API matches the effect. A cgroup with something other than >> "none" in one
[PATCH] tty: hvc_dcc: Add basic early_con support
In some cases, earlycon can help catch errors with kernel boot prior to standard console is available. Example bootargs: console=hvc0 earlycon=hvcdcc Signed-off-by: Nishanth Menon <n...@ti.com> --- Based on: v4.9 tag Also applies on: next-20161219 Tested on Simulation environment (which did not have serial console option) drivers/tty/hvc/hvc_dcc.c | 44 1 file changed, 44 insertions(+) diff --git a/drivers/tty/hvc/hvc_dcc.c b/drivers/tty/hvc/hvc_dcc.c index 82f240fb98f0..1ec06e550656 100644 --- a/drivers/tty/hvc/hvc_dcc.c +++ b/drivers/tty/hvc/hvc_dcc.c @@ -10,7 +10,9 @@ * GNU General Public License for more details. */ +#include #include +#include #include #include @@ -94,3 +96,45 @@ static int __init hvc_dcc_init(void) return PTR_ERR_OR_ZERO(p); } device_initcall(hvc_dcc_init); + +static int hvc_dcc_earlyputc(int c) +{ + unsigned long count = 0x; + static bool dead_dcc_earlycon; + + if (dead_dcc_earlycon) + return -EBUSY; + + while (count--) { + if (!(__dcc_getstatus() & DCC_STATUS_TX)) + break; + } + if (!count) { + dead_dcc_earlycon = true; + return -EBUSY; + } + __dcc_putchar(c); + return 0; +} + +static void hvc_dcc_earlywrite(struct console *con, const char *s, + unsigned int n) +{ + int r; + + while (n--) { + r = hvc_dcc_earlyputc(*s); + if (r) + break; + s++; + } +} + +static int +__init early_hvc_dcc_setup(struct earlycon_device *device, const char *opt) +{ + device->con->write = hvc_dcc_earlywrite; + return 0; +} + +EARLYCON_DECLARE(hvcdcc, early_hvc_dcc_setup); -- 2.11.0.65.geff96d7
[PATCH] tty: hvc_dcc: Add basic early_con support
In some cases, earlycon can help catch errors with kernel boot prior to standard console is available. Example bootargs: console=hvc0 earlycon=hvcdcc Signed-off-by: Nishanth Menon --- Based on: v4.9 tag Also applies on: next-20161219 Tested on Simulation environment (which did not have serial console option) drivers/tty/hvc/hvc_dcc.c | 44 1 file changed, 44 insertions(+) diff --git a/drivers/tty/hvc/hvc_dcc.c b/drivers/tty/hvc/hvc_dcc.c index 82f240fb98f0..1ec06e550656 100644 --- a/drivers/tty/hvc/hvc_dcc.c +++ b/drivers/tty/hvc/hvc_dcc.c @@ -10,7 +10,9 @@ * GNU General Public License for more details. */ +#include #include +#include #include #include @@ -94,3 +96,45 @@ static int __init hvc_dcc_init(void) return PTR_ERR_OR_ZERO(p); } device_initcall(hvc_dcc_init); + +static int hvc_dcc_earlyputc(int c) +{ + unsigned long count = 0x; + static bool dead_dcc_earlycon; + + if (dead_dcc_earlycon) + return -EBUSY; + + while (count--) { + if (!(__dcc_getstatus() & DCC_STATUS_TX)) + break; + } + if (!count) { + dead_dcc_earlycon = true; + return -EBUSY; + } + __dcc_putchar(c); + return 0; +} + +static void hvc_dcc_earlywrite(struct console *con, const char *s, + unsigned int n) +{ + int r; + + while (n--) { + r = hvc_dcc_earlyputc(*s); + if (r) + break; + s++; + } +} + +static int +__init early_hvc_dcc_setup(struct earlycon_device *device, const char *opt) +{ + device->con->write = hvc_dcc_earlywrite; + return 0; +} + +EARLYCON_DECLARE(hvcdcc, early_hvc_dcc_setup); -- 2.11.0.65.geff96d7
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov wrote: > On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote: >> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov >> wrote: >> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: >> >> Hi all- >> >> >> >> I apologize for being rather late with this. I didn't realize that >> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see >> >> it on the linux-api list, so I missed the discussion. >> >> >> >> I think that the inet ingress, egress etc filters are a neat feature, >> >> but I think the API has some issues that will bite us down the road >> >> if it becomes stable in its current form. >> >> >> >> Most of the problems I see are summarized in this transcript: >> >> >> >> # mkdir cg2 >> >> # mount -t cgroup2 none cg2 >> >> # mkdir cg2/nosockets >> >> # strace cgrp_socket_rule cg2/nosockets/ 0 >> >> ... >> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 >> >> >> >> You can modify a cgroup after opening it O_RDONLY? >> >> >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, >> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, >> >> log_buf=0x6020c0, kern_version=0}, 48) = 4 >> >> >> >> This is fine. The bpf() syscall manipulates bpf objects. >> >> >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 >> >> >> >> This is not so good: >> >> >> >> a) The bpf() syscall is supposed to manipulate bpf objects. This >> >> is manipulating a cgroup. There's no reason that a socket >> >> creation >> >> filter couldn't be written in a different language (new iptables >> >> table? Simple list of address families?), but if that happened, >> >> then using bpf() to install it would be entirely nonsensical. >> > >> > I don't see why it's _modifing_ the cgroup. I'm looking at it as >> > network stack is using cgroup as an application group that should >> > invoke bpf program at the certain point in the stack. >> > imo cgroup management is orthogonal. >> >> It is literally modifying the struct cgroup, and, as a practical >> matter, it's causing membership in the cgroup to have a certain >> effect. But rather than pointless arguing, let me propose an >> alternative API that I think solves most of the problems here. >> >> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely. >> Instead, the cgroup gets three new control files: >> "net.ingress_filter", "net.egress_filter", and >> "net.socket_create_filter". Initially, if you read these files, you >> see "none\n". >> >> To attach a bpf filter, you open the file for write and do an ioctl on >> it. After doing the ioctl, if you read the file, you'll see >> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for >> the bpf program. >> >> To detach any type of filter, bpf or otherwise, you open the file for >> write and write "none\n" (or just "none"). >> >> If you write anything else to the file, you get -EINVAL. But, if >> someone writes a new type of filter (perhaps a simple list of address >> families), maybe you can enable the filter by writing something >> appropriate to the file. > > I see no difference in what you're proposing vs what is already implemented > from feature set point of view, but the file approach is very ugly, since > it's a mismatch to FD style access that bpf is using everywhere. > In your proposal you'd also need to add bpf prefix everywhere. > So the control file names should be bpf_inet_ingress, bpf_inet_egress > and bpf_socket_create. I think we're still talking past each other. A big part of the point of changing it is that none of this is specific to bpf. You could (in theory -- I'm not proposing implementing these until there's demand) have: net.socket_create_filter = "none": no filter net.socket_create_filter = "bpf:baadf00d": bpf filter net.socket_create_filter = "disallow": no sockets created period net.socket_create_filter = "iptables:foobar": some iptables thingy net.socket_create_filter = "nft:blahblahblah": some nft thingy net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 See? This API is not bpf-specific. It's an API for filtering. The fact that struct cgroup currently contains a member called "bpf" is purely an artifact of the fact that it currently only supports bpf. Someone will want to rename it to "filters" some day, and BPF_PROG_DETACH makes no sense whatsoever as a generic API to detach a filter. > If you want to prepare such patch for them, I don't mind, > but you cannot kill syscall command, since it's more flexible > and your control-file approach _will_ be obsolete pretty quickly. BPF_PROG_ATTACH and BPF_PROC_DETACH should be removed regardless IMO. If you really really want a syscall, make it a new syscall. > >> Now the API matches the effect. A cgroup with something other than >> "none" in one of its net.*_filter files is a cgroup that filters >>
Re: Documentation/unaligned-memory-access.txt: fix incorrect comparison operator
On Sun, Dec 18, 2016 at 12:52:12AM +0200, Ozgur Karatas wrote: > 17.12.2016, 19:43, "Cihangir Akturk": > > In the actual implementation ether_addr_equal function tests for equality > > to 0 > > when returning. It seems in commit 0d74c4 it is somehow overlooked to change > > this operator to reflect the actual function. > > why this "return" function need to be ==0? I think, u16 functions read memory > but "0" is should not be equalty. XOR is true only when inputs differ. That means if inputs are the same, then it outputs false (0) or whatever you call it. Then we perform OR operation between those outputs. So if the result is 0 then addr1 and addr2 is equal. > This way, -for the code to work- memory should be everytime unaligned !=0. Sorry I didn't quite get the point.
Re: Documentation/unaligned-memory-access.txt: fix incorrect comparison operator
On Sun, Dec 18, 2016 at 12:52:12AM +0200, Ozgur Karatas wrote: > 17.12.2016, 19:43, "Cihangir Akturk" : > > In the actual implementation ether_addr_equal function tests for equality > > to 0 > > when returning. It seems in commit 0d74c4 it is somehow overlooked to change > > this operator to reflect the actual function. > > why this "return" function need to be ==0? I think, u16 functions read memory > but "0" is should not be equalty. XOR is true only when inputs differ. That means if inputs are the same, then it outputs false (0) or whatever you call it. Then we perform OR operation between those outputs. So if the result is 0 then addr1 and addr2 is equal. > This way, -for the code to work- memory should be everytime unaligned !=0. Sorry I didn't quite get the point.
Re: [RFC][PATCH] make global bitlock waitqueues per-node
On 12/19/2016 03:07 PM, Linus Torvalds wrote: > +wait_queue_head_t *bit_waitqueue(void *word, int bit) > +{ > + const int __maybe_unused nid = page_to_nid(virt_to_page(word)); > + > + return __bit_waitqueue(word, bit, nid); > > No can do. Part of the problem with the old coffee was that it did that > virt_to_page() crud. That doesn't work with the virtually mapped stack. Ahhh, got it. So, what did you have in mind? Just redirect bit_waitqueue() to the "first_online_node" waitqueues? wait_queue_head_t *bit_waitqueue(void *word, int bit) { return __bit_waitqueue(word, bit, first_online_node); } We could do some fancy stuff like only do virt_to_page() for things in the linear map, but I'm not sure we'll see much of a gain for it. None of the other waitqueue users look as pathological as the 'struct page' ones. Maybe: wait_queue_head_t *bit_waitqueue(void *word, int bit) { int nid if (word >= VMALLOC_START) /* all addrs not in linear map */ nid = first_online_node; else nid = page_to_nid(virt_to_page(word)); return __bit_waitqueue(word, bit, nid); }
Re: [RFC][PATCH] make global bitlock waitqueues per-node
On 12/19/2016 03:07 PM, Linus Torvalds wrote: > +wait_queue_head_t *bit_waitqueue(void *word, int bit) > +{ > + const int __maybe_unused nid = page_to_nid(virt_to_page(word)); > + > + return __bit_waitqueue(word, bit, nid); > > No can do. Part of the problem with the old coffee was that it did that > virt_to_page() crud. That doesn't work with the virtually mapped stack. Ahhh, got it. So, what did you have in mind? Just redirect bit_waitqueue() to the "first_online_node" waitqueues? wait_queue_head_t *bit_waitqueue(void *word, int bit) { return __bit_waitqueue(word, bit, first_online_node); } We could do some fancy stuff like only do virt_to_page() for things in the linear map, but I'm not sure we'll see much of a gain for it. None of the other waitqueue users look as pathological as the 'struct page' ones. Maybe: wait_queue_head_t *bit_waitqueue(void *word, int bit) { int nid if (word >= VMALLOC_START) /* all addrs not in linear map */ nid = first_online_node; else nid = page_to_nid(virt_to_page(word)); return __bit_waitqueue(word, bit, nid); }
Re: [PATCH 1/1] x86/platform/intel/quark: add printf attribute to imr_self_test_result()
On 19/12/16 13:21, Nicolas Iooss wrote: __printf attributes help detecting issues in printf format strings at compile time. Even though imr_selftest.c is only compiled with CONFIG_DEBUG_IMR_SELFTEST, gcc complains about a missing format attribute when compiling allmodconfig with -Wmissing-format-attribute. Silent this warning by adding the attribute. Signed-off-by: Nicolas Iooss--- arch/x86/platform/intel-quark/imr_selftest.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/platform/intel-quark/imr_selftest.c b/arch/x86/platform/intel-quark/imr_selftest.c index f5bad40936ac..b8f562049cad 100644 --- a/arch/x86/platform/intel-quark/imr_selftest.c +++ b/arch/x86/platform/intel-quark/imr_selftest.c @@ -25,7 +25,8 @@ * @fmt: format string. * ... variadic argument list. */ -static void __init imr_self_test_result(int res, const char *fmt, ...) +static __printf(2, 3) +void __init imr_self_test_result(int res, const char *fmt, ...) { va_list vlist; since I wrote this code. Acked-by: Bryan O'Donoghue cheers --- bod
Re: [PATCH 1/1] x86/platform/intel/quark: add printf attribute to imr_self_test_result()
On 19/12/16 13:21, Nicolas Iooss wrote: __printf attributes help detecting issues in printf format strings at compile time. Even though imr_selftest.c is only compiled with CONFIG_DEBUG_IMR_SELFTEST, gcc complains about a missing format attribute when compiling allmodconfig with -Wmissing-format-attribute. Silent this warning by adding the attribute. Signed-off-by: Nicolas Iooss --- arch/x86/platform/intel-quark/imr_selftest.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/platform/intel-quark/imr_selftest.c b/arch/x86/platform/intel-quark/imr_selftest.c index f5bad40936ac..b8f562049cad 100644 --- a/arch/x86/platform/intel-quark/imr_selftest.c +++ b/arch/x86/platform/intel-quark/imr_selftest.c @@ -25,7 +25,8 @@ * @fmt: format string. * ... variadic argument list. */ -static void __init imr_self_test_result(int res, const char *fmt, ...) +static __printf(2, 3) +void __init imr_self_test_result(int res, const char *fmt, ...) { va_list vlist; since I wrote this code. Acked-by: Bryan O'Donoghue cheers --- bod
[PATCH v2 0/1] perf: check that objdump correctly works
Hi Arnaldo, Here is a patch which checks that objdump works without calling it first with the "-v" option. Thanks to the shell pipefail option (which returns the rightmost error status in the shell pipe) and the grep subshell which prevents "no-match" errors, we are able the use waitpid on the whole pipeline and get the appropriate return code. This is the most simple and straightforward solution I found. The downside might be that it relies on the availability of "pipefail" shell option. However, I checked that even busybox's shell implemented it. Alexis. Alexis Berlemont (1): perf annotate: check that objdump correctly works tools/perf/util/annotate.c | 69 +++--- tools/perf/util/annotate.h | 3 ++ 2 files changed, 68 insertions(+), 4 deletions(-) -- 2.11.0
[PATCH v2 1/1] perf annotate: check that objdump correctly works
Before disassembling, the tool objdump is called just to be sure: * objdump is available in the path; * objdump is an executable binary; * objdump has no dependency issue or anything else. This objdump "pre-"command is only necessary because the real objdump command is followed by some " | grep ..."; this prevents the shell from returning the exit code of objdump execution. Signed-off-by: Alexis Berlemont--- tools/perf/util/annotate.c | 69 +++--- tools/perf/util/annotate.h | 3 ++ 2 files changed, 68 insertions(+), 4 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index ea7e0de4b9c1..6fbaabbc9d2a 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -20,9 +20,12 @@ #include "block-range.h" #include "arch/common.h" #include +#include #include #include #include +#include +#include const char *disassembler_style; const char *objdump_path; @@ -1286,6 +1289,21 @@ int symbol__strerror_disassemble(struct symbol *sym __maybe_unused, struct map * " --vmlinux vmlinux\n", build_id_msg ?: ""); } break; + + case SYMBOL_ANNOTATE_ERRNO__NO_OBJDUMP: + scnprintf(buf, buflen, "No objdump tool available in $PATH\n"); + break; + + case SYMBOL_ANNOTATE_ERRNO__NO_EXEC_OBJDUMP: + scnprintf(buf, buflen, + "The objdump tool found in $PATH cannot be executed\n"); + break; + + case SYMBOL_ANNOTATE_ERRNO__NO_OUTPUT: + scnprintf(buf, buflen, + "The objdump tool returned no disassembled code\n"); + break; + default: scnprintf(buf, buflen, "Internal error: Invalid %d error code\n", errnum); break; @@ -1329,6 +1347,44 @@ static int dso__disassemble_filename(struct dso *dso, char *filename, size_t fil return 0; } +static int annotate__check_exec(int pid, const char *command) +{ + int wstatus, err; + + err = waitpid(pid, , 0); + if (err < 0) { + pr_err("Failure calling waitpid: %s: (%s)\n", + strerror(errno), command); + return -1; + } + + pr_debug("%s: %d %d\n", command, pid, WEXITSTATUS(wstatus)); + + switch (WEXITSTATUS(wstatus)) { + case 0: + /* Success */ + err = 0; + break; + case 127: + /* The shell did not find objdump in the path */ + err = SYMBOL_ANNOTATE_ERRNO__NO_OBJDUMP; + break; + default: + /* +* In the default case, we consider that objdump +* cannot be executed; so it gathers many fault +* scenarii: +* - objdump is not an executable (126); +* - objdump has some dependency issue; +* - ... +*/ + err = SYMBOL_ANNOTATE_ERRNO__NO_EXEC_OBJDUMP; + break; + } + + return err; +} + static const char *annotate__norm_arch(const char *arch_name) { struct utsname uts; @@ -1426,7 +1482,8 @@ int symbol__disassemble(struct symbol *sym, struct map *map, const char *arch_na snprintf(command, sizeof(command), "%s %s%s --start-address=0x%016" PRIx64 " --stop-address=0x%016" PRIx64 -" -l -d %s %s -C %s 2>/dev/null|grep -v %s|expand", +" -l -d %s %s -C %s 2>/dev/null | " +"{ grep -v %s || true; } | expand", objdump_path ? objdump_path : "objdump", disassembler_style ? "-M " : "", disassembler_style ? disassembler_style : "", @@ -1454,7 +1511,7 @@ int symbol__disassemble(struct symbol *sym, struct map *map, const char *arch_na close(stdout_fd[0]); dup2(stdout_fd[1], 1); close(stdout_fd[1]); - execl("/bin/sh", "sh", "-c", command, NULL); + execl("/bin/sh", "sh", "-o", "pipefail", "-c", command, NULL); perror(command); exit(-1); } @@ -1479,8 +1536,12 @@ int symbol__disassemble(struct symbol *sym, struct map *map, const char *arch_na nline++; } - if (nline == 0) + err = annotate__check_exec(pid, command); + + if (err == 0 && nline == 0) { pr_err("No output from %s\n", command); + err = SYMBOL_ANNOTATE_ERRNO__NO_OUTPUT; + } /* * kallsyms does not have symbol sizes so there may a nop at the end. @@ -1490,7 +1551,7 @@ int symbol__disassemble(struct symbol *sym, struct map *map, const char *arch_na delete_last_nop(sym); fclose(file); - err = 0; + out_remove_tmp: close(stdout_fd[0]); diff --git
[PATCH v2 0/1] perf: check that objdump correctly works
Hi Arnaldo, Here is a patch which checks that objdump works without calling it first with the "-v" option. Thanks to the shell pipefail option (which returns the rightmost error status in the shell pipe) and the grep subshell which prevents "no-match" errors, we are able the use waitpid on the whole pipeline and get the appropriate return code. This is the most simple and straightforward solution I found. The downside might be that it relies on the availability of "pipefail" shell option. However, I checked that even busybox's shell implemented it. Alexis. Alexis Berlemont (1): perf annotate: check that objdump correctly works tools/perf/util/annotate.c | 69 +++--- tools/perf/util/annotate.h | 3 ++ 2 files changed, 68 insertions(+), 4 deletions(-) -- 2.11.0
[PATCH v2 1/1] perf annotate: check that objdump correctly works
Before disassembling, the tool objdump is called just to be sure: * objdump is available in the path; * objdump is an executable binary; * objdump has no dependency issue or anything else. This objdump "pre-"command is only necessary because the real objdump command is followed by some " | grep ..."; this prevents the shell from returning the exit code of objdump execution. Signed-off-by: Alexis Berlemont --- tools/perf/util/annotate.c | 69 +++--- tools/perf/util/annotate.h | 3 ++ 2 files changed, 68 insertions(+), 4 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index ea7e0de4b9c1..6fbaabbc9d2a 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -20,9 +20,12 @@ #include "block-range.h" #include "arch/common.h" #include +#include #include #include #include +#include +#include const char *disassembler_style; const char *objdump_path; @@ -1286,6 +1289,21 @@ int symbol__strerror_disassemble(struct symbol *sym __maybe_unused, struct map * " --vmlinux vmlinux\n", build_id_msg ?: ""); } break; + + case SYMBOL_ANNOTATE_ERRNO__NO_OBJDUMP: + scnprintf(buf, buflen, "No objdump tool available in $PATH\n"); + break; + + case SYMBOL_ANNOTATE_ERRNO__NO_EXEC_OBJDUMP: + scnprintf(buf, buflen, + "The objdump tool found in $PATH cannot be executed\n"); + break; + + case SYMBOL_ANNOTATE_ERRNO__NO_OUTPUT: + scnprintf(buf, buflen, + "The objdump tool returned no disassembled code\n"); + break; + default: scnprintf(buf, buflen, "Internal error: Invalid %d error code\n", errnum); break; @@ -1329,6 +1347,44 @@ static int dso__disassemble_filename(struct dso *dso, char *filename, size_t fil return 0; } +static int annotate__check_exec(int pid, const char *command) +{ + int wstatus, err; + + err = waitpid(pid, , 0); + if (err < 0) { + pr_err("Failure calling waitpid: %s: (%s)\n", + strerror(errno), command); + return -1; + } + + pr_debug("%s: %d %d\n", command, pid, WEXITSTATUS(wstatus)); + + switch (WEXITSTATUS(wstatus)) { + case 0: + /* Success */ + err = 0; + break; + case 127: + /* The shell did not find objdump in the path */ + err = SYMBOL_ANNOTATE_ERRNO__NO_OBJDUMP; + break; + default: + /* +* In the default case, we consider that objdump +* cannot be executed; so it gathers many fault +* scenarii: +* - objdump is not an executable (126); +* - objdump has some dependency issue; +* - ... +*/ + err = SYMBOL_ANNOTATE_ERRNO__NO_EXEC_OBJDUMP; + break; + } + + return err; +} + static const char *annotate__norm_arch(const char *arch_name) { struct utsname uts; @@ -1426,7 +1482,8 @@ int symbol__disassemble(struct symbol *sym, struct map *map, const char *arch_na snprintf(command, sizeof(command), "%s %s%s --start-address=0x%016" PRIx64 " --stop-address=0x%016" PRIx64 -" -l -d %s %s -C %s 2>/dev/null|grep -v %s|expand", +" -l -d %s %s -C %s 2>/dev/null | " +"{ grep -v %s || true; } | expand", objdump_path ? objdump_path : "objdump", disassembler_style ? "-M " : "", disassembler_style ? disassembler_style : "", @@ -1454,7 +1511,7 @@ int symbol__disassemble(struct symbol *sym, struct map *map, const char *arch_na close(stdout_fd[0]); dup2(stdout_fd[1], 1); close(stdout_fd[1]); - execl("/bin/sh", "sh", "-c", command, NULL); + execl("/bin/sh", "sh", "-o", "pipefail", "-c", command, NULL); perror(command); exit(-1); } @@ -1479,8 +1536,12 @@ int symbol__disassemble(struct symbol *sym, struct map *map, const char *arch_na nline++; } - if (nline == 0) + err = annotate__check_exec(pid, command); + + if (err == 0 && nline == 0) { pr_err("No output from %s\n", command); + err = SYMBOL_ANNOTATE_ERRNO__NO_OUTPUT; + } /* * kallsyms does not have symbol sizes so there may a nop at the end. @@ -1490,7 +1551,7 @@ int symbol__disassemble(struct symbol *sym, struct map *map, const char *arch_na delete_last_nop(sym); fclose(file); - err = 0; + out_remove_tmp: close(stdout_fd[0]); diff --git a/tools/perf/util/annotate.h
Re: [PATCH v2] vfs: fix isize/pos/len checks for reflink & dedupe
On Mon, Dec 19, 2016 at 03:13:26PM -0800, Darrick J. Wong wrote: > Strengthen the checking of pos/len vs. i_size, clarify the return values > for the clone prep function, and remove pointless code. Applied.
Re: [PATCH v2] vfs: fix isize/pos/len checks for reflink & dedupe
On Mon, Dec 19, 2016 at 03:13:26PM -0800, Darrick J. Wong wrote: > Strengthen the checking of pos/len vs. i_size, clarify the return values > for the clone prep function, and remove pointless code. Applied.
Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
Pavel Machek: [...] > Considering the memory barriers... is something like this neccessary > in the via-rhine ? Yes. > AFAICT... we need a barrier after making sure that descriptor is no > longer owned by DMA (to make sure we don't get stale data in rest of > descriptor)... and we need a barrier before giving the descriptor to > the dma, to make sure DMA engine sees the complete update? I would not expect stale data while processing a single transmit descriptor as the transmit completion does not use the rest of the descriptor at all in the via-rhine driver. However I agree that transmit descriptors should be read by the cpu with adequate ordering so the dma_rmb() should stay. Same kind of narrative for dma_wmb rhine_rx (s/read/written/ and s/cpu/device/). > diff --git a/drivers/net/ethernet/via/via-rhine.c > b/drivers/net/ethernet/via/via-rhine.c > index ba5c542..3806e72 100644 > --- a/drivers/net/ethernet/via/via-rhine.c > +++ b/drivers/net/ethernet/via/via-rhine.c [...] > @@ -2061,6 +2062,7 @@ static int rhine_rx(struct net_device *dev, int limit) > > if (desc_status & DescOwn) > break; > + dma_rmb(); > I agree with your explanation for this one (late vlan processing in a different word from the same descriptor). -- Ueimor
Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
Pavel Machek : [...] > Considering the memory barriers... is something like this neccessary > in the via-rhine ? Yes. > AFAICT... we need a barrier after making sure that descriptor is no > longer owned by DMA (to make sure we don't get stale data in rest of > descriptor)... and we need a barrier before giving the descriptor to > the dma, to make sure DMA engine sees the complete update? I would not expect stale data while processing a single transmit descriptor as the transmit completion does not use the rest of the descriptor at all in the via-rhine driver. However I agree that transmit descriptors should be read by the cpu with adequate ordering so the dma_rmb() should stay. Same kind of narrative for dma_wmb rhine_rx (s/read/written/ and s/cpu/device/). > diff --git a/drivers/net/ethernet/via/via-rhine.c > b/drivers/net/ethernet/via/via-rhine.c > index ba5c542..3806e72 100644 > --- a/drivers/net/ethernet/via/via-rhine.c > +++ b/drivers/net/ethernet/via/via-rhine.c [...] > @@ -2061,6 +2062,7 @@ static int rhine_rx(struct net_device *dev, int limit) > > if (desc_status & DescOwn) > break; > + dma_rmb(); > I agree with your explanation for this one (late vlan processing in a different word from the same descriptor). -- Ueimor
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov >wrote: > > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: > >> Hi all- > >> > >> I apologize for being rather late with this. I didn't realize that > >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see > >> it on the linux-api list, so I missed the discussion. > >> > >> I think that the inet ingress, egress etc filters are a neat feature, > >> but I think the API has some issues that will bite us down the road > >> if it becomes stable in its current form. > >> > >> Most of the problems I see are summarized in this transcript: > >> > >> # mkdir cg2 > >> # mount -t cgroup2 none cg2 > >> # mkdir cg2/nosockets > >> # strace cgrp_socket_rule cg2/nosockets/ 0 > >> ... > >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 > >> > >> You can modify a cgroup after opening it O_RDONLY? > >> > >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, > >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, > >> log_buf=0x6020c0, kern_version=0}, 48) = 4 > >> > >> This is fine. The bpf() syscall manipulates bpf objects. > >> > >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 > >> > >> This is not so good: > >> > >> a) The bpf() syscall is supposed to manipulate bpf objects. This > >> is manipulating a cgroup. There's no reason that a socket creation > >> filter couldn't be written in a different language (new iptables > >> table? Simple list of address families?), but if that happened, > >> then using bpf() to install it would be entirely nonsensical. > > > > I don't see why it's _modifing_ the cgroup. I'm looking at it as > > network stack is using cgroup as an application group that should > > invoke bpf program at the certain point in the stack. > > imo cgroup management is orthogonal. > > It is literally modifying the struct cgroup, and, as a practical > matter, it's causing membership in the cgroup to have a certain > effect. But rather than pointless arguing, let me propose an > alternative API that I think solves most of the problems here. > > In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely. > Instead, the cgroup gets three new control files: > "net.ingress_filter", "net.egress_filter", and > "net.socket_create_filter". Initially, if you read these files, you > see "none\n". > > To attach a bpf filter, you open the file for write and do an ioctl on > it. After doing the ioctl, if you read the file, you'll see > "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for > the bpf program. > > To detach any type of filter, bpf or otherwise, you open the file for > write and write "none\n" (or just "none"). > > If you write anything else to the file, you get -EINVAL. But, if > someone writes a new type of filter (perhaps a simple list of address > families), maybe you can enable the filter by writing something > appropriate to the file. I see no difference in what you're proposing vs what is already implemented from feature set point of view, but the file approach is very ugly, since it's a mismatch to FD style access that bpf is using everywhere. In your proposal you'd also need to add bpf prefix everywhere. So the control file names should be bpf_inet_ingress, bpf_inet_egress and bpf_socket_create. If you want to prepare such patch for them, I don't mind, but you cannot kill syscall command, since it's more flexible and your control-file approach _will_ be obsolete pretty quickly. > Now the API matches the effect. A cgroup with something other than > "none" in one of its net.*_filter files is a cgroup that filters > network activity. And you get CRIU compatibility for free: CRIU can > read the control file and use whatever mechanism is uses for BPF in > general to restore the cgroup filter state. As an added bonus, you > get ACLs for free and the ugly multiplexer goes away. extended bpf is not supported by criu. only classic, so having control_file-style attachment doesn't buy us anything. > >> # mkdir cg2/nosockets/sockets > >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule > >> cg2/nosockets/sockets/ 1 > >> > >> This succeeded, which means that, if this feature is enabled in 4.10, > >> then we're stuck with its semantics. If it returned -EINVAL instead, > >> there would be a chance to refine it. > > > > i don't see the problem with this behavior. bpf and cgroup are indepedent. > > Imange that socket accounting program is attached to cg2/nosockets. > > The program is readonly and carry no security meaning. > > Why cgroup should prevent creation of cg2/nosockets/foo directory ? > > I think you're misunderstanding me. What I'm saying is that, if you > allow a cgroup and one of its descendents to both enable the same type > of filter, you have just committed to
Re: Potential issues (security and otherwise) with the current cgroup-bpf API
On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov > wrote: > > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: > >> Hi all- > >> > >> I apologize for being rather late with this. I didn't realize that > >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see > >> it on the linux-api list, so I missed the discussion. > >> > >> I think that the inet ingress, egress etc filters are a neat feature, > >> but I think the API has some issues that will bite us down the road > >> if it becomes stable in its current form. > >> > >> Most of the problems I see are summarized in this transcript: > >> > >> # mkdir cg2 > >> # mount -t cgroup2 none cg2 > >> # mkdir cg2/nosockets > >> # strace cgrp_socket_rule cg2/nosockets/ 0 > >> ... > >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 > >> > >> You can modify a cgroup after opening it O_RDONLY? > >> > >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, > >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, > >> log_buf=0x6020c0, kern_version=0}, 48) = 4 > >> > >> This is fine. The bpf() syscall manipulates bpf objects. > >> > >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 > >> > >> This is not so good: > >> > >> a) The bpf() syscall is supposed to manipulate bpf objects. This > >> is manipulating a cgroup. There's no reason that a socket creation > >> filter couldn't be written in a different language (new iptables > >> table? Simple list of address families?), but if that happened, > >> then using bpf() to install it would be entirely nonsensical. > > > > I don't see why it's _modifing_ the cgroup. I'm looking at it as > > network stack is using cgroup as an application group that should > > invoke bpf program at the certain point in the stack. > > imo cgroup management is orthogonal. > > It is literally modifying the struct cgroup, and, as a practical > matter, it's causing membership in the cgroup to have a certain > effect. But rather than pointless arguing, let me propose an > alternative API that I think solves most of the problems here. > > In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely. > Instead, the cgroup gets three new control files: > "net.ingress_filter", "net.egress_filter", and > "net.socket_create_filter". Initially, if you read these files, you > see "none\n". > > To attach a bpf filter, you open the file for write and do an ioctl on > it. After doing the ioctl, if you read the file, you'll see > "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for > the bpf program. > > To detach any type of filter, bpf or otherwise, you open the file for > write and write "none\n" (or just "none"). > > If you write anything else to the file, you get -EINVAL. But, if > someone writes a new type of filter (perhaps a simple list of address > families), maybe you can enable the filter by writing something > appropriate to the file. I see no difference in what you're proposing vs what is already implemented from feature set point of view, but the file approach is very ugly, since it's a mismatch to FD style access that bpf is using everywhere. In your proposal you'd also need to add bpf prefix everywhere. So the control file names should be bpf_inet_ingress, bpf_inet_egress and bpf_socket_create. If you want to prepare such patch for them, I don't mind, but you cannot kill syscall command, since it's more flexible and your control-file approach _will_ be obsolete pretty quickly. > Now the API matches the effect. A cgroup with something other than > "none" in one of its net.*_filter files is a cgroup that filters > network activity. And you get CRIU compatibility for free: CRIU can > read the control file and use whatever mechanism is uses for BPF in > general to restore the cgroup filter state. As an added bonus, you > get ACLs for free and the ugly multiplexer goes away. extended bpf is not supported by criu. only classic, so having control_file-style attachment doesn't buy us anything. > >> # mkdir cg2/nosockets/sockets > >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule > >> cg2/nosockets/sockets/ 1 > >> > >> This succeeded, which means that, if this feature is enabled in 4.10, > >> then we're stuck with its semantics. If it returned -EINVAL instead, > >> there would be a chance to refine it. > > > > i don't see the problem with this behavior. bpf and cgroup are indepedent. > > Imange that socket accounting program is attached to cg2/nosockets. > > The program is readonly and carry no security meaning. > > Why cgroup should prevent creation of cg2/nosockets/foo directory ? > > I think you're misunderstanding me. What I'm saying is that, if you > allow a cgroup and one of its descendents to both enable the same type > of filter, you have just committed to some particular semantics for
Re: [PATCH] Documentation/unaligned-memory-access.txt: fix incorrect comparison operator
On Mon, Dec 19, 2016 at 04:13:13PM -0700, Jonathan Corbet wrote: > On Mon, 19 Dec 2016 23:53:40 +0200 > Cihangir Akturkwrote: > > > In the actual implementation ether_addr_equal function tests for equality > > to 0 > > when returning. It seems in commit 0d74c4 it is somehow overlooked to change > > this operator to reflect the actual function. I realized that I generated the patch with the -k flag to git format-patch. And think that it'd be better to resend it without this flag. Besides that nothing is changed in the patch itself. Sorry for the noise. > > I received this patch two days ago; has something changed that you're > sending it again? > > Meanwhile, there was a question from Ozgur Karatas on the patch, but I've > not yet seen your response. Ok, I'll try to answer his question. > > Thanks, > > jon Cihangir
Re: [PATCH] Documentation/unaligned-memory-access.txt: fix incorrect comparison operator
On Mon, Dec 19, 2016 at 04:13:13PM -0700, Jonathan Corbet wrote: > On Mon, 19 Dec 2016 23:53:40 +0200 > Cihangir Akturk wrote: > > > In the actual implementation ether_addr_equal function tests for equality > > to 0 > > when returning. It seems in commit 0d74c4 it is somehow overlooked to change > > this operator to reflect the actual function. I realized that I generated the patch with the -k flag to git format-patch. And think that it'd be better to resend it without this flag. Besides that nothing is changed in the patch itself. Sorry for the noise. > > I received this patch two days ago; has something changed that you're > sending it again? > > Meanwhile, there was a question from Ozgur Karatas on the patch, but I've > not yet seen your response. Ok, I'll try to answer his question. > > Thanks, > > jon Cihangir
Re: [PATCH] sched/fair: fix calc_cfs_shares fixed point arithmetics
Paul Turner, on Mon 19 Dec 2016 15:32:15 -0800, wrote: > On Mon, Dec 19, 2016 at 3:29 PM, Samuel Thibault >wrote: > > Paul Turner, on Mon 19 Dec 2016 15:26:19 -0800, wrote: > >> >> > - if (shares < MIN_SHARES) > >> >> > - shares = MIN_SHARES; > >> > ... > >> >> > return shares; > >> > > >> > This will only make sure that the returned shares is 2, not 2048. > >> > >> This is intentional. The MIN_SHARES you are seeing here is overloaded. > >> Every "1" unit of share is "SCHED_LOAD_RESOLUTION" bits internally. > > > > I'm not talking about the SCHED_LOAD_RESOLUTION scaling, but about the > > SCHED_FIXEDPOINT_SHIFT scaling, which is what > > 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit > > kernels") > > modified on 64bit platforms. > > From that commit: > > """ > -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power > usage under light load */ > +#ifdef CONFIG_64BIT > # define SCHED_LOAD_RESOLUTION 10 > # define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION) > # define scale_load_down(w)((w) >> SCHED_LOAD_RESOLUTION) Errgl, sorry, I was referring to the old naming. This stuff has seen so much patching over and over in the past revisions... It though you were referring to SCHED_CAPACITY_SCALE. The code I was reading now uses SCHED_LOAD_RESOLUTION, so that's why I read your "SCHED_LOAD_RESOLUTION" as "the other scaling". > The MIN_SHARES you are seeing here is overloaded. > In the unscaled case this needs to be MIN_SHARES, and in the scaled > case, the subdivision of the scaled values must still be >=2. Ok, now I understand. I have to say this overloading is confusing. Samuel
Re: [PATCH] sched/fair: fix calc_cfs_shares fixed point arithmetics
Paul Turner, on Mon 19 Dec 2016 15:32:15 -0800, wrote: > On Mon, Dec 19, 2016 at 3:29 PM, Samuel Thibault > wrote: > > Paul Turner, on Mon 19 Dec 2016 15:26:19 -0800, wrote: > >> >> > - if (shares < MIN_SHARES) > >> >> > - shares = MIN_SHARES; > >> > ... > >> >> > return shares; > >> > > >> > This will only make sure that the returned shares is 2, not 2048. > >> > >> This is intentional. The MIN_SHARES you are seeing here is overloaded. > >> Every "1" unit of share is "SCHED_LOAD_RESOLUTION" bits internally. > > > > I'm not talking about the SCHED_LOAD_RESOLUTION scaling, but about the > > SCHED_FIXEDPOINT_SHIFT scaling, which is what > > 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit > > kernels") > > modified on 64bit platforms. > > From that commit: > > """ > -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power > usage under light load */ > +#ifdef CONFIG_64BIT > # define SCHED_LOAD_RESOLUTION 10 > # define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION) > # define scale_load_down(w)((w) >> SCHED_LOAD_RESOLUTION) Errgl, sorry, I was referring to the old naming. This stuff has seen so much patching over and over in the past revisions... It though you were referring to SCHED_CAPACITY_SCALE. The code I was reading now uses SCHED_LOAD_RESOLUTION, so that's why I read your "SCHED_LOAD_RESOLUTION" as "the other scaling". > The MIN_SHARES you are seeing here is overloaded. > In the unscaled case this needs to be MIN_SHARES, and in the scaled > case, the subdivision of the scaled values must still be >=2. Ok, now I understand. I have to say this overloading is confusing. Samuel