Re: [PATCH v8 05/10] powerpc/perf: IMC pmu cpumask and cpuhotplug support
Stewart Smithwrites: > Madhavan Srinivasan writes: >>> * in patch 9 should opal_imc_counters_init return something other >>>than OPAL_SUCCESS in the case on invalid arguments? Maybe >>>OPAL_PARAMETER? (I think you fix this in a later patch anyway?) >> >> So, init call will return OPAL_PARAMETER for the unsupported >> domains (core and nest are supported). And if the init operation >> fails for any reason, it would return OPAL_HARDWARE. And this is >> documented. > > (I'll comment on the skiboot one too), but I think that if the class > exists but init is a no-op, then OPAL_IMC_COUNTERS_INIT should return > OPAL_SUCCESS and just do nothing. This future proofs everything, and the > API is that one *must* call _INIT before start. Yes, 100%. That's what I described in my replies to a previous version, if it doesn't do that we need to fix it. cheers
Re: [PATCH v8 05/10] powerpc/perf: IMC pmu cpumask and cpuhotplug support
Stewart Smith writes: > Madhavan Srinivasan writes: >>> * in patch 9 should opal_imc_counters_init return something other >>>than OPAL_SUCCESS in the case on invalid arguments? Maybe >>>OPAL_PARAMETER? (I think you fix this in a later patch anyway?) >> >> So, init call will return OPAL_PARAMETER for the unsupported >> domains (core and nest are supported). And if the init operation >> fails for any reason, it would return OPAL_HARDWARE. And this is >> documented. > > (I'll comment on the skiboot one too), but I think that if the class > exists but init is a no-op, then OPAL_IMC_COUNTERS_INIT should return > OPAL_SUCCESS and just do nothing. This future proofs everything, and the > API is that one *must* call _INIT before start. Yes, 100%. That's what I described in my replies to a previous version, if it doesn't do that we need to fix it. cheers
[PATCH v2 3/3] sched/deadline: Add statistics to track runtime underruns
Add accounting to track cases that runtime isn't running out, and export the information in "/proc//sched". Specifically, the patch adds three members "nr_underrun_sched", "nr_underrun_block", and "nr_underrun_yield" in sched_dl_entity: -@nr_underrun_sched hints some scheduling issue. -@nr_underrun_block hints some block reason. E.g. long sleep. -@nr_underrun_yield hints the yield reason. This is helpful to spot/debug deadline issues, for example, I launched three 50% dl tasks on my dual-core machine, plus several buggy contrained dl tasks that Daniel is trying to address in "sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks", then I observed one 50% deadline task's proc sched output: $ cat /proc/3389/sched |grep underrun dl.nr_underrun_sched:981 dl.nr_underrun_block: 0 dl.nr_underrun_yield: 0 Very large "dl.nr_underrun_sched" hints it's very likely that there is some underlying scheduling issue. Note that we don't use CONFIG_SCHED_DEBUG as the accounting added has little overhead(also happens infrequently). Suggested-by: Steven RostedtSigned-off-by: Xunlei Pang --- include/linux/sched.h | 10 ++ kernel/sched/core.c | 3 +++ kernel/sched/deadline.c | 12 +--- kernel/sched/debug.c| 3 +++ 4 files changed, 25 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ba080e5..e17928f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -455,6 +455,16 @@ struct sched_dl_entity { * own bandwidth to be enforced, thus we need one timer per task. */ struct hrtimer dl_timer; + + /* +* Accounting for periods that run less than @dl_runtime: +* @nr_underrun_sched hints some scheduling issue. +* @nr_underrun_block hints some block reason. E.g. long sleep. +* @nr_underrun_yield hints the yield reason. +*/ + u64 nr_underrun_sched; + u64 nr_underrun_block; + u64 nr_underrun_yield; }; union rcu_special { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bccd819..6214ada 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4004,6 +4004,9 @@ static struct task_struct *find_process_by_pid(pid_t pid) dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline; dl_se->flags = attr->sched_flags; dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime); + dl_se->nr_underrun_sched = 0; + dl_se->nr_underrun_block = 0; + dl_se->nr_underrun_yield = 0; /* * Changing the parameters of a task is 'tricky' and we're not doing diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 5691149..a7ddc03 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -394,8 +394,10 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se, dl_se->runtime = pi_se->dl_runtime; } - if (dl_se->dl_yielded && dl_se->runtime > 0) + if (dl_se->dl_yielded && dl_se->runtime > 0) { dl_se->runtime = 0; + ++dl_se->nr_underrun_yield; + } /* * We keep moving the deadline away until we get some @@ -723,8 +725,10 @@ static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se) if (unlikely(dl_se->dl_boosted || !start_dl_timer(p))) return; dl_se->dl_throttled = 1; - if (dl_se->runtime > 0) + if (dl_se->runtime > 0) { dl_se->runtime = 0; + ++dl_se->nr_underrun_block; + } } } @@ -733,8 +737,10 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se) { bool dmiss = dl_time_before(dl_se->deadline, rq_clock(rq)); - if (dmiss && dl_se->runtime > 0) + if (dmiss && dl_se->runtime > 0) { dl_se->runtime = 0; + ++dl_se->nr_underrun_sched; + } return (dl_se->runtime <= 0); } diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 38f0193..904b43f 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -957,6 +957,9 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) if (p->policy == SCHED_DEADLINE) { P(dl.runtime); P(dl.deadline); + P(dl.nr_underrun_sched); + P(dl.nr_underrun_block); + P(dl.nr_underrun_yield); } #undef PN_SCHEDSTAT #undef PN -- 1.8.3.1
[PATCH v2 2/3] sched/deadline: Throttle the task when missing its deadline
dl_runtime_exceeded() only checks negative runtime, actually when the current deadline past, we should start a new period and zero out the remaining runtime as well. This patch improves dl_runtime_exceeded() to achieve that. Fixes: 269ad8015a6b ("sched/deadline: Avoid double-accounting in case of missed deadlines") Cc: Luca AbeniSigned-off-by: Xunlei Pang --- kernel/sched/deadline.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index d3d291e..5691149 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -729,8 +729,13 @@ static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se) } static -int dl_runtime_exceeded(struct sched_dl_entity *dl_se) +int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se) { + bool dmiss = dl_time_before(dl_se->deadline, rq_clock(rq)); + + if (dmiss && dl_se->runtime > 0) + dl_se->runtime = 0; + return (dl_se->runtime <= 0); } @@ -781,7 +786,7 @@ static void update_curr_dl(struct rq *rq) dl_se->runtime -= delta_exec; throttle: - if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) { + if (dl_runtime_exceeded(rq, dl_se) || dl_se->dl_yielded) { dl_se->dl_throttled = 1; __dequeue_task_dl(rq, curr, 0); if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr))) -- 1.8.3.1
[PATCH v2 3/3] sched/deadline: Add statistics to track runtime underruns
Add accounting to track cases that runtime isn't running out, and export the information in "/proc//sched". Specifically, the patch adds three members "nr_underrun_sched", "nr_underrun_block", and "nr_underrun_yield" in sched_dl_entity: -@nr_underrun_sched hints some scheduling issue. -@nr_underrun_block hints some block reason. E.g. long sleep. -@nr_underrun_yield hints the yield reason. This is helpful to spot/debug deadline issues, for example, I launched three 50% dl tasks on my dual-core machine, plus several buggy contrained dl tasks that Daniel is trying to address in "sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks", then I observed one 50% deadline task's proc sched output: $ cat /proc/3389/sched |grep underrun dl.nr_underrun_sched:981 dl.nr_underrun_block: 0 dl.nr_underrun_yield: 0 Very large "dl.nr_underrun_sched" hints it's very likely that there is some underlying scheduling issue. Note that we don't use CONFIG_SCHED_DEBUG as the accounting added has little overhead(also happens infrequently). Suggested-by: Steven Rostedt Signed-off-by: Xunlei Pang --- include/linux/sched.h | 10 ++ kernel/sched/core.c | 3 +++ kernel/sched/deadline.c | 12 +--- kernel/sched/debug.c| 3 +++ 4 files changed, 25 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ba080e5..e17928f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -455,6 +455,16 @@ struct sched_dl_entity { * own bandwidth to be enforced, thus we need one timer per task. */ struct hrtimer dl_timer; + + /* +* Accounting for periods that run less than @dl_runtime: +* @nr_underrun_sched hints some scheduling issue. +* @nr_underrun_block hints some block reason. E.g. long sleep. +* @nr_underrun_yield hints the yield reason. +*/ + u64 nr_underrun_sched; + u64 nr_underrun_block; + u64 nr_underrun_yield; }; union rcu_special { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bccd819..6214ada 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4004,6 +4004,9 @@ static struct task_struct *find_process_by_pid(pid_t pid) dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline; dl_se->flags = attr->sched_flags; dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime); + dl_se->nr_underrun_sched = 0; + dl_se->nr_underrun_block = 0; + dl_se->nr_underrun_yield = 0; /* * Changing the parameters of a task is 'tricky' and we're not doing diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 5691149..a7ddc03 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -394,8 +394,10 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se, dl_se->runtime = pi_se->dl_runtime; } - if (dl_se->dl_yielded && dl_se->runtime > 0) + if (dl_se->dl_yielded && dl_se->runtime > 0) { dl_se->runtime = 0; + ++dl_se->nr_underrun_yield; + } /* * We keep moving the deadline away until we get some @@ -723,8 +725,10 @@ static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se) if (unlikely(dl_se->dl_boosted || !start_dl_timer(p))) return; dl_se->dl_throttled = 1; - if (dl_se->runtime > 0) + if (dl_se->runtime > 0) { dl_se->runtime = 0; + ++dl_se->nr_underrun_block; + } } } @@ -733,8 +737,10 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se) { bool dmiss = dl_time_before(dl_se->deadline, rq_clock(rq)); - if (dmiss && dl_se->runtime > 0) + if (dmiss && dl_se->runtime > 0) { dl_se->runtime = 0; + ++dl_se->nr_underrun_sched; + } return (dl_se->runtime <= 0); } diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 38f0193..904b43f 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -957,6 +957,9 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) if (p->policy == SCHED_DEADLINE) { P(dl.runtime); P(dl.deadline); + P(dl.nr_underrun_sched); + P(dl.nr_underrun_block); + P(dl.nr_underrun_yield); } #undef PN_SCHEDSTAT #undef PN -- 1.8.3.1
[PATCH v2 2/3] sched/deadline: Throttle the task when missing its deadline
dl_runtime_exceeded() only checks negative runtime, actually when the current deadline past, we should start a new period and zero out the remaining runtime as well. This patch improves dl_runtime_exceeded() to achieve that. Fixes: 269ad8015a6b ("sched/deadline: Avoid double-accounting in case of missed deadlines") Cc: Luca Abeni Signed-off-by: Xunlei Pang --- kernel/sched/deadline.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index d3d291e..5691149 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -729,8 +729,13 @@ static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se) } static -int dl_runtime_exceeded(struct sched_dl_entity *dl_se) +int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se) { + bool dmiss = dl_time_before(dl_se->deadline, rq_clock(rq)); + + if (dmiss && dl_se->runtime > 0) + dl_se->runtime = 0; + return (dl_se->runtime <= 0); } @@ -781,7 +786,7 @@ static void update_curr_dl(struct rq *rq) dl_se->runtime -= delta_exec; throttle: - if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) { + if (dl_runtime_exceeded(rq, dl_se) || dl_se->dl_yielded) { dl_se->dl_throttled = 1; __dequeue_task_dl(rq, curr, 0); if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr))) -- 1.8.3.1
[PATCH v2 1/3] sched/deadline: Zero out positive runtime after throttling constrained tasks
When a contrained task is throttled by dl_check_constrained_dl(), it may carry the remaining positive runtime, as a result when dl_task_timer() fires and calls replenish_dl_entity(), it will not be replenished correctly due to the positive dl_se->runtime. This patch assigns its runtime to 0 if positive after throttling. Fixes: df8eac8cafce ("sched/deadline: Throttle a constrained deadline task activated after the deadline) Acked-by: Daniel Bristot de OliveiraSigned-off-by: Xunlei Pang --- kernel/sched/deadline.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index a2ce590..d3d291e 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -723,6 +723,8 @@ static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se) if (unlikely(dl_se->dl_boosted || !start_dl_timer(p))) return; dl_se->dl_throttled = 1; + if (dl_se->runtime > 0) + dl_se->runtime = 0; } } -- 1.8.3.1
[PATCH v2 1/3] sched/deadline: Zero out positive runtime after throttling constrained tasks
When a contrained task is throttled by dl_check_constrained_dl(), it may carry the remaining positive runtime, as a result when dl_task_timer() fires and calls replenish_dl_entity(), it will not be replenished correctly due to the positive dl_se->runtime. This patch assigns its runtime to 0 if positive after throttling. Fixes: df8eac8cafce ("sched/deadline: Throttle a constrained deadline task activated after the deadline) Acked-by: Daniel Bristot de Oliveira Signed-off-by: Xunlei Pang --- kernel/sched/deadline.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index a2ce590..d3d291e 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -723,6 +723,8 @@ static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se) if (unlikely(dl_se->dl_boosted || !start_dl_timer(p))) return; dl_se->dl_throttled = 1; + if (dl_se->runtime > 0) + dl_se->runtime = 0; } } -- 1.8.3.1
Re: [PATCH 0/2] net: Set maximum receive packet size on veth interfaces
On Tue, May 9, 2017 at 5:42 PM, Fredrik Markströmwrote: > > Maybe I was unclear, the veth implementation drops all packers larger then the > configured MTU (on the receiving interface). > Most ethernet drivers accepts packets up to the ethernet MTU no matter the > configured MTU. As far as I can tell from the RFC:s that is ok. This is because IP layer does the fragmentation for you. But some drivers, for example tg3, drop packet larger than its dev->mtu very early too. > > A simpler solution would be to only drop packets larger then ethernet MTU > (like > most network drivers), but I guess that will break existing stuff > already out there. I wonder why did we introduce that mtu check for veth when IP layer could either fragment or reject with ICMP?
Re: [PATCH 0/2] net: Set maximum receive packet size on veth interfaces
On Tue, May 9, 2017 at 5:42 PM, Fredrik Markström wrote: > > Maybe I was unclear, the veth implementation drops all packers larger then the > configured MTU (on the receiving interface). > Most ethernet drivers accepts packets up to the ethernet MTU no matter the > configured MTU. As far as I can tell from the RFC:s that is ok. This is because IP layer does the fragmentation for you. But some drivers, for example tg3, drop packet larger than its dev->mtu very early too. > > A simpler solution would be to only drop packets larger then ethernet MTU > (like > most network drivers), but I guess that will break existing stuff > already out there. I wonder why did we introduce that mtu check for veth when IP layer could either fragment or reject with ICMP?
[PATCH v3 2/3] arm64: dts: mt8173: Fix mdp device tree
From: Daniel KurtzIf the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Signed-off-by: Daniel Kurtz Signed-off-by: Minghsiu Tsai --- arch/arm64/boot/dts/mediatek/mt8173.dtsi | 126 +++ 1 file changed, 60 insertions(+), 66 deletions(-) diff --git a/arch/arm64/boot/dts/mediatek/mt8173.dtsi b/arch/arm64/boot/dts/mediatek/mt8173.dtsi index 6922252..d28a363 100644 --- a/arch/arm64/boot/dts/mediatek/mt8173.dtsi +++ b/arch/arm64/boot/dts/mediatek/mt8173.dtsi @@ -792,80 +792,74 @@ #clock-cells = <1>; }; - mdp { - compatible = "mediatek,mt8173-mdp"; - #address-cells = <2>; - #size-cells = <2>; - ranges; + mdp_rdma0: rdma@14001000 { + compatible = "mediatek,mt8173-mdp-rdma", +"mediatek,mt8173-mdp"; + reg = <0 0x14001000 0 0x1000>; + clocks = < CLK_MM_MDP_RDMA0>, +< CLK_MM_MUTEX_32K>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + iommus = < M4U_PORT_MDP_RDMA0>; + mediatek,larb = <>; mediatek,vpu = <>; + }; - mdp_rdma0: rdma@14001000 { - compatible = "mediatek,mt8173-mdp-rdma"; - reg = <0 0x14001000 0 0x1000>; - clocks = < CLK_MM_MDP_RDMA0>, -< CLK_MM_MUTEX_32K>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - iommus = < M4U_PORT_MDP_RDMA0>; - mediatek,larb = <>; - }; - - mdp_rdma1: rdma@14002000 { - compatible = "mediatek,mt8173-mdp-rdma"; - reg = <0 0x14002000 0 0x1000>; - clocks = < CLK_MM_MDP_RDMA1>, -< CLK_MM_MUTEX_32K>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - iommus = < M4U_PORT_MDP_RDMA1>; - mediatek,larb = <>; - }; + mdp_rdma1: rdma@14002000 { + compatible = "mediatek,mt8173-mdp-rdma"; + reg = <0 0x14002000 0 0x1000>; + clocks = < CLK_MM_MDP_RDMA1>, +< CLK_MM_MUTEX_32K>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + iommus = < M4U_PORT_MDP_RDMA1>; + mediatek,larb = <>; + }; - mdp_rsz0: rsz@14003000 { - compatible = "mediatek,mt8173-mdp-rsz"; - reg = <0 0x14003000 0 0x1000>; - clocks = < CLK_MM_MDP_RSZ0>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - }; + mdp_rsz0: rsz@14003000 { + compatible = "mediatek,mt8173-mdp-rsz"; + reg = <0 0x14003000 0 0x1000>; + clocks = < CLK_MM_MDP_RSZ0>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + }; - mdp_rsz1: rsz@14004000 { - compatible = "mediatek,mt8173-mdp-rsz"; - reg = <0 0x14004000 0 0x1000>; - clocks = < CLK_MM_MDP_RSZ1>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - }; + mdp_rsz1: rsz@14004000 { + compatible = "mediatek,mt8173-mdp-rsz"; + reg = <0 0x14004000 0 0x1000>; + clocks = < CLK_MM_MDP_RSZ1>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + }; - mdp_rsz2: rsz@14005000 { - compatible = "mediatek,mt8173-mdp-rsz"; - reg = <0 0x14005000 0 0x1000>; - clocks = < CLK_MM_MDP_RSZ2>; - power-domains = <
[PATCH v3 2/3] arm64: dts: mt8173: Fix mdp device tree
From: Daniel Kurtz If the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Signed-off-by: Daniel Kurtz Signed-off-by: Minghsiu Tsai --- arch/arm64/boot/dts/mediatek/mt8173.dtsi | 126 +++ 1 file changed, 60 insertions(+), 66 deletions(-) diff --git a/arch/arm64/boot/dts/mediatek/mt8173.dtsi b/arch/arm64/boot/dts/mediatek/mt8173.dtsi index 6922252..d28a363 100644 --- a/arch/arm64/boot/dts/mediatek/mt8173.dtsi +++ b/arch/arm64/boot/dts/mediatek/mt8173.dtsi @@ -792,80 +792,74 @@ #clock-cells = <1>; }; - mdp { - compatible = "mediatek,mt8173-mdp"; - #address-cells = <2>; - #size-cells = <2>; - ranges; + mdp_rdma0: rdma@14001000 { + compatible = "mediatek,mt8173-mdp-rdma", +"mediatek,mt8173-mdp"; + reg = <0 0x14001000 0 0x1000>; + clocks = < CLK_MM_MDP_RDMA0>, +< CLK_MM_MUTEX_32K>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + iommus = < M4U_PORT_MDP_RDMA0>; + mediatek,larb = <>; mediatek,vpu = <>; + }; - mdp_rdma0: rdma@14001000 { - compatible = "mediatek,mt8173-mdp-rdma"; - reg = <0 0x14001000 0 0x1000>; - clocks = < CLK_MM_MDP_RDMA0>, -< CLK_MM_MUTEX_32K>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - iommus = < M4U_PORT_MDP_RDMA0>; - mediatek,larb = <>; - }; - - mdp_rdma1: rdma@14002000 { - compatible = "mediatek,mt8173-mdp-rdma"; - reg = <0 0x14002000 0 0x1000>; - clocks = < CLK_MM_MDP_RDMA1>, -< CLK_MM_MUTEX_32K>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - iommus = < M4U_PORT_MDP_RDMA1>; - mediatek,larb = <>; - }; + mdp_rdma1: rdma@14002000 { + compatible = "mediatek,mt8173-mdp-rdma"; + reg = <0 0x14002000 0 0x1000>; + clocks = < CLK_MM_MDP_RDMA1>, +< CLK_MM_MUTEX_32K>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + iommus = < M4U_PORT_MDP_RDMA1>; + mediatek,larb = <>; + }; - mdp_rsz0: rsz@14003000 { - compatible = "mediatek,mt8173-mdp-rsz"; - reg = <0 0x14003000 0 0x1000>; - clocks = < CLK_MM_MDP_RSZ0>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - }; + mdp_rsz0: rsz@14003000 { + compatible = "mediatek,mt8173-mdp-rsz"; + reg = <0 0x14003000 0 0x1000>; + clocks = < CLK_MM_MDP_RSZ0>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + }; - mdp_rsz1: rsz@14004000 { - compatible = "mediatek,mt8173-mdp-rsz"; - reg = <0 0x14004000 0 0x1000>; - clocks = < CLK_MM_MDP_RSZ1>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - }; + mdp_rsz1: rsz@14004000 { + compatible = "mediatek,mt8173-mdp-rsz"; + reg = <0 0x14004000 0 0x1000>; + clocks = < CLK_MM_MDP_RSZ1>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + }; - mdp_rsz2: rsz@14005000 { - compatible = "mediatek,mt8173-mdp-rsz"; - reg = <0 0x14005000 0 0x1000>; - clocks = < CLK_MM_MDP_RSZ2>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - }; + mdp_rsz2:
[PATCH v3 0/3] Fix mdp device tree
Changes in v3: - Upload patches again because forget to add v2 in title Changes in v2: - Update commit message If the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Daniel Kurtz (2): arm64: dts: mt8173: Fix mdp device tree media: mtk-mdp: Fix mdp device tree Minghsiu Tsai (1): dt-bindings: mt8173: Fix mdp device tree .../devicetree/bindings/media/mediatek-mdp.txt | 12 +- arch/arm64/boot/dts/mediatek/mt8173.dtsi | 126 ++--- drivers/media/platform/mtk-mdp/mtk_mdp_core.c | 2 +- 3 files changed, 64 insertions(+), 76 deletions(-) -- 1.9.1
[PATCH v3 0/3] Fix mdp device tree
Changes in v3: - Upload patches again because forget to add v2 in title Changes in v2: - Update commit message If the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Daniel Kurtz (2): arm64: dts: mt8173: Fix mdp device tree media: mtk-mdp: Fix mdp device tree Minghsiu Tsai (1): dt-bindings: mt8173: Fix mdp device tree .../devicetree/bindings/media/mediatek-mdp.txt | 12 +- arch/arm64/boot/dts/mediatek/mt8173.dtsi | 126 ++--- drivers/media/platform/mtk-mdp/mtk_mdp_core.c | 2 +- 3 files changed, 64 insertions(+), 76 deletions(-) -- 1.9.1
[PATCH v3 3/3] media: mtk-mdp: Fix mdp device tree
From: Daniel KurtzIf the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Signed-off-by: Daniel Kurtz Signed-off-by: Minghsiu Tsai --- drivers/media/platform/mtk-mdp/mtk_mdp_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c index 9e4eb7d..a5ad586 100644 --- a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c +++ b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c @@ -118,7 +118,7 @@ static int mtk_mdp_probe(struct platform_device *pdev) mutex_init(>vpulock); /* Iterate over sibling MDP function blocks */ - for_each_child_of_node(dev->of_node, node) { + for_each_child_of_node(dev->of_node->parent, node) { const struct of_device_id *of_id; enum mtk_mdp_comp_type comp_type; int comp_id; -- 1.9.1
[PATCH v3 1/3] dt-bindings: mt8173: Fix mdp device tree
If the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Signed-off-by: Minghsiu Tsai--- Documentation/devicetree/bindings/media/mediatek-mdp.txt | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/Documentation/devicetree/bindings/media/mediatek-mdp.txt b/Documentation/devicetree/bindings/media/mediatek-mdp.txt index 4182063..0d03e3a 100644 --- a/Documentation/devicetree/bindings/media/mediatek-mdp.txt +++ b/Documentation/devicetree/bindings/media/mediatek-mdp.txt @@ -2,7 +2,7 @@ Media Data Path is used for scaling and color space conversion. -Required properties (controller (parent) node): +Required properties (controller node): - compatible: "mediatek,mt8173-mdp" - mediatek,vpu: the node of video processor unit, see Documentation/devicetree/bindings/media/mediatek-vpu.txt for details. @@ -32,21 +32,16 @@ Required properties (DMA function blocks, child node): for details. Example: -mdp { - compatible = "mediatek,mt8173-mdp"; - #address-cells = <2>; - #size-cells = <2>; - ranges; - mediatek,vpu = <>; - mdp_rdma0: rdma@14001000 { compatible = "mediatek,mt8173-mdp-rdma"; +"mediatek,mt8173-mdp"; reg = <0 0x14001000 0 0x1000>; clocks = < CLK_MM_MDP_RDMA0>, < CLK_MM_MUTEX_32K>; power-domains = < MT8173_POWER_DOMAIN_MM>; iommus = < M4U_PORT_MDP_RDMA0>; mediatek,larb = <>; + mediatek,vpu = <>; }; mdp_rdma1: rdma@14002000 { @@ -106,4 +101,3 @@ mdp { iommus = < M4U_PORT_MDP_WROT1>; mediatek,larb = <>; }; -}; -- 1.9.1
[PATCH v3 3/3] media: mtk-mdp: Fix mdp device tree
From: Daniel Kurtz If the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Signed-off-by: Daniel Kurtz Signed-off-by: Minghsiu Tsai --- drivers/media/platform/mtk-mdp/mtk_mdp_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c index 9e4eb7d..a5ad586 100644 --- a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c +++ b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c @@ -118,7 +118,7 @@ static int mtk_mdp_probe(struct platform_device *pdev) mutex_init(>vpulock); /* Iterate over sibling MDP function blocks */ - for_each_child_of_node(dev->of_node, node) { + for_each_child_of_node(dev->of_node->parent, node) { const struct of_device_id *of_id; enum mtk_mdp_comp_type comp_type; int comp_id; -- 1.9.1
[PATCH v3 1/3] dt-bindings: mt8173: Fix mdp device tree
If the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Signed-off-by: Minghsiu Tsai --- Documentation/devicetree/bindings/media/mediatek-mdp.txt | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/Documentation/devicetree/bindings/media/mediatek-mdp.txt b/Documentation/devicetree/bindings/media/mediatek-mdp.txt index 4182063..0d03e3a 100644 --- a/Documentation/devicetree/bindings/media/mediatek-mdp.txt +++ b/Documentation/devicetree/bindings/media/mediatek-mdp.txt @@ -2,7 +2,7 @@ Media Data Path is used for scaling and color space conversion. -Required properties (controller (parent) node): +Required properties (controller node): - compatible: "mediatek,mt8173-mdp" - mediatek,vpu: the node of video processor unit, see Documentation/devicetree/bindings/media/mediatek-vpu.txt for details. @@ -32,21 +32,16 @@ Required properties (DMA function blocks, child node): for details. Example: -mdp { - compatible = "mediatek,mt8173-mdp"; - #address-cells = <2>; - #size-cells = <2>; - ranges; - mediatek,vpu = <>; - mdp_rdma0: rdma@14001000 { compatible = "mediatek,mt8173-mdp-rdma"; +"mediatek,mt8173-mdp"; reg = <0 0x14001000 0 0x1000>; clocks = < CLK_MM_MDP_RDMA0>, < CLK_MM_MUTEX_32K>; power-domains = < MT8173_POWER_DOMAIN_MM>; iommus = < M4U_PORT_MDP_RDMA0>; mediatek,larb = <>; + mediatek,vpu = <>; }; mdp_rdma1: rdma@14002000 { @@ -106,4 +101,3 @@ mdp { iommus = < M4U_PORT_MDP_WROT1>; mediatek,larb = <>; }; -}; -- 1.9.1
Re: [PATCH] PCI: Make SR-IOV capable GPU working on the SR-IOV incapable platform
On Fri, 12 May 2017 02:50:32 + "Cheng, Collins"wrote: > Hi Helgaas, > > Some AMD GPUs have hardware support for graphics SR-IOV. > If the SR-IOV capable GPU is plugged into the SR-IOV incapable > platform. It would cause a problem on PCI resource allocation in > current Linux kernel. > > Therefore in order to allow the PF (Physical Function) device of > SR-IOV capable GPU to work on the SR-IOV incapable platform, > it is required to verify conditions for initializing BAR resources > on AMD SR-IOV capable GPUs. > > If the device is an AMD graphics device and it supports > SR-IOV it will require a large amount of resources. > Before calling sriov_init() must ensure that the system > BIOS also supports SR-IOV and that system BIOS has been > able to allocate enough resources. > If the VF BARs are zero then the system BIOS does not > support SR-IOV or it could not allocate the resources > and this platform will not support AMD graphics SR-IOV. > Therefore do not call sriov_init(). > If the system BIOS does support SR-IOV then the VF BARs > will be properly initialized to non-zero values. > > Below is the patch against to Kernel 4.8 & 4.9. Please review. > > I checked the drivers/pci/quirks.c, it looks the workarounds/fixes in > quirks.c are for specific devices and one or more device ID are defined > for the specific devices. However my patch is for all AMD SR-IOV > capable GPUs, that includes all existing and future AMD server GPUs. > So it doesn't seem like a good fit to put the fix in quirks.c. Why is an AMD graphics card unique here? Doesn't sriov_init() always need to be able to deal with devices of any type where the BIOS hasn't initialized the SR-IOV capability? Some SR-IOV devices can fit their VFs within a minimum bridge aperture, most cannot. I don't understand why the VF resource requirements being exceptionally large dictates that they receive special handling. Thanks, Alex > Signed-off-by: Collins Cheng > --- > drivers/pci/iov.c | 63 > --- > 1 file changed, 60 insertions(+), 3 deletions(-) > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > index e30f05c..e4f1405 100644 > --- a/drivers/pci/iov.c > +++ b/drivers/pci/iov.c > @@ -523,6 +523,45 @@ static void sriov_restore_state(struct pci_dev *dev) > msleep(100); > } > > +/* > + * pci_vf_bar_valid - check if VF BARs have resource allocated > + * @dev: the PCI device > + * @pos: register offset of SR-IOV capability in PCI config space > + * Returns true any VF BAR has resource allocated, false > + * if all VF BARs are empty. > + */ > +static bool pci_vf_bar_valid(struct pci_dev *dev, int pos) > +{ > + int i; > + u32 bar_value; > + u32 bar_size_mask = ~(PCI_BASE_ADDRESS_SPACE | > + PCI_BASE_ADDRESS_MEM_TYPE_64 | > + PCI_BASE_ADDRESS_MEM_PREFETCH); > + > + for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) { > + pci_read_config_dword(dev, pos + PCI_SRIOV_BAR + i * 4, > _value); > + if (bar_value & bar_size_mask) > + return true; > + } > + > + return false; > +} > + > +/* > + * is_amd_display_adapter - check if it is an AMD/ATI GPU device > + * @dev: the PCI device > + * > + * Returns true if device is an AMD/ATI display adapter, > + * otherwise return false. > + */ > + > +static bool is_amd_display_adapter(struct pci_dev *dev) > +{ > + return (((dev->class >> 16) == PCI_BASE_CLASS_DISPLAY) && > + (dev->vendor == PCI_VENDOR_ID_ATI || > + dev->vendor == PCI_VENDOR_ID_AMD)); > +} > + > /** > * pci_iov_init - initialize the IOV capability > * @dev: the PCI device > @@ -537,9 +576,27 @@ int pci_iov_init(struct pci_dev *dev) > return -ENODEV; > > pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV); > - if (pos) > - return sriov_init(dev, pos); > - > + if (pos) { > + /* > + * If the device is an AMD graphics device and it supports > + * SR-IOV it will require a large amount of resources. > + * Before calling sriov_init() must ensure that the system > + * BIOS also supports SR-IOV and that system BIOS has been > + * able to allocate enough resources. > + * If the VF BARs are zero then the system BIOS does not > + * support SR-IOV or it could not allocate the resources > + * and this platform will not support AMD graphics SR-IOV. > + * Therefore do not call sriov_init(). > + * If the system BIOS does support SR-IOV then the VF BARs > + * will be properly initialized to non-zero values. > + */ > + if (is_amd_display_adapter(dev)) { > + if (pci_vf_bar_valid(dev, pos)) > + return sriov_init(dev, pos); > + } else { > + return sriov_init(dev, pos); > + } > + } > return
Re: [PATCH] PCI: Make SR-IOV capable GPU working on the SR-IOV incapable platform
On Fri, 12 May 2017 02:50:32 + "Cheng, Collins" wrote: > Hi Helgaas, > > Some AMD GPUs have hardware support for graphics SR-IOV. > If the SR-IOV capable GPU is plugged into the SR-IOV incapable > platform. It would cause a problem on PCI resource allocation in > current Linux kernel. > > Therefore in order to allow the PF (Physical Function) device of > SR-IOV capable GPU to work on the SR-IOV incapable platform, > it is required to verify conditions for initializing BAR resources > on AMD SR-IOV capable GPUs. > > If the device is an AMD graphics device and it supports > SR-IOV it will require a large amount of resources. > Before calling sriov_init() must ensure that the system > BIOS also supports SR-IOV and that system BIOS has been > able to allocate enough resources. > If the VF BARs are zero then the system BIOS does not > support SR-IOV or it could not allocate the resources > and this platform will not support AMD graphics SR-IOV. > Therefore do not call sriov_init(). > If the system BIOS does support SR-IOV then the VF BARs > will be properly initialized to non-zero values. > > Below is the patch against to Kernel 4.8 & 4.9. Please review. > > I checked the drivers/pci/quirks.c, it looks the workarounds/fixes in > quirks.c are for specific devices and one or more device ID are defined > for the specific devices. However my patch is for all AMD SR-IOV > capable GPUs, that includes all existing and future AMD server GPUs. > So it doesn't seem like a good fit to put the fix in quirks.c. Why is an AMD graphics card unique here? Doesn't sriov_init() always need to be able to deal with devices of any type where the BIOS hasn't initialized the SR-IOV capability? Some SR-IOV devices can fit their VFs within a minimum bridge aperture, most cannot. I don't understand why the VF resource requirements being exceptionally large dictates that they receive special handling. Thanks, Alex > Signed-off-by: Collins Cheng > --- > drivers/pci/iov.c | 63 > --- > 1 file changed, 60 insertions(+), 3 deletions(-) > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > index e30f05c..e4f1405 100644 > --- a/drivers/pci/iov.c > +++ b/drivers/pci/iov.c > @@ -523,6 +523,45 @@ static void sriov_restore_state(struct pci_dev *dev) > msleep(100); > } > > +/* > + * pci_vf_bar_valid - check if VF BARs have resource allocated > + * @dev: the PCI device > + * @pos: register offset of SR-IOV capability in PCI config space > + * Returns true any VF BAR has resource allocated, false > + * if all VF BARs are empty. > + */ > +static bool pci_vf_bar_valid(struct pci_dev *dev, int pos) > +{ > + int i; > + u32 bar_value; > + u32 bar_size_mask = ~(PCI_BASE_ADDRESS_SPACE | > + PCI_BASE_ADDRESS_MEM_TYPE_64 | > + PCI_BASE_ADDRESS_MEM_PREFETCH); > + > + for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) { > + pci_read_config_dword(dev, pos + PCI_SRIOV_BAR + i * 4, > _value); > + if (bar_value & bar_size_mask) > + return true; > + } > + > + return false; > +} > + > +/* > + * is_amd_display_adapter - check if it is an AMD/ATI GPU device > + * @dev: the PCI device > + * > + * Returns true if device is an AMD/ATI display adapter, > + * otherwise return false. > + */ > + > +static bool is_amd_display_adapter(struct pci_dev *dev) > +{ > + return (((dev->class >> 16) == PCI_BASE_CLASS_DISPLAY) && > + (dev->vendor == PCI_VENDOR_ID_ATI || > + dev->vendor == PCI_VENDOR_ID_AMD)); > +} > + > /** > * pci_iov_init - initialize the IOV capability > * @dev: the PCI device > @@ -537,9 +576,27 @@ int pci_iov_init(struct pci_dev *dev) > return -ENODEV; > > pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV); > - if (pos) > - return sriov_init(dev, pos); > - > + if (pos) { > + /* > + * If the device is an AMD graphics device and it supports > + * SR-IOV it will require a large amount of resources. > + * Before calling sriov_init() must ensure that the system > + * BIOS also supports SR-IOV and that system BIOS has been > + * able to allocate enough resources. > + * If the VF BARs are zero then the system BIOS does not > + * support SR-IOV or it could not allocate the resources > + * and this platform will not support AMD graphics SR-IOV. > + * Therefore do not call sriov_init(). > + * If the system BIOS does support SR-IOV then the VF BARs > + * will be properly initialized to non-zero values. > + */ > + if (is_amd_display_adapter(dev)) { > + if (pci_vf_bar_valid(dev, pos)) > + return sriov_init(dev, pos); > + } else { > + return sriov_init(dev, pos); > + } > + } > return -ENODEV; > } >
Re: [PATCH 1/3] f2fs: split wio_mutex
On 2017/5/12 2:37, Jaegeuk Kim wrote: > On 05/09, Chao Yu wrote: >> From: Chao Yu>> >> Split wio_mutex to adjust different temperature bio cache. > > This can be rephrased like: Yup, let me resend this patch. Thanks, > > Signed-off-by: Chao Yu > Signed-off-by: Jaegeuk Kim > --- > fs/f2fs/f2fs.h| 3 ++- > fs/f2fs/segment.c | 4 ++-- > fs/f2fs/super.c | 7 --- > 3 files changed, 8 insertions(+), 6 deletions(-) > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index c6643783adff..fb710bfbe215 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -888,7 +888,8 @@ struct f2fs_sb_info { > > /* for bio operations */ > struct f2fs_bio_info *write_io[NR_PAGE_TYPE]; /* for write bios */ > - struct mutex wio_mutex[NODE + 1]; /* bio ordering for NODE/DATA */ > + struct mutex wio_mutex[NR_PAGE_TYPE - 1][NR_TEMP_TYPE]; > + /* bio ordering for NODE/DATA */ > int write_io_size_bits; /* Write IO size bits */ > mempool_t *write_io_dummy; /* Dummy pages */ > > diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c > index da4fd2c29e86..cc2edaf36e55 100644 > --- a/fs/f2fs/segment.c > +++ b/fs/f2fs/segment.c > @@ -2156,7 +2156,7 @@ static void do_write_page(struct f2fs_summary *sum, > struct f2fs_io_info *fio) > int err; > > if (fio->type == NODE || fio->type == DATA) > - mutex_lock(>sbi->wio_mutex[fio->type]); > + mutex_lock(>sbi->wio_mutex[fio->type][fio->temp]); > reallocate: > allocate_data_block(fio->sbi, fio->page, fio->old_blkaddr, > >new_blkaddr, sum, type); > @@ -2169,7 +2169,7 @@ static void do_write_page(struct f2fs_summary *sum, > struct f2fs_io_info *fio) > } > > if (fio->type == NODE || fio->type == DATA) > - mutex_unlock(>sbi->wio_mutex[fio->type]); > + mutex_unlock(>sbi->wio_mutex[fio->type][fio->temp]); > } > > void write_meta_page(struct f2fs_sb_info *sbi, struct page *page) > diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c > index fea563bff0c1..df19902282f2 100644 > --- a/fs/f2fs/super.c > +++ b/fs/f2fs/super.c > @@ -1555,7 +1555,7 @@ int sanity_check_ckpt(struct f2fs_sb_info *sbi) > static void init_sb_info(struct f2fs_sb_info *sbi) > { > struct f2fs_super_block *raw_super = sbi->raw_super; > - int i; > + int i, j; > > sbi->log_sectors_per_block = > le32_to_cpu(raw_super->log_sectors_per_block); > @@ -1587,8 +1587,9 @@ static void init_sb_info(struct f2fs_sb_info *sbi) > > INIT_LIST_HEAD(>s_list); > mutex_init(>umount_mutex); > - mutex_init(>wio_mutex[NODE]); > - mutex_init(>wio_mutex[DATA]); > + for (i = 0; i < NR_PAGE_TYPE - 1; i++) > + for (j = HOT; j < NR_TEMP_TYPE; j++) > + mutex_init(>wio_mutex[i][j]); > spin_lock_init(>cp_lock); > } > >
Re: [PATCH 1/3] f2fs: split wio_mutex
On 2017/5/12 2:37, Jaegeuk Kim wrote: > On 05/09, Chao Yu wrote: >> From: Chao Yu >> >> Split wio_mutex to adjust different temperature bio cache. > > This can be rephrased like: Yup, let me resend this patch. Thanks, > > Signed-off-by: Chao Yu > Signed-off-by: Jaegeuk Kim > --- > fs/f2fs/f2fs.h| 3 ++- > fs/f2fs/segment.c | 4 ++-- > fs/f2fs/super.c | 7 --- > 3 files changed, 8 insertions(+), 6 deletions(-) > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index c6643783adff..fb710bfbe215 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -888,7 +888,8 @@ struct f2fs_sb_info { > > /* for bio operations */ > struct f2fs_bio_info *write_io[NR_PAGE_TYPE]; /* for write bios */ > - struct mutex wio_mutex[NODE + 1]; /* bio ordering for NODE/DATA */ > + struct mutex wio_mutex[NR_PAGE_TYPE - 1][NR_TEMP_TYPE]; > + /* bio ordering for NODE/DATA */ > int write_io_size_bits; /* Write IO size bits */ > mempool_t *write_io_dummy; /* Dummy pages */ > > diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c > index da4fd2c29e86..cc2edaf36e55 100644 > --- a/fs/f2fs/segment.c > +++ b/fs/f2fs/segment.c > @@ -2156,7 +2156,7 @@ static void do_write_page(struct f2fs_summary *sum, > struct f2fs_io_info *fio) > int err; > > if (fio->type == NODE || fio->type == DATA) > - mutex_lock(>sbi->wio_mutex[fio->type]); > + mutex_lock(>sbi->wio_mutex[fio->type][fio->temp]); > reallocate: > allocate_data_block(fio->sbi, fio->page, fio->old_blkaddr, > >new_blkaddr, sum, type); > @@ -2169,7 +2169,7 @@ static void do_write_page(struct f2fs_summary *sum, > struct f2fs_io_info *fio) > } > > if (fio->type == NODE || fio->type == DATA) > - mutex_unlock(>sbi->wio_mutex[fio->type]); > + mutex_unlock(>sbi->wio_mutex[fio->type][fio->temp]); > } > > void write_meta_page(struct f2fs_sb_info *sbi, struct page *page) > diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c > index fea563bff0c1..df19902282f2 100644 > --- a/fs/f2fs/super.c > +++ b/fs/f2fs/super.c > @@ -1555,7 +1555,7 @@ int sanity_check_ckpt(struct f2fs_sb_info *sbi) > static void init_sb_info(struct f2fs_sb_info *sbi) > { > struct f2fs_super_block *raw_super = sbi->raw_super; > - int i; > + int i, j; > > sbi->log_sectors_per_block = > le32_to_cpu(raw_super->log_sectors_per_block); > @@ -1587,8 +1587,9 @@ static void init_sb_info(struct f2fs_sb_info *sbi) > > INIT_LIST_HEAD(>s_list); > mutex_init(>umount_mutex); > - mutex_init(>wio_mutex[NODE]); > - mutex_init(>wio_mutex[DATA]); > + for (i = 0; i < NR_PAGE_TYPE - 1; i++) > + for (j = HOT; j < NR_TEMP_TYPE; j++) > + mutex_init(>wio_mutex[i][j]); > spin_lock_init(>cp_lock); > } > >
Re: [PATCH] f2fs: split bio cache
On 2017/5/11 10:35, Jaegeuk Kim wrote: > On 05/11, Chao Yu wrote: >> On 2017/5/11 7:50, Jaegeuk Kim wrote: >>> On 05/09, Chao Yu wrote: Hi Jaegeuk, On 2017/5/9 5:23, Jaegeuk Kim wrote: > Hi Chao, > > I can't see a strong reason to split meta from data/node and rename the > existing > function names. Instead, how about keeping the existing one while adding > some > page types to deal with log types? Hmm.. before write this patch, I have thought about this implementation which adds HOT_DATA/WARM_DATA/.. into page_type enum, but the final target is making different temperature log-header being with independent bio cache, io lock, and io list, eliminating interaction of different temperature IOs, also expanding f2fs' scalability when adding more log-headers. If we don't split meta from data/node, it's a little bit hard to approach this. What do you think? >>> >>> I submitted clean-up patches. How about this on top of them? >> >> After splitting, bio caches is connected to log-header, so why not moving bio >> cache into log-header (SM_I(sbi)->curseg_array)? after the merging: >> - we could avoid redundant enum. e.g. temp_type, page_type::{DATA/NODE}, >> since >> we have seg_type enum instead. >> - once we add special log-header like journal log or random IO log making >> DATA/NODE and HOT/WARM/COLD non-orthogonalization, we just need to change few >> codes to adjust it. > > Well, I perfer to keep block allocation and IO control in a separate way as > is. > Moreover, I don't think performance gain would be large enough comparing to > what > we need to change both of whole flows. Alright, let's use your implementation. ;) Could you resend this patch? and then I will rebase my patch on this one. Thanks, > > Thanks, > >> >> How do you think? >> >> Thanks, >> >>> >>> --- >>> fs/f2fs/data.c | 51 >>> + >>> fs/f2fs/f2fs.h | 10 - >>> fs/f2fs/gc.c| 2 ++ >>> fs/f2fs/segment.c | 24 +++-- >>> fs/f2fs/segment.h | 4 >>> fs/f2fs/super.c | 21 --- >>> include/trace/events/f2fs.h | 14 - >>> 7 files changed, 102 insertions(+), 24 deletions(-) >>> >>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c >>> index 06bb2042385e..49b7e3918484 100644 >>> --- a/fs/f2fs/data.c >>> +++ b/fs/f2fs/data.c >>> @@ -282,21 +282,30 @@ static bool has_merged_page(struct f2fs_sb_info *sbi, >>> struct inode *inode, >>> nid_t ino, pgoff_t idx, enum page_type type) >>> { >>> enum page_type btype = PAGE_TYPE_OF_BIO(type); >>> - struct f2fs_bio_info *io = >write_io[btype]; >>> - bool ret; >>> + enum temp_type temp; >>> + struct f2fs_bio_info *io; >>> + bool ret = false; >>> >>> - down_read(>io_rwsem); >>> - ret = __has_merged_page(io, inode, ino, idx); >>> - up_read(>io_rwsem); >>> + for (temp = HOT; temp < NR_TEMP_TYPE; temp++) { >>> + io = sbi->write_io[btype] + temp; >>> + >>> + down_read(>io_rwsem); >>> + ret = __has_merged_page(io, inode, ino, idx); >>> + up_read(>io_rwsem); >>> + >>> + /* TODO: use HOT temp only for meta pages now. */ >>> + if (ret || btype == META) >>> + break; >>> + } >>> return ret; >>> } >>> >>> static void __f2fs_submit_merged_write(struct f2fs_sb_info *sbi, >>> struct inode *inode, nid_t ino, pgoff_t idx, >>> - enum page_type type) >>> + enum page_type type, enum temp_type temp) >>> { >>> enum page_type btype = PAGE_TYPE_OF_BIO(type); >>> - struct f2fs_bio_info *io = >write_io[btype]; >>> + struct f2fs_bio_info *io = sbi->write_io[btype] + temp; >>> >>> down_write(>io_rwsem); >>> >>> @@ -316,17 +325,34 @@ static void __f2fs_submit_merged_write(struct >>> f2fs_sb_info *sbi, >>> up_write(>io_rwsem); >>> } >>> >>> +static void __submit_merged_write_cond(struct f2fs_sb_info *sbi, >>> + struct inode *inode, nid_t ino, pgoff_t idx, >>> + enum page_type type, bool force) >>> +{ >>> + enum temp_type temp; >>> + >>> + if (!force && !has_merged_page(sbi, inode, ino, idx, type)) >>> + return; >>> + >>> + for (temp = HOT; temp < NR_TEMP_TYPE; temp++) { >>> + __f2fs_submit_merged_write(sbi, inode, ino, idx, type, temp); >>> + >>> + /* TODO: use HOT temp only for meta pages now. */ >>> + if (type >= META) >>> + break; >>> + } >>> +} >>> + >>> void f2fs_submit_merged_write(struct f2fs_sb_info *sbi, enum page_type >>> type) >>> { >>> - __f2fs_submit_merged_write(sbi, NULL, 0, 0, type); >>> + __submit_merged_write_cond(sbi, NULL, 0, 0, type, true); >>> } >>> >>> void
Re: [PATCH] f2fs: split bio cache
On 2017/5/11 10:35, Jaegeuk Kim wrote: > On 05/11, Chao Yu wrote: >> On 2017/5/11 7:50, Jaegeuk Kim wrote: >>> On 05/09, Chao Yu wrote: Hi Jaegeuk, On 2017/5/9 5:23, Jaegeuk Kim wrote: > Hi Chao, > > I can't see a strong reason to split meta from data/node and rename the > existing > function names. Instead, how about keeping the existing one while adding > some > page types to deal with log types? Hmm.. before write this patch, I have thought about this implementation which adds HOT_DATA/WARM_DATA/.. into page_type enum, but the final target is making different temperature log-header being with independent bio cache, io lock, and io list, eliminating interaction of different temperature IOs, also expanding f2fs' scalability when adding more log-headers. If we don't split meta from data/node, it's a little bit hard to approach this. What do you think? >>> >>> I submitted clean-up patches. How about this on top of them? >> >> After splitting, bio caches is connected to log-header, so why not moving bio >> cache into log-header (SM_I(sbi)->curseg_array)? after the merging: >> - we could avoid redundant enum. e.g. temp_type, page_type::{DATA/NODE}, >> since >> we have seg_type enum instead. >> - once we add special log-header like journal log or random IO log making >> DATA/NODE and HOT/WARM/COLD non-orthogonalization, we just need to change few >> codes to adjust it. > > Well, I perfer to keep block allocation and IO control in a separate way as > is. > Moreover, I don't think performance gain would be large enough comparing to > what > we need to change both of whole flows. Alright, let's use your implementation. ;) Could you resend this patch? and then I will rebase my patch on this one. Thanks, > > Thanks, > >> >> How do you think? >> >> Thanks, >> >>> >>> --- >>> fs/f2fs/data.c | 51 >>> + >>> fs/f2fs/f2fs.h | 10 - >>> fs/f2fs/gc.c| 2 ++ >>> fs/f2fs/segment.c | 24 +++-- >>> fs/f2fs/segment.h | 4 >>> fs/f2fs/super.c | 21 --- >>> include/trace/events/f2fs.h | 14 - >>> 7 files changed, 102 insertions(+), 24 deletions(-) >>> >>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c >>> index 06bb2042385e..49b7e3918484 100644 >>> --- a/fs/f2fs/data.c >>> +++ b/fs/f2fs/data.c >>> @@ -282,21 +282,30 @@ static bool has_merged_page(struct f2fs_sb_info *sbi, >>> struct inode *inode, >>> nid_t ino, pgoff_t idx, enum page_type type) >>> { >>> enum page_type btype = PAGE_TYPE_OF_BIO(type); >>> - struct f2fs_bio_info *io = >write_io[btype]; >>> - bool ret; >>> + enum temp_type temp; >>> + struct f2fs_bio_info *io; >>> + bool ret = false; >>> >>> - down_read(>io_rwsem); >>> - ret = __has_merged_page(io, inode, ino, idx); >>> - up_read(>io_rwsem); >>> + for (temp = HOT; temp < NR_TEMP_TYPE; temp++) { >>> + io = sbi->write_io[btype] + temp; >>> + >>> + down_read(>io_rwsem); >>> + ret = __has_merged_page(io, inode, ino, idx); >>> + up_read(>io_rwsem); >>> + >>> + /* TODO: use HOT temp only for meta pages now. */ >>> + if (ret || btype == META) >>> + break; >>> + } >>> return ret; >>> } >>> >>> static void __f2fs_submit_merged_write(struct f2fs_sb_info *sbi, >>> struct inode *inode, nid_t ino, pgoff_t idx, >>> - enum page_type type) >>> + enum page_type type, enum temp_type temp) >>> { >>> enum page_type btype = PAGE_TYPE_OF_BIO(type); >>> - struct f2fs_bio_info *io = >write_io[btype]; >>> + struct f2fs_bio_info *io = sbi->write_io[btype] + temp; >>> >>> down_write(>io_rwsem); >>> >>> @@ -316,17 +325,34 @@ static void __f2fs_submit_merged_write(struct >>> f2fs_sb_info *sbi, >>> up_write(>io_rwsem); >>> } >>> >>> +static void __submit_merged_write_cond(struct f2fs_sb_info *sbi, >>> + struct inode *inode, nid_t ino, pgoff_t idx, >>> + enum page_type type, bool force) >>> +{ >>> + enum temp_type temp; >>> + >>> + if (!force && !has_merged_page(sbi, inode, ino, idx, type)) >>> + return; >>> + >>> + for (temp = HOT; temp < NR_TEMP_TYPE; temp++) { >>> + __f2fs_submit_merged_write(sbi, inode, ino, idx, type, temp); >>> + >>> + /* TODO: use HOT temp only for meta pages now. */ >>> + if (type >= META) >>> + break; >>> + } >>> +} >>> + >>> void f2fs_submit_merged_write(struct f2fs_sb_info *sbi, enum page_type >>> type) >>> { >>> - __f2fs_submit_merged_write(sbi, NULL, 0, 0, type); >>> + __submit_merged_write_cond(sbi, NULL, 0, 0, type, true); >>> } >>> >>> void
[PATCH 1/3] dt-bindings: mt8173: Fix mdp device tree
If the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Signed-off-by: Minghsiu Tsai--- Documentation/devicetree/bindings/media/mediatek-mdp.txt | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/Documentation/devicetree/bindings/media/mediatek-mdp.txt b/Documentation/devicetree/bindings/media/mediatek-mdp.txt index 4182063..0d03e3a 100644 --- a/Documentation/devicetree/bindings/media/mediatek-mdp.txt +++ b/Documentation/devicetree/bindings/media/mediatek-mdp.txt @@ -2,7 +2,7 @@ Media Data Path is used for scaling and color space conversion. -Required properties (controller (parent) node): +Required properties (controller node): - compatible: "mediatek,mt8173-mdp" - mediatek,vpu: the node of video processor unit, see Documentation/devicetree/bindings/media/mediatek-vpu.txt for details. @@ -32,21 +32,16 @@ Required properties (DMA function blocks, child node): for details. Example: -mdp { - compatible = "mediatek,mt8173-mdp"; - #address-cells = <2>; - #size-cells = <2>; - ranges; - mediatek,vpu = <>; - mdp_rdma0: rdma@14001000 { compatible = "mediatek,mt8173-mdp-rdma"; +"mediatek,mt8173-mdp"; reg = <0 0x14001000 0 0x1000>; clocks = < CLK_MM_MDP_RDMA0>, < CLK_MM_MUTEX_32K>; power-domains = < MT8173_POWER_DOMAIN_MM>; iommus = < M4U_PORT_MDP_RDMA0>; mediatek,larb = <>; + mediatek,vpu = <>; }; mdp_rdma1: rdma@14002000 { @@ -106,4 +101,3 @@ mdp { iommus = < M4U_PORT_MDP_WROT1>; mediatek,larb = <>; }; -}; -- 1.9.1
[PATCH v2 0/3] Fix mdp device tree
If the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Daniel Kurtz (2): arm64: dts: mt8173: Fix mdp device tree media: mtk-mdp: Fix mdp device tree Minghsiu Tsai (1): dt-bindings: mt8173: Fix mdp device tree .../devicetree/bindings/media/mediatek-mdp.txt | 12 +- arch/arm64/boot/dts/mediatek/mt8173.dtsi | 126 ++--- drivers/media/platform/mtk-mdp/mtk_mdp_core.c | 2 +- 3 files changed, 64 insertions(+), 76 deletions(-) -- 1.9.1
[PATCH 1/3] dt-bindings: mt8173: Fix mdp device tree
If the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Signed-off-by: Minghsiu Tsai --- Documentation/devicetree/bindings/media/mediatek-mdp.txt | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/Documentation/devicetree/bindings/media/mediatek-mdp.txt b/Documentation/devicetree/bindings/media/mediatek-mdp.txt index 4182063..0d03e3a 100644 --- a/Documentation/devicetree/bindings/media/mediatek-mdp.txt +++ b/Documentation/devicetree/bindings/media/mediatek-mdp.txt @@ -2,7 +2,7 @@ Media Data Path is used for scaling and color space conversion. -Required properties (controller (parent) node): +Required properties (controller node): - compatible: "mediatek,mt8173-mdp" - mediatek,vpu: the node of video processor unit, see Documentation/devicetree/bindings/media/mediatek-vpu.txt for details. @@ -32,21 +32,16 @@ Required properties (DMA function blocks, child node): for details. Example: -mdp { - compatible = "mediatek,mt8173-mdp"; - #address-cells = <2>; - #size-cells = <2>; - ranges; - mediatek,vpu = <>; - mdp_rdma0: rdma@14001000 { compatible = "mediatek,mt8173-mdp-rdma"; +"mediatek,mt8173-mdp"; reg = <0 0x14001000 0 0x1000>; clocks = < CLK_MM_MDP_RDMA0>, < CLK_MM_MUTEX_32K>; power-domains = < MT8173_POWER_DOMAIN_MM>; iommus = < M4U_PORT_MDP_RDMA0>; mediatek,larb = <>; + mediatek,vpu = <>; }; mdp_rdma1: rdma@14002000 { @@ -106,4 +101,3 @@ mdp { iommus = < M4U_PORT_MDP_WROT1>; mediatek,larb = <>; }; -}; -- 1.9.1
[PATCH v2 0/3] Fix mdp device tree
If the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Daniel Kurtz (2): arm64: dts: mt8173: Fix mdp device tree media: mtk-mdp: Fix mdp device tree Minghsiu Tsai (1): dt-bindings: mt8173: Fix mdp device tree .../devicetree/bindings/media/mediatek-mdp.txt | 12 +- arch/arm64/boot/dts/mediatek/mt8173.dtsi | 126 ++--- drivers/media/platform/mtk-mdp/mtk_mdp_core.c | 2 +- 3 files changed, 64 insertions(+), 76 deletions(-) -- 1.9.1
[PATCH 2/3] arm64: dts: mt8173: Fix mdp device tree
From: Daniel KurtzIf the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Signed-off-by: Daniel Kurtz Signed-off-by: Minghsiu Tsai --- arch/arm64/boot/dts/mediatek/mt8173.dtsi | 126 +++ 1 file changed, 60 insertions(+), 66 deletions(-) diff --git a/arch/arm64/boot/dts/mediatek/mt8173.dtsi b/arch/arm64/boot/dts/mediatek/mt8173.dtsi index 6922252..d28a363 100644 --- a/arch/arm64/boot/dts/mediatek/mt8173.dtsi +++ b/arch/arm64/boot/dts/mediatek/mt8173.dtsi @@ -792,80 +792,74 @@ #clock-cells = <1>; }; - mdp { - compatible = "mediatek,mt8173-mdp"; - #address-cells = <2>; - #size-cells = <2>; - ranges; + mdp_rdma0: rdma@14001000 { + compatible = "mediatek,mt8173-mdp-rdma", +"mediatek,mt8173-mdp"; + reg = <0 0x14001000 0 0x1000>; + clocks = < CLK_MM_MDP_RDMA0>, +< CLK_MM_MUTEX_32K>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + iommus = < M4U_PORT_MDP_RDMA0>; + mediatek,larb = <>; mediatek,vpu = <>; + }; - mdp_rdma0: rdma@14001000 { - compatible = "mediatek,mt8173-mdp-rdma"; - reg = <0 0x14001000 0 0x1000>; - clocks = < CLK_MM_MDP_RDMA0>, -< CLK_MM_MUTEX_32K>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - iommus = < M4U_PORT_MDP_RDMA0>; - mediatek,larb = <>; - }; - - mdp_rdma1: rdma@14002000 { - compatible = "mediatek,mt8173-mdp-rdma"; - reg = <0 0x14002000 0 0x1000>; - clocks = < CLK_MM_MDP_RDMA1>, -< CLK_MM_MUTEX_32K>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - iommus = < M4U_PORT_MDP_RDMA1>; - mediatek,larb = <>; - }; + mdp_rdma1: rdma@14002000 { + compatible = "mediatek,mt8173-mdp-rdma"; + reg = <0 0x14002000 0 0x1000>; + clocks = < CLK_MM_MDP_RDMA1>, +< CLK_MM_MUTEX_32K>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + iommus = < M4U_PORT_MDP_RDMA1>; + mediatek,larb = <>; + }; - mdp_rsz0: rsz@14003000 { - compatible = "mediatek,mt8173-mdp-rsz"; - reg = <0 0x14003000 0 0x1000>; - clocks = < CLK_MM_MDP_RSZ0>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - }; + mdp_rsz0: rsz@14003000 { + compatible = "mediatek,mt8173-mdp-rsz"; + reg = <0 0x14003000 0 0x1000>; + clocks = < CLK_MM_MDP_RSZ0>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + }; - mdp_rsz1: rsz@14004000 { - compatible = "mediatek,mt8173-mdp-rsz"; - reg = <0 0x14004000 0 0x1000>; - clocks = < CLK_MM_MDP_RSZ1>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - }; + mdp_rsz1: rsz@14004000 { + compatible = "mediatek,mt8173-mdp-rsz"; + reg = <0 0x14004000 0 0x1000>; + clocks = < CLK_MM_MDP_RSZ1>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + }; - mdp_rsz2: rsz@14005000 { - compatible = "mediatek,mt8173-mdp-rsz"; - reg = <0 0x14005000 0 0x1000>; - clocks = < CLK_MM_MDP_RSZ2>; - power-domains = <
[PATCH 3/3] media: mtk-mdp: Fix mdp device tree
From: Daniel KurtzIf the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Signed-off-by: Daniel Kurtz Signed-off-by: Minghsiu Tsai --- drivers/media/platform/mtk-mdp/mtk_mdp_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c index 9e4eb7d..a5ad586 100644 --- a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c +++ b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c @@ -118,7 +118,7 @@ static int mtk_mdp_probe(struct platform_device *pdev) mutex_init(>vpulock); /* Iterate over sibling MDP function blocks */ - for_each_child_of_node(dev->of_node, node) { + for_each_child_of_node(dev->of_node->parent, node) { const struct of_device_id *of_id; enum mtk_mdp_comp_type comp_type; int comp_id; -- 1.9.1
[PATCH 2/3] arm64: dts: mt8173: Fix mdp device tree
From: Daniel Kurtz If the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Signed-off-by: Daniel Kurtz Signed-off-by: Minghsiu Tsai --- arch/arm64/boot/dts/mediatek/mt8173.dtsi | 126 +++ 1 file changed, 60 insertions(+), 66 deletions(-) diff --git a/arch/arm64/boot/dts/mediatek/mt8173.dtsi b/arch/arm64/boot/dts/mediatek/mt8173.dtsi index 6922252..d28a363 100644 --- a/arch/arm64/boot/dts/mediatek/mt8173.dtsi +++ b/arch/arm64/boot/dts/mediatek/mt8173.dtsi @@ -792,80 +792,74 @@ #clock-cells = <1>; }; - mdp { - compatible = "mediatek,mt8173-mdp"; - #address-cells = <2>; - #size-cells = <2>; - ranges; + mdp_rdma0: rdma@14001000 { + compatible = "mediatek,mt8173-mdp-rdma", +"mediatek,mt8173-mdp"; + reg = <0 0x14001000 0 0x1000>; + clocks = < CLK_MM_MDP_RDMA0>, +< CLK_MM_MUTEX_32K>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + iommus = < M4U_PORT_MDP_RDMA0>; + mediatek,larb = <>; mediatek,vpu = <>; + }; - mdp_rdma0: rdma@14001000 { - compatible = "mediatek,mt8173-mdp-rdma"; - reg = <0 0x14001000 0 0x1000>; - clocks = < CLK_MM_MDP_RDMA0>, -< CLK_MM_MUTEX_32K>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - iommus = < M4U_PORT_MDP_RDMA0>; - mediatek,larb = <>; - }; - - mdp_rdma1: rdma@14002000 { - compatible = "mediatek,mt8173-mdp-rdma"; - reg = <0 0x14002000 0 0x1000>; - clocks = < CLK_MM_MDP_RDMA1>, -< CLK_MM_MUTEX_32K>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - iommus = < M4U_PORT_MDP_RDMA1>; - mediatek,larb = <>; - }; + mdp_rdma1: rdma@14002000 { + compatible = "mediatek,mt8173-mdp-rdma"; + reg = <0 0x14002000 0 0x1000>; + clocks = < CLK_MM_MDP_RDMA1>, +< CLK_MM_MUTEX_32K>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + iommus = < M4U_PORT_MDP_RDMA1>; + mediatek,larb = <>; + }; - mdp_rsz0: rsz@14003000 { - compatible = "mediatek,mt8173-mdp-rsz"; - reg = <0 0x14003000 0 0x1000>; - clocks = < CLK_MM_MDP_RSZ0>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - }; + mdp_rsz0: rsz@14003000 { + compatible = "mediatek,mt8173-mdp-rsz"; + reg = <0 0x14003000 0 0x1000>; + clocks = < CLK_MM_MDP_RSZ0>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + }; - mdp_rsz1: rsz@14004000 { - compatible = "mediatek,mt8173-mdp-rsz"; - reg = <0 0x14004000 0 0x1000>; - clocks = < CLK_MM_MDP_RSZ1>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - }; + mdp_rsz1: rsz@14004000 { + compatible = "mediatek,mt8173-mdp-rsz"; + reg = <0 0x14004000 0 0x1000>; + clocks = < CLK_MM_MDP_RSZ1>; + power-domains = < MT8173_POWER_DOMAIN_MM>; + }; - mdp_rsz2: rsz@14005000 { - compatible = "mediatek,mt8173-mdp-rsz"; - reg = <0 0x14005000 0 0x1000>; - clocks = < CLK_MM_MDP_RSZ2>; - power-domains = < MT8173_POWER_DOMAIN_MM>; - }; + mdp_rsz2:
[PATCH 3/3] media: mtk-mdp: Fix mdp device tree
From: Daniel Kurtz If the mdp_* nodes are under an mdp sub-node, their corresponding platform device does not automatically get its iommu assigned properly. Fix this by moving the mdp component nodes up a level such that they are siblings of mdp and all other SoC subsystems. This also simplifies the device tree. Although it fixes iommu assignment issue, it also break compatibility with old device tree. So, the patch in driver is needed to iterate over sibling mdp device nodes, not child ones, to keep driver work properly. Signed-off-by: Daniel Kurtz Signed-off-by: Minghsiu Tsai --- drivers/media/platform/mtk-mdp/mtk_mdp_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c index 9e4eb7d..a5ad586 100644 --- a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c +++ b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c @@ -118,7 +118,7 @@ static int mtk_mdp_probe(struct platform_device *pdev) mutex_init(>vpulock); /* Iterate over sibling MDP function blocks */ - for_each_child_of_node(dev->of_node, node) { + for_each_child_of_node(dev->of_node->parent, node) { const struct of_device_id *of_id; enum mtk_mdp_comp_type comp_type; int comp_id; -- 1.9.1
Re: [PATCH v2] scsi: sg: don't return bogus Sg_requests
Johannes, > If the list search in sg_get_rq_mark() fails to find a valid request, > we return a bogus element. This then can later lead to a GPF in > sg_remove_scat(). > > So don't return bogus Sg_requests in sg_get_rq_mark() but NULL in case > the list search doesn't find a valid request. Applied to 4.12/scsi-fixes, thanks! -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH v2] scsi: sg: don't return bogus Sg_requests
Johannes, > If the list search in sg_get_rq_mark() fails to find a valid request, > we return a bogus element. This then can later lead to a GPF in > sg_remove_scat(). > > So don't return bogus Sg_requests in sg_get_rq_mark() but NULL in case > the list search doesn't find a valid request. Applied to 4.12/scsi-fixes, thanks! -- Martin K. Petersen Oracle Linux Engineering
Re: linux-next: Tree for May 12
Hi all, On Fri, 12 May 2017 13:12:16 +1000 Stephen Rothwellwrote: > > Below is a summary of the state of the merge. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (09d79d103371 Merge tag 'docs-4.12-2' of git://git.lwn.net/linux) Merging fixes/master (97da3854c526 Linux 4.11-rc3) Merging kbuild-current/fixes (9be3213b14d4 gconfig: remove misleading parentheses around a condition) Merging arc-current/for-curr (cf4100d1cddc Revert "ARCv2: Allow enabling PAE40 w/o HIGHMEM") Merging arm-current/fixes (6d8059493691 ARM: 8670/1: V7M: Do not corrupt vector table around v7m_invalidate_l1 call) Merging m68k-current/for-linus (f6ab4d59a5fe nubus: Add MVC and VSC video card definitions) Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups) Merging powerpc-fixes/fixes (be5c5e843c4a powerpc/64: Fix HMI exception on LE with CONFIG_RELOCATABLE=y) Merging sparc/master (3c7f62212018 sparc64: fix fault handling in NGbzero.S and GENbzero.S) Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and linking special files) Merging net/master (228b0324fe61 Merge branch 'bpf-pkt-ptr-align') Merging ipsec/master (2c1497bbc8fd xfrm: Fix NETDEV_DOWN with IPSec offload) Merging netfilter/master (f411af682218 Merge branch 'ibmvnic-Updated-reset-handler-andcode-fixes') Merging ipvs/master (3c5ab3f395d6 ipvs: SNAT packet replies only for NATed connections) Merging wireless-drivers/master (d77facb88448 brcmfmac: use local iftype avoiding use-after-free of virtual interface) Merging mac80211/master (29cee56c0be4 Merge tag 'mac80211-for-davem-2017-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211) Merging sound-current/for-linus (3488df131df9 sound: Disable the build of OSS drivers) Merging pci-current/for-linus (b9c1153f7a9c PCI: hisi: Fix DT binding (hisi-pcie-almost-ecam)) Merging driver-core.current/driver-core-linus (af82455f7dbd Merge tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc) Merging tty.current/tty-linus (4f7d029b9bf0 Linux 4.11-rc7) Merging usb.current/usb-linus (a71c9a1c779f Linux 4.11-rc5) Merging usb-gadget-fixes/fixes (a351e9b9fc24 Linux 4.11) Merging usb-serial-fixes/usb-linus (c02ed2e75ef4 Linux 4.11-rc4) Merging usb-chipidea-fixes/ci-for-usb-stable (c7fbb09b2ea1 usb: chipidea: move the lock initialization to core file) Merging phy/fixes (1a09b6a7c10e phy: qcom-usb-hs: Add depends on EXTCON) Merging staging.current/staging-linus (56868a460b83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide) Merging char-misc.current/char-misc-linus (af82455f7dbd Merge tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc) Merging input-current/for-linus (8be193c7b1f4 Input: add support for PlayStation 1/2 joypads connected via SPI) Merging crypto-current/master (929562b14478 crypto: stm32 - Fix OF module alias information) Merging ide/master (acfead32f3f9 ide: don't call memcpy with the same source and destination) Merging vfio-fixes/for-linus (39da7c509acf Linux 4.11-rc6) Merging kselftest-fixes/fixes (c1ae3cfa0e89 Linux 4.11-rc1) Merging backlight-fixes/for-backlight-fixes (68feaca0b13e backlight: pwm: Handle EPROBE_DEFER while requesting the PWM) Merging ftrace-fixes/for-next-urgent (6224beb12e19 tracing: Have branch tracer use recursive field of task struct) Merging mfd-fixes/for-mfd-fixes (b2376407f989 mfd: cros-ec: Fix host command buffer size) Merging v4l-dvb-fixes/fixes (24a47426066c [media] exynos-gsc: Do not swap cb/cr for semi planar formats) Merging drm-intel-fixes/for-linux-next-fixes (a351e9b9fc24 Linux 4.11) Merging drm-misc-fixes/for-linux-next-fixes (0c45b36f8acc drm/udl: Fix unaligned memory access in udl_render_hline) Merging kbuild/for-next (791a9a666d1a Merge tag 'kbuild-uapi-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild) Merging asm-generic/master (de4be6b87b6b asm-generic: page.h: fix comment typo) CONFLICT (content): Merge conflict in include/asm-generic/percpu.h Merging arc/for-next (d5adbfcd5f7b Linux 4.10-rc7) Merging arm/for-next (c92a90a5060a Merge branches 'fixes' and 'misc' into for-next) Merging arm-perf/for-next/perf (f00fa5f4163b arm64: pmuv3: use arm_pmu ACPI framework) Merging arm-soc/for-next (e3ec06d4bc1e arm-soc: document merges) CONFLICT (content): Merge conflict in drivers/tee/tee_shm.c Merging alpine/alpine/for-next (a1144b2b1ec4 ARM: dts: alpine: add valid clock-frequency values) Merging amlogic/for-next (715dcd206041 Merge branch 'v4.12/drivers' into tmp/aml-rebuild) Merging aspeed/for-next (4944e5dbb215 Merge branches 'dt-for-v4.12' and 'defconfig-for-v4.12' into for-next) Merging at91/at91-next (ce60fdaa7e9e Merge remote-tracking branch 'alex_korg/at91-dt' into at91-next) Merging bcm2835/for-next (7ea6e490ba7f Merge branch anholt/bcm2835-defconfig-64-next into for-next) Merging
Re: linux-next: Tree for May 12
Hi all, On Fri, 12 May 2017 13:12:16 +1000 Stephen Rothwell wrote: > > Below is a summary of the state of the merge. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (09d79d103371 Merge tag 'docs-4.12-2' of git://git.lwn.net/linux) Merging fixes/master (97da3854c526 Linux 4.11-rc3) Merging kbuild-current/fixes (9be3213b14d4 gconfig: remove misleading parentheses around a condition) Merging arc-current/for-curr (cf4100d1cddc Revert "ARCv2: Allow enabling PAE40 w/o HIGHMEM") Merging arm-current/fixes (6d8059493691 ARM: 8670/1: V7M: Do not corrupt vector table around v7m_invalidate_l1 call) Merging m68k-current/for-linus (f6ab4d59a5fe nubus: Add MVC and VSC video card definitions) Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups) Merging powerpc-fixes/fixes (be5c5e843c4a powerpc/64: Fix HMI exception on LE with CONFIG_RELOCATABLE=y) Merging sparc/master (3c7f62212018 sparc64: fix fault handling in NGbzero.S and GENbzero.S) Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and linking special files) Merging net/master (228b0324fe61 Merge branch 'bpf-pkt-ptr-align') Merging ipsec/master (2c1497bbc8fd xfrm: Fix NETDEV_DOWN with IPSec offload) Merging netfilter/master (f411af682218 Merge branch 'ibmvnic-Updated-reset-handler-andcode-fixes') Merging ipvs/master (3c5ab3f395d6 ipvs: SNAT packet replies only for NATed connections) Merging wireless-drivers/master (d77facb88448 brcmfmac: use local iftype avoiding use-after-free of virtual interface) Merging mac80211/master (29cee56c0be4 Merge tag 'mac80211-for-davem-2017-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211) Merging sound-current/for-linus (3488df131df9 sound: Disable the build of OSS drivers) Merging pci-current/for-linus (b9c1153f7a9c PCI: hisi: Fix DT binding (hisi-pcie-almost-ecam)) Merging driver-core.current/driver-core-linus (af82455f7dbd Merge tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc) Merging tty.current/tty-linus (4f7d029b9bf0 Linux 4.11-rc7) Merging usb.current/usb-linus (a71c9a1c779f Linux 4.11-rc5) Merging usb-gadget-fixes/fixes (a351e9b9fc24 Linux 4.11) Merging usb-serial-fixes/usb-linus (c02ed2e75ef4 Linux 4.11-rc4) Merging usb-chipidea-fixes/ci-for-usb-stable (c7fbb09b2ea1 usb: chipidea: move the lock initialization to core file) Merging phy/fixes (1a09b6a7c10e phy: qcom-usb-hs: Add depends on EXTCON) Merging staging.current/staging-linus (56868a460b83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide) Merging char-misc.current/char-misc-linus (af82455f7dbd Merge tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc) Merging input-current/for-linus (8be193c7b1f4 Input: add support for PlayStation 1/2 joypads connected via SPI) Merging crypto-current/master (929562b14478 crypto: stm32 - Fix OF module alias information) Merging ide/master (acfead32f3f9 ide: don't call memcpy with the same source and destination) Merging vfio-fixes/for-linus (39da7c509acf Linux 4.11-rc6) Merging kselftest-fixes/fixes (c1ae3cfa0e89 Linux 4.11-rc1) Merging backlight-fixes/for-backlight-fixes (68feaca0b13e backlight: pwm: Handle EPROBE_DEFER while requesting the PWM) Merging ftrace-fixes/for-next-urgent (6224beb12e19 tracing: Have branch tracer use recursive field of task struct) Merging mfd-fixes/for-mfd-fixes (b2376407f989 mfd: cros-ec: Fix host command buffer size) Merging v4l-dvb-fixes/fixes (24a47426066c [media] exynos-gsc: Do not swap cb/cr for semi planar formats) Merging drm-intel-fixes/for-linux-next-fixes (a351e9b9fc24 Linux 4.11) Merging drm-misc-fixes/for-linux-next-fixes (0c45b36f8acc drm/udl: Fix unaligned memory access in udl_render_hline) Merging kbuild/for-next (791a9a666d1a Merge tag 'kbuild-uapi-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild) Merging asm-generic/master (de4be6b87b6b asm-generic: page.h: fix comment typo) CONFLICT (content): Merge conflict in include/asm-generic/percpu.h Merging arc/for-next (d5adbfcd5f7b Linux 4.10-rc7) Merging arm/for-next (c92a90a5060a Merge branches 'fixes' and 'misc' into for-next) Merging arm-perf/for-next/perf (f00fa5f4163b arm64: pmuv3: use arm_pmu ACPI framework) Merging arm-soc/for-next (e3ec06d4bc1e arm-soc: document merges) CONFLICT (content): Merge conflict in drivers/tee/tee_shm.c Merging alpine/alpine/for-next (a1144b2b1ec4 ARM: dts: alpine: add valid clock-frequency values) Merging amlogic/for-next (715dcd206041 Merge branch 'v4.12/drivers' into tmp/aml-rebuild) Merging aspeed/for-next (4944e5dbb215 Merge branches 'dt-for-v4.12' and 'defconfig-for-v4.12' into for-next) Merging at91/at91-next (ce60fdaa7e9e Merge remote-tracking branch 'alex_korg/at91-dt' into at91-next) Merging bcm2835/for-next (7ea6e490ba7f Merge branch anholt/bcm2835-defconfig-64-next into for-next) Merging berlin/berlin/for-next (5153351425c9
linux-next: Tree for May 12
Hi all, Please do not add any v4.13 destined material in your linux-next included branches until after v4.12-rc1 has been released. Changes since 20170511: Non-merge commits (relative to Linus' tree): 520 583 files changed, 13977 insertions(+), 14719 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc and an allmodconfig (with CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc and sparc64 defconfig. Below is a summary of the state of the merge. I am currently merging 258 trees (counting Linus' and 37 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell
linux-next: Tree for May 12
Hi all, Please do not add any v4.13 destined material in your linux-next included branches until after v4.12-rc1 has been released. Changes since 20170511: Non-merge commits (relative to Linus' tree): 520 583 files changed, 13977 insertions(+), 14719 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc and an allmodconfig (with CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc and sparc64 defconfig. Below is a summary of the state of the merge. I am currently merging 258 trees (counting Linus' and 37 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell
[PATCH] net: dsa: mv88e6xxx: add default case to switch
Add default case to switch in order to avoid any chance of using an uninitialized variable _low_, in case s->type does not match any of the listed case values. Addresses-Coverity-ID: 1398130 Suggested-by: Andrew LunnSigned-off-by: Gustavo A. R. Silva --- drivers/net/dsa/mv88e6xxx/chip.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 03dc886..d39e210 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -879,6 +879,9 @@ static uint64_t _mv88e6xxx_get_ethtool_stat(struct mv88e6xxx_chip *chip, mv88e6xxx_g1_stats_read(chip, reg, ); if (s->sizeof_stat == 8) mv88e6xxx_g1_stats_read(chip, reg + 1, ); + break; + default: + return UINT64_MAX; } value = (((u64)high) << 16) | low; return value; -- 2.5.0
Re: Implementing Dynamic Rerouting in Kernel
Hi, We do not want to add routes at run time rather i would prefer to have a NetFilter driver which can intercept the packet and forward it to desired interface. On Thu, May 11, 2017 at 9:52 PM, Florian Fainelliwrote: > On 05/11/2017 02:59 AM, Ravish Kumar wrote: >> Hi Experts, >> >> Need expert advice for the one of the requirement Where in VPN >> solution we want to dynaically route the packets to different adapter. >> We will manage our own DNS cache and , based on DNS to IP lookup, we >> can redirect the packet either to Tun device or to a physical adapter. >> >> Please suggest some design what i need to do. > > What you are looking for can be done using ipset-dns from Jason: > > https://git.zx2c4.com/ipset-dns/about/ > -- > Florian
[PATCH] net: dsa: mv88e6xxx: add default case to switch
Add default case to switch in order to avoid any chance of using an uninitialized variable _low_, in case s->type does not match any of the listed case values. Addresses-Coverity-ID: 1398130 Suggested-by: Andrew Lunn Signed-off-by: Gustavo A. R. Silva --- drivers/net/dsa/mv88e6xxx/chip.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 03dc886..d39e210 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -879,6 +879,9 @@ static uint64_t _mv88e6xxx_get_ethtool_stat(struct mv88e6xxx_chip *chip, mv88e6xxx_g1_stats_read(chip, reg, ); if (s->sizeof_stat == 8) mv88e6xxx_g1_stats_read(chip, reg + 1, ); + break; + default: + return UINT64_MAX; } value = (((u64)high) << 16) | low; return value; -- 2.5.0
Re: Implementing Dynamic Rerouting in Kernel
Hi, We do not want to add routes at run time rather i would prefer to have a NetFilter driver which can intercept the packet and forward it to desired interface. On Thu, May 11, 2017 at 9:52 PM, Florian Fainelli wrote: > On 05/11/2017 02:59 AM, Ravish Kumar wrote: >> Hi Experts, >> >> Need expert advice for the one of the requirement Where in VPN >> solution we want to dynaically route the packets to different adapter. >> We will manage our own DNS cache and , based on DNS to IP lookup, we >> can redirect the packet either to Tun device or to a physical adapter. >> >> Please suggest some design what i need to do. > > What you are looking for can be done using ipset-dns from Jason: > > https://git.zx2c4.com/ipset-dns/about/ > -- > Florian
[GIT PULL] arch/nios2 update for v4.12
Hi Linus Here is nios2 update for v4.12. This including nios2 fixes/enhancements and adding nios2 R2 support. Regards Ley Foon The following changes since commit 13e0988140374123bead1dd27c287354cb95108e: docs: complete bumping minimal GNU Make version to 3.81 (2017-05-06 18:49:09 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2.git tags/nios2-v4.12-rc1 for you to fetch changes up to e118c3fec9c0d8d2a96462c4c035305dc952e402: nios2: remove custom early console implementation (2017-05-11 17:44:21 +0800) nios2 update for v4.12-rc1 nios2: remove custom early console implementation nios2: use generic strncpy_from_user() and strnlen_user() nios2: Add CDX support nios2: Add BMX support nios2: Add NIOS2_ARCH_REVISION to select between R1 and R2 nios2: implement flush_dcache_mmap_lock/unlock nios2: enable earlycon support nios2: constify irq_domain_ops nios2: remove wrapper header for cmpxchg.h nios2: add .gitignore entries for auto-generated files Julien Beraud (1): nios2: implement flush_dcache_mmap_lock/unlock Ley Foon Tan (1): nios2: use generic strncpy_from_user() and strnlen_user() Marek Vasut (3): nios2: Add NIOS2_ARCH_REVISION to select between R1 and R2 nios2: Add BMX support nios2: Add CDX support Tobias Klauser (5): nios2: add .gitignore entries for auto-generated files nios2: remove wrapper header for cmpxchg.h nios2: constify irq_domain_ops nios2: enable earlycon support nios2: remove custom early console implementation arch/nios2/Kconfig | 2 + arch/nios2/Kconfig.debug | 1 - arch/nios2/Makefile| 5 ++ arch/nios2/boot/.gitignore | 2 + arch/nios2/boot/dts/10m50_devboard.dts | 3 +- arch/nios2/include/asm/Kbuild | 1 + arch/nios2/include/asm/cacheflush.h| 6 +- arch/nios2/include/asm/cmpxchg.h | 14 arch/nios2/include/asm/cpuinfo.h | 2 + arch/nios2/include/asm/prom.h | 22 -- arch/nios2/include/asm/setup.h | 2 - arch/nios2/include/asm/uaccess.h | 7 +- arch/nios2/kernel/.gitignore | 1 + arch/nios2/kernel/Makefile | 1 - arch/nios2/kernel/cpuinfo.c| 18 - arch/nios2/kernel/early_printk.c | 118 - arch/nios2/kernel/irq.c| 2 +- arch/nios2/kernel/prom.c | 49 -- arch/nios2/kernel/setup.c | 6 +- arch/nios2/mm/uaccess.c| 33 - arch/nios2/platform/Kconfig.platform | 26 21 files changed, 69 insertions(+), 252 deletions(-) create mode 100644 arch/nios2/boot/.gitignore delete mode 100644 arch/nios2/include/asm/cmpxchg.h delete mode 100644 arch/nios2/include/asm/prom.h create mode 100644 arch/nios2/kernel/.gitignore delete mode 100644 arch/nios2/kernel/early_printk.c
[GIT PULL] arch/nios2 update for v4.12
Hi Linus Here is nios2 update for v4.12. This including nios2 fixes/enhancements and adding nios2 R2 support. Regards Ley Foon The following changes since commit 13e0988140374123bead1dd27c287354cb95108e: docs: complete bumping minimal GNU Make version to 3.81 (2017-05-06 18:49:09 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2.git tags/nios2-v4.12-rc1 for you to fetch changes up to e118c3fec9c0d8d2a96462c4c035305dc952e402: nios2: remove custom early console implementation (2017-05-11 17:44:21 +0800) nios2 update for v4.12-rc1 nios2: remove custom early console implementation nios2: use generic strncpy_from_user() and strnlen_user() nios2: Add CDX support nios2: Add BMX support nios2: Add NIOS2_ARCH_REVISION to select between R1 and R2 nios2: implement flush_dcache_mmap_lock/unlock nios2: enable earlycon support nios2: constify irq_domain_ops nios2: remove wrapper header for cmpxchg.h nios2: add .gitignore entries for auto-generated files Julien Beraud (1): nios2: implement flush_dcache_mmap_lock/unlock Ley Foon Tan (1): nios2: use generic strncpy_from_user() and strnlen_user() Marek Vasut (3): nios2: Add NIOS2_ARCH_REVISION to select between R1 and R2 nios2: Add BMX support nios2: Add CDX support Tobias Klauser (5): nios2: add .gitignore entries for auto-generated files nios2: remove wrapper header for cmpxchg.h nios2: constify irq_domain_ops nios2: enable earlycon support nios2: remove custom early console implementation arch/nios2/Kconfig | 2 + arch/nios2/Kconfig.debug | 1 - arch/nios2/Makefile| 5 ++ arch/nios2/boot/.gitignore | 2 + arch/nios2/boot/dts/10m50_devboard.dts | 3 +- arch/nios2/include/asm/Kbuild | 1 + arch/nios2/include/asm/cacheflush.h| 6 +- arch/nios2/include/asm/cmpxchg.h | 14 arch/nios2/include/asm/cpuinfo.h | 2 + arch/nios2/include/asm/prom.h | 22 -- arch/nios2/include/asm/setup.h | 2 - arch/nios2/include/asm/uaccess.h | 7 +- arch/nios2/kernel/.gitignore | 1 + arch/nios2/kernel/Makefile | 1 - arch/nios2/kernel/cpuinfo.c| 18 - arch/nios2/kernel/early_printk.c | 118 - arch/nios2/kernel/irq.c| 2 +- arch/nios2/kernel/prom.c | 49 -- arch/nios2/kernel/setup.c | 6 +- arch/nios2/mm/uaccess.c| 33 - arch/nios2/platform/Kconfig.platform | 26 21 files changed, 69 insertions(+), 252 deletions(-) create mode 100644 arch/nios2/boot/.gitignore delete mode 100644 arch/nios2/include/asm/cmpxchg.h delete mode 100644 arch/nios2/include/asm/prom.h create mode 100644 arch/nios2/kernel/.gitignore delete mode 100644 arch/nios2/kernel/early_printk.c
Re: [PATCH net] net: phy: Call bus->reset() after releasing PHYs from reset
From: Florian FainelliDate: Thu, 11 May 2017 11:24:16 -0700 > The API convention makes it that a given MDIO bus reset should be able > to access PHY devices in its reset() callback and perform additional > MDIO accesses in order to bring the bus and PHYs in a working state. > > Commit 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.") broke that > contract by first calling bus->reset() and then release all PHYs from > reset using their shared GPIO line, so restore the expected > functionality here. > > Fixes: 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.") > Signed-off-by: Florian Fainelli Applied, thanks Florian.
Re: [PATCH net] net: phy: Call bus->reset() after releasing PHYs from reset
From: Florian Fainelli Date: Thu, 11 May 2017 11:24:16 -0700 > The API convention makes it that a given MDIO bus reset should be able > to access PHY devices in its reset() callback and perform additional > MDIO accesses in order to bring the bus and PHYs in a working state. > > Commit 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.") broke that > contract by first calling bus->reset() and then release all PHYs from > reset using their shared GPIO line, so restore the expected > functionality here. > > Fixes: 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.") > Signed-off-by: Florian Fainelli Applied, thanks Florian.
Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf
On Fri, 12 May 2017 02:12:10 + "Chen, Xiaoguang"wrote: > Hi Alex and Gerd, > > >-Original Message- > >From: intel-gvt-dev [mailto:intel-gvt-dev-boun...@lists.freedesktop.org] On > >Behalf Of Alex Williamson > >Sent: Thursday, May 11, 2017 11:45 PM > >To: Gerd Hoffmann > >Cc: Tian, Kevin ; intel-...@lists.freedesktop.org; > >linux- > >ker...@vger.kernel.org; zhen...@linux.intel.com; Lv, Zhiyuan > > ; Chen, Xiaoguang ; intel- > >gvt-...@lists.freedesktop.org; Wang, Zhi A > >Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf > > > >On Thu, 11 May 2017 15:27:53 +0200 > >Gerd Hoffmann wrote: > > > >> Hi, > >> > >> > While read the framebuffer region we have to tell the vendor driver > >> > which > >framebuffer we want to read? There are two framebuffers now in KVMGT that is > >primary and cursor. > >> > There are two methods to implement this: > >> > 1) write the plane id first and then read the framebuffer. > >> > 2) create 2 vfio regions one for primary and one for cursor. > >> > >> (3) Place information for both planes into one vfio region. > >> Which allows to fetch both with a single read() syscall. > >> > >> The question is how you'll get the file descriptor then. If the ioctl > >> returns the dma-buf fd only you have a racy interface: Things can > >> change between read(vfio-region) and ioctl(need-dmabuf-fd). > >> > >> ioctl(need-dma-buf) could return both dmabuf fd and plane info to fix > >> the race, but then it is easier to go with ioctl only interface > >> (simliar to the orginal one from dec last year) I think. > > > >If the dmabuf fd is provided by a separate mdev vendor driver specific > >ioctl, I > >don't see how vfio regions should be involved. Selecting which framebuffer > >should be an ioctl parameter. > Based on your last mail. I think the implementation looks like this: > 1) user query the framebuffer information by reading the vfio region. > 2) if the framebuffer changed(such as framebuffer's graphics address changed, > size changed etc) we will need to create a new dmabuf fd. > 3) create a new dmabuf fd using vfio device specific ioctl. > > >What sort of information needs to be conveyed > >about each plane? > Only plane id is needed. > > >Is it static information or something that needs to be read > >repeatedly? > It is static information. For our case plane id 1 represent primary plane and > 3 for cursor plane. 2 means sprite plane which will not be used in our case. > > >Do we need it before we get the dmabuf fd or can it be an ioctl on > >the dmabuf fd? > We need it while query the framebuffer. In kernel we need the plane id to > decide which plane we should decode. > Below is my current implementation: > 1) user first query the framebuffer(primary or cursor) and kernel decode the > framebuffer and return the framebuffer information to user and also save a > copy in kernel. > 2) user compared the framebuffer and if the framebuffer changed creating a > new dmabuf fd. If the contents of the framebuffer change or if the parameters of the framebuffer change? I can't image that creating a new dmabuf fd for every visual change within the framebuffer would be efficient, but I don't have any concept of what a dmabuf actually does. > 3) kernel create a new dmabuf fd based on saved framebuffer information. > > So we need plane id in step 1. > In step 3 we create a dmabuf fd only using saved framebuffer information(no > other information is needed). What changes to the framebuffer require a new dmabuf fd? Shouldn't the user query the parameters of the framebuffer through a dmabuf fd and shouldn't the dmabuf fd have some signaling mechanism to the user (eventfd perhaps) to notify the user to re-evaluate the parameters? Otherwise are you imagining that the user polls the vfio region? Why can a dmabuf fd not persist across changes to the framebuffer? Can someone explain what a dmabuf is and how it works in terms that a non-graphics person can understand? Thanks, Alex
Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf
On Fri, 12 May 2017 02:12:10 + "Chen, Xiaoguang" wrote: > Hi Alex and Gerd, > > >-Original Message- > >From: intel-gvt-dev [mailto:intel-gvt-dev-boun...@lists.freedesktop.org] On > >Behalf Of Alex Williamson > >Sent: Thursday, May 11, 2017 11:45 PM > >To: Gerd Hoffmann > >Cc: Tian, Kevin ; intel-...@lists.freedesktop.org; > >linux- > >ker...@vger.kernel.org; zhen...@linux.intel.com; Lv, Zhiyuan > >; Chen, Xiaoguang ; intel- > >gvt-...@lists.freedesktop.org; Wang, Zhi A > >Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf > > > >On Thu, 11 May 2017 15:27:53 +0200 > >Gerd Hoffmann wrote: > > > >> Hi, > >> > >> > While read the framebuffer region we have to tell the vendor driver > >> > which > >framebuffer we want to read? There are two framebuffers now in KVMGT that is > >primary and cursor. > >> > There are two methods to implement this: > >> > 1) write the plane id first and then read the framebuffer. > >> > 2) create 2 vfio regions one for primary and one for cursor. > >> > >> (3) Place information for both planes into one vfio region. > >> Which allows to fetch both with a single read() syscall. > >> > >> The question is how you'll get the file descriptor then. If the ioctl > >> returns the dma-buf fd only you have a racy interface: Things can > >> change between read(vfio-region) and ioctl(need-dmabuf-fd). > >> > >> ioctl(need-dma-buf) could return both dmabuf fd and plane info to fix > >> the race, but then it is easier to go with ioctl only interface > >> (simliar to the orginal one from dec last year) I think. > > > >If the dmabuf fd is provided by a separate mdev vendor driver specific > >ioctl, I > >don't see how vfio regions should be involved. Selecting which framebuffer > >should be an ioctl parameter. > Based on your last mail. I think the implementation looks like this: > 1) user query the framebuffer information by reading the vfio region. > 2) if the framebuffer changed(such as framebuffer's graphics address changed, > size changed etc) we will need to create a new dmabuf fd. > 3) create a new dmabuf fd using vfio device specific ioctl. > > >What sort of information needs to be conveyed > >about each plane? > Only plane id is needed. > > >Is it static information or something that needs to be read > >repeatedly? > It is static information. For our case plane id 1 represent primary plane and > 3 for cursor plane. 2 means sprite plane which will not be used in our case. > > >Do we need it before we get the dmabuf fd or can it be an ioctl on > >the dmabuf fd? > We need it while query the framebuffer. In kernel we need the plane id to > decide which plane we should decode. > Below is my current implementation: > 1) user first query the framebuffer(primary or cursor) and kernel decode the > framebuffer and return the framebuffer information to user and also save a > copy in kernel. > 2) user compared the framebuffer and if the framebuffer changed creating a > new dmabuf fd. If the contents of the framebuffer change or if the parameters of the framebuffer change? I can't image that creating a new dmabuf fd for every visual change within the framebuffer would be efficient, but I don't have any concept of what a dmabuf actually does. > 3) kernel create a new dmabuf fd based on saved framebuffer information. > > So we need plane id in step 1. > In step 3 we create a dmabuf fd only using saved framebuffer information(no > other information is needed). What changes to the framebuffer require a new dmabuf fd? Shouldn't the user query the parameters of the framebuffer through a dmabuf fd and shouldn't the dmabuf fd have some signaling mechanism to the user (eventfd perhaps) to notify the user to re-evaluate the parameters? Otherwise are you imagining that the user polls the vfio region? Why can a dmabuf fd not persist across changes to the framebuffer? Can someone explain what a dmabuf is and how it works in terms that a non-graphics person can understand? Thanks, Alex
[PATCH] PCI: Make SR-IOV capable GPU working on the SR-IOV incapable platform
Hi Helgaas, Some AMD GPUs have hardware support for graphics SR-IOV. If the SR-IOV capable GPU is plugged into the SR-IOV incapable platform. It would cause a problem on PCI resource allocation in current Linux kernel. Therefore in order to allow the PF (Physical Function) device of SR-IOV capable GPU to work on the SR-IOV incapable platform, it is required to verify conditions for initializing BAR resources on AMD SR-IOV capable GPUs. If the device is an AMD graphics device and it supports SR-IOV it will require a large amount of resources. Before calling sriov_init() must ensure that the system BIOS also supports SR-IOV and that system BIOS has been able to allocate enough resources. If the VF BARs are zero then the system BIOS does not support SR-IOV or it could not allocate the resources and this platform will not support AMD graphics SR-IOV. Therefore do not call sriov_init(). If the system BIOS does support SR-IOV then the VF BARs will be properly initialized to non-zero values. Below is the patch against to Kernel 4.8 & 4.9. Please review. I checked the drivers/pci/quirks.c, it looks the workarounds/fixes in quirks.c are for specific devices and one or more device ID are defined for the specific devices. However my patch is for all AMD SR-IOV capable GPUs, that includes all existing and future AMD server GPUs. So it doesn't seem like a good fit to put the fix in quirks.c. Signed-off-by: Collins Cheng--- drivers/pci/iov.c | 63 --- 1 file changed, 60 insertions(+), 3 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index e30f05c..e4f1405 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -523,6 +523,45 @@ static void sriov_restore_state(struct pci_dev *dev) msleep(100); } +/* + * pci_vf_bar_valid - check if VF BARs have resource allocated + * @dev: the PCI device + * @pos: register offset of SR-IOV capability in PCI config space + * Returns true any VF BAR has resource allocated, false + * if all VF BARs are empty. + */ +static bool pci_vf_bar_valid(struct pci_dev *dev, int pos) +{ + int i; + u32 bar_value; + u32 bar_size_mask = ~(PCI_BASE_ADDRESS_SPACE | + PCI_BASE_ADDRESS_MEM_TYPE_64 | + PCI_BASE_ADDRESS_MEM_PREFETCH); + + for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) { + pci_read_config_dword(dev, pos + PCI_SRIOV_BAR + i * 4, _value); + if (bar_value & bar_size_mask) + return true; + } + + return false; +} + +/* + * is_amd_display_adapter - check if it is an AMD/ATI GPU device + * @dev: the PCI device + * + * Returns true if device is an AMD/ATI display adapter, + * otherwise return false. + */ + +static bool is_amd_display_adapter(struct pci_dev *dev) +{ + return (((dev->class >> 16) == PCI_BASE_CLASS_DISPLAY) && + (dev->vendor == PCI_VENDOR_ID_ATI || + dev->vendor == PCI_VENDOR_ID_AMD)); +} + /** * pci_iov_init - initialize the IOV capability * @dev: the PCI device @@ -537,9 +576,27 @@ int pci_iov_init(struct pci_dev *dev) return -ENODEV; pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV); - if (pos) - return sriov_init(dev, pos); - + if (pos) { + /* +* If the device is an AMD graphics device and it supports +* SR-IOV it will require a large amount of resources. +* Before calling sriov_init() must ensure that the system +* BIOS also supports SR-IOV and that system BIOS has been +* able to allocate enough resources. +* If the VF BARs are zero then the system BIOS does not +* support SR-IOV or it could not allocate the resources +* and this platform will not support AMD graphics SR-IOV. +* Therefore do not call sriov_init(). +* If the system BIOS does support SR-IOV then the VF BARs +* will be properly initialized to non-zero values. +*/ + if (is_amd_display_adapter(dev)) { + if (pci_vf_bar_valid(dev, pos)) + return sriov_init(dev, pos); + } else { + return sriov_init(dev, pos); + } + } return -ENODEV; } -- 1.9.1 -Collins Cheng
[PATCH] PCI: Make SR-IOV capable GPU working on the SR-IOV incapable platform
Hi Helgaas, Some AMD GPUs have hardware support for graphics SR-IOV. If the SR-IOV capable GPU is plugged into the SR-IOV incapable platform. It would cause a problem on PCI resource allocation in current Linux kernel. Therefore in order to allow the PF (Physical Function) device of SR-IOV capable GPU to work on the SR-IOV incapable platform, it is required to verify conditions for initializing BAR resources on AMD SR-IOV capable GPUs. If the device is an AMD graphics device and it supports SR-IOV it will require a large amount of resources. Before calling sriov_init() must ensure that the system BIOS also supports SR-IOV and that system BIOS has been able to allocate enough resources. If the VF BARs are zero then the system BIOS does not support SR-IOV or it could not allocate the resources and this platform will not support AMD graphics SR-IOV. Therefore do not call sriov_init(). If the system BIOS does support SR-IOV then the VF BARs will be properly initialized to non-zero values. Below is the patch against to Kernel 4.8 & 4.9. Please review. I checked the drivers/pci/quirks.c, it looks the workarounds/fixes in quirks.c are for specific devices and one or more device ID are defined for the specific devices. However my patch is for all AMD SR-IOV capable GPUs, that includes all existing and future AMD server GPUs. So it doesn't seem like a good fit to put the fix in quirks.c. Signed-off-by: Collins Cheng --- drivers/pci/iov.c | 63 --- 1 file changed, 60 insertions(+), 3 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index e30f05c..e4f1405 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -523,6 +523,45 @@ static void sriov_restore_state(struct pci_dev *dev) msleep(100); } +/* + * pci_vf_bar_valid - check if VF BARs have resource allocated + * @dev: the PCI device + * @pos: register offset of SR-IOV capability in PCI config space + * Returns true any VF BAR has resource allocated, false + * if all VF BARs are empty. + */ +static bool pci_vf_bar_valid(struct pci_dev *dev, int pos) +{ + int i; + u32 bar_value; + u32 bar_size_mask = ~(PCI_BASE_ADDRESS_SPACE | + PCI_BASE_ADDRESS_MEM_TYPE_64 | + PCI_BASE_ADDRESS_MEM_PREFETCH); + + for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) { + pci_read_config_dword(dev, pos + PCI_SRIOV_BAR + i * 4, _value); + if (bar_value & bar_size_mask) + return true; + } + + return false; +} + +/* + * is_amd_display_adapter - check if it is an AMD/ATI GPU device + * @dev: the PCI device + * + * Returns true if device is an AMD/ATI display adapter, + * otherwise return false. + */ + +static bool is_amd_display_adapter(struct pci_dev *dev) +{ + return (((dev->class >> 16) == PCI_BASE_CLASS_DISPLAY) && + (dev->vendor == PCI_VENDOR_ID_ATI || + dev->vendor == PCI_VENDOR_ID_AMD)); +} + /** * pci_iov_init - initialize the IOV capability * @dev: the PCI device @@ -537,9 +576,27 @@ int pci_iov_init(struct pci_dev *dev) return -ENODEV; pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV); - if (pos) - return sriov_init(dev, pos); - + if (pos) { + /* +* If the device is an AMD graphics device and it supports +* SR-IOV it will require a large amount of resources. +* Before calling sriov_init() must ensure that the system +* BIOS also supports SR-IOV and that system BIOS has been +* able to allocate enough resources. +* If the VF BARs are zero then the system BIOS does not +* support SR-IOV or it could not allocate the resources +* and this platform will not support AMD graphics SR-IOV. +* Therefore do not call sriov_init(). +* If the system BIOS does support SR-IOV then the VF BARs +* will be properly initialized to non-zero values. +*/ + if (is_amd_display_adapter(dev)) { + if (pci_vf_bar_valid(dev, pos)) + return sriov_init(dev, pos); + } else { + return sriov_init(dev, pos); + } + } return -ENODEV; } -- 1.9.1 -Collins Cheng
Re: [net-dsa-mv88e6xxx] question about potential use of uninitialized variable
Hi Andrew, Quoting Andrew Lunn: On Thu, May 11, 2017 at 04:35:37PM -0500, Gustavo A. R. Silva wrote: Hello everybody, While looking into Coverity ID 1398130 I ran into the following piece of code at drivers/net/dsa/mv88e6xxx/chip.c:849: 849static uint64_t _mv88e6xxx_get_ethtool_stat(struct mv88e6xxx_chip *chip, 850struct mv88e6xxx_hw_stat *s, 851int port, u16 bank1_select, 852u16 histogram) 853{ 854u32 low; 855u32 high = 0; 856u16 reg = 0; 857int err; 858u64 value; 859 860switch (s->type) { 861case STATS_TYPE_PORT: 862err = mv88e6xxx_port_read(chip, port, s->reg, ); 863if (err) 864return UINT64_MAX; 865 866low = reg; 867if (s->sizeof_stat == 4) { 868err = mv88e6xxx_port_read(chip, port, s->reg + 1, ); 869if (err) 870return UINT64_MAX; 871high = reg; 872} 873break; 874case STATS_TYPE_BANK1: 875reg = bank1_select; 876/* fall through */ 877case STATS_TYPE_BANK0: 878reg |= s->reg | histogram; 879mv88e6xxx_g1_stats_read(chip, reg, ); 880if (s->sizeof_stat == 8) 881mv88e6xxx_g1_stats_read(chip, reg + 1, ); 882} 883value = (((u64)high) << 16) | low; 884return value; 885} My question here is if there is any chance for the execution path to directly jump from line 860 to line 883, hence ending up using the uninitialized variable _low_? Hi Gustavo It would require that s->type not have one of the listed case values. Currently all members of mv88e6xxx_hw_stats due use expected values. However, it would not hurt to add a default: return UINT64_MAX; Do you want to submit a patch? Sure, I'll send it shortly. Thanks for clarifying! -- Gustavo A. R. Silva
Re: [net-dsa-mv88e6xxx] question about potential use of uninitialized variable
Hi Andrew, Quoting Andrew Lunn : On Thu, May 11, 2017 at 04:35:37PM -0500, Gustavo A. R. Silva wrote: Hello everybody, While looking into Coverity ID 1398130 I ran into the following piece of code at drivers/net/dsa/mv88e6xxx/chip.c:849: 849static uint64_t _mv88e6xxx_get_ethtool_stat(struct mv88e6xxx_chip *chip, 850struct mv88e6xxx_hw_stat *s, 851int port, u16 bank1_select, 852u16 histogram) 853{ 854u32 low; 855u32 high = 0; 856u16 reg = 0; 857int err; 858u64 value; 859 860switch (s->type) { 861case STATS_TYPE_PORT: 862err = mv88e6xxx_port_read(chip, port, s->reg, ); 863if (err) 864return UINT64_MAX; 865 866low = reg; 867if (s->sizeof_stat == 4) { 868err = mv88e6xxx_port_read(chip, port, s->reg + 1, ); 869if (err) 870return UINT64_MAX; 871high = reg; 872} 873break; 874case STATS_TYPE_BANK1: 875reg = bank1_select; 876/* fall through */ 877case STATS_TYPE_BANK0: 878reg |= s->reg | histogram; 879mv88e6xxx_g1_stats_read(chip, reg, ); 880if (s->sizeof_stat == 8) 881mv88e6xxx_g1_stats_read(chip, reg + 1, ); 882} 883value = (((u64)high) << 16) | low; 884return value; 885} My question here is if there is any chance for the execution path to directly jump from line 860 to line 883, hence ending up using the uninitialized variable _low_? Hi Gustavo It would require that s->type not have one of the listed case values. Currently all members of mv88e6xxx_hw_stats due use expected values. However, it would not hurt to add a default: return UINT64_MAX; Do you want to submit a patch? Sure, I'll send it shortly. Thanks for clarifying! -- Gustavo A. R. Silva
Re: [PATCH v5 15/17] dt-bindings: qca7000: append UART interface to binding
On Thu, 11 May 2017 21:12:22 +0200, Michael Heimpold wrote: > Am Mittwoch, 10. Mai 2017, 10:53:26 CEST schrieb Stefan Wahren: > > This merges the serdev binding for the QCA7000 UART driver (Ethernet over > > UART) into the existing document. > > > > Signed-off-by: Stefan Wahren> > --- > > .../devicetree/bindings/net/qca-qca7000.txt| 32 > > ++ 1 file changed, 32 insertions(+) > > > > diff --git a/Documentation/devicetree/bindings/net/qca-qca7000.txt > > b/Documentation/devicetree/bindings/net/qca-qca7000.txt index > > a37f656..08364c3 100644 > > --- a/Documentation/devicetree/bindings/net/qca-qca7000.txt > > +++ b/Documentation/devicetree/bindings/net/qca-qca7000.txt > > @@ -54,3 +54,35 @@ ssp2: spi@80014000 { > > local-mac-address = [ A0 B0 C0 D0 E0 F0 ]; > > }; > > }; > > + > > +(b) Ethernet over UART > > + > > +In order to use the QCA7000 as UART slave it must be defined as a child of > > a +UART master in the device tree. It is possible to preconfigure the UART > > +settings of the QCA7000 firmware, but it's not possible to change them > > during +runtime. > > + > > +Required properties: > > +- compatible: Should be "qca,qca7000-uart" > > I already discussed this with Stefan off-list a little bit, but I would like > to bring this to a broader audience: I'm not sure whether the compatible > should contain the "-uart" suffix, because the hardware chip is the very same > QCA7000 chip which can also be used with SPI protocol. > The only difference is the loaded firmware within the chip which can either > speak SPI or UART protocol (but not both at the same time - due to shared > pins). So the hardware design decides which interface type is used. > > At the moment, this patch series adds a dedicated driver for the UART > protocol, in parallel to the existing SPI driver. So a different compatible > string is needed here to match against the new driver. > > An alternative approach would be to re-use the existing compatible string > "qca,qca7000" for both, the SPI and UART protocol, because a "smarter" > (combined) driver would detect which protocol to use. For example the driver > could check for spi-cpha and/or spi-cpol which are required for SPI protocol: > if these exists the driver could assume that SPI must be used, if both are > missing then UART protocol should be used. > (This way it would not be necessary to check whether the node is a child of > a SPI or UART master node - but maybe this is even easier - I don't know) > > Or in shorter words: my concern is that while "qca7000-uart" describes the > hardware, it's too closely coupled to the driver implementation. Having > some feedback of the experts would be nice :-) I'm no expert, but devices which can do both I2C and SPI are quite common, and they usually have the same compatible string for both buses.
Re: [PATCH v5 15/17] dt-bindings: qca7000: append UART interface to binding
On Thu, 11 May 2017 21:12:22 +0200, Michael Heimpold wrote: > Am Mittwoch, 10. Mai 2017, 10:53:26 CEST schrieb Stefan Wahren: > > This merges the serdev binding for the QCA7000 UART driver (Ethernet over > > UART) into the existing document. > > > > Signed-off-by: Stefan Wahren > > --- > > .../devicetree/bindings/net/qca-qca7000.txt| 32 > > ++ 1 file changed, 32 insertions(+) > > > > diff --git a/Documentation/devicetree/bindings/net/qca-qca7000.txt > > b/Documentation/devicetree/bindings/net/qca-qca7000.txt index > > a37f656..08364c3 100644 > > --- a/Documentation/devicetree/bindings/net/qca-qca7000.txt > > +++ b/Documentation/devicetree/bindings/net/qca-qca7000.txt > > @@ -54,3 +54,35 @@ ssp2: spi@80014000 { > > local-mac-address = [ A0 B0 C0 D0 E0 F0 ]; > > }; > > }; > > + > > +(b) Ethernet over UART > > + > > +In order to use the QCA7000 as UART slave it must be defined as a child of > > a +UART master in the device tree. It is possible to preconfigure the UART > > +settings of the QCA7000 firmware, but it's not possible to change them > > during +runtime. > > + > > +Required properties: > > +- compatible: Should be "qca,qca7000-uart" > > I already discussed this with Stefan off-list a little bit, but I would like > to bring this to a broader audience: I'm not sure whether the compatible > should contain the "-uart" suffix, because the hardware chip is the very same > QCA7000 chip which can also be used with SPI protocol. > The only difference is the loaded firmware within the chip which can either > speak SPI or UART protocol (but not both at the same time - due to shared > pins). So the hardware design decides which interface type is used. > > At the moment, this patch series adds a dedicated driver for the UART > protocol, in parallel to the existing SPI driver. So a different compatible > string is needed here to match against the new driver. > > An alternative approach would be to re-use the existing compatible string > "qca,qca7000" for both, the SPI and UART protocol, because a "smarter" > (combined) driver would detect which protocol to use. For example the driver > could check for spi-cpha and/or spi-cpol which are required for SPI protocol: > if these exists the driver could assume that SPI must be used, if both are > missing then UART protocol should be used. > (This way it would not be necessary to check whether the node is a child of > a SPI or UART master node - but maybe this is even easier - I don't know) > > Or in shorter words: my concern is that while "qca7000-uart" describes the > hardware, it's too closely coupled to the driver implementation. Having > some feedback of the experts would be nice :-) I'm no expert, but devices which can do both I2C and SPI are quite common, and they usually have the same compatible string for both buses.
[PATCH] nvmem: rockchip-efuse: add support for rk322x-efuse
This adds the necessary data for handling eFuse on the rk322x. Signed-off-by: Finley Xiao--- Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt | 1 + drivers/nvmem/rockchip-efuse.c | 4 2 files changed, 5 insertions(+) diff --git a/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt b/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt index 94aeeea..194926f 100644 --- a/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt +++ b/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt @@ -4,6 +4,7 @@ Required properties: - compatible: Should be one of the following. - "rockchip,rk3066a-efuse" - for RK3066a SoCs. - "rockchip,rk3188-efuse" - for RK3188 SoCs. + - "rockchip,rk322x-efuse" - for RK322x SoCs. - "rockchip,rk3288-efuse" - for RK3288 SoCs. - "rockchip,rk3399-efuse" - for RK3399 SoCs. - reg: Should contain the registers location and exact eFuse size diff --git a/drivers/nvmem/rockchip-efuse.c b/drivers/nvmem/rockchip-efuse.c index 423907b..a0d4ede 100644 --- a/drivers/nvmem/rockchip-efuse.c +++ b/drivers/nvmem/rockchip-efuse.c @@ -170,6 +170,10 @@ static const struct of_device_id rockchip_efuse_match[] = { .data = (void *)_rk3288_efuse_read, }, { + .compatible = "rockchip,rk322x-efuse", + .data = (void *)_rk3288_efuse_read, + }, + { .compatible = "rockchip,rk3288-efuse", .data = (void *)_rk3288_efuse_read, }, -- 2.7.4
[PATCH] nvmem: rockchip-efuse: add support for rk322x-efuse
This adds the necessary data for handling eFuse on the rk322x. Signed-off-by: Finley Xiao --- Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt | 1 + drivers/nvmem/rockchip-efuse.c | 4 2 files changed, 5 insertions(+) diff --git a/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt b/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt index 94aeeea..194926f 100644 --- a/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt +++ b/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt @@ -4,6 +4,7 @@ Required properties: - compatible: Should be one of the following. - "rockchip,rk3066a-efuse" - for RK3066a SoCs. - "rockchip,rk3188-efuse" - for RK3188 SoCs. + - "rockchip,rk322x-efuse" - for RK322x SoCs. - "rockchip,rk3288-efuse" - for RK3288 SoCs. - "rockchip,rk3399-efuse" - for RK3399 SoCs. - reg: Should contain the registers location and exact eFuse size diff --git a/drivers/nvmem/rockchip-efuse.c b/drivers/nvmem/rockchip-efuse.c index 423907b..a0d4ede 100644 --- a/drivers/nvmem/rockchip-efuse.c +++ b/drivers/nvmem/rockchip-efuse.c @@ -170,6 +170,10 @@ static const struct of_device_id rockchip_efuse_match[] = { .data = (void *)_rk3288_efuse_read, }, { + .compatible = "rockchip,rk322x-efuse", + .data = (void *)_rk3288_efuse_read, + }, + { .compatible = "rockchip,rk3288-efuse", .data = (void *)_rk3288_efuse_read, }, -- 2.7.4
[PATCH v2] [media] mtk-mdp: Fix g_/s_selection capture/compose logic
From: Daniel KurtzExperiments show that the: (1) mtk-mdp uses the _MPLANE form of CAPTURE/OUTPUT (2) CAPTURE types use CROP targets, and OUTPUT types use COMPOSE targets Signed-off-by: Daniel Kurtz Signed-off-by: Minghsiu Tsai Signed-off-by: Houlong Wei --- Changes in v2: . Can not use *_MPLANE type in g_/s_selection --- drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c b/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c index 13afe48..e18ac626 100644 --- a/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c +++ b/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c @@ -838,10 +838,10 @@ static int mtk_mdp_m2m_g_selection(struct file *file, void *fh, bool valid = false; if (s->type == V4L2_BUF_TYPE_VIDEO_CAPTURE) { - if (mtk_mdp_is_target_compose(s->target)) + if (mtk_mdp_is_target_crop(s->target)) valid = true; } else if (s->type == V4L2_BUF_TYPE_VIDEO_OUTPUT) { - if (mtk_mdp_is_target_crop(s->target)) + if (mtk_mdp_is_target_compose(s->target)) valid = true; } if (!valid) { @@ -908,10 +908,10 @@ static int mtk_mdp_m2m_s_selection(struct file *file, void *fh, bool valid = false; if (s->type == V4L2_BUF_TYPE_VIDEO_CAPTURE) { - if (s->target == V4L2_SEL_TGT_COMPOSE) + if (s->target == V4L2_SEL_TGT_CROP) valid = true; } else if (s->type == V4L2_BUF_TYPE_VIDEO_OUTPUT) { - if (s->target == V4L2_SEL_TGT_CROP) + if (s->target == V4L2_SEL_TGT_COMPOSE) valid = true; } if (!valid) { @@ -925,7 +925,7 @@ static int mtk_mdp_m2m_s_selection(struct file *file, void *fh, if (ret) return ret; - if (mtk_mdp_is_target_crop(s->target)) + if (mtk_mdp_is_target_compose(s->target)) frame = >s_frame; else frame = >d_frame; -- 1.9.1
[PATCH v2] [media] mtk-mdp: Fix g_/s_selection capture/compose logic
From: Daniel Kurtz Experiments show that the: (1) mtk-mdp uses the _MPLANE form of CAPTURE/OUTPUT (2) CAPTURE types use CROP targets, and OUTPUT types use COMPOSE targets Signed-off-by: Daniel Kurtz Signed-off-by: Minghsiu Tsai Signed-off-by: Houlong Wei --- Changes in v2: . Can not use *_MPLANE type in g_/s_selection --- drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c b/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c index 13afe48..e18ac626 100644 --- a/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c +++ b/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c @@ -838,10 +838,10 @@ static int mtk_mdp_m2m_g_selection(struct file *file, void *fh, bool valid = false; if (s->type == V4L2_BUF_TYPE_VIDEO_CAPTURE) { - if (mtk_mdp_is_target_compose(s->target)) + if (mtk_mdp_is_target_crop(s->target)) valid = true; } else if (s->type == V4L2_BUF_TYPE_VIDEO_OUTPUT) { - if (mtk_mdp_is_target_crop(s->target)) + if (mtk_mdp_is_target_compose(s->target)) valid = true; } if (!valid) { @@ -908,10 +908,10 @@ static int mtk_mdp_m2m_s_selection(struct file *file, void *fh, bool valid = false; if (s->type == V4L2_BUF_TYPE_VIDEO_CAPTURE) { - if (s->target == V4L2_SEL_TGT_COMPOSE) + if (s->target == V4L2_SEL_TGT_CROP) valid = true; } else if (s->type == V4L2_BUF_TYPE_VIDEO_OUTPUT) { - if (s->target == V4L2_SEL_TGT_CROP) + if (s->target == V4L2_SEL_TGT_COMPOSE) valid = true; } if (!valid) { @@ -925,7 +925,7 @@ static int mtk_mdp_m2m_s_selection(struct file *file, void *fh, if (ret) return ret; - if (mtk_mdp_is_target_crop(s->target)) + if (mtk_mdp_is_target_compose(s->target)) frame = >s_frame; else frame = >d_frame; -- 1.9.1
RE: [PATCH 2/2] Revert "ACPI / button: Change default behavior to lid_init_state=open"
Hi, > From: Benjamin Tissoires [mailto:benjamin.tissoi...@redhat.com] > Subject: Re: [PATCH 2/2] Revert "ACPI / button: Change default behavior to > lid_init_state=open" > > On May 11 2017 or thereabouts, Zheng, Lv wrote: > > Hi, > > > > > From: Benjamin Tissoires [mailto:benjamin.tissoi...@redhat.com] > > > Subject: [PATCH 2/2] Revert "ACPI / button: Change default behavior to > > > lid_init_state=open" > > > > > > This reverts commit 77e9a4aa9de10cc1418bf9a892366988802a8025. > > > > > > Even if the method implementation can be buggy on some platform, > > > the "open" choice is worse. It breaks docking stations basically > > > and there is no way to have a user-space hwdb to fix that. > > > > > > On the contrary, it's rather easy in user-space to have a hwdb > > > with the problematic platforms. Then, libinput (1.7.0+) can fix > > > the state of the LID switch for us: you need to set the udev > > > property LIBINPUT_ATTR_LID_SWITCH_RELIABILITY to 'write_open'. > > > > > > When libinput detects internal keyboard events, it will > > > overwrite the state of the switch to open, making it reliable > > > again. Given that logind only checks the LID switch value after > > > a timeout, we can assume the user will use the internal keyboard > > > before this timeout expires. > > > > > > For example, such a hwdb entry is: > > > > > > libinput:name:*Lid Switch*:dmi:*svnMicrosoftCorporation:pnSurface3:* > > > LIBINPUT_ATTR_LID_SWITCH_RELIABILITY=write_open > > > > For the reason mentioned previously and proofs here (see patch > > descriptions): > > https://patchwork.kernel.org/patch/9717111/ > > I am not sure you can call this proofs. The "close" state is IMO the > exact same as the "method" one, it's just that the intel driver > triggers the evaluation of the _LID method, not acpi/button.c. I should correct you that: 1. intel_lvds driver only evaluates _LID method for its own purpose, See my reply to PATCH 4-5; 2. acpi/button.c is still responsible for generating the lid input event (to input layer and thus the userspace) And the input event generated by button.c differs for the 2 modes. See my another reply to PATCH 02. > > And remember that this new default prevents userspace to fix it because > it's rather easy to detect that the device is actually opened (input > events coming from interanl keyboard, touchscreen, or touchpad), while > reporting the lid switch as open means we can't know if the user is > simply not using the internal devices, or if we have just a wrong state. That depends on what the meaning of "fix" is IMO. I saw you wrote a lot in another message, let's discuss this there. > > Given that this patch breaks all Lenovos with a dock (I can guess that 6 > lines this year are using docks, and within each line you have 2-5 > variants), I'd suggest we do not break those existing laptops and just > revert this patch. > > I would think other OEMs have a docking station with an actual power > button but I can't be sure by looking at the product pages. I'm not sure if it breaks the external monitor usage model. Why don't the userspace just 1. always light on the external display and keep the internal display not lit on after boot/resume, or 2. don't do anything and let the BIOS to light on the right display, or 3. don't do anything and let the graphics drivers to light on the right display (using saved state when suspends and resumes to that saved state). Why such decision have to be made based on ACPI control method lid device? I cannot see a strong reason that ACPI control method lid device must participate in this usage model. See drivers/gpu/drm/nouveau/nouveau_connector.c, the ignorelid parameter proves that the graphics drivers can do this on their own and do this perfectly. > > > We shouldn't do this. > > I strongly disagree. > > I am fine if you don't want to have a blacklist in the kernel for the > devices that are problematic (the ones that require open), but please > don't prevent user space to have this blacklist and please do not make > false assumptions in the kernel on the state of a switch. This is worse > than it helps and in the end, user space won't be able to solve this if > you change the default behavior at each release. We are actually also fine with any default value. But be aware of that, ACPI subsystem just provides a button driver: 1. Allows users to suspend the system upon receiving "close" lid event; 2. Stops to support any other usage models but is willing to provide tunable behavior for the other system participants to support their special needs. Such participants include but are not limited to the user space tools and the other kernel modules. But the default behavior and the timing of tuning button driver behaviors should be determined by the other system participants. And as a consequence, the other system participants who change that must be responsible for fixing regressions related to conflicts of the needs. As
RE: [PATCH 2/2] Revert "ACPI / button: Change default behavior to lid_init_state=open"
Hi, > From: Benjamin Tissoires [mailto:benjamin.tissoi...@redhat.com] > Subject: Re: [PATCH 2/2] Revert "ACPI / button: Change default behavior to > lid_init_state=open" > > On May 11 2017 or thereabouts, Zheng, Lv wrote: > > Hi, > > > > > From: Benjamin Tissoires [mailto:benjamin.tissoi...@redhat.com] > > > Subject: [PATCH 2/2] Revert "ACPI / button: Change default behavior to > > > lid_init_state=open" > > > > > > This reverts commit 77e9a4aa9de10cc1418bf9a892366988802a8025. > > > > > > Even if the method implementation can be buggy on some platform, > > > the "open" choice is worse. It breaks docking stations basically > > > and there is no way to have a user-space hwdb to fix that. > > > > > > On the contrary, it's rather easy in user-space to have a hwdb > > > with the problematic platforms. Then, libinput (1.7.0+) can fix > > > the state of the LID switch for us: you need to set the udev > > > property LIBINPUT_ATTR_LID_SWITCH_RELIABILITY to 'write_open'. > > > > > > When libinput detects internal keyboard events, it will > > > overwrite the state of the switch to open, making it reliable > > > again. Given that logind only checks the LID switch value after > > > a timeout, we can assume the user will use the internal keyboard > > > before this timeout expires. > > > > > > For example, such a hwdb entry is: > > > > > > libinput:name:*Lid Switch*:dmi:*svnMicrosoftCorporation:pnSurface3:* > > > LIBINPUT_ATTR_LID_SWITCH_RELIABILITY=write_open > > > > For the reason mentioned previously and proofs here (see patch > > descriptions): > > https://patchwork.kernel.org/patch/9717111/ > > I am not sure you can call this proofs. The "close" state is IMO the > exact same as the "method" one, it's just that the intel driver > triggers the evaluation of the _LID method, not acpi/button.c. I should correct you that: 1. intel_lvds driver only evaluates _LID method for its own purpose, See my reply to PATCH 4-5; 2. acpi/button.c is still responsible for generating the lid input event (to input layer and thus the userspace) And the input event generated by button.c differs for the 2 modes. See my another reply to PATCH 02. > > And remember that this new default prevents userspace to fix it because > it's rather easy to detect that the device is actually opened (input > events coming from interanl keyboard, touchscreen, or touchpad), while > reporting the lid switch as open means we can't know if the user is > simply not using the internal devices, or if we have just a wrong state. That depends on what the meaning of "fix" is IMO. I saw you wrote a lot in another message, let's discuss this there. > > Given that this patch breaks all Lenovos with a dock (I can guess that 6 > lines this year are using docks, and within each line you have 2-5 > variants), I'd suggest we do not break those existing laptops and just > revert this patch. > > I would think other OEMs have a docking station with an actual power > button but I can't be sure by looking at the product pages. I'm not sure if it breaks the external monitor usage model. Why don't the userspace just 1. always light on the external display and keep the internal display not lit on after boot/resume, or 2. don't do anything and let the BIOS to light on the right display, or 3. don't do anything and let the graphics drivers to light on the right display (using saved state when suspends and resumes to that saved state). Why such decision have to be made based on ACPI control method lid device? I cannot see a strong reason that ACPI control method lid device must participate in this usage model. See drivers/gpu/drm/nouveau/nouveau_connector.c, the ignorelid parameter proves that the graphics drivers can do this on their own and do this perfectly. > > > We shouldn't do this. > > I strongly disagree. > > I am fine if you don't want to have a blacklist in the kernel for the > devices that are problematic (the ones that require open), but please > don't prevent user space to have this blacklist and please do not make > false assumptions in the kernel on the state of a switch. This is worse > than it helps and in the end, user space won't be able to solve this if > you change the default behavior at each release. We are actually also fine with any default value. But be aware of that, ACPI subsystem just provides a button driver: 1. Allows users to suspend the system upon receiving "close" lid event; 2. Stops to support any other usage models but is willing to provide tunable behavior for the other system participants to support their special needs. Such participants include but are not limited to the user space tools and the other kernel modules. But the default behavior and the timing of tuning button driver behaviors should be determined by the other system participants. And as a consequence, the other system participants who change that must be responsible for fixing regressions related to conflicts of the needs. As
Re: [net-dsa-mv88e6xxx] question about potential use of uninitialized variable
On Thu, May 11, 2017 at 04:35:37PM -0500, Gustavo A. R. Silva wrote: > > Hello everybody, > > While looking into Coverity ID 1398130 I ran into the following > piece of code at drivers/net/dsa/mv88e6xxx/chip.c:849: > > 849static uint64_t _mv88e6xxx_get_ethtool_stat(struct mv88e6xxx_chip *chip, > 850struct mv88e6xxx_hw_stat *s, > 851int port, u16 bank1_select, > 852u16 histogram) > 853{ > 854u32 low; > 855u32 high = 0; > 856u16 reg = 0; > 857int err; > 858u64 value; > 859 > 860switch (s->type) { > 861case STATS_TYPE_PORT: > 862err = mv88e6xxx_port_read(chip, port, s->reg, ); > 863if (err) > 864return UINT64_MAX; > 865 > 866low = reg; > 867if (s->sizeof_stat == 4) { > 868err = mv88e6xxx_port_read(chip, port, > s->reg + 1, ); > 869if (err) > 870return UINT64_MAX; > 871high = reg; > 872} > 873break; > 874case STATS_TYPE_BANK1: > 875reg = bank1_select; > 876/* fall through */ > 877case STATS_TYPE_BANK0: > 878reg |= s->reg | histogram; > 879mv88e6xxx_g1_stats_read(chip, reg, ); > 880if (s->sizeof_stat == 8) > 881mv88e6xxx_g1_stats_read(chip, reg + 1, ); > 882} > 883value = (((u64)high) << 16) | low; > 884return value; > 885} > > My question here is if there is any chance for the execution path to > directly jump from line 860 to line 883, hence ending up using the > uninitialized variable _low_? Hi Gustavo It would require that s->type not have one of the listed case values. Currently all members of mv88e6xxx_hw_stats due use expected values. However, it would not hurt to add a default: return UINT64_MAX; Do you want to submit a patch? Thanks Andrew
Re: [net-dsa-mv88e6xxx] question about potential use of uninitialized variable
On Thu, May 11, 2017 at 04:35:37PM -0500, Gustavo A. R. Silva wrote: > > Hello everybody, > > While looking into Coverity ID 1398130 I ran into the following > piece of code at drivers/net/dsa/mv88e6xxx/chip.c:849: > > 849static uint64_t _mv88e6xxx_get_ethtool_stat(struct mv88e6xxx_chip *chip, > 850struct mv88e6xxx_hw_stat *s, > 851int port, u16 bank1_select, > 852u16 histogram) > 853{ > 854u32 low; > 855u32 high = 0; > 856u16 reg = 0; > 857int err; > 858u64 value; > 859 > 860switch (s->type) { > 861case STATS_TYPE_PORT: > 862err = mv88e6xxx_port_read(chip, port, s->reg, ); > 863if (err) > 864return UINT64_MAX; > 865 > 866low = reg; > 867if (s->sizeof_stat == 4) { > 868err = mv88e6xxx_port_read(chip, port, > s->reg + 1, ); > 869if (err) > 870return UINT64_MAX; > 871high = reg; > 872} > 873break; > 874case STATS_TYPE_BANK1: > 875reg = bank1_select; > 876/* fall through */ > 877case STATS_TYPE_BANK0: > 878reg |= s->reg | histogram; > 879mv88e6xxx_g1_stats_read(chip, reg, ); > 880if (s->sizeof_stat == 8) > 881mv88e6xxx_g1_stats_read(chip, reg + 1, ); > 882} > 883value = (((u64)high) << 16) | low; > 884return value; > 885} > > My question here is if there is any chance for the execution path to > directly jump from line 860 to line 883, hence ending up using the > uninitialized variable _low_? Hi Gustavo It would require that s->type not have one of the listed case values. Currently all members of mv88e6xxx_hw_stats due use expected values. However, it would not hurt to add a default: return UINT64_MAX; Do you want to submit a patch? Thanks Andrew
[PATCH v5 3/4] zram: make deduplication feature optional
From: Joonsoo KimBenefit of deduplication is dependent on the workload so it's not preferable to always enable. Therefore, make it optional in Kconfig and device param. Default is 'off'. This option will be beneficial for users who use the zram as blockdev and stores build output to it. Reviewed-by: Sergey Senozhatsky Acked-by: Minchan Kim Signed-off-by: Joonsoo Kim --- Documentation/ABI/testing/sysfs-block-zram | 10 Documentation/blockdev/zram.txt| 1 + drivers/block/zram/Kconfig | 14 ++ drivers/block/zram/Makefile| 3 +- drivers/block/zram/zram_dedup.c| 15 ++ drivers/block/zram/zram_dedup.h| 23 + drivers/block/zram/zram_drv.c | 81 ++ drivers/block/zram/zram_drv.h | 9 8 files changed, 144 insertions(+), 12 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index 451b6d8..3c1945f 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -90,3 +90,13 @@ Description: device's debugging info useful for kernel developers. Its format is not documented intentionally and may change anytime without any notice. + +What: /sys/block/zram/use_dedup +Date: March 2017 +Contact: Joonsoo Kim +Description: + The use_dedup file is read-write and specifies deduplication + feature is used or not. If enabled, duplicated data is + managed by reference count and will not be stored in memory + twice. Benefit of this feature largely depends on the workload + so keep attention when use. diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 2cdc303..cbbe39b 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -168,6 +168,7 @@ max_comp_streams RWthe number of possible concurrent compress operations comp_algorithmRWshow and change the compression algorithm compact WOtrigger memory compaction debug_statROthis file is used for zram debugging purposes +use_dedupRWshow and set deduplication feature User space is advised to use the following files to read the device statistics. diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index b8ecba6..2f3dd1f 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -13,3 +13,17 @@ config ZRAM disks and maybe many more. See zram.txt for more information. + +config ZRAM_DEDUP + bool "Deduplication support for ZRAM data" + depends on ZRAM + default n + help + Deduplicate ZRAM data to reduce amount of memory consumption. + Advantage largely depends on the workload. In some cases, this + option reduces memory usage to the half. However, if there is no + duplicated data, the amount of memory consumption would be + increased due to additional metadata usage. And, there is + computation time trade-off. Please check the benefit before + enabling this option. Experiment shows the positive effect when + the zram is used as blockdev and is used to store build output. diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile index 29cb008..d7204ef 100644 --- a/drivers/block/zram/Makefile +++ b/drivers/block/zram/Makefile @@ -1,3 +1,4 @@ -zram-y := zcomp.o zram_drv.o zram_dedup.o +zram-y := zcomp.o zram_drv.o +zram-$(CONFIG_ZRAM_DEDUP) += zram_dedup.o obj-$(CONFIG_ZRAM) += zram.o diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c index a8427f7..560b1f5 100644 --- a/drivers/block/zram/zram_dedup.c +++ b/drivers/block/zram/zram_dedup.c @@ -41,6 +41,9 @@ void zram_dedup_insert(struct zram *zram, struct zram_entry *new, struct rb_node **rb_node, *parent = NULL; struct zram_entry *entry; + if (!zram_dedup_enabled(zram)) + return; + new->checksum = checksum; hash = >hash[checksum % zram->hash_size]; rb_root = >rb_root; @@ -148,6 +151,9 @@ struct zram_entry *zram_dedup_find(struct zram *zram, struct page *page, void *mem; struct zram_entry *entry; + if (!zram_dedup_enabled(zram)) + return NULL; + mem = kmap_atomic(page); *checksum = zram_dedup_checksum(mem); @@ -160,6 +166,9 @@ struct zram_entry *zram_dedup_find(struct zram *zram, struct page *page, void zram_dedup_init_entry(struct zram *zram, struct zram_entry *entry, unsigned long handle, unsigned int len) { +
[PATCH v5 3/4] zram: make deduplication feature optional
From: Joonsoo Kim Benefit of deduplication is dependent on the workload so it's not preferable to always enable. Therefore, make it optional in Kconfig and device param. Default is 'off'. This option will be beneficial for users who use the zram as blockdev and stores build output to it. Reviewed-by: Sergey Senozhatsky Acked-by: Minchan Kim Signed-off-by: Joonsoo Kim --- Documentation/ABI/testing/sysfs-block-zram | 10 Documentation/blockdev/zram.txt| 1 + drivers/block/zram/Kconfig | 14 ++ drivers/block/zram/Makefile| 3 +- drivers/block/zram/zram_dedup.c| 15 ++ drivers/block/zram/zram_dedup.h| 23 + drivers/block/zram/zram_drv.c | 81 ++ drivers/block/zram/zram_drv.h | 9 8 files changed, 144 insertions(+), 12 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index 451b6d8..3c1945f 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -90,3 +90,13 @@ Description: device's debugging info useful for kernel developers. Its format is not documented intentionally and may change anytime without any notice. + +What: /sys/block/zram/use_dedup +Date: March 2017 +Contact: Joonsoo Kim +Description: + The use_dedup file is read-write and specifies deduplication + feature is used or not. If enabled, duplicated data is + managed by reference count and will not be stored in memory + twice. Benefit of this feature largely depends on the workload + so keep attention when use. diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 2cdc303..cbbe39b 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -168,6 +168,7 @@ max_comp_streams RWthe number of possible concurrent compress operations comp_algorithmRWshow and change the compression algorithm compact WOtrigger memory compaction debug_statROthis file is used for zram debugging purposes +use_dedupRWshow and set deduplication feature User space is advised to use the following files to read the device statistics. diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index b8ecba6..2f3dd1f 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -13,3 +13,17 @@ config ZRAM disks and maybe many more. See zram.txt for more information. + +config ZRAM_DEDUP + bool "Deduplication support for ZRAM data" + depends on ZRAM + default n + help + Deduplicate ZRAM data to reduce amount of memory consumption. + Advantage largely depends on the workload. In some cases, this + option reduces memory usage to the half. However, if there is no + duplicated data, the amount of memory consumption would be + increased due to additional metadata usage. And, there is + computation time trade-off. Please check the benefit before + enabling this option. Experiment shows the positive effect when + the zram is used as blockdev and is used to store build output. diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile index 29cb008..d7204ef 100644 --- a/drivers/block/zram/Makefile +++ b/drivers/block/zram/Makefile @@ -1,3 +1,4 @@ -zram-y := zcomp.o zram_drv.o zram_dedup.o +zram-y := zcomp.o zram_drv.o +zram-$(CONFIG_ZRAM_DEDUP) += zram_dedup.o obj-$(CONFIG_ZRAM) += zram.o diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c index a8427f7..560b1f5 100644 --- a/drivers/block/zram/zram_dedup.c +++ b/drivers/block/zram/zram_dedup.c @@ -41,6 +41,9 @@ void zram_dedup_insert(struct zram *zram, struct zram_entry *new, struct rb_node **rb_node, *parent = NULL; struct zram_entry *entry; + if (!zram_dedup_enabled(zram)) + return; + new->checksum = checksum; hash = >hash[checksum % zram->hash_size]; rb_root = >rb_root; @@ -148,6 +151,9 @@ struct zram_entry *zram_dedup_find(struct zram *zram, struct page *page, void *mem; struct zram_entry *entry; + if (!zram_dedup_enabled(zram)) + return NULL; + mem = kmap_atomic(page); *checksum = zram_dedup_checksum(mem); @@ -160,6 +166,9 @@ struct zram_entry *zram_dedup_find(struct zram *zram, struct page *page, void zram_dedup_init_entry(struct zram *zram, struct zram_entry *entry, unsigned long handle, unsigned int len) { + if (!zram_dedup_enabled(zram)) + return; + entry->handle = handle; entry->refcount = 1;
[PATCH v5 2/4] zram: implement deduplication in zram
From: Joonsoo KimThis patch implements deduplication feature in zram. The purpose of this work is naturally to save amount of memory usage by zram. Android is one of the biggest users to use zram as swap and it's really important to save amount of memory usage. There is a paper that reports that duplication ratio of Android's memory content is rather high [1]. And, there is a similar work on zswap that also reports that experiments has shown that around 10-15% of pages stored in zswp are duplicates and deduplicate them provides some benefits [2]. Also, there is a different kind of workload that uses zram as blockdev and store build outputs into it to reduce wear-out problem of real blockdev. In this workload, deduplication hit is very high due to temporary files and intermediate object files. Detailed analysis is on the bottom of this description. Anyway, if we can detect duplicated content and avoid to store duplicated content at different memory space, we can save memory. This patch tries to do that. Implementation is almost simple and intuitive but I should note one thing about implementation detail. To check duplication, this patch uses checksum of the page and collision of this checksum could be possible. There would be many choices to handle this situation but this patch chooses to allow entry with duplicated checksum to be added to the hash, but, not to compare all entries with duplicated checksum when checking duplication. I guess that checksum collision is quite rare event and we don't need to pay any attention to such a case. Therefore, I decided the most simplest way to implement the feature. If there is a different opinion, I can accept and go that way. Following is the result of this patch. Test result #1 (Swap): Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18 orig_data_size: 145297408 compr_data_size: 32408125 mem_used_total: 32276480 dup_data_size: 3188134 meta_data_size: 1444272 Last two metrics added to mm_stat are related to this work. First one, dup_data_size, is amount of saved memory by avoiding to store duplicated page. Later one, meta_data_size, is the amount of data structure to support deduplication. If dup > meta, we can judge that the patch improves memory usage. In Adnroid, we can save 5% of memory usage by this work. Test result #2 (Blockdev): build the kernel and store output to ext4 FS on zram Elapsed time: 249 s mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0 Elapsed time: 250 s mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038 3945792 There is no performance degradation and save 23% memory. Test result #3 (Blockdev): copy android build output dir(out/host) to ext4 FS on zram Elapsed time: out/host: 88 s mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0 Elapsed time: out/host: 100 s mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336 It shows performance degradation roughly 13% and save 24% memory. Maybe, it is due to overhead of calculating checksum and comparison. Test result #4 (Blockdev): copy android build output dir(out/target/common) to ext4 FS on zram Elapsed time: out/host: 203 s mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0 Elapsed time: out/host: 201 s mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336 Memory is saved by 42% and performance is the same. Even if there is overhead of calculating checksum and comparison, large hit ratio compensate it since hit leads to less compression attempt. I checked the detailed reason of savings on kernel build workload and there are some cases that deduplication happens. 1) *.cmd Build command is usually similar in one directory so content of these file are very similar. In my system, more than 789 lines in fs/ext4/.namei.o.cmd and fs/ext4/.inode.o.cmd are the same in 944 and 938 lines of the file, respectively. 2) intermediate object files built-in.o and temporary object file have the similar contents. More than 50% of fs/ext4/ext4.o is the same with fs/ext4/built-in.o. 3) vmlinux .tmp_vmlinux1 and .tmp_vmlinux2 and arch/x86/boo/compressed/vmlinux.bin have the similar contents. Android test has similar case that some of object files(.class and .so) are similar with another ones. (./host/linux-x86/lib/libartd.so and ./host/linux-x86-lib/libartd-comiler.so) Anyway, benefit seems to be largely dependent on the workload so following patch will make this feature optional. However, this feature can help some usecases so is deserved to be merged. [1]: MemScope: Analyzing Memory Duplication on Android Systems, dl.acm.org/citation.cfm?id=2797023 [2]: zswap: Optimize compressed pool memory utilization, lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2 Reviewed-by: Sergey Senozhatsky Acked-by: Minchan Kim Signed-off-by: Joonsoo Kim
[PATCH v5 0/4] zram: implement deduplication in zram
From: Joonsoo Kim <iamjoonsoo@lge.com> Changes from v4 o rebase on next-20170511 o collect Reviewed-by from Sergey Changes from v3 o fix module build problem o make zram_dedup_enabled() statically return false if CONFIG_ZRAM_DEDUP=n Changes from v2 o rebase to latest zram code o manage alloc/free of the zram_entry in zram_drv.c o remove useless RB_CLEAR_NODE o set RO permission tor use_deup sysfs entry if CONFIG_ZRAM_DEDUP=n Changes from v1 o reogranize dedup specific functions o support Kconfig on/off o update zram documents o compare all the entries with same checksum (patch #4) This patchset implements deduplication feature in zram. Motivation is to save memory usage by zram. There are detailed description about motivation and experimental results on patch #2 so please refer it. Thanks. Joonsoo Kim (4): zram: introduce zram_entry to prepare dedup functionality zram: implement deduplication in zram zram: make deduplication feature optional zram: compare all the entries with same checksum for deduplication Documentation/ABI/testing/sysfs-block-zram | 10 ++ Documentation/blockdev/zram.txt| 3 + drivers/block/zram/Kconfig | 14 ++ drivers/block/zram/Makefile| 3 +- drivers/block/zram/zram_dedup.c| 254 + drivers/block/zram/zram_dedup.h| 45 + drivers/block/zram/zram_drv.c | 184 + drivers/block/zram/zram_drv.h | 32 +++- 8 files changed, 507 insertions(+), 38 deletions(-) create mode 100644 drivers/block/zram/zram_dedup.c create mode 100644 drivers/block/zram/zram_dedup.h -- 2.7.4
[PATCH v5 1/4] zram: introduce zram_entry to prepare dedup functionality
From: Joonsoo KimFollowing patch will implement deduplication functionality in the zram and it requires an indirection layer to manage the life cycle of zsmalloc handle. To prepare that, this patch introduces zram_entry which can be used to manage the life-cycle of zsmalloc handle. Many lines are changed due to rename but core change is just simple introduction about newly data structure. Reviewed-by: Sergey Senozhatsky Acked-by: Minchan Kim Signed-off-by: Joonsoo Kim --- drivers/block/zram/zram_drv.c | 95 +++ drivers/block/zram/zram_drv.h | 6 ++- 2 files changed, 66 insertions(+), 35 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index debee95..26dc4e5 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -57,14 +57,15 @@ static inline struct zram *dev_to_zram(struct device *dev) return (struct zram *)dev_to_disk(dev)->private_data; } -static unsigned long zram_get_handle(struct zram *zram, u32 index) +static struct zram_entry *zram_get_entry(struct zram *zram, u32 index) { - return zram->table[index].handle; + return zram->table[index].entry; } -static void zram_set_handle(struct zram *zram, u32 index, unsigned long handle) +static void zram_set_entry(struct zram *zram, u32 index, + struct zram_entry *entry) { - zram->table[index].handle = handle; + zram->table[index].entry = entry; } /* flag operations require table entry bit_spin_lock() being held */ @@ -437,7 +438,7 @@ static bool zram_same_page_read(struct zram *zram, u32 index, unsigned int offset, unsigned int len) { zram_slot_lock(zram, index); - if (unlikely(!zram_get_handle(zram, index) || + if (unlikely(!zram_get_entry(zram, index) || zram_test_flag(zram, index, ZRAM_SAME))) { void *mem; @@ -476,6 +477,32 @@ static bool zram_same_page_write(struct zram *zram, u32 index, return false; } +static struct zram_entry *zram_entry_alloc(struct zram *zram, + unsigned int len, gfp_t flags) +{ + struct zram_entry *entry; + + entry = kzalloc(sizeof(*entry), + flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE)); + if (!entry) + return NULL; + + entry->handle = zs_malloc(zram->mem_pool, len, flags); + if (!entry->handle) { + kfree(entry); + return NULL; + } + + return entry; +} + +static inline void zram_entry_free(struct zram *zram, + struct zram_entry *entry) +{ + zs_free(zram->mem_pool, entry->handle); + kfree(entry); +} + static void zram_meta_free(struct zram *zram, u64 disksize) { size_t num_pages = disksize >> PAGE_SHIFT; @@ -514,7 +541,7 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize) */ static void zram_free_page(struct zram *zram, size_t index) { - unsigned long handle = zram_get_handle(zram, index); + struct zram_entry *entry = zram_get_entry(zram, index); /* * No memory is allocated for same element filled pages. @@ -527,23 +554,23 @@ static void zram_free_page(struct zram *zram, size_t index) return; } - if (!handle) + if (!entry) return; - zs_free(zram->mem_pool, handle); + zram_entry_free(zram, entry); atomic64_sub(zram_get_obj_size(zram, index), >stats.compr_data_size); atomic64_dec(>stats.pages_stored); - zram_set_handle(zram, index, 0); + zram_set_entry(zram, index, NULL); zram_set_obj_size(zram, index, 0); } static int zram_decompress_page(struct zram *zram, struct page *page, u32 index) { int ret; - unsigned long handle; + struct zram_entry *entry; unsigned int size; void *src, *dst; @@ -551,10 +578,10 @@ static int zram_decompress_page(struct zram *zram, struct page *page, u32 index) return 0; zram_slot_lock(zram, index); - handle = zram_get_handle(zram, index); + entry = zram_get_entry(zram, index); size = zram_get_obj_size(zram, index); - src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO); + src = zs_map_object(zram->mem_pool, entry->handle, ZS_MM_RO); if (size == PAGE_SIZE) { dst = kmap_atomic(page); memcpy(dst, src, PAGE_SIZE); @@ -568,7 +595,7 @@ static int zram_decompress_page(struct zram *zram, struct page *page, u32 index) kunmap_atomic(dst); zcomp_stream_put(zram->comp); } - zs_unmap_object(zram->mem_pool, handle); + zs_unmap_object(zram->mem_pool, entry->handle);
[PATCH v5 4/4] zram: compare all the entries with same checksum for deduplication
From: Joonsoo KimUntil now, we compare just one entry with same checksum when checking duplication since it is the simplest way to implement. However, for the completeness, checking all the entries is better so this patch implement to compare all the entries with same checksum. Since this event would be rare so there would be no performance loss. Reviewed-by: Sergey Senozhatsky Acked-by: Minchan Kim Signed-off-by: Joonsoo Kim --- drivers/block/zram/zram_dedup.c | 59 - 1 file changed, 47 insertions(+), 12 deletions(-) diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c index 560b1f5..14c4988 100644 --- a/drivers/block/zram/zram_dedup.c +++ b/drivers/block/zram/zram_dedup.c @@ -109,6 +109,51 @@ static unsigned long zram_dedup_put(struct zram *zram, return entry->refcount; } +static struct zram_entry *__zram_dedup_get(struct zram *zram, + struct zram_hash *hash, unsigned char *mem, + struct zram_entry *entry) +{ + struct zram_entry *tmp, *prev = NULL; + struct rb_node *rb_node; + + /* find left-most entry with same checksum */ + while ((rb_node = rb_prev(>rb_node))) { + tmp = rb_entry(rb_node, struct zram_entry, rb_node); + if (tmp->checksum != entry->checksum) + break; + + entry = tmp; + } + +again: + entry->refcount++; + atomic64_add(entry->len, >stats.dup_data_size); + spin_unlock(>lock); + + if (prev) + zram_entry_free(zram, prev); + + if (zram_dedup_match(zram, entry, mem)) + return entry; + + spin_lock(>lock); + tmp = NULL; + rb_node = rb_next(>rb_node); + if (rb_node) + tmp = rb_entry(rb_node, struct zram_entry, rb_node); + + if (tmp && (tmp->checksum == entry->checksum)) { + prev = entry; + entry = tmp; + goto again; + } + + spin_unlock(>lock); + zram_entry_free(zram, entry); + + return NULL; +} + static struct zram_entry *zram_dedup_get(struct zram *zram, unsigned char *mem, u32 checksum) { @@ -122,18 +167,8 @@ static struct zram_entry *zram_dedup_get(struct zram *zram, rb_node = hash->rb_root.rb_node; while (rb_node) { entry = rb_entry(rb_node, struct zram_entry, rb_node); - if (checksum == entry->checksum) { - entry->refcount++; - atomic64_add(entry->len, >stats.dup_data_size); - spin_unlock(>lock); - - if (zram_dedup_match(zram, entry, mem)) - return entry; - - zram_entry_free(zram, entry); - - return NULL; - } + if (checksum == entry->checksum) + return __zram_dedup_get(zram, hash, mem, entry); if (checksum < entry->checksum) rb_node = rb_node->rb_left; -- 2.7.4
[PATCH v5 2/4] zram: implement deduplication in zram
From: Joonsoo Kim This patch implements deduplication feature in zram. The purpose of this work is naturally to save amount of memory usage by zram. Android is one of the biggest users to use zram as swap and it's really important to save amount of memory usage. There is a paper that reports that duplication ratio of Android's memory content is rather high [1]. And, there is a similar work on zswap that also reports that experiments has shown that around 10-15% of pages stored in zswp are duplicates and deduplicate them provides some benefits [2]. Also, there is a different kind of workload that uses zram as blockdev and store build outputs into it to reduce wear-out problem of real blockdev. In this workload, deduplication hit is very high due to temporary files and intermediate object files. Detailed analysis is on the bottom of this description. Anyway, if we can detect duplicated content and avoid to store duplicated content at different memory space, we can save memory. This patch tries to do that. Implementation is almost simple and intuitive but I should note one thing about implementation detail. To check duplication, this patch uses checksum of the page and collision of this checksum could be possible. There would be many choices to handle this situation but this patch chooses to allow entry with duplicated checksum to be added to the hash, but, not to compare all entries with duplicated checksum when checking duplication. I guess that checksum collision is quite rare event and we don't need to pay any attention to such a case. Therefore, I decided the most simplest way to implement the feature. If there is a different opinion, I can accept and go that way. Following is the result of this patch. Test result #1 (Swap): Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18 orig_data_size: 145297408 compr_data_size: 32408125 mem_used_total: 32276480 dup_data_size: 3188134 meta_data_size: 1444272 Last two metrics added to mm_stat are related to this work. First one, dup_data_size, is amount of saved memory by avoiding to store duplicated page. Later one, meta_data_size, is the amount of data structure to support deduplication. If dup > meta, we can judge that the patch improves memory usage. In Adnroid, we can save 5% of memory usage by this work. Test result #2 (Blockdev): build the kernel and store output to ext4 FS on zram Elapsed time: 249 s mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0 Elapsed time: 250 s mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038 3945792 There is no performance degradation and save 23% memory. Test result #3 (Blockdev): copy android build output dir(out/host) to ext4 FS on zram Elapsed time: out/host: 88 s mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0 Elapsed time: out/host: 100 s mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336 It shows performance degradation roughly 13% and save 24% memory. Maybe, it is due to overhead of calculating checksum and comparison. Test result #4 (Blockdev): copy android build output dir(out/target/common) to ext4 FS on zram Elapsed time: out/host: 203 s mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0 Elapsed time: out/host: 201 s mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336 Memory is saved by 42% and performance is the same. Even if there is overhead of calculating checksum and comparison, large hit ratio compensate it since hit leads to less compression attempt. I checked the detailed reason of savings on kernel build workload and there are some cases that deduplication happens. 1) *.cmd Build command is usually similar in one directory so content of these file are very similar. In my system, more than 789 lines in fs/ext4/.namei.o.cmd and fs/ext4/.inode.o.cmd are the same in 944 and 938 lines of the file, respectively. 2) intermediate object files built-in.o and temporary object file have the similar contents. More than 50% of fs/ext4/ext4.o is the same with fs/ext4/built-in.o. 3) vmlinux .tmp_vmlinux1 and .tmp_vmlinux2 and arch/x86/boo/compressed/vmlinux.bin have the similar contents. Android test has similar case that some of object files(.class and .so) are similar with another ones. (./host/linux-x86/lib/libartd.so and ./host/linux-x86-lib/libartd-comiler.so) Anyway, benefit seems to be largely dependent on the workload so following patch will make this feature optional. However, this feature can help some usecases so is deserved to be merged. [1]: MemScope: Analyzing Memory Duplication on Android Systems, dl.acm.org/citation.cfm?id=2797023 [2]: zswap: Optimize compressed pool memory utilization, lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2 Reviewed-by: Sergey Senozhatsky Acked-by: Minchan Kim Signed-off-by: Joonsoo Kim --- Documentation/blockdev/zram.txt | 2 + drivers/block/zram/Makefile | 2 +-
[PATCH v5 0/4] zram: implement deduplication in zram
From: Joonsoo Kim Changes from v4 o rebase on next-20170511 o collect Reviewed-by from Sergey Changes from v3 o fix module build problem o make zram_dedup_enabled() statically return false if CONFIG_ZRAM_DEDUP=n Changes from v2 o rebase to latest zram code o manage alloc/free of the zram_entry in zram_drv.c o remove useless RB_CLEAR_NODE o set RO permission tor use_deup sysfs entry if CONFIG_ZRAM_DEDUP=n Changes from v1 o reogranize dedup specific functions o support Kconfig on/off o update zram documents o compare all the entries with same checksum (patch #4) This patchset implements deduplication feature in zram. Motivation is to save memory usage by zram. There are detailed description about motivation and experimental results on patch #2 so please refer it. Thanks. Joonsoo Kim (4): zram: introduce zram_entry to prepare dedup functionality zram: implement deduplication in zram zram: make deduplication feature optional zram: compare all the entries with same checksum for deduplication Documentation/ABI/testing/sysfs-block-zram | 10 ++ Documentation/blockdev/zram.txt| 3 + drivers/block/zram/Kconfig | 14 ++ drivers/block/zram/Makefile| 3 +- drivers/block/zram/zram_dedup.c| 254 + drivers/block/zram/zram_dedup.h| 45 + drivers/block/zram/zram_drv.c | 184 + drivers/block/zram/zram_drv.h | 32 +++- 8 files changed, 507 insertions(+), 38 deletions(-) create mode 100644 drivers/block/zram/zram_dedup.c create mode 100644 drivers/block/zram/zram_dedup.h -- 2.7.4
[PATCH v5 1/4] zram: introduce zram_entry to prepare dedup functionality
From: Joonsoo Kim Following patch will implement deduplication functionality in the zram and it requires an indirection layer to manage the life cycle of zsmalloc handle. To prepare that, this patch introduces zram_entry which can be used to manage the life-cycle of zsmalloc handle. Many lines are changed due to rename but core change is just simple introduction about newly data structure. Reviewed-by: Sergey Senozhatsky Acked-by: Minchan Kim Signed-off-by: Joonsoo Kim --- drivers/block/zram/zram_drv.c | 95 +++ drivers/block/zram/zram_drv.h | 6 ++- 2 files changed, 66 insertions(+), 35 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index debee95..26dc4e5 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -57,14 +57,15 @@ static inline struct zram *dev_to_zram(struct device *dev) return (struct zram *)dev_to_disk(dev)->private_data; } -static unsigned long zram_get_handle(struct zram *zram, u32 index) +static struct zram_entry *zram_get_entry(struct zram *zram, u32 index) { - return zram->table[index].handle; + return zram->table[index].entry; } -static void zram_set_handle(struct zram *zram, u32 index, unsigned long handle) +static void zram_set_entry(struct zram *zram, u32 index, + struct zram_entry *entry) { - zram->table[index].handle = handle; + zram->table[index].entry = entry; } /* flag operations require table entry bit_spin_lock() being held */ @@ -437,7 +438,7 @@ static bool zram_same_page_read(struct zram *zram, u32 index, unsigned int offset, unsigned int len) { zram_slot_lock(zram, index); - if (unlikely(!zram_get_handle(zram, index) || + if (unlikely(!zram_get_entry(zram, index) || zram_test_flag(zram, index, ZRAM_SAME))) { void *mem; @@ -476,6 +477,32 @@ static bool zram_same_page_write(struct zram *zram, u32 index, return false; } +static struct zram_entry *zram_entry_alloc(struct zram *zram, + unsigned int len, gfp_t flags) +{ + struct zram_entry *entry; + + entry = kzalloc(sizeof(*entry), + flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE)); + if (!entry) + return NULL; + + entry->handle = zs_malloc(zram->mem_pool, len, flags); + if (!entry->handle) { + kfree(entry); + return NULL; + } + + return entry; +} + +static inline void zram_entry_free(struct zram *zram, + struct zram_entry *entry) +{ + zs_free(zram->mem_pool, entry->handle); + kfree(entry); +} + static void zram_meta_free(struct zram *zram, u64 disksize) { size_t num_pages = disksize >> PAGE_SHIFT; @@ -514,7 +541,7 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize) */ static void zram_free_page(struct zram *zram, size_t index) { - unsigned long handle = zram_get_handle(zram, index); + struct zram_entry *entry = zram_get_entry(zram, index); /* * No memory is allocated for same element filled pages. @@ -527,23 +554,23 @@ static void zram_free_page(struct zram *zram, size_t index) return; } - if (!handle) + if (!entry) return; - zs_free(zram->mem_pool, handle); + zram_entry_free(zram, entry); atomic64_sub(zram_get_obj_size(zram, index), >stats.compr_data_size); atomic64_dec(>stats.pages_stored); - zram_set_handle(zram, index, 0); + zram_set_entry(zram, index, NULL); zram_set_obj_size(zram, index, 0); } static int zram_decompress_page(struct zram *zram, struct page *page, u32 index) { int ret; - unsigned long handle; + struct zram_entry *entry; unsigned int size; void *src, *dst; @@ -551,10 +578,10 @@ static int zram_decompress_page(struct zram *zram, struct page *page, u32 index) return 0; zram_slot_lock(zram, index); - handle = zram_get_handle(zram, index); + entry = zram_get_entry(zram, index); size = zram_get_obj_size(zram, index); - src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO); + src = zs_map_object(zram->mem_pool, entry->handle, ZS_MM_RO); if (size == PAGE_SIZE) { dst = kmap_atomic(page); memcpy(dst, src, PAGE_SIZE); @@ -568,7 +595,7 @@ static int zram_decompress_page(struct zram *zram, struct page *page, u32 index) kunmap_atomic(dst); zcomp_stream_put(zram->comp); } - zs_unmap_object(zram->mem_pool, handle); + zs_unmap_object(zram->mem_pool, entry->handle); zram_slot_unlock(zram, index); /* Should NEVER happen. Return bio error if it does. */ @@ -612,14 +639,14 @@
[PATCH v5 4/4] zram: compare all the entries with same checksum for deduplication
From: Joonsoo Kim Until now, we compare just one entry with same checksum when checking duplication since it is the simplest way to implement. However, for the completeness, checking all the entries is better so this patch implement to compare all the entries with same checksum. Since this event would be rare so there would be no performance loss. Reviewed-by: Sergey Senozhatsky Acked-by: Minchan Kim Signed-off-by: Joonsoo Kim --- drivers/block/zram/zram_dedup.c | 59 - 1 file changed, 47 insertions(+), 12 deletions(-) diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c index 560b1f5..14c4988 100644 --- a/drivers/block/zram/zram_dedup.c +++ b/drivers/block/zram/zram_dedup.c @@ -109,6 +109,51 @@ static unsigned long zram_dedup_put(struct zram *zram, return entry->refcount; } +static struct zram_entry *__zram_dedup_get(struct zram *zram, + struct zram_hash *hash, unsigned char *mem, + struct zram_entry *entry) +{ + struct zram_entry *tmp, *prev = NULL; + struct rb_node *rb_node; + + /* find left-most entry with same checksum */ + while ((rb_node = rb_prev(>rb_node))) { + tmp = rb_entry(rb_node, struct zram_entry, rb_node); + if (tmp->checksum != entry->checksum) + break; + + entry = tmp; + } + +again: + entry->refcount++; + atomic64_add(entry->len, >stats.dup_data_size); + spin_unlock(>lock); + + if (prev) + zram_entry_free(zram, prev); + + if (zram_dedup_match(zram, entry, mem)) + return entry; + + spin_lock(>lock); + tmp = NULL; + rb_node = rb_next(>rb_node); + if (rb_node) + tmp = rb_entry(rb_node, struct zram_entry, rb_node); + + if (tmp && (tmp->checksum == entry->checksum)) { + prev = entry; + entry = tmp; + goto again; + } + + spin_unlock(>lock); + zram_entry_free(zram, entry); + + return NULL; +} + static struct zram_entry *zram_dedup_get(struct zram *zram, unsigned char *mem, u32 checksum) { @@ -122,18 +167,8 @@ static struct zram_entry *zram_dedup_get(struct zram *zram, rb_node = hash->rb_root.rb_node; while (rb_node) { entry = rb_entry(rb_node, struct zram_entry, rb_node); - if (checksum == entry->checksum) { - entry->refcount++; - atomic64_add(entry->len, >stats.dup_data_size); - spin_unlock(>lock); - - if (zram_dedup_match(zram, entry, mem)) - return entry; - - zram_entry_free(zram, entry); - - return NULL; - } + if (checksum == entry->checksum) + return __zram_dedup_get(zram, hash, mem, entry); if (checksum < entry->checksum) rb_node = rb_node->rb_left; -- 2.7.4
Re: [PATCH] mm: per-cgroup memory reclaim stats
On Thu, 2017-05-11 at 20:16 +0100, Roman Gushchin wrote: > Track the following reclaim counters for every memory cgroup: > PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and > PGLAZYFREED. > > These values are exposed using the memory.stats interface of cgroup v2. The changelog could also describe what is currently there and what is missing. Is there a user space program that uses the newly exported values? Or is this just for consistency of stats? > > The meaning of each value is the same as for global counters, > available using /proc/vmstat. > > Also, for consistency, rename mem_cgroup_count_vm_event() to > count_memcg_event_mm(). > I still prefer the mem_cgroup_count_vm_event() name, or memcg_count_vm_event(), the namespace upfront makes it easier to parse where to look for the the implementation and also grep. In any case the rename should be independent patch, but I don't like the name you've proposed. > Signed-off-by: Roman Gushchin> Suggested-by: Johannes Weiner > Cc: Johannes Weiner > Cc: Tejun Heo > Cc: Li Zefan > Cc: Michal Hocko > Cc: Vladimir Davydov > Cc: cgro...@vger.kernel.org > Cc: linux-...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: linux...@kvack.org > --- > Documentation/cgroup-v2.txt | 28 ++ > fs/dax.c| 2 +- > fs/ncpfs/mmap.c | 2 +- > include/linux/memcontrol.h | 48 > ++--- > mm/filemap.c| 2 +- > mm/memcontrol.c | 10 ++ > mm/memory.c | 4 ++-- > mm/shmem.c | 3 +-- > mm/swap.c | 1 + > mm/vmscan.c | 30 +--- > 10 files changed, 113 insertions(+), 17 deletions(-) > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt > index e50b95c..5804355 100644 > --- a/Documentation/cgroup-v2.txt > +++ b/Documentation/cgroup-v2.txt > @@ -918,6 +918,34 @@ PAGE_SIZE multiple when read back. > > Number of major page faults incurred > > + pgrefill > + > + Amount of scanned pages (in an active LRU list) > + > + pgscan > + > + Amount of scanned pages (in an inactive LRU list) > + > + pgsteal > + > + Amount of reclaimed pages > + > + pgactivate > + > + Amount of pages moved to the active LRU list > + > + pgdeactivate > + > + Amount of pages moved to the inactive LRU lis > + > + pglazyfree > + > + Amount of pages postponed to be freed under memory pressure > + > + pglazyfreed > + > + Amount of reclaimed lazyfree pages > + >memory.swap.current > > A read-only single value file which exists on non-root > diff --git a/fs/dax.c b/fs/dax.c > index 66d7906..9aac521d 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1230,7 +1230,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, > case IOMAP_MAPPED: > if (iomap.flags & IOMAP_F_NEW) { > count_vm_event(PGMAJFAULT); > - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT); > + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); > major = VM_FAULT_MAJOR; > } > error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev, > diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c > index 0c3905e..6719c0b 100644 > --- a/fs/ncpfs/mmap.c > +++ b/fs/ncpfs/mmap.c > @@ -89,7 +89,7 @@ static int ncp_file_mmap_fault(struct vm_fault *vmf) >* -- nyc >*/ > count_vm_event(PGMAJFAULT); > - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT); > + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); > return VM_FAULT_MAJOR; > } > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 899949b..b2a5b1c 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -357,6 +357,17 @@ static inline unsigned short mem_cgroup_id(struct > mem_cgroup *memcg) > } > struct mem_cgroup *mem_cgroup_from_id(unsigned short id); > > +static inline struct mem_cgroup *lruvec_memcg(struct lruvec *lruvec) mem_cgroup_from_lruvec()? > +{ > + struct mem_cgroup_per_node *mz; > + > + if (mem_cgroup_disabled()) > + return NULL; > + > + mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); > + return mz->memcg; > +} > + > /** > * parent_mem_cgroup - find the accounting parent of a memcg > * @memcg: memcg whose parent to find > @@ -546,8 +557,23 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t > *pgdat, int order, > gfp_t gfp_mask, > unsigned long *total_scanned); > >
Re: [PATCH] mm: per-cgroup memory reclaim stats
On Thu, 2017-05-11 at 20:16 +0100, Roman Gushchin wrote: > Track the following reclaim counters for every memory cgroup: > PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and > PGLAZYFREED. > > These values are exposed using the memory.stats interface of cgroup v2. The changelog could also describe what is currently there and what is missing. Is there a user space program that uses the newly exported values? Or is this just for consistency of stats? > > The meaning of each value is the same as for global counters, > available using /proc/vmstat. > > Also, for consistency, rename mem_cgroup_count_vm_event() to > count_memcg_event_mm(). > I still prefer the mem_cgroup_count_vm_event() name, or memcg_count_vm_event(), the namespace upfront makes it easier to parse where to look for the the implementation and also grep. In any case the rename should be independent patch, but I don't like the name you've proposed. > Signed-off-by: Roman Gushchin > Suggested-by: Johannes Weiner > Cc: Johannes Weiner > Cc: Tejun Heo > Cc: Li Zefan > Cc: Michal Hocko > Cc: Vladimir Davydov > Cc: cgro...@vger.kernel.org > Cc: linux-...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: linux...@kvack.org > --- > Documentation/cgroup-v2.txt | 28 ++ > fs/dax.c| 2 +- > fs/ncpfs/mmap.c | 2 +- > include/linux/memcontrol.h | 48 > ++--- > mm/filemap.c| 2 +- > mm/memcontrol.c | 10 ++ > mm/memory.c | 4 ++-- > mm/shmem.c | 3 +-- > mm/swap.c | 1 + > mm/vmscan.c | 30 +--- > 10 files changed, 113 insertions(+), 17 deletions(-) > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt > index e50b95c..5804355 100644 > --- a/Documentation/cgroup-v2.txt > +++ b/Documentation/cgroup-v2.txt > @@ -918,6 +918,34 @@ PAGE_SIZE multiple when read back. > > Number of major page faults incurred > > + pgrefill > + > + Amount of scanned pages (in an active LRU list) > + > + pgscan > + > + Amount of scanned pages (in an inactive LRU list) > + > + pgsteal > + > + Amount of reclaimed pages > + > + pgactivate > + > + Amount of pages moved to the active LRU list > + > + pgdeactivate > + > + Amount of pages moved to the inactive LRU lis > + > + pglazyfree > + > + Amount of pages postponed to be freed under memory pressure > + > + pglazyfreed > + > + Amount of reclaimed lazyfree pages > + >memory.swap.current > > A read-only single value file which exists on non-root > diff --git a/fs/dax.c b/fs/dax.c > index 66d7906..9aac521d 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1230,7 +1230,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, > case IOMAP_MAPPED: > if (iomap.flags & IOMAP_F_NEW) { > count_vm_event(PGMAJFAULT); > - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT); > + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); > major = VM_FAULT_MAJOR; > } > error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev, > diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c > index 0c3905e..6719c0b 100644 > --- a/fs/ncpfs/mmap.c > +++ b/fs/ncpfs/mmap.c > @@ -89,7 +89,7 @@ static int ncp_file_mmap_fault(struct vm_fault *vmf) >* -- nyc >*/ > count_vm_event(PGMAJFAULT); > - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT); > + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); > return VM_FAULT_MAJOR; > } > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 899949b..b2a5b1c 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -357,6 +357,17 @@ static inline unsigned short mem_cgroup_id(struct > mem_cgroup *memcg) > } > struct mem_cgroup *mem_cgroup_from_id(unsigned short id); > > +static inline struct mem_cgroup *lruvec_memcg(struct lruvec *lruvec) mem_cgroup_from_lruvec()? > +{ > + struct mem_cgroup_per_node *mz; > + > + if (mem_cgroup_disabled()) > + return NULL; > + > + mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); > + return mz->memcg; > +} > + > /** > * parent_mem_cgroup - find the accounting parent of a memcg > * @memcg: memcg whose parent to find > @@ -546,8 +557,23 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t > *pgdat, int order, > gfp_t gfp_mask, > unsigned long *total_scanned); > > -static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, > - enum vm_event_item idx)
[PATCH 2/2] mm: swap: move anonymous THP split logic to vmscan
The add_to_swap aims to allocate swap_space(ie, swap slot and swapcache) so if it fails due to lack of space in case of THP or something(hdd swap but tries THP swapout) *caller* rather than add_to_swap itself should split the THP page and retry it with base page which is more natural. Cc: Johannes WeinerSigned-off-by: Minchan Kim --- include/linux/swap.h | 4 ++-- mm/swap_state.c | 23 ++- mm/vmscan.c | 17 - 3 files changed, 24 insertions(+), 20 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 8f12f67e869f..87cca2169d44 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -359,7 +359,7 @@ extern struct address_space *swapper_spaces[]; >> SWAP_ADDRESS_SPACE_SHIFT]) extern unsigned long total_swapcache_pages(void); extern void show_swap_cache_info(void); -extern int add_to_swap(struct page *, struct list_head *list); +extern int add_to_swap(struct page *); extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); extern int __add_to_swap_cache(struct page *page, swp_entry_t entry); extern void __delete_from_swap_cache(struct page *); @@ -479,7 +479,7 @@ static inline struct page *lookup_swap_cache(swp_entry_t swp) return NULL; } -static inline int add_to_swap(struct page *page, struct list_head *list) +static inline int add_to_swap(struct page *page) { return 0; } diff --git a/mm/swap_state.c b/mm/swap_state.c index 0ad214d7a7ad..9c71b6b2562f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -184,7 +184,7 @@ void __delete_from_swap_cache(struct page *page) * Allocate swap space for the page and add the page to the * swap cache. Caller needs to hold the page lock. */ -int add_to_swap(struct page *page, struct list_head *list) +int add_to_swap(struct page *page) { swp_entry_t entry; int err; @@ -192,12 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list) VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(!PageUptodate(page), page); -retry: entry = get_swap_page(page); if (!entry.val) - goto fail; + return 0; + if (mem_cgroup_try_charge_swap(page, entry)) - goto fail_free; + goto fail; /* * Radix-tree node allocations from PF_MEMALLOC contexts could @@ -218,23 +218,12 @@ int add_to_swap(struct page *page, struct list_head *list) * add_to_swap_cache() doesn't return -EEXIST, so we can safely * clear SWAP_HAS_CACHE flag. */ - goto fail_free; - - if (PageTransHuge(page)) { - err = split_huge_page_to_list(page, list); - if (err) { - delete_from_swap_cache(page); - return 0; - } - } + goto fail; return 1; -fail_free: - put_swap_page(page, entry); fail: - if (PageTransHuge(page) && !split_huge_page_to_list(page, list)) - goto retry; + put_swap_page(page, entry); return 0; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 57268c0c8fcf..767e20856080 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1125,8 +1125,23 @@ static unsigned long shrink_page_list(struct list_head *page_list, !PageSwapCache(page)) { if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; - if (!add_to_swap(page, page_list)) + if (!add_to_swap(page)) { + if (!PageTransHuge(page)) + goto activate_locked; + /* Split THP and swap individual base pages */ + if (split_huge_page_to_list(page, page_list)) + goto activate_locked; + if (!add_to_swap(page)) + goto activate_locked; + } + + /* XXX: We don't support THP writes */ + if (PageTransHuge(page) && + split_huge_page_to_list(page, page_list)) { + delete_from_swap_cache(page); goto activate_locked; + } + may_enter_fs = 1; /* Adding to swap updated mapping */ -- 2.7.4
[PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page
Now, get_swap_page takes struct page and allocates swap space according to page size(ie, normal or THP) so it would be more cleaner to introduce put_swap_page which is a counter function of get_swap_page. Then, it calls right swap slot free function depending on page's size. Cc: Johannes WeinerSigned-off-by: Minchan Kim --- include/linux/swap.h | 4 ++-- mm/shmem.c | 2 +- mm/swap_state.c | 13 +++-- mm/swapfile.c| 8 mm/vmscan.c | 2 +- 5 files changed, 15 insertions(+), 14 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index b60fea3748f8..8f12f67e869f 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -393,6 +393,7 @@ static inline long get_nr_swap_pages(void) extern void si_swapinfo(struct sysinfo *); extern swp_entry_t get_swap_page(struct page *page); +extern void put_swap_page(struct page *page, swp_entry_t entry); extern swp_entry_t get_swap_page_of_type(int); extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]); extern int add_swap_count_continuation(swp_entry_t, gfp_t); @@ -400,7 +401,6 @@ extern void swap_shmem_alloc(swp_entry_t); extern int swap_duplicate(swp_entry_t); extern int swapcache_prepare(swp_entry_t); extern void swap_free(swp_entry_t); -extern void swapcache_free(swp_entry_t); extern void swapcache_free_entries(swp_entry_t *entries, int n); extern int free_swap_and_cache(swp_entry_t); extern int swap_type_of(dev_t, sector_t, struct block_device **); @@ -459,7 +459,7 @@ static inline void swap_free(swp_entry_t swp) { } -static inline void swapcache_free(swp_entry_t swp) +static inline void put_swap_page(struct page *page, swp_entry_t swp) { } diff --git a/mm/shmem.c b/mm/shmem.c index 29948d7da172..82158edaefdb 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1326,7 +1326,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) mutex_unlock(_swaplist_mutex); free_swap: - swapcache_free(swap); + put_swap_page(page, swap); redirty: set_page_dirty(page); if (wbc->for_reclaim) diff --git a/mm/swap_state.c b/mm/swap_state.c index 16ff89d058f4..0ad214d7a7ad 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -231,10 +231,7 @@ int add_to_swap(struct page *page, struct list_head *list) return 1; fail_free: - if (PageTransHuge(page)) - swapcache_free_cluster(entry); - else - swapcache_free(entry); + put_swap_page(page, entry); fail: if (PageTransHuge(page) && !split_huge_page_to_list(page, list)) goto retry; @@ -259,11 +256,7 @@ void delete_from_swap_cache(struct page *page) __delete_from_swap_cache(page); spin_unlock_irq(_space->tree_lock); - if (PageTransHuge(page)) - swapcache_free_cluster(entry); - else - swapcache_free(entry); - + put_swap_page(page, entry); page_ref_sub(page, hpage_nr_pages(page)); } @@ -415,7 +408,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * add_to_swap_cache() doesn't return -EEXIST, so we can safely * clear SWAP_HAS_CACHE flag. */ - swapcache_free(entry); + put_swap_page(new_page, entry); } while (err != -ENOMEM); if (new_page) diff --git a/mm/swapfile.c b/mm/swapfile.c index 596306272059..b65e49428090 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry) } #endif /* CONFIG_THP_SWAP */ +void put_swap_page(struct page *page, swp_entry_t entry) +{ + if (!PageTransHuge(page)) + swapcache_free(entry); + else + swapcache_free_cluster(entry); +} + static int swp_entry_cmp(const void *ent1, const void *ent2) { const swp_entry_t *e1 = ent1, *e2 = ent2; diff --git a/mm/vmscan.c b/mm/vmscan.c index 5ebf468c5429..57268c0c8fcf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -708,7 +708,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, mem_cgroup_swapout(page, swap); __delete_from_swap_cache(page); spin_unlock_irqrestore(>tree_lock, flags); - swapcache_free(swap); + put_swap_page(page, swap); } else { void (*freepage)(struct page *); void *shadow = NULL; -- 2.7.4
[PATCH 2/2] mm: swap: move anonymous THP split logic to vmscan
The add_to_swap aims to allocate swap_space(ie, swap slot and swapcache) so if it fails due to lack of space in case of THP or something(hdd swap but tries THP swapout) *caller* rather than add_to_swap itself should split the THP page and retry it with base page which is more natural. Cc: Johannes Weiner Signed-off-by: Minchan Kim --- include/linux/swap.h | 4 ++-- mm/swap_state.c | 23 ++- mm/vmscan.c | 17 - 3 files changed, 24 insertions(+), 20 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 8f12f67e869f..87cca2169d44 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -359,7 +359,7 @@ extern struct address_space *swapper_spaces[]; >> SWAP_ADDRESS_SPACE_SHIFT]) extern unsigned long total_swapcache_pages(void); extern void show_swap_cache_info(void); -extern int add_to_swap(struct page *, struct list_head *list); +extern int add_to_swap(struct page *); extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); extern int __add_to_swap_cache(struct page *page, swp_entry_t entry); extern void __delete_from_swap_cache(struct page *); @@ -479,7 +479,7 @@ static inline struct page *lookup_swap_cache(swp_entry_t swp) return NULL; } -static inline int add_to_swap(struct page *page, struct list_head *list) +static inline int add_to_swap(struct page *page) { return 0; } diff --git a/mm/swap_state.c b/mm/swap_state.c index 0ad214d7a7ad..9c71b6b2562f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -184,7 +184,7 @@ void __delete_from_swap_cache(struct page *page) * Allocate swap space for the page and add the page to the * swap cache. Caller needs to hold the page lock. */ -int add_to_swap(struct page *page, struct list_head *list) +int add_to_swap(struct page *page) { swp_entry_t entry; int err; @@ -192,12 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list) VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(!PageUptodate(page), page); -retry: entry = get_swap_page(page); if (!entry.val) - goto fail; + return 0; + if (mem_cgroup_try_charge_swap(page, entry)) - goto fail_free; + goto fail; /* * Radix-tree node allocations from PF_MEMALLOC contexts could @@ -218,23 +218,12 @@ int add_to_swap(struct page *page, struct list_head *list) * add_to_swap_cache() doesn't return -EEXIST, so we can safely * clear SWAP_HAS_CACHE flag. */ - goto fail_free; - - if (PageTransHuge(page)) { - err = split_huge_page_to_list(page, list); - if (err) { - delete_from_swap_cache(page); - return 0; - } - } + goto fail; return 1; -fail_free: - put_swap_page(page, entry); fail: - if (PageTransHuge(page) && !split_huge_page_to_list(page, list)) - goto retry; + put_swap_page(page, entry); return 0; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 57268c0c8fcf..767e20856080 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1125,8 +1125,23 @@ static unsigned long shrink_page_list(struct list_head *page_list, !PageSwapCache(page)) { if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; - if (!add_to_swap(page, page_list)) + if (!add_to_swap(page)) { + if (!PageTransHuge(page)) + goto activate_locked; + /* Split THP and swap individual base pages */ + if (split_huge_page_to_list(page, page_list)) + goto activate_locked; + if (!add_to_swap(page)) + goto activate_locked; + } + + /* XXX: We don't support THP writes */ + if (PageTransHuge(page) && + split_huge_page_to_list(page, page_list)) { + delete_from_swap_cache(page); goto activate_locked; + } + may_enter_fs = 1; /* Adding to swap updated mapping */ -- 2.7.4
[PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page
Now, get_swap_page takes struct page and allocates swap space according to page size(ie, normal or THP) so it would be more cleaner to introduce put_swap_page which is a counter function of get_swap_page. Then, it calls right swap slot free function depending on page's size. Cc: Johannes Weiner Signed-off-by: Minchan Kim --- include/linux/swap.h | 4 ++-- mm/shmem.c | 2 +- mm/swap_state.c | 13 +++-- mm/swapfile.c| 8 mm/vmscan.c | 2 +- 5 files changed, 15 insertions(+), 14 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index b60fea3748f8..8f12f67e869f 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -393,6 +393,7 @@ static inline long get_nr_swap_pages(void) extern void si_swapinfo(struct sysinfo *); extern swp_entry_t get_swap_page(struct page *page); +extern void put_swap_page(struct page *page, swp_entry_t entry); extern swp_entry_t get_swap_page_of_type(int); extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]); extern int add_swap_count_continuation(swp_entry_t, gfp_t); @@ -400,7 +401,6 @@ extern void swap_shmem_alloc(swp_entry_t); extern int swap_duplicate(swp_entry_t); extern int swapcache_prepare(swp_entry_t); extern void swap_free(swp_entry_t); -extern void swapcache_free(swp_entry_t); extern void swapcache_free_entries(swp_entry_t *entries, int n); extern int free_swap_and_cache(swp_entry_t); extern int swap_type_of(dev_t, sector_t, struct block_device **); @@ -459,7 +459,7 @@ static inline void swap_free(swp_entry_t swp) { } -static inline void swapcache_free(swp_entry_t swp) +static inline void put_swap_page(struct page *page, swp_entry_t swp) { } diff --git a/mm/shmem.c b/mm/shmem.c index 29948d7da172..82158edaefdb 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1326,7 +1326,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) mutex_unlock(_swaplist_mutex); free_swap: - swapcache_free(swap); + put_swap_page(page, swap); redirty: set_page_dirty(page); if (wbc->for_reclaim) diff --git a/mm/swap_state.c b/mm/swap_state.c index 16ff89d058f4..0ad214d7a7ad 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -231,10 +231,7 @@ int add_to_swap(struct page *page, struct list_head *list) return 1; fail_free: - if (PageTransHuge(page)) - swapcache_free_cluster(entry); - else - swapcache_free(entry); + put_swap_page(page, entry); fail: if (PageTransHuge(page) && !split_huge_page_to_list(page, list)) goto retry; @@ -259,11 +256,7 @@ void delete_from_swap_cache(struct page *page) __delete_from_swap_cache(page); spin_unlock_irq(_space->tree_lock); - if (PageTransHuge(page)) - swapcache_free_cluster(entry); - else - swapcache_free(entry); - + put_swap_page(page, entry); page_ref_sub(page, hpage_nr_pages(page)); } @@ -415,7 +408,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * add_to_swap_cache() doesn't return -EEXIST, so we can safely * clear SWAP_HAS_CACHE flag. */ - swapcache_free(entry); + put_swap_page(new_page, entry); } while (err != -ENOMEM); if (new_page) diff --git a/mm/swapfile.c b/mm/swapfile.c index 596306272059..b65e49428090 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry) } #endif /* CONFIG_THP_SWAP */ +void put_swap_page(struct page *page, swp_entry_t entry) +{ + if (!PageTransHuge(page)) + swapcache_free(entry); + else + swapcache_free_cluster(entry); +} + static int swp_entry_cmp(const void *ent1, const void *ent2) { const swp_entry_t *e1 = ent1, *e2 = ent2; diff --git a/mm/vmscan.c b/mm/vmscan.c index 5ebf468c5429..57268c0c8fcf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -708,7 +708,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, mem_cgroup_swapout(page, swap); __delete_from_swap_cache(page); spin_unlock_irqrestore(>tree_lock, flags); - swapcache_free(swap); + put_swap_page(page, swap); } else { void (*freepage)(struct page *); void *shadow = NULL; -- 2.7.4
Re: [PATCH v8 05/10] powerpc/perf: IMC pmu cpumask and cpuhotplug support
Madhavan Srinivasanwrites: >> * in patch 9 should opal_imc_counters_init return something other >>than OPAL_SUCCESS in the case on invalid arguments? Maybe >>OPAL_PARAMETER? (I think you fix this in a later patch anyway?) > > So, init call will return OPAL_PARAMETER for the unsupported > domains (core and nest are supported). And if the init operation > fails for any reason, it would return OPAL_HARDWARE. And this is > documented. (I'll comment on the skiboot one too), but I think that if the class exists but init is a no-op, then OPAL_IMC_COUNTERS_INIT should return OPAL_SUCCESS and just do nothing. This future proofs everything, and the API is that one *must* call _INIT before start. -- Stewart Smith OPAL Architect, IBM.
Re: [PATCH v8 05/10] powerpc/perf: IMC pmu cpumask and cpuhotplug support
Madhavan Srinivasan writes: >> * in patch 9 should opal_imc_counters_init return something other >>than OPAL_SUCCESS in the case on invalid arguments? Maybe >>OPAL_PARAMETER? (I think you fix this in a later patch anyway?) > > So, init call will return OPAL_PARAMETER for the unsupported > domains (core and nest are supported). And if the init operation > fails for any reason, it would return OPAL_HARDWARE. And this is > documented. (I'll comment on the skiboot one too), but I think that if the class exists but init is a no-op, then OPAL_IMC_COUNTERS_INIT should return OPAL_SUCCESS and just do nothing. This future proofs everything, and the API is that one *must* call _INIT before start. -- Stewart Smith OPAL Architect, IBM.
Re: [v6 PATCH 07/21] x86/insn-eval: Add utility function to get segment descriptor
On Thu, 2017-05-04 at 13:02 +0200, Borislav Petkov wrote: > On Wed, Apr 26, 2017 at 02:51:56PM -0700, Ricardo Neri wrote: > > > > +seg >= > > > > current->active_mm->context.ldt->size)) { > > > > > > ldt->size is the size of the descriptor table but you've shifted seg by > > > 3. That selector index is shifted by 3 (to the left) to form an offset > > > into the descriptor table because the entries there are 8 bytes. > > > > I double-checked the ldt code and it seems to me that size refers to the > > number of entries in the table; it is always multiplied by > > LDT_ENTRY_SIZE [1], [2]. Am I missing something? > > No, you're not. I fell into that wrongly named struct member trap. > > So ldt_struct.size should actually be called ldt_struct.n_entries or > similar. Because what's in there is now is not "size". > > And then code like > > new_ldt->size * LDT_ENTRY_SIZE > > would make much more sense if written like this: > > new_ldt->n_entries * LDT_ENTRY_SIZE > > Would you fix that in a prepatch pls? > Sure I can. Would this trigger a v8 of my series? I was hoping v7 series could be merged and then start doing incremental work on top of it. Does it make sense? Thanks and BR, Ricardo
Re: [v6 PATCH 07/21] x86/insn-eval: Add utility function to get segment descriptor
On Thu, 2017-05-04 at 13:02 +0200, Borislav Petkov wrote: > On Wed, Apr 26, 2017 at 02:51:56PM -0700, Ricardo Neri wrote: > > > > +seg >= > > > > current->active_mm->context.ldt->size)) { > > > > > > ldt->size is the size of the descriptor table but you've shifted seg by > > > 3. That selector index is shifted by 3 (to the left) to form an offset > > > into the descriptor table because the entries there are 8 bytes. > > > > I double-checked the ldt code and it seems to me that size refers to the > > number of entries in the table; it is always multiplied by > > LDT_ENTRY_SIZE [1], [2]. Am I missing something? > > No, you're not. I fell into that wrongly named struct member trap. > > So ldt_struct.size should actually be called ldt_struct.n_entries or > similar. Because what's in there is now is not "size". > > And then code like > > new_ldt->size * LDT_ENTRY_SIZE > > would make much more sense if written like this: > > new_ldt->n_entries * LDT_ENTRY_SIZE > > Would you fix that in a prepatch pls? > Sure I can. Would this trigger a v8 of my series? I was hoping v7 series could be merged and then start doing incremental work on top of it. Does it make sense? Thanks and BR, Ricardo
RE: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf
Hi Alex and Gerd, >-Original Message- >From: intel-gvt-dev [mailto:intel-gvt-dev-boun...@lists.freedesktop.org] On >Behalf Of Alex Williamson >Sent: Thursday, May 11, 2017 11:45 PM >To: Gerd Hoffmann>Cc: Tian, Kevin ; intel-...@lists.freedesktop.org; linux- >ker...@vger.kernel.org; zhen...@linux.intel.com; Lv, Zhiyuan > ; Chen, Xiaoguang ; intel- >gvt-...@lists.freedesktop.org; Wang, Zhi A >Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf > >On Thu, 11 May 2017 15:27:53 +0200 >Gerd Hoffmann wrote: > >> Hi, >> >> > While read the framebuffer region we have to tell the vendor driver which >framebuffer we want to read? There are two framebuffers now in KVMGT that is >primary and cursor. >> > There are two methods to implement this: >> > 1) write the plane id first and then read the framebuffer. >> > 2) create 2 vfio regions one for primary and one for cursor. >> >> (3) Place information for both planes into one vfio region. >> Which allows to fetch both with a single read() syscall. >> >> The question is how you'll get the file descriptor then. If the ioctl >> returns the dma-buf fd only you have a racy interface: Things can >> change between read(vfio-region) and ioctl(need-dmabuf-fd). >> >> ioctl(need-dma-buf) could return both dmabuf fd and plane info to fix >> the race, but then it is easier to go with ioctl only interface >> (simliar to the orginal one from dec last year) I think. > >If the dmabuf fd is provided by a separate mdev vendor driver specific ioctl, I >don't see how vfio regions should be involved. Selecting which framebuffer >should be an ioctl parameter. Based on your last mail. I think the implementation looks like this: 1) user query the framebuffer information by reading the vfio region. 2) if the framebuffer changed(such as framebuffer's graphics address changed, size changed etc) we will need to create a new dmabuf fd. 3) create a new dmabuf fd using vfio device specific ioctl. >What sort of information needs to be conveyed >about each plane? Only plane id is needed. >Is it static information or something that needs to be read >repeatedly? It is static information. For our case plane id 1 represent primary plane and 3 for cursor plane. 2 means sprite plane which will not be used in our case. >Do we need it before we get the dmabuf fd or can it be an ioctl on >the dmabuf fd? We need it while query the framebuffer. In kernel we need the plane id to decide which plane we should decode. Below is my current implementation: 1) user first query the framebuffer(primary or cursor) and kernel decode the framebuffer and return the framebuffer information to user and also save a copy in kernel. 2) user compared the framebuffer and if the framebuffer changed creating a new dmabuf fd. 3) kernel create a new dmabuf fd based on saved framebuffer information. So we need plane id in step 1. In step 3 we create a dmabuf fd only using saved framebuffer information(no other information is needed). Chenxg. >Thanks, > >Alex >___ >intel-gvt-dev mailing list >intel-gvt-...@lists.freedesktop.org >https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
RE: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf
Hi Alex and Gerd, >-Original Message- >From: intel-gvt-dev [mailto:intel-gvt-dev-boun...@lists.freedesktop.org] On >Behalf Of Alex Williamson >Sent: Thursday, May 11, 2017 11:45 PM >To: Gerd Hoffmann >Cc: Tian, Kevin ; intel-...@lists.freedesktop.org; linux- >ker...@vger.kernel.org; zhen...@linux.intel.com; Lv, Zhiyuan >; Chen, Xiaoguang ; intel- >gvt-...@lists.freedesktop.org; Wang, Zhi A >Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf > >On Thu, 11 May 2017 15:27:53 +0200 >Gerd Hoffmann wrote: > >> Hi, >> >> > While read the framebuffer region we have to tell the vendor driver which >framebuffer we want to read? There are two framebuffers now in KVMGT that is >primary and cursor. >> > There are two methods to implement this: >> > 1) write the plane id first and then read the framebuffer. >> > 2) create 2 vfio regions one for primary and one for cursor. >> >> (3) Place information for both planes into one vfio region. >> Which allows to fetch both with a single read() syscall. >> >> The question is how you'll get the file descriptor then. If the ioctl >> returns the dma-buf fd only you have a racy interface: Things can >> change between read(vfio-region) and ioctl(need-dmabuf-fd). >> >> ioctl(need-dma-buf) could return both dmabuf fd and plane info to fix >> the race, but then it is easier to go with ioctl only interface >> (simliar to the orginal one from dec last year) I think. > >If the dmabuf fd is provided by a separate mdev vendor driver specific ioctl, I >don't see how vfio regions should be involved. Selecting which framebuffer >should be an ioctl parameter. Based on your last mail. I think the implementation looks like this: 1) user query the framebuffer information by reading the vfio region. 2) if the framebuffer changed(such as framebuffer's graphics address changed, size changed etc) we will need to create a new dmabuf fd. 3) create a new dmabuf fd using vfio device specific ioctl. >What sort of information needs to be conveyed >about each plane? Only plane id is needed. >Is it static information or something that needs to be read >repeatedly? It is static information. For our case plane id 1 represent primary plane and 3 for cursor plane. 2 means sprite plane which will not be used in our case. >Do we need it before we get the dmabuf fd or can it be an ioctl on >the dmabuf fd? We need it while query the framebuffer. In kernel we need the plane id to decide which plane we should decode. Below is my current implementation: 1) user first query the framebuffer(primary or cursor) and kernel decode the framebuffer and return the framebuffer information to user and also save a copy in kernel. 2) user compared the framebuffer and if the framebuffer changed creating a new dmabuf fd. 3) kernel create a new dmabuf fd based on saved framebuffer information. So we need plane id in step 1. In step 3 we create a dmabuf fd only using saved framebuffer information(no other information is needed). Chenxg. >Thanks, > >Alex >___ >intel-gvt-dev mailing list >intel-gvt-...@lists.freedesktop.org >https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
Re: [v6 PATCH 08/21] x86/insn-eval: Add utility function to get segment descriptor base address
On Fri, 2017-05-05 at 19:19 +0200, Borislav Petkov wrote: > On Wed, Apr 26, 2017 at 03:37:44PM -0700, Ricardo Neri wrote: > > I need a human-readable way of identifying what segment selector (in > > pt_regs, vm86regs or directly reading the segment registers) to use. > > Since there is a segment override prefix for all of them, I thought I > > could use them. > > Yes, you should... > > > Perhaps I can rename enum segment to enum segment_selector and comment > > that the values in the enum are those of the override prefixes. Would > > that be reasonable? > > ... but you should call them what they are: "enum seg_override_pfxs" or > "enum seg_ovr_pfx" or... > > Or somesuch. I suck at naming stuff. In my v7, I simply named my enumeration enum segment_register, which is what they are. Some of its entries happen to have the value of the segment override prefixes but also have special entries as SEG_REG_INVAL when for errors and SEG_REG_IGNORE for long mode [1]. Thanks and BR, Ricardo [1]. https://lkml.org/lkml/2017/5/5/405
Re: [v6 PATCH 08/21] x86/insn-eval: Add utility function to get segment descriptor base address
On Fri, 2017-05-05 at 19:19 +0200, Borislav Petkov wrote: > On Wed, Apr 26, 2017 at 03:37:44PM -0700, Ricardo Neri wrote: > > I need a human-readable way of identifying what segment selector (in > > pt_regs, vm86regs or directly reading the segment registers) to use. > > Since there is a segment override prefix for all of them, I thought I > > could use them. > > Yes, you should... > > > Perhaps I can rename enum segment to enum segment_selector and comment > > that the values in the enum are those of the override prefixes. Would > > that be reasonable? > > ... but you should call them what they are: "enum seg_override_pfxs" or > "enum seg_ovr_pfx" or... > > Or somesuch. I suck at naming stuff. In my v7, I simply named my enumeration enum segment_register, which is what they are. Some of its entries happen to have the value of the segment override prefixes but also have special entries as SEG_REG_INVAL when for errors and SEG_REG_IGNORE for long mode [1]. Thanks and BR, Ricardo [1]. https://lkml.org/lkml/2017/5/5/405
Re: [v6 PATCH 08/21] x86/insn-eval: Add utility function to get segment descriptor base address
On Fri, 2017-05-05 at 19:28 +0200, Borislav Petkov wrote: > On Wed, Apr 26, 2017 at 03:52:41PM -0700, Ricardo Neri wrote: > > Probably insn_get_seg_base() itself can verify if there are segment > > override prefixes in the struct insn. If yes, use them except for > > specific cases such as CS. > > ... and depending on whether in long mode or not. Yes, in my v7 I ignore the segment register if we are in long mode [1]. > > > On an unrelated note, I still have the problem of using DS vs ES for > > string instructions. Perhaps instead of a use_default_seg flag, a > > string_instruction flag that indicates how to determine the default > > segment. > > ... or you can look at the insn opcode directly. AFAICT, you need > to check whether the opcode is 0xa4 or 0xa5 and that the insn is a > single-byte opcode, i.e., not from the secondary map escaped with 0xf or > some of the other multi-byte opcode maps. In my v7, I have added a section my function resolve_seg_register() that ignores segment overrides if it sees string instructions and the register EDI and defaults to ES. If the register is EIP, it defaults to CS. To determine if an instruction is a string instruction I do check for the size of the opcode and the opcodes that you mention plus others based on the Intel Software Development Manual[2]. [1]. https://lkml.org/lkml/2017/5/5/405 [2]. https://lkml.org/lkml/2017/5/5/410 Thanks and BR, Ricardo > > -- > Regards/Gruss, > Boris. > > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB > 21284 (AG Nürnberg)
Re: [v6 PATCH 08/21] x86/insn-eval: Add utility function to get segment descriptor base address
On Fri, 2017-05-05 at 19:28 +0200, Borislav Petkov wrote: > On Wed, Apr 26, 2017 at 03:52:41PM -0700, Ricardo Neri wrote: > > Probably insn_get_seg_base() itself can verify if there are segment > > override prefixes in the struct insn. If yes, use them except for > > specific cases such as CS. > > ... and depending on whether in long mode or not. Yes, in my v7 I ignore the segment register if we are in long mode [1]. > > > On an unrelated note, I still have the problem of using DS vs ES for > > string instructions. Perhaps instead of a use_default_seg flag, a > > string_instruction flag that indicates how to determine the default > > segment. > > ... or you can look at the insn opcode directly. AFAICT, you need > to check whether the opcode is 0xa4 or 0xa5 and that the insn is a > single-byte opcode, i.e., not from the secondary map escaped with 0xf or > some of the other multi-byte opcode maps. In my v7, I have added a section my function resolve_seg_register() that ignores segment overrides if it sees string instructions and the register EDI and defaults to ES. If the register is EIP, it defaults to CS. To determine if an instruction is a string instruction I do check for the size of the opcode and the opcodes that you mention plus others based on the Intel Software Development Manual[2]. [1]. https://lkml.org/lkml/2017/5/5/405 [2]. https://lkml.org/lkml/2017/5/5/410 Thanks and BR, Ricardo > > -- > Regards/Gruss, > Boris. > > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB > 21284 (AG Nürnberg)
[PATCH 3/5] x86: msr-index.h: fix shifts to ULL results in HWP macros.
From: Len Brownx = 1 ulong_long = x << 32; results in: warning: left shift count >= width of type x = 8 ulong_long = x << 24; results in a sign extended ulong_long Cast x to unsigned long long in these macros to prevent these errors. Signed-off-by: Len Brown --- arch/x86/include/asm/msr-index.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 50c0c3204a92..6da2b30781ff 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -238,13 +238,13 @@ #define HWP_MIN_PERF(x)(x & 0xff) #define HWP_MAX_PERF(x)((x & 0xff) << 8) #define HWP_DESIRED_PERF(x)((x & 0xff) << 16) -#define HWP_ENERGY_PERF_PREFERENCE(x) ((x & 0xff) << 24) +#define HWP_ENERGY_PERF_PREFERENCE(x) (((unsigned long long) x & 0xff) << 24) #define HWP_EPP_PERFORMANCE0x00 #define HWP_EPP_BALANCE_PERFORMANCE0x80 #define HWP_EPP_BALANCE_POWERSAVE 0xC0 #define HWP_EPP_POWERSAVE 0xFF -#define HWP_ACTIVITY_WINDOW(x) ((x & 0xff3) << 32) -#define HWP_PACKAGE_CONTROL(x) ((x & 0x1) << 42) +#define HWP_ACTIVITY_WINDOW(x) ((unsigned long long)(x & 0xff3) << 32) +#define HWP_PACKAGE_CONTROL(x) ((unsigned long long)(x & 0x1) << 42) /* IA32_HWP_STATUS */ #define HWP_GUARANTEED_CHANGE(x) (x & 0x1) -- 2.11.0.161.g6610af872
[PATCH 3/5] x86: msr-index.h: fix shifts to ULL results in HWP macros.
From: Len Brown x = 1 ulong_long = x << 32; results in: warning: left shift count >= width of type x = 8 ulong_long = x << 24; results in a sign extended ulong_long Cast x to unsigned long long in these macros to prevent these errors. Signed-off-by: Len Brown --- arch/x86/include/asm/msr-index.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 50c0c3204a92..6da2b30781ff 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -238,13 +238,13 @@ #define HWP_MIN_PERF(x)(x & 0xff) #define HWP_MAX_PERF(x)((x & 0xff) << 8) #define HWP_DESIRED_PERF(x)((x & 0xff) << 16) -#define HWP_ENERGY_PERF_PREFERENCE(x) ((x & 0xff) << 24) +#define HWP_ENERGY_PERF_PREFERENCE(x) (((unsigned long long) x & 0xff) << 24) #define HWP_EPP_PERFORMANCE0x00 #define HWP_EPP_BALANCE_PERFORMANCE0x80 #define HWP_EPP_BALANCE_POWERSAVE 0xC0 #define HWP_EPP_POWERSAVE 0xFF -#define HWP_ACTIVITY_WINDOW(x) ((x & 0xff3) << 32) -#define HWP_PACKAGE_CONTROL(x) ((x & 0x1) << 42) +#define HWP_ACTIVITY_WINDOW(x) ((unsigned long long)(x & 0xff3) << 32) +#define HWP_PACKAGE_CONTROL(x) ((unsigned long long)(x & 0x1) << 42) /* IA32_HWP_STATUS */ #define HWP_GUARANTEED_CHANGE(x) (x & 0x1) -- 2.11.0.161.g6610af872
[PATCH 5/5] intel_pstate: use updated msr-index.h HWP.EPP values
From: Len Brownintel_pstate exports sysfs attributes for setting and observing HWP.EPP. These attributes use strings to describe 4 operating states, and inside the driver, these strings are mapped to numerical register values. The authorative mapping between the strings and numerical HWP.EPP values are now globally defined in msr-index.h, replacing the out-dated mapping that were open-coded into intel_pstate.c new old string --- --- -- 0 0 performance 128 64 balance_performance 192 128 balance_power 255 192 power Note that the HW and BIOS default value on most system is 128, which intel_pstate will now call "balance_performance" while it used to call it "balance_power". Signed-off-by: Len Brown --- drivers/cpufreq/intel_pstate.c | 34 +++--- 1 file changed, 15 insertions(+), 19 deletions(-) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 50bd6d987fc3..ab8ebaeb3621 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -716,6 +716,12 @@ static const char * const energy_perf_strings[] = { "power", NULL }; +static const unsigned int epp_values[] = { + HWP_EPP_PERFORMANCE, + HWP_EPP_BALANCE_PERFORMANCE, + HWP_EPP_BALANCE_POWERSAVE, + HWP_EPP_POWERSAVE +}; static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data) { @@ -727,17 +733,14 @@ static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data) return epp; if (static_cpu_has(X86_FEATURE_HWP_EPP)) { - /* -* Range: -* 0x00-0x3F : Performance -* 0x40-0x7F : Balance performance -* 0x80-0xBF : Balance power -* 0xC0-0xFF : Power -* The EPP is a 8 bit value, but our ranges restrict the -* value which can be set. Here only using top two bits -* effectively. -*/ - index = (epp >> 6) + 1; + if (epp == HWP_EPP_PERFORMANCE) + return 1; + if (epp <= HWP_EPP_BALANCE_PERFORMANCE) + return 2; + if (epp <= HWP_EPP_BALANCE_POWERSAVE) + return 3; + else + return 4; } else if (static_cpu_has(X86_FEATURE_EPB)) { /* * Range: @@ -775,15 +778,8 @@ static int intel_pstate_set_energy_pref_index(struct cpudata *cpu_data, value &= ~GENMASK_ULL(31, 24); - /* -* If epp is not default, convert from index into -* energy_perf_strings to epp value, by shifting 6 -* bits left to use only top two bits in epp. -* The resultant epp need to shifted by 24 bits to -* epp position in MSR_HWP_REQUEST. -*/ if (epp == -EINVAL) - epp = (pref_index - 1) << 6; + epp = epp_values[pref_index - 1]; value |= (u64)epp << 24; ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value); -- 2.11.0.161.g6610af872
[PATCH 5/5] intel_pstate: use updated msr-index.h HWP.EPP values
From: Len Brown intel_pstate exports sysfs attributes for setting and observing HWP.EPP. These attributes use strings to describe 4 operating states, and inside the driver, these strings are mapped to numerical register values. The authorative mapping between the strings and numerical HWP.EPP values are now globally defined in msr-index.h, replacing the out-dated mapping that were open-coded into intel_pstate.c new old string --- --- -- 0 0 performance 128 64 balance_performance 192 128 balance_power 255 192 power Note that the HW and BIOS default value on most system is 128, which intel_pstate will now call "balance_performance" while it used to call it "balance_power". Signed-off-by: Len Brown --- drivers/cpufreq/intel_pstate.c | 34 +++--- 1 file changed, 15 insertions(+), 19 deletions(-) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 50bd6d987fc3..ab8ebaeb3621 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -716,6 +716,12 @@ static const char * const energy_perf_strings[] = { "power", NULL }; +static const unsigned int epp_values[] = { + HWP_EPP_PERFORMANCE, + HWP_EPP_BALANCE_PERFORMANCE, + HWP_EPP_BALANCE_POWERSAVE, + HWP_EPP_POWERSAVE +}; static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data) { @@ -727,17 +733,14 @@ static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data) return epp; if (static_cpu_has(X86_FEATURE_HWP_EPP)) { - /* -* Range: -* 0x00-0x3F : Performance -* 0x40-0x7F : Balance performance -* 0x80-0xBF : Balance power -* 0xC0-0xFF : Power -* The EPP is a 8 bit value, but our ranges restrict the -* value which can be set. Here only using top two bits -* effectively. -*/ - index = (epp >> 6) + 1; + if (epp == HWP_EPP_PERFORMANCE) + return 1; + if (epp <= HWP_EPP_BALANCE_PERFORMANCE) + return 2; + if (epp <= HWP_EPP_BALANCE_POWERSAVE) + return 3; + else + return 4; } else if (static_cpu_has(X86_FEATURE_EPB)) { /* * Range: @@ -775,15 +778,8 @@ static int intel_pstate_set_energy_pref_index(struct cpudata *cpu_data, value &= ~GENMASK_ULL(31, 24); - /* -* If epp is not default, convert from index into -* energy_perf_strings to epp value, by shifting 6 -* bits left to use only top two bits in epp. -* The resultant epp need to shifted by 24 bits to -* epp position in MSR_HWP_REQUEST. -*/ if (epp == -EINVAL) - epp = (pref_index - 1) << 6; + epp = epp_values[pref_index - 1]; value |= (u64)epp << 24; ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value); -- 2.11.0.161.g6610af872
[PATCH 1/5] x86: msr-index.h: define EPB mid-points
From: Len BrownThese are currently open-coded into intel_pstate.c Signed-off-by: Len Brown --- arch/x86/include/asm/msr-index.h | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 710273c617b8..a92d9bd154f6 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -462,9 +462,11 @@ #define MSR_MISC_PWR_MGMT 0x01aa #define MSR_IA32_ENERGY_PERF_BIAS 0x01b0 -#define ENERGY_PERF_BIAS_PERFORMANCE 0 -#define ENERGY_PERF_BIAS_NORMAL6 -#define ENERGY_PERF_BIAS_POWERSAVE 15 +#define ENERGY_PERF_BIAS_PERFORMANCE 0 +#define ENERGY_PERF_BIAS_BALANCE_PERFORMANCE 4 +#define ENERGY_PERF_BIAS_NORMAL6 +#define ENERGY_PERF_BIAS_BALANCE_POWERSAVE 8 +#define ENERGY_PERF_BIAS_POWERSAVE 15 #define MSR_IA32_PACKAGE_THERM_STATUS 0x01b1 -- 2.11.0.161.g6610af872
Re: [PATCH v7 0/7] Introduce ZONE_CMA
On Thu, May 11, 2017 at 11:13:04AM +0200, Michal Hocko wrote: > On Thu 11-05-17 11:12:43, Joonsoo Kim wrote: > > Sorry for the late response. I was on a vacation. > > > > On Tue, May 02, 2017 at 03:32:29PM +0200, Michal Hocko wrote: > > > On Tue 02-05-17 13:01:32, Joonsoo Kim wrote: > > > > On Thu, Apr 27, 2017 at 05:06:36PM +0200, Michal Hocko wrote: > > > [...] > > > > > I see this point and I agree that using a specific zone might be a > > > > > _nicer_ solution in the end but you have to consider another aspects > > > > > as > > > > > well. The main one I am worried about is a long term maintainability. > > > > > We are really out of page flags and consuming one for a rather > > > > > specific > > > > > usecase is not good. Look at ZONE_DMA. I am pretty sure that almost > > > > > no sane HW needs 16MB zone anymore, yet we have hard time to get rid > > > > > of it and so we have that memory laying around unused all the time > > > > > and blocking one page flag bit. CMA falls into a similar category > > > > > AFAIU. I wouldn't be all that surprised if a future HW will not need > > > > > CMA > > > > > allocations in few years, yet we will have to fight to get rid of it > > > > > like we do with ZONE_DMA. And not only that. We will also have to > > > > > fight > > > > > finding page flags for other more general usecases in the meantime. > > > > > > > > This maintenance problem is inherent. This problem exists even if we > > > > uses MIGRATETYPE approach. We cannot remove many hooks for CMA if a > > > > future HW will not need CMA allocation in few years. The only > > > > difference is that one takes single zone bit only for CMA user and the > > > > other approach takes many hooks that we need to take care about it all > > > > the time. > > > > > > And I consider this a big difference. Because while hooks are not nice > > > they will affect CMA users (in a sense of bugs/performance etc.). While > > > an additional bit consumed will affect potential future and more generic > > > features. > > > > Because these hooks are so tricky and are spread on many places, > > bugs about these hooks can be made by *non-CMA* user and they hurt > > *CMA* user. These hooks could also delay non-CMA user's development speed > > since there are many hooks about CMA and understanding how CMA is managed > > is rather difficult. > > Than make those hooks easier to maintain. Seriously this is a > non-argument. I can't understand what you said here. With zone approach, someone who isn't related to CMA don't need to understand how CMA is managed. > > [...] > > > > And all this can be isolated to CMA specific hooks with mostly minimum > > > impact to most users. I hear you saying that zone approach is more natural > > > and I would agree if we wouldn't have to care about the number of zones. > > > > I attach a solution about one more bit in page flags although I don't > > agree with your opinion that additional bit is no-go approach. Just > > note that we have already used three bits for zone encoding in some > > configuration due to ZONE_DEVICE. > > I am absolutely not happy about ZONE_DEVICE but there is _no_ other > viable solution right now. I know that people behind this are still > considering struct page over direct pfn usage but they are not in the > same situation as CMA which _can_ work without additional zone. IIUC, ZONE_DEVICE can reuse the other zone and migratetype. What they need is just struct page and separate zone is not necessarily needed. The other thing that they want is to distinguish if the page is for the ZONE_DEVICE memory or not so it can use similar approach with CMA. IMHO, there is almost nothing that _cannot_ work in S/W world. What we need to consider is just trade-off. So, please don't say impossibility again. > > If you _really_ insist on using zone for CMA then reuse ZONE_MOVABLE. > I absolutely miss why do you push a specialized zone so hard. As I said before, there is no fundamental issue to reuse ZONE_MOVABLE. I just don't want to reuse it because they have a different characteristic. In MM subsystem context, their characteristic is the same. However, CMA memory should be used for the device in runtime so more allocation guarantee is needed. See the offline_pages() in mm/memory_hotplug.c. They can bear in 120 sec to offline the memory but CMA memory can't. And, this is a design issue. I don't want to talk about why should we pursuit the good design. Originally, ZONE exists to manage different type of memory. Migratetype is not for that purpose. Using separate zone fits the original purpose. Mixing them would be a bad design and we would esaily encounter the unexpected problem in the future. > > [...] > > > No, but I haven't heard any single argument that those bugs are > > > impossible to fix with the current approach. They might be harder to fix > > > but if I can chose between harder for CMA and harder for other more > > > generic HW independent features I will go for the
[PATCH 1/5] x86: msr-index.h: define EPB mid-points
From: Len Brown These are currently open-coded into intel_pstate.c Signed-off-by: Len Brown --- arch/x86/include/asm/msr-index.h | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 710273c617b8..a92d9bd154f6 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -462,9 +462,11 @@ #define MSR_MISC_PWR_MGMT 0x01aa #define MSR_IA32_ENERGY_PERF_BIAS 0x01b0 -#define ENERGY_PERF_BIAS_PERFORMANCE 0 -#define ENERGY_PERF_BIAS_NORMAL6 -#define ENERGY_PERF_BIAS_POWERSAVE 15 +#define ENERGY_PERF_BIAS_PERFORMANCE 0 +#define ENERGY_PERF_BIAS_BALANCE_PERFORMANCE 4 +#define ENERGY_PERF_BIAS_NORMAL6 +#define ENERGY_PERF_BIAS_BALANCE_POWERSAVE 8 +#define ENERGY_PERF_BIAS_POWERSAVE 15 #define MSR_IA32_PACKAGE_THERM_STATUS 0x01b1 -- 2.11.0.161.g6610af872
Re: [PATCH v7 0/7] Introduce ZONE_CMA
On Thu, May 11, 2017 at 11:13:04AM +0200, Michal Hocko wrote: > On Thu 11-05-17 11:12:43, Joonsoo Kim wrote: > > Sorry for the late response. I was on a vacation. > > > > On Tue, May 02, 2017 at 03:32:29PM +0200, Michal Hocko wrote: > > > On Tue 02-05-17 13:01:32, Joonsoo Kim wrote: > > > > On Thu, Apr 27, 2017 at 05:06:36PM +0200, Michal Hocko wrote: > > > [...] > > > > > I see this point and I agree that using a specific zone might be a > > > > > _nicer_ solution in the end but you have to consider another aspects > > > > > as > > > > > well. The main one I am worried about is a long term maintainability. > > > > > We are really out of page flags and consuming one for a rather > > > > > specific > > > > > usecase is not good. Look at ZONE_DMA. I am pretty sure that almost > > > > > no sane HW needs 16MB zone anymore, yet we have hard time to get rid > > > > > of it and so we have that memory laying around unused all the time > > > > > and blocking one page flag bit. CMA falls into a similar category > > > > > AFAIU. I wouldn't be all that surprised if a future HW will not need > > > > > CMA > > > > > allocations in few years, yet we will have to fight to get rid of it > > > > > like we do with ZONE_DMA. And not only that. We will also have to > > > > > fight > > > > > finding page flags for other more general usecases in the meantime. > > > > > > > > This maintenance problem is inherent. This problem exists even if we > > > > uses MIGRATETYPE approach. We cannot remove many hooks for CMA if a > > > > future HW will not need CMA allocation in few years. The only > > > > difference is that one takes single zone bit only for CMA user and the > > > > other approach takes many hooks that we need to take care about it all > > > > the time. > > > > > > And I consider this a big difference. Because while hooks are not nice > > > they will affect CMA users (in a sense of bugs/performance etc.). While > > > an additional bit consumed will affect potential future and more generic > > > features. > > > > Because these hooks are so tricky and are spread on many places, > > bugs about these hooks can be made by *non-CMA* user and they hurt > > *CMA* user. These hooks could also delay non-CMA user's development speed > > since there are many hooks about CMA and understanding how CMA is managed > > is rather difficult. > > Than make those hooks easier to maintain. Seriously this is a > non-argument. I can't understand what you said here. With zone approach, someone who isn't related to CMA don't need to understand how CMA is managed. > > [...] > > > > And all this can be isolated to CMA specific hooks with mostly minimum > > > impact to most users. I hear you saying that zone approach is more natural > > > and I would agree if we wouldn't have to care about the number of zones. > > > > I attach a solution about one more bit in page flags although I don't > > agree with your opinion that additional bit is no-go approach. Just > > note that we have already used three bits for zone encoding in some > > configuration due to ZONE_DEVICE. > > I am absolutely not happy about ZONE_DEVICE but there is _no_ other > viable solution right now. I know that people behind this are still > considering struct page over direct pfn usage but they are not in the > same situation as CMA which _can_ work without additional zone. IIUC, ZONE_DEVICE can reuse the other zone and migratetype. What they need is just struct page and separate zone is not necessarily needed. The other thing that they want is to distinguish if the page is for the ZONE_DEVICE memory or not so it can use similar approach with CMA. IMHO, there is almost nothing that _cannot_ work in S/W world. What we need to consider is just trade-off. So, please don't say impossibility again. > > If you _really_ insist on using zone for CMA then reuse ZONE_MOVABLE. > I absolutely miss why do you push a specialized zone so hard. As I said before, there is no fundamental issue to reuse ZONE_MOVABLE. I just don't want to reuse it because they have a different characteristic. In MM subsystem context, their characteristic is the same. However, CMA memory should be used for the device in runtime so more allocation guarantee is needed. See the offline_pages() in mm/memory_hotplug.c. They can bear in 120 sec to offline the memory but CMA memory can't. And, this is a design issue. I don't want to talk about why should we pursuit the good design. Originally, ZONE exists to manage different type of memory. Migratetype is not for that purpose. Using separate zone fits the original purpose. Mixing them would be a bad design and we would esaily encounter the unexpected problem in the future. > > [...] > > > No, but I haven't heard any single argument that those bugs are > > > impossible to fix with the current approach. They might be harder to fix > > > but if I can chose between harder for CMA and harder for other more > > > generic HW independent features I will go for the
[GIT PULL] power utilities update
Hi Rafael, Please pull these power utilities patches. The x86_energy_perf_policy(8) utility grows the ability to manage HWP.EPP (and Hardware P-states, in-general), on top of its previous ability to manage EPB. Linux-4.11 grew the ability to disable the cpufreq sub-system entirely with "cpufreq.off=1". On such a system, this utility is a convenient way to enable HWP with no kernel cpufreq overhead: sudo x86_energy_perf_policy --cpu all normal During this effort, we noticed the EPP defaults hard-coded into the intel_state driver were not optimal, so those are updated too. thanks! Len Brown, Intel Open Source Technology Center The following changes since commit c470abd4fde40ea6a0846a2beab642a578c0b8cd: Linux 4.10 (2017-02-19 14:34:00 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git utilities for you to fetch changes up to 3cedbc5a6d7f7c5539e139f89ec9f6e1ed668418: intel_pstate: use updated msr-index.h HWP.EPP values (2017-05-11 21:27:53 -0400) Len Brown (5): x86: msr-index.h: define EPB mid-points x86: msr-index.h: define HWP.EPP values x86: msr-index.h: fix shifts to ULL results in HWP macros. tools/power x86_energy_perf_policy: support HWP.EPP intel_pstate: use updated msr-index.h HWP.EPP values arch/x86/include/asm/msr-index.h | 18 +- drivers/cpufreq/intel_pstate.c | 34 +- tools/power/x86/x86_energy_perf_policy/Makefile| 27 +- .../x86_energy_perf_policy.8 | 241 +++- .../x86_energy_perf_policy.c | 1504 +--- 5 files changed, 1527 insertions(+), 297 deletions(-)
[PATCH 2/5] x86: msr-index.h: define HWP.EPP values
From: Len BrownThe Hardware Performance State request MSR has a field to express the "Energy Performance Preference" (HWP.EPP). Decode that field so the definition may be shared by by the intel_pstate driver and any utilities that decode the same register. Signed-off-by: Len Brown --- arch/x86/include/asm/msr-index.h | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index a92d9bd154f6..50c0c3204a92 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -239,6 +239,10 @@ #define HWP_MAX_PERF(x)((x & 0xff) << 8) #define HWP_DESIRED_PERF(x)((x & 0xff) << 16) #define HWP_ENERGY_PERF_PREFERENCE(x) ((x & 0xff) << 24) +#define HWP_EPP_PERFORMANCE0x00 +#define HWP_EPP_BALANCE_PERFORMANCE0x80 +#define HWP_EPP_BALANCE_POWERSAVE 0xC0 +#define HWP_EPP_POWERSAVE 0xFF #define HWP_ACTIVITY_WINDOW(x) ((x & 0xff3) << 32) #define HWP_PACKAGE_CONTROL(x) ((x & 0x1) << 42) -- 2.11.0.161.g6610af872
[PATCH 4/5] tools/power x86_energy_perf_policy: support HWP.EPP
From: Len Brownx86_energy_perf_policy(8) was created as an example of how the user, or upper-level OS, can manage MSR_IA32_ENERGY_PERF_BIAS (EPB). Hardware consults EPB when it makes internal decisions balancing energy-saving vs performance. For example, should HW quickly or slowly transition into and out of power-saving idles states? Should HW quickly or slowly ramp frequency up or down in response to demand in the turbo-frequency range? Depending on the processor, EPB may have package, core, or CPU thread scope. As such, the only general policy is to write the same value to EPB on every CPU in the system. Recent platforms add support for Hardware Performance States (HWP). HWP effectively extends hardware frequency control from the opportunistic turbo-frequency range to control the entire range of available processor frequencies. Just as turbo-mode used EPB, HWP can use EPB to help decicde how quickly to ramp frequency and voltage up and down in response to changing demand. Indeed, BDX and BDX-DE, the first processors to support HWP, use EPB for this purpose. Starting in SKL, HWP no longer looks to EPB for influence. Instead, it looks in a new MSR specifically for this purpose: IA32_HWP_REQUEST.Energy_Performance_Preference (HWP.EPP). HWP.EPP is like EPB, except that it is specific to HWP-mode frequency selection. Also, HWP.EPP is defined to have per CPU-thread scope. Starting in SKX, IA32_HWP_REQUEST is augmented by IA32_HWP_REQUEST_PKG -- which has the same function, but is defined to have package-wide scope. A new bit in IA32_HWP_REQUEST determines if it over-rides the IA32_HWP_REQUEST_PKG or not. Note that HWP-mode can be enabled in several ways. The "in-band" method is for HWP to be exposed in CPUID, and for the Linux intel_pstate driver to recognized that, and thus enable HWP. In this case, starting in Linux 4.10, intel_pstate exports cpufreq sysfs attribute "energy_performance_preference" which can be used to manage HWP.EPP. This interface can be used to set HWP.EPP to these values: 0 performance 128 balance_performance (default) 192 balance_power 255 power Here, x86_energy_performance_policy is updated to use idential strings and values as intel_pstate. But HWP-mode may also be enabled by firmware before the OS boots, and the OS may not be aware of HWP. In this case, intel_pstate is not available to provide sysfs attributes, and x86_energy_perf_policy or a similar utility is invaluable for managing HWP.EPP, for this utility works the same, no matter if cpufreq is enabled or not. Signed-off-by: Len Brown --- tools/power/x86/x86_energy_perf_policy/Makefile| 37 +- .../x86_energy_perf_policy.8 | 317 ++-- .../x86_energy_perf_policy.c | 1750 3 files changed, 1666 insertions(+), 438 deletions(-) rewrite tools/power/x86/x86_energy_perf_policy/Makefile (72%) rewrite tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 (80%) rewrite tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c (78%) diff --git a/tools/power/x86/x86_energy_perf_policy/Makefile b/tools/power/x86/x86_energy_perf_policy/Makefile dissimilarity index 72% index 971c9ffdcb50..a711eec0c895 100644 --- a/tools/power/x86/x86_energy_perf_policy/Makefile +++ b/tools/power/x86/x86_energy_perf_policy/Makefile @@ -1,10 +1,27 @@ -DESTDIR ?= - -x86_energy_perf_policy : x86_energy_perf_policy.c - -clean : - rm -f x86_energy_perf_policy - -install : - install x86_energy_perf_policy ${DESTDIR}/usr/bin/ - install x86_energy_perf_policy.8 ${DESTDIR}/usr/share/man/man8/ +CC = $(CROSS_COMPILE)gcc +BUILD_OUTPUT:= $(CURDIR) +PREFIX := /usr +DESTDIR:= + +ifeq ("$(origin O)", "command line") + BUILD_OUTPUT := $(O) +endif + +x86_energy_perf_policy : x86_energy_perf_policy.c +CFLAGS += -Wall +CFLAGS += -DMSRHEADER='"../../../../arch/x86/include/asm/msr-index.h"' + +%: %.c + @mkdir -p $(BUILD_OUTPUT) + $(CC) $(CFLAGS) $< -o $(BUILD_OUTPUT)/$@ + +.PHONY : clean +clean : + @rm -f $(BUILD_OUTPUT)/x86_energy_perf_policy + +install : x86_energy_perf_policy + install -d $(DESTDIR)$(PREFIX)/bin + install $(BUILD_OUTPUT)/x86_energy_perf_policy $(DESTDIR)$(PREFIX)/bin/x86_energy_perf_policy + install -d $(DESTDIR)$(PREFIX)/share/man/man8 + install x86_energy_perf_policy.8 $(DESTDIR)$(PREFIX)/share/man/man8 + diff --git a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 dissimilarity index 80% index 8eaaad648cdb..17db1c3af4d0 100644 --- a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 +++ b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 @@ -1,104 +1,213 @@ -.\" This page Copyright (C) 2010 Len Brown -.\" Distributed under the GPL, Copyleft 1994. -.TH
[GIT PULL] power utilities update
Hi Rafael, Please pull these power utilities patches. The x86_energy_perf_policy(8) utility grows the ability to manage HWP.EPP (and Hardware P-states, in-general), on top of its previous ability to manage EPB. Linux-4.11 grew the ability to disable the cpufreq sub-system entirely with "cpufreq.off=1". On such a system, this utility is a convenient way to enable HWP with no kernel cpufreq overhead: sudo x86_energy_perf_policy --cpu all normal During this effort, we noticed the EPP defaults hard-coded into the intel_state driver were not optimal, so those are updated too. thanks! Len Brown, Intel Open Source Technology Center The following changes since commit c470abd4fde40ea6a0846a2beab642a578c0b8cd: Linux 4.10 (2017-02-19 14:34:00 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git utilities for you to fetch changes up to 3cedbc5a6d7f7c5539e139f89ec9f6e1ed668418: intel_pstate: use updated msr-index.h HWP.EPP values (2017-05-11 21:27:53 -0400) Len Brown (5): x86: msr-index.h: define EPB mid-points x86: msr-index.h: define HWP.EPP values x86: msr-index.h: fix shifts to ULL results in HWP macros. tools/power x86_energy_perf_policy: support HWP.EPP intel_pstate: use updated msr-index.h HWP.EPP values arch/x86/include/asm/msr-index.h | 18 +- drivers/cpufreq/intel_pstate.c | 34 +- tools/power/x86/x86_energy_perf_policy/Makefile| 27 +- .../x86_energy_perf_policy.8 | 241 +++- .../x86_energy_perf_policy.c | 1504 +--- 5 files changed, 1527 insertions(+), 297 deletions(-)