date:20170511

Re: [PATCH v8 05/10] powerpc/perf: IMC pmu cpumask and cpuhotplug support

2017-05-11 Thread Michael Ellerman

Stewart Smith  writes:

> Madhavan Srinivasan  writes:
>>>  * in patch 9 should opal_imc_counters_init return something other
>>>than OPAL_SUCCESS in the case on invalid arguments? Maybe
>>>OPAL_PARAMETER? (I think you fix this in a later patch anyway?)
>>
>> So, init call will return OPAL_PARAMETER for the unsupported
>> domains (core and nest are supported). And if the init operation
>> fails for any reason, it would return OPAL_HARDWARE. And this is
>> documented.
>
> (I'll comment on the skiboot one too), but I think that if the class
> exists but init is a no-op, then OPAL_IMC_COUNTERS_INIT should return
> OPAL_SUCCESS and just do nothing. This future proofs everything, and the
> API is that one *must* call _INIT before start.

Yes, 100%.

That's what I described in my replies to a previous version, if it
doesn't do that we need to fix it.

cheers

Re: [PATCH v8 05/10] powerpc/perf: IMC pmu cpumask and cpuhotplug support

2017-05-11 Thread Michael Ellerman

Stewart Smith  writes:

> Madhavan Srinivasan  writes:
>>>  * in patch 9 should opal_imc_counters_init return something other
>>>than OPAL_SUCCESS in the case on invalid arguments? Maybe
>>>OPAL_PARAMETER? (I think you fix this in a later patch anyway?)
>>
>> So, init call will return OPAL_PARAMETER for the unsupported
>> domains (core and nest are supported). And if the init operation
>> fails for any reason, it would return OPAL_HARDWARE. And this is
>> documented.
>
> (I'll comment on the skiboot one too), but I think that if the class
> exists but init is a no-op, then OPAL_IMC_COUNTERS_INIT should return
> OPAL_SUCCESS and just do nothing. This future proofs everything, and the
> API is that one *must* call _INIT before start.

Yes, 100%.

That's what I described in my replies to a previous version, if it
doesn't do that we need to fix it.

cheers

[PATCH v2 3/3] sched/deadline: Add statistics to track runtime underruns

2017-05-11 Thread Xunlei Pang

Add accounting to track cases that runtime isn't running
out, and export the information in "/proc//sched".

Specifically, the patch adds three members "nr_underrun_sched",
"nr_underrun_block", and "nr_underrun_yield" in sched_dl_entity:
 -@nr_underrun_sched hints some scheduling issue.
 -@nr_underrun_block hints some block reason. E.g. long sleep.
 -@nr_underrun_yield hints the yield reason.

This is helpful to spot/debug deadline issues, for example,
I launched three 50% dl tasks on my dual-core machine, plus
several buggy contrained dl tasks that Daniel is trying to
address in "sched/deadline: Use the revised wakeup rule for
suspending constrained dl tasks", then I observed one 50%
deadline task's proc sched output:
$ cat /proc/3389/sched |grep underrun
dl.nr_underrun_sched:981
dl.nr_underrun_block:  0
dl.nr_underrun_yield:  0

Very large "dl.nr_underrun_sched" hints it's very likely that
there is some underlying scheduling issue.

Note that we don't use CONFIG_SCHED_DEBUG as the accounting
added has little overhead(also happens infrequently).

Suggested-by: Steven Rostedt 
Signed-off-by: Xunlei Pang 
---
 include/linux/sched.h   | 10 ++
 kernel/sched/core.c |  3 +++
 kernel/sched/deadline.c | 12 +---
 kernel/sched/debug.c|  3 +++
 4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba080e5..e17928f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -455,6 +455,16 @@ struct sched_dl_entity {
 * own bandwidth to be enforced, thus we need one timer per task.
 */
struct hrtimer  dl_timer;
+
+   /*
+* Accounting for periods that run less than @dl_runtime:
+* @nr_underrun_sched hints some scheduling issue.
+* @nr_underrun_block hints some block reason. E.g. long sleep.
+* @nr_underrun_yield hints the yield reason.
+*/
+   u64 nr_underrun_sched;
+   u64 nr_underrun_block;
+   u64 nr_underrun_yield;
 };
 
 union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bccd819..6214ada 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4004,6 +4004,9 @@ static struct task_struct *find_process_by_pid(pid_t pid)
dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
dl_se->flags = attr->sched_flags;
dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+   dl_se->nr_underrun_sched = 0;
+   dl_se->nr_underrun_block = 0;
+   dl_se->nr_underrun_yield = 0;
 
/*
 * Changing the parameters of a task is 'tricky' and we're not doing
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5691149..a7ddc03 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -394,8 +394,10 @@ static void replenish_dl_entity(struct sched_dl_entity 
*dl_se,
dl_se->runtime = pi_se->dl_runtime;
}
 
-   if (dl_se->dl_yielded && dl_se->runtime > 0)
+   if (dl_se->dl_yielded && dl_se->runtime > 0) {
dl_se->runtime = 0;
+   ++dl_se->nr_underrun_yield;
+   }
 
/*
 * We keep moving the deadline away until we get some
@@ -723,8 +725,10 @@ static inline void dl_check_constrained_dl(struct 
sched_dl_entity *dl_se)
if (unlikely(dl_se->dl_boosted || !start_dl_timer(p)))
return;
dl_se->dl_throttled = 1;
-   if (dl_se->runtime > 0)
+   if (dl_se->runtime > 0) {
dl_se->runtime = 0;
+   ++dl_se->nr_underrun_block;
+   }
}
 }
 
@@ -733,8 +737,10 @@ int dl_runtime_exceeded(struct rq *rq, struct 
sched_dl_entity *dl_se)
 {
bool dmiss = dl_time_before(dl_se->deadline, rq_clock(rq));
 
-   if (dmiss && dl_se->runtime > 0)
+   if (dmiss && dl_se->runtime > 0) {
dl_se->runtime = 0;
+   ++dl_se->nr_underrun_sched;
+   }
 
return (dl_se->runtime <= 0);
 }
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 38f0193..904b43f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -957,6 +957,9 @@ void proc_sched_show_task(struct task_struct *p, struct 
seq_file *m)
if (p->policy == SCHED_DEADLINE) {
P(dl.runtime);
P(dl.deadline);
+   P(dl.nr_underrun_sched);
+   P(dl.nr_underrun_block);
+   P(dl.nr_underrun_yield);
}
 #undef PN_SCHEDSTAT
 #undef PN
-- 
1.8.3.1

[PATCH v2 2/3] sched/deadline: Throttle the task when missing its deadline

2017-05-11 Thread Xunlei Pang

dl_runtime_exceeded() only checks negative runtime, actually
when the current deadline past, we should start a new period
and zero out the remaining runtime as well.

This patch improves dl_runtime_exceeded() to achieve that.

Fixes: 269ad8015a6b ("sched/deadline: Avoid double-accounting in case of missed 
deadlines")
Cc: Luca Abeni 
Signed-off-by: Xunlei Pang 
---
 kernel/sched/deadline.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d3d291e..5691149 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -729,8 +729,13 @@ static inline void dl_check_constrained_dl(struct 
sched_dl_entity *dl_se)
 }
 
 static
-int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
+int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 {
+   bool dmiss = dl_time_before(dl_se->deadline, rq_clock(rq));
+
+   if (dmiss && dl_se->runtime > 0)
+   dl_se->runtime = 0;
+
return (dl_se->runtime <= 0);
 }
 
@@ -781,7 +786,7 @@ static void update_curr_dl(struct rq *rq)
dl_se->runtime -= delta_exec;
 
 throttle:
-   if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
+   if (dl_runtime_exceeded(rq, dl_se) || dl_se->dl_yielded) {
dl_se->dl_throttled = 1;
__dequeue_task_dl(rq, curr, 0);
if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
-- 
1.8.3.1

[PATCH v2 3/3] sched/deadline: Add statistics to track runtime underruns

2017-05-11 Thread Xunlei Pang

Add accounting to track cases that runtime isn't running
out, and export the information in "/proc//sched".

Specifically, the patch adds three members "nr_underrun_sched",
"nr_underrun_block", and "nr_underrun_yield" in sched_dl_entity:
 -@nr_underrun_sched hints some scheduling issue.
 -@nr_underrun_block hints some block reason. E.g. long sleep.
 -@nr_underrun_yield hints the yield reason.

This is helpful to spot/debug deadline issues, for example,
I launched three 50% dl tasks on my dual-core machine, plus
several buggy contrained dl tasks that Daniel is trying to
address in "sched/deadline: Use the revised wakeup rule for
suspending constrained dl tasks", then I observed one 50%
deadline task's proc sched output:
$ cat /proc/3389/sched |grep underrun
dl.nr_underrun_sched:981
dl.nr_underrun_block:  0
dl.nr_underrun_yield:  0

Very large "dl.nr_underrun_sched" hints it's very likely that
there is some underlying scheduling issue.

Note that we don't use CONFIG_SCHED_DEBUG as the accounting
added has little overhead(also happens infrequently).

Suggested-by: Steven Rostedt 
Signed-off-by: Xunlei Pang 
---
 include/linux/sched.h   | 10 ++
 kernel/sched/core.c |  3 +++
 kernel/sched/deadline.c | 12 +---
 kernel/sched/debug.c|  3 +++
 4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba080e5..e17928f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -455,6 +455,16 @@ struct sched_dl_entity {
 * own bandwidth to be enforced, thus we need one timer per task.
 */
struct hrtimer  dl_timer;
+
+   /*
+* Accounting for periods that run less than @dl_runtime:
+* @nr_underrun_sched hints some scheduling issue.
+* @nr_underrun_block hints some block reason. E.g. long sleep.
+* @nr_underrun_yield hints the yield reason.
+*/
+   u64 nr_underrun_sched;
+   u64 nr_underrun_block;
+   u64 nr_underrun_yield;
 };
 
 union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bccd819..6214ada 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4004,6 +4004,9 @@ static struct task_struct *find_process_by_pid(pid_t pid)
dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
dl_se->flags = attr->sched_flags;
dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+   dl_se->nr_underrun_sched = 0;
+   dl_se->nr_underrun_block = 0;
+   dl_se->nr_underrun_yield = 0;
 
/*
 * Changing the parameters of a task is 'tricky' and we're not doing
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5691149..a7ddc03 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -394,8 +394,10 @@ static void replenish_dl_entity(struct sched_dl_entity 
*dl_se,
dl_se->runtime = pi_se->dl_runtime;
}
 
-   if (dl_se->dl_yielded && dl_se->runtime > 0)
+   if (dl_se->dl_yielded && dl_se->runtime > 0) {
dl_se->runtime = 0;
+   ++dl_se->nr_underrun_yield;
+   }
 
/*
 * We keep moving the deadline away until we get some
@@ -723,8 +725,10 @@ static inline void dl_check_constrained_dl(struct 
sched_dl_entity *dl_se)
if (unlikely(dl_se->dl_boosted || !start_dl_timer(p)))
return;
dl_se->dl_throttled = 1;
-   if (dl_se->runtime > 0)
+   if (dl_se->runtime > 0) {
dl_se->runtime = 0;
+   ++dl_se->nr_underrun_block;
+   }
}
 }
 
@@ -733,8 +737,10 @@ int dl_runtime_exceeded(struct rq *rq, struct 
sched_dl_entity *dl_se)
 {
bool dmiss = dl_time_before(dl_se->deadline, rq_clock(rq));
 
-   if (dmiss && dl_se->runtime > 0)
+   if (dmiss && dl_se->runtime > 0) {
dl_se->runtime = 0;
+   ++dl_se->nr_underrun_sched;
+   }
 
return (dl_se->runtime <= 0);
 }
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 38f0193..904b43f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -957,6 +957,9 @@ void proc_sched_show_task(struct task_struct *p, struct 
seq_file *m)
if (p->policy == SCHED_DEADLINE) {
P(dl.runtime);
P(dl.deadline);
+   P(dl.nr_underrun_sched);
+   P(dl.nr_underrun_block);
+   P(dl.nr_underrun_yield);
}
 #undef PN_SCHEDSTAT
 #undef PN
-- 
1.8.3.1

[PATCH v2 2/3] sched/deadline: Throttle the task when missing its deadline

2017-05-11 Thread Xunlei Pang

dl_runtime_exceeded() only checks negative runtime, actually
when the current deadline past, we should start a new period
and zero out the remaining runtime as well.

This patch improves dl_runtime_exceeded() to achieve that.

Fixes: 269ad8015a6b ("sched/deadline: Avoid double-accounting in case of missed 
deadlines")
Cc: Luca Abeni 
Signed-off-by: Xunlei Pang 
---
 kernel/sched/deadline.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d3d291e..5691149 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -729,8 +729,13 @@ static inline void dl_check_constrained_dl(struct 
sched_dl_entity *dl_se)
 }
 
 static
-int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
+int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 {
+   bool dmiss = dl_time_before(dl_se->deadline, rq_clock(rq));
+
+   if (dmiss && dl_se->runtime > 0)
+   dl_se->runtime = 0;
+
return (dl_se->runtime <= 0);
 }
 
@@ -781,7 +786,7 @@ static void update_curr_dl(struct rq *rq)
dl_se->runtime -= delta_exec;
 
 throttle:
-   if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
+   if (dl_runtime_exceeded(rq, dl_se) || dl_se->dl_yielded) {
dl_se->dl_throttled = 1;
__dequeue_task_dl(rq, curr, 0);
if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
-- 
1.8.3.1

[PATCH v2 1/3] sched/deadline: Zero out positive runtime after throttling constrained tasks

2017-05-11 Thread Xunlei Pang

When a contrained task is throttled by dl_check_constrained_dl(),
it may carry the remaining positive runtime, as a result when
dl_task_timer() fires and calls replenish_dl_entity(), it will
not be replenished correctly due to the positive dl_se->runtime.

This patch assigns its runtime to 0 if positive after throttling.

Fixes: df8eac8cafce ("sched/deadline: Throttle a constrained deadline task 
activated after the deadline)
Acked-by: Daniel Bristot de Oliveira 
Signed-off-by: Xunlei Pang 
---
 kernel/sched/deadline.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a2ce590..d3d291e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -723,6 +723,8 @@ static inline void dl_check_constrained_dl(struct 
sched_dl_entity *dl_se)
if (unlikely(dl_se->dl_boosted || !start_dl_timer(p)))
return;
dl_se->dl_throttled = 1;
+   if (dl_se->runtime > 0)
+   dl_se->runtime = 0;
}
 }
 
-- 
1.8.3.1

[PATCH v2 1/3] sched/deadline: Zero out positive runtime after throttling constrained tasks

2017-05-11 Thread Xunlei Pang

When a contrained task is throttled by dl_check_constrained_dl(),
it may carry the remaining positive runtime, as a result when
dl_task_timer() fires and calls replenish_dl_entity(), it will
not be replenished correctly due to the positive dl_se->runtime.

This patch assigns its runtime to 0 if positive after throttling.

Fixes: df8eac8cafce ("sched/deadline: Throttle a constrained deadline task 
activated after the deadline)
Acked-by: Daniel Bristot de Oliveira 
Signed-off-by: Xunlei Pang 
---
 kernel/sched/deadline.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a2ce590..d3d291e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -723,6 +723,8 @@ static inline void dl_check_constrained_dl(struct 
sched_dl_entity *dl_se)
if (unlikely(dl_se->dl_boosted || !start_dl_timer(p)))
return;
dl_se->dl_throttled = 1;
+   if (dl_se->runtime > 0)
+   dl_se->runtime = 0;
}
 }
 
-- 
1.8.3.1

Re: [PATCH 0/2] net: Set maximum receive packet size on veth interfaces

2017-05-11 Thread Cong Wang

On Tue, May 9, 2017 at 5:42 PM, Fredrik Markström
 wrote:
>
> Maybe I was unclear, the veth implementation drops all packers larger then the
> configured MTU (on the receiving interface).
> Most ethernet drivers accepts packets up to the ethernet MTU no matter the
> configured MTU. As far as I can tell from the RFC:s that is ok.

This is because IP layer does the fragmentation for you. But some drivers,
for example tg3, drop packet larger than its dev->mtu very early too.

>
> A simpler solution would be to only drop packets larger then ethernet MTU 
> (like
> most network drivers), but I guess that will break existing stuff
> already out there.

I wonder why did we introduce that mtu check for veth when IP layer
could either fragment or reject with ICMP?

Re: [PATCH 0/2] net: Set maximum receive packet size on veth interfaces

2017-05-11 Thread Cong Wang

On Tue, May 9, 2017 at 5:42 PM, Fredrik Markström
 wrote:
>
> Maybe I was unclear, the veth implementation drops all packers larger then the
> configured MTU (on the receiving interface).
> Most ethernet drivers accepts packets up to the ethernet MTU no matter the
> configured MTU. As far as I can tell from the RFC:s that is ok.

This is because IP layer does the fragmentation for you. But some drivers,
for example tg3, drop packet larger than its dev->mtu very early too.

>
> A simpler solution would be to only drop packets larger then ethernet MTU 
> (like
> most network drivers), but I guess that will break existing stuff
> already out there.

I wonder why did we introduce that mtu check for veth when IP layer
could either fragment or reject with ICMP?

[PATCH v3 2/3] arm64: dts: mt8173: Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

From: Daniel Kurtz 

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Signed-off-by: Daniel Kurtz 
Signed-off-by: Minghsiu Tsai 

---
 arch/arm64/boot/dts/mediatek/mt8173.dtsi | 126 +++
 1 file changed, 60 insertions(+), 66 deletions(-)

diff --git a/arch/arm64/boot/dts/mediatek/mt8173.dtsi 
b/arch/arm64/boot/dts/mediatek/mt8173.dtsi
index 6922252..d28a363 100644
--- a/arch/arm64/boot/dts/mediatek/mt8173.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt8173.dtsi
@@ -792,80 +792,74 @@
#clock-cells = <1>;
};
 
-   mdp {
-   compatible = "mediatek,mt8173-mdp";
-   #address-cells = <2>;
-   #size-cells = <2>;
-   ranges;
+   mdp_rdma0: rdma@14001000 {
+   compatible = "mediatek,mt8173-mdp-rdma",
+"mediatek,mt8173-mdp";
+   reg = <0 0x14001000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RDMA0>,
+< CLK_MM_MUTEX_32K>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   iommus = < M4U_PORT_MDP_RDMA0>;
+   mediatek,larb = <>;
mediatek,vpu = <>;
+   };
 
-   mdp_rdma0: rdma@14001000 {
-   compatible = "mediatek,mt8173-mdp-rdma";
-   reg = <0 0x14001000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RDMA0>,
-< CLK_MM_MUTEX_32K>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   iommus = < M4U_PORT_MDP_RDMA0>;
-   mediatek,larb = <>;
-   };
-
-   mdp_rdma1: rdma@14002000 {
-   compatible = "mediatek,mt8173-mdp-rdma";
-   reg = <0 0x14002000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RDMA1>,
-< CLK_MM_MUTEX_32K>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   iommus = < M4U_PORT_MDP_RDMA1>;
-   mediatek,larb = <>;
-   };
+   mdp_rdma1: rdma@14002000 {
+   compatible = "mediatek,mt8173-mdp-rdma";
+   reg = <0 0x14002000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RDMA1>,
+< CLK_MM_MUTEX_32K>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   iommus = < M4U_PORT_MDP_RDMA1>;
+   mediatek,larb = <>;
+   };
 
-   mdp_rsz0: rsz@14003000 {
-   compatible = "mediatek,mt8173-mdp-rsz";
-   reg = <0 0x14003000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RSZ0>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   };
+   mdp_rsz0: rsz@14003000 {
+   compatible = "mediatek,mt8173-mdp-rsz";
+   reg = <0 0x14003000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RSZ0>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   };
 
-   mdp_rsz1: rsz@14004000 {
-   compatible = "mediatek,mt8173-mdp-rsz";
-   reg = <0 0x14004000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RSZ1>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   };
+   mdp_rsz1: rsz@14004000 {
+   compatible = "mediatek,mt8173-mdp-rsz";
+   reg = <0 0x14004000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RSZ1>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   };
 
-   mdp_rsz2: rsz@14005000 {
-   compatible = "mediatek,mt8173-mdp-rsz";
-   reg = <0 0x14005000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RSZ2>;
-   power-domains = <

[PATCH v3 2/3] arm64: dts: mt8173: Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

From: Daniel Kurtz 

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Signed-off-by: Daniel Kurtz 
Signed-off-by: Minghsiu Tsai 

---
 arch/arm64/boot/dts/mediatek/mt8173.dtsi | 126 +++
 1 file changed, 60 insertions(+), 66 deletions(-)

diff --git a/arch/arm64/boot/dts/mediatek/mt8173.dtsi 
b/arch/arm64/boot/dts/mediatek/mt8173.dtsi
index 6922252..d28a363 100644
--- a/arch/arm64/boot/dts/mediatek/mt8173.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt8173.dtsi
@@ -792,80 +792,74 @@
#clock-cells = <1>;
};
 
-   mdp {
-   compatible = "mediatek,mt8173-mdp";
-   #address-cells = <2>;
-   #size-cells = <2>;
-   ranges;
+   mdp_rdma0: rdma@14001000 {
+   compatible = "mediatek,mt8173-mdp-rdma",
+"mediatek,mt8173-mdp";
+   reg = <0 0x14001000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RDMA0>,
+< CLK_MM_MUTEX_32K>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   iommus = < M4U_PORT_MDP_RDMA0>;
+   mediatek,larb = <>;
mediatek,vpu = <>;
+   };
 
-   mdp_rdma0: rdma@14001000 {
-   compatible = "mediatek,mt8173-mdp-rdma";
-   reg = <0 0x14001000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RDMA0>,
-< CLK_MM_MUTEX_32K>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   iommus = < M4U_PORT_MDP_RDMA0>;
-   mediatek,larb = <>;
-   };
-
-   mdp_rdma1: rdma@14002000 {
-   compatible = "mediatek,mt8173-mdp-rdma";
-   reg = <0 0x14002000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RDMA1>,
-< CLK_MM_MUTEX_32K>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   iommus = < M4U_PORT_MDP_RDMA1>;
-   mediatek,larb = <>;
-   };
+   mdp_rdma1: rdma@14002000 {
+   compatible = "mediatek,mt8173-mdp-rdma";
+   reg = <0 0x14002000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RDMA1>,
+< CLK_MM_MUTEX_32K>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   iommus = < M4U_PORT_MDP_RDMA1>;
+   mediatek,larb = <>;
+   };
 
-   mdp_rsz0: rsz@14003000 {
-   compatible = "mediatek,mt8173-mdp-rsz";
-   reg = <0 0x14003000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RSZ0>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   };
+   mdp_rsz0: rsz@14003000 {
+   compatible = "mediatek,mt8173-mdp-rsz";
+   reg = <0 0x14003000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RSZ0>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   };
 
-   mdp_rsz1: rsz@14004000 {
-   compatible = "mediatek,mt8173-mdp-rsz";
-   reg = <0 0x14004000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RSZ1>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   };
+   mdp_rsz1: rsz@14004000 {
+   compatible = "mediatek,mt8173-mdp-rsz";
+   reg = <0 0x14004000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RSZ1>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   };
 
-   mdp_rsz2: rsz@14005000 {
-   compatible = "mediatek,mt8173-mdp-rsz";
-   reg = <0 0x14005000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RSZ2>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   };
+   mdp_rsz2:

[PATCH v3 0/3] Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

Changes in v3:
- Upload patches again because forget to add v2 in title

Changes in v2:
- Update commit message

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Daniel Kurtz (2):
  arm64: dts: mt8173: Fix mdp device tree
  media: mtk-mdp: Fix mdp device tree

Minghsiu Tsai (1):
  dt-bindings: mt8173: Fix mdp device tree

 .../devicetree/bindings/media/mediatek-mdp.txt |  12 +-
 arch/arm64/boot/dts/mediatek/mt8173.dtsi   | 126 ++---
 drivers/media/platform/mtk-mdp/mtk_mdp_core.c  |   2 +-
 3 files changed, 64 insertions(+), 76 deletions(-)

-- 
1.9.1

[PATCH v3 0/3] Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

Changes in v3:
- Upload patches again because forget to add v2 in title

Changes in v2:
- Update commit message

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Daniel Kurtz (2):
  arm64: dts: mt8173: Fix mdp device tree
  media: mtk-mdp: Fix mdp device tree

Minghsiu Tsai (1):
  dt-bindings: mt8173: Fix mdp device tree

 .../devicetree/bindings/media/mediatek-mdp.txt |  12 +-
 arch/arm64/boot/dts/mediatek/mt8173.dtsi   | 126 ++---
 drivers/media/platform/mtk-mdp/mtk_mdp_core.c  |   2 +-
 3 files changed, 64 insertions(+), 76 deletions(-)

-- 
1.9.1

[PATCH v3 3/3] media: mtk-mdp: Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

From: Daniel Kurtz 

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Signed-off-by: Daniel Kurtz 
Signed-off-by: Minghsiu Tsai 

---
 drivers/media/platform/mtk-mdp/mtk_mdp_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c 
b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
index 9e4eb7d..a5ad586 100644
--- a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
+++ b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
@@ -118,7 +118,7 @@ static int mtk_mdp_probe(struct platform_device *pdev)
mutex_init(>vpulock);
 
/* Iterate over sibling MDP function blocks */
-   for_each_child_of_node(dev->of_node, node) {
+   for_each_child_of_node(dev->of_node->parent, node) {
const struct of_device_id *of_id;
enum mtk_mdp_comp_type comp_type;
int comp_id;
-- 
1.9.1

[PATCH v3 1/3] dt-bindings: mt8173: Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Signed-off-by: Minghsiu Tsai 

---
 Documentation/devicetree/bindings/media/mediatek-mdp.txt | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/Documentation/devicetree/bindings/media/mediatek-mdp.txt 
b/Documentation/devicetree/bindings/media/mediatek-mdp.txt
index 4182063..0d03e3a 100644
--- a/Documentation/devicetree/bindings/media/mediatek-mdp.txt
+++ b/Documentation/devicetree/bindings/media/mediatek-mdp.txt
@@ -2,7 +2,7 @@
 
 Media Data Path is used for scaling and color space conversion.
 
-Required properties (controller (parent) node):
+Required properties (controller node):
 - compatible: "mediatek,mt8173-mdp"
 - mediatek,vpu: the node of video processor unit, see
   Documentation/devicetree/bindings/media/mediatek-vpu.txt for details.
@@ -32,21 +32,16 @@ Required properties (DMA function blocks, child node):
   for details.
 
 Example:
-mdp {
-   compatible = "mediatek,mt8173-mdp";
-   #address-cells = <2>;
-   #size-cells = <2>;
-   ranges;
-   mediatek,vpu = <>;
-
mdp_rdma0: rdma@14001000 {
compatible = "mediatek,mt8173-mdp-rdma";
+"mediatek,mt8173-mdp";
reg = <0 0x14001000 0 0x1000>;
clocks = < CLK_MM_MDP_RDMA0>,
 < CLK_MM_MUTEX_32K>;
power-domains = < MT8173_POWER_DOMAIN_MM>;
iommus = < M4U_PORT_MDP_RDMA0>;
mediatek,larb = <>;
+   mediatek,vpu = <>;
};
 
mdp_rdma1: rdma@14002000 {
@@ -106,4 +101,3 @@ mdp {
iommus = < M4U_PORT_MDP_WROT1>;
mediatek,larb = <>;
};
-};
-- 
1.9.1

[PATCH v3 3/3] media: mtk-mdp: Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

From: Daniel Kurtz 

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Signed-off-by: Daniel Kurtz 
Signed-off-by: Minghsiu Tsai 

---
 drivers/media/platform/mtk-mdp/mtk_mdp_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c 
b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
index 9e4eb7d..a5ad586 100644
--- a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
+++ b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
@@ -118,7 +118,7 @@ static int mtk_mdp_probe(struct platform_device *pdev)
mutex_init(>vpulock);
 
/* Iterate over sibling MDP function blocks */
-   for_each_child_of_node(dev->of_node, node) {
+   for_each_child_of_node(dev->of_node->parent, node) {
const struct of_device_id *of_id;
enum mtk_mdp_comp_type comp_type;
int comp_id;
-- 
1.9.1

[PATCH v3 1/3] dt-bindings: mt8173: Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Signed-off-by: Minghsiu Tsai 

---
 Documentation/devicetree/bindings/media/mediatek-mdp.txt | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/Documentation/devicetree/bindings/media/mediatek-mdp.txt 
b/Documentation/devicetree/bindings/media/mediatek-mdp.txt
index 4182063..0d03e3a 100644
--- a/Documentation/devicetree/bindings/media/mediatek-mdp.txt
+++ b/Documentation/devicetree/bindings/media/mediatek-mdp.txt
@@ -2,7 +2,7 @@
 
 Media Data Path is used for scaling and color space conversion.
 
-Required properties (controller (parent) node):
+Required properties (controller node):
 - compatible: "mediatek,mt8173-mdp"
 - mediatek,vpu: the node of video processor unit, see
   Documentation/devicetree/bindings/media/mediatek-vpu.txt for details.
@@ -32,21 +32,16 @@ Required properties (DMA function blocks, child node):
   for details.
 
 Example:
-mdp {
-   compatible = "mediatek,mt8173-mdp";
-   #address-cells = <2>;
-   #size-cells = <2>;
-   ranges;
-   mediatek,vpu = <>;
-
mdp_rdma0: rdma@14001000 {
compatible = "mediatek,mt8173-mdp-rdma";
+"mediatek,mt8173-mdp";
reg = <0 0x14001000 0 0x1000>;
clocks = < CLK_MM_MDP_RDMA0>,
 < CLK_MM_MUTEX_32K>;
power-domains = < MT8173_POWER_DOMAIN_MM>;
iommus = < M4U_PORT_MDP_RDMA0>;
mediatek,larb = <>;
+   mediatek,vpu = <>;
};
 
mdp_rdma1: rdma@14002000 {
@@ -106,4 +101,3 @@ mdp {
iommus = < M4U_PORT_MDP_WROT1>;
mediatek,larb = <>;
};
-};
-- 
1.9.1

Re: [PATCH] PCI: Make SR-IOV capable GPU working on the SR-IOV incapable platform

2017-05-11 Thread Alex Williamson

On Fri, 12 May 2017 02:50:32 +
"Cheng, Collins"  wrote:

> Hi Helgaas,
> 
> Some AMD GPUs have hardware support for graphics SR-IOV.
> If the SR-IOV capable GPU is plugged into the SR-IOV incapable
> platform. It would cause a problem on PCI resource allocation in
> current Linux kernel.
> 
> Therefore in order to allow the PF (Physical Function) device of
> SR-IOV capable GPU to work on the SR-IOV incapable platform,
> it is required to verify conditions for initializing BAR resources
> on AMD SR-IOV capable GPUs.
> 
> If the device is an AMD graphics device and it supports
> SR-IOV it will require a large amount of resources.
> Before calling sriov_init() must ensure that the system
> BIOS also supports SR-IOV and that system BIOS has been
> able to allocate enough resources.
> If the VF BARs are zero then the system BIOS does not
> support SR-IOV or it could not allocate the resources
> and this platform will not support AMD graphics SR-IOV.
> Therefore do not call sriov_init().
> If the system BIOS does support SR-IOV then the VF BARs
> will be properly initialized to non-zero values.
> 
> Below is the patch against to Kernel 4.8 & 4.9. Please review.
> 
> I checked the drivers/pci/quirks.c, it looks the workarounds/fixes in
> quirks.c are for specific devices and one or more device ID are defined
> for the specific devices. However my patch is for all AMD SR-IOV
> capable GPUs, that includes all existing and future AMD server GPUs.
> So it doesn't seem like a good fit to put the fix in quirks.c.


Why is an AMD graphics card unique here?  Doesn't sriov_init() always
need to be able to deal with devices of any type where the BIOS hasn't
initialized the SR-IOV capability?  Some SR-IOV devices can fit their
VFs within a minimum bridge aperture, most cannot.  I don't understand
why the VF resource requirements being exceptionally large dictates
that they receive special handling.  Thanks,

Alex

> Signed-off-by: Collins Cheng 
> ---
>  drivers/pci/iov.c | 63 
> ---
>  1 file changed, 60 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index e30f05c..e4f1405 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -523,6 +523,45 @@ static void sriov_restore_state(struct pci_dev *dev)
>   msleep(100);
>  }
>  
> +/*
> + * pci_vf_bar_valid - check if VF BARs have resource allocated
> + * @dev: the PCI device
> + * @pos: register offset of SR-IOV capability in PCI config space
> + * Returns true any VF BAR has resource allocated, false
> + * if all VF BARs are empty.
> + */
> +static bool pci_vf_bar_valid(struct pci_dev *dev, int pos)
> +{
> + int i;
> + u32 bar_value;
> + u32 bar_size_mask = ~(PCI_BASE_ADDRESS_SPACE |
> + PCI_BASE_ADDRESS_MEM_TYPE_64 |
> + PCI_BASE_ADDRESS_MEM_PREFETCH);
> +
> + for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> + pci_read_config_dword(dev, pos + PCI_SRIOV_BAR + i * 4, 
> _value);
> + if (bar_value & bar_size_mask)
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/*
> + * is_amd_display_adapter - check if it is an AMD/ATI GPU device
> + * @dev: the PCI device
> + *
> + * Returns true if device is an AMD/ATI display adapter,
> + * otherwise return false.
> + */
> +
> +static bool is_amd_display_adapter(struct pci_dev *dev)
> +{
> + return (((dev->class >> 16) == PCI_BASE_CLASS_DISPLAY) &&
> + (dev->vendor == PCI_VENDOR_ID_ATI ||
> + dev->vendor == PCI_VENDOR_ID_AMD));
> +}
> +
>  /**
>   * pci_iov_init - initialize the IOV capability
>   * @dev: the PCI device
> @@ -537,9 +576,27 @@ int pci_iov_init(struct pci_dev *dev)
>   return -ENODEV;
>  
>   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
> - if (pos)
> - return sriov_init(dev, pos);
> -
> + if (pos) {
> + /*
> +  * If the device is an AMD graphics device and it supports
> +  * SR-IOV it will require a large amount of resources.
> +  * Before calling sriov_init() must ensure that the system
> +  * BIOS also supports SR-IOV and that system BIOS has been
> +  * able to allocate enough resources.
> +  * If the VF BARs are zero then the system BIOS does not
> +  * support SR-IOV or it could not allocate the resources
> +  * and this platform will not support AMD graphics SR-IOV.
> +  * Therefore do not call sriov_init().
> +  * If the system BIOS does support SR-IOV then the VF BARs
> +  * will be properly initialized to non-zero values.
> +  */
> + if (is_amd_display_adapter(dev)) {
> + if (pci_vf_bar_valid(dev, pos))
> + return sriov_init(dev, pos);
> + } else {
> + return sriov_init(dev, pos);
> + }
> + }
>   return

Re: [PATCH] PCI: Make SR-IOV capable GPU working on the SR-IOV incapable platform

2017-05-11 Thread Alex Williamson

On Fri, 12 May 2017 02:50:32 +
"Cheng, Collins"  wrote:

> Hi Helgaas,
> 
> Some AMD GPUs have hardware support for graphics SR-IOV.
> If the SR-IOV capable GPU is plugged into the SR-IOV incapable
> platform. It would cause a problem on PCI resource allocation in
> current Linux kernel.
> 
> Therefore in order to allow the PF (Physical Function) device of
> SR-IOV capable GPU to work on the SR-IOV incapable platform,
> it is required to verify conditions for initializing BAR resources
> on AMD SR-IOV capable GPUs.
> 
> If the device is an AMD graphics device and it supports
> SR-IOV it will require a large amount of resources.
> Before calling sriov_init() must ensure that the system
> BIOS also supports SR-IOV and that system BIOS has been
> able to allocate enough resources.
> If the VF BARs are zero then the system BIOS does not
> support SR-IOV or it could not allocate the resources
> and this platform will not support AMD graphics SR-IOV.
> Therefore do not call sriov_init().
> If the system BIOS does support SR-IOV then the VF BARs
> will be properly initialized to non-zero values.
> 
> Below is the patch against to Kernel 4.8 & 4.9. Please review.
> 
> I checked the drivers/pci/quirks.c, it looks the workarounds/fixes in
> quirks.c are for specific devices and one or more device ID are defined
> for the specific devices. However my patch is for all AMD SR-IOV
> capable GPUs, that includes all existing and future AMD server GPUs.
> So it doesn't seem like a good fit to put the fix in quirks.c.


Why is an AMD graphics card unique here?  Doesn't sriov_init() always
need to be able to deal with devices of any type where the BIOS hasn't
initialized the SR-IOV capability?  Some SR-IOV devices can fit their
VFs within a minimum bridge aperture, most cannot.  I don't understand
why the VF resource requirements being exceptionally large dictates
that they receive special handling.  Thanks,

Alex

> Signed-off-by: Collins Cheng 
> ---
>  drivers/pci/iov.c | 63 
> ---
>  1 file changed, 60 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index e30f05c..e4f1405 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -523,6 +523,45 @@ static void sriov_restore_state(struct pci_dev *dev)
>   msleep(100);
>  }
>  
> +/*
> + * pci_vf_bar_valid - check if VF BARs have resource allocated
> + * @dev: the PCI device
> + * @pos: register offset of SR-IOV capability in PCI config space
> + * Returns true any VF BAR has resource allocated, false
> + * if all VF BARs are empty.
> + */
> +static bool pci_vf_bar_valid(struct pci_dev *dev, int pos)
> +{
> + int i;
> + u32 bar_value;
> + u32 bar_size_mask = ~(PCI_BASE_ADDRESS_SPACE |
> + PCI_BASE_ADDRESS_MEM_TYPE_64 |
> + PCI_BASE_ADDRESS_MEM_PREFETCH);
> +
> + for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> + pci_read_config_dword(dev, pos + PCI_SRIOV_BAR + i * 4, 
> _value);
> + if (bar_value & bar_size_mask)
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/*
> + * is_amd_display_adapter - check if it is an AMD/ATI GPU device
> + * @dev: the PCI device
> + *
> + * Returns true if device is an AMD/ATI display adapter,
> + * otherwise return false.
> + */
> +
> +static bool is_amd_display_adapter(struct pci_dev *dev)
> +{
> + return (((dev->class >> 16) == PCI_BASE_CLASS_DISPLAY) &&
> + (dev->vendor == PCI_VENDOR_ID_ATI ||
> + dev->vendor == PCI_VENDOR_ID_AMD));
> +}
> +
>  /**
>   * pci_iov_init - initialize the IOV capability
>   * @dev: the PCI device
> @@ -537,9 +576,27 @@ int pci_iov_init(struct pci_dev *dev)
>   return -ENODEV;
>  
>   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
> - if (pos)
> - return sriov_init(dev, pos);
> -
> + if (pos) {
> + /*
> +  * If the device is an AMD graphics device and it supports
> +  * SR-IOV it will require a large amount of resources.
> +  * Before calling sriov_init() must ensure that the system
> +  * BIOS also supports SR-IOV and that system BIOS has been
> +  * able to allocate enough resources.
> +  * If the VF BARs are zero then the system BIOS does not
> +  * support SR-IOV or it could not allocate the resources
> +  * and this platform will not support AMD graphics SR-IOV.
> +  * Therefore do not call sriov_init().
> +  * If the system BIOS does support SR-IOV then the VF BARs
> +  * will be properly initialized to non-zero values.
> +  */
> + if (is_amd_display_adapter(dev)) {
> + if (pci_vf_bar_valid(dev, pos))
> + return sriov_init(dev, pos);
> + } else {
> + return sriov_init(dev, pos);
> + }
> + }
>   return -ENODEV;
>  }
>

Re: [PATCH 1/3] f2fs: split wio_mutex

2017-05-11 Thread Chao Yu

On 2017/5/12 2:37, Jaegeuk Kim wrote:
> On 05/09, Chao Yu wrote:
>> From: Chao Yu 
>>
>> Split wio_mutex to adjust different temperature bio cache.
> 
> This can be rephrased like:

Yup, let me resend this patch.

Thanks,

> 
> Signed-off-by: Chao Yu 
> Signed-off-by: Jaegeuk Kim 
> ---
>  fs/f2fs/f2fs.h| 3 ++-
>  fs/f2fs/segment.c | 4 ++--
>  fs/f2fs/super.c   | 7 ---
>  3 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index c6643783adff..fb710bfbe215 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -888,7 +888,8 @@ struct f2fs_sb_info {
>  
>   /* for bio operations */
>   struct f2fs_bio_info *write_io[NR_PAGE_TYPE];   /* for write bios */
> - struct mutex wio_mutex[NODE + 1];   /* bio ordering for NODE/DATA */
> + struct mutex wio_mutex[NR_PAGE_TYPE - 1][NR_TEMP_TYPE];
> + /* bio ordering for NODE/DATA */
>   int write_io_size_bits; /* Write IO size bits */
>   mempool_t *write_io_dummy;  /* Dummy pages */
>  
> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
> index da4fd2c29e86..cc2edaf36e55 100644
> --- a/fs/f2fs/segment.c
> +++ b/fs/f2fs/segment.c
> @@ -2156,7 +2156,7 @@ static void do_write_page(struct f2fs_summary *sum, 
> struct f2fs_io_info *fio)
>   int err;
>  
>   if (fio->type == NODE || fio->type == DATA)
> - mutex_lock(>sbi->wio_mutex[fio->type]);
> + mutex_lock(>sbi->wio_mutex[fio->type][fio->temp]);
>  reallocate:
>   allocate_data_block(fio->sbi, fio->page, fio->old_blkaddr,
>   >new_blkaddr, sum, type);
> @@ -2169,7 +2169,7 @@ static void do_write_page(struct f2fs_summary *sum, 
> struct f2fs_io_info *fio)
>   }
>  
>   if (fio->type == NODE || fio->type == DATA)
> - mutex_unlock(>sbi->wio_mutex[fio->type]);
> + mutex_unlock(>sbi->wio_mutex[fio->type][fio->temp]);
>  }
>  
>  void write_meta_page(struct f2fs_sb_info *sbi, struct page *page)
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index fea563bff0c1..df19902282f2 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -1555,7 +1555,7 @@ int sanity_check_ckpt(struct f2fs_sb_info *sbi)
>  static void init_sb_info(struct f2fs_sb_info *sbi)
>  {
>   struct f2fs_super_block *raw_super = sbi->raw_super;
> - int i;
> + int i, j;
>  
>   sbi->log_sectors_per_block =
>   le32_to_cpu(raw_super->log_sectors_per_block);
> @@ -1587,8 +1587,9 @@ static void init_sb_info(struct f2fs_sb_info *sbi)
>  
>   INIT_LIST_HEAD(>s_list);
>   mutex_init(>umount_mutex);
> - mutex_init(>wio_mutex[NODE]);
> - mutex_init(>wio_mutex[DATA]);
> + for (i = 0; i < NR_PAGE_TYPE - 1; i++)
> + for (j = HOT; j < NR_TEMP_TYPE; j++)
> + mutex_init(>wio_mutex[i][j]);
>   spin_lock_init(>cp_lock);
>  }
>  
>

Re: [PATCH 1/3] f2fs: split wio_mutex

2017-05-11 Thread Chao Yu

On 2017/5/12 2:37, Jaegeuk Kim wrote:
> On 05/09, Chao Yu wrote:
>> From: Chao Yu 
>>
>> Split wio_mutex to adjust different temperature bio cache.
> 
> This can be rephrased like:

Yup, let me resend this patch.

Thanks,

> 
> Signed-off-by: Chao Yu 
> Signed-off-by: Jaegeuk Kim 
> ---
>  fs/f2fs/f2fs.h| 3 ++-
>  fs/f2fs/segment.c | 4 ++--
>  fs/f2fs/super.c   | 7 ---
>  3 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index c6643783adff..fb710bfbe215 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -888,7 +888,8 @@ struct f2fs_sb_info {
>  
>   /* for bio operations */
>   struct f2fs_bio_info *write_io[NR_PAGE_TYPE];   /* for write bios */
> - struct mutex wio_mutex[NODE + 1];   /* bio ordering for NODE/DATA */
> + struct mutex wio_mutex[NR_PAGE_TYPE - 1][NR_TEMP_TYPE];
> + /* bio ordering for NODE/DATA */
>   int write_io_size_bits; /* Write IO size bits */
>   mempool_t *write_io_dummy;  /* Dummy pages */
>  
> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
> index da4fd2c29e86..cc2edaf36e55 100644
> --- a/fs/f2fs/segment.c
> +++ b/fs/f2fs/segment.c
> @@ -2156,7 +2156,7 @@ static void do_write_page(struct f2fs_summary *sum, 
> struct f2fs_io_info *fio)
>   int err;
>  
>   if (fio->type == NODE || fio->type == DATA)
> - mutex_lock(>sbi->wio_mutex[fio->type]);
> + mutex_lock(>sbi->wio_mutex[fio->type][fio->temp]);
>  reallocate:
>   allocate_data_block(fio->sbi, fio->page, fio->old_blkaddr,
>   >new_blkaddr, sum, type);
> @@ -2169,7 +2169,7 @@ static void do_write_page(struct f2fs_summary *sum, 
> struct f2fs_io_info *fio)
>   }
>  
>   if (fio->type == NODE || fio->type == DATA)
> - mutex_unlock(>sbi->wio_mutex[fio->type]);
> + mutex_unlock(>sbi->wio_mutex[fio->type][fio->temp]);
>  }
>  
>  void write_meta_page(struct f2fs_sb_info *sbi, struct page *page)
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index fea563bff0c1..df19902282f2 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -1555,7 +1555,7 @@ int sanity_check_ckpt(struct f2fs_sb_info *sbi)
>  static void init_sb_info(struct f2fs_sb_info *sbi)
>  {
>   struct f2fs_super_block *raw_super = sbi->raw_super;
> - int i;
> + int i, j;
>  
>   sbi->log_sectors_per_block =
>   le32_to_cpu(raw_super->log_sectors_per_block);
> @@ -1587,8 +1587,9 @@ static void init_sb_info(struct f2fs_sb_info *sbi)
>  
>   INIT_LIST_HEAD(>s_list);
>   mutex_init(>umount_mutex);
> - mutex_init(>wio_mutex[NODE]);
> - mutex_init(>wio_mutex[DATA]);
> + for (i = 0; i < NR_PAGE_TYPE - 1; i++)
> + for (j = HOT; j < NR_TEMP_TYPE; j++)
> + mutex_init(>wio_mutex[i][j]);
>   spin_lock_init(>cp_lock);
>  }
>  
>

Re: [PATCH] f2fs: split bio cache

2017-05-11 Thread Chao Yu

On 2017/5/11 10:35, Jaegeuk Kim wrote:
> On 05/11, Chao Yu wrote:
>> On 2017/5/11 7:50, Jaegeuk Kim wrote:
>>> On 05/09, Chao Yu wrote:
 Hi Jaegeuk,

 On 2017/5/9 5:23, Jaegeuk Kim wrote:
> Hi Chao,
>
> I can't see a strong reason to split meta from data/node and rename the 
> existing
> function names. Instead, how about keeping the existing one while adding 
> some
> page types to deal with log types?

 Hmm.. before write this patch, I have thought about this implementation 
 which
 adds HOT_DATA/WARM_DATA/.. into page_type enum, but the final target is 
 making
 different temperature log-header being with independent bio cache, io 
 lock, and
 io list, eliminating interaction of different temperature IOs, also 
 expanding
 f2fs' scalability when adding more log-headers. If we don't split meta from
 data/node, it's a little bit hard to approach this. What do you think?
>>>
>>> I submitted clean-up patches. How about this on top of them?
>>
>> After splitting, bio caches is connected to log-header, so why not moving bio
>> cache into log-header (SM_I(sbi)->curseg_array)? after the merging:
>> - we could avoid redundant enum. e.g. temp_type, page_type::{DATA/NODE}, 
>> since
>> we have seg_type enum instead.
>> - once we add special log-header like journal log or random IO log making
>> DATA/NODE and HOT/WARM/COLD non-orthogonalization, we just need to change few
>> codes to adjust it.
> 
> Well, I perfer to keep block allocation and IO control in a separate way as 
> is.
> Moreover, I don't think performance gain would be large enough comparing to 
> what
> we need to change both of whole flows.

Alright, let's use your implementation. ;)

Could you resend this patch? and then I will rebase my patch on this one.

Thanks,

> 
> Thanks,
> 
>>
>> How do you think?
>>
>> Thanks,
>>
>>>
>>> ---
>>>  fs/f2fs/data.c  | 51 
>>> +
>>>  fs/f2fs/f2fs.h  | 10 -
>>>  fs/f2fs/gc.c|  2 ++
>>>  fs/f2fs/segment.c   | 24 +++--
>>>  fs/f2fs/segment.h   |  4 
>>>  fs/f2fs/super.c | 21 ---
>>>  include/trace/events/f2fs.h | 14 -
>>>  7 files changed, 102 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>>> index 06bb2042385e..49b7e3918484 100644
>>> --- a/fs/f2fs/data.c
>>> +++ b/fs/f2fs/data.c
>>> @@ -282,21 +282,30 @@ static bool has_merged_page(struct f2fs_sb_info *sbi, 
>>> struct inode *inode,
>>> nid_t ino, pgoff_t idx, enum page_type type)
>>>  {
>>> enum page_type btype = PAGE_TYPE_OF_BIO(type);
>>> -   struct f2fs_bio_info *io = >write_io[btype];
>>> -   bool ret;
>>> +   enum temp_type temp;
>>> +   struct f2fs_bio_info *io;
>>> +   bool ret = false;
>>>  
>>> -   down_read(>io_rwsem);
>>> -   ret = __has_merged_page(io, inode, ino, idx);
>>> -   up_read(>io_rwsem);
>>> +   for (temp = HOT; temp < NR_TEMP_TYPE; temp++) {
>>> +   io = sbi->write_io[btype] + temp;
>>> +
>>> +   down_read(>io_rwsem);
>>> +   ret = __has_merged_page(io, inode, ino, idx);
>>> +   up_read(>io_rwsem);
>>> +
>>> +   /* TODO: use HOT temp only for meta pages now. */
>>> +   if (ret || btype == META)
>>> +   break;
>>> +   }
>>> return ret;
>>>  }
>>>  
>>>  static void __f2fs_submit_merged_write(struct f2fs_sb_info *sbi,
>>> struct inode *inode, nid_t ino, pgoff_t idx,
>>> -   enum page_type type)
>>> +   enum page_type type, enum temp_type temp)
>>>  {
>>> enum page_type btype = PAGE_TYPE_OF_BIO(type);
>>> -   struct f2fs_bio_info *io = >write_io[btype];
>>> +   struct f2fs_bio_info *io = sbi->write_io[btype] + temp;
>>>  
>>> down_write(>io_rwsem);
>>>  
>>> @@ -316,17 +325,34 @@ static void __f2fs_submit_merged_write(struct 
>>> f2fs_sb_info *sbi,
>>> up_write(>io_rwsem);
>>>  }
>>>  
>>> +static void __submit_merged_write_cond(struct f2fs_sb_info *sbi,
>>> +   struct inode *inode, nid_t ino, pgoff_t idx,
>>> +   enum page_type type, bool force)
>>> +{
>>> +   enum temp_type temp;
>>> +
>>> +   if (!force && !has_merged_page(sbi, inode, ino, idx, type))
>>> +   return;
>>> +
>>> +   for (temp = HOT; temp < NR_TEMP_TYPE; temp++) {
>>> +   __f2fs_submit_merged_write(sbi, inode, ino, idx, type, temp);
>>> +
>>> +   /* TODO: use HOT temp only for meta pages now. */
>>> +   if (type >= META)
>>> +   break;
>>> +   }
>>> +}
>>> +
>>>  void f2fs_submit_merged_write(struct f2fs_sb_info *sbi, enum page_type 
>>> type)
>>>  {
>>> -   __f2fs_submit_merged_write(sbi, NULL, 0, 0, type);
>>> +   __submit_merged_write_cond(sbi, NULL, 0, 0, type, true);
>>>  }
>>>  
>>>  void

Re: [PATCH] f2fs: split bio cache

2017-05-11 Thread Chao Yu

On 2017/5/11 10:35, Jaegeuk Kim wrote:
> On 05/11, Chao Yu wrote:
>> On 2017/5/11 7:50, Jaegeuk Kim wrote:
>>> On 05/09, Chao Yu wrote:
 Hi Jaegeuk,

 On 2017/5/9 5:23, Jaegeuk Kim wrote:
> Hi Chao,
>
> I can't see a strong reason to split meta from data/node and rename the 
> existing
> function names. Instead, how about keeping the existing one while adding 
> some
> page types to deal with log types?

 Hmm.. before write this patch, I have thought about this implementation 
 which
 adds HOT_DATA/WARM_DATA/.. into page_type enum, but the final target is 
 making
 different temperature log-header being with independent bio cache, io 
 lock, and
 io list, eliminating interaction of different temperature IOs, also 
 expanding
 f2fs' scalability when adding more log-headers. If we don't split meta from
 data/node, it's a little bit hard to approach this. What do you think?
>>>
>>> I submitted clean-up patches. How about this on top of them?
>>
>> After splitting, bio caches is connected to log-header, so why not moving bio
>> cache into log-header (SM_I(sbi)->curseg_array)? after the merging:
>> - we could avoid redundant enum. e.g. temp_type, page_type::{DATA/NODE}, 
>> since
>> we have seg_type enum instead.
>> - once we add special log-header like journal log or random IO log making
>> DATA/NODE and HOT/WARM/COLD non-orthogonalization, we just need to change few
>> codes to adjust it.
> 
> Well, I perfer to keep block allocation and IO control in a separate way as 
> is.
> Moreover, I don't think performance gain would be large enough comparing to 
> what
> we need to change both of whole flows.

Alright, let's use your implementation. ;)

Could you resend this patch? and then I will rebase my patch on this one.

Thanks,

> 
> Thanks,
> 
>>
>> How do you think?
>>
>> Thanks,
>>
>>>
>>> ---
>>>  fs/f2fs/data.c  | 51 
>>> +
>>>  fs/f2fs/f2fs.h  | 10 -
>>>  fs/f2fs/gc.c|  2 ++
>>>  fs/f2fs/segment.c   | 24 +++--
>>>  fs/f2fs/segment.h   |  4 
>>>  fs/f2fs/super.c | 21 ---
>>>  include/trace/events/f2fs.h | 14 -
>>>  7 files changed, 102 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>>> index 06bb2042385e..49b7e3918484 100644
>>> --- a/fs/f2fs/data.c
>>> +++ b/fs/f2fs/data.c
>>> @@ -282,21 +282,30 @@ static bool has_merged_page(struct f2fs_sb_info *sbi, 
>>> struct inode *inode,
>>> nid_t ino, pgoff_t idx, enum page_type type)
>>>  {
>>> enum page_type btype = PAGE_TYPE_OF_BIO(type);
>>> -   struct f2fs_bio_info *io = >write_io[btype];
>>> -   bool ret;
>>> +   enum temp_type temp;
>>> +   struct f2fs_bio_info *io;
>>> +   bool ret = false;
>>>  
>>> -   down_read(>io_rwsem);
>>> -   ret = __has_merged_page(io, inode, ino, idx);
>>> -   up_read(>io_rwsem);
>>> +   for (temp = HOT; temp < NR_TEMP_TYPE; temp++) {
>>> +   io = sbi->write_io[btype] + temp;
>>> +
>>> +   down_read(>io_rwsem);
>>> +   ret = __has_merged_page(io, inode, ino, idx);
>>> +   up_read(>io_rwsem);
>>> +
>>> +   /* TODO: use HOT temp only for meta pages now. */
>>> +   if (ret || btype == META)
>>> +   break;
>>> +   }
>>> return ret;
>>>  }
>>>  
>>>  static void __f2fs_submit_merged_write(struct f2fs_sb_info *sbi,
>>> struct inode *inode, nid_t ino, pgoff_t idx,
>>> -   enum page_type type)
>>> +   enum page_type type, enum temp_type temp)
>>>  {
>>> enum page_type btype = PAGE_TYPE_OF_BIO(type);
>>> -   struct f2fs_bio_info *io = >write_io[btype];
>>> +   struct f2fs_bio_info *io = sbi->write_io[btype] + temp;
>>>  
>>> down_write(>io_rwsem);
>>>  
>>> @@ -316,17 +325,34 @@ static void __f2fs_submit_merged_write(struct 
>>> f2fs_sb_info *sbi,
>>> up_write(>io_rwsem);
>>>  }
>>>  
>>> +static void __submit_merged_write_cond(struct f2fs_sb_info *sbi,
>>> +   struct inode *inode, nid_t ino, pgoff_t idx,
>>> +   enum page_type type, bool force)
>>> +{
>>> +   enum temp_type temp;
>>> +
>>> +   if (!force && !has_merged_page(sbi, inode, ino, idx, type))
>>> +   return;
>>> +
>>> +   for (temp = HOT; temp < NR_TEMP_TYPE; temp++) {
>>> +   __f2fs_submit_merged_write(sbi, inode, ino, idx, type, temp);
>>> +
>>> +   /* TODO: use HOT temp only for meta pages now. */
>>> +   if (type >= META)
>>> +   break;
>>> +   }
>>> +}
>>> +
>>>  void f2fs_submit_merged_write(struct f2fs_sb_info *sbi, enum page_type 
>>> type)
>>>  {
>>> -   __f2fs_submit_merged_write(sbi, NULL, 0, 0, type);
>>> +   __submit_merged_write_cond(sbi, NULL, 0, 0, type, true);
>>>  }
>>>  
>>>  void

[PATCH 1/3] dt-bindings: mt8173: Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Signed-off-by: Minghsiu Tsai 

---
 Documentation/devicetree/bindings/media/mediatek-mdp.txt | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/Documentation/devicetree/bindings/media/mediatek-mdp.txt 
b/Documentation/devicetree/bindings/media/mediatek-mdp.txt
index 4182063..0d03e3a 100644
--- a/Documentation/devicetree/bindings/media/mediatek-mdp.txt
+++ b/Documentation/devicetree/bindings/media/mediatek-mdp.txt
@@ -2,7 +2,7 @@
 
 Media Data Path is used for scaling and color space conversion.
 
-Required properties (controller (parent) node):
+Required properties (controller node):
 - compatible: "mediatek,mt8173-mdp"
 - mediatek,vpu: the node of video processor unit, see
   Documentation/devicetree/bindings/media/mediatek-vpu.txt for details.
@@ -32,21 +32,16 @@ Required properties (DMA function blocks, child node):
   for details.
 
 Example:
-mdp {
-   compatible = "mediatek,mt8173-mdp";
-   #address-cells = <2>;
-   #size-cells = <2>;
-   ranges;
-   mediatek,vpu = <>;
-
mdp_rdma0: rdma@14001000 {
compatible = "mediatek,mt8173-mdp-rdma";
+"mediatek,mt8173-mdp";
reg = <0 0x14001000 0 0x1000>;
clocks = < CLK_MM_MDP_RDMA0>,
 < CLK_MM_MUTEX_32K>;
power-domains = < MT8173_POWER_DOMAIN_MM>;
iommus = < M4U_PORT_MDP_RDMA0>;
mediatek,larb = <>;
+   mediatek,vpu = <>;
};
 
mdp_rdma1: rdma@14002000 {
@@ -106,4 +101,3 @@ mdp {
iommus = < M4U_PORT_MDP_WROT1>;
mediatek,larb = <>;
};
-};
-- 
1.9.1

[PATCH v2 0/3] Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Daniel Kurtz (2):
  arm64: dts: mt8173: Fix mdp device tree
  media: mtk-mdp: Fix mdp device tree

Minghsiu Tsai (1):
  dt-bindings: mt8173: Fix mdp device tree

 .../devicetree/bindings/media/mediatek-mdp.txt |  12 +-
 arch/arm64/boot/dts/mediatek/mt8173.dtsi   | 126 ++---
 drivers/media/platform/mtk-mdp/mtk_mdp_core.c  |   2 +-
 3 files changed, 64 insertions(+), 76 deletions(-)

-- 
1.9.1

[PATCH 1/3] dt-bindings: mt8173: Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Signed-off-by: Minghsiu Tsai 

---
 Documentation/devicetree/bindings/media/mediatek-mdp.txt | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/Documentation/devicetree/bindings/media/mediatek-mdp.txt 
b/Documentation/devicetree/bindings/media/mediatek-mdp.txt
index 4182063..0d03e3a 100644
--- a/Documentation/devicetree/bindings/media/mediatek-mdp.txt
+++ b/Documentation/devicetree/bindings/media/mediatek-mdp.txt
@@ -2,7 +2,7 @@
 
 Media Data Path is used for scaling and color space conversion.
 
-Required properties (controller (parent) node):
+Required properties (controller node):
 - compatible: "mediatek,mt8173-mdp"
 - mediatek,vpu: the node of video processor unit, see
   Documentation/devicetree/bindings/media/mediatek-vpu.txt for details.
@@ -32,21 +32,16 @@ Required properties (DMA function blocks, child node):
   for details.
 
 Example:
-mdp {
-   compatible = "mediatek,mt8173-mdp";
-   #address-cells = <2>;
-   #size-cells = <2>;
-   ranges;
-   mediatek,vpu = <>;
-
mdp_rdma0: rdma@14001000 {
compatible = "mediatek,mt8173-mdp-rdma";
+"mediatek,mt8173-mdp";
reg = <0 0x14001000 0 0x1000>;
clocks = < CLK_MM_MDP_RDMA0>,
 < CLK_MM_MUTEX_32K>;
power-domains = < MT8173_POWER_DOMAIN_MM>;
iommus = < M4U_PORT_MDP_RDMA0>;
mediatek,larb = <>;
+   mediatek,vpu = <>;
};
 
mdp_rdma1: rdma@14002000 {
@@ -106,4 +101,3 @@ mdp {
iommus = < M4U_PORT_MDP_WROT1>;
mediatek,larb = <>;
};
-};
-- 
1.9.1

[PATCH v2 0/3] Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Daniel Kurtz (2):
  arm64: dts: mt8173: Fix mdp device tree
  media: mtk-mdp: Fix mdp device tree

Minghsiu Tsai (1):
  dt-bindings: mt8173: Fix mdp device tree

 .../devicetree/bindings/media/mediatek-mdp.txt |  12 +-
 arch/arm64/boot/dts/mediatek/mt8173.dtsi   | 126 ++---
 drivers/media/platform/mtk-mdp/mtk_mdp_core.c  |   2 +-
 3 files changed, 64 insertions(+), 76 deletions(-)

-- 
1.9.1

[PATCH 2/3] arm64: dts: mt8173: Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

From: Daniel Kurtz 

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Signed-off-by: Daniel Kurtz 
Signed-off-by: Minghsiu Tsai 

---
 arch/arm64/boot/dts/mediatek/mt8173.dtsi | 126 +++
 1 file changed, 60 insertions(+), 66 deletions(-)

diff --git a/arch/arm64/boot/dts/mediatek/mt8173.dtsi 
b/arch/arm64/boot/dts/mediatek/mt8173.dtsi
index 6922252..d28a363 100644
--- a/arch/arm64/boot/dts/mediatek/mt8173.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt8173.dtsi
@@ -792,80 +792,74 @@
#clock-cells = <1>;
};
 
-   mdp {
-   compatible = "mediatek,mt8173-mdp";
-   #address-cells = <2>;
-   #size-cells = <2>;
-   ranges;
+   mdp_rdma0: rdma@14001000 {
+   compatible = "mediatek,mt8173-mdp-rdma",
+"mediatek,mt8173-mdp";
+   reg = <0 0x14001000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RDMA0>,
+< CLK_MM_MUTEX_32K>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   iommus = < M4U_PORT_MDP_RDMA0>;
+   mediatek,larb = <>;
mediatek,vpu = <>;
+   };
 
-   mdp_rdma0: rdma@14001000 {
-   compatible = "mediatek,mt8173-mdp-rdma";
-   reg = <0 0x14001000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RDMA0>,
-< CLK_MM_MUTEX_32K>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   iommus = < M4U_PORT_MDP_RDMA0>;
-   mediatek,larb = <>;
-   };
-
-   mdp_rdma1: rdma@14002000 {
-   compatible = "mediatek,mt8173-mdp-rdma";
-   reg = <0 0x14002000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RDMA1>,
-< CLK_MM_MUTEX_32K>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   iommus = < M4U_PORT_MDP_RDMA1>;
-   mediatek,larb = <>;
-   };
+   mdp_rdma1: rdma@14002000 {
+   compatible = "mediatek,mt8173-mdp-rdma";
+   reg = <0 0x14002000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RDMA1>,
+< CLK_MM_MUTEX_32K>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   iommus = < M4U_PORT_MDP_RDMA1>;
+   mediatek,larb = <>;
+   };
 
-   mdp_rsz0: rsz@14003000 {
-   compatible = "mediatek,mt8173-mdp-rsz";
-   reg = <0 0x14003000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RSZ0>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   };
+   mdp_rsz0: rsz@14003000 {
+   compatible = "mediatek,mt8173-mdp-rsz";
+   reg = <0 0x14003000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RSZ0>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   };
 
-   mdp_rsz1: rsz@14004000 {
-   compatible = "mediatek,mt8173-mdp-rsz";
-   reg = <0 0x14004000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RSZ1>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   };
+   mdp_rsz1: rsz@14004000 {
+   compatible = "mediatek,mt8173-mdp-rsz";
+   reg = <0 0x14004000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RSZ1>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   };
 
-   mdp_rsz2: rsz@14005000 {
-   compatible = "mediatek,mt8173-mdp-rsz";
-   reg = <0 0x14005000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RSZ2>;
-   power-domains = <

[PATCH 3/3] media: mtk-mdp: Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

From: Daniel Kurtz 

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Signed-off-by: Daniel Kurtz 
Signed-off-by: Minghsiu Tsai 

---
 drivers/media/platform/mtk-mdp/mtk_mdp_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c 
b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
index 9e4eb7d..a5ad586 100644
--- a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
+++ b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
@@ -118,7 +118,7 @@ static int mtk_mdp_probe(struct platform_device *pdev)
mutex_init(>vpulock);
 
/* Iterate over sibling MDP function blocks */
-   for_each_child_of_node(dev->of_node, node) {
+   for_each_child_of_node(dev->of_node->parent, node) {
const struct of_device_id *of_id;
enum mtk_mdp_comp_type comp_type;
int comp_id;
-- 
1.9.1

[PATCH 2/3] arm64: dts: mt8173: Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

From: Daniel Kurtz 

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Signed-off-by: Daniel Kurtz 
Signed-off-by: Minghsiu Tsai 

---
 arch/arm64/boot/dts/mediatek/mt8173.dtsi | 126 +++
 1 file changed, 60 insertions(+), 66 deletions(-)

diff --git a/arch/arm64/boot/dts/mediatek/mt8173.dtsi 
b/arch/arm64/boot/dts/mediatek/mt8173.dtsi
index 6922252..d28a363 100644
--- a/arch/arm64/boot/dts/mediatek/mt8173.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt8173.dtsi
@@ -792,80 +792,74 @@
#clock-cells = <1>;
};
 
-   mdp {
-   compatible = "mediatek,mt8173-mdp";
-   #address-cells = <2>;
-   #size-cells = <2>;
-   ranges;
+   mdp_rdma0: rdma@14001000 {
+   compatible = "mediatek,mt8173-mdp-rdma",
+"mediatek,mt8173-mdp";
+   reg = <0 0x14001000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RDMA0>,
+< CLK_MM_MUTEX_32K>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   iommus = < M4U_PORT_MDP_RDMA0>;
+   mediatek,larb = <>;
mediatek,vpu = <>;
+   };
 
-   mdp_rdma0: rdma@14001000 {
-   compatible = "mediatek,mt8173-mdp-rdma";
-   reg = <0 0x14001000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RDMA0>,
-< CLK_MM_MUTEX_32K>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   iommus = < M4U_PORT_MDP_RDMA0>;
-   mediatek,larb = <>;
-   };
-
-   mdp_rdma1: rdma@14002000 {
-   compatible = "mediatek,mt8173-mdp-rdma";
-   reg = <0 0x14002000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RDMA1>,
-< CLK_MM_MUTEX_32K>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   iommus = < M4U_PORT_MDP_RDMA1>;
-   mediatek,larb = <>;
-   };
+   mdp_rdma1: rdma@14002000 {
+   compatible = "mediatek,mt8173-mdp-rdma";
+   reg = <0 0x14002000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RDMA1>,
+< CLK_MM_MUTEX_32K>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   iommus = < M4U_PORT_MDP_RDMA1>;
+   mediatek,larb = <>;
+   };
 
-   mdp_rsz0: rsz@14003000 {
-   compatible = "mediatek,mt8173-mdp-rsz";
-   reg = <0 0x14003000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RSZ0>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   };
+   mdp_rsz0: rsz@14003000 {
+   compatible = "mediatek,mt8173-mdp-rsz";
+   reg = <0 0x14003000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RSZ0>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   };
 
-   mdp_rsz1: rsz@14004000 {
-   compatible = "mediatek,mt8173-mdp-rsz";
-   reg = <0 0x14004000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RSZ1>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   };
+   mdp_rsz1: rsz@14004000 {
+   compatible = "mediatek,mt8173-mdp-rsz";
+   reg = <0 0x14004000 0 0x1000>;
+   clocks = < CLK_MM_MDP_RSZ1>;
+   power-domains = < MT8173_POWER_DOMAIN_MM>;
+   };
 
-   mdp_rsz2: rsz@14005000 {
-   compatible = "mediatek,mt8173-mdp-rsz";
-   reg = <0 0x14005000 0 0x1000>;
-   clocks = < CLK_MM_MDP_RSZ2>;
-   power-domains = < 
MT8173_POWER_DOMAIN_MM>;
-   };
+   mdp_rsz2:

[PATCH 3/3] media: mtk-mdp: Fix mdp device tree

2017-05-11 Thread Minghsiu Tsai

From: Daniel Kurtz 

If the mdp_* nodes are under an mdp sub-node, their corresponding
platform device does not automatically get its iommu assigned properly.

Fix this by moving the mdp component nodes up a level such that they are
siblings of mdp and all other SoC subsystems.  This also simplifies the
device tree.

Although it fixes iommu assignment issue, it also break compatibility
with old device tree. So, the patch in driver is needed to iterate over
sibling mdp device nodes, not child ones, to keep driver work properly.

Signed-off-by: Daniel Kurtz 
Signed-off-by: Minghsiu Tsai 

---
 drivers/media/platform/mtk-mdp/mtk_mdp_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c 
b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
index 9e4eb7d..a5ad586 100644
--- a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
+++ b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
@@ -118,7 +118,7 @@ static int mtk_mdp_probe(struct platform_device *pdev)
mutex_init(>vpulock);
 
/* Iterate over sibling MDP function blocks */
-   for_each_child_of_node(dev->of_node, node) {
+   for_each_child_of_node(dev->of_node->parent, node) {
const struct of_device_id *of_id;
enum mtk_mdp_comp_type comp_type;
int comp_id;
-- 
1.9.1

Re: [PATCH v2] scsi: sg: don't return bogus Sg_requests

2017-05-11 Thread Martin K. Petersen


Johannes,

> If the list search in sg_get_rq_mark() fails to find a valid request,
> we return a bogus element. This then can later lead to a GPF in
> sg_remove_scat().
>
> So don't return bogus Sg_requests in sg_get_rq_mark() but NULL in case
> the list search doesn't find a valid request.

Applied to 4.12/scsi-fixes, thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH v2] scsi: sg: don't return bogus Sg_requests

2017-05-11 Thread Martin K. Petersen


Johannes,

> If the list search in sg_get_rq_mark() fails to find a valid request,
> we return a bogus element. This then can later lead to a GPF in
> sg_remove_scat().
>
> So don't return bogus Sg_requests in sg_get_rq_mark() but NULL in case
> the list search doesn't find a valid request.

Applied to 4.12/scsi-fixes, thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: linux-next: Tree for May 12

2017-05-11 Thread Stephen Rothwell

Hi all,

On Fri, 12 May 2017 13:12:16 +1000 Stephen Rothwell  
wrote:
>
> Below is a summary of the state of the merge.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (09d79d103371 Merge tag 'docs-4.12-2' of 
git://git.lwn.net/linux)
Merging fixes/master (97da3854c526 Linux 4.11-rc3)
Merging kbuild-current/fixes (9be3213b14d4 gconfig: remove misleading 
parentheses around a condition)
Merging arc-current/for-curr (cf4100d1cddc Revert "ARCv2: Allow enabling PAE40 
w/o HIGHMEM")
Merging arm-current/fixes (6d8059493691 ARM: 8670/1: V7M: Do not corrupt vector 
table around v7m_invalidate_l1 call)
Merging m68k-current/for-linus (f6ab4d59a5fe nubus: Add MVC and VSC video card 
definitions)
Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups)
Merging powerpc-fixes/fixes (be5c5e843c4a powerpc/64: Fix HMI exception on LE 
with CONFIG_RELOCATABLE=y)
Merging sparc/master (3c7f62212018 sparc64: fix fault handling in NGbzero.S and 
GENbzero.S)
Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and 
linking special files)
Merging net/master (228b0324fe61 Merge branch 'bpf-pkt-ptr-align')
Merging ipsec/master (2c1497bbc8fd xfrm: Fix NETDEV_DOWN with IPSec offload)
Merging netfilter/master (f411af682218 Merge branch 
'ibmvnic-Updated-reset-handler-andcode-fixes')
Merging ipvs/master (3c5ab3f395d6 ipvs: SNAT packet replies only for NATed 
connections)
Merging wireless-drivers/master (d77facb88448 brcmfmac: use local iftype 
avoiding use-after-free of virtual interface)
Merging mac80211/master (29cee56c0be4 Merge tag 'mac80211-for-davem-2017-05-08' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211)
Merging sound-current/for-linus (3488df131df9 sound: Disable the build of OSS 
drivers)
Merging pci-current/for-linus (b9c1153f7a9c PCI: hisi: Fix DT binding 
(hisi-pcie-almost-ecam))
Merging driver-core.current/driver-core-linus (af82455f7dbd Merge tag 
'char-misc-4.12-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc)
Merging tty.current/tty-linus (4f7d029b9bf0 Linux 4.11-rc7)
Merging usb.current/usb-linus (a71c9a1c779f Linux 4.11-rc5)
Merging usb-gadget-fixes/fixes (a351e9b9fc24 Linux 4.11)
Merging usb-serial-fixes/usb-linus (c02ed2e75ef4 Linux 4.11-rc4)
Merging usb-chipidea-fixes/ci-for-usb-stable (c7fbb09b2ea1 usb: chipidea: move 
the lock initialization to core file)
Merging phy/fixes (1a09b6a7c10e phy: qcom-usb-hs: Add depends on EXTCON)
Merging staging.current/staging-linus (56868a460b83 Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide)
Merging char-misc.current/char-misc-linus (af82455f7dbd Merge tag 
'char-misc-4.12-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc)
Merging input-current/for-linus (8be193c7b1f4 Input: add support for 
PlayStation 1/2 joypads connected via SPI)
Merging crypto-current/master (929562b14478 crypto: stm32 - Fix OF module alias 
information)
Merging ide/master (acfead32f3f9 ide: don't call memcpy with the same source 
and destination)
Merging vfio-fixes/for-linus (39da7c509acf Linux 4.11-rc6)
Merging kselftest-fixes/fixes (c1ae3cfa0e89 Linux 4.11-rc1)
Merging backlight-fixes/for-backlight-fixes (68feaca0b13e backlight: pwm: 
Handle EPROBE_DEFER while requesting the PWM)
Merging ftrace-fixes/for-next-urgent (6224beb12e19 tracing: Have branch tracer 
use recursive field of task struct)
Merging mfd-fixes/for-mfd-fixes (b2376407f989 mfd: cros-ec: Fix host command 
buffer size)
Merging v4l-dvb-fixes/fixes (24a47426066c [media] exynos-gsc: Do not swap cb/cr 
for semi planar formats)
Merging drm-intel-fixes/for-linux-next-fixes (a351e9b9fc24 Linux 4.11)
Merging drm-misc-fixes/for-linux-next-fixes (0c45b36f8acc drm/udl: Fix 
unaligned memory access in udl_render_hline)
Merging kbuild/for-next (791a9a666d1a Merge tag 'kbuild-uapi-v4.12' of 
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild)
Merging asm-generic/master (de4be6b87b6b asm-generic: page.h: fix comment typo)
CONFLICT (content): Merge conflict in include/asm-generic/percpu.h
Merging arc/for-next (d5adbfcd5f7b Linux 4.10-rc7)
Merging arm/for-next (c92a90a5060a Merge branches 'fixes' and 'misc' into 
for-next)
Merging arm-perf/for-next/perf (f00fa5f4163b arm64: pmuv3: use arm_pmu ACPI 
framework)
Merging arm-soc/for-next (e3ec06d4bc1e arm-soc: document merges)
CONFLICT (content): Merge conflict in drivers/tee/tee_shm.c
Merging alpine/alpine/for-next (a1144b2b1ec4 ARM: dts: alpine: add valid 
clock-frequency values)
Merging amlogic/for-next (715dcd206041 Merge branch 'v4.12/drivers' into 
tmp/aml-rebuild)
Merging aspeed/for-next (4944e5dbb215 Merge branches 'dt-for-v4.12' and 
'defconfig-for-v4.12' into for-next)
Merging at91/at91-next (ce60fdaa7e9e Merge remote-tracking branch 
'alex_korg/at91-dt' into at91-next)
Merging bcm2835/for-next (7ea6e490ba7f Merge branch 
anholt/bcm2835-defconfig-64-next into for-next)
Merging

Re: linux-next: Tree for May 12

2017-05-11 Thread Stephen Rothwell

Hi all,

On Fri, 12 May 2017 13:12:16 +1000 Stephen Rothwell  
wrote:
>
> Below is a summary of the state of the merge.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (09d79d103371 Merge tag 'docs-4.12-2' of 
git://git.lwn.net/linux)
Merging fixes/master (97da3854c526 Linux 4.11-rc3)
Merging kbuild-current/fixes (9be3213b14d4 gconfig: remove misleading 
parentheses around a condition)
Merging arc-current/for-curr (cf4100d1cddc Revert "ARCv2: Allow enabling PAE40 
w/o HIGHMEM")
Merging arm-current/fixes (6d8059493691 ARM: 8670/1: V7M: Do not corrupt vector 
table around v7m_invalidate_l1 call)
Merging m68k-current/for-linus (f6ab4d59a5fe nubus: Add MVC and VSC video card 
definitions)
Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups)
Merging powerpc-fixes/fixes (be5c5e843c4a powerpc/64: Fix HMI exception on LE 
with CONFIG_RELOCATABLE=y)
Merging sparc/master (3c7f62212018 sparc64: fix fault handling in NGbzero.S and 
GENbzero.S)
Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and 
linking special files)
Merging net/master (228b0324fe61 Merge branch 'bpf-pkt-ptr-align')
Merging ipsec/master (2c1497bbc8fd xfrm: Fix NETDEV_DOWN with IPSec offload)
Merging netfilter/master (f411af682218 Merge branch 
'ibmvnic-Updated-reset-handler-andcode-fixes')
Merging ipvs/master (3c5ab3f395d6 ipvs: SNAT packet replies only for NATed 
connections)
Merging wireless-drivers/master (d77facb88448 brcmfmac: use local iftype 
avoiding use-after-free of virtual interface)
Merging mac80211/master (29cee56c0be4 Merge tag 'mac80211-for-davem-2017-05-08' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211)
Merging sound-current/for-linus (3488df131df9 sound: Disable the build of OSS 
drivers)
Merging pci-current/for-linus (b9c1153f7a9c PCI: hisi: Fix DT binding 
(hisi-pcie-almost-ecam))
Merging driver-core.current/driver-core-linus (af82455f7dbd Merge tag 
'char-misc-4.12-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc)
Merging tty.current/tty-linus (4f7d029b9bf0 Linux 4.11-rc7)
Merging usb.current/usb-linus (a71c9a1c779f Linux 4.11-rc5)
Merging usb-gadget-fixes/fixes (a351e9b9fc24 Linux 4.11)
Merging usb-serial-fixes/usb-linus (c02ed2e75ef4 Linux 4.11-rc4)
Merging usb-chipidea-fixes/ci-for-usb-stable (c7fbb09b2ea1 usb: chipidea: move 
the lock initialization to core file)
Merging phy/fixes (1a09b6a7c10e phy: qcom-usb-hs: Add depends on EXTCON)
Merging staging.current/staging-linus (56868a460b83 Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide)
Merging char-misc.current/char-misc-linus (af82455f7dbd Merge tag 
'char-misc-4.12-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc)
Merging input-current/for-linus (8be193c7b1f4 Input: add support for 
PlayStation 1/2 joypads connected via SPI)
Merging crypto-current/master (929562b14478 crypto: stm32 - Fix OF module alias 
information)
Merging ide/master (acfead32f3f9 ide: don't call memcpy with the same source 
and destination)
Merging vfio-fixes/for-linus (39da7c509acf Linux 4.11-rc6)
Merging kselftest-fixes/fixes (c1ae3cfa0e89 Linux 4.11-rc1)
Merging backlight-fixes/for-backlight-fixes (68feaca0b13e backlight: pwm: 
Handle EPROBE_DEFER while requesting the PWM)
Merging ftrace-fixes/for-next-urgent (6224beb12e19 tracing: Have branch tracer 
use recursive field of task struct)
Merging mfd-fixes/for-mfd-fixes (b2376407f989 mfd: cros-ec: Fix host command 
buffer size)
Merging v4l-dvb-fixes/fixes (24a47426066c [media] exynos-gsc: Do not swap cb/cr 
for semi planar formats)
Merging drm-intel-fixes/for-linux-next-fixes (a351e9b9fc24 Linux 4.11)
Merging drm-misc-fixes/for-linux-next-fixes (0c45b36f8acc drm/udl: Fix 
unaligned memory access in udl_render_hline)
Merging kbuild/for-next (791a9a666d1a Merge tag 'kbuild-uapi-v4.12' of 
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild)
Merging asm-generic/master (de4be6b87b6b asm-generic: page.h: fix comment typo)
CONFLICT (content): Merge conflict in include/asm-generic/percpu.h
Merging arc/for-next (d5adbfcd5f7b Linux 4.10-rc7)
Merging arm/for-next (c92a90a5060a Merge branches 'fixes' and 'misc' into 
for-next)
Merging arm-perf/for-next/perf (f00fa5f4163b arm64: pmuv3: use arm_pmu ACPI 
framework)
Merging arm-soc/for-next (e3ec06d4bc1e arm-soc: document merges)
CONFLICT (content): Merge conflict in drivers/tee/tee_shm.c
Merging alpine/alpine/for-next (a1144b2b1ec4 ARM: dts: alpine: add valid 
clock-frequency values)
Merging amlogic/for-next (715dcd206041 Merge branch 'v4.12/drivers' into 
tmp/aml-rebuild)
Merging aspeed/for-next (4944e5dbb215 Merge branches 'dt-for-v4.12' and 
'defconfig-for-v4.12' into for-next)
Merging at91/at91-next (ce60fdaa7e9e Merge remote-tracking branch 
'alex_korg/at91-dt' into at91-next)
Merging bcm2835/for-next (7ea6e490ba7f Merge branch 
anholt/bcm2835-defconfig-64-next into for-next)
Merging berlin/berlin/for-next (5153351425c9

linux-next: Tree for May 12

2017-05-11 Thread Stephen Rothwell

Hi all,

Please do not add any v4.13 destined material in your linux-next
included branches until after v4.12-rc1 has been released.

Changes since 20170511:

Non-merge commits (relative to Linus' tree): 520
 583 files changed, 13977 insertions(+), 14719 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc and an allmodconfig (with
CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a
native build of tools/perf. After the final fixups (if any), I do an
x86_64 modules_install followed by builds for x86_64 allnoconfig,
powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig
and pseries_le_defconfig and i386, sparc and sparc64 defconfig.

Below is a summary of the state of the merge.

I am currently merging 258 trees (counting Linus' and 37 trees of bug
fix patches pending for the current merge release).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

linux-next: Tree for May 12

2017-05-11 Thread Stephen Rothwell

Hi all,

Please do not add any v4.13 destined material in your linux-next
included branches until after v4.12-rc1 has been released.

Changes since 20170511:

Non-merge commits (relative to Linus' tree): 520
 583 files changed, 13977 insertions(+), 14719 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc and an allmodconfig (with
CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a
native build of tools/perf. After the final fixups (if any), I do an
x86_64 modules_install followed by builds for x86_64 allnoconfig,
powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig
and pseries_le_defconfig and i386, sparc and sparc64 defconfig.

Below is a summary of the state of the merge.

I am currently merging 258 trees (counting Linus' and 37 trees of bug
fix patches pending for the current merge release).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

[PATCH] net: dsa: mv88e6xxx: add default case to switch

2017-05-11 Thread Gustavo A. R. Silva

Add default case to switch in order to avoid any chance of using an
uninitialized variable _low_, in case s->type does not match any of
the listed case values.

Addresses-Coverity-ID: 1398130
Suggested-by: Andrew Lunn 
Signed-off-by: Gustavo A. R. Silva 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 03dc886..d39e210 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -879,6 +879,9 @@ static uint64_t _mv88e6xxx_get_ethtool_stat(struct 
mv88e6xxx_chip *chip,
mv88e6xxx_g1_stats_read(chip, reg, );
if (s->sizeof_stat == 8)
mv88e6xxx_g1_stats_read(chip, reg + 1, );
+   break;
+   default:
+   return UINT64_MAX;
}
value = (((u64)high) << 16) | low;
return value;
-- 
2.5.0

Re: Implementing Dynamic Rerouting in Kernel

2017-05-11 Thread Ravish Kumar

Hi,

We do not want to add routes at run time rather i would prefer to have
a NetFilter driver which can intercept the packet and forward it to
desired interface.



On Thu, May 11, 2017 at 9:52 PM, Florian Fainelli  wrote:
> On 05/11/2017 02:59 AM, Ravish Kumar wrote:
>> Hi Experts,
>>
>> Need expert advice for the one of the requirement Where in VPN
>> solution we want to dynaically route the packets to different adapter.
>> We will manage our own DNS cache and , based on DNS to IP lookup, we
>> can redirect the packet either to Tun device or to a physical adapter.
>>
>> Please suggest some design what i need to do.
>
> What you are looking for can be done using ipset-dns from Jason:
>
> https://git.zx2c4.com/ipset-dns/about/
> --
> Florian

[PATCH] net: dsa: mv88e6xxx: add default case to switch

2017-05-11 Thread Gustavo A. R. Silva

Add default case to switch in order to avoid any chance of using an
uninitialized variable _low_, in case s->type does not match any of
the listed case values.

Addresses-Coverity-ID: 1398130
Suggested-by: Andrew Lunn 
Signed-off-by: Gustavo A. R. Silva 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 03dc886..d39e210 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -879,6 +879,9 @@ static uint64_t _mv88e6xxx_get_ethtool_stat(struct 
mv88e6xxx_chip *chip,
mv88e6xxx_g1_stats_read(chip, reg, );
if (s->sizeof_stat == 8)
mv88e6xxx_g1_stats_read(chip, reg + 1, );
+   break;
+   default:
+   return UINT64_MAX;
}
value = (((u64)high) << 16) | low;
return value;
-- 
2.5.0

Re: Implementing Dynamic Rerouting in Kernel

2017-05-11 Thread Ravish Kumar

Hi,

We do not want to add routes at run time rather i would prefer to have
a NetFilter driver which can intercept the packet and forward it to
desired interface.



On Thu, May 11, 2017 at 9:52 PM, Florian Fainelli  wrote:
> On 05/11/2017 02:59 AM, Ravish Kumar wrote:
>> Hi Experts,
>>
>> Need expert advice for the one of the requirement Where in VPN
>> solution we want to dynaically route the packets to different adapter.
>> We will manage our own DNS cache and , based on DNS to IP lookup, we
>> can redirect the packet either to Tun device or to a physical adapter.
>>
>> Please suggest some design what i need to do.
>
> What you are looking for can be done using ipset-dns from Jason:
>
> https://git.zx2c4.com/ipset-dns/about/
> --
> Florian

[GIT PULL] arch/nios2 update for v4.12

2017-05-11 Thread Ley Foon Tan

Hi Linus

Here is nios2 update for v4.12.

This including nios2 fixes/enhancements and adding nios2 R2 support.

Regards
Ley Foon


The following changes since commit 13e0988140374123bead1dd27c287354cb95108e:

  docs: complete bumping minimal GNU Make version to 3.81 (2017-05-06 18:49:09 
-0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2.git 
tags/nios2-v4.12-rc1

for you to fetch changes up to e118c3fec9c0d8d2a96462c4c035305dc952e402:

  nios2: remove custom early console implementation (2017-05-11 17:44:21 +0800)


nios2 update for v4.12-rc1

nios2: remove custom early console implementation
nios2: use generic strncpy_from_user() and strnlen_user()
nios2: Add CDX support
nios2: Add BMX support
nios2: Add NIOS2_ARCH_REVISION to select between R1 and R2
nios2: implement flush_dcache_mmap_lock/unlock
nios2: enable earlycon support
nios2: constify irq_domain_ops
nios2: remove wrapper header for cmpxchg.h
nios2: add .gitignore entries for auto-generated files


Julien Beraud (1):
  nios2: implement flush_dcache_mmap_lock/unlock

Ley Foon Tan (1):
  nios2: use generic strncpy_from_user() and strnlen_user()

Marek Vasut (3):
  nios2: Add NIOS2_ARCH_REVISION to select between R1 and R2
  nios2: Add BMX support
  nios2: Add CDX support

Tobias Klauser (5):
  nios2: add .gitignore entries for auto-generated files
  nios2: remove wrapper header for cmpxchg.h
  nios2: constify irq_domain_ops
  nios2: enable earlycon support
  nios2: remove custom early console implementation

 arch/nios2/Kconfig |   2 +
 arch/nios2/Kconfig.debug   |   1 -
 arch/nios2/Makefile|   5 ++
 arch/nios2/boot/.gitignore |   2 +
 arch/nios2/boot/dts/10m50_devboard.dts |   3 +-
 arch/nios2/include/asm/Kbuild  |   1 +
 arch/nios2/include/asm/cacheflush.h|   6 +-
 arch/nios2/include/asm/cmpxchg.h   |  14 
 arch/nios2/include/asm/cpuinfo.h   |   2 +
 arch/nios2/include/asm/prom.h  |  22 --
 arch/nios2/include/asm/setup.h |   2 -
 arch/nios2/include/asm/uaccess.h   |   7 +-
 arch/nios2/kernel/.gitignore   |   1 +
 arch/nios2/kernel/Makefile |   1 -
 arch/nios2/kernel/cpuinfo.c|  18 -
 arch/nios2/kernel/early_printk.c   | 118 -
 arch/nios2/kernel/irq.c|   2 +-
 arch/nios2/kernel/prom.c   |  49 --
 arch/nios2/kernel/setup.c  |   6 +-
 arch/nios2/mm/uaccess.c|  33 -
 arch/nios2/platform/Kconfig.platform   |  26 
 21 files changed, 69 insertions(+), 252 deletions(-)
 create mode 100644 arch/nios2/boot/.gitignore
 delete mode 100644 arch/nios2/include/asm/cmpxchg.h
 delete mode 100644 arch/nios2/include/asm/prom.h
 create mode 100644 arch/nios2/kernel/.gitignore
 delete mode 100644 arch/nios2/kernel/early_printk.c

[GIT PULL] arch/nios2 update for v4.12

2017-05-11 Thread Ley Foon Tan

Hi Linus

Here is nios2 update for v4.12.

This including nios2 fixes/enhancements and adding nios2 R2 support.

Regards
Ley Foon


The following changes since commit 13e0988140374123bead1dd27c287354cb95108e:

  docs: complete bumping minimal GNU Make version to 3.81 (2017-05-06 18:49:09 
-0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2.git 
tags/nios2-v4.12-rc1

for you to fetch changes up to e118c3fec9c0d8d2a96462c4c035305dc952e402:

  nios2: remove custom early console implementation (2017-05-11 17:44:21 +0800)


nios2 update for v4.12-rc1

nios2: remove custom early console implementation
nios2: use generic strncpy_from_user() and strnlen_user()
nios2: Add CDX support
nios2: Add BMX support
nios2: Add NIOS2_ARCH_REVISION to select between R1 and R2
nios2: implement flush_dcache_mmap_lock/unlock
nios2: enable earlycon support
nios2: constify irq_domain_ops
nios2: remove wrapper header for cmpxchg.h
nios2: add .gitignore entries for auto-generated files


Julien Beraud (1):
  nios2: implement flush_dcache_mmap_lock/unlock

Ley Foon Tan (1):
  nios2: use generic strncpy_from_user() and strnlen_user()

Marek Vasut (3):
  nios2: Add NIOS2_ARCH_REVISION to select between R1 and R2
  nios2: Add BMX support
  nios2: Add CDX support

Tobias Klauser (5):
  nios2: add .gitignore entries for auto-generated files
  nios2: remove wrapper header for cmpxchg.h
  nios2: constify irq_domain_ops
  nios2: enable earlycon support
  nios2: remove custom early console implementation

 arch/nios2/Kconfig |   2 +
 arch/nios2/Kconfig.debug   |   1 -
 arch/nios2/Makefile|   5 ++
 arch/nios2/boot/.gitignore |   2 +
 arch/nios2/boot/dts/10m50_devboard.dts |   3 +-
 arch/nios2/include/asm/Kbuild  |   1 +
 arch/nios2/include/asm/cacheflush.h|   6 +-
 arch/nios2/include/asm/cmpxchg.h   |  14 
 arch/nios2/include/asm/cpuinfo.h   |   2 +
 arch/nios2/include/asm/prom.h  |  22 --
 arch/nios2/include/asm/setup.h |   2 -
 arch/nios2/include/asm/uaccess.h   |   7 +-
 arch/nios2/kernel/.gitignore   |   1 +
 arch/nios2/kernel/Makefile |   1 -
 arch/nios2/kernel/cpuinfo.c|  18 -
 arch/nios2/kernel/early_printk.c   | 118 -
 arch/nios2/kernel/irq.c|   2 +-
 arch/nios2/kernel/prom.c   |  49 --
 arch/nios2/kernel/setup.c  |   6 +-
 arch/nios2/mm/uaccess.c|  33 -
 arch/nios2/platform/Kconfig.platform   |  26 
 21 files changed, 69 insertions(+), 252 deletions(-)
 create mode 100644 arch/nios2/boot/.gitignore
 delete mode 100644 arch/nios2/include/asm/cmpxchg.h
 delete mode 100644 arch/nios2/include/asm/prom.h
 create mode 100644 arch/nios2/kernel/.gitignore
 delete mode 100644 arch/nios2/kernel/early_printk.c

Re: [PATCH net] net: phy: Call bus->reset() after releasing PHYs from reset

2017-05-11 Thread David Miller

From: Florian Fainelli 
Date: Thu, 11 May 2017 11:24:16 -0700

> The API convention makes it that a given MDIO bus reset should be able
> to access PHY devices in its reset() callback and perform additional
> MDIO accesses in order to bring the bus and PHYs in a working state.
> 
> Commit 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.") broke that
> contract by first calling bus->reset() and then release all PHYs from
> reset using their shared GPIO line, so restore the expected
> functionality here.
> 
> Fixes: 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.")
> Signed-off-by: Florian Fainelli 

Applied, thanks Florian.

Re: [PATCH net] net: phy: Call bus->reset() after releasing PHYs from reset

2017-05-11 Thread David Miller

From: Florian Fainelli 
Date: Thu, 11 May 2017 11:24:16 -0700

> The API convention makes it that a given MDIO bus reset should be able
> to access PHY devices in its reset() callback and perform additional
> MDIO accesses in order to bring the bus and PHYs in a working state.
> 
> Commit 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.") broke that
> contract by first calling bus->reset() and then release all PHYs from
> reset using their shared GPIO line, so restore the expected
> functionality here.
> 
> Fixes: 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.")
> Signed-off-by: Florian Fainelli 

Applied, thanks Florian.

Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf

2017-05-11 Thread Alex Williamson

On Fri, 12 May 2017 02:12:10 +
"Chen, Xiaoguang"  wrote:

> Hi Alex and Gerd,
> 
> >-Original Message-
> >From: intel-gvt-dev [mailto:intel-gvt-dev-boun...@lists.freedesktop.org] On
> >Behalf Of Alex Williamson
> >Sent: Thursday, May 11, 2017 11:45 PM
> >To: Gerd Hoffmann 
> >Cc: Tian, Kevin ; intel-...@lists.freedesktop.org; 
> >linux-
> >ker...@vger.kernel.org; zhen...@linux.intel.com; Lv, Zhiyuan
> >; Chen, Xiaoguang ; intel-
> >gvt-...@lists.freedesktop.org; Wang, Zhi A 
> >Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf
> >
> >On Thu, 11 May 2017 15:27:53 +0200
> >Gerd Hoffmann  wrote:
> >  
> >>   Hi,
> >>  
> >> > While read the framebuffer region we have to tell the vendor driver 
> >> > which  
> >framebuffer we want to read? There are two framebuffers now in KVMGT that is
> >primary and cursor.  
> >> > There are two methods to implement this:
> >> > 1) write the plane id first and then read the framebuffer.
> >> > 2) create 2 vfio regions one for primary and one for cursor.  
> >>
> >> (3) Place information for both planes into one vfio region.
> >> Which allows to fetch both with a single read() syscall.
> >>
> >> The question is how you'll get the file descriptor then.  If the ioctl
> >> returns the dma-buf fd only you have a racy interface:  Things can
> >> change between read(vfio-region) and ioctl(need-dmabuf-fd).
> >>
> >> ioctl(need-dma-buf) could return both dmabuf fd and plane info to fix
> >> the race, but then it is easier to go with ioctl only interface
> >> (simliar to the orginal one from dec last year) I think.  
> >
> >If the dmabuf fd is provided by a separate mdev vendor driver specific 
> >ioctl, I
> >don't see how vfio regions should be involved.  Selecting which framebuffer
> >should be an ioctl parameter.
> Based on your last mail. I think the implementation looks like this:
> 1) user query the framebuffer information by reading the vfio region.
> 2) if the framebuffer changed(such as framebuffer's graphics address changed, 
> size changed etc) we will need to create a new dmabuf fd.
> 3) create a new dmabuf fd using vfio device specific ioctl.
> 
> >What sort of information needs to be conveyed
> >about each plane?
> Only plane id is needed.
> 
> >Is it static information or something that needs to be read
> >repeatedly?   
> It is static information. For our case plane id 1 represent primary plane and 
> 3 for cursor plane. 2 means sprite plane which will not be used in our case.
> 
> >Do we need it before we get the dmabuf fd or can it be an ioctl on
> >the dmabuf fd?  
> We need it while query the framebuffer. In kernel we need the plane id to 
> decide which plane we should decode.
> Below is my current implementation:
> 1) user first query the framebuffer(primary or cursor) and kernel decode the 
> framebuffer and return the framebuffer information to user and also save a 
> copy in kernel.
> 2) user compared the framebuffer and if the framebuffer changed  creating a 
> new dmabuf fd.

If the contents of the framebuffer change or if the parameters of the
framebuffer change?  I can't image that creating a new dmabuf fd for
every visual change within the framebuffer would be efficient, but I
don't have any concept of what a dmabuf actually does.

> 3) kernel create a new dmabuf fd based on saved framebuffer information.
> 
> So we need plane id in step 1.
> In step 3 we create a dmabuf fd only using saved framebuffer information(no 
> other information is needed).

What changes to the framebuffer require a new dmabuf fd?  Shouldn't the
user query the parameters of the framebuffer through a dmabuf fd and
shouldn't the dmabuf fd have some signaling mechanism to the user
(eventfd perhaps) to notify the user to re-evaluate the parameters?
Otherwise are you imagining that the user polls the vfio region?  Why
can a dmabuf fd not persist across changes to the framebuffer?  Can
someone explain what a dmabuf is and how it works in terms that a
non-graphics person can understand?  Thanks,

Alex

Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf

2017-05-11 Thread Alex Williamson

On Fri, 12 May 2017 02:12:10 +
"Chen, Xiaoguang"  wrote:

> Hi Alex and Gerd,
> 
> >-Original Message-
> >From: intel-gvt-dev [mailto:intel-gvt-dev-boun...@lists.freedesktop.org] On
> >Behalf Of Alex Williamson
> >Sent: Thursday, May 11, 2017 11:45 PM
> >To: Gerd Hoffmann 
> >Cc: Tian, Kevin ; intel-...@lists.freedesktop.org; 
> >linux-
> >ker...@vger.kernel.org; zhen...@linux.intel.com; Lv, Zhiyuan
> >; Chen, Xiaoguang ; intel-
> >gvt-...@lists.freedesktop.org; Wang, Zhi A 
> >Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf
> >
> >On Thu, 11 May 2017 15:27:53 +0200
> >Gerd Hoffmann  wrote:
> >  
> >>   Hi,
> >>  
> >> > While read the framebuffer region we have to tell the vendor driver 
> >> > which  
> >framebuffer we want to read? There are two framebuffers now in KVMGT that is
> >primary and cursor.  
> >> > There are two methods to implement this:
> >> > 1) write the plane id first and then read the framebuffer.
> >> > 2) create 2 vfio regions one for primary and one for cursor.  
> >>
> >> (3) Place information for both planes into one vfio region.
> >> Which allows to fetch both with a single read() syscall.
> >>
> >> The question is how you'll get the file descriptor then.  If the ioctl
> >> returns the dma-buf fd only you have a racy interface:  Things can
> >> change between read(vfio-region) and ioctl(need-dmabuf-fd).
> >>
> >> ioctl(need-dma-buf) could return both dmabuf fd and plane info to fix
> >> the race, but then it is easier to go with ioctl only interface
> >> (simliar to the orginal one from dec last year) I think.  
> >
> >If the dmabuf fd is provided by a separate mdev vendor driver specific 
> >ioctl, I
> >don't see how vfio regions should be involved.  Selecting which framebuffer
> >should be an ioctl parameter.
> Based on your last mail. I think the implementation looks like this:
> 1) user query the framebuffer information by reading the vfio region.
> 2) if the framebuffer changed(such as framebuffer's graphics address changed, 
> size changed etc) we will need to create a new dmabuf fd.
> 3) create a new dmabuf fd using vfio device specific ioctl.
> 
> >What sort of information needs to be conveyed
> >about each plane?
> Only plane id is needed.
> 
> >Is it static information or something that needs to be read
> >repeatedly?   
> It is static information. For our case plane id 1 represent primary plane and 
> 3 for cursor plane. 2 means sprite plane which will not be used in our case.
> 
> >Do we need it before we get the dmabuf fd or can it be an ioctl on
> >the dmabuf fd?  
> We need it while query the framebuffer. In kernel we need the plane id to 
> decide which plane we should decode.
> Below is my current implementation:
> 1) user first query the framebuffer(primary or cursor) and kernel decode the 
> framebuffer and return the framebuffer information to user and also save a 
> copy in kernel.
> 2) user compared the framebuffer and if the framebuffer changed  creating a 
> new dmabuf fd.

If the contents of the framebuffer change or if the parameters of the
framebuffer change?  I can't image that creating a new dmabuf fd for
every visual change within the framebuffer would be efficient, but I
don't have any concept of what a dmabuf actually does.

> 3) kernel create a new dmabuf fd based on saved framebuffer information.
> 
> So we need plane id in step 1.
> In step 3 we create a dmabuf fd only using saved framebuffer information(no 
> other information is needed).

What changes to the framebuffer require a new dmabuf fd?  Shouldn't the
user query the parameters of the framebuffer through a dmabuf fd and
shouldn't the dmabuf fd have some signaling mechanism to the user
(eventfd perhaps) to notify the user to re-evaluate the parameters?
Otherwise are you imagining that the user polls the vfio region?  Why
can a dmabuf fd not persist across changes to the framebuffer?  Can
someone explain what a dmabuf is and how it works in terms that a
non-graphics person can understand?  Thanks,

Alex

[PATCH] PCI: Make SR-IOV capable GPU working on the SR-IOV incapable platform

2017-05-11 Thread Cheng, Collins

Hi Helgaas,

Some AMD GPUs have hardware support for graphics SR-IOV.
If the SR-IOV capable GPU is plugged into the SR-IOV incapable
platform. It would cause a problem on PCI resource allocation in
current Linux kernel.

Therefore in order to allow the PF (Physical Function) device of
SR-IOV capable GPU to work on the SR-IOV incapable platform,
it is required to verify conditions for initializing BAR resources
on AMD SR-IOV capable GPUs.

If the device is an AMD graphics device and it supports
SR-IOV it will require a large amount of resources.
Before calling sriov_init() must ensure that the system
BIOS also supports SR-IOV and that system BIOS has been
able to allocate enough resources.
If the VF BARs are zero then the system BIOS does not
support SR-IOV or it could not allocate the resources
and this platform will not support AMD graphics SR-IOV.
Therefore do not call sriov_init().
If the system BIOS does support SR-IOV then the VF BARs
will be properly initialized to non-zero values.

Below is the patch against to Kernel 4.8 & 4.9. Please review.

I checked the drivers/pci/quirks.c, it looks the workarounds/fixes in
quirks.c are for specific devices and one or more device ID are defined
for the specific devices. However my patch is for all AMD SR-IOV
capable GPUs, that includes all existing and future AMD server GPUs.
So it doesn't seem like a good fit to put the fix in quirks.c.



Signed-off-by: Collins Cheng 
---
 drivers/pci/iov.c | 63 ---
 1 file changed, 60 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index e30f05c..e4f1405 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -523,6 +523,45 @@ static void sriov_restore_state(struct pci_dev *dev)
msleep(100);
 }
 
+/*
+ * pci_vf_bar_valid - check if VF BARs have resource allocated
+ * @dev: the PCI device
+ * @pos: register offset of SR-IOV capability in PCI config space
+ * Returns true any VF BAR has resource allocated, false
+ * if all VF BARs are empty.
+ */
+static bool pci_vf_bar_valid(struct pci_dev *dev, int pos)
+{
+   int i;
+   u32 bar_value;
+   u32 bar_size_mask = ~(PCI_BASE_ADDRESS_SPACE |
+   PCI_BASE_ADDRESS_MEM_TYPE_64 |
+   PCI_BASE_ADDRESS_MEM_PREFETCH);
+
+   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+   pci_read_config_dword(dev, pos + PCI_SRIOV_BAR + i * 4, 
_value);
+   if (bar_value & bar_size_mask)
+   return true;
+   }
+
+   return false;
+}
+
+/*
+ * is_amd_display_adapter - check if it is an AMD/ATI GPU device
+ * @dev: the PCI device
+ *
+ * Returns true if device is an AMD/ATI display adapter,
+ * otherwise return false.
+ */
+
+static bool is_amd_display_adapter(struct pci_dev *dev)
+{
+   return (((dev->class >> 16) == PCI_BASE_CLASS_DISPLAY) &&
+   (dev->vendor == PCI_VENDOR_ID_ATI ||
+   dev->vendor == PCI_VENDOR_ID_AMD));
+}
+
 /**
  * pci_iov_init - initialize the IOV capability
  * @dev: the PCI device
@@ -537,9 +576,27 @@ int pci_iov_init(struct pci_dev *dev)
return -ENODEV;
 
pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
-   if (pos)
-   return sriov_init(dev, pos);
-
+   if (pos) {
+   /*
+* If the device is an AMD graphics device and it supports
+* SR-IOV it will require a large amount of resources.
+* Before calling sriov_init() must ensure that the system
+* BIOS also supports SR-IOV and that system BIOS has been
+* able to allocate enough resources.
+* If the VF BARs are zero then the system BIOS does not
+* support SR-IOV or it could not allocate the resources
+* and this platform will not support AMD graphics SR-IOV.
+* Therefore do not call sriov_init().
+* If the system BIOS does support SR-IOV then the VF BARs
+* will be properly initialized to non-zero values.
+*/
+   if (is_amd_display_adapter(dev)) {
+   if (pci_vf_bar_valid(dev, pos))
+   return sriov_init(dev, pos);
+   } else {
+   return sriov_init(dev, pos);
+   }
+   }
return -ENODEV;
 }
 
-- 
1.9.1



-Collins Cheng

[PATCH] PCI: Make SR-IOV capable GPU working on the SR-IOV incapable platform

2017-05-11 Thread Cheng, Collins

Hi Helgaas,

Some AMD GPUs have hardware support for graphics SR-IOV.
If the SR-IOV capable GPU is plugged into the SR-IOV incapable
platform. It would cause a problem on PCI resource allocation in
current Linux kernel.

Therefore in order to allow the PF (Physical Function) device of
SR-IOV capable GPU to work on the SR-IOV incapable platform,
it is required to verify conditions for initializing BAR resources
on AMD SR-IOV capable GPUs.

If the device is an AMD graphics device and it supports
SR-IOV it will require a large amount of resources.
Before calling sriov_init() must ensure that the system
BIOS also supports SR-IOV and that system BIOS has been
able to allocate enough resources.
If the VF BARs are zero then the system BIOS does not
support SR-IOV or it could not allocate the resources
and this platform will not support AMD graphics SR-IOV.
Therefore do not call sriov_init().
If the system BIOS does support SR-IOV then the VF BARs
will be properly initialized to non-zero values.

Below is the patch against to Kernel 4.8 & 4.9. Please review.

I checked the drivers/pci/quirks.c, it looks the workarounds/fixes in
quirks.c are for specific devices and one or more device ID are defined
for the specific devices. However my patch is for all AMD SR-IOV
capable GPUs, that includes all existing and future AMD server GPUs.
So it doesn't seem like a good fit to put the fix in quirks.c.



Signed-off-by: Collins Cheng 
---
 drivers/pci/iov.c | 63 ---
 1 file changed, 60 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index e30f05c..e4f1405 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -523,6 +523,45 @@ static void sriov_restore_state(struct pci_dev *dev)
msleep(100);
 }
 
+/*
+ * pci_vf_bar_valid - check if VF BARs have resource allocated
+ * @dev: the PCI device
+ * @pos: register offset of SR-IOV capability in PCI config space
+ * Returns true any VF BAR has resource allocated, false
+ * if all VF BARs are empty.
+ */
+static bool pci_vf_bar_valid(struct pci_dev *dev, int pos)
+{
+   int i;
+   u32 bar_value;
+   u32 bar_size_mask = ~(PCI_BASE_ADDRESS_SPACE |
+   PCI_BASE_ADDRESS_MEM_TYPE_64 |
+   PCI_BASE_ADDRESS_MEM_PREFETCH);
+
+   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+   pci_read_config_dword(dev, pos + PCI_SRIOV_BAR + i * 4, 
_value);
+   if (bar_value & bar_size_mask)
+   return true;
+   }
+
+   return false;
+}
+
+/*
+ * is_amd_display_adapter - check if it is an AMD/ATI GPU device
+ * @dev: the PCI device
+ *
+ * Returns true if device is an AMD/ATI display adapter,
+ * otherwise return false.
+ */
+
+static bool is_amd_display_adapter(struct pci_dev *dev)
+{
+   return (((dev->class >> 16) == PCI_BASE_CLASS_DISPLAY) &&
+   (dev->vendor == PCI_VENDOR_ID_ATI ||
+   dev->vendor == PCI_VENDOR_ID_AMD));
+}
+
 /**
  * pci_iov_init - initialize the IOV capability
  * @dev: the PCI device
@@ -537,9 +576,27 @@ int pci_iov_init(struct pci_dev *dev)
return -ENODEV;
 
pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
-   if (pos)
-   return sriov_init(dev, pos);
-
+   if (pos) {
+   /*
+* If the device is an AMD graphics device and it supports
+* SR-IOV it will require a large amount of resources.
+* Before calling sriov_init() must ensure that the system
+* BIOS also supports SR-IOV and that system BIOS has been
+* able to allocate enough resources.
+* If the VF BARs are zero then the system BIOS does not
+* support SR-IOV or it could not allocate the resources
+* and this platform will not support AMD graphics SR-IOV.
+* Therefore do not call sriov_init().
+* If the system BIOS does support SR-IOV then the VF BARs
+* will be properly initialized to non-zero values.
+*/
+   if (is_amd_display_adapter(dev)) {
+   if (pci_vf_bar_valid(dev, pos))
+   return sriov_init(dev, pos);
+   } else {
+   return sriov_init(dev, pos);
+   }
+   }
return -ENODEV;
 }
 
-- 
1.9.1



-Collins Cheng

Re: [net-dsa-mv88e6xxx] question about potential use of uninitialized variable

2017-05-11 Thread Gustavo A. R. Silva


Hi Andrew,

Quoting Andrew Lunn :


On Thu, May 11, 2017 at 04:35:37PM -0500, Gustavo A. R. Silva wrote:


Hello everybody,

While looking into Coverity ID 1398130 I ran into the following
piece of code at drivers/net/dsa/mv88e6xxx/chip.c:849:

 849static uint64_t _mv88e6xxx_get_ethtool_stat(struct mv88e6xxx_chip *chip,
 850struct mv88e6xxx_hw_stat *s,
 851int port, u16 bank1_select,
 852u16 histogram)
 853{
 854u32 low;
 855u32 high = 0;
 856u16 reg = 0;
 857int err;
 858u64 value;
 859
 860switch (s->type) {
 861case STATS_TYPE_PORT:
 862err = mv88e6xxx_port_read(chip, port, s->reg, );
 863if (err)
 864return UINT64_MAX;
 865
 866low = reg;
 867if (s->sizeof_stat == 4) {
 868err = mv88e6xxx_port_read(chip, port,
s->reg + 1, );
 869if (err)
 870return UINT64_MAX;
 871high = reg;
 872}
 873break;
 874case STATS_TYPE_BANK1:
 875reg = bank1_select;
 876/* fall through */
 877case STATS_TYPE_BANK0:
 878reg |= s->reg | histogram;
 879mv88e6xxx_g1_stats_read(chip, reg, );
 880if (s->sizeof_stat == 8)
 881mv88e6xxx_g1_stats_read(chip, reg + 1, );
 882}
 883value = (((u64)high) << 16) | low;
 884return value;
 885}

My question here is if there is any chance for the execution path to
directly jump from line 860 to line 883, hence ending up using the
uninitialized variable _low_?


Hi Gustavo

It would require that s->type not have one of the listed case values.
Currently all members of mv88e6xxx_hw_stats due use expected values.
However, it would not hurt to add a

 default:
return UINT64_MAX;

Do you want to submit a patch?



Sure, I'll send it shortly.

Thanks for clarifying!
--
Gustavo A. R. Silva

Re: [net-dsa-mv88e6xxx] question about potential use of uninitialized variable

2017-05-11 Thread Gustavo A. R. Silva


Hi Andrew,

Quoting Andrew Lunn :


On Thu, May 11, 2017 at 04:35:37PM -0500, Gustavo A. R. Silva wrote:


Hello everybody,

While looking into Coverity ID 1398130 I ran into the following
piece of code at drivers/net/dsa/mv88e6xxx/chip.c:849:

 849static uint64_t _mv88e6xxx_get_ethtool_stat(struct mv88e6xxx_chip *chip,
 850struct mv88e6xxx_hw_stat *s,
 851int port, u16 bank1_select,
 852u16 histogram)
 853{
 854u32 low;
 855u32 high = 0;
 856u16 reg = 0;
 857int err;
 858u64 value;
 859
 860switch (s->type) {
 861case STATS_TYPE_PORT:
 862err = mv88e6xxx_port_read(chip, port, s->reg, );
 863if (err)
 864return UINT64_MAX;
 865
 866low = reg;
 867if (s->sizeof_stat == 4) {
 868err = mv88e6xxx_port_read(chip, port,
s->reg + 1, );
 869if (err)
 870return UINT64_MAX;
 871high = reg;
 872}
 873break;
 874case STATS_TYPE_BANK1:
 875reg = bank1_select;
 876/* fall through */
 877case STATS_TYPE_BANK0:
 878reg |= s->reg | histogram;
 879mv88e6xxx_g1_stats_read(chip, reg, );
 880if (s->sizeof_stat == 8)
 881mv88e6xxx_g1_stats_read(chip, reg + 1, );
 882}
 883value = (((u64)high) << 16) | low;
 884return value;
 885}

My question here is if there is any chance for the execution path to
directly jump from line 860 to line 883, hence ending up using the
uninitialized variable _low_?


Hi Gustavo

It would require that s->type not have one of the listed case values.
Currently all members of mv88e6xxx_hw_stats due use expected values.
However, it would not hurt to add a

 default:
return UINT64_MAX;

Do you want to submit a patch?



Sure, I'll send it shortly.

Thanks for clarifying!
--
Gustavo A. R. Silva

Re: [PATCH v5 15/17] dt-bindings: qca7000: append UART interface to binding

2017-05-11 Thread Jakub Kicinski

On Thu, 11 May 2017 21:12:22 +0200, Michael Heimpold wrote:
> Am Mittwoch, 10. Mai 2017, 10:53:26 CEST schrieb Stefan Wahren:
> > This merges the serdev binding for the QCA7000 UART driver (Ethernet over
> > UART) into the existing document.
> > 
> > Signed-off-by: Stefan Wahren 
> > ---
> >  .../devicetree/bindings/net/qca-qca7000.txt| 32
> > ++ 1 file changed, 32 insertions(+)
> > 
> > diff --git a/Documentation/devicetree/bindings/net/qca-qca7000.txt
> > b/Documentation/devicetree/bindings/net/qca-qca7000.txt index
> > a37f656..08364c3 100644
> > --- a/Documentation/devicetree/bindings/net/qca-qca7000.txt
> > +++ b/Documentation/devicetree/bindings/net/qca-qca7000.txt
> > @@ -54,3 +54,35 @@ ssp2: spi@80014000 {
> > local-mac-address = [ A0 B0 C0 D0 E0 F0 ];
> > };
> >  };
> > +
> > +(b) Ethernet over UART
> > +
> > +In order to use the QCA7000 as UART slave it must be defined as a child of
> > a +UART master in the device tree. It is possible to preconfigure the UART
> > +settings of the QCA7000 firmware, but it's not possible to change them
> > during +runtime.
> > +
> > +Required properties:
> > +- compatible: Should be "qca,qca7000-uart"  
> 
> I already discussed this with Stefan off-list a little bit, but I would like
> to bring this to a broader audience: I'm not sure whether the compatible 
> should contain the "-uart" suffix, because the hardware chip is the very same
> QCA7000 chip which can also be used with SPI protocol.
> The only difference is the loaded firmware within the chip which can either
> speak SPI or UART protocol (but not both at the same time - due to shared 
> pins). So the hardware design decides which interface type is used.
> 
> At the moment, this patch series adds a dedicated driver for the UART 
> protocol, in parallel to the existing SPI driver. So a different compatible
> string is needed here to match against the new driver.
> 
> An alternative approach would be to re-use the existing compatible string
> "qca,qca7000" for both, the SPI and UART protocol, because a "smarter" 
> (combined) driver would detect which protocol to use. For example the driver 
> could check for spi-cpha and/or spi-cpol which are required for SPI protocol: 
> if these exists the driver could assume that SPI must be used, if both are 
> missing then UART protocol should be used.
> (This way it would not be necessary to check whether the node is a child of
> a SPI or UART master node - but maybe this is even easier - I don't know)
> 
> Or in shorter words: my concern is that while "qca7000-uart" describes the 
> hardware, it's too closely coupled to the driver implementation. Having
> some feedback of the experts would be nice :-)

I'm no expert, but devices which can do both I2C and SPI are quite
common, and they usually have the same compatible string for both
buses.

Re: [PATCH v5 15/17] dt-bindings: qca7000: append UART interface to binding

2017-05-11 Thread Jakub Kicinski

On Thu, 11 May 2017 21:12:22 +0200, Michael Heimpold wrote:
> Am Mittwoch, 10. Mai 2017, 10:53:26 CEST schrieb Stefan Wahren:
> > This merges the serdev binding for the QCA7000 UART driver (Ethernet over
> > UART) into the existing document.
> > 
> > Signed-off-by: Stefan Wahren 
> > ---
> >  .../devicetree/bindings/net/qca-qca7000.txt| 32
> > ++ 1 file changed, 32 insertions(+)
> > 
> > diff --git a/Documentation/devicetree/bindings/net/qca-qca7000.txt
> > b/Documentation/devicetree/bindings/net/qca-qca7000.txt index
> > a37f656..08364c3 100644
> > --- a/Documentation/devicetree/bindings/net/qca-qca7000.txt
> > +++ b/Documentation/devicetree/bindings/net/qca-qca7000.txt
> > @@ -54,3 +54,35 @@ ssp2: spi@80014000 {
> > local-mac-address = [ A0 B0 C0 D0 E0 F0 ];
> > };
> >  };
> > +
> > +(b) Ethernet over UART
> > +
> > +In order to use the QCA7000 as UART slave it must be defined as a child of
> > a +UART master in the device tree. It is possible to preconfigure the UART
> > +settings of the QCA7000 firmware, but it's not possible to change them
> > during +runtime.
> > +
> > +Required properties:
> > +- compatible: Should be "qca,qca7000-uart"  
> 
> I already discussed this with Stefan off-list a little bit, but I would like
> to bring this to a broader audience: I'm not sure whether the compatible 
> should contain the "-uart" suffix, because the hardware chip is the very same
> QCA7000 chip which can also be used with SPI protocol.
> The only difference is the loaded firmware within the chip which can either
> speak SPI or UART protocol (but not both at the same time - due to shared 
> pins). So the hardware design decides which interface type is used.
> 
> At the moment, this patch series adds a dedicated driver for the UART 
> protocol, in parallel to the existing SPI driver. So a different compatible
> string is needed here to match against the new driver.
> 
> An alternative approach would be to re-use the existing compatible string
> "qca,qca7000" for both, the SPI and UART protocol, because a "smarter" 
> (combined) driver would detect which protocol to use. For example the driver 
> could check for spi-cpha and/or spi-cpol which are required for SPI protocol: 
> if these exists the driver could assume that SPI must be used, if both are 
> missing then UART protocol should be used.
> (This way it would not be necessary to check whether the node is a child of
> a SPI or UART master node - but maybe this is even easier - I don't know)
> 
> Or in shorter words: my concern is that while "qca7000-uart" describes the 
> hardware, it's too closely coupled to the driver implementation. Having
> some feedback of the experts would be nice :-)

I'm no expert, but devices which can do both I2C and SPI are quite
common, and they usually have the same compatible string for both
buses.

[PATCH] nvmem: rockchip-efuse: add support for rk322x-efuse

2017-05-11 Thread Finley Xiao

This adds the necessary data for handling eFuse on the rk322x.

Signed-off-by: Finley Xiao 
---
 Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt | 1 +
 drivers/nvmem/rockchip-efuse.c | 4 
 2 files changed, 5 insertions(+)

diff --git a/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt 
b/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt
index 94aeeea..194926f 100644
--- a/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt
+++ b/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt
@@ -4,6 +4,7 @@ Required properties:
 - compatible: Should be one of the following.
   - "rockchip,rk3066a-efuse" - for RK3066a SoCs.
   - "rockchip,rk3188-efuse" - for RK3188 SoCs.
+  - "rockchip,rk322x-efuse" - for RK322x SoCs.
   - "rockchip,rk3288-efuse" - for RK3288 SoCs.
   - "rockchip,rk3399-efuse" - for RK3399 SoCs.
 - reg: Should contain the registers location and exact eFuse size
diff --git a/drivers/nvmem/rockchip-efuse.c b/drivers/nvmem/rockchip-efuse.c
index 423907b..a0d4ede 100644
--- a/drivers/nvmem/rockchip-efuse.c
+++ b/drivers/nvmem/rockchip-efuse.c
@@ -170,6 +170,10 @@ static const struct of_device_id rockchip_efuse_match[] = {
.data = (void *)_rk3288_efuse_read,
},
{
+   .compatible = "rockchip,rk322x-efuse",
+   .data = (void *)_rk3288_efuse_read,
+   },
+   {
.compatible = "rockchip,rk3288-efuse",
.data = (void *)_rk3288_efuse_read,
},
-- 
2.7.4

[PATCH] nvmem: rockchip-efuse: add support for rk322x-efuse

2017-05-11 Thread Finley Xiao

This adds the necessary data for handling eFuse on the rk322x.

Signed-off-by: Finley Xiao 
---
 Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt | 1 +
 drivers/nvmem/rockchip-efuse.c | 4 
 2 files changed, 5 insertions(+)

diff --git a/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt 
b/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt
index 94aeeea..194926f 100644
--- a/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt
+++ b/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt
@@ -4,6 +4,7 @@ Required properties:
 - compatible: Should be one of the following.
   - "rockchip,rk3066a-efuse" - for RK3066a SoCs.
   - "rockchip,rk3188-efuse" - for RK3188 SoCs.
+  - "rockchip,rk322x-efuse" - for RK322x SoCs.
   - "rockchip,rk3288-efuse" - for RK3288 SoCs.
   - "rockchip,rk3399-efuse" - for RK3399 SoCs.
 - reg: Should contain the registers location and exact eFuse size
diff --git a/drivers/nvmem/rockchip-efuse.c b/drivers/nvmem/rockchip-efuse.c
index 423907b..a0d4ede 100644
--- a/drivers/nvmem/rockchip-efuse.c
+++ b/drivers/nvmem/rockchip-efuse.c
@@ -170,6 +170,10 @@ static const struct of_device_id rockchip_efuse_match[] = {
.data = (void *)_rk3288_efuse_read,
},
{
+   .compatible = "rockchip,rk322x-efuse",
+   .data = (void *)_rk3288_efuse_read,
+   },
+   {
.compatible = "rockchip,rk3288-efuse",
.data = (void *)_rk3288_efuse_read,
},
-- 
2.7.4

[PATCH v2] [media] mtk-mdp: Fix g_/s_selection capture/compose logic

2017-05-11 Thread Minghsiu Tsai

From: Daniel Kurtz 

Experiments show that the:
 (1) mtk-mdp uses the _MPLANE form of CAPTURE/OUTPUT
 (2) CAPTURE types use CROP targets, and OUTPUT types use COMPOSE targets

Signed-off-by: Daniel Kurtz 
Signed-off-by: Minghsiu Tsai 
Signed-off-by: Houlong Wei 

---
Changes in v2:
. Can not use *_MPLANE type in g_/s_selection 
---
 drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c 
b/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c
index 13afe48..e18ac626 100644
--- a/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c
+++ b/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c
@@ -838,10 +838,10 @@ static int mtk_mdp_m2m_g_selection(struct file *file, 
void *fh,
bool valid = false;
 
if (s->type == V4L2_BUF_TYPE_VIDEO_CAPTURE) {
-   if (mtk_mdp_is_target_compose(s->target))
+   if (mtk_mdp_is_target_crop(s->target))
valid = true;
} else if (s->type == V4L2_BUF_TYPE_VIDEO_OUTPUT) {
-   if (mtk_mdp_is_target_crop(s->target))
+   if (mtk_mdp_is_target_compose(s->target))
valid = true;
}
if (!valid) {
@@ -908,10 +908,10 @@ static int mtk_mdp_m2m_s_selection(struct file *file, 
void *fh,
bool valid = false;
 
if (s->type == V4L2_BUF_TYPE_VIDEO_CAPTURE) {
-   if (s->target == V4L2_SEL_TGT_COMPOSE)
+   if (s->target == V4L2_SEL_TGT_CROP)
valid = true;
} else if (s->type == V4L2_BUF_TYPE_VIDEO_OUTPUT) {
-   if (s->target == V4L2_SEL_TGT_CROP)
+   if (s->target == V4L2_SEL_TGT_COMPOSE)
valid = true;
}
if (!valid) {
@@ -925,7 +925,7 @@ static int mtk_mdp_m2m_s_selection(struct file *file, void 
*fh,
if (ret)
return ret;
 
-   if (mtk_mdp_is_target_crop(s->target))
+   if (mtk_mdp_is_target_compose(s->target))
frame = >s_frame;
else
frame = >d_frame;
-- 
1.9.1

[PATCH v2] [media] mtk-mdp: Fix g_/s_selection capture/compose logic

2017-05-11 Thread Minghsiu Tsai

From: Daniel Kurtz 

Experiments show that the:
 (1) mtk-mdp uses the _MPLANE form of CAPTURE/OUTPUT
 (2) CAPTURE types use CROP targets, and OUTPUT types use COMPOSE targets

Signed-off-by: Daniel Kurtz 
Signed-off-by: Minghsiu Tsai 
Signed-off-by: Houlong Wei 

---
Changes in v2:
. Can not use *_MPLANE type in g_/s_selection 
---
 drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c 
b/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c
index 13afe48..e18ac626 100644
--- a/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c
+++ b/drivers/media/platform/mtk-mdp/mtk_mdp_m2m.c
@@ -838,10 +838,10 @@ static int mtk_mdp_m2m_g_selection(struct file *file, 
void *fh,
bool valid = false;
 
if (s->type == V4L2_BUF_TYPE_VIDEO_CAPTURE) {
-   if (mtk_mdp_is_target_compose(s->target))
+   if (mtk_mdp_is_target_crop(s->target))
valid = true;
} else if (s->type == V4L2_BUF_TYPE_VIDEO_OUTPUT) {
-   if (mtk_mdp_is_target_crop(s->target))
+   if (mtk_mdp_is_target_compose(s->target))
valid = true;
}
if (!valid) {
@@ -908,10 +908,10 @@ static int mtk_mdp_m2m_s_selection(struct file *file, 
void *fh,
bool valid = false;
 
if (s->type == V4L2_BUF_TYPE_VIDEO_CAPTURE) {
-   if (s->target == V4L2_SEL_TGT_COMPOSE)
+   if (s->target == V4L2_SEL_TGT_CROP)
valid = true;
} else if (s->type == V4L2_BUF_TYPE_VIDEO_OUTPUT) {
-   if (s->target == V4L2_SEL_TGT_CROP)
+   if (s->target == V4L2_SEL_TGT_COMPOSE)
valid = true;
}
if (!valid) {
@@ -925,7 +925,7 @@ static int mtk_mdp_m2m_s_selection(struct file *file, void 
*fh,
if (ret)
return ret;
 
-   if (mtk_mdp_is_target_crop(s->target))
+   if (mtk_mdp_is_target_compose(s->target))
frame = >s_frame;
else
frame = >d_frame;
-- 
1.9.1

RE: [PATCH 2/2] Revert "ACPI / button: Change default behavior to lid_init_state=open"

2017-05-11 Thread Zheng, Lv

Hi, 

> From: Benjamin Tissoires [mailto:benjamin.tissoi...@redhat.com]
> Subject: Re: [PATCH 2/2] Revert "ACPI / button: Change default behavior to 
> lid_init_state=open"
> 
> On May 11 2017 or thereabouts, Zheng, Lv wrote:
> > Hi,
> >
> > > From: Benjamin Tissoires [mailto:benjamin.tissoi...@redhat.com]
> > > Subject: [PATCH 2/2] Revert "ACPI / button: Change default behavior to 
> > > lid_init_state=open"
> > >
> > > This reverts commit 77e9a4aa9de10cc1418bf9a892366988802a8025.
> > >
> > > Even if the method implementation can be buggy on some platform,
> > > the "open" choice is worse. It breaks docking stations basically
> > > and there is no way to have a user-space hwdb to fix that.
> > >
> > > On the contrary, it's rather easy in user-space to have a hwdb
> > > with the problematic platforms. Then, libinput (1.7.0+) can fix
> > > the state of the LID switch for us: you need to set the udev
> > > property LIBINPUT_ATTR_LID_SWITCH_RELIABILITY to 'write_open'.
> > >
> > > When libinput detects internal keyboard events, it will
> > > overwrite the state of the switch to open, making it reliable
> > > again. Given that logind only checks the LID switch value after
> > > a timeout, we can assume the user will use the internal keyboard
> > > before this timeout expires.
> > >
> > > For example, such a hwdb entry is:
> > >
> > > libinput:name:*Lid Switch*:dmi:*svnMicrosoftCorporation:pnSurface3:*
> > >  LIBINPUT_ATTR_LID_SWITCH_RELIABILITY=write_open
> >
> > For the reason mentioned previously and proofs here (see patch 
> > descriptions):
> > https://patchwork.kernel.org/patch/9717111/
> 
> I am not sure you can call this proofs. The "close" state is IMO the
> exact same as the "method" one, it's just that the intel driver
> triggers the evaluation of the _LID method, not acpi/button.c.

I should correct you that:
1. intel_lvds driver only evaluates _LID method for its own purpose,
   See my reply to PATCH 4-5;
2. acpi/button.c is still responsible for generating the lid input event (to 
input layer and thus the userspace)
   And the input event generated by button.c differs for the 2 modes.
   See my another reply to PATCH 02.

> 
> And remember that this new default prevents userspace to fix it because
> it's rather easy to detect that the device is actually opened (input
> events coming from interanl keyboard, touchscreen, or touchpad), while
> reporting the lid switch as open means we can't know if the user is
> simply not using the internal devices, or if we have just a wrong state.

That depends on what the meaning of "fix" is IMO.
I saw you wrote a lot in another message, let's discuss this there.

> 
> Given that this patch breaks all Lenovos with a dock (I can guess that 6
> lines this year are using docks, and within each line you have 2-5
> variants), I'd suggest we do not break those existing laptops and just
> revert this patch.
> 
> I would think other OEMs have a docking station with an actual power
> button but I can't be sure by looking at the product pages.

I'm not sure if it breaks the external monitor usage model.
Why don't the userspace just
1. always light on the external display and keep the internal display not lit 
on after boot/resume, or
2. don't do anything and let the BIOS to light on the right display, or
3. don't do anything and let the graphics drivers to light on the right display 
(using saved state when suspends and resumes to that saved state).

Why such decision have to be made based on ACPI control method lid device?
I cannot see a strong reason that ACPI control method lid device must 
participate in this usage model.

See drivers/gpu/drm/nouveau/nouveau_connector.c,
the ignorelid parameter proves that the graphics drivers can do this on their 
own and do this perfectly.

> 
> > We shouldn't do this.
> 
> I strongly disagree.
> 
> I am fine if you don't want to have a blacklist in the kernel for the
> devices that are problematic (the ones that require open), but please
> don't prevent user space to have this blacklist and please do not make
> false assumptions in the kernel on the state of a switch. This is worse
> than it helps and in the end, user space won't be able to solve this if
> you change the default behavior at each release.

We are actually also fine with any default value.

But be aware of that, ACPI subsystem just provides a button driver:
1. Allows users to suspend the system upon receiving "close" lid event;
2. Stops to support any other usage models but is willing to provide tunable 
behavior for the other system participants to support their special needs.
   Such participants include but are not limited to the user space tools and 
the other kernel modules.
   But the default behavior and the timing of tuning button driver behaviors 
should be determined by the other system participants.
   And as a consequence, the other system participants who change that must be 
responsible for fixing regressions related to conflicts of the needs.
   As

RE: [PATCH 2/2] Revert "ACPI / button: Change default behavior to lid_init_state=open"

2017-05-11 Thread Zheng, Lv

Hi, 

> From: Benjamin Tissoires [mailto:benjamin.tissoi...@redhat.com]
> Subject: Re: [PATCH 2/2] Revert "ACPI / button: Change default behavior to 
> lid_init_state=open"
> 
> On May 11 2017 or thereabouts, Zheng, Lv wrote:
> > Hi,
> >
> > > From: Benjamin Tissoires [mailto:benjamin.tissoi...@redhat.com]
> > > Subject: [PATCH 2/2] Revert "ACPI / button: Change default behavior to 
> > > lid_init_state=open"
> > >
> > > This reverts commit 77e9a4aa9de10cc1418bf9a892366988802a8025.
> > >
> > > Even if the method implementation can be buggy on some platform,
> > > the "open" choice is worse. It breaks docking stations basically
> > > and there is no way to have a user-space hwdb to fix that.
> > >
> > > On the contrary, it's rather easy in user-space to have a hwdb
> > > with the problematic platforms. Then, libinput (1.7.0+) can fix
> > > the state of the LID switch for us: you need to set the udev
> > > property LIBINPUT_ATTR_LID_SWITCH_RELIABILITY to 'write_open'.
> > >
> > > When libinput detects internal keyboard events, it will
> > > overwrite the state of the switch to open, making it reliable
> > > again. Given that logind only checks the LID switch value after
> > > a timeout, we can assume the user will use the internal keyboard
> > > before this timeout expires.
> > >
> > > For example, such a hwdb entry is:
> > >
> > > libinput:name:*Lid Switch*:dmi:*svnMicrosoftCorporation:pnSurface3:*
> > >  LIBINPUT_ATTR_LID_SWITCH_RELIABILITY=write_open
> >
> > For the reason mentioned previously and proofs here (see patch 
> > descriptions):
> > https://patchwork.kernel.org/patch/9717111/
> 
> I am not sure you can call this proofs. The "close" state is IMO the
> exact same as the "method" one, it's just that the intel driver
> triggers the evaluation of the _LID method, not acpi/button.c.

I should correct you that:
1. intel_lvds driver only evaluates _LID method for its own purpose,
   See my reply to PATCH 4-5;
2. acpi/button.c is still responsible for generating the lid input event (to 
input layer and thus the userspace)
   And the input event generated by button.c differs for the 2 modes.
   See my another reply to PATCH 02.

> 
> And remember that this new default prevents userspace to fix it because
> it's rather easy to detect that the device is actually opened (input
> events coming from interanl keyboard, touchscreen, or touchpad), while
> reporting the lid switch as open means we can't know if the user is
> simply not using the internal devices, or if we have just a wrong state.

That depends on what the meaning of "fix" is IMO.
I saw you wrote a lot in another message, let's discuss this there.

> 
> Given that this patch breaks all Lenovos with a dock (I can guess that 6
> lines this year are using docks, and within each line you have 2-5
> variants), I'd suggest we do not break those existing laptops and just
> revert this patch.
> 
> I would think other OEMs have a docking station with an actual power
> button but I can't be sure by looking at the product pages.

I'm not sure if it breaks the external monitor usage model.
Why don't the userspace just
1. always light on the external display and keep the internal display not lit 
on after boot/resume, or
2. don't do anything and let the BIOS to light on the right display, or
3. don't do anything and let the graphics drivers to light on the right display 
(using saved state when suspends and resumes to that saved state).

Why such decision have to be made based on ACPI control method lid device?
I cannot see a strong reason that ACPI control method lid device must 
participate in this usage model.

See drivers/gpu/drm/nouveau/nouveau_connector.c,
the ignorelid parameter proves that the graphics drivers can do this on their 
own and do this perfectly.

> 
> > We shouldn't do this.
> 
> I strongly disagree.
> 
> I am fine if you don't want to have a blacklist in the kernel for the
> devices that are problematic (the ones that require open), but please
> don't prevent user space to have this blacklist and please do not make
> false assumptions in the kernel on the state of a switch. This is worse
> than it helps and in the end, user space won't be able to solve this if
> you change the default behavior at each release.

We are actually also fine with any default value.

But be aware of that, ACPI subsystem just provides a button driver:
1. Allows users to suspend the system upon receiving "close" lid event;
2. Stops to support any other usage models but is willing to provide tunable 
behavior for the other system participants to support their special needs.
   Such participants include but are not limited to the user space tools and 
the other kernel modules.
   But the default behavior and the timing of tuning button driver behaviors 
should be determined by the other system participants.
   And as a consequence, the other system participants who change that must be 
responsible for fixing regressions related to conflicts of the needs.
   As

Re: [net-dsa-mv88e6xxx] question about potential use of uninitialized variable

2017-05-11 Thread Andrew Lunn

On Thu, May 11, 2017 at 04:35:37PM -0500, Gustavo A. R. Silva wrote:
> 
> Hello everybody,
> 
> While looking into Coverity ID 1398130 I ran into the following
> piece of code at drivers/net/dsa/mv88e6xxx/chip.c:849:
> 
>  849static uint64_t _mv88e6xxx_get_ethtool_stat(struct mv88e6xxx_chip *chip,
>  850struct mv88e6xxx_hw_stat *s,
>  851int port, u16 bank1_select,
>  852u16 histogram)
>  853{
>  854u32 low;
>  855u32 high = 0;
>  856u16 reg = 0;
>  857int err;
>  858u64 value;
>  859
>  860switch (s->type) {
>  861case STATS_TYPE_PORT:
>  862err = mv88e6xxx_port_read(chip, port, s->reg, );
>  863if (err)
>  864return UINT64_MAX;
>  865
>  866low = reg;
>  867if (s->sizeof_stat == 4) {
>  868err = mv88e6xxx_port_read(chip, port,
> s->reg + 1, );
>  869if (err)
>  870return UINT64_MAX;
>  871high = reg;
>  872}
>  873break;
>  874case STATS_TYPE_BANK1:
>  875reg = bank1_select;
>  876/* fall through */
>  877case STATS_TYPE_BANK0:
>  878reg |= s->reg | histogram;
>  879mv88e6xxx_g1_stats_read(chip, reg, );
>  880if (s->sizeof_stat == 8)
>  881mv88e6xxx_g1_stats_read(chip, reg + 1, );
>  882}
>  883value = (((u64)high) << 16) | low;
>  884return value;
>  885}
> 
> My question here is if there is any chance for the execution path to
> directly jump from line 860 to line 883, hence ending up using the
> uninitialized variable _low_?

Hi Gustavo

It would require that s->type not have one of the listed case values.
Currently all members of mv88e6xxx_hw_stats due use expected values.
However, it would not hurt to add a

 default:
return UINT64_MAX;

Do you want to submit a patch?

   Thanks
Andrew

Re: [net-dsa-mv88e6xxx] question about potential use of uninitialized variable

2017-05-11 Thread Andrew Lunn

On Thu, May 11, 2017 at 04:35:37PM -0500, Gustavo A. R. Silva wrote:
> 
> Hello everybody,
> 
> While looking into Coverity ID 1398130 I ran into the following
> piece of code at drivers/net/dsa/mv88e6xxx/chip.c:849:
> 
>  849static uint64_t _mv88e6xxx_get_ethtool_stat(struct mv88e6xxx_chip *chip,
>  850struct mv88e6xxx_hw_stat *s,
>  851int port, u16 bank1_select,
>  852u16 histogram)
>  853{
>  854u32 low;
>  855u32 high = 0;
>  856u16 reg = 0;
>  857int err;
>  858u64 value;
>  859
>  860switch (s->type) {
>  861case STATS_TYPE_PORT:
>  862err = mv88e6xxx_port_read(chip, port, s->reg, );
>  863if (err)
>  864return UINT64_MAX;
>  865
>  866low = reg;
>  867if (s->sizeof_stat == 4) {
>  868err = mv88e6xxx_port_read(chip, port,
> s->reg + 1, );
>  869if (err)
>  870return UINT64_MAX;
>  871high = reg;
>  872}
>  873break;
>  874case STATS_TYPE_BANK1:
>  875reg = bank1_select;
>  876/* fall through */
>  877case STATS_TYPE_BANK0:
>  878reg |= s->reg | histogram;
>  879mv88e6xxx_g1_stats_read(chip, reg, );
>  880if (s->sizeof_stat == 8)
>  881mv88e6xxx_g1_stats_read(chip, reg + 1, );
>  882}
>  883value = (((u64)high) << 16) | low;
>  884return value;
>  885}
> 
> My question here is if there is any chance for the execution path to
> directly jump from line 860 to line 883, hence ending up using the
> uninitialized variable _low_?

Hi Gustavo

It would require that s->type not have one of the listed case values.
Currently all members of mv88e6xxx_hw_stats due use expected values.
However, it would not hurt to add a

 default:
return UINT64_MAX;

Do you want to submit a patch?

   Thanks
Andrew

[PATCH v5 3/4] zram: make deduplication feature optional

2017-05-11 Thread js1304

From: Joonsoo Kim 

Benefit of deduplication is dependent on the workload so it's not
preferable to always enable. Therefore, make it optional in Kconfig
and device param. Default is 'off'. This option will be beneficial
for users who use the zram as blockdev and stores build output to it.

Reviewed-by: Sergey Senozhatsky 
Acked-by: Minchan Kim 
Signed-off-by: Joonsoo Kim 
---
 Documentation/ABI/testing/sysfs-block-zram | 10 
 Documentation/blockdev/zram.txt|  1 +
 drivers/block/zram/Kconfig | 14 ++
 drivers/block/zram/Makefile|  3 +-
 drivers/block/zram/zram_dedup.c| 15 ++
 drivers/block/zram/zram_dedup.h| 23 +
 drivers/block/zram/zram_drv.c  | 81 ++
 drivers/block/zram/zram_drv.h  |  9 
 8 files changed, 144 insertions(+), 12 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 451b6d8..3c1945f 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -90,3 +90,13 @@ Description:
device's debugging info useful for kernel developers. Its
format is not documented intentionally and may change
anytime without any notice.
+
+What:  /sys/block/zram/use_dedup
+Date:  March 2017
+Contact:   Joonsoo Kim 
+Description:
+   The use_dedup file is read-write and specifies deduplication
+   feature is used or not. If enabled, duplicated data is
+   managed by reference count and will not be stored in memory
+   twice. Benefit of this feature largely depends on the workload
+   so keep attention when use.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 2cdc303..cbbe39b 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -168,6 +168,7 @@ max_comp_streams  RWthe number of possible concurrent 
compress operations
 comp_algorithmRWshow and change the compression algorithm
 compact   WOtrigger memory compaction
 debug_statROthis file is used for zram debugging purposes
+use_dedupRWshow and set deduplication feature
 
 
 User space is advised to use the following files to read the device statistics.
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index b8ecba6..2f3dd1f 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -13,3 +13,17 @@ config ZRAM
  disks and maybe many more.
 
  See zram.txt for more information.
+
+config ZRAM_DEDUP
+   bool "Deduplication support for ZRAM data"
+   depends on ZRAM
+   default n
+   help
+ Deduplicate ZRAM data to reduce amount of memory consumption.
+ Advantage largely depends on the workload. In some cases, this
+ option reduces memory usage to the half. However, if there is no
+ duplicated data, the amount of memory consumption would be
+ increased due to additional metadata usage. And, there is
+ computation time trade-off. Please check the benefit before
+ enabling this option. Experiment shows the positive effect when
+ the zram is used as blockdev and is used to store build output.
diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile
index 29cb008..d7204ef 100644
--- a/drivers/block/zram/Makefile
+++ b/drivers/block/zram/Makefile
@@ -1,3 +1,4 @@
-zram-y :=  zcomp.o zram_drv.o zram_dedup.o
+zram-y :=  zcomp.o zram_drv.o
+zram-$(CONFIG_ZRAM_DEDUP)  +=  zram_dedup.o
 
 obj-$(CONFIG_ZRAM) +=  zram.o
diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c
index a8427f7..560b1f5 100644
--- a/drivers/block/zram/zram_dedup.c
+++ b/drivers/block/zram/zram_dedup.c
@@ -41,6 +41,9 @@ void zram_dedup_insert(struct zram *zram, struct zram_entry 
*new,
struct rb_node **rb_node, *parent = NULL;
struct zram_entry *entry;
 
+   if (!zram_dedup_enabled(zram))
+   return;
+
new->checksum = checksum;
hash = >hash[checksum % zram->hash_size];
rb_root = >rb_root;
@@ -148,6 +151,9 @@ struct zram_entry *zram_dedup_find(struct zram *zram, 
struct page *page,
void *mem;
struct zram_entry *entry;
 
+   if (!zram_dedup_enabled(zram))
+   return NULL;
+
mem = kmap_atomic(page);
*checksum = zram_dedup_checksum(mem);
 
@@ -160,6 +166,9 @@ struct zram_entry *zram_dedup_find(struct zram *zram, 
struct page *page,
 void zram_dedup_init_entry(struct zram *zram, struct zram_entry *entry,
unsigned long handle, unsigned int len)
 {
+

[PATCH v5 3/4] zram: make deduplication feature optional

2017-05-11 Thread js1304

From: Joonsoo Kim 

Benefit of deduplication is dependent on the workload so it's not
preferable to always enable. Therefore, make it optional in Kconfig
and device param. Default is 'off'. This option will be beneficial
for users who use the zram as blockdev and stores build output to it.

Reviewed-by: Sergey Senozhatsky 
Acked-by: Minchan Kim 
Signed-off-by: Joonsoo Kim 
---
 Documentation/ABI/testing/sysfs-block-zram | 10 
 Documentation/blockdev/zram.txt|  1 +
 drivers/block/zram/Kconfig | 14 ++
 drivers/block/zram/Makefile|  3 +-
 drivers/block/zram/zram_dedup.c| 15 ++
 drivers/block/zram/zram_dedup.h| 23 +
 drivers/block/zram/zram_drv.c  | 81 ++
 drivers/block/zram/zram_drv.h  |  9 
 8 files changed, 144 insertions(+), 12 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 451b6d8..3c1945f 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -90,3 +90,13 @@ Description:
device's debugging info useful for kernel developers. Its
format is not documented intentionally and may change
anytime without any notice.
+
+What:  /sys/block/zram/use_dedup
+Date:  March 2017
+Contact:   Joonsoo Kim 
+Description:
+   The use_dedup file is read-write and specifies deduplication
+   feature is used or not. If enabled, duplicated data is
+   managed by reference count and will not be stored in memory
+   twice. Benefit of this feature largely depends on the workload
+   so keep attention when use.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 2cdc303..cbbe39b 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -168,6 +168,7 @@ max_comp_streams  RWthe number of possible concurrent 
compress operations
 comp_algorithmRWshow and change the compression algorithm
 compact   WOtrigger memory compaction
 debug_statROthis file is used for zram debugging purposes
+use_dedupRWshow and set deduplication feature
 
 
 User space is advised to use the following files to read the device statistics.
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index b8ecba6..2f3dd1f 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -13,3 +13,17 @@ config ZRAM
  disks and maybe many more.
 
  See zram.txt for more information.
+
+config ZRAM_DEDUP
+   bool "Deduplication support for ZRAM data"
+   depends on ZRAM
+   default n
+   help
+ Deduplicate ZRAM data to reduce amount of memory consumption.
+ Advantage largely depends on the workload. In some cases, this
+ option reduces memory usage to the half. However, if there is no
+ duplicated data, the amount of memory consumption would be
+ increased due to additional metadata usage. And, there is
+ computation time trade-off. Please check the benefit before
+ enabling this option. Experiment shows the positive effect when
+ the zram is used as blockdev and is used to store build output.
diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile
index 29cb008..d7204ef 100644
--- a/drivers/block/zram/Makefile
+++ b/drivers/block/zram/Makefile
@@ -1,3 +1,4 @@
-zram-y :=  zcomp.o zram_drv.o zram_dedup.o
+zram-y :=  zcomp.o zram_drv.o
+zram-$(CONFIG_ZRAM_DEDUP)  +=  zram_dedup.o
 
 obj-$(CONFIG_ZRAM) +=  zram.o
diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c
index a8427f7..560b1f5 100644
--- a/drivers/block/zram/zram_dedup.c
+++ b/drivers/block/zram/zram_dedup.c
@@ -41,6 +41,9 @@ void zram_dedup_insert(struct zram *zram, struct zram_entry 
*new,
struct rb_node **rb_node, *parent = NULL;
struct zram_entry *entry;
 
+   if (!zram_dedup_enabled(zram))
+   return;
+
new->checksum = checksum;
hash = >hash[checksum % zram->hash_size];
rb_root = >rb_root;
@@ -148,6 +151,9 @@ struct zram_entry *zram_dedup_find(struct zram *zram, 
struct page *page,
void *mem;
struct zram_entry *entry;
 
+   if (!zram_dedup_enabled(zram))
+   return NULL;
+
mem = kmap_atomic(page);
*checksum = zram_dedup_checksum(mem);
 
@@ -160,6 +166,9 @@ struct zram_entry *zram_dedup_find(struct zram *zram, 
struct page *page,
 void zram_dedup_init_entry(struct zram *zram, struct zram_entry *entry,
unsigned long handle, unsigned int len)
 {
+   if (!zram_dedup_enabled(zram))
+   return;
+
entry->handle = handle;
entry->refcount = 1;

[PATCH v5 2/4] zram: implement deduplication in zram

2017-05-11 Thread js1304

From: Joonsoo Kim 

This patch implements deduplication feature in zram. The purpose
of this work is naturally to save amount of memory usage by zram.

Android is one of the biggest users to use zram as swap and it's
really important to save amount of memory usage. There is a paper
that reports that duplication ratio of Android's memory content is
rather high [1]. And, there is a similar work on zswap that also
reports that experiments has shown that around 10-15% of pages
stored in zswp are duplicates and deduplicate them provides some
benefits [2].

Also, there is a different kind of workload that uses zram as blockdev
and store build outputs into it to reduce wear-out problem of real
blockdev. In this workload, deduplication hit is very high due to
temporary files and intermediate object files. Detailed analysis is
on the bottom of this description.

Anyway, if we can detect duplicated content and avoid to store duplicated
content at different memory space, we can save memory. This patch
tries to do that.

Implementation is almost simple and intuitive but I should note
one thing about implementation detail.

To check duplication, this patch uses checksum of the page and
collision of this checksum could be possible. There would be
many choices to handle this situation but this patch chooses
to allow entry with duplicated checksum to be added to the hash,
but, not to compare all entries with duplicated checksum
when checking duplication. I guess that checksum collision is quite
rare event and we don't need to pay any attention to such a case.
Therefore, I decided the most simplest way to implement the feature.
If there is a different opinion, I can accept and go that way.

Following is the result of this patch.

Test result #1 (Swap):
Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18

orig_data_size: 145297408
compr_data_size: 32408125
mem_used_total: 32276480
dup_data_size: 3188134
meta_data_size: 1444272

Last two metrics added to mm_stat are related to this work.
First one, dup_data_size, is amount of saved memory by avoiding
to store duplicated page. Later one, meta_data_size, is the amount of
data structure to support deduplication. If dup > meta, we can judge
that the patch improves memory usage.

In Adnroid, we can save 5% of memory usage by this work.

Test result #2 (Blockdev):
build the kernel and store output to ext4 FS on zram


Elapsed time: 249 s
mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0


Elapsed time: 250 s
mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038  3945792

There is no performance degradation and save 23% memory.

Test result #3 (Blockdev):
copy android build output dir(out/host) to ext4 FS on zram


Elapsed time: out/host: 88 s
mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0


Elapsed time: out/host: 100 s
mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 
80880336

It shows performance degradation roughly 13% and save 24% memory. Maybe,
it is due to overhead of calculating checksum and comparison.

Test result #4 (Blockdev):
copy android build output dir(out/target/common) to ext4 FS on zram


Elapsed time: out/host: 203 s
mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0


Elapsed time: out/host: 201 s
mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336

Memory is saved by 42% and performance is the same. Even if there is overhead
of calculating checksum and comparison, large hit ratio compensate it since
hit leads to less compression attempt.

I checked the detailed reason of savings on kernel build workload and
there are some cases that deduplication happens.

1) *.cmd
Build command is usually similar in one directory so content of these file
are very similar. In my system, more than 789 lines in fs/ext4/.namei.o.cmd
and fs/ext4/.inode.o.cmd are the same in 944 and 938 lines of the file,
respectively.

2) intermediate object files
built-in.o and temporary object file have the similar contents. More than
50% of fs/ext4/ext4.o is the same with fs/ext4/built-in.o.

3) vmlinux
.tmp_vmlinux1 and .tmp_vmlinux2 and arch/x86/boo/compressed/vmlinux.bin
have the similar contents.

Android test has similar case that some of object files(.class
and .so) are similar with another ones.
(./host/linux-x86/lib/libartd.so and
./host/linux-x86-lib/libartd-comiler.so)

Anyway, benefit seems to be largely dependent on the workload so
following patch will make this feature optional. However, this feature
can help some usecases so is deserved to be merged.

[1]: MemScope: Analyzing Memory Duplication on Android Systems,
dl.acm.org/citation.cfm?id=2797023
[2]: zswap: Optimize compressed pool memory utilization,
lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2

Reviewed-by: Sergey Senozhatsky 
Acked-by: Minchan Kim 
Signed-off-by: Joonsoo Kim

[PATCH v5 0/4] zram: implement deduplication in zram

2017-05-11 Thread js1304

From: Joonsoo Kim <iamjoonsoo@lge.com>

Changes from v4
o rebase on next-20170511
o collect Reviewed-by from Sergey

Changes from v3
o fix module build problem
o make zram_dedup_enabled() statically return false if CONFIG_ZRAM_DEDUP=n

Changes from v2
o rebase to latest zram code
o manage alloc/free of the zram_entry in zram_drv.c
o remove useless RB_CLEAR_NODE
o set RO permission tor use_deup sysfs entry if CONFIG_ZRAM_DEDUP=n

Changes from v1
o reogranize dedup specific functions
o support Kconfig on/off
o update zram documents
o compare all the entries with same checksum (patch #4)

This patchset implements deduplication feature in zram. Motivation
is to save memory usage by zram. There are detailed description
about motivation and experimental results on patch #2 so please
refer it.

Thanks.

Joonsoo Kim (4):
  zram: introduce zram_entry to prepare dedup functionality
  zram: implement deduplication in zram
  zram: make deduplication feature optional
  zram: compare all the entries with same checksum for deduplication

 Documentation/ABI/testing/sysfs-block-zram |  10 ++
 Documentation/blockdev/zram.txt|   3 +
 drivers/block/zram/Kconfig |  14 ++
 drivers/block/zram/Makefile|   3 +-
 drivers/block/zram/zram_dedup.c| 254 +
 drivers/block/zram/zram_dedup.h|  45 +
 drivers/block/zram/zram_drv.c  | 184 +
 drivers/block/zram/zram_drv.h  |  32 +++-
 8 files changed, 507 insertions(+), 38 deletions(-)
 create mode 100644 drivers/block/zram/zram_dedup.c
 create mode 100644 drivers/block/zram/zram_dedup.h

-- 
2.7.4

[PATCH v5 1/4] zram: introduce zram_entry to prepare dedup functionality

2017-05-11 Thread js1304

From: Joonsoo Kim 

Following patch will implement deduplication functionality
in the zram and it requires an indirection layer to manage
the life cycle of zsmalloc handle. To prepare that, this patch
introduces zram_entry which can be used to manage the life-cycle
of zsmalloc handle. Many lines are changed due to rename but
core change is just simple introduction about newly data structure.

Reviewed-by: Sergey Senozhatsky 
Acked-by: Minchan Kim 
Signed-off-by: Joonsoo Kim 
---
 drivers/block/zram/zram_drv.c | 95 +++
 drivers/block/zram/zram_drv.h |  6 ++-
 2 files changed, 66 insertions(+), 35 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index debee95..26dc4e5 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -57,14 +57,15 @@ static inline struct zram *dev_to_zram(struct device *dev)
return (struct zram *)dev_to_disk(dev)->private_data;
 }
 
-static unsigned long zram_get_handle(struct zram *zram, u32 index)
+static struct zram_entry *zram_get_entry(struct zram *zram, u32 index)
 {
-   return zram->table[index].handle;
+   return zram->table[index].entry;
 }
 
-static void zram_set_handle(struct zram *zram, u32 index, unsigned long handle)
+static void zram_set_entry(struct zram *zram, u32 index,
+   struct zram_entry *entry)
 {
-   zram->table[index].handle = handle;
+   zram->table[index].entry = entry;
 }
 
 /* flag operations require table entry bit_spin_lock() being held */
@@ -437,7 +438,7 @@ static bool zram_same_page_read(struct zram *zram, u32 
index,
unsigned int offset, unsigned int len)
 {
zram_slot_lock(zram, index);
-   if (unlikely(!zram_get_handle(zram, index) ||
+   if (unlikely(!zram_get_entry(zram, index) ||
zram_test_flag(zram, index, ZRAM_SAME))) {
void *mem;
 
@@ -476,6 +477,32 @@ static bool zram_same_page_write(struct zram *zram, u32 
index,
return false;
 }
 
+static struct zram_entry *zram_entry_alloc(struct zram *zram,
+   unsigned int len, gfp_t flags)
+{
+   struct zram_entry *entry;
+
+   entry = kzalloc(sizeof(*entry),
+   flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   if (!entry)
+   return NULL;
+
+   entry->handle = zs_malloc(zram->mem_pool, len, flags);
+   if (!entry->handle) {
+   kfree(entry);
+   return NULL;
+   }
+
+   return entry;
+}
+
+static inline void zram_entry_free(struct zram *zram,
+   struct zram_entry *entry)
+{
+   zs_free(zram->mem_pool, entry->handle);
+   kfree(entry);
+}
+
 static void zram_meta_free(struct zram *zram, u64 disksize)
 {
size_t num_pages = disksize >> PAGE_SHIFT;
@@ -514,7 +541,7 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
  */
 static void zram_free_page(struct zram *zram, size_t index)
 {
-   unsigned long handle = zram_get_handle(zram, index);
+   struct zram_entry *entry = zram_get_entry(zram, index);
 
/*
 * No memory is allocated for same element filled pages.
@@ -527,23 +554,23 @@ static void zram_free_page(struct zram *zram, size_t 
index)
return;
}
 
-   if (!handle)
+   if (!entry)
return;
 
-   zs_free(zram->mem_pool, handle);
+   zram_entry_free(zram, entry);
 
atomic64_sub(zram_get_obj_size(zram, index),
>stats.compr_data_size);
atomic64_dec(>stats.pages_stored);
 
-   zram_set_handle(zram, index, 0);
+   zram_set_entry(zram, index, NULL);
zram_set_obj_size(zram, index, 0);
 }
 
 static int zram_decompress_page(struct zram *zram, struct page *page, u32 
index)
 {
int ret;
-   unsigned long handle;
+   struct zram_entry *entry;
unsigned int size;
void *src, *dst;
 
@@ -551,10 +578,10 @@ static int zram_decompress_page(struct zram *zram, struct 
page *page, u32 index)
return 0;
 
zram_slot_lock(zram, index);
-   handle = zram_get_handle(zram, index);
+   entry = zram_get_entry(zram, index);
size = zram_get_obj_size(zram, index);
 
-   src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
+   src = zs_map_object(zram->mem_pool, entry->handle, ZS_MM_RO);
if (size == PAGE_SIZE) {
dst = kmap_atomic(page);
memcpy(dst, src, PAGE_SIZE);
@@ -568,7 +595,7 @@ static int zram_decompress_page(struct zram *zram, struct 
page *page, u32 index)
kunmap_atomic(dst);
zcomp_stream_put(zram->comp);
}
-   zs_unmap_object(zram->mem_pool, handle);
+   zs_unmap_object(zram->mem_pool, entry->handle);

[PATCH v5 4/4] zram: compare all the entries with same checksum for deduplication

2017-05-11 Thread js1304

From: Joonsoo Kim 

Until now, we compare just one entry with same checksum when
checking duplication since it is the simplest way to implement.
However, for the completeness, checking all the entries is better
so this patch implement to compare all the entries with same checksum.
Since this event would be rare so there would be no performance loss.

Reviewed-by: Sergey Senozhatsky 
Acked-by: Minchan Kim 
Signed-off-by: Joonsoo Kim 
---
 drivers/block/zram/zram_dedup.c | 59 -
 1 file changed, 47 insertions(+), 12 deletions(-)

diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c
index 560b1f5..14c4988 100644
--- a/drivers/block/zram/zram_dedup.c
+++ b/drivers/block/zram/zram_dedup.c
@@ -109,6 +109,51 @@ static unsigned long zram_dedup_put(struct zram *zram,
return entry->refcount;
 }
 
+static struct zram_entry *__zram_dedup_get(struct zram *zram,
+   struct zram_hash *hash, unsigned char *mem,
+   struct zram_entry *entry)
+{
+   struct zram_entry *tmp, *prev = NULL;
+   struct rb_node *rb_node;
+
+   /* find left-most entry with same checksum */
+   while ((rb_node = rb_prev(>rb_node))) {
+   tmp = rb_entry(rb_node, struct zram_entry, rb_node);
+   if (tmp->checksum != entry->checksum)
+   break;
+
+   entry = tmp;
+   }
+
+again:
+   entry->refcount++;
+   atomic64_add(entry->len, >stats.dup_data_size);
+   spin_unlock(>lock);
+
+   if (prev)
+   zram_entry_free(zram, prev);
+
+   if (zram_dedup_match(zram, entry, mem))
+   return entry;
+
+   spin_lock(>lock);
+   tmp = NULL;
+   rb_node = rb_next(>rb_node);
+   if (rb_node)
+   tmp = rb_entry(rb_node, struct zram_entry, rb_node);
+
+   if (tmp && (tmp->checksum == entry->checksum)) {
+   prev = entry;
+   entry = tmp;
+   goto again;
+   }
+
+   spin_unlock(>lock);
+   zram_entry_free(zram, entry);
+
+   return NULL;
+}
+
 static struct zram_entry *zram_dedup_get(struct zram *zram,
unsigned char *mem, u32 checksum)
 {
@@ -122,18 +167,8 @@ static struct zram_entry *zram_dedup_get(struct zram *zram,
rb_node = hash->rb_root.rb_node;
while (rb_node) {
entry = rb_entry(rb_node, struct zram_entry, rb_node);
-   if (checksum == entry->checksum) {
-   entry->refcount++;
-   atomic64_add(entry->len, >stats.dup_data_size);
-   spin_unlock(>lock);
-
-   if (zram_dedup_match(zram, entry, mem))
-   return entry;
-
-   zram_entry_free(zram, entry);
-
-   return NULL;
-   }
+   if (checksum == entry->checksum)
+   return __zram_dedup_get(zram, hash, mem, entry);
 
if (checksum < entry->checksum)
rb_node = rb_node->rb_left;
-- 
2.7.4

[PATCH v5 2/4] zram: implement deduplication in zram

2017-05-11 Thread js1304

From: Joonsoo Kim 

This patch implements deduplication feature in zram. The purpose
of this work is naturally to save amount of memory usage by zram.

Android is one of the biggest users to use zram as swap and it's
really important to save amount of memory usage. There is a paper
that reports that duplication ratio of Android's memory content is
rather high [1]. And, there is a similar work on zswap that also
reports that experiments has shown that around 10-15% of pages
stored in zswp are duplicates and deduplicate them provides some
benefits [2].

Also, there is a different kind of workload that uses zram as blockdev
and store build outputs into it to reduce wear-out problem of real
blockdev. In this workload, deduplication hit is very high due to
temporary files and intermediate object files. Detailed analysis is
on the bottom of this description.

Anyway, if we can detect duplicated content and avoid to store duplicated
content at different memory space, we can save memory. This patch
tries to do that.

Implementation is almost simple and intuitive but I should note
one thing about implementation detail.

To check duplication, this patch uses checksum of the page and
collision of this checksum could be possible. There would be
many choices to handle this situation but this patch chooses
to allow entry with duplicated checksum to be added to the hash,
but, not to compare all entries with duplicated checksum
when checking duplication. I guess that checksum collision is quite
rare event and we don't need to pay any attention to such a case.
Therefore, I decided the most simplest way to implement the feature.
If there is a different opinion, I can accept and go that way.

Following is the result of this patch.

Test result #1 (Swap):
Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18

orig_data_size: 145297408
compr_data_size: 32408125
mem_used_total: 32276480
dup_data_size: 3188134
meta_data_size: 1444272

Last two metrics added to mm_stat are related to this work.
First one, dup_data_size, is amount of saved memory by avoiding
to store duplicated page. Later one, meta_data_size, is the amount of
data structure to support deduplication. If dup > meta, we can judge
that the patch improves memory usage.

In Adnroid, we can save 5% of memory usage by this work.

Test result #2 (Blockdev):
build the kernel and store output to ext4 FS on zram


Elapsed time: 249 s
mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0


Elapsed time: 250 s
mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038  3945792

There is no performance degradation and save 23% memory.

Test result #3 (Blockdev):
copy android build output dir(out/host) to ext4 FS on zram


Elapsed time: out/host: 88 s
mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0


Elapsed time: out/host: 100 s
mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 
80880336

It shows performance degradation roughly 13% and save 24% memory. Maybe,
it is due to overhead of calculating checksum and comparison.

Test result #4 (Blockdev):
copy android build output dir(out/target/common) to ext4 FS on zram


Elapsed time: out/host: 203 s
mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0


Elapsed time: out/host: 201 s
mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336

Memory is saved by 42% and performance is the same. Even if there is overhead
of calculating checksum and comparison, large hit ratio compensate it since
hit leads to less compression attempt.

I checked the detailed reason of savings on kernel build workload and
there are some cases that deduplication happens.

1) *.cmd
Build command is usually similar in one directory so content of these file
are very similar. In my system, more than 789 lines in fs/ext4/.namei.o.cmd
and fs/ext4/.inode.o.cmd are the same in 944 and 938 lines of the file,
respectively.

2) intermediate object files
built-in.o and temporary object file have the similar contents. More than
50% of fs/ext4/ext4.o is the same with fs/ext4/built-in.o.

3) vmlinux
.tmp_vmlinux1 and .tmp_vmlinux2 and arch/x86/boo/compressed/vmlinux.bin
have the similar contents.

Android test has similar case that some of object files(.class
and .so) are similar with another ones.
(./host/linux-x86/lib/libartd.so and
./host/linux-x86-lib/libartd-comiler.so)

Anyway, benefit seems to be largely dependent on the workload so
following patch will make this feature optional. However, this feature
can help some usecases so is deserved to be merged.

[1]: MemScope: Analyzing Memory Duplication on Android Systems,
dl.acm.org/citation.cfm?id=2797023
[2]: zswap: Optimize compressed pool memory utilization,
lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2

Reviewed-by: Sergey Senozhatsky 
Acked-by: Minchan Kim 
Signed-off-by: Joonsoo Kim 
---
 Documentation/blockdev/zram.txt |   2 +
 drivers/block/zram/Makefile |   2 +-

[PATCH v5 0/4] zram: implement deduplication in zram

2017-05-11 Thread js1304

From: Joonsoo Kim 

Changes from v4
o rebase on next-20170511
o collect Reviewed-by from Sergey

Changes from v3
o fix module build problem
o make zram_dedup_enabled() statically return false if CONFIG_ZRAM_DEDUP=n

Changes from v2
o rebase to latest zram code
o manage alloc/free of the zram_entry in zram_drv.c
o remove useless RB_CLEAR_NODE
o set RO permission tor use_deup sysfs entry if CONFIG_ZRAM_DEDUP=n

Changes from v1
o reogranize dedup specific functions
o support Kconfig on/off
o update zram documents
o compare all the entries with same checksum (patch #4)

This patchset implements deduplication feature in zram. Motivation
is to save memory usage by zram. There are detailed description
about motivation and experimental results on patch #2 so please
refer it.

Thanks.

Joonsoo Kim (4):
  zram: introduce zram_entry to prepare dedup functionality
  zram: implement deduplication in zram
  zram: make deduplication feature optional
  zram: compare all the entries with same checksum for deduplication

 Documentation/ABI/testing/sysfs-block-zram |  10 ++
 Documentation/blockdev/zram.txt|   3 +
 drivers/block/zram/Kconfig |  14 ++
 drivers/block/zram/Makefile|   3 +-
 drivers/block/zram/zram_dedup.c| 254 +
 drivers/block/zram/zram_dedup.h|  45 +
 drivers/block/zram/zram_drv.c  | 184 +
 drivers/block/zram/zram_drv.h  |  32 +++-
 8 files changed, 507 insertions(+), 38 deletions(-)
 create mode 100644 drivers/block/zram/zram_dedup.c
 create mode 100644 drivers/block/zram/zram_dedup.h

-- 
2.7.4

[PATCH v5 1/4] zram: introduce zram_entry to prepare dedup functionality

2017-05-11 Thread js1304

From: Joonsoo Kim 

Following patch will implement deduplication functionality
in the zram and it requires an indirection layer to manage
the life cycle of zsmalloc handle. To prepare that, this patch
introduces zram_entry which can be used to manage the life-cycle
of zsmalloc handle. Many lines are changed due to rename but
core change is just simple introduction about newly data structure.

Reviewed-by: Sergey Senozhatsky 
Acked-by: Minchan Kim 
Signed-off-by: Joonsoo Kim 
---
 drivers/block/zram/zram_drv.c | 95 +++
 drivers/block/zram/zram_drv.h |  6 ++-
 2 files changed, 66 insertions(+), 35 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index debee95..26dc4e5 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -57,14 +57,15 @@ static inline struct zram *dev_to_zram(struct device *dev)
return (struct zram *)dev_to_disk(dev)->private_data;
 }
 
-static unsigned long zram_get_handle(struct zram *zram, u32 index)
+static struct zram_entry *zram_get_entry(struct zram *zram, u32 index)
 {
-   return zram->table[index].handle;
+   return zram->table[index].entry;
 }
 
-static void zram_set_handle(struct zram *zram, u32 index, unsigned long handle)
+static void zram_set_entry(struct zram *zram, u32 index,
+   struct zram_entry *entry)
 {
-   zram->table[index].handle = handle;
+   zram->table[index].entry = entry;
 }
 
 /* flag operations require table entry bit_spin_lock() being held */
@@ -437,7 +438,7 @@ static bool zram_same_page_read(struct zram *zram, u32 
index,
unsigned int offset, unsigned int len)
 {
zram_slot_lock(zram, index);
-   if (unlikely(!zram_get_handle(zram, index) ||
+   if (unlikely(!zram_get_entry(zram, index) ||
zram_test_flag(zram, index, ZRAM_SAME))) {
void *mem;
 
@@ -476,6 +477,32 @@ static bool zram_same_page_write(struct zram *zram, u32 
index,
return false;
 }
 
+static struct zram_entry *zram_entry_alloc(struct zram *zram,
+   unsigned int len, gfp_t flags)
+{
+   struct zram_entry *entry;
+
+   entry = kzalloc(sizeof(*entry),
+   flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   if (!entry)
+   return NULL;
+
+   entry->handle = zs_malloc(zram->mem_pool, len, flags);
+   if (!entry->handle) {
+   kfree(entry);
+   return NULL;
+   }
+
+   return entry;
+}
+
+static inline void zram_entry_free(struct zram *zram,
+   struct zram_entry *entry)
+{
+   zs_free(zram->mem_pool, entry->handle);
+   kfree(entry);
+}
+
 static void zram_meta_free(struct zram *zram, u64 disksize)
 {
size_t num_pages = disksize >> PAGE_SHIFT;
@@ -514,7 +541,7 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
  */
 static void zram_free_page(struct zram *zram, size_t index)
 {
-   unsigned long handle = zram_get_handle(zram, index);
+   struct zram_entry *entry = zram_get_entry(zram, index);
 
/*
 * No memory is allocated for same element filled pages.
@@ -527,23 +554,23 @@ static void zram_free_page(struct zram *zram, size_t 
index)
return;
}
 
-   if (!handle)
+   if (!entry)
return;
 
-   zs_free(zram->mem_pool, handle);
+   zram_entry_free(zram, entry);
 
atomic64_sub(zram_get_obj_size(zram, index),
>stats.compr_data_size);
atomic64_dec(>stats.pages_stored);
 
-   zram_set_handle(zram, index, 0);
+   zram_set_entry(zram, index, NULL);
zram_set_obj_size(zram, index, 0);
 }
 
 static int zram_decompress_page(struct zram *zram, struct page *page, u32 
index)
 {
int ret;
-   unsigned long handle;
+   struct zram_entry *entry;
unsigned int size;
void *src, *dst;
 
@@ -551,10 +578,10 @@ static int zram_decompress_page(struct zram *zram, struct 
page *page, u32 index)
return 0;
 
zram_slot_lock(zram, index);
-   handle = zram_get_handle(zram, index);
+   entry = zram_get_entry(zram, index);
size = zram_get_obj_size(zram, index);
 
-   src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
+   src = zs_map_object(zram->mem_pool, entry->handle, ZS_MM_RO);
if (size == PAGE_SIZE) {
dst = kmap_atomic(page);
memcpy(dst, src, PAGE_SIZE);
@@ -568,7 +595,7 @@ static int zram_decompress_page(struct zram *zram, struct 
page *page, u32 index)
kunmap_atomic(dst);
zcomp_stream_put(zram->comp);
}
-   zs_unmap_object(zram->mem_pool, handle);
+   zs_unmap_object(zram->mem_pool, entry->handle);
zram_slot_unlock(zram, index);
 
/* Should NEVER happen. Return bio error if it does. */
@@ -612,14 +639,14 @@

[PATCH v5 4/4] zram: compare all the entries with same checksum for deduplication

2017-05-11 Thread js1304

From: Joonsoo Kim 

Until now, we compare just one entry with same checksum when
checking duplication since it is the simplest way to implement.
However, for the completeness, checking all the entries is better
so this patch implement to compare all the entries with same checksum.
Since this event would be rare so there would be no performance loss.

Reviewed-by: Sergey Senozhatsky 
Acked-by: Minchan Kim 
Signed-off-by: Joonsoo Kim 
---
 drivers/block/zram/zram_dedup.c | 59 -
 1 file changed, 47 insertions(+), 12 deletions(-)

diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c
index 560b1f5..14c4988 100644
--- a/drivers/block/zram/zram_dedup.c
+++ b/drivers/block/zram/zram_dedup.c
@@ -109,6 +109,51 @@ static unsigned long zram_dedup_put(struct zram *zram,
return entry->refcount;
 }
 
+static struct zram_entry *__zram_dedup_get(struct zram *zram,
+   struct zram_hash *hash, unsigned char *mem,
+   struct zram_entry *entry)
+{
+   struct zram_entry *tmp, *prev = NULL;
+   struct rb_node *rb_node;
+
+   /* find left-most entry with same checksum */
+   while ((rb_node = rb_prev(>rb_node))) {
+   tmp = rb_entry(rb_node, struct zram_entry, rb_node);
+   if (tmp->checksum != entry->checksum)
+   break;
+
+   entry = tmp;
+   }
+
+again:
+   entry->refcount++;
+   atomic64_add(entry->len, >stats.dup_data_size);
+   spin_unlock(>lock);
+
+   if (prev)
+   zram_entry_free(zram, prev);
+
+   if (zram_dedup_match(zram, entry, mem))
+   return entry;
+
+   spin_lock(>lock);
+   tmp = NULL;
+   rb_node = rb_next(>rb_node);
+   if (rb_node)
+   tmp = rb_entry(rb_node, struct zram_entry, rb_node);
+
+   if (tmp && (tmp->checksum == entry->checksum)) {
+   prev = entry;
+   entry = tmp;
+   goto again;
+   }
+
+   spin_unlock(>lock);
+   zram_entry_free(zram, entry);
+
+   return NULL;
+}
+
 static struct zram_entry *zram_dedup_get(struct zram *zram,
unsigned char *mem, u32 checksum)
 {
@@ -122,18 +167,8 @@ static struct zram_entry *zram_dedup_get(struct zram *zram,
rb_node = hash->rb_root.rb_node;
while (rb_node) {
entry = rb_entry(rb_node, struct zram_entry, rb_node);
-   if (checksum == entry->checksum) {
-   entry->refcount++;
-   atomic64_add(entry->len, >stats.dup_data_size);
-   spin_unlock(>lock);
-
-   if (zram_dedup_match(zram, entry, mem))
-   return entry;
-
-   zram_entry_free(zram, entry);
-
-   return NULL;
-   }
+   if (checksum == entry->checksum)
+   return __zram_dedup_get(zram, hash, mem, entry);
 
if (checksum < entry->checksum)
rb_node = rb_node->rb_left;
-- 
2.7.4

Re: [PATCH] mm: per-cgroup memory reclaim stats

2017-05-11 Thread Balbir Singh

On Thu, 2017-05-11 at 20:16 +0100, Roman Gushchin wrote:
> Track the following reclaim counters for every memory cgroup:
> PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and
> PGLAZYFREED.
> 
> These values are exposed using the memory.stats interface of cgroup v2.

The changelog could also describe what is currently there and what
is missing. Is there a user space program that uses the newly exported
values? Or is this just for consistency of stats?

> 
> The meaning of each value is the same as for global counters,
> available using /proc/vmstat.
> 
> Also, for consistency, rename mem_cgroup_count_vm_event() to
> count_memcg_event_mm().
> 

I still prefer the mem_cgroup_count_vm_event() name, or memcg_count_vm_event(),
the namespace upfront makes it easier to parse where to look for the the
implementation and also grep. In any case the rename should be independent
patch, but I don't like the name you've proposed.

> Signed-off-by: Roman Gushchin 
> Suggested-by: Johannes Weiner 
> Cc: Johannes Weiner 
> Cc: Tejun Heo 
> Cc: Li Zefan 
> Cc: Michal Hocko 
> Cc: Vladimir Davydov 
> Cc: cgro...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@kvack.org
> ---
>  Documentation/cgroup-v2.txt | 28 ++
>  fs/dax.c|  2 +-
>  fs/ncpfs/mmap.c |  2 +-
>  include/linux/memcontrol.h  | 48 
> ++---
>  mm/filemap.c|  2 +-
>  mm/memcontrol.c | 10 ++
>  mm/memory.c |  4 ++--
>  mm/shmem.c  |  3 +--
>  mm/swap.c   |  1 +
>  mm/vmscan.c | 30 +---
>  10 files changed, 113 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index e50b95c..5804355 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -918,6 +918,34 @@ PAGE_SIZE multiple when read back.
>  
>   Number of major page faults incurred
>  
> +   pgrefill
> +
> + Amount of scanned pages (in an active LRU list)
> +
> +   pgscan
> +
> + Amount of scanned pages (in an inactive LRU list)
> +
> +   pgsteal
> +
> + Amount of reclaimed pages
> +
> +   pgactivate
> +
> + Amount of pages moved to the active LRU list
> +
> +   pgdeactivate
> +
> + Amount of pages moved to the inactive LRU lis
> +
> +   pglazyfree
> +
> + Amount of pages postponed to be freed under memory pressure
> +
> +   pglazyfreed
> +
> + Amount of reclaimed lazyfree pages
> +
>memory.swap.current
>  
>   A read-only single value file which exists on non-root
> diff --git a/fs/dax.c b/fs/dax.c
> index 66d7906..9aac521d 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1230,7 +1230,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
>   case IOMAP_MAPPED:
>   if (iomap.flags & IOMAP_F_NEW) {
>   count_vm_event(PGMAJFAULT);
> - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
> + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
>   major = VM_FAULT_MAJOR;
>   }
>   error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev,
> diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
> index 0c3905e..6719c0b 100644
> --- a/fs/ncpfs/mmap.c
> +++ b/fs/ncpfs/mmap.c
> @@ -89,7 +89,7 @@ static int ncp_file_mmap_fault(struct vm_fault *vmf)
>* -- nyc
>*/
>   count_vm_event(PGMAJFAULT);
> - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
> + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
>   return VM_FAULT_MAJOR;
>  }
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 899949b..b2a5b1c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -357,6 +357,17 @@ static inline unsigned short mem_cgroup_id(struct 
> mem_cgroup *memcg)
>  }
>  struct mem_cgroup *mem_cgroup_from_id(unsigned short id);
>  
> +static inline struct mem_cgroup *lruvec_memcg(struct lruvec *lruvec)

mem_cgroup_from_lruvec()?

> +{
> + struct mem_cgroup_per_node *mz;
> +
> + if (mem_cgroup_disabled())
> + return NULL;
> +
> + mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> + return mz->memcg;
> +}
> +
>  /**
>   * parent_mem_cgroup - find the accounting parent of a memcg
>   * @memcg: memcg whose parent to find
> @@ -546,8 +557,23 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t 
> *pgdat, int order,
>   gfp_t gfp_mask,
>   unsigned long *total_scanned);
>  
>

Re: [PATCH] mm: per-cgroup memory reclaim stats

2017-05-11 Thread Balbir Singh

On Thu, 2017-05-11 at 20:16 +0100, Roman Gushchin wrote:
> Track the following reclaim counters for every memory cgroup:
> PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and
> PGLAZYFREED.
> 
> These values are exposed using the memory.stats interface of cgroup v2.

The changelog could also describe what is currently there and what
is missing. Is there a user space program that uses the newly exported
values? Or is this just for consistency of stats?

> 
> The meaning of each value is the same as for global counters,
> available using /proc/vmstat.
> 
> Also, for consistency, rename mem_cgroup_count_vm_event() to
> count_memcg_event_mm().
> 

I still prefer the mem_cgroup_count_vm_event() name, or memcg_count_vm_event(),
the namespace upfront makes it easier to parse where to look for the the
implementation and also grep. In any case the rename should be independent
patch, but I don't like the name you've proposed.

> Signed-off-by: Roman Gushchin 
> Suggested-by: Johannes Weiner 
> Cc: Johannes Weiner 
> Cc: Tejun Heo 
> Cc: Li Zefan 
> Cc: Michal Hocko 
> Cc: Vladimir Davydov 
> Cc: cgro...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@kvack.org
> ---
>  Documentation/cgroup-v2.txt | 28 ++
>  fs/dax.c|  2 +-
>  fs/ncpfs/mmap.c |  2 +-
>  include/linux/memcontrol.h  | 48 
> ++---
>  mm/filemap.c|  2 +-
>  mm/memcontrol.c | 10 ++
>  mm/memory.c |  4 ++--
>  mm/shmem.c  |  3 +--
>  mm/swap.c   |  1 +
>  mm/vmscan.c | 30 +---
>  10 files changed, 113 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index e50b95c..5804355 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -918,6 +918,34 @@ PAGE_SIZE multiple when read back.
>  
>   Number of major page faults incurred
>  
> +   pgrefill
> +
> + Amount of scanned pages (in an active LRU list)
> +
> +   pgscan
> +
> + Amount of scanned pages (in an inactive LRU list)
> +
> +   pgsteal
> +
> + Amount of reclaimed pages
> +
> +   pgactivate
> +
> + Amount of pages moved to the active LRU list
> +
> +   pgdeactivate
> +
> + Amount of pages moved to the inactive LRU lis
> +
> +   pglazyfree
> +
> + Amount of pages postponed to be freed under memory pressure
> +
> +   pglazyfreed
> +
> + Amount of reclaimed lazyfree pages
> +
>memory.swap.current
>  
>   A read-only single value file which exists on non-root
> diff --git a/fs/dax.c b/fs/dax.c
> index 66d7906..9aac521d 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1230,7 +1230,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
>   case IOMAP_MAPPED:
>   if (iomap.flags & IOMAP_F_NEW) {
>   count_vm_event(PGMAJFAULT);
> - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
> + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
>   major = VM_FAULT_MAJOR;
>   }
>   error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev,
> diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
> index 0c3905e..6719c0b 100644
> --- a/fs/ncpfs/mmap.c
> +++ b/fs/ncpfs/mmap.c
> @@ -89,7 +89,7 @@ static int ncp_file_mmap_fault(struct vm_fault *vmf)
>* -- nyc
>*/
>   count_vm_event(PGMAJFAULT);
> - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
> + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
>   return VM_FAULT_MAJOR;
>  }
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 899949b..b2a5b1c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -357,6 +357,17 @@ static inline unsigned short mem_cgroup_id(struct 
> mem_cgroup *memcg)
>  }
>  struct mem_cgroup *mem_cgroup_from_id(unsigned short id);
>  
> +static inline struct mem_cgroup *lruvec_memcg(struct lruvec *lruvec)

mem_cgroup_from_lruvec()?

> +{
> + struct mem_cgroup_per_node *mz;
> +
> + if (mem_cgroup_disabled())
> + return NULL;
> +
> + mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> + return mz->memcg;
> +}
> +
>  /**
>   * parent_mem_cgroup - find the accounting parent of a memcg
>   * @memcg: memcg whose parent to find
> @@ -546,8 +557,23 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t 
> *pgdat, int order,
>   gfp_t gfp_mask,
>   unsigned long *total_scanned);
>  
> -static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> -  enum vm_event_item idx)

[PATCH 2/2] mm: swap: move anonymous THP split logic to vmscan

2017-05-11 Thread Minchan Kim

The add_to_swap aims to allocate swap_space(ie, swap slot and
swapcache) so if it fails due to lack of space in case of THP
or something(hdd swap but tries THP swapout) *caller* rather
than add_to_swap itself should split the THP page and retry it
with base page which is more natural.

Cc: Johannes Weiner 
Signed-off-by: Minchan Kim 
---
 include/linux/swap.h |  4 ++--
 mm/swap_state.c  | 23 ++-
 mm/vmscan.c  | 17 -
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8f12f67e869f..87cca2169d44 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -359,7 +359,7 @@ extern struct address_space *swapper_spaces[];
>> SWAP_ADDRESS_SPACE_SHIFT])
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *, struct list_head *list);
+extern int add_to_swap(struct page *);
 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
 extern void __delete_from_swap_cache(struct page *);
@@ -479,7 +479,7 @@ static inline struct page *lookup_swap_cache(swp_entry_t 
swp)
return NULL;
 }
 
-static inline int add_to_swap(struct page *page, struct list_head *list)
+static inline int add_to_swap(struct page *page)
 {
return 0;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0ad214d7a7ad..9c71b6b2562f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -184,7 +184,7 @@ void __delete_from_swap_cache(struct page *page)
  * Allocate swap space for the page and add the page to the
  * swap cache.  Caller needs to hold the page lock. 
  */
-int add_to_swap(struct page *page, struct list_head *list)
+int add_to_swap(struct page *page)
 {
swp_entry_t entry;
int err;
@@ -192,12 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
-retry:
entry = get_swap_page(page);
if (!entry.val)
-   goto fail;
+   return 0;
+
if (mem_cgroup_try_charge_swap(page, entry))
-   goto fail_free;
+   goto fail;
 
/*
 * Radix-tree node allocations from PF_MEMALLOC contexts could
@@ -218,23 +218,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 * clear SWAP_HAS_CACHE flag.
 */
-   goto fail_free;
-
-   if (PageTransHuge(page)) {
-   err = split_huge_page_to_list(page, list);
-   if (err) {
-   delete_from_swap_cache(page);
-   return 0;
-   }
-   }
+   goto fail;
 
return 1;
 
-fail_free:
-   put_swap_page(page, entry);
 fail:
-   if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
-   goto retry;
+   put_swap_page(page, entry);
return 0;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 57268c0c8fcf..767e20856080 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1125,8 +1125,23 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
!PageSwapCache(page)) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
-   if (!add_to_swap(page, page_list))
+   if (!add_to_swap(page)) {
+   if (!PageTransHuge(page))
+   goto activate_locked;
+   /* Split THP and swap individual base pages */
+   if (split_huge_page_to_list(page, page_list))
+   goto activate_locked;
+   if (!add_to_swap(page))
+   goto activate_locked;
+   }
+
+   /* XXX: We don't support THP writes */
+   if (PageTransHuge(page) &&
+ split_huge_page_to_list(page, page_list)) {
+   delete_from_swap_cache(page);
goto activate_locked;
+   }
+
may_enter_fs = 1;
 
/* Adding to swap updated mapping */
-- 
2.7.4

[PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page

2017-05-11 Thread Minchan Kim

Now, get_swap_page takes struct page and allocates swap space
according to page size(ie, normal or THP) so it would be more
cleaner to introduce put_swap_page which is a counter function
of get_swap_page. Then, it calls right swap slot free function
depending on page's size.

Cc: Johannes Weiner 
Signed-off-by: Minchan Kim 
---
 include/linux/swap.h |  4 ++--
 mm/shmem.c   |  2 +-
 mm/swap_state.c  | 13 +++--
 mm/swapfile.c|  8 
 mm/vmscan.c  |  2 +-
 5 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b60fea3748f8..8f12f67e869f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -393,6 +393,7 @@ static inline long get_nr_swap_pages(void)
 
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(struct page *page);
+extern void put_swap_page(struct page *page, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
@@ -400,7 +401,6 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
@@ -459,7 +459,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void put_swap_page(struct page *page, swp_entry_t swp)
 {
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 29948d7da172..82158edaefdb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1326,7 +1326,7 @@ static int shmem_writepage(struct page *page, struct 
writeback_control *wbc)
 
mutex_unlock(_swaplist_mutex);
 free_swap:
-   swapcache_free(swap);
+   put_swap_page(page, swap);
 redirty:
set_page_dirty(page);
if (wbc->for_reclaim)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 16ff89d058f4..0ad214d7a7ad 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -231,10 +231,7 @@ int add_to_swap(struct page *page, struct list_head *list)
return 1;
 
 fail_free:
-   if (PageTransHuge(page))
-   swapcache_free_cluster(entry);
-   else
-   swapcache_free(entry);
+   put_swap_page(page, entry);
 fail:
if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
goto retry;
@@ -259,11 +256,7 @@ void delete_from_swap_cache(struct page *page)
__delete_from_swap_cache(page);
spin_unlock_irq(_space->tree_lock);
 
-   if (PageTransHuge(page))
-   swapcache_free_cluster(entry);
-   else
-   swapcache_free(entry);
-
+   put_swap_page(page, entry);
page_ref_sub(page, hpage_nr_pages(page));
 }
 
@@ -415,7 +408,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 * clear SWAP_HAS_CACHE flag.
 */
-   swapcache_free(entry);
+   put_swap_page(new_page, entry);
} while (err != -ENOMEM);
 
if (new_page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 596306272059..b65e49428090 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
 }
 #endif /* CONFIG_THP_SWAP */
 
+void put_swap_page(struct page *page, swp_entry_t entry)
+{
+   if (!PageTransHuge(page))
+   swapcache_free(entry);
+   else
+   swapcache_free_cluster(entry);
+}
+
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
const swp_entry_t *e1 = ent1, *e2 = ent2;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ebf468c5429..57268c0c8fcf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -708,7 +708,7 @@ static int __remove_mapping(struct address_space *mapping, 
struct page *page,
mem_cgroup_swapout(page, swap);
__delete_from_swap_cache(page);
spin_unlock_irqrestore(>tree_lock, flags);
-   swapcache_free(swap);
+   put_swap_page(page, swap);
} else {
void (*freepage)(struct page *);
void *shadow = NULL;
-- 
2.7.4

[PATCH 2/2] mm: swap: move anonymous THP split logic to vmscan

2017-05-11 Thread Minchan Kim

The add_to_swap aims to allocate swap_space(ie, swap slot and
swapcache) so if it fails due to lack of space in case of THP
or something(hdd swap but tries THP swapout) *caller* rather
than add_to_swap itself should split the THP page and retry it
with base page which is more natural.

Cc: Johannes Weiner 
Signed-off-by: Minchan Kim 
---
 include/linux/swap.h |  4 ++--
 mm/swap_state.c  | 23 ++-
 mm/vmscan.c  | 17 -
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8f12f67e869f..87cca2169d44 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -359,7 +359,7 @@ extern struct address_space *swapper_spaces[];
>> SWAP_ADDRESS_SPACE_SHIFT])
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *, struct list_head *list);
+extern int add_to_swap(struct page *);
 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
 extern void __delete_from_swap_cache(struct page *);
@@ -479,7 +479,7 @@ static inline struct page *lookup_swap_cache(swp_entry_t 
swp)
return NULL;
 }
 
-static inline int add_to_swap(struct page *page, struct list_head *list)
+static inline int add_to_swap(struct page *page)
 {
return 0;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0ad214d7a7ad..9c71b6b2562f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -184,7 +184,7 @@ void __delete_from_swap_cache(struct page *page)
  * Allocate swap space for the page and add the page to the
  * swap cache.  Caller needs to hold the page lock. 
  */
-int add_to_swap(struct page *page, struct list_head *list)
+int add_to_swap(struct page *page)
 {
swp_entry_t entry;
int err;
@@ -192,12 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
-retry:
entry = get_swap_page(page);
if (!entry.val)
-   goto fail;
+   return 0;
+
if (mem_cgroup_try_charge_swap(page, entry))
-   goto fail_free;
+   goto fail;
 
/*
 * Radix-tree node allocations from PF_MEMALLOC contexts could
@@ -218,23 +218,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 * clear SWAP_HAS_CACHE flag.
 */
-   goto fail_free;
-
-   if (PageTransHuge(page)) {
-   err = split_huge_page_to_list(page, list);
-   if (err) {
-   delete_from_swap_cache(page);
-   return 0;
-   }
-   }
+   goto fail;
 
return 1;
 
-fail_free:
-   put_swap_page(page, entry);
 fail:
-   if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
-   goto retry;
+   put_swap_page(page, entry);
return 0;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 57268c0c8fcf..767e20856080 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1125,8 +1125,23 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
!PageSwapCache(page)) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
-   if (!add_to_swap(page, page_list))
+   if (!add_to_swap(page)) {
+   if (!PageTransHuge(page))
+   goto activate_locked;
+   /* Split THP and swap individual base pages */
+   if (split_huge_page_to_list(page, page_list))
+   goto activate_locked;
+   if (!add_to_swap(page))
+   goto activate_locked;
+   }
+
+   /* XXX: We don't support THP writes */
+   if (PageTransHuge(page) &&
+ split_huge_page_to_list(page, page_list)) {
+   delete_from_swap_cache(page);
goto activate_locked;
+   }
+
may_enter_fs = 1;
 
/* Adding to swap updated mapping */
-- 
2.7.4

[PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page

2017-05-11 Thread Minchan Kim

Now, get_swap_page takes struct page and allocates swap space
according to page size(ie, normal or THP) so it would be more
cleaner to introduce put_swap_page which is a counter function
of get_swap_page. Then, it calls right swap slot free function
depending on page's size.

Cc: Johannes Weiner 
Signed-off-by: Minchan Kim 
---
 include/linux/swap.h |  4 ++--
 mm/shmem.c   |  2 +-
 mm/swap_state.c  | 13 +++--
 mm/swapfile.c|  8 
 mm/vmscan.c  |  2 +-
 5 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b60fea3748f8..8f12f67e869f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -393,6 +393,7 @@ static inline long get_nr_swap_pages(void)
 
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(struct page *page);
+extern void put_swap_page(struct page *page, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
@@ -400,7 +401,6 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
@@ -459,7 +459,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void put_swap_page(struct page *page, swp_entry_t swp)
 {
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 29948d7da172..82158edaefdb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1326,7 +1326,7 @@ static int shmem_writepage(struct page *page, struct 
writeback_control *wbc)
 
mutex_unlock(_swaplist_mutex);
 free_swap:
-   swapcache_free(swap);
+   put_swap_page(page, swap);
 redirty:
set_page_dirty(page);
if (wbc->for_reclaim)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 16ff89d058f4..0ad214d7a7ad 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -231,10 +231,7 @@ int add_to_swap(struct page *page, struct list_head *list)
return 1;
 
 fail_free:
-   if (PageTransHuge(page))
-   swapcache_free_cluster(entry);
-   else
-   swapcache_free(entry);
+   put_swap_page(page, entry);
 fail:
if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
goto retry;
@@ -259,11 +256,7 @@ void delete_from_swap_cache(struct page *page)
__delete_from_swap_cache(page);
spin_unlock_irq(_space->tree_lock);
 
-   if (PageTransHuge(page))
-   swapcache_free_cluster(entry);
-   else
-   swapcache_free(entry);
-
+   put_swap_page(page, entry);
page_ref_sub(page, hpage_nr_pages(page));
 }
 
@@ -415,7 +408,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 * clear SWAP_HAS_CACHE flag.
 */
-   swapcache_free(entry);
+   put_swap_page(new_page, entry);
} while (err != -ENOMEM);
 
if (new_page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 596306272059..b65e49428090 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
 }
 #endif /* CONFIG_THP_SWAP */
 
+void put_swap_page(struct page *page, swp_entry_t entry)
+{
+   if (!PageTransHuge(page))
+   swapcache_free(entry);
+   else
+   swapcache_free_cluster(entry);
+}
+
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
const swp_entry_t *e1 = ent1, *e2 = ent2;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ebf468c5429..57268c0c8fcf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -708,7 +708,7 @@ static int __remove_mapping(struct address_space *mapping, 
struct page *page,
mem_cgroup_swapout(page, swap);
__delete_from_swap_cache(page);
spin_unlock_irqrestore(>tree_lock, flags);
-   swapcache_free(swap);
+   put_swap_page(page, swap);
} else {
void (*freepage)(struct page *);
void *shadow = NULL;
-- 
2.7.4

Re: [PATCH v8 05/10] powerpc/perf: IMC pmu cpumask and cpuhotplug support

2017-05-11 Thread Stewart Smith

Madhavan Srinivasan  writes:
>>  * in patch 9 should opal_imc_counters_init return something other
>>than OPAL_SUCCESS in the case on invalid arguments? Maybe
>>OPAL_PARAMETER? (I think you fix this in a later patch anyway?)
>
> So, init call will return OPAL_PARAMETER for the unsupported
> domains (core and nest are supported). And if the init operation
> fails for any reason, it would return OPAL_HARDWARE. And this is
> documented.

(I'll comment on the skiboot one too), but I think that if the class
exists but init is a no-op, then OPAL_IMC_COUNTERS_INIT should return
OPAL_SUCCESS and just do nothing. This future proofs everything, and the
API is that one *must* call _INIT before start.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v8 05/10] powerpc/perf: IMC pmu cpumask and cpuhotplug support

2017-05-11 Thread Stewart Smith

Madhavan Srinivasan  writes:
>>  * in patch 9 should opal_imc_counters_init return something other
>>than OPAL_SUCCESS in the case on invalid arguments? Maybe
>>OPAL_PARAMETER? (I think you fix this in a later patch anyway?)
>
> So, init call will return OPAL_PARAMETER for the unsupported
> domains (core and nest are supported). And if the init operation
> fails for any reason, it would return OPAL_HARDWARE. And this is
> documented.

(I'll comment on the skiboot one too), but I think that if the class
exists but init is a no-op, then OPAL_IMC_COUNTERS_INIT should return
OPAL_SUCCESS and just do nothing. This future proofs everything, and the
API is that one *must* call _INIT before start.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [v6 PATCH 07/21] x86/insn-eval: Add utility function to get segment descriptor

2017-05-11 Thread Ricardo Neri

On Thu, 2017-05-04 at 13:02 +0200, Borislav Petkov wrote:
> On Wed, Apr 26, 2017 at 02:51:56PM -0700, Ricardo Neri wrote:
> > > > +seg >= 
> > > > current->active_mm->context.ldt->size)) {
> > > 
> > > ldt->size is the size of the descriptor table but you've shifted seg by
> > > 3. That selector index is shifted by 3 (to the left) to form an offset
> > > into the descriptor table because the entries there are 8 bytes.
> > 
> > I double-checked the ldt code and it seems to me that size refers to the
> > number of entries in the table; it is always multiplied by
> > LDT_ENTRY_SIZE [1], [2]. Am I missing something?
> 
> No, you're not. I fell into that wrongly named struct member trap.
> 
> So ldt_struct.size should actually be called ldt_struct.n_entries or
> similar. Because what's in there is now is not "size".
> 
> And then code like
> 
>   new_ldt->size * LDT_ENTRY_SIZE
> 
> would make much more sense if written like this:
> 
>   new_ldt->n_entries * LDT_ENTRY_SIZE
> 
> Would you fix that in a prepatch pls?
> 

Sure I can. Would this trigger a v8 of my series? I was hoping v7 series
could be merged and then start doing incremental work on top of it. Does
it make sense?

Thanks and BR,
Ricardo

Re: [v6 PATCH 07/21] x86/insn-eval: Add utility function to get segment descriptor

2017-05-11 Thread Ricardo Neri

On Thu, 2017-05-04 at 13:02 +0200, Borislav Petkov wrote:
> On Wed, Apr 26, 2017 at 02:51:56PM -0700, Ricardo Neri wrote:
> > > > +seg >= 
> > > > current->active_mm->context.ldt->size)) {
> > > 
> > > ldt->size is the size of the descriptor table but you've shifted seg by
> > > 3. That selector index is shifted by 3 (to the left) to form an offset
> > > into the descriptor table because the entries there are 8 bytes.
> > 
> > I double-checked the ldt code and it seems to me that size refers to the
> > number of entries in the table; it is always multiplied by
> > LDT_ENTRY_SIZE [1], [2]. Am I missing something?
> 
> No, you're not. I fell into that wrongly named struct member trap.
> 
> So ldt_struct.size should actually be called ldt_struct.n_entries or
> similar. Because what's in there is now is not "size".
> 
> And then code like
> 
>   new_ldt->size * LDT_ENTRY_SIZE
> 
> would make much more sense if written like this:
> 
>   new_ldt->n_entries * LDT_ENTRY_SIZE
> 
> Would you fix that in a prepatch pls?
> 

Sure I can. Would this trigger a v8 of my series? I was hoping v7 series
could be merged and then start doing incremental work on top of it. Does
it make sense?

Thanks and BR,
Ricardo

RE: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf

2017-05-11 Thread Chen, Xiaoguang

Hi Alex and Gerd,

>-Original Message-
>From: intel-gvt-dev [mailto:intel-gvt-dev-boun...@lists.freedesktop.org] On
>Behalf Of Alex Williamson
>Sent: Thursday, May 11, 2017 11:45 PM
>To: Gerd Hoffmann 
>Cc: Tian, Kevin ; intel-...@lists.freedesktop.org; linux-
>ker...@vger.kernel.org; zhen...@linux.intel.com; Lv, Zhiyuan
>; Chen, Xiaoguang ; intel-
>gvt-...@lists.freedesktop.org; Wang, Zhi A 
>Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf
>
>On Thu, 11 May 2017 15:27:53 +0200
>Gerd Hoffmann  wrote:
>
>>   Hi,
>>
>> > While read the framebuffer region we have to tell the vendor driver which
>framebuffer we want to read? There are two framebuffers now in KVMGT that is
>primary and cursor.
>> > There are two methods to implement this:
>> > 1) write the plane id first and then read the framebuffer.
>> > 2) create 2 vfio regions one for primary and one for cursor.
>>
>> (3) Place information for both planes into one vfio region.
>> Which allows to fetch both with a single read() syscall.
>>
>> The question is how you'll get the file descriptor then.  If the ioctl
>> returns the dma-buf fd only you have a racy interface:  Things can
>> change between read(vfio-region) and ioctl(need-dmabuf-fd).
>>
>> ioctl(need-dma-buf) could return both dmabuf fd and plane info to fix
>> the race, but then it is easier to go with ioctl only interface
>> (simliar to the orginal one from dec last year) I think.
>
>If the dmabuf fd is provided by a separate mdev vendor driver specific ioctl, I
>don't see how vfio regions should be involved.  Selecting which framebuffer
>should be an ioctl parameter.  
Based on your last mail. I think the implementation looks like this:
1) user query the framebuffer information by reading the vfio region.
2) if the framebuffer changed(such as framebuffer's graphics address changed, 
size changed etc) we will need to create a new dmabuf fd.
3) create a new dmabuf fd using vfio device specific ioctl.

>What sort of information needs to be conveyed
>about each plane?  
Only plane id is needed.

>Is it static information or something that needs to be read
>repeatedly? 
It is static information. For our case plane id 1 represent primary plane and 3 
for cursor plane. 2 means sprite plane which will not be used in our case.

>Do we need it before we get the dmabuf fd or can it be an ioctl on
>the dmabuf fd?
We need it while query the framebuffer. In kernel we need the plane id to 
decide which plane we should decode.
Below is my current implementation:
1) user first query the framebuffer(primary or cursor) and kernel decode the 
framebuffer and return the framebuffer information to user and also save a copy 
in kernel.
2) user compared the framebuffer and if the framebuffer changed  creating a new 
dmabuf fd.
3) kernel create a new dmabuf fd based on saved framebuffer information.

So we need plane id in step 1.
In step 3 we create a dmabuf fd only using saved framebuffer information(no 
other information is needed).

Chenxg.
>Thanks,
>
>Alex
>___
>intel-gvt-dev mailing list
>intel-gvt-...@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

RE: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf

2017-05-11 Thread Chen, Xiaoguang

Hi Alex and Gerd,

>-Original Message-
>From: intel-gvt-dev [mailto:intel-gvt-dev-boun...@lists.freedesktop.org] On
>Behalf Of Alex Williamson
>Sent: Thursday, May 11, 2017 11:45 PM
>To: Gerd Hoffmann 
>Cc: Tian, Kevin ; intel-...@lists.freedesktop.org; linux-
>ker...@vger.kernel.org; zhen...@linux.intel.com; Lv, Zhiyuan
>; Chen, Xiaoguang ; intel-
>gvt-...@lists.freedesktop.org; Wang, Zhi A 
>Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf
>
>On Thu, 11 May 2017 15:27:53 +0200
>Gerd Hoffmann  wrote:
>
>>   Hi,
>>
>> > While read the framebuffer region we have to tell the vendor driver which
>framebuffer we want to read? There are two framebuffers now in KVMGT that is
>primary and cursor.
>> > There are two methods to implement this:
>> > 1) write the plane id first and then read the framebuffer.
>> > 2) create 2 vfio regions one for primary and one for cursor.
>>
>> (3) Place information for both planes into one vfio region.
>> Which allows to fetch both with a single read() syscall.
>>
>> The question is how you'll get the file descriptor then.  If the ioctl
>> returns the dma-buf fd only you have a racy interface:  Things can
>> change between read(vfio-region) and ioctl(need-dmabuf-fd).
>>
>> ioctl(need-dma-buf) could return both dmabuf fd and plane info to fix
>> the race, but then it is easier to go with ioctl only interface
>> (simliar to the orginal one from dec last year) I think.
>
>If the dmabuf fd is provided by a separate mdev vendor driver specific ioctl, I
>don't see how vfio regions should be involved.  Selecting which framebuffer
>should be an ioctl parameter.  
Based on your last mail. I think the implementation looks like this:
1) user query the framebuffer information by reading the vfio region.
2) if the framebuffer changed(such as framebuffer's graphics address changed, 
size changed etc) we will need to create a new dmabuf fd.
3) create a new dmabuf fd using vfio device specific ioctl.

>What sort of information needs to be conveyed
>about each plane?  
Only plane id is needed.

>Is it static information or something that needs to be read
>repeatedly? 
It is static information. For our case plane id 1 represent primary plane and 3 
for cursor plane. 2 means sprite plane which will not be used in our case.

>Do we need it before we get the dmabuf fd or can it be an ioctl on
>the dmabuf fd?
We need it while query the framebuffer. In kernel we need the plane id to 
decide which plane we should decode.
Below is my current implementation:
1) user first query the framebuffer(primary or cursor) and kernel decode the 
framebuffer and return the framebuffer information to user and also save a copy 
in kernel.
2) user compared the framebuffer and if the framebuffer changed  creating a new 
dmabuf fd.
3) kernel create a new dmabuf fd based on saved framebuffer information.

So we need plane id in step 1.
In step 3 we create a dmabuf fd only using saved framebuffer information(no 
other information is needed).

Chenxg.
>Thanks,
>
>Alex
>___
>intel-gvt-dev mailing list
>intel-gvt-...@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

Re: [v6 PATCH 08/21] x86/insn-eval: Add utility function to get segment descriptor base address

2017-05-11 Thread Ricardo Neri

On Fri, 2017-05-05 at 19:19 +0200, Borislav Petkov wrote:
> On Wed, Apr 26, 2017 at 03:37:44PM -0700, Ricardo Neri wrote:
> > I need a human-readable way of identifying what segment selector (in
> > pt_regs, vm86regs or directly reading the segment registers) to use.
> > Since there is a segment override prefix for all of them, I thought I
> > could use them.
> 
> Yes, you should...
> 
> > Perhaps I can rename enum segment to enum segment_selector and comment
> > that the values in the enum are those of the override prefixes. Would
> > that be reasonable?
> 
> ... but you should call them what they are: "enum seg_override_pfxs" or
> "enum seg_ovr_pfx" or...
> 
> Or somesuch. I suck at naming stuff.

In my v7, I simply named my enumeration enum segment_register, which is
what they are. Some of its entries happen to have the value of the
segment override prefixes but also have special entries as SEG_REG_INVAL
when for errors and SEG_REG_IGNORE for long mode [1].

Thanks and BR,
Ricardo

[1]. https://lkml.org/lkml/2017/5/5/405

Re: [v6 PATCH 08/21] x86/insn-eval: Add utility function to get segment descriptor base address

2017-05-11 Thread Ricardo Neri

On Fri, 2017-05-05 at 19:19 +0200, Borislav Petkov wrote:
> On Wed, Apr 26, 2017 at 03:37:44PM -0700, Ricardo Neri wrote:
> > I need a human-readable way of identifying what segment selector (in
> > pt_regs, vm86regs or directly reading the segment registers) to use.
> > Since there is a segment override prefix for all of them, I thought I
> > could use them.
> 
> Yes, you should...
> 
> > Perhaps I can rename enum segment to enum segment_selector and comment
> > that the values in the enum are those of the override prefixes. Would
> > that be reasonable?
> 
> ... but you should call them what they are: "enum seg_override_pfxs" or
> "enum seg_ovr_pfx" or...
> 
> Or somesuch. I suck at naming stuff.

In my v7, I simply named my enumeration enum segment_register, which is
what they are. Some of its entries happen to have the value of the
segment override prefixes but also have special entries as SEG_REG_INVAL
when for errors and SEG_REG_IGNORE for long mode [1].

Thanks and BR,
Ricardo

[1]. https://lkml.org/lkml/2017/5/5/405

Re: [v6 PATCH 08/21] x86/insn-eval: Add utility function to get segment descriptor base address

2017-05-11 Thread Ricardo Neri

On Fri, 2017-05-05 at 19:28 +0200, Borislav Petkov wrote:
> On Wed, Apr 26, 2017 at 03:52:41PM -0700, Ricardo Neri wrote:
> > Probably insn_get_seg_base() itself can verify if there are segment
> > override prefixes in the struct insn. If yes, use them except for
> > specific cases such as CS.
> 
> ... and depending on whether in long mode or not.

Yes, in my v7 I ignore the segment register if we are in long mode [1].
> 
> > On an unrelated note, I still have the problem of using DS vs ES for
> > string instructions. Perhaps instead of a use_default_seg flag, a
> > string_instruction flag that indicates how to determine the default
> > segment.
> 
> ... or you can look at the insn opcode directly. AFAICT, you need
> to check whether the opcode is 0xa4 or 0xa5 and that the insn is a
> single-byte opcode, i.e., not from the secondary map escaped with 0xf or
> some of the other multi-byte opcode maps.

In my v7, I have added a section my function resolve_seg_register() that
ignores
segment overrides if it sees string instructions and the register EDI
and defaults to ES. If the register is EIP, it defaults to CS. To
determine if an instruction is a string instruction I do check for the
size of the opcode and the opcodes that you mention plus others based on
the Intel Software Development Manual[2].

[1]. https://lkml.org/lkml/2017/5/5/405
[2]. https://lkml.org/lkml/2017/5/5/410

Thanks and BR,
Ricardo


> 
> -- 
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nürnberg)

Re: [v6 PATCH 08/21] x86/insn-eval: Add utility function to get segment descriptor base address

2017-05-11 Thread Ricardo Neri

On Fri, 2017-05-05 at 19:28 +0200, Borislav Petkov wrote:
> On Wed, Apr 26, 2017 at 03:52:41PM -0700, Ricardo Neri wrote:
> > Probably insn_get_seg_base() itself can verify if there are segment
> > override prefixes in the struct insn. If yes, use them except for
> > specific cases such as CS.
> 
> ... and depending on whether in long mode or not.

Yes, in my v7 I ignore the segment register if we are in long mode [1].
> 
> > On an unrelated note, I still have the problem of using DS vs ES for
> > string instructions. Perhaps instead of a use_default_seg flag, a
> > string_instruction flag that indicates how to determine the default
> > segment.
> 
> ... or you can look at the insn opcode directly. AFAICT, you need
> to check whether the opcode is 0xa4 or 0xa5 and that the insn is a
> single-byte opcode, i.e., not from the secondary map escaped with 0xf or
> some of the other multi-byte opcode maps.

In my v7, I have added a section my function resolve_seg_register() that
ignores
segment overrides if it sees string instructions and the register EDI
and defaults to ES. If the register is EIP, it defaults to CS. To
determine if an instruction is a string instruction I do check for the
size of the opcode and the opcodes that you mention plus others based on
the Intel Software Development Manual[2].

[1]. https://lkml.org/lkml/2017/5/5/405
[2]. https://lkml.org/lkml/2017/5/5/410

Thanks and BR,
Ricardo


> 
> -- 
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nürnberg)

[PATCH 3/5] x86: msr-index.h: fix shifts to ULL results in HWP macros.

2017-05-11 Thread Len Brown

From: Len Brown 

x = 1
ulong_long = x << 32;

results in:

warning: left shift count >= width of type

x = 8
ulong_long = x << 24;

results in a sign extended ulong_long

Cast x to unsigned long long in these macros
to prevent these errors.

Signed-off-by: Len Brown 
---
 arch/x86/include/asm/msr-index.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 50c0c3204a92..6da2b30781ff 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -238,13 +238,13 @@
 #define HWP_MIN_PERF(x)(x & 0xff)
 #define HWP_MAX_PERF(x)((x & 0xff) << 8)
 #define HWP_DESIRED_PERF(x)((x & 0xff) << 16)
-#define HWP_ENERGY_PERF_PREFERENCE(x)  ((x & 0xff) << 24)
+#define HWP_ENERGY_PERF_PREFERENCE(x)  (((unsigned long long) x & 0xff) << 24)
 #define HWP_EPP_PERFORMANCE0x00
 #define HWP_EPP_BALANCE_PERFORMANCE0x80
 #define HWP_EPP_BALANCE_POWERSAVE  0xC0
 #define HWP_EPP_POWERSAVE  0xFF
-#define HWP_ACTIVITY_WINDOW(x) ((x & 0xff3) << 32)
-#define HWP_PACKAGE_CONTROL(x) ((x & 0x1) << 42)
+#define HWP_ACTIVITY_WINDOW(x) ((unsigned long long)(x & 0xff3) << 32)
+#define HWP_PACKAGE_CONTROL(x) ((unsigned long long)(x & 0x1) << 42)
 
 /* IA32_HWP_STATUS */
 #define HWP_GUARANTEED_CHANGE(x)   (x & 0x1)
-- 
2.11.0.161.g6610af872

[PATCH 3/5] x86: msr-index.h: fix shifts to ULL results in HWP macros.

2017-05-11 Thread Len Brown

From: Len Brown 

x = 1
ulong_long = x << 32;

results in:

warning: left shift count >= width of type

x = 8
ulong_long = x << 24;

results in a sign extended ulong_long

Cast x to unsigned long long in these macros
to prevent these errors.

Signed-off-by: Len Brown 
---
 arch/x86/include/asm/msr-index.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 50c0c3204a92..6da2b30781ff 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -238,13 +238,13 @@
 #define HWP_MIN_PERF(x)(x & 0xff)
 #define HWP_MAX_PERF(x)((x & 0xff) << 8)
 #define HWP_DESIRED_PERF(x)((x & 0xff) << 16)
-#define HWP_ENERGY_PERF_PREFERENCE(x)  ((x & 0xff) << 24)
+#define HWP_ENERGY_PERF_PREFERENCE(x)  (((unsigned long long) x & 0xff) << 24)
 #define HWP_EPP_PERFORMANCE0x00
 #define HWP_EPP_BALANCE_PERFORMANCE0x80
 #define HWP_EPP_BALANCE_POWERSAVE  0xC0
 #define HWP_EPP_POWERSAVE  0xFF
-#define HWP_ACTIVITY_WINDOW(x) ((x & 0xff3) << 32)
-#define HWP_PACKAGE_CONTROL(x) ((x & 0x1) << 42)
+#define HWP_ACTIVITY_WINDOW(x) ((unsigned long long)(x & 0xff3) << 32)
+#define HWP_PACKAGE_CONTROL(x) ((unsigned long long)(x & 0x1) << 42)
 
 /* IA32_HWP_STATUS */
 #define HWP_GUARANTEED_CHANGE(x)   (x & 0x1)
-- 
2.11.0.161.g6610af872

[PATCH 5/5] intel_pstate: use updated msr-index.h HWP.EPP values

2017-05-11 Thread Len Brown

From: Len Brown 

intel_pstate exports sysfs attributes for setting and observing HWP.EPP.
These attributes use strings to describe 4 operating states, and
inside the driver, these strings are mapped to numerical register
values.

The authorative mapping between the strings and numerical HWP.EPP values
are now globally defined in msr-index.h, replacing the out-dated
mapping that were open-coded into intel_pstate.c

new old string
--- --- --
  0   0 performance
128  64 balance_performance
192 128 balance_power
255 192 power

Note that the HW and BIOS default value on most system is 128,
which intel_pstate will now call "balance_performance"
while it used to call it "balance_power".

Signed-off-by: Len Brown 
---
 drivers/cpufreq/intel_pstate.c | 34 +++---
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 50bd6d987fc3..ab8ebaeb3621 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -716,6 +716,12 @@ static const char * const energy_perf_strings[] = {
"power",
NULL
 };
+static const unsigned int epp_values[] = {
+   HWP_EPP_PERFORMANCE,
+   HWP_EPP_BALANCE_PERFORMANCE,
+   HWP_EPP_BALANCE_POWERSAVE,
+   HWP_EPP_POWERSAVE
+};
 
 static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data)
 {
@@ -727,17 +733,14 @@ static int intel_pstate_get_energy_pref_index(struct 
cpudata *cpu_data)
return epp;
 
if (static_cpu_has(X86_FEATURE_HWP_EPP)) {
-   /*
-* Range:
-*  0x00-0x3F   :   Performance
-*  0x40-0x7F   :   Balance performance
-*  0x80-0xBF   :   Balance power
-*  0xC0-0xFF   :   Power
-* The EPP is a 8 bit value, but our ranges restrict the
-* value which can be set. Here only using top two bits
-* effectively.
-*/
-   index = (epp >> 6) + 1;
+   if (epp == HWP_EPP_PERFORMANCE)
+   return 1;
+   if (epp <= HWP_EPP_BALANCE_PERFORMANCE)
+   return 2;
+   if (epp <= HWP_EPP_BALANCE_POWERSAVE)
+   return 3;
+   else
+   return 4;
} else if (static_cpu_has(X86_FEATURE_EPB)) {
/*
 * Range:
@@ -775,15 +778,8 @@ static int intel_pstate_set_energy_pref_index(struct 
cpudata *cpu_data,
 
value &= ~GENMASK_ULL(31, 24);
 
-   /*
-* If epp is not default, convert from index into
-* energy_perf_strings to epp value, by shifting 6
-* bits left to use only top two bits in epp.
-* The resultant epp need to shifted by 24 bits to
-* epp position in MSR_HWP_REQUEST.
-*/
if (epp == -EINVAL)
-   epp = (pref_index - 1) << 6;
+   epp = epp_values[pref_index - 1];
 
value |= (u64)epp << 24;
ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
-- 
2.11.0.161.g6610af872

[PATCH 5/5] intel_pstate: use updated msr-index.h HWP.EPP values

2017-05-11 Thread Len Brown

From: Len Brown 

intel_pstate exports sysfs attributes for setting and observing HWP.EPP.
These attributes use strings to describe 4 operating states, and
inside the driver, these strings are mapped to numerical register
values.

The authorative mapping between the strings and numerical HWP.EPP values
are now globally defined in msr-index.h, replacing the out-dated
mapping that were open-coded into intel_pstate.c

new old string
--- --- --
  0   0 performance
128  64 balance_performance
192 128 balance_power
255 192 power

Note that the HW and BIOS default value on most system is 128,
which intel_pstate will now call "balance_performance"
while it used to call it "balance_power".

Signed-off-by: Len Brown 
---
 drivers/cpufreq/intel_pstate.c | 34 +++---
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 50bd6d987fc3..ab8ebaeb3621 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -716,6 +716,12 @@ static const char * const energy_perf_strings[] = {
"power",
NULL
 };
+static const unsigned int epp_values[] = {
+   HWP_EPP_PERFORMANCE,
+   HWP_EPP_BALANCE_PERFORMANCE,
+   HWP_EPP_BALANCE_POWERSAVE,
+   HWP_EPP_POWERSAVE
+};
 
 static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data)
 {
@@ -727,17 +733,14 @@ static int intel_pstate_get_energy_pref_index(struct 
cpudata *cpu_data)
return epp;
 
if (static_cpu_has(X86_FEATURE_HWP_EPP)) {
-   /*
-* Range:
-*  0x00-0x3F   :   Performance
-*  0x40-0x7F   :   Balance performance
-*  0x80-0xBF   :   Balance power
-*  0xC0-0xFF   :   Power
-* The EPP is a 8 bit value, but our ranges restrict the
-* value which can be set. Here only using top two bits
-* effectively.
-*/
-   index = (epp >> 6) + 1;
+   if (epp == HWP_EPP_PERFORMANCE)
+   return 1;
+   if (epp <= HWP_EPP_BALANCE_PERFORMANCE)
+   return 2;
+   if (epp <= HWP_EPP_BALANCE_POWERSAVE)
+   return 3;
+   else
+   return 4;
} else if (static_cpu_has(X86_FEATURE_EPB)) {
/*
 * Range:
@@ -775,15 +778,8 @@ static int intel_pstate_set_energy_pref_index(struct 
cpudata *cpu_data,
 
value &= ~GENMASK_ULL(31, 24);
 
-   /*
-* If epp is not default, convert from index into
-* energy_perf_strings to epp value, by shifting 6
-* bits left to use only top two bits in epp.
-* The resultant epp need to shifted by 24 bits to
-* epp position in MSR_HWP_REQUEST.
-*/
if (epp == -EINVAL)
-   epp = (pref_index - 1) << 6;
+   epp = epp_values[pref_index - 1];
 
value |= (u64)epp << 24;
ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
-- 
2.11.0.161.g6610af872

[PATCH 1/5] x86: msr-index.h: define EPB mid-points

2017-05-11 Thread Len Brown

From: Len Brown 

These are currently open-coded into intel_pstate.c

Signed-off-by: Len Brown 
---
 arch/x86/include/asm/msr-index.h | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 710273c617b8..a92d9bd154f6 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -462,9 +462,11 @@
 #define MSR_MISC_PWR_MGMT  0x01aa
 
 #define MSR_IA32_ENERGY_PERF_BIAS  0x01b0
-#define ENERGY_PERF_BIAS_PERFORMANCE   0
-#define ENERGY_PERF_BIAS_NORMAL6
-#define ENERGY_PERF_BIAS_POWERSAVE 15
+#define ENERGY_PERF_BIAS_PERFORMANCE   0
+#define ENERGY_PERF_BIAS_BALANCE_PERFORMANCE   4
+#define ENERGY_PERF_BIAS_NORMAL6
+#define ENERGY_PERF_BIAS_BALANCE_POWERSAVE 8
+#define ENERGY_PERF_BIAS_POWERSAVE 15
 
 #define MSR_IA32_PACKAGE_THERM_STATUS  0x01b1
 
-- 
2.11.0.161.g6610af872

Re: [PATCH v7 0/7] Introduce ZONE_CMA

2017-05-11 Thread Joonsoo Kim

On Thu, May 11, 2017 at 11:13:04AM +0200, Michal Hocko wrote:
> On Thu 11-05-17 11:12:43, Joonsoo Kim wrote:
> > Sorry for the late response. I was on a vacation.
> > 
> > On Tue, May 02, 2017 at 03:32:29PM +0200, Michal Hocko wrote:
> > > On Tue 02-05-17 13:01:32, Joonsoo Kim wrote:
> > > > On Thu, Apr 27, 2017 at 05:06:36PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > I see this point and I agree that using a specific zone might be a
> > > > > _nicer_ solution in the end but you have to consider another aspects 
> > > > > as
> > > > > well. The main one I am worried about is a long term maintainability.
> > > > > We are really out of page flags and consuming one for a rather 
> > > > > specific
> > > > > usecase is not good. Look at ZONE_DMA. I am pretty sure that almost
> > > > > no sane HW needs 16MB zone anymore, yet we have hard time to get rid
> > > > > of it and so we have that memory laying around unused all the time
> > > > > and blocking one page flag bit. CMA falls into a similar category
> > > > > AFAIU. I wouldn't be all that surprised if a future HW will not need 
> > > > > CMA
> > > > > allocations in few years, yet we will have to fight to get rid of it
> > > > > like we do with ZONE_DMA. And not only that. We will also have to 
> > > > > fight
> > > > > finding page flags for other more general usecases in the meantime.
> > > > 
> > > > This maintenance problem is inherent. This problem exists even if we
> > > > uses MIGRATETYPE approach. We cannot remove many hooks for CMA if a
> > > > future HW will not need CMA allocation in few years. The only
> > > > difference is that one takes single zone bit only for CMA user and the
> > > > other approach takes many hooks that we need to take care about it all
> > > > the time.
> > > 
> > > And I consider this a big difference. Because while hooks are not nice
> > > they will affect CMA users (in a sense of bugs/performance etc.). While
> > > an additional bit consumed will affect potential future and more generic
> > > features.
> > 
> > Because these hooks are so tricky and are spread on many places,
> > bugs about these hooks can be made by *non-CMA* user and they hurt
> > *CMA* user. These hooks could also delay non-CMA user's development speed
> > since there are many hooks about CMA and understanding how CMA is managed
> > is rather difficult.
> 
> Than make those hooks easier to maintain. Seriously this is a
> non-argument.

I can't understand what you said here. With zone approach, someone who
isn't related to CMA don't need to understand how CMA is managed.

> 
> [...]
> 
> > > And all this can be isolated to CMA specific hooks with mostly minimum
> > > impact to most users. I hear you saying that zone approach is more natural
> > > and I would agree if we wouldn't have to care about the number of zones.
> > 
> > I attach a solution about one more bit in page flags although I don't
> > agree with your opinion that additional bit is no-go approach. Just
> > note that we have already used three bits for zone encoding in some
> > configuration due to ZONE_DEVICE.
> 
> I am absolutely not happy about ZONE_DEVICE but there is _no_ other
> viable solution right now. I know that people behind this are still
> considering struct page over direct pfn usage but they are not in the
> same situation as CMA which _can_ work without additional zone.

IIUC, ZONE_DEVICE can reuse the other zone and migratetype. What
they need is just struct page and separate zone is not necessarily needed.
The other thing that they want is to distinguish if the page is for
the ZONE_DEVICE memory or not so it can use similar approach with CMA.

IMHO, there is almost nothing that _cannot_ work in S/W world. What we
need to consider is just trade-off. So, please don't say impossibility
again.

> 
> If you _really_ insist on using zone for CMA then reuse ZONE_MOVABLE.
> I absolutely miss why do you push a specialized zone so hard.

As I said before, there is no fundamental issue to reuse ZONE_MOVABLE.
I just don't want to reuse it because they have a different
characteristic. In MM subsystem context, their characteristic is the same.
However, CMA memory should be used for the device in runtime so more
allocation guarantee is needed. See the offline_pages() in
mm/memory_hotplug.c. They can bear in 120 sec to offline the
memory but CMA memory can't.

And, this is a design issue. I don't want to talk about why should we
pursuit the good design. Originally, ZONE exists to manage different
type of memory. Migratetype is not for that purpose. Using separate
zone fits the original purpose. Mixing them would be a bad design and
we would esaily encounter the unexpected problem in the future.

> 
> [...]
> > > No, but I haven't heard any single argument that those bugs are
> > > impossible to fix with the current approach. They might be harder to fix
> > > but if I can chose between harder for CMA and harder for other more
> > > generic HW independent features I will go for the

[PATCH 1/5] x86: msr-index.h: define EPB mid-points

2017-05-11 Thread Len Brown

From: Len Brown 

These are currently open-coded into intel_pstate.c

Signed-off-by: Len Brown 
---
 arch/x86/include/asm/msr-index.h | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 710273c617b8..a92d9bd154f6 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -462,9 +462,11 @@
 #define MSR_MISC_PWR_MGMT  0x01aa
 
 #define MSR_IA32_ENERGY_PERF_BIAS  0x01b0
-#define ENERGY_PERF_BIAS_PERFORMANCE   0
-#define ENERGY_PERF_BIAS_NORMAL6
-#define ENERGY_PERF_BIAS_POWERSAVE 15
+#define ENERGY_PERF_BIAS_PERFORMANCE   0
+#define ENERGY_PERF_BIAS_BALANCE_PERFORMANCE   4
+#define ENERGY_PERF_BIAS_NORMAL6
+#define ENERGY_PERF_BIAS_BALANCE_POWERSAVE 8
+#define ENERGY_PERF_BIAS_POWERSAVE 15
 
 #define MSR_IA32_PACKAGE_THERM_STATUS  0x01b1
 
-- 
2.11.0.161.g6610af872

Re: [PATCH v7 0/7] Introduce ZONE_CMA

2017-05-11 Thread Joonsoo Kim

On Thu, May 11, 2017 at 11:13:04AM +0200, Michal Hocko wrote:
> On Thu 11-05-17 11:12:43, Joonsoo Kim wrote:
> > Sorry for the late response. I was on a vacation.
> > 
> > On Tue, May 02, 2017 at 03:32:29PM +0200, Michal Hocko wrote:
> > > On Tue 02-05-17 13:01:32, Joonsoo Kim wrote:
> > > > On Thu, Apr 27, 2017 at 05:06:36PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > I see this point and I agree that using a specific zone might be a
> > > > > _nicer_ solution in the end but you have to consider another aspects 
> > > > > as
> > > > > well. The main one I am worried about is a long term maintainability.
> > > > > We are really out of page flags and consuming one for a rather 
> > > > > specific
> > > > > usecase is not good. Look at ZONE_DMA. I am pretty sure that almost
> > > > > no sane HW needs 16MB zone anymore, yet we have hard time to get rid
> > > > > of it and so we have that memory laying around unused all the time
> > > > > and blocking one page flag bit. CMA falls into a similar category
> > > > > AFAIU. I wouldn't be all that surprised if a future HW will not need 
> > > > > CMA
> > > > > allocations in few years, yet we will have to fight to get rid of it
> > > > > like we do with ZONE_DMA. And not only that. We will also have to 
> > > > > fight
> > > > > finding page flags for other more general usecases in the meantime.
> > > > 
> > > > This maintenance problem is inherent. This problem exists even if we
> > > > uses MIGRATETYPE approach. We cannot remove many hooks for CMA if a
> > > > future HW will not need CMA allocation in few years. The only
> > > > difference is that one takes single zone bit only for CMA user and the
> > > > other approach takes many hooks that we need to take care about it all
> > > > the time.
> > > 
> > > And I consider this a big difference. Because while hooks are not nice
> > > they will affect CMA users (in a sense of bugs/performance etc.). While
> > > an additional bit consumed will affect potential future and more generic
> > > features.
> > 
> > Because these hooks are so tricky and are spread on many places,
> > bugs about these hooks can be made by *non-CMA* user and they hurt
> > *CMA* user. These hooks could also delay non-CMA user's development speed
> > since there are many hooks about CMA and understanding how CMA is managed
> > is rather difficult.
> 
> Than make those hooks easier to maintain. Seriously this is a
> non-argument.

I can't understand what you said here. With zone approach, someone who
isn't related to CMA don't need to understand how CMA is managed.

> 
> [...]
> 
> > > And all this can be isolated to CMA specific hooks with mostly minimum
> > > impact to most users. I hear you saying that zone approach is more natural
> > > and I would agree if we wouldn't have to care about the number of zones.
> > 
> > I attach a solution about one more bit in page flags although I don't
> > agree with your opinion that additional bit is no-go approach. Just
> > note that we have already used three bits for zone encoding in some
> > configuration due to ZONE_DEVICE.
> 
> I am absolutely not happy about ZONE_DEVICE but there is _no_ other
> viable solution right now. I know that people behind this are still
> considering struct page over direct pfn usage but they are not in the
> same situation as CMA which _can_ work without additional zone.

IIUC, ZONE_DEVICE can reuse the other zone and migratetype. What
they need is just struct page and separate zone is not necessarily needed.
The other thing that they want is to distinguish if the page is for
the ZONE_DEVICE memory or not so it can use similar approach with CMA.

IMHO, there is almost nothing that _cannot_ work in S/W world. What we
need to consider is just trade-off. So, please don't say impossibility
again.

> 
> If you _really_ insist on using zone for CMA then reuse ZONE_MOVABLE.
> I absolutely miss why do you push a specialized zone so hard.

As I said before, there is no fundamental issue to reuse ZONE_MOVABLE.
I just don't want to reuse it because they have a different
characteristic. In MM subsystem context, their characteristic is the same.
However, CMA memory should be used for the device in runtime so more
allocation guarantee is needed. See the offline_pages() in
mm/memory_hotplug.c. They can bear in 120 sec to offline the
memory but CMA memory can't.

And, this is a design issue. I don't want to talk about why should we
pursuit the good design. Originally, ZONE exists to manage different
type of memory. Migratetype is not for that purpose. Using separate
zone fits the original purpose. Mixing them would be a bad design and
we would esaily encounter the unexpected problem in the future.

> 
> [...]
> > > No, but I haven't heard any single argument that those bugs are
> > > impossible to fix with the current approach. They might be harder to fix
> > > but if I can chose between harder for CMA and harder for other more
> > > generic HW independent features I will go for the

[GIT PULL] power utilities update

2017-05-11 Thread Len Brown

Hi Rafael,

Please pull these power utilities patches.

The x86_energy_perf_policy(8) utility grows the ability
to manage HWP.EPP (and Hardware P-states, in-general),
on top of its previous ability to manage EPB.

Linux-4.11 grew the ability to disable the cpufreq sub-system entirely
with "cpufreq.off=1". On such a system, this utility
is a convenient way to enable HWP with no kernel cpufreq overhead:

sudo x86_energy_perf_policy --cpu all normal

During this effort, we noticed the EPP defaults hard-coded
into the intel_state driver were not optimal, so those are updated too.

thanks!
Len Brown, Intel Open Source Technology Center

The following changes since commit c470abd4fde40ea6a0846a2beab642a578c0b8cd:

  Linux 4.10 (2017-02-19 14:34:00 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git utilities

for you to fetch changes up to 3cedbc5a6d7f7c5539e139f89ec9f6e1ed668418:

  intel_pstate: use updated msr-index.h HWP.EPP values (2017-05-11 21:27:53 
-0400)


Len Brown (5):
  x86: msr-index.h: define EPB mid-points
  x86: msr-index.h: define HWP.EPP values
  x86: msr-index.h: fix shifts to ULL results in HWP macros.
  tools/power x86_energy_perf_policy: support HWP.EPP
  intel_pstate: use updated msr-index.h HWP.EPP values

 arch/x86/include/asm/msr-index.h   |   18 +-
 drivers/cpufreq/intel_pstate.c |   34 +-
 tools/power/x86/x86_energy_perf_policy/Makefile|   27 +-
 .../x86_energy_perf_policy.8   |  241 +++-
 .../x86_energy_perf_policy.c   | 1504 +---
 5 files changed, 1527 insertions(+), 297 deletions(-)

[PATCH 2/5] x86: msr-index.h: define HWP.EPP values

2017-05-11 Thread Len Brown

From: Len Brown 

The Hardware Performance State request MSR has a field
to express the "Energy Performance Preference" (HWP.EPP).

Decode that field so the definition may be shared by
by the intel_pstate driver and any utilities that
decode the same register.

Signed-off-by: Len Brown 
---
 arch/x86/include/asm/msr-index.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a92d9bd154f6..50c0c3204a92 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -239,6 +239,10 @@
 #define HWP_MAX_PERF(x)((x & 0xff) << 8)
 #define HWP_DESIRED_PERF(x)((x & 0xff) << 16)
 #define HWP_ENERGY_PERF_PREFERENCE(x)  ((x & 0xff) << 24)
+#define HWP_EPP_PERFORMANCE0x00
+#define HWP_EPP_BALANCE_PERFORMANCE0x80
+#define HWP_EPP_BALANCE_POWERSAVE  0xC0
+#define HWP_EPP_POWERSAVE  0xFF
 #define HWP_ACTIVITY_WINDOW(x) ((x & 0xff3) << 32)
 #define HWP_PACKAGE_CONTROL(x) ((x & 0x1) << 42)
 
-- 
2.11.0.161.g6610af872

[PATCH 4/5] tools/power x86_energy_perf_policy: support HWP.EPP

2017-05-11 Thread Len Brown

From: Len Brown 

x86_energy_perf_policy(8) was created as an example
of how the user, or upper-level OS, can manage
MSR_IA32_ENERGY_PERF_BIAS (EPB).

Hardware consults EPB when it makes internal decisions
balancing energy-saving vs performance.
For example, should HW quickly or slowly
transition into and out of power-saving idles states?
Should HW quickly or slowly ramp frequency up or down
in response to demand in the turbo-frequency range?

Depending on the processor, EPB may have package, core,
or CPU thread scope.  As such, the only general policy
is to write the same value to EPB on every CPU in the system.

Recent platforms add support for Hardware Performance States (HWP).
HWP effectively extends hardware frequency control from
the opportunistic turbo-frequency range to control the entire
range of available processor frequencies.

Just as turbo-mode used EPB, HWP can use EPB to help decicde
how quickly to ramp frequency and voltage up and down
in response to changing demand.  Indeed, BDX and BDX-DE,
the first processors to support HWP, use EPB for this purpose.

Starting in SKL, HWP no longer looks to EPB for influence.
Instead, it looks in a new MSR specifically for this purpose:
IA32_HWP_REQUEST.Energy_Performance_Preference (HWP.EPP).
HWP.EPP is like EPB, except that it is specific to HWP-mode
frequency selection.  Also, HWP.EPP is defined to have
per CPU-thread scope.

Starting in SKX, IA32_HWP_REQUEST is augmented by
IA32_HWP_REQUEST_PKG -- which has the same function, but is
defined to have package-wide scope.  A new bit in IA32_HWP_REQUEST
determines if it over-rides the IA32_HWP_REQUEST_PKG or not.

Note that HWP-mode can be enabled in several ways.
The "in-band" method is for HWP to be exposed in CPUID,
and for the Linux intel_pstate driver to recognized that,
and thus enable HWP.  In this case, starting in Linux 4.10, intel_pstate
exports cpufreq sysfs attribute "energy_performance_preference"
which can be used to manage HWP.EPP.  This interface can be
used to set HWP.EPP to these values:

0 performance
128 balance_performance (default)
192 balance_power
255 power

Here, x86_energy_performance_policy is updated to use
idential strings and values as intel_pstate.

But HWP-mode may also be enabled by firmware before the OS boots,
and the OS may not be aware of HWP.  In this case, intel_pstate
is not available to provide sysfs attributes, and x86_energy_perf_policy
or a similar utility is invaluable for managing HWP.EPP, for
this utility works the same, no matter if cpufreq is enabled or not.

Signed-off-by: Len Brown 
---
 tools/power/x86/x86_energy_perf_policy/Makefile|   37 +-
 .../x86_energy_perf_policy.8   |  317 ++--
 .../x86_energy_perf_policy.c   | 1750 
 3 files changed, 1666 insertions(+), 438 deletions(-)
 rewrite tools/power/x86/x86_energy_perf_policy/Makefile (72%)
 rewrite tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 (80%)
 rewrite tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.c (78%)

diff --git a/tools/power/x86/x86_energy_perf_policy/Makefile 
b/tools/power/x86/x86_energy_perf_policy/Makefile
dissimilarity index 72%
index 971c9ffdcb50..a711eec0c895 100644
--- a/tools/power/x86/x86_energy_perf_policy/Makefile
+++ b/tools/power/x86/x86_energy_perf_policy/Makefile
@@ -1,10 +1,27 @@
-DESTDIR ?=
-
-x86_energy_perf_policy : x86_energy_perf_policy.c
-
-clean :
-   rm -f x86_energy_perf_policy
-
-install :
-   install x86_energy_perf_policy ${DESTDIR}/usr/bin/
-   install x86_energy_perf_policy.8 ${DESTDIR}/usr/share/man/man8/
+CC = $(CROSS_COMPILE)gcc
+BUILD_OUTPUT:= $(CURDIR)
+PREFIX := /usr
+DESTDIR:=
+
+ifeq ("$(origin O)", "command line")
+   BUILD_OUTPUT := $(O)
+endif
+
+x86_energy_perf_policy : x86_energy_perf_policy.c
+CFLAGS +=  -Wall
+CFLAGS +=  -DMSRHEADER='"../../../../arch/x86/include/asm/msr-index.h"'
+
+%: %.c
+   @mkdir -p $(BUILD_OUTPUT)
+   $(CC) $(CFLAGS) $< -o $(BUILD_OUTPUT)/$@
+
+.PHONY : clean
+clean :
+   @rm -f $(BUILD_OUTPUT)/x86_energy_perf_policy
+
+install : x86_energy_perf_policy
+   install -d  $(DESTDIR)$(PREFIX)/bin
+   install $(BUILD_OUTPUT)/x86_energy_perf_policy 
$(DESTDIR)$(PREFIX)/bin/x86_energy_perf_policy
+   install -d  $(DESTDIR)$(PREFIX)/share/man/man8
+   install x86_energy_perf_policy.8 $(DESTDIR)$(PREFIX)/share/man/man8
+
diff --git a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8 
b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8
dissimilarity index 80%
index 8eaaad648cdb..17db1c3af4d0 100644
--- a/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8
+++ b/tools/power/x86/x86_energy_perf_policy/x86_energy_perf_policy.8
@@ -1,104 +1,213 @@
-.\"  This page Copyright (C) 2010 Len Brown 
-.\"  Distributed under the GPL, Copyleft 1994.
-.TH

[GIT PULL] power utilities update

2017-05-11 Thread Len Brown

Hi Rafael,

Please pull these power utilities patches.

The x86_energy_perf_policy(8) utility grows the ability
to manage HWP.EPP (and Hardware P-states, in-general),
on top of its previous ability to manage EPB.

Linux-4.11 grew the ability to disable the cpufreq sub-system entirely
with "cpufreq.off=1". On such a system, this utility
is a convenient way to enable HWP with no kernel cpufreq overhead:

sudo x86_energy_perf_policy --cpu all normal

During this effort, we noticed the EPP defaults hard-coded
into the intel_state driver were not optimal, so those are updated too.

thanks!
Len Brown, Intel Open Source Technology Center

The following changes since commit c470abd4fde40ea6a0846a2beab642a578c0b8cd:

  Linux 4.10 (2017-02-19 14:34:00 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git utilities

for you to fetch changes up to 3cedbc5a6d7f7c5539e139f89ec9f6e1ed668418:

  intel_pstate: use updated msr-index.h HWP.EPP values (2017-05-11 21:27:53 
-0400)


Len Brown (5):
  x86: msr-index.h: define EPB mid-points
  x86: msr-index.h: define HWP.EPP values
  x86: msr-index.h: fix shifts to ULL results in HWP macros.
  tools/power x86_energy_perf_policy: support HWP.EPP
  intel_pstate: use updated msr-index.h HWP.EPP values

 arch/x86/include/asm/msr-index.h   |   18 +-
 drivers/cpufreq/intel_pstate.c |   34 +-
 tools/power/x86/x86_energy_perf_policy/Makefile|   27 +-
 .../x86_energy_perf_policy.8   |  241 +++-
 .../x86_energy_perf_policy.c   | 1504 +---
 5 files changed, 1527 insertions(+), 297 deletions(-)

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 2058 matches

Mail list logo