Re: [RFC PATCH 3/3] sched: introduce tunables to control soft affinity

2019-07-19 Thread Subhra Mazumdar



On 7/18/19 3:38 PM, Srikar Dronamraju wrote:

* subhra mazumdar  [2019-06-26 15:47:18]:


For different workloads the optimal "softness" of soft affinity can be
different. Introduce tunables sched_allowed and sched_preferred that can
be tuned via /proc. This allows to chose at what utilization difference
the scheduler will chose cpus_allowed over cpus_preferred in the first
level of search. Depending on the extent of data sharing, cache coherency
overhead of the system etc. the optimal point may vary.

Signed-off-by: subhra mazumdar 
---

Correct me but this patchset only seems to be concentrated on the wakeup
path, I don't see any changes in the regular load balancer or the
numa-balancer. If system is loaded or tasks are CPU intensive, then wouldn't
these tasks be moved to cpus_allowed instead of cpus_preferred and hence
breaking this soft affinity.


The new idle is purposefully unchanged, if threads get stolen to the allowed
set from the preferred set that's intended, together with the enqueue side
it will achieve softness of affinity.


Re: [RFC PATCH 3/3] sched: introduce tunables to control soft affinity

2019-07-18 Thread Srikar Dronamraju
* subhra mazumdar  [2019-06-26 15:47:18]:

> For different workloads the optimal "softness" of soft affinity can be
> different. Introduce tunables sched_allowed and sched_preferred that can
> be tuned via /proc. This allows to chose at what utilization difference
> the scheduler will chose cpus_allowed over cpus_preferred in the first
> level of search. Depending on the extent of data sharing, cache coherency
> overhead of the system etc. the optimal point may vary.
> 
> Signed-off-by: subhra mazumdar 
> ---

Correct me but this patchset only seems to be concentrated on the wakeup
path, I don't see any changes in the regular load balancer or the
numa-balancer. If system is loaded or tasks are CPU intensive, then wouldn't
these tasks be moved to cpus_allowed instead of cpus_preferred and hence
breaking this soft affinity.

-- 
Thanks and Regards
Srikar Dronamraju



[RFC PATCH 3/3] sched: introduce tunables to control soft affinity

2019-06-26 Thread subhra mazumdar
For different workloads the optimal "softness" of soft affinity can be
different. Introduce tunables sched_allowed and sched_preferred that can
be tuned via /proc. This allows to chose at what utilization difference
the scheduler will chose cpus_allowed over cpus_preferred in the first
level of search. Depending on the extent of data sharing, cache coherency
overhead of the system etc. the optimal point may vary.

Signed-off-by: subhra mazumdar 
---
 include/linux/sched/sysctl.h |  2 ++
 kernel/sched/fair.c  | 19 ++-
 kernel/sched/sched.h |  2 ++
 kernel/sysctl.c  | 14 ++
 4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 99ce6d7..0e75602 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -41,6 +41,8 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
+extern __read_mostly unsigned int sysctl_sched_preferred;
+extern __read_mostly unsigned int sysctl_sched_allowed;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53aa7f2..d222d78 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -85,6 +85,8 @@ unsigned int sysctl_sched_wakeup_granularity  
= 100UL;
 static unsigned int normalized_sysctl_sched_wakeup_granularity = 100UL;
 
 const_debug unsigned int sysctl_sched_migration_cost   = 50UL;
+const_debug unsigned int sysctl_sched_preferred= 1UL;
+const_debug unsigned int sysctl_sched_allowed  = 100UL;
 
 #ifdef CONFIG_SMP
 /*
@@ -6739,7 +6741,22 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f
int new_cpu = prev_cpu;
int want_affine = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
-   struct cpumask *cpus = &p->cpus_preferred;
+   int cpux, cpuy;
+   struct cpumask *cpus;
+
+   if (!p->affinity_unequal) {
+   cpus = &p->cpus_allowed;
+   } else {
+   cpux = cpumask_any(&p->cpus_preferred);
+   cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+   cpumask_andnot(cpus, &p->cpus_allowed, &p->cpus_preferred);
+   cpuy = cpumask_any(cpus);
+   if (sysctl_sched_preferred * cpu_rq(cpux)->cfs.avg.util_avg >
+   sysctl_sched_allowed * cpu_rq(cpuy)->cfs.avg.util_avg)
+   cpus = &p->cpus_allowed;
+   else
+   cpus = &p->cpus_preferred;
+   }
 
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..f856bdb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1863,6 +1863,8 @@ extern void check_preempt_curr(struct rq *rq, struct 
task_struct *p, int flags);
 
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
+extern const_debug unsigned int sysctl_sched_preferred;
+extern const_debug unsigned int sysctl_sched_allowed;
 
 #ifdef CONFIG_SCHED_HRTICK
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7d1008b..bdffb48 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -383,6 +383,20 @@ static struct ctl_table kern_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
},
+   {
+   .procname   = "sched_preferred",
+   .data   = &sysctl_sched_preferred,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = "sched_allowed",
+   .data   = &sysctl_sched_allowed,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
 #ifdef CONFIG_SCHEDSTATS
{
.procname   = "sched_schedstats",
-- 
2.9.3



Re: [PATCH v2] cpufreq: dt-platdev: allow RK3399 to have separate tunables per cluster

2018-10-16 Thread Rafael J. Wysocki
On Monday, October 8, 2018 7:55:47 AM CEST Viresh Kumar wrote:
> On 05-10-18, 12:00, Dmitry Torokhov wrote:
> > RK3899 has one cluster with 4 small cores, and another one with 2 big
> > cores, with cores in different clusters having different OPPs and thus
> > needing separate set of tunables. Let's enable this via
> > "have_governor_per_policy" platform data.
> > 
> > Signed-off-by: Dmitry Torokhov 
> > ---
> > 
> > v2 changes: commit message updated.
> > 
> > Not tested, but we had a patch unconditionally enabling
> > CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag in tree we used to ship devices
> > based on RK3399 platform.
> > 
> >  drivers/cpufreq/cpufreq-dt-platdev.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c 
> > b/drivers/cpufreq/cpufreq-dt-platdev.c
> > index fe14c57de6ca..040ec0f711f9 100644
> > --- a/drivers/cpufreq/cpufreq-dt-platdev.c
> > +++ b/drivers/cpufreq/cpufreq-dt-platdev.c
> > @@ -78,7 +78,10 @@ static const struct of_device_id whitelist[] __initconst 
> > = {
> > { .compatible = "rockchip,rk3328", },
> > { .compatible = "rockchip,rk3366", },
> > { .compatible = "rockchip,rk3368", },
> > -   { .compatible = "rockchip,rk3399", },
> > +   { .compatible = "rockchip,rk3399",
> > + .data = &(struct cpufreq_dt_platform_data)
> > +   { .have_governor_per_policy = true, },
> > +   },
> >  
> > { .compatible = "st-ericsson,u8500", },
> > { .compatible = "st-ericsson,u8540", },
> 
> Acked-by: Viresh Kumar 

Patch applied, thanks!



Re: [PATCH v2] cpufreq: dt-platdev: allow RK3399 to have separate tunables per cluster

2018-10-07 Thread Viresh Kumar
On 05-10-18, 12:00, Dmitry Torokhov wrote:
> RK3899 has one cluster with 4 small cores, and another one with 2 big
> cores, with cores in different clusters having different OPPs and thus
> needing separate set of tunables. Let's enable this via
> "have_governor_per_policy" platform data.
> 
> Signed-off-by: Dmitry Torokhov 
> ---
> 
> v2 changes: commit message updated.
> 
> Not tested, but we had a patch unconditionally enabling
> CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag in tree we used to ship devices
> based on RK3399 platform.
> 
>  drivers/cpufreq/cpufreq-dt-platdev.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c 
> b/drivers/cpufreq/cpufreq-dt-platdev.c
> index fe14c57de6ca..040ec0f711f9 100644
> --- a/drivers/cpufreq/cpufreq-dt-platdev.c
> +++ b/drivers/cpufreq/cpufreq-dt-platdev.c
> @@ -78,7 +78,10 @@ static const struct of_device_id whitelist[] __initconst = 
> {
>   { .compatible = "rockchip,rk3328", },
>   { .compatible = "rockchip,rk3366", },
>   { .compatible = "rockchip,rk3368", },
> - { .compatible = "rockchip,rk3399", },
> + { .compatible = "rockchip,rk3399",
> +   .data = &(struct cpufreq_dt_platform_data)
> + { .have_governor_per_policy = true, },
> + },
>  
>   { .compatible = "st-ericsson,u8500", },
>   { .compatible = "st-ericsson,u8540", },

Acked-by: Viresh Kumar 

-- 
viresh


[PATCH v2] cpufreq: dt-platdev: allow RK3399 to have separate tunables per cluster

2018-10-05 Thread Dmitry Torokhov
RK3899 has one cluster with 4 small cores, and another one with 2 big
cores, with cores in different clusters having different OPPs and thus
needing separate set of tunables. Let's enable this via
"have_governor_per_policy" platform data.

Signed-off-by: Dmitry Torokhov 
---

v2 changes: commit message updated.

Not tested, but we had a patch unconditionally enabling
CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag in tree we used to ship devices
based on RK3399 platform.

 drivers/cpufreq/cpufreq-dt-platdev.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c 
b/drivers/cpufreq/cpufreq-dt-platdev.c
index fe14c57de6ca..040ec0f711f9 100644
--- a/drivers/cpufreq/cpufreq-dt-platdev.c
+++ b/drivers/cpufreq/cpufreq-dt-platdev.c
@@ -78,7 +78,10 @@ static const struct of_device_id whitelist[] __initconst = {
{ .compatible = "rockchip,rk3328", },
{ .compatible = "rockchip,rk3366", },
{ .compatible = "rockchip,rk3368", },
-   { .compatible = "rockchip,rk3399", },
+   { .compatible = "rockchip,rk3399",
+ .data = &(struct cpufreq_dt_platform_data)
+   { .have_governor_per_policy = true, },
+   },
 
{ .compatible = "st-ericsson,u8500", },
{ .compatible = "st-ericsson,u8540", },
-- 
2.19.0.605.g01d371f741-goog


-- 
Dmitry


[PATCH 3/3] cpufreq: dt: Support governor tunables per policy

2016-09-09 Thread Viresh Kumar
The cpufreq-dt driver is also used for systems with multiple
clock/voltage domains for CPUs, i.e. multiple cpufreq policies in a
system.

And in such cases the platform users may want to enable "governor
tunables per policy". Support that via platform data, as not all users
of the driver would want that behavior.

Reported-by: Juri Lelli 
Signed-off-by: Viresh Kumar 
---
 drivers/cpufreq/cpufreq-dt-platdev.c |  7 +--
 drivers/cpufreq/cpufreq-dt.c |  6 ++
 drivers/cpufreq/cpufreq-dt.h | 19 +++
 3 files changed, 30 insertions(+), 2 deletions(-)
 create mode 100644 drivers/cpufreq/cpufreq-dt.h

diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c 
b/drivers/cpufreq/cpufreq-dt-platdev.c
index 285ed3e6494e..da2fa27b5b30 100644
--- a/drivers/cpufreq/cpufreq-dt-platdev.c
+++ b/drivers/cpufreq/cpufreq-dt-platdev.c
@@ -11,6 +11,8 @@
 #include 
 #include 
 
+#include "cpufreq-dt.h"
+
 static const struct of_device_id machines[] __initconst = {
{ .compatible = "allwinner,sun4i-a10", },
{ .compatible = "allwinner,sun5i-a10s", },
@@ -92,7 +94,8 @@ static int __init cpufreq_dt_platdev_init(void)
if (!match)
return -ENODEV;
 
-   return PTR_ERR_OR_ZERO(platform_device_register_simple("cpufreq-dt", -1,
-  NULL, 0));
+   return PTR_ERR_OR_ZERO(platform_device_register_data(NULL, "cpufreq-dt",
+  -1, match->data,
+  sizeof(struct cpufreq_dt_platform_data)));
 }
 device_initcall(cpufreq_dt_platdev_init);
diff --git a/drivers/cpufreq/cpufreq-dt.c b/drivers/cpufreq/cpufreq-dt.c
index 2bd20534155d..5c07ae05d69a 100644
--- a/drivers/cpufreq/cpufreq-dt.c
+++ b/drivers/cpufreq/cpufreq-dt.c
@@ -25,6 +25,8 @@
 #include 
 #include 
 
+#include "cpufreq-dt.h"
+
 struct private_data {
struct device *cpu_dev;
struct thermal_cooling_device *cdev;
@@ -353,6 +355,7 @@ static struct cpufreq_driver dt_cpufreq_driver = {
 
 static int dt_cpufreq_probe(struct platform_device *pdev)
 {
+   struct cpufreq_dt_platform_data *data = dev_get_platdata(&pdev->dev);
int ret;
 
/*
@@ -366,6 +369,9 @@ static int dt_cpufreq_probe(struct platform_device *pdev)
if (ret)
return ret;
 
+   if (data && data->have_governor_per_policy)
+   dt_cpufreq_driver.flags |= CPUFREQ_HAVE_GOVERNOR_PER_POLICY;
+
ret = cpufreq_register_driver(&dt_cpufreq_driver);
if (ret)
dev_err(&pdev->dev, "failed register driver: %d\n", ret);
diff --git a/drivers/cpufreq/cpufreq-dt.h b/drivers/cpufreq/cpufreq-dt.h
new file mode 100644
index ..54d774e46c43
--- /dev/null
+++ b/drivers/cpufreq/cpufreq-dt.h
@@ -0,0 +1,19 @@
+/*
+ * Copyright (C) 2016 Linaro
+ * Viresh Kumar 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef __CPUFREQ_DT_H__
+#define __CPUFREQ_DT_H__
+
+#include 
+
+struct cpufreq_dt_platform_data {
+   bool have_governor_per_policy;
+};
+
+#endif /* __CPUFREQ_DT_H__ */
-- 
2.7.1.410.g6faf27b



[PATCH V2 08/22] block, cfq: get rid of latency tunables

2016-08-08 Thread Paolo Valente
BFQ guarantees a low latency for interactive applications in a
completely different way with respect to CFQ. On the other hand, in
terms of interface and exactly as CFQ does, BFQ exports a boolean
low_latency tunable to switch low-latency heuristics on (in BFQ, these
heuristics lowers latency for interactive and soft real-time
applications). Finally, differently from CFQ, BFQ has not other
latency tunable.

Accordingly, this commit temporarily turns all latency tunables into
fake tunables, by turning the functions for reading and writing these
tunables into functions that just generate warnings. The commit
introducing low-latency heuristics in BFQ then restores only the
boolean low_latency tunable.

Signed-off-by: Paolo Valente 
---
 block/cfq-iosched.c | 36 
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 329ed2b..69c7c75 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -30,7 +30,6 @@ static const u64 cfq_slice_sync = NSEC_PER_SEC / 10;
 static u64 cfq_slice_async = NSEC_PER_SEC / 25;
 static const int cfq_slice_async_rq = 2;
 static u64 cfq_slice_idle = NSEC_PER_SEC / 125;
-static const u64 cfq_target_latency = (u64)NSEC_PER_SEC * 3/10; /* 300 ms */
 static const int cfq_hist_divisor = 4;
 
 /*
@@ -224,12 +223,9 @@ struct cfq_data {
unsigned int cfq_back_penalty;
unsigned int cfq_back_max;
unsigned int cfq_slice_async_rq;
-   unsigned int cfq_latency;
u64 cfq_fifo_expire[2];
u64 cfq_slice[2];
u64 cfq_slice_idle;
-   u64 cfq_group_idle;
-   u64 cfq_target_latency;
 
/*
 * Fallback dummy cfqq for extreme OOM conditions
@@ -1485,7 +1481,7 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, 
struct cfq_queue *cfqq)
 * We also ramp up the dispatch depth gradually for async IO,
 * based on the last sync IO we serviced
 */
-   if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
+   if (!cfq_cfqq_sync(cfqq)) {
u64 last_sync = ktime_get_ns() - cfqd->last_delayed_sync;
unsigned int depth;
 
@@ -2323,10 +2319,8 @@ static int cfq_init_queue(struct request_queue *q, 
struct elevator_type *e)
cfqd->cfq_back_penalty = cfq_back_penalty;
cfqd->cfq_slice[0] = cfq_slice_async;
cfqd->cfq_slice[1] = cfq_slice_sync;
-   cfqd->cfq_target_latency = cfq_target_latency;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
-   cfqd->cfq_latency = 1;
cfqd->hw_tag = -1;
/*
 * we optimistically start assuming sync ops weren't delayed in last
@@ -2384,8 +2378,6 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 
1);
 SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
-SHOW_FUNCTION(cfq_low_latency_show, cfqd->cfq_latency, 0);
-SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1);
 #undef SHOW_FUNCTION
 
 #define USEC_SHOW_FUNCTION(__FUNC, __VAR)  \
@@ -2399,7 +2391,6 @@ static ssize_t __FUNC(struct elevator_queue *e, char 
*page)   \
 USEC_SHOW_FUNCTION(cfq_slice_idle_us_show, cfqd->cfq_slice_idle);
 USEC_SHOW_FUNCTION(cfq_slice_sync_us_show, cfqd->cfq_slice[1]);
 USEC_SHOW_FUNCTION(cfq_slice_async_us_show, cfqd->cfq_slice[0]);
-USEC_SHOW_FUNCTION(cfq_target_latency_us_show, cfqd->cfq_target_latency);
 #undef USEC_SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)
\
@@ -2431,8 +2422,6 @@ STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 
1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
-STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
-STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 
UINT_MAX, 1);
 #undef STORE_FUNCTION
 
 #define USEC_STORE_FUNCTION(__FUNC, __PTR, MIN, MAX)   \
@@ -2451,12 +2440,27 @@ static ssize_t __FUNC(struct elevator_queue *e, const 
char *page, size_t count)
 USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, 
UINT_MAX);
 USEC_STORE_FUNCTION(cfq_slice_sync_us_store, &cfqd->cfq_slice[1], 1, UINT_MAX);
 USEC_STORE_FUNCTION(cfq_slice_async_us_store, &cfqd->cfq_slice[0], 1, 
UINT_MAX);
-USEC_STORE_FUNCTION(cfq_target_latency_us_store, &cfqd->cfq_target_latency, 1, 
UINT_MAX);
 #undef USEC_STORE_FUNCTION
 
+static ssize_t cfq_fake_lat_show(struct elevator_queue *e, char *page)
+{
+   pr_warn_once("CFQ I/O SCHED: tried to read removed latency tuna

[PATCH RFC V8 08/22] block, cfq: get rid of latency tunables

2016-07-27 Thread Paolo Valente
BFQ guarantees a low latency for interactive applications in a
completely different way with respect to CFQ. On the other hand, in
terms of interface and exactly as CFQ does, BFQ exports a boolean
low_latency tunable to switch low-latency heuristics on (in BFQ, these
heuristics lowers latency for interactive and soft real-time
applications). Finally, differently from CFQ, BFQ has not other
latency tunable.

Accordingly, this commit temporarily turns all latency tunables into
fake tunables, by turning the functions for reading and writing these
tunables into functions that just generate warnings. The commit
introducing low-latency heuristics in BFQ then restores only the
boolean low_latency tunable.

Signed-off-by: Paolo Valente 
---
 block/cfq-iosched.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index df8fb826..47e23e5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -30,7 +30,6 @@ static const int cfq_slice_sync = HZ / 10;
 static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
-static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
 static const int cfq_hist_divisor = 4;
 
 /*
@@ -227,8 +226,6 @@ struct cfq_data {
unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
-   unsigned int cfq_latency;
-   unsigned int cfq_target_latency;
 
/*
 * Fallback dummy cfqq for extreme OOM conditions
@@ -1463,7 +1460,7 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, 
struct cfq_queue *cfqq)
 * We also ramp up the dispatch depth gradually for async IO,
 * based on the last sync IO we serviced
 */
-   if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
+   if (!cfq_cfqq_sync(cfqq)) {
unsigned long last_sync = jiffies - cfqd->last_delayed_sync;
unsigned int depth;
 
@@ -2269,10 +2266,8 @@ static int cfq_init_queue(struct request_queue *q, 
struct elevator_type *e)
cfqd->cfq_back_penalty = cfq_back_penalty;
cfqd->cfq_slice[0] = cfq_slice_async;
cfqd->cfq_slice[1] = cfq_slice_sync;
-   cfqd->cfq_target_latency = cfq_target_latency;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
-   cfqd->cfq_latency = 1;
cfqd->hw_tag = -1;
/*
 * we optimistically start assuming sync ops weren't delayed in last
@@ -2330,8 +2325,6 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 
1);
 SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
-SHOW_FUNCTION(cfq_low_latency_show, cfqd->cfq_latency, 0);
-SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)
\
@@ -2363,13 +2356,27 @@ STORE_FUNCTION(cfq_slice_sync_store, 
&cfqd->cfq_slice[1], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
-STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
-STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 
UINT_MAX, 1);
 #undef STORE_FUNCTION
 
+static ssize_t cfq_fake_lat_show(struct elevator_queue *e, char *page)
+{
+   pr_warn_once("CFQ I/O SCHED: tried to read removed latency tunable");
+   return sprintf(page, "0\n");
+}
+
+static ssize_t
+cfq_fake_lat_store(struct elevator_queue *e, const char *page, size_t count)
+{
+   pr_warn_once("CFQ I/O SCHED: tried to write removed latency tunable");
+   return count;
+}
+
 #define CFQ_ATTR(name) \
__ATTR(name, S_IRUGO|S_IWUSR, cfq_##name##_show, cfq_##name##_store)
 
+#define CFQ_FAKE_LAT_ATTR(name) \
+   __ATTR(name, S_IRUGO|S_IWUSR, cfq_fake_lat_show, cfq_fake_lat_store)
+
 static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(quantum),
CFQ_ATTR(fifo_expire_sync),
@@ -2380,8 +2387,8 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
-   CFQ_ATTR(low_latency),
-   CFQ_ATTR(target_latency),
+   CFQ_FAKE_LAT_ATTR(low_latency),
+   CFQ_FAKE_LAT_ATTR(target_latency),
__ATTR_NULL
 };
 
-- 
1.9.1



Re: [PATCH RFC 08/22] block, cfq: get rid of latency tunables

2016-02-10 Thread Tejun Heo
For 1-8, except for one minor nit,

  Acked-by: Tejun Heo 

Thanks!

-- 
tejun


Re: [PATCH V4 4/7] cpufreq: governor: Move common sysfs tunables to cpufreq_governor.c

2016-02-09 Thread Rafael J. Wysocki
On Tue, Feb 9, 2016 at 4:46 AM, Viresh Kumar  wrote:
> We have got five common sysfs tunables between ondemand and conservative
> governors, move their callbacks to cpufreq_governor.c to get rid of
> redundant code.
>
> Because of minor differences in the implementation of the callbacks,
> some more per-governor callbacks are introduced in order to not
> introduce any more "governor == ONDEMAND/CONSERVATIVE" like checks.
>
> Signed-off-by: Viresh Kumar 
> Tested-by: Juri Lelli 
> Tested-by: Shilpasri G Bhat 

To me, the benefit from this patch is marginal and the cost is quite
substantial.

The code is really only duplicate if both governors are non-modular or
their modules are both loaded at the same time and it's not worth
adding the new governor callbacks IMO.

If the implementation of the given show/store pair is different enough
that you need an extra callback to move them to _governor.c, I won't
bother doing that.

Thanks,
Rafael


Re: [PATCH V3 03/13] cpufreq: governor: New sysfs show/store callbacks for governor tunables

2016-02-09 Thread Rafael J. Wysocki
On Tuesday, February 09, 2016 08:51:26 AM Viresh Kumar wrote:
> On 08-02-16, 22:36, Rafael J. Wysocki wrote:
> > On Mon, Feb 8, 2016 at 12:39 PM, Viresh Kumar  
> > wrote:
> > > +   ret = kobject_init_and_add(&dbs_data->kobj, &gov->kobj_type,
> > > +  get_governor_parent_kobj(policy),
> > > +  gov->kobj_name);
> > 
> > Besides, you forgot about the format argument for kobject_init_and_add().
> 
> What about that? Why is it required here ? We don't have to modify the
> gov->gov.name string at all, and that string can be used here without adding 
> any
> more format arguments.

But that's because the governor name is a static string and we can safely pass
it as a format (because we know that it doesn't contain any output field 
specifiers
in particular).

So either there should be a comment to that effect in the code, or the format
argument should be present.

Thanks,
Rafael



[PATCH V4 4/7] cpufreq: governor: Move common sysfs tunables to cpufreq_governor.c

2016-02-08 Thread Viresh Kumar
We have got five common sysfs tunables between ondemand and conservative
governors, move their callbacks to cpufreq_governor.c to get rid of
redundant code.

Because of minor differences in the implementation of the callbacks,
some more per-governor callbacks are introduced in order to not
introduce any more "governor == ONDEMAND/CONSERVATIVE" like checks.

Signed-off-by: Viresh Kumar 
Tested-by: Juri Lelli 
Tested-by: Shilpasri G Bhat 
---
 drivers/cpufreq/cpufreq_conservative.c |  80 +-
 drivers/cpufreq/cpufreq_governor.c | 100 
 drivers/cpufreq/cpufreq_governor.h |  16 +-
 drivers/cpufreq/cpufreq_ondemand.c | 102 -
 4 files changed, 151 insertions(+), 147 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_conservative.c 
b/drivers/cpufreq/cpufreq_conservative.c
index ed081dbce00c..5c54041015d4 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -122,47 +122,17 @@ static struct notifier_block cs_cpufreq_notifier_block = {
 /** sysfs interface /
 static struct dbs_governor cs_dbs_gov;
 
-static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
-   const char *buf, size_t count)
+static bool invalid_up_threshold(struct dbs_data *dbs_data,
+unsigned int threshold)
 {
-   unsigned int input;
-   int ret;
-   ret = sscanf(buf, "%u", &input);
-
-   if (ret != 1 || input > MAX_SAMPLING_DOWN_FACTOR || input < 1)
-   return -EINVAL;
+   struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 
-   dbs_data->sampling_down_factor = input;
-   return count;
+   return threshold > 100 || threshold <= cs_tuners->down_threshold;
 }
 
-static ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
-   size_t count)
+static bool invalid_sampling_down_factor(unsigned int factor)
 {
-   unsigned int input;
-   int ret;
-   ret = sscanf(buf, "%u", &input);
-
-   if (ret != 1)
-   return -EINVAL;
-
-   dbs_data->sampling_rate = max(input, dbs_data->min_sampling_rate);
-   return count;
-}
-
-static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf,
-   size_t count)
-{
-   struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
-   unsigned int input;
-   int ret;
-   ret = sscanf(buf, "%u", &input);
-
-   if (ret != 1 || input > 100 || input <= cs_tuners->down_threshold)
-   return -EINVAL;
-
-   dbs_data->up_threshold = input;
-   return count;
+   return factor > MAX_SAMPLING_DOWN_FACTOR;
 }
 
 static ssize_t store_down_threshold(struct dbs_data *dbs_data, const char *buf,
@@ -182,27 +152,13 @@ static ssize_t store_down_threshold(struct dbs_data 
*dbs_data, const char *buf,
return count;
 }
 
-static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
-   const char *buf, size_t count)
+static void update_ignore_nice_load(struct dbs_data *dbs_data)
 {
-   unsigned int input, j;
-   int ret;
-
-   ret = sscanf(buf, "%u", &input);
-   if (ret != 1)
-   return -EINVAL;
-
-   if (input > 1)
-   input = 1;
-
-   if (input == dbs_data->ignore_nice_load) /* nothing to do */
-   return count;
-
-   dbs_data->ignore_nice_load = input;
+   struct cs_cpu_dbs_info_s *dbs_info;
+   unsigned int j;
 
/* we need to re-evaluate prev_cpu_idle */
for_each_online_cpu(j) {
-   struct cs_cpu_dbs_info_s *dbs_info;
dbs_info = &per_cpu(cs_cpu_dbs_info, j);
dbs_info->cdbs.prev_cpu_idle = get_cpu_idle_time(j,
&dbs_info->cdbs.prev_cpu_wall, 0);
@@ -210,7 +166,6 @@ static ssize_t store_ignore_nice_load(struct dbs_data 
*dbs_data,
dbs_info->cdbs.prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
}
-   return count;
 }
 
 static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf,
@@ -235,21 +190,11 @@ static ssize_t store_freq_step(struct dbs_data *dbs_data, 
const char *buf,
return count;
 }
 
-gov_show_one_common(sampling_rate);
-gov_show_one_common(sampling_down_factor);
-gov_show_one_common(up_threshold);
-gov_show_one_common(ignore_nice_load);
-gov_show_one_common(min_sampling_rate);
 gov_show_one(cs, down_threshold);
 gov_show_one(cs, freq_step);
 
-gov_attr_rw(sampling_rate);
-gov_attr_rw(sampling_down_factor);
-gov_attr_rw(up_threshold);
-gov_attr_rw(ignore_nice_load);
-gov_attr_ro(min_sampling_rate);
-gov_attr_rw(down_threshold);
-gov_attr_rw(freq_step);
+static gov_attr_rw(down_threshold);
+s

[PATCH V4 3/6] cpufreq: governor: New sysfs show/store callbacks for governor tunables

2016-02-08 Thread Viresh Kumar
The ondemand and conservative governors use the global-attr or freq-attr
structures to represent sysfs attributes corresponding to their tunables
(which of them is actually used depends on whether or not different
policy objects can use the same governor with different tunables at the
same time and, consequently, on where those attributes are located in
sysfs).

Unfortunately, in the freq-attr case, the standard cpufreq show/store
sysfs attribute callbacks are applied to the governor tunable attributes
and they always acquire the policy->rwsem lock before carrying out the
operation.  That may lead to an ABBA deadlock if governor tunable
attributes are removed under policy->rwsem while one of them is being
accessed concurrently (if sysfs attributes removal wins the race, it
will wait for the access to complete with policy->rwsem held while the
attribute callback will block on policy->rwsem indefinitely).

We attempted to address this issue by dropping policy->rwsem around
governor tunable attributes removal (that is, around invocations of the
->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT)
in cpufreq_set_policy(), but that opened up race conditions that had not
been possible with policy->rwsem held all the time.  Therefore
policy->rwsem cannot be dropped in cpufreq_set_policy() at any point,
but the deadlock situation described above must be avoided too.

To that end, use the observation that in principle governor tunables may
be represented by the same data type regardless of whether the governor
is system-wide or per-policy and introduce a new structure, struct
governor_attr, for representing them and new corresponding macros for
creating show/store sysfs callbacks for them.  Also make their parent
kobject use a new kobject type whose default show/store callbacks are
not related to the standard core cpufreq ones in any way (and they don't
acquire policy->rwsem in particular).

[ Rafael: Written changelog ]
Signed-off-by: Viresh Kumar 
Tested-by: Juri Lelli 
Tested-by: Shilpasri G Bhat 
---
 drivers/cpufreq/cpufreq_conservative.c | 72 --
 drivers/cpufreq/cpufreq_governor.c | 68 
 drivers/cpufreq/cpufreq_governor.h | 39 +-
 drivers/cpufreq/cpufreq_ondemand.c | 72 --
 4 files changed, 147 insertions(+), 104 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_conservative.c 
b/drivers/cpufreq/cpufreq_conservative.c
index 4f640b028c94..ed081dbce00c 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -235,54 +235,33 @@ static ssize_t store_freq_step(struct dbs_data *dbs_data, 
const char *buf,
return count;
 }
 
-show_store_one(cs, down_threshold);
-show_store_one(cs, freq_step);
-show_store_one_common(cs, sampling_rate);
-show_store_one_common(cs, sampling_down_factor);
-show_store_one_common(cs, up_threshold);
-show_store_one_common(cs, ignore_nice_load);
-show_one_common(cs, min_sampling_rate);
-
-gov_sys_pol_attr_rw(sampling_rate);
-gov_sys_pol_attr_rw(sampling_down_factor);
-gov_sys_pol_attr_rw(up_threshold);
-gov_sys_pol_attr_rw(down_threshold);
-gov_sys_pol_attr_rw(ignore_nice_load);
-gov_sys_pol_attr_rw(freq_step);
-gov_sys_pol_attr_ro(min_sampling_rate);
-
-static struct attribute *dbs_attributes_gov_sys[] = {
-   &min_sampling_rate_gov_sys.attr,
-   &sampling_rate_gov_sys.attr,
-   &sampling_down_factor_gov_sys.attr,
-   &up_threshold_gov_sys.attr,
-   &down_threshold_gov_sys.attr,
-   &ignore_nice_load_gov_sys.attr,
-   &freq_step_gov_sys.attr,
+gov_show_one_common(sampling_rate);
+gov_show_one_common(sampling_down_factor);
+gov_show_one_common(up_threshold);
+gov_show_one_common(ignore_nice_load);
+gov_show_one_common(min_sampling_rate);
+gov_show_one(cs, down_threshold);
+gov_show_one(cs, freq_step);
+
+gov_attr_rw(sampling_rate);
+gov_attr_rw(sampling_down_factor);
+gov_attr_rw(up_threshold);
+gov_attr_rw(ignore_nice_load);
+gov_attr_ro(min_sampling_rate);
+gov_attr_rw(down_threshold);
+gov_attr_rw(freq_step);
+
+static struct attribute *cs_attributes[] = {
+   &min_sampling_rate.attr,
+   &sampling_rate.attr,
+   &sampling_down_factor.attr,
+   &up_threshold.attr,
+   &down_threshold.attr,
+   &ignore_nice_load.attr,
+   &freq_step.attr,
NULL
 };
 
-static struct attribute_group cs_attr_group_gov_sys = {
-   .attrs = dbs_attributes_gov_sys,
-   .name = "conservative",
-};
-
-static struct attribute *dbs_attributes_gov_pol[] = {
-   &min_sampling_rate_gov_pol.attr,
-   &sampling_rate_gov_pol.attr,
-   &sampling_down_factor_gov_pol.attr,
-   &up_threshold_gov_pol.attr,
-   &down_threshold_gov_pol.attr,
-   &ignore_nice_load_gov_pol.attr,
-   &freq_step_gov_pol.attr,
-   

[PATCH V4 2/6] cpufreq: governor: Move common tunables to 'struct dbs_data'

2016-02-08 Thread Viresh Kumar
There are few more common tunables shared across ondemand and
conservative governors. Move them to 'struct dbs_data' to simplify code.

Signed-off-by: Viresh Kumar 
Tested-by: Juri Lelli 
Tested-by: Shilpasri G Bhat 
---
 drivers/cpufreq/cpufreq_conservative.c | 38 ++-
 drivers/cpufreq/cpufreq_governor.c | 37 ++
 drivers/cpufreq/cpufreq_governor.h | 14 +---
 drivers/cpufreq/cpufreq_ondemand.c | 41 +++---
 4 files changed, 47 insertions(+), 83 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_conservative.c 
b/drivers/cpufreq/cpufreq_conservative.c
index a69eb7eae7ec..4f640b028c94 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -60,7 +60,7 @@ static void cs_check_cpu(int cpu, unsigned int load)
return;
 
/* Check for frequency increase */
-   if (load > cs_tuners->up_threshold) {
+   if (load > dbs_data->up_threshold) {
dbs_info->down_skip = 0;
 
/* if we are already at full speed then break out early */
@@ -78,7 +78,7 @@ static void cs_check_cpu(int cpu, unsigned int load)
}
 
/* if sampling_down_factor is active break out early */
-   if (++dbs_info->down_skip < cs_tuners->sampling_down_factor)
+   if (++dbs_info->down_skip < dbs_data->sampling_down_factor)
return;
dbs_info->down_skip = 0;
 
@@ -107,10 +107,9 @@ static unsigned int cs_dbs_timer(struct cpufreq_policy 
*policy)
 {
struct policy_dbs_info *policy_dbs = policy->governor_data;
struct dbs_data *dbs_data = policy_dbs->dbs_data;
-   struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 
dbs_check_cpu(policy);
-   return delay_for_sampling_rate(cs_tuners->sampling_rate);
+   return delay_for_sampling_rate(dbs_data->sampling_rate);
 }
 
 static int dbs_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
@@ -126,7 +125,6 @@ static struct dbs_governor cs_dbs_gov;
 static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
const char *buf, size_t count)
 {
-   struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -134,14 +132,13 @@ static ssize_t store_sampling_down_factor(struct dbs_data 
*dbs_data,
if (ret != 1 || input > MAX_SAMPLING_DOWN_FACTOR || input < 1)
return -EINVAL;
 
-   cs_tuners->sampling_down_factor = input;
+   dbs_data->sampling_down_factor = input;
return count;
 }
 
 static ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
size_t count)
 {
-   struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -149,7 +146,7 @@ static ssize_t store_sampling_rate(struct dbs_data 
*dbs_data, const char *buf,
if (ret != 1)
return -EINVAL;
 
-   cs_tuners->sampling_rate = max(input, dbs_data->min_sampling_rate);
+   dbs_data->sampling_rate = max(input, dbs_data->min_sampling_rate);
return count;
 }
 
@@ -164,7 +161,7 @@ static ssize_t store_up_threshold(struct dbs_data 
*dbs_data, const char *buf,
if (ret != 1 || input > 100 || input <= cs_tuners->down_threshold)
return -EINVAL;
 
-   cs_tuners->up_threshold = input;
+   dbs_data->up_threshold = input;
return count;
 }
 
@@ -178,7 +175,7 @@ static ssize_t store_down_threshold(struct dbs_data 
*dbs_data, const char *buf,
 
/* cannot be lower than 11 otherwise freq will not fall */
if (ret != 1 || input < 11 || input > 100 ||
-   input >= cs_tuners->up_threshold)
+   input >= dbs_data->up_threshold)
return -EINVAL;
 
cs_tuners->down_threshold = input;
@@ -188,7 +185,6 @@ static ssize_t store_down_threshold(struct dbs_data 
*dbs_data, const char *buf,
 static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
const char *buf, size_t count)
 {
-   struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input, j;
int ret;
 
@@ -199,10 +195,10 @@ static ssize_t store_ignore_nice_load(struct dbs_data 
*dbs_data,
if (input > 1)
input = 1;
 
-   if (input == cs_tuners->ignore_nice_load) /* nothing to do */
+   if (input == dbs_data->ignore_nice_load) /* nothing to do */
return count;
 
-   cs_tuners->ignore_nice_load = input;
+   dbs_data->ignore_nice_load = input;
 
/* we need to re-evaluate prev_cpu_idle */
for_each_online_cpu(j) {
@@ -210,7 +206,7 @@ static ssize_t s

Re: [PATCH V3 03/13] cpufreq: governor: New sysfs show/store callbacks for governor tunables

2016-02-08 Thread Viresh Kumar
On 08-02-16, 22:36, Rafael J. Wysocki wrote:
> On Mon, Feb 8, 2016 at 12:39 PM, Viresh Kumar  wrote:
> > +   ret = kobject_init_and_add(&dbs_data->kobj, &gov->kobj_type,
> > +  get_governor_parent_kobj(policy),
> > +  gov->kobj_name);
> 
> Besides, you forgot about the format argument for kobject_init_and_add().

What about that? Why is it required here ? We don't have to modify the
gov->gov.name string at all, and that string can be used here without adding any
more format arguments.

-- 
viresh


Re: [PATCH V3 03/13] cpufreq: governor: New sysfs show/store callbacks for governor tunables

2016-02-08 Thread Rafael J. Wysocki
On Mon, Feb 8, 2016 at 12:39 PM, Viresh Kumar  wrote:

[cut]

> @@ -331,8 +310,8 @@ static struct dbs_governor cs_dbs_gov = {
> .owner = THIS_MODULE,
> },
> .governor = GOV_CONSERVATIVE,
> -   .attr_group_gov_sys = &cs_attr_group_gov_sys,
> -   .attr_group_gov_pol = &cs_attr_group_gov_pol,
> +   .kobj_name = "conservative",

I don't think you need this.

> +   .kobj_type = { .default_attrs = cs_attributes },
> .get_cpu_cdbs = get_cpu_cdbs,
> .get_cpu_dbs_info_s = get_cpu_dbs_info_s,
> .gov_dbs_timer = cs_dbs_timer,

[cut]

> @@ -373,10 +420,15 @@ static int cpufreq_governor_init(struct cpufreq_policy 
> *policy)
> policy_dbs->dbs_data = dbs_data;
> policy->governor_data = policy_dbs;
>
> -   ret = sysfs_create_group(get_governor_parent_kobj(policy),
> -get_sysfs_attr(gov));
> -   if (ret)
> +   gov->kobj_type.sysfs_ops = &governor_sysfs_ops;
> +   ret = kobject_init_and_add(&dbs_data->kobj, &gov->kobj_type,
> +  get_governor_parent_kobj(policy),
> +  gov->kobj_name);

gov->gov.name can be used here instead of the new kobj_name thing.

Besides, you forgot about the format argument for kobject_init_and_add().

> +   if (ret) {
> +   pr_err("cpufreq: Governor initialization failed (dbs_data 
> kobject initialization error %d)\n",
> +  ret);
> goto reset_gdbs_data;
> +   }
>
> return 0;
>

[cut]

> diff --git a/drivers/cpufreq/cpufreq_governor.h 
> b/drivers/cpufreq/cpufreq_governor.h
> index 5c5d7936087c..a3afac5d8ab2 100644
> --- a/drivers/cpufreq/cpufreq_governor.h
> +++ b/drivers/cpufreq/cpufreq_governor.h
> @@ -160,8 +160,44 @@ struct dbs_data {
> unsigned int sampling_rate;
> unsigned int sampling_down_factor;
> unsigned int up_threshold;
> +
> +   struct kobject kobj;
> +   /* Protect concurrent updates to governor tunables from sysfs */
> +   struct mutex mutex;
> +};
> +
> +/* Governor's specific attributes */
> +struct dbs_data;
> +struct governor_attr {
> +   struct attribute attr;
> +   ssize_t (*show)(struct dbs_data *dbs_data, char *buf);
> +   ssize_t (*store)(struct dbs_data *dbs_data, const char *buf,
> +size_t count);
>  };
>
> +#define gov_show_one_tunable(_gov, file_name)  \
> +static ssize_t show_##file_name  
>   \
> +(struct dbs_data *dbs_data, char *buf) \
> +{  \
> +   struct _gov##_dbs_tuners *tuners = dbs_data->tuners;\
> +   return sprintf(buf, "%u\n", tuners->file_name); \
> +}
> +
> +#define gov_show_one(file_name)  
>   \
> +static ssize_t show_##file_name  
>   \
> +(struct dbs_data *dbs_data, char *buf) \
> +{  \
> +   return sprintf(buf, "%u\n", dbs_data->file_name);   \
> +}
> +
> +#define gov_attr_ro(_name) \
> +static struct governor_attr _name =\
> +__ATTR(_name, 0444, show_##_name, NULL)
> +
> +#define gov_attr_rw(_name) \
> +static struct governor_attr _name =\
> +__ATTR(_name, 0644, show_##_name, store_##_name)
> +
>  /* Common to all CPUs of a policy */
>  struct policy_dbs_info {
> struct cpufreq_policy *policy;
> @@ -236,8 +272,8 @@ struct dbs_governor {
> #define GOV_ONDEMAND0
> #define GOV_CONSERVATIVE1
> int governor;
> -   struct attribute_group *attr_group_gov_sys; /* one governor - system 
> */
> -   struct attribute_group *attr_group_gov_pol; /* one governor - policy 
> */
> +   const char *kobj_name;

So this isn't really necessary.

> +   struct kobj_type kobj_type;
>
> /*
>  * Common data for platforms that don't set
> diff --git a/drivers/cpufreq/cpufreq_ondemand.c 
> b/drivers/cpufreq/cpufreq_ondemand.c
> index cb0d6ff1ced5..bf570800fa78 100644
> --- a/drivers/cpufreq/cpufreq_ondemand.c
> +++ b/drivers/cpufreq/cpufreq_ondemand.c

[cut]

> @@ -5

Re: [PATCH V3 03/13] cpufreq: governor: New sysfs show/store callbacks for governor tunables

2016-02-08 Thread Rafael J. Wysocki
On Mon, Feb 8, 2016 at 6:07 PM, Viresh Kumar  wrote:
> On 08-02-16, 17:09, Viresh Kumar wrote:
>> +gov_show_one(sampling_rate);
>> +gov_show_one(sampling_down_factor);
>> +gov_show_one(up_threshold);
>> +gov_show_one(ignore_nice_load);
>> +gov_show_one(min_sampling_rate);
>> +gov_show_one_tunable(cs, down_threshold);
>> +gov_show_one_tunable(cs, freq_step);
>
> Based on the review comments on 1/13, I will do:
> - s/gov_show_one/gov_show_one_common
> - s/gov_show_one_tunable/gov_show_one

OK


Re: [PATCH V3 03/13] cpufreq: governor: New sysfs show/store callbacks for governor tunables

2016-02-08 Thread Viresh Kumar
On 08-02-16, 17:09, Viresh Kumar wrote:
> +gov_show_one(sampling_rate);
> +gov_show_one(sampling_down_factor);
> +gov_show_one(up_threshold);
> +gov_show_one(ignore_nice_load);
> +gov_show_one(min_sampling_rate);
> +gov_show_one_tunable(cs, down_threshold);
> +gov_show_one_tunable(cs, freq_step);

Based on the review comments on 1/13, I will do:
- s/gov_show_one/gov_show_one_common
- s/gov_show_one_tunable/gov_show_one

-- 
viresh


Re: [PATCH V3 09/13] cpufreq: governor: Move common sysfs tunables to cpufreq_governor.c

2016-02-08 Thread Rafael J. Wysocki
On Mon, Feb 8, 2016 at 2:03 PM, Viresh Kumar  wrote:
> On 08-02-16, 13:58, Rafael J. Wysocki wrote:
>> My most fundamental concern here is that attributes that don't apply
>> to a particular governor should not appear in sysfs at all when that
>> governor is in use (instead of appearing and always returning -EINVAL
>
> s/is in use/is not in use/ ??
>
>> which is sort of silly).
>
> But who said that I have made them available always ? Sorry, I didn't
> understood your input.
>
> I have just moved the show/store callbacks and the struct
> governor_attr definition to cpufreq_governor.c. And sysfs files are
> created only for the ones that are valid for a governor.

OK, I need to look at it more carefully then.

>> That doesn't mean the common code cannot access them, though.  They
>> still can be present in the data structure, but it may be a good idea
>> to set them to special values clearly meaning "invalid" then.
>
> Or are you saying that we should move all the tunables to dbs_data ?

Well, maybe.  I'm not sure, but that may be done later in any case.

Thanks,
Rafael


Re: [PATCH V3 09/13] cpufreq: governor: Move common sysfs tunables to cpufreq_governor.c

2016-02-08 Thread Viresh Kumar
On 08-02-16, 13:58, Rafael J. Wysocki wrote:
> My most fundamental concern here is that attributes that don't apply
> to a particular governor should not appear in sysfs at all when that
> governor is in use (instead of appearing and always returning -EINVAL

s/is in use/is not in use/ ??

> which is sort of silly).

But who said that I have made them available always ? Sorry, I didn't
understood your input.

I have just moved the show/store callbacks and the struct
governor_attr definition to cpufreq_governor.c. And sysfs files are
created only for the ones that are valid for a governor.

> That doesn't mean the common code cannot access them, though.  They
> still can be present in the data structure, but it may be a good idea
> to set them to special values clearly meaning "invalid" then.

Or are you saying that we should move all the tunables to dbs_data ?

-- 
viresh


Re: [PATCH V3 09/13] cpufreq: governor: Move common sysfs tunables to cpufreq_governor.c

2016-02-08 Thread Rafael J. Wysocki
On Mon, Feb 8, 2016 at 12:39 PM, Viresh Kumar  wrote:
> We have got five common sysfs tunables between ondemand and conservative
> governors, move their callbacks to cpufreq_governor.c to get rid of
> redundant code.
>
> Because of minor differences in the implementation of the callbacks,
> some more per-governor callbacks are introduced in order to not
> introduce any more "governor == ONDEMAND/CONSERVATIVE" like checks.

My most fundamental concern here is that attributes that don't apply
to a particular governor should not appear in sysfs at all when that
governor is in use (instead of appearing and always returning -EINVAL
which is sort of silly).

That doesn't mean the common code cannot access them, though.  They
still can be present in the data structure, but it may be a good idea
to set them to special values clearly meaning "invalid" then.

Thanks,
Rafael


[PATCH V3 03/13] cpufreq: governor: New sysfs show/store callbacks for governor tunables

2016-02-08 Thread Viresh Kumar
The ondemand and conservative governors use the global-attr or freq-attr
structures to represent sysfs attributes corresponding to their tunables
(which of them is actually used depends on whether or not different
policy objects can use the same governor with different tunables at the
same time and, consequently, on where those attributes are located in
sysfs).

Unfortunately, in the freq-attr case, the standard cpufreq show/store
sysfs attribute callbacks are applied to the governor tunable attributes
and they always acquire the policy->rwsem lock before carrying out the
operation.  That may lead to an ABBA deadlock if governor tunable
attributes are removed under policy->rwsem while one of them is being
accessed concurrently (if sysfs attributes removal wins the race, it
will wait for the access to complete with policy->rwsem held while the
attribute callback will block on policy->rwsem indefinitely).

We attempted to address this issue by dropping policy->rwsem around
governor tunable attributes removal (that is, around invocations of the
->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT)
in cpufreq_set_policy(), but that opened up race conditions that had not
been possible with policy->rwsem held all the time.  Therefore
policy->rwsem cannot be dropped in cpufreq_set_policy() at any point,
but the deadlock situation described above must be avoided too.

To that end, use the observation that in principle governor tunables may
be represented by the same data type regardless of whether the governor
is system-wide or per-policy and introduce a new structure, struct
governor_attr, for representing them and new corresponding macros for
creating show/store sysfs callbacks for them.  Also make their parent
kobject use a new kobject type whose default show/store callbacks are
not related to the standard core cpufreq ones in any way (and they don't
acquire policy->rwsem in particular).

[ Rafael: Written changelog ]
Signed-off-by: Viresh Kumar 
---
 drivers/cpufreq/cpufreq_conservative.c | 73 --
 drivers/cpufreq/cpufreq_governor.c | 68 +++
 drivers/cpufreq/cpufreq_governor.h | 40 ++-
 drivers/cpufreq/cpufreq_ondemand.c | 73 --
 4 files changed, 150 insertions(+), 104 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_conservative.c 
b/drivers/cpufreq/cpufreq_conservative.c
index ee4937ab6a8b..6d45b7e6b43f 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -235,54 +235,33 @@ static ssize_t store_freq_step(struct dbs_data *dbs_data, 
const char *buf,
return count;
 }
 
-show_store_one(cs, down_threshold);
-show_store_one(cs, freq_step);
-show_store_one_global(cs, sampling_rate);
-show_store_one_global(cs, sampling_down_factor);
-show_store_one_global(cs, up_threshold);
-show_store_one_global(cs, ignore_nice_load);
-show_one_global(cs, min_sampling_rate);
-
-gov_sys_pol_attr_rw(sampling_rate);
-gov_sys_pol_attr_rw(sampling_down_factor);
-gov_sys_pol_attr_rw(up_threshold);
-gov_sys_pol_attr_rw(down_threshold);
-gov_sys_pol_attr_rw(ignore_nice_load);
-gov_sys_pol_attr_rw(freq_step);
-gov_sys_pol_attr_ro(min_sampling_rate);
-
-static struct attribute *dbs_attributes_gov_sys[] = {
-   &min_sampling_rate_gov_sys.attr,
-   &sampling_rate_gov_sys.attr,
-   &sampling_down_factor_gov_sys.attr,
-   &up_threshold_gov_sys.attr,
-   &down_threshold_gov_sys.attr,
-   &ignore_nice_load_gov_sys.attr,
-   &freq_step_gov_sys.attr,
+gov_show_one(sampling_rate);
+gov_show_one(sampling_down_factor);
+gov_show_one(up_threshold);
+gov_show_one(ignore_nice_load);
+gov_show_one(min_sampling_rate);
+gov_show_one_tunable(cs, down_threshold);
+gov_show_one_tunable(cs, freq_step);
+
+gov_attr_rw(sampling_rate);
+gov_attr_rw(sampling_down_factor);
+gov_attr_rw(up_threshold);
+gov_attr_rw(ignore_nice_load);
+gov_attr_ro(min_sampling_rate);
+gov_attr_rw(down_threshold);
+gov_attr_rw(freq_step);
+
+static struct attribute *cs_attributes[] = {
+   &min_sampling_rate.attr,
+   &sampling_rate.attr,
+   &sampling_down_factor.attr,
+   &up_threshold.attr,
+   &down_threshold.attr,
+   &ignore_nice_load.attr,
+   &freq_step.attr,
NULL
 };
 
-static struct attribute_group cs_attr_group_gov_sys = {
-   .attrs = dbs_attributes_gov_sys,
-   .name = "conservative",
-};
-
-static struct attribute *dbs_attributes_gov_pol[] = {
-   &min_sampling_rate_gov_pol.attr,
-   &sampling_rate_gov_pol.attr,
-   &sampling_down_factor_gov_pol.attr,
-   &up_threshold_gov_pol.attr,
-   &down_threshold_gov_pol.attr,
-   &ignore_nice_load_gov_pol.attr,
-   &freq_step_gov_pol.attr,
-   NULL
-};
-
-static struct attribute_group cs_attr_group_gov

[PATCH V3 09/13] cpufreq: governor: Move common sysfs tunables to cpufreq_governor.c

2016-02-08 Thread Viresh Kumar
We have got five common sysfs tunables between ondemand and conservative
governors, move their callbacks to cpufreq_governor.c to get rid of
redundant code.

Because of minor differences in the implementation of the callbacks,
some more per-governor callbacks are introduced in order to not
introduce any more "governor == ONDEMAND/CONSERVATIVE" like checks.

Signed-off-by: Viresh Kumar 
---
 drivers/cpufreq/cpufreq_conservative.c |  80 +-
 drivers/cpufreq/cpufreq_governor.c | 100 +
 drivers/cpufreq/cpufreq_governor.h |  16 +-
 drivers/cpufreq/cpufreq_ondemand.c | 100 -
 4 files changed, 151 insertions(+), 145 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_conservative.c 
b/drivers/cpufreq/cpufreq_conservative.c
index 6d45b7e6b43f..f96770dab788 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -122,47 +122,17 @@ static struct notifier_block cs_cpufreq_notifier_block = {
 /** sysfs interface /
 static struct dbs_governor cs_dbs_gov;
 
-static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
-   const char *buf, size_t count)
+static bool invalid_up_threshold(struct dbs_data *dbs_data,
+unsigned int threshold)
 {
-   unsigned int input;
-   int ret;
-   ret = sscanf(buf, "%u", &input);
-
-   if (ret != 1 || input > MAX_SAMPLING_DOWN_FACTOR || input < 1)
-   return -EINVAL;
+   struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 
-   dbs_data->sampling_down_factor = input;
-   return count;
+   return threshold > 100 || threshold <= cs_tuners->down_threshold;
 }
 
-static ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
-   size_t count)
+static bool invalid_sampling_down_factor(unsigned int factor)
 {
-   unsigned int input;
-   int ret;
-   ret = sscanf(buf, "%u", &input);
-
-   if (ret != 1)
-   return -EINVAL;
-
-   dbs_data->sampling_rate = max(input, dbs_data->min_sampling_rate);
-   return count;
-}
-
-static ssize_t store_up_threshold(struct dbs_data *dbs_data, const char *buf,
-   size_t count)
-{
-   struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
-   unsigned int input;
-   int ret;
-   ret = sscanf(buf, "%u", &input);
-
-   if (ret != 1 || input > 100 || input <= cs_tuners->down_threshold)
-   return -EINVAL;
-
-   dbs_data->up_threshold = input;
-   return count;
+   return factor > MAX_SAMPLING_DOWN_FACTOR;
 }
 
 static ssize_t store_down_threshold(struct dbs_data *dbs_data, const char *buf,
@@ -182,27 +152,13 @@ static ssize_t store_down_threshold(struct dbs_data 
*dbs_data, const char *buf,
return count;
 }
 
-static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
-   const char *buf, size_t count)
+static void update_ignore_nice_load(struct dbs_data *dbs_data)
 {
-   unsigned int input, j;
-   int ret;
-
-   ret = sscanf(buf, "%u", &input);
-   if (ret != 1)
-   return -EINVAL;
-
-   if (input > 1)
-   input = 1;
-
-   if (input == dbs_data->ignore_nice_load) /* nothing to do */
-   return count;
-
-   dbs_data->ignore_nice_load = input;
+   struct cs_cpu_dbs_info_s *dbs_info;
+   unsigned int j;
 
/* we need to re-evaluate prev_cpu_idle */
for_each_online_cpu(j) {
-   struct cs_cpu_dbs_info_s *dbs_info;
dbs_info = &per_cpu(cs_cpu_dbs_info, j);
dbs_info->cdbs.prev_cpu_idle = get_cpu_idle_time(j,
&dbs_info->cdbs.prev_cpu_wall, 0);
@@ -210,7 +166,6 @@ static ssize_t store_ignore_nice_load(struct dbs_data 
*dbs_data,
dbs_info->cdbs.prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
}
-   return count;
 }
 
 static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf,
@@ -235,21 +190,11 @@ static ssize_t store_freq_step(struct dbs_data *dbs_data, 
const char *buf,
return count;
 }
 
-gov_show_one(sampling_rate);
-gov_show_one(sampling_down_factor);
-gov_show_one(up_threshold);
-gov_show_one(ignore_nice_load);
-gov_show_one(min_sampling_rate);
 gov_show_one_tunable(cs, down_threshold);
 gov_show_one_tunable(cs, freq_step);
 
-gov_attr_rw(sampling_rate);
-gov_attr_rw(sampling_down_factor);
-gov_attr_rw(up_threshold);
-gov_attr_rw(ignore_nice_load);
-gov_attr_ro(min_sampling_rate);
-gov_attr_rw(down_threshold);
-gov_attr_rw(freq_step);
+static gov_attr_rw(down_threshold);
+static gov_attr_rw(freq_step);
 
 static struct attribute *cs_at

[PATCH V3 02/13] cpufreq: governor: Move common tunables to 'struct dbs_data'

2016-02-08 Thread Viresh Kumar
There are few more common tunables shared across ondemand and
conservative governors. Move them to 'struct dbs_data' to simplify code.

Signed-off-by: Viresh Kumar 
---
 drivers/cpufreq/cpufreq_conservative.c | 38 ++-
 drivers/cpufreq/cpufreq_governor.c | 37 ++
 drivers/cpufreq/cpufreq_governor.h | 14 +---
 drivers/cpufreq/cpufreq_ondemand.c | 41 +++---
 4 files changed, 47 insertions(+), 83 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_conservative.c 
b/drivers/cpufreq/cpufreq_conservative.c
index 8aaa8a4c2fca..ee4937ab6a8b 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -60,7 +60,7 @@ static void cs_check_cpu(int cpu, unsigned int load)
return;
 
/* Check for frequency increase */
-   if (load > cs_tuners->up_threshold) {
+   if (load > dbs_data->up_threshold) {
dbs_info->down_skip = 0;
 
/* if we are already at full speed then break out early */
@@ -78,7 +78,7 @@ static void cs_check_cpu(int cpu, unsigned int load)
}
 
/* if sampling_down_factor is active break out early */
-   if (++dbs_info->down_skip < cs_tuners->sampling_down_factor)
+   if (++dbs_info->down_skip < dbs_data->sampling_down_factor)
return;
dbs_info->down_skip = 0;
 
@@ -107,10 +107,9 @@ static unsigned int cs_dbs_timer(struct cpufreq_policy 
*policy)
 {
struct policy_dbs_info *policy_dbs = policy->governor_data;
struct dbs_data *dbs_data = policy_dbs->dbs_data;
-   struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 
dbs_check_cpu(policy);
-   return delay_for_sampling_rate(cs_tuners->sampling_rate);
+   return delay_for_sampling_rate(dbs_data->sampling_rate);
 }
 
 static int dbs_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
@@ -126,7 +125,6 @@ static struct dbs_governor cs_dbs_gov;
 static ssize_t store_sampling_down_factor(struct dbs_data *dbs_data,
const char *buf, size_t count)
 {
-   struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -134,14 +132,13 @@ static ssize_t store_sampling_down_factor(struct dbs_data 
*dbs_data,
if (ret != 1 || input > MAX_SAMPLING_DOWN_FACTOR || input < 1)
return -EINVAL;
 
-   cs_tuners->sampling_down_factor = input;
+   dbs_data->sampling_down_factor = input;
return count;
 }
 
 static ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf,
size_t count)
 {
-   struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
@@ -149,7 +146,7 @@ static ssize_t store_sampling_rate(struct dbs_data 
*dbs_data, const char *buf,
if (ret != 1)
return -EINVAL;
 
-   cs_tuners->sampling_rate = max(input, dbs_data->min_sampling_rate);
+   dbs_data->sampling_rate = max(input, dbs_data->min_sampling_rate);
return count;
 }
 
@@ -164,7 +161,7 @@ static ssize_t store_up_threshold(struct dbs_data 
*dbs_data, const char *buf,
if (ret != 1 || input > 100 || input <= cs_tuners->down_threshold)
return -EINVAL;
 
-   cs_tuners->up_threshold = input;
+   dbs_data->up_threshold = input;
return count;
 }
 
@@ -178,7 +175,7 @@ static ssize_t store_down_threshold(struct dbs_data 
*dbs_data, const char *buf,
 
/* cannot be lower than 11 otherwise freq will not fall */
if (ret != 1 || input < 11 || input > 100 ||
-   input >= cs_tuners->up_threshold)
+   input >= dbs_data->up_threshold)
return -EINVAL;
 
cs_tuners->down_threshold = input;
@@ -188,7 +185,6 @@ static ssize_t store_down_threshold(struct dbs_data 
*dbs_data, const char *buf,
 static ssize_t store_ignore_nice_load(struct dbs_data *dbs_data,
const char *buf, size_t count)
 {
-   struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
unsigned int input, j;
int ret;
 
@@ -199,10 +195,10 @@ static ssize_t store_ignore_nice_load(struct dbs_data 
*dbs_data,
if (input > 1)
input = 1;
 
-   if (input == cs_tuners->ignore_nice_load) /* nothing to do */
+   if (input == dbs_data->ignore_nice_load) /* nothing to do */
return count;
 
-   cs_tuners->ignore_nice_load = input;
+   dbs_data->ignore_nice_load = input;
 
/* we need to re-evaluate prev_cpu_idle */
for_each_online_cpu(j) {
@@ -210,7 +206,7 @@ static ssize_t store_ignore_nice_load(struct dbs_data 
*dbs_data,
 

Re: [PATCH V2 2/7] cpufreq: governor: New sysfs show/store callbacks for governor tunables

2016-02-03 Thread Viresh Kumar
On 03-02-16, 19:32, Viresh Kumar wrote:

Build bot reported a minor fix here for compiling governors as
modules:

diff --git a/drivers/cpufreq/cpufreq_governor.c 
b/drivers/cpufreq/cpufreq_governor.c
index e7f79d2477fa..f76a83a99ca4 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -73,6 +73,7 @@ const struct sysfs_ops governor_sysfs_ops = {
.show   = governor_show,
.store  = governor_store,
 };
+EXPORT_SYMBOL_GPL(governor_sysfs_ops);
 
 void dbs_check_cpu(struct dbs_data *dbs_data, int cpu)
 {


Full patch pasted below.

-8<-

From: Viresh Kumar 
Date: Tue, 2 Feb 2016 12:35:01 +0530
Subject: [PATCH] cpufreq: governor: New sysfs show/store callbacks for
 governor tunables

The ondemand and conservative governors use the global-attr or freq-attr
structures to represent sysfs attributes corresponding to their tunables
(which of them is actually used depends on whether or not different
policy objects can use the same governor with different tunables at the
same time and, consequently, on where those attributes are located in
sysfs).

Unfortunately, in the freq-attr case, the standard cpufreq show/store
sysfs attribute callbacks are applied to the governor tunable attributes
and they always acquire the policy->rwsem lock before carrying out the
operation.  That may lead to an ABBA deadlock if governor tunable
attributes are removed under policy->rwsem while one of them is being
accessed concurrently (if sysfs attributes removal wins the race, it
will wait for the access to complete with policy->rwsem held while the
attribute callback will block on policy->rwsem indefinitely).

We attempted to address this issue by dropping policy->rwsem around
governor tunable attributes removal (that is, around invocations of the
->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT)
in cpufreq_set_policy(), but that opened up race conditions that had not
been possible with policy->rwsem held all the time.  Therefore
policy->rwsem cannot be dropped in cpufreq_set_policy() at any point,
but the deadlock situation described above must be avoided too.

To that end, use the observation that in principle governor tunables may
be represented by the same data type regardless of whether the governor
is system-wide or per-policy and introduce a new structure, struct
governor_attr, for representing them and new corresponding macros for
creating show/store sysfs callbacks for them.  Also make their parent
kobject use a new kobject type whose default show/store callbacks are
not related to the standard core cpufreq ones in any way (and they don't
acquire policy->rwsem in particular).

[ Rafael: Written changelog ]
Signed-off-by: Viresh Kumar 
---
 drivers/cpufreq/cpufreq_conservative.c | 73 --
 drivers/cpufreq/cpufreq_governor.c | 70 +++-
 drivers/cpufreq/cpufreq_governor.h | 34 ++--
 drivers/cpufreq/cpufreq_ondemand.c | 73 --
 4 files changed, 143 insertions(+), 107 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_conservative.c 
b/drivers/cpufreq/cpufreq_conservative.c
index 57750367bd26..c749fb4fe5d2 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -275,54 +275,33 @@ static ssize_t store_freq_step(struct dbs_data *dbs_data, 
const char *buf,
return count;
 }
 
-show_store_one(cs, sampling_rate);
-show_store_one(cs, sampling_down_factor);
-show_store_one(cs, up_threshold);
-show_store_one(cs, down_threshold);
-show_store_one(cs, ignore_nice_load);
-show_store_one(cs, freq_step);
-show_one(cs, min_sampling_rate);
-
-gov_sys_pol_attr_rw(sampling_rate);
-gov_sys_pol_attr_rw(sampling_down_factor);
-gov_sys_pol_attr_rw(up_threshold);
-gov_sys_pol_attr_rw(down_threshold);
-gov_sys_pol_attr_rw(ignore_nice_load);
-gov_sys_pol_attr_rw(freq_step);
-gov_sys_pol_attr_ro(min_sampling_rate);
-
-static struct attribute *dbs_attributes_gov_sys[] = {
-   &min_sampling_rate_gov_sys.attr,
-   &sampling_rate_gov_sys.attr,
-   &sampling_down_factor_gov_sys.attr,
-   &up_threshold_gov_sys.attr,
-   &down_threshold_gov_sys.attr,
-   &ignore_nice_load_gov_sys.attr,
-   &freq_step_gov_sys.attr,
+gov_show_one(cs, sampling_rate);
+gov_show_one(cs, sampling_down_factor);
+gov_show_one(cs, up_threshold);
+gov_show_one(cs, down_threshold);
+gov_show_one(cs, ignore_nice_load);
+gov_show_one(cs, freq_step);
+gov_show_one(cs, min_sampling_rate);
+
+gov_attr_rw(sampling_rate);
+gov_attr_rw(sampling_down_factor);
+gov_attr_rw(up_threshold);
+gov_attr_rw(down_threshold);
+gov_attr_rw(ignore_nice_load);
+gov_attr_rw(freq_step);
+gov_attr_ro(min_sampling_rate);
+
+static struct attribute *cs_attributes[] = {
+   &min_sampling_rate.attr,
+   &sampling_ra

[PATCH V2 2/7] cpufreq: governor: New sysfs show/store callbacks for governor tunables

2016-02-03 Thread Viresh Kumar
The ondemand and conservative governors use the global-attr or freq-attr
structures to represent sysfs attributes corresponding to their tunables
(which of them is actually used depends on whether or not different
policy objects can use the same governor with different tunables at the
same time and, consequently, on where those attributes are located in
sysfs).

Unfortunately, in the freq-attr case, the standard cpufreq show/store
sysfs attribute callbacks are applied to the governor tunable attributes
and they always acquire the policy->rwsem lock before carrying out the
operation.  That may lead to an ABBA deadlock if governor tunable
attributes are removed under policy->rwsem while one of them is being
accessed concurrently (if sysfs attributes removal wins the race, it
will wait for the access to complete with policy->rwsem held while the
attribute callback will block on policy->rwsem indefinitely).

We attempted to address this issue by dropping policy->rwsem around
governor tunable attributes removal (that is, around invocations of the
->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT)
in cpufreq_set_policy(), but that opened up race conditions that had not
been possible with policy->rwsem held all the time.  Therefore
policy->rwsem cannot be dropped in cpufreq_set_policy() at any point,
but the deadlock situation described above must be avoided too.

To that end, use the observation that in principle governor tunables may
be represented by the same data type regardless of whether the governor
is system-wide or per-policy and introduce a new structure, struct
governor_attr, for representing them and new corresponding macros for
creating show/store sysfs callbacks for them.  Also make their parent
kobject use a new kobject type whose default show/store callbacks are
not related to the standard core cpufreq ones in any way (and they don't
acquire policy->rwsem in particular).

[ Rafael: Written changelog ]
Signed-off-by: Viresh Kumar 
---
 drivers/cpufreq/cpufreq_conservative.c | 73 --
 drivers/cpufreq/cpufreq_governor.c | 69 +++-
 drivers/cpufreq/cpufreq_governor.h | 34 ++--
 drivers/cpufreq/cpufreq_ondemand.c | 73 --
 4 files changed, 142 insertions(+), 107 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_conservative.c 
b/drivers/cpufreq/cpufreq_conservative.c
index 57750367bd26..c749fb4fe5d2 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -275,54 +275,33 @@ static ssize_t store_freq_step(struct dbs_data *dbs_data, 
const char *buf,
return count;
 }
 
-show_store_one(cs, sampling_rate);
-show_store_one(cs, sampling_down_factor);
-show_store_one(cs, up_threshold);
-show_store_one(cs, down_threshold);
-show_store_one(cs, ignore_nice_load);
-show_store_one(cs, freq_step);
-show_one(cs, min_sampling_rate);
-
-gov_sys_pol_attr_rw(sampling_rate);
-gov_sys_pol_attr_rw(sampling_down_factor);
-gov_sys_pol_attr_rw(up_threshold);
-gov_sys_pol_attr_rw(down_threshold);
-gov_sys_pol_attr_rw(ignore_nice_load);
-gov_sys_pol_attr_rw(freq_step);
-gov_sys_pol_attr_ro(min_sampling_rate);
-
-static struct attribute *dbs_attributes_gov_sys[] = {
-   &min_sampling_rate_gov_sys.attr,
-   &sampling_rate_gov_sys.attr,
-   &sampling_down_factor_gov_sys.attr,
-   &up_threshold_gov_sys.attr,
-   &down_threshold_gov_sys.attr,
-   &ignore_nice_load_gov_sys.attr,
-   &freq_step_gov_sys.attr,
+gov_show_one(cs, sampling_rate);
+gov_show_one(cs, sampling_down_factor);
+gov_show_one(cs, up_threshold);
+gov_show_one(cs, down_threshold);
+gov_show_one(cs, ignore_nice_load);
+gov_show_one(cs, freq_step);
+gov_show_one(cs, min_sampling_rate);
+
+gov_attr_rw(sampling_rate);
+gov_attr_rw(sampling_down_factor);
+gov_attr_rw(up_threshold);
+gov_attr_rw(down_threshold);
+gov_attr_rw(ignore_nice_load);
+gov_attr_rw(freq_step);
+gov_attr_ro(min_sampling_rate);
+
+static struct attribute *cs_attributes[] = {
+   &min_sampling_rate.attr,
+   &sampling_rate.attr,
+   &sampling_down_factor.attr,
+   &up_threshold.attr,
+   &down_threshold.attr,
+   &ignore_nice_load.attr,
+   &freq_step.attr,
NULL
 };
 
-static struct attribute_group cs_attr_group_gov_sys = {
-   .attrs = dbs_attributes_gov_sys,
-   .name = "conservative",
-};
-
-static struct attribute *dbs_attributes_gov_pol[] = {
-   &min_sampling_rate_gov_pol.attr,
-   &sampling_rate_gov_pol.attr,
-   &sampling_down_factor_gov_pol.attr,
-   &up_threshold_gov_pol.attr,
-   &down_threshold_gov_pol.attr,
-   &ignore_nice_load_gov_pol.attr,
-   &freq_step_gov_pol.attr,
-   NULL
-};
-
-static struct attribute_group cs_attr_group_gov_pol = {
-   .attrs = dbs_attributes

[PATCH RFC 08/22] block, cfq: get rid of latency tunables

2016-02-01 Thread Paolo Valente
BFQ guarantees a low latency for interactive applications in a
completely different way with respect to CFQ. On the other hand, in
terms of interface and exactly as CFQ does, BFQ exports a boolean
low_latency tunable to switch low-latency heuristics on (in BFQ, these
heuristics lowers latency for interactive and soft real-time
applications). Finally, differently from CFQ, BFQ has not other
latency tunable.

Accordingly, this commit temporarily turns all latency tunables into
fake tunables, by turning the functions for reading and writing these
tunables into functions that just generate warnings. The commit
introducing low-latency heuristics in BFQ then restores only the
boolean low_latency tunable.

Signed-off-by: Paolo Valente 
---
 block/cfq-iosched.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 15ee70d..136ed5b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -30,7 +30,6 @@ static const int cfq_slice_sync = HZ / 10;
 static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
-static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
 static const int cfq_hist_divisor = 4;
 
 /*
@@ -227,8 +226,6 @@ struct cfq_data {
unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
-   unsigned int cfq_latency;
-   unsigned int cfq_target_latency;
 
/*
 * Fallback dummy cfqq for extreme OOM conditions
@@ -1463,7 +1460,7 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, 
struct cfq_queue *cfqq)
 * We also ramp up the dispatch depth gradually for async IO,
 * based on the last sync IO we serviced
 */
-   if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
+   if (!cfq_cfqq_sync(cfqq)) {
unsigned long last_sync = jiffies - cfqd->last_delayed_sync;
unsigned int depth;
 
@@ -2269,10 +2266,8 @@ static int cfq_init_queue(struct request_queue *q, 
struct elevator_type *e)
cfqd->cfq_back_penalty = cfq_back_penalty;
cfqd->cfq_slice[0] = cfq_slice_async;
cfqd->cfq_slice[1] = cfq_slice_sync;
-   cfqd->cfq_target_latency = cfq_target_latency;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
-   cfqd->cfq_latency = 1;
cfqd->hw_tag = -1;
/*
 * we optimistically start assuming sync ops weren't delayed in last
@@ -2330,8 +2325,6 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 
1);
 SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
-SHOW_FUNCTION(cfq_low_latency_show, cfqd->cfq_latency, 0);
-SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)
\
@@ -2363,13 +2356,27 @@ STORE_FUNCTION(cfq_slice_sync_store, 
&cfqd->cfq_slice[1], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
-STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
-STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 
UINT_MAX, 1);
 #undef STORE_FUNCTION
 
+static ssize_t cfq_fake_lat_show(struct elevator_queue *e, char *page)
+{
+   pr_warn_once("CFQ I/O SCHED: tried to read removed latency tunable");
+   return sprintf(page, "0\n");
+}
+
+static ssize_t
+cfq_fake_lat_store(struct elevator_queue *e, const char *page, size_t count)
+{
+   pr_warn_once("CFQ I/O SCHED: tried to write removed latency tunable");
+   return count;
+}
+
 #define CFQ_ATTR(name) \
__ATTR(name, S_IRUGO|S_IWUSR, cfq_##name##_show, cfq_##name##_store)
 
+#define CFQ_FAKE_LAT_ATTR(name) \
+   __ATTR(name, S_IRUGO|S_IWUSR, cfq_fake_lat_show, cfq_fake_lat_store)
+
 static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(quantum),
CFQ_ATTR(fifo_expire_sync),
@@ -2380,8 +2387,8 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
-   CFQ_ATTR(low_latency),
-   CFQ_ATTR(target_latency),
+   CFQ_FAKE_LAT_ATTR(low_latency),
+   CFQ_FAKE_LAT_ATTR(target_latency),
__ATTR_NULL
 };
 
-- 
1.9.1



Re: [PATCH] cpufreq: governors: Reset tunables only for cpufreq_unregister_governor()

2013-01-31 Thread Viresh Kumar
On 1 February 2013 11:12, Viresh Kumar  wrote:
> Currently, whenever governor->governor() is called for CPUFRREQ_GOV_START 
> event
> we reset few tunables of governor. Which isn't correct, as this routine is
> called for every cpu hot-[un]plugging event. We should actually be resetting
> these only when the governor module is removed and re-installed.
>
> Signed-off-by: Viresh Kumar 

ARM Mails are broken, please apply attached patch.


0001-cpufreq-governors-Reset-tunables-only-for-cpufreq_un.patch
Description: Binary data


[PATCH] cpufreq: governors: Reset tunables only for cpufreq_unregister_governor()

2013-01-31 Thread Viresh Kumar
Currently, whenever governor->governor() is called for CPUFRREQ_GOV_START event
we reset few tunables of governor. Which isn't correct, as this routine is
called for every cpu hot-[un]plugging event. We should actually be resetting
these only when the governor module is removed and re-installed.

Signed-off-by: Viresh Kumar 
---
 drivers/cpufreq/cpufreq.c  |  4 
 drivers/cpufreq/cpufreq_governor.c | 24 
 include/linux/cpufreq.h|  1 +
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 8d521422..9656420 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1561,6 +1561,9 @@ static int __cpufreq_governor(struct cpufreq_policy 
*policy,
policy->cpu, event);
ret = policy->governor->governor(policy, event);
 
+   if (!policy->governor->initialized && (event == CPUFREQ_GOV_START))
+   policy->governor->initialized = 1;
+
/* we keep one module reference alive for
each CPU governed by this CPU */
if ((event != CPUFREQ_GOV_START) || ret)
@@ -1584,6 +1587,7 @@ int cpufreq_register_governor(struct cpufreq_governor 
*governor)
 
mutex_lock(&cpufreq_governor_mutex);
 
+   governor->initialized = 0;
err = -EBUSY;
if (__find_governor(governor->name) == NULL) {
err = 0;
diff --git a/drivers/cpufreq/cpufreq_governor.c 
b/drivers/cpufreq/cpufreq_governor.c
index 7aaa9b1..79795c4 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -254,11 +254,6 @@ int cpufreq_governor_dbs(struct dbs_data *dbs_data,
return rc;
}
 
-   /* policy latency is in nS. Convert it to uS first */
-   latency = policy->cpuinfo.transition_latency / 1000;
-   if (latency == 0)
-   latency = 1;
-
/*
 * conservative does not implement micro like ondemand
 * governor, thus we are bound to jiffes/HZ
@@ -270,20 +265,33 @@ int cpufreq_governor_dbs(struct dbs_data *dbs_data,
cpufreq_register_notifier(cs_ops->notifier_block,
CPUFREQ_TRANSITION_NOTIFIER);
 
-   dbs_data->min_sampling_rate = MIN_SAMPLING_RATE_RATIO *
-   jiffies_to_usecs(10);
+   if (!policy->governor->initialized)
+   dbs_data->min_sampling_rate =
+   MIN_SAMPLING_RATE_RATIO *
+   jiffies_to_usecs(10);
} else {
od_dbs_info->rate_mult = 1;
od_dbs_info->sample_type = OD_NORMAL_SAMPLE;
od_ops->powersave_bias_init_cpu(cpu);
-   od_tuners->io_is_busy = od_ops->io_busy();
+
+   if (!policy->governor->initialized)
+   od_tuners->io_is_busy = od_ops->io_busy();
}
 
+   if (policy->governor->initialized)
+   goto unlock;
+
+   /* policy latency is in nS. Convert it to uS first */
+   latency = policy->cpuinfo.transition_latency / 1000;
+   if (latency == 0)
+   latency = 1;
+
/* Bring kernel and HW constraints together */
dbs_data->min_sampling_rate = max(dbs_data->min_sampling_rate,
MIN_LATENCY_MULTIPLIER * latency);
*sampling_rate = max(dbs_data->min_sampling_rate, latency *
LATENCY_MULTIPLIER);
+unlock:
mutex_unlock(&dbs_data->mutex);
 
/* Initiate timer time stamp */
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index feb360c..6bf3f2d 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -183,6 +183,7 @@ static inline unsigned long cpufreq_scale(unsigned long 
old, u_int div, u_int mu
 
 struct cpufreq_governor {
charname[CPUFREQ_NAME_LEN];
+   int initialized;
int (*governor) (struct cpufreq_policy *policy,
 unsigned int event);
ssize_t (*show_setspeed)(struct cpufreq_policy *policy,
-- 
1.7.12.rc2.18.g61b472e


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 27/29] slab: propagate tunables values

2012-11-01 Thread Glauber Costa
SLAB allows us to tune a particular cache behavior with tunables.
When creating a new memcg cache copy, we'd like to preserve any tunables
the parent cache already had.

This could be done by an explicit call to do_tune_cpucache() after the
cache is created. But this is not very convenient now that the caches are
created from common code, since this function is SLAB-specific.

Another method of doing that is taking advantage of the fact that
do_tune_cpucache() is always called from enable_cpucache(), which is
called at cache initialization. We can just preset the values, and
then things work as expected.

It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.

This change will require us to move the assignment of root_cache in
memcg_params a bit earlier. We need this to be already set - which
memcg_kmem_register_cache will do - when we reach __kmem_cache_create()

Signed-off-by: Glauber Costa 
CC: Christoph Lameter 
CC: Pekka Enberg 
CC: Michal Hocko 
CC: Kamezawa Hiroyuki 
CC: Johannes Weiner 
CC: Suleiman Souhlal 
CC: Tejun Heo 
---
 include/linux/memcontrol.h |  8 +---
 include/linux/slab.h   |  2 +-
 mm/memcontrol.c| 10 ++
 mm/slab.c  | 44 +---
 mm/slab.h  | 12 
 mm/slab_common.c   |  7 ---
 6 files changed, 69 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c780dd6..c91e3c1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -441,7 +441,8 @@ void __memcg_kmem_commit_charge(struct page *page,
 void __memcg_kmem_uncharge_pages(struct page *page, int order);
 
 int memcg_cache_id(struct mem_cgroup *memcg);
-int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s);
+int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
+struct kmem_cache *root_cache);
 void memcg_release_cache(struct kmem_cache *cachep);
 void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep);
 
@@ -583,8 +584,9 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
return -1;
 }
 
-static inline int memcg_register_cache(struct mem_cgroup *memcg,
-  struct kmem_cache *s)
+static inline int
+memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
+struct kmem_cache *root_cache)
 {
return 0;
 }
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1232c7f..81ee767 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -128,7 +128,7 @@ struct kmem_cache *kmem_cache_create(const char *, size_t, 
size_t,
void (*)(void *));
 struct kmem_cache *
 kmem_cache_create_memcg(struct mem_cgroup *, const char *, size_t, size_t,
-   unsigned long, void (*)(void *));
+   unsigned long, void (*)(void *), struct kmem_cache *);
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void kmem_cache_free(struct kmem_cache *, void *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 35f5cb3..7d14fbd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2981,7 +2981,8 @@ int memcg_update_cache_size(struct kmem_cache *s, int 
num_groups)
return 0;
 }
 
-int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s)
+int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
+struct kmem_cache *root_cache)
 {
size_t size = sizeof(struct memcg_cache_params);
 
@@ -2995,8 +2996,10 @@ int memcg_register_cache(struct mem_cgroup *memcg, 
struct kmem_cache *s)
if (!s->memcg_params)
return -ENOMEM;
 
-   if (memcg)
+   if (memcg) {
s->memcg_params->memcg = memcg;
+   s->memcg_params->root_cache = root_cache;
+   }
return 0;
 }
 
@@ -3162,7 +3165,7 @@ static struct kmem_cache *kmem_cache_dup(struct 
mem_cgroup *memcg,
return NULL;
 
new = kmem_cache_create_memcg(memcg, name, s->object_size, s->align,
- (s->flags & ~SLAB_PANIC), s->ctor);
+ (s->flags & ~SLAB_PANIC), s->ctor, s);
 
if (new)
new->allocflags |= __GFP_KMEMCG;
@@ -3206,7 +3209,6 @@ static struct kmem_cache *memcg_create_kmem_cache(struct 
mem_cgroup *memcg,
}
 
mem_cgroup_get(memcg);
-   new_cachep->memcg_params->root_cache = cachep;
atomic_set(&new_cachep->memcg_params->nr_pages , 0);
 
cachep->memcg_params->memcg_caches[idx] = new_cachep;
diff --git a/mm/slab.c b/mm/slab.c
index 15bb502..628a88e 100644
--- a/mm/slab.c
+++ b/m

Re: [PATCH v5 16/18] slab: propagate tunables values

2012-10-23 Thread Christoph Lameter
On Mon, 22 Oct 2012, Glauber Costa wrote:

> On 10/19/2012 11:51 PM, Christoph Lameter wrote:
> > On Fri, 19 Oct 2012, Glauber Costa wrote:
> >
> >> SLAB allows us to tune a particular cache behavior with tunables.
> >> When creating a new memcg cache copy, we'd like to preserve any tunables
> >> the parent cache already had.
> >
> > SLAB and SLUB allow tuning. Could you come up with some way to put these
> > things into slab common and make it flexible so that the tuning could be
> > used for future allocators (like SLAM etc)?
> >
> They do, but they also do it very differently. Like slub uses sysfs,
> while slab don't.

Well yes that is something that I also want to make more general so that
all allocators support sysfs style display of status and tuning.

> I of course fully support the integration, I just don't think this
> should be a blocker for all kinds of work in the allocators. Converting
> slab to sysfs seems to be a major work, that you are already tackling.
> Were it simple, I believe it would be done already. Without it, this is
> pretty much a fake integration...

Well there is quite a bit of infrastructure that needs to be common in
order to get this done properly. I hope we will get around to that
someday.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 16/18] slab: propagate tunables values

2012-10-22 Thread Glauber Costa
On 10/19/2012 11:51 PM, Christoph Lameter wrote:
> On Fri, 19 Oct 2012, Glauber Costa wrote:
> 
>> SLAB allows us to tune a particular cache behavior with tunables.
>> When creating a new memcg cache copy, we'd like to preserve any tunables
>> the parent cache already had.
> 
> SLAB and SLUB allow tuning. Could you come up with some way to put these
> things into slab common and make it flexible so that the tuning could be
> used for future allocators (like SLAM etc)?
> 
They do, but they also do it very differently. Like slub uses sysfs,
while slab don't.

I of course fully support the integration, I just don't think this
should be a blocker for all kinds of work in the allocators. Converting
slab to sysfs seems to be a major work, that you are already tackling.
Were it simple, I believe it would be done already. Without it, this is
pretty much a fake integration...

In summary, adding this doesn't make the integration work any harder in
the future, and blocking this particular thing on sysfs integration is
unreasonable.

This being by far not central to the patchset, if this is an absolute
requirement, maybe I should just drop it for the time being so it
doesn't stall the rest of the development.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 16/18] slab: propagate tunables values

2012-10-19 Thread Christoph Lameter
On Fri, 19 Oct 2012, Glauber Costa wrote:

> SLAB allows us to tune a particular cache behavior with tunables.
> When creating a new memcg cache copy, we'd like to preserve any tunables
> the parent cache already had.

SLAB and SLUB allow tuning. Could you come up with some way to put these
things into slab common and make it flexible so that the tuning could be
used for future allocators (like SLAM etc)?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v5 16/18] slab: propagate tunables values

2012-10-19 Thread Glauber Costa
SLAB allows us to tune a particular cache behavior with tunables.
When creating a new memcg cache copy, we'd like to preserve any tunables
the parent cache already had.

This could be done by an explicit call to do_tune_cpucache() after the
cache is created. But this is not very convenient now that the caches are
created from common code, since this function is SLAB-specific.

Another method of doing that is taking advantage of the fact that
do_tune_cpucache() is always called from enable_cpucache(), which is
called at cache initialization. We can just preset the values, and
then things work as expected.

It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.

This change will require us to move the assignment of root_cache in
memcg_params a bit earlier. We need this to be already set - which
memcg_kmem_register_cache will do - when we reach __kmem_cache_create()

Signed-off-by: Glauber Costa 
CC: Christoph Lameter 
CC: Pekka Enberg 
CC: Michal Hocko 
CC: Kamezawa Hiroyuki 
CC: Johannes Weiner 
CC: Suleiman Souhlal 
CC: Tejun Heo 
---
 include/linux/memcontrol.h |  8 +---
 include/linux/slab.h   |  2 +-
 mm/memcontrol.c| 10 ++
 mm/slab.c  | 44 +---
 mm/slab.h  | 12 
 mm/slab_common.c   |  7 ---
 6 files changed, 69 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 14def0b..9da87ff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -421,7 +421,8 @@ void __memcg_kmem_commit_charge(struct page *page,
 void __memcg_kmem_uncharge_pages(struct page *page, int order);
 
 int memcg_css_id(struct mem_cgroup *memcg);
-int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s);
+int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
+struct kmem_cache *root_cache);
 void memcg_release_cache(struct kmem_cache *cachep);
 void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep);
 
@@ -564,8 +565,9 @@ memcg_kmem_commit_charge(struct page *page, struct 
mem_cgroup *memcg, int order)
 {
 }
 
-static inline int memcg_register_cache(struct mem_cgroup *memcg,
-  struct kmem_cache *s)
+static inline int
+memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
+struct kmem_cache *root_cache)
 {
return 0;
 }
diff --git a/include/linux/slab.h b/include/linux/slab.h
index b521426..f8db4e1 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -128,7 +128,7 @@ struct kmem_cache *kmem_cache_create(const char *, size_t, 
size_t,
void (*)(void *));
 struct kmem_cache *
 kmem_cache_create_memcg(struct mem_cgroup *, const char *, size_t, size_t,
-   unsigned long, void (*)(void *));
+   unsigned long, void (*)(void *), struct kmem_cache *);
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void kmem_cache_free(struct kmem_cache *, void *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e7f3458..960d758 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2939,7 +2939,8 @@ int memcg_update_cache_size(struct kmem_cache *s, int 
num_groups)
return 0;
 }
 
-int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s)
+int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
+struct kmem_cache *root_cache)
 {
size_t size = sizeof(struct memcg_cache_params);
 
@@ -2953,8 +2954,10 @@ int memcg_register_cache(struct mem_cgroup *memcg, 
struct kmem_cache *s)
if (!s->memcg_params)
return -ENOMEM;
 
-   if (memcg)
+   if (memcg) {
s->memcg_params->memcg = memcg;
+   s->memcg_params->root_cache = root_cache;
+   }
return 0;
 }
 
@@ -3098,7 +3101,7 @@ static struct kmem_cache *kmem_cache_dup(struct 
mem_cgroup *memcg,
return NULL;
 
new = kmem_cache_create_memcg(memcg, name, s->object_size, s->align,
- (s->flags & ~SLAB_PANIC), s->ctor);
+ (s->flags & ~SLAB_PANIC), s->ctor, s);
 
if (new)
new->allocflags |= __GFP_KMEMCG;
@@ -3146,7 +3149,6 @@ static struct kmem_cache *memcg_create_kmem_cache(struct 
mem_cgroup *memcg,
cachep->memcg_params->memcg_caches[idx] = new_cachep;
wmb(); /* the readers won't lock, make sure everybody sees it */
new_cachep->memcg_params->memcg = memcg;
-   new_cachep->memcg_params->root_cache = cachep;
atomic_set(&new_cachep->memcg_params->nr_pages , 0);
 out:

Re: [PATCH v3 08/16] slab: allow enable_cpu_cache to use preset values for its tunables

2012-09-21 Thread Pekka Enberg
On Tue, Sep 18, 2012 at 5:12 PM, Glauber Costa  wrote:
> diff --git a/mm/slab.c b/mm/slab.c
> index e2cf984..f2d760c 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -4141,8 +4141,19 @@ static int do_tune_cpucache(struct kmem_cache *cachep, 
> int limit,
>  static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
>  {
> int err;
> -   int limit, shared;
> -
> +   int limit = 0;
> +   int shared = 0;
> +   int batchcount = 0;
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +   if (cachep->memcg_params.parent) {
> +   limit = cachep->memcg_params.parent->limit;
> +   shared = cachep->memcg_params.parent->shared;
> +   batchcount = cachep->memcg_params.parent->batchcount;

Style nit: please introduce a variable for
"cachep->memcg_params.parent" to make this human-readable.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 08/16] slab: allow enable_cpu_cache to use preset values for its tunables

2012-09-19 Thread Glauber Costa
On 09/18/2012 07:25 PM, Christoph Lameter wrote:
> On Tue, 18 Sep 2012, Glauber Costa wrote:
> 
>> SLAB allows us to tune a particular cache behavior with tunables.
>> When creating a new memcg cache copy, we'd like to preserve any tunables
>> the parent cache already had.
> 
> Again the same is true for SLUB. Some generic way of preserving tuning
> parameters would be appreciated.

So you would like me to extend "slub: slub-specific propagation changes"
to also allow for pre-set values, right?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 08/16] slab: allow enable_cpu_cache to use preset values for its tunables

2012-09-18 Thread Christoph Lameter
On Tue, 18 Sep 2012, Glauber Costa wrote:

> SLAB allows us to tune a particular cache behavior with tunables.
> When creating a new memcg cache copy, we'd like to preserve any tunables
> the parent cache already had.

Again the same is true for SLUB. Some generic way of preserving tuning
parameters would be appreciated.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 08/16] slab: allow enable_cpu_cache to use preset values for its tunables

2012-09-18 Thread Glauber Costa
SLAB allows us to tune a particular cache behavior with tunables.
When creating a new memcg cache copy, we'd like to preserve any tunables
the parent cache already had.

This could be done by an explicit call to do_tune_cpucache() after the
cache is created. But this is not very convenient now that the caches are
created from common code, since this function is SLAB-specific.

Another method of doing that is taking advantage of the fact that
do_tune_cpucache() is always called from enable_cpucache(), which is
called at cache initialization. We can just preset the values, and
then things work as expected.

Signed-off-by: Glauber Costa 
CC: Christoph Lameter 
CC: Pekka Enberg 
CC: Michal Hocko 
CC: Kamezawa Hiroyuki 
CC: Johannes Weiner 
CC: Suleiman Souhlal 
---
 include/linux/slab.h |  3 ++-
 mm/memcontrol.c  |  2 +-
 mm/slab.c| 19 ---
 mm/slab_common.c |  6 --
 4 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index dc6daac..9d298db 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -128,7 +128,7 @@ struct kmem_cache *kmem_cache_create(const char *, size_t, 
size_t,
void (*)(void *));
 struct kmem_cache *
 kmem_cache_create_memcg(struct mem_cgroup *, const char *, size_t, size_t,
-   unsigned long, void (*)(void *));
+   unsigned long, void (*)(void *), struct kmem_cache *);
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void kmem_cache_free(struct kmem_cache *, void *);
@@ -184,6 +184,7 @@ unsigned int kmem_cache_size(struct kmem_cache *);
 #ifdef CONFIG_MEMCG_KMEM
 struct mem_cgroup_cache_params {
struct mem_cgroup *memcg;
+   struct kmem_cache *parent;
int id;
 };
 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 54247ec..ee982aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -588,7 +588,7 @@ static struct kmem_cache *kmem_cache_dup(struct mem_cgroup 
*memcg,
return NULL;
 
new = kmem_cache_create_memcg(memcg, name, s->object_size, s->align,
- (s->flags & ~SLAB_PANIC), s->ctor);
+ (s->flags & ~SLAB_PANIC), s->ctor, s);
 
kfree(name);
return new;
diff --git a/mm/slab.c b/mm/slab.c
index e2cf984..f2d760c 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -4141,8 +4141,19 @@ static int do_tune_cpucache(struct kmem_cache *cachep, 
int limit,
 static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
 {
int err;
-   int limit, shared;
-
+   int limit = 0;
+   int shared = 0;
+   int batchcount = 0;
+
+#ifdef CONFIG_MEMCG_KMEM
+   if (cachep->memcg_params.parent) {
+   limit = cachep->memcg_params.parent->limit;
+   shared = cachep->memcg_params.parent->shared;
+   batchcount = cachep->memcg_params.parent->batchcount;
+   }
+#endif
+   if (limit && shared && batchcount)
+   goto skip_setup;
/*
 * The head array serves three purposes:
 * - create a LIFO ordering, i.e. return objects that are cache-warm
@@ -4184,7 +4195,9 @@ static int enable_cpucache(struct kmem_cache *cachep, 
gfp_t gfp)
if (limit > 32)
limit = 32;
 #endif
-   err = do_tune_cpucache(cachep, limit, (limit + 1) / 2, shared, gfp);
+   batchcount = (limit + 1) / 2;
+skip_setup:
+   err = do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
if (err)
printk(KERN_ERR "enable_cpucache failed for %s, error %d.\n",
   cachep->name, -err);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8f06849..6829aa4 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -100,7 +100,8 @@ static inline int kmem_cache_sanity_check(struct mem_cgroup 
*memcg,
 
 struct kmem_cache *
 kmem_cache_create_memcg(struct mem_cgroup *memcg, const char *name, size_t 
size,
-   size_t align, unsigned long flags, void (*ctor)(void *))
+   size_t align, unsigned long flags, void (*ctor)(void *),
+   struct kmem_cache *parent_cache)
 {
struct kmem_cache *s = NULL;
int err = 0;
@@ -122,6 +123,7 @@ kmem_cache_create_memcg(struct mem_cgroup *memcg, const 
char *name, size_t size,
s->ctor = ctor;
 #ifdef CONFIG_MEMCG_KMEM
s->memcg_params.memcg = memcg;
+   s->memcg_params.parent = parent_cache;
 #endif
s->name = kstrdup(name, GFP_KERNEL);
if (!s->name) {
@@ -168,7 +170,7 @@ struct kmem_cache *
 kmem_cache_create(const char *name, size_t size, size_t align,
  unsigned long flags, void (*ctor)(void *))
 {
-   return kmem_cache_create_memcg(NULL, name, size, alig

Re: [PATCH] [mcelog] Start using the new sysfs tunables location

2012-09-06 Thread Andi Kleen
> Even if we decide not to remove these tunables from under their current 
> per-cpu location, I still think it is much cleaner to have them 
> available under /sys/devices/system/machinecheck.

"much cleaner" is not sufficient justification to break an ABI.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [mcelog] Start using the new sysfs tunables location

2012-09-06 Thread Alan Cox
On Thu, 06 Sep 2012 18:04:27 +0530
"Naveen N. Rao"  wrote:

> On 09/06/2012 05:58 PM, Andi Kleen wrote:
> >> The change is still under discussion. Stage one is to add the new global
> >> pathnames in addition to keeping the old per-cpu ones. Also fix all 
> >> utilities
> >> (just mcelog(8) as far as we know) to prefer the new paths.
> >
> > But why do you even want to change it?  Does it fix anything?
> > AFAIK the old setup -- while not being pretty -- works just fine.
> 
> The reason for this was explained in this thread:
> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg298302.html
> 
> Even if we decide not to remove these tunables from under their current 
> per-cpu location, I still think it is much cleaner to have them 
> available under /sys/devices/system/machinecheck.

That to me seems a ridiculous proposal. What are you going to do if in
future they ceased to be system wide ? Move them back ?

The threshold for playing musical chairs with sysfs nodes is a lot higher
than "I think it's much cleaner"

The current approach is a lot more futureproof even if a spot more ugly.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [mcelog] Start using the new sysfs tunables location

2012-09-06 Thread Naveen N. Rao

On 09/06/2012 05:58 PM, Andi Kleen wrote:

The change is still under discussion. Stage one is to add the new global
pathnames in addition to keeping the old per-cpu ones. Also fix all utilities
(just mcelog(8) as far as we know) to prefer the new paths.


But why do you even want to change it?  Does it fix anything?
AFAIK the old setup -- while not being pretty -- works just fine.


The reason for this was explained in this thread:
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg298302.html

Even if we decide not to remove these tunables from under their current 
per-cpu location, I still think it is much cleaner to have them 
available under /sys/devices/system/machinecheck.



- Naveen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [mcelog] Start using the new sysfs tunables location

2012-09-06 Thread Andi Kleen
> The change is still under discussion. Stage one is to add the new global
> pathnames in addition to keeping the old per-cpu ones. Also fix all utilities
> (just mcelog(8) as far as we know) to prefer the new paths.

But why do you even want to change it?  Does it fix anything?
AFAIK the old setup -- while not being pretty -- works just fine.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [mcelog] Start using the new sysfs tunables location

2012-09-05 Thread Naveen N. Rao

On 09/06/2012 12:39 AM, Tony Luck wrote:

On Wed, Sep 5, 2012 at 11:47 AM, Andi Kleen  wrote:

On Wed, Sep 05, 2012 at 04:02:37PM +0530, Naveen N. Rao wrote:

All the current mce tunables are now available under
/sys/devices/system/machinecheck. Start using this new location, but fall back
to the older per-cpu location so that we continue working with older kernels.


Who did that change in the kernel?

That breaks Linus rule that the kernel should not break userland.
Kernel needs to fix that.


The change is still under discussion. Stage one is to add the new global
pathnames in addition to keeping the old per-cpu ones. Also fix all utilities
(just mcelog(8) as far as we know) to prefer the new paths.

After some time[1] ... delete the old paths. This is allowable under Linus'
modified edict that you can change ABI "if nobody complains". If we wait
long enough that the new mcelog is widely deployed, then nobody should
complain.

-Tony

[1] several years - not just a kernel release or two.



Tony,
Thanks for clarifying. I should have mentioned in the patch description 
that this is indeed subject to the original patch making it into the kernel.


On a related topic. I recently noticed that we don't have an entry for 
machinecheck in Documentation/ABI/. Should we add an entry in there? We 
could perhaps add the existing entries under obsolete/ and the new 
location under testing/?



- Naveen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [mcelog] Start using the new sysfs tunables location

2012-09-05 Thread Tony Luck
On Wed, Sep 5, 2012 at 11:47 AM, Andi Kleen  wrote:
> On Wed, Sep 05, 2012 at 04:02:37PM +0530, Naveen N. Rao wrote:
>> All the current mce tunables are now available under
>> /sys/devices/system/machinecheck. Start using this new location, but fall 
>> back
>> to the older per-cpu location so that we continue working with older kernels.
>
> Who did that change in the kernel?
>
> That breaks Linus rule that the kernel should not break userland.
> Kernel needs to fix that.

The change is still under discussion. Stage one is to add the new global
pathnames in addition to keeping the old per-cpu ones. Also fix all utilities
(just mcelog(8) as far as we know) to prefer the new paths.

After some time[1] ... delete the old paths. This is allowable under Linus'
modified edict that you can change ABI "if nobody complains". If we wait
long enough that the new mcelog is widely deployed, then nobody should
complain.

-Tony

[1] several years - not just a kernel release or two.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [mcelog] Start using the new sysfs tunables location

2012-09-05 Thread Andi Kleen
On Wed, Sep 05, 2012 at 04:02:37PM +0530, Naveen N. Rao wrote:
> All the current mce tunables are now available under
> /sys/devices/system/machinecheck. Start using this new location, but fall back
> to the older per-cpu location so that we continue working with older kernels.

Who did that change in the kernel?

That breaks Linus rule that the kernel should not break userland.
Kernel needs to fix that.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [mcelog] Start using the new sysfs tunables location

2012-09-05 Thread Naveen N. Rao
All the current mce tunables are now available under
/sys/devices/system/machinecheck. Start using this new location, but fall back
to the older per-cpu location so that we continue working with older kernels.

Signed-off-by: Naveen N. Rao 
---
 README  |2 +-
 mcelog.init |5 -
 tests/test  |7 ++-
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/README b/README
index 08184ed..0426460 100644
--- a/README
+++ b/README
@@ -18,7 +18,7 @@ significantly (upto 10 minutes) and does not allow mcelog to 
keep extended state
 
 trigger is a newer method where the kernel runs mcelog on a error.
 This is configured with 
-echo /usr/sbin/mcelog > /sys/devices/system/machinecheck/machinecheck0/trigger
+echo /usr/sbin/mcelog > /sys/devices/system/machinecheck/trigger
 This is faster, but still doesn't allow mcelog to keep state,
 and has relatively high overhead for each error because a program has
 to be initialized from scratch.
diff --git a/mcelog.init b/mcelog.init
index 0abe786..5f32ba7 100755
--- a/mcelog.init
+++ b/mcelog.init
@@ -31,7 +31,10 @@ MCELOG_OPTIONS=""
 
 # private settings
 MCELOG=${MCELOG:-/usr/sbin/mcelog}
-TRIGGER=/sys/devices/system/machinecheck/machinecheck0/trigger
+TRIGGER=/sys/devices/system/machinecheck/trigger
+if [ ! -e $TRIGGER ] ; then
+   TRIGGER=/sys/devices/system/machinecheck/machinecheck0/trigger
+fi
 [ ! -x $MCELOG ] && ( echo "mcelog not found" ; exit 1 )
 [ ! -r /dev/mcelog ] && ( echo "/dev/mcelog not active" ; exit 0 )
 
diff --git a/tests/test b/tests/test
index c673eb2..52daf01 100755
--- a/tests/test
+++ b/tests/test
@@ -17,10 +17,15 @@ if [ "$(whoami)" != "root" ] ; then
exit 1
 fi
 
+TRIGGER=/sys/devices/system/machinecheck/trigger
+if [ ! -e $TRIGGER ] ; then
+   TRIGGER=/sys/devices/system/machinecheck/machinecheck0/trigger
+fi
+
 echo " running $1 test +++"
 
 # disable trigger
-echo -n "" > /sys/devices/system/machinecheck/machinecheck0/trigger
+echo -n "" > $TRIGGER
 killall mcelog || true
 
 #killwatchdog() { 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] x86/mce: Make sysfs tunables available globally across all cpus

2012-09-05 Thread Naveen N. Rao
All the MCE attributes currently exported via sysfs appear under
/sys/devices/system/machinecheck/machinecheck/. Pretty much all of
these are global in nature and not specific to a processor. So, make these
available under /sys/devices/system/machinecheck/ where they rightly belong.
Update documentation to also point to the new location so that user-space
tools can pick up on the new location. We would eventually want to remove
these from the per-cpu location.

Signed-off-by: Naveen N. Rao 
---
 Documentation/x86/x86_64/machinecheck |4 ++--
 arch/x86/kernel/cpu/mcheck/mce.c  |   24 +++-
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/Documentation/x86/x86_64/machinecheck 
b/Documentation/x86/x86_64/machinecheck
index b1fb302..02b84a6 100644
--- a/Documentation/x86/x86_64/machinecheck
+++ b/Documentation/x86/x86_64/machinecheck
@@ -31,8 +31,8 @@ bankNctl
Note that BIOS maintain another mask to disable specific events
per bank.  This is not visible here
 
-The following entries appear for each CPU, but they are truly shared
-between all CPUs.
+The following entries are shared between all CPUs and appear under
+/sys/devices/system/machinecheck:
 
 check_interval
How often to poll for corrected machine check errors, in seconds
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c311122..bf276eb 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2205,6 +2205,7 @@ static struct dev_ext_attribute dev_attr_cmci_disabled = {
&mce_cmci_disabled
 };
 
+/* Use this _only_ for per-cpu attributes */
 static struct device_attribute *mce_device_attrs[] = {
&dev_attr_tolerant.attr,
&dev_attr_check_interval.attr,
@@ -2216,6 +2217,27 @@ static struct device_attribute *mce_device_attrs[] = {
NULL
 };
 
+/* All new global attributes go here */
+static struct attribute *mce_device_global_attrs[] = {
+   &dev_attr_tolerant.attr.attr,
+   &dev_attr_check_interval.attr.attr,
+   &dev_attr_trigger.attr,
+   &dev_attr_monarch_timeout.attr.attr,
+   &dev_attr_dont_log_ce.attr.attr,
+   &dev_attr_ignore_ce.attr.attr,
+   &dev_attr_cmci_disabled.attr.attr,
+   NULL
+};
+
+static struct attribute_group mce_device_attr_group = {
+   .attrs = mce_device_global_attrs,
+};
+
+static const struct attribute_group *mce_device_attr_groups[] = {
+   &mce_device_attr_group,
+   NULL,
+};
+
 static cpumask_var_t mce_device_initialized;
 
 static void mce_device_release(struct device *dev)
@@ -2397,7 +2419,7 @@ static __init int mcheck_init_device(void)
 
mce_init_banks();
 
-   err = subsys_system_register(&mce_subsys, NULL);
+   err = subsys_system_register(&mce_subsys, mce_device_attr_groups);
if (err)
return err;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 0/7] sctp: network namespace support Part 2: per net tunables

2012-08-14 Thread David Miller
From: ebied...@xmission.com (Eric W. Biederman)
Date: Tue, 07 Aug 2012 10:17:02 -0700

> 
> Since I am motivated to get things done, and since there has been much
> grumbling about my patches not implementing tunables, I have added
> tunable support on top of my last patchset.
> 
> I have performed basic testing on the these patches and nothing
> appears amis.
> 
> The sm statemachine is a major tease as it has all of these association
> and endpoint pointers in the common set of function parameters that turn
> out to be NULL at the most inconvinient times.  So I added to the common
> parameter list a struct net pointer, that is never NULL. 

Now that I have the ACKs from Vlad, I'm applying all of your work,
thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 0/7] sctp: network namespace support Part 2: per net tunables

2012-08-14 Thread Vlad Yasevich

On 08/07/2012 01:17 PM, Eric W. Biederman wrote:


Since I am motivated to get things done, and since there has been much
grumbling about my patches not implementing tunables, I have added
tunable support on top of my last patchset.

I have performed basic testing on the these patches and nothing
appears amis.

The sm statemachine is a major tease as it has all of these association
and endpoint pointers in the common set of function parameters that turn
out to be NULL at the most inconvinient times.  So I added to the common
parameter list a struct net pointer, that is never NULL.

  include/net/netns/sctp.h   |   96 +++-
  include/net/sctp/sctp.h|   16 +-
  include/net/sctp/sm.h  |8 +-
  include/net/sctp/structs.h |  126 +-
  net/sctp/associola.c   |   18 +-
  net/sctp/auth.c|   20 ++-
  net/sctp/bind_addr.c   |6 +-
  net/sctp/endpointola.c |   13 +-
  net/sctp/input.c   |6 +-
  net/sctp/primitive.c   |4 +-
  net/sctp/protocol.c|  137 +-
  net/sctp/sm_make_chunk.c   |   61 +++--
  net/sctp/sm_sideeffect.c   |   26 ++-
  net/sctp/sm_statefuns.c|  631 
  net/sctp/sm_statetable.c   |   17 +-
  net/sctp/socket.c  |   92 ---
  net/sctp/sysctl.c  |  200 --
  net/sctp/transport.c   |   23 +-
  18 files changed, 817 insertions(+), 683 deletions(-)

Eric W. Biederman (7):
   sctp: Add infrastructure for per net sysctls
   sctp: Push struct net down to sctp_chunk_event_lookup
   sctp: Push struct net down into sctp_transport_init
   sctp: Push struct net down into sctp_in_scope
   sctp: Push struct net down into all of the state machine functions
   sctp: Push struct net down into sctp_verify_ext_param
   sctp: Making sysctl tunables per net

Eric




Acked-by: Vlad Yasevich 

To this entire follow-on series.  This is much better.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 0/7] sctp: network namespace support Part 2: per net tunables

2012-08-14 Thread Vlad Yasevich

On 08/14/2012 05:14 PM, David Miller wrote:


Come on Vlad, please review this stuff some time this century.  If you
want inclusion to be dependent upon your review, then the onus is on
you to review it in a timely manner.  And you are not doing so here.

I'm not letting Eric's patches rot in patchwork for more than a week,
this is completely unacceptable.




I swear I sent an ACK 2 days ago, but I now see it sitting in my draft 
folder.  My bad.  I'll go now and dust off the ACK...


-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 0/7] sctp: network namespace support Part 2: per net tunables

2012-08-14 Thread David Miller

Come on Vlad, please review this stuff some time this century.  If you
want inclusion to be dependent upon your review, then the onus is on
you to review it in a timely manner.  And you are not doing so here.

I'm not letting Eric's patches rot in patchwork for more than a week,
this is completely unacceptable.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 0/7] sctp: network namespace support Part 2: per net tunables

2012-08-09 Thread Vlad Yasevich

On 08/09/2012 02:20 AM, David Miller wrote:

From: ebied...@xmission.com (Eric W. Biederman)
Date: Tue, 07 Aug 2012 10:17:02 -0700


Since I am motivated to get things done, and since there has been much
grumbling about my patches not implementing tunables, I have added
tunable support on top of my last patchset.

I have performed basic testing on the these patches and nothing
appears amis.

The sm statemachine is a major tease as it has all of these association
and endpoint pointers in the common set of function parameters that turn
out to be NULL at the most inconvinient times.  So I added to the common
parameter list a struct net pointer, that is never NULL.


I like Eric's patch set and I'd like to apply it to net-next.

Vlad?




I like these patches much more as well, but not done reviewing yet. 
I'll try to finish the review tonight


-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 0/7] sctp: network namespace support Part 2: per net tunables

2012-08-08 Thread David Miller
From: ebied...@xmission.com (Eric W. Biederman)
Date: Tue, 07 Aug 2012 10:17:02 -0700

> Since I am motivated to get things done, and since there has been much
> grumbling about my patches not implementing tunables, I have added
> tunable support on top of my last patchset.
> 
> I have performed basic testing on the these patches and nothing
> appears amis.
> 
> The sm statemachine is a major tease as it has all of these association
> and endpoint pointers in the common set of function parameters that turn
> out to be NULL at the most inconvinient times.  So I added to the common
> parameter list a struct net pointer, that is never NULL. 

I like Eric's patch set and I'd like to apply it to net-next.

Vlad?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next 7/7] sctp: Make sysctl tunables per net

2012-08-07 Thread Eric W. Biederman

Signed-off-by: "Eric W. Biederman" 
---
 include/net/netns/sctp.h   |   90 +
 include/net/sctp/structs.h |  116 ---
 net/sctp/associola.c   |   10 ++-
 net/sctp/auth.c|   20 -
 net/sctp/bind_addr.c   |2 +-
 net/sctp/endpointola.c |9 +-
 net/sctp/input.c   |2 +-
 net/sctp/protocol.c|  128 +++---
 net/sctp/sm_make_chunk.c   |   47 ++-
 net/sctp/sm_statefuns.c|4 +-
 net/sctp/sm_statetable.c   |6 +-
 net/sctp/socket.c  |   65 +--
 net/sctp/sysctl.c  |  185 ++-
 net/sctp/transport.c   |   15 ++--
 14 files changed, 355 insertions(+), 344 deletions(-)

diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h
index 9576b60..f15a5df 100644
--- a/include/net/netns/sctp.h
+++ b/include/net/netns/sctp.h
@@ -36,6 +36,96 @@ struct netns_sctp {
/* Lock that protects the local_addr_list writers */
spinlock_t local_addr_lock;
 
+   /* RFC2960 Section 14. Suggested SCTP Protocol Parameter Values
+*
+* The following protocol parameters are RECOMMENDED:
+*
+* RTO.Initial  - 3  seconds
+* RTO.Min  - 1  second
+* RTO.Max -  60 seconds
+* RTO.Alpha- 1/8  (3 when converted to right shifts.)
+* RTO.Beta - 1/4  (2 when converted to right shifts.)
+*/
+   unsigned int rto_initial;
+   unsigned int rto_min;
+   unsigned int rto_max;
+
+   /* Note: rto_alpha and rto_beta are really defined as inverse
+* powers of two to facilitate integer operations.
+*/
+   int rto_alpha;
+   int rto_beta;
+
+   /* Max.Burst- 4 */
+   int max_burst;
+
+   /* Whether Cookie Preservative is enabled(1) or not(0) */
+   int cookie_preserve_enable;
+
+   /* Valid.Cookie.Life- 60  seconds  */
+   unsigned int valid_cookie_life;
+
+   /* Delayed SACK timeout  200ms default*/
+   unsigned int sack_timeout;
+
+   /* HB.interval  - 30 seconds  */
+   unsigned int hb_interval;
+
+   /* Association.Max.Retrans  - 10 attempts
+* Path.Max.Retrans - 5  attempts (per destination address)
+* Max.Init.Retransmits - 8  attempts
+*/
+   int max_retrans_association;
+   int max_retrans_path;
+   int max_retrans_init;
+   /* Potentially-Failed.Max.Retrans sysctl value
+* taken from:
+* http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05
+*/
+   int pf_retrans;
+
+   /*
+* Policy for preforming sctp/socket accounting
+* 0   - do socket level accounting, all assocs share sk_sndbuf
+* 1   - do sctp accounting, each asoc may use sk_sndbuf bytes
+*/
+   int sndbuf_policy;
+
+   /*
+* Policy for preforming sctp/socket accounting
+* 0   - do socket level accounting, all assocs share sk_rcvbuf
+* 1   - do sctp accounting, each asoc may use sk_rcvbuf bytes
+*/
+   int rcvbuf_policy;
+
+   int default_auto_asconf;
+   
+   /* Flag to indicate if addip is enabled. */
+   int addip_enable;
+   int addip_noauth;
+
+   /* Flag to indicate if PR-SCTP is enabled. */
+   int prsctp_enable;
+
+   /* Flag to idicate if SCTP-AUTH is enabled */
+   int auth_enable;
+
+   /*
+* Policy to control SCTP IPv4 address scoping
+* 0   - Disable IPv4 address scoping
+* 1   - Enable IPv4 address scoping
+* 2   - Selectively allow only IPv4 private addresses
+* 3   - Selectively allow only IPv4 link local address
+*/
+   int scope_policy;
+
+   /* Threshold for rwnd update SACKS.  Receive buffer shifted this many
+* bits is an indicator of when to send and window update SACK.
+*/
+   int rwnd_upd_shift;
+
+   /* Threshold for autoclose timeout, in seconds. */
+   unsigned long max_autoclose;
 };
 
 #endif /* __NETNS_SCTP_H__ */
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 18052b4..0fef00f 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -119,69 +119,6 @@ struct sctp_hashbucket {
 
 /* The SCTP globals structure. */
 extern struct sctp_globals {
-   /* RFC2960 Section 14. Suggested SCTP Protocol Parameter Values
-*
-* The following protocol parameters are RECOMMENDED:
-*
-* RTO.Initial  - 3  seconds
-* RTO.Min  - 1  second
-* RTO.Max -  60 seconds
-* RTO.Alpha- 1/8  (3 when converted to right shifts.)
-* RTO.Beta - 1/4  (2 when converted to right shifts.)
-*/
-   unsigned int rto_initial;
-   unsi

Re: [PATCH net-next 0/7] sctp: network namespace support Part 2: per net tunables

2012-08-07 Thread Eric W. Biederman

Since I am motivated to get things done, and since there has been much
grumbling about my patches not implementing tunables, I have added
tunable support on top of my last patchset.

I have performed basic testing on the these patches and nothing
appears amis.

The sm statemachine is a major tease as it has all of these association
and endpoint pointers in the common set of function parameters that turn
out to be NULL at the most inconvinient times.  So I added to the common
parameter list a struct net pointer, that is never NULL. 

 include/net/netns/sctp.h   |   96 +++-
 include/net/sctp/sctp.h|   16 +-
 include/net/sctp/sm.h  |8 +-
 include/net/sctp/structs.h |  126 +-
 net/sctp/associola.c   |   18 +-
 net/sctp/auth.c|   20 ++-
 net/sctp/bind_addr.c   |6 +-
 net/sctp/endpointola.c |   13 +-
 net/sctp/input.c   |6 +-
 net/sctp/primitive.c   |4 +-
 net/sctp/protocol.c|  137 +-
 net/sctp/sm_make_chunk.c   |   61 +++--
 net/sctp/sm_sideeffect.c   |   26 ++-
 net/sctp/sm_statefuns.c|  631 
 net/sctp/sm_statetable.c   |   17 +-
 net/sctp/socket.c  |   92 ---
 net/sctp/sysctl.c  |  200 --
 net/sctp/transport.c   |   23 +-
 18 files changed, 817 insertions(+), 683 deletions(-)

Eric W. Biederman (7):
  sctp: Add infrastructure for per net sysctls
  sctp: Push struct net down to sctp_chunk_event_lookup
  sctp: Push struct net down into sctp_transport_init
  sctp: Push struct net down into sctp_in_scope
  sctp: Push struct net down into all of the state machine functions
  sctp: Push struct net down into sctp_verify_ext_param
  sctp: Making sysctl tunables per net

Eric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/10] slab: allow enable_cpu_cache to use preset values for its tunables

2012-07-26 Thread Glauber Costa
On 07/25/2012 10:33 PM, Christoph Lameter wrote:
> On Wed, 25 Jul 2012, Glauber Costa wrote:
> 
>> It is certainly not through does the same method as SLAB, right ?
>> Writing to /proc/slabinfo gives me an I/O error
>> I assume it is something through sysfs, but schiming through the code
>> now, I can't find any per-cache tunables. Would you mind pointing me to
>> them?
> 
> The slab attributes in /sys/kernel/slab// can be modified
> for some values. I think that could be the default method for the future
> since it allows easy addition of new tunables as needed.
> 

Christoph, would the following PoC patch be enough?


>From 7c582c5c6321cbde93c5e73c6c2096b4432a2a04 Mon Sep 17 00:00:00 2001
From: Glauber Costa 
Date: Thu, 26 Jul 2012 15:19:08 +0400
Subject: [PATCH] slub propagation

---
 mm/slub.c |   16 
 1 file changed, 16 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 55946c3..a136a75 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5130,6 +5130,10 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 	struct slab_attribute *attribute;
 	struct kmem_cache *s;
 	int err;
+#ifdef CONFIG_MEMCG_KMEM
+	struct kmem_cache *c;
+	struct mem_cgroup_cache_params *p;
+#endif
 
 	attribute = to_slab_attr(attr);
 	s = to_slab(kobj);
@@ -5138,7 +5142,19 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		return -EIO;
 
 	err = attribute->store(s, buf, len);
+#ifdef CONFIG_MEMCG_KMEM
+	if (slab_state < FULL)
+		return err;
 
+	if ((err < 0) || (s->memcg_params.id == -1))
+		return err;
+
+	list_for_each_entry(p, &s->memcg_params.sibling_list, sibling_list) {
+		c = container_of(p, struct kmem_cache, memcg_params);
+		/* return value determined by the parent cache only */
+		attribute->store(c, buf, len);
+	}
+#endif
 	return err;
 }
 
-- 
1.7.10.4



Re: [PATCH 05/10] slab: allow enable_cpu_cache to use preset values for its tunables

2012-07-25 Thread Christoph Lameter
On Wed, 25 Jul 2012, Glauber Costa wrote:

> It is certainly not through does the same method as SLAB, right ?
> Writing to /proc/slabinfo gives me an I/O error
> I assume it is something through sysfs, but schiming through the code
> now, I can't find any per-cache tunables. Would you mind pointing me to
> them?

The slab attributes in /sys/kernel/slab// can be modified
for some values. I think that could be the default method for the future
since it allows easy addition of new tunables as needed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/10] slab: allow enable_cpu_cache to use preset values for its tunables

2012-07-25 Thread Glauber Costa
On 07/25/2012 09:05 PM, Christoph Lameter wrote:
> On Wed, 25 Jul 2012, Glauber Costa wrote:
> 
>> SLAB allows us to tune a particular cache behavior with tunables.
>> When creating a new memcg cache copy, we'd like to preserve any tunables
>> the parent cache already had.
> 
> So does SLUB but I do not see a patch for that allocator.
> 
It is certainly not through does the same method as SLAB, right ?
Writing to /proc/slabinfo gives me an I/O error
I assume it is something through sysfs, but schiming through the code
now, I can't find any per-cache tunables. Would you mind pointing me to
them?

In any case, are you happy with the SLAB one, and how they are propagated?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/10] slab: allow enable_cpu_cache to use preset values for its tunables

2012-07-25 Thread Christoph Lameter
On Wed, 25 Jul 2012, Glauber Costa wrote:

> SLAB allows us to tune a particular cache behavior with tunables.
> When creating a new memcg cache copy, we'd like to preserve any tunables
> the parent cache already had.

So does SLUB but I do not see a patch for that allocator.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/10] slab: allow enable_cpu_cache to use preset values for its tunables

2012-07-25 Thread Glauber Costa
SLAB allows us to tune a particular cache behavior with tunables.
When creating a new memcg cache copy, we'd like to preserve any tunables
the parent cache already had.

This could be done by an explicit call to do_tune_cpucache() after the
cache is created. But this is not very convenient now that the caches are
created from common code, since this function is SLAB-specific.

Another method of doing that is taking advantage of the fact that
do_tune_cpucache() is always called from enable_cpucache(), which is
called at cache initialization. We can just preset the values, and
then things work as expected.

Signed-off-by: Glauber Costa 
CC: Christoph Lameter 
CC: Pekka Enberg 
CC: Michal Hocko 
CC: Kamezawa Hiroyuki 
CC: Johannes Weiner 
CC: Suleiman Souhlal 
---
 include/linux/slab.h |3 ++-
 mm/memcontrol.c  |2 +-
 mm/slab.c|   19 ---
 mm/slab_common.c |7 ---
 4 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9d3fd56..249a0d3 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -128,7 +128,7 @@ struct kmem_cache *kmem_cache_create(const char *, size_t, 
size_t,
void (*)(void *));
 struct kmem_cache *
 kmem_cache_create_memcg(struct mem_cgroup *, const char *, size_t, size_t,
-   unsigned long, void (*)(void *));
+   unsigned long, void (*)(void *), struct kmem_cache *);
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void kmem_cache_free(struct kmem_cache *, void *);
@@ -184,6 +184,7 @@ unsigned int kmem_cache_size(struct kmem_cache *);
 #ifdef CONFIG_MEMCG_KMEM
 struct mem_cgroup_cache_params {
struct mem_cgroup *memcg;
+   struct kmem_cache *parent;
int id;
 };
 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 854f6cc..b933474 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -585,7 +585,7 @@ static struct kmem_cache *kmem_cache_dup(struct mem_cgroup 
*memcg,
return NULL;
 
new = kmem_cache_create_memcg(memcg, name, s->object_size, s->align,
- (s->flags & ~SLAB_PANIC), s->ctor);
+ (s->flags & ~SLAB_PANIC), s->ctor, s);
 
kfree(name);
return new;
diff --git a/mm/slab.c b/mm/slab.c
index 7e7ec59..76bc98f 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3916,8 +3916,19 @@ static int do_tune_cpucache(struct kmem_cache *cachep, 
int limit,
 static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
 {
int err;
-   int limit, shared;
-
+   int limit = 0;
+   int shared = 0;
+   int batchcount = 0;
+
+#ifdef CONFIG_MEMCG_KMEM
+if (cachep->memcg_params.parent) {
+limit = cachep->memcg_params.parent->limit;
+shared = cachep->memcg_params.parent->shared;
+batchcount = cachep->memcg_params.parent->batchcount;
+}
+#endif
+   if (limit && shared && batchcount)
+   goto skip_setup;
/*
 * The head array serves three purposes:
 * - create a LIFO ordering, i.e. return objects that are cache-warm
@@ -3959,7 +3970,9 @@ static int enable_cpucache(struct kmem_cache *cachep, 
gfp_t gfp)
if (limit > 32)
limit = 32;
 #endif
-   err = do_tune_cpucache(cachep, limit, (limit + 1) / 2, shared, gfp);
+   batchcount = (limit + 1) / 2;
+skip_setup:
+   err = do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
if (err)
printk(KERN_ERR "enable_cpucache failed for %s, error %d.\n",
   cachep->name, -err);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 1080ef2..562146b 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -80,7 +80,8 @@ unsigned long calculate_alignment(unsigned long flags,
 
 struct kmem_cache *
 kmem_cache_create_memcg(struct mem_cgroup *memcg, const char *name, size_t 
size,
-   size_t align, unsigned long flags, void (*ctor)(void *))
+   size_t align, unsigned long flags, void (*ctor)(void *),
+   struct kmem_cache *parent_cache)
 {
struct kmem_cache *s = NULL;
char *n;
@@ -149,9 +150,9 @@ kmem_cache_create_memcg(struct mem_cgroup *memcg, const 
char *name, size_t size,
s->ctor = ctor;
s->flags = flags;
s->align = calculate_alignment(flags, align, size);
-
 #ifdef CONFIG_MEMCG_KMEM
s->memcg_params.memcg = memcg;
+   s->memcg_params.parent = parent_cache;
 #endif
 
r = __kmem_cache_create(s);
@@ -184,7 +185,7 @@ out:
 struct kmem_cache *kmem_cache_create(const char *name, size_t size, size_t 
align,
unsigned long flags, void (*ctor)(void *))
 {
-   return kmem_cache_create_memcg(NULL, nam

[PATCH 5/6] AKT - per namespace tunables

2007-01-30 Thread Nadia . Derbey
[PATCH 05/06]


This patch introduces all that is needed to process per namespace tunables.


Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]>


---
 include/linux/akt.h   |   12 ++
 kernel/autotune/akt.c |   94 +-
 2 files changed, 83 insertions(+), 23 deletions(-)

Index: linux-2.6.20-rc4/include/linux/akt.h
===
--- linux-2.6.20-rc4.orig/include/linux/akt.h   2007-01-29 15:47:40.0 
+0100
+++ linux-2.6.20-rc4/include/linux/akt.h2007-01-29 15:57:59.0 
+0100
@@ -126,6 +126,7 @@ struct auto_tune {
  */
 #define AUTO_TUNE_ENABLE  0x01
 #define TUNABLE_REGISTERED  0x02
+#define TUNABLE_IPC_NS  0x04
 
 
 /*
@@ -171,6 +172,8 @@ static inline int is_tunable_registered(
}
 
 
+#define DECLARE_TUNABLE(s) struct auto_tune s;
+
 #define DEFINE_TUNABLE(s, thr, min, max, tun, chk, type)   \
struct auto_tune s = TUNABLE_INIT(#s, thr, min, max, tun, chk, type)
 
@@ -182,6 +185,13 @@ static inline int is_tunable_registered(
(s).max.abs_value = _max;   \
} while (0)
 
+#define init_tunable_ipcns(ns, s, thr, min, max, tun, chk, type)   \
+   do {\
+   DEFINE_TUNABLE(s, thr, min, max, tun, chk, type);   \
+   s.flags |= TUNABLE_IPC_NS;  \
+   ns->s = s;  \
+   } while (0)
+
 
 static inline void set_autotuning_routine(struct auto_tune *tunable,
auto_tune_fn fn)
@@ -236,7 +246,9 @@ extern ssize_t store_tunable_max(struct 
 #else  /* CONFIG_AKT */
 
 
+#define DECLARE_TUNABLE(s)
 #define DEFINE_TUNABLE(s, thresh, min, max, tun, chk, type)
+#define init_tunable_ipcns(ns, s, th, m, M, tun, chk, type)  do { } while (0)
 #define set_tunable_min_max(s, min, max) do { } while (0)
 #define set_autotuning_routine(s, fn)do { } while (0)
 
Index: linux-2.6.20-rc4/kernel/autotune/akt.c
===
--- linux-2.6.20-rc4.orig/kernel/autotune/akt.c 2007-01-29 15:50:31.0 
+0100
+++ linux-2.6.20-rc4/kernel/autotune/akt.c  2007-01-29 16:02:10.0 
+0100
@@ -32,6 +32,7 @@
  *  store_tunable_min  (exported)
  *  show_tunable_max   (exported)
  *  store_tunable_max  (exported)
+ *  get_ns_tunable (static)
  */
 
 #include 
@@ -43,6 +44,10 @@
 #define AKT_AUTO   1
 #define AKT_MANUAL 0
 
+static struct auto_tune *get_ns_tunable(struct auto_tune *);
+
+
+
 /**
  * register_tunable - Inserts a tunable structure into sysfs
  * @tun:   tunable structure to be registered
@@ -149,17 +154,20 @@ EXPORT_SYMBOL_GPL(unregister_tunable);
 ssize_t show_tuning_mode(struct auto_tune *tun_addr, char *buf)
 {
int valid;
+   struct auto_tune *which;
 
if (tun_addr == NULL) {
printk(KERN_ERR "AKT: tunable address is invalid\n");
return -EINVAL;
}
 
-   spin_lock(&tun_addr->tunable_lck);
+   which = get_ns_tunable(tun_addr);
 
-   valid = is_auto_tune_enabled(tun_addr);
+   spin_lock(&which->tunable_lck);
 
-   spin_unlock(&tun_addr->tunable_lck);
+   valid = is_auto_tune_enabled(which);
+
+   spin_unlock(&which->tunable_lck);
 
return snprintf(buf, PAGE_SIZE, "%d\n", valid);
 }
@@ -183,6 +191,7 @@ ssize_t store_tuning_mode(struct auto_tu
size_t count)
 {
int new_value;
+   struct auto_tune *which;
 
if (sscanf(buffer, "%d", &new_value) != 1)
return -EINVAL;
@@ -195,18 +204,20 @@ ssize_t store_tuning_mode(struct auto_tu
return -EINVAL;
}
 
-   spin_lock(&tun_addr->tunable_lck);
+   which = get_ns_tunable(tun_addr);
+
+   spin_lock(&which->tunable_lck);
 
switch (new_value) {
case AKT_AUTO:
-   tun_addr->flags |= AUTO_TUNE_ENABLE;
+   which->flags |= AUTO_TUNE_ENABLE;
break;
case AKT_MANUAL:
-   tun_addr->flags &= ~AUTO_TUNE_ENABLE;
+   which->flags &= ~AUTO_TUNE_ENABLE;
break;
}
 
-   spin_unlock(&tun_addr->tunable_lck);
+   spin_unlock(&which->tunable_lck);
 
return strnlen(buffer, PAGE_SIZE);
 }
@@ -227,17 +238,20 @@ ssize_t store_tuning_mode(struct auto_tu
 ssize_t show_tunable_min(struct auto_tune *tun_addr, char *buf)
 {
ssize_t rc;
+   struct auto_tune *which;
 
if (tun_addr == NULL) {
printk(KERN_ERR "AKT: tunable address is invalid\n");
return -EINVAL;
   

[PATCH 3/6] AKT - tunables associated kobjects

2007-01-30 Thread Nadia . Derbey
e structure into sysfs
  * @tun:   tunable structure to be registered
@@ -50,6 +55,8 @@
  */
 int register_tunable(struct auto_tune *tun)
 {
+   int rc = 0;
+
if (tun == NULL) {
printk(KERN_ERR
"AKT: Bad tunable structure pointer (NULL)\n");
@@ -80,7 +87,10 @@ int register_tunable(struct auto_tune *t
return -EINVAL;
}
 
-   return 0;
+   if (!(rc = tunable_sysfs_setup(tun)))
+   tun->flags |= TUNABLE_REGISTERED;
+
+   return rc;
 }
 
 EXPORT_SYMBOL_GPL(register_tunable);
@@ -117,3 +127,82 @@ int unregister_tunable(struct auto_tune 
 }
 
 EXPORT_SYMBOL_GPL(unregister_tunable);
+
+
+/**
+ * show_tuning_mode - Outputs the tuning mode of a given tunable
+ * @tun_addr:  registered tunable structure to check
+ * @buf:   output buffer
+ *
+ * This is the get operation called by tunable_attr_show (i.e. when the file
+ * /sys/tunables//autotune is displayed).
+ * Outputs "1" if the corresponding tunable is automatically adjustable,
+ * "0" else
+ *
+ * Returns:>0 - output string length (including the '\0')
+ * <0 - failure
+ */
+ssize_t show_tuning_mode(struct auto_tune *tun_addr, char *buf)
+{
+   int valid;
+
+   if (tun_addr == NULL) {
+   printk(KERN_ERR "AKT: tunable address is invalid\n");
+   return -EINVAL;
+   }
+
+   spin_lock(&tun_addr->tunable_lck);
+
+   valid = is_auto_tune_enabled(tun_addr);
+
+   spin_unlock(&tun_addr->tunable_lck);
+
+   return snprintf(buf, PAGE_SIZE, "%d\n", valid);
+}
+
+
+/**
+ * store_tuning_mode - Sets the tuning mode of a given tunable
+ * @tun_addr:  registered tunable structure to set
+ * @buf:   input buffer
+ * @count: input buffer length (including the '\0')
+ *
+ * This is the set operation called by tunable_attr_store (i.e. when a string
+ * is stored into /sys/tunables//autotune).
+ * "1" makes the corresponding tunable automatically adjustable
+ * "0" makes the corresponding tunable manually adjustable
+ *
+ * Returns:>0 - number of characters used from the input buffer
+ * <0 - failure
+ */
+ssize_t store_tuning_mode(struct auto_tune *tun_addr, const char *buffer,
+   size_t count)
+{
+   int new_value;
+
+   if (sscanf(buffer, "%d", &new_value) != 1)
+   return -EINVAL;
+
+   if (new_value != AKT_AUTO && new_value != AKT_MANUAL)
+   return -EINVAL;
+
+   if (tun_addr == NULL) {
+   printk(KERN_ERR "AKT: NULL pointer  passed in\n");
+   return -EINVAL;
+   }
+
+   spin_lock(&tun_addr->tunable_lck);
+
+   switch (new_value) {
+   case AKT_AUTO:
+   tun_addr->flags |= AUTO_TUNE_ENABLE;
+   break;
+   case AKT_MANUAL:
+   tun_addr->flags &= ~AUTO_TUNE_ENABLE;
+   break;
+   }
+
+   spin_unlock(&tun_addr->tunable_lck);
+
+   return strnlen(buffer, PAGE_SIZE);
+}
Index: linux-2.6.20-rc4/kernel/autotune/akt_sysfs.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.20-rc4/kernel/autotune/akt_sysfs.c2007-01-29 
15:39:05.0 +0100
@@ -0,0 +1,214 @@
+/*
+ * linux/kernel/autotune/akt_sysfs.c
+ *
+ * Automatic Kernel Tunables for Linux
+ * sysfs bindings for AKT
+ *
+ * Copyright (C) 2006 Bull S.A.S
+ *
+ * Author: Nadia Derbey <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+/*
+ * FUNCTIONS:
+ *tunable_attr_show  (static)
+ *tunable_attr_store (static)
+ *tunable_sysfs_setup
+ *add_tunable_attrs  (static)
+ *init_auto_tuning
+ */
+
+
+#include 
+#include 
+#include 
+#include 
+
+
+
+
+struct tunable_attribute {
+   struct attribute attr;
+   ssize_t (*show)(struct auto_tune *, char *);
+   ssize_t (*store)(struct auto_tune *, const char *, size_t);
+};
+
+#define TUNABLE_ATTR(_name, _mode, _show, _store)  \
+struct tunable_attribute tun_attr_##_name = __ATTR(

[PATCH 0/6] AKT - Automatic Kernel Tunables

2007-01-30 Thread Nadia . Derbey
Re-sending the series of patches for the automatic kernel tunables feature:
have done some fixes after the remarks sent back by Andrew and Randy.

1) All the type independent macros have been removed, except for the automatic
tuning routine: it manages pointers to the tunable and to the value to be
checked against that tunable, so it should rmain type independent IMHO.
Now, I only left the auto-tuning routines for types int and size_t since these
are the types of the tunables the framework is applied to.
It will be easy to add the other types as needed in the future.
This makes the code much lighter.

2) CONFIG_AKT has been moved from the FS menu to the "general setup" one.

+ all the other minor changes.


--- Reminder

This is a series of patches that introduces a feature that makes the kernel
automatically change the tunables values as it sees resources running out.

The AKT framework is made of 2 parts:

1) Kernel part:
Interfaces are provided to the kernel subsystems, to (un)register the
tunables that might be automatically tuned in the future.

Registering a tunable consists in the following steps:
- a structure is declared and filled by the kernel subsystem for the
registered tunable
- that tunable structure is registered into sysfs

Registration should be done during the kernel subsystem initialization step.


Another interface is provided to the kernel subsystems, to activate the
automatic tuning for a registered tunable. It can be called during resource
allocation to tune up, and during resource freeing to tune down the registered
tunable. The automatic tuning routine is called only if the tunable has
been enabled to be automatically tuning in sysfs.

2) User part:

AKT uses sysfs to enable the tunables management from the user world (mainly
making them automatic or manual).

akt uses sysfs in the following way:
- a tunables subsystem (tunables_subsys) is declared and registered during akt
initialization.
- registering a tunable is equivalent to registering the corresponding kobject
within that subsystem.
- each tunable kobject has 3 associated attributes, all with a RW mode (i.e.
the show() and store() methods are provided for them):
. autotune: enables to (de)activate automatic tuning for the tunable
. max: enables to set a new maximum value for the tunable
. min: enables to set a new minimum value for the tunable

The only way to activate automatic tuning is from user side:
- the directory /sys/tunables is created during the init phase.
- each time a tunable is registered by a kernel subsystem, a directory is
created for it under /sys/tunables.
- This directory contains 1 file for each tunable kobject attribute



These patches should be applied to 2.6.20-rc4, in the following order:

[PATCH 1/6]: tunables_registration.patch
[PATCH 2/6]: auto_tuning_activation.patch
[PATCH 3/6]: auto_tuning_kobjects.patch
[PATCH 4/6]: tunable_min_max_kobjects.patch
[PATCH 5/6]: per_namespace_tunables.patch
[PATCH 6/6]: auto_tune_applied.patch


--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 5/6] per namespace tunables

2007-01-24 Thread Randy Dunlap
On Tue, 16 Jan 2007 07:15:21 +0100 [EMAIL PROTECTED] wrote:

> [PATCH 05/06]
> 
> This patch introduces all that is needed to process per namespace tunables.
> 
> ---
>  include/linux/akt.h   |   12 +++
>  kernel/autotune/akt.c |   80 
> ++
>  2 files changed, 73 insertions(+), 19 deletions(-)
> 
> +/*
> + * FUNCTION:This routine gets the actual auto_tune structure for the
> + *  tunables that are per namespace (presently only ipc ones).
> + *
> + * RETURN VALUE: pointer to the tunable structure for the current namespace
> + */

Please use kernel-doc format for function comment blocks.
(see Documentation/kernel-doc-nano-HOWTO.txt)

> +static struct auto_tune *get_ns_tunable(struct auto_tune *p)
> +{
> + if (p->flags & TUNABLE_IPC_NS) {
> + char *shift = (char *) p;
> + struct ipc_namespace *ns = current->nsproxy->ipc_ns;
> +
> + shift = (shift - (char *) &init_ipc_ns) + (char *) ns;
> +
> + return (struct auto_tune *) shift;
> + }
> +
> + return p;
> +}
> +
> +
>  EXPORT_SYMBOL_GPL(register_tunable);
>  EXPORT_SYMBOL_GPL(unregister_tunable);

and put EXPORT_SYMBOL/_GPL() immediately after each function
that is being exported.

---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 5/6] per namespace tunables

2007-01-15 Thread Nadia . Derbey
[PATCH 05/06]


This patch introduces all that is needed to process per namespace tunables.


Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]>


---
 include/linux/akt.h   |   12 +++
 kernel/autotune/akt.c |   80 ++
 2 files changed, 73 insertions(+), 19 deletions(-)

Index: linux-2.6.20-rc4/include/linux/akt.h
===
--- linux-2.6.20-rc4.orig/include/linux/akt.h   2007-01-15 15:21:47.0 
+0100
+++ linux-2.6.20-rc4/include/linux/akt.h2007-01-15 15:31:44.0 
+0100
@@ -154,6 +154,7 @@ struct auto_tune {
  */
 #define AUTO_TUNE_ENABLE  0x01
 #define TUNABLE_REGISTERED  0x02
+#define TUNABLE_IPC_NS  0x04
 
 
 /*
@@ -204,6 +205,8 @@ static inline int is_tunable_registered(
}
 
 
+#define DECLARE_TUNABLE(s) struct auto_tune s;
+
 #define DEFINE_TUNABLE(s, thr, min, max, tun, chk, type)   \
struct auto_tune s = TUNABLE_INIT(#s, thr, min, max, tun, chk, type)
 
@@ -215,6 +218,13 @@ static inline int is_tunable_registered(
(s).max.abs_value.val_##type = _max;\
} while (0)
 
+#define init_tunable_ipcns(ns, s, thr, min, max, tun, chk, type)   \
+   do {\
+   DEFINE_TUNABLE(s, thr, min, max, tun, chk, type);   \
+   s.flags |= TUNABLE_IPC_NS;  \
+   ns->s = s;  \
+   } while (0)
+
 
 static inline void set_autotuning_routine(struct auto_tune *tunable,
auto_tune_fn fn)
@@ -269,7 +279,9 @@ extern ssize_t store_tunable_max(struct 
 #else  /* CONFIG_AKT */
 
 
+#define DECLARE_TUNABLE(s)
 #define DEFINE_TUNABLE(s, thresh, min, max, tun, chk, type)
+#define init_tunable_ipcns(ns, s, th, m, M, tun, chk, type)  do { } while (0)
 #define set_tunable_min_max(s, min, max, type)   do { } while (0)
 #define set_autotuning_routine(s, fn)do { } while (0)
 
Index: linux-2.6.20-rc4/kernel/autotune/akt.c
===
--- linux-2.6.20-rc4.orig/kernel/autotune/akt.c 2007-01-15 15:25:35.0 
+0100
+++ linux-2.6.20-rc4/kernel/autotune/akt.c  2007-01-15 15:37:16.0 
+0100
@@ -32,6 +32,7 @@
  *  store_tunable_min  (exported)
  *  show_tunable_max   (exported)
  *  store_tunable_max  (exported)
+ *  get_ns_tunable (static)
  */
 
 #include 
@@ -45,6 +46,8 @@
 #define AKT_AUTO   1
 #define AKT_MANUAL 0
 
+static struct auto_tune *get_ns_tunable(struct auto_tune *);
+
 
 
 /*
@@ -142,6 +145,7 @@ int unregister_tunable(struct auto_tune 
 ssize_t show_tuning_mode(struct auto_tune *tun_addr, char *buf)
 {
int valid;
+   struct auto_tune *which;
 
if (tun_addr == NULL) {
printk(KERN_ERR
@@ -149,11 +153,13 @@ ssize_t show_tuning_mode(struct auto_tun
return -EINVAL;
}
 
-   spin_lock(&tun_addr->tunable_lck);
+   which = get_ns_tunable(tun_addr);
+
+   spin_lock(&which->tunable_lck);
 
-   valid = is_auto_tune_enabled(tun_addr);
+   valid = is_auto_tune_enabled(which);
 
-   spin_unlock(&tun_addr->tunable_lck);
+   spin_unlock(&which->tunable_lck);
 
return snprintf(buf, PAGE_SIZE, "%d\n", valid);
 }
@@ -176,6 +182,7 @@ ssize_t store_tuning_mode(struct auto_tu
size_t count)
 {
int new_value;
+   struct auto_tune *which;
int rc;
 
if ((rc = sscanf(buffer, "%d", &new_value)) != 1)
@@ -190,18 +197,20 @@ ssize_t store_tuning_mode(struct auto_tu
return -EINVAL;
}
 
-   spin_lock(&tun_addr->tunable_lck);
+   which = get_ns_tunable(tun_addr);
+
+   spin_lock(&which->tunable_lck);
 
switch (new_value) {
case AKT_AUTO:
-   tun_addr->flags |= AUTO_TUNE_ENABLE;
+   which->flags |= AUTO_TUNE_ENABLE;
break;
case AKT_MANUAL:
-   tun_addr->flags &= ~AUTO_TUNE_ENABLE;
+   which->flags &= ~AUTO_TUNE_ENABLE;
break;
}
 
-   spin_unlock(&tun_addr->tunable_lck);
+   spin_unlock(&which->tunable_lck);
 
return strnlen(buffer, PAGE_SIZE);
 }
@@ -218,6 +227,7 @@ ssize_t store_tuning_mode(struct auto_tu
 ssize_t show_tunable_min(struct auto_tune *tun_addr, char *buf)
 {
ssize_t rc;
+   struct auto_tune *which;
 
if (tun_addr == NULL) {
printk(KERN_ERR
@@ -225,11 +235,13 @@ ssize_t show_tunable_min(struct auto_tun
return -EINVAL;
}
 
-   spin_lock(&tun_addr->tunable_lck);
+   which = get_ns_tunable(tun_ad

[RFC][PATCH 3/6] tunables associated kobjects

2007-01-15 Thread Nadia . Derbey
 1
+#define AKT_MANUAL 0
 
 
 
@@ -54,6 +58,8 @@
  */
 int register_tunable(struct auto_tune *tun)
 {
+   int rc = 0;
+
if (tun == NULL) {
printk(KERN_ERR "\tBad tunable structure pointer (NULL)\n");
return -EINVAL;
@@ -84,7 +90,10 @@ int register_tunable(struct auto_tune *t
return -EINVAL;
}
 
-   return 0;
+   if (!(rc = tunable_sysfs_setup(tun)))
+   tun->flags |= TUNABLE_REGISTERED;
+
+   return rc;
 }
 
 
@@ -117,6 +126,81 @@ int unregister_tunable(struct auto_tune 
 }
 
 
+/*
+ * FUNCTION:Get operation called by tunable_attr_show (i.e. when the file
+ *      /sys/tunables//autotune is displayed).
+ *  Outputs "1" if the corresponding tunable is automatically
+ *  adjustable, "0" else
+ *
+ * RETURN VALUE: >0 : output string length (including the '\0')
+ *   <0 : failure
+ */
+ssize_t show_tuning_mode(struct auto_tune *tun_addr, char *buf)
+{
+   int valid;
+
+   if (tun_addr == NULL) {
+   printk(KERN_ERR
+   " show_tuning_mode(): tunable address is invalid\n");
+   return -EINVAL;
+   }
+
+   spin_lock(&tun_addr->tunable_lck);
+
+   valid = is_auto_tune_enabled(tun_addr);
+
+   spin_unlock(&tun_addr->tunable_lck);
+
+   return snprintf(buf, PAGE_SIZE, "%d\n", valid);
+}
+
+
+/*
+ * NAME:store_tuning_mode
+ *
+ * FUNCTION:Set operation called by tunable_attr_store (i.e. when a
+ *  string is stored into /sys/tunables//autotune).
+ *  "1" makes the corresponding tunable automatically adjustable
+ *  "0" makes the corresponding tunable manually adjustable
+ *
+ * PARAMETERS: count: input buffer size (including the '\0')
+ *
+ * RETURN VALUE: >0: number of characters used from the input buffer
+ *   <= 0: failure
+ */
+ssize_t store_tuning_mode(struct auto_tune *tun_addr, const char *buffer,
+   size_t count)
+{
+   int new_value;
+   int rc;
+
+   if ((rc = sscanf(buffer, "%d", &new_value)) != 1)
+   return -EINVAL;
+
+   if (new_value != AKT_AUTO && new_value != AKT_MANUAL)
+   return -EINVAL;
+
+   if (tun_addr == NULL) {
+   printk(KERN_ERR
+   " store_tuning_mode(): NULL pointer  passed in\n");
+   return -EINVAL;
+   }
+
+   spin_lock(&tun_addr->tunable_lck);
+
+   switch (new_value) {
+   case AKT_AUTO:
+   tun_addr->flags |= AUTO_TUNE_ENABLE;
+   break;
+   case AKT_MANUAL:
+   tun_addr->flags &= ~AUTO_TUNE_ENABLE;
+   break;
+   }
+
+   spin_unlock(&tun_addr->tunable_lck);
+
+   return strnlen(buffer, PAGE_SIZE);
+}
 
 
 EXPORT_SYMBOL_GPL(register_tunable);
Index: linux-2.6.20-rc4/kernel/autotune/akt_sysfs.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.20-rc4/kernel/autotune/akt_sysfs.c2007-01-15 
15:14:55.0 +0100
@@ -0,0 +1,214 @@
+/*
+ * linux/kernel/autotune/akt_sysfs.c
+ *
+ * Automatic Kernel Tunables for Linux
+ * sysfs bindings for AKT
+ *
+ * Copyright (C) 2006 Bull S.A.S
+ *
+ * Author: Nadia Derbey <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+/*
+ * FUNCTIONS:
+ *tunable_attr_show  (static)
+ *tunable_attr_store (static)
+ *tunable_sysfs_setup
+ *add_tunable_attrs  (static)
+ *init_auto_tuning
+ */
+
+
+#include 
+#include 
+#include 
+#include 
+
+
+
+
+struct tunable_attribute {
+   struct attribute attr;
+   ssize_t (*show)(struct auto_tune *, char *);
+   ssize_t (*store)(struct auto_tune *, const char *, size_t);
+};
+
+#define TUNABLE_ATTR(_name, _mode, _show, _store)  \
+struct tunable_attribute tun_attr_##_name = __ATTR(_name, _mode, _show, _store)
+
+
+static TUNABLE_ATTR(autotune, S_IWUSR | S_IRUGO, show_tuning_mode,
+   store_tuning_mode);
+
+stat

[RFC][PATCH 0/6] Automatice kernel tunables (AKT)

2007-01-15 Thread Nadia . Derbey
This is a series of patches that introduces a feature that makes the kernel
automatically change the tunables values as it sees resources running out.

The AKT framework is made of 2 parts:

1) Kernel part:
Interfaces are provided to the kernel subsystems, to (un)register the
tunables that might be automatically tuned in the future.

Registering a tunable consists in the following steps:
- a structure is declared and filled by the kernel subsystem for the
registered tunable
- that tunable structure is registered into sysfs

Registration should be done during the kernel subsystem initialization step.


Another interface is provided to the kernel subsystems, to activate the
automatic tuning for a registered tunable. It can be called during resource
allocation to tune up, and during resource freeing to tune down the registered
tunable. The automatic tuning routine is called only if the tunable has
been enabled to be automatically tuning in sysfs.

2) User part:

AKT uses sysfs to enable the tunables management from the user world (mainly
making them automatic or manual).

akt uses sysfs in the following way:
- a tunables subsystem (tunables_subsys) is declared and registered during akt
initialization.
- registering a tunable is equivalent to registering the corresponding kobject
within that subsystem.
- each tunable kobject has 3 associated attributes, all with a RW mode (i.e.
the show() and store() methods are provided for them):
. autotune: enables to (de)activate automatic tuning for the tunable
. max: enables to set a new maximum value for the tunable
. min: enables to set a new minimum value for the tunable

The only way to activate automatic tuning is from user side:
- the directory /sys/tunables is created during the init phase.
- each time a tunable is registered by a kernel subsystem, a directory is
created for it under /sys/tunables.
- This directory contains 1 file for each tunable kobject attribute



These patches should be applied to 2.6.20-rc4, in the following order:

[PATCH 1/6]: tunables_registration.patch
[PATCH 2/6]: auto_tuning_activation.patch
[PATCH 3/6]: auto_tuning_kobjects.patch
[PATCH 4/6]: tunable_min_max_kobjects.patch
[PATCH 5/6]: per_namespace_tunables.patch
[PATCH 6/6]: auto_tune_applied.patch

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tunables??

2001-05-09 Thread Dan Mann

Take a look at this link for some information:

http://kalamazoolinux.org/presentations/19991221/Performance

Dan

- Original Message -
From: "SRIKANTH CHOWDARY M. K. G" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, May 09, 2001 1:57 AM
Subject: tunables??


>
> Hi All,
>
>   1.  Is there a file which contains all the tunables??
>
>   2. Are the variables - free_pages_high & free_pages_low  (, which the
> kswapd looks for after the timer expires), tunable parameters??
>
>   3. What  range of addresses  separates the Normal, High Memory
>   & DMA zones??
>
>  Thanks & Regards,
>  Srikanth
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



tunables??

2001-05-08 Thread SRIKANTH CHOWDARY M. K. G


Hi All,

  1.  Is there a file which contains all the tunables??

  2. Are the variables - free_pages_high & free_pages_low  (, which the
kswapd looks for after the timer expires),  tunable parameters??
  
  3. What  range of addresses  separates the Normal, High Memory
  & DMA zones??

 Thanks & Regards,
 Srikanth


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/