[PATCH 4/8] bfq: keep the minimun bandwidth for CLASS_BE

2021-04-20 Thread brookxu
From: Chunguang Xu 

CLASS_RT will preempt other classes, which may starve. At
present, CLASS_IDLE has alleviated the starvation problem
through the minimum bandwidth mechanism. Similarly, we
should do the same for CLASS_BE.

Signed-off-by: Chunguang Xu 
---
 block/bfq-iosched.c |  6 --
 block/bfq-iosched.h | 11 ++
 block/bfq-wf2q.c| 59 ++---
 3 files changed, 53 insertions(+), 23 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 29940ec..89d4646 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6537,9 +6537,11 @@ static void bfq_init_root_group(struct bfq_group 
*root_group,
root_group->bfqd = bfqd;
 #endif
root_group->rq_pos_tree = RB_ROOT;
-   for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+   for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
-   root_group->sched_data.bfq_class_idle_last_service = jiffies;
+   root_group->sched_data.bfq_class_last_service[i] = jiffies;
+   }
+   root_group->sched_data.class_timeout_last_check = jiffies;
 }
 
 static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 28d8590..da636a8 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -13,7 +13,7 @@
 #include "blk-cgroup-rwstat.h"
 
 #define BFQ_IOPRIO_CLASSES 3
-#define BFQ_CL_IDLE_TIMEOUT(HZ/5)
+#define BFQ_CLASS_TIMEOUT  (HZ/5)
 
 #define BFQ_MIN_WEIGHT 1
 #define BFQ_MAX_WEIGHT 1000
@@ -97,9 +97,12 @@ struct bfq_sched_data {
struct bfq_entity *next_in_service;
/* array of service trees, one per ioprio_class */
struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
-   /* last time CLASS_IDLE was served */
-   unsigned long bfq_class_idle_last_service;
-
+   /* last time the class was served */
+   unsigned long bfq_class_last_service[BFQ_IOPRIO_CLASSES];
+   /* last time class timeout was checked */
+   unsigned long class_timeout_last_check;
+   /* next index to check class timeout */
+   unsigned int next_class_index;
 };
 
 /**
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 276f225..619ed21 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1168,6 +1168,7 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, 
bool ins_into_idle_tree)
 {
struct bfq_sched_data *sd = entity->sched_data;
struct bfq_service_tree *st;
+   int idx = bfq_class_idx(entity);
bool is_in_service;
 
if (!entity->on_st_or_in_serv) /*
@@ -1207,6 +1208,7 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, 
bool ins_into_idle_tree)
else
bfq_idle_insert(st, entity);
 
+   sd->bfq_class_last_service[idx] = jiffies;
return true;
 }
 
@@ -1435,6 +1437,45 @@ static struct bfq_entity *bfq_first_active_entity(struct 
bfq_service_tree *st,
return entity;
 }
 
+static int bfq_select_next_class(struct bfq_sched_data *sd)
+{
+   struct bfq_service_tree *st = sd->service_tree;
+   unsigned long last_check, last_serve;
+   int i, class_idx, next_class = 0;
+   bool found = false;
+
+   /*
+    * we needed to guarantee a minimum bandwidth for each class (if
+* there is some active entity in this class). This should also
+* mitigate priority-inversion problems in case a low priority
+* task is holding file system resources.
+*/
+   last_check = sd->class_timeout_last_check;
+   if (time_is_after_jiffies(last_check + BFQ_CLASS_TIMEOUT))
+   return next_class;
+
+   sd->class_timeout_last_check = jiffies;
+   for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
+   class_idx = (sd->next_class_index + i) % BFQ_IOPRIO_CLASSES;
+   last_serve = sd->bfq_class_last_service[class_idx];
+
+   if (time_is_after_jiffies(last_serve + BFQ_CLASS_TIMEOUT))
+   continue;
+
+   if (!RB_EMPTY_ROOT(&(st + class_idx)->active)) {
+   if (found)
+   continue;
+
+   next_class = class_idx++;
+   class_idx %= BFQ_IOPRIO_CLASSES;
+   sd->next_class_index = class_idx;
+   found = true;
+   }
+   sd->bfq_class_last_service[class_idx] = jiffies;
+   }
+   return next_class;
+}
+
 /**
  * bfq_lookup_next_entity - return the first eligible entity in @sd.
  * @sd: the sched_data.
@@ -1448,24 +1489,8 @@ static struct bfq_entity *bfq_lookup_next_entity(struct 
bfq_sched_data *sd,
 bool expiration)
 {
struct bfq_service_tree *st = sd->service_tree;

Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-04-19 Thread changhuaixin



> On Mar 18, 2021, at 11:10 PM, Peter Zijlstra  wrote:
> 
> On Thu, Mar 18, 2021 at 08:59:44AM -0400, Phil Auld wrote:
>> I admit to not having followed all the history of this patch set. That
>> said, when I see the above I just think your quota is too low for your
>> workload.
> 
> This.
> 
>> The burst (mis?)feature seems to be a way to bypass the quota.  And it
>> sort of assumes cooperative containers that will only burst when they
>> need it and then go back to normal. 
> 
> Its not entirely unreasonable or unheard of. There's soft realtime
> systems that use this to increase the utilization with the trade-off
> that you're going to miss deadlines once every so often.
> 
> If you do it right, you can calculate the probabilities. Or usually the
> other way around, you calculate the allowed variance/burst given a P
> value for making the deadline.
> 
> Input then isn't the WCET for each task, but a runtime distribution as
> measured for your workload on your system etc..
> 
> I used to have papers on this, but I can't seem to find them in a hurry.
> 

Hi, I have done some reading on queueing theory and done some problem 
definition.

Divide real time into discrete periods as cfs_b does. Assume there are m 
cgroups using
CFS Bandwidth Control. During each period, the i-th cgroup demands u_i CPU time,
where we assume u_i is under some distribution(exponential, Poisson or else).
At the end of a period, if the sum of u_i is under or equal to 100%, we call it 
an "idle" state.
The number of periods between two "idle" states stands for the WCET of tasks 
during these
periods.

Originally using quota, it is guaranteed that "idle" state comes at the end of 
each period. Thus,
the WCET is the length of period. When enabling CPU Burst, the sum of u_i may 
exceed 100%,
and the exceeded workload is handled in the following periods. The WCET is the 
number of periods
between two "idle" states.

Then, we are going to calculate the probabilities that WCET is longer than a 
period, and the average
WCET when using certain burst under some runtime distribution.

Basically, these are based on pervious mails. I am sending this email to see if 
there is anything wrong
on problem definition.


[PATCH net-next 08/10] net: sparx5: add calendar bandwidth allocation support

2021-04-16 Thread Steen Hegelund
This configures the Sparx5 calendars according to the bandwidth
requested in the Device Tree nodes.
It also checks if the total requested bandwidth is within the
specs of the detected Sparx5 models limits.

Signed-off-by: Steen Hegelund 
Signed-off-by: Bjarni Jonasson 
Signed-off-by: Lars Povlsen 
---
 .../net/ethernet/microchip/sparx5/Makefile|   2 +-
 .../microchip/sparx5/sparx5_calendar.c| 596 ++
 .../ethernet/microchip/sparx5/sparx5_main.c   |   9 +-
 .../ethernet/microchip/sparx5/sparx5_main.h   |   4 +
 4 files changed, 609 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/microchip/sparx5/sparx5_calendar.c

diff --git a/drivers/net/ethernet/microchip/sparx5/Makefile 
b/drivers/net/ethernet/microchip/sparx5/Makefile
index d2788e8b7798..e7dea25eb479 100644
--- a/drivers/net/ethernet/microchip/sparx5/Makefile
+++ b/drivers/net/ethernet/microchip/sparx5/Makefile
@@ -7,4 +7,4 @@ obj-$(CONFIG_SPARX5_SWITCH) += sparx5-switch.o
 
 sparx5-switch-objs  := sparx5_main.o sparx5_packet.o \
  sparx5_netdev.o sparx5_port.o sparx5_phylink.o sparx5_mactable.o 
sparx5_vlan.o \
- sparx5_switchdev.o
+ sparx5_switchdev.o sparx5_calendar.o
diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_calendar.c 
b/drivers/net/ethernet/microchip/sparx5/sparx5_calendar.c
new file mode 100644
index ..76a8bb596aec
--- /dev/null
+++ b/drivers/net/ethernet/microchip/sparx5/sparx5_calendar.c
@@ -0,0 +1,596 @@
+// SPDX-License-Identifier: GPL-2.0+
+/* Microchip Sparx5 Switch driver
+ *
+ * Copyright (c) 2021 Microchip Technology Inc. and its subsidiaries.
+ */
+
+#include 
+#include 
+
+#include "sparx5_main_regs.h"
+#include "sparx5_main.h"
+
+/* QSYS calendar information */
+#define SPX5_PORTS_PER_CALREG  10  /* Ports mapped in a calendar 
register */
+#define SPX5_CALBITS_PER_PORT  3   /* Bit per port in calendar 
register */
+
+/* DSM calendar information */
+#define SPX5_DSM_CAL_LEN   64
+#define SPX5_DSM_CAL_EMPTY 0x
+#define SPX5_DSM_CAL_MAX_DEVS_PER_TAXI 13
+#define SPX5_DSM_CAL_TAXIS 8
+#define SPX5_DSM_CAL_BW_LOSS   553
+
+#define SPX5_TAXI_PORT_MAX 70
+
+#define SPEED_1250012500
+
+/* Maps from taxis to port numbers */
+static u32 
sparx5_taxi_ports[SPX5_DSM_CAL_TAXIS][SPX5_DSM_CAL_MAX_DEVS_PER_TAXI] = {
+   {57, 12, 0, 1, 2, 16, 17, 18, 19, 20, 21, 22, 23},
+   {58, 13, 3, 4, 5, 24, 25, 26, 27, 28, 29, 30, 31},
+   {59, 14, 6, 7, 8, 32, 33, 34, 35, 36, 37, 38, 39},
+   {60, 15, 9, 10, 11, 40, 41, 42, 43, 44, 45, 46, 47},
+   {61, 48, 49, 50, 99, 99, 99, 99, 99, 99, 99, 99, 99},
+   {62, 51, 52, 53, 99, 99, 99, 99, 99, 99, 99, 99, 99},
+   {56, 63, 54, 55, 99, 99, 99, 99, 99, 99, 99, 99, 99},
+   {64, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99},
+};
+
+struct sparx5_calendar_data {
+   u32 schedule[SPX5_DSM_CAL_LEN];
+   u32 avg_dist[SPX5_DSM_CAL_MAX_DEVS_PER_TAXI];
+   u32 taxi_ports[SPX5_DSM_CAL_MAX_DEVS_PER_TAXI];
+   u32 taxi_speeds[SPX5_DSM_CAL_MAX_DEVS_PER_TAXI];
+   u32 dev_slots[SPX5_DSM_CAL_MAX_DEVS_PER_TAXI];
+   u32 new_slots[SPX5_DSM_CAL_LEN];
+   u32 temp_sched[SPX5_DSM_CAL_LEN];
+   u32 indices[SPX5_DSM_CAL_LEN];
+   u32 short_list[SPX5_DSM_CAL_LEN];
+   u32 long_list[SPX5_DSM_CAL_LEN];
+};
+
+static u32 sparx5_target_bandwidth(struct sparx5 *sparx5)
+{
+   switch (sparx5->target_ct) {
+   case SPX5_TARGET_CT_7546:
+   case SPX5_TARGET_CT_7546TSN:
+   return 65000;
+   case SPX5_TARGET_CT_7549:
+   case SPX5_TARGET_CT_7549TSN:
+   return 91000;
+   case SPX5_TARGET_CT_7552:
+   case SPX5_TARGET_CT_7552TSN:
+   return 129000;
+   case SPX5_TARGET_CT_7556:
+   case SPX5_TARGET_CT_7556TSN:
+   return 161000;
+   case SPX5_TARGET_CT_7558:
+   case SPX5_TARGET_CT_7558TSN:
+   return 201000;
+   default:
+   return 0;
+   }
+}
+
+/* This is used in calendar configuration */
+enum sparx5_cal_bw {
+   SPX5_CAL_SPEED_NONE = 0,
+   SPX5_CAL_SPEED_1G   = 1,
+   SPX5_CAL_SPEED_2G5  = 2,
+   SPX5_CAL_SPEED_5G   = 3,
+   SPX5_CAL_SPEED_10G  = 4,
+   SPX5_CAL_SPEED_25G  = 5,
+   SPX5_CAL_SPEED_0G5  = 6,
+   SPX5_CAL_SPEED_12G5 = 7
+};
+
+static u32 sparx5_clk_to_bandwidth(enum sparx5_core_clockfreq cclock)
+{
+   switch (cclock) {
+   case SPX5_CORE_CLOCK_250MHZ: return 83000; /* 25 / 3 */
+   case SPX5_CORE_CLOCK_500MHZ: return 166000; /* 50 / 3 */
+   case SPX5_CORE_CLOCK_625MHZ: return  208000; /* 625000 / 3 */
+   default: return 0;
+   }
+   return 0;
+}
+
+static u32 sparx5_cal_speed_to_value(enum sparx5_cal_bw speed)
+{
+   switch (speed) {
+   case SPX5_CAL_SPEED_1G:   return 1000;
+   case SPX5_CAL_SPEED_2G5:  return 2500;
+   case SPX5_CAL_SPEED_5G:   re

[PATCH] venus: helpers: keep max bandwidth when mbps exceeds the supported range

2021-03-31 Thread Vikash Garodia
When the video usecase have macro blocks per sec which is  more than
supported, keep the required bus bandwidth as the maximum supported.

Signed-off-by: Vikash Garodia 
---
 drivers/media/platform/qcom/venus/pm_helpers.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/media/platform/qcom/venus/pm_helpers.c 
b/drivers/media/platform/qcom/venus/pm_helpers.c
index e349d01422c5..ebd7e42e31c1 100644
--- a/drivers/media/platform/qcom/venus/pm_helpers.c
+++ b/drivers/media/platform/qcom/venus/pm_helpers.c
@@ -186,7 +186,7 @@ static void mbs_to_bw(struct venus_inst *inst, u32 mbs, u32 
*avg, u32 *peak)
return;
 
for (i = 0; i < num_rows; i++) {
-   if (mbs > bw_tbl[i].mbs_per_sec)
+   if (i != 0 && mbs > bw_tbl[i].mbs_per_sec)
break;
 
if (inst->dpb_fmt & HFI_COLOR_FORMAT_10_BIT_BASE) {
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



Re: [PATCH 2/2] usb: xhci-mtk: relax periodic TT bandwidth checking

2021-03-31 Thread Chunfeng Yun
On Tue, 2021-03-30 at 16:06 +0800, Ikjoon Jang wrote:
> Software bandwidth checking logics used by xhci-mtk puts
> a quite heavy constraints to TT periodic endpoint allocations.
> 
> This patch provides a relaxed bandwidth calculation by
> - Allowing multiple periodic transactions in a same microframe
>   for a device with multiple interrupt endpoints.
> - Using best case budget instead of maximum number of
>   complete-split when calculating byte budgets on lower speed bus
> 
> Without this patch, a typical full speed audio headset with
> 3 periodic endpoints (audio isoc-in/out, input int-in) cannot be
> configured with xhci-mtk.
> 
> Signed-off-by: Ikjoon Jang 
> ---
cc Yaqii Wu 

I'll test it, thanks

> 
>  drivers/usb/host/xhci-mtk-sch.c | 68 ++---
>  drivers/usb/host/xhci-mtk.h |  2 -
>  2 files changed, 20 insertions(+), 50 deletions(-)
> 
> diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
> index 0cb41007ec65..76827e48049a 100644
> --- a/drivers/usb/host/xhci-mtk-sch.c
> +++ b/drivers/usb/host/xhci-mtk-sch.c
> @@ -388,13 +388,17 @@ static void setup_sch_info(struct xhci_ep_ctx *ep_ctx,
>   } else { /* INT_IN_EP or ISOC_IN_EP */
>   bwb_table[0] = 0; /* start split */
>   bwb_table[1] = 0; /* idle */
> +
> + sch_ep->num_budget_microframes += 2;
> + if (sch_ep->num_budget_microframes > sch_ep->esit)
> + sch_ep->num_budget_microframes = sch_ep->esit;
>   /*
>* due to cs_count will be updated according to cs
>* position, assign all remainder budget array
>* elements as @bw_cost_per_microframe, but only first
>* @num_budget_microframes elements will be used later
>*/
> - for (i = 2; i < TT_MICROFRAMES_MAX; i++)
> + for (i = 2; i < sch_ep->num_budget_microframes; i++)
>   bwb_table[i] =  sch_ep->bw_cost_per_microframe;
>   }
>   }
> @@ -449,20 +453,17 @@ static void update_bus_bw(struct mu3h_sch_bw_info 
> *sch_bw,
>  static int check_fs_bus_bw(struct mu3h_sch_ep_info *sch_ep, int offset)
>  {
>   struct mu3h_sch_tt *tt = sch_ep->sch_tt;
> - u32 num_esit, tmp;
> - int base;
>   int i, j;
> + const int nr_lower_uframes =
> + DIV_ROUND_UP(sch_ep->maxpkt, FS_PAYLOAD_MAX);
>  
> - num_esit = XHCI_MTK_MAX_ESIT / sch_ep->esit;
> - for (i = 0; i < num_esit; i++) {
> - base = offset + i * sch_ep->esit;
> -
> + for (i = offset; i < XHCI_MTK_MAX_ESIT; i += sch_ep->esit) {
>   /*
>* Compared with hs bus, no matter what ep type,
>* the hub will always delay one uframe to send data
>*/
> - for (j = 0; j < sch_ep->cs_count; j++) {
> - tmp = tt->fs_bus_bw[base + j] + 
> sch_ep->bw_cost_per_microframe;
> + for (j = 0; j < nr_lower_uframes; j++) {
> + u32 tmp = tt->fs_bus_bw[i + j + 1] + 
> sch_ep->bw_cost_per_microframe;
>   if (tmp > FS_PAYLOAD_MAX)
>   return -ESCH_BW_OVERFLOW;
>   }
> @@ -473,11 +474,9 @@ static int check_fs_bus_bw(struct mu3h_sch_ep_info 
> *sch_ep, int offset)
>  
>  static int check_sch_tt(struct mu3h_sch_ep_info *sch_ep, u32 offset)
>  {
> - struct mu3h_sch_tt *tt = sch_ep->sch_tt;
>   u32 extra_cs_count;
>   u32 start_ss, last_ss;
>   u32 start_cs, last_cs;
> - int i;
>  
>   if (!sch_ep->sch_tt)
>   return 0;
> @@ -494,10 +493,6 @@ static int check_sch_tt(struct mu3h_sch_ep_info *sch_ep, 
> u32 offset)
>   if (!(start_ss == 7 || last_ss < 6))
>   return -ESCH_SS_Y6;
>  
> - for (i = 0; i < sch_ep->cs_count; i++)
> - if (test_bit(offset + i, tt->ss_bit_map))
> - return -ESCH_SS_OVERLAP;
> -
>   } else {
>   u32 cs_count = DIV_ROUND_UP(sch_ep->maxpkt, FS_PAYLOAD_MAX);
>  
> @@ -524,19 +519,7 @@ static int check_sch_tt(struct mu3h_sch_ep_info *sch_ep, 
> u32 offset)
>   if (cs_count > 7)
>   cs_count = 7; /* HW limit */
>  
> - if (test_bit(offset, tt->ss_bit_map))
> - return -ESCH_SS_OVERLAP;
> -
>   sch_ep->cs_count = cs_count;
&g

[PATCH next 1/2] usb: xhci-mtk: fix wrong remainder of bandwidth budget

2021-03-31 Thread Chunfeng Yun
The remainder of the last bandwidth bugdget is wrong,
it's the value alloacted in last bugdget, not unused.

Reported-by: Yaqii Wu 
Signed-off-by: Chunfeng Yun 
---
 drivers/usb/host/xhci-mtk-sch.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index a59d1f6d4744..7ac76ae28998 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -341,7 +341,6 @@ static void setup_sch_info(struct xhci_ep_ctx *ep_ctx,
}
 
if (ep_type == ISOC_IN_EP || ep_type == ISOC_OUT_EP) {
-   u32 remainder;
 
if (sch_ep->esit == 1)
sch_ep->pkts = esit_pkts;
@@ -357,14 +356,12 @@ static void setup_sch_info(struct xhci_ep_ctx *ep_ctx,
sch_ep->repeat = !!(sch_ep->num_budget_microframes > 1);
sch_ep->bw_cost_per_microframe = maxpkt * sch_ep->pkts;
 
-   remainder = sch_ep->bw_cost_per_microframe;
-   remainder *= sch_ep->num_budget_microframes;
-   remainder -= (maxpkt * esit_pkts);
for (i = 0; i < sch_ep->num_budget_microframes - 1; i++)
bwb_table[i] = sch_ep->bw_cost_per_microframe;
 
/* last one <= bw_cost_per_microframe */
-   bwb_table[i] = remainder;
+   bwb_table[i] = maxpkt * esit_pkts
+  - i * sch_ep->bw_cost_per_microframe;
}
} else if (is_fs_or_ls(sch_ep->speed)) {
sch_ep->pkts = 1; /* at most one packet for each microframe */
-- 
2.18.0



[PATCH 2/2] usb: xhci-mtk: relax periodic TT bandwidth checking

2021-03-30 Thread Ikjoon Jang
Software bandwidth checking logics used by xhci-mtk puts
a quite heavy constraints to TT periodic endpoint allocations.

This patch provides a relaxed bandwidth calculation by
- Allowing multiple periodic transactions in a same microframe
  for a device with multiple interrupt endpoints.
- Using best case budget instead of maximum number of
  complete-split when calculating byte budgets on lower speed bus

Without this patch, a typical full speed audio headset with
3 periodic endpoints (audio isoc-in/out, input int-in) cannot be
configured with xhci-mtk.

Signed-off-by: Ikjoon Jang 
---

 drivers/usb/host/xhci-mtk-sch.c | 68 ++---
 drivers/usb/host/xhci-mtk.h |  2 -
 2 files changed, 20 insertions(+), 50 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index 0cb41007ec65..76827e48049a 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -388,13 +388,17 @@ static void setup_sch_info(struct xhci_ep_ctx *ep_ctx,
} else { /* INT_IN_EP or ISOC_IN_EP */
bwb_table[0] = 0; /* start split */
bwb_table[1] = 0; /* idle */
+
+   sch_ep->num_budget_microframes += 2;
+   if (sch_ep->num_budget_microframes > sch_ep->esit)
+   sch_ep->num_budget_microframes = sch_ep->esit;
/*
 * due to cs_count will be updated according to cs
 * position, assign all remainder budget array
 * elements as @bw_cost_per_microframe, but only first
 * @num_budget_microframes elements will be used later
 */
-   for (i = 2; i < TT_MICROFRAMES_MAX; i++)
+   for (i = 2; i < sch_ep->num_budget_microframes; i++)
bwb_table[i] =  sch_ep->bw_cost_per_microframe;
}
}
@@ -449,20 +453,17 @@ static void update_bus_bw(struct mu3h_sch_bw_info *sch_bw,
 static int check_fs_bus_bw(struct mu3h_sch_ep_info *sch_ep, int offset)
 {
struct mu3h_sch_tt *tt = sch_ep->sch_tt;
-   u32 num_esit, tmp;
-   int base;
int i, j;
+   const int nr_lower_uframes =
+   DIV_ROUND_UP(sch_ep->maxpkt, FS_PAYLOAD_MAX);
 
-   num_esit = XHCI_MTK_MAX_ESIT / sch_ep->esit;
-   for (i = 0; i < num_esit; i++) {
-   base = offset + i * sch_ep->esit;
-
+   for (i = offset; i < XHCI_MTK_MAX_ESIT; i += sch_ep->esit) {
/*
 * Compared with hs bus, no matter what ep type,
 * the hub will always delay one uframe to send data
 */
-   for (j = 0; j < sch_ep->cs_count; j++) {
-   tmp = tt->fs_bus_bw[base + j] + 
sch_ep->bw_cost_per_microframe;
+   for (j = 0; j < nr_lower_uframes; j++) {
+   u32 tmp = tt->fs_bus_bw[i + j + 1] + 
sch_ep->bw_cost_per_microframe;
if (tmp > FS_PAYLOAD_MAX)
return -ESCH_BW_OVERFLOW;
}
@@ -473,11 +474,9 @@ static int check_fs_bus_bw(struct mu3h_sch_ep_info 
*sch_ep, int offset)
 
 static int check_sch_tt(struct mu3h_sch_ep_info *sch_ep, u32 offset)
 {
-   struct mu3h_sch_tt *tt = sch_ep->sch_tt;
u32 extra_cs_count;
u32 start_ss, last_ss;
u32 start_cs, last_cs;
-   int i;
 
if (!sch_ep->sch_tt)
return 0;
@@ -494,10 +493,6 @@ static int check_sch_tt(struct mu3h_sch_ep_info *sch_ep, 
u32 offset)
if (!(start_ss == 7 || last_ss < 6))
return -ESCH_SS_Y6;
 
-   for (i = 0; i < sch_ep->cs_count; i++)
-   if (test_bit(offset + i, tt->ss_bit_map))
-   return -ESCH_SS_OVERLAP;
-
} else {
u32 cs_count = DIV_ROUND_UP(sch_ep->maxpkt, FS_PAYLOAD_MAX);
 
@@ -524,19 +519,7 @@ static int check_sch_tt(struct mu3h_sch_ep_info *sch_ep, 
u32 offset)
if (cs_count > 7)
cs_count = 7; /* HW limit */
 
-   if (test_bit(offset, tt->ss_bit_map))
-   return -ESCH_SS_OVERLAP;
-
sch_ep->cs_count = cs_count;
-   /* one for ss, the other for idle */
-   sch_ep->num_budget_microframes = cs_count + 2;
-
-   /*
-* if interval=1, maxp >752, num_budge_micoframe is larger
-* than sch_ep->esit, will overstep boundary
-*/
-   if (sch_ep->num_budget_microframes > sch_ep->esit)
-   sch_ep->num_budget_microframes = sch_ep->esit;
}
 
return check_fs_bus_bw(sch_ep

[PATCH 0/2] usb: xhci-mtk: relax peridoc TT bandwidth checking

2021-03-30 Thread Ikjoon Jang
This series is for supporting typical full speed USB audio headsets
with speaker, microphone, and control knobs together.

With current implementation, such a device cannot be configured
due to xhci-mtk's bandwidth allocation failure even when there's
enough bandwidth available.

Ikjoon Jang (2):
  usb: xhci-mtk: remove unnecessary assignments in periodic TT scheduler
  usb: xhci-mtk: relax periodic TT bandwidth checking

 drivers/usb/host/xhci-mtk-sch.c | 120 +++-
 drivers/usb/host/xhci-mtk.h |   2 -
 2 files changed, 41 insertions(+), 81 deletions(-)

-- 
2.31.0.291.g576ba9dcdaf-goog



[PATCH v3 06/14] bfq: keep the minimun bandwidth for CLASS_BE

2021-03-25 Thread brookxu
From: Chunguang Xu 

CLASS_RT will preempt other classes, which may starve. At
present, CLASS_IDLE has alleviated the starvation problem
through the minimum bandwidth mechanism. Similarly, we
should do the same for CLASS_BE.

Signed-off-by: Chunguang Xu 
---
 block/bfq-iosched.c |  6 --
 block/bfq-iosched.h | 11 ++
 block/bfq-wf2q.c| 59 ++---
 3 files changed, 53 insertions(+), 23 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 8eaf0eb..ee8c457 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6560,9 +6560,11 @@ static void bfq_init_root_group(struct bfq_group 
*root_group,
root_group->bfqd = bfqd;
 #endif
root_group->rq_pos_tree = RB_ROOT;
-   for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+   for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
-   root_group->sched_data.bfq_class_idle_last_service = jiffies;
+   root_group->sched_data.bfq_class_last_service[i] = jiffies;
+   }
+   root_group->sched_data.class_timeout_last_check = jiffies;
 }
 
 static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 29a56b8..f9ed1da 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -13,7 +13,7 @@
 #include "blk-cgroup-rwstat.h"
 
 #define BFQ_IOPRIO_CLASSES 3
-#define BFQ_CL_IDLE_TIMEOUT(HZ/5)
+#define BFQ_CLASS_TIMEOUT  (HZ/5)
 
 #define BFQ_MIN_WEIGHT 1
 #define BFQ_MAX_WEIGHT 1000
@@ -97,9 +97,12 @@ struct bfq_sched_data {
struct bfq_entity *next_in_service;
/* array of service trees, one per ioprio_class */
struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
-   /* last time CLASS_IDLE was served */
-   unsigned long bfq_class_idle_last_service;
-
+   /* last time the class was served */
+   unsigned long bfq_class_last_service[BFQ_IOPRIO_CLASSES];
+   /* last time class timeout was checked */
+   unsigned long class_timeout_last_check;
+   /* next index to check class timeout */
+   unsigned int next_class_index;
 };
 
 /**
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index c91109e..1f8f3c5 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1188,6 +1188,7 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, 
bool ins_into_idle_tree)
 {
struct bfq_sched_data *sd = entity->sched_data;
struct bfq_service_tree *st;
+   int idx = bfq_class_idx(entity);
bool is_in_service;
 
if (!entity->on_st_or_in_serv) /*
@@ -1227,6 +1228,7 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, 
bool ins_into_idle_tree)
else
bfq_idle_insert(st, entity);
 
+   sd->bfq_class_last_service[idx] = jiffies;
return true;
 }
 
@@ -1455,6 +1457,45 @@ static struct bfq_entity *bfq_first_active_entity(struct 
bfq_service_tree *st,
return entity;
 }
 
+static int bfq_select_next_class(struct bfq_sched_data *sd)
+{
+   struct bfq_service_tree *st = sd->service_tree;
+   unsigned long last_check, last_serve;
+   int i, class_idx, next_class = 0;
+   bool found = false;
+
+   /*
+    * we needed to guarantee a minimum bandwidth for each class (if
+* there is some active entity in this class). This should also
+* mitigate priority-inversion problems in case a low priority
+* task is holding file system resources.
+*/
+   last_check = sd->class_timeout_last_check;
+   if (time_is_after_jiffies(last_check + BFQ_CLASS_TIMEOUT))
+   return next_class;
+
+   sd->class_timeout_last_check = jiffies;
+   for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
+   class_idx = (sd->next_class_index + i) % BFQ_IOPRIO_CLASSES;
+   last_serve = sd->bfq_class_last_service[class_idx];
+
+   if (time_is_after_jiffies(last_serve + BFQ_CLASS_TIMEOUT))
+   continue;
+
+   if (!RB_EMPTY_ROOT(&(st + class_idx)->active)) {
+   if (found)
+   continue;
+
+   next_class = class_idx++;
+   class_idx %= BFQ_IOPRIO_CLASSES;
+   sd->next_class_index = class_idx;
+   found = true;
+   }
+   sd->bfq_class_last_service[class_idx] = jiffies;
+   }
+   return next_class;
+}
+
 /**
  * bfq_lookup_next_entity - return the first eligible entity in @sd.
  * @sd: the sched_data.
@@ -1468,24 +1509,8 @@ static struct bfq_entity *bfq_lookup_next_entity(struct 
bfq_sched_data *sd,
 bool expiration)
 {
struct bfq_service_tree *st = sd->service_tree;

Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-19 Thread changhuaixin



> On Mar 19, 2021, at 8:39 PM, changhuaixin  
> wrote:
> 
> 
> 
>> On Mar 18, 2021, at 11:05 PM, Peter Zijlstra  wrote:
>> 
>> On Thu, Mar 18, 2021 at 09:26:58AM +0800, changhuaixin wrote:
>>>> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra  wrote:
>> 
>>>> So what is the typical avg,stdev,max and mode for the workloads where you 
>>>> find
>>>> you need this?
>>>> 
>>>> I would really like to put a limit on the burst. IMO a workload that has
>>>> a burst many times longer than the quota is plain broken.
>>> 
>>> I see. Then the problem comes down to how large the limit on burst shall be.
>>> 
>>> I have sampled the CPU usage of a bursty container in 100ms periods. The 
>>> statistics are:
>> 
>> So CPU usage isn't exactly what is required, job execution time is what
>> you're after. Assuming there is a relation...
>> 
> 
> Yes, job execution time is important. To be specific, it is to improve the 
> CPU usage of the whole
> system to reduce the total cost of ownership, while not damaging job 
> execution time. This
> requires lower the average CPU resource of underutilized cgroups, and 
> allowing their bursts
> at the same time.
> 
>>> average : 42.2%
>>> stddev  : 81.5%
>>> max : 844.5%
>>> P95 : 183.3%
>>> P99 : 437.0%
>> 
>> Then your WCET is 844% of 100ms ? , which is .84s.
>> 
>> But you forgot your mode; what is the most common duration, given P95 is
>> so high, I doubt that avg is representative of the most common duration.
>> 
> 
> It is true.
> 
>>> If quota is 10ms, burst buffer needs to be 8 times more in order
>>> for this workload not to be throttled.
>> 
>> Where does that 100s come from? And an 800s burst is bizarre.
>> 
>> Did you typo [us] as [ms] ?
>> 
> 
> Sorry, it should be 10us.
> 
>>> I can't say this is typical, but these workloads exist. On a machine
>>> running Kubernetes containers, where there is often room for such
>>> burst and the interference is hard to notice, users would prefer
>>> allowing such burst to being throttled occasionally.
>> 
>> Users also want ponies. I've no idea what kubernetes actually is or what
>> it has to do with containers. That's all just word salad.
>> 
>>> In this sense, I suggest limit burst buffer to 16 times of quota or
>>> around. That should be enough for users to improve tail latency caused
>>> by throttling. And users might choose a smaller one or even none, if
>>> the interference is unacceptable. What do you think?
>> 
>> Well, normal RT theory would suggest you pick your runtime around 200%
>> to get that P95 and then allow a full period burst to get your P99, but
>> that same RT theory would also have you calculate the resulting
>> interference and see if that works with the rest of the system...
>> 
> 
> I am sorry that I don't know much about the RT theory you mentioned, and 
> can't provide
> the desired calculation now. But I'd like to try and do some reading if that 
> is needed.
> 
>> 16 times is horrific.
> 
> So can we decide on a more relative value now? Or is the interference 
> probabilities still the
> missing piece?

A more [realistic] value, I mean.

> 
> Is the paper you mentioned about called "Insensitivity results in statistical 
> bandwidth sharing",
> or some related ones on statistical bandwidth results under some kind of 
> fairness?



Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-19 Thread changhuaixin



> On Mar 18, 2021, at 8:59 PM, Phil Auld  wrote:
> 
> On Thu, Mar 18, 2021 at 09:26:58AM +0800 changhuaixin wrote:
>> 
>> 
>>> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra  wrote:
>>> 
>>> On Wed, Mar 17, 2021 at 03:16:18PM +0800, changhuaixin wrote:
>>> 
>>>>> Why do you allow such a large burst? I would expect something like:
>>>>> 
>>>>>   if (burst > quote)
>>>>>   return -EINVAL;
>>>>> 
>>>>> That limits the variance in the system. Allowing super long bursts seems
>>>>> to defeat the entire purpose of bandwidth control.
>>>> 
>>>> I understand your concern. Surely large burst value might allow super
>>>> long bursts thus preventing bandwidth control entirely for a long
>>>> time.
>>>> 
>>>> However, I am afraid it is hard to decide what the maximum burst
>>>> should be from the bandwidth control mechanism itself. Allowing some
>>>> burst to the maximum of quota is helpful, but not enough. There are
>>>> cases where workloads are bursty that they need many times more than
>>>> quota in a single period. In such cases, limiting burst to the maximum
>>>> of quota fails to meet the needs.
>>>> 
>>>> Thus, I wonder whether is it acceptable to leave the maximum burst to
>>>> users. If the desired behavior is to allow some burst, configure burst
>>>> accordingly. If that is causing variance, use share or other fairness
>>>> mechanism. And if fairness mechanism still fails to coordinate, do not
>>>> use burst maybe.
>>> 
>>> It's not fairness, bandwidth control is about isolation, and burst
>>> introduces interference.
>>> 
>>>> In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is
>>>> still needed maybe.
>>> 
>>> So what is the typical avg,stdev,max and mode for the workloads where you 
>>> find
>>> you need this?
>>> 
>>> I would really like to put a limit on the burst. IMO a workload that has
>>> a burst many times longer than the quota is plain broken.
>> 
>> I see. Then the problem comes down to how large the limit on burst shall be.
>> 
>> I have sampled the CPU usage of a bursty container in 100ms periods. The 
>> statistics are:
>> average  : 42.2%
>> stddev   : 81.5%
>> max  : 844.5%
>> P95  : 183.3%
>> P99  : 437.0%
>> 
>> If quota is 10ms, burst buffer needs to be 8 times more in order for 
>> this workload not to be throttled.
>> I can't say this is typical, but these workloads exist. On a machine running 
>> Kubernetes containers,
>> where there is often room for such burst and the interference is hard to 
>> notice, users would prefer
>> allowing such burst to being throttled occasionally.
>> 
> 
> I admit to not having followed all the history of this patch set. That said, 
> when I see the above I just
> think your quota is too low for your workload.
> 

Yeah, more quota is helpful for this workload. But that usually prevents us 
from improving the total CPU
usage by putting more work onto a single machine.

> The burst (mis?)feature seems to be a way to bypass the quota.  And it sort 
> of assumes cooperative
> containers that will only burst when they need it and then go back to normal. 
> 
>> In this sense, I suggest limit burst buffer to 16 times of quota or around. 
>> That should be enough for users to
>> improve tail latency caused by throttling. And users might choose a smaller 
>> one or even none, if the interference
>> is unacceptable. What do you think?
>> 
> 
> Having quotas that can regularly be exceeded by 16 times seems to make the 
> concept of a quota
> meaningless.  I'd have thought a burst would be some small percentage.
> 
> What if several such containers burst at the same time? Can't that lead to 
> overcommit that can effect
> other well-behaved containers?
> 

I see. Maybe there should be some calculation on the probabilities of that, as 
Peter has replied.

> 
> Cheers,
> Phil
> 
> -- 



Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-19 Thread changhuaixin



> On Mar 18, 2021, at 11:05 PM, Peter Zijlstra  wrote:
> 
> On Thu, Mar 18, 2021 at 09:26:58AM +0800, changhuaixin wrote:
>>> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra  wrote:
> 
>>> So what is the typical avg,stdev,max and mode for the workloads where you 
>>> find
>>> you need this?
>>> 
>>> I would really like to put a limit on the burst. IMO a workload that has
>>> a burst many times longer than the quota is plain broken.
>> 
>> I see. Then the problem comes down to how large the limit on burst shall be.
>> 
>> I have sampled the CPU usage of a bursty container in 100ms periods. The 
>> statistics are:
> 
> So CPU usage isn't exactly what is required, job execution time is what
> you're after. Assuming there is a relation...
> 

Yes, job execution time is important. To be specific, it is to improve the CPU 
usage of the whole
system to reduce the total cost of ownership, while not damaging job execution 
time. This
requires lower the average CPU resource of underutilized cgroups, and allowing 
their bursts
at the same time.

>> average  : 42.2%
>> stddev   : 81.5%
>> max  : 844.5%
>> P95  : 183.3%
>> P99  : 437.0%
> 
> Then your WCET is 844% of 100ms ? , which is .84s.
> 
> But you forgot your mode; what is the most common duration, given P95 is
> so high, I doubt that avg is representative of the most common duration.
> 

It is true.

>> If quota is 10ms, burst buffer needs to be 8 times more in order
>> for this workload not to be throttled.
> 
> Where does that 100s come from? And an 800s burst is bizarre.
> 
> Did you typo [us] as [ms] ?
> 

Sorry, it should be 10us.

>> I can't say this is typical, but these workloads exist. On a machine
>> running Kubernetes containers, where there is often room for such
>> burst and the interference is hard to notice, users would prefer
>> allowing such burst to being throttled occasionally.
> 
> Users also want ponies. I've no idea what kubernetes actually is or what
> it has to do with containers. That's all just word salad.
> 
>> In this sense, I suggest limit burst buffer to 16 times of quota or
>> around. That should be enough for users to improve tail latency caused
>> by throttling. And users might choose a smaller one or even none, if
>> the interference is unacceptable. What do you think?
> 
> Well, normal RT theory would suggest you pick your runtime around 200%
> to get that P95 and then allow a full period burst to get your P99, but
> that same RT theory would also have you calculate the resulting
> interference and see if that works with the rest of the system...
> 

I am sorry that I don't know much about the RT theory you mentioned, and can't 
provide
the desired calculation now. But I'd like to try and do some reading if that is 
needed.

> 16 times is horrific.

So can we decide on a more relative value now? Or is the interference 
probabilities still the
missing piece?

Is the paper you mentioned about called "Insensitivity results in statistical 
bandwidth sharing",
or some related ones on statistical bandwidth results under some kind of 
fairness?



Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-18 Thread Peter Zijlstra
On Thu, Mar 18, 2021 at 08:59:44AM -0400, Phil Auld wrote:
> I admit to not having followed all the history of this patch set. That
> said, when I see the above I just think your quota is too low for your
> workload.

This.

> The burst (mis?)feature seems to be a way to bypass the quota.  And it
> sort of assumes cooperative containers that will only burst when they
> need it and then go back to normal. 

Its not entirely unreasonable or unheard of. There's soft realtime
systems that use this to increase the utilization with the trade-off
that you're going to miss deadlines once every so often.

If you do it right, you can calculate the probabilities. Or usually the
other way around, you calculate the allowed variance/burst given a P
value for making the deadline.

Input then isn't the WCET for each task, but a runtime distribution as
measured for your workload on your system etc..

I used to have papers on this, but I can't seem to find them in a hurry.


Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-18 Thread Peter Zijlstra
On Thu, Mar 18, 2021 at 09:26:58AM +0800, changhuaixin wrote:
> > On Mar 17, 2021, at 4:06 PM, Peter Zijlstra  wrote:

> > So what is the typical avg,stdev,max and mode for the workloads where you 
> > find
> > you need this?
> > 
> > I would really like to put a limit on the burst. IMO a workload that has
> > a burst many times longer than the quota is plain broken.
> 
> I see. Then the problem comes down to how large the limit on burst shall be.
> 
> I have sampled the CPU usage of a bursty container in 100ms periods. The 
> statistics are:

So CPU usage isn't exactly what is required, job execution time is what
you're after. Assuming there is a relation...

> average   : 42.2%
> stddev: 81.5%
> max   : 844.5%
> P95   : 183.3%
> P99   : 437.0%

Then your WCET is 844% of 100ms ? , which is .84s.

But you forgot your mode; what is the most common duration, given P95 is
so high, I doubt that avg is representative of the most common duration.

> If quota is 10ms, burst buffer needs to be 8 times more in order
> for this workload not to be throttled.

Where does that 100s come from? And an 800s burst is bizarre.

Did you typo [us] as [ms] ?

> I can't say this is typical, but these workloads exist. On a machine
> running Kubernetes containers, where there is often room for such
> burst and the interference is hard to notice, users would prefer
> allowing such burst to being throttled occasionally.

Users also want ponies. I've no idea what kubernetes actually is or what
it has to do with containers. That's all just word salad.

> In this sense, I suggest limit burst buffer to 16 times of quota or
> around. That should be enough for users to improve tail latency caused
> by throttling. And users might choose a smaller one or even none, if
> the interference is unacceptable. What do you think?

Well, normal RT theory would suggest you pick your runtime around 200%
to get that P95 and then allow a full period burst to get your P99, but
that same RT theory would also have you calculate the resulting
interference and see if that works with the rest of the system...

16 times is horrific.



Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-18 Thread Phil Auld
On Thu, Mar 18, 2021 at 09:26:58AM +0800 changhuaixin wrote:
> 
> 
> > On Mar 17, 2021, at 4:06 PM, Peter Zijlstra  wrote:
> > 
> > On Wed, Mar 17, 2021 at 03:16:18PM +0800, changhuaixin wrote:
> > 
> >>> Why do you allow such a large burst? I would expect something like:
> >>> 
> >>>   if (burst > quote)
> >>>   return -EINVAL;
> >>> 
> >>> That limits the variance in the system. Allowing super long bursts seems
> >>> to defeat the entire purpose of bandwidth control.
> >> 
> >> I understand your concern. Surely large burst value might allow super
> >> long bursts thus preventing bandwidth control entirely for a long
> >> time.
> >> 
> >> However, I am afraid it is hard to decide what the maximum burst
> >> should be from the bandwidth control mechanism itself. Allowing some
> >> burst to the maximum of quota is helpful, but not enough. There are
> >> cases where workloads are bursty that they need many times more than
> >> quota in a single period. In such cases, limiting burst to the maximum
> >> of quota fails to meet the needs.
> >> 
> >> Thus, I wonder whether is it acceptable to leave the maximum burst to
> >> users. If the desired behavior is to allow some burst, configure burst
> >> accordingly. If that is causing variance, use share or other fairness
> >> mechanism. And if fairness mechanism still fails to coordinate, do not
> >> use burst maybe.
> > 
> > It's not fairness, bandwidth control is about isolation, and burst
> > introduces interference.
> > 
> >> In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is
> >> still needed maybe.
> > 
> > So what is the typical avg,stdev,max and mode for the workloads where you 
> > find
> > you need this?
> > 
> > I would really like to put a limit on the burst. IMO a workload that has
> > a burst many times longer than the quota is plain broken.
> 
> I see. Then the problem comes down to how large the limit on burst shall be.
> 
> I have sampled the CPU usage of a bursty container in 100ms periods. The 
> statistics are:
> average   : 42.2%
> stddev: 81.5%
> max   : 844.5%
> P95   : 183.3%
> P99   : 437.0%
> 
> If quota is 10ms, burst buffer needs to be 8 times more in order for this 
> workload not to be throttled.
> I can't say this is typical, but these workloads exist. On a machine running 
> Kubernetes containers,
> where there is often room for such burst and the interference is hard to 
> notice, users would prefer
> allowing such burst to being throttled occasionally.
>

I admit to not having followed all the history of this patch set. That said, 
when I see the above I just
think your quota is too low for your workload.

The burst (mis?)feature seems to be a way to bypass the quota.  And it sort of 
assumes cooperative
containers that will only burst when they need it and then go back to normal. 

> In this sense, I suggest limit burst buffer to 16 times of quota or around. 
> That should be enough for users to
> improve tail latency caused by throttling. And users might choose a smaller 
> one or even none, if the interference
> is unacceptable. What do you think?
> 

Having quotas that can regularly be exceeded by 16 times seems to make the 
concept of a quota
meaningless.  I'd have thought a burst would be some small percentage.

What if several such containers burst at the same time? Can't that lead to 
overcommit that can effect
other well-behaved containers?


Cheers,
Phil

-- 



Re: [PATCH v16 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-18 Thread Dmitry Osipenko
18.03.2021 12:31, Michał Mirosław пишет:
>>  static const struct tegra_windowgroup_soc tegra194_dc_wgrps[] = {
>> @@ -2430,6 +2781,7 @@ static const struct tegra_dc_soc_info 
>> tegra194_dc_soc_info = {
>>  .has_nvdisplay = true,
>>  .wgrps = tegra194_dc_wgrps,
>>  .num_wgrps = ARRAY_SIZE(tegra194_dc_wgrps),
>> +.plane_tiled_memory_bandwidth_x2 = false,
>>  };
> For globals you will have .x = false by default; I'm not sure those entries
> add much value.
> 
> Reviewed-by: Michał Mirosław 

IIRC, in the past Thierry preferred to add the defaults to this driver
in order to ease reading/understanding of the code. So I added them for
consistency.

Thank you very much for helping with the review!


Re: [PATCH v16 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-18 Thread Michał Mirosław
te = to_const_dc_state(crtc->state);
> +
> + if (!crtc->state->active) {
> + if (!old_crtc_state->active)
> + return;
> +
> + /*
> +  * When CRTC is disabled on DPMS, the state of attached planes
> +  * is kept unchanged. Hence we need to enforce removal of the
> +  * bandwidths from the ICC paths.
> +  */
> + drm_atomic_crtc_for_each_plane(plane, crtc) {
> + tegra = to_tegra_plane(plane);
> +
> + icc_set_bw(tegra->icc_mem, 0, 0);
> + icc_set_bw(tegra->icc_mem_vfilter, 0, 0);
> + }
> +
> + return;
> + }
> +
> + for_each_old_plane_in_state(old_crtc_state->state, plane,
> + old_plane_state, i) {
> + old_tegra_state = to_const_tegra_plane_state(old_plane_state);
> + new_tegra_state = to_const_tegra_plane_state(plane->state);
> + tegra = to_tegra_plane(plane);
> +
> + /*
> +  * We're iterating over the global atomic state and it contains
> +  * planes from another CRTC, hence we need to filter out the
> +  * planes unrelated to this CRTC.
> +  */
> + if (tegra->dc != dc)
> + continue;
> +
> + new_avg_bw = new_tegra_state->avg_memory_bandwidth;
> + old_avg_bw = old_tegra_state->avg_memory_bandwidth;
> +
> + new_peak_bw = new_dc_state->plane_peak_bw[tegra->index];
> + old_peak_bw = old_dc_state->plane_peak_bw[tegra->index];
> +
> + /*
> +  * See the comment related to !crtc->state->active above,
> +  * which explains why bandwidths need to be updated when
> +  * CRTC is turning ON.
> +  */
> + if (new_avg_bw == old_avg_bw && new_peak_bw == old_peak_bw &&
> + old_crtc_state->active)
> + continue;
> +
> + window.src.h = drm_rect_height(>state->src) >> 16;
> + window.dst.h = drm_rect_height(>state->dst);
> +
> + old_window.src.h = drm_rect_height(_plane_state->src) >> 16;
> + old_window.dst.h = drm_rect_height(_plane_state->dst);
> +
> + /*
> +  * During the preparation phase (atomic_begin), the memory
> +  * freq should go high before the DC changes are committed
> +  * if bandwidth requirement goes up, otherwise memory freq
> +  * should to stay high if BW requirement goes down.  The
> +  * opposite applies to the completion phase (post_commit).
> +  */
> + if (prepare_bandwidth_transition) {
> + new_avg_bw = max(old_avg_bw, new_avg_bw);
> + new_peak_bw = max(old_peak_bw, new_peak_bw);
> +
> + if (tegra_plane_use_vertical_filtering(tegra, 
> _window))
> + window = old_window;
> + }
> +
> + icc_set_bw(tegra->icc_mem, new_avg_bw, new_peak_bw);
> +
> + if (tegra_plane_use_vertical_filtering(tegra, ))
> + icc_set_bw(tegra->icc_mem_vfilter, new_avg_bw, 
> new_peak_bw);
> + else
> + icc_set_bw(tegra->icc_mem_vfilter, 0, 0);
> + }
> +}
> +
>  static void tegra_crtc_atomic_disable(struct drm_crtc *crtc,
> struct drm_atomic_state *state)
>  {
> @@ -1934,6 +2064,8 @@ static void tegra_crtc_atomic_begin(struct drm_crtc 
> *crtc,
>  {
>   unsigned long flags;
>  
> + tegra_crtc_update_memory_bandwidth(crtc, state, true);
> +
>   if (crtc->state->event) {
>   spin_lock_irqsave(>dev->event_lock, flags);
>  
> @@ -1966,7 +2098,215 @@ static void tegra_crtc_atomic_flush(struct drm_crtc 
> *crtc,
>   value = tegra_dc_readl(dc, DC_CMD_STATE_CONTROL);
>  }
>  
> +static bool tegra_plane_is_cursor(const struct drm_plane_state *state)
> +{
> + const struct tegra_dc_soc_info *soc = to_tegra_dc(state->crtc)->soc;
> + const struct drm_format_info *fmt = state->fb->format;
> + unsigned int src_w = drm_rect_width(>src) >> 16;
> + unsigned int dst_w = drm_rect_width(>dst);
> +
> + if (state->plane->type != DRM_PLANE_TYPE_CURSOR)
> + return false;
> +
> + if (soc->supports_cursor)
> + return true;
> +
> + if (src_w 

Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-17 Thread changhuaixin



> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra  wrote:
> 
> On Wed, Mar 17, 2021 at 03:16:18PM +0800, changhuaixin wrote:
> 
>>> Why do you allow such a large burst? I would expect something like:
>>> 
>>> if (burst > quote)
>>> return -EINVAL;
>>> 
>>> That limits the variance in the system. Allowing super long bursts seems
>>> to defeat the entire purpose of bandwidth control.
>> 
>> I understand your concern. Surely large burst value might allow super
>> long bursts thus preventing bandwidth control entirely for a long
>> time.
>> 
>> However, I am afraid it is hard to decide what the maximum burst
>> should be from the bandwidth control mechanism itself. Allowing some
>> burst to the maximum of quota is helpful, but not enough. There are
>> cases where workloads are bursty that they need many times more than
>> quota in a single period. In such cases, limiting burst to the maximum
>> of quota fails to meet the needs.
>> 
>> Thus, I wonder whether is it acceptable to leave the maximum burst to
>> users. If the desired behavior is to allow some burst, configure burst
>> accordingly. If that is causing variance, use share or other fairness
>> mechanism. And if fairness mechanism still fails to coordinate, do not
>> use burst maybe.
> 
> It's not fairness, bandwidth control is about isolation, and burst
> introduces interference.
> 
>> In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is
>> still needed maybe.
> 
> So what is the typical avg,stdev,max and mode for the workloads where you find
> you need this?
> 
> I would really like to put a limit on the burst. IMO a workload that has
> a burst many times longer than the quota is plain broken.

I see. Then the problem comes down to how large the limit on burst shall be.

I have sampled the CPU usage of a bursty container in 100ms periods. The 
statistics are:
average : 42.2%
stddev  : 81.5%
max : 844.5%
P95 : 183.3%
P99 : 437.0%

If quota is 10ms, burst buffer needs to be 8 times more in order for this 
workload not to be throttled.
I can't say this is typical, but these workloads exist. On a machine running 
Kubernetes containers,
where there is often room for such burst and the interference is hard to 
notice, users would prefer
allowing such burst to being throttled occasionally.

In this sense, I suggest limit burst buffer to 16 times of quota or around. 
That should be enough for users to
improve tail latency caused by throttling. And users might choose a smaller one 
or even none, if the interference
is unacceptable. What do you think?


[PATCH v16 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-17 Thread Dmitry Osipenko
Display controller (DC) performs isochronous memory transfers, and thus,
has a requirement for a minimum memory bandwidth that shall be fulfilled,
otherwise framebuffer data can't be fetched fast enough and this results
in a DC's data-FIFO underflow that follows by a visual corruption.

The Memory Controller drivers provide facility for memory bandwidth
management via interconnect API. Let's wire up the interconnect API
support to the DC driver in order to fix the distorted display output
on T30 Ouya, T124 TK1 and other Tegra devices.

Tested-by: Peter Geis  # Ouya T30
Tested-by: Matt Merhar  # Ouya T30
Tested-by: Nicolas Chauvet  # PAZ00 T20 and TK1 T124
Signed-off-by: Dmitry Osipenko 
---
 drivers/gpu/drm/tegra/Kconfig |   1 +
 drivers/gpu/drm/tegra/dc.c| 352 ++
 drivers/gpu/drm/tegra/dc.h|  14 ++
 drivers/gpu/drm/tegra/drm.c   |  14 ++
 drivers/gpu/drm/tegra/hub.c   |   3 +
 drivers/gpu/drm/tegra/plane.c | 116 +++
 drivers/gpu/drm/tegra/plane.h |  15 ++
 7 files changed, 515 insertions(+)

diff --git a/drivers/gpu/drm/tegra/Kconfig b/drivers/gpu/drm/tegra/Kconfig
index 5043dcaf1cf9..1650a448eabd 100644
--- a/drivers/gpu/drm/tegra/Kconfig
+++ b/drivers/gpu/drm/tegra/Kconfig
@@ -9,6 +9,7 @@ config DRM_TEGRA
select DRM_MIPI_DSI
select DRM_PANEL
select TEGRA_HOST1X
+   select INTERCONNECT
select IOMMU_IOVA
select CEC_CORE if CEC_NOTIFIER
help
diff --git a/drivers/gpu/drm/tegra/dc.c b/drivers/gpu/drm/tegra/dc.c
index 3fe5630dfbe1..96e3a27dc98d 100644
--- a/drivers/gpu/drm/tegra/dc.c
+++ b/drivers/gpu/drm/tegra/dc.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -618,6 +619,9 @@ static int tegra_plane_atomic_check(struct drm_plane *plane,
struct tegra_dc *dc = to_tegra_dc(new_plane_state->crtc);
int err;
 
+   plane_state->peak_memory_bandwidth = 0;
+   plane_state->avg_memory_bandwidth = 0;
+
/* no need for further checks if the plane is being disabled */
if (!new_plane_state->crtc)
return 0;
@@ -808,6 +812,12 @@ static struct drm_plane *tegra_primary_plane_create(struct 
drm_device *drm,
formats = dc->soc->primary_formats;
modifiers = dc->soc->modifiers;
 
+   err = tegra_plane_interconnect_init(plane);
+   if (err) {
+   kfree(plane);
+   return ERR_PTR(err);
+   }
+
err = drm_universal_plane_init(drm, >base, possible_crtcs,
   _plane_funcs, formats,
   num_formats, modifiers, type, NULL);
@@ -841,9 +851,13 @@ static int tegra_cursor_atomic_check(struct drm_plane 
*plane,
 {
struct drm_plane_state *new_plane_state = 
drm_atomic_get_new_plane_state(state,

 plane);
+   struct tegra_plane_state *plane_state = 
to_tegra_plane_state(new_plane_state);
struct tegra_plane *tegra = to_tegra_plane(plane);
int err;
 
+   plane_state->peak_memory_bandwidth = 0;
+   plane_state->avg_memory_bandwidth = 0;
+
/* no need for further checks if the plane is being disabled */
if (!new_plane_state->crtc)
return 0;
@@ -985,6 +999,12 @@ static struct drm_plane 
*tegra_dc_cursor_plane_create(struct drm_device *drm,
num_formats = ARRAY_SIZE(tegra_cursor_plane_formats);
formats = tegra_cursor_plane_formats;
 
+   err = tegra_plane_interconnect_init(plane);
+   if (err) {
+   kfree(plane);
+   return ERR_PTR(err);
+   }
+
err = drm_universal_plane_init(drm, >base, possible_crtcs,
   _plane_funcs, formats,
   num_formats, NULL,
@@ -1099,6 +1119,12 @@ static struct drm_plane 
*tegra_dc_overlay_plane_create(struct drm_device *drm,
num_formats = dc->soc->num_overlay_formats;
formats = dc->soc->overlay_formats;
 
+   err = tegra_plane_interconnect_init(plane);
+   if (err) {
+   kfree(plane);
+   return ERR_PTR(err);
+   }
+
if (!cursor)
type = DRM_PLANE_TYPE_OVERLAY;
else
@@ -1216,6 +1242,7 @@ tegra_crtc_atomic_duplicate_state(struct drm_crtc *crtc)
 {
struct tegra_dc_state *state = to_dc_state(crtc->state);
struct tegra_dc_state *copy;
+   unsigned int i;
 
copy = kmalloc(sizeof(*copy), GFP_KERNEL);
if (!copy)
@@ -1227,6 +1254,9 @@ tegra_crtc_atomic_duplicate_state(struct drm_crtc *crtc)
copy->div = state->div;
copy->planes = state->planes;
 
+   for (i = 0; i < ARRAY_SIZE(state->plane_peak_bw); i++)
+   copy->plane_peak_bw[i] = state->plane_peak_bw[i];
+
return >base;
 }
 
@@ -1753,6 +1783

[PATCH v16 0/2] Add memory bandwidth management to NVIDIA Tegra DRM driver

2021-03-17 Thread Dmitry Osipenko
This series adds memory bandwidth management to the NVIDIA Tegra DRM driver,
which is done using interconnect framework. It fixes display corruption that
happens due to insufficient memory bandwidth.

Changelog:

v16: - Implemented suggestions that were given by Michał Mirosław to v15.

 - Added r-b from Michał Mirosław to the debug-stats patch.

 - Rebased on top of a recent linux-next.

 - Removed bandwidth scaling based on width difference of src/dst
   windows since it's not actual anymore. Apparently the recent memory
   driver changes fixed problems that I witnessed before.

 - Average bandwidth calculation now won't overflow for 4k resolutions.

 - Average bandwidth calculation now uses the size of the visible
   area instead of the src area since debug stats of the memory
   controller clearly show that downscaled window takes less bandwidth,
   proportionally to the scaled size.

 - Bandwidth calculation now uses "adjusted mode" of the CRTC, which
   is what used for h/w programming, instead of the mode that was
   requested by userspace, although the two usually match in practice.

v15: - Corrected tegra_plane_icc_names[] NULL-check that was partially lost
   by accident in v14 after unsuccessful rebase.

v14: - Made improvements that were suggested by Michał Mirosław to v13:

   - Changed 'unsigned int' to 'bool'.
   - Renamed functions which calculate bandwidth state.
   - Reworked comment in the code that explains why downscaled plane
 require higher bandwidth.
   - Added round-up to bandwidth calculation.
   - Added sanity checks of the plane index and fixed out-of-bounds
 access which happened on T124 due to the cursor plane index.

v13: - No code changes. Patches missed v5.12, re-sending them for v5.13.

Dmitry Osipenko (2):
  drm/tegra: dc: Support memory bandwidth management
  drm/tegra: dc: Extend debug stats with total number of events

 drivers/gpu/drm/tegra/Kconfig |   1 +
 drivers/gpu/drm/tegra/dc.c| 362 ++
 drivers/gpu/drm/tegra/dc.h|  19 ++
 drivers/gpu/drm/tegra/drm.c   |  14 ++
 drivers/gpu/drm/tegra/hub.c   |   3 +
 drivers/gpu/drm/tegra/plane.c | 116 +++
 drivers/gpu/drm/tegra/plane.h |  15 ++
 7 files changed, 530 insertions(+)

-- 
2.30.2



Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-17 Thread Peter Zijlstra
On Wed, Mar 17, 2021 at 03:16:18PM +0800, changhuaixin wrote:

> > Why do you allow such a large burst? I would expect something like:
> > 
> > if (burst > quote)
> > return -EINVAL;
> > 
> > That limits the variance in the system. Allowing super long bursts seems
> > to defeat the entire purpose of bandwidth control.
> 
> I understand your concern. Surely large burst value might allow super
> long bursts thus preventing bandwidth control entirely for a long
> time.
> 
> However, I am afraid it is hard to decide what the maximum burst
> should be from the bandwidth control mechanism itself. Allowing some
> burst to the maximum of quota is helpful, but not enough. There are
> cases where workloads are bursty that they need many times more than
> quota in a single period. In such cases, limiting burst to the maximum
> of quota fails to meet the needs.
> 
> Thus, I wonder whether is it acceptable to leave the maximum burst to
> users. If the desired behavior is to allow some burst, configure burst
> accordingly. If that is causing variance, use share or other fairness
> mechanism. And if fairness mechanism still fails to coordinate, do not
> use burst maybe.

It's not fairness, bandwidth control is about isolation, and burst
introduces interference.

> In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is
> still needed maybe.

So what is the typical avg,stdev,max and mode for the workloads where you find
you need this?

I would really like to put a limit on the burst. IMO a workload that has
a burst many times longer than the quota is plain broken.


Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-17 Thread changhuaixin



> On Mar 16, 2021, at 5:54 PM, Peter Zijlstra  wrote:
> 
> On Tue, Mar 16, 2021 at 12:49:28PM +0800, Huaixin Chang wrote:
>> @@ -8982,6 +8983,12 @@ static int tg_set_cfs_bandwidth(struct task_group 
>> *tg, u64 period, u64 quota)
>>  if (quota != RUNTIME_INF && quota > max_cfs_runtime)
>>  return -EINVAL;
>> 
>> +/*
>> + * Bound burst to defend burst against overflow during bandwidth shift.
>> + */
>> +if (burst > max_cfs_runtime)
>> +return -EINVAL;
> 
> Why do you allow such a large burst? I would expect something like:
> 
>   if (burst > quote)
>   return -EINVAL;
> 
> That limits the variance in the system. Allowing super long bursts seems
> to defeat the entire purpose of bandwidth control.

I understand your concern. Surely large burst value might allow super long 
bursts
thus preventing bandwidth control entirely for a long time.

However, I am afraid it is hard to decide what the maximum burst should be from
the bandwidth control mechanism itself. Allowing some burst to the maximum of
quota is helpful, but not enough. There are cases where workloads are bursty
that they need many times more than quota in a single period. In such cases, 
limiting
burst to the maximum of quota fails to meet the needs.

Thus, I wonder whether is it acceptable to leave the maximum burst to users. If 
the desired
behavior is to allow some burst, configure burst accordingly. If that is 
causing variance, use share
or other fairness mechanism. And if fairness mechanism still fails to 
coordinate, do not use
burst maybe.

In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is still 
needed maybe.


Re: [PATCH v15 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-16 Thread Dmitry Osipenko
15.03.2021 21:39, Dmitry Osipenko пишет:
>>> +   /*
>>> +* Horizontal downscale needs a lower memory latency, which roughly
>>> +* depends on the scaled width.  Trying to tune latency of a memory
>>> +* client alone will likely result in a strong negative impact on
>>> +* other memory clients, hence we will request a higher bandwidth
>>> +* since latency depends on bandwidth.  This allows to prevent memory
>>> +* FIFO underflows for a large plane downscales, meanwhile allowing
>>> +* display to share bandwidth fairly with other memory clients.
>>> +*/
>>> +   if (src_w > dst_w)
>>> +   mul = (src_w - dst_w) * bpp / 2048 + 1;
>>> +   else
>>> +   mul = 1;
>> [...]
>>
>> One point is unexplained yet: why is the multiplier proportional to a
>> *difference* between src and dst widths? Also, I would expect max (worst
>> case) is pixclock * read_size when src_w/dst_w >= read_size.
> IIRC, the difference gives a more adequate/practical result than the
> proportion. Although, downstream driver uses proportion. I'll try to
> revisit this for the next version of the patch.

I tried to re-test everything and can't reproduce problems that existed
previously. We didn't have a finished memory drivers back then and I
think that Tegra30 latency tuning support and various Tegra20 changes
fixed those problems. I'll remove this hunk in the next version.


Re: [PATCH v3 2/4] sched/fair: Make CFS bandwidth controller burstable

2021-03-16 Thread Peter Zijlstra
On Fri, Mar 12, 2021 at 09:54:33PM +0800, changhuaixin wrote:
> > On Mar 10, 2021, at 9:04 PM, Peter Zijlstra  wrote:

> > There's already an #ifdef block that contains that bandwidth_slice
> > thing, see the previous hunk, so why create a new #ifdef here?
> > 
> > Also, personally I think percentages are over-represented as members of
> > Q.
> > 
> Sorry, I don't quite understand the "members of Q". Is this saying that the 
> percentages
> are over-designed here?

You know the number groups (in order): N, Z, Q, R, C, H, O.

Percent being 1/100 is a fraction and thus part of Q (and anything
higher ofcourse).

Some people seem to think percent is magical and special. It's just a
fraction like the inifinite many others in Q. It's also a very crappy
one when we consider computers.

Basically I hate percentages, they're nothing special and often employed
where they should not be.


Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-16 Thread Peter Zijlstra
On Tue, Mar 16, 2021 at 12:49:28PM +0800, Huaixin Chang wrote:
> And the maximun amount of CPU a group can consume in
> a given period is "buffer" which is equivalent to "quota" + "burst in
> case that this group has done enough accumulation.

I'm confused as heck about cfs_b->buffer. Why do you need that? What's
wrong with cfs_b->runtime ?

Currently, by being strict, we ignore any remaining runtime and the
period timer resets runtime to quota and life goes on. What you want to
do is instead of resetting runtime, add quota and limit the total.

That is, currently it does:

runtime = min(runtime + quota, quota);

which by virtue of runtime not being allowed negative, it the exact same
as:

runtime = quota;

Which is what we have in refill.

Fix that to be something like:

runtime = min(runtime + quota, quota + burst)

and you're basically done. And that seems *much* simpler.

What am I missing, why have you made it so complicated?


/me looks again..

Oooh, I think I see, all this is because you don't do your constraints
right. Removing static from max_cfs_runtime is a big clue you did that
wrong.

Something like this on top of the first two. Much simpler!

Now I just need to figure out wth you mucked about with where we call
refill.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8954,7 +8954,7 @@ static DEFINE_MUTEX(cfs_constraints_mute
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 /* More than 203 days if BW_SHIFT equals 20. */
-const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
+static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
@@ -8989,10 +8989,10 @@ static int tg_set_cfs_bandwidth(struct t
if (quota != RUNTIME_INF && quota > max_cfs_runtime)
return -EINVAL;
 
-   /*
-* Bound burst to defend burst against overflow during bandwidth shift.
-*/
-   if (burst > max_cfs_runtime)
+   if (burst > quota)
+   return -EINVAL;
+
+   if (quota + burst > max_cfs_runtime)
return -EINVAL;
 
/*
@@ -9019,8 +9019,6 @@ static int tg_set_cfs_bandwidth(struct t
cfs_b->burst = burst;
 
if (runtime_enabled) {
-   cfs_b->buffer = min(max_cfs_runtime, quota + burst);
-   cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
cfs_b->runtime = cfs_b->quota;
 
/* Restart the period timer (if active) to handle new period 
expiry: */
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4621,23 +4621,22 @@ static inline u64 sched_cfs_bandwidth_sl
  *
  * requires cfs_b->lock
  */
-static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
-  u64 overrun)
+static void
+__refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
 {
-   u64 refill;
-
-   if (cfs_b->quota != RUNTIME_INF) {
-
-   if (!sysctl_sched_cfs_bw_burst_enabled) {
-   cfs_b->runtime = cfs_b->quota;
-   return;
-   }
+   if (unlikely(cfs_b->quota == RUNTIME_INF))
+   return;
 
-   overrun = min(overrun, cfs_b->max_overrun);
-   refill = cfs_b->quota * overrun;
-   cfs_b->runtime += refill;
-   cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+   if (!sysctl_sched_cfs_bw_burst_enabled) {
+   cfs_b->runtime = cfs_b->quota;
+   return;
}
+
+   /*
+* Ignore @overrun since burst <= quota.
+*/
+   cfs_b->runtime += cfs_b->quota;
+   cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -5226,7 +5225,6 @@ static enum hrtimer_restart sched_cfs_sl
 }
 
 extern const u64 max_cfs_quota_period;
-extern const u64 max_cfs_runtime;
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
@@ -5256,18 +5254,7 @@ static enum hrtimer_restart sched_cfs_pe
new = old * 2;
if (new < max_cfs_quota_period) {
cfs_b->period = ns_to_ktime(new);
-   cfs_b->quota = min(cfs_b->quota * 2,
-  max_cfs_runtime);
-
-   cfs_b->buffer = min(cfs_b->quota + cfs_b->burst,
-   max_cfs_runtime);
-   /*
-* Add 1 in case max_overrun becomes 0

Re: [PATCH 5.10 154/290] PCI/LINK: Remove bandwidth notification

2021-03-16 Thread Pavel Machek
Hi!

> From: Greg Kroah-Hartman 
> 
> From: Bjorn Helgaas 

Dup.

> Remove the bandwidth change notifications for now.  Hopefully we can add
> this back when we have a better understanding of why this happens and how
> we can make the messages useful instead of overwhelming.

This is stable, and even for mainline, I'd expect "depends on BROKEN"
in Kconfig, or something like that, so people can still work on fixing
it and so that we don't have huge changes floating around.

Best regards,
Pavel

> diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
> index 3946555a6042..45a2ef702b45 100644
> --- a/drivers/pci/pcie/Kconfig
> +++ b/drivers/pci/pcie/Kconfig
> @@ -133,14 +133,6 @@ config PCIE_PTM
> This is only useful if you have devices that support PTM, but it
> is safe to enable even if you don't.
>  
> -config PCIE_BW
> - bool "PCI Express Bandwidth Change Notification"
> - depends on PCIEPORTBUS
> - help
> -   This enables PCI Express Bandwidth Change Notification.  If
> -   you know link width or rate changes occur only to correct
> -   unreliable links, you may answer Y.

-- 
DENX Software Engineering GmbH,  Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


signature.asc
Description: Digital signature


Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-16 Thread Peter Zijlstra
On Tue, Mar 16, 2021 at 12:49:28PM +0800, Huaixin Chang wrote:
> @@ -8982,6 +8983,12 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
> u64 period, u64 quota)
>   if (quota != RUNTIME_INF && quota > max_cfs_runtime)
>   return -EINVAL;
>  
> + /*
> +  * Bound burst to defend burst against overflow during bandwidth shift.
> +  */
> + if (burst > max_cfs_runtime)
> + return -EINVAL;

Why do you allow such a large burst? I would expect something like:

if (burst > quote)
return -EINVAL;

That limits the variance in the system. Allowing super long bursts seems
to defeat the entire purpose of bandwidth control.


Re: [PATCH v4 2/4] sched/fair: Make CFS bandwidth controller burstable

2021-03-16 Thread Peter Zijlstra



I can't make sense of patch 1 and 2 independent of one another. Why the
split?



Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-16 Thread Peter Zijlstra
On Tue, Mar 16, 2021 at 12:49:28PM +0800, Huaixin Chang wrote:
> In this patch, we introduce the notion of CFS bandwidth burst. Unused
> "quota" from pervious "periods" might be accumulated and used in the
> following "periods". The maximum amount of accumulated bandwidth is
> bounded by "burst". And the maximun amount of CPU a group can consume in
> a given period is "buffer" which is equivalent to "quota" + "burst in
> case that this group has done enough accumulation.

Complete lack of why though. Why am I going to spend time looking at
this?


Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-16 Thread Peter Zijlstra
On Tue, Mar 16, 2021 at 12:49:28PM +0800, Huaixin Chang wrote:
> In this patch, we introduce the notion of CFS bandwidth burst. Unused

Documentation/process/submitting-patches.rst:instead of "[This patch] makes 
xyzzy do frotz" or "[I] changed xyzzy


[PATCH v4 2/4] sched/fair: Make CFS bandwidth controller burstable

2021-03-15 Thread Huaixin Chang
Accumulate unused quota from previous periods, thus accumulated
bandwidth runtime can be used in the following periods. During
accumulation, take care of runtime overflow. Previous non-burstable
CFS bandwidth controller only assign quota to runtime, that saves a lot.

A sysctl parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
switch for burst. It is enabled by default.

Co-developed-by: Shanpei Chen 
Signed-off-by: Shanpei Chen 
Signed-off-by: Huaixin Chang 
---
 include/linux/sched/sysctl.h |  1 +
 kernel/sched/core.c  |  8 +++---
 kernel/sched/fair.c  | 58 ++--
 kernel/sched/sched.h |  4 +--
 kernel/sysctl.c  |  9 +++
 5 files changed, 66 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 3c31ba88aca5..3cce25485c69 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -72,6 +72,7 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
 
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
 #endif
 
 #ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 708c31e6ce1f..16e23a2499ef 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8948,7 +8948,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 /* More than 203 days if BW_SHIFT equals 20. */
-static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
+const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
@@ -9012,13 +9012,13 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota,
cfs_b->quota = quota;
cfs_b->burst = burst;
 
-   __refill_cfs_bandwidth_runtime(cfs_b);
-
if (runtime_enabled) {
cfs_b->buffer = min(max_cfs_runtime, quota + burst);
+   cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
+   cfs_b->runtime = cfs_b->quota;
 
/* Restart the period timer (if active) to handle new period 
expiry: */
-   start_cfs_bandwidth(cfs_b);
+   start_cfs_bandwidth(cfs_b, 1);
}
 
raw_spin_unlock_irq(_b->lock);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59d816a365f3..c981d4845c96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -127,6 +127,13 @@ int __weak arch_asym_cpu_priority(int cpu)
  * (default: 5 msec, units: microseconds)
  */
 unsigned int sysctl_sched_cfs_bandwidth_slice  = 5000UL;
+
+/*
+ * A switch for cfs bandwidth burst.
+ *
+ * (default: 1, enabled)
+ */
+unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
 #endif
 
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
@@ -4602,10 +4609,23 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  *
  * requires cfs_b->lock
  */
-void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
+  u64 overrun)
 {
-   if (cfs_b->quota != RUNTIME_INF)
-   cfs_b->runtime = cfs_b->quota;
+   u64 refill;
+
+   if (cfs_b->quota != RUNTIME_INF) {
+
+   if (!sysctl_sched_cfs_bw_burst_enabled) {
+   cfs_b->runtime = cfs_b->quota;
+   return;
+   }
+
+   overrun = min(overrun, cfs_b->max_overrun);
+   refill = cfs_b->quota * overrun;
+   cfs_b->runtime += refill;
+   cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+   }
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4627,7 +4647,7 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth 
*cfs_b,
if (cfs_b->quota == RUNTIME_INF)
amount = min_amount;
else {
-   start_cfs_bandwidth(cfs_b);
+   start_cfs_bandwidth(cfs_b, 0);
 
if (cfs_b->runtime > 0) {
amount = min(cfs_b->runtime, min_amount);
@@ -4973,7 +4993,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth 
*cfs_b, int overrun, u
if (cfs_b->idle && !throttled)
goto out_deactivate;
 
-   __refill_cfs_bandwidth_runtime(cfs_b);
+   __refill_cfs_bandwidth_runtime(cfs_b, overrun);
 
if (!throttled) {
/* mark as potentially idle for the upcoming period */
@@ -5194,6 +5214,7 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct 
hrtimer *timer)
 }
 
 extern const u64 max_cfs_quota_period;
+extern const u64 max_cfs_runtime;
 
 stat

[PATCH v4 3/4] sched/fair: Add cfs bandwidth burst statistics

2021-03-15 Thread Huaixin Chang
When using cfs_b and meeting with some throttled periods, users shall
use burst buffer to allow bursty workloads. Apart from configuring some
burst buffer and watch whether throttled periods disappears, some
statistics on burst buffer using are also helpful. Thus expose the
following statistics into cpu.stat file:

nr_burst:   number of periods bandwidth burst occurs
burst_time: cumulative wall-time that any cpus has
used above quota in respective periods

Co-developed-by: Shanpei Chen 
Signed-off-by: Shanpei Chen 
Signed-off-by: Huaixin Chang 
---
 kernel/sched/core.c  | 14 +++---
 kernel/sched/fair.c  | 13 -
 kernel/sched/sched.h |  3 +++
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 16e23a2499ef..f60232862300 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9016,6 +9016,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota,
cfs_b->buffer = min(max_cfs_runtime, quota + burst);
cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
cfs_b->runtime = cfs_b->quota;
+   cfs_b->runtime_at_period_start = cfs_b->runtime;
 
/* Restart the period timer (if active) to handle new period 
expiry: */
start_cfs_bandwidth(cfs_b, 1);
@@ -9265,6 +9266,9 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
seq_printf(sf, "wait_sum %llu\n", ws);
}
 
+   seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
+   seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
+
return 0;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
@@ -9361,16 +9365,20 @@ static int cpu_extra_stat_show(struct seq_file *sf,
{
struct task_group *tg = css_tg(css);
struct cfs_bandwidth *cfs_b = >cfs_bandwidth;
-   u64 throttled_usec;
+   u64 throttled_usec, burst_usec;
 
throttled_usec = cfs_b->throttled_time;
do_div(throttled_usec, NSEC_PER_USEC);
+   burst_usec = cfs_b->burst_time;
+   do_div(burst_usec, NSEC_PER_USEC);
 
seq_printf(sf, "nr_periods %d\n"
   "nr_throttled %d\n"
-  "throttled_usec %llu\n",
+  "throttled_usec %llu\n"
+  "nr_burst %d\n"
+  "burst_usec %llu\n",
   cfs_b->nr_periods, cfs_b->nr_throttled,
-  throttled_usec);
+  throttled_usec, cfs_b->nr_burst, burst_usec);
}
 #endif
return 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c981d4845c96..e7574d8bc11a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4612,7 +4612,7 @@ static inline u64 sched_cfs_bandwidth_slice(void)
 static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
   u64 overrun)
 {
-   u64 refill;
+   u64 refill, runtime;
 
if (cfs_b->quota != RUNTIME_INF) {
 
@@ -4621,10 +4621,21 @@ static void __refill_cfs_bandwidth_runtime(struct 
cfs_bandwidth *cfs_b,
return;
}
 
+   if (cfs_b->runtime_at_period_start > cfs_b->runtime) {
+   runtime = cfs_b->runtime_at_period_start
+   - cfs_b->runtime;
+   if (runtime > cfs_b->quota) {
+   cfs_b->burst_time += runtime - cfs_b->quota;
+   cfs_b->nr_burst++;
+   }
+   }
+
overrun = min(overrun, cfs_b->max_overrun);
refill = cfs_b->quota * overrun;
cfs_b->runtime += refill;
cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+
+   cfs_b->runtime_at_period_start = cfs_b->runtime;
}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efcbbfc31619..7ef8d4733791 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -360,6 +360,7 @@ struct cfs_bandwidth {
u64 burst;
u64 buffer;
u64 max_overrun;
+   u64 runtime_at_period_start;
s64 hierarchical_quota;
 
u8  idle;
@@ -372,7 +373,9 @@ struct cfs_bandwidth {
/* Statistics: */
int nr_periods;
int nr_throttled;
+   int nr_burst;
u64 throttled_time;
+   u64 burst_time;
 #endif
 };
 
-- 
2.14.4.44.g2045bb6



[PATCH v4 4/4] sched/fair: Add document for burstable CFS bandwidth control

2021-03-15 Thread Huaixin Chang
Basic description of usage and effect for CFS Bandwidth Control Burst.

Co-developed-by: Shanpei Chen 
Signed-off-by: Shanpei Chen 
Signed-off-by: Huaixin Chang 
---
 Documentation/admin-guide/cgroup-v2.rst | 16 +
 Documentation/scheduler/sched-bwc.rst   | 64 ++---
 2 files changed, 69 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 64c62b979f2f..17ec571ab4a8 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -997,6 +997,8 @@ All time durations are in microseconds.
- nr_periods
- nr_throttled
- throttled_usec
+   - nr_burst
+   - burst_usec
 
   cpu.weight
A read-write single value file which exists on non-root
@@ -1017,16 +1019,18 @@ All time durations are in microseconds.
the closest approximation of the current weight.
 
   cpu.max
-   A read-write two value file which exists on non-root cgroups.
-   The default is "max 10".
+   A read-write three value file which exists on non-root cgroups.
+   The default is "max 10 0".
 
The maximum bandwidth limit.  It's in the following format::
 
- $MAX $PERIOD
+ $MAX $PERIOD $BURST
 
-   which indicates that the group may consume upto $MAX in each
-   $PERIOD duration.  "max" for $MAX indicates no limit.  If only
-   one number is written, $MAX is updated.
+   which indicates that the group may consume upto $MAX from this
+   period plus $BURST carried over from previous periods in each
+   $PERIOD duration.  "max" for $MAX indicates no limit. "0" for
+   $BURST indicates no bandwidth can be carried over. On partial
+   writing, values are updated accordingly.
 
   cpu.pressure
A read-write nested-keyed file.
diff --git a/Documentation/scheduler/sched-bwc.rst 
b/Documentation/scheduler/sched-bwc.rst
index 845eee659199..42e0773c0eed 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -22,24 +22,51 @@ cfs_quota units at each period boundary. As threads consume 
this bandwidth it
 is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
+By default, CPU bandwidth consumption is strictly limited to quota within each
+given period. For the sequence of CPU usage u_i served under CFS bandwidth
+control, if for any j <= k N(j,k) is the number of periods from u_j to u_k:
+
+u_j+...+u_k <= quota * N(j,k)
+
+For a bursty sequence among which interval u_j...u_k are at the peak, CPU
+requests might have to wait for more periods to replenish enough quota.
+Otherwise, larger quota is required.
+
+With "burst" buffer, CPU requests might be served as long as:
+
+u_j+...+u_k <= B_j + quota * N(j,k)
+
+if for any j <= k N(j,k) is the number of periods from u_j to u_k and B_j is
+the accumulated quota from previous periods in burst buffer serving u_j.
+Burst buffer helps in that serving whole bursty CPU requests without throttling
+them can be done with moderate quota setting and accumulated quota in burst
+buffer, if:
+
+u_0+...+u_n <= B_0 + quota * N(0,n)
+
+where B_0 is the initial state of burst buffer. The maximum accumulated quota 
in
+the burst buffer is capped by burst. With proper burst setting, the available
+bandwidth is still determined by quota and period on the long run.
+
 Management
 --
-Quota and period are managed within the cpu subsystem via cgroupfs.
+Quota, period and burst are managed within the cpu subsystem via cgroupfs.
 
 .. note::
The cgroupfs files described in this section are only applicable
to cgroup v1. For cgroup v2, see
:ref:`Documentation/admin-guide/cgroupv2.rst `.
 
-- cpu.cfs_quota_us: the total available run-time within a period (in
-  microseconds)
+- cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
 - cpu.cfs_period_us: the length of a period (in microseconds)
 - cpu.stat: exports throttling statistics [explained further below]
+- cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
 
 The default values are::
 
cpu.cfs_period_us=100ms
-   cpu.cfs_quota=-1
+   cpu.cfs_quota_us=-1
+   cpu.cfs_burst_us=0
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
@@ -55,6 +82,11 @@ more detail below.
 Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
 and return the group to an unconstrained state once more.
 
+A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
+any unused bandwidth. It makes the traditional bandwidth control behavior for
+CFS uncha

[PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-15 Thread Huaixin Chang
In this patch, we introduce the notion of CFS bandwidth burst. Unused
"quota" from pervious "periods" might be accumulated and used in the
following "periods". The maximum amount of accumulated bandwidth is
bounded by "burst". And the maximun amount of CPU a group can consume in
a given period is "buffer" which is equivalent to "quota" + "burst in
case that this group has done enough accumulation.

Co-developed-by: Shanpei Chen 
Signed-off-by: Shanpei Chen 
Signed-off-by: Huaixin Chang 
---
 kernel/sched/core.c  | 97 +++-
 kernel/sched/fair.c  |  2 ++
 kernel/sched/sched.h |  2 ++
 3 files changed, 84 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98191218d891..708c31e6ce1f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8952,7 +8952,8 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
+   u64 burst)
 {
int i, ret = 0, runtime_enabled, runtime_was_enabled;
struct cfs_bandwidth *cfs_b = >cfs_bandwidth;
@@ -8982,6 +8983,12 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota)
if (quota != RUNTIME_INF && quota > max_cfs_runtime)
return -EINVAL;
 
+   /*
+* Bound burst to defend burst against overflow during bandwidth shift.
+*/
+   if (burst > max_cfs_runtime)
+   return -EINVAL;
+
/*
 * Prevent race between setting of cfs_rq->runtime_enabled and
 * unthrottle_offline_cfs_rqs().
@@ -9003,12 +9010,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota)
raw_spin_lock_irq(_b->lock);
cfs_b->period = ns_to_ktime(period);
cfs_b->quota = quota;
+   cfs_b->burst = burst;
 
__refill_cfs_bandwidth_runtime(cfs_b);
 
-   /* Restart the period timer (if active) to handle new period expiry: */
-   if (runtime_enabled)
+   if (runtime_enabled) {
+   cfs_b->buffer = min(max_cfs_runtime, quota + burst);
+
+   /* Restart the period timer (if active) to handle new period 
expiry: */
start_cfs_bandwidth(cfs_b);
+   }
 
raw_spin_unlock_irq(_b->lock);
 
@@ -9036,9 +9047,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota)
 
 static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 {
-   u64 quota, period;
+   u64 quota, period, burst;
 
period = ktime_to_ns(tg->cfs_bandwidth.period);
+   burst = tg->cfs_bandwidth.burst;
if (cfs_quota_us < 0)
quota = RUNTIME_INF;
else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
@@ -9046,7 +9058,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long 
cfs_quota_us)
else
return -EINVAL;
 
-   return tg_set_cfs_bandwidth(tg, period, quota);
+   return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_quota(struct task_group *tg)
@@ -9064,15 +9076,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
 
 static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
 {
-   u64 quota, period;
+   u64 quota, period, burst;
 
if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
return -EINVAL;
 
period = (u64)cfs_period_us * NSEC_PER_USEC;
quota = tg->cfs_bandwidth.quota;
+   burst = tg->cfs_bandwidth.burst;
 
-   return tg_set_cfs_bandwidth(tg, period, quota);
+   return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_period(struct task_group *tg)
@@ -9085,6 +9098,35 @@ static long tg_get_cfs_period(struct task_group *tg)
return cfs_period_us;
 }
 
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
+{
+   u64 quota, period, burst;
+
+   period = ktime_to_ns(tg->cfs_bandwidth.period);
+   quota = tg->cfs_bandwidth.quota;
+   if (cfs_burst_us < 0)
+   burst = RUNTIME_INF;
+   else if ((u64)cfs_burst_us <= U64_MAX / NSEC_PER_USEC)
+   burst = (u64)cfs_burst_us * NSEC_PER_USEC;
+   else
+   return -EINVAL;
+
+   return tg_set_cfs_bandwidth(tg, period, quota, burst);
+}
+
+static long tg_get_cfs_burst(struct task_group *tg)
+{
+   u64 burst_us;
+
+   if (tg->cfs_bandwidth.burst == RUNTIME_INF)
+   return -1;
+
+   burst_us = tg->cfs_bandwidth.burst;
+   do_div(burst_us, NSEC_PER_USEC);
+
+   return burst_us;
+}
+
 static s64 cpu_

[PATCH v4 0/4] sched/fair: Burstable CFS bandwidth controller

2021-03-15 Thread Huaixin Chang
Changelog:

v4:
- Adjust assignments in tg_set_cfs_bandwidth(), saving unnecessary
  assignemnts when quota == RUNTIME_INF.
- Getting rid of sysctl_sched_cfs_bw_burst_onset_percent, as there seems
  no justification for both controlling start bandwidth and a percent
  way.
- Comment improvement in sched_cfs_period_timer() shifts on explaining
  why max_overrun shifting to 0 is a problem.
- Rename previous_runtime to runtime_at_period_start.
- Add cgroup2 interface and documentation.
- Getting rid of exposing current_bw as there are not enough
  justification and the updating problem.
- Add justification on cpu.stat change in the changelog.
- Rebase upon v5.12-rc3.
- Correct SoB chain.
- Several indentation fixes.
- Adjust quota in schbench test from 70 to 60.

v3:
- Fix another issue reported by test robot.
- Update docs as Randy Dunlap suggested.
Link:
https://lore.kernel.org/lkml/20210120122715.29493-1-changhuai...@linux.alibaba.com/

v2:
- Fix an issue reported by test robot.
- Rewriting docs. Appreciate any further suggestions or help.
Link:
https://lore.kernel.org/lkml/20210121110453.18899-1-changhuai...@linux.alibaba.com/

v1 Link:
https://lore.kernel.org/lkml/20201217074620.58338-1-changhuai...@linux.alibaba.com/

The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled. And they are latency sensitive at the same
time so that throttling them is undesired.

Scaling up period and quota allows greater burst capacity. But it might
cause longer stuck till next refill. We introduce "burst" to allow
accumulating unused quota from previous periods, and to be assigned when
a task group requests more CPU than quota during a specific period. Thus
allowing CPU time requests as long as the average requested CPU time is
below quota on the long run. The maximum accumulation is capped by burst
and is set 0 by default, thus the traditional behaviour remains.

A huge drop of 99th tail latency from more than 500ms to 27ms is seen for
real java workloads when using burst. Similar drops are seen when
testing with schbench too:

echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 60 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 10 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
echo 40 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

# The average CPU usage is around 500%, which is 200ms CPU time
# every 40ms.
./schbench -m 1 -t 30 -r 10 -c 1 -R 500

Without burst:

Latency percentiles (usec)
50.th: 7
75.th: 8
90.th: 9
95.th: 10
*99.th: 933
99.5000th: 981
99.9000th: 3068
min=0, max=20054
rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 
9.33%

With burst:

Latency percentiles (usec)
50.th: 7
75.th: 8
90.th: 9
95.th: 9
*99.th: 12
99.5000th: 13
99.9000th: 19
min=0, max=406
rps: 498.36 p95 (usec) 9 p99 (usec) 12 p95/cputime 0.09% p99/cputime 
0.12%

How much workloads with benefit from burstable CFS bandwidth control
depends on how bursty and how latency sensitive they are.

Previously, Cong Wang and Konstantin Khlebnikov proposed similar
feature:
https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangc...@gmail.com/
https://lore.kernel.org/lkml/157476581065.5793.4518979877345136813.stgit@buzz/

This time we present more latency statistics and handle overflow while
accumulating.

Huaixin Chang (4):
  sched/fair: Introduce primitives for CFS bandwidth burst
  sched/fair: Make CFS bandwidth controller burstable
  sched/fair: Add cfs bandwidth burst statistics
  sched/fair: Add document for burstable CFS bandwidth control

 Documentation/scheduler/sched-bwc.rst |  49 +++--
 include/linux/sched/sysctl.h  |   2 +
 kernel/sched/core.c   | 126 +-
 kernel/sched/fair.c   |  58 +---
 kernel/sched/sched.h  |   9 ++-
 kernel/sysctl.c   |  18 +
 6 files changed, 232 insertions(+), 30 deletions(-)

-- 
2.14.4.44.g2045bb6



Re: [PATCH v15 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-15 Thread Dmitry Osipenko
15.03.2021 01:31, Michał Mirosław пишет:
> On Thu, Mar 11, 2021 at 08:22:54PM +0300, Dmitry Osipenko wrote:
>> Display controller (DC) performs isochronous memory transfers, and thus,
>> has a requirement for a minimum memory bandwidth that shall be fulfilled,
>> otherwise framebuffer data can't be fetched fast enough and this results
>> in a DC's data-FIFO underflow that follows by a visual corruption.
> [...]
>> +static unsigned long
>> +tegra_plane_overlap_mask(struct drm_crtc_state *state,
>> + const struct drm_plane_state *plane_state)
>> +{
>> +const struct drm_plane_state *other_state;
>> +const struct tegra_plane *tegra;
>> +unsigned long overlap_mask = 0;
>> +struct drm_plane *plane;
>> +struct drm_rect rect;
>> +
>> +if (!plane_state->visible || !plane_state->fb)
>> +return 0;
>> +
>> +drm_atomic_crtc_state_for_each_plane_state(plane, other_state, state) {
> [...]
>> +/*
>> + * Data-prefetch FIFO will easily help to overcome temporal memory
>> + * pressure if other plane overlaps with the cursor plane.
>> + */
>> +if (tegra_plane_is_cursor(plane_state) && overlap_mask)
>> +return 0;
>> +
>> +return overlap_mask;
>> +}
> 
> Since for cursor plane this always returns 0, you could test
> tegra_plane_is_cursor() at the start of the function.

Yes, thanks.

>> +static int tegra_crtc_calculate_memory_bandwidth(struct drm_crtc *crtc,
>> + struct drm_atomic_state *state)
> [...]
>> +/*
>> +     * For overlapping planes pixel's data is fetched for each plane at
>> + * the same time, hence bandwidths are accumulated in this case.
>> + * This needs to be taken into account for calculating total bandwidth
>> + * consumed by all planes.
>> + *
>> + * Here we get the overlapping state of each plane, which is a
>> + * bitmask of plane indices telling with what planes there is an
>> + * overlap. Note that bitmask[plane] includes BIT(plane) in order
>> + * to make further code nicer and simpler.
>> + */
>> +drm_atomic_crtc_state_for_each_plane_state(plane, plane_state, 
>> new_state) {
>> +tegra_state = to_const_tegra_plane_state(plane_state);
>> +tegra = to_tegra_plane(plane);
>> +
>> +if (WARN_ON_ONCE(tegra->index >= TEGRA_DC_LEGACY_PLANES_NUM))
>> +return -EINVAL;
>> +
>> +plane_peak_bw[tegra->index] = 
>> tegra_state->peak_memory_bandwidth;
>> +mask = tegra_plane_overlap_mask(new_state, plane_state);
>> +overlap_mask[tegra->index] = mask;
>> +
>> +if (hweight_long(mask) != 3)
>> +all_planes_overlap_simultaneously = false;
>> +}
>> +
>> +old_state = drm_atomic_get_old_crtc_state(state, crtc);
>> +old_dc_state = to_const_dc_state(old_state);
>> +
>> +/*
>> + * Then we calculate maximum bandwidth of each plane state.
>> + * The bandwidth includes the plane BW + BW of the "simultaneously"
>> + * overlapping planes, where "simultaneously" means areas where DC
>> + * fetches from the planes simultaneously during of scan-out process.
>> + *
>> + * For example, if plane A overlaps with planes B and C, but B and C
>> + * don't overlap, then the peak bandwidth will be either in area where
>> + * A-and-B or A-and-C planes overlap.
>> + *
>> + * The plane_peak_bw[] contains peak memory bandwidth values of
>> + * each plane, this information is needed by interconnect provider
>> + * in order to set up latency allowness based on the peak BW, see
>> + * tegra_crtc_update_memory_bandwidth().
>> + */
>> +for (i = 0; i < ARRAY_SIZE(plane_peak_bw); i++) {
>> +overlap_bw = 0;
>> +
>> +for_each_set_bit(k, _mask[i], 3) {
>> +if (k == i)
>> +continue;
>> +
>> +    if (all_planes_overlap_simultaneously)
>> +overlap_bw += plane_peak_bw[k];
>> +else
>> +overlap_bw = max(overlap_bw, plane_peak_bw[k]);
>> +}
>> +
>> +new_dc_state->plane_peak_bw[i] = plane_peak_bw[i] + overlap_bw;
>> +
>> +/*
>> + * If plane's peak bandwi

[PATCH 5.11 150/306] PCI/LINK: Remove bandwidth notification

2021-03-15 Thread gregkh
From: Greg Kroah-Hartman 

From: Bjorn Helgaas 

[ Upstream commit b4c7d2076b4e767dd2e075a2b3a9e57753fc67f5 ]

The PCIe Bandwidth Change Notification feature logs messages when the link
bandwidth changes.  Some users have reported that these messages occur
often enough to significantly reduce NVMe performance.  GPUs also seem to
generate these messages.

We don't know why the link bandwidth changes, but in the reported cases
there's no indication that it's caused by hardware failures.

Remove the bandwidth change notifications for now.  Hopefully we can add
this back when we have a better understanding of why this happens and how
we can make the messages useful instead of overwhelming.

Link: https://lore.kernel.org/r/20200115221008.ga191...@google.com/
Link: 
https://lore.kernel.org/r/155605909349.3575.13433421148215616375.st...@gimli.home/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206197
Signed-off-by: Bjorn Helgaas 
Signed-off-by: Sasha Levin 
---
 drivers/pci/pcie/Kconfig   |   8 --
 drivers/pci/pcie/Makefile  |   1 -
 drivers/pci/pcie/bw_notification.c | 138 -
 drivers/pci/pcie/portdrv.h |   6 --
 drivers/pci/pcie/portdrv_pci.c |   1 -
 5 files changed, 154 deletions(-)
 delete mode 100644 drivers/pci/pcie/bw_notification.c

diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index 3946555a6042..45a2ef702b45 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -133,14 +133,6 @@ config PCIE_PTM
  This is only useful if you have devices that support PTM, but it
  is safe to enable even if you don't.
 
-config PCIE_BW
-   bool "PCI Express Bandwidth Change Notification"
-   depends on PCIEPORTBUS
-   help
- This enables PCI Express Bandwidth Change Notification.  If
- you know link width or rate changes occur only to correct
- unreliable links, you may answer Y.
-
 config PCIE_EDR
bool "PCI Express Error Disconnect Recover support"
depends on PCIE_DPC && ACPI
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index d9697892fa3e..b2980db88cc0 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -12,5 +12,4 @@ obj-$(CONFIG_PCIEAER_INJECT)  += aer_inject.o
 obj-$(CONFIG_PCIE_PME) += pme.o
 obj-$(CONFIG_PCIE_DPC) += dpc.o
 obj-$(CONFIG_PCIE_PTM) += ptm.o
-obj-$(CONFIG_PCIE_BW)  += bw_notification.o
 obj-$(CONFIG_PCIE_EDR) += edr.o
diff --git a/drivers/pci/pcie/bw_notification.c 
b/drivers/pci/pcie/bw_notification.c
deleted file mode 100644
index 565d23cccb8b..
--- a/drivers/pci/pcie/bw_notification.c
+++ /dev/null
@@ -1,138 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0+
-/*
- * PCI Express Link Bandwidth Notification services driver
- * Author: Alexandru Gagniuc 
- *
- * Copyright (C) 2019, Dell Inc
- *
- * The PCIe Link Bandwidth Notification provides a way to notify the
- * operating system when the link width or data rate changes.  This
- * capability is required for all root ports and downstream ports
- * supporting links wider than x1 and/or multiple link speeds.
- *
- * This service port driver hooks into the bandwidth notification interrupt
- * and warns when links become degraded in operation.
- */
-
-#define dev_fmt(fmt) "bw_notification: " fmt
-
-#include "../pci.h"
-#include "portdrv.h"
-
-static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
-{
-   int ret;
-   u32 lnk_cap;
-
-   ret = pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, _cap);
-   return (ret == PCIBIOS_SUCCESSFUL) && (lnk_cap & PCI_EXP_LNKCAP_LBNC);
-}
-
-static void pcie_enable_link_bandwidth_notification(struct pci_dev *dev)
-{
-   u16 lnk_ctl;
-
-   pcie_capability_write_word(dev, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
-
-   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
-   lnk_ctl |= PCI_EXP_LNKCTL_LBMIE;
-   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
-}
-
-static void pcie_disable_link_bandwidth_notification(struct pci_dev *dev)
-{
-   u16 lnk_ctl;
-
-   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
-   lnk_ctl &= ~PCI_EXP_LNKCTL_LBMIE;
-   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
-}
-
-static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
-{
-   struct pcie_device *srv = context;
-   struct pci_dev *port = srv->port;
-   u16 link_status, events;
-   int ret;
-
-   ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, _status);
-   events = link_status & PCI_EXP_LNKSTA_LBMS;
-
-   if (ret != PCIBIOS_SUCCESSFUL || !events)
-   return IRQ_NONE;
-
-   pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);
-   pcie_update_link_speed(port->subordinate, link_status);
-   return IRQ_WAKE_THREAD;
-}
-
-stati

[PATCH 5.10 154/290] PCI/LINK: Remove bandwidth notification

2021-03-15 Thread gregkh
From: Greg Kroah-Hartman 

From: Bjorn Helgaas 

[ Upstream commit b4c7d2076b4e767dd2e075a2b3a9e57753fc67f5 ]

The PCIe Bandwidth Change Notification feature logs messages when the link
bandwidth changes.  Some users have reported that these messages occur
often enough to significantly reduce NVMe performance.  GPUs also seem to
generate these messages.

We don't know why the link bandwidth changes, but in the reported cases
there's no indication that it's caused by hardware failures.

Remove the bandwidth change notifications for now.  Hopefully we can add
this back when we have a better understanding of why this happens and how
we can make the messages useful instead of overwhelming.

Link: https://lore.kernel.org/r/20200115221008.ga191...@google.com/
Link: 
https://lore.kernel.org/r/155605909349.3575.13433421148215616375.st...@gimli.home/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206197
Signed-off-by: Bjorn Helgaas 
Signed-off-by: Sasha Levin 
---
 drivers/pci/pcie/Kconfig   |   8 --
 drivers/pci/pcie/Makefile  |   1 -
 drivers/pci/pcie/bw_notification.c | 138 -
 drivers/pci/pcie/portdrv.h |   6 --
 drivers/pci/pcie/portdrv_pci.c |   1 -
 5 files changed, 154 deletions(-)
 delete mode 100644 drivers/pci/pcie/bw_notification.c

diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index 3946555a6042..45a2ef702b45 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -133,14 +133,6 @@ config PCIE_PTM
  This is only useful if you have devices that support PTM, but it
  is safe to enable even if you don't.
 
-config PCIE_BW
-   bool "PCI Express Bandwidth Change Notification"
-   depends on PCIEPORTBUS
-   help
- This enables PCI Express Bandwidth Change Notification.  If
- you know link width or rate changes occur only to correct
- unreliable links, you may answer Y.
-
 config PCIE_EDR
bool "PCI Express Error Disconnect Recover support"
depends on PCIE_DPC && ACPI
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index 68da9280ff11..9a7085668466 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -12,5 +12,4 @@ obj-$(CONFIG_PCIEAER_INJECT)  += aer_inject.o
 obj-$(CONFIG_PCIE_PME) += pme.o
 obj-$(CONFIG_PCIE_DPC) += dpc.o
 obj-$(CONFIG_PCIE_PTM) += ptm.o
-obj-$(CONFIG_PCIE_BW)  += bw_notification.o
 obj-$(CONFIG_PCIE_EDR) += edr.o
diff --git a/drivers/pci/pcie/bw_notification.c 
b/drivers/pci/pcie/bw_notification.c
deleted file mode 100644
index 565d23cccb8b..
--- a/drivers/pci/pcie/bw_notification.c
+++ /dev/null
@@ -1,138 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0+
-/*
- * PCI Express Link Bandwidth Notification services driver
- * Author: Alexandru Gagniuc 
- *
- * Copyright (C) 2019, Dell Inc
- *
- * The PCIe Link Bandwidth Notification provides a way to notify the
- * operating system when the link width or data rate changes.  This
- * capability is required for all root ports and downstream ports
- * supporting links wider than x1 and/or multiple link speeds.
- *
- * This service port driver hooks into the bandwidth notification interrupt
- * and warns when links become degraded in operation.
- */
-
-#define dev_fmt(fmt) "bw_notification: " fmt
-
-#include "../pci.h"
-#include "portdrv.h"
-
-static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
-{
-   int ret;
-   u32 lnk_cap;
-
-   ret = pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, _cap);
-   return (ret == PCIBIOS_SUCCESSFUL) && (lnk_cap & PCI_EXP_LNKCAP_LBNC);
-}
-
-static void pcie_enable_link_bandwidth_notification(struct pci_dev *dev)
-{
-   u16 lnk_ctl;
-
-   pcie_capability_write_word(dev, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
-
-   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
-   lnk_ctl |= PCI_EXP_LNKCTL_LBMIE;
-   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
-}
-
-static void pcie_disable_link_bandwidth_notification(struct pci_dev *dev)
-{
-   u16 lnk_ctl;
-
-   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
-   lnk_ctl &= ~PCI_EXP_LNKCTL_LBMIE;
-   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
-}
-
-static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
-{
-   struct pcie_device *srv = context;
-   struct pci_dev *port = srv->port;
-   u16 link_status, events;
-   int ret;
-
-   ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, _status);
-   events = link_status & PCI_EXP_LNKSTA_LBMS;
-
-   if (ret != PCIBIOS_SUCCESSFUL || !events)
-   return IRQ_NONE;
-
-   pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);
-   pcie_update_link_speed(port->subordinate, link_status);
-   return IRQ_WAKE_THREAD;
-}
-
-stati

Re: [PATCH v15 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-14 Thread Michał Mirosław
On Thu, Mar 11, 2021 at 08:22:54PM +0300, Dmitry Osipenko wrote:
> Display controller (DC) performs isochronous memory transfers, and thus,
> has a requirement for a minimum memory bandwidth that shall be fulfilled,
> otherwise framebuffer data can't be fetched fast enough and this results
> in a DC's data-FIFO underflow that follows by a visual corruption.
[...]
> +static unsigned long
> +tegra_plane_overlap_mask(struct drm_crtc_state *state,
> +  const struct drm_plane_state *plane_state)
> +{
> + const struct drm_plane_state *other_state;
> + const struct tegra_plane *tegra;
> + unsigned long overlap_mask = 0;
> + struct drm_plane *plane;
> + struct drm_rect rect;
> +
> + if (!plane_state->visible || !plane_state->fb)
> + return 0;
> +
> + drm_atomic_crtc_state_for_each_plane_state(plane, other_state, state) {
[...]
> + /*
> +  * Data-prefetch FIFO will easily help to overcome temporal memory
> +  * pressure if other plane overlaps with the cursor plane.
> +  */
> + if (tegra_plane_is_cursor(plane_state) && overlap_mask)
> + return 0;
> +
> + return overlap_mask;
> +}

Since for cursor plane this always returns 0, you could test
tegra_plane_is_cursor() at the start of the function.

> +static int tegra_crtc_calculate_memory_bandwidth(struct drm_crtc *crtc,
> +  struct drm_atomic_state *state)
[...]
> + /*
> +  * For overlapping planes pixel's data is fetched for each plane at
> +  * the same time, hence bandwidths are accumulated in this case.
> +  * This needs to be taken into account for calculating total bandwidth
> +  * consumed by all planes.
> +  *
> +  * Here we get the overlapping state of each plane, which is a
> +  * bitmask of plane indices telling with what planes there is an
> +  * overlap. Note that bitmask[plane] includes BIT(plane) in order
> +  * to make further code nicer and simpler.
> +  */
> + drm_atomic_crtc_state_for_each_plane_state(plane, plane_state, 
> new_state) {
> + tegra_state = to_const_tegra_plane_state(plane_state);
> + tegra = to_tegra_plane(plane);
> +
> + if (WARN_ON_ONCE(tegra->index >= TEGRA_DC_LEGACY_PLANES_NUM))
> + return -EINVAL;
> +
> + plane_peak_bw[tegra->index] = 
> tegra_state->peak_memory_bandwidth;
> + mask = tegra_plane_overlap_mask(new_state, plane_state);
> + overlap_mask[tegra->index] = mask;
> +
> + if (hweight_long(mask) != 3)
> + all_planes_overlap_simultaneously = false;
> + }
> +
> + old_state = drm_atomic_get_old_crtc_state(state, crtc);
> + old_dc_state = to_const_dc_state(old_state);
> +
> + /*
> +  * Then we calculate maximum bandwidth of each plane state.
> +  * The bandwidth includes the plane BW + BW of the "simultaneously"
> +  * overlapping planes, where "simultaneously" means areas where DC
> +  * fetches from the planes simultaneously during of scan-out process.
> +  *
> +  * For example, if plane A overlaps with planes B and C, but B and C
> +  * don't overlap, then the peak bandwidth will be either in area where
> +  * A-and-B or A-and-C planes overlap.
> +  *
> +  * The plane_peak_bw[] contains peak memory bandwidth values of
> +  * each plane, this information is needed by interconnect provider
> +  * in order to set up latency allowness based on the peak BW, see
> +  * tegra_crtc_update_memory_bandwidth().
> +  */
> + for (i = 0; i < ARRAY_SIZE(plane_peak_bw); i++) {
> + overlap_bw = 0;
> +
> + for_each_set_bit(k, _mask[i], 3) {
> + if (k == i)
> + continue;
> +
> + if (all_planes_overlap_simultaneously)
> +     overlap_bw += plane_peak_bw[k];
> + else
> + overlap_bw = max(overlap_bw, plane_peak_bw[k]);
> + }
> +
> +     new_dc_state->plane_peak_bw[i] = plane_peak_bw[i] + overlap_bw;
> +
> + /*
> +  * If plane's peak bandwidth changed (for example plane isn't
> +  * overlapped anymore) and plane isn't in the atomic state,
> +  * then add plane to the state in order to have the bandwidth
> +  * updated.
> +  */
> + if (old_dc_state->plane_peak_bw[i] !=
> + new_dc_state->plane_peak_bw[i]) {
> + p

Re: [PATCH v3 2/4] sched/fair: Make CFS bandwidth controller burstable

2021-03-12 Thread changhuaixin



> On Mar 10, 2021, at 9:04 PM, Peter Zijlstra  wrote:
> 
> On Thu, Jan 21, 2021 at 07:04:51PM +0800, Huaixin Chang wrote:
>> Accumulate unused quota from previous periods, thus accumulated
>> bandwidth runtime can be used in the following periods. During
>> accumulation, take care of runtime overflow. Previous non-burstable
>> CFS bandwidth controller only assign quota to runtime, that saves a lot.
>> 
>> A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
>> denote how many percent of burst is given on setting cfs bandwidth. By
>> default it is 0, which means on burst is allowed unless accumulated.
>> 
>> Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
>> switch for burst. It is enabled by default.
>> 
>> Signed-off-by: Huaixin Chang 
>> Signed-off-by: Shanpei Chen 
> 
> Identical invalid SoB chain.
> 
>> Reported-by: kernel test robot 
> 
> What exactly did the robot report; the whole patch?

A warning is reported by the robot. And I have fixed it in this series. I'll 
remove this line,
since it seems unnecessary.

> 
>> ---
>> include/linux/sched/sysctl.h |  2 ++
>> kernel/sched/core.c  | 31 +
>> kernel/sched/fair.c  | 47 
>> 
>> kernel/sched/sched.h |  4 ++--
>> kernel/sysctl.c  | 18 +
>> 5 files changed, 88 insertions(+), 14 deletions(-)
>> 
>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>> index 3c31ba88aca5..3400828eaf2d 100644
>> --- a/include/linux/sched/sysctl.h
>> +++ b/include/linux/sched/sysctl.h
>> @@ -72,6 +72,8 @@ extern unsigned int 
>> sysctl_sched_uclamp_util_min_rt_default;
>> 
>> #ifdef CONFIG_CFS_BANDWIDTH
>> extern unsigned int sysctl_sched_cfs_bandwidth_slice;
>> +extern unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
>> +extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
>> #endif
>> 
>> #ifdef CONFIG_SCHED_AUTOGROUP
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 48d3bad12be2..fecf0f05ef0c 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -66,6 +66,16 @@ const_debug unsigned int sysctl_sched_features =
>>  */
>> const_debug unsigned int sysctl_sched_nr_migrate = 32;
>> 
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +/*
>> + * Percent of burst assigned to cfs_b->runtime on tg_set_cfs_bandwidth,
>> + * 0 by default.
>> + */
>> +unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
>> +
>> +unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
>> +#endif
> 
> There's already an #ifdef block that contains that bandwidth_slice
> thing, see the previous hunk, so why create a new #ifdef here?
> 
> Also, personally I think percentages are over-represented as members of
> Q.
> 
Sorry, I don't quite understand the "members of Q". Is this saying that the 
percentages
are over-designed here?

>> @@ -7891,7 +7901,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
>> const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
>> static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
>> /* More than 203 days if BW_SHIFT equals 20. */
>> -static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>> +const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>> 
>> static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
>> 
>> @@ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
>> u64 period, u64 quota,
>> {
>>  int i, ret = 0, runtime_enabled, runtime_was_enabled;
>>  struct cfs_bandwidth *cfs_b = >cfs_bandwidth;
>> -u64 buffer;
>> +u64 buffer, burst_onset;
>> 
>>  if (tg == _task_group)
>>  return -EINVAL;
>> @@ -7961,11 +7971,24 @@ static int tg_set_cfs_bandwidth(struct task_group 
>> *tg, u64 period, u64 quota,
>>  cfs_b->burst = burst;
>>  cfs_b->buffer = buffer;
>> 
>> -__refill_cfs_bandwidth_runtime(cfs_b);
>> +cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
>> +cfs_b->runtime = cfs_b->quota;
>> +
>> +/* burst_onset needed */
>> +if (cfs_b->quota != RUNTIME_INF &&
>> +sysctl_sched_cfs_bw_burst_enabled &&
>> +sysctl_sched_cfs_bw_burst_onset_percent > 0) {
> 
> 'creative' indentation again...
> 
> Also, this gives rise to the question as to why onset_percent is
> sep

Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller

2021-03-12 Thread changhuaixin



> On Mar 10, 2021, at 7:11 PM, Odin Ugedal  wrote:
> 
> Hi,
> 
>> If there are cases where the "start bandwidth" matters, I think there is 
>> need to expose the
>> "start bandwidth" explicitly too. However, I doubt the existence of such 
>> cases from my view
>> and the two examples above.
> 
> Yeah, I don't think there will be any cases where users will be
> "depending" on having burst available,
> so I agree in that sense.
> 
>> In my thoughts, this patchset keeps cgroup usage within the quota in the 
>> longer term, and allows
>> cgroup to respond to a burst of work with the help of a reasonable burst 
>> buffer. If quota is set correctly
>> above average usage, and enough burst buffer is set to meet the needs of 
>> bursty work. In this
>> case, it makes no difference whether this cgroup runs with 0 start bandwidth 
>> or all of it.
>> Thus I used sysctl_sched_cfs_bw_burst_onset_percent to decided the start 
>> bandwidth
>> to leave some convenience here. If this sysctl interface is confusing, I 
>> wonder whether it
>> is a good idea not to expose this interface.
>> 
>> For the first case mentioned above, if Kubernet users care the "start 
>> bandwidth" for process startup,
>> maybe it is better to give all of it rather than a part?
> 
> Yeah, I am a bit afraid there will be some confusion, so not sure if
> the sysctl is the best way to do it.
> 
> But I would like feedback from others to highlight the problem as
> well, that would be helpful. I think a simple "API"
> where you get 0 burst or full burst on "set" (the one we decide on)
> would be best to avoid unnecessary complexity.
> 
> Start burst when starting up a new process in a new cgroup might be
> helpful, so maybe that is a vote for
> full burst? However, in long term that doesn't matter, so 0 burst on
> start would work as well.
> 
>> For the second case with quota changes over time, I think it is important 
>> making sure each change works
>> long enough to enforce average quota limit. Does it really matter to control 
>> "start burst" on each change?
> 
> No, I don't think so. Doing so would be another thing to set per
> cgroup, and that would just clutter the api
> more than necessary imo., since we cannot come up with any real use cases.
> 
>> It is an copy of runtime at period start, and used to calculate burst time 
>> during a period.
>> Not quite remaining_runtime_prev_period.
> 
> Ahh, I see, I misunderstood the code. So in a essence it is
> "runtime_at_period_start"?
> 

Yes, it is "runtime_at_preiod_start".

>> Yeah, there is the updating problem. It is okey not to expose cfs_b->runtime 
>> then.
> 
> Yeah, I think dropping it all together is the best solution.
> 
> 
>> This comment does not mean any loss any unnecessary throttle for present 
>> cfsb.
>> All this means is that all quota refilling that is not done during timer 
>> stop should be
>> refilled on timer start, for the burstable cfsb.
>> 
>> Maybe I shall change this comment in some way if it is misleading?
> 
> I think I formulated my question badly. The comment makes sense, I am
> just trying to compare how "start_cfs_bandwidth"
> works after your patch compared to how it works currently. As I
> understand, without this patch "start_cfs_bandwidth" will
> never refill runtime, while with your patch, it will refill even when
> overrun=0 with burst disabled. Is that an intended change in
> behavior, or am I not understanding the patch?
> 

Good point. The way "start_cfs_bandwidth" works is changed indeed. The present 
cfs_b doesn't
have to refill bandwidth because quota is not used during the period before 
timer stops. With this patch,
runtime is refilled no matter burst is enabled or not. Do you suggest not 
refilling runtime unless burst
is enabled here?

> 
> On another note, I have also been testing this patch, and I am not
> able to reproduce your schbench results. Both with and without burst,
> it gets the same result, and no nr_throttled stays at 0 (tested on a
> 32-core system). Can you try to rerun your tests with the mainline
> to see if you still get the same results? (Also, I see you are running
> with 30 threads. How many cores do your test setup have?). To actually
> say that the result is real, all cores used should maybe be
> exclusively reserved as well, to avoid issues where other processes
> cause a
> spike in latency.
> 

Spikes indeed cause trouble. If nr_throttle stays at 0, I suggest change quota 
from 70 to 60,
which is 

[RFC PATCH v2 05/11] bfq: keep the minimun bandwidth for be_class

2021-03-12 Thread brookxu
From: Chunguang Xu 

rt_class will preempt other classes, which may cause other
classes to starve to death. At present, idle_class has
alleviated the starvation problem through the minimum
bandwidth mechanism. Similarly, we should do the same for
be_class.

Signed-off-by: Chunguang Xu 
---
 block/bfq-iosched.c |  6 +++--
 block/bfq-iosched.h | 11 ++---
 block/bfq-wf2q.c| 59 -
 3 files changed, 53 insertions(+), 23 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 91e903f1e550..ab00b664348c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6542,9 +6542,11 @@ static void bfq_init_root_group(struct bfq_group 
*root_group,
root_group->bfqd = bfqd;
 #endif
root_group->rq_pos_tree = RB_ROOT;
-   for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+   for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
-   root_group->sched_data.bfq_class_idle_last_service = jiffies;
+   root_group->sched_data.bfq_class_last_service[i] = jiffies;
+   }
+   root_group->sched_data.class_timeout_last_check = jiffies;
 }
 
 static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 3416a75f47da..de7301664ad3 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -13,7 +13,7 @@
 #include "blk-cgroup-rwstat.h"
 
 #define BFQ_IOPRIO_CLASSES 3
-#define BFQ_CL_IDLE_TIMEOUT(HZ/5)
+#define BFQ_CLASS_TIMEOUT  (HZ/5)
 
 #define BFQ_MIN_WEIGHT 1
 #define BFQ_MAX_WEIGHT 1000
@@ -97,9 +97,12 @@ struct bfq_sched_data {
struct bfq_entity *next_in_service;
/* array of service trees, one per ioprio_class */
struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
-   /* last time CLASS_IDLE was served */
-   unsigned long bfq_class_idle_last_service;
-
+   /* last time the class was served */
+   unsigned long bfq_class_last_service[BFQ_IOPRIO_CLASSES];
+   /* last time class timeout was checked */
+   unsigned long class_timeout_last_check;
+   /* next index to check class timeout */
+   unsigned int next_class_index;
 };
 
 /**
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 7405be960a92..0ac35fd4f2ab 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1188,6 +1188,7 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, 
bool ins_into_idle_tree)
 {
struct bfq_sched_data *sd = entity->sched_data;
struct bfq_service_tree *st;
+   int idx = bfq_class_idx(entity);
bool is_in_service;
 
if (!entity->on_st_or_in_serv) /*
@@ -1227,6 +1228,7 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, 
bool ins_into_idle_tree)
else
bfq_idle_insert(st, entity);
 
+   sd->bfq_class_last_service[idx] = jiffies;
return true;
 }
 
@@ -1455,6 +1457,45 @@ __bfq_lookup_next_entity(struct bfq_service_tree *st, 
bool in_service)
return entity;
 }
 
+static int bfq_select_next_class(struct bfq_sched_data *sd)
+{
+   struct bfq_service_tree *st = sd->service_tree;
+   unsigned long last_check, last_serve;
+   int i, class_idx, next_class = 0;
+   bool found = false;
+
+   /*
+    * we needed to guarantee a minimum bandwidth for each class (if
+* there is some active entity in this class). This should also
+* mitigate priority-inversion problems in case a low priority
+* task is holding file system resources.
+*/
+   last_check = sd->class_timeout_last_check;
+   if (time_is_after_jiffies(last_check + BFQ_CLASS_TIMEOUT))
+   return next_class;
+
+   sd->class_timeout_last_check = jiffies;
+   for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
+   class_idx = (sd->next_class_index + i) % BFQ_IOPRIO_CLASSES;
+   last_serve = sd->bfq_class_last_service[class_idx];
+
+   if (time_is_after_jiffies(last_serve + BFQ_CLASS_TIMEOUT))
+   continue;
+
+   if (!RB_EMPTY_ROOT(&(st + class_idx)->active)) {
+   if (found)
+   continue;
+
+   next_class = class_idx++;
+   class_idx %= BFQ_IOPRIO_CLASSES;
+   sd->next_class_index = class_idx;
+   found = true;
+   }
+   sd->bfq_class_last_service[class_idx] = jiffies;
+   }
+   return next_class;
+}
+
 /**
  * bfq_lookup_next_entity - return the first eligible entity in @sd.
  * @sd: the sched_data.
@@ -1468,24 +1509,8 @@ static struct bfq_entity *bfq_lookup_next_entity(struct 
bfq_sched_data *sd,
 bool expiration)
 {
struct bfq_service_t

[PATCH v15 0/2] Add memory bandwidth management to NVIDIA Tegra DRM driver

2021-03-11 Thread Dmitry Osipenko
This series adds memory bandwidth management to the NVIDIA Tegra DRM driver,
which is done using interconnect framework. It fixes display corruption that
happens due to insufficient memory bandwidth.

Changelog:

v15: - Corrected tegra_plane_icc_names[] NULL-check that was partially lost
   by accident in v14 after unsuccessful rebase.

v14: - Made improvements that were suggested by Michał Mirosław to v13:

   - Changed 'unsigned int' to 'bool'.
   - Renamed functions which calculate bandwidth state.
   - Reworked comment in the code that explains why downscaled plane
 require higher bandwidth.
   - Added round-up to bandwidth calculation.
   - Added sanity checks of the plane index and fixed out-of-bounds
 access which happened on T124 due to the cursor plane index.

v13: - No code changes. Patches missed v5.12, re-sending them for v5.13.

Dmitry Osipenko (2):
  drm/tegra: dc: Support memory bandwidth management
  drm/tegra: dc: Extend debug stats with total number of events

 drivers/gpu/drm/tegra/Kconfig |   1 +
 drivers/gpu/drm/tegra/dc.c| 362 ++
 drivers/gpu/drm/tegra/dc.h|  19 ++
 drivers/gpu/drm/tegra/drm.c   |  14 ++
 drivers/gpu/drm/tegra/hub.c   |   3 +
 drivers/gpu/drm/tegra/plane.c | 127 
 drivers/gpu/drm/tegra/plane.h |  15 ++
 7 files changed, 541 insertions(+)

-- 
2.29.2



[PATCH v15 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-11 Thread Dmitry Osipenko
Display controller (DC) performs isochronous memory transfers, and thus,
has a requirement for a minimum memory bandwidth that shall be fulfilled,
otherwise framebuffer data can't be fetched fast enough and this results
in a DC's data-FIFO underflow that follows by a visual corruption.

The Memory Controller drivers provide facility for memory bandwidth
management via interconnect API. Let's wire up the interconnect API
support to the DC driver in order to fix the distorted display output
on T30 Ouya, T124 TK1 and other Tegra devices.

Tested-by: Peter Geis  # Ouya T30
Tested-by: Matt Merhar  # Ouya T30
Tested-by: Nicolas Chauvet  # PAZ00 T20 and TK1 T124
Signed-off-by: Dmitry Osipenko 
---
 drivers/gpu/drm/tegra/Kconfig |   1 +
 drivers/gpu/drm/tegra/dc.c| 352 ++
 drivers/gpu/drm/tegra/dc.h|  14 ++
 drivers/gpu/drm/tegra/drm.c   |  14 ++
 drivers/gpu/drm/tegra/hub.c   |   3 +
 drivers/gpu/drm/tegra/plane.c | 127 
 drivers/gpu/drm/tegra/plane.h |  15 ++
 7 files changed, 526 insertions(+)

diff --git a/drivers/gpu/drm/tegra/Kconfig b/drivers/gpu/drm/tegra/Kconfig
index 5043dcaf1cf9..1650a448eabd 100644
--- a/drivers/gpu/drm/tegra/Kconfig
+++ b/drivers/gpu/drm/tegra/Kconfig
@@ -9,6 +9,7 @@ config DRM_TEGRA
select DRM_MIPI_DSI
select DRM_PANEL
select TEGRA_HOST1X
+   select INTERCONNECT
select IOMMU_IOVA
select CEC_CORE if CEC_NOTIFIER
help
diff --git a/drivers/gpu/drm/tegra/dc.c b/drivers/gpu/drm/tegra/dc.c
index 0ae3a025efe9..49fa488cf930 100644
--- a/drivers/gpu/drm/tegra/dc.c
+++ b/drivers/gpu/drm/tegra/dc.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -616,6 +617,9 @@ static int tegra_plane_atomic_check(struct drm_plane *plane,
struct tegra_dc *dc = to_tegra_dc(state->crtc);
int err;
 
+   plane_state->peak_memory_bandwidth = 0;
+   plane_state->avg_memory_bandwidth = 0;
+
/* no need for further checks if the plane is being disabled */
if (!state->crtc)
return 0;
@@ -802,6 +806,12 @@ static struct drm_plane *tegra_primary_plane_create(struct 
drm_device *drm,
formats = dc->soc->primary_formats;
modifiers = dc->soc->modifiers;
 
+   err = tegra_plane_interconnect_init(plane);
+   if (err) {
+   kfree(plane);
+   return ERR_PTR(err);
+   }
+
err = drm_universal_plane_init(drm, >base, possible_crtcs,
   _plane_funcs, formats,
   num_formats, modifiers, type, NULL);
@@ -833,9 +843,13 @@ static const u32 tegra_cursor_plane_formats[] = {
 static int tegra_cursor_atomic_check(struct drm_plane *plane,
 struct drm_plane_state *state)
 {
+   struct tegra_plane_state *plane_state = to_tegra_plane_state(state);
struct tegra_plane *tegra = to_tegra_plane(plane);
int err;
 
+   plane_state->peak_memory_bandwidth = 0;
+   plane_state->avg_memory_bandwidth = 0;
+
/* no need for further checks if the plane is being disabled */
if (!state->crtc)
return 0;
@@ -973,6 +987,12 @@ static struct drm_plane 
*tegra_dc_cursor_plane_create(struct drm_device *drm,
num_formats = ARRAY_SIZE(tegra_cursor_plane_formats);
formats = tegra_cursor_plane_formats;
 
+   err = tegra_plane_interconnect_init(plane);
+   if (err) {
+   kfree(plane);
+   return ERR_PTR(err);
+   }
+
err = drm_universal_plane_init(drm, >base, possible_crtcs,
   _plane_funcs, formats,
   num_formats, NULL,
@@ -1087,6 +1107,12 @@ static struct drm_plane 
*tegra_dc_overlay_plane_create(struct drm_device *drm,
num_formats = dc->soc->num_overlay_formats;
formats = dc->soc->overlay_formats;
 
+   err = tegra_plane_interconnect_init(plane);
+   if (err) {
+   kfree(plane);
+   return ERR_PTR(err);
+   }
+
if (!cursor)
type = DRM_PLANE_TYPE_OVERLAY;
else
@@ -1204,6 +1230,7 @@ tegra_crtc_atomic_duplicate_state(struct drm_crtc *crtc)
 {
struct tegra_dc_state *state = to_dc_state(crtc->state);
struct tegra_dc_state *copy;
+   unsigned int i;
 
copy = kmalloc(sizeof(*copy), GFP_KERNEL);
if (!copy)
@@ -1215,6 +1242,9 @@ tegra_crtc_atomic_duplicate_state(struct drm_crtc *crtc)
copy->div = state->div;
copy->planes = state->planes;
 
+   for (i = 0; i < ARRAY_SIZE(state->plane_peak_bw); i++)
+   copy->plane_peak_bw[i] = state->plane_peak_bw[i];
+
return >base;
 }
 
@@ -1741,6 +1771,106 @@ static int tegra_dc_wait_idle(struct tegra_dc *dc, 
unsigned long timeout)
return 

Re: [PATCH v14 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-11 Thread Dmitry Osipenko
11.03.2021 20:06, Dmitry Osipenko пишет:
> +static const char * const tegra_plane_icc_names[TEGRA_DC_LEGACY_PLANES_NUM] 
> = {
> + "wina", "winb", "winc", "", "", "", "cursor",
> +};
> +
> +int tegra_plane_interconnect_init(struct tegra_plane *plane)
> +{
> + const char *icc_name = tegra_plane_icc_names[plane->index];
> + struct device *dev = plane->dc->dev;
> + struct tegra_dc *dc = plane->dc;
> + int err;
> +
> + if (WARN_ON(plane->index >= TEGRA_DC_LEGACY_PLANES_NUM) ||
> + WARN_ON(!tegra_plane_icc_names[plane->index]))
> + return -EINVAL;

It just occurred to me that I added the NULL-check here, but missed to
change "" to NULLs. I'll make a v15 shortly.


Re: [PATCH v14 0/2] Add memory bandwidth management to NVIDIA Tegra DRM driver

2021-03-11 Thread Dmitry Osipenko
11.03.2021 20:06, Dmitry Osipenko пишет:
> This series adds memory bandwidth management to the NVIDIA Tegra DRM driver,
> which is done using interconnect framework. It fixes display corruption that
> happens due to insufficient memory bandwidth.
> 
> Changelog:
> 
> v14: - Made improvements that were suggested by Michał Mirosław to v13:
> 
>- Changed 'unsigned int' to 'bool'.
>- Renamed functions which calculate bandwidth state.
>- Reworked comment in the code that explains why downscaled plane
>  require higher bandwidth.
>- Added round-up to bandwidth calculation.
>- Added sanity checks of the plane index and fixed out-of-bounds
>  access which happened on T124 due to the cursor plane index.
> 
> v13: - No code changes. Patches missed v5.12, re-sending them for v5.13.
> 
> Dmitry Osipenko (2):
>   drm/tegra: dc: Support memory bandwidth management
>   drm/tegra: dc: Extend debug stats with total number of events
> 
>  drivers/gpu/drm/tegra/Kconfig |   1 +
>  drivers/gpu/drm/tegra/dc.c| 362 ++
>  drivers/gpu/drm/tegra/dc.h|  19 ++
>  drivers/gpu/drm/tegra/drm.c   |  14 ++
>  drivers/gpu/drm/tegra/hub.c   |   3 +
>  drivers/gpu/drm/tegra/plane.c | 127 
>  drivers/gpu/drm/tegra/plane.h |  15 ++
>  7 files changed, 541 insertions(+)
> 

Michał, please let me know if v14 looks good to you. I'll appreciate
yours r-b, thanks in advance.


[PATCH v14 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-11 Thread Dmitry Osipenko
Display controller (DC) performs isochronous memory transfers, and thus,
has a requirement for a minimum memory bandwidth that shall be fulfilled,
otherwise framebuffer data can't be fetched fast enough and this results
in a DC's data-FIFO underflow that follows by a visual corruption.

The Memory Controller drivers provide facility for memory bandwidth
management via interconnect API. Let's wire up the interconnect API
support to the DC driver in order to fix the distorted display output
on T30 Ouya, T124 TK1 and other Tegra devices.

Tested-by: Peter Geis  # Ouya T30
Tested-by: Matt Merhar  # Ouya T30
Tested-by: Nicolas Chauvet  # PAZ00 T20 and TK1 T124
Signed-off-by: Dmitry Osipenko 
---
 drivers/gpu/drm/tegra/Kconfig |   1 +
 drivers/gpu/drm/tegra/dc.c| 352 ++
 drivers/gpu/drm/tegra/dc.h|  14 ++
 drivers/gpu/drm/tegra/drm.c   |  14 ++
 drivers/gpu/drm/tegra/hub.c   |   3 +
 drivers/gpu/drm/tegra/plane.c | 127 
 drivers/gpu/drm/tegra/plane.h |  15 ++
 7 files changed, 526 insertions(+)

diff --git a/drivers/gpu/drm/tegra/Kconfig b/drivers/gpu/drm/tegra/Kconfig
index 5043dcaf1cf9..1650a448eabd 100644
--- a/drivers/gpu/drm/tegra/Kconfig
+++ b/drivers/gpu/drm/tegra/Kconfig
@@ -9,6 +9,7 @@ config DRM_TEGRA
select DRM_MIPI_DSI
select DRM_PANEL
select TEGRA_HOST1X
+   select INTERCONNECT
select IOMMU_IOVA
select CEC_CORE if CEC_NOTIFIER
help
diff --git a/drivers/gpu/drm/tegra/dc.c b/drivers/gpu/drm/tegra/dc.c
index 0ae3a025efe9..49fa488cf930 100644
--- a/drivers/gpu/drm/tegra/dc.c
+++ b/drivers/gpu/drm/tegra/dc.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -616,6 +617,9 @@ static int tegra_plane_atomic_check(struct drm_plane *plane,
struct tegra_dc *dc = to_tegra_dc(state->crtc);
int err;
 
+   plane_state->peak_memory_bandwidth = 0;
+   plane_state->avg_memory_bandwidth = 0;
+
/* no need for further checks if the plane is being disabled */
if (!state->crtc)
return 0;
@@ -802,6 +806,12 @@ static struct drm_plane *tegra_primary_plane_create(struct 
drm_device *drm,
formats = dc->soc->primary_formats;
modifiers = dc->soc->modifiers;
 
+   err = tegra_plane_interconnect_init(plane);
+   if (err) {
+   kfree(plane);
+   return ERR_PTR(err);
+   }
+
err = drm_universal_plane_init(drm, >base, possible_crtcs,
   _plane_funcs, formats,
   num_formats, modifiers, type, NULL);
@@ -833,9 +843,13 @@ static const u32 tegra_cursor_plane_formats[] = {
 static int tegra_cursor_atomic_check(struct drm_plane *plane,
 struct drm_plane_state *state)
 {
+   struct tegra_plane_state *plane_state = to_tegra_plane_state(state);
struct tegra_plane *tegra = to_tegra_plane(plane);
int err;
 
+   plane_state->peak_memory_bandwidth = 0;
+   plane_state->avg_memory_bandwidth = 0;
+
/* no need for further checks if the plane is being disabled */
if (!state->crtc)
return 0;
@@ -973,6 +987,12 @@ static struct drm_plane 
*tegra_dc_cursor_plane_create(struct drm_device *drm,
num_formats = ARRAY_SIZE(tegra_cursor_plane_formats);
formats = tegra_cursor_plane_formats;
 
+   err = tegra_plane_interconnect_init(plane);
+   if (err) {
+   kfree(plane);
+   return ERR_PTR(err);
+   }
+
err = drm_universal_plane_init(drm, >base, possible_crtcs,
   _plane_funcs, formats,
   num_formats, NULL,
@@ -1087,6 +1107,12 @@ static struct drm_plane 
*tegra_dc_overlay_plane_create(struct drm_device *drm,
num_formats = dc->soc->num_overlay_formats;
formats = dc->soc->overlay_formats;
 
+   err = tegra_plane_interconnect_init(plane);
+   if (err) {
+   kfree(plane);
+   return ERR_PTR(err);
+   }
+
if (!cursor)
type = DRM_PLANE_TYPE_OVERLAY;
else
@@ -1204,6 +1230,7 @@ tegra_crtc_atomic_duplicate_state(struct drm_crtc *crtc)
 {
struct tegra_dc_state *state = to_dc_state(crtc->state);
struct tegra_dc_state *copy;
+   unsigned int i;
 
copy = kmalloc(sizeof(*copy), GFP_KERNEL);
if (!copy)
@@ -1215,6 +1242,9 @@ tegra_crtc_atomic_duplicate_state(struct drm_crtc *crtc)
copy->div = state->div;
copy->planes = state->planes;
 
+   for (i = 0; i < ARRAY_SIZE(state->plane_peak_bw); i++)
+   copy->plane_peak_bw[i] = state->plane_peak_bw[i];
+
return >base;
 }
 
@@ -1741,6 +1771,106 @@ static int tegra_dc_wait_idle(struct tegra_dc *dc, 
unsigned long timeout)
return 

[PATCH v14 0/2] Add memory bandwidth management to NVIDIA Tegra DRM driver

2021-03-11 Thread Dmitry Osipenko
This series adds memory bandwidth management to the NVIDIA Tegra DRM driver,
which is done using interconnect framework. It fixes display corruption that
happens due to insufficient memory bandwidth.

Changelog:

v14: - Made improvements that were suggested by Michał Mirosław to v13:

   - Changed 'unsigned int' to 'bool'.
   - Renamed functions which calculate bandwidth state.
   - Reworked comment in the code that explains why downscaled plane
 require higher bandwidth.
   - Added round-up to bandwidth calculation.
   - Added sanity checks of the plane index and fixed out-of-bounds
 access which happened on T124 due to the cursor plane index.

v13: - No code changes. Patches missed v5.12, re-sending them for v5.13.

Dmitry Osipenko (2):
  drm/tegra: dc: Support memory bandwidth management
  drm/tegra: dc: Extend debug stats with total number of events

 drivers/gpu/drm/tegra/Kconfig |   1 +
 drivers/gpu/drm/tegra/dc.c| 362 ++
 drivers/gpu/drm/tegra/dc.h|  19 ++
 drivers/gpu/drm/tegra/drm.c   |  14 ++
 drivers/gpu/drm/tegra/hub.c   |   3 +
 drivers/gpu/drm/tegra/plane.c | 127 
 drivers/gpu/drm/tegra/plane.h |  15 ++
 7 files changed, 541 insertions(+)

-- 
2.29.2



Re: [PATCH v3 4/4] sched/fair: Add document for burstable CFS bandwidth control

2021-03-10 Thread Peter Zijlstra
On Thu, Jan 21, 2021 at 07:04:53PM +0800, Huaixin Chang wrote:
> Basic description of usage and effect for CFS Bandwidth Control Burst.
> 
> Signed-off-by: Huaixin Chang 
> Signed-off-by: Shanpei Chen 

Guess :-)

> +Sometimes users might want a group to burst without accumulation. This is
> +tunable via::
> +   /proc/sys/kernel/sched_cfs_bw_burst_onset_percent (default=0)
> +
> +Up to 100% runtime of cpu.cfs_burst_us might be given on setting bandwidth.

Sometimes is a very crap reason for code to exist. Also, everything is
in _us, why do we have this one thing as a percent?


Re: [PATCH v3 3/4] sched/fair: Add cfs bandwidth burst statistics

2021-03-10 Thread Peter Zijlstra
On Thu, Jan 21, 2021 at 07:04:52PM +0800, Huaixin Chang wrote:
> Introduce statistics exports for the burstable cfs bandwidth
> controller.
> 
> The following exports are included:
> 
> current_bw: current runtime in global pool
> nr_burst:   number of periods bandwidth burst occurs
> burst_time: cumulative wall-time that any cpus has
>   used above quota in respective periods
> 
> Signed-off-by: Huaixin Chang 
> Signed-off-by: Shanpei Chen 

Consistently fail.

> ---
>  kernel/sched/core.c  |  6 ++
>  kernel/sched/fair.c  | 12 +++-
>  kernel/sched/sched.h |  3 +++
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fecf0f05ef0c..80ca763ca492 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7986,6 +7986,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
> u64 period, u64 quota,
>   cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
>   }
>  
> + cfs_b->previous_runtime = cfs_b->runtime;
> +
>   /* Restart the period timer (if active) to handle new period expiry: */
>   if (runtime_enabled)
>   start_cfs_bandwidth(cfs_b, 1);
> @@ -8234,6 +8236,10 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void 
> *v)
>   seq_printf(sf, "wait_sum %llu\n", ws);
>   }
>  
> + seq_printf(sf, "current_bw %llu\n", cfs_b->runtime);
> + seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
> + seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
> +
>   return 0;
>  }
>  #endif /* CONFIG_CFS_BANDWIDTH */

This is ABI; and the Changelog has no justification what so ever...



Re: [PATCH v3 2/4] sched/fair: Make CFS bandwidth controller burstable

2021-03-10 Thread Peter Zijlstra
On Thu, Jan 21, 2021 at 07:04:51PM +0800, Huaixin Chang wrote:
> Accumulate unused quota from previous periods, thus accumulated
> bandwidth runtime can be used in the following periods. During
> accumulation, take care of runtime overflow. Previous non-burstable
> CFS bandwidth controller only assign quota to runtime, that saves a lot.
> 
> A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
> denote how many percent of burst is given on setting cfs bandwidth. By
> default it is 0, which means on burst is allowed unless accumulated.
> 
> Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
> switch for burst. It is enabled by default.
> 
> Signed-off-by: Huaixin Chang 
> Signed-off-by: Shanpei Chen 

Identical invalid SoB chain.

> Reported-by: kernel test robot 

What exactly did the robot report; the whole patch?

> ---
>  include/linux/sched/sysctl.h |  2 ++
>  kernel/sched/core.c  | 31 +
>  kernel/sched/fair.c  | 47 
> 
>  kernel/sched/sched.h |  4 ++--
>  kernel/sysctl.c  | 18 +
>  5 files changed, 88 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 3c31ba88aca5..3400828eaf2d 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -72,6 +72,8 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
>  extern unsigned int sysctl_sched_cfs_bandwidth_slice;
> +extern unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
> +extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
>  #endif
>  
>  #ifdef CONFIG_SCHED_AUTOGROUP
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 48d3bad12be2..fecf0f05ef0c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -66,6 +66,16 @@ const_debug unsigned int sysctl_sched_features =
>   */
>  const_debug unsigned int sysctl_sched_nr_migrate = 32;
>  
> +#ifdef CONFIG_CFS_BANDWIDTH
> +/*
> + * Percent of burst assigned to cfs_b->runtime on tg_set_cfs_bandwidth,
> + * 0 by default.
> + */
> +unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
> +
> +unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
> +#endif

There's already an #ifdef block that contains that bandwidth_slice
thing, see the previous hunk, so why create a new #ifdef here?

Also, personally I think percentages are over-represented as members of
Q.

> @@ -7891,7 +7901,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
>  const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
>  static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
>  /* More than 203 days if BW_SHIFT equals 20. */
> -static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
> +const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>  
>  static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
>  
> @@ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
> u64 period, u64 quota,
>  {
>   int i, ret = 0, runtime_enabled, runtime_was_enabled;
>   struct cfs_bandwidth *cfs_b = >cfs_bandwidth;
> - u64 buffer;
> + u64 buffer, burst_onset;
>  
>   if (tg == _task_group)
>   return -EINVAL;
> @@ -7961,11 +7971,24 @@ static int tg_set_cfs_bandwidth(struct task_group 
> *tg, u64 period, u64 quota,
>   cfs_b->burst = burst;
>   cfs_b->buffer = buffer;
>  
> - __refill_cfs_bandwidth_runtime(cfs_b);
> + cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
> + cfs_b->runtime = cfs_b->quota;
> +
> + /* burst_onset needed */
> + if (cfs_b->quota != RUNTIME_INF &&
> + sysctl_sched_cfs_bw_burst_enabled &&
> + sysctl_sched_cfs_bw_burst_onset_percent > 0) {

'creative' indentation again...

Also, this gives rise to the question as to why onset_percent is
separate from enabled.

> +
> + burst_onset = do_div(burst, 100) *
> + sysctl_sched_cfs_bw_burst_onset_percent;

and again..

> +
> + cfs_b->runtime += burst_onset;
> + cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
> + }
>  
>   /* Restart the period timer (if active) to handle new period expiry: */
>   if (runtime_enabled)
> - start_cfs_bandwidth(cfs_b);
> + start_cfs_bandwidth(cfs_b, 1);
>  
>   raw_spin_unlock_irq(_b->lock);
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6bb4f89259fd..abe6eb05fe09 100644
> --- a/kernel/sched/

Re: [PATCH v3 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-10 Thread Peter Zijlstra
On Thu, Jan 21, 2021 at 07:04:50PM +0800, Huaixin Chang wrote:
> In this patch, we introduce the notion of CFS bandwidth burst. Unused
> "quota" from pervious "periods" might be accumulated and used in the
> following "periods". The maximum amount of accumulated bandwidth is
> bounded by "burst". And the maximun amount of CPU a group can consume in
> a given period is "buffer" which is equivalent to "quota" + "burst in
> case that this group has done enough accumulation.
> 
> Signed-off-by: Huaixin Chang 
> Signed-off-by: Shanpei Chen 

Not a valid SoB chain.

> ---
>  kernel/sched/core.c  | 91 
> 
>  kernel/sched/fair.c  |  2 ++
>  kernel/sched/sched.h |  2 ++
>  3 files changed, 82 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e7e453492cff..48d3bad12be2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7895,10 +7895,12 @@ static const u64 max_cfs_runtime = MAX_BW * 
> NSEC_PER_USEC;
>  
>  static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
>  
> -static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
> + u64 burst)

Non standard indentation.

>  {
>   int i, ret = 0, runtime_enabled, runtime_was_enabled;
>   struct cfs_bandwidth *cfs_b = >cfs_bandwidth;
> + u64 buffer;
>  
>   if (tg == _task_group)
>   return -EINVAL;
> @@ -7925,6 +7927,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
> u64 period, u64 quota)
>   if (quota != RUNTIME_INF && quota > max_cfs_runtime)
>   return -EINVAL;
>  
> + /*
> +  * Bound burst to defend burst against overflow during bandwidth shift.
> +  */
> + if (burst > max_cfs_runtime)
> + return -EINVAL;
> +
> + if (quota == RUNTIME_INF)
> + buffer = RUNTIME_INF;
> + else
> + buffer = min(max_cfs_runtime, quota + burst);

Why do we care about buffer when RUNTIME_INF?


Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller

2021-03-10 Thread Peter Zijlstra
On Tue, Jan 26, 2021 at 06:18:59PM +0800, changhuaixin wrote:
> Ping, any new comments on this patchset? If there're no other
> concerns, I think it's ready to be merged?

I only found this by accident...

Thread got lost because you're posting new series as replies to older
series. Please don't do that, it's crap.

I'll go have a look.


Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller

2021-03-10 Thread Odin Ugedal
Hi,

> If there are cases where the "start bandwidth" matters, I think there is need 
> to expose the
> "start bandwidth" explicitly too. However, I doubt the existence of such 
> cases from my view
> and the two examples above.

Yeah, I don't think there will be any cases where users will be
"depending" on having burst available,
so I agree in that sense.

> In my thoughts, this patchset keeps cgroup usage within the quota in the 
> longer term, and allows
> cgroup to respond to a burst of work with the help of a reasonable burst 
> buffer. If quota is set correctly
> above average usage, and enough burst buffer is set to meet the needs of 
> bursty work. In this
> case, it makes no difference whether this cgroup runs with 0 start bandwidth 
> or all of it.
> Thus I used sysctl_sched_cfs_bw_burst_onset_percent to decided the start 
> bandwidth
> to leave some convenience here. If this sysctl interface is confusing, I 
> wonder whether it
> is a good idea not to expose this interface.
>
> For the first case mentioned above, if Kubernet users care the "start 
> bandwidth" for process startup,
> maybe it is better to give all of it rather than a part?

Yeah, I am a bit afraid there will be some confusion, so not sure if
the sysctl is the best way to do it.

But I would like feedback from others to highlight the problem as
well, that would be helpful. I think a simple "API"
where you get 0 burst or full burst on "set" (the one we decide on)
would be best to avoid unnecessary complexity.

Start burst when starting up a new process in a new cgroup might be
helpful, so maybe that is a vote for
full burst? However, in long term that doesn't matter, so 0 burst on
start would work as well.

> For the second case with quota changes over time, I think it is important 
> making sure each change works
> long enough to enforce average quota limit. Does it really matter to control 
> "start burst" on each change?

No, I don't think so. Doing so would be another thing to set per
cgroup, and that would just clutter the api
more than necessary imo., since we cannot come up with any real use cases.

> It is an copy of runtime at period start, and used to calculate burst time 
> during a period.
> Not quite remaining_runtime_prev_period.

Ahh, I see, I misunderstood the code. So in a essence it is
"runtime_at_period_start"?

> Yeah, there is the updating problem. It is okey not to expose cfs_b->runtime 
> then.

Yeah, I think dropping it all together is the best solution.


> This comment does not mean any loss any unnecessary throttle for present cfsb.
> All this means is that all quota refilling that is not done during timer stop 
> should be
> refilled on timer start, for the burstable cfsb.
>
> Maybe I shall change this comment in some way if it is misleading?

I think I formulated my question badly. The comment makes sense, I am
just trying to compare how "start_cfs_bandwidth"
works after your patch compared to how it works currently. As I
understand, without this patch "start_cfs_bandwidth" will
never refill runtime, while with your patch, it will refill even when
overrun=0 with burst disabled. Is that an intended change in
behavior, or am I not understanding the patch?


On another note, I have also been testing this patch, and I am not
able to reproduce your schbench results. Both with and without burst,
it gets the same result, and no nr_throttled stays at 0 (tested on a
32-core system). Can you try to rerun your tests with the mainline
to see if you still get the same results? (Also, I see you are running
with 30 threads. How many cores do your test setup have?). To actually
say that the result is real, all cores used should maybe be
exclusively reserved as well, to avoid issues where other processes
cause a
spike in latency.


Odin


Re: [PATCH v13 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-08 Thread Dmitry Osipenko
06.03.2021 02:02, Michał Mirosław пишет:
> On Fri, Mar 05, 2021 at 12:45:51AM +0300, Dmitry Osipenko wrote:
>> 04.03.2021 02:08, Michał Mirosław пишет:
>>> On Tue, Mar 02, 2021 at 03:44:44PM +0300, Dmitry Osipenko wrote:
>>>> Display controller (DC) performs isochronous memory transfers, and thus,
>>>> has a requirement for a minimum memory bandwidth that shall be fulfilled,
>>>> otherwise framebuffer data can't be fetched fast enough and this results
>>>> in a DC's data-FIFO underflow that follows by a visual corruption.
> [...]
>>>> +  /*
>>>> +   * Horizontal downscale takes extra bandwidth which roughly depends
>>>> +   * on the scaled width.
>>>> +   */
>>>> +  if (src_w > dst_w)
>>>> +  mul = (src_w - dst_w) * bpp / 2048 + 1;
>>>> +  else
>>>> +  mul = 1;
>>>
>>> Does it really need more bandwidth to scale down? Does it read the same
>>> data multiple times just to throw it away?
>> The hardware isn't optimized for downscale, it indeed takes more
>> bandwidth. You'll witness a severe underflow of plane's memory FIFO
>> buffer on trying to downscale 1080p plane to 50x50.
> [...]
> 
> In your example, does it really need 16x the bandwidth compared to
> no scaling case?  The naive way to implement downscaling would be to read
> all the pixels and only take every N-th.  Maybe the problem is that in
> downscaling mode the latency requirements are tighter?  Why would bandwidth
> required be proportional to a difference between the widths (instead e.g.
> to src/dst or dst*cacheline_size)?

Seems you're right, it's actually not the bandwidth. Recently I added
memory client statistics gathering support to grate-kernel for Tegra20
and it shows that the consumed bandwidth is actually lower when plane is
downscaled.

So it should be the latency, which depends on memory frequency, and
thus, on bandwidth. I'll try to improve comment to the code in the next
version, thanks.


[RFC PATCH 3/8] bfq: keep the minimun bandwidth for be_class

2021-03-08 Thread brookxu
From: Chunguang Xu 

From: Chunguang Xu 

rt_class will preempt other classes, which may cause other
classes to starve to death. At present, idle_class has
alleviated the starvation problem through the minimum
bandwidth mechanism. Similarly, we should do the same for
be_class.

Signed-off-by: Chunguang Xu 
---
 block/bfq-iosched.c |  2 +-
 block/bfq-iosched.h |  9 +
 block/bfq-wf2q.c| 46 -
 3 files changed, 35 insertions(+), 22 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ea9d7f6f4e3d..b639cdbb4192 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6539,7 +6539,7 @@ static void bfq_init_root_group(struct bfq_group 
*root_group,
root_group->rq_pos_tree = RB_ROOT;
for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
-   root_group->sched_data.bfq_class_idle_last_service = jiffies;
+   root_group->sched_data.class_timeout_last_check = jiffies;
 }
 
 static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index a6f98e9e14b5..2fe7456aa7bc 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -13,7 +13,7 @@
 #include "blk-cgroup-rwstat.h"
 
 #define BFQ_IOPRIO_CLASSES 3
-#define BFQ_CL_IDLE_TIMEOUT(HZ/5)
+#define BFQ_CLASS_TIMEOUT  (HZ/5)
 
 #define BFQ_MIN_WEIGHT 1
 #define BFQ_MAX_WEIGHT 1000
@@ -97,9 +97,10 @@ struct bfq_sched_data {
struct bfq_entity *next_in_service;
/* array of service trees, one per ioprio_class */
struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
-   /* last time CLASS_IDLE was served */
-   unsigned long bfq_class_idle_last_service;
-
+   /* last time the class timeout was checked */
+   unsigned long class_timeout_last_check;
+   /* the position to start class timeout check next time */
+   unsigned int next_class_index;
 };
 
 /**
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 5ff0028920a2..0d10eb3ed555 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1435,6 +1435,34 @@ __bfq_lookup_next_entity(struct bfq_service_tree *st, 
bool in_service)
return entity;
 }
 
+static int bfq_select_next_class(struct bfq_sched_data *sd)
+{
+   struct bfq_service_tree *st = sd->service_tree;
+   int i, class_idx, next_class = 0;
+   unsigned long last_check;
+
+   /*
+* we needed to guarantee a minimum bandwidth for each class (if
+* there is some active entity in this class). This should also
+* mitigate priority-inversion problems in case a low priority
+* task is holding file system resources.
+*/
+   for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
+   class_idx = (sd->next_class_index + i) % BFQ_IOPRIO_CLASSES;
+   last_check = sd->class_timeout_last_check;
+   if (time_is_before_jiffies(last_check + BFQ_CLASS_TIMEOUT)) {
+   sd->class_timeout_last_check = jiffies;
+   if (!RB_EMPTY_ROOT(&(st + class_idx)->active)) {
+   next_class = class_idx++;
+   class_idx %= BFQ_IOPRIO_CLASSES;
+   sd->next_class_index = class_idx;
+   break;
+   }
+   }
+   }
+   return next_class;
+}
+
 /**
  * bfq_lookup_next_entity - return the first eligible entity in @sd.
  * @sd: the sched_data.
@@ -1448,24 +1476,8 @@ static struct bfq_entity *bfq_lookup_next_entity(struct 
bfq_sched_data *sd,
 bool expiration)
 {
struct bfq_service_tree *st = sd->service_tree;
-   struct bfq_service_tree *idle_class_st = st + (BFQ_IOPRIO_CLASSES - 1);
struct bfq_entity *entity = NULL;
-   int class_idx = 0;
-
-   /*
-* Choose from idle class, if needed to guarantee a minimum
-* bandwidth to this class (and if there is some active entity
-* in idle class). This should also mitigate
-* priority-inversion problems in case a low priority task is
-* holding file system resources.
-*/
-   if (time_is_before_jiffies(sd->bfq_class_idle_last_service +
-  BFQ_CL_IDLE_TIMEOUT)) {
-   if (!RB_EMPTY_ROOT(_class_st->active))
-   class_idx = BFQ_IOPRIO_CLASSES - 1;
-   /* About to be served if backlogged, or not yet backlogged */
-   sd->bfq_class_idle_last_service = jiffies;
-   }
+   int class_idx = bfq_select_next_class(sd);
 
/*
 * Find the next entity to serve for the highest-priority
-- 
2.30.0



[PATCH v2 06/18] usb: xhci-mtk: add a function to (un)load bandwidth info

2021-03-07 Thread Chunfeng Yun
Extract a function to load/unload bandwidth info, and remove
a dummy check of TT offset.

Signed-off-by: Chunfeng Yun 
---
v2: no changes
---
 drivers/usb/host/xhci-mtk-sch.c | 37 ++---
 1 file changed, 16 insertions(+), 21 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index b1da3cb077c9..9a9685f74940 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -375,7 +375,6 @@ static void update_bus_bw(struct mu3h_sch_bw_info *sch_bw,
sch_ep->bw_budget_table[j];
}
}
-   sch_ep->allocated = used;
 }
 
 static int check_fs_bus_bw(struct mu3h_sch_ep_info *sch_ep, int offset)
@@ -509,6 +508,19 @@ static void update_sch_tt(struct usb_device *udev,
list_del(_ep->tt_endpoint);
 }
 
+static int load_ep_bw(struct usb_device *udev, struct mu3h_sch_bw_info *sch_bw,
+ struct mu3h_sch_ep_info *sch_ep, bool loaded)
+{
+   if (sch_ep->sch_tt)
+   update_sch_tt(udev, sch_ep, loaded);
+
+   /* update bus bandwidth info */
+   update_bus_bw(sch_bw, sch_ep, loaded);
+   sch_ep->allocated = loaded;
+
+   return 0;
+}
+
 static u32 get_esit_boundary(struct mu3h_sch_ep_info *sch_ep)
 {
u32 boundary = sch_ep->esit;
@@ -535,7 +547,6 @@ static int check_sch_bw(struct usb_device *udev,
u32 esit_boundary;
u32 min_num_budget;
u32 min_cs_count;
-   bool tt_offset_ok = false;
int ret;
 
/*
@@ -552,8 +563,6 @@ static int check_sch_bw(struct usb_device *udev,
ret = check_sch_tt(udev, sch_ep, offset);
if (ret)
continue;
-   else
-   tt_offset_ok = true;
}
 
if ((offset + sch_ep->num_budget_microframes) > esit_boundary)
@@ -585,29 +594,15 @@ static int check_sch_bw(struct usb_device *udev,
sch_ep->cs_count = min_cs_count;
sch_ep->num_budget_microframes = min_num_budget;
 
-   if (sch_ep->sch_tt) {
-   /* all offset for tt is not ok*/
-   if (!tt_offset_ok)
-   return -ERANGE;
-
-   update_sch_tt(udev, sch_ep, 1);
-   }
-
-   /* update bus bandwidth info */
-   update_bus_bw(sch_bw, sch_ep, 1);
-
-   return 0;
+   return load_ep_bw(udev, sch_bw, sch_ep, true);
 }
 
 static void destroy_sch_ep(struct usb_device *udev,
struct mu3h_sch_bw_info *sch_bw, struct mu3h_sch_ep_info *sch_ep)
 {
/* only release ep bw check passed by check_sch_bw() */
-   if (sch_ep->allocated) {
-   update_bus_bw(sch_bw, sch_ep, 0);
-   if (sch_ep->sch_tt)
-   update_sch_tt(udev, sch_ep, 0);
-   }
+   if (sch_ep->allocated)
+   load_ep_bw(udev, sch_bw, sch_ep, false);
 
if (sch_ep->sch_tt)
drop_tt(udev);
-- 
2.18.0



[PATCH v2 12/18] usb: xhci-mtk: rebuild the way to get bandwidth domain

2021-03-07 Thread Chunfeng Yun
Rebuild the function get_bw_index(), get the bandwidth domain
directly instead its index of domain array.

Signed-off-by: Chunfeng Yun 
---
v2: no changes
---
 drivers/usb/host/xhci-mtk-sch.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index bad99580fb68..9e77bbd8e7f7 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -57,7 +57,7 @@ static u32 get_bw_boundary(enum usb_device_speed speed)
 }
 
 /*
-* get the index of bandwidth domains array which @ep belongs to.
+* get the bandwidth domain which @ep belongs to.
 *
 * the bandwidth domain array is saved to @sch_array of struct xhci_hcd_mtk,
 * each HS root port is treated as a single bandwidth domain,
@@ -68,9 +68,11 @@ static u32 get_bw_boundary(enum usb_device_speed speed)
 * so the bandwidth domain array is organized as follow for simplification:
 * SSport0-OUT, SSport0-IN, ..., SSportX-OUT, SSportX-IN, HSport0, ..., HSportY
 */
-static int get_bw_index(struct xhci_hcd *xhci, struct usb_device *udev,
-   struct usb_host_endpoint *ep)
+static struct mu3h_sch_bw_info *
+get_bw_info(struct xhci_hcd_mtk *mtk, struct usb_device *udev,
+   struct usb_host_endpoint *ep)
 {
+   struct xhci_hcd *xhci = hcd_to_xhci(mtk->hcd);
struct xhci_virt_device *virt_dev;
int bw_index;
 
@@ -86,7 +88,7 @@ static int get_bw_index(struct xhci_hcd *xhci, struct 
usb_device *udev,
bw_index = virt_dev->real_port + xhci->usb3_rhub.num_ports - 1;
}
 
-   return bw_index;
+   return >sch_array[bw_index];
 }
 
 static u32 get_esit(struct xhci_ep_ctx *ep_ctx)
@@ -722,14 +724,11 @@ void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct 
usb_device *udev,
struct xhci_hcd_mtk *mtk = hcd_to_mtk(hcd);
struct xhci_hcd *xhci;
struct xhci_virt_device *virt_dev;
-   struct mu3h_sch_bw_info *sch_array;
struct mu3h_sch_bw_info *sch_bw;
struct mu3h_sch_ep_info *sch_ep, *tmp;
-   int bw_index;
 
xhci = hcd_to_xhci(hcd);
virt_dev = xhci->devs[udev->slot_id];
-   sch_array = mtk->sch_array;
 
xhci_dbg(xhci, "%s() type:%d, speed:%d, mpks:%d, dir:%d, ep:%p\n",
__func__, usb_endpoint_type(>desc), udev->speed,
@@ -739,8 +738,7 @@ void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct 
usb_device *udev,
if (!need_bw_sch(ep, udev->speed, !!virt_dev->tt_info))
return;
 
-   bw_index = get_bw_index(xhci, udev, ep);
-   sch_bw = _array[bw_index];
+   sch_bw = get_bw_info(mtk, udev, ep);
 
list_for_each_entry_safe(sch_ep, tmp, _bw->bw_ep_list, endpoint) {
if (sch_ep->ep == ep) {
@@ -758,13 +756,12 @@ int xhci_mtk_check_bandwidth(struct usb_hcd *hcd, struct 
usb_device *udev)
struct xhci_virt_device *virt_dev = xhci->devs[udev->slot_id];
struct mu3h_sch_bw_info *sch_bw;
struct mu3h_sch_ep_info *sch_ep, *tmp;
-   int bw_index, ret;
+   int ret;
 
xhci_dbg(xhci, "%s() udev %s\n", __func__, dev_name(>dev));
 
list_for_each_entry(sch_ep, >bw_ep_chk_list, endpoint) {
-   bw_index = get_bw_index(xhci, udev, sch_ep->ep);
-   sch_bw = >sch_array[bw_index];
+   sch_bw = get_bw_info(mtk, udev, sch_ep->ep);
 
ret = check_sch_bw(sch_bw, sch_ep);
if (ret) {
@@ -778,9 +775,7 @@ int xhci_mtk_check_bandwidth(struct usb_hcd *hcd, struct 
usb_device *udev)
struct usb_host_endpoint *ep = sch_ep->ep;
unsigned int ep_index = xhci_get_endpoint_index(>desc);
 
-   bw_index = get_bw_index(xhci, udev, ep);
-   sch_bw = >sch_array[bw_index];
-
+   sch_bw = get_bw_info(mtk, udev, ep);
list_move_tail(_ep->endpoint, _bw->bw_ep_list);
 
ep_ctx = xhci_get_ep_ctx(xhci, virt_dev->in_ctx, ep_index);
@@ -805,13 +800,11 @@ void xhci_mtk_reset_bandwidth(struct usb_hcd *hcd, struct 
usb_device *udev)
struct xhci_hcd *xhci = hcd_to_xhci(hcd);
struct mu3h_sch_bw_info *sch_bw;
struct mu3h_sch_ep_info *sch_ep, *tmp;
-   int bw_index;
 
xhci_dbg(xhci, "%s() udev %s\n", __func__, dev_name(>dev));
 
list_for_each_entry_safe(sch_ep, tmp, >bw_ep_chk_list, endpoint) {
-   bw_index = get_bw_index(xhci, udev, sch_ep->ep);
-   sch_bw = >sch_array[bw_index];
+   sch_bw = get_bw_info(mtk, udev, sch_ep->ep);
destroy_sch_ep(udev, sch_bw, sch_ep);
}
 
-- 
2.18.0



[PATCH v2 02/18] usb: xhci-mtk: improve bandwidth scheduling with TT

2021-03-07 Thread Chunfeng Yun
When the USB headset is plug into an external hub, sometimes
can't set config due to not enough bandwidth, so need improve
LS/FS INT/ISOC bandwidth scheduling with TT.

Fixes: 54f6a8af3722 ("usb: xhci-mtk: skip dropping bandwidth of unchecked 
endpoints")
Cc: stable 
Signed-off-by: Yaqii Wu 
Signed-off-by: Chunfeng Yun 
---
v2: no changes
---
 drivers/usb/host/xhci-mtk-sch.c | 74 ++---
 drivers/usb/host/xhci-mtk.h |  6 ++-
 2 files changed, 64 insertions(+), 16 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index 5891f56c64da..8950d1f10a7f 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -378,6 +378,31 @@ static void update_bus_bw(struct mu3h_sch_bw_info *sch_bw,
sch_ep->allocated = used;
 }
 
+static int check_fs_bus_bw(struct mu3h_sch_ep_info *sch_ep, int offset)
+{
+   struct mu3h_sch_tt *tt = sch_ep->sch_tt;
+   u32 num_esit, tmp;
+   int base;
+   int i, j;
+
+   num_esit = XHCI_MTK_MAX_ESIT / sch_ep->esit;
+   for (i = 0; i < num_esit; i++) {
+   base = offset + i * sch_ep->esit;
+
+   /*
+* Compared with hs bus, no matter what ep type,
+* the hub will always delay one uframe to send data
+*/
+   for (j = 0; j < sch_ep->cs_count; j++) {
+   tmp = tt->fs_bus_bw[base + j] + 
sch_ep->bw_cost_per_microframe;
+   if (tmp > FS_PAYLOAD_MAX)
+   return -ERANGE;
+   }
+   }
+
+   return 0;
+}
+
 static int check_sch_tt(struct usb_device *udev,
struct mu3h_sch_ep_info *sch_ep, u32 offset)
 {
@@ -402,7 +427,7 @@ static int check_sch_tt(struct usb_device *udev,
return -ERANGE;
 
for (i = 0; i < sch_ep->cs_count; i++)
-   if (test_bit(offset + i, tt->split_bit_map))
+   if (test_bit(offset + i, tt->ss_bit_map))
return -ERANGE;
 
} else {
@@ -432,7 +457,7 @@ static int check_sch_tt(struct usb_device *udev,
cs_count = 7; /* HW limit */
 
for (i = 0; i < cs_count + 2; i++) {
-   if (test_bit(offset + i, tt->split_bit_map))
+   if (test_bit(offset + i, tt->ss_bit_map))
return -ERANGE;
}
 
@@ -448,24 +473,44 @@ static int check_sch_tt(struct usb_device *udev,
sch_ep->num_budget_microframes = sch_ep->esit;
}
 
-   return 0;
+   return check_fs_bus_bw(sch_ep, offset);
 }
 
 static void update_sch_tt(struct usb_device *udev,
-   struct mu3h_sch_ep_info *sch_ep)
+   struct mu3h_sch_ep_info *sch_ep, bool used)
 {
struct mu3h_sch_tt *tt = sch_ep->sch_tt;
u32 base, num_esit;
+   int bw_updated;
+   int bits;
int i, j;
 
num_esit = XHCI_MTK_MAX_ESIT / sch_ep->esit;
+   bits = (sch_ep->ep_type == ISOC_OUT_EP) ? sch_ep->cs_count : 1;
+
+   if (used)
+   bw_updated = sch_ep->bw_cost_per_microframe;
+   else
+   bw_updated = -sch_ep->bw_cost_per_microframe;
+
for (i = 0; i < num_esit; i++) {
base = sch_ep->offset + i * sch_ep->esit;
-   for (j = 0; j < sch_ep->num_budget_microframes; j++)
-   set_bit(base + j, tt->split_bit_map);
+
+   for (j = 0; j < bits; j++) {
+   if (used)
+   set_bit(base + j, tt->ss_bit_map);
+   else
+   clear_bit(base + j, tt->ss_bit_map);
+   }
+
+   for (j = 0; j < sch_ep->cs_count; j++)
+   tt->fs_bus_bw[base + j] += bw_updated;
}
 
-   list_add_tail(_ep->tt_endpoint, >ep_list);
+   if (used)
+   list_add_tail(_ep->tt_endpoint, >ep_list);
+   else
+   list_del(_ep->tt_endpoint);
 }
 
 static int check_sch_bw(struct usb_device *udev,
@@ -535,7 +580,7 @@ static int check_sch_bw(struct usb_device *udev,
if (!tt_offset_ok)
return -ERANGE;
 
-   update_sch_tt(udev, sch_ep);
+   update_sch_tt(udev, sch_ep, 1);
}
 
/* update bus bandwidth info */
@@ -548,15 +593,16 @@ static void destroy_sch_ep(struct usb_device *udev,
struct mu3h_sch_bw_info *sch_bw, struct mu3h_sch_ep_info *sch_ep)
 {
/* only release ep bw check passed by check_sch_bw() */
-   if (sch_ep->allocated)
+   if (sch_ep->allocated) {
update_bus_bw(sch_bw, sch_ep, 0);
+   if (sch_ep->sch_tt)
+   update_sch_tt(udev, sch_ep

[PATCH v2 07/18] usb: xhci-mtk: add a function to get bandwidth boundary

2021-03-07 Thread Chunfeng Yun
This is used to simplify unit test.

Signed-off-by: Chunfeng Yun 
---
v2: no changes
---
 drivers/usb/host/xhci-mtk-sch.c | 27 ---
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index 9a9685f74940..8fe4481eb43d 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -37,6 +37,25 @@ static int is_fs_or_ls(enum usb_device_speed speed)
return speed == USB_SPEED_FULL || speed == USB_SPEED_LOW;
 }
 
+static u32 get_bw_boundary(enum usb_device_speed speed)
+{
+   u32 boundary;
+
+   switch (speed) {
+   case USB_SPEED_SUPER_PLUS:
+   boundary = SSP_BW_BOUNDARY;
+   break;
+   case USB_SPEED_SUPER:
+   boundary = SS_BW_BOUNDARY;
+   break;
+   default:
+   boundary = HS_BW_BOUNDARY;
+   break;
+   }
+
+   return boundary;
+}
+
 /*
 * get the index of bandwidth domains array which @ep belongs to.
 *
@@ -579,13 +598,7 @@ static int check_sch_bw(struct usb_device *udev,
break;
}
 
-   if (udev->speed == USB_SPEED_SUPER_PLUS)
-   bw_boundary = SSP_BW_BOUNDARY;
-   else if (udev->speed == USB_SPEED_SUPER)
-   bw_boundary = SS_BW_BOUNDARY;
-   else
-   bw_boundary = HS_BW_BOUNDARY;
-
+   bw_boundary = get_bw_boundary(udev->speed);
/* check bandwidth */
if (min_bw > bw_boundary)
return -ERANGE;
-- 
2.18.0



Re: [PATCH v13 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-05 Thread Michał Mirosław
On Fri, Mar 05, 2021 at 12:45:51AM +0300, Dmitry Osipenko wrote:
> 04.03.2021 02:08, Michał Mirosław пишет:
> > On Tue, Mar 02, 2021 at 03:44:44PM +0300, Dmitry Osipenko wrote:
> >> Display controller (DC) performs isochronous memory transfers, and thus,
> >> has a requirement for a minimum memory bandwidth that shall be fulfilled,
> >> otherwise framebuffer data can't be fetched fast enough and this results
> >> in a DC's data-FIFO underflow that follows by a visual corruption.
[...]
> >> +  /*
> >> +   * Horizontal downscale takes extra bandwidth which roughly depends
> >> +   * on the scaled width.
> >> +   */
> >> +  if (src_w > dst_w)
> >> +  mul = (src_w - dst_w) * bpp / 2048 + 1;
> >> +  else
> >> +  mul = 1;
> > 
> > Does it really need more bandwidth to scale down? Does it read the same
> > data multiple times just to throw it away?
> The hardware isn't optimized for downscale, it indeed takes more
> bandwidth. You'll witness a severe underflow of plane's memory FIFO
> buffer on trying to downscale 1080p plane to 50x50.
[...]

In your example, does it really need 16x the bandwidth compared to
no scaling case?  The naive way to implement downscaling would be to read
all the pixels and only take every N-th.  Maybe the problem is that in
downscaling mode the latency requirements are tighter?  Why would bandwidth
required be proportional to a difference between the widths (instead e.g.
to src/dst or dst*cacheline_size)?

Best Regards
Michał Mirosław


[PATCH 12/17] usb: xhci-mtk: rebuild the way to get bandwidth domain

2021-03-05 Thread Chunfeng Yun
Rebuild the function get_bw_index(), get the bandwidth domain
directly instead its index of domain array.

Signed-off-by: Chunfeng Yun 
---
 drivers/usb/host/xhci-mtk-sch.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index 1562875c04ab..d39545ade9a1 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -57,7 +57,7 @@ static u32 get_bw_boundary(enum usb_device_speed speed)
 }
 
 /*
-* get the index of bandwidth domains array which @ep belongs to.
+* get the bandwidth domain which @ep belongs to.
 *
 * the bandwidth domain array is saved to @sch_array of struct xhci_hcd_mtk,
 * each HS root port is treated as a single bandwidth domain,
@@ -68,9 +68,11 @@ static u32 get_bw_boundary(enum usb_device_speed speed)
 * so the bandwidth domain array is organized as follow for simplification:
 * SSport0-OUT, SSport0-IN, ..., SSportX-OUT, SSportX-IN, HSport0, ..., HSportY
 */
-static int get_bw_index(struct xhci_hcd *xhci, struct usb_device *udev,
-   struct usb_host_endpoint *ep)
+static struct mu3h_sch_bw_info *
+get_bw_info(struct xhci_hcd_mtk *mtk, struct usb_device *udev,
+   struct usb_host_endpoint *ep)
 {
+   struct xhci_hcd *xhci = hcd_to_xhci(mtk->hcd);
struct xhci_virt_device *virt_dev;
int bw_index;
 
@@ -86,7 +88,7 @@ static int get_bw_index(struct xhci_hcd *xhci, struct 
usb_device *udev,
bw_index = virt_dev->real_port + xhci->usb3_rhub.num_ports - 1;
}
 
-   return bw_index;
+   return >sch_array[bw_index];
 }
 
 static u32 get_esit(struct xhci_ep_ctx *ep_ctx)
@@ -722,14 +724,11 @@ void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct 
usb_device *udev,
struct xhci_hcd_mtk *mtk = hcd_to_mtk(hcd);
struct xhci_hcd *xhci;
struct xhci_virt_device *virt_dev;
-   struct mu3h_sch_bw_info *sch_array;
struct mu3h_sch_bw_info *sch_bw;
struct mu3h_sch_ep_info *sch_ep, *tmp;
-   int bw_index;
 
xhci = hcd_to_xhci(hcd);
virt_dev = xhci->devs[udev->slot_id];
-   sch_array = mtk->sch_array;
 
xhci_dbg(xhci, "%s() type:%d, speed:%d, mpks:%d, dir:%d, ep:%p\n",
__func__, usb_endpoint_type(>desc), udev->speed,
@@ -739,8 +738,7 @@ void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct 
usb_device *udev,
if (!need_bw_sch(ep, udev->speed, !!virt_dev->tt_info))
return;
 
-   bw_index = get_bw_index(xhci, udev, ep);
-   sch_bw = _array[bw_index];
+   sch_bw = get_bw_info(mtk, udev, ep);
 
list_for_each_entry_safe(sch_ep, tmp, _bw->bw_ep_list, endpoint) {
if (sch_ep->ep == ep) {
@@ -758,13 +756,12 @@ int xhci_mtk_check_bandwidth(struct usb_hcd *hcd, struct 
usb_device *udev)
struct xhci_virt_device *virt_dev = xhci->devs[udev->slot_id];
struct mu3h_sch_bw_info *sch_bw;
struct mu3h_sch_ep_info *sch_ep, *tmp;
-   int bw_index, ret;
+   int ret;
 
xhci_dbg(xhci, "%s() udev %s\n", __func__, dev_name(>dev));
 
list_for_each_entry(sch_ep, >bw_ep_chk_list, endpoint) {
-   bw_index = get_bw_index(xhci, udev, sch_ep->ep);
-   sch_bw = >sch_array[bw_index];
+   sch_bw = get_bw_info(mtk, udev, sch_ep->ep);
 
ret = check_sch_bw(sch_bw, sch_ep);
if (ret) {
@@ -778,9 +775,7 @@ int xhci_mtk_check_bandwidth(struct usb_hcd *hcd, struct 
usb_device *udev)
struct usb_host_endpoint *ep = sch_ep->ep;
unsigned int ep_index = xhci_get_endpoint_index(>desc);
 
-   bw_index = get_bw_index(xhci, udev, ep);
-   sch_bw = >sch_array[bw_index];
-
+   sch_bw = get_bw_info(mtk, udev, ep);
list_move_tail(_ep->endpoint, _bw->bw_ep_list);
 
ep_ctx = xhci_get_ep_ctx(xhci, virt_dev->in_ctx, ep_index);
@@ -805,13 +800,11 @@ void xhci_mtk_reset_bandwidth(struct usb_hcd *hcd, struct 
usb_device *udev)
struct xhci_hcd *xhci = hcd_to_xhci(hcd);
struct mu3h_sch_bw_info *sch_bw;
struct mu3h_sch_ep_info *sch_ep, *tmp;
-   int bw_index;
 
xhci_dbg(xhci, "%s() udev %s\n", __func__, dev_name(>dev));
 
list_for_each_entry_safe(sch_ep, tmp, >bw_ep_chk_list, endpoint) {
-   bw_index = get_bw_index(xhci, udev, sch_ep->ep);
-   sch_bw = >sch_array[bw_index];
+   sch_bw = get_bw_info(mtk, udev, sch_ep->ep);
destroy_sch_ep(udev, sch_bw, sch_ep);
}
 
-- 
2.18.0



[PATCH 06/17] usb: xhci-mtk: add a function to (un)load bandwidth info

2021-03-05 Thread Chunfeng Yun
Extract a function to load/unload bandwidth info, and remove
a dummy check of TT offset.

Signed-off-by: Chunfeng Yun 
---
 drivers/usb/host/xhci-mtk-sch.c | 37 ++---
 1 file changed, 16 insertions(+), 21 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index 9016188eee97..bef82c1f909d 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -375,7 +375,6 @@ static void update_bus_bw(struct mu3h_sch_bw_info *sch_bw,
sch_ep->bw_budget_table[j];
}
}
-   sch_ep->allocated = used;
 }
 
 static int check_fs_bus_bw(struct mu3h_sch_ep_info *sch_ep, int offset)
@@ -509,6 +508,19 @@ static void update_sch_tt(struct usb_device *udev,
list_del(_ep->tt_endpoint);
 }
 
+static int load_ep_bw(struct usb_device *udev, struct mu3h_sch_bw_info *sch_bw,
+ struct mu3h_sch_ep_info *sch_ep, bool loaded)
+{
+   if (sch_ep->sch_tt)
+   update_sch_tt(udev, sch_ep, loaded);
+
+   /* update bus bandwidth info */
+   update_bus_bw(sch_bw, sch_ep, loaded);
+   sch_ep->allocated = loaded;
+
+   return 0;
+}
+
 static u32 get_esit_boundary(struct mu3h_sch_ep_info *sch_ep)
 {
u32 boundary = sch_ep->esit;
@@ -535,7 +547,6 @@ static int check_sch_bw(struct usb_device *udev,
u32 esit_boundary;
u32 min_num_budget;
u32 min_cs_count;
-   bool tt_offset_ok = false;
int ret;
 
/*
@@ -552,8 +563,6 @@ static int check_sch_bw(struct usb_device *udev,
ret = check_sch_tt(udev, sch_ep, offset);
if (ret)
continue;
-   else
-   tt_offset_ok = true;
}
 
if ((offset + sch_ep->num_budget_microframes) > esit_boundary)
@@ -585,29 +594,15 @@ static int check_sch_bw(struct usb_device *udev,
sch_ep->cs_count = min_cs_count;
sch_ep->num_budget_microframes = min_num_budget;
 
-   if (sch_ep->sch_tt) {
-   /* all offset for tt is not ok*/
-   if (!tt_offset_ok)
-   return -ERANGE;
-
-   update_sch_tt(udev, sch_ep, 1);
-   }
-
-   /* update bus bandwidth info */
-   update_bus_bw(sch_bw, sch_ep, 1);
-
-   return 0;
+   return load_ep_bw(udev, sch_bw, sch_ep, true);
 }
 
 static void destroy_sch_ep(struct usb_device *udev,
struct mu3h_sch_bw_info *sch_bw, struct mu3h_sch_ep_info *sch_ep)
 {
/* only release ep bw check passed by check_sch_bw() */
-   if (sch_ep->allocated) {
-   update_bus_bw(sch_bw, sch_ep, 0);
-   if (sch_ep->sch_tt)
-   update_sch_tt(udev, sch_ep, 0);
-   }
+   if (sch_ep->allocated)
+   load_ep_bw(udev, sch_bw, sch_ep, false);
 
if (sch_ep->sch_tt)
drop_tt(udev);
-- 
2.18.0



[PATCH 02/17] usb: xhci-mtk: improve bandwidth scheduling with TT

2021-03-05 Thread Chunfeng Yun
When the USB headset is plug into an external hub, sometimes
can't set config due to not enough bandwidth, so need improve
LS/FS INT/ISOC bandwidth scheduling with TT.

Fixes: 54f6a8af3722 ("usb: xhci-mtk: skip dropping bandwidth of unchecked 
endpoints")
Cc: stable 
Signed-off-by: Yaqii Wu 
Signed-off-by: Chunfeng Yun 
---
 drivers/usb/host/xhci-mtk-sch.c | 74 ++---
 drivers/usb/host/xhci-mtk.h |  6 ++-
 2 files changed, 64 insertions(+), 16 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index 5891f56c64da..8950d1f10a7f 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -378,6 +378,31 @@ static void update_bus_bw(struct mu3h_sch_bw_info *sch_bw,
sch_ep->allocated = used;
 }
 
+static int check_fs_bus_bw(struct mu3h_sch_ep_info *sch_ep, int offset)
+{
+   struct mu3h_sch_tt *tt = sch_ep->sch_tt;
+   u32 num_esit, tmp;
+   int base;
+   int i, j;
+
+   num_esit = XHCI_MTK_MAX_ESIT / sch_ep->esit;
+   for (i = 0; i < num_esit; i++) {
+   base = offset + i * sch_ep->esit;
+
+   /*
+* Compared with hs bus, no matter what ep type,
+* the hub will always delay one uframe to send data
+*/
+   for (j = 0; j < sch_ep->cs_count; j++) {
+   tmp = tt->fs_bus_bw[base + j] + 
sch_ep->bw_cost_per_microframe;
+   if (tmp > FS_PAYLOAD_MAX)
+   return -ERANGE;
+   }
+   }
+
+   return 0;
+}
+
 static int check_sch_tt(struct usb_device *udev,
struct mu3h_sch_ep_info *sch_ep, u32 offset)
 {
@@ -402,7 +427,7 @@ static int check_sch_tt(struct usb_device *udev,
return -ERANGE;
 
for (i = 0; i < sch_ep->cs_count; i++)
-   if (test_bit(offset + i, tt->split_bit_map))
+   if (test_bit(offset + i, tt->ss_bit_map))
return -ERANGE;
 
} else {
@@ -432,7 +457,7 @@ static int check_sch_tt(struct usb_device *udev,
cs_count = 7; /* HW limit */
 
for (i = 0; i < cs_count + 2; i++) {
-   if (test_bit(offset + i, tt->split_bit_map))
+   if (test_bit(offset + i, tt->ss_bit_map))
return -ERANGE;
}
 
@@ -448,24 +473,44 @@ static int check_sch_tt(struct usb_device *udev,
sch_ep->num_budget_microframes = sch_ep->esit;
}
 
-   return 0;
+   return check_fs_bus_bw(sch_ep, offset);
 }
 
 static void update_sch_tt(struct usb_device *udev,
-   struct mu3h_sch_ep_info *sch_ep)
+   struct mu3h_sch_ep_info *sch_ep, bool used)
 {
struct mu3h_sch_tt *tt = sch_ep->sch_tt;
u32 base, num_esit;
+   int bw_updated;
+   int bits;
int i, j;
 
num_esit = XHCI_MTK_MAX_ESIT / sch_ep->esit;
+   bits = (sch_ep->ep_type == ISOC_OUT_EP) ? sch_ep->cs_count : 1;
+
+   if (used)
+   bw_updated = sch_ep->bw_cost_per_microframe;
+   else
+   bw_updated = -sch_ep->bw_cost_per_microframe;
+
for (i = 0; i < num_esit; i++) {
base = sch_ep->offset + i * sch_ep->esit;
-   for (j = 0; j < sch_ep->num_budget_microframes; j++)
-   set_bit(base + j, tt->split_bit_map);
+
+   for (j = 0; j < bits; j++) {
+   if (used)
+   set_bit(base + j, tt->ss_bit_map);
+   else
+   clear_bit(base + j, tt->ss_bit_map);
+   }
+
+   for (j = 0; j < sch_ep->cs_count; j++)
+   tt->fs_bus_bw[base + j] += bw_updated;
}
 
-   list_add_tail(_ep->tt_endpoint, >ep_list);
+   if (used)
+   list_add_tail(_ep->tt_endpoint, >ep_list);
+   else
+   list_del(_ep->tt_endpoint);
 }
 
 static int check_sch_bw(struct usb_device *udev,
@@ -535,7 +580,7 @@ static int check_sch_bw(struct usb_device *udev,
if (!tt_offset_ok)
return -ERANGE;
 
-   update_sch_tt(udev, sch_ep);
+   update_sch_tt(udev, sch_ep, 1);
}
 
/* update bus bandwidth info */
@@ -548,15 +593,16 @@ static void destroy_sch_ep(struct usb_device *udev,
struct mu3h_sch_bw_info *sch_bw, struct mu3h_sch_ep_info *sch_ep)
 {
/* only release ep bw check passed by check_sch_bw() */
-   if (sch_ep->allocated)
+   if (sch_ep->allocated) {
update_bus_bw(sch_bw, sch_ep, 0);
+   if (sch_ep->sch_tt)
+   update_sch_tt(udev, sch_ep, 0);
+ 

[PATCH 07/17] usb: xhci-mtk: add a function to get bandwidth boundary

2021-03-05 Thread Chunfeng Yun
This is used to simplify unit test.

Signed-off-by: Chunfeng Yun 
---
 drivers/usb/host/xhci-mtk-sch.c | 27 ---
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index bef82c1f909d..cb597357f134 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -37,6 +37,25 @@ static int is_fs_or_ls(enum usb_device_speed speed)
return speed == USB_SPEED_FULL || speed == USB_SPEED_LOW;
 }
 
+static u32 get_bw_boundary(enum usb_device_speed speed)
+{
+   u32 boundary;
+
+   switch (speed) {
+   case USB_SPEED_SUPER_PLUS:
+   boundary = SSP_BW_BOUNDARY;
+   break;
+   case USB_SPEED_SUPER:
+   boundary = SS_BW_BOUNDARY;
+   break;
+   default:
+   boundary = HS_BW_BOUNDARY;
+   break;
+   }
+
+   return boundary;
+}
+
 /*
 * get the index of bandwidth domains array which @ep belongs to.
 *
@@ -579,13 +598,7 @@ static int check_sch_bw(struct usb_device *udev,
break;
}
 
-   if (udev->speed == USB_SPEED_SUPER_PLUS)
-   bw_boundary = SSP_BW_BOUNDARY;
-   else if (udev->speed == USB_SPEED_SUPER)
-   bw_boundary = SS_BW_BOUNDARY;
-   else
-   bw_boundary = HS_BW_BOUNDARY;
-
+   bw_boundary = get_bw_boundary(udev->speed);
/* check bandwidth */
if (min_bw > bw_boundary)
return -ERANGE;
-- 
2.18.0



Re: [PATCH v13 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-04 Thread Dmitry Osipenko
04.03.2021 02:08, Michał Mirosław пишет:
> On Tue, Mar 02, 2021 at 03:44:44PM +0300, Dmitry Osipenko wrote:
>> Display controller (DC) performs isochronous memory transfers, and thus,
>> has a requirement for a minimum memory bandwidth that shall be fulfilled,
>> otherwise framebuffer data can't be fetched fast enough and this results
>> in a DC's data-FIFO underflow that follows by a visual corruption.
>>
>> The Memory Controller drivers provide facility for memory bandwidth
>> management via interconnect API. Let's wire up the interconnect API
>> support to the DC driver in order to fix the distorted display output
>> on T30 Ouya, T124 TK1 and other Tegra devices.
> 
> I did a read on the code. I have put some thoughts and nits inbetween the
> diff, but here are more general questions about the patch:

Hello, Michał! Thank you very much for taking a look at the patch!

> Is there a reason why the bandwidth is allocated per plane instead of just
> using one peak and average for the whole configuration? Or is there a need
> to expose all the planes as interconnect channels and allocate their
> bandwidth individually?

Each display plane has individual memory client on Tegra, memory ICC
paths are specified per memory client. This is how memory ICCs are
defined in the DT binding and that's what memory drivers are expecting
to deal with. It is also nice to see in ICC debugfs how much memory
bandwidth is consumed by each individual memory client.

> From algorithmic part I see that the plane overlaps are calculated twice
> (A vs B and later B vs A). The cursor plane is ignored, but nevertheless
> its overlap mask is calculated before being thrown away.

The algorithm assumes that we have a fixed number of planes to care
about and it's not executed in a hot code path, hence it's more optimal
to have a simpler and smaller code rather than try to optimize it
without gaining any benefits, IMO.

> The bandwidths
> are also calculated twice: once for pre-commit and then again for
> post-commit.  Is setting bandwidth for an interconnect expensive when
> re-setting a value that was already set? The code seems to avoid this
> case, but maybe unnecessarily?

The CCF discards dummy rate-changes in the end of ICC code path.
Besides, we're not performing it in a hot code paths, hence performance
isn't a concern in this patch.

The tegra_crtc_update_memory_bandwidth() relies on being called before
and after the atomic commit. For example CRTC's "active" state is
updated only after the completion of commit-tail phase.

Earlier versions of this patch had checks that tried to avoid setting
bandwidth in both 'begin' and 'end' phases of a commit, but then I found
that it was buggy in regards to DPMS handling and it was much more
optimal to remove the unnecessary "optimizations" since code was blowing
up in complexity when I tried to fix it.

The tegra_crtc_update_memory_bandwidth() still checks whether old BW =
new BW, hence in practice actual ICC changes are only performed when
plane is turned on/off.

> [...cut the big and interesting part...]
> 
> [...]
>> @@ -65,7 +75,9 @@ struct tegra_dc_soc_info {
>>  unsigned int num_overlay_formats;
>>  const u64 *modifiers;
>>  bool has_win_a_without_filters;
>> +bool has_win_b_vfilter_mem_client;
>>  bool has_win_c_without_vert_filter;
>> +unsigned int plane_tiled_memory_bandwidth_x2;
> 
> This looks more like bool in the code using it.

Good catch, thank you!

> [...]
>> --- a/drivers/gpu/drm/tegra/plane.c
>> +++ b/drivers/gpu/drm/tegra/plane.c
> [...]
>> +static int tegra_plane_check_memory_bandwidth(struct drm_plane_state *state)
> 
> The function's body looks more like 'update' or 'recalculate' rather
> than 'check' the memory bandwidth.

The 'check' in the name is intended to show that function belongs to
atomic-state checking.

But tegra_plane_calculate_memory_bandwidth_state() also is a good
variant. I'll consider the renaming, thanks!

>> +struct tegra_plane_state *tegra_state = to_tegra_plane_state(state);
>> +unsigned int i, bpp, bpp_plane, dst_w, src_w, src_h, mul;
>> +const struct tegra_dc_soc_info *soc;
>> +const struct drm_format_info *fmt;
>> +struct drm_crtc_state *crtc_state;
>> +u32 avg_bandwidth, peak_bandwidth;
>> +
>> +if (!state->visible)
>> +return 0;
>> +
>> +crtc_state = drm_atomic_get_new_crtc_state(state->state, state->crtc);
>> +if (!crtc_state)
>> +    return -EINVAL;
>> +
>> +src_w = drm_rect_width(>src) >> 16;
>> +src_h = drm_rect_height(>src) >> 16;
>> +dst_w = drm_rect_width(>dst);
>> +
>> +fmt = state-&

Re: [PATCH v13 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-03 Thread Michał Mirosław
On Tue, Mar 02, 2021 at 03:44:44PM +0300, Dmitry Osipenko wrote:
> Display controller (DC) performs isochronous memory transfers, and thus,
> has a requirement for a minimum memory bandwidth that shall be fulfilled,
> otherwise framebuffer data can't be fetched fast enough and this results
> in a DC's data-FIFO underflow that follows by a visual corruption.
> 
> The Memory Controller drivers provide facility for memory bandwidth
> management via interconnect API. Let's wire up the interconnect API
> support to the DC driver in order to fix the distorted display output
> on T30 Ouya, T124 TK1 and other Tegra devices.

I did a read on the code. I have put some thoughts and nits inbetween the
diff, but here are more general questions about the patch:

Is there a reason why the bandwidth is allocated per plane instead of just
using one peak and average for the whole configuration? Or is there a need
to expose all the planes as interconnect channels and allocate their
bandwidth individually?

>From algorithmic part I see that the plane overlaps are calculated twice
(A vs B and later B vs A). The cursor plane is ignored, but nevertheless
its overlap mask is calculated before being thrown away. The bandwidths
are also calculated twice: once for pre-commit and then again for
post-commit.  Is setting bandwidth for an interconnect expensive when
re-setting a value that was already set? The code seems to avoid this
case, but maybe unnecessarily?

[...cut the big and interesting part...]

[...]
> @@ -65,7 +75,9 @@ struct tegra_dc_soc_info {
>   unsigned int num_overlay_formats;
>   const u64 *modifiers;
>   bool has_win_a_without_filters;
> + bool has_win_b_vfilter_mem_client;
>   bool has_win_c_without_vert_filter;
> + unsigned int plane_tiled_memory_bandwidth_x2;

This looks more like bool in the code using it.

[...]
> --- a/drivers/gpu/drm/tegra/plane.c
> +++ b/drivers/gpu/drm/tegra/plane.c
[...]
> +static int tegra_plane_check_memory_bandwidth(struct drm_plane_state *state)

The function's body looks more like 'update' or 'recalculate' rather
than 'check' the memory bandwidth.

> + struct tegra_plane_state *tegra_state = to_tegra_plane_state(state);
> + unsigned int i, bpp, bpp_plane, dst_w, src_w, src_h, mul;
> + const struct tegra_dc_soc_info *soc;
> + const struct drm_format_info *fmt;
> + struct drm_crtc_state *crtc_state;
> + u32 avg_bandwidth, peak_bandwidth;
> +
> + if (!state->visible)
> + return 0;
> +
> + crtc_state = drm_atomic_get_new_crtc_state(state->state, state->crtc);
> + if (!crtc_state)
> + return -EINVAL;
> +
> + src_w = drm_rect_width(>src) >> 16;
> + src_h = drm_rect_height(>src) >> 16;
> + dst_w = drm_rect_width(>dst);
> +
> + fmt = state->fb->format;
> + soc = to_tegra_dc(state->crtc)->soc;
> +
> + /*
> +  * Note that real memory bandwidth vary depending on format and
> +  * memory layout, we are not taking that into account because small
> +  * estimation error isn't important since bandwidth is rounded up
> +  * anyway.
> +  */
> + for (i = 0, bpp = 0; i < fmt->num_planes; i++) {
> + bpp_plane = fmt->cpp[i] * 8;

Nit: you could declare the bpp_plane here.

> + /*
> +  * Sub-sampling is relevant for chroma planes only and vertical
> +  * readouts are not cached, hence only horizontal sub-sampling
> +  * matters.
> +  */
> + if (i > 0)
> + bpp_plane /= fmt->hsub;
> +
> + bpp += bpp_plane;
> + }
> +
> + /*
> +  * Horizontal downscale takes extra bandwidth which roughly depends
> +  * on the scaled width.
> +  */
> +     if (src_w > dst_w)
> + mul = (src_w - dst_w) * bpp / 2048 + 1;
> + else
> + mul = 1;

Does it really need more bandwidth to scale down? Does it read the same
data multiple times just to throw it away?

> + /* average bandwidth in bytes/s */
> + avg_bandwidth  = src_w * src_h * bpp / 8 * mul;
> + avg_bandwidth *= drm_mode_vrefresh(_state->mode);
> +
> + /* mode.clock in kHz, peak bandwidth in kbit/s */
> + peak_bandwidth = crtc_state->mode.clock * bpp * mul;

I would guess that (src_w * bpp / 8) needs rounding up unless the plane
is continuous. Or you could just add the max rounding error here and
get a safe overestimated value. Maybe this would be better done when
summing per-plane widths.

> + /* ICC bandwidth in kbyte/s */
> + peak_bandwidth = kbps_to_icc(peak_bandwidth);
> + avg_bandwidth  = Bps_to_icc(avg_bandwidth);

This could be merged wit

[PATCH v13 0/2] Add memory bandwidth management to NVIDIA Tegra DRM driver

2021-03-02 Thread Dmitry Osipenko
This series adds memory bandwidth management to the NVIDIA Tegra DRM driver,
which is done using interconnect framework. It fixes display corruption that
happens due to insufficient memory bandwidth.

Tegra memory drivers already got the interconnect API support and DRM patches
were a part of the series that added ICC support to the memory drivers, but
the DRM patches were left out unreviewed and unapplied. Hence I'm re-sending
the DRM changes in this new standalone series.

Changelog:

v13: - No code changes. Patches missed v5.12, re-sending them for v5.13.

Dmitry Osipenko (2):
  drm/tegra: dc: Support memory bandwidth management
  drm/tegra: dc: Extend debug stats with total number of events

 drivers/gpu/drm/tegra/Kconfig |   1 +
 drivers/gpu/drm/tegra/dc.c| 359 ++
 drivers/gpu/drm/tegra/dc.h|  19 ++
 drivers/gpu/drm/tegra/drm.c   |  14 ++
 drivers/gpu/drm/tegra/hub.c   |   3 +
 drivers/gpu/drm/tegra/plane.c | 121 
 drivers/gpu/drm/tegra/plane.h |  15 ++
 7 files changed, 532 insertions(+)

-- 
2.29.2



[PATCH v13 1/2] drm/tegra: dc: Support memory bandwidth management

2021-03-02 Thread Dmitry Osipenko
Display controller (DC) performs isochronous memory transfers, and thus,
has a requirement for a minimum memory bandwidth that shall be fulfilled,
otherwise framebuffer data can't be fetched fast enough and this results
in a DC's data-FIFO underflow that follows by a visual corruption.

The Memory Controller drivers provide facility for memory bandwidth
management via interconnect API. Let's wire up the interconnect API
support to the DC driver in order to fix the distorted display output
on T30 Ouya, T124 TK1 and other Tegra devices.

Tested-by: Peter Geis  # Ouya T30
Tested-by: Matt Merhar  # Ouya T30
Tested-by: Nicolas Chauvet  # PAZ00 T20 and TK1 T124
Signed-off-by: Dmitry Osipenko 
---
 drivers/gpu/drm/tegra/Kconfig |   1 +
 drivers/gpu/drm/tegra/dc.c| 349 ++
 drivers/gpu/drm/tegra/dc.h|  14 ++
 drivers/gpu/drm/tegra/drm.c   |  14 ++
 drivers/gpu/drm/tegra/hub.c   |   3 +
 drivers/gpu/drm/tegra/plane.c | 121 
 drivers/gpu/drm/tegra/plane.h |  15 ++
 7 files changed, 517 insertions(+)

diff --git a/drivers/gpu/drm/tegra/Kconfig b/drivers/gpu/drm/tegra/Kconfig
index 5043dcaf1cf9..1650a448eabd 100644
--- a/drivers/gpu/drm/tegra/Kconfig
+++ b/drivers/gpu/drm/tegra/Kconfig
@@ -9,6 +9,7 @@ config DRM_TEGRA
select DRM_MIPI_DSI
select DRM_PANEL
select TEGRA_HOST1X
+   select INTERCONNECT
select IOMMU_IOVA
select CEC_CORE if CEC_NOTIFIER
help
diff --git a/drivers/gpu/drm/tegra/dc.c b/drivers/gpu/drm/tegra/dc.c
index 0ae3a025efe9..7c6079984906 100644
--- a/drivers/gpu/drm/tegra/dc.c
+++ b/drivers/gpu/drm/tegra/dc.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -616,6 +617,9 @@ static int tegra_plane_atomic_check(struct drm_plane *plane,
struct tegra_dc *dc = to_tegra_dc(state->crtc);
int err;
 
+   plane_state->peak_memory_bandwidth = 0;
+   plane_state->avg_memory_bandwidth = 0;
+
/* no need for further checks if the plane is being disabled */
if (!state->crtc)
return 0;
@@ -802,6 +806,12 @@ static struct drm_plane *tegra_primary_plane_create(struct 
drm_device *drm,
formats = dc->soc->primary_formats;
modifiers = dc->soc->modifiers;
 
+   err = tegra_plane_interconnect_init(plane);
+   if (err) {
+   kfree(plane);
+   return ERR_PTR(err);
+   }
+
err = drm_universal_plane_init(drm, >base, possible_crtcs,
   _plane_funcs, formats,
   num_formats, modifiers, type, NULL);
@@ -833,9 +843,13 @@ static const u32 tegra_cursor_plane_formats[] = {
 static int tegra_cursor_atomic_check(struct drm_plane *plane,
 struct drm_plane_state *state)
 {
+   struct tegra_plane_state *plane_state = to_tegra_plane_state(state);
struct tegra_plane *tegra = to_tegra_plane(plane);
int err;
 
+   plane_state->peak_memory_bandwidth = 0;
+   plane_state->avg_memory_bandwidth = 0;
+
/* no need for further checks if the plane is being disabled */
if (!state->crtc)
return 0;
@@ -973,6 +987,12 @@ static struct drm_plane 
*tegra_dc_cursor_plane_create(struct drm_device *drm,
num_formats = ARRAY_SIZE(tegra_cursor_plane_formats);
formats = tegra_cursor_plane_formats;
 
+   err = tegra_plane_interconnect_init(plane);
+   if (err) {
+   kfree(plane);
+   return ERR_PTR(err);
+   }
+
err = drm_universal_plane_init(drm, >base, possible_crtcs,
   _plane_funcs, formats,
   num_formats, NULL,
@@ -1087,6 +1107,12 @@ static struct drm_plane 
*tegra_dc_overlay_plane_create(struct drm_device *drm,
num_formats = dc->soc->num_overlay_formats;
formats = dc->soc->overlay_formats;
 
+   err = tegra_plane_interconnect_init(plane);
+   if (err) {
+   kfree(plane);
+   return ERR_PTR(err);
+   }
+
if (!cursor)
type = DRM_PLANE_TYPE_OVERLAY;
else
@@ -1204,6 +1230,7 @@ tegra_crtc_atomic_duplicate_state(struct drm_crtc *crtc)
 {
struct tegra_dc_state *state = to_dc_state(crtc->state);
struct tegra_dc_state *copy;
+   unsigned int i;
 
copy = kmalloc(sizeof(*copy), GFP_KERNEL);
if (!copy)
@@ -1215,6 +1242,9 @@ tegra_crtc_atomic_duplicate_state(struct drm_crtc *crtc)
copy->div = state->div;
copy->planes = state->planes;
 
+   for (i = 0; i < ARRAY_SIZE(state->plane_peak_bw); i++)
+   copy->plane_peak_bw[i] = state->plane_peak_bw[i];
+
return >base;
 }
 
@@ -1741,6 +1771,106 @@ static int tegra_dc_wait_idle(struct tegra_dc *dc, 
unsigned long timeout)
return 

[PATCH AUTOSEL 5.10 33/47] PCI/LINK: Remove bandwidth notification

2021-03-02 Thread Sasha Levin
From: Bjorn Helgaas 

[ Upstream commit b4c7d2076b4e767dd2e075a2b3a9e57753fc67f5 ]

The PCIe Bandwidth Change Notification feature logs messages when the link
bandwidth changes.  Some users have reported that these messages occur
often enough to significantly reduce NVMe performance.  GPUs also seem to
generate these messages.

We don't know why the link bandwidth changes, but in the reported cases
there's no indication that it's caused by hardware failures.

Remove the bandwidth change notifications for now.  Hopefully we can add
this back when we have a better understanding of why this happens and how
we can make the messages useful instead of overwhelming.

Link: https://lore.kernel.org/r/20200115221008.ga191...@google.com/
Link: 
https://lore.kernel.org/r/155605909349.3575.13433421148215616375.st...@gimli.home/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206197
Signed-off-by: Bjorn Helgaas 
Signed-off-by: Sasha Levin 
---
 drivers/pci/pcie/Kconfig   |   8 --
 drivers/pci/pcie/Makefile  |   1 -
 drivers/pci/pcie/bw_notification.c | 138 -
 drivers/pci/pcie/portdrv.h |   6 --
 drivers/pci/pcie/portdrv_pci.c |   1 -
 5 files changed, 154 deletions(-)
 delete mode 100644 drivers/pci/pcie/bw_notification.c

diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index 3946555a6042..45a2ef702b45 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -133,14 +133,6 @@ config PCIE_PTM
  This is only useful if you have devices that support PTM, but it
  is safe to enable even if you don't.
 
-config PCIE_BW
-   bool "PCI Express Bandwidth Change Notification"
-   depends on PCIEPORTBUS
-   help
- This enables PCI Express Bandwidth Change Notification.  If
- you know link width or rate changes occur only to correct
- unreliable links, you may answer Y.
-
 config PCIE_EDR
bool "PCI Express Error Disconnect Recover support"
depends on PCIE_DPC && ACPI
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index 68da9280ff11..9a7085668466 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -12,5 +12,4 @@ obj-$(CONFIG_PCIEAER_INJECT)  += aer_inject.o
 obj-$(CONFIG_PCIE_PME) += pme.o
 obj-$(CONFIG_PCIE_DPC) += dpc.o
 obj-$(CONFIG_PCIE_PTM) += ptm.o
-obj-$(CONFIG_PCIE_BW)  += bw_notification.o
 obj-$(CONFIG_PCIE_EDR) += edr.o
diff --git a/drivers/pci/pcie/bw_notification.c 
b/drivers/pci/pcie/bw_notification.c
deleted file mode 100644
index 565d23cccb8b..
--- a/drivers/pci/pcie/bw_notification.c
+++ /dev/null
@@ -1,138 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0+
-/*
- * PCI Express Link Bandwidth Notification services driver
- * Author: Alexandru Gagniuc 
- *
- * Copyright (C) 2019, Dell Inc
- *
- * The PCIe Link Bandwidth Notification provides a way to notify the
- * operating system when the link width or data rate changes.  This
- * capability is required for all root ports and downstream ports
- * supporting links wider than x1 and/or multiple link speeds.
- *
- * This service port driver hooks into the bandwidth notification interrupt
- * and warns when links become degraded in operation.
- */
-
-#define dev_fmt(fmt) "bw_notification: " fmt
-
-#include "../pci.h"
-#include "portdrv.h"
-
-static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
-{
-   int ret;
-   u32 lnk_cap;
-
-   ret = pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, _cap);
-   return (ret == PCIBIOS_SUCCESSFUL) && (lnk_cap & PCI_EXP_LNKCAP_LBNC);
-}
-
-static void pcie_enable_link_bandwidth_notification(struct pci_dev *dev)
-{
-   u16 lnk_ctl;
-
-   pcie_capability_write_word(dev, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
-
-   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
-   lnk_ctl |= PCI_EXP_LNKCTL_LBMIE;
-   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
-}
-
-static void pcie_disable_link_bandwidth_notification(struct pci_dev *dev)
-{
-   u16 lnk_ctl;
-
-   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
-   lnk_ctl &= ~PCI_EXP_LNKCTL_LBMIE;
-   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
-}
-
-static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
-{
-   struct pcie_device *srv = context;
-   struct pci_dev *port = srv->port;
-   u16 link_status, events;
-   int ret;
-
-   ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, _status);
-   events = link_status & PCI_EXP_LNKSTA_LBMS;
-
-   if (ret != PCIBIOS_SUCCESSFUL || !events)
-   return IRQ_NONE;
-
-   pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);
-   pcie_update_link_speed(port->subordinate, link_status);
-   return IRQ_WAKE_THREAD;
-}
-
-static irqreturn_t pcie_bw_notif

[PATCH AUTOSEL 5.11 37/52] PCI/LINK: Remove bandwidth notification

2021-03-02 Thread Sasha Levin
From: Bjorn Helgaas 

[ Upstream commit b4c7d2076b4e767dd2e075a2b3a9e57753fc67f5 ]

The PCIe Bandwidth Change Notification feature logs messages when the link
bandwidth changes.  Some users have reported that these messages occur
often enough to significantly reduce NVMe performance.  GPUs also seem to
generate these messages.

We don't know why the link bandwidth changes, but in the reported cases
there's no indication that it's caused by hardware failures.

Remove the bandwidth change notifications for now.  Hopefully we can add
this back when we have a better understanding of why this happens and how
we can make the messages useful instead of overwhelming.

Link: https://lore.kernel.org/r/20200115221008.ga191...@google.com/
Link: 
https://lore.kernel.org/r/155605909349.3575.13433421148215616375.st...@gimli.home/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206197
Signed-off-by: Bjorn Helgaas 
Signed-off-by: Sasha Levin 
---
 drivers/pci/pcie/Kconfig   |   8 --
 drivers/pci/pcie/Makefile  |   1 -
 drivers/pci/pcie/bw_notification.c | 138 -
 drivers/pci/pcie/portdrv.h |   6 --
 drivers/pci/pcie/portdrv_pci.c |   1 -
 5 files changed, 154 deletions(-)
 delete mode 100644 drivers/pci/pcie/bw_notification.c

diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index 3946555a6042..45a2ef702b45 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -133,14 +133,6 @@ config PCIE_PTM
  This is only useful if you have devices that support PTM, but it
  is safe to enable even if you don't.
 
-config PCIE_BW
-   bool "PCI Express Bandwidth Change Notification"
-   depends on PCIEPORTBUS
-   help
- This enables PCI Express Bandwidth Change Notification.  If
- you know link width or rate changes occur only to correct
- unreliable links, you may answer Y.
-
 config PCIE_EDR
bool "PCI Express Error Disconnect Recover support"
depends on PCIE_DPC && ACPI
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index d9697892fa3e..b2980db88cc0 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -12,5 +12,4 @@ obj-$(CONFIG_PCIEAER_INJECT)  += aer_inject.o
 obj-$(CONFIG_PCIE_PME) += pme.o
 obj-$(CONFIG_PCIE_DPC) += dpc.o
 obj-$(CONFIG_PCIE_PTM) += ptm.o
-obj-$(CONFIG_PCIE_BW)  += bw_notification.o
 obj-$(CONFIG_PCIE_EDR) += edr.o
diff --git a/drivers/pci/pcie/bw_notification.c 
b/drivers/pci/pcie/bw_notification.c
deleted file mode 100644
index 565d23cccb8b..
--- a/drivers/pci/pcie/bw_notification.c
+++ /dev/null
@@ -1,138 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0+
-/*
- * PCI Express Link Bandwidth Notification services driver
- * Author: Alexandru Gagniuc 
- *
- * Copyright (C) 2019, Dell Inc
- *
- * The PCIe Link Bandwidth Notification provides a way to notify the
- * operating system when the link width or data rate changes.  This
- * capability is required for all root ports and downstream ports
- * supporting links wider than x1 and/or multiple link speeds.
- *
- * This service port driver hooks into the bandwidth notification interrupt
- * and warns when links become degraded in operation.
- */
-
-#define dev_fmt(fmt) "bw_notification: " fmt
-
-#include "../pci.h"
-#include "portdrv.h"
-
-static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
-{
-   int ret;
-   u32 lnk_cap;
-
-   ret = pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, _cap);
-   return (ret == PCIBIOS_SUCCESSFUL) && (lnk_cap & PCI_EXP_LNKCAP_LBNC);
-}
-
-static void pcie_enable_link_bandwidth_notification(struct pci_dev *dev)
-{
-   u16 lnk_ctl;
-
-   pcie_capability_write_word(dev, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
-
-   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
-   lnk_ctl |= PCI_EXP_LNKCTL_LBMIE;
-   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
-}
-
-static void pcie_disable_link_bandwidth_notification(struct pci_dev *dev)
-{
-   u16 lnk_ctl;
-
-   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
-   lnk_ctl &= ~PCI_EXP_LNKCTL_LBMIE;
-   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
-}
-
-static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
-{
-   struct pcie_device *srv = context;
-   struct pci_dev *port = srv->port;
-   u16 link_status, events;
-   int ret;
-
-   ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, _status);
-   events = link_status & PCI_EXP_LNKSTA_LBMS;
-
-   if (ret != PCIBIOS_SUCCESSFUL || !events)
-   return IRQ_NONE;
-
-   pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);
-   pcie_update_link_speed(port->subordinate, link_status);
-   return IRQ_WAKE_THREAD;
-}
-
-static irqreturn_t pcie_bw_notif

Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller

2021-02-27 Thread changhuaixin
Hi,

Sorry for my late reply.

> On Feb 9, 2021, at 9:17 PM, Odin Ugedal  wrote:
> 
> 
> Hi! This looks quite useful, but I have a few quick thoughts. :)
> 
> I know of a lot of people who would love this (especially some
> Kubernetes users)! I really like how this allow users to use cfs
> in a more dynamic and flexible way, without interfering with those
> who like the enforce strict quotas.
> 
> 
>> +++ b/kernel/sched/core.c
>> @ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
>> u64 period, u64
>> [...]
>> +/* burst_onset needed */
>> +if (cfs_b->quota != RUNTIME_INF &&
>> +sysctl_sched_cfs_bw_burst_enabled &&
>> +sysctl_sched_cfs_bw_burst_onset_percent > 0) {
>> +
>> +burst_onset = do_div(burst, 100) *
>> +sysctl_sched_cfs_bw_burst_onset_percent;
>> +
>> +cfs_b->runtime += burst_onset;
>> +cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
>> +}
> 
> I saw a comment about this behavior, but I think this can lead to a bit of
> confusion. If sysctl_sched_cfs_bw_burst_onset_percent=0, the amount of
> bandwidth when the first process starts up will depend on the time between
> the quota was set and the startup of the process, and that feel a bit like
> a "timing" race that end user application then will have to think about.
> 
> I suspect contianer runtimes and/or tools like Kubernetes will then have
> to tell users to set the value to a certan amount in order to make it
> work as expected.
> 
> Another thing is that when a cgroup has saved some time into the
> "burst quota", updating the quota, period or burst will then reset the
> "burst quota", even though eg. only the burst was changed. Some tools
> use dynamic quotas, resulting in multiple changes in the quota over time,
> and I am a bit scared that don't allowing them to control "start burst"
> on a write can be limiting.
> 
> Maybe we can allow people to set the "start bandwidth" explicitly when setting
> cfs_burst if they want to do that? (edit: that might be hard for cgroup v1, 
> but
> would I think that is a good solution on cgroup v2).
> 
> This is however just my thoughts, and I am not 100% sure about what the
> best solution is, but if we merge a certain behavior, we have no real
> chance of changing it later.
> 

If there are cases where the "start bandwidth" matters, I think there is need 
to expose the
"start bandwidth" explicitly too. However, I doubt the existence of such cases 
from my view
and the two examples above.

In my thoughts, this patchset keeps cgroup usage within the quota in the longer 
term, and allows 
cgroup to respond to a burst of work with the help of a reasonable burst 
buffer. If quota is set correctly
above average usage, and enough burst buffer is set to meet the needs of bursty 
work. In this
case, it makes no difference whether this cgroup runs with 0 start bandwidth or 
all of it.
Thus I used sysctl_sched_cfs_bw_burst_onset_percent to decided the start 
bandwidth
to leave some convenience here. If this sysctl interface is confusing, I wonder 
whether it
is a good idea not to expose this interface.

For the first case mentioned above, if Kubernet users care the "start 
bandwidth" for process startup,
maybe it is better to give all of it rather than a part?

For the second case with quota changes over time, I think it is important 
making sure each change works
long enough to enforce average quota limit. Does it really matter to control 
"start burst" on each change?



> 
>> +++ b/kernel/sched/sched.h
>> @@ -367,6 +367,7 @@ struct cfs_bandwidth {
>>  u64 burst;
>>  u64 buffer;
>>  u64 max_overrun;
>> +u64 previous_runtime;
>>  s64 hierarchical_quota;
> 
> Maybe indicate that this was the remaining runtime _after_ the previous
> period ended? Not 100% sure, but maybe sometihing like
> 'remaining_runtime_prev_period' or 'end_runtime_prev_period'(as inspiration). 
>   
> 

It is an copy of runtime at period start, and used to calculate burst time 
during a period.
Not quite remaining_runtime_prev_period.

> 
>> +++ b/kernel/sched/core.c
>> @@ -8234,6 +8236,10 @@ static int cpu_cfs_stat_show(struct seq_file *sf, 
>> void *v)
>>  seq_printf(sf, "wait_sum %llu\n", ws);
>>  }
>> 
>> +seq_printf(sf, "current_bw %llu\n", cfs_b->runtime);
>> +seq_printf(sf, "nr_burst %d\n", 

Re: [RFC v4 PATCH] usb: xhci-mtk: improve bandwidth scheduling with TT

2021-02-21 Thread Ikjoon Jang
On Mon, Feb 8, 2021 at 11:27 AM Chunfeng Yun  wrote:
>
> When the USB headset is plug into an external hub, sometimes
> can't set config due to not enough bandwidth, so need improve
> LS/FS INT/ISOC bandwidth scheduling with TT.
>
> Fixes: 08e469de87a2 ("usb: xhci-mtk: supports bandwidth scheduling with 
> multi-TT")
> Signed-off-by: Yaqii Wu 
> Signed-off-by: Chunfeng Yun 

Tested-by: Ikjoon Jang 

> ---
>  drivers/usb/host/xhci-mtk-sch.c | 270 +++-
>  drivers/usb/host/xhci-mtk.h |   8 +-
>  2 files changed, 201 insertions(+), 77 deletions(-)
>
> diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
> index b45e5bf08997..f3cdfcf4e5bf 100644
> --- a/drivers/usb/host/xhci-mtk-sch.c
> +++ b/drivers/usb/host/xhci-mtk-sch.c
> @@ -32,6 +32,35 @@
>  #define EP_BOFFSET(p)  ((p) & 0x3fff)
>  #define EP_BREPEAT(p)  (((p) & 0x7fff) << 16)
>
> +enum mtk_sch_err_type {
> +   SCH_SUCCESS = 0,
> +   SCH_ERR_Y6,
> +   SCH_SS_OVERLAP,
> +   SCH_CS_OVERFLOW,
> +   SCH_BW_OVERFLOW,
> +   SCH_FIXME,
> +};
> +
> +static char *sch_error_string(enum mtk_sch_err_type error)
> +{
> +   switch (error) {
> +   case SCH_SUCCESS:
> +   return "Success";
> +   case SCH_ERR_Y6:
> +   return "Can't schedule Start-Split in Y6";
> +   case SCH_SS_OVERLAP:
> +   return "Can't find a suitable Start-Split location";
> +   case SCH_CS_OVERFLOW:
> +   return "The last Complete-Split is greater than 7";
> +   case SCH_BW_OVERFLOW:
> +   return "Bandwidth exceeds the max limit";
> +   case SCH_FIXME:
> +   return "FIXME, to be resolved";
> +   default:
> +   return "Unknown error type";
> +   }
> +}
> +
>  static int is_fs_or_ls(enum usb_device_speed speed)
>  {
> return speed == USB_SPEED_FULL || speed == USB_SPEED_LOW;
> @@ -81,11 +110,22 @@ static u32 get_esit(struct xhci_ep_ctx *ep_ctx)
> return esit;
>  }
>
> +static u32 get_bw_boundary(enum usb_device_speed speed)
> +{
> +   switch (speed) {
> +   case USB_SPEED_SUPER_PLUS:
> +   return SSP_BW_BOUNDARY;
> +   case USB_SPEED_SUPER:
> +   return SS_BW_BOUNDARY;
> +   default:
> +   return HS_BW_BOUNDARY;
> +   }
> +}
> +
>  static struct mu3h_sch_tt *find_tt(struct usb_device *udev)
>  {
> struct usb_tt *utt = udev->tt;
> struct mu3h_sch_tt *tt, **tt_index, **ptt;
> -   unsigned int port;
> bool allocated_index = false;
>
> if (!utt)
> @@ -107,10 +147,9 @@ static struct mu3h_sch_tt *find_tt(struct usb_device 
> *udev)
> utt->hcpriv = tt_index;
> allocated_index = true;
> }
> -   port = udev->ttport - 1;
> -   ptt = _index[port];
> +
> +   ptt = _index[udev->ttport - 1];
> } else {
> -   port = 0;
> ptt = (struct mu3h_sch_tt **) >hcpriv;
> }
>
> @@ -125,8 +164,7 @@ static struct mu3h_sch_tt *find_tt(struct usb_device 
> *udev)
> return ERR_PTR(-ENOMEM);
> }
> INIT_LIST_HEAD(>ep_list);
> -   tt->usb_tt = utt;
> -   tt->tt_port = port;
> +
> *ptt = tt;
> }
>
> @@ -206,6 +244,15 @@ static struct mu3h_sch_ep_info *create_sch_ep(struct 
> usb_device *udev,
> return sch_ep;
>  }
>
> +static void delete_sch_ep(struct usb_device *udev, struct mu3h_sch_ep_info 
> *sch_ep)
> +{
> +   if (sch_ep->sch_tt)
> +   drop_tt(udev);
> +
> +   list_del(_ep->endpoint);
> +   kfree(sch_ep);
> +}
> +
>  static void setup_sch_info(struct usb_device *udev,
> struct xhci_ep_ctx *ep_ctx, struct mu3h_sch_ep_info *sch_ep)
>  {
> @@ -375,21 +422,55 @@ static void update_bus_bw(struct mu3h_sch_bw_info 
> *sch_bw,
> sch_ep->bw_budget_table[j];
> }
> }
> -   sch_ep->allocated = used;
>  }
>
> -static int check_sch_tt(struct usb_device *udev,
> -   struct mu3h_sch_ep_info *sch_ep, u32 offset)
> +static int check_fs_bus_bw(struct mu3h_sch_ep_info *sch_ep, int offset)
> +{
> +   struct mu3h_sch_tt *tt = sch_ep->sch_tt;
> +   u32 num_esit, base;
> +   u32 i, j;
> +   u32 tmp;
> +
> +   num_esit = XHCI

Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller

2021-02-09 Thread Tejun Heo
Hello,

On Tue, Feb 09, 2021 at 02:17:19PM +0100, Odin Ugedal wrote:
> A am not that familiar how cross subsystem patches like these are handled, but
> I am still adding the Tejun Heo (cgroup maintainer) as a CC. Should maybe cc 
> to
> cgroup@ as well?

Yeah, that'd be great. Given that it's mostly straight forward extension on
an existing interface, things looks fine from cgroup side; however, please
do add cgroup2 interface and documentation. One thing which has bene
bothersome about the bandwidth interface is that we're exposing
implementation details (window size and now burst size) instead of
usage-centric requirements but that boat has already sailed, so...

Thanks.

-- 
tejun


Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller

2021-02-09 Thread Odin Ugedal


Hi! This looks quite useful, but I have a few quick thoughts. :)

I know of a lot of people who would love this (especially some
Kubernetes users)! I really like how this allow users to use cfs
in a more dynamic and flexible way, without interfering with those
who like the enforce strict quotas.


> +++ b/kernel/sched/core.c
> @ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
> u64 period, u64
> [...]
> + /* burst_onset needed */
> + if (cfs_b->quota != RUNTIME_INF &&
> + sysctl_sched_cfs_bw_burst_enabled &&
> + sysctl_sched_cfs_bw_burst_onset_percent > 0) {
> +
> + burst_onset = do_div(burst, 100) *
> + sysctl_sched_cfs_bw_burst_onset_percent;
> +
> + cfs_b->runtime += burst_onset;
> + cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
> + }

I saw a comment about this behavior, but I think this can lead to a bit of
confusion. If sysctl_sched_cfs_bw_burst_onset_percent=0, the amount of
bandwidth when the first process starts up will depend on the time between
the quota was set and the startup of the process, and that feel a bit like
a "timing" race that end user application then will have to think about.

I suspect contianer runtimes and/or tools like Kubernetes will then have
to tell users to set the value to a certan amount in order to make it
work as expected.

Another thing is that when a cgroup has saved some time into the
"burst quota", updating the quota, period or burst will then reset the
"burst quota", even though eg. only the burst was changed. Some tools
use dynamic quotas, resulting in multiple changes in the quota over time,
and I am a bit scared that don't allowing them to control "start burst"
on a write can be limiting.

Maybe we can allow people to set the "start bandwidth" explicitly when setting
cfs_burst if they want to do that? (edit: that might be hard for cgroup v1, but
would I think that is a good solution on cgroup v2).

This is however just my thoughts, and I am not 100% sure about what the
best solution is, but if we merge a certain behavior, we have no real
chance of changing it later.


> +++ b/kernel/sched/sched.h
> @@ -367,6 +367,7 @@ struct cfs_bandwidth {
>   u64 burst;
>   u64 buffer;
>   u64 max_overrun;
> + u64 previous_runtime;
>   s64 hierarchical_quota;

Maybe indicate that this was the remaining runtime _after_ the previous
period ended? Not 100% sure, but maybe sometihing like
'remaining_runtime_prev_period' or 'end_runtime_prev_period'(as inspiration).   


> +++ b/kernel/sched/core.c
> @@ -8234,6 +8236,10 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void 
> *v)
>   seq_printf(sf, "wait_sum %llu\n", ws);
>   }
>  
> + seq_printf(sf, "current_bw %llu\n", cfs_b->runtime);
> + seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
> + seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
> +
>   return 0;
>  }

Looks like these metrics are missing from the cgroup v2 stats.

Are we sure it is smart to start exposing cfs_b->runtime, since it makes it
harder to change the implementation at a later time? I don't thin it is that
usefull, and if it is only exposed for debugging purposes people can probably
use kprobes instead? Also, it would not be usefull unless you know how much
wall time is left in the current period. In that sense,
cfs_b->previous_runtime would probably be more usefull, but still not sure if
it deserves to be exposed to end users like this.

Also, will "cfs_b->runtime" keep updating if no processes are running, or
will it be the the same here, but update (with burst via timer overrun)
when a process starts again? If so, the runtime available when a process
starts on cgroup inint can be hard to communicate if the value here doesn't
update.


> +++ b/kernel/sched/fair.c
> +void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init)
> [...]
> + /*
> +  * When period timer stops, quota for the following period is not
> +  * refilled, however period timer is already forwarded. We should
> +  * accumulate quota once more than overrun here.
> +  */


Trying to wrap my head around this one... Is not refilling here, as the
behavior before your patch causing "loss" in runtime and causing unnecessary
possibly causing a cgroup throttle?.


A am not that familiar how cross subsystem patches like these are handled, but
I am still adding the Tejun Heo (cgroup maintainer) as a CC. Should maybe cc to
cgroup@ as well?

Sorry for a long mail, in retrospect it should have been one per patch...


[PATCH 5.10 011/120] usb: xhci-mtk: fix unreleased bandwidth data

2021-02-08 Thread Greg Kroah-Hartman
From: Ikjoon Jang 

commit 1d69f9d901ef14d81c3b004e3282b8cc7b456280 upstream.

xhci-mtk needs XHCI_MTK_HOST quirk functions in add_endpoint() and
drop_endpoint() to handle its own sw bandwidth management.

It stores bandwidth data into an internal table every time
add_endpoint() is called, and drops those in drop_endpoint().
But when bandwidth allocation fails at one endpoint, all earlier
allocation from the same interface could still remain at the table.

This patch moves bandwidth management codes to check_bandwidth() and
reset_bandwidth() path. To do so, this patch also adds those functions
to xhci_driver_overrides and lets mtk-xhci to release all failed
endpoints in reset_bandwidth() path.

Fixes: 08e469de87a2 ("usb: xhci-mtk: supports bandwidth scheduling with 
multi-TT")
Signed-off-by: Ikjoon Jang 
Link: 
https://lore.kernel.org/r/20210113180444.v6.1.Id0d31b5f3ddf5e734d2ab11161ac5821921b1e1e@changeid
Cc: stable 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/usb/host/xhci-mtk-sch.c |  123 +++-
 drivers/usb/host/xhci-mtk.c |2 
 drivers/usb/host/xhci-mtk.h |   13 
 drivers/usb/host/xhci.c |8 +-
 drivers/usb/host/xhci.h |4 +
 5 files changed, 111 insertions(+), 39 deletions(-)

--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -200,6 +200,7 @@ static struct mu3h_sch_ep_info *create_s
 
sch_ep->sch_tt = tt;
sch_ep->ep = ep;
+   INIT_LIST_HEAD(_ep->tt_endpoint);
 
return sch_ep;
 }
@@ -583,6 +584,8 @@ int xhci_mtk_sch_init(struct xhci_hcd_mt
 
mtk->sch_array = sch_array;
 
+   INIT_LIST_HEAD(>bw_ep_list_new);
+
return 0;
 }
 EXPORT_SYMBOL_GPL(xhci_mtk_sch_init);
@@ -601,19 +604,14 @@ int xhci_mtk_add_ep_quirk(struct usb_hcd
struct xhci_ep_ctx *ep_ctx;
struct xhci_slot_ctx *slot_ctx;
struct xhci_virt_device *virt_dev;
-   struct mu3h_sch_bw_info *sch_bw;
struct mu3h_sch_ep_info *sch_ep;
-   struct mu3h_sch_bw_info *sch_array;
unsigned int ep_index;
-   int bw_index;
-   int ret = 0;
 
xhci = hcd_to_xhci(hcd);
virt_dev = xhci->devs[udev->slot_id];
ep_index = xhci_get_endpoint_index(>desc);
slot_ctx = xhci_get_slot_ctx(xhci, virt_dev->in_ctx);
ep_ctx = xhci_get_ep_ctx(xhci, virt_dev->in_ctx, ep_index);
-   sch_array = mtk->sch_array;
 
xhci_dbg(xhci, "%s() type:%d, speed:%d, mpkt:%d, dir:%d, ep:%p\n",
__func__, usb_endpoint_type(>desc), udev->speed,
@@ -632,39 +630,34 @@ int xhci_mtk_add_ep_quirk(struct usb_hcd
return 0;
}
 
-   bw_index = get_bw_index(xhci, udev, ep);
-   sch_bw = _array[bw_index];
-
sch_ep = create_sch_ep(udev, ep, ep_ctx);
if (IS_ERR_OR_NULL(sch_ep))
return -ENOMEM;
 
setup_sch_info(udev, ep_ctx, sch_ep);
 
-   ret = check_sch_bw(udev, sch_bw, sch_ep);
-   if (ret) {
-   xhci_err(xhci, "Not enough bandwidth!\n");
-   if (is_fs_or_ls(udev->speed))
-   drop_tt(udev);
-
-   kfree(sch_ep);
-   return -ENOSPC;
-   }
+   list_add_tail(_ep->endpoint, >bw_ep_list_new);
 
-   list_add_tail(_ep->endpoint, _bw->bw_ep_list);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(xhci_mtk_add_ep_quirk);
 
-   ep_ctx->reserved[0] |= cpu_to_le32(EP_BPKTS(sch_ep->pkts)
-   | EP_BCSCOUNT(sch_ep->cs_count) | EP_BBM(sch_ep->burst_mode));
-   ep_ctx->reserved[1] |= cpu_to_le32(EP_BOFFSET(sch_ep->offset)
-   | EP_BREPEAT(sch_ep->repeat));
+static void xhci_mtk_drop_ep(struct xhci_hcd_mtk *mtk, struct usb_device *udev,
+struct mu3h_sch_ep_info *sch_ep)
+{
+   struct xhci_hcd *xhci = hcd_to_xhci(mtk->hcd);
+   int bw_index = get_bw_index(xhci, udev, sch_ep->ep);
+   struct mu3h_sch_bw_info *sch_bw = >sch_array[bw_index];
 
-   xhci_dbg(xhci, " PKTS:%x, CSCOUNT:%x, BM:%x, OFFSET:%x, REPEAT:%x\n",
-   sch_ep->pkts, sch_ep->cs_count, sch_ep->burst_mode,
-   sch_ep->offset, sch_ep->repeat);
+   update_bus_bw(sch_bw, sch_ep, 0);
+   list_del(_ep->endpoint);
 
-   return 0;
+   if (sch_ep->sch_tt) {
+   list_del(_ep->tt_endpoint);
+   drop_tt(udev);
+   }
+   kfree(sch_ep);
 }
-EXPORT_SYMBOL_GPL(xhci_mtk_add_ep_quirk);
 
 void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct usb_device *udev,
struct usb_host_endpoint *ep)
@@ -675,7 +668,7 @@ void xhci_mtk_drop_ep_quirk(struct usb_h
struct xhci_virt_device *virt_dev;
struct mu3h_sch_bw_info *sch_array;
struct mu3h_sch_bw_info *sch_bw;
-   struct mu3h_sch_ep_info *sch_ep;
+   struct mu3

[PATCH 5.10 012/120] usb: xhci-mtk: skip dropping bandwidth of unchecked endpoints

2021-02-08 Thread Greg Kroah-Hartman
From: Chunfeng Yun 

commit 54f6a8af372213a254af6609758d99f7c0b6b5ad upstream.

For those unchecked endpoints, we don't allocate bandwidth for
them, so no need free the bandwidth, otherwise will decrease
the allocated bandwidth.
Meanwhile use xhci_dbg() instead of dev_dbg() to print logs and
rename bw_ep_list_new as bw_ep_chk_list.

Fixes: 1d69f9d901ef ("usb: xhci-mtk: fix unreleased bandwidth data")
Cc: stable 
Reviewed-and-tested-by: Ikjoon Jang 
Signed-off-by: Chunfeng Yun 
Link: 
https://lore.kernel.org/r/1612159064-28413-1-git-send-email-chunfeng@mediatek.com
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/usb/host/xhci-mtk-sch.c |   61 +---
 drivers/usb/host/xhci-mtk.h |4 +-
 2 files changed, 36 insertions(+), 29 deletions(-)

--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -200,6 +200,7 @@ static struct mu3h_sch_ep_info *create_s
 
sch_ep->sch_tt = tt;
sch_ep->ep = ep;
+   INIT_LIST_HEAD(_ep->endpoint);
INIT_LIST_HEAD(_ep->tt_endpoint);
 
return sch_ep;
@@ -374,6 +375,7 @@ static void update_bus_bw(struct mu3h_sc
sch_ep->bw_budget_table[j];
}
}
+   sch_ep->allocated = used;
 }
 
 static int check_sch_tt(struct usb_device *udev,
@@ -542,6 +544,22 @@ static int check_sch_bw(struct usb_devic
return 0;
 }
 
+static void destroy_sch_ep(struct usb_device *udev,
+   struct mu3h_sch_bw_info *sch_bw, struct mu3h_sch_ep_info *sch_ep)
+{
+   /* only release ep bw check passed by check_sch_bw() */
+   if (sch_ep->allocated)
+   update_bus_bw(sch_bw, sch_ep, 0);
+
+   list_del(_ep->endpoint);
+
+   if (sch_ep->sch_tt) {
+   list_del(_ep->tt_endpoint);
+   drop_tt(udev);
+   }
+   kfree(sch_ep);
+}
+
 static bool need_bw_sch(struct usb_host_endpoint *ep,
enum usb_device_speed speed, int has_tt)
 {
@@ -584,7 +602,7 @@ int xhci_mtk_sch_init(struct xhci_hcd_mt
 
mtk->sch_array = sch_array;
 
-   INIT_LIST_HEAD(>bw_ep_list_new);
+   INIT_LIST_HEAD(>bw_ep_chk_list);
 
return 0;
 }
@@ -636,29 +654,12 @@ int xhci_mtk_add_ep_quirk(struct usb_hcd
 
setup_sch_info(udev, ep_ctx, sch_ep);
 
-   list_add_tail(_ep->endpoint, >bw_ep_list_new);
+   list_add_tail(_ep->endpoint, >bw_ep_chk_list);
 
return 0;
 }
 EXPORT_SYMBOL_GPL(xhci_mtk_add_ep_quirk);
 
-static void xhci_mtk_drop_ep(struct xhci_hcd_mtk *mtk, struct usb_device *udev,
-struct mu3h_sch_ep_info *sch_ep)
-{
-   struct xhci_hcd *xhci = hcd_to_xhci(mtk->hcd);
-   int bw_index = get_bw_index(xhci, udev, sch_ep->ep);
-   struct mu3h_sch_bw_info *sch_bw = >sch_array[bw_index];
-
-   update_bus_bw(sch_bw, sch_ep, 0);
-   list_del(_ep->endpoint);
-
-   if (sch_ep->sch_tt) {
-   list_del(_ep->tt_endpoint);
-   drop_tt(udev);
-   }
-   kfree(sch_ep);
-}
-
 void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct usb_device *udev,
struct usb_host_endpoint *ep)
 {
@@ -688,9 +689,8 @@ void xhci_mtk_drop_ep_quirk(struct usb_h
sch_bw = _array[bw_index];
 
list_for_each_entry_safe(sch_ep, tmp, _bw->bw_ep_list, endpoint) {
-   if (sch_ep->ep == ep) {
-   xhci_mtk_drop_ep(mtk, udev, sch_ep);
-   }
+   if (sch_ep->ep == ep)
+   destroy_sch_ep(udev, sch_bw, sch_ep);
}
 }
 EXPORT_SYMBOL_GPL(xhci_mtk_drop_ep_quirk);
@@ -704,9 +704,9 @@ int xhci_mtk_check_bandwidth(struct usb_
struct mu3h_sch_ep_info *sch_ep, *tmp;
int bw_index, ret;
 
-   dev_dbg(>dev, "%s\n", __func__);
+   xhci_dbg(xhci, "%s() udev %s\n", __func__, dev_name(>dev));
 
-   list_for_each_entry(sch_ep, >bw_ep_list_new, endpoint) {
+   list_for_each_entry(sch_ep, >bw_ep_chk_list, endpoint) {
bw_index = get_bw_index(xhci, udev, sch_ep->ep);
sch_bw = >sch_array[bw_index];
 
@@ -717,7 +717,7 @@ int xhci_mtk_check_bandwidth(struct usb_
}
}
 
-   list_for_each_entry_safe(sch_ep, tmp, >bw_ep_list_new, endpoint) {
+   list_for_each_entry_safe(sch_ep, tmp, >bw_ep_chk_list, endpoint) {
struct xhci_ep_ctx *ep_ctx;
struct usb_host_endpoint *ep = sch_ep->ep;
unsigned int ep_index = xhci_get_endpoint_index(>desc);
@@ -746,12 +746,17 @@ EXPORT_SYMBOL_GPL(xhci_mtk_check_bandwid
 void xhci_mtk_reset_bandwidth(struct usb_hcd *hcd, struct usb_device *udev)
 {
struct xhci_hcd_mtk *mtk = hcd_to_mtk(hcd);
+   struct xhci_hcd *xhci = hcd_to_xhci(hcd);
+   struct mu3h_sch_bw_info *sch_bw;
struct mu3h_sch_ep_info *sch_e

[PATCH 5.4 29/65] usb: xhci-mtk: skip dropping bandwidth of unchecked endpoints

2021-02-08 Thread Greg Kroah-Hartman
From: Chunfeng Yun 

commit 54f6a8af372213a254af6609758d99f7c0b6b5ad upstream.

For those unchecked endpoints, we don't allocate bandwidth for
them, so no need free the bandwidth, otherwise will decrease
the allocated bandwidth.
Meanwhile use xhci_dbg() instead of dev_dbg() to print logs and
rename bw_ep_list_new as bw_ep_chk_list.

Fixes: 1d69f9d901ef ("usb: xhci-mtk: fix unreleased bandwidth data")
Cc: stable 
Reviewed-and-tested-by: Ikjoon Jang 
Signed-off-by: Chunfeng Yun 
Link: 
https://lore.kernel.org/r/1612159064-28413-1-git-send-email-chunfeng@mediatek.com
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/usb/host/xhci-mtk-sch.c |   61 +---
 drivers/usb/host/xhci-mtk.h |4 +-
 2 files changed, 36 insertions(+), 29 deletions(-)

--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -200,6 +200,7 @@ static struct mu3h_sch_ep_info *create_s
 
sch_ep->sch_tt = tt;
sch_ep->ep = ep;
+   INIT_LIST_HEAD(_ep->endpoint);
INIT_LIST_HEAD(_ep->tt_endpoint);
 
return sch_ep;
@@ -374,6 +375,7 @@ static void update_bus_bw(struct mu3h_sc
sch_ep->bw_budget_table[j];
}
}
+   sch_ep->allocated = used;
 }
 
 static int check_sch_tt(struct usb_device *udev,
@@ -542,6 +544,22 @@ static int check_sch_bw(struct usb_devic
return 0;
 }
 
+static void destroy_sch_ep(struct usb_device *udev,
+   struct mu3h_sch_bw_info *sch_bw, struct mu3h_sch_ep_info *sch_ep)
+{
+   /* only release ep bw check passed by check_sch_bw() */
+   if (sch_ep->allocated)
+   update_bus_bw(sch_bw, sch_ep, 0);
+
+   list_del(_ep->endpoint);
+
+   if (sch_ep->sch_tt) {
+   list_del(_ep->tt_endpoint);
+   drop_tt(udev);
+   }
+   kfree(sch_ep);
+}
+
 static bool need_bw_sch(struct usb_host_endpoint *ep,
enum usb_device_speed speed, int has_tt)
 {
@@ -584,7 +602,7 @@ int xhci_mtk_sch_init(struct xhci_hcd_mt
 
mtk->sch_array = sch_array;
 
-   INIT_LIST_HEAD(>bw_ep_list_new);
+   INIT_LIST_HEAD(>bw_ep_chk_list);
 
return 0;
 }
@@ -636,29 +654,12 @@ int xhci_mtk_add_ep_quirk(struct usb_hcd
 
setup_sch_info(udev, ep_ctx, sch_ep);
 
-   list_add_tail(_ep->endpoint, >bw_ep_list_new);
+   list_add_tail(_ep->endpoint, >bw_ep_chk_list);
 
return 0;
 }
 EXPORT_SYMBOL_GPL(xhci_mtk_add_ep_quirk);
 
-static void xhci_mtk_drop_ep(struct xhci_hcd_mtk *mtk, struct usb_device *udev,
-struct mu3h_sch_ep_info *sch_ep)
-{
-   struct xhci_hcd *xhci = hcd_to_xhci(mtk->hcd);
-   int bw_index = get_bw_index(xhci, udev, sch_ep->ep);
-   struct mu3h_sch_bw_info *sch_bw = >sch_array[bw_index];
-
-   update_bus_bw(sch_bw, sch_ep, 0);
-   list_del(_ep->endpoint);
-
-   if (sch_ep->sch_tt) {
-   list_del(_ep->tt_endpoint);
-   drop_tt(udev);
-   }
-   kfree(sch_ep);
-}
-
 void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct usb_device *udev,
struct usb_host_endpoint *ep)
 {
@@ -688,9 +689,8 @@ void xhci_mtk_drop_ep_quirk(struct usb_h
sch_bw = _array[bw_index];
 
list_for_each_entry_safe(sch_ep, tmp, _bw->bw_ep_list, endpoint) {
-   if (sch_ep->ep == ep) {
-   xhci_mtk_drop_ep(mtk, udev, sch_ep);
-   }
+   if (sch_ep->ep == ep)
+   destroy_sch_ep(udev, sch_bw, sch_ep);
}
 }
 EXPORT_SYMBOL_GPL(xhci_mtk_drop_ep_quirk);
@@ -704,9 +704,9 @@ int xhci_mtk_check_bandwidth(struct usb_
struct mu3h_sch_ep_info *sch_ep, *tmp;
int bw_index, ret;
 
-   dev_dbg(>dev, "%s\n", __func__);
+   xhci_dbg(xhci, "%s() udev %s\n", __func__, dev_name(>dev));
 
-   list_for_each_entry(sch_ep, >bw_ep_list_new, endpoint) {
+   list_for_each_entry(sch_ep, >bw_ep_chk_list, endpoint) {
bw_index = get_bw_index(xhci, udev, sch_ep->ep);
sch_bw = >sch_array[bw_index];
 
@@ -717,7 +717,7 @@ int xhci_mtk_check_bandwidth(struct usb_
}
}
 
-   list_for_each_entry_safe(sch_ep, tmp, >bw_ep_list_new, endpoint) {
+   list_for_each_entry_safe(sch_ep, tmp, >bw_ep_chk_list, endpoint) {
struct xhci_ep_ctx *ep_ctx;
struct usb_host_endpoint *ep = sch_ep->ep;
unsigned int ep_index = xhci_get_endpoint_index(>desc);
@@ -746,12 +746,17 @@ EXPORT_SYMBOL_GPL(xhci_mtk_check_bandwid
 void xhci_mtk_reset_bandwidth(struct usb_hcd *hcd, struct usb_device *udev)
 {
struct xhci_hcd_mtk *mtk = hcd_to_mtk(hcd);
+   struct xhci_hcd *xhci = hcd_to_xhci(hcd);
+   struct mu3h_sch_bw_info *sch_bw;
struct mu3h_sch_ep_info *sch_e

[PATCH 5.4 28/65] usb: xhci-mtk: fix unreleased bandwidth data

2021-02-08 Thread Greg Kroah-Hartman
From: Ikjoon Jang 

commit 1d69f9d901ef14d81c3b004e3282b8cc7b456280 upstream.

xhci-mtk needs XHCI_MTK_HOST quirk functions in add_endpoint() and
drop_endpoint() to handle its own sw bandwidth management.

It stores bandwidth data into an internal table every time
add_endpoint() is called, and drops those in drop_endpoint().
But when bandwidth allocation fails at one endpoint, all earlier
allocation from the same interface could still remain at the table.

This patch moves bandwidth management codes to check_bandwidth() and
reset_bandwidth() path. To do so, this patch also adds those functions
to xhci_driver_overrides and lets mtk-xhci to release all failed
endpoints in reset_bandwidth() path.

Fixes: 08e469de87a2 ("usb: xhci-mtk: supports bandwidth scheduling with 
multi-TT")
Signed-off-by: Ikjoon Jang 
Link: 
https://lore.kernel.org/r/20210113180444.v6.1.Id0d31b5f3ddf5e734d2ab11161ac5821921b1e1e@changeid
Cc: stable 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/usb/host/xhci-mtk-sch.c |  123 +++-
 drivers/usb/host/xhci-mtk.c |2 
 drivers/usb/host/xhci-mtk.h |   13 
 drivers/usb/host/xhci.c |8 +-
 drivers/usb/host/xhci.h |4 +
 5 files changed, 111 insertions(+), 39 deletions(-)

--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -200,6 +200,7 @@ static struct mu3h_sch_ep_info *create_s
 
sch_ep->sch_tt = tt;
sch_ep->ep = ep;
+   INIT_LIST_HEAD(_ep->tt_endpoint);
 
return sch_ep;
 }
@@ -583,6 +584,8 @@ int xhci_mtk_sch_init(struct xhci_hcd_mt
 
mtk->sch_array = sch_array;
 
+   INIT_LIST_HEAD(>bw_ep_list_new);
+
return 0;
 }
 EXPORT_SYMBOL_GPL(xhci_mtk_sch_init);
@@ -601,19 +604,14 @@ int xhci_mtk_add_ep_quirk(struct usb_hcd
struct xhci_ep_ctx *ep_ctx;
struct xhci_slot_ctx *slot_ctx;
struct xhci_virt_device *virt_dev;
-   struct mu3h_sch_bw_info *sch_bw;
struct mu3h_sch_ep_info *sch_ep;
-   struct mu3h_sch_bw_info *sch_array;
unsigned int ep_index;
-   int bw_index;
-   int ret = 0;
 
xhci = hcd_to_xhci(hcd);
virt_dev = xhci->devs[udev->slot_id];
ep_index = xhci_get_endpoint_index(>desc);
slot_ctx = xhci_get_slot_ctx(xhci, virt_dev->in_ctx);
ep_ctx = xhci_get_ep_ctx(xhci, virt_dev->in_ctx, ep_index);
-   sch_array = mtk->sch_array;
 
xhci_dbg(xhci, "%s() type:%d, speed:%d, mpkt:%d, dir:%d, ep:%p\n",
__func__, usb_endpoint_type(>desc), udev->speed,
@@ -632,39 +630,34 @@ int xhci_mtk_add_ep_quirk(struct usb_hcd
return 0;
}
 
-   bw_index = get_bw_index(xhci, udev, ep);
-   sch_bw = _array[bw_index];
-
sch_ep = create_sch_ep(udev, ep, ep_ctx);
if (IS_ERR_OR_NULL(sch_ep))
return -ENOMEM;
 
setup_sch_info(udev, ep_ctx, sch_ep);
 
-   ret = check_sch_bw(udev, sch_bw, sch_ep);
-   if (ret) {
-   xhci_err(xhci, "Not enough bandwidth!\n");
-   if (is_fs_or_ls(udev->speed))
-   drop_tt(udev);
-
-   kfree(sch_ep);
-   return -ENOSPC;
-   }
+   list_add_tail(_ep->endpoint, >bw_ep_list_new);
 
-   list_add_tail(_ep->endpoint, _bw->bw_ep_list);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(xhci_mtk_add_ep_quirk);
 
-   ep_ctx->reserved[0] |= cpu_to_le32(EP_BPKTS(sch_ep->pkts)
-   | EP_BCSCOUNT(sch_ep->cs_count) | EP_BBM(sch_ep->burst_mode));
-   ep_ctx->reserved[1] |= cpu_to_le32(EP_BOFFSET(sch_ep->offset)
-   | EP_BREPEAT(sch_ep->repeat));
+static void xhci_mtk_drop_ep(struct xhci_hcd_mtk *mtk, struct usb_device *udev,
+struct mu3h_sch_ep_info *sch_ep)
+{
+   struct xhci_hcd *xhci = hcd_to_xhci(mtk->hcd);
+   int bw_index = get_bw_index(xhci, udev, sch_ep->ep);
+   struct mu3h_sch_bw_info *sch_bw = >sch_array[bw_index];
 
-   xhci_dbg(xhci, " PKTS:%x, CSCOUNT:%x, BM:%x, OFFSET:%x, REPEAT:%x\n",
-   sch_ep->pkts, sch_ep->cs_count, sch_ep->burst_mode,
-   sch_ep->offset, sch_ep->repeat);
+   update_bus_bw(sch_bw, sch_ep, 0);
+   list_del(_ep->endpoint);
 
-   return 0;
+   if (sch_ep->sch_tt) {
+   list_del(_ep->tt_endpoint);
+   drop_tt(udev);
+   }
+   kfree(sch_ep);
 }
-EXPORT_SYMBOL_GPL(xhci_mtk_add_ep_quirk);
 
 void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct usb_device *udev,
struct usb_host_endpoint *ep)
@@ -675,7 +668,7 @@ void xhci_mtk_drop_ep_quirk(struct usb_h
struct xhci_virt_device *virt_dev;
struct mu3h_sch_bw_info *sch_array;
struct mu3h_sch_bw_info *sch_bw;
-   struct mu3h_sch_ep_info *sch_ep;
+   struct mu3

[RFC v4 PATCH] usb: xhci-mtk: improve bandwidth scheduling with TT

2021-02-07 Thread Chunfeng Yun
When the USB headset is plug into an external hub, sometimes
can't set config due to not enough bandwidth, so need improve
LS/FS INT/ISOC bandwidth scheduling with TT.

Fixes: 08e469de87a2 ("usb: xhci-mtk: supports bandwidth scheduling with 
multi-TT")
Signed-off-by: Yaqii Wu 
Signed-off-by: Chunfeng Yun 
---
 drivers/usb/host/xhci-mtk-sch.c | 270 +++-
 drivers/usb/host/xhci-mtk.h |   8 +-
 2 files changed, 201 insertions(+), 77 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index b45e5bf08997..f3cdfcf4e5bf 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -32,6 +32,35 @@
 #define EP_BOFFSET(p)  ((p) & 0x3fff)
 #define EP_BREPEAT(p)  (((p) & 0x7fff) << 16)
 
+enum mtk_sch_err_type {
+   SCH_SUCCESS = 0,
+   SCH_ERR_Y6,
+   SCH_SS_OVERLAP,
+   SCH_CS_OVERFLOW,
+   SCH_BW_OVERFLOW,
+   SCH_FIXME,
+};
+
+static char *sch_error_string(enum mtk_sch_err_type error)
+{
+   switch (error) {
+   case SCH_SUCCESS:
+   return "Success";
+   case SCH_ERR_Y6:
+   return "Can't schedule Start-Split in Y6";
+   case SCH_SS_OVERLAP:
+   return "Can't find a suitable Start-Split location";
+   case SCH_CS_OVERFLOW:
+   return "The last Complete-Split is greater than 7";
+   case SCH_BW_OVERFLOW:
+   return "Bandwidth exceeds the max limit";
+   case SCH_FIXME:
+   return "FIXME, to be resolved";
+   default:
+   return "Unknown error type";
+   }
+}
+
 static int is_fs_or_ls(enum usb_device_speed speed)
 {
return speed == USB_SPEED_FULL || speed == USB_SPEED_LOW;
@@ -81,11 +110,22 @@ static u32 get_esit(struct xhci_ep_ctx *ep_ctx)
return esit;
 }
 
+static u32 get_bw_boundary(enum usb_device_speed speed)
+{
+   switch (speed) {
+   case USB_SPEED_SUPER_PLUS:
+   return SSP_BW_BOUNDARY;
+   case USB_SPEED_SUPER:
+   return SS_BW_BOUNDARY;
+   default:
+   return HS_BW_BOUNDARY;
+   }
+}
+
 static struct mu3h_sch_tt *find_tt(struct usb_device *udev)
 {
struct usb_tt *utt = udev->tt;
struct mu3h_sch_tt *tt, **tt_index, **ptt;
-   unsigned int port;
bool allocated_index = false;
 
if (!utt)
@@ -107,10 +147,9 @@ static struct mu3h_sch_tt *find_tt(struct usb_device *udev)
utt->hcpriv = tt_index;
allocated_index = true;
}
-   port = udev->ttport - 1;
-   ptt = _index[port];
+
+   ptt = _index[udev->ttport - 1];
} else {
-   port = 0;
ptt = (struct mu3h_sch_tt **) >hcpriv;
}
 
@@ -125,8 +164,7 @@ static struct mu3h_sch_tt *find_tt(struct usb_device *udev)
return ERR_PTR(-ENOMEM);
}
INIT_LIST_HEAD(>ep_list);
-   tt->usb_tt = utt;
-   tt->tt_port = port;
+
*ptt = tt;
}
 
@@ -206,6 +244,15 @@ static struct mu3h_sch_ep_info *create_sch_ep(struct 
usb_device *udev,
return sch_ep;
 }
 
+static void delete_sch_ep(struct usb_device *udev, struct mu3h_sch_ep_info 
*sch_ep)
+{
+   if (sch_ep->sch_tt)
+   drop_tt(udev);
+
+   list_del(_ep->endpoint);
+   kfree(sch_ep);
+}
+
 static void setup_sch_info(struct usb_device *udev,
struct xhci_ep_ctx *ep_ctx, struct mu3h_sch_ep_info *sch_ep)
 {
@@ -375,21 +422,55 @@ static void update_bus_bw(struct mu3h_sch_bw_info *sch_bw,
sch_ep->bw_budget_table[j];
}
}
-   sch_ep->allocated = used;
 }
 
-static int check_sch_tt(struct usb_device *udev,
-   struct mu3h_sch_ep_info *sch_ep, u32 offset)
+static int check_fs_bus_bw(struct mu3h_sch_ep_info *sch_ep, int offset)
+{
+   struct mu3h_sch_tt *tt = sch_ep->sch_tt;
+   u32 num_esit, base;
+   u32 i, j;
+   u32 tmp;
+
+   num_esit = XHCI_MTK_MAX_ESIT / sch_ep->esit;
+
+   for (i = 0; i < num_esit; i++) {
+   base = offset + i * sch_ep->esit;
+
+   /*
+* Compared with hs bus, no matter what ep type
+* The hub will always delay one uframe to send
+* data for us. As described in the figure below.
+*/
+   if (sch_ep->ep_type == ISOC_OUT_EP) {
+   for (j = 0; j < sch_ep->num_budget_microframes; j++) {
+   tmp = tt->fs_bus_bw[base + 1 + j]
+   + sch_ep->bw_cost_per_microframe;
+
+   if (tmp > FS_PAYLOAD_MAX)
+ 

Re: [PATCH] perf test: Add parse-metric memory bandwidth testcase

2021-02-02 Thread Namhyung Kim
Hello,

On Mon, Jan 25, 2021 at 9:52 PM John Garry  wrote:
>
> Event duration_time in a metric expression requires special handling.
>
> Improve test coverage by including a metric whose expression includes
> duration_time. The actual metric is a copied from the L1D_Cache_Fill_BW
> metric on my broadwell machine.
>
> Signed-off-by: John Garry 

Acked-by: Namhyung Kim 

Thanks,
Namhyung


> ---
> Based on acme perf/core + "perf metricgroup: Fix for metrics containing 
> duration_time"
>
> diff --git a/tools/perf/tests/parse-metric.c b/tools/perf/tests/parse-metric.c
> index ce7be37f0d88..6dc1db1626ad 100644
> --- a/tools/perf/tests/parse-metric.c
> +++ b/tools/perf/tests/parse-metric.c
> @@ -69,6 +69,10 @@ static struct pmu_event pme_test[] = {
> .metric_expr= "1/m3",
> .metric_name= "M3",
>  },
> +{
> +   .metric_expr= "64 * l1d.replacement / 10 / duration_time",
> +   .metric_name= "L1D_Cache_Fill_BW",
> +},
>  {
> .name   = NULL,
>  }
> @@ -107,6 +111,8 @@ static void load_runtime_stat(struct runtime_stat *st, 
> struct evlist *evlist,
> evlist__for_each_entry(evlist, evsel) {
> count = find_value(evsel->name, vals);
> perf_stat__update_shadow_stats(evsel, count, 0, st);
> +   if (!strcmp(evsel->name, "duration_time"))
> +   update_stats(_nsecs_stats, count);
> }
>  }
>
> @@ -321,6 +327,23 @@ static int test_recursion_fail(void)
> return 0;
>  }
>
> +static int test_memory_bandwidth(void)
> +{
> +   double ratio;
> +   struct value vals[] = {
> +   { .event = "l1d.replacement", .val = 400 },
> +   { .event = "duration_time",  .val = 2 },
> +   { .event = NULL, },
> +   };
> +
> +   TEST_ASSERT_VAL("failed to compute metric",
> +   compute_metric("L1D_Cache_Fill_BW", vals, ) == 
> 0);
> +   TEST_ASSERT_VAL("L1D_Cache_Fill_BW, wrong ratio",
> +   1.28 == ratio);
> +
> +   return 0;
> +}
> +
>  static int test_metric_group(void)
>  {
> double ratio1, ratio2;
> @@ -353,5 +376,6 @@ int test__parse_metric(struct test *test __maybe_unused, 
> int subtest __maybe_unu
> TEST_ASSERT_VAL("DCache_L2 failed", test_dcache_l2() == 0);
> TEST_ASSERT_VAL("recursion fail failed", test_recursion_fail() == 0);
> TEST_ASSERT_VAL("test metric group", test_metric_group() == 0);
> +   TEST_ASSERT_VAL("Memory bandwidth", test_memory_bandwidth() == 0);
> return 0;
>  }
> --
> 2.26.2
>


[PATCH] PCI/LINK: Remove bandwidth notification

2021-02-02 Thread Bjorn Helgaas
From: Bjorn Helgaas 

The PCIe Bandwidth Change Notification feature logs messages when the link
bandwidth changes.  Some users have reported that these messages occur
often enough to significantly reduce NVMe performance.  GPUs also seem to
generate these messages.

We don't know why the link bandwidth changes, but in the reported cases
there's no indication that it's caused by hardware failures.

Remove the bandwidth change notifications for now.  Hopefully we can add
this back when we have a better understanding of why this happens and how
we can make the messages useful instead of overwhelming.

Link: https://lore.kernel.org/r/20200115221008.ga191...@google.com/
Link: 
https://lore.kernel.org/r/155605909349.3575.13433421148215616375.st...@gimli.home/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206197
Signed-off-by: Bjorn Helgaas 
---
 drivers/pci/pcie/Kconfig   |   8 --
 drivers/pci/pcie/Makefile  |   1 -
 drivers/pci/pcie/bw_notification.c | 138 -
 drivers/pci/pcie/portdrv.h |   6 --
 drivers/pci/pcie/portdrv_pci.c |   1 -
 5 files changed, 154 deletions(-)
 delete mode 100644 drivers/pci/pcie/bw_notification.c

diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index 3946555a6042..45a2ef702b45 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -133,14 +133,6 @@ config PCIE_PTM
  This is only useful if you have devices that support PTM, but it
  is safe to enable even if you don't.
 
-config PCIE_BW
-   bool "PCI Express Bandwidth Change Notification"
-   depends on PCIEPORTBUS
-   help
- This enables PCI Express Bandwidth Change Notification.  If
- you know link width or rate changes occur only to correct
- unreliable links, you may answer Y.
-
 config PCIE_EDR
bool "PCI Express Error Disconnect Recover support"
depends on PCIE_DPC && ACPI
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index d9697892fa3e..b2980db88cc0 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -12,5 +12,4 @@ obj-$(CONFIG_PCIEAER_INJECT)  += aer_inject.o
 obj-$(CONFIG_PCIE_PME) += pme.o
 obj-$(CONFIG_PCIE_DPC) += dpc.o
 obj-$(CONFIG_PCIE_PTM) += ptm.o
-obj-$(CONFIG_PCIE_BW)  += bw_notification.o
 obj-$(CONFIG_PCIE_EDR) += edr.o
diff --git a/drivers/pci/pcie/bw_notification.c 
b/drivers/pci/pcie/bw_notification.c
deleted file mode 100644
index 565d23cccb8b..
--- a/drivers/pci/pcie/bw_notification.c
+++ /dev/null
@@ -1,138 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0+
-/*
- * PCI Express Link Bandwidth Notification services driver
- * Author: Alexandru Gagniuc 
- *
- * Copyright (C) 2019, Dell Inc
- *
- * The PCIe Link Bandwidth Notification provides a way to notify the
- * operating system when the link width or data rate changes.  This
- * capability is required for all root ports and downstream ports
- * supporting links wider than x1 and/or multiple link speeds.
- *
- * This service port driver hooks into the bandwidth notification interrupt
- * and warns when links become degraded in operation.
- */
-
-#define dev_fmt(fmt) "bw_notification: " fmt
-
-#include "../pci.h"
-#include "portdrv.h"
-
-static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
-{
-   int ret;
-   u32 lnk_cap;
-
-   ret = pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, _cap);
-   return (ret == PCIBIOS_SUCCESSFUL) && (lnk_cap & PCI_EXP_LNKCAP_LBNC);
-}
-
-static void pcie_enable_link_bandwidth_notification(struct pci_dev *dev)
-{
-   u16 lnk_ctl;
-
-   pcie_capability_write_word(dev, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
-
-   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
-   lnk_ctl |= PCI_EXP_LNKCTL_LBMIE;
-   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
-}
-
-static void pcie_disable_link_bandwidth_notification(struct pci_dev *dev)
-{
-   u16 lnk_ctl;
-
-   pcie_capability_read_word(dev, PCI_EXP_LNKCTL, _ctl);
-   lnk_ctl &= ~PCI_EXP_LNKCTL_LBMIE;
-   pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctl);
-}
-
-static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
-{
-   struct pcie_device *srv = context;
-   struct pci_dev *port = srv->port;
-   u16 link_status, events;
-   int ret;
-
-   ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, _status);
-   events = link_status & PCI_EXP_LNKSTA_LBMS;
-
-   if (ret != PCIBIOS_SUCCESSFUL || !events)
-   return IRQ_NONE;
-
-   pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);
-   pcie_update_link_speed(port->subordinate, link_status);
-   return IRQ_WAKE_THREAD;
-}
-
-static irqreturn_t pcie_bw_notification_handler(int irq, void *context)
-{
-   struct pcie_device *

Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"

2021-02-02 Thread Alex G.

On 2/2/21 2:16 PM, Bjorn Helgaas wrote:

On Tue, Feb 02, 2021 at 01:50:20PM -0600, Alex G. wrote:

On 1/29/21 3:56 PM, Bjorn Helgaas wrote:

On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote:

On 1/28/21 5:51 PM, Sinan Kaya wrote:

On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:

AFAICT, this thread petered out with no resolution.

If the bandwidth change notifications are important to somebody,
please speak up, preferably with a patch that makes the notifications
disabled by default and adds a parameter to enable them (or some other
strategy that makes sense).

I think these are potentially useful, so I don't really want to just
revert them, but if nobody thinks these are important enough to fix,
that's a possibility.


Hide behind debug or expert option by default? or even mark it as BROKEN
until someone fixes it?


Instead of making it a config option, wouldn't it be better as a kernel
parameter? People encountering this seem quite competent in passing kernel
arguments, so having a "pcie_bw_notification=off" would solve their
problems.


I don't want people to have to discover a parameter to solve issues.
If there's a parameter, notification should default to off, and people
who want notification should supply a parameter to enable it.  Same
thing for the sysfs idea.


I can imagine cases where a per-port flag would be useful. For example, a
machine with a NIC and a couple of PCIe storage drives. In this example, the
PCIe drives downtrain willie-nillie, so it's useful to turn off their
notifications, but the NIC absolutely must not downtrain. It's debatable
whether it should be default on or default off.


I think we really just need to figure out what's going on.  Then it
should be clearer how to handle it.  I'm not really in a position to
debug the root cause since I don't have the hardware or the time.


I wonder
(a) if some PCIe devices are downtraining willie-nillie to save power
(b) if this willie-nillie downtraining somehow violates PCIe spec
(c) what is the official behavior when downtraining is intentional

My theory is: YES, YES, ASPM. But I don't know how to figure this out
without having the problem hardware in hand.


If nobody can figure out what's going on, I think we'll have to make it
disabled by default.


I think most distros do "CONFIG_PCIE_BW is not set". Is that not true?


I think it *is* true that distros do not enable CONFIG_PCIE_BW.

But it's perfectly reasonable for people building their own kernels to
enable it.  It should be safe to enable all config options.  If they
do enable CONFIG_PCIE_BW, I don't want them to waste time debugging
messages they don't expect.

If we understood why these happen and could filter out the expected
ones, that would be great.  But we don't.  We've already wasted quite
a bit of Jan's and Atanas' time, and no doubt others who haven't
bothered to file bug reports.

So I think I'll queue up a patch to remove the functionality for now.
It's easily restored if somebody debugs the problem or adds a
command-line switch or something.


I think it's best we make it a module (or kernel) parameter, default=off 
for the time being.


Alex


Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"

2021-02-02 Thread Bjorn Helgaas
On Tue, Feb 02, 2021 at 01:50:20PM -0600, Alex G. wrote:
> On 1/29/21 3:56 PM, Bjorn Helgaas wrote:
> > On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote:
> > > On 1/28/21 5:51 PM, Sinan Kaya wrote:
> > > > On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:
> > > > > AFAICT, this thread petered out with no resolution.
> > > > > 
> > > > > If the bandwidth change notifications are important to somebody,
> > > > > please speak up, preferably with a patch that makes the notifications
> > > > > disabled by default and adds a parameter to enable them (or some other
> > > > > strategy that makes sense).
> > > > > 
> > > > > I think these are potentially useful, so I don't really want to just
> > > > > revert them, but if nobody thinks these are important enough to fix,
> > > > > that's a possibility.
> > > > 
> > > > Hide behind debug or expert option by default? or even mark it as BROKEN
> > > > until someone fixes it?
> > > > 
> > > Instead of making it a config option, wouldn't it be better as a kernel
> > > parameter? People encountering this seem quite competent in passing kernel
> > > arguments, so having a "pcie_bw_notification=off" would solve their
> > > problems.
> > 
> > I don't want people to have to discover a parameter to solve issues.
> > If there's a parameter, notification should default to off, and people
> > who want notification should supply a parameter to enable it.  Same
> > thing for the sysfs idea.
> 
> I can imagine cases where a per-port flag would be useful. For example, a
> machine with a NIC and a couple of PCIe storage drives. In this example, the
> PCIe drives downtrain willie-nillie, so it's useful to turn off their
> notifications, but the NIC absolutely must not downtrain. It's debatable
> whether it should be default on or default off.
>
> > I think we really just need to figure out what's going on.  Then it
> > should be clearer how to handle it.  I'm not really in a position to
> > debug the root cause since I don't have the hardware or the time.
> 
> I wonder
> (a) if some PCIe devices are downtraining willie-nillie to save power
> (b) if this willie-nillie downtraining somehow violates PCIe spec
> (c) what is the official behavior when downtraining is intentional
> 
> My theory is: YES, YES, ASPM. But I don't know how to figure this out
> without having the problem hardware in hand.
> 
> > If nobody can figure out what's going on, I think we'll have to make it
> > disabled by default.
> 
> I think most distros do "CONFIG_PCIE_BW is not set". Is that not true?

I think it *is* true that distros do not enable CONFIG_PCIE_BW.

But it's perfectly reasonable for people building their own kernels to
enable it.  It should be safe to enable all config options.  If they
do enable CONFIG_PCIE_BW, I don't want them to waste time debugging
messages they don't expect.

If we understood why these happen and could filter out the expected
ones, that would be great.  But we don't.  We've already wasted quite
a bit of Jan's and Atanas' time, and no doubt others who haven't
bothered to file bug reports.

So I think I'll queue up a patch to remove the functionality for now.
It's easily restored if somebody debugs the problem or adds a
command-line switch or something.

Bjorn


Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"

2021-02-02 Thread Alex G.

On 1/29/21 3:56 PM, Bjorn Helgaas wrote:

On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote:

On 1/28/21 5:51 PM, Sinan Kaya wrote:

On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:

AFAICT, this thread petered out with no resolution.

If the bandwidth change notifications are important to somebody,
please speak up, preferably with a patch that makes the notifications
disabled by default and adds a parameter to enable them (or some other
strategy that makes sense).

I think these are potentially useful, so I don't really want to just
revert them, but if nobody thinks these are important enough to fix,
that's a possibility.


Hide behind debug or expert option by default? or even mark it as BROKEN
until someone fixes it?


Instead of making it a config option, wouldn't it be better as a kernel
parameter? People encountering this seem quite competent in passing kernel
arguments, so having a "pcie_bw_notification=off" would solve their
problems.


I don't want people to have to discover a parameter to solve issues.
If there's a parameter, notification should default to off, and people
who want notification should supply a parameter to enable it.  Same
thing for the sysfs idea.


I can imagine cases where a per-port flag would be useful. For example, 
a machine with a NIC and a couple of PCIe storage drives. In this 
example, the PCIe drives donwtrain willie-nillie, so it's useful to turn 
off their notifications, but the NIC absolutely must not downtrain. It's 
debatable whether it should be default on or default off.



I think we really just need to figure out what's going on.  Then it
should be clearer how to handle it.  I'm not really in a position to
debug the root cause since I don't have the hardware or the time.


I wonder
(a) if some PCIe devices are downtraining willie-nillie to save power
(b) if this willie-nillie downtraining somehow violates PCIe spec
(c) what is the official behavior when downtraining is intentional

My theory is: YES, YES, ASPM. But I don't know how to figure this out 
without having the problem hardware in hand.




If nobody can figure out what's going on, I think we'll have to make it
disabled by default.


I think most distros do "CONFIG_PCIE_BW is not set". Is that not true?

Alex


[PATCH 4/4] sched/fair: Add document for burstable CFS bandwidth control

2021-02-02 Thread Huaixin Chang
Basic description of usage and effect for CFS Bandwidth Control Burst.

Signed-off-by: Huaixin Chang 
Signed-off-by: Shanpei Chen 
---
 Documentation/scheduler/sched-bwc.rst | 70 +--
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.rst 
b/Documentation/scheduler/sched-bwc.rst
index 9801d6b284b1..0933c66cc68b 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -21,18 +21,46 @@ cfs_quota units at each period boundary. As threads consume 
this bandwidth it
 is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
+By default, CPU bandwidth consumption is strictly limited to quota within each
+given period. For the sequence of CPU usage u_i served under CFS bandwidth
+control, if for any j <= k N(j,k) is the number of periods from u_j to u_k:
+
+u_j+...+u_k <= quota * N(j,k)
+
+For a bursty sequence among which interval u_j...u_k are at the peak, CPU
+requests might have to wait for more periods to replenish enough quota.
+Otherwise, larger quota is required.
+
+With "burst" buffer, CPU requests might be served as long as:
+
+u_j+...+u_k <= B_j + quota * N(j,k)
+
+if for any j <= k N(j,k) is the number of periods from u_j to u_k and B_j is
+the accumulated quota from previous periods in burst buffer serving u_j.
+Burst buffer helps in that serving whole bursty CPU requests without throttling
+them can be done with moderate quota setting and accumulated quota in burst
+buffer, if:
+
+u_0+...+u_n <= B_0 + quota * N(0,n)
+
+where B_0 is the initial state of burst buffer. The maximum accumulated quota 
in
+the burst buffer is capped by burst. With proper burst setting, the available
+bandwidth is still determined by quota and period on the long run.
+
 Management
 --
-Quota and period are managed within the cpu subsystem via cgroupfs.
+Quota, period and burst are managed within the cpu subsystem via cgroupfs.
 
-cpu.cfs_quota_us: the total available run-time within a period (in 
microseconds)
+cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
 cpu.cfs_period_us: the length of a period (in microseconds)
+cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
 cpu.stat: exports throttling statistics [explained further below]
 
 The default values are::
 
cpu.cfs_period_us=100ms
-   cpu.cfs_quota=-1
+   cpu.cfs_quota_us=-1
+   cpu.cfs_burst_us=0
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
@@ -48,6 +76,11 @@ more detail below.
 Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
 and return the group to an unconstrained state once more.
 
+A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
+any unused bandwidth. It makes the traditional bandwidth control behavior for
+CFS unchanged. Writing any (valid) positive value(s) into cpu.cfs_burst_us
+will enact the cap on unused bandwidth accumulation.
+
 Any updates to a group's bandwidth specification will result in it becoming
 unthrottled if it is in a constrained state.
 
@@ -65,9 +98,21 @@ This is tunable via procfs::
 Larger slice values will reduce transfer overheads, while smaller values allow
 for more fine-grained consumption.
 
+There is also a global switch to turn off burst for all groups::
+   /proc/sys/kernel/sched_cfs_bw_burst_enabled (default=1)
+
+By default it is enabled. Writing a 0 value means no accumulated CPU time can 
be
+used for any group, even if cpu.cfs_burst_us is configured.
+
+Sometimes users might want a group to burst without accumulation. This is
+tunable via::
+   /proc/sys/kernel/sched_cfs_bw_burst_onset_percent (default=0)
+
+Up to 100% runtime of cpu.cfs_burst_us might be given on setting bandwidth.
+
 Statistics
 --
-A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+A group's bandwidth statistics are exported via 6 fields in cpu.stat.
 
 cpu.stat:
 
@@ -75,6 +120,11 @@ cpu.stat:
 - nr_throttled: Number of times the group has been throttled/limited.
 - throttled_time: The total time duration (in nanoseconds) for which entities
   of the group have been throttled.
+- current_bw: Current runtime in global pool.
+- nr_burst: Number of periods burst occurs.
+- burst_time: Cumulative wall-time that any CPUs has used above quota in
+  respective periods
+
 
 This interface is read-only.
 
@@ -172,3 +222,15 @@ Examples
 
By using a small period here we are ensuring a consistent latency
response at the expense of burst capacity.
+
+4. Limit a group to 20% of 1 CPU, and allow accumulate up to 60% of 1 CPU
+   additionally, in case accumulation has been done.
+
+   With 50ms period, 

[PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-02-02 Thread Huaixin Chang
In this patch, we introduce the notion of CFS bandwidth burst. Unused
"quota" from pervious "periods" might be accumulated and used in the
following "periods". The maximum amount of accumulated bandwidth is
bounded by "burst". And the maximun amount of CPU a group can consume in
a given period is "buffer" which is equivalent to "quota" + "burst in
case that this group has done enough accumulation.

Signed-off-by: Huaixin Chang 
Signed-off-by: Shanpei Chen 
---
 kernel/sched/core.c  | 91 
 kernel/sched/fair.c  |  2 ++
 kernel/sched/sched.h |  2 ++
 3 files changed, 82 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ff74fca39ed2..28e3165c685b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8590,10 +8590,12 @@ static const u64 max_cfs_runtime = MAX_BW * 
NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
+   u64 burst)
 {
int i, ret = 0, runtime_enabled, runtime_was_enabled;
struct cfs_bandwidth *cfs_b = >cfs_bandwidth;
+   u64 buffer;
 
if (tg == _task_group)
return -EINVAL;
@@ -8620,6 +8622,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota)
if (quota != RUNTIME_INF && quota > max_cfs_runtime)
return -EINVAL;
 
+   /*
+* Bound burst to defend burst against overflow during bandwidth shift.
+*/
+   if (burst > max_cfs_runtime)
+   return -EINVAL;
+
+   if (quota == RUNTIME_INF)
+   buffer = RUNTIME_INF;
+   else
+   buffer = min(max_cfs_runtime, quota + burst);
/*
 * Prevent race between setting of cfs_rq->runtime_enabled and
 * unthrottle_offline_cfs_rqs().
@@ -8641,6 +8653,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota)
raw_spin_lock_irq(_b->lock);
cfs_b->period = ns_to_ktime(period);
cfs_b->quota = quota;
+   cfs_b->burst = burst;
+   cfs_b->buffer = buffer;
 
__refill_cfs_bandwidth_runtime(cfs_b);
 
@@ -8674,9 +8688,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota)
 
 static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 {
-   u64 quota, period;
+   u64 quota, period, burst;
 
period = ktime_to_ns(tg->cfs_bandwidth.period);
+   burst = tg->cfs_bandwidth.burst;
if (cfs_quota_us < 0)
quota = RUNTIME_INF;
else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
@@ -8684,7 +8699,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long 
cfs_quota_us)
else
return -EINVAL;
 
-   return tg_set_cfs_bandwidth(tg, period, quota);
+   return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_quota(struct task_group *tg)
@@ -8702,15 +8717,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
 
 static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
 {
-   u64 quota, period;
+   u64 quota, period, burst;
 
if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
return -EINVAL;
 
period = (u64)cfs_period_us * NSEC_PER_USEC;
quota = tg->cfs_bandwidth.quota;
+   burst = tg->cfs_bandwidth.burst;
 
-   return tg_set_cfs_bandwidth(tg, period, quota);
+   return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_period(struct task_group *tg)
@@ -8723,6 +8739,35 @@ static long tg_get_cfs_period(struct task_group *tg)
return cfs_period_us;
 }
 
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
+{
+   u64 quota, period, burst;
+
+   period = ktime_to_ns(tg->cfs_bandwidth.period);
+   quota = tg->cfs_bandwidth.quota;
+   if (cfs_burst_us < 0)
+   burst = RUNTIME_INF;
+   else if ((u64)cfs_burst_us <= U64_MAX / NSEC_PER_USEC)
+   burst = (u64)cfs_burst_us * NSEC_PER_USEC;
+   else
+   return -EINVAL;
+
+   return tg_set_cfs_bandwidth(tg, period, quota, burst);
+}
+
+static long tg_get_cfs_burst(struct task_group *tg)
+{
+   u64 burst_us;
+
+   if (tg->cfs_bandwidth.burst == RUNTIME_INF)
+   return -1;
+
+   burst_us = tg->cfs_bandwidth.burst;
+   do_div(burst_us, NSEC_PER_USEC);
+
+   return burst_us;
+}
+
 static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
  struct cftype *cft)
 {
@@ -8747,6 +8792,18 @@ static int cpu_cfs_period_write_

[PATCH 3/4] sched/fair: Add cfs bandwidth burst statistics

2021-02-02 Thread Huaixin Chang
Introduce statistics exports for the burstable cfs bandwidth
controller.

The following exports are included:

current_bw: current runtime in global pool
nr_burst:   number of periods bandwidth burst occurs
burst_time: cumulative wall-time that any cpus has
used above quota in respective periods

Signed-off-by: Huaixin Chang 
Signed-off-by: Shanpei Chen 
---
 kernel/sched/core.c  |  6 ++
 kernel/sched/fair.c  | 12 +++-
 kernel/sched/sched.h |  3 +++
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9f1b05ad0411..d253903dbb4e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8681,6 +8681,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota,
cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
}
 
+   cfs_b->previous_runtime = cfs_b->runtime;
+
/* Restart the period timer (if active) to handle new period expiry: */
if (runtime_enabled)
start_cfs_bandwidth(cfs_b, 1);
@@ -8929,6 +8931,10 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void 
*v)
seq_printf(sf, "wait_sum %llu\n", ws);
}
 
+   seq_printf(sf, "current_bw %llu\n", cfs_b->runtime);
+   seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
+   seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
+
return 0;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a7c261d206a..4cd3dc16659c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4612,7 +4612,7 @@ static inline u64 sched_cfs_bandwidth_slice(void)
 static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
u64 overrun)
 {
-   u64 refill;
+   u64 refill, runtime;
 
if (cfs_b->quota != RUNTIME_INF) {
 
@@ -4621,10 +4621,20 @@ static void __refill_cfs_bandwidth_runtime(struct 
cfs_bandwidth *cfs_b,
return;
}
 
+   if (cfs_b->previous_runtime > cfs_b->runtime) {
+   runtime = cfs_b->previous_runtime - cfs_b->runtime;
+   if (runtime > cfs_b->quota) {
+   cfs_b->burst_time += runtime - cfs_b->quota;
+   cfs_b->nr_burst++;
+   }
+   }
+
overrun = min(overrun, cfs_b->max_overrun);
refill = cfs_b->quota * overrun;
cfs_b->runtime += refill;
cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+
+   cfs_b->previous_runtime = cfs_b->runtime;
}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4f53ea8e92ce..04b0a1ce4c89 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -360,6 +360,7 @@ struct cfs_bandwidth {
u64 burst;
u64 buffer;
u64 max_overrun;
+   u64 previous_runtime;
s64 hierarchical_quota;
 
u8  idle;
@@ -372,7 +373,9 @@ struct cfs_bandwidth {
/* Statistics: */
int nr_periods;
int nr_throttled;
+   int nr_burst;
u64 throttled_time;
+   u64 burst_time;
 #endif
 };
 
-- 
2.14.4.44.g2045bb6



[PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable

2021-02-02 Thread Huaixin Chang
Accumulate unused quota from previous periods, thus accumulated
bandwidth runtime can be used in the following periods. During
accumulation, take care of runtime overflow. Previous non-burstable
CFS bandwidth controller only assign quota to runtime, that saves a lot.

A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
denote how many percent of burst is given on setting cfs bandwidth. By
default it is 0, which means on burst is allowed unless accumulated.

Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
switch for burst. It is enabled by default.

Signed-off-by: Huaixin Chang 
Signed-off-by: Shanpei Chen 
Reported-by: kernel test robot 
---
 include/linux/sched/sysctl.h |  2 ++
 kernel/sched/core.c  | 31 +
 kernel/sched/fair.c  | 47 
 kernel/sched/sched.h |  4 ++--
 kernel/sysctl.c  | 18 +
 5 files changed, 88 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 3c31ba88aca5..3400828eaf2d 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -72,6 +72,8 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
 
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
+extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
 #endif
 
 #ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 28e3165c685b..9f1b05ad0411 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -66,6 +66,16 @@ const_debug unsigned int sysctl_sched_features =
  */
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * Percent of burst assigned to cfs_b->runtime on tg_set_cfs_bandwidth,
+ * 0 by default.
+ */
+unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
+
+unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
+#endif
+
 /*
  * period over which we measure -rt task CPU usage in us.
  * default: 1s
@@ -8586,7 +8596,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 /* More than 203 days if BW_SHIFT equals 20. */
-static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
+const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
@@ -8595,7 +8605,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota,
 {
int i, ret = 0, runtime_enabled, runtime_was_enabled;
struct cfs_bandwidth *cfs_b = >cfs_bandwidth;
-   u64 buffer;
+   u64 buffer, burst_onset;
 
if (tg == _task_group)
return -EINVAL;
@@ -8656,11 +8666,24 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota,
cfs_b->burst = burst;
cfs_b->buffer = buffer;
 
-   __refill_cfs_bandwidth_runtime(cfs_b);
+   cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
+   cfs_b->runtime = cfs_b->quota;
+
+   /* burst_onset needed */
+   if (cfs_b->quota != RUNTIME_INF &&
+   sysctl_sched_cfs_bw_burst_enabled &&
+   sysctl_sched_cfs_bw_burst_onset_percent > 0) {
+
+   burst_onset = div_u64(burst, 100) *
+   sysctl_sched_cfs_bw_burst_onset_percent;
+
+   cfs_b->runtime += burst_onset;
+   cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
+   }
 
/* Restart the period timer (if active) to handle new period expiry: */
if (runtime_enabled)
-   start_cfs_bandwidth(cfs_b);
+   start_cfs_bandwidth(cfs_b, 1);
 
raw_spin_unlock_irq(_b->lock);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46945349f209..6a7c261d206a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4609,10 +4609,23 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  *
  * requires cfs_b->lock
  */
-void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
+   u64 overrun)
 {
-   if (cfs_b->quota != RUNTIME_INF)
-   cfs_b->runtime = cfs_b->quota;
+   u64 refill;
+
+   if (cfs_b->quota != RUNTIME_INF) {
+
+   if (!sysctl_sched_cfs_bw_burst_enabled) {
+   cfs_b->runtime = cfs_b->quota;
+   return;
+   }
+
+   overrun = min(overrun, cfs_b->max_overrun);
+   refill = cfs_b->quota * overrun;
+   cfs_b->runtime += refill;
+   cfs_b->runtime = min(cfs_b->runtime, cfs_b->b

[PATCH 0/4] sched/fair: Burstable CFS bandwidth controller

2021-02-02 Thread Huaixin Chang
This patchset is rebased upon v5.11-rc6.
Previous archive:
https://lore.kernel.org/lkml/20201217074620.58338-1-changhuai...@linux.alibaba.com/

The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled. And they are latency sensitive at the same
time so that throttling them is undesired.

Scaling up period and quota allows greater burst capacity. But it might
cause longer stuck till next refill. We introduce "burst" to allow
accumulating unused quota from previous periods, and to be assigned when
a task group requests more CPU than quota during a specific period. Thus
allowing CPU time requests as long as the average requested CPU time is
below quota on the long run. The maximum accumulation is capped by burst
and is set 0 by default, thus the traditional behaviour remains.

A huge drop of 99th tail latency from more than 500ms to 27ms is seen for
real java workloads when using burst. Similar drops are seen when
testing with schbench too:

echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 70 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 10 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
echo 40 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

# The average CPU usage is around 500%, which is 200ms CPU time
# every 40ms.
./schbench -m 1 -t 30 -r 60 -c 1 -R 500

Without burst:

Latency percentiles (usec)
50.th: 7
75.th: 8
90.th: 9
95.th: 10
*99.th: 933
99.5000th: 981
99.9000th: 3068
min=0, max=20054
rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 
9.33%

With burst:

Latency percentiles (usec)
50.th: 7
75.th: 8
90.th: 9
95.th: 9
*99.th: 12
99.5000th: 13
99.9000th: 19
min=0, max=406
rps: 498.36 p95 (usec) 9 p99 (usec) 12 p95/cputime 0.09% p99/cputime 
0.12%

How much workloads with benefit from burstable CFS bandwidth control
depends on how bursty and how latency sensitive they are.

Previously, Cong Wang and Konstantin Khlebnikov proposed similar
feature:
https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangc...@gmail.com/
https://lore.kernel.org/lkml/157476581065.5793.4518979877345136813.stgit@buzz/

This time we present more latency statistics and handle overflow while
accumulating.

Huaixin Chang (4):
  sched/fair: Introduce primitives for CFS bandwidth burst
  sched/fair: Make CFS bandwidth controller burstable
  sched/fair: Add cfs bandwidth burst statistics
  sched/fair: Add document for burstable CFS bandwidth control

 Documentation/scheduler/sched-bwc.rst |  49 +++--
 include/linux/sched/sysctl.h  |   2 +
 kernel/sched/core.c   | 126 +-
 kernel/sched/fair.c   |  58 +---
 kernel/sched/sched.h  |   9 ++-
 kernel/sysctl.c   |  18 +
 6 files changed, 232 insertions(+), 30 deletions(-)

-- 
2.14.4.44.g2045bb6



Re: [next v2 PATCH] usb: xhci-mtk: skip dropping bandwidth of unchecked endpoints

2021-02-01 Thread Greg Kroah-Hartman
On Tue, Feb 02, 2021 at 02:28:18PM +0800, Chunfeng Yun wrote:
> For those unchecked endpoints, we don't allocate bandwidth for
> them, so no need free the bandwidth, otherwise will decrease
> the allocated bandwidth.
> Meanwhile use xhci_dbg() instead of dev_dbg() to print logs and
> rename bw_ep_list_new as bw_ep_chk_list.
> 
> Fixes: 1d69f9d901ef ("usb: xhci-mtk: fix unreleased bandwidth data")
> Cc: stable 
> Signed-off-by: Chunfeng Yun 
> Tested-by: Ikjoon Jang 
> Reviewed-by: Ikjoon Jang 
> ---
> v2: add 'break' when find the ep that will be dropped suggested by Ikjoon
> add Tested-by and Reviewed-by Ikjoon

As v1 is already in my public tree, please send a follow-on patch that
fixes it instead.

thanks,

greg k-h


[next v2 PATCH] usb: xhci-mtk: skip dropping bandwidth of unchecked endpoints

2021-02-01 Thread Chunfeng Yun
For those unchecked endpoints, we don't allocate bandwidth for
them, so no need free the bandwidth, otherwise will decrease
the allocated bandwidth.
Meanwhile use xhci_dbg() instead of dev_dbg() to print logs and
rename bw_ep_list_new as bw_ep_chk_list.

Fixes: 1d69f9d901ef ("usb: xhci-mtk: fix unreleased bandwidth data")
Cc: stable 
Signed-off-by: Chunfeng Yun 
Tested-by: Ikjoon Jang 
Reviewed-by: Ikjoon Jang 
---
v2: add 'break' when find the ep that will be dropped suggested by Ikjoon
add Tested-by and Reviewed-by Ikjoon
---
 drivers/usb/host/xhci-mtk-sch.c | 59 ++---
 drivers/usb/host/xhci-mtk.h |  4 ++-
 2 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index a313e75ff1c6..b45e5bf08997 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -200,6 +200,7 @@ static struct mu3h_sch_ep_info *create_sch_ep(struct 
usb_device *udev,
 
sch_ep->sch_tt = tt;
sch_ep->ep = ep;
+   INIT_LIST_HEAD(_ep->endpoint);
INIT_LIST_HEAD(_ep->tt_endpoint);
 
return sch_ep;
@@ -374,6 +375,7 @@ static void update_bus_bw(struct mu3h_sch_bw_info *sch_bw,
sch_ep->bw_budget_table[j];
}
}
+   sch_ep->allocated = used;
 }
 
 static int check_sch_tt(struct usb_device *udev,
@@ -542,6 +544,22 @@ static int check_sch_bw(struct usb_device *udev,
return 0;
 }
 
+static void destroy_sch_ep(struct usb_device *udev,
+   struct mu3h_sch_bw_info *sch_bw, struct mu3h_sch_ep_info *sch_ep)
+{
+   /* only release ep bw check passed by check_sch_bw() */
+   if (sch_ep->allocated)
+   update_bus_bw(sch_bw, sch_ep, 0);
+
+   list_del(_ep->endpoint);
+
+   if (sch_ep->sch_tt) {
+   list_del(_ep->tt_endpoint);
+   drop_tt(udev);
+   }
+   kfree(sch_ep);
+}
+
 static bool need_bw_sch(struct usb_host_endpoint *ep,
enum usb_device_speed speed, int has_tt)
 {
@@ -584,7 +602,7 @@ int xhci_mtk_sch_init(struct xhci_hcd_mtk *mtk)
 
mtk->sch_array = sch_array;
 
-   INIT_LIST_HEAD(>bw_ep_list_new);
+   INIT_LIST_HEAD(>bw_ep_chk_list);
 
return 0;
 }
@@ -636,29 +654,12 @@ int xhci_mtk_add_ep_quirk(struct usb_hcd *hcd, struct 
usb_device *udev,
 
setup_sch_info(udev, ep_ctx, sch_ep);
 
-   list_add_tail(_ep->endpoint, >bw_ep_list_new);
+   list_add_tail(_ep->endpoint, >bw_ep_chk_list);
 
return 0;
 }
 EXPORT_SYMBOL_GPL(xhci_mtk_add_ep_quirk);
 
-static void xhci_mtk_drop_ep(struct xhci_hcd_mtk *mtk, struct usb_device *udev,
-struct mu3h_sch_ep_info *sch_ep)
-{
-   struct xhci_hcd *xhci = hcd_to_xhci(mtk->hcd);
-   int bw_index = get_bw_index(xhci, udev, sch_ep->ep);
-   struct mu3h_sch_bw_info *sch_bw = >sch_array[bw_index];
-
-   update_bus_bw(sch_bw, sch_ep, 0);
-   list_del(_ep->endpoint);
-
-   if (sch_ep->sch_tt) {
-   list_del(_ep->tt_endpoint);
-   drop_tt(udev);
-   }
-   kfree(sch_ep);
-}
-
 void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct usb_device *udev,
struct usb_host_endpoint *ep)
 {
@@ -689,7 +690,8 @@ void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct 
usb_device *udev,
 
list_for_each_entry_safe(sch_ep, tmp, _bw->bw_ep_list, endpoint) {
if (sch_ep->ep == ep) {
-   xhci_mtk_drop_ep(mtk, udev, sch_ep);
+   destroy_sch_ep(udev, sch_bw, sch_ep);
+   break;
}
}
 }
@@ -704,9 +706,9 @@ int xhci_mtk_check_bandwidth(struct usb_hcd *hcd, struct 
usb_device *udev)
struct mu3h_sch_ep_info *sch_ep, *tmp;
int bw_index, ret;
 
-   dev_dbg(>dev, "%s\n", __func__);
+   xhci_dbg(xhci, "%s() udev %s\n", __func__, dev_name(>dev));
 
-   list_for_each_entry(sch_ep, >bw_ep_list_new, endpoint) {
+   list_for_each_entry(sch_ep, >bw_ep_chk_list, endpoint) {
bw_index = get_bw_index(xhci, udev, sch_ep->ep);
sch_bw = >sch_array[bw_index];
 
@@ -717,7 +719,7 @@ int xhci_mtk_check_bandwidth(struct usb_hcd *hcd, struct 
usb_device *udev)
}
}
 
-   list_for_each_entry_safe(sch_ep, tmp, >bw_ep_list_new, endpoint) {
+   list_for_each_entry_safe(sch_ep, tmp, >bw_ep_chk_list, endpoint) {
struct xhci_ep_ctx *ep_ctx;
struct usb_host_endpoint *ep = sch_ep->ep;
unsigned int ep_index = xhci_get_endpoint_index(>desc);
@@ -746,12 +748,17 @@ EXPORT_SYMBOL_GPL(xhci_mtk_check_bandwidth);
 void xhci_mtk_reset_bandwidth(struct usb_hcd *hcd, struct usb_device *udev)
 {
struct xhci_hcd_mtk *mtk = hcd_to_mtk(hcd);
+

Re: [next PATCH] usb: xhci-mtk: skip dropping bandwidth of unchecked endpoints

2021-02-01 Thread Chunfeng Yun
On Mon, 2021-02-01 at 17:20 +0800, Ikjoon Jang wrote:
> HI Chunfeng,
> 
> On Mon, Feb 1, 2021 at 1:58 PM Chunfeng Yun  wrote:
> >
> > For those unchecked endpoints, we don't allocate bandwidth for
> > them, so no need free the bandwidth, otherwise will decrease
> > the allocated bandwidth.
> > Meanwhile use xhci_dbg() instead of dev_dbg() to print logs and
> > rename bw_ep_list_new as bw_ep_chk_list.
> >
> > Fixes: 1d69f9d901ef ("usb: xhci-mtk: fix unreleased bandwidth data")
> > Cc: stable 
> > Signed-off-by: Chunfeng Yun 
> 
> Reviewed-and-tested-by: Ikjoon Jang 
> 
> > ---
> >  drivers/usb/host/xhci-mtk-sch.c | 61 ++---
> >  drivers/usb/host/xhci-mtk.h |  4 ++-
> >  2 files changed, 36 insertions(+), 29 deletions(-)
> >
> > diff --git a/drivers/usb/host/xhci-mtk-sch.c 
> > b/drivers/usb/host/xhci-mtk-sch.c
> > index a313e75ff1c6..dee8a329076d 100644
> > --- a/drivers/usb/host/xhci-mtk-sch.c
> > +++ b/drivers/usb/host/xhci-mtk-sch.c
> > @@ -200,6 +200,7 @@ static struct mu3h_sch_ep_info *create_sch_ep(struct 
> > usb_device *udev,
> >
> > sch_ep->sch_tt = tt;
> > sch_ep->ep = ep;
> > +   INIT_LIST_HEAD(_ep->endpoint);
> > INIT_LIST_HEAD(_ep->tt_endpoint);
> >
> > return sch_ep;
> > @@ -374,6 +375,7 @@ static void update_bus_bw(struct mu3h_sch_bw_info 
> > *sch_bw,
> > sch_ep->bw_budget_table[j];
> > }
> > }
> > +   sch_ep->allocated = used;
> 
> Yes, this is really needed!
> 
> >  }
> >
> >  static int check_sch_tt(struct usb_device *udev,
> > @@ -542,6 +544,22 @@ static int check_sch_bw(struct usb_device *udev,
> > return 0;
> >  }
> >
> > +static void destroy_sch_ep(struct usb_device *udev,
> > +   struct mu3h_sch_bw_info *sch_bw, struct mu3h_sch_ep_info *sch_ep)
> > +{
> > +   /* only release ep bw check passed by check_sch_bw() */
> > +   if (sch_ep->allocated)
> > +   update_bus_bw(sch_bw, sch_ep, 0);
> 
> So only these two lines really matter.
> 
> > +
> > +   list_del(_ep->endpoint);
> > +
> > +   if (sch_ep->sch_tt) {
> > +   list_del(_ep->tt_endpoint);
> > +   drop_tt(udev);
> > +   }
> > +   kfree(sch_ep);
> > +}
> > +
> >  static bool need_bw_sch(struct usb_host_endpoint *ep,
> > enum usb_device_speed speed, int has_tt)
> >  {
> > @@ -584,7 +602,7 @@ int xhci_mtk_sch_init(struct xhci_hcd_mtk *mtk)
> >
> > mtk->sch_array = sch_array;
> >
> > -   INIT_LIST_HEAD(>bw_ep_list_new);
> > +   INIT_LIST_HEAD(>bw_ep_chk_list);
> >
> > return 0;
> >  }
> > @@ -636,29 +654,12 @@ int xhci_mtk_add_ep_quirk(struct usb_hcd *hcd, struct 
> > usb_device *udev,
> >
> > setup_sch_info(udev, ep_ctx, sch_ep);
> >
> > -   list_add_tail(_ep->endpoint, >bw_ep_list_new);
> > +   list_add_tail(_ep->endpoint, >bw_ep_chk_list);
> >
> > return 0;
> >  }
> >  EXPORT_SYMBOL_GPL(xhci_mtk_add_ep_quirk);
> >
> > -static void xhci_mtk_drop_ep(struct xhci_hcd_mtk *mtk, struct usb_device 
> > *udev,
> > -struct mu3h_sch_ep_info *sch_ep)
> > -{
> > -   struct xhci_hcd *xhci = hcd_to_xhci(mtk->hcd);
> > -   int bw_index = get_bw_index(xhci, udev, sch_ep->ep);
> > -   struct mu3h_sch_bw_info *sch_bw = >sch_array[bw_index];
> > -
> > -   update_bus_bw(sch_bw, sch_ep, 0);
> > -   list_del(_ep->endpoint);
> > -
> > -   if (sch_ep->sch_tt) {
> > -   list_del(_ep->tt_endpoint);
> > -   drop_tt(udev);
> > -   }
> > -   kfree(sch_ep);
> > -}
> > -
> >  void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct usb_device *udev,
> > struct usb_host_endpoint *ep)
> >  {
> > @@ -688,9 +689,8 @@ void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct 
> > usb_device *udev,
> > sch_bw = _array[bw_index];
> >
> > list_for_each_entry_safe(sch_ep, tmp, _bw->bw_ep_list, 
> > endpoint) {
> > -   if (sch_ep->ep == ep) {
> > -   xhci_mtk_drop_ep(mtk, udev, sch_ep);
> > -   }
> > +   if (sch_ep->ep == ep)
> > +

[PATCH] media: uvc: limit max bandwidth for HDMI capture

2021-02-01 Thread Mauro Carvalho Chehab
This device:
534d:2109 MacroSilicon

Announces that it supports several frame intervals for
their resolutions for MJPEG compression:

VideoStreaming Interface Descriptor:
bLength46
bDescriptorType36
bDescriptorSubtype  7 (FRAME_MJPEG)
bFrameIndex 1
bmCapabilities   0x00
  Still image unsupported
wWidth   1920
wHeight  1080
dwMinBitRate   768000
dwMaxBitRate196608000
dwMaxVideoFrameBufferSize 4147200
dwDefaultFrameInterval 16
bFrameIntervalType  5
dwFrameInterval( 0)16
dwFrameInterval( 1)33
dwFrameInterval( 2)40
dwFrameInterval( 3)50
dwFrameInterval( 4)   100

However, the highest frame interval (16), which means 60 fps
is not supported. For such resolution, the maximum interval
is, instead 33 (30 fps).

The last format that supports such frame interval is 1280x720.

Add a quirk to estimate a raw bandwidth, by doing:
width * height * framerate
E. g.:
1920 * 1080 * 30 = 62208000

if the bandwidth is greater than such threshold, get
the next value from the dwFrameInterval.

Signed-off-by: Mauro Carvalho Chehab 
---
 drivers/media/usb/uvc/uvc_driver.c | 15 +++
 drivers/media/usb/uvc/uvc_video.c  | 26 +++---
 drivers/media/usb/uvc/uvcvideo.h   |  2 ++
 3 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_driver.c 
b/drivers/media/usb/uvc/uvc_driver.c
index 1abc122a0977..c83a329f6527 100644
--- a/drivers/media/usb/uvc/uvc_driver.c
+++ b/drivers/media/usb/uvc/uvc_driver.c
@@ -2339,6 +2339,7 @@ static int uvc_probe(struct usb_interface *intf,
dev->info = info ? info : _quirk_none;
dev->quirks = uvc_quirks_param == -1
? dev->info->quirks : uvc_quirks_param;
+   dev->max_bandwidth = dev->info->max_bandwidth;
 
if (id->idVendor && id->idProduct)
uvc_dbg(dev, PROBE, "Probing known UVC device %s (%04x:%04x)\n",
@@ -2615,6 +2616,11 @@ static const struct uvc_device_info 
uvc_quirk_fix_bandwidth = {
.quirks = UVC_QUIRK_FIX_BANDWIDTH,
 };
 
+static const struct uvc_device_info uvc_quirk_fix_bw_622 = {
+   .quirks = UVC_QUIRK_FIX_BANDWIDTH,
+   .max_bandwidth = 62208000,
+};
+
 static const struct uvc_device_info uvc_quirk_probe_def = {
.quirks = UVC_QUIRK_PROBE_DEF,
 };
@@ -2830,6 +2836,15 @@ static const struct usb_device_id uvc_ids[] = {
  .bInterfaceSubClass   = 1,
  .bInterfaceProtocol   = 0,
  .driver_info  = (kernel_ulong_t)_quirk_fix_bandwidth },
+   /* MacroSilicon HDMI capture */
+   { .match_flags  = USB_DEVICE_ID_MATCH_DEVICE
+   | USB_DEVICE_ID_MATCH_INT_INFO,
+ .idVendor = 0x534d,
+ .idProduct= 0x2109,
+ .bInterfaceClass  = USB_CLASS_VIDEO,
+ .bInterfaceSubClass   = 1,
+ .bInterfaceProtocol   = 0,
+ .driver_info  = (kernel_ulong_t)_quirk_fix_bw_622 },
/* Genesys Logic USB 2.0 PC Camera */
{ .match_flags  = USB_DEVICE_ID_MATCH_DEVICE
| USB_DEVICE_ID_MATCH_INT_INFO,
diff --git a/drivers/media/usb/uvc/uvc_video.c 
b/drivers/media/usb/uvc/uvc_video.c
index f2f565281e63..4afc1fbe0801 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -162,9 +162,29 @@ static void uvc_fixup_video_ctrl(struct uvc_streaming 
*stream,
if ((ctrl->dwMaxPayloadTransferSize & 0x) == 0x)
ctrl->dwMaxPayloadTransferSize &= ~0x;
 
-   if (!(format->flags & UVC_FMT_FLAG_COMPRESSED) &&
-   stream->dev->quirks & UVC_QUIRK_FIX_BANDWIDTH &&
-   stream->intf->num_altsetting > 1) {
+
+   if (!(stream->dev->quirks & UVC_QUIRK_FIX_BANDWIDTH))
+   return;
+
+   /* Handle UVC_QUIRK_FIX_BANDWIDTH */
+
+   if (format->flags & UVC_FMT_FLAG_COMPRESSED &&
+   stream->dev->max_bandwidth && frame->bFrameIntervalType) {
+   u32 bandwidth;
+
+   for (i = 0; i < frame->bFrameIntervalType - 1; ++i) {
+   bandwidth = frame->wWidth * frame->wHeight;
+   bandwidth *= 1000 / frame->dwFrameInterval[i];
+
+   if (bandwidth <= stream->dev->max_bandwidth)
+   break;
+   }
+
+   ctrl->dwFrameI

Re: [next PATCH] usb: xhci-mtk: skip dropping bandwidth of unchecked endpoints

2021-02-01 Thread Ikjoon Jang
HI Chunfeng,

On Mon, Feb 1, 2021 at 1:58 PM Chunfeng Yun  wrote:
>
> For those unchecked endpoints, we don't allocate bandwidth for
> them, so no need free the bandwidth, otherwise will decrease
> the allocated bandwidth.
> Meanwhile use xhci_dbg() instead of dev_dbg() to print logs and
> rename bw_ep_list_new as bw_ep_chk_list.
>
> Fixes: 1d69f9d901ef ("usb: xhci-mtk: fix unreleased bandwidth data")
> Cc: stable 
> Signed-off-by: Chunfeng Yun 

Reviewed-and-tested-by: Ikjoon Jang 

> ---
>  drivers/usb/host/xhci-mtk-sch.c | 61 ++---
>  drivers/usb/host/xhci-mtk.h |  4 ++-
>  2 files changed, 36 insertions(+), 29 deletions(-)
>
> diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
> index a313e75ff1c6..dee8a329076d 100644
> --- a/drivers/usb/host/xhci-mtk-sch.c
> +++ b/drivers/usb/host/xhci-mtk-sch.c
> @@ -200,6 +200,7 @@ static struct mu3h_sch_ep_info *create_sch_ep(struct 
> usb_device *udev,
>
> sch_ep->sch_tt = tt;
> sch_ep->ep = ep;
> +   INIT_LIST_HEAD(_ep->endpoint);
> INIT_LIST_HEAD(_ep->tt_endpoint);
>
> return sch_ep;
> @@ -374,6 +375,7 @@ static void update_bus_bw(struct mu3h_sch_bw_info *sch_bw,
> sch_ep->bw_budget_table[j];
> }
> }
> +   sch_ep->allocated = used;

Yes, this is really needed!

>  }
>
>  static int check_sch_tt(struct usb_device *udev,
> @@ -542,6 +544,22 @@ static int check_sch_bw(struct usb_device *udev,
> return 0;
>  }
>
> +static void destroy_sch_ep(struct usb_device *udev,
> +   struct mu3h_sch_bw_info *sch_bw, struct mu3h_sch_ep_info *sch_ep)
> +{
> +   /* only release ep bw check passed by check_sch_bw() */
> +   if (sch_ep->allocated)
> +   update_bus_bw(sch_bw, sch_ep, 0);

So only these two lines really matter.

> +
> +   list_del(_ep->endpoint);
> +
> +   if (sch_ep->sch_tt) {
> +   list_del(_ep->tt_endpoint);
> +   drop_tt(udev);
> +   }
> +   kfree(sch_ep);
> +}
> +
>  static bool need_bw_sch(struct usb_host_endpoint *ep,
> enum usb_device_speed speed, int has_tt)
>  {
> @@ -584,7 +602,7 @@ int xhci_mtk_sch_init(struct xhci_hcd_mtk *mtk)
>
> mtk->sch_array = sch_array;
>
> -   INIT_LIST_HEAD(>bw_ep_list_new);
> +   INIT_LIST_HEAD(>bw_ep_chk_list);
>
> return 0;
>  }
> @@ -636,29 +654,12 @@ int xhci_mtk_add_ep_quirk(struct usb_hcd *hcd, struct 
> usb_device *udev,
>
> setup_sch_info(udev, ep_ctx, sch_ep);
>
> -   list_add_tail(_ep->endpoint, >bw_ep_list_new);
> +   list_add_tail(_ep->endpoint, >bw_ep_chk_list);
>
> return 0;
>  }
>  EXPORT_SYMBOL_GPL(xhci_mtk_add_ep_quirk);
>
> -static void xhci_mtk_drop_ep(struct xhci_hcd_mtk *mtk, struct usb_device 
> *udev,
> -struct mu3h_sch_ep_info *sch_ep)
> -{
> -   struct xhci_hcd *xhci = hcd_to_xhci(mtk->hcd);
> -   int bw_index = get_bw_index(xhci, udev, sch_ep->ep);
> -   struct mu3h_sch_bw_info *sch_bw = >sch_array[bw_index];
> -
> -   update_bus_bw(sch_bw, sch_ep, 0);
> -   list_del(_ep->endpoint);
> -
> -   if (sch_ep->sch_tt) {
> -   list_del(_ep->tt_endpoint);
> -   drop_tt(udev);
> -   }
> -   kfree(sch_ep);
> -}
> -
>  void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct usb_device *udev,
> struct usb_host_endpoint *ep)
>  {
> @@ -688,9 +689,8 @@ void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct 
> usb_device *udev,
> sch_bw = _array[bw_index];
>
> list_for_each_entry_safe(sch_ep, tmp, _bw->bw_ep_list, endpoint) {
> -   if (sch_ep->ep == ep) {
> -   xhci_mtk_drop_ep(mtk, udev, sch_ep);
> -   }
> +   if (sch_ep->ep == ep)
> +   destroy_sch_ep(udev, sch_bw, sch_ep);

not so critical but I've also missed 'break' here.
Can you please add a break statement here?

> }
>  }
>  EXPORT_SYMBOL_GPL(xhci_mtk_drop_ep_quirk);
> @@ -704,9 +704,9 @@ int xhci_mtk_check_bandwidth(struct usb_hcd *hcd, struct 
> usb_device *udev)
> struct mu3h_sch_ep_info *sch_ep, *tmp;
> int bw_index, ret;
>
> -   dev_dbg(>dev, "%s\n", __func__);
> +   xhci_dbg(xhci, "%s() udev %s\n", __func__, dev_name(>dev));
>
> -   list_for_each_entry(sch_ep, >bw_ep_list_new, endpoint) {
> +   list_for_each_entry(sch_ep, >bw_ep_chk_list, en

[next PATCH] usb: xhci-mtk: skip dropping bandwidth of unchecked endpoints

2021-01-31 Thread Chunfeng Yun
For those unchecked endpoints, we don't allocate bandwidth for
them, so no need free the bandwidth, otherwise will decrease
the allocated bandwidth.
Meanwhile use xhci_dbg() instead of dev_dbg() to print logs and
rename bw_ep_list_new as bw_ep_chk_list.

Fixes: 1d69f9d901ef ("usb: xhci-mtk: fix unreleased bandwidth data")
Cc: stable 
Signed-off-by: Chunfeng Yun 
---
 drivers/usb/host/xhci-mtk-sch.c | 61 ++---
 drivers/usb/host/xhci-mtk.h |  4 ++-
 2 files changed, 36 insertions(+), 29 deletions(-)

diff --git a/drivers/usb/host/xhci-mtk-sch.c b/drivers/usb/host/xhci-mtk-sch.c
index a313e75ff1c6..dee8a329076d 100644
--- a/drivers/usb/host/xhci-mtk-sch.c
+++ b/drivers/usb/host/xhci-mtk-sch.c
@@ -200,6 +200,7 @@ static struct mu3h_sch_ep_info *create_sch_ep(struct 
usb_device *udev,
 
sch_ep->sch_tt = tt;
sch_ep->ep = ep;
+   INIT_LIST_HEAD(_ep->endpoint);
INIT_LIST_HEAD(_ep->tt_endpoint);
 
return sch_ep;
@@ -374,6 +375,7 @@ static void update_bus_bw(struct mu3h_sch_bw_info *sch_bw,
sch_ep->bw_budget_table[j];
}
}
+   sch_ep->allocated = used;
 }
 
 static int check_sch_tt(struct usb_device *udev,
@@ -542,6 +544,22 @@ static int check_sch_bw(struct usb_device *udev,
return 0;
 }
 
+static void destroy_sch_ep(struct usb_device *udev,
+   struct mu3h_sch_bw_info *sch_bw, struct mu3h_sch_ep_info *sch_ep)
+{
+   /* only release ep bw check passed by check_sch_bw() */
+   if (sch_ep->allocated)
+   update_bus_bw(sch_bw, sch_ep, 0);
+
+   list_del(_ep->endpoint);
+
+   if (sch_ep->sch_tt) {
+   list_del(_ep->tt_endpoint);
+   drop_tt(udev);
+   }
+   kfree(sch_ep);
+}
+
 static bool need_bw_sch(struct usb_host_endpoint *ep,
enum usb_device_speed speed, int has_tt)
 {
@@ -584,7 +602,7 @@ int xhci_mtk_sch_init(struct xhci_hcd_mtk *mtk)
 
mtk->sch_array = sch_array;
 
-   INIT_LIST_HEAD(>bw_ep_list_new);
+   INIT_LIST_HEAD(>bw_ep_chk_list);
 
return 0;
 }
@@ -636,29 +654,12 @@ int xhci_mtk_add_ep_quirk(struct usb_hcd *hcd, struct 
usb_device *udev,
 
setup_sch_info(udev, ep_ctx, sch_ep);
 
-   list_add_tail(_ep->endpoint, >bw_ep_list_new);
+   list_add_tail(_ep->endpoint, >bw_ep_chk_list);
 
return 0;
 }
 EXPORT_SYMBOL_GPL(xhci_mtk_add_ep_quirk);
 
-static void xhci_mtk_drop_ep(struct xhci_hcd_mtk *mtk, struct usb_device *udev,
-struct mu3h_sch_ep_info *sch_ep)
-{
-   struct xhci_hcd *xhci = hcd_to_xhci(mtk->hcd);
-   int bw_index = get_bw_index(xhci, udev, sch_ep->ep);
-   struct mu3h_sch_bw_info *sch_bw = >sch_array[bw_index];
-
-   update_bus_bw(sch_bw, sch_ep, 0);
-   list_del(_ep->endpoint);
-
-   if (sch_ep->sch_tt) {
-   list_del(_ep->tt_endpoint);
-   drop_tt(udev);
-   }
-   kfree(sch_ep);
-}
-
 void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct usb_device *udev,
struct usb_host_endpoint *ep)
 {
@@ -688,9 +689,8 @@ void xhci_mtk_drop_ep_quirk(struct usb_hcd *hcd, struct 
usb_device *udev,
sch_bw = _array[bw_index];
 
list_for_each_entry_safe(sch_ep, tmp, _bw->bw_ep_list, endpoint) {
-   if (sch_ep->ep == ep) {
-   xhci_mtk_drop_ep(mtk, udev, sch_ep);
-   }
+   if (sch_ep->ep == ep)
+   destroy_sch_ep(udev, sch_bw, sch_ep);
}
 }
 EXPORT_SYMBOL_GPL(xhci_mtk_drop_ep_quirk);
@@ -704,9 +704,9 @@ int xhci_mtk_check_bandwidth(struct usb_hcd *hcd, struct 
usb_device *udev)
struct mu3h_sch_ep_info *sch_ep, *tmp;
int bw_index, ret;
 
-   dev_dbg(>dev, "%s\n", __func__);
+   xhci_dbg(xhci, "%s() udev %s\n", __func__, dev_name(>dev));
 
-   list_for_each_entry(sch_ep, >bw_ep_list_new, endpoint) {
+   list_for_each_entry(sch_ep, >bw_ep_chk_list, endpoint) {
bw_index = get_bw_index(xhci, udev, sch_ep->ep);
sch_bw = >sch_array[bw_index];
 
@@ -717,7 +717,7 @@ int xhci_mtk_check_bandwidth(struct usb_hcd *hcd, struct 
usb_device *udev)
}
}
 
-   list_for_each_entry_safe(sch_ep, tmp, >bw_ep_list_new, endpoint) {
+   list_for_each_entry_safe(sch_ep, tmp, >bw_ep_chk_list, endpoint) {
struct xhci_ep_ctx *ep_ctx;
struct usb_host_endpoint *ep = sch_ep->ep;
unsigned int ep_index = xhci_get_endpoint_index(>desc);
@@ -746,12 +746,17 @@ EXPORT_SYMBOL_GPL(xhci_mtk_check_bandwidth);
 void xhci_mtk_reset_bandwidth(struct usb_hcd *hcd, struct usb_device *udev)
 {
struct xhci_hcd_mtk *mtk = hcd_to_mtk(hcd);
+   struct xhci_hcd *xhci = hcd_to_xhci(hcd);
+   struct mu3h_sch_bw_info *sch

Re: [PATCH v7] usb: xhci-mtk: fix unreleased bandwidth data

2021-01-31 Thread Chunfeng Yun
On Fri, 2021-01-29 at 11:27 +0100, Greg Kroah-Hartman wrote:
> On Fri, Jan 29, 2021 at 05:38:19PM +0800, Chunfeng Yun wrote:
> > From: Ikjoon Jang 
> > 
> > xhci-mtk needs XHCI_MTK_HOST quirk functions in add_endpoint() and
> > drop_endpoint() to handle its own sw bandwidth management.
> > 
> > It stores bandwidth data into an internal table every time
> > add_endpoint() is called, and drops those in drop_endpoint().
> > But when bandwidth allocation fails at one endpoint, all earlier
> > allocation from the same interface could still remain at the table.
> > 
> > This patch moves bandwidth management codes to check_bandwidth() and
> > reset_bandwidth() path. To do so, this patch also adds those functions
> > to xhci_driver_overrides and lets mtk-xhci to release all failed
> > endpoints in reset_bandwidth() path.
> > 
> > Fixes: 08e469de87a2 ("usb: xhci-mtk: supports bandwidth scheduling with 
> > multi-TT")
> > Signed-off-by: Ikjoon Jang 
> > Signed-off-by: Chunfeng Yun 
> > ---
> > Changes in v7 from Chunfeng:
> > - rename xhci_mtk_droop_ep() as destroy_sch_ep(), and include parameters
> > - add member @allocated in mu3h_sch_ep_info struct
> >   used to skip endpoint not allocated bandwidth
> > - use xhci_dbg() instead of dev_dbg()
> > - rename bw_ep_list_new as bw_ep_chk_list
> 
> As a previous version of this patch is already in my public tree, just
> send a follow-on patch that resolves the issues in the previous one, as
> I can not apply this one.  Bonus is that you get the credit for fixing
> these issues :)
Ok, thanks

> 
> thanks,
> 
> greg k-h



Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"

2021-01-29 Thread Bjorn Helgaas
On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote:
> On 1/28/21 5:51 PM, Sinan Kaya wrote:
> > On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:
> > > AFAICT, this thread petered out with no resolution.
> > > 
> > > If the bandwidth change notifications are important to somebody,
> > > please speak up, preferably with a patch that makes the notifications
> > > disabled by default and adds a parameter to enable them (or some other
> > > strategy that makes sense).
> > > 
> > > I think these are potentially useful, so I don't really want to just
> > > revert them, but if nobody thinks these are important enough to fix,
> > > that's a possibility.
> > 
> > Hide behind debug or expert option by default? or even mark it as BROKEN
> > until someone fixes it?
> > 
> Instead of making it a config option, wouldn't it be better as a kernel
> parameter? People encountering this seem quite competent in passing kernel
> arguments, so having a "pcie_bw_notification=off" would solve their
> problems.

I don't want people to have to discover a parameter to solve issues.
If there's a parameter, notification should default to off, and people
who want notification should supply a parameter to enable it.  Same
thing for the sysfs idea.

I think we really just need to figure out what's going on.  Then it
should be clearer how to handle it.  I'm not really in a position to
debug the root cause since I don't have the hardware or the time.  If
nobody can figure out what's going on, I think we'll have to make it
disabled by default.

> As far as marking this as broken, I've seen no conclusive evidence of to
> tell if its a sw bug or actual hardware problem. Could we have a sysfs to
> disable this on a per-downstream-port basis?
> 
> e.g.
> echo 0 > /sys/bus/pci/devices/:00:04.0/bw_notification_enabled
> 
> This probably won't be ideal if there are many devices downtraining their
> links ad-hoc. At worst we'd have a way to silence those messages if we do
> encounter such devices.
> 
> Alex


Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"

2021-01-28 Thread Alex Deucher
On Thu, Jan 28, 2021 at 6:39 PM Bjorn Helgaas  wrote:
>
> [+cc Atanas -- thank you very much for the bug report!]
>
> On Sat, Feb 22, 2020 at 10:58:40AM -0600, Bjorn Helgaas wrote:
> > On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
> > > I think we have a problem with link bandwidth change notifications
> > > (see 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
> > >
> > > Here's a recent bug report where Jan reported "_tons_" of these
> > > notifications on an nvme device:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=206197
> >
> > AFAICT, this thread petered out with no resolution.
> >
> > If the bandwidth change notifications are important to somebody,
> > please speak up, preferably with a patch that makes the notifications
> > disabled by default and adds a parameter to enable them (or some other
> > strategy that makes sense).
> >
> > I think these are potentially useful, so I don't really want to just
> > revert them, but if nobody thinks these are important enough to fix,
> > that's a possibility.
>
> Atanas is also seeing this problem and went to the trouble of digging
> up this bug report:
> https://bugzilla.kernel.org/show_bug.cgi?id=206197#c8
>
> I'm actually a little surprised that we haven't seen more reports of
> this.  I don't think distros enable CONFIG_PCIE_BW, but even so, I
> would think more people running upstream kernels would trip over it.
> But maybe people just haven't turned CONFIG_PCIE_BW on.
>
> I don't have a suggestion; just adding Atanas to this old thread.
>
> > > There was similar discussion involving GPU drivers at
> > > https://lore.kernel.org/r/20190429185611.121751-2-helg...@kernel.org
> > >
> > > The current solution is the CONFIG_PCIE_BW config option, which
> > > disables the messages completely.  That option defaults to "off" (no
> > > messages), but even so, I think it's a little problematic.
> > >
> > > Users are not really in a position to figure out whether it's safe to
> > > enable.  All they can do is experiment and see whether it works with
> > > their current mix of devices and drivers.
> > >
> > > I don't think it's currently useful for distros because it's a
> > > compile-time switch, and distros cannot predict what system configs
> > > will be used, so I don't think they can enable it.
> > >
> > > Does anybody have proposals for making it smarter about distinguishing
> > > real problems from intentional power management, or maybe interfaces
> > > drivers could use to tell us when we should ignore bandwidth changes?

There's also this recently filed bug:
https://gitlab.freedesktop.org/drm/amd/-/issues/1447
The root cause of it appears to be related to ASPM.

Alex


Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification"

2021-01-28 Thread Alex G.

On 1/28/21 5:51 PM, Sinan Kaya wrote:

On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:

AFAICT, this thread petered out with no resolution.

If the bandwidth change notifications are important to somebody,
please speak up, preferably with a patch that makes the notifications
disabled by default and adds a parameter to enable them (or some other
strategy that makes sense).

I think these are potentially useful, so I don't really want to just
revert them, but if nobody thinks these are important enough to fix,
that's a possibility.


Hide behind debug or expert option by default? or even mark it as BROKEN
until someone fixes it?

Instead of making it a config option, wouldn't it be better as a kernel 
parameter? People encountering this seem quite competent in passing 
kernel arguments, so having a "pcie_bw_notification=off" would solve 
their problems.


As far as marking this as broken, I've seen no conclusive evidence of to 
tell if its a sw bug or actual hardware problem. Could we have a sysfs 
to disable this on a per-downstream-port basis?


e.g.
echo 0 > /sys/bus/pci/devices/:00:04.0/bw_notification_enabled

This probably won't be ideal if there are many devices downtraining 
their links ad-hoc. At worst we'd have a way to silence those messages 
if we do encounter such devices.


Alex


  1   2   3   4   5   6   7   8   9   10   >