date:20240104

Re: [PATCH v8] bus: mhi: host: Add tracing support

2024-01-04 Thread Baochen Qiang





On 1/4/2024 12:47 PM, Krishna Chaitanya Chundru wrote:

Hi Steven,

Can you please review this.

Thanks & Regards,

Krishna Chaitanya.

On 12/7/2023 10:00 AM, Krishna chaitanya chundru wrote:

This change adds ftrace support for following functions which
helps in debugging the issues when there is Channel state & MHI
state change and also when we receive data and control events:
1. mhi_intvec_mhi_states
2. mhi_process_data_event_ring
3. mhi_process_ctrl_ev_ring
4. mhi_gen_tre
5. mhi_update_channel_state
6. mhi_tryset_pm_state
7. mhi_pm_st_worker

Where ever the trace events are added, debug messages are removed.
Is there a reason why debug messages have to be removed? From the view 
of MHI controller, we often need MHI logs to debug, so is it possbile to 
preserve those logs?


Signed-off-by: Krishna chaitanya chundru 
---
Changes in v8:
- Pass the structure and derefernce the variables in TP_fast_assign as 
suggested by steve
- Link to v7: 
https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com


Changes in v7:
- change log format as pointed by mani.
- Link to v6: 
https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com


Changes in v6:
- use 'rp' directly as suggested by jeffrey.
- Link to v5: 
https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com


Changes in v5:
- Use DECLARE_EVENT_CLASS for multiple events as suggested by steve.
- Instead of converting to u64 to print address, use %px to print the 
address to avoid

- warnings in some platforms.
- Link to v4: 
https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com


Changes in v4:
- Fix compilation issues in previous patch which happended due to 
rebasing.
- In the defconfig FTRACE config is not enabled due to that the 
compilation issue is not

- seen in my workspace.
- Link to v3: 
https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com


Changes in v3:
- move trace header file from include/trace/events to 
drivers/bus/mhi/host/ so that

- we can include driver header files.
- Use macros directly in the trace events as suggested Jeffrey Hugo.
- Reorder the structure in the events as suggested by steve to avoid 
holes in the buffer.

- removed the mhi_to_physical function as this can give security issues.
- removed macros to define strings as we can get those from driver 
headers.
- Link to v2: 
https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com


Changes in v2:
- Passing the raw state into the trace event and using  
__print_symbolic() as suggested by bjorn.

- Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn.
- Fixed the kernel test rebot issues.
- Link to v1: 
https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com

---
  drivers/bus/mhi/host/init.c  |   3 +
  drivers/bus/mhi/host/main.c  |  19 ++--
  drivers/bus/mhi/host/pm.c    |   7 +-
  drivers/bus/mhi/host/trace.h | 205 
+++

  4 files changed, 221 insertions(+), 13 deletions(-)

diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c
index f78aefd2d7a3..6acb85f4c5f8 100644
--- a/drivers/bus/mhi/host/init.c
+++ b/drivers/bus/mhi/host/init.c
@@ -20,6 +20,9 @@
  #include 
  #include "internal.h"
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
  static DEFINE_IDA(mhi_controller_ida);
  const char * const mhi_ee_str[MHI_EE_MAX] = {
diff --git a/drivers/bus/mhi/host/main.c b/drivers/bus/mhi/host/main.c
index dcf627b36e82..189f4786403e 100644
--- a/drivers/bus/mhi/host/main.c
+++ b/drivers/bus/mhi/host/main.c
@@ -15,6 +15,7 @@
  #include 
  #include 
  #include "internal.h"
+#include "trace.h"
  int __must_check mhi_read_reg(struct mhi_controller *mhi_cntrl,
    void __iomem *base, u32 offset, u32 *out)
@@ -491,11 +492,8 @@ irqreturn_t mhi_intvec_threaded_handler(int 
irq_number, void *priv)

  state = mhi_get_mhi_state(mhi_cntrl);
  ee = mhi_get_exec_env(mhi_cntrl);
-    dev_dbg(dev, "local ee: %s state: %s device ee: %s state: %s\n",
-    TO_MHI_EXEC_STR(mhi_cntrl->ee),
-    mhi_state_str(mhi_cntrl->dev_state),
-    TO_MHI_EXEC_STR(ee), mhi_state_str(state));
+    trace_mhi_intvec_states(mhi_cntrl, ee, state);
  if (state == MHI_STATE_SYS_ERR) {
  dev_dbg(dev, "System error detected\n");
  pm_state = mhi_tryset_pm_state(mhi_cntrl,
@@ -832,6 +830,8 @@ int mhi_process_ctrl_ev_ring(struct mhi_controller 
*mhi_cntrl,

  while (dev_rp != local_rp) {
  enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
+    trace_mhi_ctrl_event(mhi_cntrl, local_rp);
+
  switch (type) {
  case MHI_PKT_TYPE_BW_REQ_EVENT:
  {
@@ -997,6 +997,8 @@ int mhi_process_data_event_ring(struct 
mhi_controller *mhi_cntrl,

  while (dev_rp != local_rp && event_quota > 0) {
  enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
+

Re: [PATCH v8] bus: mhi: host: Add tracing support

2024-01-04 Thread Krishna Chaitanya Chundru




On 1/4/2024 9:31 PM, Steven Rostedt wrote:

On Thu, 7 Dec 2023 10:00:47 +0530
Krishna chaitanya chundru  wrote:


This change adds ftrace support for following functions which
helps in debugging the issues when there is Channel state & MHI
state change and also when we receive data and control events:
1. mhi_intvec_mhi_states
2. mhi_process_data_event_ring
3. mhi_process_ctrl_ev_ring
4. mhi_gen_tre
5. mhi_update_channel_state
6. mhi_tryset_pm_state
7. mhi_pm_st_worker

Where ever the trace events are added, debug messages are removed.

Signed-off-by: Krishna chaitanya chundru 
---
Changes in v8:
- Pass the structure and derefernce the variables in TP_fast_assign as 
suggested by steve
- Link to v7: 
https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com

So this looks good from a tracing POV.

Reviewed-by: Steven Rostedt (Google) 

But I do have some more comments that could be done by new patches.




+TRACE_EVENT(mhi_intvec_states,
+
+   TP_PROTO(struct mhi_controller *mhi_cntrl, int dev_ee, int dev_state),
+
+   TP_ARGS(mhi_cntrl, dev_ee, dev_state),
+
+   TP_STRUCT__entry(
+   __string(name, mhi_cntrl->mhi_dev->name)
+   __field(int, local_ee)
+   __field(int, state)
+   __field(int, dev_ee)
+   __field(int, dev_state)
+   ),
+
+   TP_fast_assign(
+   __assign_str(name, mhi_cntrl->mhi_dev->name);
+   __entry->local_ee = mhi_cntrl->ee;
+   __entry->state = mhi_cntrl->dev_state;
+   __entry->dev_ee = dev_ee;
+   __entry->dev_state = dev_state;
+   ),
+
+   TP_printk("%s: local ee: %s state: %s device ee: %s state: %s\n",
+ __get_str(name),
+ TO_MHI_EXEC_STR(__entry->local_ee),
+ mhi_state_str(__entry->state),
+ TO_MHI_EXEC_STR(__entry->dev_ee),
+ mhi_state_str(__entry->dev_state))

So the above may have issues with user space parsing.

For one, that mhi_state_str() is:

static inline const char *mhi_state_str(enum mhi_state state)
{
 switch (state) {
 case MHI_STATE_RESET:
 return "RESET";
 case MHI_STATE_READY:
 return "READY";
 case MHI_STATE_M0:
 return "M0";
 case MHI_STATE_M1:
 return "M1";
 case MHI_STATE_M2:
 return "M2";
 case MHI_STATE_M3:
 return "M3";
 case MHI_STATE_M3_FAST:
 return "M3 FAST";
 case MHI_STATE_BHI:
 return "BHI";
 case MHI_STATE_SYS_ERR:
 return "SYS ERROR";
 default:
 return "Unknown state";
 }
};
  
Which if this could be changed into:


#define MHI_STATE_LIST  \
EM(RESET,"RESET") \
EM(READY,"READY") \
EM(M0,   "M0")\
EM(M1,   "M1")\
EM(M2,   "M2")\
EM(M3,   "M3")\
EM(M3_FAST,  "M3_FAST")   \
EM(BHI,  "BHI")   \
EMe(SYS_ERR, "SYS ERROR")

#undef EM
#undef EMe

#define EM(a, b)  case MHI_STATE_##a: return b;
#define EMe(a, b) case MHI_STATE_##a: return b;

static inline const char *mhi_state_str(enum mhi_state state)
{
 switch (state) {
MHI_STATE_LIST
default:
return "Unknown state";
}

Then you could use that macro in the trace header:

#undef EM
#undef EMe

#define EM(a, b)TRACE_DEFINE_ENUM(MHI_STATE_##a);
#define EMe(a, b)   TRACE_DEFINE_ENUM(MHI_STATE_##a);

MHI_STATE_LIST

And in the print fmts:

#undef EM
#undef EMe

#define EM(a, b)   { MHI_STATE_##a, b },
#define EMe(a, b)  { MHI_STATE_##a, b }


TP_printk("%s: local ee: %s state: %s device ee: %s state: %s\n",
  __get_str(name),
  TO_MHI_EXEC_STR(__entry->local_ee),

  __print_symbolic(__entry->state), MHI_STATE_LIST),

  TO_MHI_EXEC_STR(__entry->dev_ee),

  __print_symbolic(__entry->dev_state, MHI_STATE_LIST))


And that will be exported to user space in the
/sys/kernel/tracing/events/*/*/format file, as:

__print_symbolic(REC->state, {
{ MHI_STATE_RESET, "RESET"},
{ MHI_STATE_READY, "READY"},
{ MHI_STATE_M0, "M0"},
{ MHI_STATE_M1, "M1"},
{ MHI_STATE_M2, "M2"},
{ MHI_STATE_M3, "M3"},
{ MHI_STATE_M3_FAST, "M3 FAST"},
{ MHI_STATE_BHI, "BHI"},
{ MHI_STATE_SYS_ERR, "SYS ERROR"} }

Which libtracevent knows how to parse. Oh, it wouldn't even show the enum
names as the TRACE_DEFINE_ENUM() above, would tell the kernel to replace
them with their actual numeric values. That way, when trace-cmd or perf
reads the raw data, it knows how to

Re: [PATCH v6 3/3] vduse: enable Virtio-net device type

2024-01-04 Thread Jason Wang

On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin
 wrote:
>
> This patch adds Virtio-net device type to the supported
> devices types.
>
> Initialization fails if the device does not support
> VIRTIO_F_VERSION_1 feature, in order to guarantee the
> configuration space is read-only. It also fails with
> -EPERM if the CAP_NET_ADMIN is missing.
>
> Signed-off-by: Maxime Coquelin 
> ---

Acked-by: Jason Wang 

Thanks

Re: [PATCH v6 2/3] vduse: Temporarily fail if control queue features requested

2024-01-04 Thread Jason Wang

On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin
 wrote:
>
> Virtio-net driver control queue implementation is not safe
> when used with VDUSE. If the VDUSE application does not
> reply to control queue messages, it currently ends up
> hanging the kernel thread sending this command.
>
> Some work is on-going to make the control queue
> implementation robust with VDUSE. Until it is completed,
> let's fail features check if any control-queue related
> feature is requested.
>
> Signed-off-by: Maxime Coquelin 
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 13 +
>  1 file changed, 13 insertions(+)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 0486ff672408..94f54ea2eb06 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  #include "iova_domain.h"
> @@ -46,6 +47,15 @@
>
>  #define IRQ_UNBOUND -1
>
> +#define VDUSE_NET_INVALID_FEATURES_MASK \
> +   (BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |\
> +BIT_ULL(VIRTIO_NET_F_CTRL_RX)   |  \
> +BIT_ULL(VIRTIO_NET_F_CTRL_VLAN) |  \
> +BIT_ULL(VIRTIO_NET_F_GUEST_ANNOUNCE) | \
> +BIT_ULL(VIRTIO_NET_F_MQ) | \
> +BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) |  \
> +BIT_ULL(VIRTIO_NET_F_RSS))

We need to make this as well:

VIRTIO_NET_F_CTRL_GUEST_OFFLOADS

Other than this,

Acked-by: Jason Wang 

Thanks

> +
>  struct vduse_virtqueue {
> u16 index;
> u16 num_max;
> @@ -1680,6 +1690,9 @@ static bool features_is_valid(struct vduse_dev_config 
> *config)
> if ((config->device_id == VIRTIO_ID_BLOCK) &&
> (config->features & (1ULL << 
> VIRTIO_BLK_F_CONFIG_WCE)))
> return false;
> +   else if ((config->device_id == VIRTIO_ID_NET) &&
> +   (config->features & VDUSE_NET_INVALID_FEATURES_MASK))
> +   return false;
>
> return true;
>  }
> --
> 2.43.0
>

Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test

2024-01-04 Thread Yunsheng Lin

On 2024/1/5 0:17, Eugenio Perez Martin wrote:
> On Wed, Jan 3, 2024 at 11:00 AM Yunsheng Lin  wrote:

...

>> +
>> +static void run_tx_test(struct vdev_info *dev, struct vq_info *vq,
>> +   bool delayed, int batch, int bufs)
>> +{
>> +   const bool random_batch = batch == RANDOM_BATCH;
>> +   long long spurious = 0;
>> +   struct scatterlist sl;
>> +   unsigned int len;
>> +   int r;
>> +
>> +   for (;;) {
>> +   long started_before = vq->started;
>> +   long completed_before = vq->completed;
>> +
>> +   virtqueue_disable_cb(vq->vq);
>> +   do {
>> +   if (random_batch)
>> +   batch = (random() % vq->vring.num) + 1;
>> +
>> +   while (vq->started < bufs &&
>> +  (vq->started - vq->completed) < batch) {
>> +   sg_init_one(, dev->test_buf, HDR_LEN + 
>> TEST_BUF_LEN);
>> +   r = virtqueue_add_outbuf(vq->vq, , 1,
>> +dev->test_buf + 
>> vq->started,
>> +GFP_ATOMIC);
>> +   if (unlikely(r != 0)) {
>> +   if (r == -ENOSPC &&
>> +   vq->started > started_before)
>> +   r = 0;
>> +   else
>> +   r = -1;
>> +   break;
>> +   }
>> +
>> +   ++vq->started;
>> +
>> +   if (unlikely(!virtqueue_kick(vq->vq))) {
>> +   r = -1;
>> +   break;
>> +   }
>> +   }
>> +
>> +   if (vq->started >= bufs)
>> +   r = -1;
>> +
>> +   /* Flush out completed bufs if any */
>> +   while (virtqueue_get_buf(vq->vq, )) {
>> +   int n, i;
>> +
>> +   n = recvfrom(dev->sock, dev->res_buf, 
>> TEST_BUF_LEN, 0, NULL, NULL);
>> +   assert(n == TEST_BUF_LEN);
>> +
>> +   for (i = ETHER_HDR_LEN; i < n; i++)
>> +   assert(dev->res_buf[i] == (char)i);
>> +
>> +   ++vq->completed;
>> +   r = 0;
>> +   }
>> +   } while (r == 0);
>> +
>> +   if (vq->completed == completed_before && vq->started == 
>> started_before)
>> +   ++spurious;
>> +
>> +   assert(vq->completed <= bufs);
>> +   assert(vq->started <= bufs);
>> +   if (vq->completed == bufs)
>> +   break;
>> +
>> +   if (delayed) {
>> +   if (virtqueue_enable_cb_delayed(vq->vq))
>> +   wait_for_interrupt(vq);
>> +   } else {
>> +   if (virtqueue_enable_cb(vq->vq))
>> +   wait_for_interrupt(vq);
>> +   }
>> +   }
>> +   printf("TX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
>> +  spurious, vq->started, vq->completed);
>> +}
>> +
>> +static void run_rx_test(struct vdev_info *dev, struct vq_info *vq,
>> +   bool delayed, int batch, int bufs)
>> +{
>> +   const bool random_batch = batch == RANDOM_BATCH;
>> +   long long spurious = 0;
>> +   struct scatterlist sl;
>> +   unsigned int len;
>> +   int r;
>> +
>> +   for (;;) {
>> +   long started_before = vq->started;
>> +   long completed_before = vq->completed;
>> +
>> +   do {
>> +   if (random_batch)
>> +   batch = (random() % vq->vring.num) + 1;
>> +
>> +   while (vq->started < bufs &&
>> +  (vq->started - vq->completed) < batch) {
>> +   sg_init_one(, dev->res_buf, HDR_LEN + 
>> TEST_BUF_LEN);
>> +
>> +   r = virtqueue_add_inbuf(vq->vq, , 1,
>> +   dev->res_buf + 
>> vq->started,
>> +   GFP_ATOMIC);
>> +   if (unlikely(r != 0)) {
>> +   if (r == -ENOSPC &&
>> +   vq->started > started_before)
>> +   r = 0;
>> +   else
>> +

[PATCH -next v6 2/2] mm: vmscan: add new event to trace shrink lru

2024-01-04 Thread Bixuan Cui

From: cuibixuan 

Page reclaim is an important part of memory reclaim, including:
  * shrink_active_list(), moves folios from the active LRU to the inactive LRU
  * shrink_inactive_list(), shrink lru from inactive LRU list

Add the new events to calculate the execution time to better evaluate 
the entire memory recycling ratio.

Example of output:
 kswapd0-103 [007] .  1098.353020: 
mm_vmscan_lru_shrink_active_start: nid=0
 kswapd0-103 [007] .  1098.353040: 
mm_vmscan_lru_shrink_active_end: nid=0 nr_taken=32 nr_active=0 
nr_deactivated=32 nr_referenced=0 priority=6 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
 kswapd0-103 [007] .  1098.353040: 
mm_vmscan_lru_shrink_inactive_start: nid=0
 kswapd0-103 [007] .  1098.353094: 
mm_vmscan_lru_shrink_inactive_end: nid=0 nr_scanned=32 nr_reclaimed=0 
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 
nr_activate_file=0 nr_ref_keep=32 nr_unmap_fail=0 priority=6 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
 kswapd0-103 [007] .  1098.353094: 
mm_vmscan_lru_shrink_inactive_start: nid=0
 kswapd0-103 [007] .  1098.353162: 
mm_vmscan_lru_shrink_inactive_end: nid=0 nr_scanned=32 nr_reclaimed=21 
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 
nr_activate_file=0 nr_ref_keep=11 nr_unmap_fail=0 priority=6 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

Signed-off-by: Bixuan Cui 
Reviewed-by: Andrew Morton 
Reviewed-by: Steven Rostedt (Google) 
---
Changes:
v6: * Add Reviewed-by from Steven Rostedt.
v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
* Add the explanation for adding new shrink lru events into 'mm: vmscan: 
add new event to trace shrink lru'
v4: * Add Reviewed-by and Changlog to every patch.
v3: * Swap the positions of 'nid' and 'freeable' to prevent the hole in the 
trace event.
v2: * Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the 
same time to fix build error (Andrew pointed out).

 include/trace/events/vmscan.h | 31 +--
 mm/vmscan.c   | 11 ---
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index b99cd28c9815..4793d952c248 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -395,7 +395,34 @@ TRACE_EVENT(mm_vmscan_write_folio,
show_reclaim_flags(__entry->reclaim_flags))
 );
 
-TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
+DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template,
+
+   TP_PROTO(int nid),
+
+   TP_ARGS(nid),
+
+   TP_STRUCT__entry(
+   __field(int, nid)
+   ),
+
+   TP_fast_assign(
+   __entry->nid = nid;
+   ),
+
+   TP_printk("nid=%d", __entry->nid)
+);
+
+DEFINE_EVENT(mm_vmscan_lru_shrink_start_template, 
mm_vmscan_lru_shrink_inactive_start,
+   TP_PROTO(int nid),
+   TP_ARGS(nid)
+);
+
+DEFINE_EVENT(mm_vmscan_lru_shrink_start_template, 
mm_vmscan_lru_shrink_active_start,
+   TP_PROTO(int nid),
+   TP_ARGS(nid)
+);
+
+TRACE_EVENT(mm_vmscan_lru_shrink_inactive_end,
 
TP_PROTO(int nid,
unsigned long nr_scanned, unsigned long nr_reclaimed,
@@ -446,7 +473,7 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
show_reclaim_flags(__entry->reclaim_flags))
 );
 
-TRACE_EVENT(mm_vmscan_lru_shrink_active,
+TRACE_EVENT(mm_vmscan_lru_shrink_active_end,
 
TP_PROTO(int nid, unsigned long nr_taken,
unsigned long nr_active, unsigned long nr_deactivated,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4e3b835c6b4a..a44d9624d60f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1906,6 +1906,8 @@ static unsigned long shrink_inactive_list(unsigned long 
nr_to_scan,
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
bool stalled = false;
 
+   trace_mm_vmscan_lru_shrink_inactive_start(pgdat->node_id);
+
while (unlikely(too_many_isolated(pgdat, file, sc))) {
if (stalled)
return 0;
@@ -1990,7 +1992,7 @@ static unsigned long shrink_inactive_list(unsigned long 
nr_to_scan,
if (file)
sc->nr.file_taken += nr_taken;
 
-   trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
+   trace_mm_vmscan_lru_shrink_inactive_end(pgdat->node_id,
nr_scanned, nr_reclaimed, , sc->priority, file);
return nr_reclaimed;
 }
@@ -2028,6 +2030,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
+   trace_mm_vmscan_lru_shrink_active_start(pgdat->node_id);
+
lru_add_drain();
 
spin_lock_irq(>lru_lock);
@@ -2107,7 +2111,7 @@ static void shrink_active_list(unsigned long nr_to_scan,

[PATCH -next v6 1/2] mm: shrinker: add new event to trace shrink count

2024-01-04 Thread Bixuan Cui

From: cuibixuan 

do_shrink_slab() calculates the freeable memory through 
shrinker->count_objects(),
and then reclaims the memory through shrinker->scan_objects(). When reclaiming
memory, shrinker->count_objects() takes a certain amount of time:

Fun   spend(us)
ext4_es_count 4302
ext4_es_scan  12
super_cache_count 4195
super_cache_scan  2103

Therefore, adding the trace event to count_objects() can more accurately
obtain the time taken for slab memory recycling.

Example of output:
 kswapd0-103 [003] .  1098.317942: mm_shrink_count_start: 
kfree_rcu_shrink_count.cfi_jt+0x0/0x8 c540ff51: nid: 0
 kswapd0-103 [003] .  1098.317951: mm_shrink_count_end: 
kfree_rcu_shrink_count.cfi_jt+0x0/0x8 c540ff51: nid: 0 freeable:36

Signed-off-by: Bixuan Cui 
Reviewed-by: Steven Rostedt (Google) 
---
Changes:
v6: * Add Reviewed-by from Steven Rostedt.
v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
* Add the explanation for adding new shrink lru events into 'mm: vmscan: 
add new event to trace shrink lru'
v4: * Add Reviewed-by and Changlog to every patch.
v3: * Swap the positions of 'nid' and 'freeable' to prevent the hole in the 
trace event.
v2: * Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the 
same time to fix build error (Andrew pointed out).

 include/trace/events/vmscan.h | 49 +++
 mm/shrinker.c |  4 +++
 2 files changed, 53 insertions(+)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 1a488c30afa5..b99cd28c9815 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -196,6 +196,55 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, 
mm_vmscan_memcg_softlimit_re
 );
 #endif /* CONFIG_MEMCG */
 
+TRACE_EVENT(mm_shrink_count_start,
+   TP_PROTO(struct shrinker *shr, struct shrink_control *sc),
+
+   TP_ARGS(shr, sc),
+
+   TP_STRUCT__entry(
+   __field(struct shrinker *, shr)
+   __field(void *, shrink)
+   __field(int, nid)
+   ),
+
+   TP_fast_assign(
+   __entry->shr = shr;
+   __entry->shrink = shr->count_objects;
+   __entry->nid = sc->nid;
+   ),
+
+   TP_printk("%pS %p: nid: %d",
+   __entry->shrink,
+   __entry->shr,
+   __entry->nid)
+);
+
+TRACE_EVENT(mm_shrink_count_end,
+   TP_PROTO(struct shrinker *shr, struct shrink_control *sc, long 
freeable),
+
+   TP_ARGS(shr, sc, freeable),
+
+   TP_STRUCT__entry(
+   __field(struct shrinker *, shr)
+   __field(void *, shrink)
+   __field(long, freeable)
+   __field(int, nid)
+   ),
+
+   TP_fast_assign(
+   __entry->shr = shr;
+   __entry->shrink = shr->count_objects;
+   __entry->freeable = freeable;
+   __entry->nid = sc->nid;
+   ),
+
+   TP_printk("%pS %p: nid: %d freeable:%ld",
+   __entry->shrink,
+   __entry->shr,
+   __entry->nid,
+   __entry->freeable)
+);
+
 TRACE_EVENT(mm_shrink_slab_start,
TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
long nr_objects_to_shrink, unsigned long cache_items,
diff --git a/mm/shrinker.c b/mm/shrinker.c
index dd91eab43ed3..d0c7bf61db61 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -379,7 +379,11 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
  : SHRINK_BATCH;
long scanned = 0, next_deferred;
 
+   trace_mm_shrink_count_start(shrinker, shrinkctl);
+
freeable = shrinker->count_objects(shrinker, shrinkctl);
+
+   trace_mm_shrink_count_end(shrinker, shrinkctl, freeable);
if (freeable == 0 || freeable == SHRINK_EMPTY)
return freeable;
 
-- 
2.17.1

[PATCH -next v6 0/2] Make memory reclamation measurable

2024-01-04 Thread Bixuan Cui

When the system memory is low, kswapd reclaims the memory. The key steps
of memory reclamation include
1.shrink_lruvec
  * shrink_active_list, moves folios from the active LRU to the inactive LRU
  * shrink_inactive_list, shrink lru from inactive LRU list
2.shrink_slab
  * shrinker->count_objects(), calculates the freeable memory
  * shrinker->scan_objects(), reclaims the slab memory

The existing tracers in the vmscan are as follows:

--do_try_to_free_pages
--shrink_zones
--trace_mm_vmscan_node_reclaim_begin (tracer)
--shrink_node
--shrink_node_memcgs
  --trace_mm_vmscan_memcg_shrink_begin (tracer)
  --shrink_lruvec
--shrink_list
  --shrink_active_list
  --trace_mm_vmscan_lru_shrink_active (tracer)
  --shrink_inactive_list
  --trace_mm_vmscan_lru_shrink_inactive (tracer)
--shrink_active_list
  --shrink_slab
--do_shrink_slab
--shrinker->count_objects()
--trace_mm_shrink_slab_start (tracer)
--shrinker->scan_objects()
--trace_mm_shrink_slab_end (tracer)
  --trace_mm_vmscan_memcg_shrink_end (tracer)
--trace_mm_vmscan_node_reclaim_end (tracer)

If we get the duration and quantity of shrink lru and slab,
then we can measure the memory recycling, as follows

Measuring memory reclamation with bpf:
  LRU FILE:
CPU COMMShrinkActive(us) ShrinkInactive(us)  Reclaim(page)
7   kswapd0 26  51  32
7   kswapd0 52  47  13
  SLAB:
CPU COMMOBJ_NAMECount_Dur(us) 
Freeable(page) Scan_Dur(us) Reclaim(page)
 1  kswapd0 super_cache_scan.cfi_jt 2   341 
   3225 128
 7  kswapd0 super_cache_scan.cfi_jt 0   
2247   8524 1024
 7  kswapd0 super_cache_scan.cfi_jt 23670   
   00

For this, add the new tracer to shrink_active_list/shrink_inactive_list
and shrinker->count_objects().

Changes:
v6: * Add Reviewed-by from Steven Rostedt.
v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
* Add the explanation for adding new shrink lru events into 'mm: vmscan: 
add new event to trace shrink lru'
v4: Add Reviewed-by and Changlog to every patch.
v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace 
event.
v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same 
time to fix build error.

cuibixuan (2):
  mm: shrinker: add new event to trace shrink count
  mm: vmscan: add new event to trace shrink lru

 include/trace/events/vmscan.h | 80 ++-
 mm/shrinker.c |  4 ++
 mm/vmscan.c   | 11 +++--
 3 files changed, 90 insertions(+), 5 deletions(-)

-- 
2.17.1

Re: [PATCH net-next v3 2/3] net: introduce abstraction for network memory

2024-01-04 Thread Mina Almasry

On Thu, Jan 4, 2024 at 1:44 PM Jakub Kicinski  wrote:
>
> On Thu, 21 Dec 2023 15:44:22 -0800 Mina Almasry wrote:
> > The warning is like so:
> >
> > ./include/net/page_pool/helpers.h: In function ‘page_pool_alloc’:
> > ./include/linux/stddef.h:8:14: warning: returning ‘void *’ from a
> > function with return type ‘netmem_ref’ {aka ‘long unsigned int’} makes
> > integer from pointer without a cast [-Wint-conversion]
> > 8 | #define NULL ((void *)0)
> >   |  ^
> > ./include/net/page_pool/helpers.h:132:24: note: in expansion of macro
> > ‘NULL’
> >   132 | return NULL;
> >   |^~~~
> >
> > And happens in all the code where:
> >
> > netmem_ref func()
> > {
> > return NULL;
> > }
> >
> > It's fixable by changing the return to `return (netmem_ref NULL);` or
> > `return 0;`, but I feel like netmem_ref should be some type which
> > allows a cast from NULL implicitly.
>
> Why do you think we should be able to cast NULL implicitly?
> netmem_ref is a handle, it could possibly be some form of
> an ID in the future, rather than a pointer. Or have more low
> bits stolen for specific use cases.
>
> unsigned long, and returning 0 as "no handle" makes perfect sense to me.
>
> Note that 0 is a special case, bitwise types are allowed to convert
> to 0/bool and 0 is implicitly allowed to become a bitwise type.
> This will pass without a warning:
>
> typedef unsigned long __bitwise netmem_ref;
>
> netmem_ref some_code(netmem_ref ref)
> {
> // direct test is fine
> if (!ref)
> // 0 "upgrades" without casts
> return 0;
> // 1 does not, we need __force
> return (__force netmem_ref)1 | ref;
> }
>
> The __bitwise annotation will make catching people trying
> to cast to struct page * trivial.
>
> You seem to be trying hard to make struct netmem a thing.
> Perhaps you have a reason I'm not getting?

There are a number of functions that return struct page* today that I
convert to return struct netmem* later in the child devmem series, one
example is something like:

struct page *page_pool_alloc(...); // returns NULL on failure.

becomes:

struct netmem *page_pool_alloc(...); // also returns NULL on failure.

rather than,

netmem_ref page_pool_alloc(...); // returns 0 on failure.

I guess in my mind having NULL be castable to the new type makes it so
that I can avoid the additional code churn of converting a bunch of
`return NULL;` to `return 0;`, and maybe the transition from page
pointers to netmem pointers can be more easily done if they're both
compatible pointer types.

But that is not any huge blocker or critical point in my mind, I just
thought this approach is preferred. If conversion to unsigned long
makes more sense to you, I'll respin this like that and do the `NULL
-> 0` conversion everywhere as needed.

-- 
Thanks,
Mina

[PATCH 3/4] eventfs: Read ei->entries before ei->children in eventfs_iterate()

2024-01-04 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

In order to apply a shortcut to skip over the current ctx->pos
immediately, by using the ei->entries array, the reading of that array
should be first. Moving the array reading before the linked list reading
will make the shortcut change diff nicer to read.

Link: 
https://lore.kernel.org/all/CAHk-=wiKwDUDv3+jCsv-uacDcHDVTYsXtBR9=6sgm5mqx+d...@mail.gmail.com/

Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 46 
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index c73fb1f7ddbc..a1934e0eea3b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -752,8 +752,8 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
 * Need to create the dentries and inodes to have a consistent
 * inode number.
 */
-   list_for_each_entry_srcu(ei_child, >children, list,
-srcu_read_lock_held(_srcu)) {
+   for (i = 0; i < ei->nr_entries; i++) {
+   void *cdata = ei->data;
 
if (c > 0) {
c--;
@@ -762,23 +762,32 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
 
ctx->pos++;
 
-   if (ei_child->is_freed)
-   continue;
+   entry = >entries[i];
+   name = entry->name;
 
-   name = ei_child->name;
+   mutex_lock(_mutex);
+   /* If ei->is_freed then just bail here, nothing more to do */
+   if (ei->is_freed) {
+   mutex_unlock(_mutex);
+   goto out_dec;
+   }
+   r = entry->callback(name, , , );
+   mutex_unlock(_mutex);
+   if (r <= 0)
+   continue;
 
-   dentry = create_dir_dentry(ei, ei_child, ei_dentry);
+   dentry = create_file_dentry(ei, i, ei_dentry, name, mode, 
cdata, fops);
if (!dentry)
goto out_dec;
ino = dentry->d_inode->i_ino;
dput(dentry);
 
-   if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
+   if (!dir_emit(ctx, name, strlen(name), ino, DT_REG))
goto out_dec;
}
 
-   for (i = 0; i < ei->nr_entries; i++) {
-   void *cdata = ei->data;
+   list_for_each_entry_srcu(ei_child, >children, list,
+srcu_read_lock_held(_srcu)) {
 
if (c > 0) {
c--;
@@ -787,27 +796,18 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
 
ctx->pos++;
 
-   entry = >entries[i];
-   name = entry->name;
-
-   mutex_lock(_mutex);
-   /* If ei->is_freed then just bail here, nothing more to do */
-   if (ei->is_freed) {
-   mutex_unlock(_mutex);
-   goto out_dec;
-   }
-   r = entry->callback(name, , , );
-   mutex_unlock(_mutex);
-   if (r <= 0)
+   if (ei_child->is_freed)
continue;
 
-   dentry = create_file_dentry(ei, i, ei_dentry, name, mode, 
cdata, fops);
+   name = ei_child->name;
+
+   dentry = create_dir_dentry(ei, ei_child, ei_dentry);
if (!dentry)
goto out_dec;
ino = dentry->d_inode->i_ino;
dput(dentry);
 
-   if (!dir_emit(ctx, name, strlen(name), ino, DT_REG))
+   if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
goto out_dec;
}
ret = 1;
-- 
2.42.0

[PATCH 0/4] eventfs: More updates to eventfs_iterate()

2024-01-04 Thread Steven Rostedt

With the ongoing descussion of eventfs iterator, a few more changes
are required and some changes are just enhancements.

- Stop immediately in the loop if the ei is found to be in the process
  of being freed.

- Make the ctx->pos update consistent with the skipped previous read index.
  This fixes a bug with duplicate files being showned by 'ls'.

- Swap reading ei->entries with ei->children to make the next change
  easier to read

- Add a "shortcut" in the ei->entries array to skip over already read
  entries.


Steven Rostedt (Google) (4):
  eventfs: Have eventfs_iterate() stop immediately if ei->is_freed is set
  eventfs: Do ctx->pos update for all iterations in  eventfs_iterate()
  eventfs: Read ei->entries before ei->children in eventfs_iterate()
  eventfs: Shortcut eventfs_iterate() by skipping entries already read


 fs/tracefs/event_inode.c | 67 ++--
 1 file changed, 36 insertions(+), 31 deletions(-)

[PATCH 4/4] eventfs: Shortcut eventfs_iterate() by skipping entries already read

2024-01-04 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

As the ei->entries array is fixed for the duration of the eventfs_inode,
it can be used to skip over already read entries in eventfs_iterate().

That is, if ctx->pos is greater than zero, there's no reason in doing the
loop across the ei->entries array for the entries less than ctx->pos.
Instead, start the lookup of the entries at the current ctx->pos.

Link: 
https://lore.kernel.org/all/CAHk-=wiKwDUDv3+jCsv-uacDcHDVTYsXtBR9=6sgm5mqx+d...@mail.gmail.com/

Suggested-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a1934e0eea3b..fdff53d5a1f8 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -746,21 +746,15 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
if (!ei || !ei_dentry)
goto out;
 
-   ret = 0;
-
/*
 * Need to create the dentries and inodes to have a consistent
 * inode number.
 */
-   for (i = 0; i < ei->nr_entries; i++) {
-   void *cdata = ei->data;
-
-   if (c > 0) {
-   c--;
-   continue;
-   }
+   ret = 0;
 
-   ctx->pos++;
+   /* Start at 'c' to jump over already read entries */
+   for (i = c; i < ei->nr_entries; i++, ctx->pos++) {
+   void *cdata = ei->data;
 
entry = >entries[i];
name = entry->name;
@@ -769,7 +763,7 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
/* If ei->is_freed then just bail here, nothing more to do */
if (ei->is_freed) {
mutex_unlock(_mutex);
-   goto out_dec;
+   goto out;
}
r = entry->callback(name, , , );
mutex_unlock(_mutex);
@@ -778,14 +772,17 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
 
dentry = create_file_dentry(ei, i, ei_dentry, name, mode, 
cdata, fops);
if (!dentry)
-   goto out_dec;
+   goto out;
ino = dentry->d_inode->i_ino;
dput(dentry);
 
if (!dir_emit(ctx, name, strlen(name), ino, DT_REG))
-   goto out_dec;
+   goto out;
}
 
+   /* Subtract the skipped entries above */
+   c -= min((unsigned int)c, (unsigned int)ei->nr_entries);
+
list_for_each_entry_srcu(ei_child, >children, list,
 srcu_read_lock_held(_srcu)) {
 
-- 
2.42.0

[PATCH 2/4] eventfs: Do ctx->pos update for all iterations in eventfs_iterate()

2024-01-04 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The ctx->pos was only updated when it added an entry, but the "skip to
current pos" check (c--) happened for every loop regardless of if the
entry was added or not. This inconsistency caused readdir to be incorrect.

It was due to:

for (i = 0; i < ei->nr_entries; i++) {

if (c > 0) {
c--;
continue;
}

mutex_lock(_mutex);
/* If ei->is_freed then just bail here, nothing more to do */
if (ei->is_freed) {
mutex_unlock(_mutex);
goto out;
}
r = entry->callback(name, , , );
mutex_unlock(_mutex);

[..]
ctx->pos++;
}

But this can cause the iterator to return a file that was already read.
That's because of the way the callback() works. Some events may not have
all files, and the callback can return 0 to tell eventfs to skip the file
for this directory.

for instance, we have:

 # ls /sys/kernel/tracing/events/ftrace/function
format  hist  hist_debug  id  inject

and

 # ls /sys/kernel/tracing/events/sched/sched_switch/
enable  filter  format  hist  hist_debug  id  inject  trigger

Where the function directory is missing "enable", "filter" and
"trigger". That's because the callback() for events has:

static int event_callback(const char *name, umode_t *mode, void **data,
  const struct file_operations **fops)
{
struct trace_event_file *file = *data;
struct trace_event_call *call = file->event_call;

[..]

/*
 * Only event directories that can be enabled should have
 * triggers or filters, with the exception of the "print"
 * event that can have a "trigger" file.
 */
if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)) {
if (call->class->reg && strcmp(name, "enable") == 0) {
*mode = TRACE_MODE_WRITE;
*fops = _enable_fops;
return 1;
}

if (strcmp(name, "filter") == 0) {
*mode = TRACE_MODE_WRITE;
*fops = _event_filter_fops;
return 1;
}
}

if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE) ||
strcmp(trace_event_name(call), "print") == 0) {
if (strcmp(name, "trigger") == 0) {
*mode = TRACE_MODE_WRITE;
*fops = _trigger_fops;
return 1;
}
}
[..]
return 0;
}

Where the function event has the TRACE_EVENT_FL_IGNORE_ENABLE set.

This means that the entries array elements for "enable", "filter" and
"trigger" when called on the function event will have the callback return
0 and not 1, to tell eventfs to skip these files for it.

Because the "skip to current ctx->pos" check happened for all entries, but
the ctx->pos++ only happened to entries that exist, it would confuse the
reading of a directory. Which would cause:

 # ls /sys/kernel/tracing/events/ftrace/function/
format  hist  hist  hist_debug  hist_debug  id  inject  inject

The missing "enable", "filter" and "trigger" caused ls to show "hist",
"hist_debug" and "inject" twice.

Update the ctx->pos for every iteration to keep its update and the "skip"
update consistent. This also means that on error, the ctx->pos needs to be
decremented if it was incremented without adding something.

Link: https://lore.kernel.org/all/20240104150500.38b15...@gandalf.local.home/

Fixes: 493ec81a8fb8e ("eventfs: Stop using dcache_readdir() for getdents()")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 0aca6910efb3..c73fb1f7ddbc 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -760,6 +760,8 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
continue;
}
 
+   ctx->pos++;
+
if (ei_child->is_freed)
continue;
 
@@ -767,13 +769,12 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
 
dentry = create_dir_dentry(ei, ei_child, ei_dentry);
if (!dentry)
-   goto out;
+   goto out_dec;
ino = dentry->d_inode->i_ino;
dput(dentry);
 
if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
-   goto out;
-   ctx->pos++;
+   goto out_dec;
}
 
for (i = 0; i < ei->nr_entries; i++) {
@@ -784,6 +785,8 @@ static int eventfs_iterate(struct file *file, struct 
dir_context

[PATCH 1/4] eventfs: Have eventfs_iterate() stop immediately if ei->is_freed is set

2024-01-04 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If ei->is_freed is set in eventfs_iterate(), it means that the directory
that is being iterated on is in the process of being freed. Just exit the
loop immediately when that is ever detected, and separate out the return
of the entry->callback() from ei->is_freed.

Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 72912b5f9a90..0aca6910efb3 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -788,11 +788,12 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
name = entry->name;
 
mutex_lock(_mutex);
-   /* If ei->is_freed, then the event itself may be too */
-   if (!ei->is_freed)
-   r = entry->callback(name, , , );
-   else
-   r = -1;
+   /* If ei->is_freed then just bail here, nothing more to do */
+   if (ei->is_freed) {
+   mutex_unlock(_mutex);
+   goto out;
+   }
+   r = entry->callback(name, , , );
mutex_unlock(_mutex);
if (r <= 0)
continue;
-- 
2.42.0

Re: [PATCH net-next v3 2/3] net: introduce abstraction for network memory

2024-01-04 Thread Jakub Kicinski

On Thu, 21 Dec 2023 15:44:22 -0800 Mina Almasry wrote:
> The warning is like so:
> 
> ./include/net/page_pool/helpers.h: In function ‘page_pool_alloc’:
> ./include/linux/stddef.h:8:14: warning: returning ‘void *’ from a
> function with return type ‘netmem_ref’ {aka ‘long unsigned int’} makes
> integer from pointer without a cast [-Wint-conversion]
> 8 | #define NULL ((void *)0)
>   |  ^
> ./include/net/page_pool/helpers.h:132:24: note: in expansion of macro
> ‘NULL’
>   132 | return NULL;
>   |^~~~
> 
> And happens in all the code where:
> 
> netmem_ref func()
> {
> return NULL;
> }
> 
> It's fixable by changing the return to `return (netmem_ref NULL);` or
> `return 0;`, but I feel like netmem_ref should be some type which
> allows a cast from NULL implicitly.

Why do you think we should be able to cast NULL implicitly?
netmem_ref is a handle, it could possibly be some form of 
an ID in the future, rather than a pointer. Or have more low
bits stolen for specific use cases.

unsigned long, and returning 0 as "no handle" makes perfect sense to me.

Note that 0 is a special case, bitwise types are allowed to convert
to 0/bool and 0 is implicitly allowed to become a bitwise type.
This will pass without a warning:

typedef unsigned long __bitwise netmem_ref;

netmem_ref some_code(netmem_ref ref)
{
// direct test is fine
if (!ref)
// 0 "upgrades" without casts
return 0;
// 1 does not, we need __force
return (__force netmem_ref)1 | ref;
}

The __bitwise annotation will make catching people trying
to cast to struct page * trivial.

You seem to be trying hard to make struct netmem a thing.
Perhaps you have a reason I'm not getting?

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Haitao Huang

On Thu, 04 Jan 2024 13:27:07 -0600, Dave Hansen   
wrote:



On 1/4/24 11:11, Haitao Huang wrote:

If those are OK with users and also make it acceptable for merge
quickly, I'm happy to do that 


How about we put some actual numbers behind this?  How much complexity
are we talking about here?  What's the diffstat for the utterly
bare-bones implementation and what does it cost on top of that to do the
per-cgroup reclaim?



For bare-bone:

 arch/x86/Kconfig |  13 
 arch/x86/kernel/cpu/sgx/Makefile |   1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 104  

 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  39  
+++
 arch/x86/kernel/cpu/sgx/main.c   |  41  


 arch/x86/kernel/cpu/sgx/sgx.h|   5 +
 include/linux/misc_cgroup.h  |  42  
+
 kernel/cgroup/misc.c |  52  
+++---

 8 files changed, 281 insertions(+), 16 deletions(-)

Additional for per-cgroup reclaim:

 arch/x86/kernel/cpu/sgx/encl.c   |  41 +
 arch/x86/kernel/cpu/sgx/encl.h   |   3 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 224  
+--

 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  18 ++--
 arch/x86/kernel/cpu/sgx/main.c   | 226  
++--
 arch/x86/kernel/cpu/sgx/sgx.h|  85  
+--

 6 files changed, 480 insertions(+), 117 deletions(-)


Thanks
Haitao

Re: [syzbot] [virtualization?] KMSAN: uninit-value in virtqueue_add (4)

2024-01-04 Thread Stefan Hajnoczi

On Tue, Jan 02, 2024 at 08:03:46AM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 01, 2024 at 05:38:24AM -0800, syzbot wrote:
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:fbafc3e621c3 Merge tag 'for_linus' of git://git.kernel.org..
> > git tree:   upstream
> > console+strace: https://syzkaller.appspot.com/x/log.txt?x=173df3e9e8
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=e0c7078a6b901aa3
> > dashboard link: https://syzkaller.appspot.com/bug?extid=d7521c1e3841ed075a42
> > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > Debian) 2.40
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1300b4a1e8
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=130b0379e8
> > 
> > Downloadable assets:
> > disk image: 
> > https://storage.googleapis.com/syzbot-assets/1520f7b6daa4/disk-fbafc3e6.raw.xz
> > vmlinux: 
> > https://storage.googleapis.com/syzbot-assets/8b490af009d5/vmlinux-fbafc3e6.xz
> > kernel image: 
> > https://storage.googleapis.com/syzbot-assets/202ca200f4a4/bzImage-fbafc3e6.xz
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+d7521c1e3841ed075...@syzkaller.appspotmail.com
> > 
> > =

Hi Alexander,
Please take a look at this KMSAN failure. The uninitialized memory was
created for the purpose of writing a coredump. vring_map_one_sg() should
have direction=DMA_TO_DEVICE.

I can't easily tell whether this is a genuine bug or an issue with
commit 88938359e2df ("virtio: kmsan: check/unpoison scatterlist in
vring_map_one_sg()"). Maybe coredump.c is writing out pages that KMSAN
thinks are uninitialized?

Stefan

> > BUG: KMSAN: uninit-value in vring_map_one_sg 
> > drivers/virtio/virtio_ring.c:380 [inline]
> > BUG: KMSAN: uninit-value in virtqueue_add_split 
> > drivers/virtio/virtio_ring.c:614 [inline]
> > BUG: KMSAN: uninit-value in virtqueue_add+0x21c6/0x6530 
> > drivers/virtio/virtio_ring.c:2210
> >  vring_map_one_sg drivers/virtio/virtio_ring.c:380 [inline]
> >  virtqueue_add_split drivers/virtio/virtio_ring.c:614 [inline]
> >  virtqueue_add+0x21c6/0x6530 drivers/virtio/virtio_ring.c:2210
> >  virtqueue_add_sgs+0x186/0x1a0 drivers/virtio/virtio_ring.c:2244
> >  __virtscsi_add_cmd drivers/scsi/virtio_scsi.c:467 [inline]
> >  virtscsi_add_cmd+0x838/0xad0 drivers/scsi/virtio_scsi.c:501
> >  virtscsi_queuecommand+0x896/0xa60 drivers/scsi/virtio_scsi.c:598
> >  scsi_dispatch_cmd drivers/scsi/scsi_lib.c:1516 [inline]
> >  scsi_queue_rq+0x4874/0x5790 drivers/scsi/scsi_lib.c:1758
> >  blk_mq_dispatch_rq_list+0x13f8/0x3600 block/blk-mq.c:2049
> >  __blk_mq_do_dispatch_sched block/blk-mq-sched.c:170 [inline]
> >  blk_mq_do_dispatch_sched block/blk-mq-sched.c:184 [inline]
> >  __blk_mq_sched_dispatch_requests+0x10af/0x2500 block/blk-mq-sched.c:309
> >  blk_mq_sched_dispatch_requests+0x160/0x2d0 block/blk-mq-sched.c:333
> >  blk_mq_run_work_fn+0xd0/0x280 block/blk-mq.c:2434
> >  process_one_work kernel/workqueue.c:2627 [inline]
> >  process_scheduled_works+0x104e/0x1e70 kernel/workqueue.c:2700
> >  worker_thread+0xf45/0x1490 kernel/workqueue.c:2781
> >  kthread+0x3ed/0x540 kernel/kthread.c:388
> >  ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
> >  ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
> > 
> > Uninit was created at:
> >  __alloc_pages+0x9a4/0xe00 mm/page_alloc.c:4591
> >  alloc_pages_mpol+0x62b/0x9d0 mm/mempolicy.c:2133
> >  alloc_pages mm/mempolicy.c:2204 [inline]
> >  folio_alloc+0x1da/0x380 mm/mempolicy.c:2211
> >  filemap_alloc_folio+0xa5/0x430 mm/filemap.c:974
> >  __filemap_get_folio+0xa5a/0x1760 mm/filemap.c:1918
> >  ext4_da_write_begin+0x7f8/0xec0 fs/ext4/inode.c:2891
> >  generic_perform_write+0x3f5/0xc40 mm/filemap.c:3918
> >  ext4_buffered_write_iter+0x564/0xaa0 fs/ext4/file.c:299
> >  ext4_file_write_iter+0x20f/0x3460
> >  __kernel_write_iter+0x329/0x930 fs/read_write.c:517
> >  dump_emit_page fs/coredump.c:888 [inline]
> >  dump_user_range+0x593/0xcd0 fs/coredump.c:915
> >  elf_core_dump+0x528d/0x5a40 fs/binfmt_elf.c:2077
> >  do_coredump+0x32c9/0x4920 fs/coredump.c:764
> >  get_signal+0x2185/0x2d10 kernel/signal.c:2890
> >  arch_do_signal_or_restart+0x53/0xca0 arch/x86/kernel/signal.c:309
> >  exit_to_user_mode_loop+0xe8/0x320 kernel/entry/common.c:168
> >  exit_to_user_mode_prepare+0x163/0x220 kernel/entry/common.c:204
> >  irqentry_exit_to_user_mode+0xd/0x30 kernel/entry/common.c:309
> >  irqentry_exit+0x16/0x40 kernel/entry/common.c:412
> >  exc_page_fault+0x246/0x6f0 arch/x86/mm/fault.c:1564
> >  asm_exc_page_fault+0x2b/0x30 arch/x86/include/asm/idtentry.h:570
> > 
> > Bytes 0-4095 of 4096 are uninitialized
> > Memory access of size 4096 starts at 88812c79c000
> > 
> > CPU: 0 PID: 997 Comm: kworker/0:1H Not tainted 
> > 6.7.0-rc7-syzkaller-3-gfbafc3e621c3 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> >

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Dave Hansen

On 1/4/24 11:11, Haitao Huang wrote:
> If those are OK with users and also make it acceptable for merge
> quickly, I'm happy to do that 

How about we put some actual numbers behind this?  How much complexity
are we talking about here?  What's the diffstat for the utterly
bare-bones implementation and what does it cost on top of that to do the
per-cgroup reclaim?

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Haitao Huang


Hi Michal,

On Thu, 04 Jan 2024 06:38:41 -0600, Michal Koutný  wrote:


Hello.

On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang  
 wrote:
Thanks for raising this. Actually my understanding the above discussion  
was

mainly about not doing reclaiming by killing enclaves, i.e., I assumed
"reclaiming" within that context only meant for that particular kind.

As Mikko pointed out, without reclaiming per-cgroup, the max limit of  
each
cgroup needs to accommodate the peak usage of enclaves within that  
cgroup.
That may be inconvenient for practical usage and limits could be forced  
to
be set larger than necessary to run enclaves performantly. For example,  
we
can observe following undesired consequences comparing a system with  
current
kernel loaded with enclaves whose total peak usage is greater than the  
EPC

capacity.

1) If a user wants to load the same exact enclaves but in separate  
cgroups,

then the sum of cgroup limits must be higher than the capacity and the
system will end up doing the same old global reclaiming as it is  
currently

doing. Cgroup is not useful at all for isolating EPC consumptions.


That is the use of limits to prevent a runaway cgroup smothering the
system. Overcommited values in such a config are fine because the more
simultaneous runaways, the less likely.
The peak consumption is on the fair expense of others (some efficiency)
and the limit contains the runaway (hence the isolation).



This makes sense to me in theory. Mikko, Chris Y/Bo Z, your thoughts on  
whether this is good enough for your intended usages?


2) To isolate impact of usage of each cgroup on other cgroups and yet  
still
being able to load each enclave, the user basically has to carefully  
plan to

ensure the sum of cgroup max limits, i.e., the sum of peak usage of
enclaves, is not reaching over the capacity. That means no  
over-commiting
allowed and the same system may not be able to load as many enclaves as  
with

current kernel.


Another "config layout" of limits is to achieve partitioning (sum ==
capacity). That is perfect isolation but it naturally goes against
efficient utilization. The way other controllers approach this trade-off
is with weights (cpu, io) or protections (memory). I'm afraid misc
controller is not ready for this.

My opinion is to start with the simple limits (first patches) and think
of prioritization/guarantee mechanism based on real cases.



We moved away from using mem like custom controller with (low, high, max)  
to misc controller. But if we need add those down the road, then the  
interface needs be changed. So my concern on this route would be whether  
misc would allow any of those extensions. On the other hand, it might turn  
out less complex just doing the reclamation per cgroup.


Thanks a lot for your comments and they are really helpful!

Haitao

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Jarkko Sakkinen

On Thu Jan 4, 2024 at 9:11 PM EET, Haitao Huang wrote:
> > The key question here is whether we want the SGX VM to be complex and
> > more like the real VM or simple when a cgroup hits its limit.  Right?
> >
>
> Although it's fair to say the majority of complexity of this series is in  
> support for reclaiming per cgroup, I think it's manageable and much less  
> than real VM after we removed the enclave killing parts: the only extra  
> effort is to track pages in separate list and reclaim them in separately  
> as opposed to track in on global list and reclaim together. The main  
> reclaiming loop code is still pretty much the same as before.

I'm not seeing any unmanageable complexity on SGX side, and also
cgroups specific changes are somewhat clean to me at least...

BR, Jarkko

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Haitao Huang


Hi Dave,

On Wed, 03 Jan 2024 10:37:35 -0600, Dave Hansen   
wrote:



On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
thoughts? Or could you confirm we should not

do reclaim per cgroup at all?

What's the benefit of doing reclaim per cgroup?  Is that worth the extra
complexity?



Without reclaiming per cgroup, then we have to always set the limit to  
enclave's peak usage. This may not be efficient utilization as in many  
cases each enclave can perform fine with EPC limit set less than peak.  
Basically each group can not give up some pages for greater good without  
dying :-)


Also with enclaves enabled with EDMM, the peak usage is not static so hard  
to determine upfront. Hence it might be an operation/deployment  
inconvenience.


In case of over-committing (sum of limits > total capacity), one cgroup at  
peak usage may require swapping pages out in a different cgroup if system  
is overloaded at that time.



The key question here is whether we want the SGX VM to be complex and
more like the real VM or simple when a cgroup hits its limit.  Right?



Although it's fair to say the majority of complexity of this series is in  
support for reclaiming per cgroup, I think it's manageable and much less  
than real VM after we removed the enclave killing parts: the only extra  
effort is to track pages in separate list and reclaim them in separately  
as opposed to track in on global list and reclaim together. The main  
reclaiming loop code is still pretty much the same as before.




If stopping at patch 5 and having less code is even remotely an option,
why not do _that_?


I hope I described limitations clear enough above.
If those are OK with users and also make it acceptable for merge quickly,  
I'm happy to do that :-)


Thanks
Haitao

Re: [PATCH v9 1/2] ring-buffer: Introducing ring-buffer mapping functions

2024-01-04 Thread Steven Rostedt

On Thu, 21 Dec 2023 12:58:13 -0500
Steven Rostedt  wrote:

> On Thu, 21 Dec 2023 17:35:22 +
> Vincent Donnefort  wrote:
> 
> > @@ -5999,6 +6078,307 @@ int ring_buffer_subbuf_order_set(struct 
> > trace_buffer *buffer, int order)
> >  }
> >  EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
> >
> 
> The kernel developers have agreed to allow loop variables to be declared in
> loops. This will simplify these macros:
> 
> 
> 
> > +#define subbuf_page(off, start) \
> > +   virt_to_page((void *)(start + (off << PAGE_SHIFT)))
> > +
> > +#define foreach_subbuf_page(off, sub_order, start, page)   \
> > +   for (off = 0, page = subbuf_page(0, start); \
> > +off < (1 << sub_order);\
> > +off++, page = subbuf_page(off, start))  
> 
> #define foreach_subbuf_page(sub_order, start, page)   \
>   for (int __off = 0, page = subbuf_page(0, (start)); \
>__off < (1 << (sub_order));\
>__off++, page = subbuf_page(__off, (start)))

So it seems that you can't declare "int __off" with page there, but we
could have:

#define foreach_subbuf_page(sub_order, start, page) \
page = subbuf_page(0, (start)); \
for (int __off = 0; __off < (1 << (sub_order)); \
 __off++, page = subbuf_page(__off, (start)))


And that would work.

-- Steve

Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test

2024-01-04 Thread Eugenio Perez Martin

On Wed, Jan 3, 2024 at 11:00 AM Yunsheng Lin  wrote:
>
> introduce vhost_net_test basing on virtio_test to test
> vhost_net changing in the kernel.
>
> Signed-off-by: Yunsheng Lin 
> ---
>  tools/virtio/Makefile |   8 +-
>  tools/virtio/vhost_net_test.c | 574 ++
>  2 files changed, 579 insertions(+), 3 deletions(-)
>  create mode 100644 tools/virtio/vhost_net_test.c
>
> diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
> index d128925980e0..e25e99c1c3b7 100644
> --- a/tools/virtio/Makefile
> +++ b/tools/virtio/Makefile
> @@ -1,8 +1,9 @@
>  # SPDX-License-Identifier: GPL-2.0
>  all: test mod
> -test: virtio_test vringh_test
> +test: virtio_test vringh_test vhost_net_test
>  virtio_test: virtio_ring.o virtio_test.o
>  vringh_test: vringh_test.o vringh.o virtio_ring.o
> +vhost_net_test: virtio_ring.o vhost_net_test.o
>
>  try-run = $(shell set -e;  \
> if ($(1)) >/dev/null 2>&1;  \
> @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
>
>  .PHONY: all test mod clean vhost oot oot-clean oot-build
>  clean:
> -   ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
> -  vhost_test/Module.symvers vhost_test/modules.order *.d
> +   ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
> +  vhost_test/.*.cmd vhost_test/Module.symvers \
> +  vhost_test/modules.order *.d
>  -include *.d
> diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
> new file mode 100644
> index ..cfffcef53d94
> --- /dev/null
> +++ b/tools/virtio/vhost_net_test.c
> @@ -0,0 +1,574 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define RANDOM_BATCH   -1
> +#define HDR_LEN12
> +#define TEST_BUF_LEN   256
> +#define TEST_PTYPE ETH_P_LOOPBACK
> +
> +/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */
> +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
> +
> +struct vq_info {
> +   int kick;
> +   int call;
> +   int idx;
> +   long started;
> +   long completed;
> +   struct pollfd fds;
> +   void *ring;
> +   /* copy used for control */
> +   struct vring vring;
> +   struct virtqueue *vq;
> +};
> +
> +struct vdev_info {
> +   struct virtio_device vdev;
> +   int control;
> +   struct vq_info vqs[2];
> +   int nvqs;
> +   void *buf;
> +   size_t buf_size;
> +   char *test_buf;
> +   char *res_buf;
> +   struct vhost_memory *mem;
> +   int sock;
> +   int ifindex;
> +   unsigned char mac[ETHER_ADDR_LEN];
> +};
> +
> +static int tun_alloc(struct vdev_info *dev)
> +{
> +   struct ifreq ifr;
> +   int len = HDR_LEN;
> +   int fd, e;
> +
> +   fd = open("/dev/net/tun", O_RDWR);
> +   if (fd < 0) {
> +   perror("Cannot open /dev/net/tun");
> +   return fd;
> +   }
> +
> +   memset(, 0, sizeof(ifr));
> +
> +   ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
> +   snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> +
> +   e = ioctl(fd, TUNSETIFF, );
> +   if (e < 0) {
> +   perror("ioctl[TUNSETIFF]");
> +   close(fd);
> +   return e;
> +   }
> +
> +   e = ioctl(fd, TUNSETVNETHDRSZ, );
> +   if (e < 0) {
> +   perror("ioctl[TUNSETVNETHDRSZ]");
> +   close(fd);
> +   return e;
> +   }
> +
> +   e = ioctl(fd, SIOCGIFHWADDR, );
> +   if (e < 0) {
> +   perror("ioctl[SIOCGIFHWADDR]");
> +   close(fd);
> +   return e;
> +   }
> +
> +   memcpy(dev->mac, _hwaddr.sa_data, ETHER_ADDR_LEN);
> +   return fd;
> +}
> +
> +static void vdev_create_socket(struct vdev_info *dev)
> +{
> +   struct ifreq ifr;
> +
> +   dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
> +   assert(dev->sock != -1);
> +
> +   snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> +   assert(ioctl(dev->sock, SIOCGIFINDEX, ) >= 0);
> +
> +   dev->ifindex = ifr.ifr_ifindex;
> +
> +   /* Set the flags that bring the device up */
> +   assert(ioctl(dev->sock, SIOCGIFFLAGS, ) >= 0);
> +   ifr.ifr_flags |= (IFF_UP | IFF_RUNNING);
> +   assert(ioctl(dev->sock, SIOCSIFFLAGS, ) >= 0);
> +}
> +
> +static void vdev_send_packet(struct vdev_info *dev)
> +{
> +   char *sendbuf = dev->test_buf + HDR_LEN;
> +   struct sockaddr_ll saddrll = {0};
> +   int sockfd = dev->sock;
> +   int ret;
> +
> +   memset(, 0, sizeof(saddrll));
> +   saddrll.sll_family = PF_PACKET;
> +   saddrll.sll_ifindex =

Re: [PATCH v8 3/3] remoteproc: zynqmp: parse TCM from device tree

2024-01-04 Thread Tanmay Shah



On 1/3/24 12:17 PM, Mathieu Poirier wrote:
> On Fri, Dec 15, 2023 at 03:57:25PM -0800, Tanmay Shah wrote:
> > ZynqMP TCM information is fixed in driver. Now ZynqMP TCM information
>
> s/"is fixed in driver"/"was fixed in driver"
>
> > is available in device-tree. Parse TCM information in driver
> > as per new bindings.
> > 
> > Signed-off-by: Tanmay Shah 
> > ---
> > 
> > Changes in v8:
> >   - parse power-domains property from device-tree and use EEMI calls
> > to power on/off TCM instead of using pm domains framework
> >   - Remove checking of pm_domain_id validation to power on/off tcm
> >   - Remove spurious change
> > 
> > Changes in v7:
> >   - move checking of pm_domain_id from previous patch
> >   - fix mem_bank_data memory allocation
> > 
> >  drivers/remoteproc/xlnx_r5_remoteproc.c | 154 +++-
> >  1 file changed, 148 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> > b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > index 4395edea9a64..36d73dcd93f0 100644
> > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > @@ -74,8 +74,8 @@ struct mbox_info {
> >  };
> >  
> >  /*
> > - * Hardcoded TCM bank values. This will be removed once TCM bindings are
> > - * accepted for system-dt specifications and upstreamed in linux kernel
> > + * Hardcoded TCM bank values. This will stay in driver to maintain backward
> > + * compatibility with device-tree that does not have TCM information.
> >   */
> >  static const struct mem_bank_data zynqmp_tcm_banks_split[] = {
> > {0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each 
> > */
> > @@ -878,6 +878,139 @@ static struct zynqmp_r5_core 
> > *zynqmp_r5_add_rproc_core(struct device *cdev)
> > return ERR_PTR(ret);
> >  }
> >  
> > +static int zynqmp_r5_get_tcm_node_from_dt(struct zynqmp_r5_cluster 
> > *cluster)
> > +{
> > +   struct of_phandle_args out_args;
> > +   int tcm_reg_per_r5, tcm_pd_idx;
> > +   struct zynqmp_r5_core *r5_core;
> > +   int i, j, tcm_bank_count, ret;
> > +   struct platform_device *cpdev;
> > +   struct mem_bank_data *tcm;
> > +   struct device_node *np;
> > +   struct resource *res;
> > +   u64 abs_addr, size;
> > +   struct device *dev;
> > +
> > +   for (i = 0; i < cluster->core_count; i++) {
> > +   r5_core = cluster->r5_cores[i];
> > +   dev = r5_core->dev;
> > +   np = of_node_get(dev_of_node(dev));
> > +   tcm_pd_idx = 1;
> > +
> > +   /* we have address cell 2 and size cell as 2 */
> > +   tcm_reg_per_r5 = of_property_count_elems_of_size(np, "reg",
> > +4 * 
> > sizeof(u32));
> > +   if (tcm_reg_per_r5 <= 0) {
> > +   dev_err(dev, "can't get reg property err %d\n", 
> > tcm_reg_per_r5);
> > +   return -EINVAL;
> > +   }
> > +
> > +   /*
> > +* In lockstep mode, r5 core 0 will use r5 core 1 TCM
> > +* power domains as well. so allocate twice of per core TCM
>
> Twice of what?  Please use proper english in your multi line comments, i.e a
> capital letter for the first word and a dot at the end.  
>
> > +*/
> > +   if (cluster->mode == LOCKSTEP_MODE)
> > +   tcm_bank_count = tcm_reg_per_r5 * 2;
> > +   else
> > +   tcm_bank_count = tcm_reg_per_r5;
> > +
> > +   r5_core->tcm_banks = devm_kcalloc(dev, tcm_bank_count,
> > + sizeof(struct mem_bank_data 
> > *),
> > + GFP_KERNEL);
> > +   if (!r5_core->tcm_banks)
> > +   ret = -ENOMEM;
> > +
> > +   r5_core->tcm_bank_count = tcm_bank_count;
> > +   for (j = 0; j < tcm_bank_count; j++) {
> > +   tcm = devm_kzalloc(dev, sizeof(struct mem_bank_data),
> > +  GFP_KERNEL);
> > +   if (!tcm)
> > +   return -ENOMEM;
> > +
> > +   r5_core->tcm_banks[j] = tcm;
> > +
> > +   /*
> > +* In lockstep mode, get second core's TCM power 
> > domains id
> > +* after first core TCM parsing is done as
>
> There seems to be words missing at the end of the sentence, and there is no 
> dot.
>
> > +*/
> > +   if (j == tcm_reg_per_r5) {
> > +   /* dec first core node */
> > +   of_node_put(np);
> > +
> > +   /* get second core node */
> > +   np = of_get_next_child(cluster->dev->of_node, 
> > np);
> > +
> > +   /*
> > +* reset index of power-domains property list
> > +* for second core
> > +*/
> > +

Re: [PATCH v8] bus: mhi: host: Add tracing support

2024-01-04 Thread Steven Rostedt

On Thu, 7 Dec 2023 10:00:47 +0530
Krishna chaitanya chundru  wrote:

> This change adds ftrace support for following functions which
> helps in debugging the issues when there is Channel state & MHI
> state change and also when we receive data and control events:
> 1. mhi_intvec_mhi_states
> 2. mhi_process_data_event_ring
> 3. mhi_process_ctrl_ev_ring
> 4. mhi_gen_tre
> 5. mhi_update_channel_state
> 6. mhi_tryset_pm_state
> 7. mhi_pm_st_worker
> 
> Where ever the trace events are added, debug messages are removed.
> 
> Signed-off-by: Krishna chaitanya chundru 
> ---
> Changes in v8:
> - Pass the structure and derefernce the variables in TP_fast_assign as 
> suggested by steve
> - Link to v7: 
> https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com

So this looks good from a tracing POV.

Reviewed-by: Steven Rostedt (Google) 

But I do have some more comments that could be done by new patches.



> +TRACE_EVENT(mhi_intvec_states,
> +
> + TP_PROTO(struct mhi_controller *mhi_cntrl, int dev_ee, int dev_state),
> +
> + TP_ARGS(mhi_cntrl, dev_ee, dev_state),
> +
> + TP_STRUCT__entry(
> + __string(name, mhi_cntrl->mhi_dev->name)
> + __field(int, local_ee)
> + __field(int, state)
> + __field(int, dev_ee)
> + __field(int, dev_state)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(name, mhi_cntrl->mhi_dev->name);
> + __entry->local_ee = mhi_cntrl->ee;
> + __entry->state = mhi_cntrl->dev_state;
> + __entry->dev_ee = dev_ee;
> + __entry->dev_state = dev_state;
> + ),
> +
> + TP_printk("%s: local ee: %s state: %s device ee: %s state: %s\n",
> +   __get_str(name),
> +   TO_MHI_EXEC_STR(__entry->local_ee),
> +   mhi_state_str(__entry->state),
> +   TO_MHI_EXEC_STR(__entry->dev_ee),
> +   mhi_state_str(__entry->dev_state))

So the above may have issues with user space parsing.

For one, that mhi_state_str() is:

static inline const char *mhi_state_str(enum mhi_state state)
{
switch (state) {
case MHI_STATE_RESET:
return "RESET";
case MHI_STATE_READY:
return "READY";
case MHI_STATE_M0:
return "M0";
case MHI_STATE_M1:
return "M1";
case MHI_STATE_M2:
return "M2";
case MHI_STATE_M3:
return "M3";
case MHI_STATE_M3_FAST:
return "M3 FAST";
case MHI_STATE_BHI:
return "BHI";
case MHI_STATE_SYS_ERR:
return "SYS ERROR";
default:
return "Unknown state";
}
};
 
Which if this could be changed into:

#define MHI_STATE_LIST  \
EM(RESET,"RESET")   \
EM(READY,"READY")   \
EM(M0,   "M0")  \
EM(M1,   "M1")  \
EM(M2,   "M2")  \
EM(M3,   "M3")  \
EM(M3_FAST,  "M3_FAST") \
EM(BHI,  "BHI") \
EMe(SYS_ERR, "SYS ERROR")

#undef EM
#undef EMe

#define EM(a, b)  case MHI_STATE_##a: return b;
#define EMe(a, b) case MHI_STATE_##a: return b;

static inline const char *mhi_state_str(enum mhi_state state)
{
switch (state) {
MHI_STATE_LIST
default:
return "Unknown state";
}

Then you could use that macro in the trace header:

#undef EM
#undef EMe

#define EM(a, b)TRACE_DEFINE_ENUM(MHI_STATE_##a);
#define EMe(a, b)   TRACE_DEFINE_ENUM(MHI_STATE_##a);

MHI_STATE_LIST

And in the print fmts:

#undef EM
#undef EMe

#define EM(a, b)   { MHI_STATE_##a, b },
#define EMe(a, b)  { MHI_STATE_##a, b }


TP_printk("%s: local ee: %s state: %s device ee: %s state: %s\n",
  __get_str(name),
  TO_MHI_EXEC_STR(__entry->local_ee),

  __print_symbolic(__entry->state), MHI_STATE_LIST),

  TO_MHI_EXEC_STR(__entry->dev_ee),

  __print_symbolic(__entry->dev_state, MHI_STATE_LIST))


And that will be exported to user space in the
/sys/kernel/tracing/events/*/*/format file, as:

__print_symbolic(REC->state, { 
{ MHI_STATE_RESET, "RESET"},
{ MHI_STATE_READY, "READY"},
{ MHI_STATE_M0, "M0"},
{ MHI_STATE_M1, "M1"},
{ MHI_STATE_M2, "M2"},
{ MHI_STATE_M3, "M3"},
{ MHI_STATE_M3_FAST, "M3 FAST"},
{ MHI_STATE_BHI, "BHI"},
{ MHI_STATE_SYS_ERR, "SYS ERROR"} }

Which libtracevent knows how to parse. Oh, it wouldn't even show the enum
names as the TRACE_DEFINE_ENUM() above, would tell the kernel to replace
them with their actual numeric values. That way, when trace-cmd or perf
reads the raw data, it knows how to print it.

Now

[PATCH v6 3/3] vduse: enable Virtio-net device type

2024-01-04 Thread Maxime Coquelin

This patch adds Virtio-net device type to the supported
devices types.

Initialization fails if the device does not support
VIRTIO_F_VERSION_1 feature, in order to guarantee the
configuration space is read-only. It also fails with
-EPERM if the CAP_NET_ADMIN is missing.

Signed-off-by: Maxime Coquelin 
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index 94f54ea2eb06..4057b34ff995 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -151,6 +151,7 @@ static struct workqueue_struct *vduse_irq_bound_wq;
 
 static u32 allowed_device_id[] = {
VIRTIO_ID_BLOCK,
+   VIRTIO_ID_NET,
 };
 
 static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
@@ -1694,6 +1695,10 @@ static bool features_is_valid(struct vduse_dev_config 
*config)
(config->features & VDUSE_NET_INVALID_FEATURES_MASK))
return false;
 
+   if ((config->device_id == VIRTIO_ID_NET) &&
+   !(config->features & (1ULL << VIRTIO_F_VERSION_1)))
+   return false;
+
return true;
 }
 
@@ -1801,6 +1806,10 @@ static int vduse_create_dev(struct vduse_dev_config 
*config,
int ret;
struct vduse_dev *dev;
 
+   ret = -EPERM;
+   if ((config->device_id == VIRTIO_ID_NET) && !capable(CAP_NET_ADMIN))
+   goto err;
+
ret = -EEXIST;
if (vduse_find_dev(config->name))
goto err;
@@ -2044,6 +2053,7 @@ static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops 
= {
 
 static struct virtio_device_id id_table[] = {
{ VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
+   { VIRTIO_ID_NET, VIRTIO_DEV_ANY_ID },
{ 0 },
 };
 
-- 
2.43.0

[PATCH v6 2/3] vduse: Temporarily fail if control queue features requested

2024-01-04 Thread Maxime Coquelin

Virtio-net driver control queue implementation is not safe
when used with VDUSE. If the VDUSE application does not
reply to control queue messages, it currently ends up
hanging the kernel thread sending this command.

Some work is on-going to make the control queue
implementation robust with VDUSE. Until it is completed,
let's fail features check if any control-queue related
feature is requested.

Signed-off-by: Maxime Coquelin 
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index 0486ff672408..94f54ea2eb06 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "iova_domain.h"
@@ -46,6 +47,15 @@
 
 #define IRQ_UNBOUND -1
 
+#define VDUSE_NET_INVALID_FEATURES_MASK \
+   (BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |\
+BIT_ULL(VIRTIO_NET_F_CTRL_RX)   |  \
+BIT_ULL(VIRTIO_NET_F_CTRL_VLAN) |  \
+BIT_ULL(VIRTIO_NET_F_GUEST_ANNOUNCE) | \
+BIT_ULL(VIRTIO_NET_F_MQ) | \
+BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) |  \
+BIT_ULL(VIRTIO_NET_F_RSS))
+
 struct vduse_virtqueue {
u16 index;
u16 num_max;
@@ -1680,6 +1690,9 @@ static bool features_is_valid(struct vduse_dev_config 
*config)
if ((config->device_id == VIRTIO_ID_BLOCK) &&
(config->features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE)))
return false;
+   else if ((config->device_id == VIRTIO_ID_NET) &&
+   (config->features & VDUSE_NET_INVALID_FEATURES_MASK))
+   return false;
 
return true;
 }
-- 
2.43.0

[PATCH v6 1/3] vduse: validate block features only with block devices

2024-01-04 Thread Maxime Coquelin

This patch is preliminary work to enable network device
type support to VDUSE.

As VIRTIO_BLK_F_CONFIG_WCE shares the same value as
VIRTIO_NET_F_HOST_TSO4, we need to restrict its check
to Virtio-blk device type.

Acked-by: Jason Wang 
Reviewed-by: Xie Yongji 
Signed-off-by: Maxime Coquelin 
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index 0ddd4b8abecb..0486ff672408 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1671,13 +1671,14 @@ static bool device_is_allowed(u32 device_id)
return false;
 }
 
-static bool features_is_valid(u64 features)
+static bool features_is_valid(struct vduse_dev_config *config)
 {
-   if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
+   if (!(config->features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
return false;
 
/* Now we only support read-only configuration space */
-   if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
+   if ((config->device_id == VIRTIO_ID_BLOCK) &&
+   (config->features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE)))
return false;
 
return true;
@@ -1704,7 +1705,7 @@ static bool vduse_validate_config(struct vduse_dev_config 
*config)
if (!device_is_allowed(config->device_id))
return false;
 
-   if (!features_is_valid(config->features))
+   if (!features_is_valid(config))
return false;
 
return true;
-- 
2.43.0

[PATCH v6 0/3] vduse: add support for networking devices

2024-01-04 Thread Maxime Coquelin

This small series enables virtio-net device type in VDUSE.
With it, basic operation have been tested, both with
virtio-vdpa and vhost-vdpa using DPDK Vhost library series
adding VDUSE support using split rings layout (merged in
DPDK v23.07-rc1).

Control queue support (and so multiqueue) has also been
tested, but requires a Kernel series from Jason Wang
relaxing control queue polling [1] to function reliably,
so while Jason rework is done, a patch is added to disable
CVQ and features that depend on it (tested also with DPDK
v23.07-rc1).

In this v5, LSM hooks introduced in previous revision are
unified into a single hook that covers below operations:
- VDUSE_CREATE_DEV ioctl on VDUSE control file,
- VDUSE_DESTROY_DEV ioctl on VDUSE control file,
- open() on VDUSE device file.

In combination with the operations permission, a device type
permission has to be associated:
- block: Virtio block device type,
- net: Virtio networking device type.

changes in v6!
==
- Remove SELinux support from the series, will be handled
  in a dedicated one.
- Require CAP_NET_ADMIN for Net devices creation (Jason).
- Fail init if control queue features are requested for
  Net device type (Jason).
- Rebased on latest master.

Changes in v5:
==
- Move control queue disablement patch before Net
  devices enablement (Jason).
- Unify operations LSM hooks into a single hook.
- Rebase on latest master.

Maxime Coquelin (3):
  vduse: validate block features only with block devices
  vduse: Temporarily fail if control queue features requested
  vduse: enable Virtio-net device type

 drivers/vdpa/vdpa_user/vduse_dev.c | 32 ++
 1 file changed, 28 insertions(+), 4 deletions(-)

-- 
2.43.0

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Michal Koutný

Hello.

On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang 
 wrote:
> Thanks for raising this. Actually my understanding the above discussion was
> mainly about not doing reclaiming by killing enclaves, i.e., I assumed
> "reclaiming" within that context only meant for that particular kind.
> 
> As Mikko pointed out, without reclaiming per-cgroup, the max limit of each
> cgroup needs to accommodate the peak usage of enclaves within that cgroup.
> That may be inconvenient for practical usage and limits could be forced to
> be set larger than necessary to run enclaves performantly. For example, we
> can observe following undesired consequences comparing a system with current
> kernel loaded with enclaves whose total peak usage is greater than the EPC
> capacity.
> 
> 1) If a user wants to load the same exact enclaves but in separate cgroups,
> then the sum of cgroup limits must be higher than the capacity and the
> system will end up doing the same old global reclaiming as it is currently
> doing. Cgroup is not useful at all for isolating EPC consumptions.

That is the use of limits to prevent a runaway cgroup smothering the
system. Overcommited values in such a config are fine because the more
simultaneous runaways, the less likely.
The peak consumption is on the fair expense of others (some efficiency)
and the limit contains the runaway (hence the isolation).

> 2) To isolate impact of usage of each cgroup on other cgroups and yet still
> being able to load each enclave, the user basically has to carefully plan to
> ensure the sum of cgroup max limits, i.e., the sum of peak usage of
> enclaves, is not reaching over the capacity. That means no over-commiting
> allowed and the same system may not be able to load as many enclaves as with
> current kernel.

Another "config layout" of limits is to achieve partitioning (sum ==
capacity). That is perfect isolation but it naturally goes against
efficient utilization. The way other controllers approach this trade-off
is with weights (cpu, io) or protections (memory). I'm afraid misc
controller is not ready for this.

My opinion is to start with the simple limits (first patches) and think
of prioritization/guarantee mechanism based on real cases.

HTH,
Michal

signature.asc
Description: PGP signature

Re: [PATCH v3 28/34] s390/mm: Define KMSAN metadata for vmalloc and modules

2024-01-04 Thread Heiko Carstens

On Thu, Jan 04, 2024 at 11:03:42AM +0100, Alexander Gordeev wrote:
> On Tue, Jan 02, 2024 at 04:05:31PM +0100, Heiko Carstens wrote:
> Hi Heiko,
> ...
> > > @@ -253,9 +253,17 @@ static unsigned long setup_kernel_memory_layout(void)
> > >   MODULES_END = round_down(__abs_lowcore, _SEGMENT_SIZE);
> > >   MODULES_VADDR = MODULES_END - MODULES_LEN;
> > >   VMALLOC_END = MODULES_VADDR;
> > > +#ifdef CONFIG_KMSAN
> > > + VMALLOC_END -= MODULES_LEN * 2;
> > > +#endif
> > >  
> > >   /* allow vmalloc area to occupy up to about 1/2 of the rest virtual 
> > > space left */
> > >   vmalloc_size = min(vmalloc_size, round_down(VMALLOC_END / 2, 
> > > _REGION3_SIZE));
> > > +#ifdef CONFIG_KMSAN
> > > + /* take 2/3 of vmalloc area for KMSAN shadow and origins */
> > > + vmalloc_size = round_down(vmalloc_size / 3, _REGION3_SIZE);
> > > + VMALLOC_END -= vmalloc_size * 2;
> > > +#endif
> > 
> > Please use
> > 
> > if (IS_ENABLED(CONFIG_KMSAN))
> > 
> > above, since this way we get more compile time checks.
> 
> This way we will get a mixture of CONFIG_KASAN and CONFIG_KMSAN
> #ifdef vs IS_ENABLED() checks within one function. I guess, we
> would rather address it with a separate cleanup?

I don't think so, since you can't convert the CONFIG_KASAN ifdef to
IS_ENABLED() here: it won't compile.

But IS_ENABLED(CONFIG_KMSAN) should work. I highly prefer IS_ENABLED() over
ifdef since it allows for better compile time checks, and you won't be
surprised by code that doesn't compile if you just change a config option.
We've seen that way too often.

Re: [PATCH v5 4/4] vduse: Add LSM hook to check Virtio device type

2024-01-04 Thread Maxime Coquelin





On 12/18/23 18:33, Stephen Smalley wrote:

On Mon, Dec 18, 2023 at 12:21 PM Stephen Smalley
 wrote:


On Tue, Dec 12, 2023 at 8:17 AM Maxime Coquelin
 wrote:


This patch introduces a LSM hook for devices creation,
destruction (ioctl()) and opening (open()) operations,
checking the application is allowed to perform these
operations for the Virtio device type.


Can you explain why the existing LSM hooks and SELinux implementation
are not sufficient? We already control the ability to open device
nodes via selinux_inode_permission() and selinux_file_open(), and can
support fine-grained per-cmd ioctl checking via selinux_file_ioctl().
And it should already be possible to label these nodes distinctly
through existing mechanisms (file_contexts if udev-created/labeled,
genfs_contexts if kernel-created). What exactly can't you do today
that this hook enables?


(added Ondrej to the distribution; IMHO we should swap him into
MAINTAINERS in place of Eric Paris since Eric has long-since moved on
from SELinux and Ondrej serves in that capacity these days)

Other items to consider:
- If vduse devices are created using anonymous inodes, then SELinux
grew a general facility for labeling and controlling the creation of
those via selinux_inode_init_security_anon().
- You can encode information about the device into its SELinux type
that then allows you to distinguish things like net vs block based on
the device's SELinux security context rather than encoding that in the
permission bits.


Got it, that seems indeed more appropriate than using persmission bits
for the device type.


- If you truly need new LSM hooks (which you need to prove first),
then you should pass some usable information about the object in
question to allow SELinux to find a security context for it. Like an
inode associated with the device, for example.


Ok.





Thanks for the insights, I'll try and see if I can follow your
recommendations in a dedicated series.

Maxime

[PATCH -next v5 2/2] mm: vmscan: add new event to trace shrink lru

2024-01-04 Thread Bixuan Cui

From: cuibixuan 

Page reclaim is an important part of memory reclaim, including:
  * shrink_active_list(), moves folios from the active LRU to the inactive LRU
  * shrink_inactive_list(), shrink lru from inactive LRU list

Add the new events to calculate the execution time to better evaluate 
the entire memory recycling ratio.

Example of output:
 kswapd0-103 [007] .  1098.353020: 
mm_vmscan_lru_shrink_active_start: nid=0
 kswapd0-103 [007] .  1098.353040: 
mm_vmscan_lru_shrink_active_end: nid=0 nr_taken=32 nr_active=0 
nr_deactivated=32 nr_referenced=0 priority=6 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
 kswapd0-103 [007] .  1098.353040: 
mm_vmscan_lru_shrink_inactive_start: nid=0
 kswapd0-103 [007] .  1098.353094: 
mm_vmscan_lru_shrink_inactive_end: nid=0 nr_scanned=32 nr_reclaimed=0 
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 
nr_activate_file=0 nr_ref_keep=32 nr_unmap_fail=0 priority=6 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
 kswapd0-103 [007] .  1098.353094: 
mm_vmscan_lru_shrink_inactive_start: nid=0
 kswapd0-103 [007] .  1098.353162: 
mm_vmscan_lru_shrink_inactive_end: nid=0 nr_scanned=32 nr_reclaimed=21 
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 
nr_activate_file=0 nr_ref_keep=11 nr_unmap_fail=0 priority=6 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

Signed-off-by: Bixuan Cui 
Reviewed-by: Andrew Morton 
---
Changes:
v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
* Add the explanation for adding new shrink lru events into 'mm: vmscan: 
add new event to trace shrink lru'
v4: * Add Reviewed-by and Changlog to every patch.
v3: * Swap the positions of 'nid' and 'freeable' to prevent the hole in the 
trace event.
v2: * Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the 
same time to fix build error (Andrew pointed out).

 include/trace/events/vmscan.h | 31 +--
 mm/vmscan.c   | 11 ---
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index b99cd28c9815..4793d952c248 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -395,7 +395,34 @@ TRACE_EVENT(mm_vmscan_write_folio,
show_reclaim_flags(__entry->reclaim_flags))
 );
 
-TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
+DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template,
+
+   TP_PROTO(int nid),
+
+   TP_ARGS(nid),
+
+   TP_STRUCT__entry(
+   __field(int, nid)
+   ),
+
+   TP_fast_assign(
+   __entry->nid = nid;
+   ),
+
+   TP_printk("nid=%d", __entry->nid)
+);
+
+DEFINE_EVENT(mm_vmscan_lru_shrink_start_template, 
mm_vmscan_lru_shrink_inactive_start,
+   TP_PROTO(int nid),
+   TP_ARGS(nid)
+);
+
+DEFINE_EVENT(mm_vmscan_lru_shrink_start_template, 
mm_vmscan_lru_shrink_active_start,
+   TP_PROTO(int nid),
+   TP_ARGS(nid)
+);
+
+TRACE_EVENT(mm_vmscan_lru_shrink_inactive_end,
 
TP_PROTO(int nid,
unsigned long nr_scanned, unsigned long nr_reclaimed,
@@ -446,7 +473,7 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
show_reclaim_flags(__entry->reclaim_flags))
 );
 
-TRACE_EVENT(mm_vmscan_lru_shrink_active,
+TRACE_EVENT(mm_vmscan_lru_shrink_active_end,
 
TP_PROTO(int nid, unsigned long nr_taken,
unsigned long nr_active, unsigned long nr_deactivated,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4e3b835c6b4a..a44d9624d60f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1906,6 +1906,8 @@ static unsigned long shrink_inactive_list(unsigned long 
nr_to_scan,
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
bool stalled = false;
 
+   trace_mm_vmscan_lru_shrink_inactive_start(pgdat->node_id);
+
while (unlikely(too_many_isolated(pgdat, file, sc))) {
if (stalled)
return 0;
@@ -1990,7 +1992,7 @@ static unsigned long shrink_inactive_list(unsigned long 
nr_to_scan,
if (file)
sc->nr.file_taken += nr_taken;
 
-   trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
+   trace_mm_vmscan_lru_shrink_inactive_end(pgdat->node_id,
nr_scanned, nr_reclaimed, , sc->priority, file);
return nr_reclaimed;
 }
@@ -2028,6 +2030,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
+   trace_mm_vmscan_lru_shrink_active_start(pgdat->node_id);
+
lru_add_drain();
 
spin_lock_irq(>lru_lock);
@@ -2107,7 +2111,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
lru_note_cost(lruvec, file, 0, nr_rotated);
mem_cgroup_uncharge_list(_active);

[PATCH -next v5 0/2] Make memory reclamation measurable

2024-01-04 Thread Bixuan Cui

When the system memory is low, kswapd reclaims the memory. The key steps
of memory reclamation include
1.shrink_lruvec
  * shrink_active_list, moves folios from the active LRU to the inactive LRU
  * shrink_inactive_list, shrink lru from inactive LRU list
2.shrink_slab
  * shrinker->count_objects(), calculates the freeable memory
  * shrinker->scan_objects(), reclaims the slab memory

The existing tracers in the vmscan are as follows:

--do_try_to_free_pages
--shrink_zones
--trace_mm_vmscan_node_reclaim_begin (tracer)
--shrink_node
--shrink_node_memcgs
  --trace_mm_vmscan_memcg_shrink_begin (tracer)
  --shrink_lruvec
--shrink_list
  --shrink_active_list
  --trace_mm_vmscan_lru_shrink_active (tracer)
  --shrink_inactive_list
  --trace_mm_vmscan_lru_shrink_inactive (tracer)
--shrink_active_list
  --shrink_slab
--do_shrink_slab
--shrinker->count_objects()
--trace_mm_shrink_slab_start (tracer)
--shrinker->scan_objects()
--trace_mm_shrink_slab_end (tracer)
  --trace_mm_vmscan_memcg_shrink_end (tracer)
--trace_mm_vmscan_node_reclaim_end (tracer)

If we get the duration and quantity of shrink lru and slab,
then we can measure the memory recycling, as follows

Measuring memory reclamation with bpf:
  LRU FILE:
CPU COMMShrinkActive(us) ShrinkInactive(us)  Reclaim(page)
7   kswapd0 26  51  32
7   kswapd0 52  47  13
  SLAB:
CPU COMMOBJ_NAMECount_Dur(us) 
Freeable(page) Scan_Dur(us) Reclaim(page)
 1  kswapd0 super_cache_scan.cfi_jt 2   341 
   3225 128
 7  kswapd0 super_cache_scan.cfi_jt 0   
2247   8524 1024
 7  kswapd0 super_cache_scan.cfi_jt 23670   
   00

For this, add the new tracer to shrink_active_list/shrink_inactive_list
and shrinker->count_objects().

Changes:
v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
* Add the explanation for adding new shrink lru events into 'mm: vmscan: 
add new event to trace shrink lru'
v4: Add Reviewed-by and Changlog to every patch.
v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace 
event.
v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same 
time to fix build error.

cuibixuan (2):
  mm: shrinker: add new event to trace shrink count
  mm: vmscan: add new event to trace shrink lru

 include/trace/events/vmscan.h | 80 ++-
 mm/shrinker.c |  4 ++
 mm/vmscan.c   | 11 +++--
 3 files changed, 90 insertions(+), 5 deletions(-)

-- 
2.17.1

[PATCH -next v5 1/2] mm: shrinker: add new event to trace shrink count

2024-01-04 Thread Bixuan Cui

From: cuibixuan 

do_shrink_slab() calculates the freeable memory through 
shrinker->count_objects(),
and then reclaims the memory through shrinker->scan_objects(). When reclaiming
memory, shrinker->count_objects() takes a certain amount of time:

Fun   spend(us)
ext4_es_count 4302
ext4_es_scan  12
super_cache_count 4195
super_cache_scan  2103

Therefore, adding the trace event to count_objects() can more accurately
obtain the time taken for slab memory recycling.

Example of output:
 kswapd0-103 [003] .  1098.317942: mm_shrink_count_start: 
kfree_rcu_shrink_count.cfi_jt+0x0/0x8 c540ff51: nid: 0
 kswapd0-103 [003] .  1098.317951: mm_shrink_count_end: 
kfree_rcu_shrink_count.cfi_jt+0x0/0x8 c540ff51: nid: 0 freeable:36

Signed-off-by: Bixuan Cui 
Reviewed-by: Steven Rostedt 
---
Changes:
v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
* Add the explanation for adding new shrink lru events into 'mm: vmscan: 
add new event to trace shrink lru'
v4: * Add Reviewed-by and Changlog to every patch.
v3: * Swap the positions of 'nid' and 'freeable' to prevent the hole in the 
trace event.
v2: * Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the 
same time to fix build error (Andrew pointed out).

 include/trace/events/vmscan.h | 49 +++
 mm/shrinker.c |  4 +++
 2 files changed, 53 insertions(+)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 1a488c30afa5..b99cd28c9815 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -196,6 +196,55 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, 
mm_vmscan_memcg_softlimit_re
 );
 #endif /* CONFIG_MEMCG */
 
+TRACE_EVENT(mm_shrink_count_start,
+   TP_PROTO(struct shrinker *shr, struct shrink_control *sc),
+
+   TP_ARGS(shr, sc),
+
+   TP_STRUCT__entry(
+   __field(struct shrinker *, shr)
+   __field(void *, shrink)
+   __field(int, nid)
+   ),
+
+   TP_fast_assign(
+   __entry->shr = shr;
+   __entry->shrink = shr->count_objects;
+   __entry->nid = sc->nid;
+   ),
+
+   TP_printk("%pS %p: nid: %d",
+   __entry->shrink,
+   __entry->shr,
+   __entry->nid)
+);
+
+TRACE_EVENT(mm_shrink_count_end,
+   TP_PROTO(struct shrinker *shr, struct shrink_control *sc, long 
freeable),
+
+   TP_ARGS(shr, sc, freeable),
+
+   TP_STRUCT__entry(
+   __field(struct shrinker *, shr)
+   __field(void *, shrink)
+   __field(long, freeable)
+   __field(int, nid)
+   ),
+
+   TP_fast_assign(
+   __entry->shr = shr;
+   __entry->shrink = shr->count_objects;
+   __entry->freeable = freeable;
+   __entry->nid = sc->nid;
+   ),
+
+   TP_printk("%pS %p: nid: %d freeable:%ld",
+   __entry->shrink,
+   __entry->shr,
+   __entry->nid,
+   __entry->freeable)
+);
+
 TRACE_EVENT(mm_shrink_slab_start,
TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
long nr_objects_to_shrink, unsigned long cache_items,
diff --git a/mm/shrinker.c b/mm/shrinker.c
index dd91eab43ed3..d0c7bf61db61 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -379,7 +379,11 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
  : SHRINK_BATCH;
long scanned = 0, next_deferred;
 
+   trace_mm_shrink_count_start(shrinker, shrinkctl);
+
freeable = shrinker->count_objects(shrinker, shrinkctl);
+
+   trace_mm_shrink_count_end(shrinker, shrinkctl, freeable);
if (freeable == 0 || freeable == SHRINK_EMPTY)
return freeable;
 
-- 
2.17.1

Re: [PATCH v3 28/34] s390/mm: Define KMSAN metadata for vmalloc and modules

2024-01-04 Thread Alexander Gordeev

On Tue, Jan 02, 2024 at 04:05:31PM +0100, Heiko Carstens wrote:
Hi Heiko,
...
> > @@ -253,9 +253,17 @@ static unsigned long setup_kernel_memory_layout(void)
> > MODULES_END = round_down(__abs_lowcore, _SEGMENT_SIZE);
> > MODULES_VADDR = MODULES_END - MODULES_LEN;
> > VMALLOC_END = MODULES_VADDR;
> > +#ifdef CONFIG_KMSAN
> > +   VMALLOC_END -= MODULES_LEN * 2;
> > +#endif
> >  
> > /* allow vmalloc area to occupy up to about 1/2 of the rest virtual 
> > space left */
> > vmalloc_size = min(vmalloc_size, round_down(VMALLOC_END / 2, 
> > _REGION3_SIZE));
> > +#ifdef CONFIG_KMSAN
> > +   /* take 2/3 of vmalloc area for KMSAN shadow and origins */
> > +   vmalloc_size = round_down(vmalloc_size / 3, _REGION3_SIZE);
> > +   VMALLOC_END -= vmalloc_size * 2;
> > +#endif
> 
> Please use
> 
>   if (IS_ENABLED(CONFIG_KMSAN))
> 
> above, since this way we get more compile time checks.

This way we will get a mixture of CONFIG_KASAN and CONFIG_KMSAN
#ifdef vs IS_ENABLED() checks within one function. I guess, we
would rather address it with a separate cleanup?

Thanks!

Re: [RFC PATCH 3/5] dt-bindings: regulator: add documentation entry for 88pm88x-regulator

2024-01-04 Thread Krzysztof Kozlowski

On 04/01/2024 10:25, Krzysztof Kozlowski wrote:
> On 28/12/2023 10:39, Karel Balej wrote:
>> From: Karel Balej 
>>
> 
> A nit, subject: drop second/last, redundant "documentation entry". The
> "dt-bindings" prefix is already stating that these are bindings.
> 
>> The Marvell 88PM88X PMICs provide regulators among other things.
>> Document how to use them.
> 
> 
> You did not document them in the bindings. You just added example to
> make it complete. Where is the binding change?
> 
> What's more, I have doubts that you tested it.

I found your other patchset - now I am 100% sure you did not test it.

Best regards,
Krzysztof

Re: [RFC PATCH 3/5] dt-bindings: regulator: add documentation entry for 88pm88x-regulator

2024-01-04 Thread Krzysztof Kozlowski

On 28/12/2023 10:39, Karel Balej wrote:
> From: Karel Balej 
> 

A nit, subject: drop second/last, redundant "documentation entry". The
"dt-bindings" prefix is already stating that these are bindings.

> The Marvell 88PM88X PMICs provide regulators among other things.
> Document how to use them.


You did not document them in the bindings. You just added example to
make it complete. Where is the binding change?

What's more, I have doubts that you tested it.


> 
> Signed-off-by: Karel Balej 
> ---
>  .../bindings/mfd/marvell,88pm88x.yaml | 17 +++
>  .../regulator/marvell,88pm88x-regulator.yaml  | 28 +++
>  2 files changed, 45 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/regulator/marvell,88pm88x-regulator.yaml
> 
> diff --git a/Documentation/devicetree/bindings/mfd/marvell,88pm88x.yaml 
> b/Documentation/devicetree/bindings/mfd/marvell,88pm88x.yaml
> index 115b41c9f22c..e6944369fc5c 100644
> --- a/Documentation/devicetree/bindings/mfd/marvell,88pm88x.yaml
> +++ b/Documentation/devicetree/bindings/mfd/marvell,88pm88x.yaml
> @@ -54,6 +54,23 @@ examples:
>  onkey {
>compatible = "marvell,88pm88x-onkey";
>  };
> +
> +regulators {
> +  ldo2: ldo2 {
> +regulator-min-microvolt = <310>;
> +regulator-max-microvolt = <330>;
> +};
> +
> +  ldo15: ldo15 {
> +regulator-min-microvolt = <330>;
> +regulator-max-microvolt = <330>;
> +};
> +
> +  buck2: buck2 {
> +regulator-min-microvolt = <180>;
> +regulator-max-microvolt = <180>;
> +};
> +};
>};
>  };
>  ...
> diff --git 
> a/Documentation/devicetree/bindings/regulator/marvell,88pm88x-regulator.yaml 
> b/Documentation/devicetree/bindings/regulator/marvell,88pm88x-regulator.yaml
> new file mode 100644
> index ..c6ac17b113e7
> --- /dev/null
> +++ 
> b/Documentation/devicetree/bindings/regulator/marvell,88pm88x-regulator.yaml
> @@ -0,0 +1,28 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/regulator/marvell,88pm88x-regulator.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Marvell 88PM88X PMICs regulators
> +
> +maintainers:
> +  - Karel Balej 
> +
> +description: |
> +  This module is part of the Marvell 88PM88X MFD device. For more details
> +  see Documentation/devicetree/bindings/mfd/marvell,88pm88x.yaml.
> +
> +  The regulator controller is represented as a sub-node of the PMIC node
> +  in the device tree.
> +
> +  The valid names for 88PM886 regulator nodes are ldo[1-9], ldo1[0-6], 
> buck[1-5].

Don't repeat constraints in free form text.

> +
> +patternProperties:
> +  "^(ldo|buck)[0-9]+$":

You need to fix the pattern to be narrow. buck0 and buck6 are not correct.


> +type: object
> +description:
> +  Properties for single regulator.
> +$ref: regulator.yaml#

Not many benefits of this being in its own schema.

> +
> +additionalProperties: false

Best regards,
Krzysztof

Re: [PATCH V1] net: qrtr: ns: Ignore ENODEV failures in ns

2024-01-04 Thread Simon Horman

On Tue, Dec 26, 2023 at 04:20:03PM -0800, Chris Lew wrote:
> 
> 
> On 12/23/2023 5:56 AM, Simon Horman wrote:
> > [Dropped bjorn.anders...@kernel.org, as the correct address seems
> >   to be anders...@kernel.org, which is already in the CC list.
> >   kernel.org rejected sending this email without that update.]
> > 
> > On Thu, Dec 21, 2023 at 03:36:50PM +0530, Sarannya S wrote:
> > > From: Chris Lew 
> > > 
> > > Ignore the ENODEV failures returned by kernel_sendmsg(). These errors
> > > indicate that either the local port has been closed or the remote has
> > > gone down. Neither of these scenarios are fatal and will eventually be
> > > handled through packets that are later queued on the control port.
> > > 
> > > Signed-off-by: Chris Lew 
> > > Signed-off-by: Sarannya Sasikumar 
> > > ---
> > >   net/qrtr/ns.c | 11 +++
> > >   1 file changed, 7 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/net/qrtr/ns.c b/net/qrtr/ns.c
> > > index abb0c70..8234339 100644
> > > --- a/net/qrtr/ns.c
> > > +++ b/net/qrtr/ns.c
> > > @@ -157,7 +157,7 @@ static int service_announce_del(struct sockaddr_qrtr 
> > > *dest,
> > >   msg.msg_namelen = sizeof(*dest);
> > >   ret = kernel_sendmsg(qrtr_ns.sock, , , 1, sizeof(pkt));
> > > - if (ret < 0)
> > > + if (ret < 0 && ret != -ENODEV)
> > >   pr_err("failed to announce del service\n");
> > >   return ret;
> > 
> > Hi,
> > 
> > The caller of service_announce_del() ignores it's return value.
> > So the only action on error is the pr_err() call above, and so
> > with this patch -ENODEV is indeed ignored.
> > 
> > However, I wonder if it would make things clearer to the reader (me?)
> > if the return type of service_announce_del was updated void. Because
> > as things stand -ENODEV may be returned, which implies something might
> > handle that, even though it doe not.
> > 
> > The above notwithstanding, this change looks good to me.
> > 
> > Reviewed-by: Simon Horman 
> > 
> > ...
> 
> Hi Simon, thanks for the review and suggestion. We weren't sure whether we
> should change the function prototype on these patches on the chance that
> there will be something that listens and handles this in the future. I think
> it's a good idea to change it to void and we can change it back if there is
> such a usecase in the future.

Hi Chris,

yes, I think that would be a good approach.

39 matches

Mail list logo