Re: [PATCH v8] bus: mhi: host: Add tracing support
On 1/4/2024 12:47 PM, Krishna Chaitanya Chundru wrote: Hi Steven, Can you please review this. Thanks & Regards, Krishna Chaitanya. On 12/7/2023 10:00 AM, Krishna chaitanya chundru wrote: This change adds ftrace support for following functions which helps in debugging the issues when there is Channel state & MHI state change and also when we receive data and control events: 1. mhi_intvec_mhi_states 2. mhi_process_data_event_ring 3. mhi_process_ctrl_ev_ring 4. mhi_gen_tre 5. mhi_update_channel_state 6. mhi_tryset_pm_state 7. mhi_pm_st_worker Where ever the trace events are added, debug messages are removed. Is there a reason why debug messages have to be removed? From the view of MHI controller, we often need MHI logs to debug, so is it possbile to preserve those logs? Signed-off-by: Krishna chaitanya chundru --- Changes in v8: - Pass the structure and derefernce the variables in TP_fast_assign as suggested by steve - Link to v7: https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com Changes in v7: - change log format as pointed by mani. - Link to v6: https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com Changes in v6: - use 'rp' directly as suggested by jeffrey. - Link to v5: https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com Changes in v5: - Use DECLARE_EVENT_CLASS for multiple events as suggested by steve. - Instead of converting to u64 to print address, use %px to print the address to avoid - warnings in some platforms. - Link to v4: https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com Changes in v4: - Fix compilation issues in previous patch which happended due to rebasing. - In the defconfig FTRACE config is not enabled due to that the compilation issue is not - seen in my workspace. - Link to v3: https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com Changes in v3: - move trace header file from include/trace/events to drivers/bus/mhi/host/ so that - we can include driver header files. - Use macros directly in the trace events as suggested Jeffrey Hugo. - Reorder the structure in the events as suggested by steve to avoid holes in the buffer. - removed the mhi_to_physical function as this can give security issues. - removed macros to define strings as we can get those from driver headers. - Link to v2: https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com Changes in v2: - Passing the raw state into the trace event and using __print_symbolic() as suggested by bjorn. - Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn. - Fixed the kernel test rebot issues. - Link to v1: https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com --- drivers/bus/mhi/host/init.c | 3 + drivers/bus/mhi/host/main.c | 19 ++-- drivers/bus/mhi/host/pm.c | 7 +- drivers/bus/mhi/host/trace.h | 205 +++ 4 files changed, 221 insertions(+), 13 deletions(-) diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c index f78aefd2d7a3..6acb85f4c5f8 100644 --- a/drivers/bus/mhi/host/init.c +++ b/drivers/bus/mhi/host/init.c @@ -20,6 +20,9 @@ #include #include "internal.h" +#define CREATE_TRACE_POINTS +#include "trace.h" + static DEFINE_IDA(mhi_controller_ida); const char * const mhi_ee_str[MHI_EE_MAX] = { diff --git a/drivers/bus/mhi/host/main.c b/drivers/bus/mhi/host/main.c index dcf627b36e82..189f4786403e 100644 --- a/drivers/bus/mhi/host/main.c +++ b/drivers/bus/mhi/host/main.c @@ -15,6 +15,7 @@ #include #include #include "internal.h" +#include "trace.h" int __must_check mhi_read_reg(struct mhi_controller *mhi_cntrl, void __iomem *base, u32 offset, u32 *out) @@ -491,11 +492,8 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv) state = mhi_get_mhi_state(mhi_cntrl); ee = mhi_get_exec_env(mhi_cntrl); - dev_dbg(dev, "local ee: %s state: %s device ee: %s state: %s\n", - TO_MHI_EXEC_STR(mhi_cntrl->ee), - mhi_state_str(mhi_cntrl->dev_state), - TO_MHI_EXEC_STR(ee), mhi_state_str(state)); + trace_mhi_intvec_states(mhi_cntrl, ee, state); if (state == MHI_STATE_SYS_ERR) { dev_dbg(dev, "System error detected\n"); pm_state = mhi_tryset_pm_state(mhi_cntrl, @@ -832,6 +830,8 @@ int mhi_process_ctrl_ev_ring(struct mhi_controller *mhi_cntrl, while (dev_rp != local_rp) { enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp); + trace_mhi_ctrl_event(mhi_cntrl, local_rp); + switch (type) { case MHI_PKT_TYPE_BW_REQ_EVENT: { @@ -997,6 +997,8 @@ int mhi_process_data_event_ring(struct mhi_controller *mhi_cntrl, while (dev_rp != local_rp && event_quota > 0) { enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp); +
Re: [PATCH v8] bus: mhi: host: Add tracing support
On 1/4/2024 9:31 PM, Steven Rostedt wrote: On Thu, 7 Dec 2023 10:00:47 +0530 Krishna chaitanya chundru wrote: This change adds ftrace support for following functions which helps in debugging the issues when there is Channel state & MHI state change and also when we receive data and control events: 1. mhi_intvec_mhi_states 2. mhi_process_data_event_ring 3. mhi_process_ctrl_ev_ring 4. mhi_gen_tre 5. mhi_update_channel_state 6. mhi_tryset_pm_state 7. mhi_pm_st_worker Where ever the trace events are added, debug messages are removed. Signed-off-by: Krishna chaitanya chundru --- Changes in v8: - Pass the structure and derefernce the variables in TP_fast_assign as suggested by steve - Link to v7: https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com So this looks good from a tracing POV. Reviewed-by: Steven Rostedt (Google) But I do have some more comments that could be done by new patches. +TRACE_EVENT(mhi_intvec_states, + + TP_PROTO(struct mhi_controller *mhi_cntrl, int dev_ee, int dev_state), + + TP_ARGS(mhi_cntrl, dev_ee, dev_state), + + TP_STRUCT__entry( + __string(name, mhi_cntrl->mhi_dev->name) + __field(int, local_ee) + __field(int, state) + __field(int, dev_ee) + __field(int, dev_state) + ), + + TP_fast_assign( + __assign_str(name, mhi_cntrl->mhi_dev->name); + __entry->local_ee = mhi_cntrl->ee; + __entry->state = mhi_cntrl->dev_state; + __entry->dev_ee = dev_ee; + __entry->dev_state = dev_state; + ), + + TP_printk("%s: local ee: %s state: %s device ee: %s state: %s\n", + __get_str(name), + TO_MHI_EXEC_STR(__entry->local_ee), + mhi_state_str(__entry->state), + TO_MHI_EXEC_STR(__entry->dev_ee), + mhi_state_str(__entry->dev_state)) So the above may have issues with user space parsing. For one, that mhi_state_str() is: static inline const char *mhi_state_str(enum mhi_state state) { switch (state) { case MHI_STATE_RESET: return "RESET"; case MHI_STATE_READY: return "READY"; case MHI_STATE_M0: return "M0"; case MHI_STATE_M1: return "M1"; case MHI_STATE_M2: return "M2"; case MHI_STATE_M3: return "M3"; case MHI_STATE_M3_FAST: return "M3 FAST"; case MHI_STATE_BHI: return "BHI"; case MHI_STATE_SYS_ERR: return "SYS ERROR"; default: return "Unknown state"; } }; Which if this could be changed into: #define MHI_STATE_LIST \ EM(RESET,"RESET") \ EM(READY,"READY") \ EM(M0, "M0")\ EM(M1, "M1")\ EM(M2, "M2")\ EM(M3, "M3")\ EM(M3_FAST, "M3_FAST") \ EM(BHI, "BHI") \ EMe(SYS_ERR, "SYS ERROR") #undef EM #undef EMe #define EM(a, b) case MHI_STATE_##a: return b; #define EMe(a, b) case MHI_STATE_##a: return b; static inline const char *mhi_state_str(enum mhi_state state) { switch (state) { MHI_STATE_LIST default: return "Unknown state"; } Then you could use that macro in the trace header: #undef EM #undef EMe #define EM(a, b)TRACE_DEFINE_ENUM(MHI_STATE_##a); #define EMe(a, b) TRACE_DEFINE_ENUM(MHI_STATE_##a); MHI_STATE_LIST And in the print fmts: #undef EM #undef EMe #define EM(a, b) { MHI_STATE_##a, b }, #define EMe(a, b) { MHI_STATE_##a, b } TP_printk("%s: local ee: %s state: %s device ee: %s state: %s\n", __get_str(name), TO_MHI_EXEC_STR(__entry->local_ee), __print_symbolic(__entry->state), MHI_STATE_LIST), TO_MHI_EXEC_STR(__entry->dev_ee), __print_symbolic(__entry->dev_state, MHI_STATE_LIST)) And that will be exported to user space in the /sys/kernel/tracing/events/*/*/format file, as: __print_symbolic(REC->state, { { MHI_STATE_RESET, "RESET"}, { MHI_STATE_READY, "READY"}, { MHI_STATE_M0, "M0"}, { MHI_STATE_M1, "M1"}, { MHI_STATE_M2, "M2"}, { MHI_STATE_M3, "M3"}, { MHI_STATE_M3_FAST, "M3 FAST"}, { MHI_STATE_BHI, "BHI"}, { MHI_STATE_SYS_ERR, "SYS ERROR"} } Which libtracevent knows how to parse. Oh, it wouldn't even show the enum names as the TRACE_DEFINE_ENUM() above, would tell the kernel to replace them with their actual numeric values. That way, when trace-cmd or perf reads the raw data, it knows how to
Re: [PATCH v6 3/3] vduse: enable Virtio-net device type
On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin wrote: > > This patch adds Virtio-net device type to the supported > devices types. > > Initialization fails if the device does not support > VIRTIO_F_VERSION_1 feature, in order to guarantee the > configuration space is read-only. It also fails with > -EPERM if the CAP_NET_ADMIN is missing. > > Signed-off-by: Maxime Coquelin > --- Acked-by: Jason Wang Thanks
Re: [PATCH v6 2/3] vduse: Temporarily fail if control queue features requested
On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin wrote: > > Virtio-net driver control queue implementation is not safe > when used with VDUSE. If the VDUSE application does not > reply to control queue messages, it currently ends up > hanging the kernel thread sending this command. > > Some work is on-going to make the control queue > implementation robust with VDUSE. Until it is completed, > let's fail features check if any control-queue related > feature is requested. > > Signed-off-by: Maxime Coquelin > --- > drivers/vdpa/vdpa_user/vduse_dev.c | 13 + > 1 file changed, 13 insertions(+) > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c > b/drivers/vdpa/vdpa_user/vduse_dev.c > index 0486ff672408..94f54ea2eb06 100644 > --- a/drivers/vdpa/vdpa_user/vduse_dev.c > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c > @@ -28,6 +28,7 @@ > #include > #include > #include > +#include > #include > > #include "iova_domain.h" > @@ -46,6 +47,15 @@ > > #define IRQ_UNBOUND -1 > > +#define VDUSE_NET_INVALID_FEATURES_MASK \ > + (BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |\ > +BIT_ULL(VIRTIO_NET_F_CTRL_RX) | \ > +BIT_ULL(VIRTIO_NET_F_CTRL_VLAN) | \ > +BIT_ULL(VIRTIO_NET_F_GUEST_ANNOUNCE) | \ > +BIT_ULL(VIRTIO_NET_F_MQ) | \ > +BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) | \ > +BIT_ULL(VIRTIO_NET_F_RSS)) We need to make this as well: VIRTIO_NET_F_CTRL_GUEST_OFFLOADS Other than this, Acked-by: Jason Wang Thanks > + > struct vduse_virtqueue { > u16 index; > u16 num_max; > @@ -1680,6 +1690,9 @@ static bool features_is_valid(struct vduse_dev_config > *config) > if ((config->device_id == VIRTIO_ID_BLOCK) && > (config->features & (1ULL << > VIRTIO_BLK_F_CONFIG_WCE))) > return false; > + else if ((config->device_id == VIRTIO_ID_NET) && > + (config->features & VDUSE_NET_INVALID_FEATURES_MASK)) > + return false; > > return true; > } > -- > 2.43.0 >
Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
On 2024/1/5 0:17, Eugenio Perez Martin wrote: > On Wed, Jan 3, 2024 at 11:00 AM Yunsheng Lin wrote: ... >> + >> +static void run_tx_test(struct vdev_info *dev, struct vq_info *vq, >> + bool delayed, int batch, int bufs) >> +{ >> + const bool random_batch = batch == RANDOM_BATCH; >> + long long spurious = 0; >> + struct scatterlist sl; >> + unsigned int len; >> + int r; >> + >> + for (;;) { >> + long started_before = vq->started; >> + long completed_before = vq->completed; >> + >> + virtqueue_disable_cb(vq->vq); >> + do { >> + if (random_batch) >> + batch = (random() % vq->vring.num) + 1; >> + >> + while (vq->started < bufs && >> + (vq->started - vq->completed) < batch) { >> + sg_init_one(, dev->test_buf, HDR_LEN + >> TEST_BUF_LEN); >> + r = virtqueue_add_outbuf(vq->vq, , 1, >> +dev->test_buf + >> vq->started, >> +GFP_ATOMIC); >> + if (unlikely(r != 0)) { >> + if (r == -ENOSPC && >> + vq->started > started_before) >> + r = 0; >> + else >> + r = -1; >> + break; >> + } >> + >> + ++vq->started; >> + >> + if (unlikely(!virtqueue_kick(vq->vq))) { >> + r = -1; >> + break; >> + } >> + } >> + >> + if (vq->started >= bufs) >> + r = -1; >> + >> + /* Flush out completed bufs if any */ >> + while (virtqueue_get_buf(vq->vq, )) { >> + int n, i; >> + >> + n = recvfrom(dev->sock, dev->res_buf, >> TEST_BUF_LEN, 0, NULL, NULL); >> + assert(n == TEST_BUF_LEN); >> + >> + for (i = ETHER_HDR_LEN; i < n; i++) >> + assert(dev->res_buf[i] == (char)i); >> + >> + ++vq->completed; >> + r = 0; >> + } >> + } while (r == 0); >> + >> + if (vq->completed == completed_before && vq->started == >> started_before) >> + ++spurious; >> + >> + assert(vq->completed <= bufs); >> + assert(vq->started <= bufs); >> + if (vq->completed == bufs) >> + break; >> + >> + if (delayed) { >> + if (virtqueue_enable_cb_delayed(vq->vq)) >> + wait_for_interrupt(vq); >> + } else { >> + if (virtqueue_enable_cb(vq->vq)) >> + wait_for_interrupt(vq); >> + } >> + } >> + printf("TX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n", >> + spurious, vq->started, vq->completed); >> +} >> + >> +static void run_rx_test(struct vdev_info *dev, struct vq_info *vq, >> + bool delayed, int batch, int bufs) >> +{ >> + const bool random_batch = batch == RANDOM_BATCH; >> + long long spurious = 0; >> + struct scatterlist sl; >> + unsigned int len; >> + int r; >> + >> + for (;;) { >> + long started_before = vq->started; >> + long completed_before = vq->completed; >> + >> + do { >> + if (random_batch) >> + batch = (random() % vq->vring.num) + 1; >> + >> + while (vq->started < bufs && >> + (vq->started - vq->completed) < batch) { >> + sg_init_one(, dev->res_buf, HDR_LEN + >> TEST_BUF_LEN); >> + >> + r = virtqueue_add_inbuf(vq->vq, , 1, >> + dev->res_buf + >> vq->started, >> + GFP_ATOMIC); >> + if (unlikely(r != 0)) { >> + if (r == -ENOSPC && >> + vq->started > started_before) >> + r = 0; >> + else >> +
[PATCH -next v6 2/2] mm: vmscan: add new event to trace shrink lru
From: cuibixuan Page reclaim is an important part of memory reclaim, including: * shrink_active_list(), moves folios from the active LRU to the inactive LRU * shrink_inactive_list(), shrink lru from inactive LRU list Add the new events to calculate the execution time to better evaluate the entire memory recycling ratio. Example of output: kswapd0-103 [007] . 1098.353020: mm_vmscan_lru_shrink_active_start: nid=0 kswapd0-103 [007] . 1098.353040: mm_vmscan_lru_shrink_active_end: nid=0 nr_taken=32 nr_active=0 nr_deactivated=32 nr_referenced=0 priority=6 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC kswapd0-103 [007] . 1098.353040: mm_vmscan_lru_shrink_inactive_start: nid=0 kswapd0-103 [007] . 1098.353094: mm_vmscan_lru_shrink_inactive_end: nid=0 nr_scanned=32 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 nr_ref_keep=32 nr_unmap_fail=0 priority=6 flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC kswapd0-103 [007] . 1098.353094: mm_vmscan_lru_shrink_inactive_start: nid=0 kswapd0-103 [007] . 1098.353162: mm_vmscan_lru_shrink_inactive_end: nid=0 nr_scanned=32 nr_reclaimed=21 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 nr_ref_keep=11 nr_unmap_fail=0 priority=6 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC Signed-off-by: Bixuan Cui Reviewed-by: Andrew Morton Reviewed-by: Steven Rostedt (Google) --- Changes: v6: * Add Reviewed-by from Steven Rostedt. v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start' * Add the explanation for adding new shrink lru events into 'mm: vmscan: add new event to trace shrink lru' v4: * Add Reviewed-by and Changlog to every patch. v3: * Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace event. v2: * Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same time to fix build error (Andrew pointed out). include/trace/events/vmscan.h | 31 +-- mm/vmscan.c | 11 --- 2 files changed, 37 insertions(+), 5 deletions(-) diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index b99cd28c9815..4793d952c248 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -395,7 +395,34 @@ TRACE_EVENT(mm_vmscan_write_folio, show_reclaim_flags(__entry->reclaim_flags)) ); -TRACE_EVENT(mm_vmscan_lru_shrink_inactive, +DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template, + + TP_PROTO(int nid), + + TP_ARGS(nid), + + TP_STRUCT__entry( + __field(int, nid) + ), + + TP_fast_assign( + __entry->nid = nid; + ), + + TP_printk("nid=%d", __entry->nid) +); + +DEFINE_EVENT(mm_vmscan_lru_shrink_start_template, mm_vmscan_lru_shrink_inactive_start, + TP_PROTO(int nid), + TP_ARGS(nid) +); + +DEFINE_EVENT(mm_vmscan_lru_shrink_start_template, mm_vmscan_lru_shrink_active_start, + TP_PROTO(int nid), + TP_ARGS(nid) +); + +TRACE_EVENT(mm_vmscan_lru_shrink_inactive_end, TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_reclaimed, @@ -446,7 +473,7 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive, show_reclaim_flags(__entry->reclaim_flags)) ); -TRACE_EVENT(mm_vmscan_lru_shrink_active, +TRACE_EVENT(mm_vmscan_lru_shrink_active_end, TP_PROTO(int nid, unsigned long nr_taken, unsigned long nr_active, unsigned long nr_deactivated, diff --git a/mm/vmscan.c b/mm/vmscan.c index 4e3b835c6b4a..a44d9624d60f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1906,6 +1906,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, struct pglist_data *pgdat = lruvec_pgdat(lruvec); bool stalled = false; + trace_mm_vmscan_lru_shrink_inactive_start(pgdat->node_id); + while (unlikely(too_many_isolated(pgdat, file, sc))) { if (stalled) return 0; @@ -1990,7 +1992,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, if (file) sc->nr.file_taken += nr_taken; - trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, + trace_mm_vmscan_lru_shrink_inactive_end(pgdat->node_id, nr_scanned, nr_reclaimed, , sc->priority, file); return nr_reclaimed; } @@ -2028,6 +2030,8 @@ static void shrink_active_list(unsigned long nr_to_scan, int file = is_file_lru(lru); struct pglist_data *pgdat = lruvec_pgdat(lruvec); + trace_mm_vmscan_lru_shrink_active_start(pgdat->node_id); + lru_add_drain(); spin_lock_irq(>lru_lock); @@ -2107,7 +2111,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
[PATCH -next v6 1/2] mm: shrinker: add new event to trace shrink count
From: cuibixuan do_shrink_slab() calculates the freeable memory through shrinker->count_objects(), and then reclaims the memory through shrinker->scan_objects(). When reclaiming memory, shrinker->count_objects() takes a certain amount of time: Fun spend(us) ext4_es_count 4302 ext4_es_scan 12 super_cache_count 4195 super_cache_scan 2103 Therefore, adding the trace event to count_objects() can more accurately obtain the time taken for slab memory recycling. Example of output: kswapd0-103 [003] . 1098.317942: mm_shrink_count_start: kfree_rcu_shrink_count.cfi_jt+0x0/0x8 c540ff51: nid: 0 kswapd0-103 [003] . 1098.317951: mm_shrink_count_end: kfree_rcu_shrink_count.cfi_jt+0x0/0x8 c540ff51: nid: 0 freeable:36 Signed-off-by: Bixuan Cui Reviewed-by: Steven Rostedt (Google) --- Changes: v6: * Add Reviewed-by from Steven Rostedt. v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start' * Add the explanation for adding new shrink lru events into 'mm: vmscan: add new event to trace shrink lru' v4: * Add Reviewed-by and Changlog to every patch. v3: * Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace event. v2: * Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same time to fix build error (Andrew pointed out). include/trace/events/vmscan.h | 49 +++ mm/shrinker.c | 4 +++ 2 files changed, 53 insertions(+) diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index 1a488c30afa5..b99cd28c9815 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -196,6 +196,55 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re ); #endif /* CONFIG_MEMCG */ +TRACE_EVENT(mm_shrink_count_start, + TP_PROTO(struct shrinker *shr, struct shrink_control *sc), + + TP_ARGS(shr, sc), + + TP_STRUCT__entry( + __field(struct shrinker *, shr) + __field(void *, shrink) + __field(int, nid) + ), + + TP_fast_assign( + __entry->shr = shr; + __entry->shrink = shr->count_objects; + __entry->nid = sc->nid; + ), + + TP_printk("%pS %p: nid: %d", + __entry->shrink, + __entry->shr, + __entry->nid) +); + +TRACE_EVENT(mm_shrink_count_end, + TP_PROTO(struct shrinker *shr, struct shrink_control *sc, long freeable), + + TP_ARGS(shr, sc, freeable), + + TP_STRUCT__entry( + __field(struct shrinker *, shr) + __field(void *, shrink) + __field(long, freeable) + __field(int, nid) + ), + + TP_fast_assign( + __entry->shr = shr; + __entry->shrink = shr->count_objects; + __entry->freeable = freeable; + __entry->nid = sc->nid; + ), + + TP_printk("%pS %p: nid: %d freeable:%ld", + __entry->shrink, + __entry->shr, + __entry->nid, + __entry->freeable) +); + TRACE_EVENT(mm_shrink_slab_start, TP_PROTO(struct shrinker *shr, struct shrink_control *sc, long nr_objects_to_shrink, unsigned long cache_items, diff --git a/mm/shrinker.c b/mm/shrinker.c index dd91eab43ed3..d0c7bf61db61 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -379,7 +379,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, : SHRINK_BATCH; long scanned = 0, next_deferred; + trace_mm_shrink_count_start(shrinker, shrinkctl); + freeable = shrinker->count_objects(shrinker, shrinkctl); + + trace_mm_shrink_count_end(shrinker, shrinkctl, freeable); if (freeable == 0 || freeable == SHRINK_EMPTY) return freeable; -- 2.17.1
[PATCH -next v6 0/2] Make memory reclamation measurable
When the system memory is low, kswapd reclaims the memory. The key steps of memory reclamation include 1.shrink_lruvec * shrink_active_list, moves folios from the active LRU to the inactive LRU * shrink_inactive_list, shrink lru from inactive LRU list 2.shrink_slab * shrinker->count_objects(), calculates the freeable memory * shrinker->scan_objects(), reclaims the slab memory The existing tracers in the vmscan are as follows: --do_try_to_free_pages --shrink_zones --trace_mm_vmscan_node_reclaim_begin (tracer) --shrink_node --shrink_node_memcgs --trace_mm_vmscan_memcg_shrink_begin (tracer) --shrink_lruvec --shrink_list --shrink_active_list --trace_mm_vmscan_lru_shrink_active (tracer) --shrink_inactive_list --trace_mm_vmscan_lru_shrink_inactive (tracer) --shrink_active_list --shrink_slab --do_shrink_slab --shrinker->count_objects() --trace_mm_shrink_slab_start (tracer) --shrinker->scan_objects() --trace_mm_shrink_slab_end (tracer) --trace_mm_vmscan_memcg_shrink_end (tracer) --trace_mm_vmscan_node_reclaim_end (tracer) If we get the duration and quantity of shrink lru and slab, then we can measure the memory recycling, as follows Measuring memory reclamation with bpf: LRU FILE: CPU COMMShrinkActive(us) ShrinkInactive(us) Reclaim(page) 7 kswapd0 26 51 32 7 kswapd0 52 47 13 SLAB: CPU COMMOBJ_NAMECount_Dur(us) Freeable(page) Scan_Dur(us) Reclaim(page) 1 kswapd0 super_cache_scan.cfi_jt 2 341 3225 128 7 kswapd0 super_cache_scan.cfi_jt 0 2247 8524 1024 7 kswapd0 super_cache_scan.cfi_jt 23670 00 For this, add the new tracer to shrink_active_list/shrink_inactive_list and shrinker->count_objects(). Changes: v6: * Add Reviewed-by from Steven Rostedt. v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start' * Add the explanation for adding new shrink lru events into 'mm: vmscan: add new event to trace shrink lru' v4: Add Reviewed-by and Changlog to every patch. v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace event. v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same time to fix build error. cuibixuan (2): mm: shrinker: add new event to trace shrink count mm: vmscan: add new event to trace shrink lru include/trace/events/vmscan.h | 80 ++- mm/shrinker.c | 4 ++ mm/vmscan.c | 11 +++-- 3 files changed, 90 insertions(+), 5 deletions(-) -- 2.17.1
Re: [PATCH net-next v3 2/3] net: introduce abstraction for network memory
On Thu, Jan 4, 2024 at 1:44 PM Jakub Kicinski wrote: > > On Thu, 21 Dec 2023 15:44:22 -0800 Mina Almasry wrote: > > The warning is like so: > > > > ./include/net/page_pool/helpers.h: In function ‘page_pool_alloc’: > > ./include/linux/stddef.h:8:14: warning: returning ‘void *’ from a > > function with return type ‘netmem_ref’ {aka ‘long unsigned int’} makes > > integer from pointer without a cast [-Wint-conversion] > > 8 | #define NULL ((void *)0) > > | ^ > > ./include/net/page_pool/helpers.h:132:24: note: in expansion of macro > > ‘NULL’ > > 132 | return NULL; > > |^~~~ > > > > And happens in all the code where: > > > > netmem_ref func() > > { > > return NULL; > > } > > > > It's fixable by changing the return to `return (netmem_ref NULL);` or > > `return 0;`, but I feel like netmem_ref should be some type which > > allows a cast from NULL implicitly. > > Why do you think we should be able to cast NULL implicitly? > netmem_ref is a handle, it could possibly be some form of > an ID in the future, rather than a pointer. Or have more low > bits stolen for specific use cases. > > unsigned long, and returning 0 as "no handle" makes perfect sense to me. > > Note that 0 is a special case, bitwise types are allowed to convert > to 0/bool and 0 is implicitly allowed to become a bitwise type. > This will pass without a warning: > > typedef unsigned long __bitwise netmem_ref; > > netmem_ref some_code(netmem_ref ref) > { > // direct test is fine > if (!ref) > // 0 "upgrades" without casts > return 0; > // 1 does not, we need __force > return (__force netmem_ref)1 | ref; > } > > The __bitwise annotation will make catching people trying > to cast to struct page * trivial. > > You seem to be trying hard to make struct netmem a thing. > Perhaps you have a reason I'm not getting? There are a number of functions that return struct page* today that I convert to return struct netmem* later in the child devmem series, one example is something like: struct page *page_pool_alloc(...); // returns NULL on failure. becomes: struct netmem *page_pool_alloc(...); // also returns NULL on failure. rather than, netmem_ref page_pool_alloc(...); // returns 0 on failure. I guess in my mind having NULL be castable to the new type makes it so that I can avoid the additional code churn of converting a bunch of `return NULL;` to `return 0;`, and maybe the transition from page pointers to netmem pointers can be more easily done if they're both compatible pointer types. But that is not any huge blocker or critical point in my mind, I just thought this approach is preferred. If conversion to unsigned long makes more sense to you, I'll respin this like that and do the `NULL -> 0` conversion everywhere as needed. -- Thanks, Mina
[PATCH 3/4] eventfs: Read ei->entries before ei->children in eventfs_iterate()
From: "Steven Rostedt (Google)" In order to apply a shortcut to skip over the current ctx->pos immediately, by using the ei->entries array, the reading of that array should be first. Moving the array reading before the linked list reading will make the shortcut change diff nicer to read. Link: https://lore.kernel.org/all/CAHk-=wiKwDUDv3+jCsv-uacDcHDVTYsXtBR9=6sgm5mqx+d...@mail.gmail.com/ Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 46 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index c73fb1f7ddbc..a1934e0eea3b 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -752,8 +752,8 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) * Need to create the dentries and inodes to have a consistent * inode number. */ - list_for_each_entry_srcu(ei_child, >children, list, -srcu_read_lock_held(_srcu)) { + for (i = 0; i < ei->nr_entries; i++) { + void *cdata = ei->data; if (c > 0) { c--; @@ -762,23 +762,32 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) ctx->pos++; - if (ei_child->is_freed) - continue; + entry = >entries[i]; + name = entry->name; - name = ei_child->name; + mutex_lock(_mutex); + /* If ei->is_freed then just bail here, nothing more to do */ + if (ei->is_freed) { + mutex_unlock(_mutex); + goto out_dec; + } + r = entry->callback(name, , , ); + mutex_unlock(_mutex); + if (r <= 0) + continue; - dentry = create_dir_dentry(ei, ei_child, ei_dentry); + dentry = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops); if (!dentry) goto out_dec; ino = dentry->d_inode->i_ino; dput(dentry); - if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) + if (!dir_emit(ctx, name, strlen(name), ino, DT_REG)) goto out_dec; } - for (i = 0; i < ei->nr_entries; i++) { - void *cdata = ei->data; + list_for_each_entry_srcu(ei_child, >children, list, +srcu_read_lock_held(_srcu)) { if (c > 0) { c--; @@ -787,27 +796,18 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) ctx->pos++; - entry = >entries[i]; - name = entry->name; - - mutex_lock(_mutex); - /* If ei->is_freed then just bail here, nothing more to do */ - if (ei->is_freed) { - mutex_unlock(_mutex); - goto out_dec; - } - r = entry->callback(name, , , ); - mutex_unlock(_mutex); - if (r <= 0) + if (ei_child->is_freed) continue; - dentry = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops); + name = ei_child->name; + + dentry = create_dir_dentry(ei, ei_child, ei_dentry); if (!dentry) goto out_dec; ino = dentry->d_inode->i_ino; dput(dentry); - if (!dir_emit(ctx, name, strlen(name), ino, DT_REG)) + if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) goto out_dec; } ret = 1; -- 2.42.0
[PATCH 0/4] eventfs: More updates to eventfs_iterate()
With the ongoing descussion of eventfs iterator, a few more changes are required and some changes are just enhancements. - Stop immediately in the loop if the ei is found to be in the process of being freed. - Make the ctx->pos update consistent with the skipped previous read index. This fixes a bug with duplicate files being showned by 'ls'. - Swap reading ei->entries with ei->children to make the next change easier to read - Add a "shortcut" in the ei->entries array to skip over already read entries. Steven Rostedt (Google) (4): eventfs: Have eventfs_iterate() stop immediately if ei->is_freed is set eventfs: Do ctx->pos update for all iterations in eventfs_iterate() eventfs: Read ei->entries before ei->children in eventfs_iterate() eventfs: Shortcut eventfs_iterate() by skipping entries already read fs/tracefs/event_inode.c | 67 ++-- 1 file changed, 36 insertions(+), 31 deletions(-)
[PATCH 4/4] eventfs: Shortcut eventfs_iterate() by skipping entries already read
From: "Steven Rostedt (Google)" As the ei->entries array is fixed for the duration of the eventfs_inode, it can be used to skip over already read entries in eventfs_iterate(). That is, if ctx->pos is greater than zero, there's no reason in doing the loop across the ei->entries array for the entries less than ctx->pos. Instead, start the lookup of the entries at the current ctx->pos. Link: https://lore.kernel.org/all/CAHk-=wiKwDUDv3+jCsv-uacDcHDVTYsXtBR9=6sgm5mqx+d...@mail.gmail.com/ Suggested-by: Linus Torvalds Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 23 ++- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index a1934e0eea3b..fdff53d5a1f8 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -746,21 +746,15 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) if (!ei || !ei_dentry) goto out; - ret = 0; - /* * Need to create the dentries and inodes to have a consistent * inode number. */ - for (i = 0; i < ei->nr_entries; i++) { - void *cdata = ei->data; - - if (c > 0) { - c--; - continue; - } + ret = 0; - ctx->pos++; + /* Start at 'c' to jump over already read entries */ + for (i = c; i < ei->nr_entries; i++, ctx->pos++) { + void *cdata = ei->data; entry = >entries[i]; name = entry->name; @@ -769,7 +763,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) /* If ei->is_freed then just bail here, nothing more to do */ if (ei->is_freed) { mutex_unlock(_mutex); - goto out_dec; + goto out; } r = entry->callback(name, , , ); mutex_unlock(_mutex); @@ -778,14 +772,17 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) dentry = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops); if (!dentry) - goto out_dec; + goto out; ino = dentry->d_inode->i_ino; dput(dentry); if (!dir_emit(ctx, name, strlen(name), ino, DT_REG)) - goto out_dec; + goto out; } + /* Subtract the skipped entries above */ + c -= min((unsigned int)c, (unsigned int)ei->nr_entries); + list_for_each_entry_srcu(ei_child, >children, list, srcu_read_lock_held(_srcu)) { -- 2.42.0
[PATCH 2/4] eventfs: Do ctx->pos update for all iterations in eventfs_iterate()
From: "Steven Rostedt (Google)" The ctx->pos was only updated when it added an entry, but the "skip to current pos" check (c--) happened for every loop regardless of if the entry was added or not. This inconsistency caused readdir to be incorrect. It was due to: for (i = 0; i < ei->nr_entries; i++) { if (c > 0) { c--; continue; } mutex_lock(_mutex); /* If ei->is_freed then just bail here, nothing more to do */ if (ei->is_freed) { mutex_unlock(_mutex); goto out; } r = entry->callback(name, , , ); mutex_unlock(_mutex); [..] ctx->pos++; } But this can cause the iterator to return a file that was already read. That's because of the way the callback() works. Some events may not have all files, and the callback can return 0 to tell eventfs to skip the file for this directory. for instance, we have: # ls /sys/kernel/tracing/events/ftrace/function format hist hist_debug id inject and # ls /sys/kernel/tracing/events/sched/sched_switch/ enable filter format hist hist_debug id inject trigger Where the function directory is missing "enable", "filter" and "trigger". That's because the callback() for events has: static int event_callback(const char *name, umode_t *mode, void **data, const struct file_operations **fops) { struct trace_event_file *file = *data; struct trace_event_call *call = file->event_call; [..] /* * Only event directories that can be enabled should have * triggers or filters, with the exception of the "print" * event that can have a "trigger" file. */ if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)) { if (call->class->reg && strcmp(name, "enable") == 0) { *mode = TRACE_MODE_WRITE; *fops = _enable_fops; return 1; } if (strcmp(name, "filter") == 0) { *mode = TRACE_MODE_WRITE; *fops = _event_filter_fops; return 1; } } if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE) || strcmp(trace_event_name(call), "print") == 0) { if (strcmp(name, "trigger") == 0) { *mode = TRACE_MODE_WRITE; *fops = _trigger_fops; return 1; } } [..] return 0; } Where the function event has the TRACE_EVENT_FL_IGNORE_ENABLE set. This means that the entries array elements for "enable", "filter" and "trigger" when called on the function event will have the callback return 0 and not 1, to tell eventfs to skip these files for it. Because the "skip to current ctx->pos" check happened for all entries, but the ctx->pos++ only happened to entries that exist, it would confuse the reading of a directory. Which would cause: # ls /sys/kernel/tracing/events/ftrace/function/ format hist hist hist_debug hist_debug id inject inject The missing "enable", "filter" and "trigger" caused ls to show "hist", "hist_debug" and "inject" twice. Update the ctx->pos for every iteration to keep its update and the "skip" update consistent. This also means that on error, the ctx->pos needs to be decremented if it was incremented without adding something. Link: https://lore.kernel.org/all/20240104150500.38b15...@gandalf.local.home/ Fixes: 493ec81a8fb8e ("eventfs: Stop using dcache_readdir() for getdents()") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 0aca6910efb3..c73fb1f7ddbc 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -760,6 +760,8 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) continue; } + ctx->pos++; + if (ei_child->is_freed) continue; @@ -767,13 +769,12 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) dentry = create_dir_dentry(ei, ei_child, ei_dentry); if (!dentry) - goto out; + goto out_dec; ino = dentry->d_inode->i_ino; dput(dentry); if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) - goto out; - ctx->pos++; + goto out_dec; } for (i = 0; i < ei->nr_entries; i++) { @@ -784,6 +785,8 @@ static int eventfs_iterate(struct file *file, struct dir_context
[PATCH 1/4] eventfs: Have eventfs_iterate() stop immediately if ei->is_freed is set
From: "Steven Rostedt (Google)" If ei->is_freed is set in eventfs_iterate(), it means that the directory that is being iterated on is in the process of being freed. Just exit the loop immediately when that is ever detected, and separate out the return of the entry->callback() from ei->is_freed. Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 72912b5f9a90..0aca6910efb3 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -788,11 +788,12 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) name = entry->name; mutex_lock(_mutex); - /* If ei->is_freed, then the event itself may be too */ - if (!ei->is_freed) - r = entry->callback(name, , , ); - else - r = -1; + /* If ei->is_freed then just bail here, nothing more to do */ + if (ei->is_freed) { + mutex_unlock(_mutex); + goto out; + } + r = entry->callback(name, , , ); mutex_unlock(_mutex); if (r <= 0) continue; -- 2.42.0
Re: [PATCH net-next v3 2/3] net: introduce abstraction for network memory
On Thu, 21 Dec 2023 15:44:22 -0800 Mina Almasry wrote: > The warning is like so: > > ./include/net/page_pool/helpers.h: In function ‘page_pool_alloc’: > ./include/linux/stddef.h:8:14: warning: returning ‘void *’ from a > function with return type ‘netmem_ref’ {aka ‘long unsigned int’} makes > integer from pointer without a cast [-Wint-conversion] > 8 | #define NULL ((void *)0) > | ^ > ./include/net/page_pool/helpers.h:132:24: note: in expansion of macro > ‘NULL’ > 132 | return NULL; > |^~~~ > > And happens in all the code where: > > netmem_ref func() > { > return NULL; > } > > It's fixable by changing the return to `return (netmem_ref NULL);` or > `return 0;`, but I feel like netmem_ref should be some type which > allows a cast from NULL implicitly. Why do you think we should be able to cast NULL implicitly? netmem_ref is a handle, it could possibly be some form of an ID in the future, rather than a pointer. Or have more low bits stolen for specific use cases. unsigned long, and returning 0 as "no handle" makes perfect sense to me. Note that 0 is a special case, bitwise types are allowed to convert to 0/bool and 0 is implicitly allowed to become a bitwise type. This will pass without a warning: typedef unsigned long __bitwise netmem_ref; netmem_ref some_code(netmem_ref ref) { // direct test is fine if (!ref) // 0 "upgrades" without casts return 0; // 1 does not, we need __force return (__force netmem_ref)1 | ref; } The __bitwise annotation will make catching people trying to cast to struct page * trivial. You seem to be trying hard to make struct netmem a thing. Perhaps you have a reason I'm not getting?
Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function
On Thu, 04 Jan 2024 13:27:07 -0600, Dave Hansen wrote: On 1/4/24 11:11, Haitao Huang wrote: If those are OK with users and also make it acceptable for merge quickly, I'm happy to do that How about we put some actual numbers behind this? How much complexity are we talking about here? What's the diffstat for the utterly bare-bones implementation and what does it cost on top of that to do the per-cgroup reclaim? For bare-bone: arch/x86/Kconfig | 13 arch/x86/kernel/cpu/sgx/Makefile | 1 + arch/x86/kernel/cpu/sgx/epc_cgroup.c | 104 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 39 +++ arch/x86/kernel/cpu/sgx/main.c | 41 arch/x86/kernel/cpu/sgx/sgx.h| 5 + include/linux/misc_cgroup.h | 42 + kernel/cgroup/misc.c | 52 +++--- 8 files changed, 281 insertions(+), 16 deletions(-) Additional for per-cgroup reclaim: arch/x86/kernel/cpu/sgx/encl.c | 41 + arch/x86/kernel/cpu/sgx/encl.h | 3 +- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 224 +-- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 18 ++-- arch/x86/kernel/cpu/sgx/main.c | 226 ++-- arch/x86/kernel/cpu/sgx/sgx.h| 85 +-- 6 files changed, 480 insertions(+), 117 deletions(-) Thanks Haitao
Re: [syzbot] [virtualization?] KMSAN: uninit-value in virtqueue_add (4)
On Tue, Jan 02, 2024 at 08:03:46AM -0500, Michael S. Tsirkin wrote: > On Mon, Jan 01, 2024 at 05:38:24AM -0800, syzbot wrote: > > Hello, > > > > syzbot found the following issue on: > > > > HEAD commit:fbafc3e621c3 Merge tag 'for_linus' of git://git.kernel.org.. > > git tree: upstream > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=173df3e9e8 > > kernel config: https://syzkaller.appspot.com/x/.config?x=e0c7078a6b901aa3 > > dashboard link: https://syzkaller.appspot.com/bug?extid=d7521c1e3841ed075a42 > > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for > > Debian) 2.40 > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1300b4a1e8 > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=130b0379e8 > > > > Downloadable assets: > > disk image: > > https://storage.googleapis.com/syzbot-assets/1520f7b6daa4/disk-fbafc3e6.raw.xz > > vmlinux: > > https://storage.googleapis.com/syzbot-assets/8b490af009d5/vmlinux-fbafc3e6.xz > > kernel image: > > https://storage.googleapis.com/syzbot-assets/202ca200f4a4/bzImage-fbafc3e6.xz > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > Reported-by: syzbot+d7521c1e3841ed075...@syzkaller.appspotmail.com > > > > = Hi Alexander, Please take a look at this KMSAN failure. The uninitialized memory was created for the purpose of writing a coredump. vring_map_one_sg() should have direction=DMA_TO_DEVICE. I can't easily tell whether this is a genuine bug or an issue with commit 88938359e2df ("virtio: kmsan: check/unpoison scatterlist in vring_map_one_sg()"). Maybe coredump.c is writing out pages that KMSAN thinks are uninitialized? Stefan > > BUG: KMSAN: uninit-value in vring_map_one_sg > > drivers/virtio/virtio_ring.c:380 [inline] > > BUG: KMSAN: uninit-value in virtqueue_add_split > > drivers/virtio/virtio_ring.c:614 [inline] > > BUG: KMSAN: uninit-value in virtqueue_add+0x21c6/0x6530 > > drivers/virtio/virtio_ring.c:2210 > > vring_map_one_sg drivers/virtio/virtio_ring.c:380 [inline] > > virtqueue_add_split drivers/virtio/virtio_ring.c:614 [inline] > > virtqueue_add+0x21c6/0x6530 drivers/virtio/virtio_ring.c:2210 > > virtqueue_add_sgs+0x186/0x1a0 drivers/virtio/virtio_ring.c:2244 > > __virtscsi_add_cmd drivers/scsi/virtio_scsi.c:467 [inline] > > virtscsi_add_cmd+0x838/0xad0 drivers/scsi/virtio_scsi.c:501 > > virtscsi_queuecommand+0x896/0xa60 drivers/scsi/virtio_scsi.c:598 > > scsi_dispatch_cmd drivers/scsi/scsi_lib.c:1516 [inline] > > scsi_queue_rq+0x4874/0x5790 drivers/scsi/scsi_lib.c:1758 > > blk_mq_dispatch_rq_list+0x13f8/0x3600 block/blk-mq.c:2049 > > __blk_mq_do_dispatch_sched block/blk-mq-sched.c:170 [inline] > > blk_mq_do_dispatch_sched block/blk-mq-sched.c:184 [inline] > > __blk_mq_sched_dispatch_requests+0x10af/0x2500 block/blk-mq-sched.c:309 > > blk_mq_sched_dispatch_requests+0x160/0x2d0 block/blk-mq-sched.c:333 > > blk_mq_run_work_fn+0xd0/0x280 block/blk-mq.c:2434 > > process_one_work kernel/workqueue.c:2627 [inline] > > process_scheduled_works+0x104e/0x1e70 kernel/workqueue.c:2700 > > worker_thread+0xf45/0x1490 kernel/workqueue.c:2781 > > kthread+0x3ed/0x540 kernel/kthread.c:388 > > ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147 > > ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242 > > > > Uninit was created at: > > __alloc_pages+0x9a4/0xe00 mm/page_alloc.c:4591 > > alloc_pages_mpol+0x62b/0x9d0 mm/mempolicy.c:2133 > > alloc_pages mm/mempolicy.c:2204 [inline] > > folio_alloc+0x1da/0x380 mm/mempolicy.c:2211 > > filemap_alloc_folio+0xa5/0x430 mm/filemap.c:974 > > __filemap_get_folio+0xa5a/0x1760 mm/filemap.c:1918 > > ext4_da_write_begin+0x7f8/0xec0 fs/ext4/inode.c:2891 > > generic_perform_write+0x3f5/0xc40 mm/filemap.c:3918 > > ext4_buffered_write_iter+0x564/0xaa0 fs/ext4/file.c:299 > > ext4_file_write_iter+0x20f/0x3460 > > __kernel_write_iter+0x329/0x930 fs/read_write.c:517 > > dump_emit_page fs/coredump.c:888 [inline] > > dump_user_range+0x593/0xcd0 fs/coredump.c:915 > > elf_core_dump+0x528d/0x5a40 fs/binfmt_elf.c:2077 > > do_coredump+0x32c9/0x4920 fs/coredump.c:764 > > get_signal+0x2185/0x2d10 kernel/signal.c:2890 > > arch_do_signal_or_restart+0x53/0xca0 arch/x86/kernel/signal.c:309 > > exit_to_user_mode_loop+0xe8/0x320 kernel/entry/common.c:168 > > exit_to_user_mode_prepare+0x163/0x220 kernel/entry/common.c:204 > > irqentry_exit_to_user_mode+0xd/0x30 kernel/entry/common.c:309 > > irqentry_exit+0x16/0x40 kernel/entry/common.c:412 > > exc_page_fault+0x246/0x6f0 arch/x86/mm/fault.c:1564 > > asm_exc_page_fault+0x2b/0x30 arch/x86/include/asm/idtentry.h:570 > > > > Bytes 0-4095 of 4096 are uninitialized > > Memory access of size 4096 starts at 88812c79c000 > > > > CPU: 0 PID: 997 Comm: kworker/0:1H Not tainted > > 6.7.0-rc7-syzkaller-3-gfbafc3e621c3 #0 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > >
Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function
On 1/4/24 11:11, Haitao Huang wrote: > If those are OK with users and also make it acceptable for merge > quickly, I'm happy to do that How about we put some actual numbers behind this? How much complexity are we talking about here? What's the diffstat for the utterly bare-bones implementation and what does it cost on top of that to do the per-cgroup reclaim?
Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function
Hi Michal, On Thu, 04 Jan 2024 06:38:41 -0600, Michal Koutný wrote: Hello. On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang wrote: Thanks for raising this. Actually my understanding the above discussion was mainly about not doing reclaiming by killing enclaves, i.e., I assumed "reclaiming" within that context only meant for that particular kind. As Mikko pointed out, without reclaiming per-cgroup, the max limit of each cgroup needs to accommodate the peak usage of enclaves within that cgroup. That may be inconvenient for practical usage and limits could be forced to be set larger than necessary to run enclaves performantly. For example, we can observe following undesired consequences comparing a system with current kernel loaded with enclaves whose total peak usage is greater than the EPC capacity. 1) If a user wants to load the same exact enclaves but in separate cgroups, then the sum of cgroup limits must be higher than the capacity and the system will end up doing the same old global reclaiming as it is currently doing. Cgroup is not useful at all for isolating EPC consumptions. That is the use of limits to prevent a runaway cgroup smothering the system. Overcommited values in such a config are fine because the more simultaneous runaways, the less likely. The peak consumption is on the fair expense of others (some efficiency) and the limit contains the runaway (hence the isolation). This makes sense to me in theory. Mikko, Chris Y/Bo Z, your thoughts on whether this is good enough for your intended usages? 2) To isolate impact of usage of each cgroup on other cgroups and yet still being able to load each enclave, the user basically has to carefully plan to ensure the sum of cgroup max limits, i.e., the sum of peak usage of enclaves, is not reaching over the capacity. That means no over-commiting allowed and the same system may not be able to load as many enclaves as with current kernel. Another "config layout" of limits is to achieve partitioning (sum == capacity). That is perfect isolation but it naturally goes against efficient utilization. The way other controllers approach this trade-off is with weights (cpu, io) or protections (memory). I'm afraid misc controller is not ready for this. My opinion is to start with the simple limits (first patches) and think of prioritization/guarantee mechanism based on real cases. We moved away from using mem like custom controller with (low, high, max) to misc controller. But if we need add those down the road, then the interface needs be changed. So my concern on this route would be whether misc would allow any of those extensions. On the other hand, it might turn out less complex just doing the reclamation per cgroup. Thanks a lot for your comments and they are really helpful! Haitao
Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function
On Thu Jan 4, 2024 at 9:11 PM EET, Haitao Huang wrote: > > The key question here is whether we want the SGX VM to be complex and > > more like the real VM or simple when a cgroup hits its limit. Right? > > > > Although it's fair to say the majority of complexity of this series is in > support for reclaiming per cgroup, I think it's manageable and much less > than real VM after we removed the enclave killing parts: the only extra > effort is to track pages in separate list and reclaim them in separately > as opposed to track in on global list and reclaim together. The main > reclaiming loop code is still pretty much the same as before. I'm not seeing any unmanageable complexity on SGX side, and also cgroups specific changes are somewhat clean to me at least... BR, Jarkko
Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function
Hi Dave, On Wed, 03 Jan 2024 10:37:35 -0600, Dave Hansen wrote: On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your thoughts? Or could you confirm we should not do reclaim per cgroup at all? What's the benefit of doing reclaim per cgroup? Is that worth the extra complexity? Without reclaiming per cgroup, then we have to always set the limit to enclave's peak usage. This may not be efficient utilization as in many cases each enclave can perform fine with EPC limit set less than peak. Basically each group can not give up some pages for greater good without dying :-) Also with enclaves enabled with EDMM, the peak usage is not static so hard to determine upfront. Hence it might be an operation/deployment inconvenience. In case of over-committing (sum of limits > total capacity), one cgroup at peak usage may require swapping pages out in a different cgroup if system is overloaded at that time. The key question here is whether we want the SGX VM to be complex and more like the real VM or simple when a cgroup hits its limit. Right? Although it's fair to say the majority of complexity of this series is in support for reclaiming per cgroup, I think it's manageable and much less than real VM after we removed the enclave killing parts: the only extra effort is to track pages in separate list and reclaim them in separately as opposed to track in on global list and reclaim together. The main reclaiming loop code is still pretty much the same as before. If stopping at patch 5 and having less code is even remotely an option, why not do _that_? I hope I described limitations clear enough above. If those are OK with users and also make it acceptable for merge quickly, I'm happy to do that :-) Thanks Haitao
Re: [PATCH v9 1/2] ring-buffer: Introducing ring-buffer mapping functions
On Thu, 21 Dec 2023 12:58:13 -0500 Steven Rostedt wrote: > On Thu, 21 Dec 2023 17:35:22 + > Vincent Donnefort wrote: > > > @@ -5999,6 +6078,307 @@ int ring_buffer_subbuf_order_set(struct > > trace_buffer *buffer, int order) > > } > > EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set); > > > > The kernel developers have agreed to allow loop variables to be declared in > loops. This will simplify these macros: > > > > > +#define subbuf_page(off, start) \ > > + virt_to_page((void *)(start + (off << PAGE_SHIFT))) > > + > > +#define foreach_subbuf_page(off, sub_order, start, page) \ > > + for (off = 0, page = subbuf_page(0, start); \ > > +off < (1 << sub_order);\ > > +off++, page = subbuf_page(off, start)) > > #define foreach_subbuf_page(sub_order, start, page) \ > for (int __off = 0, page = subbuf_page(0, (start)); \ >__off < (1 << (sub_order));\ >__off++, page = subbuf_page(__off, (start))) So it seems that you can't declare "int __off" with page there, but we could have: #define foreach_subbuf_page(sub_order, start, page) \ page = subbuf_page(0, (start)); \ for (int __off = 0; __off < (1 << (sub_order)); \ __off++, page = subbuf_page(__off, (start))) And that would work. -- Steve
Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
On Wed, Jan 3, 2024 at 11:00 AM Yunsheng Lin wrote: > > introduce vhost_net_test basing on virtio_test to test > vhost_net changing in the kernel. > > Signed-off-by: Yunsheng Lin > --- > tools/virtio/Makefile | 8 +- > tools/virtio/vhost_net_test.c | 574 ++ > 2 files changed, 579 insertions(+), 3 deletions(-) > create mode 100644 tools/virtio/vhost_net_test.c > > diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile > index d128925980e0..e25e99c1c3b7 100644 > --- a/tools/virtio/Makefile > +++ b/tools/virtio/Makefile > @@ -1,8 +1,9 @@ > # SPDX-License-Identifier: GPL-2.0 > all: test mod > -test: virtio_test vringh_test > +test: virtio_test vringh_test vhost_net_test > virtio_test: virtio_ring.o virtio_test.o > vringh_test: vringh_test.o vringh.o virtio_ring.o > +vhost_net_test: virtio_ring.o vhost_net_test.o > > try-run = $(shell set -e; \ > if ($(1)) >/dev/null 2>&1; \ > @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean > > .PHONY: all test mod clean vhost oot oot-clean oot-build > clean: > - ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \ > - vhost_test/Module.symvers vhost_test/modules.order *.d > + ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \ > + vhost_test/.*.cmd vhost_test/Module.symvers \ > + vhost_test/modules.order *.d > -include *.d > diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c > new file mode 100644 > index ..cfffcef53d94 > --- /dev/null > +++ b/tools/virtio/vhost_net_test.c > @@ -0,0 +1,574 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#define _GNU_SOURCE > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define RANDOM_BATCH -1 > +#define HDR_LEN12 > +#define TEST_BUF_LEN 256 > +#define TEST_PTYPE ETH_P_LOOPBACK > + > +/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */ > +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end; > + > +struct vq_info { > + int kick; > + int call; > + int idx; > + long started; > + long completed; > + struct pollfd fds; > + void *ring; > + /* copy used for control */ > + struct vring vring; > + struct virtqueue *vq; > +}; > + > +struct vdev_info { > + struct virtio_device vdev; > + int control; > + struct vq_info vqs[2]; > + int nvqs; > + void *buf; > + size_t buf_size; > + char *test_buf; > + char *res_buf; > + struct vhost_memory *mem; > + int sock; > + int ifindex; > + unsigned char mac[ETHER_ADDR_LEN]; > +}; > + > +static int tun_alloc(struct vdev_info *dev) > +{ > + struct ifreq ifr; > + int len = HDR_LEN; > + int fd, e; > + > + fd = open("/dev/net/tun", O_RDWR); > + if (fd < 0) { > + perror("Cannot open /dev/net/tun"); > + return fd; > + } > + > + memset(, 0, sizeof(ifr)); > + > + ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR; > + snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid()); > + > + e = ioctl(fd, TUNSETIFF, ); > + if (e < 0) { > + perror("ioctl[TUNSETIFF]"); > + close(fd); > + return e; > + } > + > + e = ioctl(fd, TUNSETVNETHDRSZ, ); > + if (e < 0) { > + perror("ioctl[TUNSETVNETHDRSZ]"); > + close(fd); > + return e; > + } > + > + e = ioctl(fd, SIOCGIFHWADDR, ); > + if (e < 0) { > + perror("ioctl[SIOCGIFHWADDR]"); > + close(fd); > + return e; > + } > + > + memcpy(dev->mac, _hwaddr.sa_data, ETHER_ADDR_LEN); > + return fd; > +} > + > +static void vdev_create_socket(struct vdev_info *dev) > +{ > + struct ifreq ifr; > + > + dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE)); > + assert(dev->sock != -1); > + > + snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid()); > + assert(ioctl(dev->sock, SIOCGIFINDEX, ) >= 0); > + > + dev->ifindex = ifr.ifr_ifindex; > + > + /* Set the flags that bring the device up */ > + assert(ioctl(dev->sock, SIOCGIFFLAGS, ) >= 0); > + ifr.ifr_flags |= (IFF_UP | IFF_RUNNING); > + assert(ioctl(dev->sock, SIOCSIFFLAGS, ) >= 0); > +} > + > +static void vdev_send_packet(struct vdev_info *dev) > +{ > + char *sendbuf = dev->test_buf + HDR_LEN; > + struct sockaddr_ll saddrll = {0}; > + int sockfd = dev->sock; > + int ret; > + > + memset(, 0, sizeof(saddrll)); > + saddrll.sll_family = PF_PACKET; > + saddrll.sll_ifindex =
Re: [PATCH v8 3/3] remoteproc: zynqmp: parse TCM from device tree
On 1/3/24 12:17 PM, Mathieu Poirier wrote: > On Fri, Dec 15, 2023 at 03:57:25PM -0800, Tanmay Shah wrote: > > ZynqMP TCM information is fixed in driver. Now ZynqMP TCM information > > s/"is fixed in driver"/"was fixed in driver" > > > is available in device-tree. Parse TCM information in driver > > as per new bindings. > > > > Signed-off-by: Tanmay Shah > > --- > > > > Changes in v8: > > - parse power-domains property from device-tree and use EEMI calls > > to power on/off TCM instead of using pm domains framework > > - Remove checking of pm_domain_id validation to power on/off tcm > > - Remove spurious change > > > > Changes in v7: > > - move checking of pm_domain_id from previous patch > > - fix mem_bank_data memory allocation > > > > drivers/remoteproc/xlnx_r5_remoteproc.c | 154 +++- > > 1 file changed, 148 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c > > b/drivers/remoteproc/xlnx_r5_remoteproc.c > > index 4395edea9a64..36d73dcd93f0 100644 > > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c > > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c > > @@ -74,8 +74,8 @@ struct mbox_info { > > }; > > > > /* > > - * Hardcoded TCM bank values. This will be removed once TCM bindings are > > - * accepted for system-dt specifications and upstreamed in linux kernel > > + * Hardcoded TCM bank values. This will stay in driver to maintain backward > > + * compatibility with device-tree that does not have TCM information. > > */ > > static const struct mem_bank_data zynqmp_tcm_banks_split[] = { > > {0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each > > */ > > @@ -878,6 +878,139 @@ static struct zynqmp_r5_core > > *zynqmp_r5_add_rproc_core(struct device *cdev) > > return ERR_PTR(ret); > > } > > > > +static int zynqmp_r5_get_tcm_node_from_dt(struct zynqmp_r5_cluster > > *cluster) > > +{ > > + struct of_phandle_args out_args; > > + int tcm_reg_per_r5, tcm_pd_idx; > > + struct zynqmp_r5_core *r5_core; > > + int i, j, tcm_bank_count, ret; > > + struct platform_device *cpdev; > > + struct mem_bank_data *tcm; > > + struct device_node *np; > > + struct resource *res; > > + u64 abs_addr, size; > > + struct device *dev; > > + > > + for (i = 0; i < cluster->core_count; i++) { > > + r5_core = cluster->r5_cores[i]; > > + dev = r5_core->dev; > > + np = of_node_get(dev_of_node(dev)); > > + tcm_pd_idx = 1; > > + > > + /* we have address cell 2 and size cell as 2 */ > > + tcm_reg_per_r5 = of_property_count_elems_of_size(np, "reg", > > +4 * > > sizeof(u32)); > > + if (tcm_reg_per_r5 <= 0) { > > + dev_err(dev, "can't get reg property err %d\n", > > tcm_reg_per_r5); > > + return -EINVAL; > > + } > > + > > + /* > > +* In lockstep mode, r5 core 0 will use r5 core 1 TCM > > +* power domains as well. so allocate twice of per core TCM > > Twice of what? Please use proper english in your multi line comments, i.e a > capital letter for the first word and a dot at the end. > > > +*/ > > + if (cluster->mode == LOCKSTEP_MODE) > > + tcm_bank_count = tcm_reg_per_r5 * 2; > > + else > > + tcm_bank_count = tcm_reg_per_r5; > > + > > + r5_core->tcm_banks = devm_kcalloc(dev, tcm_bank_count, > > + sizeof(struct mem_bank_data > > *), > > + GFP_KERNEL); > > + if (!r5_core->tcm_banks) > > + ret = -ENOMEM; > > + > > + r5_core->tcm_bank_count = tcm_bank_count; > > + for (j = 0; j < tcm_bank_count; j++) { > > + tcm = devm_kzalloc(dev, sizeof(struct mem_bank_data), > > + GFP_KERNEL); > > + if (!tcm) > > + return -ENOMEM; > > + > > + r5_core->tcm_banks[j] = tcm; > > + > > + /* > > +* In lockstep mode, get second core's TCM power > > domains id > > +* after first core TCM parsing is done as > > There seems to be words missing at the end of the sentence, and there is no > dot. > > > +*/ > > + if (j == tcm_reg_per_r5) { > > + /* dec first core node */ > > + of_node_put(np); > > + > > + /* get second core node */ > > + np = of_get_next_child(cluster->dev->of_node, > > np); > > + > > + /* > > +* reset index of power-domains property list > > +* for second core > > +*/ > > +
Re: [PATCH v8] bus: mhi: host: Add tracing support
On Thu, 7 Dec 2023 10:00:47 +0530 Krishna chaitanya chundru wrote: > This change adds ftrace support for following functions which > helps in debugging the issues when there is Channel state & MHI > state change and also when we receive data and control events: > 1. mhi_intvec_mhi_states > 2. mhi_process_data_event_ring > 3. mhi_process_ctrl_ev_ring > 4. mhi_gen_tre > 5. mhi_update_channel_state > 6. mhi_tryset_pm_state > 7. mhi_pm_st_worker > > Where ever the trace events are added, debug messages are removed. > > Signed-off-by: Krishna chaitanya chundru > --- > Changes in v8: > - Pass the structure and derefernce the variables in TP_fast_assign as > suggested by steve > - Link to v7: > https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com So this looks good from a tracing POV. Reviewed-by: Steven Rostedt (Google) But I do have some more comments that could be done by new patches. > +TRACE_EVENT(mhi_intvec_states, > + > + TP_PROTO(struct mhi_controller *mhi_cntrl, int dev_ee, int dev_state), > + > + TP_ARGS(mhi_cntrl, dev_ee, dev_state), > + > + TP_STRUCT__entry( > + __string(name, mhi_cntrl->mhi_dev->name) > + __field(int, local_ee) > + __field(int, state) > + __field(int, dev_ee) > + __field(int, dev_state) > + ), > + > + TP_fast_assign( > + __assign_str(name, mhi_cntrl->mhi_dev->name); > + __entry->local_ee = mhi_cntrl->ee; > + __entry->state = mhi_cntrl->dev_state; > + __entry->dev_ee = dev_ee; > + __entry->dev_state = dev_state; > + ), > + > + TP_printk("%s: local ee: %s state: %s device ee: %s state: %s\n", > + __get_str(name), > + TO_MHI_EXEC_STR(__entry->local_ee), > + mhi_state_str(__entry->state), > + TO_MHI_EXEC_STR(__entry->dev_ee), > + mhi_state_str(__entry->dev_state)) So the above may have issues with user space parsing. For one, that mhi_state_str() is: static inline const char *mhi_state_str(enum mhi_state state) { switch (state) { case MHI_STATE_RESET: return "RESET"; case MHI_STATE_READY: return "READY"; case MHI_STATE_M0: return "M0"; case MHI_STATE_M1: return "M1"; case MHI_STATE_M2: return "M2"; case MHI_STATE_M3: return "M3"; case MHI_STATE_M3_FAST: return "M3 FAST"; case MHI_STATE_BHI: return "BHI"; case MHI_STATE_SYS_ERR: return "SYS ERROR"; default: return "Unknown state"; } }; Which if this could be changed into: #define MHI_STATE_LIST \ EM(RESET,"RESET") \ EM(READY,"READY") \ EM(M0, "M0") \ EM(M1, "M1") \ EM(M2, "M2") \ EM(M3, "M3") \ EM(M3_FAST, "M3_FAST") \ EM(BHI, "BHI") \ EMe(SYS_ERR, "SYS ERROR") #undef EM #undef EMe #define EM(a, b) case MHI_STATE_##a: return b; #define EMe(a, b) case MHI_STATE_##a: return b; static inline const char *mhi_state_str(enum mhi_state state) { switch (state) { MHI_STATE_LIST default: return "Unknown state"; } Then you could use that macro in the trace header: #undef EM #undef EMe #define EM(a, b)TRACE_DEFINE_ENUM(MHI_STATE_##a); #define EMe(a, b) TRACE_DEFINE_ENUM(MHI_STATE_##a); MHI_STATE_LIST And in the print fmts: #undef EM #undef EMe #define EM(a, b) { MHI_STATE_##a, b }, #define EMe(a, b) { MHI_STATE_##a, b } TP_printk("%s: local ee: %s state: %s device ee: %s state: %s\n", __get_str(name), TO_MHI_EXEC_STR(__entry->local_ee), __print_symbolic(__entry->state), MHI_STATE_LIST), TO_MHI_EXEC_STR(__entry->dev_ee), __print_symbolic(__entry->dev_state, MHI_STATE_LIST)) And that will be exported to user space in the /sys/kernel/tracing/events/*/*/format file, as: __print_symbolic(REC->state, { { MHI_STATE_RESET, "RESET"}, { MHI_STATE_READY, "READY"}, { MHI_STATE_M0, "M0"}, { MHI_STATE_M1, "M1"}, { MHI_STATE_M2, "M2"}, { MHI_STATE_M3, "M3"}, { MHI_STATE_M3_FAST, "M3 FAST"}, { MHI_STATE_BHI, "BHI"}, { MHI_STATE_SYS_ERR, "SYS ERROR"} } Which libtracevent knows how to parse. Oh, it wouldn't even show the enum names as the TRACE_DEFINE_ENUM() above, would tell the kernel to replace them with their actual numeric values. That way, when trace-cmd or perf reads the raw data, it knows how to print it. Now
[PATCH v6 3/3] vduse: enable Virtio-net device type
This patch adds Virtio-net device type to the supported devices types. Initialization fails if the device does not support VIRTIO_F_VERSION_1 feature, in order to guarantee the configuration space is read-only. It also fails with -EPERM if the CAP_NET_ADMIN is missing. Signed-off-by: Maxime Coquelin --- drivers/vdpa/vdpa_user/vduse_dev.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index 94f54ea2eb06..4057b34ff995 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -151,6 +151,7 @@ static struct workqueue_struct *vduse_irq_bound_wq; static u32 allowed_device_id[] = { VIRTIO_ID_BLOCK, + VIRTIO_ID_NET, }; static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa) @@ -1694,6 +1695,10 @@ static bool features_is_valid(struct vduse_dev_config *config) (config->features & VDUSE_NET_INVALID_FEATURES_MASK)) return false; + if ((config->device_id == VIRTIO_ID_NET) && + !(config->features & (1ULL << VIRTIO_F_VERSION_1))) + return false; + return true; } @@ -1801,6 +1806,10 @@ static int vduse_create_dev(struct vduse_dev_config *config, int ret; struct vduse_dev *dev; + ret = -EPERM; + if ((config->device_id == VIRTIO_ID_NET) && !capable(CAP_NET_ADMIN)) + goto err; + ret = -EEXIST; if (vduse_find_dev(config->name)) goto err; @@ -2044,6 +2053,7 @@ static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = { static struct virtio_device_id id_table[] = { { VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID }, + { VIRTIO_ID_NET, VIRTIO_DEV_ANY_ID }, { 0 }, }; -- 2.43.0
[PATCH v6 2/3] vduse: Temporarily fail if control queue features requested
Virtio-net driver control queue implementation is not safe when used with VDUSE. If the VDUSE application does not reply to control queue messages, it currently ends up hanging the kernel thread sending this command. Some work is on-going to make the control queue implementation robust with VDUSE. Until it is completed, let's fail features check if any control-queue related feature is requested. Signed-off-by: Maxime Coquelin --- drivers/vdpa/vdpa_user/vduse_dev.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index 0486ff672408..94f54ea2eb06 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include "iova_domain.h" @@ -46,6 +47,15 @@ #define IRQ_UNBOUND -1 +#define VDUSE_NET_INVALID_FEATURES_MASK \ + (BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |\ +BIT_ULL(VIRTIO_NET_F_CTRL_RX) | \ +BIT_ULL(VIRTIO_NET_F_CTRL_VLAN) | \ +BIT_ULL(VIRTIO_NET_F_GUEST_ANNOUNCE) | \ +BIT_ULL(VIRTIO_NET_F_MQ) | \ +BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) | \ +BIT_ULL(VIRTIO_NET_F_RSS)) + struct vduse_virtqueue { u16 index; u16 num_max; @@ -1680,6 +1690,9 @@ static bool features_is_valid(struct vduse_dev_config *config) if ((config->device_id == VIRTIO_ID_BLOCK) && (config->features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))) return false; + else if ((config->device_id == VIRTIO_ID_NET) && + (config->features & VDUSE_NET_INVALID_FEATURES_MASK)) + return false; return true; } -- 2.43.0
[PATCH v6 1/3] vduse: validate block features only with block devices
This patch is preliminary work to enable network device type support to VDUSE. As VIRTIO_BLK_F_CONFIG_WCE shares the same value as VIRTIO_NET_F_HOST_TSO4, we need to restrict its check to Virtio-blk device type. Acked-by: Jason Wang Reviewed-by: Xie Yongji Signed-off-by: Maxime Coquelin --- drivers/vdpa/vdpa_user/vduse_dev.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index 0ddd4b8abecb..0486ff672408 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -1671,13 +1671,14 @@ static bool device_is_allowed(u32 device_id) return false; } -static bool features_is_valid(u64 features) +static bool features_is_valid(struct vduse_dev_config *config) { - if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) + if (!(config->features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) return false; /* Now we only support read-only configuration space */ - if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE)) + if ((config->device_id == VIRTIO_ID_BLOCK) && + (config->features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))) return false; return true; @@ -1704,7 +1705,7 @@ static bool vduse_validate_config(struct vduse_dev_config *config) if (!device_is_allowed(config->device_id)) return false; - if (!features_is_valid(config->features)) + if (!features_is_valid(config)) return false; return true; -- 2.43.0
[PATCH v6 0/3] vduse: add support for networking devices
This small series enables virtio-net device type in VDUSE. With it, basic operation have been tested, both with virtio-vdpa and vhost-vdpa using DPDK Vhost library series adding VDUSE support using split rings layout (merged in DPDK v23.07-rc1). Control queue support (and so multiqueue) has also been tested, but requires a Kernel series from Jason Wang relaxing control queue polling [1] to function reliably, so while Jason rework is done, a patch is added to disable CVQ and features that depend on it (tested also with DPDK v23.07-rc1). In this v5, LSM hooks introduced in previous revision are unified into a single hook that covers below operations: - VDUSE_CREATE_DEV ioctl on VDUSE control file, - VDUSE_DESTROY_DEV ioctl on VDUSE control file, - open() on VDUSE device file. In combination with the operations permission, a device type permission has to be associated: - block: Virtio block device type, - net: Virtio networking device type. changes in v6! == - Remove SELinux support from the series, will be handled in a dedicated one. - Require CAP_NET_ADMIN for Net devices creation (Jason). - Fail init if control queue features are requested for Net device type (Jason). - Rebased on latest master. Changes in v5: == - Move control queue disablement patch before Net devices enablement (Jason). - Unify operations LSM hooks into a single hook. - Rebase on latest master. Maxime Coquelin (3): vduse: validate block features only with block devices vduse: Temporarily fail if control queue features requested vduse: enable Virtio-net device type drivers/vdpa/vdpa_user/vduse_dev.c | 32 ++ 1 file changed, 28 insertions(+), 4 deletions(-) -- 2.43.0
Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function
Hello. On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang wrote: > Thanks for raising this. Actually my understanding the above discussion was > mainly about not doing reclaiming by killing enclaves, i.e., I assumed > "reclaiming" within that context only meant for that particular kind. > > As Mikko pointed out, without reclaiming per-cgroup, the max limit of each > cgroup needs to accommodate the peak usage of enclaves within that cgroup. > That may be inconvenient for practical usage and limits could be forced to > be set larger than necessary to run enclaves performantly. For example, we > can observe following undesired consequences comparing a system with current > kernel loaded with enclaves whose total peak usage is greater than the EPC > capacity. > > 1) If a user wants to load the same exact enclaves but in separate cgroups, > then the sum of cgroup limits must be higher than the capacity and the > system will end up doing the same old global reclaiming as it is currently > doing. Cgroup is not useful at all for isolating EPC consumptions. That is the use of limits to prevent a runaway cgroup smothering the system. Overcommited values in such a config are fine because the more simultaneous runaways, the less likely. The peak consumption is on the fair expense of others (some efficiency) and the limit contains the runaway (hence the isolation). > 2) To isolate impact of usage of each cgroup on other cgroups and yet still > being able to load each enclave, the user basically has to carefully plan to > ensure the sum of cgroup max limits, i.e., the sum of peak usage of > enclaves, is not reaching over the capacity. That means no over-commiting > allowed and the same system may not be able to load as many enclaves as with > current kernel. Another "config layout" of limits is to achieve partitioning (sum == capacity). That is perfect isolation but it naturally goes against efficient utilization. The way other controllers approach this trade-off is with weights (cpu, io) or protections (memory). I'm afraid misc controller is not ready for this. My opinion is to start with the simple limits (first patches) and think of prioritization/guarantee mechanism based on real cases. HTH, Michal signature.asc Description: PGP signature
Re: [PATCH v3 28/34] s390/mm: Define KMSAN metadata for vmalloc and modules
On Thu, Jan 04, 2024 at 11:03:42AM +0100, Alexander Gordeev wrote: > On Tue, Jan 02, 2024 at 04:05:31PM +0100, Heiko Carstens wrote: > Hi Heiko, > ... > > > @@ -253,9 +253,17 @@ static unsigned long setup_kernel_memory_layout(void) > > > MODULES_END = round_down(__abs_lowcore, _SEGMENT_SIZE); > > > MODULES_VADDR = MODULES_END - MODULES_LEN; > > > VMALLOC_END = MODULES_VADDR; > > > +#ifdef CONFIG_KMSAN > > > + VMALLOC_END -= MODULES_LEN * 2; > > > +#endif > > > > > > /* allow vmalloc area to occupy up to about 1/2 of the rest virtual > > > space left */ > > > vmalloc_size = min(vmalloc_size, round_down(VMALLOC_END / 2, > > > _REGION3_SIZE)); > > > +#ifdef CONFIG_KMSAN > > > + /* take 2/3 of vmalloc area for KMSAN shadow and origins */ > > > + vmalloc_size = round_down(vmalloc_size / 3, _REGION3_SIZE); > > > + VMALLOC_END -= vmalloc_size * 2; > > > +#endif > > > > Please use > > > > if (IS_ENABLED(CONFIG_KMSAN)) > > > > above, since this way we get more compile time checks. > > This way we will get a mixture of CONFIG_KASAN and CONFIG_KMSAN > #ifdef vs IS_ENABLED() checks within one function. I guess, we > would rather address it with a separate cleanup? I don't think so, since you can't convert the CONFIG_KASAN ifdef to IS_ENABLED() here: it won't compile. But IS_ENABLED(CONFIG_KMSAN) should work. I highly prefer IS_ENABLED() over ifdef since it allows for better compile time checks, and you won't be surprised by code that doesn't compile if you just change a config option. We've seen that way too often.
Re: [PATCH v5 4/4] vduse: Add LSM hook to check Virtio device type
On 12/18/23 18:33, Stephen Smalley wrote: On Mon, Dec 18, 2023 at 12:21 PM Stephen Smalley wrote: On Tue, Dec 12, 2023 at 8:17 AM Maxime Coquelin wrote: This patch introduces a LSM hook for devices creation, destruction (ioctl()) and opening (open()) operations, checking the application is allowed to perform these operations for the Virtio device type. Can you explain why the existing LSM hooks and SELinux implementation are not sufficient? We already control the ability to open device nodes via selinux_inode_permission() and selinux_file_open(), and can support fine-grained per-cmd ioctl checking via selinux_file_ioctl(). And it should already be possible to label these nodes distinctly through existing mechanisms (file_contexts if udev-created/labeled, genfs_contexts if kernel-created). What exactly can't you do today that this hook enables? (added Ondrej to the distribution; IMHO we should swap him into MAINTAINERS in place of Eric Paris since Eric has long-since moved on from SELinux and Ondrej serves in that capacity these days) Other items to consider: - If vduse devices are created using anonymous inodes, then SELinux grew a general facility for labeling and controlling the creation of those via selinux_inode_init_security_anon(). - You can encode information about the device into its SELinux type that then allows you to distinguish things like net vs block based on the device's SELinux security context rather than encoding that in the permission bits. Got it, that seems indeed more appropriate than using persmission bits for the device type. - If you truly need new LSM hooks (which you need to prove first), then you should pass some usable information about the object in question to allow SELinux to find a security context for it. Like an inode associated with the device, for example. Ok. Thanks for the insights, I'll try and see if I can follow your recommendations in a dedicated series. Maxime
[PATCH -next v5 2/2] mm: vmscan: add new event to trace shrink lru
From: cuibixuan Page reclaim is an important part of memory reclaim, including: * shrink_active_list(), moves folios from the active LRU to the inactive LRU * shrink_inactive_list(), shrink lru from inactive LRU list Add the new events to calculate the execution time to better evaluate the entire memory recycling ratio. Example of output: kswapd0-103 [007] . 1098.353020: mm_vmscan_lru_shrink_active_start: nid=0 kswapd0-103 [007] . 1098.353040: mm_vmscan_lru_shrink_active_end: nid=0 nr_taken=32 nr_active=0 nr_deactivated=32 nr_referenced=0 priority=6 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC kswapd0-103 [007] . 1098.353040: mm_vmscan_lru_shrink_inactive_start: nid=0 kswapd0-103 [007] . 1098.353094: mm_vmscan_lru_shrink_inactive_end: nid=0 nr_scanned=32 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 nr_ref_keep=32 nr_unmap_fail=0 priority=6 flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC kswapd0-103 [007] . 1098.353094: mm_vmscan_lru_shrink_inactive_start: nid=0 kswapd0-103 [007] . 1098.353162: mm_vmscan_lru_shrink_inactive_end: nid=0 nr_scanned=32 nr_reclaimed=21 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 nr_ref_keep=11 nr_unmap_fail=0 priority=6 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC Signed-off-by: Bixuan Cui Reviewed-by: Andrew Morton --- Changes: v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start' * Add the explanation for adding new shrink lru events into 'mm: vmscan: add new event to trace shrink lru' v4: * Add Reviewed-by and Changlog to every patch. v3: * Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace event. v2: * Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same time to fix build error (Andrew pointed out). include/trace/events/vmscan.h | 31 +-- mm/vmscan.c | 11 --- 2 files changed, 37 insertions(+), 5 deletions(-) diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index b99cd28c9815..4793d952c248 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -395,7 +395,34 @@ TRACE_EVENT(mm_vmscan_write_folio, show_reclaim_flags(__entry->reclaim_flags)) ); -TRACE_EVENT(mm_vmscan_lru_shrink_inactive, +DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template, + + TP_PROTO(int nid), + + TP_ARGS(nid), + + TP_STRUCT__entry( + __field(int, nid) + ), + + TP_fast_assign( + __entry->nid = nid; + ), + + TP_printk("nid=%d", __entry->nid) +); + +DEFINE_EVENT(mm_vmscan_lru_shrink_start_template, mm_vmscan_lru_shrink_inactive_start, + TP_PROTO(int nid), + TP_ARGS(nid) +); + +DEFINE_EVENT(mm_vmscan_lru_shrink_start_template, mm_vmscan_lru_shrink_active_start, + TP_PROTO(int nid), + TP_ARGS(nid) +); + +TRACE_EVENT(mm_vmscan_lru_shrink_inactive_end, TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_reclaimed, @@ -446,7 +473,7 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive, show_reclaim_flags(__entry->reclaim_flags)) ); -TRACE_EVENT(mm_vmscan_lru_shrink_active, +TRACE_EVENT(mm_vmscan_lru_shrink_active_end, TP_PROTO(int nid, unsigned long nr_taken, unsigned long nr_active, unsigned long nr_deactivated, diff --git a/mm/vmscan.c b/mm/vmscan.c index 4e3b835c6b4a..a44d9624d60f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1906,6 +1906,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, struct pglist_data *pgdat = lruvec_pgdat(lruvec); bool stalled = false; + trace_mm_vmscan_lru_shrink_inactive_start(pgdat->node_id); + while (unlikely(too_many_isolated(pgdat, file, sc))) { if (stalled) return 0; @@ -1990,7 +1992,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, if (file) sc->nr.file_taken += nr_taken; - trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, + trace_mm_vmscan_lru_shrink_inactive_end(pgdat->node_id, nr_scanned, nr_reclaimed, , sc->priority, file); return nr_reclaimed; } @@ -2028,6 +2030,8 @@ static void shrink_active_list(unsigned long nr_to_scan, int file = is_file_lru(lru); struct pglist_data *pgdat = lruvec_pgdat(lruvec); + trace_mm_vmscan_lru_shrink_active_start(pgdat->node_id); + lru_add_drain(); spin_lock_irq(>lru_lock); @@ -2107,7 +2111,7 @@ static void shrink_active_list(unsigned long nr_to_scan, lru_note_cost(lruvec, file, 0, nr_rotated); mem_cgroup_uncharge_list(_active);
[PATCH -next v5 0/2] Make memory reclamation measurable
When the system memory is low, kswapd reclaims the memory. The key steps of memory reclamation include 1.shrink_lruvec * shrink_active_list, moves folios from the active LRU to the inactive LRU * shrink_inactive_list, shrink lru from inactive LRU list 2.shrink_slab * shrinker->count_objects(), calculates the freeable memory * shrinker->scan_objects(), reclaims the slab memory The existing tracers in the vmscan are as follows: --do_try_to_free_pages --shrink_zones --trace_mm_vmscan_node_reclaim_begin (tracer) --shrink_node --shrink_node_memcgs --trace_mm_vmscan_memcg_shrink_begin (tracer) --shrink_lruvec --shrink_list --shrink_active_list --trace_mm_vmscan_lru_shrink_active (tracer) --shrink_inactive_list --trace_mm_vmscan_lru_shrink_inactive (tracer) --shrink_active_list --shrink_slab --do_shrink_slab --shrinker->count_objects() --trace_mm_shrink_slab_start (tracer) --shrinker->scan_objects() --trace_mm_shrink_slab_end (tracer) --trace_mm_vmscan_memcg_shrink_end (tracer) --trace_mm_vmscan_node_reclaim_end (tracer) If we get the duration and quantity of shrink lru and slab, then we can measure the memory recycling, as follows Measuring memory reclamation with bpf: LRU FILE: CPU COMMShrinkActive(us) ShrinkInactive(us) Reclaim(page) 7 kswapd0 26 51 32 7 kswapd0 52 47 13 SLAB: CPU COMMOBJ_NAMECount_Dur(us) Freeable(page) Scan_Dur(us) Reclaim(page) 1 kswapd0 super_cache_scan.cfi_jt 2 341 3225 128 7 kswapd0 super_cache_scan.cfi_jt 0 2247 8524 1024 7 kswapd0 super_cache_scan.cfi_jt 23670 00 For this, add the new tracer to shrink_active_list/shrink_inactive_list and shrinker->count_objects(). Changes: v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start' * Add the explanation for adding new shrink lru events into 'mm: vmscan: add new event to trace shrink lru' v4: Add Reviewed-by and Changlog to every patch. v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace event. v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same time to fix build error. cuibixuan (2): mm: shrinker: add new event to trace shrink count mm: vmscan: add new event to trace shrink lru include/trace/events/vmscan.h | 80 ++- mm/shrinker.c | 4 ++ mm/vmscan.c | 11 +++-- 3 files changed, 90 insertions(+), 5 deletions(-) -- 2.17.1
[PATCH -next v5 1/2] mm: shrinker: add new event to trace shrink count
From: cuibixuan do_shrink_slab() calculates the freeable memory through shrinker->count_objects(), and then reclaims the memory through shrinker->scan_objects(). When reclaiming memory, shrinker->count_objects() takes a certain amount of time: Fun spend(us) ext4_es_count 4302 ext4_es_scan 12 super_cache_count 4195 super_cache_scan 2103 Therefore, adding the trace event to count_objects() can more accurately obtain the time taken for slab memory recycling. Example of output: kswapd0-103 [003] . 1098.317942: mm_shrink_count_start: kfree_rcu_shrink_count.cfi_jt+0x0/0x8 c540ff51: nid: 0 kswapd0-103 [003] . 1098.317951: mm_shrink_count_end: kfree_rcu_shrink_count.cfi_jt+0x0/0x8 c540ff51: nid: 0 freeable:36 Signed-off-by: Bixuan Cui Reviewed-by: Steven Rostedt --- Changes: v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start' * Add the explanation for adding new shrink lru events into 'mm: vmscan: add new event to trace shrink lru' v4: * Add Reviewed-by and Changlog to every patch. v3: * Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace event. v2: * Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same time to fix build error (Andrew pointed out). include/trace/events/vmscan.h | 49 +++ mm/shrinker.c | 4 +++ 2 files changed, 53 insertions(+) diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index 1a488c30afa5..b99cd28c9815 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -196,6 +196,55 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re ); #endif /* CONFIG_MEMCG */ +TRACE_EVENT(mm_shrink_count_start, + TP_PROTO(struct shrinker *shr, struct shrink_control *sc), + + TP_ARGS(shr, sc), + + TP_STRUCT__entry( + __field(struct shrinker *, shr) + __field(void *, shrink) + __field(int, nid) + ), + + TP_fast_assign( + __entry->shr = shr; + __entry->shrink = shr->count_objects; + __entry->nid = sc->nid; + ), + + TP_printk("%pS %p: nid: %d", + __entry->shrink, + __entry->shr, + __entry->nid) +); + +TRACE_EVENT(mm_shrink_count_end, + TP_PROTO(struct shrinker *shr, struct shrink_control *sc, long freeable), + + TP_ARGS(shr, sc, freeable), + + TP_STRUCT__entry( + __field(struct shrinker *, shr) + __field(void *, shrink) + __field(long, freeable) + __field(int, nid) + ), + + TP_fast_assign( + __entry->shr = shr; + __entry->shrink = shr->count_objects; + __entry->freeable = freeable; + __entry->nid = sc->nid; + ), + + TP_printk("%pS %p: nid: %d freeable:%ld", + __entry->shrink, + __entry->shr, + __entry->nid, + __entry->freeable) +); + TRACE_EVENT(mm_shrink_slab_start, TP_PROTO(struct shrinker *shr, struct shrink_control *sc, long nr_objects_to_shrink, unsigned long cache_items, diff --git a/mm/shrinker.c b/mm/shrinker.c index dd91eab43ed3..d0c7bf61db61 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -379,7 +379,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, : SHRINK_BATCH; long scanned = 0, next_deferred; + trace_mm_shrink_count_start(shrinker, shrinkctl); + freeable = shrinker->count_objects(shrinker, shrinkctl); + + trace_mm_shrink_count_end(shrinker, shrinkctl, freeable); if (freeable == 0 || freeable == SHRINK_EMPTY) return freeable; -- 2.17.1
Re: [PATCH v3 28/34] s390/mm: Define KMSAN metadata for vmalloc and modules
On Tue, Jan 02, 2024 at 04:05:31PM +0100, Heiko Carstens wrote: Hi Heiko, ... > > @@ -253,9 +253,17 @@ static unsigned long setup_kernel_memory_layout(void) > > MODULES_END = round_down(__abs_lowcore, _SEGMENT_SIZE); > > MODULES_VADDR = MODULES_END - MODULES_LEN; > > VMALLOC_END = MODULES_VADDR; > > +#ifdef CONFIG_KMSAN > > + VMALLOC_END -= MODULES_LEN * 2; > > +#endif > > > > /* allow vmalloc area to occupy up to about 1/2 of the rest virtual > > space left */ > > vmalloc_size = min(vmalloc_size, round_down(VMALLOC_END / 2, > > _REGION3_SIZE)); > > +#ifdef CONFIG_KMSAN > > + /* take 2/3 of vmalloc area for KMSAN shadow and origins */ > > + vmalloc_size = round_down(vmalloc_size / 3, _REGION3_SIZE); > > + VMALLOC_END -= vmalloc_size * 2; > > +#endif > > Please use > > if (IS_ENABLED(CONFIG_KMSAN)) > > above, since this way we get more compile time checks. This way we will get a mixture of CONFIG_KASAN and CONFIG_KMSAN #ifdef vs IS_ENABLED() checks within one function. I guess, we would rather address it with a separate cleanup? Thanks!
Re: [RFC PATCH 3/5] dt-bindings: regulator: add documentation entry for 88pm88x-regulator
On 04/01/2024 10:25, Krzysztof Kozlowski wrote: > On 28/12/2023 10:39, Karel Balej wrote: >> From: Karel Balej >> > > A nit, subject: drop second/last, redundant "documentation entry". The > "dt-bindings" prefix is already stating that these are bindings. > >> The Marvell 88PM88X PMICs provide regulators among other things. >> Document how to use them. > > > You did not document them in the bindings. You just added example to > make it complete. Where is the binding change? > > What's more, I have doubts that you tested it. I found your other patchset - now I am 100% sure you did not test it. Best regards, Krzysztof
Re: [RFC PATCH 3/5] dt-bindings: regulator: add documentation entry for 88pm88x-regulator
On 28/12/2023 10:39, Karel Balej wrote: > From: Karel Balej > A nit, subject: drop second/last, redundant "documentation entry". The "dt-bindings" prefix is already stating that these are bindings. > The Marvell 88PM88X PMICs provide regulators among other things. > Document how to use them. You did not document them in the bindings. You just added example to make it complete. Where is the binding change? What's more, I have doubts that you tested it. > > Signed-off-by: Karel Balej > --- > .../bindings/mfd/marvell,88pm88x.yaml | 17 +++ > .../regulator/marvell,88pm88x-regulator.yaml | 28 +++ > 2 files changed, 45 insertions(+) > create mode 100644 > Documentation/devicetree/bindings/regulator/marvell,88pm88x-regulator.yaml > > diff --git a/Documentation/devicetree/bindings/mfd/marvell,88pm88x.yaml > b/Documentation/devicetree/bindings/mfd/marvell,88pm88x.yaml > index 115b41c9f22c..e6944369fc5c 100644 > --- a/Documentation/devicetree/bindings/mfd/marvell,88pm88x.yaml > +++ b/Documentation/devicetree/bindings/mfd/marvell,88pm88x.yaml > @@ -54,6 +54,23 @@ examples: > onkey { >compatible = "marvell,88pm88x-onkey"; > }; > + > +regulators { > + ldo2: ldo2 { > +regulator-min-microvolt = <310>; > +regulator-max-microvolt = <330>; > +}; > + > + ldo15: ldo15 { > +regulator-min-microvolt = <330>; > +regulator-max-microvolt = <330>; > +}; > + > + buck2: buck2 { > +regulator-min-microvolt = <180>; > +regulator-max-microvolt = <180>; > +}; > +}; >}; > }; > ... > diff --git > a/Documentation/devicetree/bindings/regulator/marvell,88pm88x-regulator.yaml > b/Documentation/devicetree/bindings/regulator/marvell,88pm88x-regulator.yaml > new file mode 100644 > index ..c6ac17b113e7 > --- /dev/null > +++ > b/Documentation/devicetree/bindings/regulator/marvell,88pm88x-regulator.yaml > @@ -0,0 +1,28 @@ > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause > +%YAML 1.2 > +--- > +$id: http://devicetree.org/schemas/regulator/marvell,88pm88x-regulator.yaml# > +$schema: http://devicetree.org/meta-schemas/core.yaml# > + > +title: Marvell 88PM88X PMICs regulators > + > +maintainers: > + - Karel Balej > + > +description: | > + This module is part of the Marvell 88PM88X MFD device. For more details > + see Documentation/devicetree/bindings/mfd/marvell,88pm88x.yaml. > + > + The regulator controller is represented as a sub-node of the PMIC node > + in the device tree. > + > + The valid names for 88PM886 regulator nodes are ldo[1-9], ldo1[0-6], > buck[1-5]. Don't repeat constraints in free form text. > + > +patternProperties: > + "^(ldo|buck)[0-9]+$": You need to fix the pattern to be narrow. buck0 and buck6 are not correct. > +type: object > +description: > + Properties for single regulator. > +$ref: regulator.yaml# Not many benefits of this being in its own schema. > + > +additionalProperties: false Best regards, Krzysztof
Re: [PATCH V1] net: qrtr: ns: Ignore ENODEV failures in ns
On Tue, Dec 26, 2023 at 04:20:03PM -0800, Chris Lew wrote: > > > On 12/23/2023 5:56 AM, Simon Horman wrote: > > [Dropped bjorn.anders...@kernel.org, as the correct address seems > > to be anders...@kernel.org, which is already in the CC list. > > kernel.org rejected sending this email without that update.] > > > > On Thu, Dec 21, 2023 at 03:36:50PM +0530, Sarannya S wrote: > > > From: Chris Lew > > > > > > Ignore the ENODEV failures returned by kernel_sendmsg(). These errors > > > indicate that either the local port has been closed or the remote has > > > gone down. Neither of these scenarios are fatal and will eventually be > > > handled through packets that are later queued on the control port. > > > > > > Signed-off-by: Chris Lew > > > Signed-off-by: Sarannya Sasikumar > > > --- > > > net/qrtr/ns.c | 11 +++ > > > 1 file changed, 7 insertions(+), 4 deletions(-) > > > > > > diff --git a/net/qrtr/ns.c b/net/qrtr/ns.c > > > index abb0c70..8234339 100644 > > > --- a/net/qrtr/ns.c > > > +++ b/net/qrtr/ns.c > > > @@ -157,7 +157,7 @@ static int service_announce_del(struct sockaddr_qrtr > > > *dest, > > > msg.msg_namelen = sizeof(*dest); > > > ret = kernel_sendmsg(qrtr_ns.sock, , , 1, sizeof(pkt)); > > > - if (ret < 0) > > > + if (ret < 0 && ret != -ENODEV) > > > pr_err("failed to announce del service\n"); > > > return ret; > > > > Hi, > > > > The caller of service_announce_del() ignores it's return value. > > So the only action on error is the pr_err() call above, and so > > with this patch -ENODEV is indeed ignored. > > > > However, I wonder if it would make things clearer to the reader (me?) > > if the return type of service_announce_del was updated void. Because > > as things stand -ENODEV may be returned, which implies something might > > handle that, even though it doe not. > > > > The above notwithstanding, this change looks good to me. > > > > Reviewed-by: Simon Horman > > > > ... > > Hi Simon, thanks for the review and suggestion. We weren't sure whether we > should change the function prototype on these patches on the chance that > there will be something that listens and handles this in the future. I think > it's a good idea to change it to void and we can change it back if there is > such a usecase in the future. Hi Chris, yes, I think that would be a good approach.