Re: [PATCH v2 1/6] dt-bindings: iio: light: stk33xx: add vdd and leda regulators
On 24/04/2024 00:33, Aren Moynihan wrote: > stk3310 and stk3311 are typically connected to power supplies for the > chip (vdd) and the infrared LED (leda). Add properties so we can power > these up / down appropriately. > > Signed-off-by: Aren Moynihan > --- > Reviewed-by: Krzysztof Kozlowski Best regards, Krzysztof
Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection
On Wed, Apr 24, 2024 at 11:15 AM Cindy Lu wrote: > > On Wed, Apr 24, 2024 at 8:44 AM Jason Wang wrote: > > > > On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin wrote: > > > > > > On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote: > > > > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin > > > > wrote: > > > > > > > > > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote: > > > > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin > > > > > > wrote: > > > > > > > > > > > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote: > > > > > > > > Add the function vduse_alloc_reconnnect_info_mem > > > > > > > > and vduse_alloc_reconnnect_info_mem > > > > > > > > These functions allow vduse to allocate and free memory for > > > > > > > > reconnection > > > > > > > > information. The amount of memory allocated is vq_num pages. > > > > > > > > Each VQS will map its own page where the reconnection > > > > > > > > information will be saved > > > > > > > > > > > > > > > > Signed-off-by: Cindy Lu > > > > > > > > --- > > > > > > > > drivers/vdpa/vdpa_user/vduse_dev.c | 40 > > > > > > > > ++ > > > > > > > > 1 file changed, 40 insertions(+) > > > > > > > > > > > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > > > > index ef3c9681941e..2da659d5f4a8 100644 > > > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue { > > > > > > > > int irq_effective_cpu; > > > > > > > > struct cpumask irq_affinity; > > > > > > > > struct kobject kobj; > > > > > > > > + unsigned long vdpa_reconnect_vaddr; > > > > > > > > }; > > > > > > > > > > > > > > > > struct vduse_dev; > > > > > > > > @@ -1105,6 +1106,38 @@ static void > > > > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq) > > > > > > > > > > > > > > > > vq->irq_effective_cpu = curr_cpu; > > > > > > > > } > > > > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev > > > > > > > > *dev) > > > > > > > > +{ > > > > > > > > + unsigned long vaddr = 0; > > > > > > > > + struct vduse_virtqueue *vq; > > > > > > > > + > > > > > > > > + for (int i = 0; i < dev->vq_num; i++) { > > > > > > > > + /*page 0~ vq_num save the reconnect info for vq*/ > > > > > > > > + vq = dev->vqs[i]; > > > > > > > > + vaddr = get_zeroed_page(GFP_KERNEL); > > > > > > > > > > > > > > > > > > > > > I don't get why you insist on stealing kernel memory for something > > > > > > > that is just used by userspace to store data for its own use. > > > > > > > Userspace does not lack ways to persist data, for example, > > > > > > > create a regular file anywhere in the filesystem. > > > > > > > > > > > > Good point. So the motivation here is to: > > > > > > > > > > > > 1) be self contained, no dependency for high speed persist data > > > > > > storage like tmpfs > > > > > > > > > > No idea what this means. > > > > > > > > I mean a regular file may slow down the datapath performance, so > > > > usually the application will try to use tmpfs and other which is a > > > > dependency for implementing the reconnection. > > > > > > Are we worried about systems without tmpfs now? > > > > Yes. > > > > > > > > > > > > > > > > > > > 2) standardize the format in uAPI which allows reconnection from > > > > > > arbitrary userspace, unfortunately, such effort was removed in new > > > > > > versions > > > > > > > > > > And I don't see why that has to live in the kernel tree either. > > > > > > > > I can't find a better place, any idea? > > > > > > > > Thanks > > > > > > > > > Well anywhere on github really. with libvhost-user maybe? > > > It's harmless enough in Documentation > > > if you like but ties you to the kernel release cycle in a way that > > > is completely unnecessary. > > > > Ok. > > > > Thanks > > > Sure, got it. Do we need to withdraw all the series? maybe we can keep > the patch for support ioctl to get status and configure? We can leave the mmap part for future, the rest is still useful unless I miss anything. Thanks > > thanks > cindy > > > > > > > > > > > > > > > If the above doesn't make sense, we don't need to offer those pages > > > > > > by VDUSE. > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + if (vaddr == 0) > > > > > > > > + return -ENOMEM; > > > > > > > > + > > > > > > > > + vq->vdpa_reconnect_vaddr = vaddr; > > > > > > > > + } > > > > > > > > + > > > > > > > > + return 0; > > > > > > > > +} > > > > > > > > + > > > > > > > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev > > > > > > > > *dev) > > > > > > > > +{ > > > > > > > > + struct vduse_virtqueue *vq; > > > > > > > > + > > > > > > > > + for (int i = 0; i <
Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection
On Wed, Apr 24, 2024 at 8:44 AM Jason Wang wrote: > > On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin wrote: > > > > On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote: > > > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin > > > wrote: > > > > > > > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote: > > > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin > > > > > wrote: > > > > > > > > > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote: > > > > > > > Add the function vduse_alloc_reconnnect_info_mem > > > > > > > and vduse_alloc_reconnnect_info_mem > > > > > > > These functions allow vduse to allocate and free memory for > > > > > > > reconnection > > > > > > > information. The amount of memory allocated is vq_num pages. > > > > > > > Each VQS will map its own page where the reconnection information > > > > > > > will be saved > > > > > > > > > > > > > > Signed-off-by: Cindy Lu > > > > > > > --- > > > > > > > drivers/vdpa/vdpa_user/vduse_dev.c | 40 > > > > > > > ++ > > > > > > > 1 file changed, 40 insertions(+) > > > > > > > > > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > > > index ef3c9681941e..2da659d5f4a8 100644 > > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue { > > > > > > > int irq_effective_cpu; > > > > > > > struct cpumask irq_affinity; > > > > > > > struct kobject kobj; > > > > > > > + unsigned long vdpa_reconnect_vaddr; > > > > > > > }; > > > > > > > > > > > > > > struct vduse_dev; > > > > > > > @@ -1105,6 +1106,38 @@ static void > > > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq) > > > > > > > > > > > > > > vq->irq_effective_cpu = curr_cpu; > > > > > > > } > > > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev) > > > > > > > +{ > > > > > > > + unsigned long vaddr = 0; > > > > > > > + struct vduse_virtqueue *vq; > > > > > > > + > > > > > > > + for (int i = 0; i < dev->vq_num; i++) { > > > > > > > + /*page 0~ vq_num save the reconnect info for vq*/ > > > > > > > + vq = dev->vqs[i]; > > > > > > > + vaddr = get_zeroed_page(GFP_KERNEL); > > > > > > > > > > > > > > > > > > I don't get why you insist on stealing kernel memory for something > > > > > > that is just used by userspace to store data for its own use. > > > > > > Userspace does not lack ways to persist data, for example, > > > > > > create a regular file anywhere in the filesystem. > > > > > > > > > > Good point. So the motivation here is to: > > > > > > > > > > 1) be self contained, no dependency for high speed persist data > > > > > storage like tmpfs > > > > > > > > No idea what this means. > > > > > > I mean a regular file may slow down the datapath performance, so > > > usually the application will try to use tmpfs and other which is a > > > dependency for implementing the reconnection. > > > > Are we worried about systems without tmpfs now? > > Yes. > > > > > > > > > > > > > > 2) standardize the format in uAPI which allows reconnection from > > > > > arbitrary userspace, unfortunately, such effort was removed in new > > > > > versions > > > > > > > > And I don't see why that has to live in the kernel tree either. > > > > > > I can't find a better place, any idea? > > > > > > Thanks > > > > > > Well anywhere on github really. with libvhost-user maybe? > > It's harmless enough in Documentation > > if you like but ties you to the kernel release cycle in a way that > > is completely unnecessary. > > Ok. > > Thanks > Sure, got it. Do we need to withdraw all the series? maybe we can keep the patch for support ioctl to get status and configure? thanks cindy > > > > > > > > > > > If the above doesn't make sense, we don't need to offer those pages > > > > > by VDUSE. > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + if (vaddr == 0) > > > > > > > + return -ENOMEM; > > > > > > > + > > > > > > > + vq->vdpa_reconnect_vaddr = vaddr; > > > > > > > + } > > > > > > > + > > > > > > > + return 0; > > > > > > > +} > > > > > > > + > > > > > > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev) > > > > > > > +{ > > > > > > > + struct vduse_virtqueue *vq; > > > > > > > + > > > > > > > + for (int i = 0; i < dev->vq_num; i++) { > > > > > > > + vq = dev->vqs[i]; > > > > > > > + > > > > > > > + if (vq->vdpa_reconnect_vaddr) > > > > > > > + free_page(vq->vdpa_reconnect_vaddr); > > > > > > > + vq->vdpa_reconnect_vaddr = 0; > > > > > > > + } > > > > > > > + > > > > > > > + return 0; > > > > > > > +} > > > > > > > > > > > > > > static long vduse_dev_ioctl(struct file *file, unsigned
[PATCH v3 3/4] remoteproc: mediatek: Support setting DRAM and IPI shared buffer sizes
The SCP on different chips will require different DRAM sizes and IPI shared buffer sizes based on varying requirements. Signed-off-by: Olivia Wen --- drivers/remoteproc/mtk_common.h | 11 -- drivers/remoteproc/mtk_scp.c | 84 +++- drivers/remoteproc/mtk_scp_ipi.c | 7 +++- 3 files changed, 79 insertions(+), 23 deletions(-) diff --git a/drivers/remoteproc/mtk_common.h b/drivers/remoteproc/mtk_common.h index 6d7736a..fd5c539 100644 --- a/drivers/remoteproc/mtk_common.h +++ b/drivers/remoteproc/mtk_common.h @@ -78,7 +78,6 @@ #define MT8195_L2TCM_OFFSET0x850d0 #define SCP_FW_VER_LEN 32 -#define SCP_SHARE_BUFFER_SIZE 288 struct scp_run { u32 signaled; @@ -97,6 +96,11 @@ struct scp_ipi_desc { struct mtk_scp; +struct mtk_scp_sizes_data { + size_t max_dram_size; + size_t ipi_share_buffer_size; +}; + struct mtk_scp_of_data { int (*scp_clk_get)(struct mtk_scp *scp); int (*scp_before_load)(struct mtk_scp *scp); @@ -110,6 +114,7 @@ struct mtk_scp_of_data { u32 host_to_scp_int_bit; size_t ipi_buf_offset; + const struct mtk_scp_sizes_data *scp_sizes; }; struct mtk_scp_of_cluster { @@ -141,10 +146,10 @@ struct mtk_scp { struct scp_ipi_desc ipi_desc[SCP_IPI_MAX]; bool ipi_id_ack[SCP_IPI_MAX]; wait_queue_head_t ack_wq; + u8 *share_buf; void *cpu_addr; dma_addr_t dma_addr; - size_t dram_size; struct rproc_subdev *rpmsg_subdev; @@ -162,7 +167,7 @@ struct mtk_scp { struct mtk_share_obj { u32 id; u32 len; - u8 share_buf[SCP_SHARE_BUFFER_SIZE]; + u8 *share_buf; }; void scp_memcpy_aligned(void __iomem *dst, const void *src, unsigned int len); diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c index 6295148..e281d28 100644 --- a/drivers/remoteproc/mtk_scp.c +++ b/drivers/remoteproc/mtk_scp.c @@ -20,7 +20,6 @@ #include "mtk_common.h" #include "remoteproc_internal.h" -#define MAX_CODE_SIZE 0x50 #define SECTION_NAME_IPI_BUFFER ".ipi_buffer" /** @@ -94,14 +93,15 @@ static void scp_ipi_handler(struct mtk_scp *scp) { struct mtk_share_obj __iomem *rcv_obj = scp->recv_buf; struct scp_ipi_desc *ipi_desc = scp->ipi_desc; - u8 tmp_data[SCP_SHARE_BUFFER_SIZE]; scp_ipi_handler_t handler; u32 id = readl(_obj->id); u32 len = readl(_obj->len); + const struct mtk_scp_sizes_data *scp_sizes; - if (len > SCP_SHARE_BUFFER_SIZE) { - dev_err(scp->dev, "ipi message too long (len %d, max %d)", len, - SCP_SHARE_BUFFER_SIZE); + scp_sizes = scp->data->scp_sizes; + if (len > scp_sizes->ipi_share_buffer_size) { + dev_err(scp->dev, "ipi message too long (len %d, max %zd)", len, + scp_sizes->ipi_share_buffer_size); return; } if (id >= SCP_IPI_MAX) { @@ -117,8 +117,9 @@ static void scp_ipi_handler(struct mtk_scp *scp) return; } - memcpy_fromio(tmp_data, _obj->share_buf, len); - handler(tmp_data, len, ipi_desc[id].priv); + memset(scp->share_buf, 0, scp_sizes->ipi_share_buffer_size); + memcpy_fromio(scp->share_buf, _obj->share_buf, len); + handler(scp->share_buf, len, ipi_desc[id].priv); scp_ipi_unlock(scp, id); scp->ipi_id_ack[id] = true; @@ -133,6 +134,8 @@ static int scp_ipi_init(struct mtk_scp *scp, const struct firmware *fw) { int ret; size_t buf_sz, offset; + size_t share_buf_offset; + const struct mtk_scp_sizes_data *scp_sizes; /* read the ipi buf addr from FW itself first */ ret = scp_elf_read_ipi_buf_addr(scp, fw, ); @@ -152,12 +155,15 @@ static int scp_ipi_init(struct mtk_scp *scp, const struct firmware *fw) return -EOVERFLOW; } + scp_sizes = scp->data->scp_sizes; scp->recv_buf = (struct mtk_share_obj __iomem *) (scp->sram_base + offset); + share_buf_offset = sizeof(scp->recv_buf->id) + + sizeof(scp->recv_buf->len) + scp_sizes->ipi_share_buffer_size; scp->send_buf = (struct mtk_share_obj __iomem *) - (scp->sram_base + offset + sizeof(*scp->recv_buf)); - memset_io(scp->recv_buf, 0, sizeof(*scp->recv_buf)); - memset_io(scp->send_buf, 0, sizeof(*scp->send_buf)); + (scp->sram_base + offset + share_buf_offset); + memset_io(scp->recv_buf, 0, share_buf_offset); + memset_io(scp->send_buf, 0, share_buf_offset); return 0; } @@ -741,14 +747,16 @@ static int scp_start(struct rproc *rproc) static void *mt8183_scp_da_to_va(struct mtk_scp *scp, u64 da, size_t len) { int offset; + const struct mtk_scp_sizes_data *scp_sizes; + scp_sizes = scp->data->scp_sizes; if (da <
[PATCH v3 0/4] Support MT8188 SCP core 1
Change in v3: - split the modifications in version 2 into three separate commits Change in v2: - change the order of mt8188-scp-dual - modify the struct mtk_scp_of_data on MT8188 - add MT8188-specific functions - add structure mtk_scp_sizes_data to manage sizes - change tmp_data allocation to share_buf Olivia Wen (4): dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP remoteproc: mediatek: Support MT8188 SCP core 1 remoteproc: mediatek: Support setting DRAM and IPI shared buffer sizes media: mediatek: imgsys: Support image processing .../devicetree/bindings/remoteproc/mtk,scp.yaml| 2 + drivers/remoteproc/mtk_common.h| 11 +- drivers/remoteproc/mtk_scp.c | 230 +++-- drivers/remoteproc/mtk_scp_ipi.c | 7 +- include/linux/remoteproc/mtk_scp.h | 1 + 5 files changed, 225 insertions(+), 26 deletions(-) -- 2.6.4
[PATCH v3 2/4] remoteproc: mediatek: Support MT8188 SCP core 1
MT8188 SCP has two RISC-V cores which is similar to MT8195 but without L1TCM. We've added MT8188-specific functions to configure L1TCM in multicore setups. Signed-off-by: Olivia Wen --- drivers/remoteproc/mtk_scp.c | 146 ++- 1 file changed, 143 insertions(+), 3 deletions(-) diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c index 6751829..6295148 100644 --- a/drivers/remoteproc/mtk_scp.c +++ b/drivers/remoteproc/mtk_scp.c @@ -471,6 +471,86 @@ static int mt8186_scp_before_load(struct mtk_scp *scp) return 0; } +static int mt8188_scp_l2tcm_on(struct mtk_scp *scp) +{ + struct mtk_scp_of_cluster *scp_cluster = scp->cluster; + + mutex_lock(_cluster->cluster_lock); + + if (scp_cluster->l2tcm_refcnt == 0) { + /* clear SPM interrupt, SCP2SPM_IPC_CLR */ + writel(0xff, scp->cluster->reg_base + MT8192_SCP2SPM_IPC_CLR); + + /* Power on L2TCM */ + scp_sram_power_on(scp->cluster->reg_base + MT8192_L2TCM_SRAM_PD_0, 0); + scp_sram_power_on(scp->cluster->reg_base + MT8192_L2TCM_SRAM_PD_1, 0); + scp_sram_power_on(scp->cluster->reg_base + MT8192_L2TCM_SRAM_PD_2, 0); + scp_sram_power_on(scp->cluster->reg_base + MT8192_L1TCM_SRAM_PDN, 0); + } + + scp_cluster->l2tcm_refcnt += 1; + + mutex_unlock(_cluster->cluster_lock); + + return 0; +} + +static int mt8188_scp_before_load(struct mtk_scp *scp) +{ + writel(1, scp->cluster->reg_base + MT8192_CORE0_SW_RSTN_SET); + + mt8188_scp_l2tcm_on(scp); + + scp_sram_power_on(scp->cluster->reg_base + MT8192_CPU0_SRAM_PD, 0); + + /* enable MPU for all memory regions */ + writel(0xff, scp->cluster->reg_base + MT8192_CORE0_MEM_ATT_PREDEF); + + return 0; +} + +static int mt8188_scp_c1_before_load(struct mtk_scp *scp) +{ + u32 sec_ctrl; + struct mtk_scp *scp_c0; + struct mtk_scp_of_cluster *scp_cluster = scp->cluster; + + scp->data->scp_reset_assert(scp); + + mt8188_scp_l2tcm_on(scp); + + scp_sram_power_on(scp->cluster->reg_base + MT8195_CPU1_SRAM_PD, 0); + + /* enable MPU for all memory regions */ + writel(0xff, scp->cluster->reg_base + MT8195_CORE1_MEM_ATT_PREDEF); + + /* +* The L2TCM_OFFSET_RANGE and L2TCM_OFFSET shift the destination address +* on SRAM when SCP core 1 accesses SRAM. +* +* This configuration solves booting the SCP core 0 and core 1 from +* different SRAM address because core 0 and core 1 both boot from +* the head of SRAM by default. this must be configured before boot SCP core 1. +* +* The value of L2TCM_OFFSET_RANGE is from the viewpoint of SCP core 1. +* When SCP core 1 issues address within the range (L2TCM_OFFSET_RANGE), +* the address will be added with a fixed offset (L2TCM_OFFSET) on the bus. +* The shift action is tranparent to software. +*/ + writel(0, scp->cluster->reg_base + MT8195_L2TCM_OFFSET_RANGE_0_LOW); + writel(scp->sram_size, scp->cluster->reg_base + MT8195_L2TCM_OFFSET_RANGE_0_HIGH); + + scp_c0 = list_first_entry(_cluster->mtk_scp_list, struct mtk_scp, elem); + writel(scp->sram_phys - scp_c0->sram_phys, scp->cluster->reg_base + MT8195_L2TCM_OFFSET); + + /* enable SRAM offset when fetching instruction and data */ + sec_ctrl = readl(scp->cluster->reg_base + MT8195_SEC_CTRL); + sec_ctrl |= MT8195_CORE_OFFSET_ENABLE_I | MT8195_CORE_OFFSET_ENABLE_D; + writel(sec_ctrl, scp->cluster->reg_base + MT8195_SEC_CTRL); + + return 0; +} + static int mt8192_scp_before_load(struct mtk_scp *scp) { /* clear SPM interrupt, SCP2SPM_IPC_CLR */ @@ -717,6 +797,47 @@ static void mt8183_scp_stop(struct mtk_scp *scp) writel(0, scp->cluster->reg_base + MT8183_WDT_CFG); } +static void mt8188_scp_l2tcm_off(struct mtk_scp *scp) +{ + struct mtk_scp_of_cluster *scp_cluster = scp->cluster; + + mutex_lock(_cluster->cluster_lock); + + if (scp_cluster->l2tcm_refcnt > 0) + scp_cluster->l2tcm_refcnt -= 1; + + if (scp_cluster->l2tcm_refcnt == 0) { + /* Power off L2TCM */ + scp_sram_power_off(scp->cluster->reg_base + MT8192_L2TCM_SRAM_PD_0, 0); + scp_sram_power_off(scp->cluster->reg_base + MT8192_L2TCM_SRAM_PD_1, 0); + scp_sram_power_off(scp->cluster->reg_base + MT8192_L2TCM_SRAM_PD_2, 0); + scp_sram_power_off(scp->cluster->reg_base + MT8192_L1TCM_SRAM_PDN, 0); + } + + mutex_unlock(_cluster->cluster_lock); +} + +static void mt8188_scp_stop(struct mtk_scp *scp) +{ + mt8188_scp_l2tcm_off(scp); + + scp_sram_power_off(scp->cluster->reg_base + MT8192_CPU0_SRAM_PD, 0); + + /* Disable SCP watchdog */ + writel(0, scp->cluster->reg_base + MT8192_CORE0_WDT_CFG); +} + +static void
[PATCH v3 4/4] media: mediatek: imgsys: Support image processing
Integrate the imgsys core architecture driver for image processing on the MT8188 platform. Signed-off-by: Olivia Wen --- include/linux/remoteproc/mtk_scp.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/remoteproc/mtk_scp.h b/include/linux/remoteproc/mtk_scp.h index 7c2b7cc9..344ff41 100644 --- a/include/linux/remoteproc/mtk_scp.h +++ b/include/linux/remoteproc/mtk_scp.h @@ -43,6 +43,7 @@ enum scp_ipi_id { SCP_IPI_CROS_HOST_CMD, SCP_IPI_VDEC_LAT, SCP_IPI_VDEC_CORE, + SCP_IPI_IMGSYS_CMD, SCP_IPI_NS_SERVICE = 0xFF, SCP_IPI_MAX = 0x100, }; -- 2.6.4
[PATCH v3 1/4] dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP
Under different applications, the MT8188 SCP can be used as single-core or dual-core. Signed-off-by: Olivia Wen Acked-by: Krzysztof Kozlowski Reviewed-by: AngeloGioacchino Del Regno --- Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml index 507f98f..c5dc3c2 100644 --- a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml +++ b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml @@ -19,6 +19,7 @@ properties: - mediatek,mt8183-scp - mediatek,mt8186-scp - mediatek,mt8188-scp + - mediatek,mt8188-scp-dual - mediatek,mt8192-scp - mediatek,mt8195-scp - mediatek,mt8195-scp-dual @@ -194,6 +195,7 @@ allOf: properties: compatible: enum: +- mediatek,mt8188-scp-dual - mediatek,mt8195-scp-dual then: properties: -- 2.6.4
Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup
On Tue, 2024-04-23 at 19:26 -0500, Haitao Huang wrote: > On Tue, 23 Apr 2024 17:13:15 -0500, Huang, Kai wrote: > > > On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote: > > > > > It's a workaround because you use the capacity==0 but it does not > > > really > > > > > mean to disable the misc cgroup for specific resource IIUC. > > > > > > > > Please read the comment around @misc_res_capacity again: > > > > > > > > * Miscellaneous resources capacity for the entire machine. 0 > > > capacity > > > > * means resource is not initialized or not present in the host. > > > > > > > > > > I mentioned this in earlier email. I think this means no SGX EPC. It > > > doesnot mean sgx epc cgroup not enabled. That's also consistent with the > > > behavior try_charge() fails if capacity is zero. > > > > OK. To me the "capacity" is purely the concept of cgroup, so it must be > > interpreted within the scope of "cgroup". If cgroup, in our case, SGX > > cgroup, is disabled, then whether "leaving the capacity to reflect the > > presence of hardware resource" doesn't really matter. > > So what you are saying is that, the kernel must initialize the capacity > > of > > some MISC resource once it is added as valid type. > > And you must initialize the "capacity" even MISC cgroup is disabled > > entirely by kernel commandline, in which case, IIUC, misc.capacity is not > > even going to show in the /fs. > > > > If this is your point, then your patch: > > > > cgroup/misc: Add SGX EPC resource type > > > > is already broken, because you added the new type w/o initializing the > > capacity. > > > > Please fix that up. > > > > > > > > > > > > > > > There is explicit way for user to disable misc without setting > > > capacity> > to > > > > > zero. > > > > > > > > Which way are you talking about? > > > > > > Echo "-misc" to cgroup.subtree_control at root level for example still > > > shows non-zero sgx_epc capacity. > > > > I guess "having to disable all MISC resources just in order to disable > > SGX > > EPC cgroup" is a brilliant idea. > > > > You can easily disable the entire MISC cgroup by commandline for that > > purpose if that's acceptable. > > > > And I have no idea why "still showing non-zero EPC capacity" is important > > if SGX cgroup cannot be supported at all. > > > > Okay, all I'm trying to say is we should care about consistency in code > and don't want SGX do something different. Mixing "disable" with > "capacity==0" causes inconsistencies AFAICS: > > 1) The try_charge() API currently returns error when capacity is zero. So > it appears not to mean that the cgroup is disabled otherwise it should > return success. I agree this isn't ideal. My view is we can fix it in MISC code if needed. > > 2) The current explicit way ("-misc") to disable misc still shows non-zero > entries in misc.capacity. (At least for v2 cgroup, it does when I tested). > Maybe this is not important but I just don't feel good about this > inconsistency. This belongs to "MISC resource cgroup was initially enabled by the kernel at boot time, but later was disabled *somewhere in hierarchy* by the user". In fact, if you only do "-misc" for "some subtree", it's quite reasonable to still report the resource in max.capacity. In the case above, the "some subtree" happens to be the root. So to me it's reasonable to still show max.capacity in this case. And you can actually argue that the kernel still supports the cgroup for the resource. E.g., you can at runtime do "+misc" to re-enable. However, if the kernel isn't able to support certain MISC resource cgroup at boot time, it's quite reasonable to just set the "capacity" to 0 so it isn't visible to userspace. Note: My key point is, when userspace sees 0 "capacity", it shouldn't need to care about whether it is because of "hardware resource is not available", or "hardware resource is available but kernel cannot support cgroup for it". The resource cgroup is simply unavailable. That means the kernel has full right to just hide that resource from the cgroup at boot time. But this should be just within "cgroup's scope", i.e., that resource can still be available if kernel can provide it. If some app wants to additionally check whether such resource is indeed available but only cgroup is not available, it should check resource specific interface but not take advantage of the MISC cgroup interface. > > For now I'll just do BUG_ON() unless there are more strong opinions one > way or the other. > Fine to me.
Re: [PATCH v2 2/2] remoteproc: mediatek: Support MT8188 SCP core 1
Hi Mathieu, Thanks for the reviews. I will correct it in the next version. Best regards, Olivia On Mon, 2024-04-22 at 09:33 -0600, Mathieu Poirier wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > Hi Olivia, > > On Fri, Apr 19, 2024 at 04:42:11PM +0800, Olivia Wen wrote: > > From: "olivia.wen" > > > > There are three primary modifications. > > > > 1. The struct mtk_scp_of_data usage on MT8188 > > MT8192 functions are unsuitable for the dual-core MT8188 SCP, > > which has two RISC-V cores similar to MT8195 but without L1TCM. > > We've added MT8188-specific functions to configure L1TCM > > in multicore setups. > > > > 2. SCP_IPI_IMGSYS_CMD feature > > This version also adds SCP_IPI_IMGSYS_CMD to facilitate > > communication between the imgsys kernel and the backend driver. > > > > 3. Different code sizes and IPI share buffer sizes > > Each SCP necessitates different code and IPI share buffer sizes. > > Introducing a structure mtk_scp_sizes_data to handle them. > > > > Please split in 3 different patches and in the changelog, concentrate > on "why" > you are making the changes rather than "what" changes are done. > > Thanks, > Mathieu > > > Signed-off-by: olivia.wen > > --- > > drivers/remoteproc/mtk_common.h| 11 +- > > drivers/remoteproc/mtk_scp.c | 230 > + > > drivers/remoteproc/mtk_scp_ipi.c | 7 +- > > include/linux/remoteproc/mtk_scp.h | 1 + > > 4 files changed, 223 insertions(+), 26 deletions(-) > > > > diff --git a/drivers/remoteproc/mtk_common.h > b/drivers/remoteproc/mtk_common.h > > index 6d7736a..fd5c539 100644 > > --- a/drivers/remoteproc/mtk_common.h > > +++ b/drivers/remoteproc/mtk_common.h > > @@ -78,7 +78,6 @@ > > #define MT8195_L2TCM_OFFSET0x850d0 > > > > #define SCP_FW_VER_LEN32 > > -#define SCP_SHARE_BUFFER_SIZE288 > > > > struct scp_run { > > u32 signaled; > > @@ -97,6 +96,11 @@ struct scp_ipi_desc { > > > > struct mtk_scp; > > > > +struct mtk_scp_sizes_data { > > +size_t max_dram_size; > > +size_t ipi_share_buffer_size; > > +}; > > + > > struct mtk_scp_of_data { > > int (*scp_clk_get)(struct mtk_scp *scp); > > int (*scp_before_load)(struct mtk_scp *scp); > > @@ -110,6 +114,7 @@ struct mtk_scp_of_data { > > u32 host_to_scp_int_bit; > > > > size_t ipi_buf_offset; > > +const struct mtk_scp_sizes_data *scp_sizes; > > }; > > > > struct mtk_scp_of_cluster { > > @@ -141,10 +146,10 @@ struct mtk_scp { > > struct scp_ipi_desc ipi_desc[SCP_IPI_MAX]; > > bool ipi_id_ack[SCP_IPI_MAX]; > > wait_queue_head_t ack_wq; > > +u8 *share_buf; > > > > void *cpu_addr; > > dma_addr_t dma_addr; > > -size_t dram_size; > > > > struct rproc_subdev *rpmsg_subdev; > > > > @@ -162,7 +167,7 @@ struct mtk_scp { > > struct mtk_share_obj { > > u32 id; > > u32 len; > > -u8 share_buf[SCP_SHARE_BUFFER_SIZE]; > > +u8 *share_buf; > > }; > > > > void scp_memcpy_aligned(void __iomem *dst, const void *src, > unsigned int len); > > diff --git a/drivers/remoteproc/mtk_scp.c > b/drivers/remoteproc/mtk_scp.c > > index 6751829..e281d28 100644 > > --- a/drivers/remoteproc/mtk_scp.c > > +++ b/drivers/remoteproc/mtk_scp.c > > @@ -20,7 +20,6 @@ > > #include "mtk_common.h" > > #include "remoteproc_internal.h" > > > > -#define MAX_CODE_SIZE 0x50 > > #define SECTION_NAME_IPI_BUFFER ".ipi_buffer" > > > > /** > > @@ -94,14 +93,15 @@ static void scp_ipi_handler(struct mtk_scp > *scp) > > { > > struct mtk_share_obj __iomem *rcv_obj = scp->recv_buf; > > struct scp_ipi_desc *ipi_desc = scp->ipi_desc; > > -u8 tmp_data[SCP_SHARE_BUFFER_SIZE]; > > scp_ipi_handler_t handler; > > u32 id = readl(_obj->id); > > u32 len = readl(_obj->len); > > +const struct mtk_scp_sizes_data *scp_sizes; > > > > -if (len > SCP_SHARE_BUFFER_SIZE) { > > -dev_err(scp->dev, "ipi message too long (len %d, max %d)", len, > > -SCP_SHARE_BUFFER_SIZE); > > +scp_sizes = scp->data->scp_sizes; > > +if (len > scp_sizes->ipi_share_buffer_size) { > > +dev_err(scp->dev, "ipi message too long (len %d, max %zd)", len, > > +scp_sizes->ipi_share_buffer_size); > > return; > > } > > if (id >= SCP_IPI_MAX) { > > @@ -117,8 +117,9 @@ static void scp_ipi_handler(struct mtk_scp > *scp) > > return; > > } > > > > -memcpy_fromio(tmp_data, _obj->share_buf, len); > > -handler(tmp_data, len, ipi_desc[id].priv); > > +memset(scp->share_buf, 0, scp_sizes->ipi_share_buffer_size); > > +memcpy_fromio(scp->share_buf, _obj->share_buf, len); > > +handler(scp->share_buf, len, ipi_desc[id].priv); > > scp_ipi_unlock(scp, id); > > > > scp->ipi_id_ack[id] = true; > > @@ -133,6 +134,8 @@ static int scp_ipi_init(struct mtk_scp *scp, > const struct firmware *fw) > > { > > int ret; > > size_t buf_sz, offset; > > +size_t share_buf_offset; > > +const struct mtk_scp_sizes_data *scp_sizes; > > > > /* read the ipi buf addr from FW itself first */ > > ret = scp_elf_read_ipi_buf_addr(scp, fw,
Re: [PATCH 1/2] dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP
Hi AngeloGioacchino, On Thu, 2024-04-11 at 09:39 +0200, AngeloGioacchino Del Regno wrote: > Il 11/04/24 09:34, AngeloGioacchino Del Regno ha scritto: > > Il 11/04/24 05:37, olivia.wen ha scritto: > > > Under different applications, the MT8188 SCP can be used as > > > single-core > > > or dual-core. > > > > > > Signed-off-by: olivia.wen > > > --- > > > Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml | 3 > > > ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git > > > a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml > > > b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml > > > index 507f98f..7e7b567 100644 > > > --- a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml > > > +++ b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml > > > @@ -22,7 +22,7 @@ properties: > > > - mediatek,mt8192-scp > > > - mediatek,mt8195-scp > > > - mediatek,mt8195-scp-dual > > > - > > > > Don't remove the blank line, it's there for readability. > > > > > + - mediatek,mt8188-scp-dual > > Ah, sorry, one more comment. Please, keep the entries ordered by > name. > 8188 goes before 8195. > > > > > After addressing that comment, > > > > Reviewed-by: AngeloGioacchino Del Regno < > > angelogioacchino.delre...@collabora.com> > > > > > reg: > > > description: > > > Should contain the address ranges for memory regions > > > SRAM, CFG, and, > > > @@ -195,6 +195,7 @@ allOf: > > > compatible: > > > enum: > > > - mediatek,mt8195-scp-dual > > > +- mediatek,mt8188-scp-dual > > same here. > > > > then: > > > properties: > > > reg: > > > > > Thanks for the reviews. It will be corrected in the next version. Best regards, Olivia
Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection
On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin wrote: > > On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote: > > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin wrote: > > > > > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote: > > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin > > > > wrote: > > > > > > > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote: > > > > > > Add the function vduse_alloc_reconnnect_info_mem > > > > > > and vduse_alloc_reconnnect_info_mem > > > > > > These functions allow vduse to allocate and free memory for > > > > > > reconnection > > > > > > information. The amount of memory allocated is vq_num pages. > > > > > > Each VQS will map its own page where the reconnection information > > > > > > will be saved > > > > > > > > > > > > Signed-off-by: Cindy Lu > > > > > > --- > > > > > > drivers/vdpa/vdpa_user/vduse_dev.c | 40 > > > > > > ++ > > > > > > 1 file changed, 40 insertions(+) > > > > > > > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > > index ef3c9681941e..2da659d5f4a8 100644 > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue { > > > > > > int irq_effective_cpu; > > > > > > struct cpumask irq_affinity; > > > > > > struct kobject kobj; > > > > > > + unsigned long vdpa_reconnect_vaddr; > > > > > > }; > > > > > > > > > > > > struct vduse_dev; > > > > > > @@ -1105,6 +1106,38 @@ static void > > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq) > > > > > > > > > > > > vq->irq_effective_cpu = curr_cpu; > > > > > > } > > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev) > > > > > > +{ > > > > > > + unsigned long vaddr = 0; > > > > > > + struct vduse_virtqueue *vq; > > > > > > + > > > > > > + for (int i = 0; i < dev->vq_num; i++) { > > > > > > + /*page 0~ vq_num save the reconnect info for vq*/ > > > > > > + vq = dev->vqs[i]; > > > > > > + vaddr = get_zeroed_page(GFP_KERNEL); > > > > > > > > > > > > > > > I don't get why you insist on stealing kernel memory for something > > > > > that is just used by userspace to store data for its own use. > > > > > Userspace does not lack ways to persist data, for example, > > > > > create a regular file anywhere in the filesystem. > > > > > > > > Good point. So the motivation here is to: > > > > > > > > 1) be self contained, no dependency for high speed persist data > > > > storage like tmpfs > > > > > > No idea what this means. > > > > I mean a regular file may slow down the datapath performance, so > > usually the application will try to use tmpfs and other which is a > > dependency for implementing the reconnection. > > Are we worried about systems without tmpfs now? Yes. > > > > > > > > > 2) standardize the format in uAPI which allows reconnection from > > > > arbitrary userspace, unfortunately, such effort was removed in new > > > > versions > > > > > > And I don't see why that has to live in the kernel tree either. > > > > I can't find a better place, any idea? > > > > Thanks > > > Well anywhere on github really. with libvhost-user maybe? > It's harmless enough in Documentation > if you like but ties you to the kernel release cycle in a way that > is completely unnecessary. Ok. Thanks > > > > > > > > If the above doesn't make sense, we don't need to offer those pages by > > > > VDUSE. > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + if (vaddr == 0) > > > > > > + return -ENOMEM; > > > > > > + > > > > > > + vq->vdpa_reconnect_vaddr = vaddr; > > > > > > + } > > > > > > + > > > > > > + return 0; > > > > > > +} > > > > > > + > > > > > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev) > > > > > > +{ > > > > > > + struct vduse_virtqueue *vq; > > > > > > + > > > > > > + for (int i = 0; i < dev->vq_num; i++) { > > > > > > + vq = dev->vqs[i]; > > > > > > + > > > > > > + if (vq->vdpa_reconnect_vaddr) > > > > > > + free_page(vq->vdpa_reconnect_vaddr); > > > > > > + vq->vdpa_reconnect_vaddr = 0; > > > > > > + } > > > > > > + > > > > > > + return 0; > > > > > > +} > > > > > > > > > > > > static long vduse_dev_ioctl(struct file *file, unsigned int cmd, > > > > > > unsigned long arg) > > > > > > @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name) > > > > > > mutex_unlock(>lock); > > > > > > return -EBUSY; > > > > > > } > > > > > > + vduse_free_reconnnect_info_mem(dev); > > > > > > + > > > > > > dev->connected = true; > > > > > > mutex_unlock(>lock); > > > > > > > > > > > > @@ -1855,12 +1890,17 @@ static int
Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup
On Tue, 23 Apr 2024 17:13:15 -0500, Huang, Kai wrote: On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote: > > It's a workaround because you use the capacity==0 but it does not really > > mean to disable the misc cgroup for specific resource IIUC. > > Please read the comment around @misc_res_capacity again: > > * Miscellaneous resources capacity for the entire machine. 0 capacity > * means resource is not initialized or not present in the host. > I mentioned this in earlier email. I think this means no SGX EPC. It doesnot mean sgx epc cgroup not enabled. That's also consistent with the behavior try_charge() fails if capacity is zero. OK. To me the "capacity" is purely the concept of cgroup, so it must be interpreted within the scope of "cgroup". If cgroup, in our case, SGX cgroup, is disabled, then whether "leaving the capacity to reflect the presence of hardware resource" doesn't really matter. So what you are saying is that, the kernel must initialize the capacity of some MISC resource once it is added as valid type. And you must initialize the "capacity" even MISC cgroup is disabled entirely by kernel commandline, in which case, IIUC, misc.capacity is not even going to show in the /fs. If this is your point, then your patch: cgroup/misc: Add SGX EPC resource type is already broken, because you added the new type w/o initializing the capacity. Please fix that up. > > > > There is explicit way for user to disable misc without setting capacity> > to > > zero. > > Which way are you talking about? Echo "-misc" to cgroup.subtree_control at root level for example still shows non-zero sgx_epc capacity. I guess "having to disable all MISC resources just in order to disable SGX EPC cgroup" is a brilliant idea. You can easily disable the entire MISC cgroup by commandline for that purpose if that's acceptable. And I have no idea why "still showing non-zero EPC capacity" is important if SGX cgroup cannot be supported at all. Okay, all I'm trying to say is we should care about consistency in code and don't want SGX do something different. Mixing "disable" with "capacity==0" causes inconsistencies AFAICS: 1) The try_charge() API currently returns error when capacity is zero. So it appears not to mean that the cgroup is disabled otherwise it should return success. 2) The current explicit way ("-misc") to disable misc still shows non-zero entries in misc.capacity. (At least for v2 cgroup, it does when I tested). Maybe this is not important but I just don't feel good about this inconsistency. For now I'll just do BUG_ON() unless there are more strong opinions one way or the other. BR Haitao
Re: [PATCH v3 0/4] MSM8976 MDSS/GPU/WCNSS support
On 13/04/2024 18:03, Adam Skladowski wrote: This patch series provide support for display subsystem, gpu and also adds wireless connectivity subsystem support. Changes since v2 1. Disabled mdss_dsi nodes by default 2. Changed reg size of mdss_dsi0 to be equal on both 3. Added operating points to second mdss_dsi 4. Brought back required opp-supported-hw on adreno 5. Moved status under operating points on adreno Changes since v1 1. Addressed feedback 2. Dropped already applied dt-bindings patches 3. Dropped sdc patch as it was submitted as part of other series 4. Dropped dt-bindings patch for Adreno, also separate now Adam Skladowski (4): arm64: dts: qcom: msm8976: Add IOMMU nodes arm64: dts: qcom: msm8976: Add MDSS nodes arm64: dts: qcom: msm8976: Add Adreno GPU arm64: dts: qcom: msm8976: Add WCNSS node arch/arm64/boot/dts/qcom/msm8976.dtsi | 536 +- 1 file changed, 532 insertions(+), 4 deletions(-) I remembered you pinged me in IRC about my previous review feedback. Would appreciate if you could add a response email to your cover letter explaining why that's not done. I'm sure you're probably right but, good practice is to state in your log what was and wasn't done and why. --- bod
[PATCH v21 4/5] Documentation: tracing: Add ring-buffer mapping
It is now possible to mmap() a ring-buffer to stream its content. Add some documentation and a code example. Signed-off-by: Vincent Donnefort diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index 5092d6c13af5..0b300901fd75 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -29,6 +29,7 @@ Linux Tracing Technologies timerlat-tracer intel_th ring-buffer-design + ring-buffer-map stm sys-t coresight/index diff --git a/Documentation/trace/ring-buffer-map.rst b/Documentation/trace/ring-buffer-map.rst new file mode 100644 index ..8e296bcc0d7f --- /dev/null +++ b/Documentation/trace/ring-buffer-map.rst @@ -0,0 +1,106 @@ +.. SPDX-License-Identifier: GPL-2.0 + +== +Tracefs ring-buffer memory mapping +== + +:Author: Vincent Donnefort + +Overview + +Tracefs ring-buffer memory map provides an efficient method to stream data +as no memory copy is necessary. The application mapping the ring-buffer becomes +then a consumer for that ring-buffer, in a similar fashion to trace_pipe. + +Memory mapping setup + +The mapping works with a mmap() of the trace_pipe_raw interface. + +The first system page of the mapping contains ring-buffer statistics and +description. It is referred to as the meta-page. One of the most important +fields of the meta-page is the reader. It contains the sub-buffer ID which can +be safely read by the mapper (see ring-buffer-design.rst). + +The meta-page is followed by all the sub-buffers, ordered by ascending ID. It is +therefore effortless to know where the reader starts in the mapping: + +.. code-block:: c + +reader_id = meta->reader->id; +reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; + +When the application is done with the current reader, it can get a new one using +the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates +the meta-page fields. + +Limitations +=== +When a mapping is in place on a Tracefs ring-buffer, it is not possible to +either resize it (either by increasing the entire size of the ring-buffer or +each subbuf). It is also not possible to use snapshot and causes splice to copy +the ring buffer data instead of using the copyless swap from the ring buffer. + +Concurrent readers (either another application mapping that ring-buffer or the +kernel with trace_pipe) are allowed but not recommended. They will compete for +the ring-buffer and the output is unpredictable, just like concurrent readers on +trace_pipe would be. + +Example +=== + +.. code-block:: c + +#include +#include +#include +#include + +#include + +#include +#include + +#define TRACE_PIPE_RAW "/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw" + +int main(void) +{ +int page_size = getpagesize(), fd, reader_id; +unsigned long meta_len, data_len; +struct trace_buffer_meta *meta; +void *map, *reader, *data; + +fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK); +if (fd < 0) +exit(EXIT_FAILURE); + +map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0); +if (map == MAP_FAILED) +exit(EXIT_FAILURE); + +meta = (struct trace_buffer_meta *)map; +meta_len = meta->meta_page_size; + +printf("entries:%llu\n", meta->entries); +printf("overrun:%llu\n", meta->overrun); +printf("read: %llu\n", meta->read); +printf("nr_subbufs: %u\n", meta->nr_subbufs); + +data_len = meta->subbuf_size * meta->nr_subbufs; +data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, meta_len); +if (data == MAP_FAILED) +exit(EXIT_FAILURE); + +if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0) +exit(EXIT_FAILURE); + +reader_id = meta->reader.id; +reader = data + meta->subbuf_size * reader_id; + +printf("Current reader address: %p\n", reader); + +munmap(data, data_len); +munmap(meta, meta_len); +close (fd); + +return 0; +} -- 2.44.0.769.g3c40516874-goog
[PATCH v21 3/5] tracing: Allow user-space mapping of the ring-buffer
Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each subbuffer constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * Subbuf ID 0 * Subbuf ID 1 ... It is therefore easy to translate a subbuf ID into an offset in the mapping: reader_id = meta->reader->id; reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Mapping will prevent snapshot and buffer size modifications. CC: Signed-off-by: Vincent Donnefort diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h index b682e9925539..bd1066754220 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -43,4 +43,6 @@ struct trace_buffer_meta { __u64 Reserved2; }; +#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1) + #endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 233d1af39fff..a35e7f598233 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1191,6 +1191,12 @@ static void tracing_snapshot_instance_cond(struct trace_array *tr, return; } + if (tr->mapped) { + trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n"); + trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n"); + return; + } + local_irq_save(flags); update_max_tr(tr, current, smp_processor_id(), cond_data); local_irq_restore(flags); @@ -1323,7 +1329,7 @@ static int tracing_arm_snapshot_locked(struct trace_array *tr) lockdep_assert_held(_types_lock); spin_lock(>snapshot_trigger_lock); - if (tr->snapshot == UINT_MAX) { + if (tr->snapshot == UINT_MAX || tr->mapped) { spin_unlock(>snapshot_trigger_lock); return -EBUSY; } @@ -6068,7 +6074,7 @@ static void tracing_set_nop(struct trace_array *tr) { if (tr->current_trace == _trace) return; - + tr->current_trace->enabled--; if (tr->current_trace->reset) @@ -8194,15 +8200,32 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, return ret; } -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct ftrace_buffer_info *info = file->private_data; struct trace_iterator *iter = >iter; + int err; + + if (cmd == TRACE_MMAP_IOCTL_GET_READER) { + if (!(file->f_flags & O_NONBLOCK)) { + err = ring_buffer_wait(iter->array_buffer->buffer, + iter->cpu_file, + iter->tr->buffer_percent, + NULL, NULL); + if (err) + return err; + } - if (cmd) - return -ENOIOCTLCMD; + return ring_buffer_map_get_reader(iter->array_buffer->buffer, + iter->cpu_file); + } else if (cmd) { + return -ENOTTY; + } + /* +* An ioctl call with cmd 0 to the ring buffer file will wake up all +* waiters +*/ mutex_lock(_types_lock); /* Make sure the waiters see the new wait_index */ @@ -8214,6 +8237,76 @@ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned return 0; } +#ifdef CONFIG_TRACER_MAX_TRACE +static int get_snapshot_map(struct trace_array *tr) +{ + int err = 0; + + /* +* Called with mmap_lock held. lockdep would be unhappy if we would now +* take trace_types_lock. Instead use the specific +* snapshot_trigger_lock. +*/ + spin_lock(>snapshot_trigger_lock); + + if (tr->snapshot || tr->mapped == UINT_MAX) + err = -EBUSY; + else + tr->mapped++; + + spin_unlock(>snapshot_trigger_lock); + + /* Wait for update_max_tr() to observe iter->tr->mapped */ + if (tr->mapped == 1) + synchronize_rcu(); + + return err; + +} +static void put_snapshot_map(struct trace_array *tr) +{ + spin_lock(>snapshot_trigger_lock); + if (!WARN_ON(!tr->mapped)) +
[PATCH v21 2/5] ring-buffer: Introducing ring-buffer mapping functions
In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. CC: Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index dc5ae4e96aee..96d2140b471e 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu, + struct vm_area_struct *vma); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..b682e9925539 --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _TRACE_MMAP_H_ +#define _TRACE_MMAP_H_ + +#include + +/** + * struct trace_buffer_meta - Ring-buffer Meta-page description + * @meta_page_size:Size of this meta-page. + * @meta_struct_len: Size of this structure. + * @subbuf_size: Size of each sub-buffer. + * @nr_subbufs:Number of subbfs in the ring-buffer, including the reader. + * @reader.lost_events:Number of events lost at the time of the reader swap. + * @reader.id: subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1] + * @reader.read: Number of bytes read on the reader subbuf. + * @flags: Placeholder for now, 0 until new features are supported. + * @entries: Number of entries in the ring-buffer. + * @overrun: Number of entries lost in the ring-buffer. + * @read: Number of entries that have been read. + * @Reserved1: Internal use only. + * @Reserved2: Internal use only. + */ +struct trace_buffer_meta { + __u32 meta_page_size; + __u32 meta_struct_len; + + __u32 subbuf_size; + __u32 nr_subbufs; + + struct { + __u64 lost_events; + __u32 id; + __u32 read; + } reader; + + __u64 flags; + + __u64 entries; + __u64 overrun; + __u64 read; + + __u64 Reserved1; + __u64 Reserved2; +}; + +#endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index cc9ebe593571..84f8744fa110 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,7 @@ #include #include #include +#include #include #include @@ -338,6 +340,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + unsigned intmapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to subbuf VA */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; struct list_headnew_pages; /* new pages to add */ @@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters); init_waitqueue_head(_buffer->irq_work.waiters); init_waitqueue_head(_buffer->irq_work.full_waiters); +
[PATCH v21 1/5] ring-buffer: Allocate sub-buffers with __GFP_COMP
In preparation for the ring-buffer memory mapping, allocate compound pages for the ring-buffer sub-buffers to enable us to map them to user-space with vm_insert_pages(). Signed-off-by: Vincent Donnefort diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 25476ead681b..cc9ebe593571 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, list_add(>list, pages); page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), - mflags | __GFP_ZERO, + mflags | __GFP_COMP | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) goto free_pages; @@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) cpu_buffer->reader_page = bpage; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO, + page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) goto fail_free_reader; @@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu) goto out; page = alloc_pages_node(cpu_to_node(cpu), - GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO, + GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) { kfree(bpage); -- 2.44.0.769.g3c40516874-goog
[PATCH v21 0/5] Introducing trace buffer mapping by user-space
The tracing ring-buffers can be stored on disk or sent to network without any copy via splice. However the later doesn't allow real time processing of the traces. A solution is to give userspace direct access to the ring-buffer pages via a mapping. An application can now become a consumer of the ring-buffer, in a similar fashion to what trace_pipe offers. Support for this new feature can already be found in libtracefs from version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE. Vincent v20 -> v21: * Collect Ack * Add .gitignore * Few nits * Remove meta-page padding (zero-page not supported by vm_insert_pages) * Remove single-usage macros * Move vma flags handling into ring-buffer.c v19 -> v20: * Fix typos in documentation. * Remove useless mmap open and fault callbacks. * add mm.h include for vm_insert_pages v18 -> v19: * Use VM_PFNMAP and vm_insert_pages * Allocate ring-buffer subbufs with __GFP_COMP * Pad the meta-page with the zero-page to align on the subbuf_order * Extend the ring-buffer test with mmap() dedicated suite v17 -> v18: * Fix lockdep_assert_held * Fix spin_lock_init typo * Fix CONFIG_TRACER_MAX_TRACE typo v16 -> v17: * Documentation and comments improvements. * Create get/put_snapshot_map() for clearer code. * Replace kzalloc with kcalloc. * Fix -ENOMEM handling in rb_alloc_meta_page(). * Move flush(cpu_buffer->reader_page) behind the reader lock. * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer mutex. (removes READ_ONCE/WRITE_ONCE accesses). v15 -> v16: * Add comment for the dcache flush. * Remove now unnecessary WRITE_ONCE for the meta-page. v14 -> v15: * Add meta-page and reader-page flush. Intends to fix the mapping for VIVT and aliasing-VIPT data caches. * -EPERM on VM_EXEC. * Fix build warning !CONFIG_TRACER_MAX_TRACE. v13 -> v14: * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu) * on unmap, sync meta-page teardown with the reader_lock instead of the synchronize_rcu. * Add a dedicated spinlock for trace_array ->snapshot and ->mapped. (intends to fix a lockdep issue) * Add kerneldoc for flags and Reserved fields. * Add kselftest for snapshot/map mutual exclusion. v12 -> v13: * Swap subbufs_{touched,lost} for Reserved fields. * Add a flag field in the meta-page. * Fix CONFIG_TRACER_MAX_TRACE. * Rebase on top of trace/urgent. * Add a comment for try_unregister_trigger() v11 -> v12: * Fix code sample mmap bug. * Add logging in sample code. * Reset tracer in selftest. * Add a refcount for the snapshot users. * Prevent mapping when there are snapshot users and vice versa. * Refine the meta-page. * Fix types in the meta-page. * Collect Reviewed-by. v10 -> v11: * Add Documentation and code sample. * Add a selftest. * Move all the update to the meta-page into a single rb_update_meta_page(). * rb_update_meta_page() is now called from ring_buffer_map_get_reader() to fix NOBLOCK callers. * kerneldoc for struct trace_meta_page. * Add a patch to zero all the ring-buffer allocations. v9 -> v10: * Refactor rb_update_meta_page() * In-loop declaration for foreach_subbuf_page() * Check for cpu_buffer->mapped overflow v8 -> v9: * Fix the unlock path in ring_buffer_map() * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a) v7 -> v8: * Drop the subbufs renaming into bpages * Use subbuf as a name when relevant v6 -> v7: * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/ * Support for subbufs * Rename subbufs into bpages v5 -> v6: * Rebase on next-20230802. * (unsigned long) -> (void *) cast for virt_to_page(). * Add a wait for the GET_READER_PAGE ioctl. * Move writer fields update (overrun/pages_lost/entries/pages_touched) in the irq_work. * Rearrange id in struct buffer_page. * Rearrange the meta-page. * ring_buffer_meta_page -> trace_buffer_meta_page. * Add meta_struct_len into the meta-page. v4 -> v5: * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3) v3 -> v4: * Add to the meta-page: - pages_lost / pages_read (allow to compute how full is the ring-buffer) - read (allow to compute how many entries can be read) - A reader_page struct. * Rename ring_buffer_meta_header -> ring_buffer_meta * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page * Properly consume events on ring_buffer_map_get_reader_page() with rb_advance_reader(). v2 -> v3: * Remove data page list (for non-consuming read) ** Implies removing order > 0 meta-page * Add a new meta page field ->read * Rename ring_buffer_meta_page_header into ring_buffer_meta_header v1 -> v2: * Hide data_pages from the userspace struct * Fix META_PAGE_MAX_PAGES * Support for order > 0 meta-page * Add missing page->mapping. Vincent Donnefort (5):
Re: [PATCH v2 2/6] iio: light: stk3310: Implement vdd supply and power it off during suspend
On Wed, Apr 24, 2024 at 1:41 AM Aren Moynihan wrote: > > From: Ondrej Jirman > > VDD power input can be used to completely power off the chip during > system suspend. Do so if available. ... > ret = stk3310_init(indio_dev); > if (ret < 0) > - return ret; > + goto err_vdd_disable; This is wrong. You will have the regulator being disabled _before_ IRQ. Note, that the original code likely has a bug which sets states before disabling IRQ and removing a handler. Side note, you may make the driver neater with help of struct device *dev = >dev; defined in this patch. ... > static int stk3310_suspend(struct device *dev) > { > struct stk3310_data *data; > data = iio_priv(i2c_get_clientdata(to_i2c_client(dev))); Side note: This may be updated (in a separate change) to use dev_get_drvdata() directly. Jonathan, do we have something like iio_priv_from_drvdata(struct device *dev)? Seems many drivers may utilise it. > } ... > static int stk3310_resume(struct device *dev) Ditto. -- With Best Regards, Andy Shevchenko
Re: [PATCH v2 3/6] iio: light: stk3310: Manage LED power supply
On Wed, Apr 24, 2024 at 1:41 AM Aren Moynihan wrote: > > The stk3310 and stk3310 chips have an input for power to the infrared > LED. Add support for managing it's state. its ... > if (IS_ERR(data->vdd_reg)) > return dev_err_probe(>dev, ret, "get regulator vdd > failed\n"); > > + data->led_reg = devm_regulator_get(>dev, "leda"); > + if (IS_ERR(data->led_reg)) > + return dev_err_probe(>dev, ret, "get regulator led > failed\n"); Can't you use a bulk regulator API instead? -- With Best Regards, Andy Shevchenko
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Wed, 2024-04-24 at 00:27 +0300, Jarkko Sakkinen wrote: > On Tue Apr 23, 2024 at 8:08 PM EEST, Reinette Chatre wrote: > > Hi Kai, > > > > On 4/23/2024 4:50 AM, Huang, Kai wrote: > > > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c > > > > b/arch/x86/kernel/cpu/sgx/ioctl.c > > > > index b65ab214bdf5..2340a82fa796 100644 > > > > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > > > > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > > > > @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl > > > > *encl, > > > > } > > > > > > > > mutex_unlock(>lock); > > > > + > > > > + if (need_resched()) > > > > + cond_resched(); > > > > } > > > > > > > > ret = 0; > > > > @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct > > > > sgx_encl *encl, > > > > entry->type = page_type; > > > > > > > > mutex_unlock(>lock); > > > > + > > > > + if (need_resched()) > > > > + cond_resched(); > > > > } > > > > > > > > ret = 0; > > > > @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl > > > > *encl, > > > > kfree(entry); > > > > > > > > mutex_unlock(>lock); > > > > + > > > > + if (need_resched()) > > > > + cond_resched(); > > > > } > > > > > > > > > > You can remove the need_reshced() in all 3 places above but just call > > > cond_resched() directly. > > > > > > > This change will call cond_resched() after dealing with each page in a > > potentially large page range (cover mentions 30GB but we have also had to > > make optimizations for enclaves larger than this). Adding a cond_resched() > > here will surely placate the soft lockup detector, but we need to take care > > how changes like this impact the performance of the system and having > > actions > > on these page ranges take much longer than necessary. > > For reference, please see 7b72c823ddf8 ("x86/sgx: Reduce delay and > > interference > > of enclave release") that turned frequent cond_resched() into batches > > to address performance issues. Ah I didn't know this. Thanks for the info. > > > > It looks to me like the need_resched() may be a quick check that can be used > > to improve performance? > > Perhaps? My assumption is eventually cond_resched() will do similar check of need_resched() but I am not entirely sure about of that. Reading the code, it seems cond_resched() eventually does should_resched(). The generic version indeed does similar check of need_resched() but it seems the x86 version has a different implementation. > > I am not familiar with all use cases that need to be > > considered to determine if a batching solution may be needed. Looks at least the REMOVE_PAGES ioctls() could have the same impact to the performance downgrade problem mentioned in commit 7b72c823ddf8 ("x86/sgx: Reduce delay and interference of enclave release"), but I guess it's acceptable to fix softlockup first and then improve performance if there's someone hit any real issue. > > Ya, well no matter it is the reasoning will need to be documented > because this should have symmetry with sgx_ioc_enclave_add_pages() > (see my response to Kai). Yeah I was actually going to mention this, but somehow I didn't choose to. > > I because this makes dealing with need_resched() a change in code > even if it is left out as a side-effect, I'd support of not removing > which means adding need_resched() as a side-effect. I am fine with keeping the need_resched().
[PATCH v2 3/6] iio: light: stk3310: Manage LED power supply
The stk3310 and stk3310 chips have an input for power to the infrared LED. Add support for managing it's state. Signed-off-by: Aren Moynihan --- drivers/iio/light/stk3310.c | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/drivers/iio/light/stk3310.c b/drivers/iio/light/stk3310.c index a0547eeca3e3..ee1ac95dbc0e 100644 --- a/drivers/iio/light/stk3310.c +++ b/drivers/iio/light/stk3310.c @@ -120,6 +120,7 @@ struct stk3310_data { struct regmap_field *reg_flag_psint; struct regmap_field *reg_flag_nf; struct regulator *vdd_reg; + struct regulator *led_reg; }; static const struct iio_event_spec stk3310_events[] = { @@ -614,6 +615,10 @@ static int stk3310_probe(struct i2c_client *client) if (IS_ERR(data->vdd_reg)) return dev_err_probe(>dev, ret, "get regulator vdd failed\n"); + data->led_reg = devm_regulator_get(>dev, "leda"); + if (IS_ERR(data->led_reg)) + return dev_err_probe(>dev, ret, "get regulator led failed\n"); + ret = stk3310_regmap_init(data); if (ret < 0) return ret; @@ -629,12 +634,18 @@ static int stk3310_probe(struct i2c_client *client) return dev_err_probe(>dev, ret, "regulator vdd enable failed\n"); + ret = regulator_enable(data->led_reg); + if (ret) { + dev_err_probe(>dev, ret, "regulator led enable failed\n"); + goto err_vdd_disable; + } + /* we need a short delay to allow the chip time to power on */ fsleep(1000); ret = stk3310_init(indio_dev); if (ret < 0) - goto err_vdd_disable; + goto err_led_disable; if (client->irq > 0) { ret = devm_request_threaded_irq(>dev, client->irq, @@ -660,6 +671,8 @@ static int stk3310_probe(struct i2c_client *client) err_standby: stk3310_set_state(data, STK3310_STATE_STANDBY); +err_led_disable: + regulator_disable(data->led_reg); err_vdd_disable: regulator_disable(data->vdd_reg); return ret; @@ -672,6 +685,7 @@ static void stk3310_remove(struct i2c_client *client) iio_device_unregister(indio_dev); stk3310_set_state(iio_priv(indio_dev), STK3310_STATE_STANDBY); + regulator_disable(data->led_reg); regulator_disable(data->vdd_reg); } @@ -687,6 +701,7 @@ static int stk3310_suspend(struct device *dev) return ret; regcache_mark_dirty(data->regmap); + regulator_disable(data->led_reg); regulator_disable(data->vdd_reg); return 0; @@ -706,6 +721,12 @@ static int stk3310_resume(struct device *dev) return ret; } + ret = regulator_enable(data->led_reg); + if (ret) { + dev_err(dev, "Failed to re-enable regulator led\n"); + return ret; + } + fsleep(1000); ret = regcache_sync(data->regmap); -- 2.44.0
[PATCH v2 2/6] iio: light: stk3310: Implement vdd supply and power it off during suspend
From: Ondrej Jirman VDD power input can be used to completely power off the chip during system suspend. Do so if available. Signed-off-by: Ondrej Jirman Signed-off-by: Aren Moynihan --- Notes: Changes in v2: - always enable / disable regulators and rely on a dummy regulator if one isn't specified - replace usleep_range with fsleep - reorder includes so iio headers are last - add missing error handling to resume drivers/iio/light/stk3310.c | 49 ++--- 1 file changed, 46 insertions(+), 3 deletions(-) diff --git a/drivers/iio/light/stk3310.c b/drivers/iio/light/stk3310.c index 7b71ad71d78d..a0547eeca3e3 100644 --- a/drivers/iio/light/stk3310.c +++ b/drivers/iio/light/stk3310.c @@ -13,6 +13,8 @@ #include #include #include +#include + #include #include #include @@ -117,6 +119,7 @@ struct stk3310_data { struct regmap_field *reg_int_ps; struct regmap_field *reg_flag_psint; struct regmap_field *reg_flag_nf; + struct regulator *vdd_reg; }; static const struct iio_event_spec stk3310_events[] = { @@ -607,6 +610,10 @@ static int stk3310_probe(struct i2c_client *client) mutex_init(>lock); + data->vdd_reg = devm_regulator_get(>dev, "vdd"); + if (IS_ERR(data->vdd_reg)) + return dev_err_probe(>dev, ret, "get regulator vdd failed\n"); + ret = stk3310_regmap_init(data); if (ret < 0) return ret; @@ -617,9 +624,17 @@ static int stk3310_probe(struct i2c_client *client) indio_dev->channels = stk3310_channels; indio_dev->num_channels = ARRAY_SIZE(stk3310_channels); + ret = regulator_enable(data->vdd_reg); + if (ret) + return dev_err_probe(>dev, ret, +"regulator vdd enable failed\n"); + + /* we need a short delay to allow the chip time to power on */ + fsleep(1000); + ret = stk3310_init(indio_dev); if (ret < 0) - return ret; + goto err_vdd_disable; if (client->irq > 0) { ret = devm_request_threaded_irq(>dev, client->irq, @@ -645,32 +660,60 @@ static int stk3310_probe(struct i2c_client *client) err_standby: stk3310_set_state(data, STK3310_STATE_STANDBY); +err_vdd_disable: + regulator_disable(data->vdd_reg); return ret; } static void stk3310_remove(struct i2c_client *client) { struct iio_dev *indio_dev = i2c_get_clientdata(client); + struct stk3310_data *data = iio_priv(indio_dev); iio_device_unregister(indio_dev); stk3310_set_state(iio_priv(indio_dev), STK3310_STATE_STANDBY); + regulator_disable(data->vdd_reg); } static int stk3310_suspend(struct device *dev) { struct stk3310_data *data; + int ret; data = iio_priv(i2c_get_clientdata(to_i2c_client(dev))); - return stk3310_set_state(data, STK3310_STATE_STANDBY); + ret = stk3310_set_state(data, STK3310_STATE_STANDBY); + if (ret) + return ret; + + regcache_mark_dirty(data->regmap); + regulator_disable(data->vdd_reg); + + return 0; } static int stk3310_resume(struct device *dev) { - u8 state = 0; struct stk3310_data *data; + u8 state = 0; + int ret; data = iio_priv(i2c_get_clientdata(to_i2c_client(dev))); + + ret = regulator_enable(data->vdd_reg); + if (ret) { + dev_err(dev, "Failed to re-enable regulator vdd\n"); + return ret; + } + + fsleep(1000); + + ret = regcache_sync(data->regmap); + if (ret) { + dev_err(dev, "Failed to restore registers: %d\n", ret); + return ret; + } + if (data->ps_enabled) state |= STK3310_STATE_EN_PS; if (data->als_enabled) -- 2.44.0
[PATCH v2 6/6] arm64: dts: allwinner: pinephone: Add power supplies to stk3311
From: Ondrej Jirman This makes the driver disable the supply during sleep. Signed-off-by: Ondrej Jirman Signed-off-by: Aren Moynihan --- Notes: Changes in v2: - add leda-supply arch/arm64/boot/dts/allwinner/sun50i-a64-pinephone.dtsi | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64-pinephone.dtsi b/arch/arm64/boot/dts/allwinner/sun50i-a64-pinephone.dtsi index b5a232209f2b..51ab1db95f81 100644 --- a/arch/arm64/boot/dts/allwinner/sun50i-a64-pinephone.dtsi +++ b/arch/arm64/boot/dts/allwinner/sun50i-a64-pinephone.dtsi @@ -250,6 +250,8 @@ light-sensor@48 { reg = <0x48>; interrupt-parent = <>; interrupts = <1 0 IRQ_TYPE_EDGE_FALLING>; /* PB0 */ + vdd-supply = <_ldo_io0>; + leda-supply = <_dldo1>; }; /* Accelerometer/gyroscope */ -- 2.44.0
[PATCH v2 5/6] iio: light: stk3310: log error if reading the chip id fails
If the chip isn't powered, this call is likely to return an error. Without a log here the driver will silently fail to probe. Common errors are ENXIO (when the chip isn't powered) and ETIMEDOUT (when the i2c bus isn't powered). Signed-off-by: Aren Moynihan --- Notes: Changes in v2: - use dev_err_probe drivers/iio/light/stk3310.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iio/light/stk3310.c b/drivers/iio/light/stk3310.c index c56c6298d292..6ee6f145a6d5 100644 --- a/drivers/iio/light/stk3310.c +++ b/drivers/iio/light/stk3310.c @@ -475,7 +475,7 @@ static int stk3310_init(struct iio_dev *indio_dev) ret = regmap_read(data->regmap, STK3310_REG_ID, ); if (ret < 0) - return ret; + return dev_err_probe(>dev, ret, "failed to read chip id\n"); if (chipid != STK3310_CHIP_ID_VAL && chipid != STK3311_CHIP_ID_VAL && -- 2.44.0
[PATCH v2 1/6] dt-bindings: iio: light: stk33xx: add vdd and leda regulators
stk3310 and stk3311 are typically connected to power supplies for the chip (vdd) and the infrared LED (leda). Add properties so we can power these up / down appropriately. Signed-off-by: Aren Moynihan --- Notes: Changes in v2: - add leda-supply - add supplies to examples Documentation/devicetree/bindings/iio/light/stk33xx.yaml | 4 1 file changed, 4 insertions(+) diff --git a/Documentation/devicetree/bindings/iio/light/stk33xx.yaml b/Documentation/devicetree/bindings/iio/light/stk33xx.yaml index f6e22dc9814a..43ead524cecb 100644 --- a/Documentation/devicetree/bindings/iio/light/stk33xx.yaml +++ b/Documentation/devicetree/bindings/iio/light/stk33xx.yaml @@ -29,6 +29,8 @@ properties: interrupts: maxItems: 1 + vdd-supply: true + leda-supply: true proximity-near-level: true required: @@ -52,6 +54,8 @@ examples: proximity-near-level = <25>; interrupt-parent = <>; interrupts = <5 IRQ_TYPE_LEVEL_LOW>; +vdd-supply = <_regulator>; +leda-supply = <_regulator>; }; }; ... -- 2.44.0
[PATCH v2 0/6] iio: light: stk3310: support powering off during suspend
In the Pine64 PinePhone, the stk3310 chip is powered by a regulator that is disabled at system boot and can be shut off during suspend. To ensure that the chip properly initializes, both after boot and suspend, we need to manage this regulator. Additionally if the chip is shut off in suspend, we need to make sure that it gets reinitialized with the same parameters after resume. Major changes in v2: - Add handling of the IR LED. I was hesitant to include this as it is the same as pull-up regulator for the i2c bus on the hardware I have, so I can't test it well. I think leaving it out is more likely to cause issues than including it. - Convert stk3310 to use dev_err_probe for errors. - Always enable / disable regulators and rely on dummy devices if they're not specified. - more listed in individual patches Aren Moynihan (4): dt-bindings: iio: light: stk33xx: add vdd and leda regulators iio: light: stk3310: Manage LED power supply iio: light: stk3310: use dev_err_probe where possible iio: light: stk3310: log error if reading the chip id fails Ondrej Jirman (2): iio: light: stk3310: Implement vdd supply and power it off during suspend arm64: dts: allwinner: pinephone: Add power supplies to stk3311 .../bindings/iio/light/stk33xx.yaml | 4 + .../dts/allwinner/sun50i-a64-pinephone.dtsi | 2 + drivers/iio/light/stk3310.c | 116 +- 3 files changed, 94 insertions(+), 28 deletions(-) -- 2.44.0
[PATCH v2 4/6] iio: light: stk3310: use dev_err_probe where possible
Using dev_err_probe instead of dev_err and return makes the errors easier to understand by including the error name, and saves a little code. Signed-off-by: Aren Moynihan --- Notes: Added in v2 drivers/iio/light/stk3310.c | 44 + 1 file changed, 20 insertions(+), 24 deletions(-) diff --git a/drivers/iio/light/stk3310.c b/drivers/iio/light/stk3310.c index ee1ac95dbc0e..c56c6298d292 100644 --- a/drivers/iio/light/stk3310.c +++ b/drivers/iio/light/stk3310.c @@ -60,10 +60,10 @@ data->reg_##name = \ devm_regmap_field_alloc(>dev, regmap, \ stk3310_reg_field_##name); \ - if (IS_ERR(data->reg_##name)) { \ - dev_err(>dev, "reg field alloc failed.\n"); \ - return PTR_ERR(data->reg_##name); \ - } \ + if (IS_ERR(data->reg_##name)) \ + return dev_err_probe(>dev, \ + PTR_ERR(data->reg_##name), \ + "reg field alloc failed.\n"); \ } while (0) static const struct reg_field stk3310_reg_field_state = @@ -480,22 +480,20 @@ static int stk3310_init(struct iio_dev *indio_dev) if (chipid != STK3310_CHIP_ID_VAL && chipid != STK3311_CHIP_ID_VAL && chipid != STK3311X_CHIP_ID_VAL && - chipid != STK3335_CHIP_ID_VAL) { - dev_err(>dev, "invalid chip id: 0x%x\n", chipid); - return -ENODEV; - } + chipid != STK3335_CHIP_ID_VAL) + return dev_err_probe(>dev, -ENODEV, +"invalid chip id: 0x%x\n", chipid); state = STK3310_STATE_EN_ALS | STK3310_STATE_EN_PS; ret = stk3310_set_state(data, state); - if (ret < 0) { - dev_err(>dev, "failed to enable sensor"); - return ret; - } + if (ret < 0) + return dev_err_probe(>dev, ret, "failed to enable sensor\n"); /* Enable PS interrupts */ ret = regmap_field_write(data->reg_int_ps, STK3310_PSINT_EN); if (ret < 0) - dev_err(>dev, "failed to enable interrupts!\n"); + return dev_err_probe(>dev, ret, +"failed to enable interrupts!\n"); return ret; } @@ -530,10 +528,10 @@ static int stk3310_regmap_init(struct stk3310_data *data) client = data->client; regmap = devm_regmap_init_i2c(client, _regmap_config); - if (IS_ERR(regmap)) { - dev_err(>dev, "regmap initialization failed.\n"); - return PTR_ERR(regmap); - } + if (IS_ERR(regmap)) + return dev_err_probe(>dev, PTR_ERR(regmap), +"regmap initialization failed.\n"); + data->regmap = regmap; STK3310_REGFIELD(state); @@ -597,10 +595,8 @@ static int stk3310_probe(struct i2c_client *client) struct stk3310_data *data; indio_dev = devm_iio_device_alloc(>dev, sizeof(*data)); - if (!indio_dev) { - dev_err(>dev, "iio allocation failed!\n"); - return -ENOMEM; - } + if (!indio_dev) + return dev_err_probe(>dev, -ENOMEM, "iio allocation failed!\n"); data = iio_priv(indio_dev); data->client = client; @@ -655,15 +651,15 @@ static int stk3310_probe(struct i2c_client *client) IRQF_ONESHOT, STK3310_EVENT, indio_dev); if (ret < 0) { - dev_err(>dev, "request irq %d failed\n", - client->irq); + dev_err_probe(>dev, ret, "request irq %d failed\n", + client->irq); goto err_standby; } } ret = iio_device_register(indio_dev); if (ret < 0) { - dev_err(>dev, "device_register failed\n"); + dev_err_probe(>dev, ret, "device_register failed\n"); goto err_standby; } -- 2.44.0
Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup
On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote: > > > It's a workaround because you use the capacity==0 but it does not really > > > mean to disable the misc cgroup for specific resource IIUC. > > > > Please read the comment around @misc_res_capacity again: > > > > * Miscellaneous resources capacity for the entire machine. 0 capacity > > * means resource is not initialized or not present in the host. > > > > I mentioned this in earlier email. I think this means no SGX EPC. It does > not mean sgx epc cgroup not enabled. That's also consistent with the > behavior try_charge() fails if capacity is zero. OK. To me the "capacity" is purely the concept of cgroup, so it must be interpreted within the scope of "cgroup". If cgroup, in our case, SGX cgroup, is disabled, then whether "leaving the capacity to reflect the presence of hardware resource" doesn't really matter. So what you are saying is that, the kernel must initialize the capacity of some MISC resource once it is added as valid type. And you must initialize the "capacity" even MISC cgroup is disabled entirely by kernel commandline, in which case, IIUC, misc.capacity is not even going to show in the /fs. If this is your point, then your patch: cgroup/misc: Add SGX EPC resource type is already broken, because you added the new type w/o initializing the capacity. Please fix that up. > > > > > > > There is explicit way for user to disable misc without setting capacity > > > to > > > zero. > > > > Which way are you talking about? > > Echo "-misc" to cgroup.subtree_control at root level for example still > shows non-zero sgx_epc capacity. I guess "having to disable all MISC resources just in order to disable SGX EPC cgroup" is a brilliant idea. You can easily disable the entire MISC cgroup by commandline for that purpose if that's acceptable. And I have no idea why "still showing non-zero EPC capacity" is important if SGX cgroup cannot be supported at all.
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Tue Apr 23, 2024 at 8:08 PM EEST, Reinette Chatre wrote: > Hi Kai, > > On 4/23/2024 4:50 AM, Huang, Kai wrote: > >> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c > >> b/arch/x86/kernel/cpu/sgx/ioctl.c > >> index b65ab214bdf5..2340a82fa796 100644 > >> --- a/arch/x86/kernel/cpu/sgx/ioctl.c > >> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > >> @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl, > >>} > >> > >>mutex_unlock(>lock); > >> + > >> + if (need_resched()) > >> + cond_resched(); > >>} > >> > >>ret = 0; > >> @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl > >> *encl, > >>entry->type = page_type; > >> > >>mutex_unlock(>lock); > >> + > >> + if (need_resched()) > >> + cond_resched(); > >>} > >> > >>ret = 0; > >> @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl > >> *encl, > >>kfree(entry); > >> > >>mutex_unlock(>lock); > >> + > >> + if (need_resched()) > >> + cond_resched(); > >>} > >> > > > > You can remove the need_reshced() in all 3 places above but just call > > cond_resched() directly. > > > > This change will call cond_resched() after dealing with each page in a > potentially large page range (cover mentions 30GB but we have also had to > make optimizations for enclaves larger than this). Adding a cond_resched() > here will surely placate the soft lockup detector, but we need to take care > how changes like this impact the performance of the system and having actions > on these page ranges take much longer than necessary. > For reference, please see 7b72c823ddf8 ("x86/sgx: Reduce delay and > interference > of enclave release") that turned frequent cond_resched() into batches > to address performance issues. > > It looks to me like the need_resched() may be a quick check that can be used > to improve performance? I am not familiar with all use cases that need to be > considered to determine if a batching solution may be needed. Ya, well no matter it is the reasoning will need to be documented because this should have symmetry with sgx_ioc_enclave_add_pages() (see my response to Kai). I because this makes dealing with need_resched() a change in code even if it is left out as a side-effect, I'd support of not removing which means adding need_resched() as a side-effect. >From this follows that *if* need_resched() needs to be removed then that is not really part of the bug fix, so in all cases the bug fix itself must include need_resched() :-) phew, hope you got my logic here, i think it reasonable... BR, Jarkko
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Tue Apr 23, 2024 at 2:50 PM EEST, Huang, Kai wrote: > On Tue, 2024-04-23 at 17:25 +0800, 朱伯君(杰铭) wrote: > > EDMM's ioctl()s support batch operations, which may be > > time-consuming. Try to explicitly give up the CPU at > > the every end of "for loop" in > > sgx_enclave_{ modify_types | restrict_permissions | remove_pages} > > to give other tasks a chance to run, and avoid softlockup warning. > > > > The following has been observed on Linux v6.9-rc5 with kernel > > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel > > is requested to restrict page permissions of a large number of EPC pages. > > > > [ cut here ] > > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905] > > ... > > CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7 > > ... > > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 > > Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e > > f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf 00 00 00 > > 40 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef > > RSP: 0018:b55a6591fa80 EFLAGS: 0202 > > RAX: RBX: b55a6591fac0 RCX: b581e7384000 > > RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000 > > RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0 > > R10: 006e R11: 0002 R12: 00072052d000 > > R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020 > > FS: 7fe10dbda740() GS:92163e48() > > knlGS: > > CS: 0010 DS: ES: CR0: 80050033 > > CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0 > > DR0: DR1: DR2: > > DR3: DR6: fffe07f0 DR7: 0400 > > PKRU: 5554 > > Call Trace: > > > > ? show_regs+0x67/0x70 > > ? watchdog_timer_fn+0x1f3/0x280 > > ? __pfx_watchdog_timer_fn+0x10/0x10 > > ? __hrtimer_run_queues+0xc8/0x220 > > ? hrtimer_interrupt+0x10c/0x250 > > ? __sysvec_apic_timer_interrupt+0x53/0x130 > > ? sysvec_apic_timer_interrupt+0x7b/0x90 > > > > > > ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > > ? sgx_enclave_restrict_permissions+0xba/0x1f0 > > ? __pte_offset_map_lock+0x94/0x110 > > ? sgx_encl_test_and_clear_young_cb+0x40/0x60 > > sgx_ioctl+0x1ab/0x900 > > ? do_syscall_64+0x79/0x110 > > ? apply_to_page_range+0x14/0x20 > > ? sgx_encl_test_and_clear_young+0x6c/0x80 > > ? sgx_vma_fault+0x132/0x4f0 > > __x64_sys_ioctl+0x95/0xd0 > > x64_sys_call+0x1209/0x20c0 > > do_syscall_64+0x6d/0x110 > > ? do_syscall_64+0x79/0x110 > > ? do_pte_missing+0x2e8/0xcc0 > > ? __pte_offset_map+0x1c/0x190 > > ? __handle_mm_fault+0x7b9/0xe60 > > ? __count_memcg_events+0x70/0x100 > > ? handle_mm_fault+0x256/0x360 > > ? do_user_addr_fault+0x3c1/0x860 > > ? irqentry_exit_to_user_mode+0x67/0x190 > > ? irqentry_exit+0x3b/0x50 > > ? exc_page_fault+0x89/0x180 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > RIP: 0033:0x7fe10e2ee5cb > > Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff > > ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 > > ff ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48 > > RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010 > > RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb > > RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005 > > RBP: 0005 R08: R09: 7fffb2c75594 > > R10: 7fffb2c755c8 R11: 0246 R12: c028a405 > > R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980 > > > > [ end trace ] > > Could you trim down the trace to only include the relevant part? > > E.g., please at least remove the two register dumps at the beginning and > end of the trace. > > Please refer to "Backtraces in commit messages" section in > Documentation/process/submitting-patches.rst. > > > > > Signed-off-by: Bojun Zhu > > --- > > arch/x86/kernel/cpu/sgx/ioctl.c | 9 + > > 1 file changed, 9 insertions(+) > > > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c > > b/arch/x86/kernel/cpu/sgx/ioctl.c > > index b65ab214bdf5..2340a82fa796 100644 > > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > > @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl, > > } > > > > mutex_unlock(>lock); > > + > > + if (need_resched()) > > + cond_resched(); > > } > > > > ret = 0; > > @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl > > *encl, > >
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Wed Apr 24, 2024 at 12:10 AM EEST, Jarkko Sakkinen wrote: > On Tue Apr 23, 2024 at 12:25 PM EEST, =?UTF-8?B?5pyx5Lyv5ZCbKOadsOmTrSk=?= > wrote: > > EDMM's ioctl()s support batch operations, which may be > > time-consuming. Try to explicitly give up the CPU at > > the every end of "for loop" in > > sgx_enclave_{ modify_types | restrict_permissions | remove_pages} > > to give other tasks a chance to run, and avoid softlockup warning. > > > > The following has been observed on Linux v6.9-rc5 with kernel > > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel > > is requested to restrict page permissions of a large number of EPC pages. > > > > [ cut here ] > > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905] > > ... > > CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7 > > ... > > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 > > Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e > > f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf 00 00 00 > > 40 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef > > RSP: 0018:b55a6591fa80 EFLAGS: 0202 > > RAX: RBX: b55a6591fac0 RCX: b581e7384000 > > RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000 > > RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0 > > R10: 006e R11: 0002 R12: 00072052d000 > > R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020 > > FS: 7fe10dbda740() GS:92163e48() > > knlGS: > > CS: 0010 DS: ES: CR0: 80050033 > > CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0 > > DR0: DR1: DR2: > > DR3: DR6: fffe07f0 DR7: 0400 > > PKRU: 5554 > > Call Trace: > > > > ? show_regs+0x67/0x70 > > ? watchdog_timer_fn+0x1f3/0x280 > > ? __pfx_watchdog_timer_fn+0x10/0x10 > > ? __hrtimer_run_queues+0xc8/0x220 > > ? hrtimer_interrupt+0x10c/0x250 > > ? __sysvec_apic_timer_interrupt+0x53/0x130 > > ? sysvec_apic_timer_interrupt+0x7b/0x90 > > > > > > ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > > ? sgx_enclave_restrict_permissions+0xba/0x1f0 > > ? __pte_offset_map_lock+0x94/0x110 > > ? sgx_encl_test_and_clear_young_cb+0x40/0x60 > > sgx_ioctl+0x1ab/0x900 > > ? do_syscall_64+0x79/0x110 > > ? apply_to_page_range+0x14/0x20 > > ? sgx_encl_test_and_clear_young+0x6c/0x80 > > ? sgx_vma_fault+0x132/0x4f0 > > __x64_sys_ioctl+0x95/0xd0 > > x64_sys_call+0x1209/0x20c0 > > do_syscall_64+0x6d/0x110 > > ? do_syscall_64+0x79/0x110 > > ? do_pte_missing+0x2e8/0xcc0 > > ? __pte_offset_map+0x1c/0x190 > > ? __handle_mm_fault+0x7b9/0xe60 > > ? __count_memcg_events+0x70/0x100 > > ? handle_mm_fault+0x256/0x360 > > ? do_user_addr_fault+0x3c1/0x860 > > ? irqentry_exit_to_user_mode+0x67/0x190 > > ? irqentry_exit+0x3b/0x50 > > ? exc_page_fault+0x89/0x180 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > RIP: 0033:0x7fe10e2ee5cb > > Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff > > ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 > > ff ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48 > > RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010 > > RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb > > RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005 > > RBP: 0005 R08: R09: 7fffb2c75594 > > R10: 7fffb2c755c8 R11: 0246 R12: c028a405 > > R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980 > > > > [ end trace ] > > > > Signed-off-by: Bojun Zhu > > Can you also fixup this as your "firstname lastname" in your emails > from field? This matters so that author field in git log matches your > sob. > > > --- > > arch/x86/kernel/cpu/sgx/ioctl.c | 9 + > > 1 file changed, 9 insertions(+) > > > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c > > b/arch/x86/kernel/cpu/sgx/ioctl.c > > index b65ab214bdf5..2340a82fa796 100644 > > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > > @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl, > > } > > > > mutex_unlock(>lock); > > + > > + if (need_resched()) > > + cond_resched(); > > } > > > > ret = 0; > > @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl > > *encl, > > entry->type = page_type; > > > > mutex_unlock(>lock); > > + > > + if
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Tue Apr 23, 2024 at 12:25 PM EEST, =?UTF-8?B?5pyx5Lyv5ZCbKOadsOmTrSk=?= wrote: > EDMM's ioctl()s support batch operations, which may be > time-consuming. Try to explicitly give up the CPU at > the every end of "for loop" in > sgx_enclave_{ modify_types | restrict_permissions | remove_pages} > to give other tasks a chance to run, and avoid softlockup warning. > > The following has been observed on Linux v6.9-rc5 with kernel > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel > is requested to restrict page permissions of a large number of EPC pages. > > [ cut here ] > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905] > ... > CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7 > ... > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 > Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e > f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf 00 00 00 40 > 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef > RSP: 0018:b55a6591fa80 EFLAGS: 0202 > RAX: RBX: b55a6591fac0 RCX: b581e7384000 > RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000 > RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0 > R10: 006e R11: 0002 R12: 00072052d000 > R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020 > FS: 7fe10dbda740() GS:92163e48() > knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0 > DR0: DR1: DR2: > DR3: DR6: fffe07f0 DR7: 0400 > PKRU: 5554 > Call Trace: > > ? show_regs+0x67/0x70 > ? watchdog_timer_fn+0x1f3/0x280 > ? __pfx_watchdog_timer_fn+0x10/0x10 > ? __hrtimer_run_queues+0xc8/0x220 > ? hrtimer_interrupt+0x10c/0x250 > ? __sysvec_apic_timer_interrupt+0x53/0x130 > ? sysvec_apic_timer_interrupt+0x7b/0x90 > > > ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > ? sgx_enclave_restrict_permissions+0xba/0x1f0 > ? __pte_offset_map_lock+0x94/0x110 > ? sgx_encl_test_and_clear_young_cb+0x40/0x60 > sgx_ioctl+0x1ab/0x900 > ? do_syscall_64+0x79/0x110 > ? apply_to_page_range+0x14/0x20 > ? sgx_encl_test_and_clear_young+0x6c/0x80 > ? sgx_vma_fault+0x132/0x4f0 > __x64_sys_ioctl+0x95/0xd0 > x64_sys_call+0x1209/0x20c0 > do_syscall_64+0x6d/0x110 > ? do_syscall_64+0x79/0x110 > ? do_pte_missing+0x2e8/0xcc0 > ? __pte_offset_map+0x1c/0x190 > ? __handle_mm_fault+0x7b9/0xe60 > ? __count_memcg_events+0x70/0x100 > ? handle_mm_fault+0x256/0x360 > ? do_user_addr_fault+0x3c1/0x860 > ? irqentry_exit_to_user_mode+0x67/0x190 > ? irqentry_exit+0x3b/0x50 > ? exc_page_fault+0x89/0x180 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x7fe10e2ee5cb > Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff > ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff > ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48 > RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010 > RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb > RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005 > RBP: 0005 R08: R09: 7fffb2c75594 > R10: 7fffb2c755c8 R11: 0246 R12: c028a405 > R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980 > > [ end trace ] > > Signed-off-by: Bojun Zhu Can you also fixup this as your "firstname lastname" in your emails from field? This matters so that author field in git log matches your sob. > --- > arch/x86/kernel/cpu/sgx/ioctl.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c > index b65ab214bdf5..2340a82fa796 100644 > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl, > } > > mutex_unlock(>lock); > + > + if (need_resched()) > + cond_resched(); > } > > ret = 0; > @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl > *encl, > entry->type = page_type; > > mutex_unlock(>lock); > + > + if (need_resched()) > + cond_resched(); > } > > ret = 0; > @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl, > kfree(entry); > > mutex_unlock(>lock); > + > + if
Re: [PATCH] drivers: remoteproc: xlnx: fix uninitialize variable use
On Tue, Apr 23, 2024 at 10:02:11AM -0700, Tanmay Shah wrote: > Fix following warning for clang compiler with W=1 option: > initialize the variable 'ret' to silence this warning > 907 | int ret, i; > |^ > | = 0 > > Fixes: a6b974b40f94 ("drivers: remoteproc: xlnx: Add Versal and Versal-NET > support") > Reported-by: kernel test robot > Closes: > https://lore.kernel.org/oe-kbuild-all/202404231839.ohiy9lw8-...@intel.com/ > Signed-off-by: Tanmay Shah > --- > drivers/remoteproc/xlnx_r5_remoteproc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c > b/drivers/remoteproc/xlnx_r5_remoteproc.c > index a6d8ac7394e7..d98940d7ef8f 100644 > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c > @@ -904,7 +904,7 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster > *cluster, > { > struct device *dev = cluster->dev; > struct zynqmp_r5_core *r5_core; > - int ret, i; > + int ret = -EINVAL, i; > Applied - thanks, Mathieu > r5_core = cluster->r5_cores[0]; > > > base-commit: e99fcac055b3325283d6c5c61a117651fb147686 > -- > 2.25.1 >
Re: [PATCH v5 00/15] mm: jit/text allocator
On Mon, Apr 22, 2024 at 12:44:21PM +0300, Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" > > (something went wrong with the prevois posting, sorry for the noise) > > Hi, > > Since v3 I looked into making execmem more of an utility toolbox, as we > discussed at LPC with Mark Rutland, but it was getting more hairier than > having a struct describing architecture constraints and a type identifying > the consumer of execmem. > > And I do think that having the description of architecture constraints for > allocations of executable memory in a single place is better than having it > spread all over the place. > > The patches available via git: > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v5 Thanks! I've merged and pushed this onto modules-next in its entirety now for wider testing. Luis
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
Hi Kai, On 4/23/2024 4:50 AM, Huang, Kai wrote: >> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c >> b/arch/x86/kernel/cpu/sgx/ioctl.c >> index b65ab214bdf5..2340a82fa796 100644 >> --- a/arch/x86/kernel/cpu/sgx/ioctl.c >> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c >> @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl, >> } >> >> mutex_unlock(>lock); >> + >> +if (need_resched()) >> +cond_resched(); >> } >> >> ret = 0; >> @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl >> *encl, >> entry->type = page_type; >> >> mutex_unlock(>lock); >> + >> +if (need_resched()) >> +cond_resched(); >> } >> >> ret = 0; >> @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl >> *encl, >> kfree(entry); >> >> mutex_unlock(>lock); >> + >> +if (need_resched()) >> +cond_resched(); >> } >> > > You can remove the need_reshced() in all 3 places above but just call > cond_resched() directly. > This change will call cond_resched() after dealing with each page in a potentially large page range (cover mentions 30GB but we have also had to make optimizations for enclaves larger than this). Adding a cond_resched() here will surely placate the soft lockup detector, but we need to take care how changes like this impact the performance of the system and having actions on these page ranges take much longer than necessary. For reference, please see 7b72c823ddf8 ("x86/sgx: Reduce delay and interference of enclave release") that turned frequent cond_resched() into batches to address performance issues. It looks to me like the need_resched() may be a quick check that can be used to improve performance? I am not familiar with all use cases that need to be considered to determine if a batching solution may be needed. Reinette
[PATCH] drivers: remoteproc: xlnx: fix uninitialize variable use
Fix following warning for clang compiler with W=1 option: initialize the variable 'ret' to silence this warning 907 | int ret, i; |^ | = 0 Fixes: a6b974b40f94 ("drivers: remoteproc: xlnx: Add Versal and Versal-NET support") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202404231839.ohiy9lw8-...@intel.com/ Signed-off-by: Tanmay Shah --- drivers/remoteproc/xlnx_r5_remoteproc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c b/drivers/remoteproc/xlnx_r5_remoteproc.c index a6d8ac7394e7..d98940d7ef8f 100644 --- a/drivers/remoteproc/xlnx_r5_remoteproc.c +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c @@ -904,7 +904,7 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster *cluster, { struct device *dev = cluster->dev; struct zynqmp_r5_core *r5_core; - int ret, i; + int ret = -EINVAL, i; r5_core = cluster->r5_cores[0]; base-commit: e99fcac055b3325283d6c5c61a117651fb147686 -- 2.25.1
Re: [PATCH v3 4/4] arm64: dts: qcom: msm8976: Add WCNSS node
On 4/13/24 19:03, Adam Skladowski wrote: Add node describing wireless connectivity subsystem. Signed-off-by: Adam Skladowski --- [...] + wcnss_wifi: wifi { + compatible = "qcom,wcnss-wlan"; + + interrupts = , +; + interrupt-names = "tx", "rx"; + + qcom,smem-states = <_smsm 10>, <_smsm 9>; + qcom,smem-state-names = "tx-enable", + "tx-rings-empty"; Since you're already going to resend, please keep one entry per line, this fragment is very inconsistent looks good otherwise Konrad
Re: [PATCH v5 04/15] sparc: simplify module_alloc()
Hi Mike, On Mon, Apr 22, 2024 at 12:44:25PM +0300, Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" > > Define MODULES_VADDR and MODULES_END as VMALLOC_START and VMALLOC_END > for 32-bit and reduce module_alloc() to > > __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, ...) > > as with the new defines the allocations becames identical for both 32 > and 64 bits. > > While on it, drop unsed include of > > Suggested-by: Sam Ravnborg > Signed-off-by: Mike Rapoport (IBM) Looks good. Reviewed-by: Sam Ravnborg
Re: [PATCH v3 3/4] arm64: dts: qcom: msm8976: Add Adreno GPU
On 4/13/24 19:03, Adam Skladowski wrote: Add Adreno GPU node. Signed-off-by: Adam Skladowski --- [...] + opp-2 { + opp-hz = /bits/ 64 <2>; + required-opps = <_opp_low_svs>; + opp-supported-hw = <0xff>; Doesn't look like this platform had any speed binning going on, can drop? Konrad
Re: [PATCH v3 2/4] arm64: dts: qcom: msm8976: Add MDSS nodes
On 4/13/24 19:03, Adam Skladowski wrote: Add MDSS nodes to support displays on MSM8976 SoC. Signed-off-by: Adam Skladowski --- [...] + + mdp_opp_table: opp-table { + compatible = "operating-points-v2"; + + opp-17778 { + opp-hz = /bits/ 64 <17778>; + required-opps = <_opp_svs>; + }; + + opp-27000 { + opp-hz = /bits/ 64 <27000>; + required-opps = <_opp_svs_plus>; + }; + + opp-32000 { + opp-hz = /bits/ 64 <32000>; + required-opps = <_opp_nom>; + }; + opp-36000 { Missing a newline above [...] + dsi0_opp_table: opp-table { + compatible = "operating-points-v2"; + + opp-12500 { + opp-hz = /bits/ 64 <12500>; + required-opps = <_opp_svs>; + + }; You can borrow it from here Looks reasonable otherwise Konrad
[PATCH v2 1/2] tracing/user_events: Fix non-spaced field matching
When the ABI was updated to prevent same name w/different args, it missed an important corner case when fields don't end with a space. Typically, space is used for fields to help separate them, like "u8 field1; u8 field2". If no spaces are used, like "u8 field1;u8 field2", then the parsing works for the first time. However, the match check fails on a subsequent register, leading to confusion. This is because the match check uses argv_split() and assumes that all fields will be split upon the space. When spaces are used, we get back { "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }. This causes a mismatch, and the user program gets back -EADDRINUSE. Add a method to detect this case before calling argv_split(). If found force a space after the field separator character ';'. This ensures all cases work properly for matching. With this fix, the following are all treated as matching: u8 field1;u8 field2 u8 field1; u8 field2 u8 field1;\tu8 field2 u8 field1;\nu8 field2 Fixes: ba470eebc2f6 ("tracing/user_events: Prevent same name but different args event") Signed-off-by: Beau Belgrave --- kernel/trace/trace_events_user.c | 76 +++- 1 file changed, 75 insertions(+), 1 deletion(-) diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c index 70d428c394b6..82b191f33a28 100644 --- a/kernel/trace/trace_events_user.c +++ b/kernel/trace/trace_events_user.c @@ -1989,6 +1989,80 @@ static int user_event_set_tp_name(struct user_event *user) return 0; } +/* + * Counts how many ';' without a trailing space are in the args. + */ +static int count_semis_no_space(char *args) +{ + int count = 0; + + while ((args = strchr(args, ';'))) { + args++; + + if (!isspace(*args)) + count++; + } + + return count; +} + +/* + * Copies the arguments while ensuring all ';' have a trailing space. + */ +static char *insert_space_after_semis(char *args, int count) +{ + char *fixed, *pos; + int len; + + len = strlen(args) + count; + fixed = kmalloc(len + 1, GFP_KERNEL); + + if (!fixed) + return NULL; + + pos = fixed; + + /* Insert a space after ';' if there is no trailing space. */ + while (*args) { + *pos = *args++; + + if (*pos++ == ';' && !isspace(*args)) + *pos++ = ' '; + } + + *pos = '\0'; + + return fixed; +} + +static char **user_event_argv_split(char *args, int *argc) +{ + char **split; + char *fixed; + int count; + + /* Count how many ';' without a trailing space */ + count = count_semis_no_space(args); + + /* No fixup is required */ + if (!count) + return argv_split(GFP_KERNEL, args, argc); + + /* We must fixup 'field;field' to 'field; field' */ + fixed = insert_space_after_semis(args, count); + + if (!fixed) + return NULL; + + /* We do a normal split afterwards */ + split = argv_split(GFP_KERNEL, fixed, argc); + + /* We can free since argv_split makes a copy */ + kfree(fixed); + + return split; +} + /* * Parses the event name, arguments and flags then registers if successful. * The name buffer lifetime is owned by this method for success cases only. @@ -2012,7 +2086,7 @@ static int user_event_parse(struct user_event_group *group, char *name, return -EPERM; if (args) { - argv = argv_split(GFP_KERNEL, args, ); + argv = user_event_argv_split(args, ); if (!argv) return -ENOMEM; -- 2.34.1
[PATCH v2 2/2] selftests/user_events: Add non-spacing separator check
The ABI documentation indicates that field separators do not need a space between them, only a ';'. When no spacing is used, the register must work. Any subsequent register, with or without spaces, must match and not return -EADDRINUSE. Add a non-spacing separator case to our self-test register case to ensure it works going forward. Signed-off-by: Beau Belgrave --- tools/testing/selftests/user_events/ftrace_test.c | 8 1 file changed, 8 insertions(+) diff --git a/tools/testing/selftests/user_events/ftrace_test.c b/tools/testing/selftests/user_events/ftrace_test.c index dcd7509fe2e0..0bb46793dcd4 100644 --- a/tools/testing/selftests/user_events/ftrace_test.c +++ b/tools/testing/selftests/user_events/ftrace_test.c @@ -261,6 +261,12 @@ TEST_F(user, register_events) { ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSREG, )); ASSERT_EQ(0, reg.write_index); + /* Register without separator spacing should still match */ + reg.enable_bit = 29; + reg.name_args = (__u64)"__test_event u32 field1;u32 field2"; + ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSREG, )); + ASSERT_EQ(0, reg.write_index); + /* Multiple registers to same name but different args should fail */ reg.enable_bit = 29; reg.name_args = (__u64)"__test_event u32 field1;"; @@ -288,6 +294,8 @@ TEST_F(user, register_events) { ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSUNREG, )); unreg.disable_bit = 30; ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSUNREG, )); + unreg.disable_bit = 29; + ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSUNREG, )); /* Delete should have been auto-done after close and unregister */ close(self->data_fd); -- 2.34.1
[PATCH v2 0/2] tracing/user_events: Fix non-spaced field matching
When the ABI was updated to prevent same name w/different args, it missed an important corner case when fields don't end with a space. Typically, space is used for fields to help separate them, like "u8 field1; u8 field2". If no spaces are used, like "u8 field1;u8 field2", then the parsing works for the first time. However, the match check fails on a subsequent register, leading to confusion. This is because the match check uses argv_split() and assumes that all fields will be split upon the space. When spaces are used, we get back { "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }. This causes a mismatch, and the user program gets back -EADDRINUSE. Add a method to detect this case before calling argv_split(). If found force a space after the field separator character ';'. This ensures all cases work properly for matching. I could not find an existing function to accomplish this, so I had to hand code a copy with this logic. If there is a better way to achieve this, I'm all ears. This series also adds a selftest to ensure this doesn't break again. With this fix, the following are all treated as matching: u8 field1;u8 field2 u8 field1; u8 field2 u8 field1;\tu8 field2 u8 field1;\nu8 field2 V2 changes: Renamed fix_semis_no_space() to insert_space_after_semis(). Have user_event_argv_split() return fast in no-split case. Pulled in Masami's shorter loop in insert_space_after_semis(). Beau Belgrave (2): tracing/user_events: Fix non-spaced field matching selftests/user_events: Add non-spacing separator check kernel/trace/trace_events_user.c | 76 ++- .../selftests/user_events/ftrace_test.c | 8 ++ 2 files changed, 83 insertions(+), 1 deletion(-) base-commit: 0bbac3facb5d6cc0171c45c9873a2dc96bea9680 -- 2.34.1
Re: [PATCH] drivers: remoteproc: xlnx: Add Versal and Versal-NET support
On 4/23/24 11:06 AM, Nathan Chancellor wrote: > Hi Tanmay, > > On Thu, Apr 18, 2024 at 03:01:25PM -0700, Tanmay Shah wrote: >> AMD-Xilinx Versal platform is successor of ZynqMP platform. >> Real-time Processing Unit R5 cluster IP on Versal is same as >> of ZynqMP Platform. Power-domains ids for Versal platform is >> different than ZynqMP. >> >> AMD-Xilinx Versal-NET platform is successor of Versal platform. >> Versal-NET Real-Time Processing Unit has two clusters and each >> cluster contains dual core ARM Cortex-R52 processors. Each R52 >> core is assigned 128KB of TCM memory. >> >> Signed-off-by: Tanmay Shah >> --- >> drivers/remoteproc/xlnx_r5_remoteproc.c | 53 - >> 1 file changed, 17 insertions(+), 36 deletions(-) >> >> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c >> b/drivers/remoteproc/xlnx_r5_remoteproc.c >> index 7b1c12108bff..a6d8ac7394e7 100644 >> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c >> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c >> @@ -300,36 +300,6 @@ static void zynqmp_r5_rproc_kick(struct rproc *rproc, >> int vqid) >> dev_warn(dev, "failed to send message\n"); >> } >> >> -/* >> - * zynqmp_r5_set_mode() >> - * >> - * set RPU cluster and TCM operation mode >> - * >> - * @r5_core: pointer to zynqmp_r5_core type object >> - * @fw_reg_val: value expected by firmware to configure RPU cluster mode >> - * @tcm_mode: value expected by fw to configure TCM mode (lockstep or split) >> - * >> - * Return: 0 for success and < 0 for failure >> - */ >> -static int zynqmp_r5_set_mode(struct zynqmp_r5_core *r5_core, >> - enum rpu_oper_mode fw_reg_val, >> - enum rpu_tcm_comb tcm_mode) >> -{ >> -int ret; >> - >> -ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val); >> -if (ret < 0) { >> -dev_err(r5_core->dev, "failed to set RPU mode\n"); >> -return ret; >> -} >> - >> -ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode); >> -if (ret < 0) >> -dev_err(r5_core->dev, "failed to configure TCM\n"); >> - >> -return ret; >> -} >> - >> /* >> * zynqmp_r5_rproc_start() >> * @rproc: single R5 core's corresponding rproc instance >> @@ -941,7 +911,7 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster >> *cluster, >> /* Maintain backward compatibility for zynqmp by using hardcode TCM >> address. */ >> if (of_find_property(r5_core->np, "reg", NULL)) >> ret = zynqmp_r5_get_tcm_node_from_dt(cluster); >> -else >> +else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss")) >> ret = zynqmp_r5_get_tcm_node(cluster); > > This change breaks the build with clang: > > drivers/remoteproc/xlnx_r5_remoteproc.c:914:11: error: variable 'ret' is > used uninitialized whenever 'if' condition is false > [-Werror,-Wsometimes-uninitialized] > 914 | else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss")) > | ^~ > drivers/remoteproc/xlnx_r5_remoteproc.c:917:6: note: uninitialized use > occurs here > 917 | if (ret) { > | ^~~ > drivers/remoteproc/xlnx_r5_remoteproc.c:914:7: note: remove the 'if' if its > condition is always true > 914 | else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss")) > | ^~~ > 915 | ret = zynqmp_r5_get_tcm_node(cluster); > drivers/remoteproc/xlnx_r5_remoteproc.c:907:9: note: initialize the > variable 'ret' to silence this warning > 907 | int ret, i; > |^ > | = 0 > 1 error generated. > > Should ret be initialized to zero or should there be an else statement? Hello, Thanks for analysis. ret should be initialized with -EINVAL, so else can be avoided. I will send patch by EOD. Thanks, Tanmay > > Cheers, > Nathan > >> >> if (ret) { >> @@ -960,12 +930,21 @@ static int zynqmp_r5_core_init(struct >> zynqmp_r5_cluster *cluster, >> return ret; >> } >> >> -ret = zynqmp_r5_set_mode(r5_core, fw_reg_val, tcm_mode); >> -if (ret) { >> -dev_err(dev, "failed to set r5 cluster mode %d, err >> %d\n", >> -cluster->mode, ret); >> +ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val); >> +if (ret < 0) { >> +dev_err(r5_core->dev, "failed to set RPU mode\n"); >> return ret; >> } >> + >> +if (of_find_property(dev_of_node(dev), "xlnx,tcm-mode", NULL) || >> +device_is_compatible(dev, "xlnx,zynqmp-r5fss")) { >> +ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, >> + tcm_mode); >> +
Re: [PATCH] drivers: remoteproc: xlnx: Add Versal and Versal-NET support
Hi Tanmay, On Thu, Apr 18, 2024 at 03:01:25PM -0700, Tanmay Shah wrote: > AMD-Xilinx Versal platform is successor of ZynqMP platform. > Real-time Processing Unit R5 cluster IP on Versal is same as > of ZynqMP Platform. Power-domains ids for Versal platform is > different than ZynqMP. > > AMD-Xilinx Versal-NET platform is successor of Versal platform. > Versal-NET Real-Time Processing Unit has two clusters and each > cluster contains dual core ARM Cortex-R52 processors. Each R52 > core is assigned 128KB of TCM memory. > > Signed-off-by: Tanmay Shah > --- > drivers/remoteproc/xlnx_r5_remoteproc.c | 53 - > 1 file changed, 17 insertions(+), 36 deletions(-) > > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c > b/drivers/remoteproc/xlnx_r5_remoteproc.c > index 7b1c12108bff..a6d8ac7394e7 100644 > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c > @@ -300,36 +300,6 @@ static void zynqmp_r5_rproc_kick(struct rproc *rproc, > int vqid) > dev_warn(dev, "failed to send message\n"); > } > > -/* > - * zynqmp_r5_set_mode() > - * > - * set RPU cluster and TCM operation mode > - * > - * @r5_core: pointer to zynqmp_r5_core type object > - * @fw_reg_val: value expected by firmware to configure RPU cluster mode > - * @tcm_mode: value expected by fw to configure TCM mode (lockstep or split) > - * > - * Return: 0 for success and < 0 for failure > - */ > -static int zynqmp_r5_set_mode(struct zynqmp_r5_core *r5_core, > - enum rpu_oper_mode fw_reg_val, > - enum rpu_tcm_comb tcm_mode) > -{ > - int ret; > - > - ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val); > - if (ret < 0) { > - dev_err(r5_core->dev, "failed to set RPU mode\n"); > - return ret; > - } > - > - ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode); > - if (ret < 0) > - dev_err(r5_core->dev, "failed to configure TCM\n"); > - > - return ret; > -} > - > /* > * zynqmp_r5_rproc_start() > * @rproc: single R5 core's corresponding rproc instance > @@ -941,7 +911,7 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster > *cluster, > /* Maintain backward compatibility for zynqmp by using hardcode TCM > address. */ > if (of_find_property(r5_core->np, "reg", NULL)) > ret = zynqmp_r5_get_tcm_node_from_dt(cluster); > - else > + else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss")) > ret = zynqmp_r5_get_tcm_node(cluster); This change breaks the build with clang: drivers/remoteproc/xlnx_r5_remoteproc.c:914:11: error: variable 'ret' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] 914 | else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss")) | ^~ drivers/remoteproc/xlnx_r5_remoteproc.c:917:6: note: uninitialized use occurs here 917 | if (ret) { | ^~~ drivers/remoteproc/xlnx_r5_remoteproc.c:914:7: note: remove the 'if' if its condition is always true 914 | else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss")) | ^~~ 915 | ret = zynqmp_r5_get_tcm_node(cluster); drivers/remoteproc/xlnx_r5_remoteproc.c:907:9: note: initialize the variable 'ret' to silence this warning 907 | int ret, i; |^ | = 0 1 error generated. Should ret be initialized to zero or should there be an else statement? Cheers, Nathan > > if (ret) { > @@ -960,12 +930,21 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster > *cluster, > return ret; > } > > - ret = zynqmp_r5_set_mode(r5_core, fw_reg_val, tcm_mode); > - if (ret) { > - dev_err(dev, "failed to set r5 cluster mode %d, err > %d\n", > - cluster->mode, ret); > + ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val); > + if (ret < 0) { > + dev_err(r5_core->dev, "failed to set RPU mode\n"); > return ret; > } > + > + if (of_find_property(dev_of_node(dev), "xlnx,tcm-mode", NULL) || > + device_is_compatible(dev, "xlnx,zynqmp-r5fss")) { > + ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, > +tcm_mode); > + if (ret < 0) { > + dev_err(r5_core->dev, "failed to configure > TCM\n"); > + return ret; > + } > + } > } > > return 0; > @@ -1022,7 +1001,7 @@ static int zynqmp_r5_cluster_init(struct >
Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
* Steven Rostedt [240410 13:41]: > On Sat, 6 Apr 2024 18:36:46 +0100 > Vincent Donnefort wrote: > > > +int ring_buffer_map(struct trace_buffer *buffer, int cpu, > > + struct vm_area_struct *vma) > > +{ > > + struct ring_buffer_per_cpu *cpu_buffer; > > + unsigned long flags, *subbuf_ids; > > + int err = 0; > > + > > + if (!cpumask_test_cpu(cpu, buffer->cpumask)) > > + return -EINVAL; > > + > > + cpu_buffer = buffer->buffers[cpu]; > > + > > + mutex_lock(_buffer->mapping_lock); > > + > > + if (cpu_buffer->mapped) { > > + err = __rb_map_vma(cpu_buffer, vma); > > + if (!err) > > + err = __rb_inc_dec_mapped(cpu_buffer, true); > > + mutex_unlock(_buffer->mapping_lock); > > + return err; > > + } > > + > > + /* prevent another thread from changing buffer/sub-buffer sizes */ > > + mutex_lock(>mutex); > > + > > + err = rb_alloc_meta_page(cpu_buffer); > > + if (err) > > + goto unlock; > > + > > + /* subbuf_ids include the reader while nr_pages does not */ > > + subbuf_ids = kcalloc(cpu_buffer->nr_pages + 1, sizeof(*subbuf_ids), > > GFP_KERNEL); > > + if (!subbuf_ids) { > > + rb_free_meta_page(cpu_buffer); > > + err = -ENOMEM; > > + goto unlock; > > + } > > + > > + atomic_inc(_buffer->resize_disabled); > > + > > + /* > > +* Lock all readers to block any subbuf swap until the subbuf IDs are > > +* assigned. > > +*/ > > + raw_spin_lock_irqsave(_buffer->reader_lock, flags); > > + rb_setup_ids_meta_page(cpu_buffer, subbuf_ids); > > + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags); > > + > > + err = __rb_map_vma(cpu_buffer, vma); > > + if (!err) { > > + raw_spin_lock_irqsave(_buffer->reader_lock, flags); > > + cpu_buffer->mapped = 1; > > + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags); > > + } else { > > + kfree(cpu_buffer->subbuf_ids); > > + cpu_buffer->subbuf_ids = NULL; > > + rb_free_meta_page(cpu_buffer); > > + } > > +unlock: > > Nit: For all labels, please add a space before them. Otherwise, diffs will > show "unlock" as the function and not "ring_buffer_map", making it harder > to find where the change is. > Isn't the inclusion of a space before labels outdate since 'git diff' superseds 'diff' and fixed this issue? Thanks, Liam
Re: [PATCH v12 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup
On Wed, 17 Apr 2024 18:51:28 -0500, Huang, Kai wrote: On 16/04/2024 3:20 pm, Haitao Huang wrote: From: Kristen Carlson Accardi Currently in the EPC page allocation, the kernel simply fails the allocation when the current EPC cgroup fails to charge due to its usage reaching limit. This is not ideal. When that happens, a better way is to reclaim EPC page(s) from the current EPC cgroup (and/or its descendants) to reduce its usage so the new allocation can succeed. Add the basic building blocks to support per-cgroup reclamation. Currently the kernel only has one place to reclaim EPC pages: the global EPC LRU list. To support the "per-cgroup" EPC reclaim, maintain an LRU list for each EPC cgroup, and introduce a "cgroup" variant function to reclaim EPC pages from a given EPC cgroup and its descendants. Currently the kernel does the global EPC reclaim in sgx_reclaim_page(). It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) pages. Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN pages from the global LRU, and then tries to reclaim these pages at once for better performance. Implement the "cgroup" variant EPC reclaim in a similar way, but keep the implementation simple: 1) change sgx_reclaim_pages() to take an LRU as input, and return the pages that are "scanned" and attempted for reclamation (but not necessarily reclaimed successfully); 2) loop the given EPC cgroup and its descendants and do the new sgx_reclaim_pages() until SGX_NR_TO_SCAN pages are "scanned". This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC cgroup, and only moves to its descendants when there's no enough reclaimable EPC pages to "scan" in its LRU. It should be enough for most cases. Note, this simple implementation doesn't _exactly_ mimic the current global EPC reclaim (which always tries to do the actual reclaim in batch of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN reclaimable pages, the actual reclaim of EPC pages will be split into smaller batches _across_ multiple LRUs with each being smaller than SGX_NR_TO_SCAN pages. A more precise way to mimic the current global EPC reclaim would be to have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages _across_ the given EPC cgroup _AND_ its descendants, and then do the actual reclaim in one batch. But this is unnecessarily complicated at this stage. Alternatively, the current sgx_reclaim_pages() could be changed to return the actual "reclaimed" pages, but not "scanned" pages. However, the reclamation is a lengthy process, forcing a successful reclamation of predetermined number of pages may block the caller for too long. And that may not be acceptable in some synchronous contexts, e.g., in serving an ioctl(). With this building block in place, add synchronous reclamation support in sgx_cgroup_try_charge(): trigger a call to sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the caller allows synchronous reclaim as indicated by s newly added parameter. A later patch will add support for asynchronous reclamation reusing sgx_cgroup_reclaim_pages(). Note all reclaimable EPC pages are still tracked in the global LRU thus no per-cgroup reclamation is actually active at the moment. Per-cgroup tracking and reclamation will be turned on in the end after all necessary infrastructure is in place. Nit: "all necessary infrastructures are in place", or, "all necessary building blocks are in place". ? Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Tested-by: Jarkko Sakkinen --- Reviewed-by: Kai Huang Thanks More nitpickings below: [...] -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim) Let's still wrap the text on 80-character basis. I guess most people are more used to that. [...] - epc_page = list_first_entry_or_null(_global_lru.reclaimable, - struct sgx_epc_page, list); + epc_page = list_first_entry_or_null(>reclaimable, struct sgx_epc_page, list); Ditto. Actually I changed to 100 char width based on comments from Jarkko IIRC. I don't have personal preference, but will not change back to 80 unless Jarkko also agrees. Thanks Haitao
Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup
On Tue, 23 Apr 2024 09:19:53 -0500, Huang, Kai wrote: On Tue, 2024-04-23 at 08:08 -0500, Haitao Huang wrote: On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai wrote: > On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote: > > On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai > > wrote: > > > > > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote: > > > > > > I think we can add support for "sgx_cgroup=disabled" in future > > if > > > > indeed > > > > > > needed. But just for init failure, no? > > > > > > > > > > > > > > > > It's not about the commandline, which we can add in the future > > when > > > > > needed. It's about we need to have a way to handle SGX cgroup > > being > > > > > disabled at boot time nicely, because we already have a case > > where we > > > > > need > > > > > to do so. > > > > > > > > > > Your approach looks half-way to me, and is not future > > extendible. If > > > > we > > > > > choose to do it, do it right -- that is, we need a way to disable > > it > > > > > completely in both kernel and userspace so that userspace won't be > > > > able> to > > > > > see it. > > > > > > > > That would need more changes in misc cgroup implementation to > > support > > > > sgx-disable. Right now misc does not have separate files for > > different > > > > resource types. So we can only block echo "sgx_epc..." to those > > > > interfacefiles, can't really make files not visible. > > > > > > "won't be able to see" I mean "only for SGX EPC resource", but not the > > > control files for the entire MISC cgroup. > > > > > > I replied at the beginning of the previous reply: > > > > > > " > > > Given SGX EPC is just one type of MISC cgroup resources, we cannot > > just > > > disable MISC cgroup as a whole. > > > " > > > > > Sorry I missed this point. below. > > > > > You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC. > > See > > > the comment of @misc_res_capacity: > > > > > > * Miscellaneous resources capacity for the entire machine. 0 capacity > > > * means resource is not initialized or not present in the host. > > > > > > > IIUC I don't think the situation we have is either of those cases. For > > our > > case, resource is inited and present on the host but we have allocation > > error for sgx cgroup infra. > > You have calculated the "capacity", but later you failed something and > then reset the "capacity" to 0, i.e., cleanup. What's wrong with that? > > > > > > And "blocking echo sgx_epc ... to those control files" is already > > > sufficient for the purpose of not exposing SGX EPC to userspace, > > correct? > > > > > > E.g., if SGX cgroup is enabled, you can see below when you read "max": > > > > > > # cat /sys/fs/cgroup/my_group/misc.max > > > # > > >sgx_epc ... > > >... > > > > > > Otherwise you won't be able to see "sgx_epc": > > > > > > # cat /sys/fs/cgroup/my_group/misc.max > > > # > > >... > > > > > > And when you try to write the "max" for "sgx_epc", you will hit error: > > > > > > # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max > > > # ... echo: write error: Invalid argument > > > > > > The above applies to all the control files. To me this is pretty much > > > means "SGX EPC is disabled" or "not supported" for userspace. > > > > > You are right, capacity == 0 does block echoing max and users see an > > error > > if they do that. But 1) doubt you literately wanted "SGX EPC is > > disabled" > > and make it unsupported in this case, > > I don't understand. Something failed during SGX cgroup initialization, > you _literally_ cannot continue to support it. > > Then we should just return -ENOMEM from sgx_init() when sgx cgroup initialization fails? I thought we only disable SGX cgroup support. SGX can still run. I am not sure how you got this conclusion. I specifically said something failed during SGX "cgroup" initialization, so only SGX "cgroup" needs to be disabled, not SGX as a whole. > > 2) even if we accept this is "sgx > > cgroup disabled" I don't see how it is much better user experience than > > current solution or really helps user better. > > In your way, the userspace is still able to see "sgx_epc" in control > files > and is able to update them. So from userspace's perspective SGX cgroup > is > enabled, but obviously updating to "max" doesn't have any impact. This > will confuse userspace. > > > Setting capacity to zero also confuses user space. Some application may rely on this file to know the capacity. Why?? Are you saying before this SGX cgroup patchset those applications cannot run? > > Also to implement this approach, as you mentioned, we need workaround > > the > > fact that misc_try_charge() fails when capacity set to zero, and adding > > code to return root always? > > Why this is a problem? > It changes/overrides the the original meaning of capacity==0: no one can allocate if capacity is zero. Why?? Are you saying before this series, no
Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
On 22.04.24 22:31, Vincent Donnefort wrote: On Mon, Apr 22, 2024 at 08:27:17PM +0200, David Hildenbrand wrote: On 22.04.24 20:20, Vincent Donnefort wrote: Hi David, Thanks for having a look, very much appreciated! On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote: On 19.04.24 20:25, David Hildenbrand wrote: On 06.04.24 19:36, Vincent Donnefort wrote: In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. CC: Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index dc5ae4e96aee..96d2140b471e 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu, + struct vm_area_struct *vma); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..ffcd8dfcaa4f --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _TRACE_MMAP_H_ +#define _TRACE_MMAP_H_ + +#include + +/** + * struct trace_buffer_meta - Ring-buffer Meta-page description + * @meta_page_size:Size of this meta-page. + * @meta_struct_len: Size of this structure. + * @subbuf_size: Size of each sub-buffer. + * @nr_subbufs:Number of subbfs in the ring-buffer, including the reader. + * @reader.lost_events:Number of events lost at the time of the reader swap. + * @reader.id: subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1] + * @reader.read: Number of bytes read on the reader subbuf. + * @flags: Placeholder for now, 0 until new features are supported. + * @entries: Number of entries in the ring-buffer. + * @overrun: Number of entries lost in the ring-buffer. + * @read: Number of entries that have been read. + * @Reserved1: Reserved for future use. + * @Reserved2: Reserved for future use. + */ +struct trace_buffer_meta { + __u32 meta_page_size; + __u32 meta_struct_len; + + __u32 subbuf_size; + __u32 nr_subbufs; + + struct { + __u64 lost_events; + __u32 id; + __u32 read; + } reader; + + __u64 flags; + + __u64 entries; + __u64 overrun; + __u64 read; + + __u64 Reserved1; + __u64 Reserved2; +}; + +#endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index cc9ebe593571..793ecc454039 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,7 @@ #include #include #include +#include #include #include @@ -338,6 +340,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + unsigned intmapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to subbuf VA */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to
Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup
On Tue, 2024-04-23 at 08:08 -0500, Haitao Huang wrote: > On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai wrote: > > > On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote: > > > On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai > > > wrote: > > > > > > > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote: > > > > > > > I think we can add support for "sgx_cgroup=disabled" in future > > > if > > > > > indeed > > > > > > > needed. But just for init failure, no? > > > > > > > > > > > > > > > > > > > It's not about the commandline, which we can add in the future > > > when > > > > > > needed. It's about we need to have a way to handle SGX cgroup > > > being > > > > > > disabled at boot time nicely, because we already have a case > > > where we > > > > > > need > > > > > > to do so. > > > > > > > > > > > > Your approach looks half-way to me, and is not future > > > extendible. If > > > > > we > > > > > > choose to do it, do it right -- that is, we need a way to disable > > > it > > > > > > completely in both kernel and userspace so that userspace won't be > > > > > able> to > > > > > > see it. > > > > > > > > > > That would need more changes in misc cgroup implementation to > > > support > > > > > sgx-disable. Right now misc does not have separate files for > > > different > > > > > resource types. So we can only block echo "sgx_epc..." to those > > > > > interfacefiles, can't really make files not visible. > > > > > > > > "won't be able to see" I mean "only for SGX EPC resource", but not the > > > > control files for the entire MISC cgroup. > > > > > > > > I replied at the beginning of the previous reply: > > > > > > > > " > > > > Given SGX EPC is just one type of MISC cgroup resources, we cannot > > > just > > > > disable MISC cgroup as a whole. > > > > " > > > > > > > Sorry I missed this point. below. > > > > > > > You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC. > > > See > > > > the comment of @misc_res_capacity: > > > > > > > > * Miscellaneous resources capacity for the entire machine. 0 capacity > > > > * means resource is not initialized or not present in the host. > > > > > > > > > > IIUC I don't think the situation we have is either of those cases. For > > > our > > > case, resource is inited and present on the host but we have allocation > > > error for sgx cgroup infra. > > > > You have calculated the "capacity", but later you failed something and > > then reset the "capacity" to 0, i.e., cleanup. What's wrong with that? > > > > > > > > > And "blocking echo sgx_epc ... to those control files" is already > > > > sufficient for the purpose of not exposing SGX EPC to userspace, > > > correct? > > > > > > > > E.g., if SGX cgroup is enabled, you can see below when you read "max": > > > > > > > > # cat /sys/fs/cgroup/my_group/misc.max > > > > # > > > >sgx_epc ... > > > >... > > > > > > > > Otherwise you won't be able to see "sgx_epc": > > > > > > > > # cat /sys/fs/cgroup/my_group/misc.max > > > > # > > > >... > > > > > > > > And when you try to write the "max" for "sgx_epc", you will hit error: > > > > > > > > # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max > > > > # ... echo: write error: Invalid argument > > > > > > > > The above applies to all the control files. To me this is pretty much > > > > means "SGX EPC is disabled" or "not supported" for userspace. > > > > > > > You are right, capacity == 0 does block echoing max and users see an > > > error > > > if they do that. But 1) doubt you literately wanted "SGX EPC is > > > disabled" > > > and make it unsupported in this case, > > > > I don't understand. Something failed during SGX cgroup initialization, > > you _literally_ cannot continue to support it. > > > > > > Then we should just return -ENOMEM from sgx_init() when sgx cgroup > initialization fails? > I thought we only disable SGX cgroup support. SGX can still run. I am not sure how you got this conclusion. I specifically said something failed during SGX "cgroup" initialization, so only SGX "cgroup" needs to be disabled, not SGX as a whole. > > > > 2) even if we accept this is "sgx > > > cgroup disabled" I don't see how it is much better user experience than > > > current solution or really helps user better. > > > > In your way, the userspace is still able to see "sgx_epc" in control > > files > > and is able to update them. So from userspace's perspective SGX cgroup > > is > > enabled, but obviously updating to "max" doesn't have any impact. This > > will confuse userspace. > > > > > > > Setting capacity to zero also confuses user space. Some application may > rely on this file to know the capacity. Why?? Are you saying before this SGX cgroup patchset those applications cannot run? > > > > Also to implement this approach, as you mentioned, we need workaround > > > the > > > fact that misc_try_charge() fails when capacity set to zero, and adding > > >
Re: [PATCH v2 0/6] virtiofs: fix the warning for ITER_KVEC dio
On 4/23/2024 4:06 AM, Michael S. Tsirkin wrote: > On Tue, Apr 09, 2024 at 09:48:08AM +0800, Hou Tao wrote: >> Hi, >> >> On 4/8/2024 3:45 PM, Michael S. Tsirkin wrote: >>> On Wed, Feb 28, 2024 at 10:41:20PM +0800, Hou Tao wrote: From: Hou Tao Hi, The patch set aims to fix the warning related to an abnormal size parameter of kmalloc() in virtiofs. The warning occurred when attempting to insert a 10MB sized kernel module kept in a virtiofs with cache disabled. As analyzed in patch #1, the root cause is that the length of the read buffer is no limited, and the read buffer is passed directly to virtiofs through out_args[0].value. Therefore patch #1 limits the length of the read buffer passed to virtiofs by using max_pages. However it is not enough, because now the maximal value of max_pages is 256. Consequently, when reading a 10MB-sized kernel module, the length of the bounce buffer in virtiofs will be 40 + (256 * 4096), and kmalloc will try to allocate 2MB from memory subsystem. The request for 2MB of physically contiguous memory significantly stress the memory subsystem and may fail indefinitely on hosts with fragmented memory. To address this, patch #2~#5 use scattered pages in a bio_vec to replace the kmalloc-allocated bounce buffer when the length of the bounce buffer for KVEC_ITER dio is larger than PAGE_SIZE. The final issue with the allocation of the bounce buffer and sg array in virtiofs is that GFP_ATOMIC is used even when the allocation occurs in a kworker context. Therefore the last patch uses GFP_NOFS for the allocation of both sg array and bounce buffer when initiated by the kworker. For more details, please check the individual patches. As usual, comments are always welcome. Change Log: >>> Bernd should I just merge the patchset as is? >>> It seems to fix a real problem and no one has the >>> time to work on a better fix WDYT? >> Sorry for the long delay. I am just start to prepare for v3. In v3, I >> plan to avoid the unnecessary memory copy between fuse args and bio_vec. >> Will post it before next week. > Didn't happen before this week apparently. Sorry for failing to make it this week. Being busy these two weeks. Hope to send v3 out before the end of April. > >>> v2: * limit the length of ITER_KVEC dio by max_pages instead of the newly-introduced max_nopage_rw. Using max_pages make the ITER_KVEC dio being consistent with other rw operations. * replace kmalloc-allocated bounce buffer by using a bounce buffer backed by scattered pages when the length of the bounce buffer for KVEC_ITER dio is larger than PAG_SIZE, so even on hosts with fragmented memory, the KVEC_ITER dio can be handled normally by virtiofs. (Bernd Schubert) * merge the GFP_NOFS patch [1] into this patch-set and use memalloc_nofs_{save|restore}+GFP_KERNEL instead of GFP_NOFS (Benjamin Coddington) v1: https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/ [1]: https://lore.kernel.org/linux-fsdevel/20240105105305.4052672-1-hou...@huaweicloud.com/ Hou Tao (6): fuse: limit the length of ITER_KVEC dio by max_pages virtiofs: move alloc/free of argbuf into separated helpers virtiofs: factor out more common methods for argbuf virtiofs: support bounce buffer backed by scattered pages virtiofs: use scattered bounce buffer for ITER_KVEC dio virtiofs: use GFP_NOFS when enqueuing request through kworker fs/fuse/file.c | 12 +- fs/fuse/virtio_fs.c | 336 +--- 2 files changed, 296 insertions(+), 52 deletions(-) -- 2.29.2
Re: [PATCH v2 3/3] arm64: dts: qcom: sm7225-fairphone-fp4: Enable USB role switching
On 3/29/24 13:26, Luca Weiss wrote: Configure the Type-C and VBUS regulator on PM7250B and wire it up to the USB PHY, so that USB role and orientation switching works. For now USB Power Delivery properties are skipped / disabled, so that the (presumably) bootloader-configured charger doesn't get messed with and we can charge the phone with at least some amount of power. Signed-off-by: Luca Weiss --- Reviewed-by: Konrad Dybcio nits: + usb-role-switch; this is a common property of the dwc3 host [...] vdda-phy-supply = <_l22a>; vdda-pll-supply = <_l16a>; + orientation-switch; and this is a common property of the QMPPHY Could you move them to the node definition? Konrad
Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup
On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai wrote: On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote: On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai wrote: > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote: > > > > I think we can add support for "sgx_cgroup=disabled" in future if > > indeed > > > > needed. But just for init failure, no? > > > > > > > > > > It's not about the commandline, which we can add in the future when > > > needed. It's about we need to have a way to handle SGX cgroup being > > > disabled at boot time nicely, because we already have a case where we > > > need > > > to do so. > > > > > > Your approach looks half-way to me, and is not future extendible. If > > we > > > choose to do it, do it right -- that is, we need a way to disable it > > > completely in both kernel and userspace so that userspace won't be > > able> to > > > see it. > > > > That would need more changes in misc cgroup implementation to support > > sgx-disable. Right now misc does not have separate files for different > > resource types. So we can only block echo "sgx_epc..." to those > > interfacefiles, can't really make files not visible. > > "won't be able to see" I mean "only for SGX EPC resource", but not the > control files for the entire MISC cgroup. > > I replied at the beginning of the previous reply: > > " > Given SGX EPC is just one type of MISC cgroup resources, we cannot just > disable MISC cgroup as a whole. > " > Sorry I missed this point. below. > You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC. See > the comment of @misc_res_capacity: > > * Miscellaneous resources capacity for the entire machine. 0 capacity > * means resource is not initialized or not present in the host. > IIUC I don't think the situation we have is either of those cases. For our case, resource is inited and present on the host but we have allocation error for sgx cgroup infra. You have calculated the "capacity", but later you failed something and then reset the "capacity" to 0, i.e., cleanup. What's wrong with that? > And "blocking echo sgx_epc ... to those control files" is already > sufficient for the purpose of not exposing SGX EPC to userspace, correct? > > E.g., if SGX cgroup is enabled, you can see below when you read "max": > > # cat /sys/fs/cgroup/my_group/misc.max > # >sgx_epc ... >... > > Otherwise you won't be able to see "sgx_epc": > > # cat /sys/fs/cgroup/my_group/misc.max > # >... > > And when you try to write the "max" for "sgx_epc", you will hit error: > > # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max > # ... echo: write error: Invalid argument > > The above applies to all the control files. To me this is pretty much > means "SGX EPC is disabled" or "not supported" for userspace. > You are right, capacity == 0 does block echoing max and users see an error if they do that. But 1) doubt you literately wanted "SGX EPC is disabled" and make it unsupported in this case, I don't understand. Something failed during SGX cgroup initialization, you _literally_ cannot continue to support it. Then we should just return -ENOMEM from sgx_init() when sgx cgroup initialization fails? I thought we only disable SGX cgroup support. SGX can still run. 2) even if we accept this is "sgx cgroup disabled" I don't see how it is much better user experience than current solution or really helps user better. In your way, the userspace is still able to see "sgx_epc" in control files and is able to update them. So from userspace's perspective SGX cgroup is enabled, but obviously updating to "max" doesn't have any impact. This will confuse userspace. Setting capacity to zero also confuses user space. Some application may rely on this file to know the capacity. Also to implement this approach, as you mentioned, we need workaround the fact that misc_try_charge() fails when capacity set to zero, and adding code to return root always? Why this is a problem? It changes/overrides the the original meaning of capacity==0: no one can allocate if capacity is zero. So it seems like more workaround code to just make it work for a failing case no one really care much and end result is not really much better IMHO. It's not workaround, it's the right thing to do. The result is userspace will see it being disabled when kernel disables it. It's a workaround because you use the capacity==0 but it does not really mean to disable the misc cgroup for specific resource IIUC. There is explicit way for user to disable misc without setting capacity to zero. So in future if we want really disable sgx_epc cgroup specifically we should not use capacity. Therefore your approach is not extensible/reusable. Given this is a rare corner case caused by configuration, we can only do as much as possible IMHO, not trying to implement a perfect solution at the moment. Maybe BUG_ON() is more
Re: [PATCH net-next v7 1/7] net: introduce rstreason to detect why the RST is sent
On Tue, Apr 23, 2024 at 7:57 PM Simon Horman wrote: > > On Tue, Apr 23, 2024 at 10:17:31AM +0800, Jason Xing wrote: > > On Tue, Apr 23, 2024 at 10:14 AM Jason Xing > > wrote: > > > > > > Hi Simon, > > > > > > On Tue, Apr 23, 2024 at 2:28 AM Simon Horman wrote: > > > > > > > > On Mon, Apr 22, 2024 at 11:01:03AM +0800, Jason Xing wrote: > > > > > > > > ... > > > > > > > > > diff --git a/include/net/rstreason.h b/include/net/rstreason.h > > > > > > > > ... > > > > > > > > > +/** > > > > > + * There are three parts in order: > > > > > + * 1) reset reason in MPTCP: only for MPTCP use > > > > > + * 2) skb drop reason: relying on drop reasons for such as passive > > > > > reset > > > > > + * 3) independent reset reason: such as active reset reasons > > > > > + */ > > > > > > > > Hi Jason, > > > > > > > > A minor nit from my side. > > > > > > > > '/**' denotes the beginning of a Kernel doc, > > > > but other than that, this comment is not a Kernel doc. > > > > > > > > FWIIW, I would suggest providing a proper Kernel doc for enum > > > > sk_rst_reason. > > > > But another option would be to simply make this a normal comment, > > > > starting with "/* There are" > > > > > > Thanks Simon. I'm trying to use the kdoc way to make it right :) > > > > > > How about this one: > > > /** > > > * enum sk_rst_reason - the reasons of socket reset > > > * > > > * The reason of skb drop, which is used in DCCP/TCP/MPTCP protocols. > > > > s/skb drop/sk reset/ > > > > Sorry, I cannot withdraw my previous email in time. > > > > > * > > > * There are three parts in order: > > > * 1) skb drop reasons: relying on drop reasons for such as passive > > > reset > > > * 2) independent reset reasons: such as active reset reasons > > > * 3) reset reasons in MPTCP: only for MPTCP use > > > */ > > > ? > > > > > > I chose to mimic what enum skb_drop_reason does in the > > > include/net/dropreason-core.h file. > > > > > > > +enum sk_rst_reason { > > > > + /** > > > > +* Copy from include/uapi/linux/mptcp.h. > > > > +* These reset fields will not be changed since they adhere to > > > > +* RFC 8684. So do not touch them. I'm going to list each > > > > definition > > > > +* of them respectively. > > > > +*/ > > > > > > Thanks to you, I found another similar point where I smell something > > > wrong as in the above code. I'm going to replace '/**' with '/*' since > > > it's only a comment, not a kdoc. > > Likewise, thanks Jason. > > I haven't had time to look at v8 properly, > but I see that kernel-doc is happy with the changed > you have made there as discussed above. > Thank you, Simon. I learned something new about the coding style. Besides, some other nit problems have been spotted by Matt. I will fix them if it's required to send a new version.
Re: [PATCH net-next v7 1/7] net: introduce rstreason to detect why the RST is sent
On Tue, Apr 23, 2024 at 10:17:31AM +0800, Jason Xing wrote: > On Tue, Apr 23, 2024 at 10:14 AM Jason Xing wrote: > > > > Hi Simon, > > > > On Tue, Apr 23, 2024 at 2:28 AM Simon Horman wrote: > > > > > > On Mon, Apr 22, 2024 at 11:01:03AM +0800, Jason Xing wrote: > > > > > > ... > > > > > > > diff --git a/include/net/rstreason.h b/include/net/rstreason.h > > > > > > ... > > > > > > > +/** > > > > + * There are three parts in order: > > > > + * 1) reset reason in MPTCP: only for MPTCP use > > > > + * 2) skb drop reason: relying on drop reasons for such as passive > > > > reset > > > > + * 3) independent reset reason: such as active reset reasons > > > > + */ > > > > > > Hi Jason, > > > > > > A minor nit from my side. > > > > > > '/**' denotes the beginning of a Kernel doc, > > > but other than that, this comment is not a Kernel doc. > > > > > > FWIIW, I would suggest providing a proper Kernel doc for enum > > > sk_rst_reason. > > > But another option would be to simply make this a normal comment, > > > starting with "/* There are" > > > > Thanks Simon. I'm trying to use the kdoc way to make it right :) > > > > How about this one: > > /** > > * enum sk_rst_reason - the reasons of socket reset > > * > > * The reason of skb drop, which is used in DCCP/TCP/MPTCP protocols. > > s/skb drop/sk reset/ > > Sorry, I cannot withdraw my previous email in time. > > > * > > * There are three parts in order: > > * 1) skb drop reasons: relying on drop reasons for such as passive > > reset > > * 2) independent reset reasons: such as active reset reasons > > * 3) reset reasons in MPTCP: only for MPTCP use > > */ > > ? > > > > I chose to mimic what enum skb_drop_reason does in the > > include/net/dropreason-core.h file. > > > > > +enum sk_rst_reason { > > > + /** > > > +* Copy from include/uapi/linux/mptcp.h. > > > +* These reset fields will not be changed since they adhere to > > > +* RFC 8684. So do not touch them. I'm going to list each > > > definition > > > +* of them respectively. > > > +*/ > > > > Thanks to you, I found another similar point where I smell something > > wrong as in the above code. I'm going to replace '/**' with '/*' since > > it's only a comment, not a kdoc. Likewise, thanks Jason. I haven't had time to look at v8 properly, but I see that kernel-doc is happy with the changed you have made there as discussed above.
Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
On Tue, 2024-04-23 at 17:25 +0800, 朱伯君(杰铭) wrote: > EDMM's ioctl()s support batch operations, which may be > time-consuming. Try to explicitly give up the CPU at > the every end of "for loop" in > sgx_enclave_{ modify_types | restrict_permissions | remove_pages} > to give other tasks a chance to run, and avoid softlockup warning. > > The following has been observed on Linux v6.9-rc5 with kernel > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel > is requested to restrict page permissions of a large number of EPC pages. > > [ cut here ] > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905] > ... > CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7 > ... > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 > Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e > f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf 00 00 00 40 > 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef > RSP: 0018:b55a6591fa80 EFLAGS: 0202 > RAX: RBX: b55a6591fac0 RCX: b581e7384000 > RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000 > RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0 > R10: 006e R11: 0002 R12: 00072052d000 > R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020 > FS: 7fe10dbda740() GS:92163e48() > knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0 > DR0: DR1: DR2: > DR3: DR6: fffe07f0 DR7: 0400 > PKRU: 5554 > Call Trace: > > ? show_regs+0x67/0x70 > ? watchdog_timer_fn+0x1f3/0x280 > ? __pfx_watchdog_timer_fn+0x10/0x10 > ? __hrtimer_run_queues+0xc8/0x220 > ? hrtimer_interrupt+0x10c/0x250 > ? __sysvec_apic_timer_interrupt+0x53/0x130 > ? sysvec_apic_timer_interrupt+0x7b/0x90 > > > ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > ? sgx_enclave_restrict_permissions+0xba/0x1f0 > ? __pte_offset_map_lock+0x94/0x110 > ? sgx_encl_test_and_clear_young_cb+0x40/0x60 > sgx_ioctl+0x1ab/0x900 > ? do_syscall_64+0x79/0x110 > ? apply_to_page_range+0x14/0x20 > ? sgx_encl_test_and_clear_young+0x6c/0x80 > ? sgx_vma_fault+0x132/0x4f0 > __x64_sys_ioctl+0x95/0xd0 > x64_sys_call+0x1209/0x20c0 > do_syscall_64+0x6d/0x110 > ? do_syscall_64+0x79/0x110 > ? do_pte_missing+0x2e8/0xcc0 > ? __pte_offset_map+0x1c/0x190 > ? __handle_mm_fault+0x7b9/0xe60 > ? __count_memcg_events+0x70/0x100 > ? handle_mm_fault+0x256/0x360 > ? do_user_addr_fault+0x3c1/0x860 > ? irqentry_exit_to_user_mode+0x67/0x190 > ? irqentry_exit+0x3b/0x50 > ? exc_page_fault+0x89/0x180 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x7fe10e2ee5cb > Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff > ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff > ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48 > RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010 > RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb > RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005 > RBP: 0005 R08: R09: 7fffb2c75594 > R10: 7fffb2c755c8 R11: 0246 R12: c028a405 > R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980 > > [ end trace ] Could you trim down the trace to only include the relevant part? E.g., please at least remove the two register dumps at the beginning and end of the trace. Please refer to "Backtraces in commit messages" section in Documentation/process/submitting-patches.rst. > > Signed-off-by: Bojun Zhu > --- > arch/x86/kernel/cpu/sgx/ioctl.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c > index b65ab214bdf5..2340a82fa796 100644 > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl, > } > > mutex_unlock(>lock); > + > + if (need_resched()) > + cond_resched(); > } > > ret = 0; > @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl > *encl, > entry->type = page_type; > > mutex_unlock(>lock); > + > + if (need_resched()) > + cond_resched(); > } > > ret = 0; > @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl, >
Re: [PATCH net-next v8 5/7] mptcp: support rstreason for passive reset
Hello Matthieu, On Tue, Apr 23, 2024 at 6:02 PM Matthieu Baerts wrote: > > Hi Jason, > > On 23/04/2024 09:21, Jason Xing wrote: > > From: Jason Xing > > > > It relys on what reset options in the skb are as rfc8684 says. Reusing > > (if you have something else to fix, 'checkpatch.pl --codespell' reported > a warning here: s/relys/relies/) Thanks. Will fix it. > > > this logic can save us much energy. This patch replaces most of the prior > > NOT_SPECIFIED reasons. > > > > Signed-off-by: Jason Xing > > --- > > net/mptcp/protocol.h | 28 > > net/mptcp/subflow.c | 22 +- > > 2 files changed, 45 insertions(+), 5 deletions(-) > > > > diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h > > index fdfa843e2d88..bbcb8c068aae 100644 > > --- a/net/mptcp/protocol.h > > +++ b/net/mptcp/protocol.h > > @@ -581,6 +581,34 @@ mptcp_subflow_ctx_reset(struct mptcp_subflow_context > > *subflow) > > WRITE_ONCE(subflow->local_id, -1); > > } > > > > +/* Convert reset reasons in MPTCP to enum sk_rst_reason type */ > > +static inline enum sk_rst_reason > > +sk_rst_convert_mptcp_reason(u32 reason) > > +{ > > + switch (reason) { > > + case MPTCP_RST_EUNSPEC: > > + return SK_RST_REASON_MPTCP_RST_EUNSPEC; > > + case MPTCP_RST_EMPTCP: > > + return SK_RST_REASON_MPTCP_RST_EMPTCP; > > + case MPTCP_RST_ERESOURCE: > > + return SK_RST_REASON_MPTCP_RST_ERESOURCE; > > + case MPTCP_RST_EPROHIBIT: > > + return SK_RST_REASON_MPTCP_RST_EPROHIBIT; > > + case MPTCP_RST_EWQ2BIG: > > + return SK_RST_REASON_MPTCP_RST_EWQ2BIG; > > + case MPTCP_RST_EBADPERF: > > + return SK_RST_REASON_MPTCP_RST_EBADPERF; > > + case MPTCP_RST_EMIDDLEBOX: > > + return SK_RST_REASON_MPTCP_RST_EMIDDLEBOX; > > + default: > > + /** > > I guess here as well, it should be '/*' instead of '/**'. But I guess > that's fine, this file is probably not scanned. Anyway, if you have to > send a new version, please fix this as well. Thanks for your help. I will. > > (Also, this helper might require '#include ', but our > CI is fine with it, it is also added in the next commit, and probably > already included via include/net/request_sock.h. So I guess that's fine.) Yes, If I need to submit the V9 patch, I will move it. > > > Other than that, it looks good to me: > > Reviewed-by: Matthieu Baerts (NGI0) Thanks for all the reviews :) Thanks, Jason > > Cheers, > Matt > -- > Sponsored by the NGI0 Core fund. >
Re: [PATCH v3 1/4] virtio_balloon: separate vm events into a function
On 23.04.24 05:41, zhenwei pi wrote: All the VM events related statistics have dependence on 'CONFIG_VM_EVENT_COUNTERS', separate these events into a function to make code clean. Then we can remove 'CONFIG_VM_EVENT_COUNTERS' from 'update_balloon_stats'. Signed-off-by: zhenwei pi --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH net-next v8 6/7] mptcp: introducing a helper into active reset logic
Hi Jason, On 23/04/2024 09:21, Jason Xing wrote: > From: Jason Xing > > Since we have mapped every mptcp reset reason definition in enum > sk_rst_reason, introducing a new helper can cover some missing places > where we have already set the subflow->reset_reason. > > Note: using SK_RST_REASON_NOT_SPECIFIED is the same as > SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown. So we can convert > it directly. It looks good to me, thanks: Reviewed-by: Matthieu Baerts (NGI0) Cheers, Matt -- Sponsored by the NGI0 Core fund.
Re: [PATCH net-next v8 5/7] mptcp: support rstreason for passive reset
Hi Jason, On 23/04/2024 09:21, Jason Xing wrote: > From: Jason Xing > > It relys on what reset options in the skb are as rfc8684 says. Reusing (if you have something else to fix, 'checkpatch.pl --codespell' reported a warning here: s/relys/relies/) > this logic can save us much energy. This patch replaces most of the prior > NOT_SPECIFIED reasons. > > Signed-off-by: Jason Xing > --- > net/mptcp/protocol.h | 28 > net/mptcp/subflow.c | 22 +- > 2 files changed, 45 insertions(+), 5 deletions(-) > > diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h > index fdfa843e2d88..bbcb8c068aae 100644 > --- a/net/mptcp/protocol.h > +++ b/net/mptcp/protocol.h > @@ -581,6 +581,34 @@ mptcp_subflow_ctx_reset(struct mptcp_subflow_context > *subflow) > WRITE_ONCE(subflow->local_id, -1); > } > > +/* Convert reset reasons in MPTCP to enum sk_rst_reason type */ > +static inline enum sk_rst_reason > +sk_rst_convert_mptcp_reason(u32 reason) > +{ > + switch (reason) { > + case MPTCP_RST_EUNSPEC: > + return SK_RST_REASON_MPTCP_RST_EUNSPEC; > + case MPTCP_RST_EMPTCP: > + return SK_RST_REASON_MPTCP_RST_EMPTCP; > + case MPTCP_RST_ERESOURCE: > + return SK_RST_REASON_MPTCP_RST_ERESOURCE; > + case MPTCP_RST_EPROHIBIT: > + return SK_RST_REASON_MPTCP_RST_EPROHIBIT; > + case MPTCP_RST_EWQ2BIG: > + return SK_RST_REASON_MPTCP_RST_EWQ2BIG; > + case MPTCP_RST_EBADPERF: > + return SK_RST_REASON_MPTCP_RST_EBADPERF; > + case MPTCP_RST_EMIDDLEBOX: > + return SK_RST_REASON_MPTCP_RST_EMIDDLEBOX; > + default: > + /** I guess here as well, it should be '/*' instead of '/**'. But I guess that's fine, this file is probably not scanned. Anyway, if you have to send a new version, please fix this as well. (Also, this helper might require '#include ', but our CI is fine with it, it is also added in the next commit, and probably already included via include/net/request_sock.h. So I guess that's fine.) Other than that, it looks good to me: Reviewed-by: Matthieu Baerts (NGI0) Cheers, Matt -- Sponsored by the NGI0 Core fund.
Re: [PATCH net-next v8 3/7] rstreason: prepare for active reset
Hi Jason, On 23/04/2024 09:21, Jason Xing wrote: > From: Jason Xing > > Like what we did to passive reset: > only passing possible reset reason in each active reset path. > > No functional changes. (...) > net/mptcp/protocol.c | 4 +++- > net/mptcp/subflow.c | 5 +++-- For the modifications in MPTCP: Acked-by: Matthieu Baerts (NGI0) Cheers, Matt -- Sponsored by the NGI0 Core fund.
Re: [PATCH net-next v8 2/7] rstreason: prepare for passive reset
Hi Jason, On 23/04/2024 09:21, Jason Xing wrote: > From: Jason Xing > > Adjust the parameter and support passing reason of reset which > is for now NOT_SPECIFIED. No functional changes. (...) > net/mptcp/subflow.c| 8 +--- For the modifications in MPTCP: Acked-by: Matthieu Baerts (NGI0) Cheers, Matt -- Sponsored by the NGI0 Core fund.
Re: [PATCH net-next v8 1/7] net: introduce rstreason to detect why the RST is sent
Hi Jason, On 23/04/2024 09:21, Jason Xing wrote: > From: Jason Xing > > Add a new standalone file for the easy future extension to support > both active reset and passive reset in the TCP/DCCP/MPTCP protocols. > > This patch only does the preparations for reset reason mechanism, > nothing else changes. > > The reset reasons are divided into three parts: > 1) reuse drop reasons for passive reset in TCP > 2) our own independent reasons which aren't relying on other reasons at all > 3) reuse MP_TCPRST option for MPTCP Thank you for the v8, it looks good to me regarding the modifications linked to MPTCP. Acked-by: Matthieu Baerts (NGI0) Cheers, Matt -- Sponsored by the NGI0 Core fund.
[RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
EDMM's ioctl()s support batch operations, which may be time-consuming. Try to explicitly give up the CPU at the every end of "for loop" in sgx_enclave_{ modify_types | restrict_permissions | remove_pages} to give other tasks a chance to run, and avoid softlockup warning. The following has been observed on Linux v6.9-rc5 with kernel preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel is requested to restrict page permissions of a large number of EPC pages. [ cut here ] watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905] ... CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7 ... RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf 00 00 00 40 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef RSP: 0018:b55a6591fa80 EFLAGS: 0202 RAX: RBX: b55a6591fac0 RCX: b581e7384000 RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000 RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0 R10: 006e R11: 0002 R12: 00072052d000 R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020 FS: 7fe10dbda740() GS:92163e48() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0 DR0: DR1: DR2: DR3: DR6: fffe07f0 DR7: 0400 PKRU: 5554 Call Trace: ? show_regs+0x67/0x70 ? watchdog_timer_fn+0x1f3/0x280 ? __pfx_watchdog_timer_fn+0x10/0x10 ? __hrtimer_run_queues+0xc8/0x220 ? hrtimer_interrupt+0x10c/0x250 ? __sysvec_apic_timer_interrupt+0x53/0x130 ? sysvec_apic_timer_interrupt+0x7b/0x90 ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 ? sgx_enclave_restrict_permissions+0xba/0x1f0 ? __pte_offset_map_lock+0x94/0x110 ? sgx_encl_test_and_clear_young_cb+0x40/0x60 sgx_ioctl+0x1ab/0x900 ? do_syscall_64+0x79/0x110 ? apply_to_page_range+0x14/0x20 ? sgx_encl_test_and_clear_young+0x6c/0x80 ? sgx_vma_fault+0x132/0x4f0 __x64_sys_ioctl+0x95/0xd0 x64_sys_call+0x1209/0x20c0 do_syscall_64+0x6d/0x110 ? do_syscall_64+0x79/0x110 ? do_pte_missing+0x2e8/0xcc0 ? __pte_offset_map+0x1c/0x190 ? __handle_mm_fault+0x7b9/0xe60 ? __count_memcg_events+0x70/0x100 ? handle_mm_fault+0x256/0x360 ? do_user_addr_fault+0x3c1/0x860 ? irqentry_exit_to_user_mode+0x67/0x190 ? irqentry_exit+0x3b/0x50 ? exc_page_fault+0x89/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fe10e2ee5cb Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48 RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005 RBP: 0005 R08: R09: 7fffb2c75594 R10: 7fffb2c755c8 R11: 0246 R12: c028a405 R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980 [ end trace ] Signed-off-by: Bojun Zhu --- arch/x86/kernel/cpu/sgx/ioctl.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c index b65ab214bdf5..2340a82fa796 100644 --- a/arch/x86/kernel/cpu/sgx/ioctl.c +++ b/arch/x86/kernel/cpu/sgx/ioctl.c @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl, } mutex_unlock(>lock); + + if (need_resched()) + cond_resched(); } ret = 0; @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl, entry->type = page_type; mutex_unlock(>lock); + + if (need_resched()) + cond_resched(); } ret = 0; @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl, kfree(entry); mutex_unlock(>lock); + + if (need_resched()) + cond_resched(); } ret = 0; -- 2.25.1
[RFC PATCH 0/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup
Hi folks, If we run an enclave equipped with large EPC(30G or greater on my platfrom) on the Linux with kernel preemptions disabled(by configuring "CONFIG_PREEMPT_NONE=y"), we will get the following softlockup warning messages being reported in "dmesg" log: Is it an known issue? I can get the warning messages on Linux v6.9-rc5. The EDMM's ioctl()s (sgx_ioc_enclave_{ modify_types | restrict_permissions | remove_pages}) interface provided by kernel support batch changing attributes of enclave's EPC. If userspace App requests kernel to handle too many EPC pages, kernel may stuck for a long time(with preemption disabled). The log is as follows: [ cut here ] [ 901.101294] watchdog: BUG: soft lockup - CPU#92 stuck for 23s! [occlum-run:4289] [ 901.109617] Modules linked in: veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter br_netfilter bridge stp llc overlay nls_iso8859_1 intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit binfmt_misc ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 pmt_telemetry sha1_ssse3 pmt_class joydev intel_sdsi input_leds aesni_intel crypto_simd cryptd dax_hmem cxl_acpi cmdlinepart rapl cxl_core ast spi_nor intel_cstate drm_shmem_helper einj mtd drm_kms_helper mei_me idxd isst_if_mmio isst_if_mbox_pci isst_if_common intel_vsec idxd_bus mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter mac_hid sch_fq_codel msr parport_pc ppdev lp parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore drm ip_tables x_tables [ 901.109670] autofs4 mlx5_ib ib_uverbs ib_core hid_generic usbhid hid ses enclosure scsi_transport_sas mlx5_core pci_hyperv_intf mlxfw igb ahci psample i2c_algo_bit i2c_i801 spi_intel_pci xhci_pci tls megaraid_sas dca spi_intel crc32_pclmul i2c_smbus i2c_ismt libahci xhci_pci_renesas wmi pinctrl_emmitsburg [ 901.109691] CPU: 92 PID: 4289 Comm: occlum-run Not tainted 6.9.0-rc5 #3 [ 901.109693] Hardware name: Inspur NF5468-M7-A0-R0-00/NF5468-M7-A0-R0-00, BIOS 05.02.01 05/08/2023 [ 901.109695] RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0 [ 901.109701] Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e 15 8b 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e 15 8b 0f 01 cf 00 00 00 40 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef [ 901.109702] RSP: 0018:ad0ae5d0f8c0 EFLAGS: 0202 [ 901.109704] RAX: RBX: ad0ae5d0f900 RCX: ad11dfc0e000 [ 901.109705] RDX: ad2adcff81c0 RSI: RDI: 9a12f5f4f000 [ 901.109706] RBP: ad0ae5d0f9b0 R08: 0002 R09: 9a1289f57520 [ 901.109707] R10: 005d R11: 0002 R12: 0006d8ff2000 [ 901.109708] R13: 9a12f5f4f000 R14: ad0ae5d0fa18 R15: 9a12f5f4f020 [ 901.109709] FS: 7fb20ad1d740() GS:9a317fe0() knlGS: [ 901.109710] CS: 0010 DS: ES: CR0: 80050033 [ 901.109711] CR2: 7f8041811000 CR3: 000118530006 CR4: 00770ef0 [ 901.109712] DR0: DR1: DR2: [ 901.109713] DR3: DR6: fffe07f0 DR7: 0400 [ 901.109714] PKRU: 5554 [ 901.109714] Call Trace: [ 901.109716] [ 901.109718] ? show_regs+0x67/0x70 [ 901.109722] ? watchdog_timer_fn+0x1f3/0x280 [ 901.109725] ? __pfx_watchdog_timer_fn+0x10/0x10 [ 901.109727] ? __hrtimer_run_queues+0xc8/0x220 [ 901.109731] ? hrtimer_interrupt+0x10c/0x250 [ 901.109733] ? __sysvec_apic_timer_interrupt+0x53/0x130 [ 901.109736] ? sysvec_apic_timer_interrupt+0x7b/0x90 [ 901.109739] [ 901.109740] [ 901.109740] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ 901.109745] ? sgx_enclave_restrict_permissions+0xba/0x1f0 [ 901.109747] ? aa_file_perm+0x145/0x550 [ 901.109750] sgx_ioctl+0x1ab/0x900 [ 901.109751] ? xas_find+0x84/0x200 [ 901.109754] ? sgx_enclave_etrack+0xbb/0x140 [ 901.109756] ? sgx_encl_may_map+0x19a/0x240 [ 901.109758] ? common_file_perm+0x8a/0x1b0 [ 901.109760] ? obj_cgroup_charge_pages+0xa2/0x100 [ 901.109763] ? tlb_flush_mmu+0x31/0x1c0 [ 901.109766] ? tlb_finish_mmu+0x42/0x80 [ 901.109767] ? do_mprotect_pkey+0x150/0x530 [ 901.109769] ? __fget_light+0xc0/0x100 [ 901.109772] __x64_sys_ioctl+0x95/0xd0 [ 901.109775] x64_sys_call+0x1209/0x20c0 [ 901.109777] do_syscall_64+0x6d/0x110 [ 901.109779] ? syscall_exit_to_user_mode+0x86/0x1c0 [ 901.109782] ? do_syscall_64+0x79/0x110 [ 901.109783] ? syscall_exit_to_user_mode+0x86/0x1c0 [ 901.109784] ? do_syscall_64+0x79/0x110 [ 901.109785] ? free_unref_page+0x10e/0x180 [ 901.109788] ? __do_fault+0x36/0x130 [ 901.109791] ? do_pte_missing+0x2e8/0xcc0 [
[PATCH net-next v6] net/ipv4: add tracepoint for icmp_send
From: Peilin He Introduce a tracepoint for icmp_send, which can help users to get more detail information conveniently when icmp abnormal events happen. 1. Giving an usecase example: = When an application experiences packet loss due to an unreachable UDP destination port, the kernel will send an exception message through the icmp_send function. By adding a trace point for icmp_send, developers or system administrators can obtain detailed information about the UDP packet loss, including the type, code, source address, destination address, source port, and destination port. This facilitates the trouble-shooting of UDP packet loss issues especially for those network-service applications. 2. Operation Instructions: == Switch to the tracing directory. cd /sys/kernel/tracing Filter for destination port unreachable. echo "type==3 && code==3" > events/icmp/icmp_send/filter Enable trace event. echo 1 > events/icmp/icmp_send/enable 3. Result View: udp_client_erro-11370 [002] ...s.12 124.728002: icmp_send: icmp_send: type=3, code=3. From 127.0.0.1:41895 to 127.0.0.1: ulen=23 skbaddr=589b167a Change log == v5->v6: Some fixes according to https://lore.kernel.org/all/20240413161319.ga853...@kernel.org/ 1.Resubmit patches based on the latest net-next code. v4->v5: Some fixes according to https://lore.kernel.org/all/CAL+tcoDeXXh+zcRk4PHnUk8ELnx=ce2pccqs7sfm0y9ak-e...@mail.gmail.com/ 1.Adjust the position of trace_icmp_send() to before icmp_push_reply(). v3->v4: Some fixes according to https://lore.kernel.org/all/CANn89i+EFEr7VHXNdOi59Ba_R1nFKSBJzBzkJFVgCTdXBx=y...@mail.gmail.com/ 1.Add legality check for UDP header in SKB. 2.Target this patch for net-next. v2->v3: Some fixes according to https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/ 1. Change the tracking directory to/sys/kernel/tracking. 2. Adjust the layout of the TP-STRUCT_entry parameter structure. v1->v2: Some fixes according to https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/ 1. adjust the trace_icmp_send() to more protocols than UDP. 2. move the calling of trace_icmp_send after sanity checks in __icmp_send(). Signed-off-by: Peilin He Reviewed-by: Yunkai Zhang Cc: Yang Yang Cc: Liu Chun Cc: Xuexin Jiang Signed-off-by: xu xin --- include/trace/events/icmp.h | 65 + net/ipv4/icmp.c | 4 +++ 2 files changed, 69 insertions(+) create mode 100644 include/trace/events/icmp.h diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h new file mode 100644 index ..7d5190f48a28 --- /dev/null +++ b/include/trace/events/icmp.h @@ -0,0 +1,65 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM icmp + +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_ICMP_H + +#include +#include + +TRACE_EVENT(icmp_send, + + TP_PROTO(const struct sk_buff *skb, int type, int code), + + TP_ARGS(skb, type, code), + + TP_STRUCT__entry( + __field(const void *, skbaddr) + __field(int, type) + __field(int, code) + __array(__u8, saddr, 4) + __array(__u8, daddr, 4) + __field(__u16, sport) + __field(__u16, dport) + __field(unsigned short, ulen) + ), + + TP_fast_assign( + struct iphdr *iph = ip_hdr(skb); + int proto_4 = iph->protocol; + __be32 *p32; + + __entry->skbaddr = skb; + __entry->type = type; + __entry->code = code; + + struct udphdr *uh = udp_hdr(skb); + if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head || + (u8 *)uh + sizeof(struct udphdr) > skb_tail_pointer(skb)) { + __entry->sport = 0; + __entry->dport = 0; + __entry->ulen = 0; + } else { + __entry->sport = ntohs(uh->source); + __entry->dport = ntohs(uh->dest); + __entry->ulen = ntohs(uh->len); + } + + p32 = (__be32 *) __entry->saddr; + *p32 = iph->saddr; + + p32 = (__be32 *) __entry->daddr; + *p32 = iph->daddr; + ), + + TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u ulen=%d skbaddr=%p", + __entry->type, __entry->code, + __entry->saddr, __entry->sport,
回复: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors
> > When there is a ctlq and it doesn't require interrupt callbacks,the > > original method of calculating vectors wastes hardware msi or msix > > resources as well as system IRQ resources. > > > > When conducting performance testing using testpmd in the guest os, > > it was found that the performance was lower compared to directly > > using vfio-pci to passthrough the device > > > > In scenarios where the virtio device in the guest os does not > > utilize interrupts, the vdpa driver still configures the hardware's > > msix vector. Therefore, the hardware still sends interrupts to the host os. > > >I just have a question on this part. How come hardware sends interrupts does > >not guest driver disable them? > >1:Assuming the guest OS's Virtio device is using PMD mode, QEMU sets the > call fd to -1 >2:On the host side, the vhost_vdpa program will set > vp_vdpa->vring[i].cb.callback to invalid >3:Before the modification, the vp_vdpa_request_irq function does not check > whether > vp_vdpa->vring[i].cb.callback is valid. Instead, it enables the > hardware's MSIX > interrupts based on the number of queues of the device > > So MSIX is enabled but why would it trigger? virtio PMD in poll mode > presumably suppresses interrupts after all. I didn't express the test model clearly. The testing model is as follows: testpmd testpmd--- ^^ || || || vv vfio pci--- ---vfio pci--- pci device -----pci device-- guest os - guest os -- ---virtio device-----vfio device-- --qemu-----qemu--- ^ ^ | | | | | | v v vhost_vdpa vfio pci host os-- --hw VF1 VF2 After the hardware completes DMA, it will still send an interrupt to the host OS. -Original Message- From: Angus Chen angus.c...@jaguarmicro.com Sent: April 23, 2024 16:43 To: Michael S. Tsirkin m...@redhat.com; Gavin Liu gavin@jaguarmicro.com Cc: jasow...@redhat.com; virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com; linux-kernel@vger.kernel.org; Heng Qi hen...@linux.alibaba.com Subject: RE: Reply: [PATCH v5] vp_vdpa: don't allocate unused msix vectors Hi mst. > -Original Message- > From: Michael S. Tsirkin > Sent: Tuesday, April 23, 2024 4:35 PM > To: Gavin Liu > Cc: jasow...@redhat.com; Angus Chen ; > virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com; > linux-kernel@vger.kernel.org; Heng Qi > Subject: Re: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix > vectors > > On Tue, Apr 23, 2024 at 01:39:17AM +, Gavin Liu wrote: > > On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote: > > > From: Yuxue Liu > > > > > > When there is a ctlq and it doesn't require interrupt > > > callbacks,the original method of calculating vectors wastes > > > hardware msi or msix resources as well as system IRQ resources. > > > > > > When conducting performance testing using testpmd in the guest os, > > > it was found that the performance was lower compared to directly > > > using vfio-pci to passthrough the device > > > > > > In scenarios where the virtio device in the guest os does not > > > utilize interrupts, the vdpa driver still configures the > > > hardware's msix vector. Therefore, the hardware still sends interrupts to > > > the host os. > > > > >I just have a question on this part. How come hardware sends > > >interrupts does > not guest driver disable them? > > > >1:Assuming the guest OS's Virtio device is using PMD mode, QEMU > > sets > the call fd to -1 > >2:On the host side, the vhost_vdpa program will set > vp_vdpa->vring[i].cb.callback to invalid > >3:Before the modification, the vp_vdpa_request_irq function does > > not > check whether > > vp_vdpa->vring[i].cb.callback is valid. Instead, it enables > > the > hardware's MSIX > > interrupts based on the number of queues of the device > > > > So MSIX is enabled but why would it trigger? virtio PMD in poll mode > presumably suppresses interrupts after all. Virtio pmd is in the guest,but in host side,the msix is enabled,then the device will triger Interrupt normally. I analysed this bug before,and I think gavin is right. Did I make it clear? > > > > > > > - Original
Re: Re: [PATCH v3 2/4] virtio_balloon: introduce oom-kill invocations
On 4/23/24 17:13, Michael S. Tsirkin wrote: On Tue, Apr 23, 2024 at 11:41:07AM +0800, zhenwei pi wrote: [snip] #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ Looks like a useful extension. But any UAPI extension has to go to virtio spec first. Sure, I'll send related virtio spec changes once virtio comment mail list gets ready. @@ -83,7 +84,8 @@ struct virtio_balloon_config { VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \ VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ - VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \ + VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ + VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \ } #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("") -- 2.34.1 -- zhenwei pi
Re: [PATCH v3 2/4] virtio_balloon: introduce oom-kill invocations
On Tue, Apr 23, 2024 at 11:41:07AM +0800, zhenwei pi wrote: > When the guest OS runs under critical memory pressure, the guest > starts to kill processes. A guest monitor agent may scan 'oom_kill' > from /proc/vmstat, and reports the OOM KILL event. However, the agent > may be killed and we will loss this critical event(and the later > events). > > For now we can also grep for magic words in guest kernel log from host > side. Rather than this unstable way, virtio balloon reports OOM-KILL > invocations instead. > > Acked-by: David Hildenbrand > Signed-off-by: zhenwei pi > --- > drivers/virtio/virtio_balloon.c | 1 + > include/uapi/linux/virtio_balloon.h | 6 -- > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index 1710e3098ecd..f7a47eaa0936 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -330,6 +330,7 @@ static inline unsigned int update_balloon_vm_stats(struct > virtio_balloon *vb) > pages_to_bytes(events[PSWPOUT])); > update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]); > update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]); > + update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]); > > #ifdef CONFIG_HUGETLB_PAGE > update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC, > diff --git a/include/uapi/linux/virtio_balloon.h > b/include/uapi/linux/virtio_balloon.h > index ddaa45e723c4..b17bbe033697 100644 > --- a/include/uapi/linux/virtio_balloon.h > +++ b/include/uapi/linux/virtio_balloon.h > @@ -71,7 +71,8 @@ struct virtio_balloon_config { > #define VIRTIO_BALLOON_S_CACHES 7 /* Disk caches */ > #define VIRTIO_BALLOON_S_HTLB_PGALLOC 8 /* Hugetlb page allocations */ > #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation > failures */ > -#define VIRTIO_BALLOON_S_NR 10 > +#define VIRTIO_BALLOON_S_OOM_KILL 10 /* OOM killer invocations */ > +#define VIRTIO_BALLOON_S_NR 11 > > #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ > VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ Looks like a useful extension. But any UAPI extension has to go to virtio spec first. > @@ -83,7 +84,8 @@ struct virtio_balloon_config { > VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \ > VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \ > VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ > - VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \ > + VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ > + VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \ > } > > #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("") > -- > 2.34.1
RE: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors
Hi mst. > -Original Message- > From: Michael S. Tsirkin > Sent: Tuesday, April 23, 2024 4:35 PM > To: Gavin Liu > Cc: jasow...@redhat.com; Angus Chen ; > virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com; > linux-kernel@vger.kernel.org; Heng Qi > Subject: Re: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors > > On Tue, Apr 23, 2024 at 01:39:17AM +, Gavin Liu wrote: > > On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote: > > > From: Yuxue Liu > > > > > > When there is a ctlq and it doesn't require interrupt callbacks,the > > > original method of calculating vectors wastes hardware msi or msix > > > resources as well as system IRQ resources. > > > > > > When conducting performance testing using testpmd in the guest os, it > > > was found that the performance was lower compared to directly using > > > vfio-pci to passthrough the device > > > > > > In scenarios where the virtio device in the guest os does not utilize > > > interrupts, the vdpa driver still configures the hardware's msix > > > vector. Therefore, the hardware still sends interrupts to the host os. > > > > >I just have a question on this part. How come hardware sends interrupts > > >does > not guest driver disable them? > > > >1:Assuming the guest OS's Virtio device is using PMD mode, QEMU sets > the call fd to -1 > >2:On the host side, the vhost_vdpa program will set > vp_vdpa->vring[i].cb.callback to invalid > >3:Before the modification, the vp_vdpa_request_irq function does not > check whether > > vp_vdpa->vring[i].cb.callback is valid. Instead, it enables the > hardware's MSIX > > interrupts based on the number of queues of the device > > > > So MSIX is enabled but why would it trigger? virtio PMD in poll mode > presumably suppresses interrupts after all. Virtio pmd is in the guest,but in host side,the msix is enabled,then the device will triger Interrupt normally. I analysed this bug before,and I think gavin is right. Did I make it clear? > > > > > > > - Original Message - > > From: Michael S. Tsirkin m...@redhat.com > > Sent: April 22, 2024 20:09 > > To: Gavin Liu gavin@jaguarmicro.com > > Cc: jasow...@redhat.com; Angus Chen angus.c...@jaguarmicro.com; > virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com; > linux-kernel@vger.kernel.org; Heng Qi hen...@linux.alibaba.com > > Subject: Re: [PATCH v5] vp_vdpa: don't allocate unused msix vectors > > > > > > > > External Mail: This email originated from OUTSIDE of the organization! > > Do not click links, open attachments or provide ANY information unless you > recognize the sender and know the content is safe. > > > > > > On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote: > > > From: Yuxue Liu > > > > > > When there is a ctlq and it doesn't require interrupt callbacks,the > > > original method of calculating vectors wastes hardware msi or msix > > > resources as well as system IRQ resources. > > > > > > When conducting performance testing using testpmd in the guest os, it > > > was found that the performance was lower compared to directly using > > > vfio-pci to passthrough the device > > > > > > In scenarios where the virtio device in the guest os does not utilize > > > interrupts, the vdpa driver still configures the hardware's msix > > > vector. Therefore, the hardware still sends interrupts to the host os. > > > > I just have a question on this part. How come hardware sends interrupts does > not guest driver disable them? > > > > > Because of this unnecessary > > > action by the hardware, hardware performance decreases, and it also > > > affects the performance of the host os. > > > > > > Before modification:(interrupt mode) > > > 32: 0 0 0 0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0 > > > 33: 0 0 0 0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1 > > > 34: 0 0 0 0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2 > > > 35: 0 0 0 0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config > > > > > > After modification:(interrupt mode) > > > 32: 0 0 1 7 PCI-MSI 32768-edge vp-vdpa[:00:02.0]-0 > > > 33: 36 0 3 0 PCI-MSI 32769-edge vp-vdpa[:00:02.0]-1 > > > 34: 0 0 0 0 PCI-MSI 32770-edge vp-vdpa[:00:02.0]-config > > > > > > Before modification:(virtio pmd mode for guest os) > > > 32: 0 0 0 0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0 > > > 33: 0 0 0 0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1 > > > 34: 0 0 0 0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2 > > > 35: 0 0 0 0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config > > > > > > After modification:(virtio pmd mode for guest os) > > > 32: 0 0 0 0 PCI-MSI 32768-edge vp-vdpa[:00:02.0]-config > > > > > > To verify the use of the virtio PMD mode in the guest operating > > > system, the following patch needs to be applied to QEMU: > > > https://lore.kernel.org/all/20240408073311.2049-1-yuxue.liu@jaguarmicr > > > o.com > > > > > > Signed-off-by:
Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection
On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote: > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin wrote: > > > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote: > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin > > > wrote: > > > > > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote: > > > > > Add the function vduse_alloc_reconnnect_info_mem > > > > > and vduse_alloc_reconnnect_info_mem > > > > > These functions allow vduse to allocate and free memory for > > > > > reconnection > > > > > information. The amount of memory allocated is vq_num pages. > > > > > Each VQS will map its own page where the reconnection information > > > > > will be saved > > > > > > > > > > Signed-off-by: Cindy Lu > > > > > --- > > > > > drivers/vdpa/vdpa_user/vduse_dev.c | 40 > > > > > ++ > > > > > 1 file changed, 40 insertions(+) > > > > > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > index ef3c9681941e..2da659d5f4a8 100644 > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue { > > > > > int irq_effective_cpu; > > > > > struct cpumask irq_affinity; > > > > > struct kobject kobj; > > > > > + unsigned long vdpa_reconnect_vaddr; > > > > > }; > > > > > > > > > > struct vduse_dev; > > > > > @@ -1105,6 +1106,38 @@ static void > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq) > > > > > > > > > > vq->irq_effective_cpu = curr_cpu; > > > > > } > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev) > > > > > +{ > > > > > + unsigned long vaddr = 0; > > > > > + struct vduse_virtqueue *vq; > > > > > + > > > > > + for (int i = 0; i < dev->vq_num; i++) { > > > > > + /*page 0~ vq_num save the reconnect info for vq*/ > > > > > + vq = dev->vqs[i]; > > > > > + vaddr = get_zeroed_page(GFP_KERNEL); > > > > > > > > > > > > I don't get why you insist on stealing kernel memory for something > > > > that is just used by userspace to store data for its own use. > > > > Userspace does not lack ways to persist data, for example, > > > > create a regular file anywhere in the filesystem. > > > > > > Good point. So the motivation here is to: > > > > > > 1) be self contained, no dependency for high speed persist data > > > storage like tmpfs > > > > No idea what this means. > > I mean a regular file may slow down the datapath performance, so > usually the application will try to use tmpfs and other which is a > dependency for implementing the reconnection. Are we worried about systems without tmpfs now? > > > > > 2) standardize the format in uAPI which allows reconnection from > > > arbitrary userspace, unfortunately, such effort was removed in new > > > versions > > > > And I don't see why that has to live in the kernel tree either. > > I can't find a better place, any idea? > > Thanks Well anywhere on github really. with libvhost-user maybe? It's harmless enough in Documentation if you like but ties you to the kernel release cycle in a way that is completely unnecessary. > > > > > If the above doesn't make sense, we don't need to offer those pages by > > > VDUSE. > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > + if (vaddr == 0) > > > > > + return -ENOMEM; > > > > > + > > > > > + vq->vdpa_reconnect_vaddr = vaddr; > > > > > + } > > > > > + > > > > > + return 0; > > > > > +} > > > > > + > > > > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev) > > > > > +{ > > > > > + struct vduse_virtqueue *vq; > > > > > + > > > > > + for (int i = 0; i < dev->vq_num; i++) { > > > > > + vq = dev->vqs[i]; > > > > > + > > > > > + if (vq->vdpa_reconnect_vaddr) > > > > > + free_page(vq->vdpa_reconnect_vaddr); > > > > > + vq->vdpa_reconnect_vaddr = 0; > > > > > + } > > > > > + > > > > > + return 0; > > > > > +} > > > > > > > > > > static long vduse_dev_ioctl(struct file *file, unsigned int cmd, > > > > > unsigned long arg) > > > > > @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name) > > > > > mutex_unlock(>lock); > > > > > return -EBUSY; > > > > > } > > > > > + vduse_free_reconnnect_info_mem(dev); > > > > > + > > > > > dev->connected = true; > > > > > mutex_unlock(>lock); > > > > > > > > > > @@ -1855,12 +1890,17 @@ static int vduse_create_dev(struct > > > > > vduse_dev_config *config, > > > > > ret = vduse_dev_init_vqs(dev, config->vq_align, config->vq_num); > > > > > if (ret) > > > > > goto err_vqs; > > > > > + ret = vduse_alloc_reconnnect_info_mem(dev); > > > > > + if (ret < 0) > > > > > + goto err_mem; > > > > > >
Re: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors
On Tue, Apr 23, 2024 at 01:39:17AM +, Gavin Liu wrote: > On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote: > > From: Yuxue Liu > > > > When there is a ctlq and it doesn't require interrupt callbacks,the > > original method of calculating vectors wastes hardware msi or msix > > resources as well as system IRQ resources. > > > > When conducting performance testing using testpmd in the guest os, it > > was found that the performance was lower compared to directly using > > vfio-pci to passthrough the device > > > > In scenarios where the virtio device in the guest os does not utilize > > interrupts, the vdpa driver still configures the hardware's msix > > vector. Therefore, the hardware still sends interrupts to the host os. > > >I just have a question on this part. How come hardware sends interrupts does > >not guest driver disable them? > >1:Assuming the guest OS's Virtio device is using PMD mode, QEMU sets the > call fd to -1 >2:On the host side, the vhost_vdpa program will set > vp_vdpa->vring[i].cb.callback to invalid >3:Before the modification, the vp_vdpa_request_irq function does not check > whether > vp_vdpa->vring[i].cb.callback is valid. Instead, it enables the > hardware's MSIX > interrupts based on the number of queues of the device > So MSIX is enabled but why would it trigger? virtio PMD in poll mode presumably suppresses interrupts after all. > > > - Original Message - > From: Michael S. Tsirkin m...@redhat.com > Sent: April 22, 2024 20:09 > To: Gavin Liu gavin@jaguarmicro.com > Cc: jasow...@redhat.com; Angus Chen angus.c...@jaguarmicro.com; > virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com; > linux-kernel@vger.kernel.org; Heng Qi hen...@linux.alibaba.com > Subject: Re: [PATCH v5] vp_vdpa: don't allocate unused msix vectors > > > > External Mail: This email originated from OUTSIDE of the organization! > Do not click links, open attachments or provide ANY information unless you > recognize the sender and know the content is safe. > > > On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote: > > From: Yuxue Liu > > > > When there is a ctlq and it doesn't require interrupt callbacks,the > > original method of calculating vectors wastes hardware msi or msix > > resources as well as system IRQ resources. > > > > When conducting performance testing using testpmd in the guest os, it > > was found that the performance was lower compared to directly using > > vfio-pci to passthrough the device > > > > In scenarios where the virtio device in the guest os does not utilize > > interrupts, the vdpa driver still configures the hardware's msix > > vector. Therefore, the hardware still sends interrupts to the host os. > > I just have a question on this part. How come hardware sends interrupts does > not guest driver disable them? > > > Because of this unnecessary > > action by the hardware, hardware performance decreases, and it also > > affects the performance of the host os. > > > > Before modification:(interrupt mode) > > 32: 0 0 0 0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0 > > 33: 0 0 0 0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1 > > 34: 0 0 0 0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2 > > 35: 0 0 0 0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config > > > > After modification:(interrupt mode) > > 32: 0 0 1 7 PCI-MSI 32768-edge vp-vdpa[:00:02.0]-0 > > 33: 36 0 3 0 PCI-MSI 32769-edge vp-vdpa[:00:02.0]-1 > > 34: 0 0 0 0 PCI-MSI 32770-edge vp-vdpa[:00:02.0]-config > > > > Before modification:(virtio pmd mode for guest os) > > 32: 0 0 0 0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0 > > 33: 0 0 0 0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1 > > 34: 0 0 0 0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2 > > 35: 0 0 0 0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config > > > > After modification:(virtio pmd mode for guest os) > > 32: 0 0 0 0 PCI-MSI 32768-edge vp-vdpa[:00:02.0]-config > > > > To verify the use of the virtio PMD mode in the guest operating > > system, the following patch needs to be applied to QEMU: > > https://lore.kernel.org/all/20240408073311.2049-1-yuxue.liu@jaguarmicr > > o.com > > > > Signed-off-by: Yuxue Liu > > Acked-by: Jason Wang > > Reviewed-by: Heng Qi > > --- > > V5: modify the description of the printout when an exception occurs > > V4: update the title and assign values to uninitialized variables > > V3: delete unused variables and add validation records > > V2: fix when allocating IRQs, scan all queues > > > > drivers/vdpa/virtio_pci/vp_vdpa.c | 22 -- > > 1 file changed, 16 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c > > b/drivers/vdpa/virtio_pci/vp_vdpa.c > > index df5f4a3bccb5..8de0224e9ec2 100644 > > --- a/drivers/vdpa/virtio_pci/vp_vdpa.c > > +++
[PATCH v2 1/1] genirq/cpuhotplug: retry with cpu_online_mask when irq_do_set_affinity return -ENOSPC
When a CPU goes offline, the interrupts pinned to that CPU are re-configured. Its managed interrupts undergo either migration to other CPUs or shutdown if all CPUs listed in the affinity are offline. This patch doesn't affect managed interrupts. For regular interrupts, they are migrated to other selected online CPUs. The target CPUs are chosen from either desc->pending_mask (suppose CONFIG_GENERIC_PENDING_IRQ) or d->common->affinity (suppose CONFIG_SMP). The cpu_online_mask is used as target CPUs only when CPUs in both desc->pending_mask and d->common->affinity are offline. However, there is a bad corner case, when desc->pending_mask or d->common->affinity is selected as the target cpumask, but none of their CPUs has any available vectors. In this case the migration fails and the device interrupt becomes stale. This is not any different from the case where the affinity mask does not contain any online CPU, but there is no fallback operation for this. Instead of giving up, retry the migration attempt with the online CPU mask if the interrupt is not managed, as managed interrupts cannot be affected by this problem. Cc: Joe Jin Signed-off-by: Dongli Zhang [tglx: massage some changelog] --- Changed since v1: - Re-work the commit message - Move pr_debug before setting affinity - Remove 'all' from pr_debug message kernel/irq/cpuhotplug.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c index 1ed2b1739363..19babb914949 100644 --- a/kernel/irq/cpuhotplug.c +++ b/kernel/irq/cpuhotplug.c @@ -130,6 +130,17 @@ static bool migrate_one_irq(struct irq_desc *desc) * CPU. */ err = irq_do_set_affinity(d, affinity, false); + + if (err == -ENOSPC && !irqd_affinity_is_managed(d) && affinity != cpu_online_mask) { + pr_debug("IRQ%u: set affinity failed for %*pbl, re-try with online CPUs\n", +d->irq, cpumask_pr_args(affinity)); + + affinity = cpu_online_mask; + brokeaff = true; + + err = irq_do_set_affinity(d, affinity, false); + } + if (err) { pr_warn_ratelimited("IRQ%u: set affinity failed(%d).\n", d->irq, err); -- 2.34.1
Re: [PATCH v12 13/14] Docs/x86/sgx: Add description for cgroup support
On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote: > From: Sean Christopherson > > Add initial documentation of how to regulate the distribution of > SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup > controller. > > Acked-by: Kai Huang
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
On 23.04.24 00:15, Mauro Carvalho Chehab wrote: >> sta...@kernel.org is there to route to /dev/null on purpose so that >> developers/maintainers who only want their patches to get picked up when >> they hit Linus's tree, will have happen and not notify anyone else. >> This is especially good when dealing with security-related things as we >> have had MANY people accidentally leak patches way too early by having >> cc: sta...@vger.kernel.org in their signed-off-by areas, and forgetting >> to tell git send-email to suppress cc: when sending them out for >> internal review. > Nice! didn't know about that. On a quick check, the only place at > documentation mentioning it without vger is at checkpatch.rst. > > Perhaps it would make sense to document that as well. Maybe something like the below? Will add that to my next patch set unless I hear complaints. Ciao, Thorsten --- diff --git a/Documentation/process/stable-kernel-rules.rst b/Documentation/process/stable-kernel-rules.rst index 727ad7f758e3e0..5a47ed06081e41 100644 --- a/Documentation/process/stable-kernel-rules.rst +++ b/Documentation/process/stable-kernel-rules.rst @@ -72,6 +72,10 @@ for stable trees, add this tag in the sign-off area:: Cc: sta...@vger.kernel.org +Use ``Cc: sta...@kernel.org`` instead when fixing an unpublished vulnerability: +it reduces the chance of someone exposing the fix to the public by way of +'git send-email', as mails sent to that address are not delivered anywhere. + Once the patch is mainlined it will be applied to the stable tree without anything else needing to be done by the author or subsystem maintainer.
[PATCH net-next v8 7/7] rstreason: make it work in trace world
From: Jason Xing At last, we should let it work by introducing this reset reason in trace world. One of the possible expected outputs is: ... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx state=TCP_ESTABLISHED reason=NOT_SPECIFIED Signed-off-by: Jason Xing Reviewed-by: Steven Rostedt (Google) --- include/trace/events/tcp.h | 26 ++ net/ipv4/tcp_ipv4.c| 2 +- net/ipv4/tcp_output.c | 2 +- net/ipv6/tcp_ipv6.c| 2 +- 4 files changed, 25 insertions(+), 7 deletions(-) diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h index 5c04a61a11c2..49b5ee091cf6 100644 --- a/include/trace/events/tcp.h +++ b/include/trace/events/tcp.h @@ -11,6 +11,7 @@ #include #include #include +#include /* * tcp event with arguments sk and skb @@ -74,20 +75,32 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb, TP_ARGS(sk, skb) ); +#undef FN +#define FN(reason) TRACE_DEFINE_ENUM(SK_RST_REASON_##reason); +DEFINE_RST_REASON(FN, FN) + +#undef FN +#undef FNe +#define FN(reason) { SK_RST_REASON_##reason, #reason }, +#define FNe(reason){ SK_RST_REASON_##reason, #reason } + /* * skb of trace_tcp_send_reset is the skb that caused RST. In case of * active reset, skb should be NULL */ TRACE_EVENT(tcp_send_reset, - TP_PROTO(const struct sock *sk, const struct sk_buff *skb), + TP_PROTO(const struct sock *sk, +const struct sk_buff *skb, +const enum sk_rst_reason reason), - TP_ARGS(sk, skb), + TP_ARGS(sk, skb, reason), TP_STRUCT__entry( __field(const void *, skbaddr) __field(const void *, skaddr) __field(int, state) + __field(enum sk_rst_reason, reason) __array(__u8, saddr, sizeof(struct sockaddr_in6)) __array(__u8, daddr, sizeof(struct sockaddr_in6)) ), @@ -113,14 +126,19 @@ TRACE_EVENT(tcp_send_reset, */ TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, entry->saddr); } + __entry->reason = reason; ), - TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s", + TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s reason=%s", __entry->skbaddr, __entry->skaddr, __entry->saddr, __entry->daddr, - __entry->state ? show_tcp_state_name(__entry->state) : "UNKNOWN") + __entry->state ? show_tcp_state_name(__entry->state) : "UNKNOWN", + __print_symbolic(__entry->reason, DEFINE_RST_REASON(FN, FNe))) ); +#undef FN +#undef FNe + /* * tcp event with arguments sk * diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 6bd3a0fb9439..6096ac7a3a02 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -871,7 +871,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb, if (sk) arg.bound_dev_if = sk->sk_bound_dev_if; - trace_tcp_send_reset(sk, skb); + trace_tcp_send_reset(sk, skb, reason); BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) != offsetof(struct inet_timewait_sock, tw_bound_dev_if)); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 276d9d541b01..b08ffb17d5a0 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3612,7 +3612,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority, /* skb of trace_tcp_send_reset() keeps the skb that caused RST, * skb here is different to the troublesome skb, so use NULL */ - trace_tcp_send_reset(sk, NULL); + trace_tcp_send_reset(sk, NULL, SK_RST_REASON_NOT_SPECIFIED); } /* Send a crossed SYN-ACK during socket establishment. diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 317d7a6e6b01..77958adf2e16 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1133,7 +1133,7 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb, label = ip6_flowlabel(ipv6h); } - trace_tcp_send_reset(sk, skb); + trace_tcp_send_reset(sk, skb, reason); tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1, ipv6_get_dsfield(ipv6h), label, priority, txhash, -- 2.37.3
[PATCH net-next v8 6/7] mptcp: introducing a helper into active reset logic
From: Jason Xing Since we have mapped every mptcp reset reason definition in enum sk_rst_reason, introducing a new helper can cover some missing places where we have already set the subflow->reset_reason. Note: using SK_RST_REASON_NOT_SPECIFIED is the same as SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown. So we can convert it directly. Suggested-by: Paolo Abeni Signed-off-by: Jason Xing --- Link: https://lore.kernel.org/all/2d3ea199eef53cf6a0c48e21abdee0eefbdee927.ca...@redhat.com/ --- net/mptcp/protocol.c | 4 +--- net/mptcp/protocol.h | 11 +++ net/mptcp/subflow.c | 6 ++ 3 files changed, 14 insertions(+), 7 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 065967086492..4b13ca362efa 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -21,7 +21,6 @@ #endif #include #include -#include #include #include "protocol.h" #include "mib.h" @@ -2570,8 +2569,7 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk) slow = lock_sock_fast(tcp_sk); if (tcp_sk->sk_state != TCP_CLOSE) { - tcp_send_active_reset(tcp_sk, GFP_ATOMIC, - SK_RST_REASON_NOT_SPECIFIED); + mptcp_send_active_reset_reason(tcp_sk); tcp_set_state(tcp_sk, TCP_CLOSE); } unlock_sock_fast(tcp_sk, slow); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index bbcb8c068aae..d40ad4a2f1b8 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -12,6 +12,7 @@ #include #include #include +#include #include "mptcp_pm_gen.h" @@ -609,6 +610,16 @@ sk_rst_convert_mptcp_reason(u32 reason) } } +static inline void +mptcp_send_active_reset_reason(struct sock *sk) +{ + struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk); + enum sk_rst_reason reason; + + reason = sk_rst_convert_mptcp_reason(subflow->reset_reason); + tcp_send_active_reset(sk, GFP_ATOMIC, reason); +} + static inline u64 mptcp_subflow_get_map_offset(const struct mptcp_subflow_context *subflow) { diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index fb7abf2d01ca..97ec44d1df30 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -20,7 +20,6 @@ #include #endif #include -#include #include "protocol.h" #include "mib.h" @@ -424,7 +423,7 @@ void mptcp_subflow_reset(struct sock *ssk) /* must hold: tcp_done() could drop last reference on parent */ sock_hold(sk); - tcp_send_active_reset(ssk, GFP_ATOMIC, SK_RST_REASON_NOT_SPECIFIED); + mptcp_send_active_reset_reason(ssk); tcp_done(ssk); if (!test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW, _sk(sk)->flags)) mptcp_schedule_work(sk); @@ -1362,8 +1361,7 @@ static bool subflow_check_data_avail(struct sock *ssk) tcp_set_state(ssk, TCP_CLOSE); while ((skb = skb_peek(>sk_receive_queue))) sk_eat_skb(ssk, skb); - tcp_send_active_reset(ssk, GFP_ATOMIC, - SK_RST_REASON_NOT_SPECIFIED); + mptcp_send_active_reset_reason(ssk); WRITE_ONCE(subflow->data_avail, false); return false; } -- 2.37.3
[PATCH net-next v8 5/7] mptcp: support rstreason for passive reset
From: Jason Xing It relys on what reset options in the skb are as rfc8684 says. Reusing this logic can save us much energy. This patch replaces most of the prior NOT_SPECIFIED reasons. Signed-off-by: Jason Xing --- net/mptcp/protocol.h | 28 net/mptcp/subflow.c | 22 +- 2 files changed, 45 insertions(+), 5 deletions(-) diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index fdfa843e2d88..bbcb8c068aae 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -581,6 +581,34 @@ mptcp_subflow_ctx_reset(struct mptcp_subflow_context *subflow) WRITE_ONCE(subflow->local_id, -1); } +/* Convert reset reasons in MPTCP to enum sk_rst_reason type */ +static inline enum sk_rst_reason +sk_rst_convert_mptcp_reason(u32 reason) +{ + switch (reason) { + case MPTCP_RST_EUNSPEC: + return SK_RST_REASON_MPTCP_RST_EUNSPEC; + case MPTCP_RST_EMPTCP: + return SK_RST_REASON_MPTCP_RST_EMPTCP; + case MPTCP_RST_ERESOURCE: + return SK_RST_REASON_MPTCP_RST_ERESOURCE; + case MPTCP_RST_EPROHIBIT: + return SK_RST_REASON_MPTCP_RST_EPROHIBIT; + case MPTCP_RST_EWQ2BIG: + return SK_RST_REASON_MPTCP_RST_EWQ2BIG; + case MPTCP_RST_EBADPERF: + return SK_RST_REASON_MPTCP_RST_EBADPERF; + case MPTCP_RST_EMIDDLEBOX: + return SK_RST_REASON_MPTCP_RST_EMIDDLEBOX; + default: + /** +* It should not happen, or else errors may occur +* in MPTCP layer +*/ + return SK_RST_REASON_ERROR; + } +} + static inline u64 mptcp_subflow_get_map_offset(const struct mptcp_subflow_context *subflow) { diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index ac867d277860..fb7abf2d01ca 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -309,8 +309,13 @@ static struct dst_entry *subflow_v4_route_req(const struct sock *sk, return dst; dst_release(dst); - if (!req->syncookie) - tcp_request_sock_ops.send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + if (!req->syncookie) { + struct mptcp_ext *mpext = mptcp_get_ext(skb); + enum sk_rst_reason reason; + + reason = sk_rst_convert_mptcp_reason(mpext->reset_reason); + tcp_request_sock_ops.send_reset(sk, skb, reason); + } return NULL; } @@ -377,8 +382,13 @@ static struct dst_entry *subflow_v6_route_req(const struct sock *sk, return dst; dst_release(dst); - if (!req->syncookie) - tcp6_request_sock_ops.send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + if (!req->syncookie) { + struct mptcp_ext *mpext = mptcp_get_ext(skb); + enum sk_rst_reason reason; + + reason = sk_rst_convert_mptcp_reason(mpext->reset_reason); + tcp6_request_sock_ops.send_reset(sk, skb, reason); + } return NULL; } #endif @@ -783,6 +793,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk, struct mptcp_subflow_request_sock *subflow_req; struct mptcp_options_received mp_opt; bool fallback, fallback_is_fatal; + enum sk_rst_reason reason; struct mptcp_sock *owner; struct sock *child; @@ -913,7 +924,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk, tcp_rsk(req)->drop_req = true; inet_csk_prepare_for_destroy_sock(child); tcp_done(child); - req->rsk_ops->send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + reason = sk_rst_convert_mptcp_reason(mptcp_get_ext(skb)->reset_reason); + req->rsk_ops->send_reset(sk, skb, reason); /* The last child reference will be released by the caller */ return child; -- 2.37.3
[PATCH net-next v8 4/7] tcp: support rstreason for passive reset
From: Jason Xing Reuse the dropreason logic to show the exact reason of tcp reset, so we can finally display the corresponding item in enum sk_reset_reason instead of reinventing new reset reasons. This patch replaces all the prior NOT_SPECIFIED reasons. Signed-off-by: Jason Xing --- include/net/rstreason.h | 15 +++ net/ipv4/tcp_ipv4.c | 11 +++ net/ipv6/tcp_ipv6.c | 11 +++ 3 files changed, 29 insertions(+), 8 deletions(-) diff --git a/include/net/rstreason.h b/include/net/rstreason.h index bc53b5a24505..df3b6ac0c9b3 100644 --- a/include/net/rstreason.h +++ b/include/net/rstreason.h @@ -103,4 +103,19 @@ enum sk_rst_reason { */ SK_RST_REASON_MAX, }; + +/* Convert skb drop reasons to enum sk_rst_reason type */ +static inline enum sk_rst_reason +sk_rst_convert_drop_reason(enum skb_drop_reason reason) +{ + switch (reason) { + case SKB_DROP_REASON_NOT_SPECIFIED: + return SK_RST_REASON_NOT_SPECIFIED; + case SKB_DROP_REASON_NO_SOCKET: + return SK_RST_REASON_NO_SOCKET; + default: + /* If we don't have our own corresponding reason */ + return SK_RST_REASON_NOT_SPECIFIED; + } +} #endif diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 418d11902fa7..6bd3a0fb9439 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1936,7 +1936,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) return 0; reset: - tcp_v4_send_reset(rsk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v4_send_reset(rsk, skb, sk_rst_convert_drop_reason(reason)); discard: kfree_skb_reason(skb, reason); /* Be careful here. If this function gets more complicated and @@ -2278,7 +2278,10 @@ int tcp_v4_rcv(struct sk_buff *skb) } else { drop_reason = tcp_child_process(sk, nsk, skb); if (drop_reason) { - tcp_v4_send_reset(nsk, skb, SK_RST_REASON_NOT_SPECIFIED); + enum sk_rst_reason rst_reason; + + rst_reason = sk_rst_convert_drop_reason(drop_reason); + tcp_v4_send_reset(nsk, skb, rst_reason); goto discard_and_relse; } sock_put(sk); @@ -2357,7 +2360,7 @@ int tcp_v4_rcv(struct sk_buff *skb) bad_packet: __TCP_INC_STATS(net, TCP_MIB_INERRS); } else { - tcp_v4_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v4_send_reset(NULL, skb, sk_rst_convert_drop_reason(drop_reason)); } discard_it: @@ -2409,7 +2412,7 @@ int tcp_v4_rcv(struct sk_buff *skb) tcp_v4_timewait_ack(sk, skb); break; case TCP_TW_RST: - tcp_v4_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v4_send_reset(sk, skb, sk_rst_convert_drop_reason(drop_reason)); inet_twsk_deschedule_put(inet_twsk(sk)); goto discard_it; case TCP_TW_SUCCESS:; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 017f6293b5f4..317d7a6e6b01 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1680,7 +1680,7 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) return 0; reset: - tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v6_send_reset(sk, skb, sk_rst_convert_drop_reason(reason)); discard: if (opt_skb) __kfree_skb(opt_skb); @@ -1865,7 +1865,10 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) } else { drop_reason = tcp_child_process(sk, nsk, skb); if (drop_reason) { - tcp_v6_send_reset(nsk, skb, SK_RST_REASON_NOT_SPECIFIED); + enum sk_rst_reason rst_reason; + + rst_reason = sk_rst_convert_drop_reason(drop_reason); + tcp_v6_send_reset(nsk, skb, rst_reason); goto discard_and_relse; } sock_put(sk); @@ -1942,7 +1945,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) bad_packet: __TCP_INC_STATS(net, TCP_MIB_INERRS); } else { - tcp_v6_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v6_send_reset(NULL, skb, sk_rst_convert_drop_reason(drop_reason)); } discard_it: @@ -1998,7 +2001,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) tcp_v6_timewait_ack(sk, skb); break; case TCP_TW_RST: - tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v6_send_reset(sk, skb, sk_rst_convert_drop_reason(drop_reason));
[PATCH net-next v8 3/7] rstreason: prepare for active reset
From: Jason Xing Like what we did to passive reset: only passing possible reset reason in each active reset path. No functional changes. Signed-off-by: Jason Xing --- include/net/tcp.h | 3 ++- net/ipv4/tcp.c| 15 ++- net/ipv4/tcp_output.c | 3 ++- net/ipv4/tcp_timer.c | 9 ++--- net/mptcp/protocol.c | 4 +++- net/mptcp/subflow.c | 5 +++-- 6 files changed, 26 insertions(+), 13 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index b935e1ae4caf..adeacc9aa28a 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -670,7 +670,8 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue, void tcp_send_probe0(struct sock *); int tcp_write_wakeup(struct sock *, int mib); void tcp_send_fin(struct sock *sk); -void tcp_send_active_reset(struct sock *sk, gfp_t priority); +void tcp_send_active_reset(struct sock *sk, gfp_t priority, + enum sk_rst_reason reason); int tcp_send_synack(struct sock *); void tcp_push_one(struct sock *, unsigned int mss_now); void __tcp_send_ack(struct sock *sk, u32 rcv_nxt); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f23b9ea5..4ec0f4feee00 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -275,6 +275,7 @@ #include #include #include +#include #include #include @@ -2811,7 +2812,8 @@ void __tcp_close(struct sock *sk, long timeout) /* Unread data was tossed, zap the connection. */ NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE); tcp_set_state(sk, TCP_CLOSE); - tcp_send_active_reset(sk, sk->sk_allocation); + tcp_send_active_reset(sk, sk->sk_allocation, + SK_RST_REASON_NOT_SPECIFIED); } else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) { /* Check zero linger _after_ checking for unread data. */ sk->sk_prot->disconnect(sk, 0); @@ -2885,7 +2887,8 @@ void __tcp_close(struct sock *sk, long timeout) struct tcp_sock *tp = tcp_sk(sk); if (READ_ONCE(tp->linger2) < 0) { tcp_set_state(sk, TCP_CLOSE); - tcp_send_active_reset(sk, GFP_ATOMIC); + tcp_send_active_reset(sk, GFP_ATOMIC, + SK_RST_REASON_NOT_SPECIFIED); __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONLINGER); } else { @@ -2903,7 +2906,8 @@ void __tcp_close(struct sock *sk, long timeout) if (sk->sk_state != TCP_CLOSE) { if (tcp_check_oom(sk, 0)) { tcp_set_state(sk, TCP_CLOSE); - tcp_send_active_reset(sk, GFP_ATOMIC); + tcp_send_active_reset(sk, GFP_ATOMIC, + SK_RST_REASON_NOT_SPECIFIED); __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY); } else if (!check_net(sock_net(sk))) { @@ -3007,7 +3011,7 @@ int tcp_disconnect(struct sock *sk, int flags) /* The last check adjusts for discrepancy of Linux wrt. RFC * states */ - tcp_send_active_reset(sk, gfp_any()); + tcp_send_active_reset(sk, gfp_any(), SK_RST_REASON_NOT_SPECIFIED); WRITE_ONCE(sk->sk_err, ECONNRESET); } else if (old_state == TCP_SYN_SENT) WRITE_ONCE(sk->sk_err, ECONNRESET); @@ -4564,7 +4568,8 @@ int tcp_abort(struct sock *sk, int err) smp_wmb(); sk_error_report(sk); if (tcp_need_reset(sk->sk_state)) - tcp_send_active_reset(sk, GFP_ATOMIC); + tcp_send_active_reset(sk, GFP_ATOMIC, + SK_RST_REASON_NOT_SPECIFIED); tcp_done(sk); } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 61119d42b0fd..276d9d541b01 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3586,7 +3586,8 @@ void tcp_send_fin(struct sock *sk) * was unread data in the receive queue. This behavior is recommended * by RFC 2525, section 2.17. -DaveM */ -void tcp_send_active_reset(struct sock *sk, gfp_t priority) +void tcp_send_active_reset(struct sock *sk, gfp_t priority, + enum sk_rst_reason reason) { struct sk_buff *skb; diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 976db57b95d4..83fe7f62f7f1 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -22,6 +22,7 @@ #include #include #include +#include static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk) { @@ -127,7 +128,8 @@ static int tcp_out_of_resources(struct sock *sk, bool do_reset) (!tp->snd_wnd && !tp->packets_out))
Re: [PATCH v12 10/14] x86/sgx: Charge mem_cgroup for per-cgroup reclamation
On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote: > Enclave Page Cache(EPC) memory can be swapped out to regular system > memory, and the consumed memory should be charged to a proper > mem_cgroup. Currently the selection of mem_cgroup to charge is done in > sgx_encl_get_mem_cgroup(). But it considers all contexts other than the > ksgxd thread are user processes. With the new EPC cgroup implementation, > the swapping can also happen in EPC cgroup work-queue threads. In those > cases, it improperly selects the root mem_cgroup to charge for the RAM > usage. > > Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take > an additional argument to explicitly specify the mm struct to charge for > allocations. Callers from background kthreads not associated with a > charging mm struct would set it to NULL, while callers in user process > contexts set it to current->mm. > > Internally, it handles the case when the charging mm given is NULL, by > searching for an mm struct from enclave's mm_list. > > Signed-off-by: Haitao Huang > Reported-by: Mikko Ylinen > Tested-by: Mikko Ylinen > Tested-by: Jarkko Sakkinen > Reviewed-by: Kai Huang
[PATCH net-next v8 1/7] net: introduce rstreason to detect why the RST is sent
From: Jason Xing Add a new standalone file for the easy future extension to support both active reset and passive reset in the TCP/DCCP/MPTCP protocols. This patch only does the preparations for reset reason mechanism, nothing else changes. The reset reasons are divided into three parts: 1) reuse drop reasons for passive reset in TCP 2) our own independent reasons which aren't relying on other reasons at all 3) reuse MP_TCPRST option for MPTCP The benefits of a standalone reset reason are listed here: 1) it can cover more than one case, such as reset reasons in MPTCP, active reset reasons. 2) people can easily/fastly understand and maintain this mechanism. 3) we get unified format of output with prefix stripped. 4) more new reset reasons are on the way ... I will implement the basic codes of active/passive reset reason in those three protocols, which are not complete for this moment. For passive reset part in TCP, I only introduce the NO_SOCKET common case which could be set as an example. After this series applied, it will have the ability to open a new gate to let other people contribute more reasons into it :) Signed-off-by: Jason Xing --- include/net/rstreason.h | 106 1 file changed, 106 insertions(+) create mode 100644 include/net/rstreason.h diff --git a/include/net/rstreason.h b/include/net/rstreason.h new file mode 100644 index ..bc53b5a24505 --- /dev/null +++ b/include/net/rstreason.h @@ -0,0 +1,106 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ + +#ifndef _LINUX_RSTREASON_H +#define _LINUX_RSTREASON_H +#include +#include + +#define DEFINE_RST_REASON(FN, FNe) \ + FN(NOT_SPECIFIED) \ + FN(NO_SOCKET) \ + FN(MPTCP_RST_EUNSPEC) \ + FN(MPTCP_RST_EMPTCP)\ + FN(MPTCP_RST_ERESOURCE) \ + FN(MPTCP_RST_EPROHIBIT) \ + FN(MPTCP_RST_EWQ2BIG) \ + FN(MPTCP_RST_EBADPERF) \ + FN(MPTCP_RST_EMIDDLEBOX)\ + FN(ERROR) \ + FNe(MAX) + +/** + * enum sk_rst_reason - the reasons of socket reset + * + * The reasons of sk reset, which are used in DCCP/TCP/MPTCP protocols. + * + * There are three parts in order: + * 1) skb drop reasons: relying on drop reasons for such as passive reset + * 2) independent reset reasons: such as active reset reasons + * 3) reset reasons in MPTCP: only for MPTCP use + */ +enum sk_rst_reason { + /* Refer to include/net/dropreason-core.h +* Rely on skb drop reasons because it indicates exactly why RST +* could happen. +*/ + /** @SK_RST_REASON_NOT_SPECIFIED: reset reason is not specified */ + SK_RST_REASON_NOT_SPECIFIED, + /** @SK_RST_REASON_NO_SOCKET: no valid socket that can be used */ + SK_RST_REASON_NO_SOCKET, + + /* Copy from include/uapi/linux/mptcp.h. +* These reset fields will not be changed since they adhere to +* RFC 8684. So do not touch them. I'm going to list each definition +* of them respectively. +*/ + /** +* @SK_RST_REASON_MPTCP_RST_EUNSPEC: Unspecified error. +* This is the default error; it implies that the subflow is no +* longer available. The presence of this option shows that the +* RST was generated by an MPTCP-aware device. +*/ + SK_RST_REASON_MPTCP_RST_EUNSPEC, + /** +* @SK_RST_REASON_MPTCP_RST_EMPTCP: MPTCP-specific error. +* An error has been detected in the processing of MPTCP options. +* This is the usual reason code to return in the cases where a RST +* is being sent to close a subflow because of an invalid response. +*/ + SK_RST_REASON_MPTCP_RST_EMPTCP, + /** +* @SK_RST_REASON_MPTCP_RST_ERESOURCE: Lack of resources. +* This code indicates that the sending host does not have enough +* resources to support the terminated subflow. +*/ + SK_RST_REASON_MPTCP_RST_ERESOURCE, + /** +* @SK_RST_REASON_MPTCP_RST_EPROHIBIT: Administratively prohibited. +* This code indicates that the requested subflow is prohibited by +* the policies of the sending host. +*/ + SK_RST_REASON_MPTCP_RST_EPROHIBIT, + /** +* @SK_RST_REASON_MPTCP_RST_EWQ2BIG: Too much outstanding data. +* This code indicates that there is an excessive amount of data +* that needs to be transmitted over the terminated subflow while +* having already been acknowledged over one or more other subflows. +* This may occur if a path has been unavailable for a short period +* and it is more efficient to reset and start again than it is to +* retransmit the queued data. +*/ + SK_RST_REASON_MPTCP_RST_EWQ2BIG, + /** +* @SK_RST_REASON_MPTCP_RST_EBADPERF: Unacceptable performance. +* This code
[PATCH net-next v8 2/7] rstreason: prepare for passive reset
From: Jason Xing Adjust the parameter and support passing reason of reset which is for now NOT_SPECIFIED. No functional changes. Signed-off-by: Jason Xing --- include/net/request_sock.h | 4 +++- net/dccp/ipv4.c| 10 ++ net/dccp/ipv6.c| 10 ++ net/dccp/minisocks.c | 3 ++- net/ipv4/tcp_ipv4.c| 12 +++- net/ipv4/tcp_minisocks.c | 3 ++- net/ipv6/tcp_ipv6.c| 15 +-- net/mptcp/subflow.c| 8 +--- 8 files changed, 40 insertions(+), 25 deletions(-) diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 004e651e6067..bdc737832da6 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -18,6 +18,7 @@ #include #include +#include struct request_sock; struct sk_buff; @@ -34,7 +35,8 @@ struct request_sock_ops { void(*send_ack)(const struct sock *sk, struct sk_buff *skb, struct request_sock *req); void(*send_reset)(const struct sock *sk, - struct sk_buff *skb); + struct sk_buff *skb, + enum sk_rst_reason reason); void(*destructor)(struct request_sock *req); void(*syn_ack_timeout)(const struct request_sock *req); }; diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c index 9fc9cea4c251..ff41bd6f99c3 100644 --- a/net/dccp/ipv4.c +++ b/net/dccp/ipv4.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "ackvec.h" #include "ccid.h" @@ -521,7 +522,8 @@ static int dccp_v4_send_response(const struct sock *sk, struct request_sock *req return err; } -static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb) +static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb, + enum sk_rst_reason reason) { int err; const struct iphdr *rxiph; @@ -706,7 +708,7 @@ int dccp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) return 0; reset: - dccp_v4_ctl_send_reset(sk, skb); + dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); kfree_skb(skb); return 0; } @@ -869,7 +871,7 @@ static int dccp_v4_rcv(struct sk_buff *skb) if (nsk == sk) { reqsk_put(req); } else if (dccp_child_process(sk, nsk, skb)) { - dccp_v4_ctl_send_reset(sk, skb); + dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); goto discard_and_relse; } else { sock_put(sk); @@ -909,7 +911,7 @@ static int dccp_v4_rcv(struct sk_buff *skb) if (dh->dccph_type != DCCP_PKT_RESET) { DCCP_SKB_CB(skb)->dccpd_reset_code = DCCP_RESET_CODE_NO_CONNECTION; - dccp_v4_ctl_send_reset(sk, skb); + dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); } discard_it: diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c index c8ca703dc331..85f4b8fdbe5e 100644 --- a/net/dccp/ipv6.c +++ b/net/dccp/ipv6.c @@ -29,6 +29,7 @@ #include #include #include +#include #include "dccp.h" #include "ipv6.h" @@ -256,7 +257,8 @@ static void dccp_v6_reqsk_destructor(struct request_sock *req) kfree_skb(inet_rsk(req)->pktopts); } -static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb) +static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb, + enum sk_rst_reason reason) { const struct ipv6hdr *rxip6h; struct sk_buff *skb; @@ -656,7 +658,7 @@ static int dccp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) return 0; reset: - dccp_v6_ctl_send_reset(sk, skb); + dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); discard: if (opt_skb != NULL) __kfree_skb(opt_skb); @@ -762,7 +764,7 @@ static int dccp_v6_rcv(struct sk_buff *skb) if (nsk == sk) { reqsk_put(req); } else if (dccp_child_process(sk, nsk, skb)) { - dccp_v6_ctl_send_reset(sk, skb); + dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); goto discard_and_relse; } else { sock_put(sk); @@ -801,7 +803,7 @@ static int dccp_v6_rcv(struct sk_buff *skb) if (dh->dccph_type != DCCP_PKT_RESET) { DCCP_SKB_CB(skb)->dccpd_reset_code = DCCP_RESET_CODE_NO_CONNECTION; - dccp_v6_ctl_send_reset(sk, skb); + dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); } discard_it: diff --git a/net/dccp/minisocks.c
[PATCH net-next v8 0/7] Implement reset reason mechanism to detect
From: Jason Xing In production, there are so many cases about why the RST skb is sent but we don't have a very convenient/fast method to detect the exact underlying reasons. RST is implemented in two kinds: passive kind (like tcp_v4_send_reset()) and active kind (like tcp_send_active_reset()). The former can be traced carefully 1) in TCP, with the help of drop reasons, which is based on Eric's idea[1], 2) in MPTCP, with the help of reset options defined in RFC 8684. The latter is relatively independent, which should be implemented on our own, such as active reset reasons which can not be replace by skb drop reason or something like this. In this series, I focus on the fundamental implement mostly about how the rstreason mechnism works and give the detailed passive part as an example, not including the active reset part. In future, we can go further and refine those NOT_SPECIFIED reasons. Here are some examples when tracing: -0 [002] ..s1. 1830.262425: tcp_send_reset: skbaddr=x skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED -0 [002] ..s1. 1830.262425: tcp_send_reset: skbaddr=x skaddr=x src=x dest=x state=x reason=NO_SOCKET [1] Link: https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/ v8 Link: https://lore.kernel.org/all/20240422030109.12891-1-kerneljasonx...@gmail.com/ 1. put sk reset reasons into more natural order (Matt) 2. adjust those helper position (Matt) 3. rename two convert function (Matt) 4. make the kdoc format correct (Simon) v7 Link: https://lore.kernel.org/all/20240417085143.69578-1-kerneljasonx...@gmail.com/ 1. get rid of enum casts which could bring potential issues (Eric) 2. use switch-case method to map between reset reason in MPTCP and sk reset reason (Steven) 3. use switch-case method to map between skb drop reason and sk reset reason v6 1. add back casts, or else they are treated as error. v5 Link: https://lore.kernel.org/all/2024045630.38420-1-kerneljasonx...@gmail.com/ 1. address format issue (like reverse xmas tree) (Eric, Paolo) 2. remove unnecessary casts. (Eric) 3. introduce a helper used in mptcp active reset. See patch 6. (Paolo) v4 Link: https://lore.kernel.org/all/20240409100934.37725-1-kerneljasonx...@gmail.com/ 1. passing 'enum sk_rst_reason' for readability when tracing (Antoine) v3 Link: https://lore.kernel.org/all/20240404072047.11490-1-kerneljasonx...@gmail.com/ 1. rebase (mptcp part) and address what Mat suggested. v2 Link: https://lore.kernel.org/all/20240403185033.47ebc...@kernel.org/ 1. rebase against the latest net-next tree Jason Xing (7): net: introduce rstreason to detect why the RST is sent rstreason: prepare for passive reset rstreason: prepare for active reset tcp: support rstreason for passive reset mptcp: support rstreason for passive reset mptcp: introducing a helper into active reset logic rstreason: make it work in trace world include/net/request_sock.h | 4 +- include/net/rstreason.h| 121 + include/net/tcp.h | 3 +- include/trace/events/tcp.h | 26 ++-- net/dccp/ipv4.c| 10 +-- net/dccp/ipv6.c| 10 +-- net/dccp/minisocks.c | 3 +- net/ipv4/tcp.c | 15 +++-- net/ipv4/tcp_ipv4.c| 17 -- net/ipv4/tcp_minisocks.c | 3 +- net/ipv4/tcp_output.c | 5 +- net/ipv4/tcp_timer.c | 9 ++- net/ipv6/tcp_ipv6.c| 20 +++--- net/mptcp/protocol.c | 2 +- net/mptcp/protocol.h | 39 net/mptcp/subflow.c| 27 ++--- 16 files changed, 267 insertions(+), 47 deletions(-) create mode 100644 include/net/rstreason.h -- 2.37.3