date:20240423

Re: [PATCH v2 1/6] dt-bindings: iio: light: stk33xx: add vdd and leda regulators

2024-04-23 Thread Krzysztof Kozlowski

On 24/04/2024 00:33, Aren Moynihan wrote:
> stk3310 and stk3311 are typically connected to power supplies for the
> chip (vdd) and the infrared LED (leda). Add properties so we can power
> these up / down appropriately.
> 
> Signed-off-by: Aren Moynihan 
> ---
> 

Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-23 Thread Jason Wang

On Wed, Apr 24, 2024 at 11:15 AM Cindy Lu  wrote:
>
> On Wed, Apr 24, 2024 at 8:44 AM Jason Wang  wrote:
> >
> > On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin  wrote:
> > >
> > > On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> > > > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > > > > and vduse_alloc_reconnnect_info_mem
> > > > > > > > These functions allow vduse to allocate and free memory for 
> > > > > > > > reconnection
> > > > > > > > information. The amount of memory allocated is vq_num pages.
> > > > > > > > Each VQS will map its own page where the reconnection 
> > > > > > > > information will be saved
> > > > > > > >
> > > > > > > > Signed-off-by: Cindy Lu 
> > > > > > > > ---
> > > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > > > > ++
> > > > > > > >  1 file changed, 40 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > > > > >   int irq_effective_cpu;
> > > > > > > >   struct cpumask irq_affinity;
> > > > > > > >   struct kobject kobj;
> > > > > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > > > > >  };
> > > > > > > >
> > > > > > > >  struct vduse_dev;
> > > > > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > > > > >
> > > > > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > > > > >  }
> > > > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev 
> > > > > > > > *dev)
> > > > > > > > +{
> > > > > > > > + unsigned long vaddr = 0;
> > > > > > > > + struct vduse_virtqueue *vq;
> > > > > > > > +
> > > > > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > > > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > > > > > > + vq = dev->vqs[i];
> > > > > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > > > > >
> > > > > > >
> > > > > > > I don't get why you insist on stealing kernel memory for something
> > > > > > > that is just used by userspace to store data for its own use.
> > > > > > > Userspace does not lack ways to persist data, for example,
> > > > > > > create a regular file anywhere in the filesystem.
> > > > > >
> > > > > > Good point. So the motivation here is to:
> > > > > >
> > > > > > 1) be self contained, no dependency for high speed persist data
> > > > > > storage like tmpfs
> > > > >
> > > > > No idea what this means.
> > > >
> > > > I mean a regular file may slow down the datapath performance, so
> > > > usually the application will try to use tmpfs and other which is a
> > > > dependency for implementing the reconnection.
> > >
> > > Are we worried about systems without tmpfs now?
> >
> > Yes.
> >
> > >
> > >
> > > > >
> > > > > > 2) standardize the format in uAPI which allows reconnection from
> > > > > > arbitrary userspace, unfortunately, such effort was removed in new
> > > > > > versions
> > > > >
> > > > > And I don't see why that has to live in the kernel tree either.
> > > >
> > > > I can't find a better place, any idea?
> > > >
> > > > Thanks
> > >
> > >
> > > Well anywhere on github really. with libvhost-user maybe?
> > > It's harmless enough in Documentation
> > > if you like but ties you to the kernel release cycle in a way that
> > > is completely unnecessary.
> >
> > Ok.
> >
> > Thanks
> >
> Sure, got it. Do we need to withdraw all the series? maybe we can keep
> the patch for support ioctl to get status and configure?

We can leave the mmap part for future, the rest is still useful unless
I miss anything.

Thanks

>
> thanks
> cindy
>
> > >
> > > > >
> > > > > > If the above doesn't make sense, we don't need to offer those pages 
> > > > > > by VDUSE.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > + if (vaddr == 0)
> > > > > > > > + return -ENOMEM;
> > > > > > > > +
> > > > > > > > + vq->vdpa_reconnect_vaddr = vaddr;
> > > > > > > > + }
> > > > > > > > +
> > > > > > > > + return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev 
> > > > > > > > *dev)
> > > > > > > > +{
> > > > > > > > + struct vduse_virtqueue *vq;
> > > > > > > > +
> > > > > > > > + for (int i = 0; i <

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-23 Thread Cindy Lu

On Wed, Apr 24, 2024 at 8:44 AM Jason Wang  wrote:
>
> On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin  wrote:
> >
> > On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> > > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > > > and vduse_alloc_reconnnect_info_mem
> > > > > > > These functions allow vduse to allocate and free memory for 
> > > > > > > reconnection
> > > > > > > information. The amount of memory allocated is vq_num pages.
> > > > > > > Each VQS will map its own page where the reconnection information 
> > > > > > > will be saved
> > > > > > >
> > > > > > > Signed-off-by: Cindy Lu 
> > > > > > > ---
> > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > > > ++
> > > > > > >  1 file changed, 40 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > > > >   int irq_effective_cpu;
> > > > > > >   struct cpumask irq_affinity;
> > > > > > >   struct kobject kobj;
> > > > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > > > >  };
> > > > > > >
> > > > > > >  struct vduse_dev;
> > > > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > > > >
> > > > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > > > >  }
> > > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > > > +{
> > > > > > > + unsigned long vaddr = 0;
> > > > > > > + struct vduse_virtqueue *vq;
> > > > > > > +
> > > > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > > > > > + vq = dev->vqs[i];
> > > > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > > > >
> > > > > >
> > > > > > I don't get why you insist on stealing kernel memory for something
> > > > > > that is just used by userspace to store data for its own use.
> > > > > > Userspace does not lack ways to persist data, for example,
> > > > > > create a regular file anywhere in the filesystem.
> > > > >
> > > > > Good point. So the motivation here is to:
> > > > >
> > > > > 1) be self contained, no dependency for high speed persist data
> > > > > storage like tmpfs
> > > >
> > > > No idea what this means.
> > >
> > > I mean a regular file may slow down the datapath performance, so
> > > usually the application will try to use tmpfs and other which is a
> > > dependency for implementing the reconnection.
> >
> > Are we worried about systems without tmpfs now?
>
> Yes.
>
> >
> >
> > > >
> > > > > 2) standardize the format in uAPI which allows reconnection from
> > > > > arbitrary userspace, unfortunately, such effort was removed in new
> > > > > versions
> > > >
> > > > And I don't see why that has to live in the kernel tree either.
> > >
> > > I can't find a better place, any idea?
> > >
> > > Thanks
> >
> >
> > Well anywhere on github really. with libvhost-user maybe?
> > It's harmless enough in Documentation
> > if you like but ties you to the kernel release cycle in a way that
> > is completely unnecessary.
>
> Ok.
>
> Thanks
>
Sure, got it. Do we need to withdraw all the series? maybe we can keep
the patch for support ioctl to get status and configure?

thanks
cindy

> >
> > > >
> > > > > If the above doesn't make sense, we don't need to offer those pages 
> > > > > by VDUSE.
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > + if (vaddr == 0)
> > > > > > > + return -ENOMEM;
> > > > > > > +
> > > > > > > + vq->vdpa_reconnect_vaddr = vaddr;
> > > > > > > + }
> > > > > > > +
> > > > > > > + return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > > > +{
> > > > > > > + struct vduse_virtqueue *vq;
> > > > > > > +
> > > > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > > > + vq = dev->vqs[i];
> > > > > > > +
> > > > > > > + if (vq->vdpa_reconnect_vaddr)
> > > > > > > + free_page(vq->vdpa_reconnect_vaddr);
> > > > > > > + vq->vdpa_reconnect_vaddr = 0;
> > > > > > > + }
> > > > > > > +
> > > > > > > + return 0;
> > > > > > > +}
> > > > > > >
> > > > > > >  static long vduse_dev_ioctl(struct file *file, unsigned

[PATCH v3 3/4] remoteproc: mediatek: Support setting DRAM and IPI shared buffer sizes

2024-04-23 Thread Olivia Wen

The SCP on different chips will require different DRAM sizes and IPI
shared buffer sizes based on varying requirements.

Signed-off-by: Olivia Wen 
---
 drivers/remoteproc/mtk_common.h  | 11 --
 drivers/remoteproc/mtk_scp.c | 84 +++-
 drivers/remoteproc/mtk_scp_ipi.c |  7 +++-
 3 files changed, 79 insertions(+), 23 deletions(-)

diff --git a/drivers/remoteproc/mtk_common.h b/drivers/remoteproc/mtk_common.h
index 6d7736a..fd5c539 100644
--- a/drivers/remoteproc/mtk_common.h
+++ b/drivers/remoteproc/mtk_common.h
@@ -78,7 +78,6 @@
 #define MT8195_L2TCM_OFFSET0x850d0
 
 #define SCP_FW_VER_LEN 32
-#define SCP_SHARE_BUFFER_SIZE  288
 
 struct scp_run {
u32 signaled;
@@ -97,6 +96,11 @@ struct scp_ipi_desc {
 
 struct mtk_scp;
 
+struct mtk_scp_sizes_data {
+   size_t max_dram_size;
+   size_t ipi_share_buffer_size;
+};
+
 struct mtk_scp_of_data {
int (*scp_clk_get)(struct mtk_scp *scp);
int (*scp_before_load)(struct mtk_scp *scp);
@@ -110,6 +114,7 @@ struct mtk_scp_of_data {
u32 host_to_scp_int_bit;
 
size_t ipi_buf_offset;
+   const struct mtk_scp_sizes_data *scp_sizes;
 };
 
 struct mtk_scp_of_cluster {
@@ -141,10 +146,10 @@ struct mtk_scp {
struct scp_ipi_desc ipi_desc[SCP_IPI_MAX];
bool ipi_id_ack[SCP_IPI_MAX];
wait_queue_head_t ack_wq;
+   u8 *share_buf;
 
void *cpu_addr;
dma_addr_t dma_addr;
-   size_t dram_size;
 
struct rproc_subdev *rpmsg_subdev;
 
@@ -162,7 +167,7 @@ struct mtk_scp {
 struct mtk_share_obj {
u32 id;
u32 len;
-   u8 share_buf[SCP_SHARE_BUFFER_SIZE];
+   u8 *share_buf;
 };
 
 void scp_memcpy_aligned(void __iomem *dst, const void *src, unsigned int len);
diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c
index 6295148..e281d28 100644
--- a/drivers/remoteproc/mtk_scp.c
+++ b/drivers/remoteproc/mtk_scp.c
@@ -20,7 +20,6 @@
 #include "mtk_common.h"
 #include "remoteproc_internal.h"
 
-#define MAX_CODE_SIZE 0x50
 #define SECTION_NAME_IPI_BUFFER ".ipi_buffer"
 
 /**
@@ -94,14 +93,15 @@ static void scp_ipi_handler(struct mtk_scp *scp)
 {
struct mtk_share_obj __iomem *rcv_obj = scp->recv_buf;
struct scp_ipi_desc *ipi_desc = scp->ipi_desc;
-   u8 tmp_data[SCP_SHARE_BUFFER_SIZE];
scp_ipi_handler_t handler;
u32 id = readl(_obj->id);
u32 len = readl(_obj->len);
+   const struct mtk_scp_sizes_data *scp_sizes;
 
-   if (len > SCP_SHARE_BUFFER_SIZE) {
-   dev_err(scp->dev, "ipi message too long (len %d, max %d)", len,
-   SCP_SHARE_BUFFER_SIZE);
+   scp_sizes = scp->data->scp_sizes;
+   if (len > scp_sizes->ipi_share_buffer_size) {
+   dev_err(scp->dev, "ipi message too long (len %d, max %zd)", len,
+   scp_sizes->ipi_share_buffer_size);
return;
}
if (id >= SCP_IPI_MAX) {
@@ -117,8 +117,9 @@ static void scp_ipi_handler(struct mtk_scp *scp)
return;
}
 
-   memcpy_fromio(tmp_data, _obj->share_buf, len);
-   handler(tmp_data, len, ipi_desc[id].priv);
+   memset(scp->share_buf, 0, scp_sizes->ipi_share_buffer_size);
+   memcpy_fromio(scp->share_buf, _obj->share_buf, len);
+   handler(scp->share_buf, len, ipi_desc[id].priv);
scp_ipi_unlock(scp, id);
 
scp->ipi_id_ack[id] = true;
@@ -133,6 +134,8 @@ static int scp_ipi_init(struct mtk_scp *scp, const struct 
firmware *fw)
 {
int ret;
size_t buf_sz, offset;
+   size_t share_buf_offset;
+   const struct mtk_scp_sizes_data *scp_sizes;
 
/* read the ipi buf addr from FW itself first */
ret = scp_elf_read_ipi_buf_addr(scp, fw, );
@@ -152,12 +155,15 @@ static int scp_ipi_init(struct mtk_scp *scp, const struct 
firmware *fw)
return -EOVERFLOW;
}
 
+   scp_sizes = scp->data->scp_sizes;
scp->recv_buf = (struct mtk_share_obj __iomem *)
(scp->sram_base + offset);
+   share_buf_offset = sizeof(scp->recv_buf->id)
+   + sizeof(scp->recv_buf->len) + scp_sizes->ipi_share_buffer_size;
scp->send_buf = (struct mtk_share_obj __iomem *)
-   (scp->sram_base + offset + sizeof(*scp->recv_buf));
-   memset_io(scp->recv_buf, 0, sizeof(*scp->recv_buf));
-   memset_io(scp->send_buf, 0, sizeof(*scp->send_buf));
+   (scp->sram_base + offset + share_buf_offset);
+   memset_io(scp->recv_buf, 0, share_buf_offset);
+   memset_io(scp->send_buf, 0, share_buf_offset);
 
return 0;
 }
@@ -741,14 +747,16 @@ static int scp_start(struct rproc *rproc)
 static void *mt8183_scp_da_to_va(struct mtk_scp *scp, u64 da, size_t len)
 {
int offset;
+   const struct mtk_scp_sizes_data *scp_sizes;
 
+   scp_sizes = scp->data->scp_sizes;
if (da <

[PATCH v3 0/4] Support MT8188 SCP core 1

2024-04-23 Thread Olivia Wen

Change in v3:
- split the modifications in version 2 into three separate commits

Change in v2:
- change the order of mt8188-scp-dual
- modify the struct mtk_scp_of_data on MT8188
- add MT8188-specific functions
- add structure mtk_scp_sizes_data to manage sizes
- change tmp_data allocation to share_buf

Olivia Wen (4):
  dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP
  remoteproc: mediatek: Support MT8188 SCP core 1
  remoteproc: mediatek: Support setting DRAM and IPI shared buffer sizes
  media: mediatek: imgsys: Support image processing

 .../devicetree/bindings/remoteproc/mtk,scp.yaml|   2 +
 drivers/remoteproc/mtk_common.h|  11 +-
 drivers/remoteproc/mtk_scp.c   | 230 +++--
 drivers/remoteproc/mtk_scp_ipi.c   |   7 +-
 include/linux/remoteproc/mtk_scp.h |   1 +
 5 files changed, 225 insertions(+), 26 deletions(-)

-- 
2.6.4

[PATCH v3 2/4] remoteproc: mediatek: Support MT8188 SCP core 1

2024-04-23 Thread Olivia Wen

MT8188 SCP has two RISC-V cores which is similar to MT8195 but without
L1TCM. We've added MT8188-specific functions to configure L1TCM in
multicore setups.

Signed-off-by: Olivia Wen 
---
 drivers/remoteproc/mtk_scp.c | 146 ++-
 1 file changed, 143 insertions(+), 3 deletions(-)

diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c
index 6751829..6295148 100644
--- a/drivers/remoteproc/mtk_scp.c
+++ b/drivers/remoteproc/mtk_scp.c
@@ -471,6 +471,86 @@ static int mt8186_scp_before_load(struct mtk_scp *scp)
return 0;
 }
 
+static int mt8188_scp_l2tcm_on(struct mtk_scp *scp)
+{
+   struct mtk_scp_of_cluster *scp_cluster = scp->cluster;
+
+   mutex_lock(_cluster->cluster_lock);
+
+   if (scp_cluster->l2tcm_refcnt == 0) {
+   /* clear SPM interrupt, SCP2SPM_IPC_CLR */
+   writel(0xff, scp->cluster->reg_base + MT8192_SCP2SPM_IPC_CLR);
+
+   /* Power on L2TCM */
+   scp_sram_power_on(scp->cluster->reg_base + 
MT8192_L2TCM_SRAM_PD_0, 0);
+   scp_sram_power_on(scp->cluster->reg_base + 
MT8192_L2TCM_SRAM_PD_1, 0);
+   scp_sram_power_on(scp->cluster->reg_base + 
MT8192_L2TCM_SRAM_PD_2, 0);
+   scp_sram_power_on(scp->cluster->reg_base + 
MT8192_L1TCM_SRAM_PDN, 0);
+   }
+
+   scp_cluster->l2tcm_refcnt += 1;
+
+   mutex_unlock(_cluster->cluster_lock);
+
+   return 0;
+}
+
+static int mt8188_scp_before_load(struct mtk_scp *scp)
+{
+   writel(1, scp->cluster->reg_base + MT8192_CORE0_SW_RSTN_SET);
+
+   mt8188_scp_l2tcm_on(scp);
+
+   scp_sram_power_on(scp->cluster->reg_base + MT8192_CPU0_SRAM_PD, 0);
+
+   /* enable MPU for all memory regions */
+   writel(0xff, scp->cluster->reg_base + MT8192_CORE0_MEM_ATT_PREDEF);
+
+   return 0;
+}
+
+static int mt8188_scp_c1_before_load(struct mtk_scp *scp)
+{
+   u32 sec_ctrl;
+   struct mtk_scp *scp_c0;
+   struct mtk_scp_of_cluster *scp_cluster = scp->cluster;
+
+   scp->data->scp_reset_assert(scp);
+
+   mt8188_scp_l2tcm_on(scp);
+
+   scp_sram_power_on(scp->cluster->reg_base + MT8195_CPU1_SRAM_PD, 0);
+
+   /* enable MPU for all memory regions */
+   writel(0xff, scp->cluster->reg_base + MT8195_CORE1_MEM_ATT_PREDEF);
+
+   /*
+* The L2TCM_OFFSET_RANGE and L2TCM_OFFSET shift the destination address
+* on SRAM when SCP core 1 accesses SRAM.
+*
+* This configuration solves booting the SCP core 0 and core 1 from
+* different SRAM address because core 0 and core 1 both boot from
+* the head of SRAM by default. this must be configured before boot SCP 
core 1.
+*
+* The value of L2TCM_OFFSET_RANGE is from the viewpoint of SCP core 1.
+* When SCP core 1 issues address within the range (L2TCM_OFFSET_RANGE),
+* the address will be added with a fixed offset (L2TCM_OFFSET) on the 
bus.
+* The shift action is tranparent to software.
+*/
+   writel(0, scp->cluster->reg_base + MT8195_L2TCM_OFFSET_RANGE_0_LOW);
+   writel(scp->sram_size, scp->cluster->reg_base + 
MT8195_L2TCM_OFFSET_RANGE_0_HIGH);
+
+   scp_c0 = list_first_entry(_cluster->mtk_scp_list, struct mtk_scp, 
elem);
+   writel(scp->sram_phys - scp_c0->sram_phys, scp->cluster->reg_base + 
MT8195_L2TCM_OFFSET);
+
+   /* enable SRAM offset when fetching instruction and data */
+   sec_ctrl = readl(scp->cluster->reg_base + MT8195_SEC_CTRL);
+   sec_ctrl |= MT8195_CORE_OFFSET_ENABLE_I | MT8195_CORE_OFFSET_ENABLE_D;
+   writel(sec_ctrl, scp->cluster->reg_base + MT8195_SEC_CTRL);
+
+   return 0;
+}
+
 static int mt8192_scp_before_load(struct mtk_scp *scp)
 {
/* clear SPM interrupt, SCP2SPM_IPC_CLR */
@@ -717,6 +797,47 @@ static void mt8183_scp_stop(struct mtk_scp *scp)
writel(0, scp->cluster->reg_base + MT8183_WDT_CFG);
 }
 
+static void mt8188_scp_l2tcm_off(struct mtk_scp *scp)
+{
+   struct mtk_scp_of_cluster *scp_cluster = scp->cluster;
+
+   mutex_lock(_cluster->cluster_lock);
+
+   if (scp_cluster->l2tcm_refcnt > 0)
+   scp_cluster->l2tcm_refcnt -= 1;
+
+   if (scp_cluster->l2tcm_refcnt == 0) {
+   /* Power off L2TCM */
+   scp_sram_power_off(scp->cluster->reg_base + 
MT8192_L2TCM_SRAM_PD_0, 0);
+   scp_sram_power_off(scp->cluster->reg_base + 
MT8192_L2TCM_SRAM_PD_1, 0);
+   scp_sram_power_off(scp->cluster->reg_base + 
MT8192_L2TCM_SRAM_PD_2, 0);
+   scp_sram_power_off(scp->cluster->reg_base + 
MT8192_L1TCM_SRAM_PDN, 0);
+   }
+
+   mutex_unlock(_cluster->cluster_lock);
+}
+
+static void mt8188_scp_stop(struct mtk_scp *scp)
+{
+   mt8188_scp_l2tcm_off(scp);
+
+   scp_sram_power_off(scp->cluster->reg_base + MT8192_CPU0_SRAM_PD, 0);
+
+   /* Disable SCP watchdog */
+   writel(0, scp->cluster->reg_base + MT8192_CORE0_WDT_CFG);
+}
+
+static void

[PATCH v3 4/4] media: mediatek: imgsys: Support image processing

2024-04-23 Thread Olivia Wen

Integrate the imgsys core architecture driver for image processing on
the MT8188 platform.

Signed-off-by: Olivia Wen 
---
 include/linux/remoteproc/mtk_scp.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/remoteproc/mtk_scp.h 
b/include/linux/remoteproc/mtk_scp.h
index 7c2b7cc9..344ff41 100644
--- a/include/linux/remoteproc/mtk_scp.h
+++ b/include/linux/remoteproc/mtk_scp.h
@@ -43,6 +43,7 @@ enum scp_ipi_id {
SCP_IPI_CROS_HOST_CMD,
SCP_IPI_VDEC_LAT,
SCP_IPI_VDEC_CORE,
+   SCP_IPI_IMGSYS_CMD,
SCP_IPI_NS_SERVICE = 0xFF,
SCP_IPI_MAX = 0x100,
 };
-- 
2.6.4

[PATCH v3 1/4] dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP

2024-04-23 Thread Olivia Wen

Under different applications, the MT8188 SCP can be used as single-core
or dual-core.

Signed-off-by: Olivia Wen 
Acked-by: Krzysztof Kozlowski 
Reviewed-by: AngeloGioacchino Del Regno 


---
 Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml 
b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
index 507f98f..c5dc3c2 100644
--- a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
@@ -19,6 +19,7 @@ properties:
   - mediatek,mt8183-scp
   - mediatek,mt8186-scp
   - mediatek,mt8188-scp
+  - mediatek,mt8188-scp-dual
   - mediatek,mt8192-scp
   - mediatek,mt8195-scp
   - mediatek,mt8195-scp-dual
@@ -194,6 +195,7 @@ allOf:
   properties:
 compatible:
   enum:
+- mediatek,mt8188-scp-dual
 - mediatek,mt8195-scp-dual
 then:
   properties:
-- 
2.6.4

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-23 Thread Huang, Kai

On Tue, 2024-04-23 at 19:26 -0500, Haitao Huang wrote:
> On Tue, 23 Apr 2024 17:13:15 -0500, Huang, Kai  wrote:
> 
> > On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote:
> > > > > It's a workaround because you use the capacity==0 but it does not  
> > > really
> > > > > mean to disable the misc cgroup for specific resource IIUC.
> > > > 
> > > > Please read the comment around @misc_res_capacity again:
> > > > 
> > > >   * Miscellaneous resources capacity for the entire machine. 0  
> > > capacity
> > > >   * means resource is not initialized or not present in the host.
> > > > 
> > > 
> > > I mentioned this in earlier email. I think this means no SGX EPC. It  
> > > doesnot mean sgx epc cgroup not enabled. That's also consistent with the 
> > > behavior try_charge() fails if capacity is zero.
> > 
> > OK. To me the "capacity" is purely the concept of cgroup, so it must be
> > interpreted within the scope of "cgroup".  If cgroup, in our case, SGX
> > cgroup, is disabled, then whether "leaving the capacity to reflect the
> > presence of hardware resource" doesn't really matter.
> > So what you are saying is that, the kernel must initialize the capacity  
> > of
> > some MISC resource once it is added as valid type.  
> > And you must initialize the "capacity" even MISC cgroup is disabled
> > entirely by kernel commandline, in which case, IIUC, misc.capacity is not
> > even going to show in the /fs.
> > 
> > If this is your point, then your patch:
> > 
> > cgroup/misc: Add SGX EPC resource type
> > 
> > is already broken, because you added the new type w/o initializing the
> > capacity.
> > 
> > Please fix that up.
> > 
> > > 
> > > > > 
> > > > > There is explicit way for user to disable misc without setting  
> > > capacity> > to
> > > > > zero.
> > > > 
> > > > Which way are you talking about?
> > > 
> > > Echo "-misc" to cgroup.subtree_control at root level for example still 
> > > shows non-zero sgx_epc capacity.
> > 
> > I guess "having to disable all MISC resources just in order to disable  
> > SGX
> > EPC cgroup" is a brilliant idea.
> > 
> > You can easily disable the entire MISC cgroup by commandline for that
> > purpose if that's acceptable.
> > 
> > And I have no idea why "still showing non-zero EPC capacity" is important
> > if SGX cgroup cannot be supported at all.
> > 
> 
> Okay, all I'm trying to say is we should care about consistency in code  
> and don't want SGX do something different. Mixing "disable" with  
> "capacity==0" causes inconsistencies AFAICS:
> 
> 1) The try_charge() API currently returns error when capacity is zero. So  
> it appears not to mean that the cgroup is disabled otherwise it should  
> return success.

I agree this isn't ideal.  My view is we can fix it in MISC code if
needed.

> 
> 2) The current explicit way ("-misc") to disable misc still shows non-zero  
> entries in misc.capacity. (At least for v2 cgroup, it does when I tested).  
> Maybe this is not important but I just don't feel good about this  
> inconsistency.

This belongs to "MISC resource cgroup was initially enabled by the kernel
at boot time, but later was disabled *somewhere in hierarchy* by the
user".

In fact, if you only do "-misc" for "some subtree", it's quite reasonable
to still report the resource in max.capacity.  In the case above, the
"some subtree" happens to be the root.

So to me it's reasonable to still show max.capacity in this case.  And you
can actually argue that the kernel still supports the cgroup for the
resource.  E.g., you can at runtime do "+misc" to re-enable.

However, if the kernel isn't able to support certain MISC resource cgroup
at boot time, it's quite reasonable to just set the "capacity" to 0 so it
isn't visible to userspace.

Note:

My key point is, when userspace sees 0 "capacity", it shouldn't need to
care about whether it is because of "hardware resource is not available",
or "hardware resource is available but kernel cannot support cgroup for
it".  The resource cgroup is simply unavailable.

That means the kernel has full right to just hide that resource from the
cgroup at boot time.

But this should be just within "cgroup's scope", i.e., that resource can
still be available if kernel can provide it.  If some app wants to
additionally check whether such resource is indeed available but only
cgroup is not available, it should check resource specific interface but
not take advantage of the MISC cgroup interface.

> 
> For now I'll just do BUG_ON() unless there are more strong opinions one  
> way or the other.
> 

Fine to me.

Re: [PATCH v2 2/2] remoteproc: mediatek: Support MT8188 SCP core 1

2024-04-23 Thread 温倩苓

Hi Mathieu,

Thanks for the reviews.
I will correct it in the next version.

Best regards,
Olivia

On Mon, 2024-04-22 at 09:33 -0600, Mathieu Poirier wrote:
>
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  Hi Olivia,
> 
> On Fri, Apr 19, 2024 at 04:42:11PM +0800, Olivia Wen wrote:
> > From: "olivia.wen" 
> > 
> > There are three primary modifications.
> > 
> > 1. The struct mtk_scp_of_data usage on MT8188
> > MT8192 functions are unsuitable for the dual-core MT8188 SCP,
> > which has two RISC-V cores similar to MT8195 but without L1TCM.
> > We've added MT8188-specific functions to configure L1TCM
> > in multicore setups.
> > 
> > 2. SCP_IPI_IMGSYS_CMD feature
> > This version also adds SCP_IPI_IMGSYS_CMD to facilitate
> > communication between the imgsys kernel and the backend driver.
> > 
> > 3. Different code sizes and IPI share buffer sizes
> > Each SCP necessitates different code and IPI share buffer sizes.
> > Introducing a structure mtk_scp_sizes_data to handle them.
> > 
> 
> Please split in 3 different patches and in the changelog, concentrate
> on "why"
> you are making the changes rather than "what" changes are done.
> 
> Thanks,
> Mathieu
> 
> > Signed-off-by: olivia.wen 
> > ---
> >  drivers/remoteproc/mtk_common.h|  11 +-
> >  drivers/remoteproc/mtk_scp.c   | 230
> +
> >  drivers/remoteproc/mtk_scp_ipi.c   |   7 +-
> >  include/linux/remoteproc/mtk_scp.h |   1 +
> >  4 files changed, 223 insertions(+), 26 deletions(-)
> > 
> > diff --git a/drivers/remoteproc/mtk_common.h
> b/drivers/remoteproc/mtk_common.h
> > index 6d7736a..fd5c539 100644
> > --- a/drivers/remoteproc/mtk_common.h
> > +++ b/drivers/remoteproc/mtk_common.h
> > @@ -78,7 +78,6 @@
> >  #define MT8195_L2TCM_OFFSET0x850d0
> >  
> >  #define SCP_FW_VER_LEN32
> > -#define SCP_SHARE_BUFFER_SIZE288
> >  
> >  struct scp_run {
> >  u32 signaled;
> > @@ -97,6 +96,11 @@ struct scp_ipi_desc {
> >  
> >  struct mtk_scp;
> >  
> > +struct mtk_scp_sizes_data {
> > +size_t max_dram_size;
> > +size_t ipi_share_buffer_size;
> > +};
> > +
> >  struct mtk_scp_of_data {
> >  int (*scp_clk_get)(struct mtk_scp *scp);
> >  int (*scp_before_load)(struct mtk_scp *scp);
> > @@ -110,6 +114,7 @@ struct mtk_scp_of_data {
> >  u32 host_to_scp_int_bit;
> >  
> >  size_t ipi_buf_offset;
> > +const struct mtk_scp_sizes_data *scp_sizes;
> >  };
> >  
> >  struct mtk_scp_of_cluster {
> > @@ -141,10 +146,10 @@ struct mtk_scp {
> >  struct scp_ipi_desc ipi_desc[SCP_IPI_MAX];
> >  bool ipi_id_ack[SCP_IPI_MAX];
> >  wait_queue_head_t ack_wq;
> > +u8 *share_buf;
> >  
> >  void *cpu_addr;
> >  dma_addr_t dma_addr;
> > -size_t dram_size;
> >  
> >  struct rproc_subdev *rpmsg_subdev;
> >  
> > @@ -162,7 +167,7 @@ struct mtk_scp {
> >  struct mtk_share_obj {
> >  u32 id;
> >  u32 len;
> > -u8 share_buf[SCP_SHARE_BUFFER_SIZE];
> > +u8 *share_buf;
> >  };
> >  
> >  void scp_memcpy_aligned(void __iomem *dst, const void *src,
> unsigned int len);
> > diff --git a/drivers/remoteproc/mtk_scp.c
> b/drivers/remoteproc/mtk_scp.c
> > index 6751829..e281d28 100644
> > --- a/drivers/remoteproc/mtk_scp.c
> > +++ b/drivers/remoteproc/mtk_scp.c
> > @@ -20,7 +20,6 @@
> >  #include "mtk_common.h"
> >  #include "remoteproc_internal.h"
> >  
> > -#define MAX_CODE_SIZE 0x50
> >  #define SECTION_NAME_IPI_BUFFER ".ipi_buffer"
> >  
> >  /**
> > @@ -94,14 +93,15 @@ static void scp_ipi_handler(struct mtk_scp
> *scp)
> >  {
> >  struct mtk_share_obj __iomem *rcv_obj = scp->recv_buf;
> >  struct scp_ipi_desc *ipi_desc = scp->ipi_desc;
> > -u8 tmp_data[SCP_SHARE_BUFFER_SIZE];
> >  scp_ipi_handler_t handler;
> >  u32 id = readl(_obj->id);
> >  u32 len = readl(_obj->len);
> > +const struct mtk_scp_sizes_data *scp_sizes;
> >  
> > -if (len > SCP_SHARE_BUFFER_SIZE) {
> > -dev_err(scp->dev, "ipi message too long (len %d, max %d)", len,
> > -SCP_SHARE_BUFFER_SIZE);
> > +scp_sizes = scp->data->scp_sizes;
> > +if (len > scp_sizes->ipi_share_buffer_size) {
> > +dev_err(scp->dev, "ipi message too long (len %d, max %zd)", len,
> > +scp_sizes->ipi_share_buffer_size);
> >  return;
> >  }
> >  if (id >= SCP_IPI_MAX) {
> > @@ -117,8 +117,9 @@ static void scp_ipi_handler(struct mtk_scp
> *scp)
> >  return;
> >  }
> >  
> > -memcpy_fromio(tmp_data, _obj->share_buf, len);
> > -handler(tmp_data, len, ipi_desc[id].priv);
> > +memset(scp->share_buf, 0, scp_sizes->ipi_share_buffer_size);
> > +memcpy_fromio(scp->share_buf, _obj->share_buf, len);
> > +handler(scp->share_buf, len, ipi_desc[id].priv);
> >  scp_ipi_unlock(scp, id);
> >  
> >  scp->ipi_id_ack[id] = true;
> > @@ -133,6 +134,8 @@ static int scp_ipi_init(struct mtk_scp *scp,
> const struct firmware *fw)
> >  {
> >  int ret;
> >  size_t buf_sz, offset;
> > +size_t share_buf_offset;
> > +const struct mtk_scp_sizes_data *scp_sizes;
> >  
> >  /* read the ipi buf addr from FW itself first */
> >  ret = scp_elf_read_ipi_buf_addr(scp, fw,

Re: [PATCH 1/2] dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP

2024-04-23 Thread 温倩苓

Hi AngeloGioacchino,

On Thu, 2024-04-11 at 09:39 +0200, AngeloGioacchino Del Regno wrote:
> Il 11/04/24 09:34, AngeloGioacchino Del Regno ha scritto:
> > Il 11/04/24 05:37, olivia.wen ha scritto:
> > > Under different applications, the MT8188 SCP can be used as
> > > single-core
> > > or dual-core.
> > > 
> > > Signed-off-by: olivia.wen 
> > > ---
> > >   Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml | 3
> > > ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git
> > > a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml 
> > > b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
> > > index 507f98f..7e7b567 100644
> > > --- a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
> > > +++ b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
> > > @@ -22,7 +22,7 @@ properties:
> > > - mediatek,mt8192-scp
> > > - mediatek,mt8195-scp
> > > - mediatek,mt8195-scp-dual
> > > -
> > 
> > Don't remove the blank line, it's there for readability.
> > 
> > > +  - mediatek,mt8188-scp-dual
> 
> Ah, sorry, one more comment. Please, keep the entries ordered by
> name.
> 8188 goes before 8195.
> 
> > 
> > After addressing that comment,
> > 
> > Reviewed-by: AngeloGioacchino Del Regno <
> > angelogioacchino.delre...@collabora.com>
> > 
> > > reg:
> > >   description:
> > > Should contain the address ranges for memory regions
> > > SRAM, CFG, and,
> > > @@ -195,6 +195,7 @@ allOf:
> > >   compatible:
> > > enum:
> > >   - mediatek,mt8195-scp-dual
> > > +- mediatek,mt8188-scp-dual
> 
> same here.
> 
> > >   then:
> > > properties:
> > >   reg:
> > 
> > 
> 

Thanks for the reviews.
It will be corrected in the next version.

Best regards,
Olivia

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-23 Thread Jason Wang

On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin  wrote:
>
> On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  wrote:
> > >
> > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > > and vduse_alloc_reconnnect_info_mem
> > > > > > These functions allow vduse to allocate and free memory for 
> > > > > > reconnection
> > > > > > information. The amount of memory allocated is vq_num pages.
> > > > > > Each VQS will map its own page where the reconnection information 
> > > > > > will be saved
> > > > > >
> > > > > > Signed-off-by: Cindy Lu 
> > > > > > ---
> > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > > ++
> > > > > >  1 file changed, 40 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > > >   int irq_effective_cpu;
> > > > > >   struct cpumask irq_affinity;
> > > > > >   struct kobject kobj;
> > > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > > >  };
> > > > > >
> > > > > >  struct vduse_dev;
> > > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > > >
> > > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > > >  }
> > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > > +{
> > > > > > + unsigned long vaddr = 0;
> > > > > > + struct vduse_virtqueue *vq;
> > > > > > +
> > > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > > > > + vq = dev->vqs[i];
> > > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > > >
> > > > >
> > > > > I don't get why you insist on stealing kernel memory for something
> > > > > that is just used by userspace to store data for its own use.
> > > > > Userspace does not lack ways to persist data, for example,
> > > > > create a regular file anywhere in the filesystem.
> > > >
> > > > Good point. So the motivation here is to:
> > > >
> > > > 1) be self contained, no dependency for high speed persist data
> > > > storage like tmpfs
> > >
> > > No idea what this means.
> >
> > I mean a regular file may slow down the datapath performance, so
> > usually the application will try to use tmpfs and other which is a
> > dependency for implementing the reconnection.
>
> Are we worried about systems without tmpfs now?

Yes.

>
>
> > >
> > > > 2) standardize the format in uAPI which allows reconnection from
> > > > arbitrary userspace, unfortunately, such effort was removed in new
> > > > versions
> > >
> > > And I don't see why that has to live in the kernel tree either.
> >
> > I can't find a better place, any idea?
> >
> > Thanks
>
>
> Well anywhere on github really. with libvhost-user maybe?
> It's harmless enough in Documentation
> if you like but ties you to the kernel release cycle in a way that
> is completely unnecessary.

Ok.

Thanks

>
> > >
> > > > If the above doesn't make sense, we don't need to offer those pages by 
> > > > VDUSE.
> > > >
> > > > Thanks
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > > + if (vaddr == 0)
> > > > > > + return -ENOMEM;
> > > > > > +
> > > > > > + vq->vdpa_reconnect_vaddr = vaddr;
> > > > > > + }
> > > > > > +
> > > > > > + return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > > +{
> > > > > > + struct vduse_virtqueue *vq;
> > > > > > +
> > > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > > + vq = dev->vqs[i];
> > > > > > +
> > > > > > + if (vq->vdpa_reconnect_vaddr)
> > > > > > + free_page(vq->vdpa_reconnect_vaddr);
> > > > > > + vq->vdpa_reconnect_vaddr = 0;
> > > > > > + }
> > > > > > +
> > > > > > + return 0;
> > > > > > +}
> > > > > >
> > > > > >  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > > > >   unsigned long arg)
> > > > > > @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name)
> > > > > >   mutex_unlock(>lock);
> > > > > >   return -EBUSY;
> > > > > >   }
> > > > > > + vduse_free_reconnnect_info_mem(dev);
> > > > > > +
> > > > > >   dev->connected = true;
> > > > > >   mutex_unlock(>lock);
> > > > > >
> > > > > > @@ -1855,12 +1890,17 @@ static int

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-23 Thread Haitao Huang

On Tue, 23 Apr 2024 17:13:15 -0500, Huang, Kai  wrote:

On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote:
> > It's a workaround because you use the capacity==0 but it does not  
really

> > mean to disable the misc cgroup for specific resource IIUC.
>
> Please read the comment around @misc_res_capacity again:
>
>   * Miscellaneous resources capacity for the entire machine. 0  
capacity

>   * means resource is not initialized or not present in the host.
>

I mentioned this in earlier email. I think this means no SGX EPC. It  
doesnot mean sgx epc cgroup not enabled. That's also consistent with the 
behavior try_charge() fails if capacity is zero.

OK. To me the "capacity" is purely the concept of cgroup, so it must be
interpreted within the scope of "cgroup".  If cgroup, in our case, SGX
cgroup, is disabled, then whether "leaving the capacity to reflect the
presence of hardware resource" doesn't really matter.
So what you are saying is that, the kernel must initialize the capacity  
of
some MISC resource once it is added as valid type.  
And you must initialize the "capacity" even MISC cgroup is disabled

entirely by kernel commandline, in which case, IIUC, misc.capacity is not
even going to show in the /fs.

If this is your point, then your patch:

cgroup/misc: Add SGX EPC resource type

is already broken, because you added the new type w/o initializing the
capacity.

Please fix that up.

> >
> > There is explicit way for user to disable misc without setting  
capacity> > to

> > zero.
>
> Which way are you talking about?

Echo "-misc" to cgroup.subtree_control at root level for example still 
shows non-zero sgx_epc capacity.

I guess "having to disable all MISC resources just in order to disable  
SGX

EPC cgroup" is a brilliant idea.

You can easily disable the entire MISC cgroup by commandline for that
purpose if that's acceptable.

And I have no idea why "still showing non-zero EPC capacity" is important
if SGX cgroup cannot be supported at all.

Okay, all I'm trying to say is we should care about consistency in code  
and don't want SGX do something different. Mixing "disable" with  
"capacity==0" causes inconsistencies AFAICS:

1) The try_charge() API currently returns error when capacity is zero. So  
it appears not to mean that the cgroup is disabled otherwise it should  
return success.

2) The current explicit way ("-misc") to disable misc still shows non-zero  
entries in misc.capacity. (At least for v2 cgroup, it does when I tested).  
Maybe this is not important but I just don't feel good about this  
inconsistency.

For now I'll just do BUG_ON() unless there are more strong opinions one  
way or the other.

BR
Haitao

Re: [PATCH v3 0/4] MSM8976 MDSS/GPU/WCNSS support

2024-04-23 Thread Bryan O'Donoghue


On 13/04/2024 18:03, Adam Skladowski wrote:

This patch series provide support for display subsystem, gpu
and also adds wireless connectivity subsystem support.

Changes since v2

1. Disabled mdss_dsi nodes by default
2. Changed reg size of mdss_dsi0 to be equal on both
3. Added operating points to second mdss_dsi
4. Brought back required opp-supported-hw on adreno
5. Moved status under operating points on adreno

Changes since v1

1. Addressed feedback
2. Dropped already applied dt-bindings patches
3. Dropped sdc patch as it was submitted as part of other series
4. Dropped dt-bindings patch for Adreno, also separate now

Adam Skladowski (4):
   arm64: dts: qcom: msm8976: Add IOMMU nodes
   arm64: dts: qcom: msm8976: Add MDSS nodes
   arm64: dts: qcom: msm8976: Add Adreno GPU
   arm64: dts: qcom: msm8976: Add WCNSS node

  arch/arm64/boot/dts/qcom/msm8976.dtsi | 536 +-
  1 file changed, 532 insertions(+), 4 deletions(-)



I remembered you pinged me in IRC about my previous review feedback.

Would appreciate if you could add a response email to your cover letter 
explaining why that's not done.


I'm sure you're probably right but, good practice is to state in your 
log what was and wasn't done and why.


---
bod

[PATCH v21 4/5] Documentation: tracing: Add ring-buffer mapping

2024-04-23 Thread Vincent Donnefort

It is now possible to mmap() a ring-buffer to stream its content. Add
some documentation and a code example.

Signed-off-by: Vincent Donnefort 

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5092d6c13af5..0b300901fd75 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -29,6 +29,7 @@ Linux Tracing Technologies
timerlat-tracer
intel_th
ring-buffer-design
+   ring-buffer-map
stm
sys-t
coresight/index
diff --git a/Documentation/trace/ring-buffer-map.rst 
b/Documentation/trace/ring-buffer-map.rst
new file mode 100644
index ..8e296bcc0d7f
--- /dev/null
+++ b/Documentation/trace/ring-buffer-map.rst
@@ -0,0 +1,106 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Tracefs ring-buffer memory mapping
+==
+
+:Author: Vincent Donnefort 
+
+Overview
+
+Tracefs ring-buffer memory map provides an efficient method to stream data
+as no memory copy is necessary. The application mapping the ring-buffer becomes
+then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
+
+Memory mapping setup
+
+The mapping works with a mmap() of the trace_pipe_raw interface.
+
+The first system page of the mapping contains ring-buffer statistics and
+description. It is referred to as the meta-page. One of the most important
+fields of the meta-page is the reader. It contains the sub-buffer ID which can
+be safely read by the mapper (see ring-buffer-design.rst).
+
+The meta-page is followed by all the sub-buffers, ordered by ascending ID. It 
is
+therefore effortless to know where the reader starts in the mapping:
+
+.. code-block:: c
+
+reader_id = meta->reader->id;
+reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
+
+When the application is done with the current reader, it can get a new one 
using
+the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates
+the meta-page fields.
+
+Limitations
+===
+When a mapping is in place on a Tracefs ring-buffer, it is not possible to
+either resize it (either by increasing the entire size of the ring-buffer or
+each subbuf). It is also not possible to use snapshot and causes splice to copy
+the ring buffer data instead of using the copyless swap from the ring buffer.
+
+Concurrent readers (either another application mapping that ring-buffer or the
+kernel with trace_pipe) are allowed but not recommended. They will compete for
+the ring-buffer and the output is unpredictable, just like concurrent readers 
on
+trace_pipe would be.
+
+Example
+===
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+#include 
+
+#define TRACE_PIPE_RAW 
"/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
+
+int main(void)
+{
+int page_size = getpagesize(), fd, reader_id;
+unsigned long meta_len, data_len;
+struct trace_buffer_meta *meta;
+void *map, *reader, *data;
+
+fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK);
+if (fd < 0)
+exit(EXIT_FAILURE);
+
+map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+if (map == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+meta = (struct trace_buffer_meta *)map;
+meta_len = meta->meta_page_size;
+
+printf("entries:%llu\n", meta->entries);
+printf("overrun:%llu\n", meta->overrun);
+printf("read:   %llu\n", meta->read);
+printf("nr_subbufs: %u\n", meta->nr_subbufs);
+
+data_len = meta->subbuf_size * meta->nr_subbufs;
+data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, 
meta_len);
+if (data == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0)
+exit(EXIT_FAILURE);
+
+reader_id = meta->reader.id;
+reader = data + meta->subbuf_size * reader_id;
+
+printf("Current reader address: %p\n", reader);
+
+munmap(data, data_len);
+munmap(meta, meta_len);
+close (fd);
+
+return 0;
+}
-- 
2.44.0.769.g3c40516874-goog

[PATCH v21 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-04-23 Thread Vincent Donnefort

Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Mapping will prevent snapshot and buffer size modifications.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index b682e9925539..bd1066754220 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,4 +43,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 233d1af39fff..a35e7f598233 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1191,6 +1191,12 @@ static void tracing_snapshot_instance_cond(struct 
trace_array *tr,
return;
}
 
+   if (tr->mapped) {
+   trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
+   trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+   return;
+   }
+
local_irq_save(flags);
update_max_tr(tr, current, smp_processor_id(), cond_data);
local_irq_restore(flags);
@@ -1323,7 +1329,7 @@ static int tracing_arm_snapshot_locked(struct trace_array 
*tr)
lockdep_assert_held(_types_lock);
 
spin_lock(>snapshot_trigger_lock);
-   if (tr->snapshot == UINT_MAX) {
+   if (tr->snapshot == UINT_MAX || tr->mapped) {
spin_unlock(>snapshot_trigger_lock);
return -EBUSY;
}
@@ -6068,7 +6074,7 @@ static void tracing_set_nop(struct trace_array *tr)
 {
if (tr->current_trace == _trace)
return;
-   
+
tr->current_trace->enabled--;
 
if (tr->current_trace->reset)
@@ -8194,15 +8200,32 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = >iter;
+   int err;
+
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent,
+  NULL, NULL);
+   if (err)
+   return err;
+   }
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
 
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(_types_lock);
 
/* Make sure the waiters see the new wait_index */
@@ -8214,6 +8237,76 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+#ifdef CONFIG_TRACER_MAX_TRACE
+static int get_snapshot_map(struct trace_array *tr)
+{
+   int err = 0;
+
+   /*
+* Called with mmap_lock held. lockdep would be unhappy if we would now
+* take trace_types_lock. Instead use the specific
+* snapshot_trigger_lock.
+*/
+   spin_lock(>snapshot_trigger_lock);
+
+   if (tr->snapshot || tr->mapped == UINT_MAX)
+   err = -EBUSY;
+   else
+   tr->mapped++;
+
+   spin_unlock(>snapshot_trigger_lock);
+
+   /* Wait for update_max_tr() to observe iter->tr->mapped */
+   if (tr->mapped == 1)
+   synchronize_rcu();
+
+   return err;
+
+}
+static void put_snapshot_map(struct trace_array *tr)
+{
+   spin_lock(>snapshot_trigger_lock);
+   if (!WARN_ON(!tr->mapped))
+

[PATCH v21 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-23 Thread Vincent Donnefort

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..b682e9925539
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Internal use only.
+ * @Reserved2: Internal use only.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..84f8744fa110 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(_buffer->irq_work.waiters);
init_waitqueue_head(_buffer->irq_work.full_waiters);
+

[PATCH v21 1/5] ring-buffer: Allocate sub-buffers with __GFP_COMP

2024-04-23 Thread Vincent Donnefort

In preparation for the ring-buffer memory mapping, allocate compound
pages for the ring-buffer sub-buffers to enable us to map them to
user-space with vm_insert_pages().

Signed-off-by: Vincent Donnefort 

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 25476ead681b..cc9ebe593571 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu 
*cpu_buffer,
list_add(>list, pages);
 
page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
-   mflags | __GFP_ZERO,
+   mflags | __GFP_COMP | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
@@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
 
cpu_buffer->reader_page = bpage;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | 
__GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
@@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, 
int cpu)
goto out;
 
page = alloc_pages_node(cpu_to_node(cpu),
-   GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
+   GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | 
__GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page) {
kfree(bpage);
-- 
2.44.0.769.g3c40516874-goog

[PATCH v21 0/5] Introducing trace buffer mapping by user-space

2024-04-23 Thread Vincent Donnefort

The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Support for this new feature can already be found in libtracefs from
version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.

Vincent

v20 -> v21:
  * Collect Ack
  * Add .gitignore
  * Few nits
  * Remove meta-page padding (zero-page not supported by vm_insert_pages)
  * Remove single-usage macros
  * Move vma flags handling into ring-buffer.c

v19 -> v20:
  * Fix typos in documentation.
  * Remove useless mmap open and fault callbacks.
  * add mm.h include for vm_insert_pages

v18 -> v19:
  * Use VM_PFNMAP and vm_insert_pages
  * Allocate ring-buffer subbufs with __GFP_COMP
  * Pad the meta-page with the zero-page to align on the subbuf_order
  * Extend the ring-buffer test with mmap() dedicated suite

v17 -> v18:
  * Fix lockdep_assert_held
  * Fix spin_lock_init typo
  * Fix CONFIG_TRACER_MAX_TRACE typo

v16 -> v17:
  * Documentation and comments improvements.
  * Create get/put_snapshot_map() for clearer code.
  * Replace kzalloc with kcalloc.
  * Fix -ENOMEM handling in rb_alloc_meta_page().
  * Move flush(cpu_buffer->reader_page) behind the reader lock.
  * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer
mutex. (removes READ_ONCE/WRITE_ONCE accesses).

v15 -> v16:
  * Add comment for the dcache flush.
  * Remove now unnecessary WRITE_ONCE for the meta-page.

v14 -> v15:
  * Add meta-page and reader-page flush. Intends to fix the mapping
for VIVT and aliasing-VIPT data caches.
  * -EPERM on VM_EXEC.
  * Fix build warning !CONFIG_TRACER_MAX_TRACE.

v13 -> v14:
  * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
  * on unmap, sync meta-page teardown with the reader_lock instead of
the synchronize_rcu.
  * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
(intends to fix a lockdep issue)
  * Add kerneldoc for flags and Reserved fields.
  * Add kselftest for snapshot/map mutual exclusion.

v12 -> v13:
  * Swap subbufs_{touched,lost} for Reserved fields.
  * Add a flag field in the meta-page.
  * Fix CONFIG_TRACER_MAX_TRACE.
  * Rebase on top of trace/urgent.
  * Add a comment for try_unregister_trigger()

v11 -> v12:
  * Fix code sample mmap bug.
  * Add logging in sample code.
  * Reset tracer in selftest.
  * Add a refcount for the snapshot users.
  * Prevent mapping when there are snapshot users and vice versa.
  * Refine the meta-page.
  * Fix types in the meta-page.
  * Collect Reviewed-by.

v10 -> v11:
  * Add Documentation and code sample.
  * Add a selftest.
  * Move all the update to the meta-page into a single
rb_update_meta_page().
  * rb_update_meta_page() is now called from
ring_buffer_map_get_reader() to fix NOBLOCK callers.
  * kerneldoc for struct trace_meta_page.
  * Add a patch to zero all the ring-buffer allocations.

v9 -> v10:
  * Refactor rb_update_meta_page()
  * In-loop declaration for foreach_subbuf_page()
  * Check for cpu_buffer->mapped overflow

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

Vincent Donnefort (5):

Re: [PATCH v2 2/6] iio: light: stk3310: Implement vdd supply and power it off during suspend

2024-04-23 Thread Andy Shevchenko

On Wed, Apr 24, 2024 at 1:41 AM Aren Moynihan  wrote:
>
> From: Ondrej Jirman 
>
> VDD power input can be used to completely power off the chip during
> system suspend. Do so if available.

...

> ret = stk3310_init(indio_dev);
> if (ret < 0)
> -   return ret;
> +   goto err_vdd_disable;

This is wrong. You will have the regulator being disabled _before_
IRQ. Note, that the original code likely has a bug which sets states
before disabling IRQ and removing a handler.

Side note, you may make the driver neater with help of

  struct device *dev = >dev;

defined in this patch.

...

>  static int stk3310_suspend(struct device *dev)
>  {
> struct stk3310_data *data;

> data = iio_priv(i2c_get_clientdata(to_i2c_client(dev)));

Side note: This may be updated (in a separate change) to use
dev_get_drvdata() directly.

Jonathan, do we have something like iio_priv_from_drvdata(struct
device *dev)? Seems many drivers may utilise it.

>  }

...

>  static int stk3310_resume(struct device *dev)

Ditto.

--
With Best Regards,
Andy Shevchenko

Re: [PATCH v2 3/6] iio: light: stk3310: Manage LED power supply

2024-04-23 Thread Andy Shevchenko

On Wed, Apr 24, 2024 at 1:41 AM Aren Moynihan  wrote:
>
> The stk3310 and stk3310 chips have an input for power to the infrared
> LED. Add support for managing it's state.

its

...

> if (IS_ERR(data->vdd_reg))
> return dev_err_probe(>dev, ret, "get regulator vdd 
> failed\n");
>
> +   data->led_reg = devm_regulator_get(>dev, "leda");
> +   if (IS_ERR(data->led_reg))
> +   return dev_err_probe(>dev, ret, "get regulator led 
> failed\n");

Can't you use a bulk regulator API instead?

-- 
With Best Regards,
Andy Shevchenko

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Huang, Kai

On Wed, 2024-04-24 at 00:27 +0300, Jarkko Sakkinen wrote:
> On Tue Apr 23, 2024 at 8:08 PM EEST, Reinette Chatre wrote:
> > Hi Kai,
> > 
> > On 4/23/2024 4:50 AM, Huang, Kai wrote:
> > > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c 
> > > > b/arch/x86/kernel/cpu/sgx/ioctl.c
> > > > index b65ab214bdf5..2340a82fa796 100644
> > > > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > > > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > > > @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl 
> > > > *encl,
> > > > }
> > > >  
> > > > mutex_unlock(>lock);
> > > > +
> > > > +   if (need_resched())
> > > > +   cond_resched();
> > > > }
> > > >  
> > > > ret = 0;
> > > > @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct 
> > > > sgx_encl *encl,
> > > > entry->type = page_type;
> > > >  
> > > > mutex_unlock(>lock);
> > > > +
> > > > +   if (need_resched())
> > > > +   cond_resched();
> > > > }
> > > >  
> > > > ret = 0;
> > > > @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl 
> > > > *encl,
> > > > kfree(entry);
> > > >  
> > > > mutex_unlock(>lock);
> > > > +
> > > > +   if (need_resched())
> > > > +   cond_resched();
> > > > }
> > > > 
> > > 
> > > You can remove the need_reshced() in all 3 places above but just call
> > > cond_resched() directly.
> > > 
> > 
> > This change will call cond_resched() after dealing with each page in a
> > potentially large page range (cover mentions 30GB but we have also had to
> > make optimizations for enclaves larger than this). Adding a cond_resched()
> > here will surely placate the soft lockup detector, but we need to take care
> > how changes like this impact the performance of the system and having 
> > actions
> > on these page ranges take much longer than necessary.
> > For reference, please see 7b72c823ddf8 ("x86/sgx: Reduce delay and 
> > interference
> > of enclave release") that turned frequent cond_resched() into batches
> > to address performance issues.

Ah I didn't know this.  Thanks for the info.

> > 
> > It looks to me like the need_resched() may be a quick check that can be used
> > to improve performance? 
> > 

Perhaps?  My assumption is eventually cond_resched() will do similar check
of need_resched() but I am not entirely sure about of that.

Reading the code, it seems cond_resched() eventually does
should_resched().  The generic version indeed does similar check of
need_resched() but it seems the x86 version has a different
implementation.

> > I am not familiar with all use cases that need to be
> > considered to determine if a batching solution may be needed.

Looks at least the REMOVE_PAGES ioctls() could have the same impact to the
performance downgrade problem mentioned in commit 7b72c823ddf8 ("x86/sgx:
Reduce delay and interference of enclave release"), but I guess it's
acceptable to fix softlockup first and then improve performance if there's
someone hit any real issue. 

> 
> Ya, well no matter it is the reasoning will need to be documented
> because this should have symmetry with sgx_ioc_enclave_add_pages()
> (see my response to Kai).

Yeah I was actually going to mention this, but somehow I didn't choose to.

> 
> I because this makes dealing with need_resched() a change in code
> even if it is left out as a side-effect, I'd support of not removing
> which means adding need_resched() as a side-effect.

I am fine with keeping the need_resched().

[PATCH v2 3/6] iio: light: stk3310: Manage LED power supply

2024-04-23 Thread Aren Moynihan

The stk3310 and stk3310 chips have an input for power to the infrared
LED. Add support for managing it's state.

Signed-off-by: Aren Moynihan 
---
 drivers/iio/light/stk3310.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/iio/light/stk3310.c b/drivers/iio/light/stk3310.c
index a0547eeca3e3..ee1ac95dbc0e 100644
--- a/drivers/iio/light/stk3310.c
+++ b/drivers/iio/light/stk3310.c
@@ -120,6 +120,7 @@ struct stk3310_data {
struct regmap_field *reg_flag_psint;
struct regmap_field *reg_flag_nf;
struct regulator *vdd_reg;
+   struct regulator *led_reg;
 };
 
 static const struct iio_event_spec stk3310_events[] = {
@@ -614,6 +615,10 @@ static int stk3310_probe(struct i2c_client *client)
if (IS_ERR(data->vdd_reg))
return dev_err_probe(>dev, ret, "get regulator vdd 
failed\n");
 
+   data->led_reg = devm_regulator_get(>dev, "leda");
+   if (IS_ERR(data->led_reg))
+   return dev_err_probe(>dev, ret, "get regulator led 
failed\n");
+
ret = stk3310_regmap_init(data);
if (ret < 0)
return ret;
@@ -629,12 +634,18 @@ static int stk3310_probe(struct i2c_client *client)
return dev_err_probe(>dev, ret,
 "regulator vdd enable failed\n");
 
+   ret = regulator_enable(data->led_reg);
+   if (ret) {
+   dev_err_probe(>dev, ret, "regulator led enable 
failed\n");
+   goto err_vdd_disable;
+   }
+
/* we need a short delay to allow the chip time to power on */
fsleep(1000);
 
ret = stk3310_init(indio_dev);
if (ret < 0)
-   goto err_vdd_disable;
+   goto err_led_disable;
 
if (client->irq > 0) {
ret = devm_request_threaded_irq(>dev, client->irq,
@@ -660,6 +671,8 @@ static int stk3310_probe(struct i2c_client *client)
 
 err_standby:
stk3310_set_state(data, STK3310_STATE_STANDBY);
+err_led_disable:
+   regulator_disable(data->led_reg);
 err_vdd_disable:
regulator_disable(data->vdd_reg);
return ret;
@@ -672,6 +685,7 @@ static void stk3310_remove(struct i2c_client *client)
 
iio_device_unregister(indio_dev);
stk3310_set_state(iio_priv(indio_dev), STK3310_STATE_STANDBY);
+   regulator_disable(data->led_reg);
regulator_disable(data->vdd_reg);
 }
 
@@ -687,6 +701,7 @@ static int stk3310_suspend(struct device *dev)
return ret;
 
regcache_mark_dirty(data->regmap);
+   regulator_disable(data->led_reg);
regulator_disable(data->vdd_reg);
 
return 0;
@@ -706,6 +721,12 @@ static int stk3310_resume(struct device *dev)
return ret;
}
 
+   ret = regulator_enable(data->led_reg);
+   if (ret) {
+   dev_err(dev, "Failed to re-enable regulator led\n");
+   return ret;
+   }
+
fsleep(1000);
 
ret = regcache_sync(data->regmap);
-- 
2.44.0

[PATCH v2 2/6] iio: light: stk3310: Implement vdd supply and power it off during suspend

2024-04-23 Thread Aren Moynihan

From: Ondrej Jirman 

VDD power input can be used to completely power off the chip during
system suspend. Do so if available.

Signed-off-by: Ondrej Jirman 
Signed-off-by: Aren Moynihan 
---

Notes:
Changes in v2:
 - always enable / disable regulators and rely on a dummy regulator if
   one isn't specified
 - replace usleep_range with fsleep
 - reorder includes so iio headers are last
 - add missing error handling to resume

 drivers/iio/light/stk3310.c | 49 ++---
 1 file changed, 46 insertions(+), 3 deletions(-)

diff --git a/drivers/iio/light/stk3310.c b/drivers/iio/light/stk3310.c
index 7b71ad71d78d..a0547eeca3e3 100644
--- a/drivers/iio/light/stk3310.c
+++ b/drivers/iio/light/stk3310.c
@@ -13,6 +13,8 @@
 #include 
 #include 
 #include 
+#include 
+
 #include 
 #include 
 #include 
@@ -117,6 +119,7 @@ struct stk3310_data {
struct regmap_field *reg_int_ps;
struct regmap_field *reg_flag_psint;
struct regmap_field *reg_flag_nf;
+   struct regulator *vdd_reg;
 };
 
 static const struct iio_event_spec stk3310_events[] = {
@@ -607,6 +610,10 @@ static int stk3310_probe(struct i2c_client *client)
 
mutex_init(>lock);
 
+   data->vdd_reg = devm_regulator_get(>dev, "vdd");
+   if (IS_ERR(data->vdd_reg))
+   return dev_err_probe(>dev, ret, "get regulator vdd 
failed\n");
+
ret = stk3310_regmap_init(data);
if (ret < 0)
return ret;
@@ -617,9 +624,17 @@ static int stk3310_probe(struct i2c_client *client)
indio_dev->channels = stk3310_channels;
indio_dev->num_channels = ARRAY_SIZE(stk3310_channels);
 
+   ret = regulator_enable(data->vdd_reg);
+   if (ret)
+   return dev_err_probe(>dev, ret,
+"regulator vdd enable failed\n");
+
+   /* we need a short delay to allow the chip time to power on */
+   fsleep(1000);
+
ret = stk3310_init(indio_dev);
if (ret < 0)
-   return ret;
+   goto err_vdd_disable;
 
if (client->irq > 0) {
ret = devm_request_threaded_irq(>dev, client->irq,
@@ -645,32 +660,60 @@ static int stk3310_probe(struct i2c_client *client)
 
 err_standby:
stk3310_set_state(data, STK3310_STATE_STANDBY);
+err_vdd_disable:
+   regulator_disable(data->vdd_reg);
return ret;
 }
 
 static void stk3310_remove(struct i2c_client *client)
 {
struct iio_dev *indio_dev = i2c_get_clientdata(client);
+   struct stk3310_data *data = iio_priv(indio_dev);
 
iio_device_unregister(indio_dev);
stk3310_set_state(iio_priv(indio_dev), STK3310_STATE_STANDBY);
+   regulator_disable(data->vdd_reg);
 }
 
 static int stk3310_suspend(struct device *dev)
 {
struct stk3310_data *data;
+   int ret;
 
data = iio_priv(i2c_get_clientdata(to_i2c_client(dev)));
 
-   return stk3310_set_state(data, STK3310_STATE_STANDBY);
+   ret = stk3310_set_state(data, STK3310_STATE_STANDBY);
+   if (ret)
+   return ret;
+
+   regcache_mark_dirty(data->regmap);
+   regulator_disable(data->vdd_reg);
+
+   return 0;
 }
 
 static int stk3310_resume(struct device *dev)
 {
-   u8 state = 0;
struct stk3310_data *data;
+   u8 state = 0;
+   int ret;
 
data = iio_priv(i2c_get_clientdata(to_i2c_client(dev)));
+
+   ret = regulator_enable(data->vdd_reg);
+   if (ret) {
+   dev_err(dev, "Failed to re-enable regulator vdd\n");
+   return ret;
+   }
+
+   fsleep(1000);
+
+   ret = regcache_sync(data->regmap);
+   if (ret) {
+   dev_err(dev, "Failed to restore registers: %d\n", ret);
+   return ret;
+   }
+
if (data->ps_enabled)
state |= STK3310_STATE_EN_PS;
if (data->als_enabled)
-- 
2.44.0

[PATCH v2 6/6] arm64: dts: allwinner: pinephone: Add power supplies to stk3311

2024-04-23 Thread Aren Moynihan

From: Ondrej Jirman 

This makes the driver disable the supply during sleep.

Signed-off-by: Ondrej Jirman 
Signed-off-by: Aren Moynihan 
---

Notes:
Changes in v2:
 - add leda-supply

 arch/arm64/boot/dts/allwinner/sun50i-a64-pinephone.dtsi | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64-pinephone.dtsi 
b/arch/arm64/boot/dts/allwinner/sun50i-a64-pinephone.dtsi
index b5a232209f2b..51ab1db95f81 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64-pinephone.dtsi
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64-pinephone.dtsi
@@ -250,6 +250,8 @@ light-sensor@48 {
reg = <0x48>;
interrupt-parent = <>;
interrupts = <1 0 IRQ_TYPE_EDGE_FALLING>; /* PB0 */
+   vdd-supply = <_ldo_io0>;
+   leda-supply = <_dldo1>;
};
 
/* Accelerometer/gyroscope */
-- 
2.44.0

[PATCH v2 5/6] iio: light: stk3310: log error if reading the chip id fails

2024-04-23 Thread Aren Moynihan

If the chip isn't powered, this call is likely to return an error.
Without a log here the driver will silently fail to probe. Common errors
are ENXIO (when the chip isn't powered) and ETIMEDOUT (when the i2c bus
isn't powered).

Signed-off-by: Aren Moynihan 
---

Notes:
Changes in v2:
 - use dev_err_probe

 drivers/iio/light/stk3310.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iio/light/stk3310.c b/drivers/iio/light/stk3310.c
index c56c6298d292..6ee6f145a6d5 100644
--- a/drivers/iio/light/stk3310.c
+++ b/drivers/iio/light/stk3310.c
@@ -475,7 +475,7 @@ static int stk3310_init(struct iio_dev *indio_dev)
 
ret = regmap_read(data->regmap, STK3310_REG_ID, );
if (ret < 0)
-   return ret;
+   return dev_err_probe(>dev, ret, "failed to read chip 
id\n");
 
if (chipid != STK3310_CHIP_ID_VAL &&
chipid != STK3311_CHIP_ID_VAL &&
-- 
2.44.0

[PATCH v2 1/6] dt-bindings: iio: light: stk33xx: add vdd and leda regulators

2024-04-23 Thread Aren Moynihan

stk3310 and stk3311 are typically connected to power supplies for the
chip (vdd) and the infrared LED (leda). Add properties so we can power
these up / down appropriately.

Signed-off-by: Aren Moynihan 
---

Notes:
Changes in v2:
 - add leda-supply
 - add supplies to examples

 Documentation/devicetree/bindings/iio/light/stk33xx.yaml | 4 
 1 file changed, 4 insertions(+)

diff --git a/Documentation/devicetree/bindings/iio/light/stk33xx.yaml 
b/Documentation/devicetree/bindings/iio/light/stk33xx.yaml
index f6e22dc9814a..43ead524cecb 100644
--- a/Documentation/devicetree/bindings/iio/light/stk33xx.yaml
+++ b/Documentation/devicetree/bindings/iio/light/stk33xx.yaml
@@ -29,6 +29,8 @@ properties:
   interrupts:
 maxItems: 1
 
+  vdd-supply: true
+  leda-supply: true
   proximity-near-level: true
 
 required:
@@ -52,6 +54,8 @@ examples:
 proximity-near-level = <25>;
 interrupt-parent = <>;
 interrupts = <5 IRQ_TYPE_LEVEL_LOW>;
+vdd-supply = <_regulator>;
+leda-supply = <_regulator>;
 };
 };
 ...
-- 
2.44.0

[PATCH v2 0/6] iio: light: stk3310: support powering off during suspend

2024-04-23 Thread Aren Moynihan

In the Pine64 PinePhone, the stk3310 chip is powered by a regulator that is
disabled at system boot and can be shut off during suspend. To ensure that
the chip properly initializes, both after boot and suspend, we need to
manage this regulator.

Additionally if the chip is shut off in suspend, we need to make sure that
it gets reinitialized with the same parameters after resume.

Major changes in v2:
 - Add handling of the IR LED. I was hesitant to include this as it is the
   same as pull-up regulator for the i2c bus on the hardware I have, so I
   can't test it well. I think leaving it out is more likely to cause
   issues than including it.
 - Convert stk3310 to use dev_err_probe for errors.
 - Always enable / disable regulators and rely on dummy devices if they're
   not specified.
 - more listed in individual patches

Aren Moynihan (4):
  dt-bindings: iio: light: stk33xx: add vdd and leda regulators
  iio: light: stk3310: Manage LED power supply
  iio: light: stk3310: use dev_err_probe where possible
  iio: light: stk3310: log error if reading the chip id fails

Ondrej Jirman (2):
  iio: light: stk3310: Implement vdd supply and power it off during
suspend
  arm64: dts: allwinner: pinephone: Add power supplies to stk3311

 .../bindings/iio/light/stk33xx.yaml   |   4 +
 .../dts/allwinner/sun50i-a64-pinephone.dtsi   |   2 +
 drivers/iio/light/stk3310.c   | 116 +-
 3 files changed, 94 insertions(+), 28 deletions(-)

-- 
2.44.0

[PATCH v2 4/6] iio: light: stk3310: use dev_err_probe where possible

2024-04-23 Thread Aren Moynihan

Using dev_err_probe instead of dev_err and return makes the errors
easier to understand by including the error name, and saves a little
code.

Signed-off-by: Aren Moynihan 
---

Notes:
Added in v2

 drivers/iio/light/stk3310.c | 44 +
 1 file changed, 20 insertions(+), 24 deletions(-)

diff --git a/drivers/iio/light/stk3310.c b/drivers/iio/light/stk3310.c
index ee1ac95dbc0e..c56c6298d292 100644
--- a/drivers/iio/light/stk3310.c
+++ b/drivers/iio/light/stk3310.c
@@ -60,10 +60,10 @@
data->reg_##name =  \
devm_regmap_field_alloc(>dev, regmap,   \
stk3310_reg_field_##name);  \
-   if (IS_ERR(data->reg_##name)) { \
-   dev_err(>dev, "reg field alloc failed.\n"); \
-   return PTR_ERR(data->reg_##name);   \
-   }   \
+   if (IS_ERR(data->reg_##name))   \
+   return dev_err_probe(>dev,  \
+   PTR_ERR(data->reg_##name),  \
+   "reg field alloc failed.\n");   \
} while (0)
 
 static const struct reg_field stk3310_reg_field_state =
@@ -480,22 +480,20 @@ static int stk3310_init(struct iio_dev *indio_dev)
if (chipid != STK3310_CHIP_ID_VAL &&
chipid != STK3311_CHIP_ID_VAL &&
chipid != STK3311X_CHIP_ID_VAL &&
-   chipid != STK3335_CHIP_ID_VAL) {
-   dev_err(>dev, "invalid chip id: 0x%x\n", chipid);
-   return -ENODEV;
-   }
+   chipid != STK3335_CHIP_ID_VAL)
+   return dev_err_probe(>dev, -ENODEV,
+"invalid chip id: 0x%x\n", chipid);
 
state = STK3310_STATE_EN_ALS | STK3310_STATE_EN_PS;
ret = stk3310_set_state(data, state);
-   if (ret < 0) {
-   dev_err(>dev, "failed to enable sensor");
-   return ret;
-   }
+   if (ret < 0)
+   return dev_err_probe(>dev, ret, "failed to enable 
sensor\n");
 
/* Enable PS interrupts */
ret = regmap_field_write(data->reg_int_ps, STK3310_PSINT_EN);
if (ret < 0)
-   dev_err(>dev, "failed to enable interrupts!\n");
+   return dev_err_probe(>dev, ret,
+"failed to enable interrupts!\n");
 
return ret;
 }
@@ -530,10 +528,10 @@ static int stk3310_regmap_init(struct stk3310_data *data)
 
client = data->client;
regmap = devm_regmap_init_i2c(client, _regmap_config);
-   if (IS_ERR(regmap)) {
-   dev_err(>dev, "regmap initialization failed.\n");
-   return PTR_ERR(regmap);
-   }
+   if (IS_ERR(regmap))
+   return dev_err_probe(>dev, PTR_ERR(regmap),
+"regmap initialization failed.\n");
+
data->regmap = regmap;
 
STK3310_REGFIELD(state);
@@ -597,10 +595,8 @@ static int stk3310_probe(struct i2c_client *client)
struct stk3310_data *data;
 
indio_dev = devm_iio_device_alloc(>dev, sizeof(*data));
-   if (!indio_dev) {
-   dev_err(>dev, "iio allocation failed!\n");
-   return -ENOMEM;
-   }
+   if (!indio_dev)
+   return dev_err_probe(>dev, -ENOMEM, "iio allocation 
failed!\n");
 
data = iio_priv(indio_dev);
data->client = client;
@@ -655,15 +651,15 @@ static int stk3310_probe(struct i2c_client *client)
IRQF_ONESHOT,
STK3310_EVENT, indio_dev);
if (ret < 0) {
-   dev_err(>dev, "request irq %d failed\n",
-   client->irq);
+   dev_err_probe(>dev, ret, "request irq %d 
failed\n",
+ client->irq);
goto err_standby;
}
}
 
ret = iio_device_register(indio_dev);
if (ret < 0) {
-   dev_err(>dev, "device_register failed\n");
+   dev_err_probe(>dev, ret, "device_register failed\n");
goto err_standby;
}
 
-- 
2.44.0

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-23 Thread Huang, Kai

On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote:
> > > It's a workaround because you use the capacity==0 but it does not really
> > > mean to disable the misc cgroup for specific resource IIUC.
> > 
> > Please read the comment around @misc_res_capacity again:
> > 
> >   * Miscellaneous resources capacity for the entire machine. 0 capacity
> >   * means resource is not initialized or not present in the host.
> > 
> 
> I mentioned this in earlier email. I think this means no SGX EPC. It does  
> not mean sgx epc cgroup not enabled. That's also consistent with the  
> behavior try_charge() fails if capacity is zero.

OK. To me the "capacity" is purely the concept of cgroup, so it must be
interpreted within the scope of "cgroup".  If cgroup, in our case, SGX
cgroup, is disabled, then whether "leaving the capacity to reflect the
presence of hardware resource" doesn't really matter. 

So what you are saying is that, the kernel must initialize the capacity of
some MISC resource once it is added as valid type.  

And you must initialize the "capacity" even MISC cgroup is disabled
entirely by kernel commandline, in which case, IIUC, misc.capacity is not
even going to show in the /fs.

If this is your point, then your patch:

cgroup/misc: Add SGX EPC resource type

is already broken, because you added the new type w/o initializing the
capacity.

Please fix that up.

> 
> > > 
> > > There is explicit way for user to disable misc without setting capacity  
> > > to
> > > zero.
> > 
> > Which way are you talking about?
> 
> Echo "-misc" to cgroup.subtree_control at root level for example still  
> shows non-zero sgx_epc capacity.

I guess "having to disable all MISC resources just in order to disable SGX
EPC cgroup" is a brilliant idea.

You can easily disable the entire MISC cgroup by commandline for that
purpose if that's acceptable.

And I have no idea why "still showing non-zero EPC capacity" is important
if SGX cgroup cannot be supported at all.

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Jarkko Sakkinen

On Tue Apr 23, 2024 at 8:08 PM EEST, Reinette Chatre wrote:
> Hi Kai,
>
> On 4/23/2024 4:50 AM, Huang, Kai wrote:
> >> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c 
> >> b/arch/x86/kernel/cpu/sgx/ioctl.c
> >> index b65ab214bdf5..2340a82fa796 100644
> >> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> >> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> >> @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
> >>}
> >>  
> >>mutex_unlock(>lock);
> >> +
> >> +  if (need_resched())
> >> +  cond_resched();
> >>}
> >>  
> >>ret = 0;
> >> @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> >> *encl,
> >>entry->type = page_type;
> >>  
> >>mutex_unlock(>lock);
> >> +
> >> +  if (need_resched())
> >> +  cond_resched();
> >>}
> >>  
> >>ret = 0;
> >> @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl 
> >> *encl,
> >>kfree(entry);
> >>  
> >>mutex_unlock(>lock);
> >> +
> >> +  if (need_resched())
> >> +  cond_resched();
> >>}
> >>
> > 
> > You can remove the need_reshced() in all 3 places above but just call
> > cond_resched() directly.
> > 
>
> This change will call cond_resched() after dealing with each page in a
> potentially large page range (cover mentions 30GB but we have also had to
> make optimizations for enclaves larger than this). Adding a cond_resched()
> here will surely placate the soft lockup detector, but we need to take care
> how changes like this impact the performance of the system and having actions
> on these page ranges take much longer than necessary.
> For reference, please see 7b72c823ddf8 ("x86/sgx: Reduce delay and 
> interference
> of enclave release") that turned frequent cond_resched() into batches
> to address performance issues.
>
> It looks to me like the need_resched() may be a quick check that can be used
> to improve performance? I am not familiar with all use cases that need to be
> considered to determine if a batching solution may be needed.

Ya, well no matter it is the reasoning will need to be documented
because this should have symmetry with sgx_ioc_enclave_add_pages()
(see my response to Kai).

I because this makes dealing with need_resched() a change in code
even if it is left out as a side-effect, I'd support of not removing
which means adding need_resched() as a side-effect.

>From this follows that *if* need_resched() needs to be removed then
that is not really part of the bug fix, so in all cases the bug fix
itself must include need_resched() :-)

phew, hope you got my logic here, i think it reasonable...

BR, Jarkko

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Jarkko Sakkinen

On Tue Apr 23, 2024 at 2:50 PM EEST, Huang, Kai wrote:
> On Tue, 2024-04-23 at 17:25 +0800, 朱伯君(杰铭) wrote:
> > EDMM's ioctl()s support batch operations, which may be
> > time-consuming. Try to explicitly give up the CPU at
> > the every end of "for loop" in
> > sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
> > to give other tasks a chance to run, and avoid softlockup warning.
> > 
> > The following has been observed on Linux v6.9-rc5 with kernel
> > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
> > is requested to restrict page permissions of a large number of EPC pages.
> > 
> > [ cut here ]
> > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905]
> > ...
> > CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7
> > ...
> > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
> > Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e 
> > f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf  00 00 00 
> > 40 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef
> > RSP: 0018:b55a6591fa80 EFLAGS: 0202
> > RAX:  RBX: b55a6591fac0 RCX: b581e7384000
> > RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000
> > RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0
> > R10: 006e R11: 0002 R12: 00072052d000
> > R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020
> > FS:  7fe10dbda740() GS:92163e48() 
> > knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe07f0 DR7: 0400
> > PKRU: 5554
> > Call Trace:
> >  
> >  ? show_regs+0x67/0x70
> >  ? watchdog_timer_fn+0x1f3/0x280
> >  ? __pfx_watchdog_timer_fn+0x10/0x10
> >  ? __hrtimer_run_queues+0xc8/0x220
> >  ? hrtimer_interrupt+0x10c/0x250
> >  ? __sysvec_apic_timer_interrupt+0x53/0x130
> >  ? sysvec_apic_timer_interrupt+0x7b/0x90
> >  
> >  
> >  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> >  ? sgx_enclave_restrict_permissions+0xba/0x1f0
> >  ? __pte_offset_map_lock+0x94/0x110
> >  ? sgx_encl_test_and_clear_young_cb+0x40/0x60
> >  sgx_ioctl+0x1ab/0x900
> >  ? do_syscall_64+0x79/0x110
> >  ? apply_to_page_range+0x14/0x20
> >  ? sgx_encl_test_and_clear_young+0x6c/0x80
> >  ? sgx_vma_fault+0x132/0x4f0
> >  __x64_sys_ioctl+0x95/0xd0
> >  x64_sys_call+0x1209/0x20c0
> >  do_syscall_64+0x6d/0x110
> >  ? do_syscall_64+0x79/0x110
> >  ? do_pte_missing+0x2e8/0xcc0
> >  ? __pte_offset_map+0x1c/0x190
> >  ? __handle_mm_fault+0x7b9/0xe60
> >  ? __count_memcg_events+0x70/0x100
> >  ? handle_mm_fault+0x256/0x360
> >  ? do_user_addr_fault+0x3c1/0x860
> >  ? irqentry_exit_to_user_mode+0x67/0x190
> >  ? irqentry_exit+0x3b/0x50
> >  ? exc_page_fault+0x89/0x180
> >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > RIP: 0033:0x7fe10e2ee5cb
> > Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff 
> > ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 
> > ff ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48
> > RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010
> > RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb
> > RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005
> > RBP: 0005 R08:  R09: 7fffb2c75594
> > R10: 7fffb2c755c8 R11: 0246 R12: c028a405
> > R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980
> >  
> >  [ end trace ]
>
> Could you trim down the trace to only include the relevant part?
>
> E.g., please at least remove the two register dumps at the beginning and
> end of the trace.
>
> Please refer to "Backtraces in commit messages" section in
> Documentation/process/submitting-patches.rst.
>
> > 
> > Signed-off-by: Bojun Zhu 
> > ---
> >  arch/x86/kernel/cpu/sgx/ioctl.c | 9 +
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c 
> > b/arch/x86/kernel/cpu/sgx/ioctl.c
> > index b65ab214bdf5..2340a82fa796 100644
> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
> > }
> >  
> > mutex_unlock(>lock);
> > +
> > +   if (need_resched())
> > +   cond_resched();
> > }
> >  
> > ret = 0;
> > @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> > *encl,
> >

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Jarkko Sakkinen

On Wed Apr 24, 2024 at 12:10 AM EEST, Jarkko Sakkinen wrote:
> On Tue Apr 23, 2024 at 12:25 PM EEST, =?UTF-8?B?5pyx5Lyv5ZCbKOadsOmTrSk=?= 
> wrote:
> > EDMM's ioctl()s support batch operations, which may be
> > time-consuming. Try to explicitly give up the CPU at
> > the every end of "for loop" in
> > sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
> > to give other tasks a chance to run, and avoid softlockup warning.
> >
> > The following has been observed on Linux v6.9-rc5 with kernel
> > preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
> > is requested to restrict page permissions of a large number of EPC pages.
> >
> > [ cut here ]
> > watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905]
> > ...
> > CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7
> > ...
> > RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
> > Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e 
> > f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf  00 00 00 
> > 40 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef
> > RSP: 0018:b55a6591fa80 EFLAGS: 0202
> > RAX:  RBX: b55a6591fac0 RCX: b581e7384000
> > RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000
> > RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0
> > R10: 006e R11: 0002 R12: 00072052d000
> > R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020
> > FS:  7fe10dbda740() GS:92163e48() 
> > knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe07f0 DR7: 0400
> > PKRU: 5554
> > Call Trace:
> >  
> >  ? show_regs+0x67/0x70
> >  ? watchdog_timer_fn+0x1f3/0x280
> >  ? __pfx_watchdog_timer_fn+0x10/0x10
> >  ? __hrtimer_run_queues+0xc8/0x220
> >  ? hrtimer_interrupt+0x10c/0x250
> >  ? __sysvec_apic_timer_interrupt+0x53/0x130
> >  ? sysvec_apic_timer_interrupt+0x7b/0x90
> >  
> >  
> >  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> >  ? sgx_enclave_restrict_permissions+0xba/0x1f0
> >  ? __pte_offset_map_lock+0x94/0x110
> >  ? sgx_encl_test_and_clear_young_cb+0x40/0x60
> >  sgx_ioctl+0x1ab/0x900
> >  ? do_syscall_64+0x79/0x110
> >  ? apply_to_page_range+0x14/0x20
> >  ? sgx_encl_test_and_clear_young+0x6c/0x80
> >  ? sgx_vma_fault+0x132/0x4f0
> >  __x64_sys_ioctl+0x95/0xd0
> >  x64_sys_call+0x1209/0x20c0
> >  do_syscall_64+0x6d/0x110
> >  ? do_syscall_64+0x79/0x110
> >  ? do_pte_missing+0x2e8/0xcc0
> >  ? __pte_offset_map+0x1c/0x190
> >  ? __handle_mm_fault+0x7b9/0xe60
> >  ? __count_memcg_events+0x70/0x100
> >  ? handle_mm_fault+0x256/0x360
> >  ? do_user_addr_fault+0x3c1/0x860
> >  ? irqentry_exit_to_user_mode+0x67/0x190
> >  ? irqentry_exit+0x3b/0x50
> >  ? exc_page_fault+0x89/0x180
> >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > RIP: 0033:0x7fe10e2ee5cb
> > Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff 
> > ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 
> > ff ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48
> > RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010
> > RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb
> > RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005
> > RBP: 0005 R08:  R09: 7fffb2c75594
> > R10: 7fffb2c755c8 R11: 0246 R12: c028a405
> > R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980
> >  
> >  [ end trace ]
> >
> > Signed-off-by: Bojun Zhu 
>
> Can you also fixup this as your "firstname lastname" in your emails
> from field? This matters so that author field in git log matches your
> sob.
>
> > ---
> >  arch/x86/kernel/cpu/sgx/ioctl.c | 9 +
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c 
> > b/arch/x86/kernel/cpu/sgx/ioctl.c
> > index b65ab214bdf5..2340a82fa796 100644
> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
> > }
> >  
> > mutex_unlock(>lock);
> > +
> > +   if (need_resched())
> > +   cond_resched();
> > }
> >  
> > ret = 0;
> > @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> > *encl,
> > entry->type = page_type;
> >  
> > mutex_unlock(>lock);
> > +
> > +   if

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Jarkko Sakkinen

On Tue Apr 23, 2024 at 12:25 PM EEST, =?UTF-8?B?5pyx5Lyv5ZCbKOadsOmTrSk=?= 
wrote:
> EDMM's ioctl()s support batch operations, which may be
> time-consuming. Try to explicitly give up the CPU at
> the every end of "for loop" in
> sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
> to give other tasks a chance to run, and avoid softlockup warning.
>
> The following has been observed on Linux v6.9-rc5 with kernel
> preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
> is requested to restrict page permissions of a large number of EPC pages.
>
> [ cut here ]
> watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905]
> ...
> CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7
> ...
> RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
> Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e 
> f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf  00 00 00 40 
> 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef
> RSP: 0018:b55a6591fa80 EFLAGS: 0202
> RAX:  RBX: b55a6591fac0 RCX: b581e7384000
> RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000
> RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0
> R10: 006e R11: 0002 R12: 00072052d000
> R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020
> FS:  7fe10dbda740() GS:92163e48() 
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe07f0 DR7: 0400
> PKRU: 5554
> Call Trace:
>  
>  ? show_regs+0x67/0x70
>  ? watchdog_timer_fn+0x1f3/0x280
>  ? __pfx_watchdog_timer_fn+0x10/0x10
>  ? __hrtimer_run_queues+0xc8/0x220
>  ? hrtimer_interrupt+0x10c/0x250
>  ? __sysvec_apic_timer_interrupt+0x53/0x130
>  ? sysvec_apic_timer_interrupt+0x7b/0x90
>  
>  
>  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
>  ? sgx_enclave_restrict_permissions+0xba/0x1f0
>  ? __pte_offset_map_lock+0x94/0x110
>  ? sgx_encl_test_and_clear_young_cb+0x40/0x60
>  sgx_ioctl+0x1ab/0x900
>  ? do_syscall_64+0x79/0x110
>  ? apply_to_page_range+0x14/0x20
>  ? sgx_encl_test_and_clear_young+0x6c/0x80
>  ? sgx_vma_fault+0x132/0x4f0
>  __x64_sys_ioctl+0x95/0xd0
>  x64_sys_call+0x1209/0x20c0
>  do_syscall_64+0x6d/0x110
>  ? do_syscall_64+0x79/0x110
>  ? do_pte_missing+0x2e8/0xcc0
>  ? __pte_offset_map+0x1c/0x190
>  ? __handle_mm_fault+0x7b9/0xe60
>  ? __count_memcg_events+0x70/0x100
>  ? handle_mm_fault+0x256/0x360
>  ? do_user_addr_fault+0x3c1/0x860
>  ? irqentry_exit_to_user_mode+0x67/0x190
>  ? irqentry_exit+0x3b/0x50
>  ? exc_page_fault+0x89/0x180
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fe10e2ee5cb
> Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff 
> ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff 
> ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48
> RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb
> RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005
> RBP: 0005 R08:  R09: 7fffb2c75594
> R10: 7fffb2c755c8 R11: 0246 R12: c028a405
> R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980
>  
>  [ end trace ]
>
> Signed-off-by: Bojun Zhu 

Can you also fixup this as your "firstname lastname" in your emails
from field? This matters so that author field in git log matches your
sob.

> ---
>  arch/x86/kernel/cpu/sgx/ioctl.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..2340a82fa796 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
>   }
>  
>   mutex_unlock(>lock);
> +
> + if (need_resched())
> + cond_resched();
>   }
>  
>   ret = 0;
> @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> *encl,
>   entry->type = page_type;
>  
>   mutex_unlock(>lock);
> +
> + if (need_resched())
> + cond_resched();
>   }
>  
>   ret = 0;
> @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>   kfree(entry);
>  
>   mutex_unlock(>lock);
> +
> + if

Re: [PATCH] drivers: remoteproc: xlnx: fix uninitialize variable use

2024-04-23 Thread Mathieu Poirier

On Tue, Apr 23, 2024 at 10:02:11AM -0700, Tanmay Shah wrote:
> Fix following warning for clang compiler with W=1 option:
> initialize the variable 'ret' to silence this warning
>  907 | int ret, i;
>  |^
>  | = 0
> 
> Fixes: a6b974b40f94 ("drivers: remoteproc: xlnx: Add Versal and Versal-NET 
> support")
> Reported-by: kernel test robot 
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202404231839.ohiy9lw8-...@intel.com/
> Signed-off-by: Tanmay Shah 
> ---
>  drivers/remoteproc/xlnx_r5_remoteproc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> b/drivers/remoteproc/xlnx_r5_remoteproc.c
> index a6d8ac7394e7..d98940d7ef8f 100644
> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> @@ -904,7 +904,7 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster 
> *cluster,
>  {
>   struct device *dev = cluster->dev;
>   struct zynqmp_r5_core *r5_core;
> - int ret, i;
> + int ret = -EINVAL, i;
>

Applied - thanks,
Mathieu

>   r5_core = cluster->r5_cores[0];
>  
> 
> base-commit: e99fcac055b3325283d6c5c61a117651fb147686
> -- 
> 2.25.1
>

Re: [PATCH v5 00/15] mm: jit/text allocator

2024-04-23 Thread Luis Chamberlain

On Mon, Apr 22, 2024 at 12:44:21PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> (something went wrong with the prevois posting, sorry for the noise)
> 
> Hi,
> 
> Since v3 I looked into making execmem more of an utility toolbox, as we
> discussed at LPC with Mark Rutland, but it was getting more hairier than
> having a struct describing architecture constraints and a type identifying
> the consumer of execmem.
> 
> And I do think that having the description of architecture constraints for
> allocations of executable memory in a single place is better than having it
> spread all over the place.
> 
> The patches available via git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v5

Thanks! I've merged and pushed this onto modules-next in its entirety now for
wider testing.

  Luis

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Reinette Chatre

Hi Kai,

On 4/23/2024 4:50 AM, Huang, Kai wrote:
>> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c 
>> b/arch/x86/kernel/cpu/sgx/ioctl.c
>> index b65ab214bdf5..2340a82fa796 100644
>> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
>> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
>> @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
>>  }
>>  
>>  mutex_unlock(>lock);
>> +
>> +if (need_resched())
>> +cond_resched();
>>  }
>>  
>>  ret = 0;
>> @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl 
>> *encl,
>>  entry->type = page_type;
>>  
>>  mutex_unlock(>lock);
>> +
>> +if (need_resched())
>> +cond_resched();
>>  }
>>  
>>  ret = 0;
>> @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl 
>> *encl,
>>  kfree(entry);
>>  
>>  mutex_unlock(>lock);
>> +
>> +if (need_resched())
>> +cond_resched();
>>  }
>>
> 
> You can remove the need_reshced() in all 3 places above but just call
> cond_resched() directly.
> 

This change will call cond_resched() after dealing with each page in a
potentially large page range (cover mentions 30GB but we have also had to
make optimizations for enclaves larger than this). Adding a cond_resched()
here will surely placate the soft lockup detector, but we need to take care
how changes like this impact the performance of the system and having actions
on these page ranges take much longer than necessary.
For reference, please see 7b72c823ddf8 ("x86/sgx: Reduce delay and interference
of enclave release") that turned frequent cond_resched() into batches
to address performance issues.

It looks to me like the need_resched() may be a quick check that can be used
to improve performance? I am not familiar with all use cases that need to be
considered to determine if a batching solution may be needed.

Reinette

[PATCH] drivers: remoteproc: xlnx: fix uninitialize variable use

2024-04-23 Thread Tanmay Shah

Fix following warning for clang compiler with W=1 option:
initialize the variable 'ret' to silence this warning
 907 | int ret, i;
 |^
 | = 0

Fixes: a6b974b40f94 ("drivers: remoteproc: xlnx: Add Versal and Versal-NET 
support")
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202404231839.ohiy9lw8-...@intel.com/
Signed-off-by: Tanmay Shah 
---
 drivers/remoteproc/xlnx_r5_remoteproc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
b/drivers/remoteproc/xlnx_r5_remoteproc.c
index a6d8ac7394e7..d98940d7ef8f 100644
--- a/drivers/remoteproc/xlnx_r5_remoteproc.c
+++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
@@ -904,7 +904,7 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster 
*cluster,
 {
struct device *dev = cluster->dev;
struct zynqmp_r5_core *r5_core;
-   int ret, i;
+   int ret = -EINVAL, i;
 
r5_core = cluster->r5_cores[0];
 

base-commit: e99fcac055b3325283d6c5c61a117651fb147686
-- 
2.25.1

Re: [PATCH v3 4/4] arm64: dts: qcom: msm8976: Add WCNSS node

2024-04-23 Thread Konrad Dybcio





On 4/13/24 19:03, Adam Skladowski wrote:

Add node describing wireless connectivity subsystem.

Signed-off-by: Adam Skladowski 
---


[...]


+   wcnss_wifi: wifi {
+   compatible = "qcom,wcnss-wlan";
+
+   interrupts = ,
+;
+   interrupt-names = "tx", "rx";
+
+   qcom,smem-states = <_smsm 10>, 
<_smsm 9>;
+   qcom,smem-state-names = 
"tx-enable",
+   
"tx-rings-empty";


Since you're already going to resend, please keep one entry per line,
this fragment is very inconsistent

looks good otherwise

Konrad

Re: [PATCH v5 04/15] sparc: simplify module_alloc()

2024-04-23 Thread Sam Ravnborg

Hi Mike,
On Mon, Apr 22, 2024 at 12:44:25PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Define MODULES_VADDR and MODULES_END as VMALLOC_START and VMALLOC_END
> for 32-bit and reduce module_alloc() to
> 
>   __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, ...)
> 
> as with the new defines the allocations becames identical for both 32
> and 64 bits.
> 
> While on it, drop unsed include of 
> 
> Suggested-by: Sam Ravnborg 
> Signed-off-by: Mike Rapoport (IBM) 

Looks good.
Reviewed-by: Sam Ravnborg

Re: [PATCH v3 3/4] arm64: dts: qcom: msm8976: Add Adreno GPU

2024-04-23 Thread Konrad Dybcio





On 4/13/24 19:03, Adam Skladowski wrote:

Add Adreno GPU node.

Signed-off-by: Adam Skladowski 
---


[...]


+   opp-2 {
+   opp-hz = /bits/ 64 <2>;
+   required-opps = <_opp_low_svs>;
+   opp-supported-hw = <0xff>;


Doesn't look like this platform had any speed binning going on, can drop?

Konrad

Re: [PATCH v3 2/4] arm64: dts: qcom: msm8976: Add MDSS nodes

2024-04-23 Thread Konrad Dybcio





On 4/13/24 19:03, Adam Skladowski wrote:

Add MDSS nodes to support displays on MSM8976 SoC.

Signed-off-by: Adam Skladowski 
---


[...]


+
+   mdp_opp_table: opp-table {
+   compatible = "operating-points-v2";
+
+   opp-17778 {
+   opp-hz = /bits/ 64 <17778>;
+   required-opps = 
<_opp_svs>;
+   };
+
+   opp-27000 {
+   opp-hz = /bits/ 64 <27000>;
+   required-opps = 
<_opp_svs_plus>;
+   };
+
+   opp-32000 {
+   opp-hz = /bits/ 64 <32000>;
+   required-opps = 
<_opp_nom>;
+   };
+   opp-36000 {


Missing a newline above

[...]



+   dsi0_opp_table: opp-table {
+   compatible = "operating-points-v2";
+
+   opp-12500 {
+   opp-hz = /bits/ 64 <12500>;
+   required-opps = 
<_opp_svs>;
+
+   };


You can borrow it from here

Looks reasonable otherwise

Konrad

[PATCH v2 1/2] tracing/user_events: Fix non-spaced field matching

2024-04-23 Thread Beau Belgrave

When the ABI was updated to prevent same name w/different args, it
missed an important corner case when fields don't end with a space.
Typically, space is used for fields to help separate them, like
"u8 field1; u8 field2". If no spaces are used, like
"u8 field1;u8 field2", then the parsing works for the first time.
However, the match check fails on a subsequent register, leading to
confusion.

This is because the match check uses argv_split() and assumes that all
fields will be split upon the space. When spaces are used, we get back
{ "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }.
This causes a mismatch, and the user program gets back -EADDRINUSE.

Add a method to detect this case before calling argv_split(). If found
force a space after the field separator character ';'. This ensures all
cases work properly for matching.

With this fix, the following are all treated as matching:
u8 field1;u8 field2
u8 field1; u8 field2
u8 field1;\tu8 field2
u8 field1;\nu8 field2

Fixes: ba470eebc2f6 ("tracing/user_events: Prevent same name but different args 
event")
Signed-off-by: Beau Belgrave 
---
 kernel/trace/trace_events_user.c | 76 +++-
 1 file changed, 75 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index 70d428c394b6..82b191f33a28 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -1989,6 +1989,80 @@ static int user_event_set_tp_name(struct user_event 
*user)
return 0;
 }
 
+/*
+ * Counts how many ';' without a trailing space are in the args.
+ */
+static int count_semis_no_space(char *args)
+{
+   int count = 0;
+
+   while ((args = strchr(args, ';'))) {
+   args++;
+
+   if (!isspace(*args))
+   count++;
+   }
+
+   return count;
+}
+
+/*
+ * Copies the arguments while ensuring all ';' have a trailing space.
+ */
+static char *insert_space_after_semis(char *args, int count)
+{
+   char *fixed, *pos;
+   int len;
+
+   len = strlen(args) + count;
+   fixed = kmalloc(len + 1, GFP_KERNEL);
+
+   if (!fixed)
+   return NULL;
+
+   pos = fixed;
+
+   /* Insert a space after ';' if there is no trailing space. */
+   while (*args) {
+   *pos = *args++;
+
+   if (*pos++ == ';' && !isspace(*args))
+   *pos++ = ' ';
+   }
+
+   *pos = '\0';
+
+   return fixed;
+}
+
+static char **user_event_argv_split(char *args, int *argc)
+{
+   char **split;
+   char *fixed;
+   int count;
+
+   /* Count how many ';' without a trailing space */
+   count = count_semis_no_space(args);
+
+   /* No fixup is required */
+   if (!count)
+   return argv_split(GFP_KERNEL, args, argc);
+
+   /* We must fixup 'field;field' to 'field; field' */
+   fixed = insert_space_after_semis(args, count);
+
+   if (!fixed)
+   return NULL;
+
+   /* We do a normal split afterwards */
+   split = argv_split(GFP_KERNEL, fixed, argc);
+
+   /* We can free since argv_split makes a copy */
+   kfree(fixed);
+
+   return split;
+}
+
 /*
  * Parses the event name, arguments and flags then registers if successful.
  * The name buffer lifetime is owned by this method for success cases only.
@@ -2012,7 +2086,7 @@ static int user_event_parse(struct user_event_group 
*group, char *name,
return -EPERM;
 
if (args) {
-   argv = argv_split(GFP_KERNEL, args, );
+   argv = user_event_argv_split(args, );
 
if (!argv)
return -ENOMEM;
-- 
2.34.1

[PATCH v2 2/2] selftests/user_events: Add non-spacing separator check

2024-04-23 Thread Beau Belgrave

The ABI documentation indicates that field separators do not need a
space between them, only a ';'. When no spacing is used, the register
must work. Any subsequent register, with or without spaces, must match
and not return -EADDRINUSE.

Add a non-spacing separator case to our self-test register case to ensure
it works going forward.

Signed-off-by: Beau Belgrave 
---
 tools/testing/selftests/user_events/ftrace_test.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/tools/testing/selftests/user_events/ftrace_test.c 
b/tools/testing/selftests/user_events/ftrace_test.c
index dcd7509fe2e0..0bb46793dcd4 100644
--- a/tools/testing/selftests/user_events/ftrace_test.c
+++ b/tools/testing/selftests/user_events/ftrace_test.c
@@ -261,6 +261,12 @@ TEST_F(user, register_events) {
ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSREG, ));
ASSERT_EQ(0, reg.write_index);
 
+   /* Register without separator spacing should still match */
+   reg.enable_bit = 29;
+   reg.name_args = (__u64)"__test_event u32 field1;u32 field2";
+   ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSREG, ));
+   ASSERT_EQ(0, reg.write_index);
+
/* Multiple registers to same name but different args should fail */
reg.enable_bit = 29;
reg.name_args = (__u64)"__test_event u32 field1;";
@@ -288,6 +294,8 @@ TEST_F(user, register_events) {
ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSUNREG, ));
unreg.disable_bit = 30;
ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSUNREG, ));
+   unreg.disable_bit = 29;
+   ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSUNREG, ));
 
/* Delete should have been auto-done after close and unregister */
close(self->data_fd);
-- 
2.34.1

[PATCH v2 0/2] tracing/user_events: Fix non-spaced field matching

2024-04-23 Thread Beau Belgrave

When the ABI was updated to prevent same name w/different args, it
missed an important corner case when fields don't end with a space.
Typically, space is used for fields to help separate them, like
"u8 field1; u8 field2". If no spaces are used, like
"u8 field1;u8 field2", then the parsing works for the first time.
However, the match check fails on a subsequent register, leading to
confusion.

This is because the match check uses argv_split() and assumes that all
fields will be split upon the space. When spaces are used, we get back
{ "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }.
This causes a mismatch, and the user program gets back -EADDRINUSE.

Add a method to detect this case before calling argv_split(). If found
force a space after the field separator character ';'. This ensures all
cases work properly for matching.

I could not find an existing function to accomplish this, so I had to
hand code a copy with this logic. If there is a better way to achieve
this, I'm all ears.

This series also adds a selftest to ensure this doesn't break again.

With this fix, the following are all treated as matching:
u8 field1;u8 field2
u8 field1; u8 field2
u8 field1;\tu8 field2
u8 field1;\nu8 field2

V2 changes:
  Renamed fix_semis_no_space() to insert_space_after_semis().
  Have user_event_argv_split() return fast in no-split case.
  Pulled in Masami's shorter loop in insert_space_after_semis().

Beau Belgrave (2):
  tracing/user_events: Fix non-spaced field matching
  selftests/user_events: Add non-spacing separator check

 kernel/trace/trace_events_user.c  | 76 ++-
 .../selftests/user_events/ftrace_test.c   |  8 ++
 2 files changed, 83 insertions(+), 1 deletion(-)


base-commit: 0bbac3facb5d6cc0171c45c9873a2dc96bea9680
-- 
2.34.1

Re: [PATCH] drivers: remoteproc: xlnx: Add Versal and Versal-NET support

2024-04-23 Thread Tanmay Shah




On 4/23/24 11:06 AM, Nathan Chancellor wrote:
> Hi Tanmay,
> 
> On Thu, Apr 18, 2024 at 03:01:25PM -0700, Tanmay Shah wrote:
>> AMD-Xilinx Versal platform is successor of ZynqMP platform.
>> Real-time Processing Unit R5 cluster IP on Versal is same as
>> of ZynqMP Platform. Power-domains ids for Versal platform is
>> different than ZynqMP.
>> 
>> AMD-Xilinx Versal-NET platform is successor of Versal platform.
>> Versal-NET Real-Time Processing Unit has two clusters and each
>> cluster contains dual core ARM Cortex-R52 processors. Each R52
>> core is assigned 128KB of TCM memory.
>> 
>> Signed-off-by: Tanmay Shah 
>> ---
>>  drivers/remoteproc/xlnx_r5_remoteproc.c | 53 -
>>  1 file changed, 17 insertions(+), 36 deletions(-)
>> 
>> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
>> b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> index 7b1c12108bff..a6d8ac7394e7 100644
>> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
>> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> @@ -300,36 +300,6 @@ static void zynqmp_r5_rproc_kick(struct rproc *rproc, 
>> int vqid)
>>  dev_warn(dev, "failed to send message\n");
>>  }
>>  
>> -/*
>> - * zynqmp_r5_set_mode()
>> - *
>> - * set RPU cluster and TCM operation mode
>> - *
>> - * @r5_core: pointer to zynqmp_r5_core type object
>> - * @fw_reg_val: value expected by firmware to configure RPU cluster mode
>> - * @tcm_mode: value expected by fw to configure TCM mode (lockstep or split)
>> - *
>> - * Return: 0 for success and < 0 for failure
>> - */
>> -static int zynqmp_r5_set_mode(struct zynqmp_r5_core *r5_core,
>> -  enum rpu_oper_mode fw_reg_val,
>> -  enum rpu_tcm_comb tcm_mode)
>> -{
>> -int ret;
>> -
>> -ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val);
>> -if (ret < 0) {
>> -dev_err(r5_core->dev, "failed to set RPU mode\n");
>> -return ret;
>> -}
>> -
>> -ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
>> -if (ret < 0)
>> -dev_err(r5_core->dev, "failed to configure TCM\n");
>> -
>> -return ret;
>> -}
>> -
>>  /*
>>   * zynqmp_r5_rproc_start()
>>   * @rproc: single R5 core's corresponding rproc instance
>> @@ -941,7 +911,7 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster 
>> *cluster,
>>  /* Maintain backward compatibility for zynqmp by using hardcode TCM 
>> address. */
>>  if (of_find_property(r5_core->np, "reg", NULL))
>>  ret = zynqmp_r5_get_tcm_node_from_dt(cluster);
>> -else
>> +else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss"))
>>  ret = zynqmp_r5_get_tcm_node(cluster);
> 
> This change breaks the build with clang:
> 
>   drivers/remoteproc/xlnx_r5_remoteproc.c:914:11: error: variable 'ret' is 
> used uninitialized whenever 'if' condition is false 
> [-Werror,-Wsometimes-uninitialized]
> 914 | else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss"))
> |  ^~
>   drivers/remoteproc/xlnx_r5_remoteproc.c:917:6: note: uninitialized use 
> occurs here
> 917 | if (ret) {
> | ^~~
>   drivers/remoteproc/xlnx_r5_remoteproc.c:914:7: note: remove the 'if' if its 
> condition is always true
> 914 | else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss"))
> |  ^~~
> 915 | ret = zynqmp_r5_get_tcm_node(cluster);
>   drivers/remoteproc/xlnx_r5_remoteproc.c:907:9: note: initialize the 
> variable 'ret' to silence this warning
> 907 | int ret, i;
> |^
> | = 0
>   1 error generated.
> 
> Should ret be initialized to zero or should there be an else statement?

Hello,

Thanks for analysis. ret should be initialized with -EINVAL, so else can be 
avoided.
I will send patch by EOD.

Thanks,
Tanmay

> 
> Cheers,
> Nathan
> 
>>  
>>  if (ret) {
>> @@ -960,12 +930,21 @@ static int zynqmp_r5_core_init(struct 
>> zynqmp_r5_cluster *cluster,
>>  return ret;
>>  }
>>  
>> -ret = zynqmp_r5_set_mode(r5_core, fw_reg_val, tcm_mode);
>> -if (ret) {
>> -dev_err(dev, "failed to set r5 cluster mode %d, err 
>> %d\n",
>> -cluster->mode, ret);
>> +ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val);
>> +if (ret < 0) {
>> +dev_err(r5_core->dev, "failed to set RPU mode\n");
>>  return ret;
>>  }
>> +
>> +if (of_find_property(dev_of_node(dev), "xlnx,tcm-mode", NULL) ||
>> +device_is_compatible(dev, "xlnx,zynqmp-r5fss")) {
>> +ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id,
>> +   tcm_mode);
>> +

Re: [PATCH] drivers: remoteproc: xlnx: Add Versal and Versal-NET support

2024-04-23 Thread Nathan Chancellor

Hi Tanmay,

On Thu, Apr 18, 2024 at 03:01:25PM -0700, Tanmay Shah wrote:
> AMD-Xilinx Versal platform is successor of ZynqMP platform.
> Real-time Processing Unit R5 cluster IP on Versal is same as
> of ZynqMP Platform. Power-domains ids for Versal platform is
> different than ZynqMP.
> 
> AMD-Xilinx Versal-NET platform is successor of Versal platform.
> Versal-NET Real-Time Processing Unit has two clusters and each
> cluster contains dual core ARM Cortex-R52 processors. Each R52
> core is assigned 128KB of TCM memory.
> 
> Signed-off-by: Tanmay Shah 
> ---
>  drivers/remoteproc/xlnx_r5_remoteproc.c | 53 -
>  1 file changed, 17 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> b/drivers/remoteproc/xlnx_r5_remoteproc.c
> index 7b1c12108bff..a6d8ac7394e7 100644
> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> @@ -300,36 +300,6 @@ static void zynqmp_r5_rproc_kick(struct rproc *rproc, 
> int vqid)
>   dev_warn(dev, "failed to send message\n");
>  }
>  
> -/*
> - * zynqmp_r5_set_mode()
> - *
> - * set RPU cluster and TCM operation mode
> - *
> - * @r5_core: pointer to zynqmp_r5_core type object
> - * @fw_reg_val: value expected by firmware to configure RPU cluster mode
> - * @tcm_mode: value expected by fw to configure TCM mode (lockstep or split)
> - *
> - * Return: 0 for success and < 0 for failure
> - */
> -static int zynqmp_r5_set_mode(struct zynqmp_r5_core *r5_core,
> -   enum rpu_oper_mode fw_reg_val,
> -   enum rpu_tcm_comb tcm_mode)
> -{
> - int ret;
> -
> - ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val);
> - if (ret < 0) {
> - dev_err(r5_core->dev, "failed to set RPU mode\n");
> - return ret;
> - }
> -
> - ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
> - if (ret < 0)
> - dev_err(r5_core->dev, "failed to configure TCM\n");
> -
> - return ret;
> -}
> -
>  /*
>   * zynqmp_r5_rproc_start()
>   * @rproc: single R5 core's corresponding rproc instance
> @@ -941,7 +911,7 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster 
> *cluster,
>   /* Maintain backward compatibility for zynqmp by using hardcode TCM 
> address. */
>   if (of_find_property(r5_core->np, "reg", NULL))
>   ret = zynqmp_r5_get_tcm_node_from_dt(cluster);
> - else
> + else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss"))
>   ret = zynqmp_r5_get_tcm_node(cluster);

This change breaks the build with clang:

  drivers/remoteproc/xlnx_r5_remoteproc.c:914:11: error: variable 'ret' is used 
uninitialized whenever 'if' condition is false 
[-Werror,-Wsometimes-uninitialized]
914 | else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss"))
|  ^~
  drivers/remoteproc/xlnx_r5_remoteproc.c:917:6: note: uninitialized use occurs 
here
917 | if (ret) {
| ^~~
  drivers/remoteproc/xlnx_r5_remoteproc.c:914:7: note: remove the 'if' if its 
condition is always true
914 | else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss"))
|  ^~~
915 | ret = zynqmp_r5_get_tcm_node(cluster);
  drivers/remoteproc/xlnx_r5_remoteproc.c:907:9: note: initialize the variable 
'ret' to silence this warning
907 | int ret, i;
|^
| = 0
  1 error generated.

Should ret be initialized to zero or should there be an else statement?

Cheers,
Nathan

>  
>   if (ret) {
> @@ -960,12 +930,21 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster 
> *cluster,
>   return ret;
>   }
>  
> - ret = zynqmp_r5_set_mode(r5_core, fw_reg_val, tcm_mode);
> - if (ret) {
> - dev_err(dev, "failed to set r5 cluster mode %d, err 
> %d\n",
> - cluster->mode, ret);
> + ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val);
> + if (ret < 0) {
> + dev_err(r5_core->dev, "failed to set RPU mode\n");
>   return ret;
>   }
> +
> + if (of_find_property(dev_of_node(dev), "xlnx,tcm-mode", NULL) ||
> + device_is_compatible(dev, "xlnx,zynqmp-r5fss")) {
> + ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id,
> +tcm_mode);
> + if (ret < 0) {
> + dev_err(r5_core->dev, "failed to configure 
> TCM\n");
> + return ret;
> + }
> + }
>   }
>  
>   return 0;
> @@ -1022,7 +1001,7 @@ static int zynqmp_r5_cluster_init(struct 
>

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-23 Thread Liam R. Howlett

* Steven Rostedt  [240410 13:41]:
> On Sat,  6 Apr 2024 18:36:46 +0100
> Vincent Donnefort  wrote:
> 
> > +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
> > +   struct vm_area_struct *vma)
> > +{
> > +   struct ring_buffer_per_cpu *cpu_buffer;
> > +   unsigned long flags, *subbuf_ids;
> > +   int err = 0;
> > +
> > +   if (!cpumask_test_cpu(cpu, buffer->cpumask))
> > +   return -EINVAL;
> > +
> > +   cpu_buffer = buffer->buffers[cpu];
> > +
> > +   mutex_lock(_buffer->mapping_lock);
> > +
> > +   if (cpu_buffer->mapped) {
> > +   err = __rb_map_vma(cpu_buffer, vma);
> > +   if (!err)
> > +   err = __rb_inc_dec_mapped(cpu_buffer, true);
> > +   mutex_unlock(_buffer->mapping_lock);
> > +   return err;
> > +   }
> > +
> > +   /* prevent another thread from changing buffer/sub-buffer sizes */
> > +   mutex_lock(>mutex);
> > +
> > +   err = rb_alloc_meta_page(cpu_buffer);
> > +   if (err)
> > +   goto unlock;
> > +
> > +   /* subbuf_ids include the reader while nr_pages does not */
> > +   subbuf_ids = kcalloc(cpu_buffer->nr_pages + 1, sizeof(*subbuf_ids), 
> > GFP_KERNEL);
> > +   if (!subbuf_ids) {
> > +   rb_free_meta_page(cpu_buffer);
> > +   err = -ENOMEM;
> > +   goto unlock;
> > +   }
> > +
> > +   atomic_inc(_buffer->resize_disabled);
> > +
> > +   /*
> > +* Lock all readers to block any subbuf swap until the subbuf IDs are
> > +* assigned.
> > +*/
> > +   raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> > +   rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
> > +   raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> > +
> > +   err = __rb_map_vma(cpu_buffer, vma);
> > +   if (!err) {
> > +   raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> > +   cpu_buffer->mapped = 1;
> > +   raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> > +   } else {
> > +   kfree(cpu_buffer->subbuf_ids);
> > +   cpu_buffer->subbuf_ids = NULL;
> > +   rb_free_meta_page(cpu_buffer);
> > +   }
> > +unlock:
> 
> Nit: For all labels, please add a space before them. Otherwise, diffs will
> show "unlock" as the function and not "ring_buffer_map", making it harder
> to find where the change is.
> 

Isn't the inclusion of a space before labels outdate since 'git diff'
superseds 'diff' and fixed this issue?

Thanks,
Liam

Re: [PATCH v12 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-04-23 Thread Haitao Huang


On Wed, 17 Apr 2024 18:51:28 -0500, Huang, Kai  wrote:




On 16/04/2024 3:20 pm, Haitao Huang wrote:

From: Kristen Carlson Accardi 
 Currently in the EPC page allocation, the kernel simply fails the
allocation when the current EPC cgroup fails to charge due to its usage
reaching limit.  This is not ideal. When that happens, a better way is
to reclaim EPC page(s) from the current EPC cgroup (and/or its
descendants) to reduce its usage so the new allocation can succeed.
 Add the basic building blocks to support per-cgroup reclamation.
 Currently the kernel only has one place to reclaim EPC pages: the  
global

EPC LRU list.  To support the "per-cgroup" EPC reclaim, maintain an LRU
list for each EPC cgroup, and introduce a "cgroup" variant function to
reclaim EPC pages from a given EPC cgroup and its descendants.
 Currently the kernel does the global EPC reclaim in sgx_reclaim_page().
It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16)
pages.  Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN
pages from the global LRU, and then tries to reclaim these pages at once
for better performance.
 Implement the "cgroup" variant EPC reclaim in a similar way, but keep
the implementation simple: 1) change sgx_reclaim_pages() to take an LRU
as input, and return the pages that are "scanned" and attempted for
reclamation (but not necessarily reclaimed successfully); 2) loop the
given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
until SGX_NR_TO_SCAN pages are "scanned".
 This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always
tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC
cgroup, and only moves to its descendants when there's no enough
reclaimable EPC pages to "scan" in its LRU.  It should be enough for
most cases.
 Note, this simple implementation doesn't _exactly_ mimic the current
global EPC reclaim (which always tries to do the actual reclaim in batch
of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN
reclaimable pages, the actual reclaim of EPC pages will be split into
smaller batches _across_ multiple LRUs with each being smaller than
SGX_NR_TO_SCAN pages.
 A more precise way to mimic the current global EPC reclaim would be to
have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
_across_ the given EPC cgroup _AND_ its descendants, and then do the
actual reclaim in one batch.  But this is unnecessarily complicated at
this stage.
 Alternatively, the current sgx_reclaim_pages() could be changed to
return the actual "reclaimed" pages, but not "scanned" pages. However,
the reclamation is a lengthy process, forcing a successful reclamation
of predetermined number of pages may block the caller for too long. And
that may not be acceptable in some synchronous contexts, e.g., in
serving an ioctl().
 With this building block in place, add synchronous reclamation support
in sgx_cgroup_try_charge(): trigger a call to
sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the
caller allows synchronous reclaim as indicated by s newly added
parameter.
 A later patch will add support for asynchronous reclamation reusing
sgx_cgroup_reclaim_pages().
 Note all reclaimable EPC pages are still tracked in the global LRU thus
no per-cgroup reclamation is actually active at the moment. Per-cgroup
tracking and reclamation will be turned on in the end after all
necessary infrastructure is in place.


Nit:

"all necessary infrastructures are in place", or, "all necessary  
building blocks are in place".


?


 Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Tested-by: Jarkko Sakkinen 
---


Reviewed-by: Kai Huang 



Thanks


More nitpickings below:

[...]


-static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg,  
enum sgx_reclaim reclaim)


Let's still wrap the text on 80-character basis.

I guess most people are more used to that.

[...]


-   epc_page = list_first_entry_or_null(_global_lru.reclaimable,
-   struct sgx_epc_page, list);
+		epc_page = list_first_entry_or_null(>reclaimable, struct  
sgx_epc_page, list);


Ditto.



Actually I changed to 100 char width based on comments from Jarkko IIRC.
I don't have personal preference, but will not change back to 80 unless  
Jarkko also agrees.


Thanks
Haitao

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-23 Thread Haitao Huang


On Tue, 23 Apr 2024 09:19:53 -0500, Huang, Kai  wrote:


On Tue, 2024-04-23 at 08:08 -0500, Haitao Huang wrote:
On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai   
wrote:


> On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
> > On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai 
> > wrote:
> >
> > > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
> > > > > > I think we can add support for "sgx_cgroup=disabled" in  
future

> > if
> > > > indeed
> > > > > > needed. But just for init failure, no?
> > > > > >
> > > > >
> > > > > It's not about the commandline, which we can add in the future
> > when
> > > > > needed.  It's about we need to have a way to handle SGX cgroup
> > being
> > > > > disabled at boot time nicely, because we already have a case
> > where we
> > > > > need
> > > > > to do so.
> > > > >
> > > > > Your approach looks half-way to me, and is not future
> > extendible.  If
> > > > we
> > > > > choose to do it, do it right -- that is, we need a way to  
disable

> > it
> > > > > completely in both kernel and userspace so that userspace  
won't be

> > > > able> to
> > > > > see it.
> > > >
> > > > That would need more changes in misc cgroup implementation to
> > support
> > > > sgx-disable. Right now misc does not have separate files for
> > different
> > > > resource types. So we can only block echo "sgx_epc..." to those
> > > > interfacefiles, can't really make files not visible.
> > >
> > > "won't be able to see" I mean "only for SGX EPC resource", but  
not the

> > > control files for the entire MISC cgroup.
> > >
> > > I replied at the beginning of the previous reply:
> > >
> > > "
> > > Given SGX EPC is just one type of MISC cgroup resources, we cannot
> > just
> > > disable MISC cgroup as a whole.
> > > "
> > >
> > Sorry I missed this point. below.
> >
> > > You just need to set the SGX EPC "capacity" to 0 to disable SGX  
EPC.

> > See
> > > the comment of @misc_res_capacity:
> > >
> > >  * Miscellaneous resources capacity for the entire machine. 0  
capacity

> > >  * means resource is not initialized or not present in the host.
> > >
> >
> > IIUC I don't think the situation we have is either of those cases.  
For

> > our
> > case, resource is inited and present on the host but we have  
allocation

> > error for sgx cgroup infra.
>
> You have calculated the "capacity", but later you failed something and
> then reset the "capacity" to 0, i.e., cleanup.  What's wrong with  
that?

>
> >
> > > And "blocking echo sgx_epc ... to those control files" is already
> > > sufficient for the purpose of not exposing SGX EPC to userspace,
> > correct?
> > >
> > > E.g., if SGX cgroup is enabled, you can see below when you read  
"max":

> > >
> > >  # cat /sys/fs/cgroup/my_group/misc.max
> > >  #  
> > >sgx_epc ...
> > >...
> > >
> > > Otherwise you won't be able to see "sgx_epc":
> > >
> > >  # cat /sys/fs/cgroup/my_group/misc.max
> > >  #  
> > >...
> > >
> > > And when you try to write the "max" for "sgx_epc", you will hit  
error:

> > >
> > >  # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
> > >  # ... echo: write error: Invalid argument
> > >
> > > The above applies to all the control files.  To me this is pretty  
much

> > > means "SGX EPC is disabled" or "not supported" for userspace.
> > >
> > You are right, capacity == 0 does block echoing max and users see an
> > error
> > if they do that. But 1) doubt you literately wanted "SGX EPC is
> > disabled"
> > and make it unsupported in this case,
>
> I don't understand.  Something failed during SGX cgroup  
initialization,

> you _literally_ cannot continue to support it.
>
>

Then we should just return -ENOMEM from sgx_init() when sgx cgroup
initialization fails?
I thought we only disable SGX cgroup support. SGX can still run.


I am not sure how you got this conclusion.  I specifically said something
failed during SGX "cgroup" initialization, so only SGX "cgroup" needs to
be disabled, not SGX as a whole.



> > 2) even if we accept this is "sgx
> > cgroup disabled" I don't see how it is much better user experience  
than

> > current solution or really helps user better.
>
> In your way, the userspace is still able to see "sgx_epc" in control
> files
> and is able to update them.  So from userspace's perspective SGX  
cgroup

> is
> enabled, but obviously updating to "max" doesn't have any impact.   
This

> will confuse userspace.
>
> >

Setting capacity to zero also confuses user space. Some application may
rely on this file to know the capacity.



Why??

Are you saying before this SGX cgroup patchset those applications cannot
run?



> > Also to implement this approach, as you mentioned, we need  
workaround

> > the
> > fact that misc_try_charge() fails when capacity set to zero, and  
adding

> > code to return root always?
>
> Why this is a problem?
>

It changes/overrides the the original meaning of capacity==0: no one can
allocate if capacity is zero.


Why??

Are you saying before this series, no

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-23 Thread David Hildenbrand


On 22.04.24 22:31, Vincent Donnefort wrote:

On Mon, Apr 22, 2024 at 08:27:17PM +0200, David Hildenbrand wrote:

On 22.04.24 20:20, Vincent Donnefort wrote:

Hi David,

Thanks for having a look, very much appreciated!

On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote:

On 19.04.24 20:25, David Hildenbrand wrote:

On 06.04.24 19:36, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-23 Thread Huang, Kai

On Tue, 2024-04-23 at 08:08 -0500, Haitao Huang wrote:
> On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai  wrote:
> 
> > On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
> > > On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai   
> > > wrote:
> > > 
> > > > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
> > > > > > > I think we can add support for "sgx_cgroup=disabled" in future  
> > > if
> > > > > indeed
> > > > > > > needed. But just for init failure, no?
> > > > > > > 
> > > > > > 
> > > > > > It's not about the commandline, which we can add in the future  
> > > when
> > > > > > needed.  It's about we need to have a way to handle SGX cgroup  
> > > being
> > > > > > disabled at boot time nicely, because we already have a case  
> > > where we
> > > > > > need
> > > > > > to do so.
> > > > > > 
> > > > > > Your approach looks half-way to me, and is not future  
> > > extendible.  If
> > > > > we
> > > > > > choose to do it, do it right -- that is, we need a way to disable  
> > > it
> > > > > > completely in both kernel and userspace so that userspace won't be
> > > > > able> to
> > > > > > see it.
> > > > > 
> > > > > That would need more changes in misc cgroup implementation to  
> > > support
> > > > > sgx-disable. Right now misc does not have separate files for  
> > > different
> > > > > resource types. So we can only block echo "sgx_epc..." to those
> > > > > interfacefiles, can't really make files not visible.
> > > > 
> > > > "won't be able to see" I mean "only for SGX EPC resource", but not the
> > > > control files for the entire MISC cgroup.
> > > > 
> > > > I replied at the beginning of the previous reply:
> > > > 
> > > > "
> > > > Given SGX EPC is just one type of MISC cgroup resources, we cannot  
> > > just
> > > > disable MISC cgroup as a whole.
> > > > "
> > > > 
> > > Sorry I missed this point. below.
> > > 
> > > > You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC.   
> > > See
> > > > the comment of @misc_res_capacity:
> > > > 
> > > >  * Miscellaneous resources capacity for the entire machine. 0 capacity
> > > >  * means resource is not initialized or not present in the host.
> > > > 
> > > 
> > > IIUC I don't think the situation we have is either of those cases. For  
> > > our
> > > case, resource is inited and present on the host but we have allocation
> > > error for sgx cgroup infra.
> > 
> > You have calculated the "capacity", but later you failed something and
> > then reset the "capacity" to 0, i.e., cleanup.  What's wrong with that?
> > 
> > > 
> > > > And "blocking echo sgx_epc ... to those control files" is already
> > > > sufficient for the purpose of not exposing SGX EPC to userspace,  
> > > correct?
> > > > 
> > > > E.g., if SGX cgroup is enabled, you can see below when you read "max":
> > > > 
> > > >  # cat /sys/fs/cgroup/my_group/misc.max
> > > >  #  
> > > >sgx_epc ...
> > > >...
> > > > 
> > > > Otherwise you won't be able to see "sgx_epc":
> > > > 
> > > >  # cat /sys/fs/cgroup/my_group/misc.max
> > > >  #  
> > > >...
> > > > 
> > > > And when you try to write the "max" for "sgx_epc", you will hit error:
> > > > 
> > > >  # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
> > > >  # ... echo: write error: Invalid argument
> > > > 
> > > > The above applies to all the control files.  To me this is pretty much
> > > > means "SGX EPC is disabled" or "not supported" for userspace.
> > > > 
> > > You are right, capacity == 0 does block echoing max and users see an  
> > > error
> > > if they do that. But 1) doubt you literately wanted "SGX EPC is  
> > > disabled"
> > > and make it unsupported in this case,
> > 
> > I don't understand.  Something failed during SGX cgroup initialization,
> > you _literally_ cannot continue to support it.
> > 
> > 
> 
> Then we should just return -ENOMEM from sgx_init() when sgx cgroup  
> initialization fails?
> I thought we only disable SGX cgroup support. SGX can still run.

I am not sure how you got this conclusion.  I specifically said something
failed during SGX "cgroup" initialization, so only SGX "cgroup" needs to
be disabled, not SGX as a whole.

> 
> > > 2) even if we accept this is "sgx
> > > cgroup disabled" I don't see how it is much better user experience than
> > > current solution or really helps user better.
> > 
> > In your way, the userspace is still able to see "sgx_epc" in control  
> > files
> > and is able to update them.  So from userspace's perspective SGX cgroup  
> > is
> > enabled, but obviously updating to "max" doesn't have any impact.  This
> > will confuse userspace.
> > 
> > > 
> 
> Setting capacity to zero also confuses user space. Some application may  
> rely on this file to know the capacity.


Why??

Are you saying before this SGX cgroup patchset those applications cannot
run?

> 
> > > Also to implement this approach, as you mentioned, we need workaround  
> > > the
> > > fact that misc_try_charge() fails when capacity set to zero, and adding
> > >

Re: [PATCH v2 0/6] virtiofs: fix the warning for ITER_KVEC dio

2024-04-23 Thread Hou Tao

On 4/23/2024 4:06 AM, Michael S. Tsirkin wrote:
> On Tue, Apr 09, 2024 at 09:48:08AM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 4/8/2024 3:45 PM, Michael S. Tsirkin wrote:
>>> On Wed, Feb 28, 2024 at 10:41:20PM +0800, Hou Tao wrote:
 From: Hou Tao 

 Hi,

 The patch set aims to fix the warning related to an abnormal size
 parameter of kmalloc() in virtiofs. The warning occurred when attempting
 to insert a 10MB sized kernel module kept in a virtiofs with cache
 disabled. As analyzed in patch #1, the root cause is that the length of
 the read buffer is no limited, and the read buffer is passed directly to
 virtiofs through out_args[0].value. Therefore patch #1 limits the
 length of the read buffer passed to virtiofs by using max_pages. However
 it is not enough, because now the maximal value of max_pages is 256.
 Consequently, when reading a 10MB-sized kernel module, the length of the
 bounce buffer in virtiofs will be 40 + (256 * 4096), and kmalloc will
 try to allocate 2MB from memory subsystem. The request for 2MB of
 physically contiguous memory significantly stress the memory subsystem
 and may fail indefinitely on hosts with fragmented memory. To address
 this, patch #2~#5 use scattered pages in a bio_vec to replace the
 kmalloc-allocated bounce buffer when the length of the bounce buffer for
 KVEC_ITER dio is larger than PAGE_SIZE. The final issue with the
 allocation of the bounce buffer and sg array in virtiofs is that
 GFP_ATOMIC is used even when the allocation occurs in a kworker context.
 Therefore the last patch uses GFP_NOFS for the allocation of both sg
 array and bounce buffer when initiated by the kworker. For more details,
 please check the individual patches.

 As usual, comments are always welcome.

 Change Log:
>>> Bernd should I just merge the patchset as is?
>>> It seems to fix a real problem and no one has the
>>> time to work on a better fix  WDYT?
>> Sorry for the long delay. I am just start to prepare for v3. In v3, I
>> plan to avoid the unnecessary memory copy between fuse args and bio_vec.
>> Will post it before next week.
> Didn't happen before this week apparently.

Sorry for failing to make it this week. Being busy these two weeks. Hope
to send v3 out before the end of April.
>
>>>
 v2:
   * limit the length of ITER_KVEC dio by max_pages instead of the
 newly-introduced max_nopage_rw. Using max_pages make the ITER_KVEC
 dio being consistent with other rw operations.
   * replace kmalloc-allocated bounce buffer by using a bounce buffer
 backed by scattered pages when the length of the bounce buffer for
 KVEC_ITER dio is larger than PAG_SIZE, so even on hosts with
 fragmented memory, the KVEC_ITER dio can be handled normally by
 virtiofs. (Bernd Schubert)
   * merge the GFP_NOFS patch [1] into this patch-set and use
 memalloc_nofs_{save|restore}+GFP_KERNEL instead of GFP_NOFS
 (Benjamin Coddington)

 v1: 
 https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/

 [1]: 
 https://lore.kernel.org/linux-fsdevel/20240105105305.4052672-1-hou...@huaweicloud.com/

 Hou Tao (6):
   fuse: limit the length of ITER_KVEC dio by max_pages
   virtiofs: move alloc/free of argbuf into separated helpers
   virtiofs: factor out more common methods for argbuf
   virtiofs: support bounce buffer backed by scattered pages
   virtiofs: use scattered bounce buffer for ITER_KVEC dio
   virtiofs: use GFP_NOFS when enqueuing request through kworker

  fs/fuse/file.c  |  12 +-
  fs/fuse/virtio_fs.c | 336 +---
  2 files changed, 296 insertions(+), 52 deletions(-)

 -- 
 2.29.2

Re: [PATCH v2 3/3] arm64: dts: qcom: sm7225-fairphone-fp4: Enable USB role switching

2024-04-23 Thread Konrad Dybcio





On 3/29/24 13:26, Luca Weiss wrote:

Configure the Type-C and VBUS regulator on PM7250B and wire it up to the
USB PHY, so that USB role and orientation switching works.

For now USB Power Delivery properties are skipped / disabled, so that
the (presumably) bootloader-configured charger doesn't get messed with
and we can charge the phone with at least some amount of power.

Signed-off-by: Luca Weiss 
---


Reviewed-by: Konrad Dybcio 

nits:


+   usb-role-switch;


this is a common property of the dwc3 host


[...]


vdda-phy-supply = <_l22a>;
vdda-pll-supply = <_l16a>;
+   orientation-switch;


and this is a common property of the QMPPHY

Could you move them to the node definition?

Konrad

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-23 Thread Haitao Huang

On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai  wrote:

On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai   
wrote:

> On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
> > > > I think we can add support for "sgx_cgroup=disabled" in future  
if

> > indeed
> > > > needed. But just for init failure, no?
> > > >
> > >
> > > It's not about the commandline, which we can add in the future  
when
> > > needed.  It's about we need to have a way to handle SGX cgroup  
being
> > > disabled at boot time nicely, because we already have a case  
where we

> > > need
> > > to do so.
> > >
> > > Your approach looks half-way to me, and is not future  
extendible.  If

> > we
> > > choose to do it, do it right -- that is, we need a way to disable  
it

> > > completely in both kernel and userspace so that userspace won't be
> > able> to
> > > see it.
> >
> > That would need more changes in misc cgroup implementation to  
support
> > sgx-disable. Right now misc does not have separate files for  
different

> > resource types. So we can only block echo "sgx_epc..." to those
> > interfacefiles, can't really make files not visible.
>
> "won't be able to see" I mean "only for SGX EPC resource", but not the
> control files for the entire MISC cgroup.
>
> I replied at the beginning of the previous reply:
>
> "
> Given SGX EPC is just one type of MISC cgroup resources, we cannot  
just

> disable MISC cgroup as a whole.
> "
>
Sorry I missed this point. below.

> You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC.   
See

> the comment of @misc_res_capacity:
>
>  * Miscellaneous resources capacity for the entire machine. 0 capacity
>  * means resource is not initialized or not present in the host.
>

IIUC I don't think the situation we have is either of those cases. For  
our

case, resource is inited and present on the host but we have allocation
error for sgx cgroup infra.

You have calculated the "capacity", but later you failed something and
then reset the "capacity" to 0, i.e., cleanup.  What's wrong with that?

> And "blocking echo sgx_epc ... to those control files" is already
> sufficient for the purpose of not exposing SGX EPC to userspace,  
correct?

>
> E.g., if SGX cgroup is enabled, you can see below when you read "max":
>
>  # cat /sys/fs/cgroup/my_group/misc.max
>  #  
>sgx_epc ...
>...
>
> Otherwise you won't be able to see "sgx_epc":
>
>  # cat /sys/fs/cgroup/my_group/misc.max
>  #  
>...
>
> And when you try to write the "max" for "sgx_epc", you will hit error:
>
>  # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
>  # ... echo: write error: Invalid argument
>
> The above applies to all the control files.  To me this is pretty much
> means "SGX EPC is disabled" or "not supported" for userspace.
>
You are right, capacity == 0 does block echoing max and users see an  
error
if they do that. But 1) doubt you literately wanted "SGX EPC is  
disabled"

and make it unsupported in this case,

I don't understand.  Something failed during SGX cgroup initialization,
you _literally_ cannot continue to support it.

Then we should just return -ENOMEM from sgx_init() when sgx cgroup  
initialization fails?

I thought we only disable SGX cgroup support. SGX can still run.

2) even if we accept this is "sgx
cgroup disabled" I don't see how it is much better user experience than
current solution or really helps user better.

In your way, the userspace is still able to see "sgx_epc" in control  
files
and is able to update them.  So from userspace's perspective SGX cgroup  
is

enabled, but obviously updating to "max" doesn't have any impact.  This
will confuse userspace.

Setting capacity to zero also confuses user space. Some application may  
rely on this file to know the capacity.

Also to implement this approach, as you mentioned, we need workaround  
the

fact that misc_try_charge() fails when capacity set to zero, and adding
code to return root always?

Why this is a problem?

It changes/overrides the the original meaning of capacity==0: no one can  
allocate if capacity is zero.

So it seems like more workaround code to just
make it work for a failing case no one really care much and end result  
is

not really much better IMHO.

It's not workaround, it's the right thing to do.

The result is userspace will see it being disabled when kernel disables
it.

It's a workaround because you use the capacity==0 but it does not really  
mean to disable the misc cgroup for specific resource IIUC.

There is explicit way for user to disable misc without setting capacity to  
zero. So in future if we want really disable sgx_epc cgroup specifically  
we should not use capacity.  Therefore your approach is not  
extensible/reusable.

Given this is a rare corner case caused by configuration, we can only do  
as much as possible IMHO, not trying to implement a perfect solution at  
the moment. Maybe BUG_ON() is more

Re: [PATCH net-next v7 1/7] net: introduce rstreason to detect why the RST is sent

2024-04-23 Thread Jason Xing

On Tue, Apr 23, 2024 at 7:57 PM Simon Horman  wrote:
>
> On Tue, Apr 23, 2024 at 10:17:31AM +0800, Jason Xing wrote:
> > On Tue, Apr 23, 2024 at 10:14 AM Jason Xing  
> > wrote:
> > >
> > > Hi Simon,
> > >
> > > On Tue, Apr 23, 2024 at 2:28 AM Simon Horman  wrote:
> > > >
> > > > On Mon, Apr 22, 2024 at 11:01:03AM +0800, Jason Xing wrote:
> > > >
> > > > ...
> > > >
> > > > > diff --git a/include/net/rstreason.h b/include/net/rstreason.h
> > > >
> > > > ...
> > > >
> > > > > +/**
> > > > > + * There are three parts in order:
> > > > > + * 1) reset reason in MPTCP: only for MPTCP use
> > > > > + * 2) skb drop reason: relying on drop reasons for such as passive 
> > > > > reset
> > > > > + * 3) independent reset reason: such as active reset reasons
> > > > > + */
> > > >
> > > > Hi Jason,
> > > >
> > > > A minor nit from my side.
> > > >
> > > > '/**' denotes the beginning of a Kernel doc,
> > > > but other than that, this comment is not a Kernel doc.
> > > >
> > > > FWIIW, I would suggest providing a proper Kernel doc for enum 
> > > > sk_rst_reason.
> > > > But another option would be to simply make this a normal comment,
> > > > starting with "/* There are"
> > >
> > > Thanks Simon. I'm trying to use the kdoc way to make it right :)
> > >
> > > How about this one:
> > > /**
> > >  * enum sk_rst_reason - the reasons of socket reset
> > >  *
> > >  * The reason of skb drop, which is used in DCCP/TCP/MPTCP protocols.
> >
> > s/skb drop/sk reset/
> >
> > Sorry, I cannot withdraw my previous email in time.
> >
> > >  *
> > >  * There are three parts in order:
> > >  * 1) skb drop reasons: relying on drop reasons for such as passive
> > > reset
> > >  * 2) independent reset reasons: such as active reset reasons
> > >  * 3) reset reasons in MPTCP: only for MPTCP use
> > >  */
> > > ?
> > >
> > > I chose to mimic what enum skb_drop_reason does in the
> > > include/net/dropreason-core.h file.
> > >
> > > > +enum sk_rst_reason {
> > > > +   /**
> > > > +* Copy from include/uapi/linux/mptcp.h.
> > > > +* These reset fields will not be changed since they adhere to
> > > > +* RFC 8684. So do not touch them. I'm going to list each 
> > > > definition
> > > > +* of them respectively.
> > > > +*/
> > >
> > > Thanks to you, I found another similar point where I smell something
> > > wrong as in the above code. I'm going to replace '/**' with '/*' since
> > > it's only a comment, not a kdoc.
>
> Likewise, thanks Jason.
>
> I haven't had time to look at v8 properly,
> but I see that kernel-doc is happy with the changed
> you have made there as discussed above.
>

Thank you, Simon. I learned something new about the coding style.

Besides, some other nit problems have been spotted by Matt. I will fix
them if it's required to send a new version.

Re: [PATCH net-next v7 1/7] net: introduce rstreason to detect why the RST is sent

2024-04-23 Thread Simon Horman

On Tue, Apr 23, 2024 at 10:17:31AM +0800, Jason Xing wrote:
> On Tue, Apr 23, 2024 at 10:14 AM Jason Xing  wrote:
> >
> > Hi Simon,
> >
> > On Tue, Apr 23, 2024 at 2:28 AM Simon Horman  wrote:
> > >
> > > On Mon, Apr 22, 2024 at 11:01:03AM +0800, Jason Xing wrote:
> > >
> > > ...
> > >
> > > > diff --git a/include/net/rstreason.h b/include/net/rstreason.h
> > >
> > > ...
> > >
> > > > +/**
> > > > + * There are three parts in order:
> > > > + * 1) reset reason in MPTCP: only for MPTCP use
> > > > + * 2) skb drop reason: relying on drop reasons for such as passive 
> > > > reset
> > > > + * 3) independent reset reason: such as active reset reasons
> > > > + */
> > >
> > > Hi Jason,
> > >
> > > A minor nit from my side.
> > >
> > > '/**' denotes the beginning of a Kernel doc,
> > > but other than that, this comment is not a Kernel doc.
> > >
> > > FWIIW, I would suggest providing a proper Kernel doc for enum 
> > > sk_rst_reason.
> > > But another option would be to simply make this a normal comment,
> > > starting with "/* There are"
> >
> > Thanks Simon. I'm trying to use the kdoc way to make it right :)
> >
> > How about this one:
> > /**
> >  * enum sk_rst_reason - the reasons of socket reset
> >  *
> >  * The reason of skb drop, which is used in DCCP/TCP/MPTCP protocols.
> 
> s/skb drop/sk reset/
> 
> Sorry, I cannot withdraw my previous email in time.
> 
> >  *
> >  * There are three parts in order:
> >  * 1) skb drop reasons: relying on drop reasons for such as passive
> > reset
> >  * 2) independent reset reasons: such as active reset reasons
> >  * 3) reset reasons in MPTCP: only for MPTCP use
> >  */
> > ?
> >
> > I chose to mimic what enum skb_drop_reason does in the
> > include/net/dropreason-core.h file.
> >
> > > +enum sk_rst_reason {
> > > +   /**
> > > +* Copy from include/uapi/linux/mptcp.h.
> > > +* These reset fields will not be changed since they adhere to
> > > +* RFC 8684. So do not touch them. I'm going to list each 
> > > definition
> > > +* of them respectively.
> > > +*/
> >
> > Thanks to you, I found another similar point where I smell something
> > wrong as in the above code. I'm going to replace '/**' with '/*' since
> > it's only a comment, not a kdoc.

Likewise, thanks Jason.

I haven't had time to look at v8 properly,
but I see that kernel-doc is happy with the changed
you have made there as discussed above.

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Huang, Kai

On Tue, 2024-04-23 at 17:25 +0800, 朱伯君(杰铭) wrote:
> EDMM's ioctl()s support batch operations, which may be
> time-consuming. Try to explicitly give up the CPU at
> the every end of "for loop" in
> sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
> to give other tasks a chance to run, and avoid softlockup warning.
> 
> The following has been observed on Linux v6.9-rc5 with kernel
> preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
> is requested to restrict page permissions of a large number of EPC pages.
> 
> [ cut here ]
> watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905]
> ...
> CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7
> ...
> RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
> Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e 
> f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf  00 00 00 40 
> 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef
> RSP: 0018:b55a6591fa80 EFLAGS: 0202
> RAX:  RBX: b55a6591fac0 RCX: b581e7384000
> RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000
> RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0
> R10: 006e R11: 0002 R12: 00072052d000
> R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020
> FS:  7fe10dbda740() GS:92163e48() 
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe07f0 DR7: 0400
> PKRU: 5554
> Call Trace:
>  
>  ? show_regs+0x67/0x70
>  ? watchdog_timer_fn+0x1f3/0x280
>  ? __pfx_watchdog_timer_fn+0x10/0x10
>  ? __hrtimer_run_queues+0xc8/0x220
>  ? hrtimer_interrupt+0x10c/0x250
>  ? __sysvec_apic_timer_interrupt+0x53/0x130
>  ? sysvec_apic_timer_interrupt+0x7b/0x90
>  
>  
>  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
>  ? sgx_enclave_restrict_permissions+0xba/0x1f0
>  ? __pte_offset_map_lock+0x94/0x110
>  ? sgx_encl_test_and_clear_young_cb+0x40/0x60
>  sgx_ioctl+0x1ab/0x900
>  ? do_syscall_64+0x79/0x110
>  ? apply_to_page_range+0x14/0x20
>  ? sgx_encl_test_and_clear_young+0x6c/0x80
>  ? sgx_vma_fault+0x132/0x4f0
>  __x64_sys_ioctl+0x95/0xd0
>  x64_sys_call+0x1209/0x20c0
>  do_syscall_64+0x6d/0x110
>  ? do_syscall_64+0x79/0x110
>  ? do_pte_missing+0x2e8/0xcc0
>  ? __pte_offset_map+0x1c/0x190
>  ? __handle_mm_fault+0x7b9/0xe60
>  ? __count_memcg_events+0x70/0x100
>  ? handle_mm_fault+0x256/0x360
>  ? do_user_addr_fault+0x3c1/0x860
>  ? irqentry_exit_to_user_mode+0x67/0x190
>  ? irqentry_exit+0x3b/0x50
>  ? exc_page_fault+0x89/0x180
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fe10e2ee5cb
> Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff 
> ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff 
> ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48
> RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb
> RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005
> RBP: 0005 R08:  R09: 7fffb2c75594
> R10: 7fffb2c755c8 R11: 0246 R12: c028a405
> R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980
>  
>  [ end trace ]

Could you trim down the trace to only include the relevant part?

E.g., please at least remove the two register dumps at the beginning and
end of the trace.

Please refer to "Backtraces in commit messages" section in
Documentation/process/submitting-patches.rst.

> 
> Signed-off-by: Bojun Zhu 
> ---
>  arch/x86/kernel/cpu/sgx/ioctl.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..2340a82fa796 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
>   }
>  
>   mutex_unlock(>lock);
> +
> + if (need_resched())
> + cond_resched();
>   }
>  
>   ret = 0;
> @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> *encl,
>   entry->type = page_type;
>  
>   mutex_unlock(>lock);
> +
> + if (need_resched())
> + cond_resched();
>   }
>  
>   ret = 0;
> @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>

Re: [PATCH net-next v8 5/7] mptcp: support rstreason for passive reset

2024-04-23 Thread Jason Xing

Hello Matthieu,

On Tue, Apr 23, 2024 at 6:02 PM Matthieu Baerts  wrote:
>
> Hi Jason,
>
> On 23/04/2024 09:21, Jason Xing wrote:
> > From: Jason Xing 
> >
> > It relys on what reset options in the skb are as rfc8684 says. Reusing
>
> (if you have something else to fix, 'checkpatch.pl --codespell' reported
> a warning here: s/relys/relies/)

Thanks. Will fix it.

>
> > this logic can save us much energy. This patch replaces most of the prior
> > NOT_SPECIFIED reasons.
> >
> > Signed-off-by: Jason Xing 
> > ---
> >  net/mptcp/protocol.h | 28 
> >  net/mptcp/subflow.c  | 22 +-
> >  2 files changed, 45 insertions(+), 5 deletions(-)
> >
> > diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> > index fdfa843e2d88..bbcb8c068aae 100644
> > --- a/net/mptcp/protocol.h
> > +++ b/net/mptcp/protocol.h
> > @@ -581,6 +581,34 @@ mptcp_subflow_ctx_reset(struct mptcp_subflow_context 
> > *subflow)
> >   WRITE_ONCE(subflow->local_id, -1);
> >  }
> >
> > +/* Convert reset reasons in MPTCP to enum sk_rst_reason type */
> > +static inline enum sk_rst_reason
> > +sk_rst_convert_mptcp_reason(u32 reason)
> > +{
> > + switch (reason) {
> > + case MPTCP_RST_EUNSPEC:
> > + return SK_RST_REASON_MPTCP_RST_EUNSPEC;
> > + case MPTCP_RST_EMPTCP:
> > + return SK_RST_REASON_MPTCP_RST_EMPTCP;
> > + case MPTCP_RST_ERESOURCE:
> > + return SK_RST_REASON_MPTCP_RST_ERESOURCE;
> > + case MPTCP_RST_EPROHIBIT:
> > + return SK_RST_REASON_MPTCP_RST_EPROHIBIT;
> > + case MPTCP_RST_EWQ2BIG:
> > + return SK_RST_REASON_MPTCP_RST_EWQ2BIG;
> > + case MPTCP_RST_EBADPERF:
> > + return SK_RST_REASON_MPTCP_RST_EBADPERF;
> > + case MPTCP_RST_EMIDDLEBOX:
> > + return SK_RST_REASON_MPTCP_RST_EMIDDLEBOX;
> > + default:
> > + /**
>
> I guess here as well, it should be '/*' instead of '/**'. But I guess
> that's fine, this file is probably not scanned. Anyway, if you have to
> send a new version, please fix this as well.

Thanks for your help. I will.

>
> (Also, this helper might require '#include ', but our
> CI is fine with it, it is also added in the next commit, and probably
> already included via include/net/request_sock.h. So I guess that's fine.)

Yes, If I need to submit the V9 patch, I will move it.

>
>
> Other than that, it looks good to me:
>
> Reviewed-by: Matthieu Baerts (NGI0) 

Thanks for all the reviews :)

Thanks,
Jason

>
> Cheers,
> Matt
> --
> Sponsored by the NGI0 Core fund.
>

Re: [PATCH v3 1/4] virtio_balloon: separate vm events into a function

2024-04-23 Thread David Hildenbrand


On 23.04.24 05:41, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', separate these events into a function to
make code clean. Then we can remove 'CONFIG_VM_EVENT_COUNTERS' from
'update_balloon_stats'.

Signed-off-by: zhenwei pi 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH net-next v8 6/7] mptcp: introducing a helper into active reset logic

2024-04-23 Thread Matthieu Baerts

Hi Jason,

On 23/04/2024 09:21, Jason Xing wrote:
> From: Jason Xing 
> 
> Since we have mapped every mptcp reset reason definition in enum
> sk_rst_reason, introducing a new helper can cover some missing places
> where we have already set the subflow->reset_reason.
> 
> Note: using SK_RST_REASON_NOT_SPECIFIED is the same as
> SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown. So we can convert
> it directly.
It looks good to me, thanks:

Reviewed-by: Matthieu Baerts (NGI0) 

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

Re: [PATCH net-next v8 5/7] mptcp: support rstreason for passive reset

2024-04-23 Thread Matthieu Baerts

Hi Jason,

On 23/04/2024 09:21, Jason Xing wrote:
> From: Jason Xing 
> 
> It relys on what reset options in the skb are as rfc8684 says. Reusing

(if you have something else to fix, 'checkpatch.pl --codespell' reported
a warning here: s/relys/relies/)

> this logic can save us much energy. This patch replaces most of the prior
> NOT_SPECIFIED reasons.
> 
> Signed-off-by: Jason Xing 
> ---
>  net/mptcp/protocol.h | 28 
>  net/mptcp/subflow.c  | 22 +-
>  2 files changed, 45 insertions(+), 5 deletions(-)
> 
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index fdfa843e2d88..bbcb8c068aae 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -581,6 +581,34 @@ mptcp_subflow_ctx_reset(struct mptcp_subflow_context 
> *subflow)
>   WRITE_ONCE(subflow->local_id, -1);
>  }
>  
> +/* Convert reset reasons in MPTCP to enum sk_rst_reason type */
> +static inline enum sk_rst_reason
> +sk_rst_convert_mptcp_reason(u32 reason)
> +{
> + switch (reason) {
> + case MPTCP_RST_EUNSPEC:
> + return SK_RST_REASON_MPTCP_RST_EUNSPEC;
> + case MPTCP_RST_EMPTCP:
> + return SK_RST_REASON_MPTCP_RST_EMPTCP;
> + case MPTCP_RST_ERESOURCE:
> + return SK_RST_REASON_MPTCP_RST_ERESOURCE;
> + case MPTCP_RST_EPROHIBIT:
> + return SK_RST_REASON_MPTCP_RST_EPROHIBIT;
> + case MPTCP_RST_EWQ2BIG:
> + return SK_RST_REASON_MPTCP_RST_EWQ2BIG;
> + case MPTCP_RST_EBADPERF:
> + return SK_RST_REASON_MPTCP_RST_EBADPERF;
> + case MPTCP_RST_EMIDDLEBOX:
> + return SK_RST_REASON_MPTCP_RST_EMIDDLEBOX;
> + default:
> + /**

I guess here as well, it should be '/*' instead of '/**'. But I guess
that's fine, this file is probably not scanned. Anyway, if you have to
send a new version, please fix this as well.

(Also, this helper might require '#include ', but our
CI is fine with it, it is also added in the next commit, and probably
already included via include/net/request_sock.h. So I guess that's fine.)


Other than that, it looks good to me:

Reviewed-by: Matthieu Baerts (NGI0) 

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

Re: [PATCH net-next v8 3/7] rstreason: prepare for active reset

2024-04-23 Thread Matthieu Baerts

Hi Jason,

On 23/04/2024 09:21, Jason Xing wrote:
> From: Jason Xing 
> 
> Like what we did to passive reset:
> only passing possible reset reason in each active reset path.
> 
> No functional changes.

(...)

>  net/mptcp/protocol.c  |  4 +++-
>  net/mptcp/subflow.c   |  5 +++--

For the modifications in MPTCP:

Acked-by: Matthieu Baerts (NGI0) 

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

Re: [PATCH net-next v8 2/7] rstreason: prepare for passive reset

2024-04-23 Thread Matthieu Baerts

Hi Jason,

On 23/04/2024 09:21, Jason Xing wrote:
> From: Jason Xing 
> 
> Adjust the parameter and support passing reason of reset which
> is for now NOT_SPECIFIED. No functional changes.

(...)

>  net/mptcp/subflow.c|  8 +---

For the modifications in MPTCP:

Acked-by: Matthieu Baerts (NGI0) 

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

Re: [PATCH net-next v8 1/7] net: introduce rstreason to detect why the RST is sent

2024-04-23 Thread Matthieu Baerts

Hi Jason,

On 23/04/2024 09:21, Jason Xing wrote:
> From: Jason Xing 
> 
> Add a new standalone file for the easy future extension to support
> both active reset and passive reset in the TCP/DCCP/MPTCP protocols.
> 
> This patch only does the preparations for reset reason mechanism,
> nothing else changes.
> 
> The reset reasons are divided into three parts:
> 1) reuse drop reasons for passive reset in TCP
> 2) our own independent reasons which aren't relying on other reasons at all
> 3) reuse MP_TCPRST option for MPTCP

Thank you for the v8, it looks good to me regarding the modifications
linked to MPTCP.

Acked-by: Matthieu Baerts (NGI0) 

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

[RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread 朱伯君(杰铭)

EDMM's ioctl()s support batch operations, which may be
time-consuming. Try to explicitly give up the CPU at
the every end of "for loop" in
sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
to give other tasks a chance to run, and avoid softlockup warning.

The following has been observed on Linux v6.9-rc5 with kernel
preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
is requested to restrict page permissions of a large number of EPC pages.

[ cut here ]
watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905]
...
CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7
...
RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e f5 
93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf  00 00 00 40 0f 85 
b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef
RSP: 0018:b55a6591fa80 EFLAGS: 0202
RAX:  RBX: b55a6591fac0 RCX: b581e7384000
RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000
RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0
R10: 006e R11: 0002 R12: 00072052d000
R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020
FS:  7fe10dbda740() GS:92163e48() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe07f0 DR7: 0400
PKRU: 5554
Call Trace:
 
 ? show_regs+0x67/0x70
 ? watchdog_timer_fn+0x1f3/0x280
 ? __pfx_watchdog_timer_fn+0x10/0x10
 ? __hrtimer_run_queues+0xc8/0x220
 ? hrtimer_interrupt+0x10c/0x250
 ? __sysvec_apic_timer_interrupt+0x53/0x130
 ? sysvec_apic_timer_interrupt+0x7b/0x90
 
 
 ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
 ? sgx_enclave_restrict_permissions+0xba/0x1f0
 ? __pte_offset_map_lock+0x94/0x110
 ? sgx_encl_test_and_clear_young_cb+0x40/0x60
 sgx_ioctl+0x1ab/0x900
 ? do_syscall_64+0x79/0x110
 ? apply_to_page_range+0x14/0x20
 ? sgx_encl_test_and_clear_young+0x6c/0x80
 ? sgx_vma_fault+0x132/0x4f0
 __x64_sys_ioctl+0x95/0xd0
 x64_sys_call+0x1209/0x20c0
 do_syscall_64+0x6d/0x110
 ? do_syscall_64+0x79/0x110
 ? do_pte_missing+0x2e8/0xcc0
 ? __pte_offset_map+0x1c/0x190
 ? __handle_mm_fault+0x7b9/0xe60
 ? __count_memcg_events+0x70/0x100
 ? handle_mm_fault+0x256/0x360
 ? do_user_addr_fault+0x3c1/0x860
 ? irqentry_exit_to_user_mode+0x67/0x190
 ? irqentry_exit+0x3b/0x50
 ? exc_page_fault+0x89/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fe10e2ee5cb
Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff 
ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 
01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48
RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb
RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005
RBP: 0005 R08:  R09: 7fffb2c75594
R10: 7fffb2c755c8 R11: 0246 R12: c028a405
R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980
 
 [ end trace ]

Signed-off-by: Bojun Zhu 
---
 arch/x86/kernel/cpu/sgx/ioctl.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index b65ab214bdf5..2340a82fa796 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
}
 
mutex_unlock(>lock);
+
+   if (need_resched())
+   cond_resched();
}
 
ret = 0;
@@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl 
*encl,
entry->type = page_type;
 
mutex_unlock(>lock);
+
+   if (need_resched())
+   cond_resched();
}
 
ret = 0;
@@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
kfree(entry);
 
mutex_unlock(>lock);
+
+   if (need_resched())
+   cond_resched();
}
 
ret = 0;
-- 
2.25.1

[RFC PATCH 0/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread 朱伯君(杰铭)

Hi folks,

If we run an enclave equipped with large EPC(30G or greater on my platfrom)
on the Linux with kernel preemptions disabled(by configuring
"CONFIG_PREEMPT_NONE=y"), we will get the following softlockup warning 
messages being reported in "dmesg" log:

Is it an known issue? I can get the warning messages on Linux v6.9-rc5.

The EDMM's ioctl()s (sgx_ioc_enclave_{ modify_types | restrict_permissions |  
remove_pages}) 
interface provided by kernel support batch changing attributes of enclave's EPC.
If userspace App requests kernel to handle too many EPC pages, kernel
may stuck for a long time(with preemption disabled).

The log is as follows:

[ cut here ]
[  901.101294] watchdog: BUG: soft lockup - CPU#92 stuck for 23s! 
[occlum-run:4289]
[  901.109617] Modules linked in: veth xt_conntrack xt_MASQUERADE 
nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter 
br_netfilter bridge stp llc overlay nls_iso8859_1 intel_rapl_msr 
intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common 
i10nm_edac nfit binfmt_misc ipmi_ssif x86_pkg_temp_thermal intel_powerclamp 
coretemp kvm_intel kvm crct10dif_pclmul polyval_clmulni polyval_generic 
ghash_clmulni_intel sha512_ssse3 sha256_ssse3 pmt_telemetry sha1_ssse3 
pmt_class joydev intel_sdsi input_leds aesni_intel crypto_simd cryptd dax_hmem 
cxl_acpi cmdlinepart rapl cxl_core ast spi_nor intel_cstate drm_shmem_helper 
einj mtd drm_kms_helper mei_me idxd isst_if_mmio isst_if_mbox_pci 
isst_if_common intel_vsec idxd_bus mei acpi_ipmi ipmi_si ipmi_devintf 
ipmi_msghandler acpi_pad acpi_power_meter mac_hid sch_fq_codel msr parport_pc 
ppdev lp parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore drm 
ip_tables x_tables
[  901.109670]  autofs4 mlx5_ib ib_uverbs ib_core hid_generic usbhid hid ses 
enclosure scsi_transport_sas mlx5_core pci_hyperv_intf mlxfw igb ahci psample 
i2c_algo_bit i2c_i801 spi_intel_pci xhci_pci tls megaraid_sas dca spi_intel 
crc32_pclmul i2c_smbus i2c_ismt libahci xhci_pci_renesas wmi pinctrl_emmitsburg
[  901.109691] CPU: 92 PID: 4289 Comm: occlum-run Not tainted 6.9.0-rc5 #3
[  901.109693] Hardware name: Inspur NF5468-M7-A0-R0-00/NF5468-M7-A0-R0-00, 
BIOS 05.02.01 05/08/2023
[  901.109695] RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
[  901.109701] Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 
8e 70 8e 15 8b 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e 15 8b 0f 01 cf  00 00 
00 40 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef
[  901.109702] RSP: 0018:ad0ae5d0f8c0 EFLAGS: 0202
[  901.109704] RAX:  RBX: ad0ae5d0f900 RCX: ad11dfc0e000
[  901.109705] RDX: ad2adcff81c0 RSI:  RDI: 9a12f5f4f000
[  901.109706] RBP: ad0ae5d0f9b0 R08: 0002 R09: 9a1289f57520
[  901.109707] R10: 005d R11: 0002 R12: 0006d8ff2000
[  901.109708] R13: 9a12f5f4f000 R14: ad0ae5d0fa18 R15: 9a12f5f4f020
[  901.109709] FS:  7fb20ad1d740() GS:9a317fe0() 
knlGS:
[  901.109710] CS:  0010 DS:  ES:  CR0: 80050033
[  901.109711] CR2: 7f8041811000 CR3: 000118530006 CR4: 00770ef0
[  901.109712] DR0:  DR1:  DR2: 
[  901.109713] DR3:  DR6: fffe07f0 DR7: 0400
[  901.109714] PKRU: 5554
[  901.109714] Call Trace:
[  901.109716]  
[  901.109718]  ? show_regs+0x67/0x70
[  901.109722]  ? watchdog_timer_fn+0x1f3/0x280
[  901.109725]  ? __pfx_watchdog_timer_fn+0x10/0x10
[  901.109727]  ? __hrtimer_run_queues+0xc8/0x220
[  901.109731]  ? hrtimer_interrupt+0x10c/0x250
[  901.109733]  ? __sysvec_apic_timer_interrupt+0x53/0x130
[  901.109736]  ? sysvec_apic_timer_interrupt+0x7b/0x90
[  901.109739]  
[  901.109740]  
[  901.109740]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[  901.109745]  ? sgx_enclave_restrict_permissions+0xba/0x1f0
[  901.109747]  ? aa_file_perm+0x145/0x550
[  901.109750]  sgx_ioctl+0x1ab/0x900
[  901.109751]  ? xas_find+0x84/0x200
[  901.109754]  ? sgx_enclave_etrack+0xbb/0x140
[  901.109756]  ? sgx_encl_may_map+0x19a/0x240
[  901.109758]  ? common_file_perm+0x8a/0x1b0
[  901.109760]  ? obj_cgroup_charge_pages+0xa2/0x100
[  901.109763]  ? tlb_flush_mmu+0x31/0x1c0
[  901.109766]  ? tlb_finish_mmu+0x42/0x80
[  901.109767]  ? do_mprotect_pkey+0x150/0x530
[  901.109769]  ? __fget_light+0xc0/0x100
[  901.109772]  __x64_sys_ioctl+0x95/0xd0
[  901.109775]  x64_sys_call+0x1209/0x20c0
[  901.109777]  do_syscall_64+0x6d/0x110
[  901.109779]  ? syscall_exit_to_user_mode+0x86/0x1c0
[  901.109782]  ? do_syscall_64+0x79/0x110
[  901.109783]  ? syscall_exit_to_user_mode+0x86/0x1c0
[  901.109784]  ? do_syscall_64+0x79/0x110
[  901.109785]  ? free_unref_page+0x10e/0x180
[  901.109788]  ? __do_fault+0x36/0x130
[  901.109791]  ? do_pte_missing+0x2e8/0xcc0
[

[PATCH net-next v6] net/ipv4: add tracepoint for icmp_send

2024-04-23 Thread xu.xin16

From: Peilin He 

Introduce a tracepoint for icmp_send, which can help users to get more
detail information conveniently when icmp abnormal events happen.

1. Giving an usecase example:
=
When an application experiences packet loss due to an unreachable UDP
destination port, the kernel will send an exception message through the
icmp_send function. By adding a trace point for icmp_send, developers or
system administrators can obtain detailed information about the UDP
packet loss, including the type, code, source address, destination address,
source port, and destination port. This facilitates the trouble-shooting
of UDP packet loss issues especially for those network-service
applications.

2. Operation Instructions:
==
Switch to the tracing directory.
cd /sys/kernel/tracing
Filter for destination port unreachable.
echo "type==3 && code==3" > events/icmp/icmp_send/filter
Enable trace event.
echo 1 > events/icmp/icmp_send/enable

3. Result View:

 udp_client_erro-11370   [002] ...s.12   124.728002:
 icmp_send: icmp_send: type=3, code=3.
 From 127.0.0.1:41895 to 127.0.0.1: ulen=23
 skbaddr=589b167a

Change log
==
v5->v6:
Some fixes according to
https://lore.kernel.org/all/20240413161319.ga853...@kernel.org/
1.Resubmit patches based on the latest net-next code.

v4->v5:
Some fixes according to
https://lore.kernel.org/all/CAL+tcoDeXXh+zcRk4PHnUk8ELnx=ce2pccqs7sfm0y9ak-e...@mail.gmail.com/
1.Adjust the position of trace_icmp_send() to before icmp_push_reply().

v3->v4:
Some fixes according to
https://lore.kernel.org/all/CANn89i+EFEr7VHXNdOi59Ba_R1nFKSBJzBzkJFVgCTdXBx=y...@mail.gmail.com/
1.Add legality check for UDP header in SKB.
2.Target this patch for net-next.

v2->v3:
Some fixes according to
https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
1. Change the tracking directory to/sys/kernel/tracking.
2. Adjust the layout of the TP-STRUCT_entry parameter structure.

v1->v2:
Some fixes according to
https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
1. adjust the trace_icmp_send() to more protocols than UDP.
2. move the calling of trace_icmp_send after sanity checks
in __icmp_send().

Signed-off-by: Peilin He 
Reviewed-by: Yunkai Zhang 
Cc: Yang Yang 
Cc: Liu Chun 
Cc: Xuexin Jiang 
Signed-off-by: xu xin 
---
 include/trace/events/icmp.h | 65 +
 net/ipv4/icmp.c |  4 +++
 2 files changed, 69 insertions(+)
 create mode 100644 include/trace/events/icmp.h

diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
new file mode 100644
index ..7d5190f48a28
--- /dev/null
+++ b/include/trace/events/icmp.h
@@ -0,0 +1,65 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM icmp
+
+#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_ICMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(icmp_send,
+
+   TP_PROTO(const struct sk_buff *skb, int type, int code),
+
+   TP_ARGS(skb, type, code),
+
+   TP_STRUCT__entry(
+   __field(const void *, skbaddr)
+   __field(int, type)
+   __field(int, code)
+   __array(__u8, saddr, 4)
+   __array(__u8, daddr, 4)
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __field(unsigned short, ulen)
+   ),
+
+   TP_fast_assign(
+   struct iphdr *iph = ip_hdr(skb);
+   int proto_4 = iph->protocol;
+   __be32 *p32;
+
+   __entry->skbaddr = skb;
+   __entry->type = type;
+   __entry->code = code;
+
+   struct udphdr *uh = udp_hdr(skb);
+   if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
+   (u8 *)uh + sizeof(struct udphdr) > 
skb_tail_pointer(skb)) {
+   __entry->sport = 0;
+   __entry->dport = 0;
+   __entry->ulen = 0;
+   } else {
+   __entry->sport = ntohs(uh->source);
+   __entry->dport = ntohs(uh->dest);
+   __entry->ulen = ntohs(uh->len);
+   }
+
+   p32 = (__be32 *) __entry->saddr;
+   *p32 = iph->saddr;
+
+   p32 = (__be32 *) __entry->daddr;
+   *p32 = iph->daddr;
+   ),
+
+   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u 
ulen=%d skbaddr=%p",
+   __entry->type, __entry->code,
+   __entry->saddr, __entry->sport,

回复: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors

2024-04-23 Thread Gavin Liu

> > When there is a ctlq and it doesn't require interrupt callbacks,the 
> > original method of calculating vectors wastes hardware msi or msix 
> > resources as well as system IRQ resources.
> >
> > When conducting performance testing using testpmd in the guest os, 
> > it was found that the performance was lower compared to directly 
> > using vfio-pci to passthrough the device
> >
> > In scenarios where the virtio device in the guest os does not 
> > utilize interrupts, the vdpa driver still configures the hardware's 
> > msix vector. Therefore, the hardware still sends interrupts to the host os.
>
> >I just have a question on this part. How come hardware sends interrupts does 
> >not guest driver disable them?
>
>1：Assuming the guest OS's Virtio device is using PMD mode, QEMU sets the 
> call fd to -1
>2：On the host side, the vhost_vdpa program will set 
> vp_vdpa->vring[i].cb.callback to invalid
>3：Before the modification, the vp_vdpa_request_irq function does not check 
> whether
>   vp_vdpa->vring[i].cb.callback is valid. Instead, it enables the 
> hardware's MSIX
> interrupts based on the number of queues of the device
>

> So MSIX is enabled but why would it trigger? virtio PMD in poll mode 
> presumably suppresses interrupts after all.

I didn't express the test model clearly.  The testing model is as follows:
testpmd   testpmd---
  ^^ 
  ||  
  ||
  ||
  vv
  vfio pci---   ---vfio pci---
 pci device -----pci device--
 guest os -   guest os --
   
   
---virtio device-----vfio device--
--qemu-----qemu---
^  ^
|  |
|  |
|  |
v  v
vhost_vdpa   vfio pci  
host os--   
--hw
  VF1  VF2
After the hardware completes DMA, it will still send an interrupt to the host 
OS.




 -Original Message-
From: Angus Chen angus.c...@jaguarmicro.com
Sent: April 23, 2024 16:43
To: Michael S. Tsirkin m...@redhat.com; Gavin Liu gavin@jaguarmicro.com
Cc: jasow...@redhat.com; virtualizat...@lists.linux.dev; 
xuanz...@linux.alibaba.com; linux-kernel@vger.kernel.org; Heng Qi 
hen...@linux.alibaba.com
Subject: RE: Reply: [PATCH v5] vp_vdpa: don't allocate unused msix vectors


Hi mst.

> -Original Message-
> From: Michael S. Tsirkin 
> Sent: Tuesday, April 23, 2024 4:35 PM
> To: Gavin Liu 
> Cc: jasow...@redhat.com; Angus Chen ; 
> virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com; 
> linux-kernel@vger.kernel.org; Heng Qi 
> Subject: Re: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix 
> vectors
> 
> On Tue, Apr 23, 2024 at 01:39:17AM +, Gavin Liu wrote:
> > On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > > From: Yuxue Liu 
> > >
> > > When there is a ctlq and it doesn't require interrupt 
> > > callbacks,the original method of calculating vectors wastes 
> > > hardware msi or msix resources as well as system IRQ resources.
> > >
> > > When conducting performance testing using testpmd in the guest os, 
> > > it was found that the performance was lower compared to directly 
> > > using vfio-pci to passthrough the device
> > >
> > > In scenarios where the virtio device in the guest os does not 
> > > utilize interrupts, the vdpa driver still configures the 
> > > hardware's msix vector. Therefore, the hardware still sends interrupts to 
> > > the host os.
> >
> > >I just have a question on this part. How come hardware sends 
> > >interrupts does
> not guest driver disable them?
> >
> >1：Assuming the guest OS's Virtio device is using PMD mode, QEMU 
> > sets
> the call fd to -1
> >2：On the host side, the vhost_vdpa program will set
> vp_vdpa->vring[i].cb.callback to invalid
> >3：Before the modification, the vp_vdpa_request_irq function does 
> > not
> check whether
> >   vp_vdpa->vring[i].cb.callback is valid. Instead, it enables 
> > the
> hardware's MSIX
> >   interrupts based on the number of queues of the device
> >
> 
> So MSIX is enabled but why would it trigger? virtio PMD in poll mode 
> presumably suppresses interrupts after all.
Virtio pmd is in the guest,but in host side,the msix is enabled,then the device 
will triger Interrupt normally. I analysed this bug before,and I think gavin is 
right.
Did I make it clear?
> 
> >
> >
> > - Original

Re: Re: [PATCH v3 2/4] virtio_balloon: introduce oom-kill invocations

2024-04-23 Thread zhenwei pi





On 4/23/24 17:13, Michael S. Tsirkin wrote:

On Tue, Apr 23, 2024 at 11:41:07AM +0800, zhenwei pi wrote:


[snip]

  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \
VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \


Looks like a useful extension. But
any UAPI extension has to go to virtio spec first.



Sure, I'll send related virtio spec changes once virtio comment mail 
list gets ready.



@@ -83,7 +84,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
-   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \
+   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \
  }
  
  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")

--
2.34.1




--
zhenwei pi

Re: [PATCH v3 2/4] virtio_balloon: introduce oom-kill invocations

2024-04-23 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 11:41:07AM +0800, zhenwei pi wrote:
> When the guest OS runs under critical memory pressure, the guest
> starts to kill processes. A guest monitor agent may scan 'oom_kill'
> from /proc/vmstat, and reports the OOM KILL event. However, the agent
> may be killed and we will loss this critical event(and the later
> events).
> 
> For now we can also grep for magic words in guest kernel log from host
> side. Rather than this unstable way, virtio balloon reports OOM-KILL
> invocations instead.
> 
> Acked-by: David Hildenbrand 
> Signed-off-by: zhenwei pi 
> ---
>  drivers/virtio/virtio_balloon.c | 1 +
>  include/uapi/linux/virtio_balloon.h | 6 --
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 1710e3098ecd..f7a47eaa0936 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -330,6 +330,7 @@ static inline unsigned int update_balloon_vm_stats(struct 
> virtio_balloon *vb)
>   pages_to_bytes(events[PSWPOUT]));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
>  
>  #ifdef CONFIG_HUGETLB_PAGE
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
> diff --git a/include/uapi/linux/virtio_balloon.h 
> b/include/uapi/linux/virtio_balloon.h
> index ddaa45e723c4..b17bbe033697 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -71,7 +71,8 @@ struct virtio_balloon_config {
>  #define VIRTIO_BALLOON_S_CACHES   7   /* Disk caches */
>  #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
>  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation 
> failures */
> -#define VIRTIO_BALLOON_S_NR   10
> +#define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
> +#define VIRTIO_BALLOON_S_NR   11
>  
>  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \
>   VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \

Looks like a useful extension. But
any UAPI extension has to go to virtio spec first.

> @@ -83,7 +84,8 @@ struct virtio_balloon_config {
>   VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \
>   VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
>   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
> - VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \
> + VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
> + VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \
>  }
>  
>  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")
> -- 
> 2.34.1

RE: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors

2024-04-23 Thread Angus Chen

Hi mst.

> -Original Message-
> From: Michael S. Tsirkin 
> Sent: Tuesday, April 23, 2024 4:35 PM
> To: Gavin Liu 
> Cc: jasow...@redhat.com; Angus Chen ;
> virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com;
> linux-kernel@vger.kernel.org; Heng Qi 
> Subject: Re: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors
> 
> On Tue, Apr 23, 2024 at 01:39:17AM +, Gavin Liu wrote:
> > On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > > From: Yuxue Liu 
> > >
> > > When there is a ctlq and it doesn't require interrupt callbacks,the
> > > original method of calculating vectors wastes hardware msi or msix
> > > resources as well as system IRQ resources.
> > >
> > > When conducting performance testing using testpmd in the guest os, it
> > > was found that the performance was lower compared to directly using
> > > vfio-pci to passthrough the device
> > >
> > > In scenarios where the virtio device in the guest os does not utilize
> > > interrupts, the vdpa driver still configures the hardware's msix
> > > vector. Therefore, the hardware still sends interrupts to the host os.
> >
> > >I just have a question on this part. How come hardware sends interrupts 
> > >does
> not guest driver disable them?
> >
> >1：Assuming the guest OS's Virtio device is using PMD mode, QEMU sets
> the call fd to -1
> >2：On the host side, the vhost_vdpa program will set
> vp_vdpa->vring[i].cb.callback to invalid
> >3：Before the modification, the vp_vdpa_request_irq function does not
> check whether
> >   vp_vdpa->vring[i].cb.callback is valid. Instead, it enables the
> hardware's MSIX
> >   interrupts based on the number of queues of the device
> >
> 
> So MSIX is enabled but why would it trigger? virtio PMD in poll mode
> presumably suppresses interrupts after all.
Virtio pmd is in the guest,but in host side,the msix is enabled,then the device 
will triger 
Interrupt normally. I analysed this bug before,and I think gavin is right.
Did I make it clear?
> 
> >
> >
> > - Original Message -
> > From: Michael S. Tsirkin m...@redhat.com
> > Sent: April 22, 2024 20:09
> > To: Gavin Liu gavin@jaguarmicro.com
> > Cc: jasow...@redhat.com; Angus Chen angus.c...@jaguarmicro.com;
> virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com;
> linux-kernel@vger.kernel.org; Heng Qi hen...@linux.alibaba.com
> > Subject: Re: [PATCH v5] vp_vdpa: don't allocate unused msix vectors
> >
> >
> >
> > External Mail: This email originated from OUTSIDE of the organization!
> > Do not click links, open attachments or provide ANY information unless you
> recognize the sender and know the content is safe.
> >
> >
> > On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > > From: Yuxue Liu 
> > >
> > > When there is a ctlq and it doesn't require interrupt callbacks,the
> > > original method of calculating vectors wastes hardware msi or msix
> > > resources as well as system IRQ resources.
> > >
> > > When conducting performance testing using testpmd in the guest os, it
> > > was found that the performance was lower compared to directly using
> > > vfio-pci to passthrough the device
> > >
> > > In scenarios where the virtio device in the guest os does not utilize
> > > interrupts, the vdpa driver still configures the hardware's msix
> > > vector. Therefore, the hardware still sends interrupts to the host os.
> >
> > I just have a question on this part. How come hardware sends interrupts does
> not guest driver disable them?
> >
> > > Because of this unnecessary
> > > action by the hardware, hardware performance decreases, and it also
> > > affects the performance of the host os.
> > >
> > > Before modification:(interrupt mode)
> > >  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
> > >  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
> > >  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
> > >  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> > >
> > > After modification:(interrupt mode)
> > >  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
> > >  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
> > >  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config
> > >
> > > Before modification:(virtio pmd mode for guest os)
> > >  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
> > >  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
> > >  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
> > >  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> > >
> > > After modification:(virtio pmd mode for guest os)
> > >  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config
> > >
> > > To verify the use of the virtio PMD mode in the guest operating
> > > system, the following patch needs to be applied to QEMU:
> > > https://lore.kernel.org/all/20240408073311.2049-1-yuxue.liu@jaguarmicr
> > > o.com
> > >
> > > Signed-off-by:

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-23 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  wrote:
> >
> > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > and vduse_alloc_reconnnect_info_mem
> > > > > These functions allow vduse to allocate and free memory for 
> > > > > reconnection
> > > > > information. The amount of memory allocated is vq_num pages.
> > > > > Each VQS will map its own page where the reconnection information 
> > > > > will be saved
> > > > >
> > > > > Signed-off-by: Cindy Lu 
> > > > > ---
> > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > ++
> > > > >  1 file changed, 40 insertions(+)
> > > > >
> > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > >   int irq_effective_cpu;
> > > > >   struct cpumask irq_affinity;
> > > > >   struct kobject kobj;
> > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > >  };
> > > > >
> > > > >  struct vduse_dev;
> > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > >
> > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > >  }
> > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > +{
> > > > > + unsigned long vaddr = 0;
> > > > > + struct vduse_virtqueue *vq;
> > > > > +
> > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > > > + vq = dev->vqs[i];
> > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > >
> > > >
> > > > I don't get why you insist on stealing kernel memory for something
> > > > that is just used by userspace to store data for its own use.
> > > > Userspace does not lack ways to persist data, for example,
> > > > create a regular file anywhere in the filesystem.
> > >
> > > Good point. So the motivation here is to:
> > >
> > > 1) be self contained, no dependency for high speed persist data
> > > storage like tmpfs
> >
> > No idea what this means.
> 
> I mean a regular file may slow down the datapath performance, so
> usually the application will try to use tmpfs and other which is a
> dependency for implementing the reconnection.

Are we worried about systems without tmpfs now?


> >
> > > 2) standardize the format in uAPI which allows reconnection from
> > > arbitrary userspace, unfortunately, such effort was removed in new
> > > versions
> >
> > And I don't see why that has to live in the kernel tree either.
> 
> I can't find a better place, any idea?
> 
> Thanks


Well anywhere on github really. with libvhost-user maybe?
It's harmless enough in Documentation
if you like but ties you to the kernel release cycle in a way that
is completely unnecessary.

> >
> > > If the above doesn't make sense, we don't need to offer those pages by 
> > > VDUSE.
> > >
> > > Thanks
> > >
> > >
> > > >
> > > >
> > > >
> > > > > + if (vaddr == 0)
> > > > > + return -ENOMEM;
> > > > > +
> > > > > + vq->vdpa_reconnect_vaddr = vaddr;
> > > > > + }
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > +{
> > > > > + struct vduse_virtqueue *vq;
> > > > > +
> > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > + vq = dev->vqs[i];
> > > > > +
> > > > > + if (vq->vdpa_reconnect_vaddr)
> > > > > + free_page(vq->vdpa_reconnect_vaddr);
> > > > > + vq->vdpa_reconnect_vaddr = 0;
> > > > > + }
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > >
> > > > >  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > > >   unsigned long arg)
> > > > > @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name)
> > > > >   mutex_unlock(>lock);
> > > > >   return -EBUSY;
> > > > >   }
> > > > > + vduse_free_reconnnect_info_mem(dev);
> > > > > +
> > > > >   dev->connected = true;
> > > > >   mutex_unlock(>lock);
> > > > >
> > > > > @@ -1855,12 +1890,17 @@ static int vduse_create_dev(struct 
> > > > > vduse_dev_config *config,
> > > > >   ret = vduse_dev_init_vqs(dev, config->vq_align, config->vq_num);
> > > > >   if (ret)
> > > > >   goto err_vqs;
> > > > > + ret = vduse_alloc_reconnnect_info_mem(dev);
> > > > > + if (ret < 0)
> > > > > + goto err_mem;
> > > > >
>

Re: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors

2024-04-23 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 01:39:17AM +, Gavin Liu wrote:
> On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > From: Yuxue Liu 
> >
> > When there is a ctlq and it doesn't require interrupt callbacks,the 
> > original method of calculating vectors wastes hardware msi or msix 
> > resources as well as system IRQ resources.
> >
> > When conducting performance testing using testpmd in the guest os, it 
> > was found that the performance was lower compared to directly using 
> > vfio-pci to passthrough the device
> >
> > In scenarios where the virtio device in the guest os does not utilize 
> > interrupts, the vdpa driver still configures the hardware's msix 
> > vector. Therefore, the hardware still sends interrupts to the host os.
> 
> >I just have a question on this part. How come hardware sends interrupts does 
> >not guest driver disable them?
>
>1：Assuming the guest OS's Virtio device is using PMD mode, QEMU sets the 
> call fd to -1
>2：On the host side, the vhost_vdpa program will set 
> vp_vdpa->vring[i].cb.callback to invalid
>3：Before the modification, the vp_vdpa_request_irq function does not check 
> whether 
>   vp_vdpa->vring[i].cb.callback is valid. Instead, it enables the 
> hardware's MSIX
> interrupts based on the number of queues of the device
> 

So MSIX is enabled but why would it trigger? virtio PMD in poll mode
presumably suppresses interrupts after all.

> 
> 
> - Original Message -
> From: Michael S. Tsirkin m...@redhat.com
> Sent: April 22, 2024 20:09
> To: Gavin Liu gavin@jaguarmicro.com
> Cc: jasow...@redhat.com; Angus Chen angus.c...@jaguarmicro.com; 
> virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com; 
> linux-kernel@vger.kernel.org; Heng Qi hen...@linux.alibaba.com
> Subject: Re: [PATCH v5] vp_vdpa: don't allocate unused msix vectors
> 
> 
> 
> External Mail: This email originated from OUTSIDE of the organization!
> Do not click links, open attachments or provide ANY information unless you 
> recognize the sender and know the content is safe.
> 
> 
> On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > From: Yuxue Liu 
> >
> > When there is a ctlq and it doesn't require interrupt callbacks,the 
> > original method of calculating vectors wastes hardware msi or msix 
> > resources as well as system IRQ resources.
> >
> > When conducting performance testing using testpmd in the guest os, it 
> > was found that the performance was lower compared to directly using 
> > vfio-pci to passthrough the device
> >
> > In scenarios where the virtio device in the guest os does not utilize 
> > interrupts, the vdpa driver still configures the hardware's msix 
> > vector. Therefore, the hardware still sends interrupts to the host os.
> 
> I just have a question on this part. How come hardware sends interrupts does 
> not guest driver disable them?
> 
> > Because of this unnecessary
> > action by the hardware, hardware performance decreases, and it also 
> > affects the performance of the host os.
> >
> > Before modification:(interrupt mode)
> >  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
> >  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
> >  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
> >  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> >
> > After modification:(interrupt mode)
> >  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
> >  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
> >  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config
> >
> > Before modification:(virtio pmd mode for guest os)
> >  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
> >  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
> >  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
> >  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> >
> > After modification:(virtio pmd mode for guest os)
> >  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config
> >
> > To verify the use of the virtio PMD mode in the guest operating 
> > system, the following patch needs to be applied to QEMU:
> > https://lore.kernel.org/all/20240408073311.2049-1-yuxue.liu@jaguarmicr
> > o.com
> >
> > Signed-off-by: Yuxue Liu 
> > Acked-by: Jason Wang 
> > Reviewed-by: Heng Qi 
> > ---
> > V5: modify the description of the printout when an exception occurs
> > V4: update the title and assign values to uninitialized variables
> > V3: delete unused variables and add validation records
> > V2: fix when allocating IRQs, scan all queues
> >
> >  drivers/vdpa/virtio_pci/vp_vdpa.c | 22 --
> >  1 file changed, 16 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
> > b/drivers/vdpa/virtio_pci/vp_vdpa.c
> > index df5f4a3bccb5..8de0224e9ec2 100644
> > --- a/drivers/vdpa/virtio_pci/vp_vdpa.c
> > +++

[PATCH v2 1/1] genirq/cpuhotplug: retry with cpu_online_mask when irq_do_set_affinity return -ENOSPC

2024-04-23 Thread Dongli Zhang

When a CPU goes offline, the interrupts pinned to that CPU are
re-configured.

Its managed interrupts undergo either migration to other CPUs or shutdown
if all CPUs listed in the affinity are offline. This patch doesn't affect
managed interrupts.

For regular interrupts, they are migrated to other selected online CPUs.
The target CPUs are chosen from either desc->pending_mask (suppose
CONFIG_GENERIC_PENDING_IRQ) or d->common->affinity (suppose CONFIG_SMP).
The cpu_online_mask is used as target CPUs only when CPUs in both
desc->pending_mask and d->common->affinity are offline.

However, there is a bad corner case, when desc->pending_mask or
d->common->affinity is selected as the target cpumask, but none of their
CPUs has any available vectors.

In this case the migration fails and the device interrupt becomes
stale. This is not any different from the case where the affinity
mask does not contain any online CPU, but there is no fallback
operation for this.

Instead of giving up, retry the migration attempt with the online CPU
mask if the interrupt is not managed, as managed interrupts cannot be
affected by this problem.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
[tglx: massage some changelog]
---
Changed since v1:
  - Re-work the commit message
  - Move pr_debug before setting affinity
  - Remove 'all' from pr_debug message

 kernel/irq/cpuhotplug.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 1ed2b1739363..19babb914949 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -130,6 +130,17 @@ static bool migrate_one_irq(struct irq_desc *desc)
 * CPU.
 */
err = irq_do_set_affinity(d, affinity, false);
+
+   if (err == -ENOSPC && !irqd_affinity_is_managed(d) && affinity != 
cpu_online_mask) {
+   pr_debug("IRQ%u: set affinity failed for %*pbl, re-try with 
online CPUs\n",
+d->irq, cpumask_pr_args(affinity));
+
+   affinity = cpu_online_mask;
+   brokeaff = true;
+
+   err = irq_do_set_affinity(d, affinity, false);
+   }
+
if (err) {
pr_warn_ratelimited("IRQ%u: set affinity failed(%d).\n",
d->irq, err);
-- 
2.34.1

Re: [PATCH v12 13/14] Docs/x86/sgx: Add description for cgroup support

2024-04-23 Thread Huang, Kai

On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Sean Christopherson 
> 
> Add initial documentation of how to regulate the distribution of
> SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
> controller.
> 
> 

Acked-by: Kai Huang

Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-23 Thread Thorsten Leemhuis




On 23.04.24 00:15, Mauro Carvalho Chehab wrote:
>> sta...@kernel.org is there to route to /dev/null on purpose so that
>> developers/maintainers who only want their patches to get picked up when
>> they hit Linus's tree, will have happen and not notify anyone else.
>> This is especially good when dealing with security-related things as we
>> have had MANY people accidentally leak patches way too early by having
>>  cc: sta...@vger.kernel.org in their signed-off-by areas, and forgetting
>> to tell git send-email to suppress cc: when sending them out for
>> internal review.
> Nice! didn't know about that. On a quick check, the only place at
> documentation mentioning it without vger is at checkpatch.rst.
> 
> Perhaps it would make sense to document that as well.

Maybe something like the below?

Will add that to my next patch set unless I hear complaints. 

Ciao, Thorsten

---

diff --git a/Documentation/process/stable-kernel-rules.rst 
b/Documentation/process/stable-kernel-rules.rst
index 727ad7f758e3e0..5a47ed06081e41 100644
--- a/Documentation/process/stable-kernel-rules.rst
+++ b/Documentation/process/stable-kernel-rules.rst
@@ -72,6 +72,10 @@ for stable trees, add this tag in the sign-off area::
 
  Cc: sta...@vger.kernel.org
 
+Use ``Cc: sta...@kernel.org`` instead when fixing an unpublished vulnerability:
+it reduces the chance of someone exposing the fix to the public by way of
+'git send-email', as mails sent to that address are not delivered anywhere.
+
 Once the patch is mainlined it will be applied to the stable tree without
 anything else needing to be done by the author or subsystem maintainer.

[PATCH net-next v8 7/7] rstreason: make it work in trace world

2024-04-23 Thread Jason Xing

From: Jason Xing 

At last, we should let it work by introducing this reset reason in
trace world.

One of the possible expected outputs is:
... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
state=TCP_ESTABLISHED reason=NOT_SPECIFIED

Signed-off-by: Jason Xing 
Reviewed-by: Steven Rostedt (Google) 
---
 include/trace/events/tcp.h | 26 ++
 net/ipv4/tcp_ipv4.c|  2 +-
 net/ipv4/tcp_output.c  |  2 +-
 net/ipv6/tcp_ipv6.c|  2 +-
 4 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 5c04a61a11c2..49b5ee091cf6 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * tcp event with arguments sk and skb
@@ -74,20 +75,32 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
TP_ARGS(sk, skb)
 );
 
+#undef FN
+#define FN(reason) TRACE_DEFINE_ENUM(SK_RST_REASON_##reason);
+DEFINE_RST_REASON(FN, FN)
+
+#undef FN
+#undef FNe
+#define FN(reason) { SK_RST_REASON_##reason, #reason },
+#define FNe(reason){ SK_RST_REASON_##reason, #reason }
+
 /*
  * skb of trace_tcp_send_reset is the skb that caused RST. In case of
  * active reset, skb should be NULL
  */
 TRACE_EVENT(tcp_send_reset,
 
-   TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
+   TP_PROTO(const struct sock *sk,
+const struct sk_buff *skb,
+const enum sk_rst_reason reason),
 
-   TP_ARGS(sk, skb),
+   TP_ARGS(sk, skb, reason),
 
TP_STRUCT__entry(
__field(const void *, skbaddr)
__field(const void *, skaddr)
__field(int, state)
+   __field(enum sk_rst_reason, reason)
__array(__u8, saddr, sizeof(struct sockaddr_in6))
__array(__u8, daddr, sizeof(struct sockaddr_in6))
),
@@ -113,14 +126,19 @@ TRACE_EVENT(tcp_send_reset,
 */
TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, 
entry->saddr);
}
+   __entry->reason = reason;
),
 
-   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
+   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s 
reason=%s",
  __entry->skbaddr, __entry->skaddr,
  __entry->saddr, __entry->daddr,
- __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN")
+ __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN",
+ __print_symbolic(__entry->reason, DEFINE_RST_REASON(FN, FNe)))
 );
 
+#undef FN
+#undef FNe
+
 /*
  * tcp event with arguments sk
  *
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6bd3a0fb9439..6096ac7a3a02 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -871,7 +871,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb,
if (sk)
arg.bound_dev_if = sk->sk_bound_dev_if;
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb, reason);
 
BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
 offsetof(struct inet_timewait_sock, tw_bound_dev_if));
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 276d9d541b01..b08ffb17d5a0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3612,7 +3612,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t 
priority,
/* skb of trace_tcp_send_reset() keeps the skb that caused RST,
 * skb here is different to the troublesome skb, so use NULL
 */
-   trace_tcp_send_reset(sk, NULL);
+   trace_tcp_send_reset(sk, NULL, SK_RST_REASON_NOT_SPECIFIED);
 }
 
 /* Send a crossed SYN-ACK during socket establishment.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 317d7a6e6b01..77958adf2e16 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1133,7 +1133,7 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb,
label = ip6_flowlabel(ipv6h);
}
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb, reason);
 
tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
 ipv6_get_dsfield(ipv6h), label, priority, txhash,
-- 
2.37.3

[PATCH net-next v8 6/7] mptcp: introducing a helper into active reset logic

2024-04-23 Thread Jason Xing

From: Jason Xing 

Since we have mapped every mptcp reset reason definition in enum
sk_rst_reason, introducing a new helper can cover some missing places
where we have already set the subflow->reset_reason.

Note: using SK_RST_REASON_NOT_SPECIFIED is the same as
SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown. So we can convert
it directly.

Suggested-by: Paolo Abeni 
Signed-off-by: Jason Xing 
---
Link: 
https://lore.kernel.org/all/2d3ea199eef53cf6a0c48e21abdee0eefbdee927.ca...@redhat.com/
---
 net/mptcp/protocol.c |  4 +---
 net/mptcp/protocol.h | 11 +++
 net/mptcp/subflow.c  |  6 ++
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 065967086492..4b13ca362efa 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -21,7 +21,6 @@
 #endif
 #include 
 #include 
-#include 
 #include 
 #include "protocol.h"
 #include "mib.h"
@@ -2570,8 +2569,7 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
 
slow = lock_sock_fast(tcp_sk);
if (tcp_sk->sk_state != TCP_CLOSE) {
-   tcp_send_active_reset(tcp_sk, GFP_ATOMIC,
- SK_RST_REASON_NOT_SPECIFIED);
+   mptcp_send_active_reset_reason(tcp_sk);
tcp_set_state(tcp_sk, TCP_CLOSE);
}
unlock_sock_fast(tcp_sk, slow);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index bbcb8c068aae..d40ad4a2f1b8 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mptcp_pm_gen.h"
 
@@ -609,6 +610,16 @@ sk_rst_convert_mptcp_reason(u32 reason)
}
 }
 
+static inline void
+mptcp_send_active_reset_reason(struct sock *sk)
+{
+   struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+   enum sk_rst_reason reason;
+
+   reason = sk_rst_convert_mptcp_reason(subflow->reset_reason);
+   tcp_send_active_reset(sk, GFP_ATOMIC, reason);
+}
+
 static inline u64
 mptcp_subflow_get_map_offset(const struct mptcp_subflow_context *subflow)
 {
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index fb7abf2d01ca..97ec44d1df30 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -20,7 +20,6 @@
 #include 
 #endif
 #include 
-#include 
 
 #include "protocol.h"
 #include "mib.h"
@@ -424,7 +423,7 @@ void mptcp_subflow_reset(struct sock *ssk)
/* must hold: tcp_done() could drop last reference on parent */
sock_hold(sk);
 
-   tcp_send_active_reset(ssk, GFP_ATOMIC, SK_RST_REASON_NOT_SPECIFIED);
+   mptcp_send_active_reset_reason(ssk);
tcp_done(ssk);
if (!test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW, _sk(sk)->flags))
mptcp_schedule_work(sk);
@@ -1362,8 +1361,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
tcp_set_state(ssk, TCP_CLOSE);
while ((skb = skb_peek(>sk_receive_queue)))
sk_eat_skb(ssk, skb);
-   tcp_send_active_reset(ssk, GFP_ATOMIC,
- SK_RST_REASON_NOT_SPECIFIED);
+   mptcp_send_active_reset_reason(ssk);
WRITE_ONCE(subflow->data_avail, false);
return false;
}
-- 
2.37.3

[PATCH net-next v8 5/7] mptcp: support rstreason for passive reset

2024-04-23 Thread Jason Xing

From: Jason Xing 

It relys on what reset options in the skb are as rfc8684 says. Reusing
this logic can save us much energy. This patch replaces most of the prior
NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
 net/mptcp/protocol.h | 28 
 net/mptcp/subflow.c  | 22 +-
 2 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index fdfa843e2d88..bbcb8c068aae 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -581,6 +581,34 @@ mptcp_subflow_ctx_reset(struct mptcp_subflow_context 
*subflow)
WRITE_ONCE(subflow->local_id, -1);
 }
 
+/* Convert reset reasons in MPTCP to enum sk_rst_reason type */
+static inline enum sk_rst_reason
+sk_rst_convert_mptcp_reason(u32 reason)
+{
+   switch (reason) {
+   case MPTCP_RST_EUNSPEC:
+   return SK_RST_REASON_MPTCP_RST_EUNSPEC;
+   case MPTCP_RST_EMPTCP:
+   return SK_RST_REASON_MPTCP_RST_EMPTCP;
+   case MPTCP_RST_ERESOURCE:
+   return SK_RST_REASON_MPTCP_RST_ERESOURCE;
+   case MPTCP_RST_EPROHIBIT:
+   return SK_RST_REASON_MPTCP_RST_EPROHIBIT;
+   case MPTCP_RST_EWQ2BIG:
+   return SK_RST_REASON_MPTCP_RST_EWQ2BIG;
+   case MPTCP_RST_EBADPERF:
+   return SK_RST_REASON_MPTCP_RST_EBADPERF;
+   case MPTCP_RST_EMIDDLEBOX:
+   return SK_RST_REASON_MPTCP_RST_EMIDDLEBOX;
+   default:
+   /**
+* It should not happen, or else errors may occur
+* in MPTCP layer
+*/
+   return SK_RST_REASON_ERROR;
+   }
+}
+
 static inline u64
 mptcp_subflow_get_map_offset(const struct mptcp_subflow_context *subflow)
 {
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index ac867d277860..fb7abf2d01ca 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -309,8 +309,13 @@ static struct dst_entry *subflow_v4_route_req(const struct 
sock *sk,
return dst;
 
dst_release(dst);
-   if (!req->syncookie)
-   tcp_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   if (!req->syncookie) {
+   struct mptcp_ext *mpext = mptcp_get_ext(skb);
+   enum sk_rst_reason reason;
+
+   reason = sk_rst_convert_mptcp_reason(mpext->reset_reason);
+   tcp_request_sock_ops.send_reset(sk, skb, reason);
+   }
return NULL;
 }
 
@@ -377,8 +382,13 @@ static struct dst_entry *subflow_v6_route_req(const struct 
sock *sk,
return dst;
 
dst_release(dst);
-   if (!req->syncookie)
-   tcp6_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   if (!req->syncookie) {
+   struct mptcp_ext *mpext = mptcp_get_ext(skb);
+   enum sk_rst_reason reason;
+
+   reason = sk_rst_convert_mptcp_reason(mpext->reset_reason);
+   tcp6_request_sock_ops.send_reset(sk, skb, reason);
+   }
return NULL;
 }
 #endif
@@ -783,6 +793,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
struct mptcp_subflow_request_sock *subflow_req;
struct mptcp_options_received mp_opt;
bool fallback, fallback_is_fatal;
+   enum sk_rst_reason reason;
struct mptcp_sock *owner;
struct sock *child;
 
@@ -913,7 +924,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
tcp_rsk(req)->drop_req = true;
inet_csk_prepare_for_destroy_sock(child);
tcp_done(child);
-   req->rsk_ops->send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   reason = sk_rst_convert_mptcp_reason(mptcp_get_ext(skb)->reset_reason);
+   req->rsk_ops->send_reset(sk, skb, reason);
 
/* The last child reference will be released by the caller */
return child;
-- 
2.37.3

[PATCH net-next v8 4/7] tcp: support rstreason for passive reset

2024-04-23 Thread Jason Xing

From: Jason Xing 

Reuse the dropreason logic to show the exact reason of tcp reset,
so we can finally display the corresponding item in enum sk_reset_reason
instead of reinventing new reset reasons. This patch replaces all
the prior NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
 include/net/rstreason.h | 15 +++
 net/ipv4/tcp_ipv4.c | 11 +++
 net/ipv6/tcp_ipv6.c | 11 +++
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/include/net/rstreason.h b/include/net/rstreason.h
index bc53b5a24505..df3b6ac0c9b3 100644
--- a/include/net/rstreason.h
+++ b/include/net/rstreason.h
@@ -103,4 +103,19 @@ enum sk_rst_reason {
 */
SK_RST_REASON_MAX,
 };
+
+/* Convert skb drop reasons to enum sk_rst_reason type */
+static inline enum sk_rst_reason
+sk_rst_convert_drop_reason(enum skb_drop_reason reason)
+{
+   switch (reason) {
+   case SKB_DROP_REASON_NOT_SPECIFIED:
+   return SK_RST_REASON_NOT_SPECIFIED;
+   case SKB_DROP_REASON_NO_SOCKET:
+   return SK_RST_REASON_NO_SOCKET;
+   default:
+   /* If we don't have our own corresponding reason */
+   return SK_RST_REASON_NOT_SPECIFIED;
+   }
+}
 #endif
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 418d11902fa7..6bd3a0fb9439 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1936,7 +1936,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   tcp_v4_send_reset(rsk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(rsk, skb, sk_rst_convert_drop_reason(reason));
 discard:
kfree_skb_reason(skb, reason);
/* Be careful here. If this function gets more complicated and
@@ -2278,7 +2278,10 @@ int tcp_v4_rcv(struct sk_buff *skb)
} else {
drop_reason = tcp_child_process(sk, nsk, skb);
if (drop_reason) {
-   tcp_v4_send_reset(nsk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   enum sk_rst_reason rst_reason;
+
+   rst_reason = 
sk_rst_convert_drop_reason(drop_reason);
+   tcp_v4_send_reset(nsk, skb, rst_reason);
goto discard_and_relse;
}
sock_put(sk);
@@ -2357,7 +2360,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v4_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(NULL, skb, 
sk_rst_convert_drop_reason(drop_reason));
}
 
 discard_it:
@@ -2409,7 +2412,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
tcp_v4_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v4_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(sk, skb, 
sk_rst_convert_drop_reason(drop_reason));
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 017f6293b5f4..317d7a6e6b01 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1680,7 +1680,7 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(sk, skb, sk_rst_convert_drop_reason(reason));
 discard:
if (opt_skb)
__kfree_skb(opt_skb);
@@ -1865,7 +1865,10 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
} else {
drop_reason = tcp_child_process(sk, nsk, skb);
if (drop_reason) {
-   tcp_v6_send_reset(nsk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   enum sk_rst_reason rst_reason;
+
+   rst_reason = 
sk_rst_convert_drop_reason(drop_reason);
+   tcp_v6_send_reset(nsk, skb, rst_reason);
goto discard_and_relse;
}
sock_put(sk);
@@ -1942,7 +1945,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v6_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(NULL, skb, 
sk_rst_convert_drop_reason(drop_reason));
}
 
 discard_it:
@@ -1998,7 +2001,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
tcp_v6_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(sk, skb, 
sk_rst_convert_drop_reason(drop_reason));

[PATCH net-next v8 3/7] rstreason: prepare for active reset

2024-04-23 Thread Jason Xing

From: Jason Xing 

Like what we did to passive reset:
only passing possible reset reason in each active reset path.

No functional changes.

Signed-off-by: Jason Xing 
---
 include/net/tcp.h |  3 ++-
 net/ipv4/tcp.c| 15 ++-
 net/ipv4/tcp_output.c |  3 ++-
 net/ipv4/tcp_timer.c  |  9 ++---
 net/mptcp/protocol.c  |  4 +++-
 net/mptcp/subflow.c   |  5 +++--
 6 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b935e1ae4caf..adeacc9aa28a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -670,7 +670,8 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 void tcp_send_probe0(struct sock *);
 int tcp_write_wakeup(struct sock *, int mib);
 void tcp_send_fin(struct sock *sk);
-void tcp_send_active_reset(struct sock *sk, gfp_t priority);
+void tcp_send_active_reset(struct sock *sk, gfp_t priority,
+  enum sk_rst_reason reason);
 int tcp_send_synack(struct sock *);
 void tcp_push_one(struct sock *, unsigned int mss_now);
 void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f23b9ea5..4ec0f4feee00 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -275,6 +275,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2811,7 +2812,8 @@ void __tcp_close(struct sock *sk, long timeout)
/* Unread data was tossed, zap the connection. */
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, sk->sk_allocation);
+   tcp_send_active_reset(sk, sk->sk_allocation,
+ SK_RST_REASON_NOT_SPECIFIED);
} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
/* Check zero linger _after_ checking for unread data. */
sk->sk_prot->disconnect(sk, 0);
@@ -2885,7 +2887,8 @@ void __tcp_close(struct sock *sk, long timeout)
struct tcp_sock *tp = tcp_sk(sk);
if (READ_ONCE(tp->linger2) < 0) {
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONLINGER);
} else {
@@ -2903,7 +2906,8 @@ void __tcp_close(struct sock *sk, long timeout)
if (sk->sk_state != TCP_CLOSE) {
if (tcp_check_oom(sk, 0)) {
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONMEMORY);
} else if (!check_net(sock_net(sk))) {
@@ -3007,7 +3011,7 @@ int tcp_disconnect(struct sock *sk, int flags)
/* The last check adjusts for discrepancy of Linux wrt. RFC
 * states
 */
-   tcp_send_active_reset(sk, gfp_any());
+   tcp_send_active_reset(sk, gfp_any(), 
SK_RST_REASON_NOT_SPECIFIED);
WRITE_ONCE(sk->sk_err, ECONNRESET);
} else if (old_state == TCP_SYN_SENT)
WRITE_ONCE(sk->sk_err, ECONNRESET);
@@ -4564,7 +4568,8 @@ int tcp_abort(struct sock *sk, int err)
smp_wmb();
sk_error_report(sk);
if (tcp_need_reset(sk->sk_state))
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
tcp_done(sk);
}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 61119d42b0fd..276d9d541b01 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3586,7 +3586,8 @@ void tcp_send_fin(struct sock *sk)
  * was unread data in the receive queue.  This behavior is recommended
  * by RFC 2525, section 2.17.  -DaveM
  */
-void tcp_send_active_reset(struct sock *sk, gfp_t priority)
+void tcp_send_active_reset(struct sock *sk, gfp_t priority,
+  enum sk_rst_reason reason)
 {
struct sk_buff *skb;
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 976db57b95d4..83fe7f62f7f1 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk)
 {
@@ -127,7 +128,8 @@ static int tcp_out_of_resources(struct sock *sk, bool 
do_reset)
(!tp->snd_wnd && !tp->packets_out))

Re: [PATCH v12 10/14] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

2024-04-23 Thread Huang, Kai

On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> Enclave Page Cache(EPC) memory can be swapped out to regular system
> memory, and the consumed memory should be charged to a proper
> mem_cgroup. Currently the selection of mem_cgroup to charge is done in
> sgx_encl_get_mem_cgroup(). But it considers all contexts other than the
> ksgxd thread are user processes. With the new EPC cgroup implementation,
> the swapping can also happen in EPC cgroup work-queue threads. In those
> cases, it improperly selects the root mem_cgroup to charge for the RAM
> usage.
> 
> Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take
> an additional argument to explicitly specify the mm struct to charge for
> allocations. Callers from background kthreads not associated with a
> charging mm struct would set it to NULL, while callers in user process
> contexts set it to current->mm.
> 
> Internally, it handles the case when the charging mm given is NULL, by
> searching for an mm struct from enclave's mm_list.
> 
> Signed-off-by: Haitao Huang 
> Reported-by: Mikko Ylinen 
> Tested-by: Mikko Ylinen 
> Tested-by: Jarkko Sakkinen 
> 

Reviewed-by: Kai Huang

[PATCH net-next v8 1/7] net: introduce rstreason to detect why the RST is sent

2024-04-23 Thread Jason Xing

From: Jason Xing 

Add a new standalone file for the easy future extension to support
both active reset and passive reset in the TCP/DCCP/MPTCP protocols.

This patch only does the preparations for reset reason mechanism,
nothing else changes.

The reset reasons are divided into three parts:
1) reuse drop reasons for passive reset in TCP
2) our own independent reasons which aren't relying on other reasons at all
3) reuse MP_TCPRST option for MPTCP

The benefits of a standalone reset reason are listed here:
1) it can cover more than one case, such as reset reasons in MPTCP,
active reset reasons.
2) people can easily/fastly understand and maintain this mechanism.
3) we get unified format of output with prefix stripped.
4) more new reset reasons are on the way
...

I will implement the basic codes of active/passive reset reason in
those three protocols, which are not complete for this moment. For
passive reset part in TCP, I only introduce the NO_SOCKET common case
which could be set as an example.

After this series applied, it will have the ability to open a new
gate to let other people contribute more reasons into it :)

Signed-off-by: Jason Xing 
---
 include/net/rstreason.h | 106 
 1 file changed, 106 insertions(+)
 create mode 100644 include/net/rstreason.h

diff --git a/include/net/rstreason.h b/include/net/rstreason.h
new file mode 100644
index ..bc53b5a24505
--- /dev/null
+++ b/include/net/rstreason.h
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef _LINUX_RSTREASON_H
+#define _LINUX_RSTREASON_H
+#include 
+#include 
+
+#define DEFINE_RST_REASON(FN, FNe) \
+   FN(NOT_SPECIFIED)   \
+   FN(NO_SOCKET)   \
+   FN(MPTCP_RST_EUNSPEC)   \
+   FN(MPTCP_RST_EMPTCP)\
+   FN(MPTCP_RST_ERESOURCE) \
+   FN(MPTCP_RST_EPROHIBIT) \
+   FN(MPTCP_RST_EWQ2BIG)   \
+   FN(MPTCP_RST_EBADPERF)  \
+   FN(MPTCP_RST_EMIDDLEBOX)\
+   FN(ERROR)   \
+   FNe(MAX)
+
+/**
+ * enum sk_rst_reason - the reasons of socket reset
+ *
+ * The reasons of sk reset, which are used in DCCP/TCP/MPTCP protocols.
+ *
+ * There are three parts in order:
+ * 1) skb drop reasons: relying on drop reasons for such as passive reset
+ * 2) independent reset reasons: such as active reset reasons
+ * 3) reset reasons in MPTCP: only for MPTCP use
+ */
+enum sk_rst_reason {
+   /* Refer to include/net/dropreason-core.h
+* Rely on skb drop reasons because it indicates exactly why RST
+* could happen.
+*/
+   /** @SK_RST_REASON_NOT_SPECIFIED: reset reason is not specified */
+   SK_RST_REASON_NOT_SPECIFIED,
+   /** @SK_RST_REASON_NO_SOCKET: no valid socket that can be used */
+   SK_RST_REASON_NO_SOCKET,
+
+   /* Copy from include/uapi/linux/mptcp.h.
+* These reset fields will not be changed since they adhere to
+* RFC 8684. So do not touch them. I'm going to list each definition
+* of them respectively.
+*/
+   /**
+* @SK_RST_REASON_MPTCP_RST_EUNSPEC: Unspecified error.
+* This is the default error; it implies that the subflow is no
+* longer available. The presence of this option shows that the
+* RST was generated by an MPTCP-aware device.
+*/
+   SK_RST_REASON_MPTCP_RST_EUNSPEC,
+   /**
+* @SK_RST_REASON_MPTCP_RST_EMPTCP: MPTCP-specific error.
+* An error has been detected in the processing of MPTCP options.
+* This is the usual reason code to return in the cases where a RST
+* is being sent to close a subflow because of an invalid response.
+*/
+   SK_RST_REASON_MPTCP_RST_EMPTCP,
+   /**
+* @SK_RST_REASON_MPTCP_RST_ERESOURCE: Lack of resources.
+* This code indicates that the sending host does not have enough
+* resources to support the terminated subflow.
+*/
+   SK_RST_REASON_MPTCP_RST_ERESOURCE,
+   /**
+* @SK_RST_REASON_MPTCP_RST_EPROHIBIT: Administratively prohibited.
+* This code indicates that the requested subflow is prohibited by
+* the policies of the sending host.
+*/
+   SK_RST_REASON_MPTCP_RST_EPROHIBIT,
+   /**
+* @SK_RST_REASON_MPTCP_RST_EWQ2BIG: Too much outstanding data.
+* This code indicates that there is an excessive amount of data
+* that needs to be transmitted over the terminated subflow while
+* having already been acknowledged over one or more other subflows.
+* This may occur if a path has been unavailable for a short period
+* and it is more efficient to reset and start again than it is to
+* retransmit the queued data.
+*/
+   SK_RST_REASON_MPTCP_RST_EWQ2BIG,
+   /**
+* @SK_RST_REASON_MPTCP_RST_EBADPERF: Unacceptable performance.
+* This code

[PATCH net-next v8 2/7] rstreason: prepare for passive reset

2024-04-23 Thread Jason Xing

From: Jason Xing 

Adjust the parameter and support passing reason of reset which
is for now NOT_SPECIFIED. No functional changes.

Signed-off-by: Jason Xing 
---
 include/net/request_sock.h |  4 +++-
 net/dccp/ipv4.c| 10 ++
 net/dccp/ipv6.c| 10 ++
 net/dccp/minisocks.c   |  3 ++-
 net/ipv4/tcp_ipv4.c| 12 +++-
 net/ipv4/tcp_minisocks.c   |  3 ++-
 net/ipv6/tcp_ipv6.c| 15 +--
 net/mptcp/subflow.c|  8 +---
 8 files changed, 40 insertions(+), 25 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 004e651e6067..bdc737832da6 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -18,6 +18,7 @@
 #include 
 
 #include 
+#include 
 
 struct request_sock;
 struct sk_buff;
@@ -34,7 +35,8 @@ struct request_sock_ops {
void(*send_ack)(const struct sock *sk, struct sk_buff *skb,
struct request_sock *req);
void(*send_reset)(const struct sock *sk,
- struct sk_buff *skb);
+ struct sk_buff *skb,
+ enum sk_rst_reason reason);
void(*destructor)(struct request_sock *req);
void(*syn_ack_timeout)(const struct request_sock *req);
 };
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9fc9cea4c251..ff41bd6f99c3 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ackvec.h"
 #include "ccid.h"
@@ -521,7 +522,8 @@ static int dccp_v4_send_response(const struct sock *sk, 
struct request_sock *req
return err;
 }
 
-static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb)
+static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb,
+  enum sk_rst_reason reason)
 {
int err;
const struct iphdr *rxiph;
@@ -706,7 +708,7 @@ int dccp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
kfree_skb(skb);
return 0;
 }
@@ -869,7 +871,7 @@ static int dccp_v4_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
} else if (dccp_child_process(sk, nsk, skb)) {
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
goto discard_and_relse;
} else {
sock_put(sk);
@@ -909,7 +911,7 @@ static int dccp_v4_rcv(struct sk_buff *skb)
if (dh->dccph_type != DCCP_PKT_RESET) {
DCCP_SKB_CB(skb)->dccpd_reset_code =
DCCP_RESET_CODE_NO_CONNECTION;
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
}
 
 discard_it:
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index c8ca703dc331..85f4b8fdbe5e 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "dccp.h"
 #include "ipv6.h"
@@ -256,7 +257,8 @@ static void dccp_v6_reqsk_destructor(struct request_sock 
*req)
kfree_skb(inet_rsk(req)->pktopts);
 }
 
-static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb)
+static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb,
+  enum sk_rst_reason reason)
 {
const struct ipv6hdr *rxip6h;
struct sk_buff *skb;
@@ -656,7 +658,7 @@ static int dccp_v6_do_rcv(struct sock *sk, struct sk_buff 
*skb)
return 0;
 
 reset:
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
 discard:
if (opt_skb != NULL)
__kfree_skb(opt_skb);
@@ -762,7 +764,7 @@ static int dccp_v6_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
} else if (dccp_child_process(sk, nsk, skb)) {
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
goto discard_and_relse;
} else {
sock_put(sk);
@@ -801,7 +803,7 @@ static int dccp_v6_rcv(struct sk_buff *skb)
if (dh->dccph_type != DCCP_PKT_RESET) {
DCCP_SKB_CB(skb)->dccpd_reset_code =
DCCP_RESET_CODE_NO_CONNECTION;
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
}
 
 discard_it:
diff --git a/net/dccp/minisocks.c

[PATCH net-next v8 0/7] Implement reset reason mechanism to detect

2024-04-23 Thread Jason Xing

From: Jason Xing 

In production, there are so many cases about why the RST skb is sent but
we don't have a very convenient/fast method to detect the exact underlying
reasons.

RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
and active kind (like tcp_send_active_reset()). The former can be traced
carefully 1) in TCP, with the help of drop reasons, which is based on
Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
RFC 8684. The latter is relatively independent, which should be
implemented on our own, such as active reset reasons which can not be
replace by skb drop reason or something like this.

In this series, I focus on the fundamental implement mostly about how
the rstreason mechnism works and give the detailed passive part as an
example, not including the active reset part. In future, we can go
further and refine those NOT_SPECIFIED reasons.

Here are some examples when tracing:
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NO_SOCKET

[1]
Link: 
https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/

v8
Link: 
https://lore.kernel.org/all/20240422030109.12891-1-kerneljasonx...@gmail.com/
1. put sk reset reasons into more natural order (Matt)
2. adjust those helper position (Matt)
3. rename two convert function (Matt)
4. make the kdoc format correct (Simon)

v7
Link: 
https://lore.kernel.org/all/20240417085143.69578-1-kerneljasonx...@gmail.com/
1. get rid of enum casts which could bring potential issues (Eric)
2. use switch-case method to map between reset reason in MPTCP and sk reset
reason (Steven)
3. use switch-case method to map between skb drop reason and sk reset
reason

v6
1. add back casts, or else they are treated as error.

v5
Link: 
https://lore.kernel.org/all/2024045630.38420-1-kerneljasonx...@gmail.com/
1. address format issue (like reverse xmas tree) (Eric, Paolo)
2. remove unnecessary casts. (Eric)
3. introduce a helper used in mptcp active reset. See patch 6. (Paolo)

v4
Link: 
https://lore.kernel.org/all/20240409100934.37725-1-kerneljasonx...@gmail.com/
1. passing 'enum sk_rst_reason' for readability when tracing (Antoine)

v3
Link: 
https://lore.kernel.org/all/20240404072047.11490-1-kerneljasonx...@gmail.com/
1. rebase (mptcp part) and address what Mat suggested.

v2
Link: https://lore.kernel.org/all/20240403185033.47ebc...@kernel.org/
1. rebase against the latest net-next tree

Jason Xing (7):
  net: introduce rstreason to detect why the RST is sent
  rstreason: prepare for passive reset
  rstreason: prepare for active reset
  tcp: support rstreason for passive reset
  mptcp: support rstreason for passive reset
  mptcp: introducing a helper into active reset logic
  rstreason: make it work in trace world

 include/net/request_sock.h |   4 +-
 include/net/rstreason.h| 121 +
 include/net/tcp.h  |   3 +-
 include/trace/events/tcp.h |  26 ++--
 net/dccp/ipv4.c|  10 +--
 net/dccp/ipv6.c|  10 +--
 net/dccp/minisocks.c   |   3 +-
 net/ipv4/tcp.c |  15 +++--
 net/ipv4/tcp_ipv4.c|  17 --
 net/ipv4/tcp_minisocks.c   |   3 +-
 net/ipv4/tcp_output.c  |   5 +-
 net/ipv4/tcp_timer.c   |   9 ++-
 net/ipv6/tcp_ipv6.c|  20 +++---
 net/mptcp/protocol.c   |   2 +-
 net/mptcp/protocol.h   |  39 
 net/mptcp/subflow.c|  27 ++---
 16 files changed, 267 insertions(+), 47 deletions(-)
 create mode 100644 include/net/rstreason.h

-- 
2.37.3

86 matches

Mail list logo