date:20240226

Re: [PATCH v5 3/6] LoongArch: KVM: Add cpucfg area for kvm hypervisor

2024-02-26 Thread maobibo





On 2024/2/26 下午1:25, WANG Xuerui wrote:

Hi,

On 2/26/24 10:04, maobibo wrote:

On 2024/2/24 下午5:13, Huacai Chen wrote:

Hi, Bibo,

On Thu, Feb 22, 2024 at 11:28 AM Bibo Mao  wrote:


Instruction cpucfg can be used to get processor features. And there
is trap exception when it is executed in VM mode, and also it is
to provide cpu features to VM. On real hardware cpucfg area 0 - 20
is used.  Here one specified area 0x4000 -- 0x40ff is used
for KVM hypervisor to privide PV features, and the area can be extended
for other hypervisors in future. This area will never be used for
real HW, it is only used by software.

After reading and thinking, I find that the hypercall method which is
used in our productive kernel is better than this cpucfg method.
Because hypercall is more simple and straightforward, plus we don't
worry about conflicting with the real hardware.

No, I do not think so. cpucfg is simper than hypercall, hypercall can
be in effect when system runs in guest mode. In some scenario like TCG 
mode, hypercall is illegal intruction, however cpucfg can work.


While the CPUCFG instruction is universally available, it's also 
unprivileged, so any additional CPUCFG behavior also automatically 
becomes UAPI, which likely isn't what you expect. Hypervisor 
implementation details shouldn't be leaked to userland because it has no 
reason to care -- even though userland learns about the capabilities, it 
cannot actually access the resources, because relevant CSRs and/or 
instructions are privileged. Worse, the unnecessary exposure of 
information could be a problem security-wise.

cpucfg is read-only and used to represent current hw cpu features,
why do you think there is security issue?  Is there security issue about 
cpucfg2 and cpucfg6 since it can be accessed in user space also?


PMU feature is defined in cpucfg6, PMU driver is written in kernel mode.


A possible way to preserve the unprivileged CPUCFG behavior would be 
acting differently based on guest CSR.CRMD.PLV: only returning data for 
the new configuration space when guest is not in PLV3. But this behavior 
isn't explicitly allowed nor disallowed in the LoongArch manuals, and is 
in my opinion unnecessarily complex.


And regarding the lack of hypcall support from QEMU system mode 
emulation on TCG, I'd argue it's simply a matter of adding support in 
target/loongarch64. This would be attractive because it will enable easy 
development and testing of hypervisor software with QEMU -- both locally 
and in CI.

Hypercall is part of hardware assisted virtualization LVZ, do you think
only adding hypercall instruction withou LVZ is possible?



Extioi virtualization extension will be added later, cpucfg can be 
used to get extioi features. It is unlikely that extioi driver depends 
on PARA_VIRT macro if hypercall is used to get features.
And the EXTIOI feature too isn't something usable from unprivileged 
code, so I don't think it will affect the conclusions above.

Sorry, I do not know what do you mean.


Regards
Bibo Mao

Re: [PATCH v5 3/6] LoongArch: KVM: Add cpucfg area for kvm hypervisor

2024-02-26 Thread maobibo





On 2024/2/26 下午2:12, Huacai Chen wrote:

On Mon, Feb 26, 2024 at 10:04 AM maobibo  wrote:




On 2024/2/24 下午5:13, Huacai Chen wrote:

Hi, Bibo,

On Thu, Feb 22, 2024 at 11:28 AM Bibo Mao  wrote:


Instruction cpucfg can be used to get processor features. And there
is trap exception when it is executed in VM mode, and also it is
to provide cpu features to VM. On real hardware cpucfg area 0 - 20
is used.  Here one specified area 0x4000 -- 0x40ff is used
for KVM hypervisor to privide PV features, and the area can be extended
for other hypervisors in future. This area will never be used for
real HW, it is only used by software.

After reading and thinking, I find that the hypercall method which is
used in our productive kernel is better than this cpucfg method.
Because hypercall is more simple and straightforward, plus we don't
worry about conflicting with the real hardware.

No, I do not think so. cpucfg is simper than hypercall, hypercall can
be in effect when system runs in guest mode. In some scenario like TCG
mode, hypercall is illegal intruction, however cpucfg can work.

Nearly all architectures use hypercall except x86 for its historical
Only x86 support multiple hypervisors and there is multiple hypervisor 
in x86 only. It is an advantage, not historical reason.



reasons. If we use CPUCFG, then the hypervisor information is
unnecessarily leaked to userspace, and this may be a security issue.
Meanwhile, I don't think TCG mode needs PV features.
Besides PV features, there is other features different with real hw such 
as virtio device, virtual interrupt controller.


Regards
Bibo Mao



I consulted with Jiaxun before, and maybe he can give some more comments.



Extioi virtualization extension will be added later, cpucfg can be used
to get extioi features. It is unlikely that extioi driver depends on
PARA_VIRT macro if hypercall is used to get features.

CPUCFG is per-core information, if we really need something about
extioi, it should be in iocsr (LOONGARCH_IOCSR_FEATURES).


Huacai



Regards
Bibo Mao



Huacai



Signed-off-by: Bibo Mao 
---
   arch/loongarch/include/asm/inst.h  |  1 +
   arch/loongarch/include/asm/loongarch.h | 10 ++
   arch/loongarch/kvm/exit.c  | 46 +-
   3 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/arch/loongarch/include/asm/inst.h 
b/arch/loongarch/include/asm/inst.h
index d8f637f9e400..ad120f924905 100644
--- a/arch/loongarch/include/asm/inst.h
+++ b/arch/loongarch/include/asm/inst.h
@@ -67,6 +67,7 @@ enum reg2_op {
  revhd_op= 0x11,
  extwh_op= 0x16,
  extwb_op= 0x17,
+   cpucfg_op   = 0x1b,
  iocsrrdb_op = 0x19200,
  iocsrrdh_op = 0x19201,
  iocsrrdw_op = 0x19202,
diff --git a/arch/loongarch/include/asm/loongarch.h 
b/arch/loongarch/include/asm/loongarch.h
index 46366e783c84..a1d22e8b6f94 100644
--- a/arch/loongarch/include/asm/loongarch.h
+++ b/arch/loongarch/include/asm/loongarch.h
@@ -158,6 +158,16 @@
   #define  CPUCFG48_VFPU_CG  BIT(2)
   #define  CPUCFG48_RAM_CG   BIT(3)

+/*
+ * cpucfg index area: 0x4000 -- 0x40ff
+ * SW emulation for KVM hypervirsor
+ */
+#define CPUCFG_KVM_BASE0x4000UL
+#define CPUCFG_KVM_SIZE0x100
+#define CPUCFG_KVM_SIG CPUCFG_KVM_BASE
+#define  KVM_SIGNATURE "KVM\0"
+#define CPUCFG_KVM_FEATURE (CPUCFG_KVM_BASE + 4)
+
   #ifndef __ASSEMBLY__

   /* CSR */
diff --git a/arch/loongarch/kvm/exit.c b/arch/loongarch/kvm/exit.c
index 923bbca9bd22..6a38fd59d86d 100644
--- a/arch/loongarch/kvm/exit.c
+++ b/arch/loongarch/kvm/exit.c
@@ -206,10 +206,37 @@ int kvm_emu_idle(struct kvm_vcpu *vcpu)
  return EMULATE_DONE;
   }

-static int kvm_trap_handle_gspr(struct kvm_vcpu *vcpu)
+static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, larch_inst inst)
   {
  int rd, rj;
  unsigned int index;
+
+   rd = inst.reg2_format.rd;
+   rj = inst.reg2_format.rj;
+   ++vcpu->stat.cpucfg_exits;
+   index = vcpu->arch.gprs[rj];
+
+   /*
+* By LoongArch Reference Manual 2.2.10.5
+* Return value is 0 for undefined cpucfg index
+*/
+   switch (index) {
+   case 0 ... (KVM_MAX_CPUCFG_REGS - 1):
+   vcpu->arch.gprs[rd] = vcpu->arch.cpucfg[index];
+   break;
+   case CPUCFG_KVM_SIG:
+   vcpu->arch.gprs[rd] = *(unsigned int *)KVM_SIGNATURE;
+   break;
+   default:
+   vcpu->arch.gprs[rd] = 0;
+   break;
+   }
+
+   return EMULATE_DONE;
+}
+
+static int kvm_trap_handle_gspr(struct kvm_vcpu *vcpu)
+{
  unsigned long curr_pc;
  larch_inst inst;
  enum emulation_result er = EMULATE_DONE;
@@ -224,21 +251,8 @@ static int kvm_trap_handle_gspr(struct kvm_vcpu *vcpu)
  er = EMULATE_FAIL;
  switch (((i

Re: [PATCH v5 3/6] LoongArch: KVM: Add cpucfg area for kvm hypervisor

2024-02-26 Thread WANG Xuerui


On 2/26/24 16:00, maobibo wrote:



On 2024/2/26 下午1:25, WANG Xuerui wrote:

Hi,

On 2/26/24 10:04, maobibo wrote:

On 2024/2/24 下午5:13, Huacai Chen wrote:

Hi, Bibo,

On Thu, Feb 22, 2024 at 11:28 AM Bibo Mao  wrote:


Instruction cpucfg can be used to get processor features. And there
is trap exception when it is executed in VM mode, and also it is
to provide cpu features to VM. On real hardware cpucfg area 0 - 20
is used.  Here one specified area 0x4000 -- 0x40ff is used
for KVM hypervisor to privide PV features, and the area can be 
extended

for other hypervisors in future. This area will never be used for
real HW, it is only used by software.

After reading and thinking, I find that the hypercall method which is
used in our productive kernel is better than this cpucfg method.
Because hypercall is more simple and straightforward, plus we don't
worry about conflicting with the real hardware.

No, I do not think so. cpucfg is simper than hypercall, hypercall can
be in effect when system runs in guest mode. In some scenario like 
TCG mode, hypercall is illegal intruction, however cpucfg can work.


While the CPUCFG instruction is universally available, it's also 
unprivileged, so any additional CPUCFG behavior also automatically 
becomes UAPI, which likely isn't what you expect. Hypervisor 
implementation details shouldn't be leaked to userland because it has 
no reason to care -- even though userland learns about the 
capabilities, it cannot actually access the resources, because 
relevant CSRs and/or instructions are privileged. Worse, the 
unnecessary exposure of information could be a problem security-wise.

cpucfg is read-only and used to represent current hw cpu features,
why do you think there is security issue?  Is there security issue about 
cpucfg2 and cpucfg6 since it can be accessed in user space also?


PMU feature is defined in cpucfg6, PMU driver is written in kernel mode.


These CPUCFG leaves were added before existence of LoongArch were 
publicized, without community review. If early drafts of the manual were 
available to community reviewers, at least I would strongly NAK it.




A possible way to preserve the unprivileged CPUCFG behavior would be 
acting differently based on guest CSR.CRMD.PLV: only returning data 
for the new configuration space when guest is not in PLV3. But this 
behavior isn't explicitly allowed nor disallowed in the LoongArch 
manuals, and is in my opinion unnecessarily complex.


And regarding the lack of hypcall support from QEMU system mode 
emulation on TCG, I'd argue it's simply a matter of adding support in 
target/loongarch64. This would be attractive because it will enable 
easy development and testing of hypervisor software with QEMU -- both 
locally and in CI.

Hypercall is part of hardware assisted virtualization LVZ, do you think
only adding hypercall instruction withou LVZ is possible?


I cannot comment on the actual feasibility of doing so, because I don't 
have access to the LVZ manuals which *still* isn't publicly available. 
But from my intuition it should be a more-or-less trivial processor mode 
transition like with syscall -- whether that's indeed the case I can't 
(dis)prove.


Extioi virtualization extension will be added later, cpucfg can be 
used to get extioi features. It is unlikely that extioi driver 
depends on PARA_VIRT macro if hypercall is used to get features.
And the EXTIOI feature too isn't something usable from unprivileged 
code, so I don't think it will affect the conclusions above.

Sorry, I do not know what do you mean.


I was just saying this example provided no additional information at 
least for me -- while it's appreciated that you informed the community 
of your intended future use case, like what I stated in the first 
paragraph in my reply, it looked essentially the same because both PV 
and EXTIOI are privileged things.


--
WANG "xen0n" Xuerui

Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/

Re: [PATCH net-next 1/2] net/vsockmon: Leverage core stats allocator

2024-02-26 Thread Stefano Garzarella


On Fri, Feb 23, 2024 at 03:58:37AM -0800, Breno Leitao wrote:

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the vsockmon driver and leverage the network
core allocation instead.

Signed-off-by: Breno Leitao 
---
drivers/net/vsockmon.c | 16 +---
1 file changed, 1 insertion(+), 15 deletions(-)


Thanks for this patch!

Reviewed-by: Stefano Garzarella 



diff --git a/drivers/net/vsockmon.c b/drivers/net/vsockmon.c
index b1bb1b04b664..a0b4dca36baf 100644
--- a/drivers/net/vsockmon.c
+++ b/drivers/net/vsockmon.c
@@ -13,19 +13,6 @@
#define DEFAULT_MTU (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE + \
 sizeof(struct af_vsockmon_hdr))

-static int vsockmon_dev_init(struct net_device *dev)
-{
-   dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats);
-   if (!dev->lstats)
-   return -ENOMEM;
-   return 0;
-}
-
-static void vsockmon_dev_uninit(struct net_device *dev)
-{
-   free_percpu(dev->lstats);
-}
-
struct vsockmon {
struct vsock_tap vt;
};
@@ -79,8 +66,6 @@ static int vsockmon_change_mtu(struct net_device *dev, int 
new_mtu)
}

static const struct net_device_ops vsockmon_ops = {
-   .ndo_init = vsockmon_dev_init,
-   .ndo_uninit = vsockmon_dev_uninit,
.ndo_open = vsockmon_open,
.ndo_stop = vsockmon_close,
.ndo_start_xmit = vsockmon_xmit,
@@ -112,6 +97,7 @@ static void vsockmon_setup(struct net_device *dev)
dev->flags = IFF_NOARP;

dev->mtu = DEFAULT_MTU;
+   dev->pcpu_stat_type = NETDEV_PCPU_STAT_LSTATS;
}

static struct rtnl_link_ops vsockmon_link_ops __read_mostly = {
--
2.39.3

Re: [PATCH net-next 2/2] net/vsockmon: Do not set zeroed statistics

2024-02-26 Thread Stefano Garzarella


On Fri, Feb 23, 2024 at 03:58:38AM -0800, Breno Leitao wrote:

Do not set rtnl_link_stats64 fields to zero, since they are zeroed
before ops->ndo_get_stats64 is called in core dev_get_stats() function.

Signed-off-by: Breno Leitao 
---
drivers/net/vsockmon.c | 3 ---
1 file changed, 3 deletions(-)


Reviewed-by: Stefano Garzarella 



diff --git a/drivers/net/vsockmon.c b/drivers/net/vsockmon.c
index a0b4dca36baf..a1ba5169ed5d 100644
--- a/drivers/net/vsockmon.c
+++ b/drivers/net/vsockmon.c
@@ -46,9 +46,6 @@ static void
vsockmon_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
{
dev_lstats_read(dev, &stats->rx_packets, &stats->rx_bytes);
-
-   stats->tx_packets = 0;
-   stats->tx_bytes = 0;
}

static int vsockmon_is_valid_mtu(int new_mtu)
--
2.39.3

Re: [PATCH net-next 2/2] net/vsockmon: Do not set zeroed statistics

2024-02-26 Thread Jason Xing

On Fri, Feb 23, 2024 at 7:59 PM Breno Leitao  wrote:
>
> Do not set rtnl_link_stats64 fields to zero, since they are zeroed
> before ops->ndo_get_stats64 is called in core dev_get_stats() function.
>
> Signed-off-by: Breno Leitao 

Reviewed-by: Jason Xing 

Another similar codes that also can be changed later go like this:

diff --git a/drivers/net/nlmon.c b/drivers/net/nlmon.c
index 5e19a6839dea..9b205b152734 100644
--- a/drivers/net/nlmon.c
+++ b/drivers/net/nlmon.c
@@ -56,10 +56,8 @@ nlmon_get_stats64(struct net_device *dev, struct
rtnl_link_stats64 *stats)
dev_lstats_read(dev, &packets, &bytes);

stats->rx_packets = packets;
-   stats->tx_packets = 0;

stats->rx_bytes = bytes;
-   stats->tx_bytes = 0;
 }

 static u32 always_on(struct net_device *dev)

> ---
>  drivers/net/vsockmon.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/drivers/net/vsockmon.c b/drivers/net/vsockmon.c
> index a0b4dca36baf..a1ba5169ed5d 100644
> --- a/drivers/net/vsockmon.c
> +++ b/drivers/net/vsockmon.c
> @@ -46,9 +46,6 @@ static void
>  vsockmon_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
>  {
> dev_lstats_read(dev, &stats->rx_packets, &stats->rx_bytes);
> -
> -   stats->tx_packets = 0;
> -   stats->tx_bytes = 0;
>  }
>
>  static int vsockmon_is_valid_mtu(int new_mtu)
> --
> 2.39.3
>
>

[PATCH] mm: add alloc_contig_migrate_range allocation statistics

2024-02-26 Thread Richard Chang

alloc_contig_migrate_range has every information to be able to
understand big contiguous allocation latency. For example, how many
pages are migrated, how many times they were needed to unmap from
page tables.

This patch adds the trace event to collect the allocation statistics.
In the field, it was quite useful to understand CMA allocation
latency.

Signed-off-by: Richard Chang 
---
 include/trace/events/kmem.h | 39 +
 mm/internal.h   |  3 ++-
 mm/page_alloc.c | 30 +++-
 mm/page_isolation.c |  2 +-
 4 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 58688768ef0f..964704d76f9f 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -304,6 +304,45 @@ TRACE_EVENT(mm_page_alloc_extfrag,
__entry->change_ownership)
 );
 
+TRACE_EVENT(mm_alloc_contig_migrate_range_info,
+
+   TP_PROTO(unsigned long start,
+unsigned long end,
+int migratetype,
+unsigned long nr_migrated,
+unsigned long nr_reclaimed,
+unsigned long nr_mapped),
+
+   TP_ARGS(start, end, migratetype,
+   nr_migrated, nr_reclaimed, nr_mapped),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, start)
+   __field(unsigned long, end)
+   __field(int, migratetype)
+   __field(unsigned long, nr_migrated)
+   __field(unsigned long, nr_reclaimed)
+   __field(unsigned long, nr_mapped)
+   ),
+
+   TP_fast_assign(
+   __entry->start = start;
+   __entry->end = end;
+   __entry->migratetype = migratetype;
+   __entry->nr_migrated = nr_migrated;
+   __entry->nr_reclaimed = nr_reclaimed;
+   __entry->nr_mapped = nr_mapped;
+   ),
+
+   TP_printk("start=0x%lx end=0x%lx migratetype=%d nr_migrated=%lu 
nr_reclaimed=%lu nr_mapped=%lu",
+ __entry->start,
+ __entry->end,
+ __entry->migratetype,
+ __entry->nr_migrated,
+ __entry->nr_reclaimed,
+ __entry->nr_mapped)
+);
+
 /*
  * Required for uniquely and securely identifying mm in rss_stat tracepoint.
  */
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..e114c647e278 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -537,7 +537,8 @@ isolate_migratepages_range(struct compact_control *cc,
   unsigned long low_pfn, unsigned long end_pfn);
 
 int __alloc_contig_migrate_range(struct compact_control *cc,
-   unsigned long start, unsigned long end);
+   unsigned long start, unsigned long end,
+   int migratetype);
 
 /* Free whole pageblock and set its migration type to MIGRATE_CMA. */
 void init_cma_reserved_pageblock(struct page *page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 150d4f23b010..f840bc785afa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6219,9 +6219,14 @@ static void alloc_contig_dump_pages(struct list_head 
*page_list)
}
 }
 
-/* [start, end) must belong to a single zone. */
+/*
+ * [start, end) must belong to a single zone.
+ * @migratetype: using migratetype to filter the type of migration in
+ * trace_mm_alloc_contig_migrate_range_info.
+ */
 int __alloc_contig_migrate_range(struct compact_control *cc,
-   unsigned long start, unsigned long end)
+   unsigned long start, unsigned long end,
+   int migratetype)
 {
/* This function is based on compact_zone() from compaction.c. */
unsigned int nr_reclaimed;
@@ -6232,6 +6237,10 @@ int __alloc_contig_migrate_range(struct compact_control 
*cc,
.nid = zone_to_nid(cc->zone),
.gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
};
+   struct page *page;
+   unsigned long total_mapped = 0;
+   unsigned long total_migrated = 0;
+   unsigned long total_reclaimed = 0;
 
lru_cache_disable();
 
@@ -6257,9 +6266,16 @@ int __alloc_contig_migrate_range(struct compact_control 
*cc,
&cc->migratepages);
cc->nr_migratepages -= nr_reclaimed;
 
+   total_reclaimed += nr_reclaimed;
+   list_for_each_entry(page, &cc->migratepages, lru)
+   total_mapped += page_mapcount(page);
+
ret = migrate_pages(&cc->migratepages, alloc_migration_target,
NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE, 
NULL);
 
+   if (!ret)
+   total_migrated += cc->nr_migratepages;
+

Re: [PATCH 1/1] vhost: Added pad cleanup if vnet_hdr is not present.

2024-02-26 Thread Andrew Melnichenko

Hi all,
Ok, let me prepare a new patch v2, where I'll write a
description/analysis of the issue in the commit message.

On Thu, Feb 22, 2024 at 10:02 PM Michael S. Tsirkin  wrote:
>
> On Mon, Jan 15, 2024 at 05:32:25PM -0500, Michael S. Tsirkin wrote:
> > On Mon, Jan 15, 2024 at 09:48:40PM +0200, Andrew Melnychenko wrote:
> > > When the Qemu launched with vhost but without tap vnet_hdr,
> > > vhost tries to copy vnet_hdr from socket iter with size 0
> > > to the page that may contain some trash.
> > > That trash can be interpreted as unpredictable values for
> > > vnet_hdr.
> > > That leads to dropping some packets and in some cases to
> > > stalling vhost routine when the vhost_net tries to process
> > > packets and fails in a loop.
> > >
> > > Qemu options:
> > >   -netdev tap,vhost=on,vnet_hdr=off,...
> > >
> > > Signed-off-by: Andrew Melnychenko 
> > > ---
> > >  drivers/vhost/net.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index f2ed7167c848..57411ac2d08b 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -735,6 +735,9 @@ static int vhost_net_build_xdp(struct 
> > > vhost_net_virtqueue *nvq,
> > > hdr = buf;
> > > gso = &hdr->gso;
> > >
> > > +   if (!sock_hlen)
> > > +   memset(buf, 0, pad);
> > > +
> > > if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> > > vhost16_to_cpu(vq, gso->csum_start) +
> > > vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> >
> >
> > Hmm need to analyse it to make sure there are no cases where we leak
> > some data to guest here in case where sock_hlen is set ...
>
>
> Could you post this analysis pls?
>
> > > --
> > > 2.43.0
>

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Huang, Kai

On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote:
> On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai  wrote:
> 
> > 
> > 
> > On 24/02/2024 6:00 am, Haitao Huang wrote:
> > > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai   
> > > wrote:
> > > 
> > > > > > 
> > > > > Right. When code reaches to here, we already passed reclaim per  
> > > > > cgroup.
> > > > 
> > > > Yes if try_charge() failed we must do pre-cgroup reclaim.
> > > > 
> > > > > The cgroup may not at or reach limit but system has run out of  
> > > > > physical
> > > > > EPC.
> > > > > 
> > > > 
> > > > But after try_charge() we can still choose to reclaim from the current  
> > > > group,
> > > > but not necessarily have to be global, right?  I am not sure whether I  
> > > > am
> > > > missing something, but could you elaborate why we should choose to  
> > > > reclaim from
> > > > the global?
> > > > 
> > >  Once try_charge is done and returns zero that means the cgroup usage  
> > > is charged and it's not over usage limit. So you really can't reclaim  
> > > from that cgroup if allocation failed. The only  thing you can do is to  
> > > reclaim globally.
> > 
> > Sorry I still cannot establish the logic here.
> > 
> > Let's say the sum of all cgroups are greater than the physical EPC, and  
> > elclave(s) in each cgroup could potentially fault w/o reaching cgroup's  
> > limit.
> > 
> > In this case, when enclave(s) in one cgroup faults, why we cannot  
> > reclaim from the current cgroup, but have to reclaim from global?
> > 
> > Is there any real downside of the former, or you just want to follow the  
> > reclaim logic w/o cgroup at all?
> > 
> > IIUC, there's at least one advantage of reclaim from the current group,  
> > that faults of enclave(s) in one group won't impact other enclaves in  
> > other cgroups.  E.g., in this way other enclaves in other groups may  
> > never need to trigger faults.
> > 
> > Or perhaps I am missing anything?
> > 
> The use case here is that user knows it's OK for group A to borrow some  
> pages from group B for some time without impact much performance, vice  
> versa. That's why the user is overcomitting so system can run more  
> enclave/groups. Otherwise, if she is concerned about impact of A on B, she  
> could lower limit for A so it never interfere or interfere less with B  
> (assume the lower limit is still high enough to run all enclaves in A),  
> and sacrifice some of A's performance. Or if she does not want any  
> interference between groups, just don't over-comit. So we don't really  
> lose anything here.

But if we reclaim from the same group, seems we could enable a user case that
allows the admin to ensure certain group won't be impacted at all, while
allowing other groups to over-commit?

E.g., let's say we have 100M physical EPC.  And let's say the admin wants to run
some performance-critical enclave(s) which costs 50M EPC w/o being impacted. 
The admin also wants to run other enclaves which could cost 100M EPC in total
but EPC swapping among them is acceptable.

If we choose to reclaim from the current EPC cgroup, then seems to that the
admin can achieve the above by setting up 2 groups with group1 having 50M limit
and group2 having 100M limit, and then run performance-critical enclave(s) in
group1 and others in group2?  Or am I missing anything?

If we choose to do global reclaim, then we cannot achieve that.

> 
> In case of overcomitting, even if we always reclaim from the same cgroup  
> for each fault, one group may still interfere the other: e.g., consider an  
> extreme case in that group A used up almost all EPC at the time group B  
> has a fault, B has to fail allocation and kill enclaves.

If the admin allows group A to use almost all EPC, to me it's fair to say he/she
doesn't want to run anything inside B at all and it is acceptable enclaves in B
to be killed.

[PATCH] dt-bindings: remoteproc: qcom,glink-rpm-edge: drop redundant type from label

2024-02-26 Thread Krzysztof Kozlowski

dtschema defines label as string, so $ref in other bindings is
redundant.

Signed-off-by: Krzysztof Kozlowski 
---
 .../devicetree/bindings/remoteproc/qcom,glink-rpm-edge.yaml  | 1 -
 1 file changed, 1 deletion(-)

diff --git 
a/Documentation/devicetree/bindings/remoteproc/qcom,glink-rpm-edge.yaml 
b/Documentation/devicetree/bindings/remoteproc/qcom,glink-rpm-edge.yaml
index 884158bccd50..3766d4513b37 100644
--- a/Documentation/devicetree/bindings/remoteproc/qcom,glink-rpm-edge.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/qcom,glink-rpm-edge.yaml
@@ -18,7 +18,6 @@ properties:
 const: qcom,glink-rpm
 
   label:
-$ref: /schemas/types.yaml#/definitions/string
 description:
   Name of the edge, used for debugging and identification purposes. The
   node name will be used if this is not present.
-- 
2.34.1

Re: [PATCH v8 31/35] fprobe: Rewrite fprobe on function-graph tracer

2024-02-26 Thread Jiri Olsa

On Mon, Feb 26, 2024 at 12:20:43AM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> Rewrite fprobe implementation on function-graph tracer.
> Major API changes are:
>  -  'nr_maxactive' field is deprecated.
>  -  This depends on CONFIG_DYNAMIC_FTRACE_WITH_ARGS or
> !CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS, and
> CONFIG_HAVE_FUNCTION_GRAPH_FREGS. So currently works only
> on x86_64.
>  -  Currently the entry size is limited in 15 * sizeof(long).
>  -  If there is too many fprobe exit handler set on the same
> function, it will fail to probe.
> 
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  Changes in v8:
>   - Use trace_func_graph_ret/ent_t for fgraph_ops.
>   - Update CONFIG_FPROBE dependencies.
>   - Add ftrace_regs_get_return_address() for each arch.
>  Changes in v3:
>   - Update for new reserve_data/retrieve_data API.
>   - Fix internal push/pop on fgraph data logic so that it can
> correctly save/restore the returning fprobes.
>  Changes in v2:
>   - Add more lockdep_assert_held(fprobe_mutex)
>   - Use READ_ONCE() and WRITE_ONCE() for fprobe_hlist_node::fp.
>   - Add NOKPROBE_SYMBOL() for the functions which is called from
> entry/exit callback.
> ---
>  arch/arm64/include/asm/ftrace.h |6 
>  arch/loongarch/include/asm/ftrace.h |6 
>  arch/powerpc/include/asm/ftrace.h   |6 
>  arch/s390/include/asm/ftrace.h  |6 
>  arch/x86/include/asm/ftrace.h   |6 
>  include/linux/fprobe.h  |   54 ++-
>  include/linux/ftrace.h  |7 
>  kernel/trace/Kconfig|8 
>  kernel/trace/fprobe.c   |  638 
> +--
>  lib/test_fprobe.c   |   45 --
>  10 files changed, 536 insertions(+), 246 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
> index 95a8f349f871..800c75f46a13 100644
> --- a/arch/arm64/include/asm/ftrace.h
> +++ b/arch/arm64/include/asm/ftrace.h
> @@ -143,6 +143,12 @@ ftrace_regs_get_frame_pointer(const struct ftrace_regs 
> *fregs)
>   return fregs->fp;
>  }
>  
> +static __always_inline unsigned long
> +ftrace_regs_get_return_address(const struct ftrace_regs *fregs)
> +{
> + return fregs->lr;
> +}
> +
>  static __always_inline struct pt_regs *
>  ftrace_partial_regs(const struct ftrace_regs *fregs, struct pt_regs *regs)
>  {
> diff --git a/arch/loongarch/include/asm/ftrace.h 
> b/arch/loongarch/include/asm/ftrace.h
> index 14a1576bf948..b8432b7cc9d4 100644
> --- a/arch/loongarch/include/asm/ftrace.h
> +++ b/arch/loongarch/include/asm/ftrace.h
> @@ -81,6 +81,12 @@ ftrace_regs_set_instruction_pointer(struct ftrace_regs 
> *fregs, unsigned long ip)
>  #define ftrace_regs_get_frame_pointer(fregs) \
>   ((fregs)->regs.regs[22])
>  
> +static __always_inline unsigned long
> +ftrace_regs_get_return_address(struct ftrace_regs *fregs)
> +{
> + return *(unsigned long *)(fregs->regs.regs[1]);
> +}
> +
>  #define ftrace_graph_func ftrace_graph_func
>  void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
>  struct ftrace_ops *op, struct ftrace_regs *fregs);
> diff --git a/arch/powerpc/include/asm/ftrace.h 
> b/arch/powerpc/include/asm/ftrace.h
> index 773e9011cff1..7ac1bc3e7196 100644
> --- a/arch/powerpc/include/asm/ftrace.h
> +++ b/arch/powerpc/include/asm/ftrace.h
> @@ -85,6 +85,12 @@ ftrace_regs_get_instruction_pointer(struct ftrace_regs 
> *fregs)
>  #define ftrace_regs_query_register_offset(name) \
>   regs_query_register_offset(name)
>  
> +static __always_inline unsigned long
> +ftrace_regs_get_return_address(struct ftrace_regs *fregs)
> +{
> + return fregs->regs.link;
> +}
> +
>  struct ftrace_ops;
>  
>  #define ftrace_graph_func ftrace_graph_func
> diff --git a/arch/s390/include/asm/ftrace.h b/arch/s390/include/asm/ftrace.h
> index e910924eee59..d434a0497683 100644
> --- a/arch/s390/include/asm/ftrace.h
> +++ b/arch/s390/include/asm/ftrace.h
> @@ -89,6 +89,12 @@ ftrace_regs_get_frame_pointer(struct ftrace_regs *fregs)
>   return sp[0];   /* return backchain */
>  }
>  
> +static __always_inline unsigned long
> +ftrace_regs_get_return_address(const struct ftrace_regs *fregs)
> +{
> + return fregs->regs.gprs[14];
> +}
> +
>  #define arch_ftrace_fill_perf_regs(fregs, _regs)  do {   \
>   (_regs)->psw.addr = (fregs)->regs.psw.addr; \
>   (_regs)->gprs[15] = (fregs)->regs.gprs[15]; \
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index 7625887fc49b..979d3458a328 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h
> @@ -82,6 +82,12 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs)
>  #define ftrace_regs_get_frame_pointer(fregs) \
>   frame_pointer(&(fregs)->regs)
>  
> +static __always_inline unsigned long
> +ftrace_regs_get_return_address(struct ftrace_regs *fregs)
> +{
> + return *(unsig

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Dave Hansen

On 2/26/24 03:36, Huang, Kai wrote:
>> In case of overcomitting, even if we always reclaim from the same cgroup  
>> for each fault, one group may still interfere the other: e.g., consider an  
>> extreme case in that group A used up almost all EPC at the time group B  
>> has a fault, B has to fail allocation and kill enclaves.
> If the admin allows group A to use almost all EPC, to me it's fair to say 
> he/she
> doesn't want to run anything inside B at all and it is acceptable enclaves in 
> B
> to be killed.

Folks, I'm having a really hard time following this thread.  It sounds
like there's disagreement about when to do system-wide reclaim.  Could
someone remind me of the choices that we have?  (A proposed patch would
go a _long_ way to helping me understand)

Also, what does the core mm memcg code do?

Last, what is the simplest (least amount of code) thing that the SGX
cgroup controller could implement here?

[PATCH v2 1/7] mm/damon: refactor DAMOS_PAGEOUT with migration_mode

2024-02-26 Thread Honggyu Kim

This is a preparation patch that introduces migration modes.

The damon_pa_pageout is renamed to damon_pa_migrate and it receives an
extra argument for migration_mode.

No functional changes applied.

Signed-off-by: Honggyu Kim 
---
 mm/damon/paddr.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 081e2a325778..277a1c4d833c 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -224,7 +224,12 @@ static bool damos_pa_filter_out(struct damos *scheme, 
struct folio *folio)
return false;
 }
 
-static unsigned long damon_pa_pageout(struct damon_region *r, struct damos *s)
+enum migration_mode {
+   MIG_PAGEOUT,
+};
+
+static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
+ enum migration_mode mm)
 {
unsigned long addr, applied;
LIST_HEAD(folio_list);
@@ -249,7 +254,14 @@ static unsigned long damon_pa_pageout(struct damon_region 
*r, struct damos *s)
 put_folio:
folio_put(folio);
}
-   applied = reclaim_pages(&folio_list);
+   switch (mm) {
+   case MIG_PAGEOUT:
+   applied = reclaim_pages(&folio_list);
+   break;
+   default:
+   /* Unexpected migration mode. */
+   return 0;
+   }
cond_resched();
return applied * PAGE_SIZE;
 }
@@ -297,7 +309,7 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx 
*ctx,
 {
switch (scheme->action) {
case DAMOS_PAGEOUT:
-   return damon_pa_pageout(r, scheme);
+   return damon_pa_migrate(r, scheme, MIG_PAGEOUT);
case DAMOS_LRU_PRIO:
return damon_pa_mark_accessed(r, scheme);
case DAMOS_LRU_DEPRIO:
-- 
2.34.1

[RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-02-26 Thread Honggyu Kim

There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
posted at [1].

It says there is no implementation of the demote/promote DAMOS action
are made.  This RFC is about its implementation for physical address
space.


Introduction


With the advent of CXL/PCIe attached DRAM, which will be called simply
as CXL memory in this cover letter, some systems are becoming more
heterogeneous having memory systems with different latency and bandwidth
characteristics.  They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of
its protocol overhead compared to local DRAM.

In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency.  Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly.  Moreover, the memory access
patterns can be changed at runtime.

To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature.  The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages.
DAMOS provides multiple actions based on DAMON monitoring results and it
can be used for proactive reclaim, which means swapping cold pages out
with DAMOS_PAGEOUT action, but it doesn't support migration actions such
as demotion and promotion between tiered memory nodes.

This series supports two new DAMOS actions; DAMOS_DEMOTE for demotion
from fast tiers and DAMOS_PROMOTE for promotion from slow tiers.  This
prevents hot pages from being stuck on slow tiers, which makes
performance degradation and cold pages can be proactively demoted to
slow tiers so that the system can increase the chance to allocate more
hot pages to fast tiers.

The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.

Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 15~17% to 4~5% when the
system runs under high memory pressure on its fast tier DRAM nodes.


DAMON configuration
===

The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.

The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[2].  It includes gen_config.py script that
generates a json file with the full config of DAMON knobs and it creates
multiple kdamonds for each NUMA node when the DAMON is enabled so that
it can run hot/cold based migration for tiered memory.


Evaluation Workload
===

The performance evaluation is done with redis[3], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[4].  We have measured two different workloads with zipfian and
latest distributions but their configs are slightly modified to make
memory usage higher and execution time longer for better evaluation.

The idea of evaluation using these demote and promote actions covers
system-wide memory management rather than partitioning hot/cold pages of
a single workload.  The default memory allocation policy creates pages
to the fast tier DRAM node first, then allocates newly created pages to
the slow tier CXL node when the DRAM node has insufficient free space.
Once the page allocation is done then those pages never move between
NUMA nodes.  It's not true when using numa balancing, but it is not the
scope of this DAMON based 2-tier memory management support.

If the working set of redis can be fit fully into the DRAM node, then
the redis will access the fast DRAM only.  Since the performance of DRAM
only is faster than partially accessing CXL memory in slow tiers, this
environment is not useful to evaluate this patch series.

To make pages of redis be distributed across fast DRAM node and slow
CXL node to evaluate our demote and promote actions, we pre-allocate
some cold memory externally using mmap and memset before launching
redis-server.  We assumed that there are enough amount of cold memory in
datacenters as TMO[5] and TPP[6] papers mentioned.

The evaluation sequence is as follows.

1. Turn on DAMON with DAMOS_DEMOTE action for DRAM node and
   DAMOS_PROMOTE action for CXL node.  It demotes cold pages on DRAM
   node and promotes hot pages on CXL node in a regular interval.
2. Allocate a huge block of cold memory by calling mmap and memset at
   the fast tier DRAM node, then make the process sleep to make the fast
   tier has insufficient memory for redis-server.
3. Launch redis-server and load prebaked snapshot image, dump.rdb.  The
   redis-server consumes 52GB of anon pages and 33GB of file pages, but
   due to the cold memory allocated at 2, it fails

[PATCH v2 2/7] mm: make alloc_demote_folio externally invokable for migration

2024-02-26 Thread Honggyu Kim

The alloc_demote_folio can be used out of vmscan.c so it'd be better to
remove static keyword from it.

This function can also be used for both demotion and promotion so it'd
be better to rename it from alloc_demote_folio to alloc_migrate_folio.

Signed-off-by: Honggyu Kim 
---
 mm/internal.h |  1 +
 mm/vmscan.c   | 10 +++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b61034bd50f5..61af6641235d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -866,6 +866,7 @@ extern unsigned long  __must_check vm_mmap_pgoff(struct 
file *, unsigned long,
 unsigned long, unsigned long);
 
 extern void set_pageblock_order(void);
+struct folio *alloc_migrate_folio(struct folio *src, unsigned long private);
 unsigned long reclaim_pages(struct list_head *folio_list);
 unsigned int reclaim_clean_pages_from_list(struct zone *zone,
struct list_head *folio_list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bba207f41b14..b8a1a1599e3c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -910,8 +910,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }
 
-static struct folio *alloc_demote_folio(struct folio *src,
-   unsigned long private)
+struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
 {
struct folio *dst;
nodemask_t *allowed_mask;
@@ -935,6 +934,11 @@ static struct folio *alloc_demote_folio(struct folio *src,
if (dst)
return dst;
 
+   /*
+* Allocation failed from the target node so try to allocate from
+* fallback nodes based on allowed_mask.
+* See fallback_alloc() at mm/slab.c.
+*/
mtc->gfp_mask &= ~__GFP_THISNODE;
mtc->nmask = allowed_mask;
 
@@ -973,7 +977,7 @@ static unsigned int demote_folio_list(struct list_head 
*demote_folios,
node_get_allowed_targets(pgdat, &allowed_mask);
 
/* Demotion ignores all cpuset and mempolicy settings */
-   migrate_pages(demote_folios, alloc_demote_folio, NULL,
+   migrate_pages(demote_folios, alloc_migrate_folio, NULL,
  (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
  &nr_succeeded);
 
-- 
2.34.1

[PATCH v2 3/7] mm/damon: introduce DAMOS_DEMOTE action for demotion

2024-02-26 Thread Honggyu Kim

This patch introduces DAMOS_DEMOTE action, which is similar to
DAMOS_PAGEOUT, but demote folios instead of swapping them out.

Since there are some common routines with pageout, many functions have
similar logics between pageout and demote.

damon_pa_migrate_folio_list() is a minimized version of
shrink_folio_list(), but it's minified only for demotion.

Signed-off-by: Honggyu Kim 
Signed-off-by: Hyeongtak Ji 
---
 include/linux/damon.h|   2 +
 mm/damon/paddr.c | 222 ++-
 mm/damon/sysfs-schemes.c |   1 +
 3 files changed, 224 insertions(+), 1 deletion(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index e00ddf1ed39c..86e66772766b 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -105,6 +105,7 @@ struct damon_target {
  * @DAMOS_NOHUGEPAGE:  Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
  * @DAMOS_LRU_PRIO:Prioritize the region on its LRU lists.
  * @DAMOS_LRU_DEPRIO:  Deprioritize the region on its LRU lists.
+ * @DAMOS_DEMOTE:   Do demotion for the given region.
  * @DAMOS_STAT:Do nothing but count the stat.
  * @NR_DAMOS_ACTIONS:  Total number of DAMOS actions
  *
@@ -122,6 +123,7 @@ enum damos_action {
DAMOS_NOHUGEPAGE,
DAMOS_LRU_PRIO,
DAMOS_LRU_DEPRIO,
+   DAMOS_DEMOTE,
DAMOS_STAT, /* Do nothing but only record the stat */
NR_DAMOS_ACTIONS,
 };
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 277a1c4d833c..23e37ce57202 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -12,6 +12,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include "../internal.h"
 #include "ops-common.h"
@@ -226,8 +229,214 @@ static bool damos_pa_filter_out(struct damos *scheme, 
struct folio *folio)
 
 enum migration_mode {
MIG_PAGEOUT,
+   MIG_DEMOTE,
 };
 
+/*
+ * XXX: This is copied from demote_folio_list as renamed as migrate_folio_list.
+ * Take folios on @migrate_folios and attempt to migrate them to another node.
+ * Folios which are not migrated are left on @migrate_folios.
+ */
+static unsigned int migrate_folio_list(struct list_head *migrate_folios,
+  struct pglist_data *pgdat,
+  enum migration_mode mm)
+{
+   int target_nid = next_demotion_node(pgdat->node_id);
+   unsigned int nr_succeeded;
+   nodemask_t allowed_mask;
+
+   struct migration_target_control mtc = {
+   /*
+* Allocate from 'node', or fail quickly and quietly.
+* When this happens, 'page' will likely just be discarded
+* instead of migrated.
+*/
+   .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | 
__GFP_NOWARN |
+   __GFP_NOMEMALLOC | GFP_NOWAIT,
+   .nid = target_nid,
+   .nmask = &allowed_mask
+   };
+
+   if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
+   return 0;
+
+   if (list_empty(migrate_folios))
+   return 0;
+
+   node_get_allowed_targets(pgdat, &allowed_mask);
+
+   /* Migration ignores all cpuset and mempolicy settings */
+   migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
+ (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
+ &nr_succeeded);
+
+   __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
+
+   return nr_succeeded;
+}
+
+enum folio_references {
+   FOLIOREF_RECLAIM,
+   FOLIOREF_KEEP,
+   FOLIOREF_ACTIVATE,
+};
+
+/*
+ * XXX: This is just copied and simplified from folio_check_references at
+ *  mm/vmscan.c but without having scan_control.
+ */
+static enum folio_references folio_check_references(struct folio *folio)
+{
+   int referenced_ptes, referenced_folio;
+   unsigned long vm_flags;
+
+   referenced_ptes = folio_referenced(folio, 1, NULL, &vm_flags);
+   referenced_folio = folio_test_clear_referenced(folio);
+
+   /* rmap lock contention: rotate */
+   if (referenced_ptes == -1)
+   return FOLIOREF_KEEP;
+
+   if (referenced_ptes) {
+   /*
+* All mapped folios start out with page table
+* references from the instantiating fault, so we need
+* to look twice if a mapped file/anon folio is used more
+* than once.
+*
+* Mark it and spare it for another trip around the
+* inactive list.  Another page table reference will
+* lead to its activation.
+*
+* Note: the mark is set for activated folios as well
+* so that recently deactivated but used folios are
+* quickly recovered.
+*/
+   folio_set_referenced(folio);
+
+   if (referenced_folio || referenced_ptes > 1)
+

[PATCH v2 4/7] mm/memory-tiers: add next_promotion_node to find promotion target

2024-02-26 Thread Honggyu Kim

From: Hyeongtak Ji 

This patch adds next_promotion_node that can be used to identify the
appropriate promotion target based on memory tiers.  When multiple
promotion target nodes are available, the nearest node is selected based
on numa distance.

Signed-off-by: Hyeongtak Ji 
---
 include/linux/memory-tiers.h | 11 +
 mm/memory-tiers.c| 43 
 2 files changed, 54 insertions(+)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 1e39d27bee41..0788e435fc50 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -50,6 +50,7 @@ int mt_set_default_dram_perf(int nid, struct node_hmem_attrs 
*perf,
 int mt_perf_to_adistance(struct node_hmem_attrs *perf, int *adist);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
+int next_promotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
 bool node_is_toptier(int node);
 #else
@@ -58,6 +59,11 @@ static inline int next_demotion_node(int node)
return NUMA_NO_NODE;
 }
 
+static inline int next_promotion_node(int node)
+{
+   return NUMA_NO_NODE;
+}
+
 static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t 
*targets)
 {
*targets = NODE_MASK_NONE;
@@ -101,6 +107,11 @@ static inline int next_demotion_node(int node)
return NUMA_NO_NODE;
 }
 
+static inline int next_promotion_node(int node)
+{
+   return NUMA_NO_NODE;
+}
+
 static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t 
*targets)
 {
*targets = NODE_MASK_NONE;
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 8d5291add2bc..0060ee571cf4 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -335,6 +335,49 @@ int next_demotion_node(int node)
return target;
 }
 
+/*
+ * Select a promotion target that is close to the from node among the given
+ * two nodes.
+ *
+ * TODO: consider other decision policy as node_distance may not be precise.
+ */
+static int select_promotion_target(int a, int b, int from)
+{
+   if (node_distance(from, a) < node_distance(from, b))
+   return a;
+   else
+   return b;
+}
+
+/**
+ * next_promotion_node() - Get the next node in the promotion path
+ * @node: The starting node to lookup the next node
+ *
+ * Return: node id for next memory node in the promotion path hierarchy
+ * from @node; NUMA_NO_NODE if @node is the toptier.
+ */
+int next_promotion_node(int node)
+{
+   int target = NUMA_NO_NODE;
+   int nid;
+
+   if (node_is_toptier(node))
+   return NUMA_NO_NODE;
+
+   rcu_read_lock();
+   for_each_node_state(nid, N_MEMORY) {
+   if (node_isset(node, node_demotion[nid].preferred)) {
+   if (target == NUMA_NO_NODE)
+   target = nid;
+   else
+   target = select_promotion_target(nid, target, 
node);
+   }
+   }
+   rcu_read_unlock();
+
+   return target;
+}
+
 static void disable_all_demotion_targets(void)
 {
struct memory_tier *memtier;
-- 
2.34.1

[PATCH v2 5/7] mm/damon: introduce DAMOS_PROMOTE action for promotion

2024-02-26 Thread Honggyu Kim

From: Hyeongtak Ji 

This patch introduces DAMOS_PROMOTE action for paddr mode.

It includes renaming alloc_demote_folio to alloc_migrate_folio to use it
for promotion as well.

Signed-off-by: Hyeongtak Ji 
Signed-off-by: Honggyu Kim 
---
 include/linux/damon.h  |  2 ++
 include/linux/migrate_mode.h   |  1 +
 include/linux/vm_event_item.h  |  1 +
 include/trace/events/migrate.h |  3 ++-
 mm/damon/paddr.c   | 45 --
 mm/damon/sysfs-schemes.c   |  1 +
 mm/vmstat.c|  1 +
 7 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 86e66772766b..d7e52d0228b4 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -105,6 +105,7 @@ struct damon_target {
  * @DAMOS_NOHUGEPAGE:  Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
  * @DAMOS_LRU_PRIO:Prioritize the region on its LRU lists.
  * @DAMOS_LRU_DEPRIO:  Deprioritize the region on its LRU lists.
+ * @DAMOS_PROMOTE:  Do promotion for the given region.
  * @DAMOS_DEMOTE:   Do demotion for the given region.
  * @DAMOS_STAT:Do nothing but count the stat.
  * @NR_DAMOS_ACTIONS:  Total number of DAMOS actions
@@ -123,6 +124,7 @@ enum damos_action {
DAMOS_NOHUGEPAGE,
DAMOS_LRU_PRIO,
DAMOS_LRU_DEPRIO,
+   DAMOS_PROMOTE,
DAMOS_DEMOTE,
DAMOS_STAT, /* Do nothing but only record the stat */
NR_DAMOS_ACTIONS,
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index f37cc03f9369..63f75eb9abf3 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -29,6 +29,7 @@ enum migrate_reason {
MR_CONTIG_RANGE,
MR_LONGTERM_PIN,
MR_DEMOTION,
+   MR_PROMOTION,
MR_TYPES
 };
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8abfa1240040..63cf920afeaa 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGDEMOTE_KSWAPD,
PGDEMOTE_DIRECT,
PGDEMOTE_KHUGEPAGED,
+   PGPROMOTE,
PGSCAN_KSWAPD,
PGSCAN_DIRECT,
PGSCAN_KHUGEPAGED,
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 0190ef725b43..f0dd569c1e62 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -22,7 +22,8 @@
EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
EM( MR_CONTIG_RANGE,"contig_range") \
EM( MR_LONGTERM_PIN,"longterm_pin") \
-   EMe(MR_DEMOTION,"demotion")
+   EM( MR_DEMOTION,"demotion") \
+   EMe(MR_PROMOTION,   "promotion")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 23e37ce57202..37a7b34a36dd 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -229,6 +229,7 @@ static bool damos_pa_filter_out(struct damos *scheme, 
struct folio *folio)
 
 enum migration_mode {
MIG_PAGEOUT,
+   MIG_PROMOTE,
MIG_DEMOTE,
 };
 
@@ -241,9 +242,26 @@ static unsigned int migrate_folio_list(struct list_head 
*migrate_folios,
   struct pglist_data *pgdat,
   enum migration_mode mm)
 {
-   int target_nid = next_demotion_node(pgdat->node_id);
+   int target_nid;
unsigned int nr_succeeded;
nodemask_t allowed_mask;
+   int reason;
+   enum vm_event_item vm_event;
+
+   switch (mm) {
+   case MIG_PROMOTE:
+   target_nid = next_promotion_node(pgdat->node_id);
+   reason = MR_PROMOTION;
+   vm_event = PGPROMOTE;
+   break;
+   case MIG_DEMOTE:
+   target_nid = next_demotion_node(pgdat->node_id);
+   reason = MR_DEMOTION;
+   vm_event = PGDEMOTE_DIRECT;
+   break;
+   default:
+   return 0;
+   }
 
struct migration_target_control mtc = {
/*
@@ -263,14 +281,19 @@ static unsigned int migrate_folio_list(struct list_head 
*migrate_folios,
if (list_empty(migrate_folios))
return 0;
 
-   node_get_allowed_targets(pgdat, &allowed_mask);
+   if (mm == MIG_DEMOTE) {
+   node_get_allowed_targets(pgdat, &allowed_mask);
+   } else if (mm == MIG_PROMOTE) {
+   /* TODO: Need to add upper_tier_mask at struct memory_tier. */
+   allowed_mask = NODE_MASK_NONE;
+   }
 
/* Migration ignores all cpuset and mempolicy settings */
migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
- (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
+ (unsigned long

[PATCH v2 7/7] mm/damon/sysfs-schemes: apply target_nid for promote and demote actions

2024-02-26 Thread Honggyu Kim

From: Hyeongtak Ji 

This patch changes DAMOS_PROMOTE and DAMOS_DEMOTE to use target_nid of
sysfs as the destination NUMA node of migration.  This has been tested
on qemu as follows:

  $ cd /sys/kernel/mm/damon/admin/kdamonds/
  $ cat contexts//schemes//action
  promote
  $ echo 1 > contexts//schemes//target_nid
  $ echo commit > state
  $ numactl -p 2 ./hot_cold 500M 600M &
  $ numastat -c -p hot_cold

  Per-node process memory usage (in MBs)
  PID Node 0 Node 1 Node 2 Total
  --  -- -- -- -
  701 (hot_cold)   0501601  1101

Signed-off-by: Hyeongtak Ji 
Signed-off-by: Honggyu Kim 
---
 mm/damon/paddr.c | 27 +--
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 37a7b34a36dd..5e057a69464f 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -240,9 +240,9 @@ enum migration_mode {
  */
 static unsigned int migrate_folio_list(struct list_head *migrate_folios,
   struct pglist_data *pgdat,
-  enum migration_mode mm)
+  enum migration_mode mm,
+  int target_nid)
 {
-   int target_nid;
unsigned int nr_succeeded;
nodemask_t allowed_mask;
int reason;
@@ -250,12 +250,14 @@ static unsigned int migrate_folio_list(struct list_head 
*migrate_folios,
 
switch (mm) {
case MIG_PROMOTE:
-   target_nid = next_promotion_node(pgdat->node_id);
+   if (target_nid == NUMA_NO_NODE)
+   target_nid = next_promotion_node(pgdat->node_id);
reason = MR_PROMOTION;
vm_event = PGPROMOTE;
break;
case MIG_DEMOTE:
-   target_nid = next_demotion_node(pgdat->node_id);
+   if (target_nid == NUMA_NO_NODE)
+   target_nid = next_demotion_node(pgdat->node_id);
reason = MR_DEMOTION;
vm_event = PGDEMOTE_DIRECT;
break;
@@ -358,7 +360,8 @@ static enum folio_references folio_check_references(struct 
folio *folio)
  */
 static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list,
struct pglist_data *pgdat,
-   enum migration_mode mm)
+   enum migration_mode mm,
+   int target_nid)
 {
unsigned int nr_migrated = 0;
struct folio *folio;
@@ -399,7 +402,7 @@ static unsigned int damon_pa_migrate_folio_list(struct 
list_head *folio_list,
/* 'folio_list' is always empty here */
 
/* Migrate folios selected for migration */
-   nr_migrated += migrate_folio_list(&migrate_folios, pgdat, mm);
+   nr_migrated += migrate_folio_list(&migrate_folios, pgdat, mm, 
target_nid);
/* Folios that could not be migrated are still in @migrate_folios */
if (!list_empty(&migrate_folios)) {
/* Folios which weren't migrated go back on @folio_list */
@@ -426,7 +429,8 @@ static unsigned int damon_pa_migrate_folio_list(struct 
list_head *folio_list,
  *  common function for both cases.
  */
 static unsigned long damon_pa_migrate_pages(struct list_head *folio_list,
-   enum migration_mode mm)
+   enum migration_mode mm,
+   int target_nid)
 {
int nid;
unsigned int nr_migrated = 0;
@@ -449,12 +453,14 @@ static unsigned long damon_pa_migrate_pages(struct 
list_head *folio_list,
}
 
nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
-  NODE_DATA(nid), mm);
+  NODE_DATA(nid), mm,
+  target_nid);
nid = folio_nid(lru_to_folio(folio_list));
} while (!list_empty(folio_list));
 
nr_migrated += damon_pa_migrate_folio_list(&node_folio_list,
-  NODE_DATA(nid), mm);
+  NODE_DATA(nid), mm,
+  target_nid);
 
memalloc_noreclaim_restore(noreclaim_flag);
 
@@ -499,7 +505,8 @@ static unsigned long damon_pa_migrate(struct damon_region 
*r, struct damos *s,
break;
case MIG_PROMOTE:
case MIG_DEMOTE:
-   applied = damon_pa_migrate_pages(&folio_list, mm);
+   applied = damon_pa_migrate_pages(&folio_list, mm,
+s->target_nid);
break;
default:
/* Unexpected migration mode. */
--

[PATCH v2 6/7] mm/damon/sysfs-schemes: add target_nid on sysfs-schemes

2024-02-26 Thread Honggyu Kim

From: Hyeongtak Ji 

This patch adds target_nid under
  /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes/

target_nid can be used as the destination node for DAMOS actions such as
DAMOS_DEMOTE or DAMOS_PROMOTE in the future.

Signed-off-by: Hyeongtak Ji 
Signed-off-by: Honggyu Kim 
---
 include/linux/damon.h| 11 ++-
 mm/damon/core.c  |  5 -
 mm/damon/dbgfs.c |  2 +-
 mm/damon/lru_sort.c  |  3 ++-
 mm/damon/reclaim.c   |  3 ++-
 mm/damon/sysfs-schemes.c | 37 -
 6 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index d7e52d0228b4..4d270956dbd0 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -321,6 +321,7 @@ struct damos_access_pattern {
  * @apply_interval_us: The time between applying the @action.
  * @quota: Control the aggressiveness of this scheme.
  * @wmarks:Watermarks for automated (in)activation of this scheme.
+ * @target_nid:Destination node if @action is "promote" or 
"demote".
  * @filters:   Additional set of &struct damos_filter for &action.
  * @stat:  Statistics of this scheme.
  * @list:  List head for siblings.
@@ -336,6 +337,10 @@ struct damos_access_pattern {
  * monitoring context are inactive, DAMON stops monitoring either, and just
  * repeatedly checks the watermarks.
  *
+ * @target_nid is used to set the destination node for promote or demote
+ * actions, which means it's only meaningful when @action is either "promote" 
or
+ * "demote".
+ *
  * Before applying the &action to a memory region, &struct damon_operations
  * implementation could check pages of the region and skip &action to respect
  * &filters
@@ -357,6 +362,9 @@ struct damos {
 /* public: */
struct damos_quota quota;
struct damos_watermarks wmarks;
+   union {
+   int target_nid;
+   };
struct list_head filters;
struct damos_stat stat;
struct list_head list;
@@ -661,7 +669,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern 
*pattern,
enum damos_action action,
unsigned long apply_interval_us,
struct damos_quota *quota,
-   struct damos_watermarks *wmarks);
+   struct damos_watermarks *wmarks,
+   int target_nid);
 void damon_add_scheme(struct damon_ctx *ctx, struct damos *s);
 void damon_destroy_scheme(struct damos *s);
 
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 3a05e71509b9..0c2472818fb9 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -316,7 +316,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern 
*pattern,
enum damos_action action,
unsigned long apply_interval_us,
struct damos_quota *quota,
-   struct damos_watermarks *wmarks)
+   struct damos_watermarks *wmarks,
+   int target_nid)
 {
struct damos *scheme;
 
@@ -341,6 +342,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern 
*pattern,
scheme->wmarks = *wmarks;
scheme->wmarks.activated = true;
 
+   scheme->target_nid = target_nid;
+
return scheme;
 }
 
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c
index dc0ea1fc30ca..29b427dd1186 100644
--- a/mm/damon/dbgfs.c
+++ b/mm/damon/dbgfs.c
@@ -279,7 +279,7 @@ static struct damos **str_to_schemes(const char *str, 
ssize_t len,
 
pos += parsed;
scheme = damon_new_scheme(&pattern, action, 0, "a,
-   &wmarks);
+   &wmarks, NUMA_NO_NODE);
if (!scheme)
goto fail;
 
diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c
index f2e5f9431892..fd0492637fce 100644
--- a/mm/damon/lru_sort.c
+++ b/mm/damon/lru_sort.c
@@ -163,7 +163,8 @@ static struct damos *damon_lru_sort_new_scheme(
/* under the quota. */
"a,
/* (De)activate this according to the watermarks. */
-   &damon_lru_sort_wmarks);
+   &damon_lru_sort_wmarks,
+   NUMA_NO_NODE);
 }
 
 /* Create a DAMON-based operation scheme for hot memory regions */
diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c
index ab974e477d2f..973ac5df84eb 100644
--- a/mm/damon/reclaim.c
+++ b/mm/damon/reclaim.c
@@ -147,7 +147,8 @@ static struct damos *damon_reclaim_new_scheme(void)
/* under the quota. */
&damon_reclaim_quota,
/* (De)activate this according to the watermarks. */
-   &damon_reclaim_wmarks);
+   &damon_reclaim_wmarks,
+   NUMA_NO_NODE);
 }
 
 static int damon_recl

[RFC PATCH 66/73] x86/pvm: Use new cpu feature to describe XENPV and PVM

2024-02-26 Thread Lai Jiangshan

From: Hou Wenlong 

Some PVOPS are patched as the native version directly if the guest is
not a XENPV guest. However, this approach will not work after
introducing a PVM guest. To address this, use a new CPU feature to
describe XENPV and PVM, and ensure that those PVOPS are patched only
when it is not a paravirtual guest.

Signed-off-by: Hou Wenlong 
Signed-off-by: Lai Jiangshan 
---
 arch/x86/entry/entry_64.S  |  5 ++---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/paravirt.h| 14 +++---
 arch/x86/kernel/pvm.c  |  1 +
 arch/x86/xen/enlighten_pv.c|  1 +
 5 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index fe12605b3c05..6b41a1837698 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -127,9 +127,8 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, 
SYM_L_GLOBAL)
 * In the PVM guest case we must use eretu synthetic instruction.
 */
 
-   ALTERNATIVE_2 "testb %al, %al; jz 
swapgs_restore_regs_and_return_to_usermode", \
-   "jmp swapgs_restore_regs_and_return_to_usermode", 
X86_FEATURE_XENPV, \
-   "jmp swapgs_restore_regs_and_return_to_usermode", 
X86_FEATURE_KVM_PVM_GUEST
+   ALTERNATIVE "testb %al, %al; jz 
swapgs_restore_regs_and_return_to_usermode", \
+   "jmp swapgs_restore_regs_and_return_to_usermode", 
X86_FEATURE_PV_GUEST
 
/*
 * We win! This label is here just for ease of understanding
diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index e17e72f13423..72ef58a2db19 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
 #define X86_FEATURE_VCPUPREEMPT( 8*32+21) /* "" PV 
vcpu_is_preempted function */
 #define X86_FEATURE_TDX_GUEST  ( 8*32+22) /* Intel Trust Domain 
Extensions Guest */
 #define X86_FEATURE_KVM_PVM_GUEST  ( 8*32+23) /* KVM Pagetable-based 
Virtual Machine guest */
+#define X86_FEATURE_PV_GUEST   ( 8*32+24) /* "" Paravirtual guest */
 
 /* Intel-defined CPU features, CPUID level 0x0007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE   ( 9*32+ 0) /* RDFSBASE, WRFSBASE, 
RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index deaee9ec575e..a864ee481ca2 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -143,7 +143,7 @@ static __always_inline unsigned long read_cr2(void)
 {
return PVOP_ALT_CALLEE0(unsigned long, mmu.read_cr2,
"mov %%cr2, %%rax;",
-   ALT_NOT(X86_FEATURE_XENPV));
+   ALT_NOT(X86_FEATURE_PV_GUEST));
 }
 
 static __always_inline void write_cr2(unsigned long x)
@@ -154,13 +154,13 @@ static __always_inline void write_cr2(unsigned long x)
 static inline unsigned long __read_cr3(void)
 {
return PVOP_ALT_CALL0(unsigned long, mmu.read_cr3,
- "mov %%cr3, %%rax;", ALT_NOT(X86_FEATURE_XENPV));
+ "mov %%cr3, %%rax;", 
ALT_NOT(X86_FEATURE_PV_GUEST));
 }
 
 static inline void write_cr3(unsigned long x)
 {
PVOP_ALT_VCALL1(mmu.write_cr3, x,
-   "mov %%rdi, %%cr3", ALT_NOT(X86_FEATURE_XENPV));
+   "mov %%rdi, %%cr3", ALT_NOT(X86_FEATURE_PV_GUEST));
 }
 
 static inline void __write_cr4(unsigned long x)
@@ -694,17 +694,17 @@ bool __raw_callee_save___native_vcpu_is_preempted(long 
cpu);
 static __always_inline unsigned long arch_local_save_flags(void)
 {
return PVOP_ALT_CALLEE0(unsigned long, irq.save_fl, "pushf; pop %%rax;",
-   ALT_NOT(X86_FEATURE_XENPV));
+   ALT_NOT(X86_FEATURE_PV_GUEST));
 }
 
 static __always_inline void arch_local_irq_disable(void)
 {
-   PVOP_ALT_VCALLEE0(irq.irq_disable, "cli;", ALT_NOT(X86_FEATURE_XENPV));
+   PVOP_ALT_VCALLEE0(irq.irq_disable, "cli;", 
ALT_NOT(X86_FEATURE_PV_GUEST));
 }
 
 static __always_inline void arch_local_irq_enable(void)
 {
-   PVOP_ALT_VCALLEE0(irq.irq_enable, "sti;", ALT_NOT(X86_FEATURE_XENPV));
+   PVOP_ALT_VCALLEE0(irq.irq_enable, "sti;", 
ALT_NOT(X86_FEATURE_PV_GUEST));
 }
 
 static __always_inline unsigned long arch_local_irq_save(void)
@@ -776,7 +776,7 @@ void native_pv_lock_init(void) __init;
 .endm
 
 #define SAVE_FLAGS ALTERNATIVE "PARA_IRQ_save_fl;", "pushf; pop %rax;", \
-   ALT_NOT(X86_FEATURE_XENPV)
+   ALT_NOT(X86_FEATURE_PV_GUEST)
 #endif
 #endif /* CONFIG_PARAVIRT_XXL */
 #endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/pvm.c b/arch/x86/kernel/pvm.c
index c38e46a96ad3..d39550a8159f 100644
--- a/arch/x86/kernel/pvm.c
+++ b/arch/x86/kernel/pvm.c
@@ -300,6 +300,7 @@ void __init pvm_early_setup(void)

Re: [PATCH 3/3] dt-bindings: remoteproc: ti,davinci: remove unstable remark

2024-02-26 Thread Mathieu Poirier

On Sat, Feb 24, 2024 at 10:12:36AM +0100, Krzysztof Kozlowski wrote:
> TI Davinci remoteproc bindings were marked as work-in-progress /
> unstable in 2017 in commit ae67b8007816 ("dt-bindings: remoteproc: Add
> bindings for Davinci DSP processors"). Almost seven years is enough, so
> drop the "unstable" remark and expect usual ABI rules.
> 
> Signed-off-by: Krzysztof Kozlowski 
> ---
>  .../devicetree/bindings/remoteproc/ti,davinci-rproc.txt| 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git 
> a/Documentation/devicetree/bindings/remoteproc/ti,davinci-rproc.txt 
> b/Documentation/devicetree/bindings/remoteproc/ti,davinci-rproc.txt
> index 25f8658e216f..48a49c516b62 100644
> --- a/Documentation/devicetree/bindings/remoteproc/ti,davinci-rproc.txt
> +++ b/Documentation/devicetree/bindings/remoteproc/ti,davinci-rproc.txt
> @@ -1,9 +1,6 @@
>  TI Davinci DSP devices
>  ===
>  
> -Binding status: Unstable - Subject to changes for DT representation of clocks
> -and resets
> -

Reviewed-by: Mathieu Poirier 

>  The TI Davinci family of SoCs usually contains a TI DSP Core sub-system that
>  is used to offload some of the processor-intensive tasks or algorithms, for
>  achieving various system level goals.
> -- 
> 2.34.1
>

Re: [PATCH] mm: add alloc_contig_migrate_range allocation statistics

2024-02-26 Thread Steven Rostedt

On Mon, 26 Feb 2024 10:00:15 +
Richard Chang  wrote:

> alloc_contig_migrate_range has every information to be able to
> understand big contiguous allocation latency. For example, how many
> pages are migrated, how many times they were needed to unmap from
> page tables.
> 
> This patch adds the trace event to collect the allocation statistics.
> In the field, it was quite useful to understand CMA allocation
> latency.
> 
> Signed-off-by: Richard Chang 
> ---
>  include/trace/events/kmem.h | 39 +
>  mm/internal.h   |  3 ++-
>  mm/page_alloc.c | 30 +++-
>  mm/page_isolation.c |  2 +-
>  4 files changed, 67 insertions(+), 7 deletions(-)
> 
> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> index 58688768ef0f..964704d76f9f 100644
> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -304,6 +304,45 @@ TRACE_EVENT(mm_page_alloc_extfrag,
>   __entry->change_ownership)
>  );
>  
> +TRACE_EVENT(mm_alloc_contig_migrate_range_info,
> +
> + TP_PROTO(unsigned long start,
> +  unsigned long end,
> +  int migratetype,
> +  unsigned long nr_migrated,
> +  unsigned long nr_reclaimed,
> +  unsigned long nr_mapped),
> +
> + TP_ARGS(start, end, migratetype,
> + nr_migrated, nr_reclaimed, nr_mapped),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, start)
> + __field(unsigned long, end)
> + __field(int, migratetype)


Please move the int to the end of the longs, as it will cause a 4 byte hole
in 64 bit machines otherwise.


> + __field(unsigned long, nr_migrated)
> + __field(unsigned long, nr_reclaimed)
> + __field(unsigned long, nr_mapped)
> + ),
> +
> + TP_fast_assign(
> + __entry->start = start;
> + __entry->end = end;
> + __entry->migratetype = migratetype;
> + __entry->nr_migrated = nr_migrated;
> + __entry->nr_reclaimed = nr_reclaimed;
> + __entry->nr_mapped = nr_mapped;
> + ),
> +
> + TP_printk("start=0x%lx end=0x%lx migratetype=%d nr_migrated=%lu 
> nr_reclaimed=%lu nr_mapped=%lu",
> +   __entry->start,
> +   __entry->end,
> +   __entry->migratetype,
> +   __entry->nr_migrated,
> +   __entry->nr_reclaimed,
> +   __entry->nr_mapped)
> +);
> +
>  /*
>   * Required for uniquely and securely identifying mm in rss_stat tracepoint.
>   */
> diff --git a/mm/internal.h b/mm/internal.h
> index f309a010d50f..e114c647e278 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -537,7 +537,8 @@ isolate_migratepages_range(struct compact_control *cc,
>  unsigned long low_pfn, unsigned long end_pfn);
>  
>  int __alloc_contig_migrate_range(struct compact_control *cc,
> - unsigned long start, unsigned long end);
> + unsigned long start, unsigned long end,
> + int migratetype);
>  
>  /* Free whole pageblock and set its migration type to MIGRATE_CMA. */
>  void init_cma_reserved_pageblock(struct page *page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 150d4f23b010..f840bc785afa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6219,9 +6219,14 @@ static void alloc_contig_dump_pages(struct list_head 
> *page_list)
>   }
>  }
>  
> -/* [start, end) must belong to a single zone. */
> +/*
> + * [start, end) must belong to a single zone.
> + * @migratetype: using migratetype to filter the type of migration in
> + *   trace_mm_alloc_contig_migrate_range_info.
> + */
>  int __alloc_contig_migrate_range(struct compact_control *cc,
> - unsigned long start, unsigned long end)
> + unsigned long start, unsigned long end,
> + int migratetype)
>  {
>   /* This function is based on compact_zone() from compaction.c. */
>   unsigned int nr_reclaimed;
> @@ -6232,6 +6237,10 @@ int __alloc_contig_migrate_range(struct 
> compact_control *cc,
>   .nid = zone_to_nid(cc->zone),
>   .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
>   };
> + struct page *page;
> + unsigned long total_mapped = 0;
> + unsigned long total_migrated = 0;
> + unsigned long total_reclaimed = 0;
>  
>   lru_cache_disable();
>  
> @@ -6257,9 +6266,16 @@ int __alloc_contig_migrate_range(struct 
> compact_control *cc,
>   &cc->migratepages);
>   cc->nr_migratepages -= nr_reclaimed;
>  
> + total_reclaimed += nr_reclaimed;
> + list_for_each_entry(page, &cc->migratepages, lru)
> +

Re: [PATCH v2] tracing: Add warning if string in __assign_str() does not match __string()

2024-02-26 Thread Steven Rostedt

On Mon, 26 Feb 2024 09:33:28 +0900
Masami Hiramatsu (Google)  wrote:

> On Fri, 23 Feb 2024 16:13:56 -0500
> Steven Rostedt  wrote:
> 
> > From: "Steven Rostedt (Google)" 
> > 
> > In preparation to remove the second parameter of __assign_str(), make sure
> > it is really a duplicate of __string() by adding a WARN_ON_ONCE().
> >   
> 
> Looks good to me. So eventually this is removed when the second parameter
> is removed?

Yes, I applied this after my other patches.


> 
> Reviewed-by: Masami Hiramatsu (Google) 


Thanks!

-- Steve

Re: [PATCH] mm: add alloc_contig_migrate_range allocation statistics

2024-02-26 Thread Steven Rostedt

On Mon, 26 Feb 2024 12:06:29 -0500
Steven Rostedt  wrote:

> On Mon, 26 Feb 2024 10:00:15 +
> Richard Chang  wrote:
> 
> > alloc_contig_migrate_range has every information to be able to
> > understand big contiguous allocation latency. For example, how many
> > pages are migrated, how many times they were needed to unmap from
> > page tables.
> > 
> > This patch adds the trace event to collect the allocation statistics.
> > In the field, it was quite useful to understand CMA allocation
> > latency.
> > 
> > Signed-off-by: Richard Chang 
> > ---
> >  include/trace/events/kmem.h | 39 +
> >  mm/internal.h   |  3 ++-
> >  mm/page_alloc.c | 30 +++-
> >  mm/page_isolation.c |  2 +-
> >  4 files changed, 67 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> > index 58688768ef0f..964704d76f9f 100644
> > --- a/include/trace/events/kmem.h
> > +++ b/include/trace/events/kmem.h
> > @@ -304,6 +304,45 @@ TRACE_EVENT(mm_page_alloc_extfrag,
> > __entry->change_ownership)
> >  );
> >  
> > +TRACE_EVENT(mm_alloc_contig_migrate_range_info,
> > +
> > +   TP_PROTO(unsigned long start,
> > +unsigned long end,
> > +int migratetype,
> > +unsigned long nr_migrated,
> > +unsigned long nr_reclaimed,
> > +unsigned long nr_mapped),
> > +
> > +   TP_ARGS(start, end, migratetype,
> > +   nr_migrated, nr_reclaimed, nr_mapped),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(unsigned long, start)
> > +   __field(unsigned long, end)
> > +   __field(int, migratetype)  
> 
> 
> Please move the int to the end of the longs, as it will cause a 4 byte hole
> in 64 bit machines otherwise.
> 
> 
> > +   __field(unsigned long, nr_migrated)
> > +   __field(unsigned long, nr_reclaimed)
> > +   __field(unsigned long, nr_mapped)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->start = start;
> > +   __entry->end = end;
> > +   __entry->migratetype = migratetype;
> > +   __entry->nr_migrated = nr_migrated;
> > +   __entry->nr_reclaimed = nr_reclaimed;
> > +   __entry->nr_mapped = nr_mapped;
> > +   ),
> > +
> > +   TP_printk("start=0x%lx end=0x%lx migratetype=%d nr_migrated=%lu 
> > nr_reclaimed=%lu nr_mapped=%lu",
> > + __entry->start,
> > + __entry->end,
> > + __entry->migratetype,
> > + __entry->nr_migrated,
> > + __entry->nr_reclaimed,
> > + __entry->nr_mapped)
> > +);
> > +
> >  /*
> >   * Required for uniquely and securely identifying mm in rss_stat 
> > tracepoint.
> >   */
> > diff --git a/mm/internal.h b/mm/internal.h
> > index f309a010d50f..e114c647e278 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -537,7 +537,8 @@ isolate_migratepages_range(struct compact_control *cc,
> >unsigned long low_pfn, unsigned long end_pfn);
> >  
> >  int __alloc_contig_migrate_range(struct compact_control *cc,
> > -   unsigned long start, unsigned long end);
> > +   unsigned long start, unsigned long end,
> > +   int migratetype);
> >  
> >  /* Free whole pageblock and set its migration type to MIGRATE_CMA. */
> >  void init_cma_reserved_pageblock(struct page *page);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 150d4f23b010..f840bc785afa 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6219,9 +6219,14 @@ static void alloc_contig_dump_pages(struct list_head 
> > *page_list)
> > }
> >  }
> >  
> > -/* [start, end) must belong to a single zone. */
> > +/*
> > + * [start, end) must belong to a single zone.
> > + * @migratetype: using migratetype to filter the type of migration in
> > + * trace_mm_alloc_contig_migrate_range_info.
> > + */
> >  int __alloc_contig_migrate_range(struct compact_control *cc,
> > -   unsigned long start, unsigned long end)
> > +   unsigned long start, unsigned long end,
> > +   int migratetype)
> >  {
> > /* This function is based on compact_zone() from compaction.c. */
> > unsigned int nr_reclaimed;
> > @@ -6232,6 +6237,10 @@ int __alloc_contig_migrate_range(struct 
> > compact_control *cc,
> > .nid = zone_to_nid(cc->zone),
> > .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> > };
> > +   struct page *page;
> > +   unsigned long total_mapped = 0;
> > +   unsigned long total_migrated = 0;
> > +   unsigned long total_reclaimed = 0;
> >  
> > lru_cache_disable();
> >  
> > @@ -6257,9 +6266,16 @@ int __alloc_contig_migrate_range(struct 
> > compact_control *cc,
> >

[PATCH v2 0/3] media: Fix warnings building with LLVM=1

2024-02-26 Thread Ricardo Ribalda

LLVM does check -Wcast-function-type-sctrict, which is triggered in a
couple of places in the media subsystem.

Signed-off-by: Ricardo Ribalda 
---
Changes in v2:
- Refactor media: mediatek patchset
- sta2x11: Fix Christmas tree order
- Link to v1: 
https://lore.kernel.org/r/20240128-fix-clang-warnings-v1-0-1d946013a...@chromium.org

---
Ricardo Ribalda (3):
  media: pci: sta2x11: Fix Wcast-function-type-strict warnings
  media: usb: pvrusb2: Fix Wcast-function-type-strict warnings
  media: mediatek: vcodedc: Fix Wcast-function-type-strict warnings

 drivers/media/pci/sta2x11/sta2x11_vip.c  |  5 +++--
 drivers/media/platform/mediatek/mdp3/mtk-mdp3-vpu.c  | 12 ++--
 .../media/platform/mediatek/vcodec/common/mtk_vcodec_fw.h|  2 +-
 .../platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c  | 10 +-
 drivers/media/platform/mediatek/vcodec/decoder/vdec_vpu_if.c |  2 +-
 drivers/media/platform/mediatek/vcodec/encoder/venc_vpu_if.c |  2 +-
 drivers/media/usb/pvrusb2/pvrusb2-context.c  |  5 +++--
 drivers/media/usb/pvrusb2/pvrusb2-dvb.c  |  7 ---
 drivers/media/usb/pvrusb2/pvrusb2-v4l2.c |  7 ---
 drivers/remoteproc/mtk_scp.c |  4 ++--
 include/linux/remoteproc/mtk_scp.h   |  2 +-
 include/linux/rpmsg/mtk_rpmsg.h  |  2 +-
 12 files changed, 28 insertions(+), 32 deletions(-)
---
base-commit: d206a76d7d2726f3b096037f2079ce0bd3ba329b
change-id: 20240128-fix-clang-warnings-6c12920467cb

Best regards,
-- 
Ricardo Ribalda

[PATCH v2 3/3] media: mediatek: vcodedc: Fix Wcast-function-type-strict warnings

2024-02-26 Thread Ricardo Ribalda

Building with LLVM=1 throws the following warning:
drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c:38:32: 
warning: cast from 'mtk_vcodec_ipi_handler' (aka 'void (*)(void *, unsigned 
int, void *)') to 'ipi_handler_t' (aka 'void (*)(const void *, unsigned int, 
void *)') converts to incompatible function type [-Wcast-function-type-strict]

Constify the types to avoid the warning.

Signed-off-by: Ricardo Ribalda 
---
 drivers/media/platform/mediatek/mdp3/mtk-mdp3-vpu.c  | 12 ++--
 .../media/platform/mediatek/vcodec/common/mtk_vcodec_fw.h|  2 +-
 .../platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c  | 10 +-
 drivers/media/platform/mediatek/vcodec/decoder/vdec_vpu_if.c |  2 +-
 drivers/media/platform/mediatek/vcodec/encoder/venc_vpu_if.c |  2 +-
 drivers/remoteproc/mtk_scp.c |  4 ++--
 include/linux/remoteproc/mtk_scp.h   |  2 +-
 include/linux/rpmsg/mtk_rpmsg.h  |  2 +-
 8 files changed, 14 insertions(+), 22 deletions(-)

diff --git a/drivers/media/platform/mediatek/mdp3/mtk-mdp3-vpu.c 
b/drivers/media/platform/mediatek/mdp3/mtk-mdp3-vpu.c
index 49fc2e9d45dd5..c4f1c49b9d52a 100644
--- a/drivers/media/platform/mediatek/mdp3/mtk-mdp3-vpu.c
+++ b/drivers/media/platform/mediatek/mdp3/mtk-mdp3-vpu.c
@@ -77,10 +77,10 @@ void mdp_vpu_shared_mem_free(struct mdp_vpu_dev *vpu)
dma_free_wc(dev, vpu->config_size, vpu->config, 
vpu->config_addr);
 }
 
-static void mdp_vpu_ipi_handle_init_ack(void *data, unsigned int len,
+static void mdp_vpu_ipi_handle_init_ack(const void *data, unsigned int len,
void *priv)
 {
-   struct mdp_ipi_init_msg *msg = (struct mdp_ipi_init_msg *)data;
+   const struct mdp_ipi_init_msg *msg = data;
struct mdp_vpu_dev *vpu =
(struct mdp_vpu_dev *)(unsigned long)msg->drv_data;
 
@@ -91,10 +91,10 @@ static void mdp_vpu_ipi_handle_init_ack(void *data, 
unsigned int len,
complete(&vpu->ipi_acked);
 }
 
-static void mdp_vpu_ipi_handle_deinit_ack(void *data, unsigned int len,
+static void mdp_vpu_ipi_handle_deinit_ack(const void *data, unsigned int len,
  void *priv)
 {
-   struct mdp_ipi_deinit_msg *msg = (struct mdp_ipi_deinit_msg *)data;
+   const struct mdp_ipi_deinit_msg *msg = data;
struct mdp_vpu_dev *vpu =
(struct mdp_vpu_dev *)(unsigned long)msg->drv_data;
 
@@ -102,10 +102,10 @@ static void mdp_vpu_ipi_handle_deinit_ack(void *data, 
unsigned int len,
complete(&vpu->ipi_acked);
 }
 
-static void mdp_vpu_ipi_handle_frame_ack(void *data, unsigned int len,
+static void mdp_vpu_ipi_handle_frame_ack(const void *data, unsigned int len,
 void *priv)
 {
-   struct img_sw_addr *addr = (struct img_sw_addr *)data;
+   const struct img_sw_addr *addr = data;
struct img_ipi_frameparam *param =
(struct img_ipi_frameparam *)(unsigned long)addr->va;
struct mdp_vpu_dev *vpu =
diff --git a/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw.h 
b/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw.h
index 300363a40158c..2561b99c95871 100644
--- a/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw.h
+++ b/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw.h
@@ -23,7 +23,7 @@ enum mtk_vcodec_fw_use {
 
 struct mtk_vcodec_fw;
 
-typedef void (*mtk_vcodec_ipi_handler) (void *data,
+typedef void (*mtk_vcodec_ipi_handler) (const void *data,
unsigned int len, void *priv);
 
 struct mtk_vcodec_fw *mtk_vcodec_fw_select(void *priv, enum mtk_vcodec_fw_type 
type,
diff --git a/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c 
b/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c
index 9f6e4b59455da..4c34344dc7dcb 100644
--- a/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c
+++ b/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c
@@ -29,15 +29,7 @@ static int mtk_vcodec_vpu_set_ipi_register(struct 
mtk_vcodec_fw *fw, int id,
   mtk_vcodec_ipi_handler handler,
   const char *name, void *priv)
 {
-   /*
-* The handler we receive takes a void * as its first argument. We
-* cannot change this because it needs to be passed down to the rproc
-* subsystem when SCP is used. VPU takes a const argument, which is
-* more constrained, so the conversion below is safe.
-*/
-   ipi_handler_t handler_const = (ipi_handler_t)handler;
-
-   return vpu_ipi_register(fw->pdev, id, handler_const, name, priv);
+   return vpu_ipi_register(fw->pdev, id, handler, name, priv);
 }
 
 static int mtk_vcodec_vpu_ipi_send(struct mtk_vcodec_fw *fw, int id, void *buf,
diff --git a/drivers/media/platform/mediatek/vcodec/decoder/vdec_vpu_i

[PATCH v2 1/3] media: pci: sta2x11: Fix Wcast-function-type-strict warnings

2024-02-26 Thread Ricardo Ribalda

Building with LLVM=1 throws the following warning:
drivers/media/pci/sta2x11/sta2x11_vip.c:1057:6: warning: cast from 'irqreturn_t 
(*)(int, struct sta2x11_vip *)' (aka 'enum irqreturn (*)(int, struct 
sta2x11_vip *)') to 'irq_handler_t' (aka 'enum irqreturn (*)(int, void *)') 
converts to incompatible function type [-Wcast-function-type-strict]

Reviewed-by: Nathan Chancellor 
Signed-off-by: Ricardo Ribalda 
---
 drivers/media/pci/sta2x11/sta2x11_vip.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/media/pci/sta2x11/sta2x11_vip.c 
b/drivers/media/pci/sta2x11/sta2x11_vip.c
index e4cf9d63e926d..a6456673be3f6 100644
--- a/drivers/media/pci/sta2x11/sta2x11_vip.c
+++ b/drivers/media/pci/sta2x11/sta2x11_vip.c
@@ -757,7 +757,7 @@ static const struct video_device video_dev_template = {
 /**
  * vip_irq - interrupt routine
  * @irq: Number of interrupt ( not used, correct number is assumed )
- * @vip: local data structure containing all information
+ * @data: local data structure containing all information
  *
  * check for both frame interrupts set ( top and bottom ).
  * check FIFO overflow, but limit number of log messages after open.
@@ -767,8 +767,9 @@ static const struct video_device video_dev_template = {
  *
  * IRQ_HANDLED, interrupt done.
  */
-static irqreturn_t vip_irq(int irq, struct sta2x11_vip *vip)
+static irqreturn_t vip_irq(int irq, void *data)
 {
+   struct sta2x11_vip *vip = data;
unsigned int status;
 
status = reg_read(vip, DVP_ITS);

-- 
2.44.0.rc0.258.g7320e95886-goog

Re: [PATCH v2 3/3] media: mediatek: vcodedc: Fix Wcast-function-type-strict warnings

2024-02-26 Thread Ricardo Ribalda

Hi

On Mon, 26 Feb 2024 at 18:32, Ricardo Ribalda  wrote:
>
> Building with LLVM=1 throws the following warning:
> drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c:38:32: 
> warning: cast from 'mtk_vcodec_ipi_handler' (aka 'void (*)(void *, unsigned 
> int, void *)') to 'ipi_handler_t' (aka 'void (*)(const void *, unsigned int, 
> void *)') converts to incompatible function type [-Wcast-function-type-strict]
>
> Constify the types to avoid the warning.

Please note that this has only been compile tested.


>
> Signed-off-by: Ricardo Ribalda 
> ---
>  drivers/media/platform/mediatek/mdp3/mtk-mdp3-vpu.c  | 12 
> ++--
>  .../media/platform/mediatek/vcodec/common/mtk_vcodec_fw.h|  2 +-
>  .../platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c  | 10 +-
>  drivers/media/platform/mediatek/vcodec/decoder/vdec_vpu_if.c |  2 +-
>  drivers/media/platform/mediatek/vcodec/encoder/venc_vpu_if.c |  2 +-
>  drivers/remoteproc/mtk_scp.c |  4 ++--
>  include/linux/remoteproc/mtk_scp.h   |  2 +-
>  include/linux/rpmsg/mtk_rpmsg.h  |  2 +-
>  8 files changed, 14 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/media/platform/mediatek/mdp3/mtk-mdp3-vpu.c 
> b/drivers/media/platform/mediatek/mdp3/mtk-mdp3-vpu.c
> index 49fc2e9d45dd5..c4f1c49b9d52a 100644
> --- a/drivers/media/platform/mediatek/mdp3/mtk-mdp3-vpu.c
> +++ b/drivers/media/platform/mediatek/mdp3/mtk-mdp3-vpu.c
> @@ -77,10 +77,10 @@ void mdp_vpu_shared_mem_free(struct mdp_vpu_dev *vpu)
> dma_free_wc(dev, vpu->config_size, vpu->config, 
> vpu->config_addr);
>  }
>
> -static void mdp_vpu_ipi_handle_init_ack(void *data, unsigned int len,
> +static void mdp_vpu_ipi_handle_init_ack(const void *data, unsigned int len,
> void *priv)
>  {
> -   struct mdp_ipi_init_msg *msg = (struct mdp_ipi_init_msg *)data;
> +   const struct mdp_ipi_init_msg *msg = data;
> struct mdp_vpu_dev *vpu =
> (struct mdp_vpu_dev *)(unsigned long)msg->drv_data;
>
> @@ -91,10 +91,10 @@ static void mdp_vpu_ipi_handle_init_ack(void *data, 
> unsigned int len,
> complete(&vpu->ipi_acked);
>  }
>
> -static void mdp_vpu_ipi_handle_deinit_ack(void *data, unsigned int len,
> +static void mdp_vpu_ipi_handle_deinit_ack(const void *data, unsigned int len,
>   void *priv)
>  {
> -   struct mdp_ipi_deinit_msg *msg = (struct mdp_ipi_deinit_msg *)data;
> +   const struct mdp_ipi_deinit_msg *msg = data;
> struct mdp_vpu_dev *vpu =
> (struct mdp_vpu_dev *)(unsigned long)msg->drv_data;
>
> @@ -102,10 +102,10 @@ static void mdp_vpu_ipi_handle_deinit_ack(void *data, 
> unsigned int len,
> complete(&vpu->ipi_acked);
>  }
>
> -static void mdp_vpu_ipi_handle_frame_ack(void *data, unsigned int len,
> +static void mdp_vpu_ipi_handle_frame_ack(const void *data, unsigned int len,
>  void *priv)
>  {
> -   struct img_sw_addr *addr = (struct img_sw_addr *)data;
> +   const struct img_sw_addr *addr = data;
> struct img_ipi_frameparam *param =
> (struct img_ipi_frameparam *)(unsigned long)addr->va;
> struct mdp_vpu_dev *vpu =
> diff --git a/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw.h 
> b/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw.h
> index 300363a40158c..2561b99c95871 100644
> --- a/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw.h
> +++ b/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw.h
> @@ -23,7 +23,7 @@ enum mtk_vcodec_fw_use {
>
>  struct mtk_vcodec_fw;
>
> -typedef void (*mtk_vcodec_ipi_handler) (void *data,
> +typedef void (*mtk_vcodec_ipi_handler) (const void *data,
> unsigned int len, void *priv);
>
>  struct mtk_vcodec_fw *mtk_vcodec_fw_select(void *priv, enum 
> mtk_vcodec_fw_type type,
> diff --git 
> a/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c 
> b/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c
> index 9f6e4b59455da..4c34344dc7dcb 100644
> --- a/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c
> +++ b/drivers/media/platform/mediatek/vcodec/common/mtk_vcodec_fw_vpu.c
> @@ -29,15 +29,7 @@ static int mtk_vcodec_vpu_set_ipi_register(struct 
> mtk_vcodec_fw *fw, int id,
>mtk_vcodec_ipi_handler handler,
>const char *name, void *priv)
>  {
> -   /*
> -* The handler we receive takes a void * as its first argument. We
> -* cannot change this because it needs to be passed down to the rproc
> -* subsystem when SCP is used. VPU takes a const argument, which is
> -* more constrained, so the conversion below is safe.
> -*/
> -   ipi_handler_t handler_const = (ipi_handler_t)handler;

[PATCH v2 2/3] media: usb: pvrusb2: Fix Wcast-function-type-strict warnings

2024-02-26 Thread Ricardo Ribalda

Building with LLVM=1 throws the following warning:
drivers/media/usb/pvrusb2/pvrusb2-context.c:110:6: warning: cast from 'void 
(*)(struct pvr2_context *)' to 'void (*)(void *)' converts to incompatible 
function type [-Wcast-function-type-strict]
drivers/media/usb/pvrusb2/pvrusb2-v4l2.c:1070:30: warning: cast from 'void 
(*)(struct pvr2_v4l2_fh *)' to 'pvr2_stream_callback' (aka 'void (*)(void *)') 
converts to incompatible function type [-Wcast-function-type-strict] 
drivers/media/usb/pvrusb2/pvrusb2-dvb.c:152:6: warning: cast from 'void 
(*)(struct pvr2_dvb_adapter *)' to 'pvr2_stream_callback' (aka 'void (*)(void 
*)') converts to incompatible function type [-Wcast-function-type-strict]

Reviewed-by: Nathan Chancellor 
Signed-off-by: Ricardo Ribalda 
---
 drivers/media/usb/pvrusb2/pvrusb2-context.c | 5 +++--
 drivers/media/usb/pvrusb2/pvrusb2-dvb.c | 7 ---
 drivers/media/usb/pvrusb2/pvrusb2-v4l2.c| 7 ---
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/media/usb/pvrusb2/pvrusb2-context.c 
b/drivers/media/usb/pvrusb2/pvrusb2-context.c
index 1764674de98bc..16edabda191c4 100644
--- a/drivers/media/usb/pvrusb2/pvrusb2-context.c
+++ b/drivers/media/usb/pvrusb2/pvrusb2-context.c
@@ -90,8 +90,9 @@ static void pvr2_context_destroy(struct pvr2_context *mp)
 }
 
 
-static void pvr2_context_notify(struct pvr2_context *mp)
+static void pvr2_context_notify(void *data)
 {
+   struct pvr2_context *mp = data;
pvr2_context_set_notify(mp,!0);
 }
 
@@ -107,7 +108,7 @@ static void pvr2_context_check(struct pvr2_context *mp)
   "pvr2_context %p (initialize)", mp);
/* Finish hardware initialization */
if (pvr2_hdw_initialize(mp->hdw,
-   (void (*)(void *))pvr2_context_notify,
+   pvr2_context_notify,
mp)) {
mp->video_stream.stream =
pvr2_hdw_get_video_stream(mp->hdw);
diff --git a/drivers/media/usb/pvrusb2/pvrusb2-dvb.c 
b/drivers/media/usb/pvrusb2/pvrusb2-dvb.c
index 26811efe0fb58..8b9f1a09bd53d 100644
--- a/drivers/media/usb/pvrusb2/pvrusb2-dvb.c
+++ b/drivers/media/usb/pvrusb2/pvrusb2-dvb.c
@@ -88,8 +88,9 @@ static int pvr2_dvb_feed_thread(void *data)
return stat;
 }
 
-static void pvr2_dvb_notify(struct pvr2_dvb_adapter *adap)
+static void pvr2_dvb_notify(void *data)
 {
+   struct pvr2_dvb_adapter *adap = data;
wake_up(&adap->buffer_wait_data);
 }
 
@@ -148,8 +149,8 @@ static int pvr2_dvb_stream_do_start(struct pvr2_dvb_adapter 
*adap)
if (!(adap->buffer_storage[idx])) return -ENOMEM;
}
 
-   pvr2_stream_set_callback(pvr->video_stream.stream,
-(pvr2_stream_callback) pvr2_dvb_notify, adap);
+   pvr2_stream_set_callback(pvr->video_stream.stream, pvr2_dvb_notify,
+adap);
 
ret = pvr2_stream_set_buffer_count(stream, PVR2_DVB_BUFFER_COUNT);
if (ret < 0) return ret;
diff --git a/drivers/media/usb/pvrusb2/pvrusb2-v4l2.c 
b/drivers/media/usb/pvrusb2/pvrusb2-v4l2.c
index c04ab7258d645..590f80949bbfc 100644
--- a/drivers/media/usb/pvrusb2/pvrusb2-v4l2.c
+++ b/drivers/media/usb/pvrusb2/pvrusb2-v4l2.c
@@ -1032,9 +1032,10 @@ static int pvr2_v4l2_open(struct file *file)
return 0;
 }
 
-
-static void pvr2_v4l2_notify(struct pvr2_v4l2_fh *fhp)
+static void pvr2_v4l2_notify(void *data)
 {
+   struct pvr2_v4l2_fh *fhp = data;
+
wake_up(&fhp->wait_data);
 }
 
@@ -1067,7 +1068,7 @@ static int pvr2_v4l2_iosetup(struct pvr2_v4l2_fh *fh)
 
hdw = fh->channel.mc_head->hdw;
sp = fh->pdi->stream->stream;
-   pvr2_stream_set_callback(sp,(pvr2_stream_callback)pvr2_v4l2_notify,fh);
+   pvr2_stream_set_callback(sp, pvr2_v4l2_notify, fh);
pvr2_hdw_set_stream_type(hdw,fh->pdi->config);
if ((ret = pvr2_hdw_set_streaming(hdw,!0)) < 0) return ret;
return pvr2_ioread_set_enabled(fh->rhp,!0);

-- 
2.44.0.rc0.258.g7320e95886-goog

Re: [PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality

2024-02-26 Thread Michal Koutný

On Mon, Feb 05, 2024 at 01:06:27PM -0800, Haitao Huang 
 wrote:
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
> +
> +const struct misc_res_ops sgx_epc_cgroup_ops = {
> + .alloc = sgx_epc_cgroup_alloc,
> + .free = sgx_epc_cgroup_free,
> +};
> +
> +static void sgx_epc_misc_init(struct misc_cg *cg, struct sgx_epc_cgroup 
> *epc_cg)
> +{
> + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
> + epc_cg->cg = cg;
> +}

This is a possibly a nit pick but I share it here for consideration.

Would it be more prudent to have the signature like
  alloc(struct misc_res *res, struct misc_cg *cg)
so that implementations are free of the assumption of how cg and res are
stored?

Thanks,
Michal

signature.asc
Description: PGP signature

Re: [PATCH] tracefs: remove SLAB_MEM_SPREAD flag usage

2024-02-26 Thread Steven Rostedt

On Sat, 24 Feb 2024 13:52:06 +
chengming.z...@linux.dev wrote:

> From: Chengming Zhou 
> 
> The SLAB_MEM_SPREAD flag is already a no-op as of 6.8-rc1, remove
> its usage so we can delete it from slab. No functional change.
> 
> Signed-off-by: Chengming Zhou 

Queued.

Thanks!

-- Steve

> ---
>  fs/tracefs/inode.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> index d65ffad4c327..5545e6bf7d26 100644
> --- a/fs/tracefs/inode.c
> +++ b/fs/tracefs/inode.c
> @@ -731,7 +731,6 @@ static int __init tracefs_init(void)
>   tracefs_inode_cachep = kmem_cache_create("tracefs_inode_cache",
>sizeof(struct tracefs_inode),
>0, (SLAB_RECLAIM_ACCOUNT|
> -  SLAB_MEM_SPREAD|
>SLAB_ACCOUNT),
>init_once);
>   if (!tracefs_inode_cachep)

Re: [PATCH] ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment

2024-02-26 Thread Steven Rostedt

On Sun, 25 Feb 2024 15:03:02 -0500
Steven Rostedt  wrote:

> *But* looking at this deeper, the commit_page may need a READ_ONCE()
> but not for the reason you suggested.
> 
> commit_page = cpu_buffer->commit_page;
> commit_ts = commit_page->page->time_stamp;
> 
> The commit_page above *is* used again, and we want commit_ts to be part
> of the commit_page that was originally read and not a second reading.
> 
> So, I think for the commit_page we do need a READ_ONCE() but that's
> because it is referenced again just below it and we don't want the
> compiler to read the memory location again for the second reference.

Care to send a v2 patch that just adds READ_ONCE() for the commit_page, and
change the change log stating that it is to fix the possibility that the
time_stamp may come from a different page.

-- Steve

Re: [PATCH 1/3] dt-bindings: clock: keystone: remove unstable remark

2024-02-26 Thread Rob Herring



On Sat, 24 Feb 2024 10:12:34 +0100, Krzysztof Kozlowski wrote:
> Keystone clock controller bindings were marked as work-in-progress /
> unstable in 2013 in commit b9e0d40c0d83 ("clk: keystone: add Keystone
> PLL clock driver") and commit 7affe5685c96 ("clk: keystone: Add gate
> control clock driver") Almost eleven years is enough, so drop the
> "unstable" remark and expect usual ABI rules.
> 
> Signed-off-by: Krzysztof Kozlowski 
> ---
>  Documentation/devicetree/bindings/clock/keystone-gate.txt | 2 --
>  Documentation/devicetree/bindings/clock/keystone-pll.txt  | 2 --
>  2 files changed, 4 deletions(-)
> 

Acked-by: Rob Herring

Re: [PATCH 2/3] dt-bindings: clock: ti: remove unstable remark

2024-02-26 Thread Rob Herring



On Sat, 24 Feb 2024 10:12:35 +0100, Krzysztof Kozlowski wrote:
> Several TI SoC clock bindings were marked as work-in-progress / unstable
> between 2013-2016, for example in commit f60b1ea5ea7a ("CLK: TI: add
> support for gate clock").  It was enough of time to consider them stable
> and expect usual ABI rules.
> 
> Signed-off-by: Krzysztof Kozlowski 
> ---
>  Documentation/devicetree/bindings/clock/ti/adpll.txt| 2 --
>  Documentation/devicetree/bindings/clock/ti/apll.txt | 2 --
>  Documentation/devicetree/bindings/clock/ti/autoidle.txt | 2 --
>  Documentation/devicetree/bindings/clock/ti/clockdomain.txt  | 2 --
>  Documentation/devicetree/bindings/clock/ti/composite.txt| 2 --
>  Documentation/devicetree/bindings/clock/ti/divider.txt  | 2 --
>  Documentation/devicetree/bindings/clock/ti/dpll.txt | 2 --
>  Documentation/devicetree/bindings/clock/ti/fapll.txt| 2 --
>  .../devicetree/bindings/clock/ti/fixed-factor-clock.txt | 2 --
>  Documentation/devicetree/bindings/clock/ti/gate.txt | 2 --
>  Documentation/devicetree/bindings/clock/ti/interface.txt| 2 --
>  Documentation/devicetree/bindings/clock/ti/mux.txt  | 2 --
>  12 files changed, 24 deletions(-)
> 

Acked-by: Rob Herring

Re: [PATCH 3/3] dt-bindings: remoteproc: ti,davinci: remove unstable remark

2024-02-26 Thread Rob Herring



On Sat, 24 Feb 2024 10:12:36 +0100, Krzysztof Kozlowski wrote:
> TI Davinci remoteproc bindings were marked as work-in-progress /
> unstable in 2017 in commit ae67b8007816 ("dt-bindings: remoteproc: Add
> bindings for Davinci DSP processors"). Almost seven years is enough, so
> drop the "unstable" remark and expect usual ABI rules.
> 
> Signed-off-by: Krzysztof Kozlowski 
> ---
>  .../devicetree/bindings/remoteproc/ti,davinci-rproc.txt| 3 ---
>  1 file changed, 3 deletions(-)
> 

Acked-by: Rob Herring

Re: [PATCH v5 3/6] LoongArch: KVM: Add cpucfg area for kvm hypervisor

2024-02-26 Thread Jiaxun Yang




在2024年2月26日二月 上午8:04，maobibo写道：
> On 2024/2/26 下午2:12, Huacai Chen wrote:
>> On Mon, Feb 26, 2024 at 10:04 AM maobibo  wrote:
>>>
>>>
>>>
>>> On 2024/2/24 下午5:13, Huacai Chen wrote:
 Hi, Bibo,

 On Thu, Feb 22, 2024 at 11:28 AM Bibo Mao  wrote:
>
> Instruction cpucfg can be used to get processor features. And there
> is trap exception when it is executed in VM mode, and also it is
> to provide cpu features to VM. On real hardware cpucfg area 0 - 20
> is used.  Here one specified area 0x4000 -- 0x40ff is used
> for KVM hypervisor to privide PV features, and the area can be extended
> for other hypervisors in future. This area will never be used for
> real HW, it is only used by software.
 After reading and thinking, I find that the hypercall method which is
 used in our productive kernel is better than this cpucfg method.
 Because hypercall is more simple and straightforward, plus we don't
 worry about conflicting with the real hardware.
>>> No, I do not think so. cpucfg is simper than hypercall, hypercall can
>>> be in effect when system runs in guest mode. In some scenario like TCG
>>> mode, hypercall is illegal intruction, however cpucfg can work.
>> Nearly all architectures use hypercall except x86 for its historical
> Only x86 support multiple hypervisors and there is multiple hypervisor 
> in x86 only. It is an advantage, not historical reason.

I do believe that all those stuff should not be exposed to guest user space
for security reasons.

Also for different implementations of hypervisors they may have different 
PV features behavior, using hypcall to perform feature detection
can pass more information to help us cope with hypervisor diversity.
>
>> reasons. If we use CPUCFG, then the hypervisor information is
>> unnecessarily leaked to userspace, and this may be a security issue.
>> Meanwhile, I don't think TCG mode needs PV features.
> Besides PV features, there is other features different with real hw such 
> as virtio device, virtual interrupt controller.

Those are *device* level information, they must be passed in firmware
interfaces to keep processor emulation sane.

Thanks

>
> Regards
> Bibo Mao
>
>> 
>> I consulted with Jiaxun before, and maybe he can give some more comments.
>> 
>>>
>>> Extioi virtualization extension will be added later, cpucfg can be used
>>> to get extioi features. It is unlikely that extioi driver depends on
>>> PARA_VIRT macro if hypercall is used to get features.
>> CPUCFG is per-core information, if we really need something about
>> extioi, it should be in iocsr (LOONGARCH_IOCSR_FEATURES).
>> 
>> 
>> Huacai
>> 
>>>
>>> Regards
>>> Bibo Mao
>>>

 Huacai

>
> Signed-off-by: Bibo Mao 
> ---
>arch/loongarch/include/asm/inst.h  |  1 +
>arch/loongarch/include/asm/loongarch.h | 10 ++
>arch/loongarch/kvm/exit.c  | 46 +-
>3 files changed, 41 insertions(+), 16 deletions(-)
>
> diff --git a/arch/loongarch/include/asm/inst.h 
> b/arch/loongarch/include/asm/inst.h
> index d8f637f9e400..ad120f924905 100644
> --- a/arch/loongarch/include/asm/inst.h
> +++ b/arch/loongarch/include/asm/inst.h
> @@ -67,6 +67,7 @@ enum reg2_op {
>   revhd_op= 0x11,
>   extwh_op= 0x16,
>   extwb_op= 0x17,
> +   cpucfg_op   = 0x1b,
>   iocsrrdb_op = 0x19200,
>   iocsrrdh_op = 0x19201,
>   iocsrrdw_op = 0x19202,
> diff --git a/arch/loongarch/include/asm/loongarch.h 
> b/arch/loongarch/include/asm/loongarch.h
> index 46366e783c84..a1d22e8b6f94 100644
> --- a/arch/loongarch/include/asm/loongarch.h
> +++ b/arch/loongarch/include/asm/loongarch.h
> @@ -158,6 +158,16 @@
>#define  CPUCFG48_VFPU_CG  BIT(2)
>#define  CPUCFG48_RAM_CG   BIT(3)
>
> +/*
> + * cpucfg index area: 0x4000 -- 0x40ff
> + * SW emulation for KVM hypervirsor
> + */
> +#define CPUCFG_KVM_BASE0x4000UL
> +#define CPUCFG_KVM_SIZE0x100
> +#define CPUCFG_KVM_SIG CPUCFG_KVM_BASE
> +#define  KVM_SIGNATURE "KVM\0"
> +#define CPUCFG_KVM_FEATURE (CPUCFG_KVM_BASE + 4)
> +
>#ifndef __ASSEMBLY__
>
>/* CSR */
> diff --git a/arch/loongarch/kvm/exit.c b/arch/loongarch/kvm/exit.c
> index 923bbca9bd22..6a38fd59d86d 100644
> --- a/arch/loongarch/kvm/exit.c
> +++ b/arch/loongarch/kvm/exit.c
> @@ -206,10 +206,37 @@ int kvm_emu_idle(struct kvm_vcpu *vcpu)
>   return EMULATE_DONE;
>}
>
> -static int kvm_trap_handle_gspr(struct kvm_vcpu *vcpu)
> +static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, larch_inst inst)
>{
>   int rd, rj;
>   unsigned

Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-26 Thread John Fastabend

Jason Wang wrote:
> On Fri, Feb 23, 2024 at 9:42 AM Xuan Zhuo  wrote:
> >
> > On Fri, 09 Feb 2024 13:57:25 +0100, Paolo Abeni  wrote:
> > > On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote:
> > > > On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni  wrote:
> > > > >
> > > > > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> > > > > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > > > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer 
> > > > > > > >  wrote:
> > > > > > > > > On 02/02/2024 13.11, Liang Chen wrote:
> > > > > > > [...]
> > > > > > > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct 
> > > > > > > > > > xdp_buff *xdp)
> > > > > > > > > >   }
> > > > > > > > > >   }
> > > > > > > > > >
> > > > > > > > > > +static void virtnet_xdp_save_rx_hash(struct 
> > > > > > > > > > virtnet_xdp_buff *virtnet_xdp,
> > > > > > > > > > +  struct net_device *dev,
> > > > > > > > > > +  struct 
> > > > > > > > > > virtio_net_hdr_v1_hash *hdr_hash)
> > > > > > > > > > +{
> > > > > > > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > > > > > > + virtnet_xdp->hash_value = 
> > > > > > > > > > hdr_hash->hash_value;
> > > > > > > > > > + virtnet_xdp->hash_report = 
> > > > > > > > > > hdr_hash->hash_report;
> > > > > > > > > > + }
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > >
> > > > > > > > > Would it be possible to store a pointer to hdr_hash in 
> > > > > > > > > virtnet_xdp_buff,
> > > > > > > > > with the purpose of delaying extracting this, until and only 
> > > > > > > > > if XDP
> > > > > > > > > bpf_prog calls the kfunc?
> > > > > > > > >
> > > > > > > >
> > > > > > > > That seems to be the way v1 works,
> > > > > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > > > > > > . But it was pointed out that the inline header may be 
> > > > > > > > overwritten by
> > > > > > > > the xdp prog, so the hash is copied out to maintain its 
> > > > > > > > integrity.
> > > > > > >
> > > > > > > Why? isn't XDP supposed to get write access only to the pkt
> > > > > > > contents/buffer?
> > > > > > >
> > > > > >
> > > > > > Normally, an XDP program accesses only the packet data. However,
> > > > > > there's also an XDP RX Metadata area, referenced by the data_meta
> > > > > > pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to
> > > > > > point somewhere ahead of the data buffer, thereby granting the XDP
> > > > > > program access to the virtio header located immediately before the
> > > > >
> > > > > AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data 
> > > > > before
> > > > > xdp->data_hard_start:
> > > > >
> > > > > https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210
> > > > >
> > > > > and virtio net set such field after the virtio_net_hdr:
> > > > >
> > > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218
> > > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420
> > > > >
> > > > > I don't see how the virtio hdr could be touched? Possibly even more
> > > > > important: if such thing is possible, I think is should be somewhat
> > > > > denied (for the same reason an H/W nic should prevent XDP from
> > > > > modifying its own buffer descriptor).
> > > >
> > > > Thank you for highlighting this concern. The header layout differs
> > > > slightly between small and mergeable mode. Taking 'mergeable mode' as
> > > > an example, after calling xdp_prepare_buff the layout of xdp_buff
> > > > would be as depicted in the diagram below,
> > > >
> > > >   buf
> > > >|
> > > >v
> > > > +--+--+-+
> > > > | xdp headroom | virtio header| packet  |
> > > > | (256 bytes)  | (20 bytes)   | content |
> > > > +--+--+-+
> > > > ^ ^
> > > > | |
> > > >  data_hard_startdata
> > > >   data_meta
> > > >
> > > > If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little
> > > > towards 'data_hard_start', it would point to the inline header, thus
> > > > potentially allowing the XDP program to access the inline header.

Fairly late to the thread sorry. Given above layout does it make sense to
just delay extraction to the kfunc as suggested above? Sure the XDP program
could smash the entry in virtio header, but this is already the case for
anything else there. A program writing over the virtio header is likely
buggy anyways. Worse that might happen is bad rss values and mappings?

I like seeing more use cases for the hints though.

Thanks!
Joh

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Haitao Huang

On Mon, 26 Feb 2024 05:36:02 -0600, Huang, Kai  wrote:

On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote:
On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai   
wrote:

>
>
> On 24/02/2024 6:00 am, Haitao Huang wrote:
> > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai 
> > wrote:
> >
> > > > >
> > > > Right. When code reaches to here, we already passed reclaim per
> > > > cgroup.
> > >
> > > Yes if try_charge() failed we must do pre-cgroup reclaim.
> > >
> > > > The cgroup may not at or reach limit but system has run out of
> > > > physical
> > > > EPC.
> > > >
> > >
> > > But after try_charge() we can still choose to reclaim from the  
current

> > > group,
> > > but not necessarily have to be global, right?  I am not sure  
whether I

> > > am
> > > missing something, but could you elaborate why we should choose to
> > > reclaim from
> > > the global?
> > >
> >  Once try_charge is done and returns zero that means the cgroup  
usage
> > is charged and it's not over usage limit. So you really can't  
reclaim
> > from that cgroup if allocation failed. The only  thing you can do  
is to

> > reclaim globally.
>
> Sorry I still cannot establish the logic here.
>
> Let's say the sum of all cgroups are greater than the physical EPC,  
and
> elclave(s) in each cgroup could potentially fault w/o reaching  
cgroup's

> limit.
>
> In this case, when enclave(s) in one cgroup faults, why we cannot
> reclaim from the current cgroup, but have to reclaim from global?
>
> Is there any real downside of the former, or you just want to follow  
the

> reclaim logic w/o cgroup at all?
>
> IIUC, there's at least one advantage of reclaim from the current  
group,

> that faults of enclave(s) in one group won't impact other enclaves in
> other cgroups.  E.g., in this way other enclaves in other groups may
> never need to trigger faults.
>
> Or perhaps I am missing anything?
>
The use case here is that user knows it's OK for group A to borrow some
pages from group B for some time without impact much performance, vice
versa. That's why the user is overcomitting so system can run more
enclave/groups. Otherwise, if she is concerned about impact of A on B,  
she

could lower limit for A so it never interfere or interfere less with B
(assume the lower limit is still high enough to run all enclaves in A),
and sacrifice some of A's performance. Or if she does not want any
interference between groups, just don't over-comit. So we don't really
lose anything here.

But if we reclaim from the same group, seems we could enable a user case  
that

allows the admin to ensure certain group won't be impacted at all, while
allowing other groups to over-commit?

E.g., let's say we have 100M physical EPC.  And let's say the admin  
wants to run
some performance-critical enclave(s) which costs 50M EPC w/o being  
impacted.
The admin also wants to run other enclaves which could cost 100M EPC in  
total

but EPC swapping among them is acceptable.

If we choose to reclaim from the current EPC cgroup, then seems to that  
the
admin can achieve the above by setting up 2 groups with group1 having  
50M limit
and group2 having 100M limit, and then run performance-critical  
enclave(s) in

group1 and others in group2?  Or am I missing anything?

The more important groups should have limits higher than or equal to peak  
usage to ensure no impact.
The less important groups should have lower limits than its peak usage to  
avoid impacting higher priority groups.

The limit is the maximum usage allowed.

By setting group2 limit to 100M, you are allowing it to use 100M. So as  
soon as it gets up and consume 100M, group1 can not even load any enclave  
if we only reclaim per-cgroup and do not do global reclaim.

If we choose to do global reclaim, then we cannot achieve that.

You can achieve this by setting group 2 limit to 50M. No need to  
overcommiting to the system.
Group 2 will swap as soon as it hits 50M, which is the maximum it can  
consume so no impact to group 1.

In case of overcomitting, even if we always reclaim from the same cgroup
for each fault, one group may still interfere the other: e.g., consider  
an

extreme case in that group A used up almost all EPC at the time group B
has a fault, B has to fail allocation and kill enclaves.

If the admin allows group A to use almost all EPC, to me it's fair to  
say he/she
doesn't want to run anything inside B at all and it is acceptable  
enclaves in B

to be killed.

I don't think so. The user just knows group A + B peak usages higher than  
system capacity. And she is OK for them to share some of the pages  
dynamically. So kernel should allow one borrow from the other at a  
particular instance when one group has higher demand. And later doing the  
opposite. IOW, the favor goes both ways.

Haitao

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Haitao Huang


Hi Dave,

On Mon, 26 Feb 2024 08:04:54 -0600, Dave Hansen   
wrote:



On 2/26/24 03:36, Huang, Kai wrote:
In case of overcomitting, even if we always reclaim from the same  
cgroup
for each fault, one group may still interfere the other: e.g.,  
consider an

extreme case in that group A used up almost all EPC at the time group B
has a fault, B has to fail allocation and kill enclaves.
If the admin allows group A to use almost all EPC, to me it's fair to  
say he/she
doesn't want to run anything inside B at all and it is acceptable  
enclaves in B

to be killed.


Folks, I'm having a really hard time following this thread.  It sounds
like there's disagreement about when to do system-wide reclaim.  Could
someone remind me of the choices that we have?  (A proposed patch would
go a _long_ way to helping me understand)



In case of overcomitting, i.e., sum of limits greater than the EPC  
capacity, if one group has a fault, and its usage is not above its own  
limit (try_charge() passes), yet total usage of the system has exceeded  
the capacity, whether we do global reclaim or just reclaim pages in the  
current faulting group.



Also, what does the core mm memcg code do?

I'm not sure. I'll try to find out but it'd be appreciated if someone more  
knowledgeable can comment on this. memcg also has the protection mechanism  
(i.e., min, low settings) to guarantee some allocation per group so its  
approach might not be applicable to misc controller here.



Last, what is the simplest (least amount of code) thing that the SGX
cgroup controller could implement here?




I still think the current approach of doing global reclaim is reasonable  
and simple: try_charge() checks cgroup limit and reclaim within the group  
if needed, then do EPC page allocation, reclaim globally if allocation  
fails due to global usage reaches the capacity.


I'm not sure how not doing global reclaiming in this case would bring any  
benefit. Please see my response to Kai's example cases.


Thanks
Haitao

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Dave Hansen

On 2/26/24 13:48, Haitao Huang wrote:
> In case of overcomitting, i.e., sum of limits greater than the EPC
> capacity, if one group has a fault, and its usage is not above its own
> limit (try_charge() passes), yet total usage of the system has exceeded
> the capacity, whether we do global reclaim or just reclaim pages in the
> current faulting group.

I don't see _any_ reason to limit reclaim to the current faulting cgroup.

>> Last, what is the simplest (least amount of code) thing that the SGX
>> cgroup controller could implement here?
> 
> I still think the current approach of doing global reclaim is reasonable
> and simple: try_charge() checks cgroup limit and reclaim within the
> group if needed, then do EPC page allocation, reclaim globally if
> allocation fails due to global usage reaches the capacity.
> 
> I'm not sure how not doing global reclaiming in this case would bring
> any benefit.
I tend to agree.

Kai, I think your examples sound a little bit contrived.  Have actual
users expressed a strong intent for doing anything with this series
other than limiting bad actors from eating all the EPC?

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Huang, Kai

On 27/02/2024 10:18 am, Haitao Huang wrote:

On Mon, 26 Feb 2024 05:36:02 -0600, Huang, Kai  wrote:

On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote:
On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai  
wrote:

>
>
> On 24/02/2024 6:00 am, Haitao Huang wrote:
> > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai 
> > wrote:
> >
> > > > >
> > > > Right. When code reaches to here, we already passed reclaim per
> > > > cgroup.
> > >
> > > Yes if try_charge() failed we must do pre-cgroup reclaim.
> > >
> > > > The cgroup may not at or reach limit but system has run out of
> > > > physical
> > > > EPC.
> > > >
> > >
> > > But after try_charge() we can still choose to reclaim from the 
current

> > > group,
> > > but not necessarily have to be global, right?  I am not sure 
whether I

> > > am
> > > missing something, but could you elaborate why we should choose to
> > > reclaim from
> > > the global?
> > >
> >  Once try_charge is done and returns zero that means the cgroup 
usage
> > is charged and it's not over usage limit. So you really can't 
reclaim
> > from that cgroup if allocation failed. The only  thing you can do 
is to

> > reclaim globally.
>
> Sorry I still cannot establish the logic here.
>
> Let's say the sum of all cgroups are greater than the physical EPC, 
and
> elclave(s) in each cgroup could potentially fault w/o reaching 
cgroup's

> limit.
>
> In this case, when enclave(s) in one cgroup faults, why we cannot
> reclaim from the current cgroup, but have to reclaim from global?
>
> Is there any real downside of the former, or you just want to 
follow the

> reclaim logic w/o cgroup at all?
>
> IIUC, there's at least one advantage of reclaim from the current 
group,

> that faults of enclave(s) in one group won't impact other enclaves in
> other cgroups.  E.g., in this way other enclaves in other groups may
> never need to trigger faults.
>
> Or perhaps I am missing anything?
>
The use case here is that user knows it's OK for group A to borrow some
pages from group B for some time without impact much performance, vice
versa. That's why the user is overcomitting so system can run more
enclave/groups. Otherwise, if she is concerned about impact of A on 
B, she

could lower limit for A so it never interfere or interfere less with B
(assume the lower limit is still high enough to run all enclaves in A),
and sacrifice some of A's performance. Or if she does not want any
interference between groups, just don't over-comit. So we don't really
lose anything here.

But if we reclaim from the same group, seems we could enable a user 
case that

allows the admin to ensure certain group won't be impacted at all, while
allowing other groups to over-commit?

E.g., let's say we have 100M physical EPC.  And let's say the admin 
wants to run
some performance-critical enclave(s) which costs 50M EPC w/o being 
impacted.
The admin also wants to run other enclaves which could cost 100M EPC 
in total

but EPC swapping among them is acceptable.

If we choose to reclaim from the current EPC cgroup, then seems to 
that the
admin can achieve the above by setting up 2 groups with group1 having 
50M limit
and group2 having 100M limit, and then run performance-critical 
enclave(s) in

group1 and others in group2?  Or am I missing anything?

The more important groups should have limits higher than or equal to 
peak usage to ensure no impact.

Yes.  But if you do global reclaim there's no guarantee of this 
regardless of the limit setting.  It depends on setting of limits of 
other groups.

The less important groups should have lower limits than its peak usage 
to avoid impacting higher priority groups.

Yeah, but depending on how low the limit is, the try_charge() can still 
succeed but physical EPC is already running out.

Are you saying we should always expect the admin to set limits of groups 
not exceeding the physical EPC?

The limit is the maximum usage allowed.

By setting group2 limit to 100M, you are allowing it to use 100M. So as 
soon as it gets up and consume 100M, group1 can not even load any 
enclave if we only reclaim per-cgroup and do not do global reclaim.

I kinda forgot, but I think SGX supports swapping out EPC of an enclave 
before EINIT?  Also, with SGX2 the initial enclave can take less EPC to 
be loaded.

If we choose to do global reclaim, then we cannot achieve that.

You can achieve this by setting group 2 limit to 50M. No need to 
overcommiting to the system.
Group 2 will swap as soon as it hits 50M, which is the maximum it can 
consume so no impact to group 1.

Right.  We can achieve this by doing so.  But as said above, you are 
depending on setting up the limit to do per-cgorup reclaim.

So, back to the question:

What is the downside of doing per-group reclaim when try_charge() 
succeeds for the enclave but failed to allocate EPC page?

Could you give an complete answer why you choose to use global reclaim 
for the above case?

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Dave Hansen

On 2/26/24 14:24, Huang, Kai wrote:
> What is the downside of doing per-group reclaim when try_charge()
> succeeds for the enclave but failed to allocate EPC page?
> 
> Could you give an complete answer why you choose to use global reclaim
> for the above case?

There are literally two different limits at play.  There's the limit
that the cgroup imposes and then the actual physical limit.

Hitting the cgroup limit induces cgroup reclaim.

Hitting the physical limit induces global reclaim.

Maybe I'm just being dense, but I fail to understand why you would want
to entangle those two different concepts more than absolutely necessary.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Huang, Kai






Kai, I think your examples sound a little bit contrived.  Have actual
users expressed a strong intent for doing anything with this series
other than limiting bad actors from eating all the EPC?

I am not sure about this.  I am also trying to get a full picture.

I asked because I didn't quite like the duplicated code change in 
try_charge() in this patch and in sgx_alloc_epc_page().  I think using 
per-group reclaim we can unify the code (I have even started to write 
the code) and I don't see the downside of doing so.


So I am trying to get the actual downside of doing per-cgroup reclaim or 
the full reason that we choose global reclaim.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Dave Hansen

On 2/26/24 14:34, Huang, Kai wrote:
> So I am trying to get the actual downside of doing per-cgroup reclaim or
> the full reason that we choose global reclaim.

Take the most extreme example:

while (hit_global_sgx_limit())
reclaim_from_this(cgroup);

You eventually end up with all of 'cgroup's pages gone and handed out to
other users on the system who stole them all.  Other users might cause
you to go over the global limit.  *They* should be paying part of the
cost, not just you and your cgroup.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Huang, Kai





On 27/02/2024 11:31 am, Dave Hansen wrote:

On 2/26/24 14:24, Huang, Kai wrote:

What is the downside of doing per-group reclaim when try_charge()
succeeds for the enclave but failed to allocate EPC page?

Could you give an complete answer why you choose to use global reclaim
for the above case?


There are literally two different limits at play.  There's the limit
that the cgroup imposes and then the actual physical limit.

Hitting the cgroup limit induces cgroup reclaim.

Hitting the physical limit induces global reclaim.

Maybe I'm just being dense, but I fail to understand why you would want
to entangle those two different concepts more than absolutely necessary.


OK.  Yes I agree doing per-cgroup reclaim when hitting physical limit 
would bring another layer of consideration of when to do global reclaim, 
which is not necessary now.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Huang, Kai





On 27/02/2024 11:38 am, Dave Hansen wrote:

On 2/26/24 14:34, Huang, Kai wrote:

So I am trying to get the actual downside of doing per-cgroup reclaim or
the full reason that we choose global reclaim.


Take the most extreme example:

while (hit_global_sgx_limit())
reclaim_from_this(cgroup);

You eventually end up with all of 'cgroup's pages gone and handed out to
other users on the system who stole them all.  Other users might cause
you to go over the global limit.  *They* should be paying part of the
cost, not just you and your cgroup.


Yeah likely we will need another layer of logic to decide when to do 
global reclaim.  I agree that is downside and is unnecessary for this 
patchset.


Thanks for the comments.

[PATCH v4] modules: wait do_free_init correctly

2024-02-26 Thread Changbin Du

The synchronization here is to ensure the ordering of freeing of a module
init so that it happens before W+X checking. It is worth noting it is not
that the freeing was not happening, it is just that our sanity checkers
raced against the permission checkers which assume init memory is already
gone.

Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved
calling do_free_init() into a global workqueue instead of relying on it
being called through call_rcu(..., do_free_init), which used to allowed us
call do_free_init() asynchronously after the end of a subsequent grace
period. The move to a global workqueue broke the gaurantees for code which
needed to be sure the do_free_init() would complete with rcu_barrier().
To fix this callers which used to rely on rcu_barrier() must now instead
use flush_work(&init_free_wq).

Without this fix, we still could encounter false positive reports in W+X
checking since the rcu_barrier() here can not ensure the ordering now.

Even worse, the rcu_barrier() can introduce significant delay. Eric Chanudet
reported that the rcu_barrier introduces ~0.1s delay on a PREEMPT_RT kernel.

  [0.291444] Freeing unused kernel memory: 5568K
  [0.402442] Run /sbin/init as init process

With this fix, the above delay can be eliminated.

Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Signed-off-by: Changbin Du 
Cc: Xiaoyi Su 
Cc: Eric Chanudet 
Cc: Luis Chamberlain 
Tested-by: Eric Chanudet 

---
v4:
  - polish commit msg. (Luis Chamberlain)
v3:
  - amend comment in do_init_module() and update commit msg.
v2:
  - fix compilation issue for no CONFIG_MODULES found by 0-DAY.
---
 include/linux/moduleloader.h | 8 
 init/main.c  | 5 +++--
 kernel/module/main.c | 9 +++--
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 001b2ce83832..89b1e0ed9811 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -115,6 +115,14 @@ int module_finalize(const Elf_Ehdr *hdr,
const Elf_Shdr *sechdrs,
struct module *mod);
 
+#ifdef CONFIG_MODULES
+void flush_module_init_free_work(void);
+#else
+static inline void flush_module_init_free_work(void)
+{
+}
+#endif
+
 /* Any cleanup needed when module leaves. */
 void module_arch_cleanup(struct module *mod);
 
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..f0b7e21ac67f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -99,6 +99,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -1402,11 +1403,11 @@ static void mark_readonly(void)
if (rodata_enabled) {
/*
 * load_module() results in W+X mappings, which are cleaned
-* up with call_rcu().  Let's make sure that queued work is
+* up with init_free_wq. Let's make sure that queued work is
 * flushed so that we don't hit false positives looking for
 * insecure pages which are W+X.
 */
-   rcu_barrier();
+   flush_module_init_free_work();
mark_rodata_ro();
rodata_test();
} else
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 36681911c05a..b0b99348e1a8 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2489,6 +2489,11 @@ static void do_free_init(struct work_struct *w)
}
 }
 
+void flush_module_init_free_work(void)
+{
+   flush_work(&init_free_wq);
+}
+
 #undef MODULE_PARAM_PREFIX
 #define MODULE_PARAM_PREFIX "module."
 /* Default value for module->async_probe_requested */
@@ -2593,8 +2598,8 @@ static noinline int do_init_module(struct module *mod)
 * Note that module_alloc() on most architectures creates W+X page
 * mappings which won't be cleaned up until do_free_init() runs.  Any
 * code such as mark_rodata_ro() which depends on those mappings to
-* be cleaned up needs to sync with the queued work - ie
-* rcu_barrier()
+* be cleaned up needs to sync with the queued work by invoking
+* flush_module_init_free_work().
 */
if (llist_add(&freeinit->node, &init_free_list))
schedule_work(&init_free_wq);
-- 
2.25.1

[PATCH] net/ipv4: add tracepoint for icmp_send

2024-02-26 Thread xu.xin16

From: xu xin 

Introduce a tracepoint for icmp_send, which can help users to get more
detail information conveniently when icmp abnormal events happen.

1. Giving an usecase example:
=
When an application experiences packet loss due to an unreachable UDP
destination port, the kernel will send an exception message through the
icmp_send function. By adding a trace point for icmp_send, developers or
system administrators can obtain the detailed information easily about the
UDP packet loss, including the type, code, source address, destination
address, source port, and destination port. This facilitates the
trouble-shooting of packet loss issues especially for those complicated
network-service applications.

2. Operation Instructions:
==
Switch to the tracing directory.
cd /sys/kernel/debug/tracing
Filter for destination port unreachable.
echo "type==3 && code==3" > events/icmp/icmp_send/filter
Enable trace event.
echo 1 > events/icmp/icmp_send/enable

3. Result View:

 udp_client_erro-11370   [002] ...s.12   124.728002:  icmp_send:
icmp_send: type=3, code=3.From 127.0.0.1:41895 to 127.0.0.1: ulen=23
skbaddr=589b167a

Signed-off-by: He Peilin 
Reviewed-by: xu xin 
Reviewed-by: Yunkai Zhang 
Cc: Yang Yang 
Cc: Liu Chun 
Cc: Xuexin Jiang 
---
 include/trace/events/icmp.h | 57 +
 net/ipv4/icmp.c |  4 
 2 files changed, 61 insertions(+)
 create mode 100644 include/trace/events/icmp.h

diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
new file mode 100644
index ..3d9af5769bc3
--- /dev/null
+++ b/include/trace/events/icmp.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM icmp
+
+#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_ICMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(icmp_send,
+
+   TP_PROTO(const struct sk_buff *skb, int type, int code),
+
+   TP_ARGS(skb, type, code),
+
+   TP_STRUCT__entry(
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __field(unsigned short, ulen)
+   __field(const void *, skbaddr)
+   __field(int, type)
+   __field(int, code)
+   __array(__u8, saddr, 4)
+   __array(__u8, daddr, 4)
+   ),
+
+   TP_fast_assign(
+   // Get UDP header
+   struct udphdr *uh = udp_hdr(skb);
+   struct iphdr *iph = ip_hdr(skb);
+   __be32 *p32;
+
+   __entry->sport = ntohs(uh->source);
+   __entry->dport = ntohs(uh->dest);
+   __entry->ulen = ntohs(uh->len);
+   __entry->skbaddr = skb;
+   __entry->type = type;
+   __entry->code = code;
+
+   p32 = (__be32 *) __entry->saddr;
+   *p32 = iph->saddr;
+
+   p32 = (__be32 *) __entry->daddr;
+   *p32 =  iph->daddr;
+   ),
+
+   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u 
ulen=%d skbaddr=%p",
+   __entry->type, __entry->code,
+   __entry->saddr, __entry->sport, __entry->daddr,
+   __entry->dport, __entry->ulen, __entry->skbaddr)
+);
+
+#endif /* _TRACE_ICMP_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e63a3bf99617..437bdb7e2650 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -92,6 +92,8 @@
 #include 
 #include 
 #include 
+#define CREATE_TRACE_POINTS
+#include 

 /*
  * Build xmit assembly blocks
@@ -599,6 +601,8 @@ void __icmp_send(struct sk_buff *skb_in, int type, int 
code, __be32 info,
struct net *net;
struct sock *sk;

+   trace_icmp_send(skb_in, type, code);
+
if (!rt)
goto out;

-- 
2.15.2

Re: [PATCH v5 3/6] LoongArch: KVM: Add cpucfg area for kvm hypervisor

2024-02-26 Thread maobibo





On 2024/2/27 上午4:02, Jiaxun Yang wrote:



在2024年2月26日二月 上午8:04，maobibo写道：

On 2024/2/26 下午2:12, Huacai Chen wrote:

On Mon, Feb 26, 2024 at 10:04 AM maobibo  wrote:




On 2024/2/24 下午5:13, Huacai Chen wrote:

Hi, Bibo,

On Thu, Feb 22, 2024 at 11:28 AM Bibo Mao  wrote:


Instruction cpucfg can be used to get processor features. And there
is trap exception when it is executed in VM mode, and also it is
to provide cpu features to VM. On real hardware cpucfg area 0 - 20
is used.  Here one specified area 0x4000 -- 0x40ff is used
for KVM hypervisor to privide PV features, and the area can be extended
for other hypervisors in future. This area will never be used for
real HW, it is only used by software.

After reading and thinking, I find that the hypercall method which is
used in our productive kernel is better than this cpucfg method.
Because hypercall is more simple and straightforward, plus we don't
worry about conflicting with the real hardware.

No, I do not think so. cpucfg is simper than hypercall, hypercall can
be in effect when system runs in guest mode. In some scenario like TCG
mode, hypercall is illegal intruction, however cpucfg can work.

Nearly all architectures use hypercall except x86 for its historical

Only x86 support multiple hypervisors and there is multiple hypervisor
in x86 only. It is an advantage, not historical reason.


I do believe that all those stuff should not be exposed to guest user space
for security reasons.
Can you add PLV checking when cpucfg 0x4000-0x40FF is emulated? 
if it is user mode return value is zero and it is kernel mode emulated 
value will be returned. It can avoid information leaking.




Also for different implementations of hypervisors they may have different
PV features behavior, using hypcall to perform feature detection
can pass more information to help us cope with hypervisor diversity.
How do different hypervisors can be detected firstly?  On x86 MSR is 
used for all hypervisors detection and on ARM64 hyperv used 
acpi_gbl_FADT and kvm use smc forcely, host mode can execute smc 
instruction without exception on ARM64.


I do not know why hypercall is better than cpucfg on LoongArch, cpucfg 
is basic intruction however hypercall is not, it is part of LVZ feature.





reasons. If we use CPUCFG, then the hypervisor information is
unnecessarily leaked to userspace, and this may be a security issue.
Meanwhile, I don't think TCG mode needs PV features.

Besides PV features, there is other features different with real hw such
as virtio device, virtual interrupt controller.


Those are *device* level information, they must be passed in firmware
interfaces to keep processor emulation sane.
File arch/x86/hyperv/hv_apic.c can be referenced, apic features comes 
from ms_hyperv.hints and HYPERV_CPUID_ENLIGHTMENT_INFO cpuid info, not 
must be passed by firmware interface.


Regards
Bibo Mao


Thanks



Regards
Bibo Mao



I consulted with Jiaxun before, and maybe he can give some more comments.



Extioi virtualization extension will be added later, cpucfg can be used
to get extioi features. It is unlikely that extioi driver depends on
PARA_VIRT macro if hypercall is used to get features.

CPUCFG is per-core information, if we really need something about
extioi, it should be in iocsr (LOONGARCH_IOCSR_FEATURES).


Huacai



Regards
Bibo Mao



Huacai



Signed-off-by: Bibo Mao 
---
arch/loongarch/include/asm/inst.h  |  1 +
arch/loongarch/include/asm/loongarch.h | 10 ++
arch/loongarch/kvm/exit.c  | 46 +-
3 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/arch/loongarch/include/asm/inst.h 
b/arch/loongarch/include/asm/inst.h
index d8f637f9e400..ad120f924905 100644
--- a/arch/loongarch/include/asm/inst.h
+++ b/arch/loongarch/include/asm/inst.h
@@ -67,6 +67,7 @@ enum reg2_op {
   revhd_op= 0x11,
   extwh_op= 0x16,
   extwb_op= 0x17,
+   cpucfg_op   = 0x1b,
   iocsrrdb_op = 0x19200,
   iocsrrdh_op = 0x19201,
   iocsrrdw_op = 0x19202,
diff --git a/arch/loongarch/include/asm/loongarch.h 
b/arch/loongarch/include/asm/loongarch.h
index 46366e783c84..a1d22e8b6f94 100644
--- a/arch/loongarch/include/asm/loongarch.h
+++ b/arch/loongarch/include/asm/loongarch.h
@@ -158,6 +158,16 @@
#define  CPUCFG48_VFPU_CG  BIT(2)
#define  CPUCFG48_RAM_CG   BIT(3)

+/*
+ * cpucfg index area: 0x4000 -- 0x40ff
+ * SW emulation for KVM hypervirsor
+ */
+#define CPUCFG_KVM_BASE0x4000UL
+#define CPUCFG_KVM_SIZE0x100
+#define CPUCFG_KVM_SIG CPUCFG_KVM_BASE
+#define  KVM_SIGNATURE "KVM\0"
+#define CPUCFG_KVM_FEATURE (CPUCFG_KVM_BASE + 4)
+
#ifndef __ASSEMBLY__

/* CSR */
diff --git a/arch/loongarch/kvm/exit.c b/arch/loongarch/kvm/exit.c
index 923bbca9bd22..6a38fd59d86d 100644
-

Re: [PATCH net-next 1/2] net/vsockmon: Leverage core stats allocator

2024-02-26 Thread patchwork-bot+netdevbpf

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski :

On Fri, 23 Feb 2024 03:58:37 -0800 you wrote:
> With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
> convert veth & vrf"), stats allocation could be done on net core
> instead of this driver.
> 
> With this new approach, the driver doesn't have to bother with error
> handling (allocation failure checking, making sure free happens in the
> right spot, etc). This is core responsibility now.
> 
> [...]

Here is the summary with links:
  - [net-next,1/2] net/vsockmon: Leverage core stats allocator
https://git.kernel.org/netdev/net-next/c/bcd53aff4d0c
  - [net-next,2/2] net/vsockmon: Do not set zeroed statistics
https://git.kernel.org/netdev/net-next/c/3a25e212306c

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

Re: [PATCH] dt-bindings: remoteproc: qcom,glink-rpm-edge: drop redundant type from label

2024-02-26 Thread Rob Herring



On Mon, 26 Feb 2024 13:28:54 +0100, Krzysztof Kozlowski wrote:
> dtschema defines label as string, so $ref in other bindings is
> redundant.
> 
> Signed-off-by: Krzysztof Kozlowski 
> ---
>  .../devicetree/bindings/remoteproc/qcom,glink-rpm-edge.yaml  | 1 -
>  1 file changed, 1 deletion(-)
> 

Acked-by: Rob Herring

Re: [PATCH v5 3/6] LoongArch: KVM: Add cpucfg area for kvm hypervisor

2024-02-26 Thread Jiaxun Yang




在2024年2月27日二月 上午3:14，maobibo写道：
> On 2024/2/27 上午4:02, Jiaxun Yang wrote:
>> 
>> 
>> 在2024年2月26日二月 上午8:04，maobibo写道：
>>> On 2024/2/26 下午2:12, Huacai Chen wrote:
 On Mon, Feb 26, 2024 at 10:04 AM maobibo  wrote:
>
>
>
> On 2024/2/24 下午5:13, Huacai Chen wrote:
>> Hi, Bibo,
>>
>> On Thu, Feb 22, 2024 at 11:28 AM Bibo Mao  wrote:
>>>
>>> Instruction cpucfg can be used to get processor features. And there
>>> is trap exception when it is executed in VM mode, and also it is
>>> to provide cpu features to VM. On real hardware cpucfg area 0 - 20
>>> is used.  Here one specified area 0x4000 -- 0x40ff is used
>>> for KVM hypervisor to privide PV features, and the area can be extended
>>> for other hypervisors in future. This area will never be used for
>>> real HW, it is only used by software.
>> After reading and thinking, I find that the hypercall method which is
>> used in our productive kernel is better than this cpucfg method.
>> Because hypercall is more simple and straightforward, plus we don't
>> worry about conflicting with the real hardware.
> No, I do not think so. cpucfg is simper than hypercall, hypercall can
> be in effect when system runs in guest mode. In some scenario like TCG
> mode, hypercall is illegal intruction, however cpucfg can work.
 Nearly all architectures use hypercall except x86 for its historical
>>> Only x86 support multiple hypervisors and there is multiple hypervisor
>>> in x86 only. It is an advantage, not historical reason.
>> 
>> I do believe that all those stuff should not be exposed to guest user space
>> for security reasons.
> Can you add PLV checking when cpucfg 0x4000-0x40FF is emulated? 
> if it is user mode return value is zero and it is kernel mode emulated 
> value will be returned. It can avoid information leaking.

Please don’t do insane stuff here, applications are not expecting exception from
cpucfg.

>
>> 
>> Also for different implementations of hypervisors they may have different
>> PV features behavior, using hypcall to perform feature detection
>> can pass more information to help us cope with hypervisor diversity.
> How do different hypervisors can be detected firstly?  On x86 MSR is 
> used for all hypervisors detection and on ARM64 hyperv used 
> acpi_gbl_FADT and kvm use smc forcely, host mode can execute smc 
> instruction without exception on ARM64.

That’s hypcall ABI design choices, those information can come from firmware
or privileged CSRs on LoongArch.

>
> I do not know why hypercall is better than cpucfg on LoongArch, cpucfg 
> is basic intruction however hypercall is not, it is part of LVZ feature.

KVM can only work with LVZ right?

>
>>>
 reasons. If we use CPUCFG, then the hypervisor information is
 unnecessarily leaked to userspace, and this may be a security issue.
 Meanwhile, I don't think TCG mode needs PV features.
>>> Besides PV features, there is other features different with real hw such
>>> as virtio device, virtual interrupt controller.
>> 
>> Those are *device* level information, they must be passed in firmware
>> interfaces to keep processor emulation sane.
> File arch/x86/hyperv/hv_apic.c can be referenced, apic features comes 
> from ms_hyperv.hints and HYPERV_CPUID_ENLIGHTMENT_INFO cpuid info, not 
> must be passed by firmware interface.

That’s not KVM, that’s Hyper V. At Linux Kernel we enjoy the benefits of better
modularity on device abstractions, please don’t break it.

Thanks

>
> Regards
> Bibo Mao
>> 
>> Thanks
>> 
>>>
>>> Regards
>>> Bibo Mao
>>>

 I consulted with Jiaxun before, and maybe he can give some more comments.

>
> Extioi virtualization extension will be added later, cpucfg can be used
> to get extioi features. It is unlikely that extioi driver depends on
> PARA_VIRT macro if hypercall is used to get features.
 CPUCFG is per-core information, if we really need something about
 extioi, it should be in iocsr (LOONGARCH_IOCSR_FEATURES).


 Huacai

>
> Regards
> Bibo Mao
>
>>
>> Huacai
>>
>>>
>>> Signed-off-by: Bibo Mao 
>>> ---
>>> arch/loongarch/include/asm/inst.h  |  1 +
>>> arch/loongarch/include/asm/loongarch.h | 10 ++
>>> arch/loongarch/kvm/exit.c  | 46 
>>> +-
>>> 3 files changed, 41 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/arch/loongarch/include/asm/inst.h 
>>> b/arch/loongarch/include/asm/inst.h
>>> index d8f637f9e400..ad120f924905 100644
>>> --- a/arch/loongarch/include/asm/inst.h
>>> +++ b/arch/loongarch/include/asm/inst.h
>>> @@ -67,6 +67,7 @@ enum reg2_op {
>>>revhd_op= 0x11,
>>>extwh_op= 0x16,
>>>extwb_op= 0x17,
>>> +   cpucfg_op   = 0x1b,
>>>iocsrrdb_op = 0x19200,
>

Re: [PATCH] net/ipv4: add tracepoint for icmp_send

2024-02-26 Thread Eric Dumazet

On Tue, Feb 27, 2024 at 3:50 AM  wrote:
>
> From: xu xin 
>
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.
>
> 1. Giving an usecase example:
> =
> When an application experiences packet loss due to an unreachable UDP
> destination port, the kernel will send an exception message through the
> icmp_send function. By adding a trace point for icmp_send, developers or
> system administrators can obtain the detailed information easily about the
> UDP packet loss, including the type, code, source address, destination
> address, source port, and destination port. This facilitates the
> trouble-shooting of packet loss issues especially for those complicated
> network-service applications.
>
> 2. Operation Instructions:
> ==
> Switch to the tracing directory.
> cd /sys/kernel/debug/tracing
> Filter for destination port unreachable.
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
> Enable trace event.
> echo 1 > events/icmp/icmp_send/enable
>
> 3. Result View:
> 
>  udp_client_erro-11370   [002] ...s.12   124.728002:  icmp_send:
> icmp_send: type=3, code=3.From 127.0.0.1:41895 to 127.0.0.1: ulen=23
> skbaddr=589b167a
>
> Signed-off-by: He Peilin 
> Reviewed-by: xu xin 
> Reviewed-by: Yunkai Zhang 
> Cc: Yang Yang 
> Cc: Liu Chun 
> Cc: Xuexin Jiang 
> ---
>  include/trace/events/icmp.h | 57 
> +
>  net/ipv4/icmp.c |  4 
>  2 files changed, 61 insertions(+)
>  create mode 100644 include/trace/events/icmp.h
>
> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> new file mode 100644
> index ..3d9af5769bc3
> --- /dev/null
> +++ b/include/trace/events/icmp.h
> @@ -0,0 +1,57 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM icmp
> +
> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_ICMP_H
> +
> +#include 
> +#include 
> +
> +TRACE_EVENT(icmp_send,
> +
> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> +
> +   TP_ARGS(skb, type, code),
> +
> +   TP_STRUCT__entry(
> +   __field(__u16, sport)
> +   __field(__u16, dport)
> +   __field(unsigned short, ulen)
> +   __field(const void *, skbaddr)
> +   __field(int, type)
> +   __field(int, code)
> +   __array(__u8, saddr, 4)
> +   __array(__u8, daddr, 4)
> +   ),
> +
> +   TP_fast_assign(
> +   // Get UDP header
> +   struct udphdr *uh = udp_hdr(skb);
> +   struct iphdr *iph = ip_hdr(skb);
> +   __be32 *p32;
> +
> +   __entry->sport = ntohs(uh->source);
> +   __entry->dport = ntohs(uh->dest);
> +   __entry->ulen = ntohs(uh->len);
> +   __entry->skbaddr = skb;
> +   __entry->type = type;
> +   __entry->code = code;
> +
> +   p32 = (__be32 *) __entry->saddr;
> +   *p32 = iph->saddr;
> +
> +   p32 = (__be32 *) __entry->daddr;
> +   *p32 =  iph->daddr;
> +   ),
> +

FYI, ICMP can be generated for many other protocols than UDP.

> +   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to 
> %pI4:%u ulen=%d skbaddr=%p",
> +   __entry->type, __entry->code,
> +   __entry->saddr, __entry->sport, __entry->daddr,
> +   __entry->dport, __entry->ulen, __entry->skbaddr)
> +);
> +
> +#endif /* _TRACE_ICMP_H */
> +
> +/* This part must be outside protection */
> +#include 
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index e63a3bf99617..437bdb7e2650 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -92,6 +92,8 @@
>  #include 
>  #include 
>  #include 
> +#define CREATE_TRACE_POINTS
> +#include 
>
>  /*
>   * Build xmit assembly blocks
> @@ -599,6 +601,8 @@ void __icmp_send(struct sk_buff *skb_in, int type, int 
> code, __be32 info,
> struct net *net;
> struct sock *sk;
>
> +   trace_icmp_send(skb_in, type, code);

I think you missed many sanity checks between lines 622 and 676

Honestly, a kprobe BPF based solution would be less risky, and less
maintenance for us.

Re: [PATCH] net/ipv4: add tracepoint for icmp_send

2024-02-26 Thread Jason Xing

On Tue, Feb 27, 2024 at 1:49 PM Eric Dumazet  wrote:
>
> On Tue, Feb 27, 2024 at 3:50 AM  wrote:
> >
> > From: xu xin 
> >
> > Introduce a tracepoint for icmp_send, which can help users to get more
> > detail information conveniently when icmp abnormal events happen.
> >
> > 1. Giving an usecase example:
> > =
> > When an application experiences packet loss due to an unreachable UDP
> > destination port, the kernel will send an exception message through the
> > icmp_send function. By adding a trace point for icmp_send, developers or
> > system administrators can obtain the detailed information easily about the
> > UDP packet loss, including the type, code, source address, destination
> > address, source port, and destination port. This facilitates the
> > trouble-shooting of packet loss issues especially for those complicated
> > network-service applications.
> >
> > 2. Operation Instructions:
> > ==
> > Switch to the tracing directory.
> > cd /sys/kernel/debug/tracing
> > Filter for destination port unreachable.
> > echo "type==3 && code==3" > events/icmp/icmp_send/filter
> > Enable trace event.
> > echo 1 > events/icmp/icmp_send/enable
> >
> > 3. Result View:
> > 
> >  udp_client_erro-11370   [002] ...s.12   124.728002:  icmp_send:
> > icmp_send: type=3, code=3.From 127.0.0.1:41895 to 127.0.0.1: ulen=23
> > skbaddr=589b167a
> >
> > Signed-off-by: He Peilin 
> > Reviewed-by: xu xin 
> > Reviewed-by: Yunkai Zhang 
> > Cc: Yang Yang 
> > Cc: Liu Chun 
> > Cc: Xuexin Jiang 
> > ---
> >  include/trace/events/icmp.h | 57 
> > +
> >  net/ipv4/icmp.c |  4 
> >  2 files changed, 61 insertions(+)
> >  create mode 100644 include/trace/events/icmp.h
> >
> > diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> > new file mode 100644
> > index ..3d9af5769bc3
> > --- /dev/null
> > +++ b/include/trace/events/icmp.h
> > @@ -0,0 +1,57 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM icmp
> > +
> > +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_ICMP_H
> > +
> > +#include 
> > +#include 
> > +
> > +TRACE_EVENT(icmp_send,
> > +
> > +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> > +
> > +   TP_ARGS(skb, type, code),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(__u16, sport)
> > +   __field(__u16, dport)
> > +   __field(unsigned short, ulen)
> > +   __field(const void *, skbaddr)
> > +   __field(int, type)
> > +   __field(int, code)
> > +   __array(__u8, saddr, 4)
> > +   __array(__u8, daddr, 4)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   // Get UDP header
> > +   struct udphdr *uh = udp_hdr(skb);
> > +   struct iphdr *iph = ip_hdr(skb);
> > +   __be32 *p32;
> > +
> > +   __entry->sport = ntohs(uh->source);
> > +   __entry->dport = ntohs(uh->dest);
> > +   __entry->ulen = ntohs(uh->len);
> > +   __entry->skbaddr = skb;
> > +   __entry->type = type;
> > +   __entry->code = code;
> > +
> > +   p32 = (__be32 *) __entry->saddr;
> > +   *p32 = iph->saddr;
> > +
> > +   p32 = (__be32 *) __entry->daddr;
> > +   *p32 =  iph->daddr;
> > +   ),
> > +
>
> FYI, ICMP can be generated for many other protocols than UDP.
>
> > +   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to 
> > %pI4:%u ulen=%d skbaddr=%p",
> > +   __entry->type, __entry->code,
> > +   __entry->saddr, __entry->sport, __entry->daddr,
> > +   __entry->dport, __entry->ulen, __entry->skbaddr)
> > +);
> > +
> > +#endif /* _TRACE_ICMP_H */
> > +
> > +/* This part must be outside protection */
> > +#include 
> > diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> > index e63a3bf99617..437bdb7e2650 100644
> > --- a/net/ipv4/icmp.c
> > +++ b/net/ipv4/icmp.c
> > @@ -92,6 +92,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#define CREATE_TRACE_POINTS
> > +#include 
> >
> >  /*
> >   * Build xmit assembly blocks
> > @@ -599,6 +601,8 @@ void __icmp_send(struct sk_buff *skb_in, int type, int 
> > code, __be32 info,
> > struct net *net;
> > struct sock *sk;
> >
> > +   trace_icmp_send(skb_in, type, code);

Re: [PATCH v5 3/6] LoongArch: KVM: Add cpucfg area for kvm hypervisor

2024-02-26 Thread maobibo





On 2024/2/27 下午1:23, Jiaxun Yang wrote:



在2024年2月27日二月 上午3:14，maobibo写道：

On 2024/2/27 上午4:02, Jiaxun Yang wrote:



在2024年2月26日二月 上午8:04，maobibo写道：

On 2024/2/26 下午2:12, Huacai Chen wrote:

On Mon, Feb 26, 2024 at 10:04 AM maobibo  wrote:




On 2024/2/24 下午5:13, Huacai Chen wrote:

Hi, Bibo,

On Thu, Feb 22, 2024 at 11:28 AM Bibo Mao  wrote:


Instruction cpucfg can be used to get processor features. And there
is trap exception when it is executed in VM mode, and also it is
to provide cpu features to VM. On real hardware cpucfg area 0 - 20
is used.  Here one specified area 0x4000 -- 0x40ff is used
for KVM hypervisor to privide PV features, and the area can be extended
for other hypervisors in future. This area will never be used for
real HW, it is only used by software.

After reading and thinking, I find that the hypercall method which is
used in our productive kernel is better than this cpucfg method.
Because hypercall is more simple and straightforward, plus we don't
worry about conflicting with the real hardware.

No, I do not think so. cpucfg is simper than hypercall, hypercall can
be in effect when system runs in guest mode. In some scenario like TCG
mode, hypercall is illegal intruction, however cpucfg can work.

Nearly all architectures use hypercall except x86 for its historical

Only x86 support multiple hypervisors and there is multiple hypervisor
in x86 only. It is an advantage, not historical reason.


I do believe that all those stuff should not be exposed to guest user space
for security reasons.

Can you add PLV checking when cpucfg 0x4000-0x40FF is emulated?
if it is user mode return value is zero and it is kernel mode emulated
value will be returned. It can avoid information leaking.


Please don’t do insane stuff here, applications are not expecting exception from
cpucfg.
Sorry, I do not understand. Can you describe the behavior about cpucfg 
instruction from applications? Why is there no exception for cpucfg.






Also for different implementations of hypervisors they may have different
PV features behavior, using hypcall to perform feature detection
can pass more information to help us cope with hypervisor diversity.

How do different hypervisors can be detected firstly?  On x86 MSR is
used for all hypervisors detection and on ARM64 hyperv used
acpi_gbl_FADT and kvm use smc forcely, host mode can execute smc
instruction without exception on ARM64.


That’s hypcall ABI design choices, those information can come from firmware
or privileged CSRs on LoongArch.
Firstly the firmware or privileged CSRs is not relative with hypcall ABI 
design choices.  With CSR instruction, CSR ID is encoded in CSR 
instruction, range about CSR ID is 16K; for cpucfg instruction, cpucfg 
area is passed with register, range is UINT_MAX at least.


It is difficult to find an area unused by HW for CSR method since the 
small CSR ID range. However it is easy for cpucfg. Here I doubt whether 
you really know about LoongArch LVZ.






I do not know why hypercall is better than cpucfg on LoongArch, cpucfg
is basic intruction however hypercall is not, it is part of LVZ feature.


KVM can only work with LVZ right?
Linux kernel need boot well with TCG and KVM mode, hypercall is illegal 
instruction with TCG mode.


Regards
Bibo Mao







reasons. If we use CPUCFG, then the hypervisor information is
unnecessarily leaked to userspace, and this may be a security issue.
Meanwhile, I don't think TCG mode needs PV features.

Besides PV features, there is other features different with real hw such
as virtio device, virtual interrupt controller.


Those are *device* level information, they must be passed in firmware
interfaces to keep processor emulation sane.

File arch/x86/hyperv/hv_apic.c can be referenced, apic features comes
from ms_hyperv.hints and HYPERV_CPUID_ENLIGHTMENT_INFO cpuid info, not
must be passed by firmware interface.


That’s not KVM, that’s Hyper V. At Linux Kernel we enjoy the benefits of better
modularity on device abstractions, please don’t break it.

Thanks



Regards
Bibo Mao


Thanks



Regards
Bibo Mao



I consulted with Jiaxun before, and maybe he can give some more comments.



Extioi virtualization extension will be added later, cpucfg can be used
to get extioi features. It is unlikely that extioi driver depends on
PARA_VIRT macro if hypercall is used to get features.

CPUCFG is per-core information, if we really need something about
extioi, it should be in iocsr (LOONGARCH_IOCSR_FEATURES).


Huacai



Regards
Bibo Mao



Huacai



Signed-off-by: Bibo Mao 
---
 arch/loongarch/include/asm/inst.h  |  1 +
 arch/loongarch/include/asm/loongarch.h | 10 ++
 arch/loongarch/kvm/exit.c  | 46 +-
 3 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/arch/loongarch/include/asm/inst.h 
b/arch/loongarch/include/asm/inst.h
index d8f637f9e400..ad120f924905 100644
--- a/arch/loongarch/include/asm/inst.h
+++ b/arch/lo

56 matches

Mail list logo