Re: Re: [RFC PATCH 0/5] vsock/virtio: Add support for multi-devices

2024-05-29 Thread Xuewei Niu
Hi, MST!

> >  include/linux/virtio_vsock.h|   2 +-
> >  include/net/af_vsock.h  |  25 ++-
> >  include/uapi/linux/virtio_vsock.h   |   1 +
> >  include/uapi/linux/vm_sockets.h |  14 ++
> >  net/vmw_vsock/af_vsock.c| 116 +--
> >  net/vmw_vsock/virtio_transport.c| 255 ++--
> >  net/vmw_vsock/virtio_transport_common.c |  16 +-
> >  net/vmw_vsock/vsock_loopback.c  |   4 +-
> >  8 files changed, 352 insertions(+), 81 deletions(-)
> 
> As any change to virtio device/driver interface, this has to
> go through the virtio TC. Please subscribe at
> virtio-comment+subscr...@lists.linux.dev and then
> contact the TC at virtio-comm...@lists.linux.dev
> 
> You will likely eventually need to write a spec draft document, too.

Thanks for your reply. I've sent a series of RFC documents for the spec
changes to virtio TC [1].

[1]: 
https://lore.kernel.org/virtio-comment/20240528054725.268173-1-niuxuewei@antgroup.com/

Thanks
Xuewei




Re: Re: [PATCH v3 2/4] virtio_balloon: introduce oom-kill invocations

2024-04-23 Thread zhenwei pi




On 4/23/24 17:13, Michael S. Tsirkin wrote:

On Tue, Apr 23, 2024 at 11:41:07AM +0800, zhenwei pi wrote:


[snip]

  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \
VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \


Looks like a useful extension. But
any UAPI extension has to go to virtio spec first.



Sure, I'll send related virtio spec changes once virtio comment mail 
list gets ready.



@@ -83,7 +84,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
-   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \
+   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \
  }
  
  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")

--
2.34.1




--
zhenwei pi



Re: Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function

2024-04-22 Thread zhenwei pi




On 4/22/24 15:47, David Hildenbrand wrote:

On 22.04.24 09:42, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any
VM events in future, we would have codes like:
  #ifdef CONFIG_VM_EVENT_COUNTERS
   unsigned long foo;
  #endif
   ...
  #ifdef CONFIG_VM_EVENT_COUNTERS
   foo = events[XXX] + events[YYY];
   update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
  #endif

Separate vm events into a single function, also remove


Why not simply use __maybe_unused for that variable?



1>
static unsigned int update_balloon_stats()
{
unsigned __maybe_unused long foo;

...
#ifdef CONFIG_VM_EVENT_COUNTERS
foo = events[XXX] + events[YYY];
update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
#endif
}

2>
static inline unsigned int update_balloon_vm_stats()
{
#ifdef CONFIG_VM_EVENT_COUNTERS
unsigned long foo;

foo = events[XXX] + events[YYY];
update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
#endif
}

From the point of my view, I don't need to compile code in my brain 
when reading codes for case 2. :)


--
zhenwei pi



Re: Re: [RFC 0/3] Improve memory statistics for virtio balloon

2024-04-15 Thread zhenwei pi




On 4/15/24 23:01, David Hildenbrand wrote:

On 15.04.24 10:41, zhenwei pi wrote:

Hi,

When the guest runs under critial memory pressure, the guest becomss
too slow, even sshd turns D state(uninterruptible) on memory
allocation. We can't login this VM to do any work on trouble shooting.

Guest kernel log via virtual TTY(on host side) only provides a few
necessary log after OOM. More detail memory statistics are required,
then we can know explicit memory events and estimate the pressure.

I'm going to introduce several VM counters for virtio balloon:
- oom-kill
- alloc-stall
- scan-async
- scan-direct
- reclaim-async
- reclaim-direct


IIUC, we're only exposing events that are already getting provided via 
all_vm_events(), correct?




Yes, all of these counters come from all_vm_events(). The 'alloc-stall' 
is summary of several classes of alloc-stall. please see '[RFC 2/3] 
virtio_balloon: introduce memory allocation stall counter'.



In that case, I don't really see a major issue. Some considerations:

(1) These new events are fairly Linux specific.

PSWPIN and friends are fairly generic, but HGTLB is also already fairly 
Linux specific already. OOM-kills don't really exist on Windows, for 
example. We'll have to be careful of properly describing what the 
semantics are.




I also notice FreeBSD supports virtio balloon for a long time, 'OOM 
kill' is used on FreeBSD too.(LINK: 
https://klarasystems.com/articles/exploring-swap-on-freebsd/)


(2) How should we handle if Linux ever stops supporting a certain event 
(e.g., major reclaim rework). I assume, simply return nothing like we 
currently would for VIRTIO_BALLOON_S_HTLB_PGALLOC without 
CONFIG_HUGETLB_PAGE.




Luckily, virtio balloon stats schema is tag-value style. This way would 
be safe enough.



Suggestions in patch [1-3] are good, I'll fix them in the next version 
if this series is acceptable.


--
zhenwei pi



Re: Re: [PATCH 2/3] kernel/pid: Remove default pid_max value

2024-04-12 Thread Michal Koutný
On Thu, Apr 11, 2024 at 03:03:31PM -0700, Andrew Morton 
 wrote:
> A large increase in the maximum number of processes.

The change from (some) default to effective infinity is the crux of the
change. Because that is only a number.
(Thus I don't find the number's 12700% increase alone a big change.)

Actual maximum amount of processes is "workload dependent" and hence
should be determined based on the particular workload.

> Or did I misinterpret?

I thought you saw an issue with projection of that number into sizings
based on the default. Which of them comprises the large change in your
eyes?

Thanks,
Michal



Re: Re: [PATCH 2/3] kernel/pid: Remove default pid_max value

2024-04-11 Thread Michal Koutný
Hello.

On Mon, Apr 08, 2024 at 01:29:55PM -0700, Andrew Morton 
 wrote:
> That seems like a large change.

In what sense is it large?

I tried to lookup the code parts that depend on this default and either
add the other patches or mention the impact (that part could be more
thorough) in the commit message.

> It isn't clear why we'd want to merge this patchset.  Does it improve
> anyone's life and if so, how?

- kernel devs who don't care about policy
  - policy should be decided by distros/users, not in kernel

- users who need many threads
  - current default is too low
  - this is one more place to look at when configuring

- users who want to prevent fork-bombs
  - current default is ineffective (too high), false feeling of safety
  - i.e. they should configure appropriate mechanism appropriately


I thought that the first point alone would be convincing and that only
scaling impact might need clarification.

Regards,
Michal



Re: Re: Re: Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send

2024-04-11 Thread Peilin He
>> >> >[...]
>> >> >> >I think my understanding based on what Eric depicted differs from =
>you:
>> >> >> >we're supposed to filter out those many invalid cases and only tra=
>ce
>> >> >> >the valid action of sending a icmp, so where to add a new tracepoi=
>nt
>> >> >> >is important instead of adding more checks in the tracepoint itsel=
>f.
>> >> >> >Please refer to what trace_tcp_retransmit_skb() does :)
>> >> >> >
>> >> >> >Thanks,
>> >> >> >Jason
>> >> >> Okay, thank you for your suggestion. In order to avoid filtering ou=
>t
>> >> >> those many invalid cases and only tracing the valid action of sendi=
>ng
>> >> >> a icmp, the next patch will add udd_fail_no_port trancepoint to the
>> >> >> include/trace/events/udp.h. This will solve the problem you mention=
>ed
>> >> >> very well. At this point, only UDP protocol exceptions will be trac=
>ked,
>> >> >> without the need to track them in icmp_send.
>> >> >
>> >> >I'm not against what you did (tracing all the icmp_send() for UDP) in
>> >> >your original patch. I was suggesting that you could put
>> >> >trace_icmp_send() in the right place, then you don't have to check th=
>e
>> >> >possible error condition (like if the skb->head is valid or not, ...)
>> >> >in your trace function.
>> >> >
>> >> >One example that can avoid various checks existing in the
>> >> >__icmp_send() function:
>> >> >diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
>> >> >index e63a3bf99617..2c9f7364de45 100644
>> >> >--- a/net/ipv4/icmp.c
>> >> >+++ b/net/ipv4/icmp.c
>> >> >@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type=
>,
>> >> >int code, __be32 info,
>> >> >if (!fl4.saddr)
>> >> >fl4.saddr =3D htonl(INADDR_DUMMY);
>> >> >
>> >> >+   trace_icmp_send(skb_in, type, code);
>> >> >icmp_push_reply(sk, _param, , , );
>> >> > ende
>> >> >ip_rt_put(rt);
>> >> >
>> >> >If we go here, it means we are ready to send the ICMP skb because
>> >> >we're done extracting the right information in the 'struct sk_buff
>> >> >skb_in'. Simpler and easier, right?
>> >> >
>> >> >Thanks,
>> >> >Jason
>> >>
>> >> I may not fully agree with this viewpoint. When trace_icmp_send is pla=
>ced
>> >> in this position, it cannot guarantee that all skbs in icmp are UDP pr=
>otocols
>> >> (UDP needs to be distinguished based on the proto_4!=3DIPPROTO_UDP con=
>dition),
>> >> nor can it guarantee the legitimacy of udphdr (*uh legitimacy check is=
> required).
>> >
>> >Of course, the UDP test statement is absolutely needed! Eric
>> >previously pointed this out in the V1 patch thread. I'm not referring
>> >to this one but like skb->head check something like this which exists
>> >in __icmp_send() function. You can see there are so many checks in it
>> >before sending.
>> >
>> >So only keeping the UDP check is enough, I think.
>>
>> The __icmp_send function only checks the IP header, but does not check
>> the UDP header, as shown in the following code snippet:
>>
>> if ((u8 *)iph < skb_in->head ||
>> (skb_network_header(skb_in) + sizeof(*iph)) >
>> skb_tail_pointer(skb_in))
>> goto out;
>>
>> There is no problem with the IP header check, which does not mean that
>> the UDP header is correct. Therefore, I believe that it is essential to
>> include a legitimacy judgment for the UDP header.
>>
>> Here is an explanation of this code:
>> Firstly, the UDP header (*uh) is extracted from the skb.
>> Then, if the current protocol of the skb is not UDP, or if the address of
>> uh is outside the range of the skb, the source port and destination port
>> will not be resolved, and 0 will be filled in directly.Otherwise,
>> the source port and destination port of the UDP header will be resolved.
>>
>> +   struct udphdr *uh =3D udp_hdr(skb);
>> +   if (proto_4 !=3D IPPROTO_UDP || (u8 *)uh < skb->head ||
>> +   (u8 *)uh + sizeof(struct udphdr) > skb_tail_pointer(skb)) {
>
>>From the beginning, I always agree with the UDP check. I was saying if
>you can put the trace_icmp_send() just before icmp_push_reply()[1],
>you could avoid those kinds of checks.
>As I said in the previous email, "only keeping the UDP check is
>enough". So you are right.
>
>[1]
>diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
>index e63a3bf99617..2c9f7364de45 100644
>--- a/net/ipv4/icmp.c
>+++ b/net/ipv4/icmp.c
>@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
>int code, __be32 info,
>if (!fl4.saddr)
>fl4.saddr =3D htonl(INADDR_DUMMY);
>
>+   trace_icmp_send(skb_in, type, code);
>icmp_push_reply(sk, _param, , , );
> ende:
>ip_rt_put(rt);
>
>If we're doing this, trace_icmp_send() can reflect the real action of
>sending an ICMP like trace_tcp_retransmit_skb(). Or else, the trace
>could print some messages but no real ICMP is sent (see those error
>checks). WDYT?
>
>Thanks,
>Jasoin

Yeah, placing trace_icmp_send() before icmp_push_reply() will ensure
that tracking starts when ICMP 

Re: Re: Re: Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send

2024-04-11 Thread Jason Xing
On Thu, Apr 11, 2024 at 12:57 PM Peilin He  wrote:
>
> >> >[...]
> >> >> >I think my understanding based on what Eric depicted differs from you:
> >> >> >we're supposed to filter out those many invalid cases and only trace
> >> >> >the valid action of sending a icmp, so where to add a new tracepoint
> >> >> >is important instead of adding more checks in the tracepoint itself.
> >> >> >Please refer to what trace_tcp_retransmit_skb() does :)
> >> >> >
> >> >> >Thanks,
> >> >> >Jason
> >> >> Okay, thank you for your suggestion. In order to avoid filtering out
> >> >> those many invalid cases and only tracing the valid action of sending
> >> >> a icmp, the next patch will add udd_fail_no_port trancepoint to the
> >> >> include/trace/events/udp.h. This will solve the problem you mentioned
> >> >> very well. At this point, only UDP protocol exceptions will be tracked,
> >> >> without the need to track them in icmp_send.
> >> >
> >> >I'm not against what you did (tracing all the icmp_send() for UDP) in
> >> >your original patch. I was suggesting that you could put
> >> >trace_icmp_send() in the right place, then you don't have to check the
> >> >possible error condition (like if the skb->head is valid or not, ...)
> >> >in your trace function.
> >> >
> >> >One example that can avoid various checks existing in the
> >> >__icmp_send() function:
> >> >diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> >> >index e63a3bf99617..2c9f7364de45 100644
> >> >--- a/net/ipv4/icmp.c
> >> >+++ b/net/ipv4/icmp.c
> >> >@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
> >> >int code, __be32 info,
> >> >if (!fl4.saddr)
> >> >fl4.saddr = htonl(INADDR_DUMMY);
> >> >
> >> >+   trace_icmp_send(skb_in, type, code);
> >> >icmp_push_reply(sk, _param, , , );
> >> > ende
> >> >ip_rt_put(rt);
> >> >
> >> >If we go here, it means we are ready to send the ICMP skb because
> >> >we're done extracting the right information in the 'struct sk_buff
> >> >skb_in'. Simpler and easier, right?
> >> >
> >> >Thanks,
> >> >Jason
> >>
> >> I may not fully agree with this viewpoint. When trace_icmp_send is placed
> >> in this position, it cannot guarantee that all skbs in icmp are UDP 
> >> protocols
> >> (UDP needs to be distinguished based on the proto_4!=IPPROTO_UDP 
> >> condition),
> >> nor can it guarantee the legitimacy of udphdr (*uh legitimacy check is 
> >> required).
> >
> >Of course, the UDP test statement is absolutely needed! Eric
> >previously pointed this out in the V1 patch thread. I'm not referring
> >to this one but like skb->head check something like this which exists
> >in __icmp_send() function. You can see there are so many checks in it
> >before sending.
> >
> >So only keeping the UDP check is enough, I think.
>
> The __icmp_send function only checks the IP header, but does not check
> the UDP header, as shown in the following code snippet:
>
> if ((u8 *)iph < skb_in->head ||
> (skb_network_header(skb_in) + sizeof(*iph)) >
> skb_tail_pointer(skb_in))
> goto out;
>
> There is no problem with the IP header check, which does not mean that
> the UDP header is correct. Therefore, I believe that it is essential to
> include a legitimacy judgment for the UDP header.
>
> Here is an explanation of this code:
> Firstly, the UDP header (*uh) is extracted from the skb.
> Then, if the current protocol of the skb is not UDP, or if the address of
> uh is outside the range of the skb, the source port and destination port
> will not be resolved, and 0 will be filled in directly.Otherwise,
> the source port and destination port of the UDP header will be resolved.
>
> +   struct udphdr *uh = udp_hdr(skb);
> +   if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
> +   (u8 *)uh + sizeof(struct udphdr) > skb_tail_pointer(skb)) {

>From the beginning, I always agree with the UDP check. I was saying if
you can put the trace_icmp_send() just before icmp_push_reply()[1],
you could avoid those kinds of checks.
As I said in the previous email, "only keeping the UDP check is
enough". So you are right.

[1]
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e63a3bf99617..2c9f7364de45 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
int code, __be32 info,
if (!fl4.saddr)
fl4.saddr = htonl(INADDR_DUMMY);

+   trace_icmp_send(skb_in, type, code);
icmp_push_reply(sk, _param, , , );
 ende:
ip_rt_put(rt);

If we're doing this, trace_icmp_send() can reflect the real action of
sending an ICMP like trace_tcp_retransmit_skb(). Or else, the trace
could print some messages but no real ICMP is sent (see those error
checks). WDYT?

Thanks,
Jason

>
> With best wishes
> Peilin He
>
> >Thanks,
> >Jason
> >
> >>
> >> With best wishes
> >> Peilin He
> >>
> >> >>
> >> >> >> 2.Target this patch for net-next.
> >> >> >>
> >> >> >> v2->v3:
> >> 

Re: Re: Re: Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send

2024-04-10 Thread Peilin He
>> >[...]
>> >> >I think my understanding based on what Eric depicted differs from you:
>> >> >we're supposed to filter out those many invalid cases and only trace
>> >> >the valid action of sending a icmp, so where to add a new tracepoint
>> >> >is important instead of adding more checks in the tracepoint itself.
>> >> >Please refer to what trace_tcp_retransmit_skb() does :)
>> >> >
>> >> >Thanks,
>> >> >Jason
>> >> Okay, thank you for your suggestion. In order to avoid filtering out
>> >> those many invalid cases and only tracing the valid action of sending
>> >> a icmp, the next patch will add udd_fail_no_port trancepoint to the
>> >> include/trace/events/udp.h. This will solve the problem you mentioned
>> >> very well. At this point, only UDP protocol exceptions will be tracked,
>> >> without the need to track them in icmp_send.
>> >
>> >I'm not against what you did (tracing all the icmp_send() for UDP) in
>> >your original patch. I was suggesting that you could put
>> >trace_icmp_send() in the right place, then you don't have to check the
>> >possible error condition (like if the skb->head is valid or not, ...)
>> >in your trace function.
>> >
>> >One example that can avoid various checks existing in the
>> >__icmp_send() function:
>> >diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
>> >index e63a3bf99617..2c9f7364de45 100644
>> >--- a/net/ipv4/icmp.c
>> >+++ b/net/ipv4/icmp.c
>> >@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
>> >int code, __be32 info,
>> >if (!fl4.saddr)
>> >fl4.saddr = htonl(INADDR_DUMMY);
>> >
>> >+   trace_icmp_send(skb_in, type, code);
>> >icmp_push_reply(sk, _param, , , );
>> > ende
>> >ip_rt_put(rt);
>> >
>> >If we go here, it means we are ready to send the ICMP skb because
>> >we're done extracting the right information in the 'struct sk_buff
>> >skb_in'. Simpler and easier, right?
>> >
>> >Thanks,
>> >Jason
>>
>> I may not fully agree with this viewpoint. When trace_icmp_send is placed
>> in this position, it cannot guarantee that all skbs in icmp are UDP protocols
>> (UDP needs to be distinguished based on the proto_4!=IPPROTO_UDP condition),
>> nor can it guarantee the legitimacy of udphdr (*uh legitimacy check is 
>> required).
>
>Of course, the UDP test statement is absolutely needed! Eric
>previously pointed this out in the V1 patch thread. I'm not referring
>to this one but like skb->head check something like this which exists
>in __icmp_send() function. You can see there are so many checks in it
>before sending.
>
>So only keeping the UDP check is enough, I think.

The __icmp_send function only checks the IP header, but does not check
the UDP header, as shown in the following code snippet:

if ((u8 *)iph < skb_in->head ||
(skb_network_header(skb_in) + sizeof(*iph)) >
skb_tail_pointer(skb_in))
goto out;

There is no problem with the IP header check, which does not mean that
the UDP header is correct. Therefore, I believe that it is essential to
include a legitimacy judgment for the UDP header.
 
Here is an explanation of this code:
Firstly, the UDP header (*uh) is extracted from the skb.
Then, if the current protocol of the skb is not UDP, or if the address of
uh is outside the range of the skb, the source port and destination port
will not be resolved, and 0 will be filled in directly.Otherwise,
the source port and destination port of the UDP header will be resolved.

+   struct udphdr *uh = udp_hdr(skb);
+   if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
+   (u8 *)uh + sizeof(struct udphdr) > skb_tail_pointer(skb)) {

With best wishes
Peilin He

>Thanks,
>Jason
>
>>
>> With best wishes
>> Peilin He
>>
>> >>
>> >> >> 2.Target this patch for net-next.
>> >> >>
>> >> >> v2->v3:
>> >> >> Some fixes according to
>> >> >> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
>> >> >> 1. Change the tracking directory to/sys/kernel/tracking.
>> >> >> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>> >> >>
>> >> >> v1->v2:
>> >> >> Some fixes according to
>> >> >> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
>> >> >-nz1x...@mail.gmail.com/
>> >> >> 1. adjust the trace_icmp_send() to more protocols than UDP.
>> >> >> 2. move the calling of trace_icmp_send after sanity checks
>> >> >> in __icmp_send().
>> >> >>
>> >> >> Signed-off-by: Peilin He
>> >> >> Reviewed-by: xu xin 
>> >> >> Reviewed-by: Yunkai Zhang 
>> >> >> Cc: Yang Yang 
>> >> >> Cc: Liu Chun 
>> >> >> Cc: Xuexin Jiang 
>> >> >> ---
>> >> >>  include/trace/events/icmp.h | 65 +
>> >> >>  net/ipv4/icmp.c |  4 +++
>> >> >>  2 files changed, 69 insertions(+)
>> >> >>  create mode 100644 include/trace/events/icmp.h
>> >> >>
>> >> >> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
>> >> >> new file mode 100644
>> >> >> index ..7d5190f48a28
>> >> 

Re: Re: Re: Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send

2024-04-10 Thread Jason Xing
On Thu, Apr 11, 2024 at 10:34 AM Peilin He  wrote:
>
> >[...]
> >> >I think my understanding based on what Eric depicted differs from you:
> >> >we're supposed to filter out those many invalid cases and only trace
> >> >the valid action of sending a icmp, so where to add a new tracepoint
> >> >is important instead of adding more checks in the tracepoint itself.
> >> >Please refer to what trace_tcp_retransmit_skb() does :)
> >> >
> >> >Thanks,
> >> >Jason
> >> Okay, thank you for your suggestion. In order to avoid filtering out
> >> those many invalid cases and only tracing the valid action of sending
> >> a icmp, the next patch will add udd_fail_no_port trancepoint to the
> >> include/trace/events/udp.h. This will solve the problem you mentioned
> >> very well. At this point, only UDP protocol exceptions will be tracked,
> >> without the need to track them in icmp_send.
> >
> >I'm not against what you did (tracing all the icmp_send() for UDP) in
> >your original patch. I was suggesting that you could put
> >trace_icmp_send() in the right place, then you don't have to check the
> >possible error condition (like if the skb->head is valid or not, ...)
> >in your trace function.
> >
> >One example that can avoid various checks existing in the
> >__icmp_send() function:
> >diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> >index e63a3bf99617..2c9f7364de45 100644
> >--- a/net/ipv4/icmp.c
> >+++ b/net/ipv4/icmp.c
> >@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
> >int code, __be32 info,
> >if (!fl4.saddr)
> >fl4.saddr = htonl(INADDR_DUMMY);
> >
> >+   trace_icmp_send(skb_in, type, code);
> >icmp_push_reply(sk, _param, , , );
> > ende
> >ip_rt_put(rt);
> >
> >If we go here, it means we are ready to send the ICMP skb because
> >we're done extracting the right information in the 'struct sk_buff
> >skb_in'. Simpler and easier, right?
> >
> >Thanks,
> >Jason
>
> I may not fully agree with this viewpoint. When trace_icmp_send is placed
> in this position, it cannot guarantee that all skbs in icmp are UDP protocols
> (UDP needs to be distinguished based on the proto_4!=IPPROTO_UDP condition),
> nor can it guarantee the legitimacy of udphdr (*uh legitimacy check is 
> required).

Of course, the UDP test statement is absolutely needed! Eric
previously pointed this out in the V1 patch thread. I'm not referring
to this one but like skb->head check something like this which exists
in __icmp_send() function. You can see there are so many checks in it
before sending.

So only keeping the UDP check is enough, I think.

Thanks,
Jason

>
> With best wishes
> Peilin He
>
> >>
> >> >> 2.Target this patch for net-next.
> >> >>
> >> >> v2->v3:
> >> >> Some fixes according to
> >> >> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
> >> >> 1. Change the tracking directory to/sys/kernel/tracking.
> >> >> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
> >> >>
> >> >> v1->v2:
> >> >> Some fixes according to
> >> >> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
> >> >-nz1x...@mail.gmail.com/
> >> >> 1. adjust the trace_icmp_send() to more protocols than UDP.
> >> >> 2. move the calling of trace_icmp_send after sanity checks
> >> >> in __icmp_send().
> >> >>
> >> >> Signed-off-by: Peilin He
> >> >> Reviewed-by: xu xin 
> >> >> Reviewed-by: Yunkai Zhang 
> >> >> Cc: Yang Yang 
> >> >> Cc: Liu Chun 
> >> >> Cc: Xuexin Jiang 
> >> >> ---
> >> >>  include/trace/events/icmp.h | 65 +
> >> >>  net/ipv4/icmp.c |  4 +++
> >> >>  2 files changed, 69 insertions(+)
> >> >>  create mode 100644 include/trace/events/icmp.h
> >> >>
> >> >> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> >> >> new file mode 100644
> >> >> index ..7d5190f48a28
> >> >> --- /dev/null
> >> >> +++ b/include/trace/events/icmp.h
> >> >> @@ -0,0 +1,65 @@
> >> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> >> +#undef TRACE_SYSTEM
> >> >> +#define TRACE_SYSTEM icmp
> >> >> +
> >> >> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> >> >> +#define _TRACE_ICMP_H
> >> >> +
> >> >> +#include 
> >> >> +#include 
> >> >> +
> >> >> +TRACE_EVENT(icmp_send,
> >> >> +
> >> >> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> >> >> +
> >> >> +   TP_ARGS(skb, type, code),
> >> >> +
> >> >> +   TP_STRUCT__entry(
> >> >> +   __field(const void *, skbaddr)
> >> >> +   __field(int, type)
> >> >> +   __field(int, code)
> >> >> +   __array(__u8, saddr, 4)
> >> >> +   __array(__u8, daddr, 4)
> >> >> +   __field(__u16, sport)
> >> >> +   __field(__u16, dport)
> >> >> +   __field(unsigned short, ulen)
> >> >> +   ),
> >> >> +
> >> >> +   

Re: Re: Re: Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send

2024-04-10 Thread Peilin He
>[...]
>> >I think my understanding based on what Eric depicted differs from you:
>> >we're supposed to filter out those many invalid cases and only trace
>> >the valid action of sending a icmp, so where to add a new tracepoint
>> >is important instead of adding more checks in the tracepoint itself.
>> >Please refer to what trace_tcp_retransmit_skb() does :)
>> >
>> >Thanks,
>> >Jason
>> Okay, thank you for your suggestion. In order to avoid filtering out
>> those many invalid cases and only tracing the valid action of sending
>> a icmp, the next patch will add udd_fail_no_port trancepoint to the
>> include/trace/events/udp.h. This will solve the problem you mentioned
>> very well. At this point, only UDP protocol exceptions will be tracked,
>> without the need to track them in icmp_send.
>
>I'm not against what you did (tracing all the icmp_send() for UDP) in
>your original patch. I was suggesting that you could put
>trace_icmp_send() in the right place, then you don't have to check the
>possible error condition (like if the skb->head is valid or not, ...)
>in your trace function.
>
>One example that can avoid various checks existing in the
>__icmp_send() function:
>diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
>index e63a3bf99617..2c9f7364de45 100644
>--- a/net/ipv4/icmp.c
>+++ b/net/ipv4/icmp.c
>@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
>int code, __be32 info,
>if (!fl4.saddr)
>fl4.saddr = htonl(INADDR_DUMMY);
>
>+   trace_icmp_send(skb_in, type, code);
>icmp_push_reply(sk, _param, , , );
> ende:
>ip_rt_put(rt);
>
>If we go here, it means we are ready to send the ICMP skb because
>we're done extracting the right information in the 'struct sk_buff
>skb_in'. Simpler and easier, right?
>
>Thanks,
>Jason

I may not fully agree with this viewpoint. When trace_icmp_send is placed
in this position, it cannot guarantee that all skbs in icmp are UDP protocols
(UDP needs to be distinguished based on the proto_4!=IPPROTO_UDP condition),
nor can it guarantee the legitimacy of udphdr (*uh legitimacy check is 
required).

With best wishes
Peilin He

>>
>> >> 2.Target this patch for net-next.
>> >>
>> >> v2->v3:
>> >> Some fixes according to
>> >> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
>> >> 1. Change the tracking directory to/sys/kernel/tracking.
>> >> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>> >>
>> >> v1->v2:
>> >> Some fixes according to
>> >> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
>> >-nz1x...@mail.gmail.com/
>> >> 1. adjust the trace_icmp_send() to more protocols than UDP.
>> >> 2. move the calling of trace_icmp_send after sanity checks
>> >> in __icmp_send().
>> >>
>> >> Signed-off-by: Peilin He
>> >> Reviewed-by: xu xin 
>> >> Reviewed-by: Yunkai Zhang 
>> >> Cc: Yang Yang 
>> >> Cc: Liu Chun 
>> >> Cc: Xuexin Jiang 
>> >> ---
>> >>  include/trace/events/icmp.h | 65 +
>> >>  net/ipv4/icmp.c |  4 +++
>> >>  2 files changed, 69 insertions(+)
>> >>  create mode 100644 include/trace/events/icmp.h
>> >>
>> >> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
>> >> new file mode 100644
>> >> index ..7d5190f48a28
>> >> --- /dev/null
>> >> +++ b/include/trace/events/icmp.h
>> >> @@ -0,0 +1,65 @@
>> >> +/* SPDX-License-Identifier: GPL-2.0 */
>> >> +#undef TRACE_SYSTEM
>> >> +#define TRACE_SYSTEM icmp
>> >> +
>> >> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
>> >> +#define _TRACE_ICMP_H
>> >> +
>> >> +#include 
>> >> +#include 
>> >> +
>> >> +TRACE_EVENT(icmp_send,
>> >> +
>> >> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
>> >> +
>> >> +   TP_ARGS(skb, type, code),
>> >> +
>> >> +   TP_STRUCT__entry(
>> >> +   __field(const void *, skbaddr)
>> >> +   __field(int, type)
>> >> +   __field(int, code)
>> >> +   __array(__u8, saddr, 4)
>> >> +   __array(__u8, daddr, 4)
>> >> +   __field(__u16, sport)
>> >> +   __field(__u16, dport)
>> >> +   __field(unsigned short, ulen)
>> >> +   ),
>> >> +
>> >> +   TP_fast_assign(
>> >> +   struct iphdr *iph =3D ip_hdr(skb);
>> >> +   int proto_4 =3D iph->protocol;
>> >> +   __be32 *p32;
>> >> +
>> >> +   __entry->skbaddr =3D skb;
>> >> +   __entry->type =3D type;
>> >> +   __entry->code =3D code;
>> >> +
>> >> +   struct udphdr *uh =3D udp_hdr(skb);
>> >> +   if (proto_4 !=3D IPPROTO_UDP || (u8 *)uh < skb->h=
>> >ead ||
>> >> +   (u8 *)uh + sizeof(struct udphdr) > skb_ta=
>> >il_pointer(skb)) {
>> >> +

Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-10 Thread Haitao Huang
On Tue, 09 Apr 2024 10:34:06 -0500, Haitao Huang  
 wrote:


On Tue, 09 Apr 2024 04:03:22 -0500, Michal Koutný   
wrote:


On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang  
 wrote:

It's always non-NULL based on testing.

It's hard for me to say definitely by reading the code. But IIUC
cgroup_disable command-line only blocks operations in /sys/fs/cgroup  
so user
space can't set up controllers and config limits, etc., for the  
diasabled
ones. Each task->cgroups would still have a non-NULL pointer to the  
static

root object for each cgroup that is enabled by KConfig, so
get_current_misc_cg() thus  sgx_get_current_cg() should not return NULL
regardless 'cgroup_disable=misc'.

Maybe @Michal or @tj can confirm?


The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist.

(It is up to the controller implementation to do further optimization
based on the boot-time disablement (e.g. see uses of
mem_cgroup_disabled()). Not sure if this is useful for misc controller.)

As for the WARN_ON(1), taking example from memcg -- NULL is best
synonymous with root. It's a judgement call which of the values to store
and when to intepret it.

HTH,
Michal

Thanks for the info.

The way I see it, misc does not have special handling like memcg so  
every task at least belong to the root(default) group even if it's  
disabled by command line parameter. So we would not get NULL from  
get_current_misc_cg(). I think I'll keep the WARN_ON_ONCE for now as a  
reminder in case misc do have custom support for disabling in future.


Actually I think it makes more sense just add some comments instead of  
WARN.

That's what I did in v11 now.

Thanks
Haitao



Re: Re: Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send

2024-04-10 Thread Jason Xing
[...]
> >I think my understanding based on what Eric depicted differs from you:
> >we're supposed to filter out those many invalid cases and only trace
> >the valid action of sending a icmp, so where to add a new tracepoint
> >is important instead of adding more checks in the tracepoint itself.
> >Please refer to what trace_tcp_retransmit_skb() does :)
> >
> >Thanks,
> >Jason
> Okay, thank you for your suggestion. In order to avoid filtering out
> those many invalid cases and only tracing the valid action of sending
> a icmp, the next patch will add udd_fail_no_port trancepoint to the
> include/trace/events/udp.h. This will solve the problem you mentioned
> very well. At this point, only UDP protocol exceptions will be tracked,
> without the need to track them in icmp_send.

I'm not against what you did (tracing all the icmp_send() for UDP) in
your original patch. I was suggesting that you could put
trace_icmp_send() in the right place, then you don't have to check the
possible error condition (like if the skb->head is valid or not, ...)
in your trace function.

One example that can avoid various checks existing in the
__icmp_send() function:
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e63a3bf99617..2c9f7364de45 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
int code, __be32 info,
if (!fl4.saddr)
fl4.saddr = htonl(INADDR_DUMMY);

+   trace_icmp_send(skb_in, type, code);
icmp_push_reply(sk, _param, , , );
 ende:
ip_rt_put(rt);

If we go here, it means we are ready to send the ICMP skb because
we're done extracting the right information in the 'struct sk_buff
skb_in'. Simpler and easier, right?

Thanks,
Jason

>
> >> 2.Target this patch for net-next.
> >>
> >> v2->v3:
> >> Some fixes according to
> >> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
> >> 1. Change the tracking directory to/sys/kernel/tracking.
> >> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
> >>
> >> v1->v2:
> >> Some fixes according to
> >> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
> >-nz1x...@mail.gmail.com/
> >> 1. adjust the trace_icmp_send() to more protocols than UDP.
> >> 2. move the calling of trace_icmp_send after sanity checks
> >> in __icmp_send().
> >>
> >> Signed-off-by: Peilin He
> >> Reviewed-by: xu xin 
> >> Reviewed-by: Yunkai Zhang 
> >> Cc: Yang Yang 
> >> Cc: Liu Chun 
> >> Cc: Xuexin Jiang 
> >> ---
> >>  include/trace/events/icmp.h | 65 +
> >>  net/ipv4/icmp.c |  4 +++
> >>  2 files changed, 69 insertions(+)
> >>  create mode 100644 include/trace/events/icmp.h
> >>
> >> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> >> new file mode 100644
> >> index ..7d5190f48a28
> >> --- /dev/null
> >> +++ b/include/trace/events/icmp.h
> >> @@ -0,0 +1,65 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +#undef TRACE_SYSTEM
> >> +#define TRACE_SYSTEM icmp
> >> +
> >> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> >> +#define _TRACE_ICMP_H
> >> +
> >> +#include 
> >> +#include 
> >> +
> >> +TRACE_EVENT(icmp_send,
> >> +
> >> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> >> +
> >> +   TP_ARGS(skb, type, code),
> >> +
> >> +   TP_STRUCT__entry(
> >> +   __field(const void *, skbaddr)
> >> +   __field(int, type)
> >> +   __field(int, code)
> >> +   __array(__u8, saddr, 4)
> >> +   __array(__u8, daddr, 4)
> >> +   __field(__u16, sport)
> >> +   __field(__u16, dport)
> >> +   __field(unsigned short, ulen)
> >> +   ),
> >> +
> >> +   TP_fast_assign(
> >> +   struct iphdr *iph =3D ip_hdr(skb);
> >> +   int proto_4 =3D iph->protocol;
> >> +   __be32 *p32;
> >> +
> >> +   __entry->skbaddr =3D skb;
> >> +   __entry->type =3D type;
> >> +   __entry->code =3D code;
> >> +
> >> +   struct udphdr *uh =3D udp_hdr(skb);
> >> +   if (proto_4 !=3D IPPROTO_UDP || (u8 *)uh < skb->h=
> >ead ||
> >> +   (u8 *)uh + sizeof(struct udphdr) > skb_ta=
> >il_pointer(skb)) {
> >> +   __entry->sport =3D 0;
> >> +   __entry->dport =3D 0;
> >> +   __entry->ulen =3D 0;
> >> +   } else {
> >> +   __entry->sport =3D ntohs(uh->source);
> >> +   __entry->dport =3D ntohs(uh->dest);
> >> +   __entry->ulen =3D ntohs(uh->len);
> >> +   }
> >> +

Re: Re: Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send

2024-04-10 Thread Peilin He
>> From: hepeilin 
>>
>> Introduce a tracepoint for icmp_send, which can help users to get more
>> detail information conveniently when icmp abnormal events happen.
>>
>> 1. Giving an usecase example:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>=3D=3D=3D=3D=3D
>> When an application experiences packet loss due to an unreachable UDP
>> destination port, the kernel will send an exception message through the
>> icmp_send function. By adding a trace point for icmp_send, developers or
>> system administrators can obtain detailed information about the UDP
>> packet loss, including the type, code, source address, destination addres=
>s,
>> source port, and destination port. This facilitates the trouble-shooting
>> of UDP packet loss issues especially for those network-service
>> applications.
>>
>> 2. Operation Instructions:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>=3D=3D
>> Switch to the tracing directory.
>> cd /sys/kernel/tracing
>> Filter for destination port unreachable.
>> echo "type=3D=3D3 && code=3D=3D3" > events/icmp/icmp_send/filter
>> Enable trace event.
>> echo 1 > events/icmp/icmp_send/enable
>>
>> 3. Result View:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>  udp_client_erro-11370   [002] ...s.12   124.728002:
>>  icmp_send: icmp_send: type=3D3, code=3D3.
>>  From 127.0.0.1:41895 to 127.0.0.1: ulen=3D23
>>  skbaddr=3D589b167a
>>
>> v3->v4:
>> Some fixes according to
>> https://lore.kernel.org/all/CANn89i+EFEr7VHXNdOi59Ba_R1nFKSBJzBzkJFVgCTdX=
>Bx=3d...@mail.gmail.com/
>> 1.Add legality check for UDP header in SKB.
>
>I think my understanding based on what Eric depicted differs from you:
>we're supposed to filter out those many invalid cases and only trace
>the valid action of sending a icmp, so where to add a new tracepoint
>is important instead of adding more checks in the tracepoint itself.
>Please refer to what trace_tcp_retransmit_skb() does :)
>
>Thanks,
>Jason
Okay, thank you for your suggestion. In order to avoid filtering out
those many invalid cases and only tracing the valid action of sending
a icmp, the next patch will add udd_fail_no_port trancepoint to the
include/trace/events/udp.h. This will solve the problem you mentioned
very well. At this point, only UDP protocol exceptions will be tracked,
without the need to track them in icmp_send.

>> 2.Target this patch for net-next.
>>
>> v2->v3:
>> Some fixes according to
>> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
>> 1. Change the tracking directory to/sys/kernel/tracking.
>> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>>
>> v1->v2:
>> Some fixes according to
>> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
>-nz1x...@mail.gmail.com/
>> 1. adjust the trace_icmp_send() to more protocols than UDP.
>> 2. move the calling of trace_icmp_send after sanity checks
>> in __icmp_send().
>>
>> Signed-off-by: Peilin He
>> Reviewed-by: xu xin 
>> Reviewed-by: Yunkai Zhang 
>> Cc: Yang Yang 
>> Cc: Liu Chun 
>> Cc: Xuexin Jiang 
>> ---
>>  include/trace/events/icmp.h | 65 +
>>  net/ipv4/icmp.c |  4 +++
>>  2 files changed, 69 insertions(+)
>>  create mode 100644 include/trace/events/icmp.h
>>
>> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
>> new file mode 100644
>> index ..7d5190f48a28
>> --- /dev/null
>> +++ b/include/trace/events/icmp.h
>> @@ -0,0 +1,65 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM icmp
>> +
>> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_ICMP_H
>> +
>> +#include 
>> +#include 
>> +
>> +TRACE_EVENT(icmp_send,
>> +
>> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
>> +
>> +   TP_ARGS(skb, type, code),
>> +
>> +   TP_STRUCT__entry(
>> +   __field(const void *, skbaddr)
>> +   __field(int, type)
>> +   __field(int, code)
>> +   __array(__u8, saddr, 4)
>> +   __array(__u8, daddr, 4)
>> +   __field(__u16, sport)
>> +   __field(__u16, dport)
>> +   __field(unsigned short, ulen)
>> +   ),
>> +
>> +   TP_fast_assign(
>> +   struct iphdr *iph =3D ip_hdr(skb);
>> +   int proto_4 =3D iph->protocol;
>> +   __be32 *p32;
>> +
>> +   __entry->skbaddr =3D skb;
>> +   __entry->type =3D type;
>> +   __entry->code =3D code;
>> +
>> +   struct udphdr *uh =3D udp_hdr(skb);
>> +   if (proto_4 !=3D IPPROTO_UDP || (u8 *)uh < skb->h=
>ead ||
>> +   (u8 *)uh + sizeof(struct udphdr) 

Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-09 Thread Haitao Huang

On Tue, 09 Apr 2024 04:03:22 -0500, Michal Koutný  wrote:

On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang  
 wrote:

It's always non-NULL based on testing.

It's hard for me to say definitely by reading the code. But IIUC
cgroup_disable command-line only blocks operations in /sys/fs/cgroup so  
user
space can't set up controllers and config limits, etc., for the  
diasabled
ones. Each task->cgroups would still have a non-NULL pointer to the  
static

root object for each cgroup that is enabled by KConfig, so
get_current_misc_cg() thus  sgx_get_current_cg() should not return NULL
regardless 'cgroup_disable=misc'.

Maybe @Michal or @tj can confirm?


The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist.

(It is up to the controller implementation to do further optimization
based on the boot-time disablement (e.g. see uses of
mem_cgroup_disabled()). Not sure if this is useful for misc controller.)

As for the WARN_ON(1), taking example from memcg -- NULL is best
synonymous with root. It's a judgement call which of the values to store
and when to intepret it.

HTH,
Michal

Thanks for the info.

The way I see it, misc does not have special handling like memcg so every  
task at least belong to the root(default) group even if it's disabled by  
command line parameter. So we would not get NULL from  
get_current_misc_cg(). I think I'll keep the WARN_ON_ONCE for now as a  
reminder in case misc do have custom support for disabling in future.


Thanks
Haitao



Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-09 Thread Michal Koutný
On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang 
 wrote:
> It's always non-NULL based on testing.
> 
> It's hard for me to say definitely by reading the code. But IIUC
> cgroup_disable command-line only blocks operations in /sys/fs/cgroup so user
> space can't set up controllers and config limits, etc., for the diasabled
> ones. Each task->cgroups would still have a non-NULL pointer to the static
> root object for each cgroup that is enabled by KConfig, so
> get_current_misc_cg() thus  sgx_get_current_cg() should not return NULL
> regardless 'cgroup_disable=misc'.
> 
> Maybe @Michal or @tj can confirm?

The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist.

(It is up to the controller implementation to do further optimization
based on the boot-time disablement (e.g. see uses of
mem_cgroup_disabled()). Not sure if this is useful for misc controller.)

As for the WARN_ON(1), taking example from memcg -- NULL is best
synonymous with root. It's a judgement call which of the values to store
and when to intepret it.

HTH,
Michal


signature.asc
Description: PGP signature


Re: Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-04-08 Thread Jason Wang
On Wed, Apr 3, 2024 at 10:47 AM tab  wrote:
>
> > >
> > > On Fri, Mar 29, 2024 at 11:55:50AM +0800, Jason Wang wrote:
> > > > On Wed, Mar 27, 2024 at 5:08 PM Jason Wang  wrote:
> > > > >
> > > > > On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > > > > > From: Rong Wang 
> > > > > > >
> > > > > > > Once enable iommu domain for one device, the MSI
> > > > > > > translation tables have to be there for software-managed MSI.
> > > > > > > Otherwise, platform with software-managed MSI without an
> > > > > > > irq bypass function, can not get a correct memory write event
> > > > > > > from pcie, will not get irqs.
> > > > > > > The solution is to obtain the MSI phy base address from
> > > > > > > iommu reserved region, and set it to iommu MSI cookie,
> > > > > > > then translation tables will be created while request irq.
> > > > > > >
> > > > > > > Change log
> > > > > > > --
> > > > > > >
> > > > > > > v1->v2:
> > > > > > > - add resv iotlb to avoid overlap mapping.
> > > > > > > v2->v3:
> > > > > > > - there is no need to export the iommu symbol anymore.
> > > > > > >
> > > > > > > Signed-off-by: Rong Wang 
> > > > > >
> > > > > > There's in interest to keep extending vhost iotlb -
> > > > > > we should just switch over to iommufd which supports
> > > > > > this already.
> > > > >
> > > > > IOMMUFD is good but VFIO supports this before IOMMUFD. This patch
> > > > > makes vDPA run without a backporting of full IOMMUFD in the production
> > > > > environment. I think it's worth.
> > > > >
> > > > > If you worry about the extension, we can just use the vhost iotlb
> > > > > existing facility to do this.
> > > > >
> > > > > Thanks
> > > >
> > > > Btw, Wang Rong,
> > > >
> > > > It looks that Cindy does have the bandwidth in working for IOMMUFD 
> > > > support.
> > >
> > > I think you mean she does not.
> >
> > Yes, you are right.
> >
> > Thanks
>
> I need to discuss internally, and there may be someone else will do that.
>
> Thanks.

Ok, please let us know if you have a conclusion.

Thanks

>
> >
> > >
> > > > Do you have the will to do that?
> > > >
> > > > Thanks
> > >
>
>
>
>
> --
> 发自我的网易邮箱平板适配版
> 
>
>
> - Original Message -
> From: "Jason Wang" 
> To: "Michael S. Tsirkin" 
> Cc: "Wang Rong" , k...@vger.kernel.org, 
> virtualizat...@lists.linux.dev, net...@vger.kernel.org, 
> linux-kernel@vger.kernel.org, "Cindy Lu" 
> Sent: Fri, 29 Mar 2024 18:39:54 +0800
> Subject: Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for 
> software-managed MSI
>
> On Fri, Mar 29, 2024 at 5:13 PM Michael S. Tsirkin  wrote:
> >
> > On Fri, Mar 29, 2024 at 11:55:50AM +0800, Jason Wang wrote:
> > > On Wed, Mar 27, 2024 at 5:08 PM Jason Wang  wrote:
> > > >
> > > > On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > > > > From: Rong Wang 
> > > > > >
> > > > > > Once enable iommu domain for one device, the MSI
> > > > > > translation tables have to be there for software-managed MSI.
> > > > > > Otherwise, platform with software-managed MSI without an
> > > > > > irq bypass function, can not get a correct memory write event
> > > > > > from pcie, will not get irqs.
> > > > > > The solution is to obtain the MSI phy base address from
> > > > > > iommu reserved region, and set it to iommu MSI cookie,
> > > > > > then translation tables will be created while request irq.
> > > > > >
> > > > > > Change log
> > > > > > --
> > > > > >
> > > > > > v1->v2:
> > > > > > - add resv iotlb to avoid overlap mapping.
> > > > > > v2->v3:
> > > > > > - there is no need to export the iommu symbol anymore.
> > > > > >
> > > > > > Signed-off-by: Rong Wang 
> > > > >
> > > > > There's in interest to keep extending vhost iotlb -
> > > > > we should just switch over to iommufd which supports
> > > > > this already.
> > > >
> > > > IOMMUFD is good but VFIO supports this before IOMMUFD. This patch
> > > > makes vDPA run without a backporting of full IOMMUFD in the production
> > > > environment. I think it's worth.
> > > >
> > > > If you worry about the extension, we can just use the vhost iotlb
> > > > existing facility to do this.
> > > >
> > > > Thanks
> > >
> > > Btw, Wang Rong,
> > >
> > > It looks that Cindy does have the bandwidth in working for IOMMUFD 
> > > support.
> >
> > I think you mean she does not.
>
> Yes, you are right.
>
> Thanks
>
> >
> > > Do you have the will to do that?
> > >
> > > Thanks
> >




Re: Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-02 Thread Haitao Huang

On Tue, 02 Apr 2024 12:40:03 -0500, Michal Koutný  wrote:

On Tue, Apr 02, 2024 at 11:20:21AM -0500, Haitao Huang  
 wrote:

Do we really want to have it implemented in c?


I only pointed to the available C boilerplate.


There are much fewer lines of
code in shell scripts. Note we are not really testing basic cgroup  
stuff.
All we needed were creating/deleting cgroups and set limits which I  
think

have been demonstrated feasible in the ash scripts now.


I assume you refer to
Message-Id: <20240331174442.51019-1-haitao.hu...@linux.intel.com>
right?

Could it be even simpler if you didn't stick to cgtools APIs and v1
compatibility?

Reducing ash_cgexec.sh to something like
echo 0 >$R/$1/cgroup.procs
shift
exec "$@"
(with some small builerplate for $R and previous mkdirs)


Yes, Thanks for the suggestion.
Haitao



Re: Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-02 Thread Michal Koutný
On Tue, Apr 02, 2024 at 11:20:21AM -0500, Haitao Huang 
 wrote:
> Do we really want to have it implemented in c?

I only pointed to the available C boilerplate.

> There are much fewer lines of
> code in shell scripts. Note we are not really testing basic cgroup stuff.
> All we needed were creating/deleting cgroups and set limits which I think
> have been demonstrated feasible in the ash scripts now.

I assume you refer to
Message-Id: <20240331174442.51019-1-haitao.hu...@linux.intel.com>
right?

Could it be even simpler if you didn't stick to cgtools APIs and v1
compatibility?

Reducing ash_cgexec.sh to something like
echo 0 >$R/$1/cgroup.procs
shift
exec "$@"
(with some small builerplate for $R and previous mkdirs)


Thanks,
Michal


signature.asc
Description: PGP signature


Re: Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-02 Thread Michal Koutný
Hello.

On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen  
wrote:
> > > It'd be more complicated and less readable to do all the stuff without 
> > > the  
> > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends 
> > >  
> > > on libc so I hope this would not cause too much inconvenience.
> >
> > As per cgroup-tools, please prove this. It makes the job for more
> > complicated *for you* and you are making the job more  complicated
> > to every possible person in the planet running any kernel QA.
> >
> > I weight the latter more than the former. And it is exactly the
> > reason why we did custom user space kselftest in the first place.
> > Let's keep the tradition. All I can say is that kselftest is 
> > unfinished in its current form.
> >
> > What is "esp cgexec"?
> 
> Also in kselftest we don't drive ultimate simplicity, we drive
> efficient CI/QA. By open coding something like subset of
> cgroup-tools needed to run the test you also help us later
> on to backtrack the kernel changes. With cgroups-tools you
> would have to use strace to get the same info.

FWIW, see also functions in
tools/testing/selftests/cgroup/cgroup_util.{h,c}.
They likely cover what you need already -- if the tests are in C.

(I admit that stuff in tools/testing/selftests/cgroup/ is best
understood with strace.)

HTH,
Michal


signature.asc
Description: PGP signature


Re: Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-25 Thread Peilin He
>>
>> Introduce a tracepoint for icmp_send, which can help users to get more
>> detail information conveniently when icmp abnormal events happen.
>>
>> 1. Giving an usecase example:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>=3D=3D=3D=3D=3D
>> When an application experiences packet loss due to an unreachable UDP
>> destination port, the kernel will send an exception message through the
>> icmp_send function. By adding a trace point for icmp_send, developers or
>> system administrators can obtain detailed information about the UDP
>> packet loss, including the type, code, source address, destination addres=
>s,
>> source port, and destination port. This facilitates the trouble-shooting
>> of UDP packet loss issues especially for those network-service
>> applications.
>>
>> 2. Operation Instructions:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>=3D=3D
>> Switch to the tracing directory.
>> cd /sys/kernel/tracing
>> Filter for destination port unreachable.
>> echo "type=3D=3D3 && code=3D=3D3" > events/icmp/icmp_send/filter
>> Enable trace event.
>> echo 1 > events/icmp/icmp_send/enable
>>
>> 3. Result View:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>  udp_client_erro-11370   [002] ...s.12   124.728002:
>>  icmp_send: icmp_send: type=3D3, code=3D3.
>>  From 127.0.0.1:41895 to 127.0.0.1: ulen=3D23
>>  skbaddr=3D589b167a
>>
>> Changelog
>> -
>> v2->v3:
>> Some fixes according to
>> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
>> 1. Change the tracking directory to/sys/kernel/tracking.
>> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>>
>> v1->v2:
>> Some fixes according to
>> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
>-nz1x...@mail.gmail.com/
>> 1. adjust the trace_icmp_send() to more protocols than UDP.
>> 2. move the calling of trace_icmp_send after sanity checks
>> in __icmp_send().
>>
>> Signed-off-by: Peilin He
>> Reviewed-by: xu xin 
>> Reviewed-by: Yunkai Zhang 
>> Cc: Yang Yang 
>> Cc: Liu Chun 
>> Cc: Xuexin Jiang 
>> ---
>>  include/trace/events/icmp.h | 64 +
>>  net/ipv4/icmp.c |  4 +++
>>  2 files changed, 68 insertions(+)
>>  create mode 100644 include/trace/events/icmp.h
>>
>> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
>> new file mode 100644
>> index ..2098d4b1b12e
>> --- /dev/null
>> +++ b/include/trace/events/icmp.h
>> @@ -0,0 +1,64 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM icmp
>> +
>> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_ICMP_H
>> +
>> +#include 
>> +#include 
>> +
>> +TRACE_EVENT(icmp_send,
>> +
>> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
>> +
>> +   TP_ARGS(skb, type, code),
>> +
>> +   TP_STRUCT__entry(
>> +   __field(const void *, skbaddr)
>> +   __field(int, type)
>> +   __field(int, code)
>> +   __array(__u8, saddr, 4)
>> +   __array(__u8, daddr, 4)
>> +   __field(__u16, sport)
>> +   __field(__u16, dport)
>> +   __field(unsigned short, ulen)
>> +   ),
>> +
>> +   TP_fast_assign(
>> +   struct iphdr *iph =3D ip_hdr(skb);
>> +   int proto_4 =3D iph->protocol;
>> +   __be32 *p32;
>> +
>> +   __entry->skbaddr =3D skb;
>> +   __entry->type =3D type;
>> +   __entry->code =3D code;
>> +
>> +   if (proto_4 =3D=3D IPPROTO_UDP) {
>> +   struct udphdr *uh =3D udp_hdr(skb);
>> +   __entry->sport =3D ntohs(uh->source);
>> +   __entry->dport =3D ntohs(uh->dest);
>> +   __entry->ulen =3D ntohs(uh->len);
>
>This is completely bogus.
>
>Adding tracepoints is ok if there are no side effects like bugs :/
>
>At this point there is no guarantee the UDP header is complete/present
>in skb->head
>
>Look at the existing checks between lines 619 and 623
>
>Then audit all icmp_send() callers, and ask yourself if UDP packets
>can not be malicious (like with a truncated UDP header)
Yeah, you are correct. Directly parsing udphdr through the sdk may
conceal bugs, such as illegal skb. To handle such exceptional scenarios,
we can determine the legitimacy of skb by checking whether the position
of the uh pointer is out of bounds.

Perhaps it could be modified like this: 
struct udphdr *uh = udp_hdr(skb);

if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
(u8 *)uh + sizeof(struct udphdr) > skb_tail_pointer(skb)) 

Re: Re: Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-25 Thread Peilin He
>> >> -
>> >> v2->v3:
>> >> Some fixes according to
>> >> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home=
>/
>> >> 1. Change the tracking directory to/sys/kernel/tracking.
>> >> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>> >>
>> >> v1->v2:
>> >> Some fixes according to
>> >> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3D3DsZtRnKRu_tnUwqHuFQT=
>JvJsv=3D
>> >-nz1x...@mail.gmail.com/
>> >> 1. adjust the trace_icmp_send() to more protocols than UDP.
>> >> 2. move the calling of trace_icmp_send after sanity checks
>> >> in __icmp_send().
>> >>
>> >> Signed-off-by: Peilin He
>> >> Reviewed-by: xu xin 
>> >> Reviewed-by: Yunkai Zhang 
>> >> Cc: Yang Yang 
>> >> Cc: Liu Chun 
>> >> Cc: Xuexin Jiang 
>> >
>> >I think it would be better to target net-next tree since it's not a
>> >fix or something else important.
>> >
>> OK. I would target it for net-next.
>> >> ---
>> >>  include/trace/events/icmp.h | 64 =
>+
>> >>  net/ipv4/icmp.c |  4 +++
>> >>  2 files changed, 68 insertions(+)
>> >>  create mode 100644 include/trace/events/icmp.h
>> >>
>> >> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
>> >> new file mode 100644
>> >> index ..2098d4b1b12e
>> >> --- /dev/null
>> >> +++ b/include/trace/events/icmp.h
>> >> @@ -0,0 +1,64 @@
>> >> +/* SPDX-License-Identifier: GPL-2.0 */
>> >> +#undef TRACE_SYSTEM
>> >> +#define TRACE_SYSTEM icmp
>> >> +
>> >> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
>> >> +#define _TRACE_ICMP_H
>> >> +
>> >> +#include 
>> >> +#include 
>> >> +
>> >> +TRACE_EVENT(icmp_send,
>> >> +
>> >> +   TP_PROTO(const struct sk_buff *skb, int type, int code=
>),
>> >> +
>> >> +   TP_ARGS(skb, type, code),
>> >> +
>> >> +   TP_STRUCT__entry(
>> >> +   __field(const void *, skbaddr)
>> >> +   __field(int, type)
>> >> +   __field(int, code)
>> >> +   __array(__u8, saddr, 4)
>> >> +   __array(__u8, daddr, 4)
>> >> +   __field(__u16, sport)
>> >> +   __field(__u16, dport)
>> >> +   __field(unsigned short, ulen)
>> >> +   ),
>> >> +
>> >> +   TP_fast_assign(
>> >> +   struct iphdr *iph =3D3D ip_hdr(skb);
>> >> +   int proto_4 =3D3D iph->protocol;
>> >> +   __be32 *p32;
>> >> +
>> >> +   __entry->skbaddr =3D3D skb;
>> >> +   __entry->type =3D3D type;
>> >> +   __entry->code =3D3D code;
>> >> +
>> >> +   if (proto_4 =3D3D=3D3D IPPROTO_UDP) {
>> >> +   struct udphdr *uh =3D3D udp_hdr(skb);
>> >> +   __entry->sport =3D3D ntohs(uh->source)=
>;
>> >> +   __entry->dport =3D3D ntohs(uh->dest);
>> >> +   __entry->ulen =3D3D ntohs(uh->len);
>> >> +   } else {
>> >> +   __entry->sport =3D3D 0;
>> >> +   __entry->dport =3D3D 0;
>> >> +   __entry->ulen =3D3D 0;
>> >> +   }
>> >
>> >What about using the TP_STORE_ADDR_PORTS_SKB macro to record the sport
>> >and dport like the patch[1] did through extending the use of header
>> >for TCP and UDP?
>> >
>> I believe patch[1] is a good idea as it moves the TCP protocol parsing
>> previously done inside the TP_STORE_ADDR_PORTS_SKB macro to TP_fast_assig=
>n,
>> and extracts the TP_STORE_ADDR_PORTS_SKB macro into a common file,
>> enabling support for both UDP and TCP protocol parsing simultaneously.
>>
>> However, patch[1] only extracts the source and destination addresses of
>> the packet, but does not extract the source port and destination port,
>> which limits the significance of my submitted patch.
>
>No, please take a look at TP_STORE_ADDR_PORTS_SKB() macro again. It
>records 4-tuples of the flow.
>
>Thanks,
>Jason
>
Okay, after patch [1] is merged, we will propose an optimization patch based on 
it.
>>
>> Perhaps the patch[1] could be referenced for integration after it is merg=
>ed.
>> >And, I wonder what the use of tracing ulen of that skb?
>> >
>> The tracking of ulen is primarily aimed at ensuring the legality of recei=
>ved
>> UDP packets and providing developers with more detailed information
>> on exceptions. See net/ipv4/udp.c:2494-2501.
>> >[1]: https://lore.kernel.org/all/1c7156a3f164eb33ef3a25b8432e359f0bb60a8=
>e.1=3D
>> >710866188.git.balazs.scheid...@axoflow.com/
>> >
>> >Thanks,
>> >Jason
>> >
>> >> +
>> >> +   p32 =3D3D (__be32 *) __entry->saddr;
>> >> +   *p32 =3D3D iph->saddr;
>> >> +
>> >> +   p32 =3D3D (__be32 *) __entry->daddr;
>> >> +  

Re: Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-25 Thread Jason Xing
On Mon, Mar 25, 2024 at 12:05 PM Peilin He  wrote:
>
> >> -
> >> v2->v3:
> >> Some fixes according to
> >> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
> >> 1. Change the tracking directory to/sys/kernel/tracking.
> >> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
> >>
> >> v1->v2:
> >> Some fixes according to
> >> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
> >-nz1x...@mail.gmail.com/
> >> 1. adjust the trace_icmp_send() to more protocols than UDP.
> >> 2. move the calling of trace_icmp_send after sanity checks
> >> in __icmp_send().
> >>
> >> Signed-off-by: Peilin He
> >> Reviewed-by: xu xin 
> >> Reviewed-by: Yunkai Zhang 
> >> Cc: Yang Yang 
> >> Cc: Liu Chun 
> >> Cc: Xuexin Jiang 
> >
> >I think it would be better to target net-next tree since it's not a
> >fix or something else important.
> >
> OK. I would target it for net-next.
> >> ---
> >>  include/trace/events/icmp.h | 64 +
> >>  net/ipv4/icmp.c |  4 +++
> >>  2 files changed, 68 insertions(+)
> >>  create mode 100644 include/trace/events/icmp.h
> >>
> >> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> >> new file mode 100644
> >> index ..2098d4b1b12e
> >> --- /dev/null
> >> +++ b/include/trace/events/icmp.h
> >> @@ -0,0 +1,64 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +#undef TRACE_SYSTEM
> >> +#define TRACE_SYSTEM icmp
> >> +
> >> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> >> +#define _TRACE_ICMP_H
> >> +
> >> +#include 
> >> +#include 
> >> +
> >> +TRACE_EVENT(icmp_send,
> >> +
> >> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> >> +
> >> +   TP_ARGS(skb, type, code),
> >> +
> >> +   TP_STRUCT__entry(
> >> +   __field(const void *, skbaddr)
> >> +   __field(int, type)
> >> +   __field(int, code)
> >> +   __array(__u8, saddr, 4)
> >> +   __array(__u8, daddr, 4)
> >> +   __field(__u16, sport)
> >> +   __field(__u16, dport)
> >> +   __field(unsigned short, ulen)
> >> +   ),
> >> +
> >> +   TP_fast_assign(
> >> +   struct iphdr *iph =3D ip_hdr(skb);
> >> +   int proto_4 =3D iph->protocol;
> >> +   __be32 *p32;
> >> +
> >> +   __entry->skbaddr =3D skb;
> >> +   __entry->type =3D type;
> >> +   __entry->code =3D code;
> >> +
> >> +   if (proto_4 =3D=3D IPPROTO_UDP) {
> >> +   struct udphdr *uh =3D udp_hdr(skb);
> >> +   __entry->sport =3D ntohs(uh->source);
> >> +   __entry->dport =3D ntohs(uh->dest);
> >> +   __entry->ulen =3D ntohs(uh->len);
> >> +   } else {
> >> +   __entry->sport =3D 0;
> >> +   __entry->dport =3D 0;
> >> +   __entry->ulen =3D 0;
> >> +   }
> >
> >What about using the TP_STORE_ADDR_PORTS_SKB macro to record the sport
> >and dport like the patch[1] did through extending the use of header
> >for TCP and UDP?
> >
> I believe patch[1] is a good idea as it moves the TCP protocol parsing
> previously done inside the TP_STORE_ADDR_PORTS_SKB macro to TP_fast_assign,
> and extracts the TP_STORE_ADDR_PORTS_SKB macro into a common file,
> enabling support for both UDP and TCP protocol parsing simultaneously.
>
> However, patch[1] only extracts the source and destination addresses of
> the packet, but does not extract the source port and destination port,
> which limits the significance of my submitted patch.

No, please take a look at TP_STORE_ADDR_PORTS_SKB() macro again. It
records 4-tuples of the flow.

Thanks,
Jason

>
> Perhaps the patch[1] could be referenced for integration after it is merged.
> >And, I wonder what the use of tracing ulen of that skb?
> >
> The tracking of ulen is primarily aimed at ensuring the legality of received
> UDP packets and providing developers with more detailed information
> on exceptions. See net/ipv4/udp.c:2494-2501.
> >[1]: https://lore.kernel.org/all/1c7156a3f164eb33ef3a25b8432e359f0bb60a8e.1=
> >710866188.git.balazs.scheid...@axoflow.com/
> >
> >Thanks,
> >Jason
> >
> >> +
> >> +   p32 =3D (__be32 *) __entry->saddr;
> >> +   *p32 =3D iph->saddr;
> >> +
> >> +   p32 =3D (__be32 *) __entry->daddr;
> >> +   *p32 =3D iph->daddr;
> >> +   ),
> >> +
> >> +   TP_printk("icmp_send: type=3D%d, code=3D%d. From %pI4:%u =
> >to %pI4:%u ulen=3D%d skbaddr=3D%p",
> >> +   __entry->type, 

Re: Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-25 Thread Peilin He
>>
>> Introduce a tracepoint for icmp_send, which can help users to get more
>> detail information conveniently when icmp abnormal events happen.
>>
>> 1. Giving an usecase example:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>=3D=3D=3D=3D=3D
>> When an application experiences packet loss due to an unreachable UDP
>> destination port, the kernel will send an exception message through the
>> icmp_send function. By adding a trace point for icmp_send, developers or
>> system administrators can obtain detailed information about the UDP
>> packet loss, including the type, code, source address, destination addres=
>s,
>> source port, and destination port. This facilitates the trouble-shooting
>> of UDP packet loss issues especially for those network-service
>> applications.
>>
>> 2. Operation Instructions:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>=3D=3D
>> Switch to the tracing directory.
>> cd /sys/kernel/tracing
>> Filter for destination port unreachable.
>> echo "type=3D=3D3 && code=3D=3D3" > events/icmp/icmp_send/filter
>> Enable trace event.
>> echo 1 > events/icmp/icmp_send/enable
>>
>> 3. Result View:
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>  udp_client_erro-11370   [002] ...s.12   124.728002:
>>  icmp_send: icmp_send: type=3D3, code=3D3.
>>  From 127.0.0.1:41895 to 127.0.0.1: ulen=3D23
>>  skbaddr=3D589b167a
>>
>> Changelog
>> -
>> v2->v3:
>> Some fixes according to
>> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
>> 1. Change the tracking directory to/sys/kernel/tracking.
>> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>>
>> v1->v2:
>> Some fixes according to
>> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
>-nz1x...@mail.gmail.com/
>> 1. adjust the trace_icmp_send() to more protocols than UDP.
>> 2. move the calling of trace_icmp_send after sanity checks
>> in __icmp_send().
>>
>> Signed-off-by: Peilin He
>> Reviewed-by: xu xin 
>> Reviewed-by: Yunkai Zhang 
>> Cc: Yang Yang 
>> Cc: Liu Chun 
>> Cc: Xuexin Jiang 
>> ---
>>  include/trace/events/icmp.h | 64 +
>>  net/ipv4/icmp.c |  4 +++
>>  2 files changed, 68 insertions(+)
>>  create mode 100644 include/trace/events/icmp.h
>>
>> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
>> new file mode 100644
>> index ..2098d4b1b12e
>> --- /dev/null
>> +++ b/include/trace/events/icmp.h
>> @@ -0,0 +1,64 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM icmp
>> +
>> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_ICMP_H
>> +
>> +#include 
>> +#include 
>> +
>> +TRACE_EVENT(icmp_send,
>> +
>> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
>> +
>> +   TP_ARGS(skb, type, code),
>> +
>> +   TP_STRUCT__entry(
>> +   __field(const void *, skbaddr)
>> +   __field(int, type)
>> +   __field(int, code)
>> +   __array(__u8, saddr, 4)
>> +   __array(__u8, daddr, 4)
>> +   __field(__u16, sport)
>> +   __field(__u16, dport)
>> +   __field(unsigned short, ulen)
>> +   ),
>> +
>> +   TP_fast_assign(
>> +   struct iphdr *iph =3D ip_hdr(skb);
>> +   int proto_4 =3D iph->protocol;
>> +   __be32 *p32;
>> +
>> +   __entry->skbaddr =3D skb;
>> +   __entry->type =3D type;
>> +   __entry->code =3D code;
>> +
>> +   if (proto_4 =3D=3D IPPROTO_UDP) {
>> +   struct udphdr *uh =3D udp_hdr(skb);
>> +   __entry->sport =3D ntohs(uh->source);
>> +   __entry->dport =3D ntohs(uh->dest);
>> +   __entry->ulen =3D ntohs(uh->len);
>
>This is completely bogus.
>
>Adding tracepoints is ok if there are no side effects like bugs :/
>
>At this point there is no guarantee the UDP header is complete/present
>in skb->head
>
>Look at the existing checks between lines 619 and 623
>
>Then audit all icmp_send() callers, and ask yourself if UDP packets
>can not be malicious (like with a truncated UDP header)
Yeah, you are correct. Directly parsing udphdr through the sdk may
conceal bugs, such as illegal skb. To handle such exceptional scenarios,
we can determine the legitimacy of skb by checking whether the position
of the uh pointer is out of bounds. The modifications in the patch are
as follows: 
struct udphdr *uh = udp_hdr(skb);

if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
(u8 *)uh + sizeof(struct udphdr) > 

Re: Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-25 Thread Peilin He
>> -
>> v2->v3:
>> Some fixes according to
>> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
>> 1. Change the tracking directory to/sys/kernel/tracking.
>> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>>
>> v1->v2:
>> Some fixes according to
>> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=3DsZtRnKRu_tnUwqHuFQTJvJsv=
>-nz1x...@mail.gmail.com/
>> 1. adjust the trace_icmp_send() to more protocols than UDP.
>> 2. move the calling of trace_icmp_send after sanity checks
>> in __icmp_send().
>>
>> Signed-off-by: Peilin He
>> Reviewed-by: xu xin 
>> Reviewed-by: Yunkai Zhang 
>> Cc: Yang Yang 
>> Cc: Liu Chun 
>> Cc: Xuexin Jiang 
>
>I think it would be better to target net-next tree since it's not a
>fix or something else important.
>
OK. I would target it for net-next.
>> ---
>>  include/trace/events/icmp.h | 64 +
>>  net/ipv4/icmp.c |  4 +++
>>  2 files changed, 68 insertions(+)
>>  create mode 100644 include/trace/events/icmp.h
>>
>> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
>> new file mode 100644
>> index ..2098d4b1b12e
>> --- /dev/null
>> +++ b/include/trace/events/icmp.h
>> @@ -0,0 +1,64 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM icmp
>> +
>> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_ICMP_H
>> +
>> +#include 
>> +#include 
>> +
>> +TRACE_EVENT(icmp_send,
>> +
>> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
>> +
>> +   TP_ARGS(skb, type, code),
>> +
>> +   TP_STRUCT__entry(
>> +   __field(const void *, skbaddr)
>> +   __field(int, type)
>> +   __field(int, code)
>> +   __array(__u8, saddr, 4)
>> +   __array(__u8, daddr, 4)
>> +   __field(__u16, sport)
>> +   __field(__u16, dport)
>> +   __field(unsigned short, ulen)
>> +   ),
>> +
>> +   TP_fast_assign(
>> +   struct iphdr *iph =3D ip_hdr(skb);
>> +   int proto_4 =3D iph->protocol;
>> +   __be32 *p32;
>> +
>> +   __entry->skbaddr =3D skb;
>> +   __entry->type =3D type;
>> +   __entry->code =3D code;
>> +
>> +   if (proto_4 =3D=3D IPPROTO_UDP) {
>> +   struct udphdr *uh =3D udp_hdr(skb);
>> +   __entry->sport =3D ntohs(uh->source);
>> +   __entry->dport =3D ntohs(uh->dest);
>> +   __entry->ulen =3D ntohs(uh->len);
>> +   } else {
>> +   __entry->sport =3D 0;
>> +   __entry->dport =3D 0;
>> +   __entry->ulen =3D 0;
>> +   }
>
>What about using the TP_STORE_ADDR_PORTS_SKB macro to record the sport
>and dport like the patch[1] did through extending the use of header
>for TCP and UDP?
>
I believe patch[1] is a good idea as it moves the TCP protocol parsing
previously done inside the TP_STORE_ADDR_PORTS_SKB macro to TP_fast_assign,
and extracts the TP_STORE_ADDR_PORTS_SKB macro into a common file,
enabling support for both UDP and TCP protocol parsing simultaneously.

However, patch[1] only extracts the source and destination addresses of
the packet, but does not extract the source port and destination port,
which limits the significance of my submitted patch.

Perhaps the patch[1] could be referenced for integration after it is merged.
>And, I wonder what the use of tracing ulen of that skb?
>
The tracking of ulen is primarily aimed at ensuring the legality of received
UDP packets and providing developers with more detailed information
on exceptions. See net/ipv4/udp.c:2494-2501.
>[1]: https://lore.kernel.org/all/1c7156a3f164eb33ef3a25b8432e359f0bb60a8e.1=
>710866188.git.balazs.scheid...@axoflow.com/
>
>Thanks,
>Jason
>
>> +
>> +   p32 =3D (__be32 *) __entry->saddr;
>> +   *p32 =3D iph->saddr;
>> +
>> +   p32 =3D (__be32 *) __entry->daddr;
>> +   *p32 =3D iph->daddr;
>> +   ),
>> +
>> +   TP_printk("icmp_send: type=3D%d, code=3D%d. From %pI4:%u =
>to %pI4:%u ulen=3D%d skbaddr=3D%p",
>> +   __entry->type, __entry->code,
>> +   __entry->saddr, __entry->sport, __entry->daddr,
>> +   __entry->dport, __entry->ulen, __entry->skbaddr)
>> +);
>> +
>> +#endif /* _TRACE_ICMP_H */
>> +
>> +/* This part must be outside protection */
>> +#include 
>> \ No newline at end of file
>> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
>> index e63a3bf99617..21fb41257fe9 100644
>> --- 

Re: Re: [PATCH v2] net/ipv4: add tracepoint for icmp_send

2024-03-20 Thread He Peilin
> > From: Peilin He
> > 
> > Introduce a tracepoint for icmp_send, which can help users to get more
> > detail information conveniently when icmp abnormal events happen.
> > 
> > 1. Giving an usecase example:
> > =
> > When an application experiences packet loss due to an unreachable UDP
> > destination port, the kernel will send an exception message through the
> > icmp_send function. By adding a trace point for icmp_send, developers or
> > system administrators can obtain detailed information about the UDP
> > packet loss, including the type, code, source address, destination address,
> > source port, and destination port. This facilitates the trouble-shooting
> > of UDP packet loss issues especially for those network-service
> > applications.
> > 
> > 2. Operation Instructions:
> > ==
> > Switch to the tracing directory.
> > cd /sys/kernel/debug/tracing
> 
> FYI, that directory is obsolete. Please always reference /sys/kernel/tracing.
OK. 
> > Filter for destination port unreachable.
> > echo "type==3 && code==3" > events/icmp/icmp_send/filter
> > Enable trace event.
> > echo 1 > events/icmp/icmp_send/enable
> > 
> > 3. Result View:
> > 
> >  udp_client_erro-11370   [002] ...s.12   124.728002:
> >  icmp_send: icmp_send: type=3, code=3.
> >  From 127.0.0.1:41895 to 127.0.0.1: ulen=23
> >  skbaddr=589b167a
> > 
> > v1->v2:
> > Some fixes according to
> > https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
> > 1. adjust the trace_icmp_send() to more protocols than UDP.
> > 2. move the calling of trace_icmp_send after sanity checks
> >in __icmp_send().
> > 
> > Signed-off-by: Peilin He
> > Reviewed-by: xu xin 
> > Reviewed-by: Yunkai Zhang 
> > Cc: Yang Yang 
> > Cc: Liu Chun 
> > Cc: Xuexin Jiang 
> > ---
> >  include/trace/events/icmp.h | 64 
> > +
> >  net/ipv4/icmp.c |  4 +++
> >  2 files changed, 68 insertions(+)
> >  create mode 100644 include/trace/events/icmp.h
> > 
> > diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> > new file mode 100644
> > index ..c3dc337be7bc
> > --- /dev/null
> > +++ b/include/trace/events/icmp.h
> > @@ -0,0 +1,64 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM icmp
> > +
> > +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_ICMP_H
> > +
> > +#include 
> > +#include 
> > +
> > +TRACE_EVENT(icmp_send,
> > +
> > +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> > +
> > +   TP_ARGS(skb, type, code),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(__u16, sport)
> > +   __field(__u16, dport)
> > +   __field(int, type)
> > +   __field(int, code)
> > +   __array(__u8, saddr, 4)
> > +   __array(__u8, daddr, 4)
> > +   __field(const void *, skbaddr)
> > +   __field(unsigned short, ulen)
> 
> Note, to prevent holes, I usually suggest pointers and longs go first,
> followed by ints, and then end with char.
> 
>   __field(const void *, skbaddr)
>   __field(int, type)
>   __field(int, code)
>   __array(__u8, saddr, 4)
>   __array(__u8, daddr, 4)
>   __field(__u16, sport)
>   __field(__u16, dport)
>   __field(unsigned short, ulen)
> 
> -- Steve
Thank you very much for your suggestion. We will rearrange the parameters in 
TP_STRUCT_entry, prioritizing pointers and longs, followed by ints, and ending 
with char. This will be reflected in Patch v3.
> 
> 
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   struct iphdr *iph = ip_hdr(skb);
> > +   int proto_4 = iph->protocol;
> > +   __be32 *p32;
> > +
> > +   __entry->skbaddr = skb;
> > +   __entry->type = type;
> > +   __entry->code = code;
> > +
> > +   if (proto_4 == IPPROTO_UDP) {
> > +   struct udphdr *uh = udp_hdr(skb);
> > +   __entry->sport = ntohs(uh->source);
> > +   __entry->dport = ntohs(uh->dest);
> > +   __entry->ulen = ntohs(uh->len);
> > +   } else {
> > +   __entry->sport = 0;
> > +   __entry->dport = 0;
> > +   __entry->ulen = 0;
> > +   }
> > +
> > +   p32 = (__be32 *) __entry->saddr;
> > +   *p32 = iph->saddr;
> > +
> > +   p32 = (__be32 *) __entry->daddr;
> > +   *p32 = iph->daddr;
> > + 

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-11 Thread Michael S. Tsirkin
On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > >  Summary 
> > > 
> > > In my (non-vhost experience) opinion the way to go would be either
> > > replacing the cond_resched with a hard schedule or setting the
> > > need_resched flag within vhost if the a data transfer was successfully
> > > initiated. It will be necessary to check if this causes problems with
> > > other workloads/benchmarks.
> > 
> > Yes but conceptually I am still in the dark on whether the fact that
> > periodically invoking cond_resched is no longer sufficient to be nice to
> > others is a bug, or intentional.  So you feel it is intentional?
> 
> I would assume that cond_resched is still a valid concept.
> But, in this particular scenario we have the following problem:
> 
> So far (with CFS) we had:
> 1. vhost initiates data transfer
> 2. kworker is woken up
> 3. CFS gives priority to woken up task and schedules it
> 4. kworker runs
> 
> Now (with EEVDF) we have:
> 0. In some cases, kworker has accumulated negative lag 
> 1. vhost initiates data transfer
> 2. kworker is woken up
> -3a. EEVDF does not schedule kworker if it has negative lag
> -4a. vhost continues running, kworker on same CPU starves
> --
> -3b. EEVDF schedules kworker if it has positive or no lag
> -4b. kworker runs
> 
> In the 3a/4a case, the kworker is given no chance to set the
> necessary flag. The flag can only be set by another CPU now.
> The schedule of the kworker was not caused by cond_resched, but
> rather by the wakeup path of the scheduler.
> 
> cond_resched works successfully once the load balancer (I suppose) 
> decides to migrate the vhost off to another CPU. In that case, the
> load balancer on another CPU sets that flag and we are good.
> That then eventually allows the scheduler to pick kworker, but very
> late.

Are we going anywhere with this btw?


> > I propose a two patch series then:
> > 
> > patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> > 
> > If you're doing longer computations: first think userspace. If you
> > **really** want to do it in kernel you should regularly check if you need
> > to give up the CPU (remember there is cooperative multitasking per CPU).
> > Idiom::
> > 
> > cond_resched(); /* Will sleep */
> > 
> > 
> > replace cond_resched -> schedule
> > 
> > 
> > Since apparently cond_resched is no longer sufficient to
> > make the scheduler check whether you need to give up the CPU.
> > 
> > patch 2: make this change for vhost.
> > 
> > WDYT?
> 
> For patch 1, I would like to see some feedback from Peter (or someone else
> from the scheduler maintainers).
> For patch 2, I would prefer to do some more testing first if this might have
> an negative effect on other benchmarks.
> 
> I also stumbled upon something in the scheduler code that I want to verify.
> Maybe a cgroup thing, will check that out again.
> 
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.
> 
> > 
> > -- 
> > MST
> > 
> > 




Re: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-08 Thread Honggyu Kim
Hi SeongJae,

I couldn't send email to LKML properly due to internal system issues,
but it's better now so I will be able to communicate better.

On Wed,  6 Mar 2024 19:05:50 -0800 SeongJae Park  wrote:
> 
> Hello,
> 
> On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park  wrote:
> 
> > On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim  wrote:
> > 
> > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > > posted at [1].
> > > 
> > > It says there is no implementation of the demote/promote DAMOS action
> > > are made.  This RFC is about its implementation for physical address
> > > space.
> [...]
> > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we discussed 
> > about
> > this patchset in high level.  Sharing the summary here for open discussion. 
> >  As
> > also discussed on the first version of this patchset[2], we want to make 
> > single
> > action for general page migration with minimum changes, but would like to 
> > keep
> > page level access re-check.  We also agreed the previously proposed DAMOS
> > filter-based approach could make sense for the purpose.
> > 
> > Because I was anyway planning making such DAMOS filter for not only
> > promotion/demotion but other types of DAMOS action, I will start developing 
> > the
> > page level access re-check results based DAMOS filter.  Once the 
> > implementation
> > of the prototype is done, I will share the early implementation.  Then, 
> > Honggyu
> > will adjust their implementation based on the filter, and run their tests 
> > again
> > and share the results.
> 
> I just posted an RFC patchset for the page level access re-check DAMOS filter:
> https://lore.kernel.org/r/20240307030013.47041-1...@kernel.org
> 
> I hope it to help you better understanding and testing the idea.

Thanks very much for your work! I will test it based on your changes.

Thanks,
Honggyu

> 
> > 
> > [1] https://lore.kernel.org/damon/20220810225102.124459-1...@kernel.org/
> > [2] https://lore.kernel.org/damon/20240118171756.80356-1...@kernel.org
> 
> 
> Thanks,
> SJ
> 
> [...]
> 
> 
> 



Re: Re: [PATCH] net/ipv4: add tracepoint for icmp_send

2024-03-02 Thread He Peilin
> >  include/trace/events/icmp.h | 57 
> > +
> >  net/ipv4/icmp.c |  4 
> >  2 files changed, 61 insertions(+)
> >  create mode 100644 include/trace/events/icmp.h
> > 
> > diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> > new file mode 100644
> > index ..3d9af5769bc3
> > --- /dev/null
> > +++ b/include/trace/events/icmp.h
> > @@ -0,0 +1,57 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM icmp
> > +
> > +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_ICMP_H
> > +
> > +#include 
> > +#include 
> > +
> > +TRACE_EVENT(icmp_send,
> > +
> > +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> > +
> > +   TP_ARGS(skb, type, code),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(__u16, sport)
> 
> 2 bytes
> 
> > +   __field(__u16, dport)
> 
> 2 bytes
> 
> > +   __field(unsigned short, ulen)
> 
> 2 bytes
> 
> [ 2 byte hole for alignment ]
> 
> > +   __field(const void *, skbaddr)
> 
> 4/8 bytes
> 
> It's best to keep the holes at the end of the TP_STRUCT__entry().
> 
> That is, I would move ulen to the end of the structure. It doesn't affect
> anything else.
> 
> -- Steve
Thank you for pointing that out. The next step is to move __field(unsigned 
short, ulen) to the end of TP_STRUCT__entry().

> 
> > +   __field(int, type)
> > +   __field(int, code)
> > +   __array(__u8, saddr, 4)
> > +   __array(__u8, daddr, 4)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   // Get UDP header
> > +   struct udphdr *uh = udp_hdr(skb);
> > +   struct iphdr *iph = ip_hdr(skb);
> > +   __be32 *p32;
> > +
> > +   __entry->sport = ntohs(uh->source);
> > +   __entry->dport = ntohs(uh->dest);
> > +   __entry->ulen = ntohs(uh->len);
> > +   __entry->skbaddr = skb;
> > +   __entry->type = type;
> > +   __entry->code = code;
> > +
> > +   p32 = (__be32 *) __entry->saddr;
> > +   *p32 = iph->saddr;
> > +
> > +   p32 = (__be32 *) __entry->daddr;
> > +   *p32 =  iph->daddr;
> > +   ),
> > +
> > +   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u 
> > ulen=%d skbaddr=%p",
> > +   __entry->type, __entry->code,
> > +   __entry->saddr, __entry->sport, __entry->daddr,
> > +   __entry->dport, __entry->ulen, __entry->skbaddr)
> > +);
> > +
> > +#endif /* _TRACE_ICMP_H */
> > +
> > +/* This part must be outside protection */
> > +#include 
> > diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> > index e63a3bf99617..437bdb7e2650 100644
> > --- a/net/ipv4/icmp.c
> > +++ b/net/ipv4/icmp.c
> > @@ -92,6 +92,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#define CREATE_TRACE_POINTS
> > +#include 
> > 
> >  /*
> >   * Build xmit assembly blocks
> > @@ -599,6 +601,8 @@ void __icmp_send(struct sk_buff *skb_in, int type, int 
> > code, __be32 info,
> > struct net *net;
> > struct sock *sk;
> > 
> > +   trace_icmp_send(skb_in, type, code);
> > +
> > if (!rt)
> > goto out;
> >  




Re: Re: [PATCH] net/ipv4: add tracepoint for icmp_send

2024-03-02 Thread He Peilin
> >  include/trace/events/icmp.h | 57 
> > +
> >  net/ipv4/icmp.c |  4 
> >  2 files changed, 61 insertions(+)
> >  create mode 100644 include/trace/events/icmp.h
> >
> > diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> > new file mode 100644
> > index ..3d9af5769bc3
> > --- /dev/null
> > +++ b/include/trace/events/icmp.h
> > @@ -0,0 +1,57 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM icmp
> > +
> > +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_ICMP_H
> > +
> > +#include 
> > +#include 
> > +
> > +TRACE_EVENT(icmp_send,
> > +
> > +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> > +
> > +   TP_ARGS(skb, type, code),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(__u16, sport)
> > +   __field(__u16, dport)
> > +   __field(unsigned short, ulen)
> > +   __field(const void *, skbaddr)
> > +   __field(int, type)
> > +   __field(int, code)
> > +   __array(__u8, saddr, 4)
> > +   __array(__u8, daddr, 4)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   // Get UDP header
> > +   struct udphdr *uh = udp_hdr(skb);
> > +   struct iphdr *iph = ip_hdr(skb);
> > +   __be32 *p32;
> > +
> > +   __entry->sport = ntohs(uh->source);
> > +   __entry->dport = ntohs(uh->dest);
> > +   __entry->ulen = ntohs(uh->len);
> > +   __entry->skbaddr = skb;
> > +   __entry->type = type;
> > +   __entry->code = code;
> > +
> > +   p32 = (__be32 *) __entry->saddr;
> > +   *p32 = iph->saddr;
> > +
> > +   p32 = (__be32 *) __entry->daddr;
> > +   *p32 =  iph->daddr;
> > +   ),
> > +
> 
> FYI, ICMP can be generated for many other protocols than UDP.

We have noted this issue. Therefore, a UDP judgment confition has been added in 
TP_fast_assign.This condition will only track abnormal messages when icmp_send 
is called with the UDP protocol. Otherwise, it will simply print the abnormal 
type and code.

As follows:
if(proto_4 == IPPROTO_UDP) {
struct udphdr *uh = udp_hdr(skb);
__entry->sport = nthos(uh->source);
__entry_dport = nthos(uh->dest);
__entry->ulen = nthos(uh->len);
} else {
__entry->sport = 0;
__entry_dport = 0;
__entry->ulen = 0;
}

> 
> > +   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to 
> > %pI4:%u ulen=%d skbaddr=%p",
> > +   __entry->type, __entry->code,
> > +   __entry->saddr, __entry->sport, __entry->daddr,
> > +   __entry->dport, __entry->ulen, __entry->skbaddr)
> > +);
> > +
> > +#endif /* _TRACE_ICMP_H */
> > +
> > +/* This part must be outside protection */
> > +#include 
> > diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> > index e63a3bf99617..437bdb7e2650 100644
> > --- a/net/ipv4/icmp.c
> > +++ b/net/ipv4/icmp.c
> > @@ -92,6 +92,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#define CREATE_TRACE_POINTS
> > +#include 
> >
> >  /*
> >   * Build xmit assembly blocks
> > @@ -599,6 +601,8 @@ void __icmp_send(struct sk_buff *skb_in, int type, int 
> > code, __be32 info,
> > struct net *net;
> > struct sock *sk;
> >
> > +   trace_icmp_send(skb_in, type, code);
> 
> I think you missed many sanity checks between lines 622 and 676
Thank you for the reminder. The next step is to move the trace point to line 
676.

> Honestly, a kprobe BPF based solution would be less risky, and less
> maintenance for us.
emm, yeah, but tracepoints has advantages on its convienice, especially for 
those Embedded Linux which doesn't support EBPF.




Re: Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-27 Thread Michal Koutný
Hello.

On Mon, Feb 26, 2024 at 03:48:18PM -0600, Haitao Huang 
 wrote:
> In case of overcomitting, i.e., sum of limits greater than the EPC capacity,
> if one group has a fault, and its usage is not above its own limit
> (try_charge() passes), yet total usage of the system has exceeded the
> capacity, whether we do global reclaim or just reclaim pages in the current
> faulting group.
> 
> > Also, what does the core mm memcg code do?
> > 
> I'm not sure. I'll try to find out but it'd be appreciated if someone more
> knowledgeable can comment on this. memcg also has the protection mechanism
> (i.e., min, low settings) to guarantee some allocation per group so its
> approach might not be applicable to misc controller here.

I only follow the discussion superficially but it'd be nice to have
analogous mechanisms in memcg and sgx controller.

The memory limits are rather simple -- when allocating new memory, the
tightest limit of ancestor applies and reclaim is applied to whole
subtree of such an ancestor (not necessearily the originating cgroup of
the allocation). Overcommit is admited, competition among siblings is
resolved on the first comes, first served basis.

The memory protections are an additional (and in a sense orthogoal)
mechanism. They can be interpretted as limits that are enforced not at
the time of allocation but at the time of reclaim (and reclaim is
triggered independetly, either global or with the limits above). The
consequence is that the protection code must do some normalization to
evaluate overcommited values sensibly.

(Historically, there was also memory.soft_limit_in_bytes, which combined
"protection" and global reclaim but it turned out broken. I suggest
reading Documentation/admin-guide/cgroup-v2.rst section Controller
Issues and Remedies/Memory for more details and as cautionary example.)

HTH,
Michal


signature.asc
Description: PGP signature


Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-22 Thread Michael S. Tsirkin
On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.

As I predicted, if you want attention from sched guys you need to
send a patch in their area.

-- 
MST




Re: RE: [PATCH net-next 1/2] xsk: Remove non-zero 'dma_page' check in xp_assign_dev

2024-02-21 Thread Xuan Zhuo
On Wed, 21 Feb 2024 11:37:22 +, wangyunjian  wrote:
> > -Original Message-
> > From: Xuan Zhuo [mailto:xuanz...@linux.alibaba.com]
> > Sent: Wednesday, February 21, 2024 5:53 PM
> > To: wangyunjian 
> > Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> > ; wangyunjian ;
> > m...@redhat.com; willemdebruijn.ker...@gmail.com; jasow...@redhat.com;
> > k...@kernel.org; da...@davemloft.net; magnus.karls...@intel.com
> > Subject: Re: [PATCH net-next 1/2] xsk: Remove non-zero 'dma_page' check in
> > xp_assign_dev
> >
> > On Wed, 24 Jan 2024 17:37:38 +0800, Yunjian Wang
> >  wrote:
> > > Now dma mappings are used by the physical NICs. However the vNIC maybe
> > > do not need them. So remove non-zero 'dma_page' check in
> > > xp_assign_dev.
> >
> > Could you tell me which one nic can work with AF_XDP without DMA?
>
> TUN will support AF_XDP Tx zero-copy, which does not require DMA mappings.


Great. Though I do not know how it works, but I think a new option or feature
is better.

Thanks.


>
> Thanks
>
> >
> > Thanks.
> >
> >
> > >
> > > Signed-off-by: Yunjian Wang 
> > > ---
> > >  net/xdp/xsk_buff_pool.c | 7 ---
> > >  1 file changed, 7 deletions(-)
> > >
> > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index
> > > 28711cc44ced..939b6e7b59ff 100644
> > > --- a/net/xdp/xsk_buff_pool.c
> > > +++ b/net/xdp/xsk_buff_pool.c
> > > @@ -219,16 +219,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> > >   if (err)
> > >   goto err_unreg_pool;
> > >
> > > - if (!pool->dma_pages) {
> > > - WARN(1, "Driver did not DMA map zero-copy buffers");
> > > - err = -EINVAL;
> > > - goto err_unreg_xsk;
> > > - }
> > >   pool->umem->zc = true;
> > >   return 0;
> > >
> > > -err_unreg_xsk:
> > > - xp_disable_drv_zc(pool);
> > >  err_unreg_pool:
> > >   if (!force_zc)
> > >   err = 0; /* fallback to copy mode */
> > > --
> > > 2.33.0
> > >
> > >



Re: Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-20 Thread Huang, Kai
On Tue, 2024-02-20 at 14:18 +0100, Michal Koutný wrote:
> On Tue, Feb 20, 2024 at 09:52:39AM +, "Huang, Kai"  
> wrote:
> > I am not sure, but is it possible or legal for an ancestor to have less 
> > limit
> > than children?
> 
> Why not?
> It is desired for proper hiearchical delegation and the tightest limit
> of ancestors applies (cf memory.max).
> 

OK.  Thanks for the info.


Re: Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-20 Thread Michal Koutný
On Tue, Feb 20, 2024 at 09:52:39AM +, "Huang, Kai"  
wrote:
> I am not sure, but is it possible or legal for an ancestor to have less limit
> than children?

Why not?
It is desired for proper hiearchical delegation and the tightest limit
of ancestors applies (cf memory.max).

Michal


signature.asc
Description: PGP signature


Re: Re: [PATCH] vhost-vdpa: fail enabling virtqueue in certain conditions

2024-02-07 Thread Stefano Garzarella

On Wed, Feb 07, 2024 at 11:27:14AM +0800, Jason Wang wrote:

On Tue, Feb 6, 2024 at 10:52 PM Stefano Garzarella  wrote:


If VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK is not negotiated, we expect
the driver to enable virtqueue before setting DRIVER_OK. If the driver
tries anyway, better to fail right away as soon as we get the ioctl.
Let's also update the documentation to make it clearer.

We had a problem in QEMU for not meeting this requirement, see
https://lore.kernel.org/qemu-devel/20240202132521.32714-1-kw...@redhat.com/


Maybe it's better to only enable cvq when the backend supports
VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK. Eugenio, any comment on this?



Fixes: 9f09fd6171fe ("vdpa: accept VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend 
feature")
Cc: epere...@redhat.com
Signed-off-by: Stefano Garzarella 
---
 include/uapi/linux/vhost_types.h | 3 ++-
 drivers/vhost/vdpa.c | 4 
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index d7656908f730..5df49b6021a7 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -182,7 +182,8 @@ struct vhost_vdpa_iova_range {
 /* Device can be resumed */
 #define VHOST_BACKEND_F_RESUME  0x5
 /* Device supports the driver enabling virtqueues both before and after
- * DRIVER_OK
+ * DRIVER_OK. If this feature is not negotiated, the virtqueues must be
+ * enabled before setting DRIVER_OK.
  */
 #define VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK  0x6
 /* Device may expose the virtqueue's descriptor area, driver area and
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index bc4a51e4638b..1fba305ba8c1 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -651,6 +651,10 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, 
unsigned int cmd,
case VHOST_VDPA_SET_VRING_ENABLE:
if (copy_from_user(, argp, sizeof(s)))
return -EFAULT;
+   if (!vhost_backend_has_feature(vq,
+   VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK) &&
+   (ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
+   return -EINVAL;


As discussed, without VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK, we don't
know if parents can do vq_ready after driver_ok.

So maybe we need to keep this behaviour to unbreak some "legacy" userspace?


I'm not sure it's a good idea, since "legacy" userspace are currently 
broken if used with VDUSE device. So we need to fix userspace in any 
case, and IMHO is better if we start to return an error, so the user 
understands what went wrong, because the problem in QEMU took us quite 
some time to figure out that we couldn't enable vq after DRIVER_OK.


Since userspace is unable to understand if a vhost-vdpa device is VDUSE 
or not, I think we have only 2 options either merge this patch or fix 
VDUSE somehow. But the last one I think is more complicated/intrusive.


Thanks,
Stefano



For example ifcvf did:

static void ifcvf_vdpa_set_vq_ready(struct vdpa_device *vdpa_dev,
   u16 qid, bool ready)
{
 struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev);

   ifcvf_set_vq_ready(vf, qid, ready);
}

And it did:

void ifcvf_set_vq_ready(struct ifcvf_hw *hw, u16 qid, bool ready)
{
   struct virtio_pci_common_cfg __iomem *cfg = hw->common_cfg;

   vp_iowrite16(qid, >queue_select);
   vp_iowrite16(ready, >queue_enable);
}

Though it didn't advertise VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK?

Adding LingShan for more thought.

Thanks


ops->set_vq_ready(vdpa, idx, s.num);
return 0;
case VHOST_VDPA_GET_VRING_GROUP:
--
2.43.0








Re: Re: [PATCH] vhost-vdpa: fail enabling virtqueue in certain conditions

2024-02-06 Thread Stefano Garzarella

On Tue, Feb 06, 2024 at 10:56:50AM -0500, Michael S. Tsirkin wrote:

better @subj: try late vq enable only if negotiated


I rewrote it 3/4 times, and before sending it I was not happy with the 
result.


Thank you, much better! I'll change it in v2.

Stefano




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-01 Thread Michael S. Tsirkin
On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > >  Summary 
> > > 
> > > In my (non-vhost experience) opinion the way to go would be either
> > > replacing the cond_resched with a hard schedule or setting the
> > > need_resched flag within vhost if the a data transfer was successfully
> > > initiated. It will be necessary to check if this causes problems with
> > > other workloads/benchmarks.
> > 
> > Yes but conceptually I am still in the dark on whether the fact that
> > periodically invoking cond_resched is no longer sufficient to be nice to
> > others is a bug, or intentional.  So you feel it is intentional?
> 
> I would assume that cond_resched is still a valid concept.
> But, in this particular scenario we have the following problem:
> 
> So far (with CFS) we had:
> 1. vhost initiates data transfer
> 2. kworker is woken up
> 3. CFS gives priority to woken up task and schedules it
> 4. kworker runs
> 
> Now (with EEVDF) we have:
> 0. In some cases, kworker has accumulated negative lag 
> 1. vhost initiates data transfer
> 2. kworker is woken up
> -3a. EEVDF does not schedule kworker if it has negative lag
> -4a. vhost continues running, kworker on same CPU starves
> --
> -3b. EEVDF schedules kworker if it has positive or no lag
> -4b. kworker runs
> 
> In the 3a/4a case, the kworker is given no chance to set the
> necessary flag. The flag can only be set by another CPU now.
> The schedule of the kworker was not caused by cond_resched, but
> rather by the wakeup path of the scheduler.
> 
> cond_resched works successfully once the load balancer (I suppose) 
> decides to migrate the vhost off to another CPU. In that case, the
> load balancer on another CPU sets that flag and we are good.
> That then eventually allows the scheduler to pick kworker, but very
> late.

I don't really understand what is special about vhost though.
Wouldn't it apply to any kernel code?

> > I propose a two patch series then:
> > 
> > patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> > 
> > If you're doing longer computations: first think userspace. If you
> > **really** want to do it in kernel you should regularly check if you need
> > to give up the CPU (remember there is cooperative multitasking per CPU).
> > Idiom::
> > 
> > cond_resched(); /* Will sleep */
> > 
> > 
> > replace cond_resched -> schedule
> > 
> > 
> > Since apparently cond_resched is no longer sufficient to
> > make the scheduler check whether you need to give up the CPU.
> > 
> > patch 2: make this change for vhost.
> > 
> > WDYT?
> 
> For patch 1, I would like to see some feedback from Peter (or someone else
> from the scheduler maintainers).

I am guessing once you post it you will see feedback.

> For patch 2, I would prefer to do some more testing first if this might have
> an negative effect on other benchmarks.
> 
> I also stumbled upon something in the scheduler code that I want to verify.
> Maybe a cgroup thing, will check that out again.
> 
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.
> 
> > 
> > -- 
> > MST
> > 
> > 




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-01 Thread Tobias Huschle
On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > 
> >  Summary 
> > 
> > In my (non-vhost experience) opinion the way to go would be either
> > replacing the cond_resched with a hard schedule or setting the
> > need_resched flag within vhost if the a data transfer was successfully
> > initiated. It will be necessary to check if this causes problems with
> > other workloads/benchmarks.
> 
> Yes but conceptually I am still in the dark on whether the fact that
> periodically invoking cond_resched is no longer sufficient to be nice to
> others is a bug, or intentional.  So you feel it is intentional?

I would assume that cond_resched is still a valid concept.
But, in this particular scenario we have the following problem:

So far (with CFS) we had:
1. vhost initiates data transfer
2. kworker is woken up
3. CFS gives priority to woken up task and schedules it
4. kworker runs

Now (with EEVDF) we have:
0. In some cases, kworker has accumulated negative lag 
1. vhost initiates data transfer
2. kworker is woken up
-3a. EEVDF does not schedule kworker if it has negative lag
-4a. vhost continues running, kworker on same CPU starves
--
-3b. EEVDF schedules kworker if it has positive or no lag
-4b. kworker runs

In the 3a/4a case, the kworker is given no chance to set the
necessary flag. The flag can only be set by another CPU now.
The schedule of the kworker was not caused by cond_resched, but
rather by the wakeup path of the scheduler.

cond_resched works successfully once the load balancer (I suppose) 
decides to migrate the vhost off to another CPU. In that case, the
load balancer on another CPU sets that flag and we are good.
That then eventually allows the scheduler to pick kworker, but very
late.

> I propose a two patch series then:
> 
> patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> 
> If you're doing longer computations: first think userspace. If you
> **really** want to do it in kernel you should regularly check if you need
> to give up the CPU (remember there is cooperative multitasking per CPU).
> Idiom::
> 
> cond_resched(); /* Will sleep */
> 
> 
> replace cond_resched -> schedule
> 
> 
> Since apparently cond_resched is no longer sufficient to
> make the scheduler check whether you need to give up the CPU.
> 
> patch 2: make this change for vhost.
> 
> WDYT?

For patch 1, I would like to see some feedback from Peter (or someone else
from the scheduler maintainers).
For patch 2, I would prefer to do some more testing first if this might have
an negative effect on other benchmarks.

I also stumbled upon something in the scheduler code that I want to verify.
Maybe a cgroup thing, will check that out again.

I'll do some more testing with the cond_resched->schedule fix, check the
cgroup thing and wait for Peter then.
Will get back if any of the above yields some results.

> 
> -- 
> MST
> 
> 



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-01 Thread Michael S. Tsirkin
On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > - Along with the wakeup of the kworker, need_resched needs to
> > >   be set, such that cond_resched() triggers a reschedule.
> > 
> > Let's try this? Does not look like discussing vhost itself will
> > draw attention from scheduler guys but posting a scheduling
> > patch probably will? Can you post a patch?
> 
> As a baseline, I verified that the following two options fix
> the regression:
> 
> - replacing the cond_resched in the vhost_worker function with a hard
>   schedule 
> - setting the need_resched flag using set_tsk_need_resched(current)
>   right before calling cond_resched
> 
> I then tried to find a better spot to put the set_tsk_need_resched
> call. 
> 
> One approach I found to be working is setting the need_resched flag 
> at the end of handle_tx and hande_rx.
> This would be after data has been actually passed to the socket, so 
> the originally blocked kworker has something to do and will profit
> from the reschedule. 
> It might be possible to go deeper and place the set_tsk_need_resched
> call to the location right after actually passing the data, but this
> might leave us with sprinkling that call in multiple places and
> might be too intrusive.
> Furthermore, it might be possible to check if an error occured when
> preparing the transmission and then skip the setting of the flag.
> 
> This would require a conceptual decision on the vhost side.
> This solution would not touch the scheduler, only incentivise it to
> do the right thing for this particular regression.
> 
> Another idea could be to find the counterpart that initiates the
> actual data transfer, which I assume wakes up the kworker. From
> what I gather it seems to be an eventfd notification that ends up
> somewhere in the qemu code. Not sure if that context would allow
> to set the need_resched flag, nor whether this would be a good idea.
> 
> > 
> > > - On cond_resched(), verify if the consumed runtime of the caller
> > >   is outweighing the negative lag of another process (e.g. the 
> > >   kworker) and schedule the other process. Introduces overhead
> > >   to cond_resched.
> > 
> > Or this last one.
> 
> On cond_resched itself, this will probably only be possible in a very 
> very hacky way. That is because currently, there is no immidiate access
> to the necessary data available, which would make it necessary to 
> bloat up the cond_resched function quite a bit, with a probably 
> non-negligible amount of overhead.
> 
> Changing other aspects in the scheduler might get us in trouble as
> they all would probably resolve back to the question "What is the magic
> value that determines whether a small task not being scheduled justifies
> setting the need_resched flag for a currently running task or adjusting 
> its lag?". As this would then also have to work for all non-vhost related
> cases, this looks like a dangerous path to me on second thought.
> 
> 
>  Summary 
> 
> In my (non-vhost experience) opinion the way to go would be either
> replacing the cond_resched with a hard schedule or setting the
> need_resched flag within vhost if the a data transfer was successfully
> initiated. It will be necessary to check if this causes problems with
> other workloads/benchmarks.

Yes but conceptually I am still in the dark on whether the fact that
periodically invoking cond_resched is no longer sufficient to be nice to
others is a bug, or intentional.  So you feel it is intentional?
I propose a two patch series then:

patch 1: in this text in Documentation/kernel-hacking/hacking.rst

If you're doing longer computations: first think userspace. If you
**really** want to do it in kernel you should regularly check if you need
to give up the CPU (remember there is cooperative multitasking per CPU).
Idiom::

cond_resched(); /* Will sleep */


replace cond_resched -> schedule


Since apparently cond_resched is no longer sufficient to
make the scheduler check whether you need to give up the CPU.

patch 2: make this change for vhost.

WDYT?

-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-01-31 Thread Tobias Huschle
On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > - Along with the wakeup of the kworker, need_resched needs to
> >   be set, such that cond_resched() triggers a reschedule.
> 
> Let's try this? Does not look like discussing vhost itself will
> draw attention from scheduler guys but posting a scheduling
> patch probably will? Can you post a patch?

As a baseline, I verified that the following two options fix
the regression:

- replacing the cond_resched in the vhost_worker function with a hard
  schedule 
- setting the need_resched flag using set_tsk_need_resched(current)
  right before calling cond_resched

I then tried to find a better spot to put the set_tsk_need_resched
call. 

One approach I found to be working is setting the need_resched flag 
at the end of handle_tx and hande_rx.
This would be after data has been actually passed to the socket, so 
the originally blocked kworker has something to do and will profit
from the reschedule. 
It might be possible to go deeper and place the set_tsk_need_resched
call to the location right after actually passing the data, but this
might leave us with sprinkling that call in multiple places and
might be too intrusive.
Furthermore, it might be possible to check if an error occured when
preparing the transmission and then skip the setting of the flag.

This would require a conceptual decision on the vhost side.
This solution would not touch the scheduler, only incentivise it to
do the right thing for this particular regression.

Another idea could be to find the counterpart that initiates the
actual data transfer, which I assume wakes up the kworker. From
what I gather it seems to be an eventfd notification that ends up
somewhere in the qemu code. Not sure if that context would allow
to set the need_resched flag, nor whether this would be a good idea.

> 
> > - On cond_resched(), verify if the consumed runtime of the caller
> >   is outweighing the negative lag of another process (e.g. the 
> >   kworker) and schedule the other process. Introduces overhead
> >   to cond_resched.
> 
> Or this last one.

On cond_resched itself, this will probably only be possible in a very 
very hacky way. That is because currently, there is no immidiate access
to the necessary data available, which would make it necessary to 
bloat up the cond_resched function quite a bit, with a probably 
non-negligible amount of overhead.

Changing other aspects in the scheduler might get us in trouble as
they all would probably resolve back to the question "What is the magic
value that determines whether a small task not being scheduled justifies
setting the need_resched flag for a currently running task or adjusting 
its lag?". As this would then also have to work for all non-vhost related
cases, this looks like a dangerous path to me on second thought.


 Summary 

In my (non-vhost experience) opinion the way to go would be either
replacing the cond_resched with a hard schedule or setting the
need_resched flag within vhost if the a data transfer was successfully
initiated. It will be necessary to check if this causes problems with
other workloads/benchmarks.



Re: Re: [RFC PATCH] rpmsg: glink: Add bounds check on tx path

2024-01-29 Thread Michal Koutný
On Mon, Jan 29, 2024 at 04:18:36PM +0530, Deepak Kumar Singh 
 wrote:
> There is already a patch posted for similar problem -
> https://lore.kernel.org/all/20231201110631.669085-1-quic_dee...@quicinc.com/

I was not aware, thanks for the pointer.

Do you plan to update your patch to "just" bail-out/zero instead of
using slightly random values (as pointed out by Bjorn)?

Michal


signature.asc
Description: PGP signature


Re: Re: [PATCH v2 2/3] tracing: initialize trace_seq buffers

2024-01-25 Thread Ricardo B. Marliere
Hi Steve,

On 25 Jan 15:44, Steven Rostedt wrote:
> On Thu, 25 Jan 2024 17:16:21 -0300
> "Ricardo B. Marliere"  wrote:
> 
> > Now that trace_seq_reset have been created, correct the places where the
> > buffers need to be initialized.
> 
> This patch would need to come first. You don't ever want to intentionally
> create a broken kernel.

Indeed, sorry for the lack of attention.

> 
> Also, the change log should be:
> 
>   In order to extend trace_seq into being dynamic, the struct trace_seq
>   will no longer be valid if simply set to zero. Call trace_seq_init() for
>   all trace_seq when they are first created.

Ack.

> 
> > 
> > Suggested-by: Steven Rostedt 
> > Signed-off-by: Ricardo B. Marliere 
> > ---
> >  kernel/trace/trace.c | 14 ++
> >  1 file changed, 14 insertions(+)
> > 
> 
> You also need to initialize iter.seq in ftrace_dump()

Thanks a lot for reviewing, I will send a v3.
-   Ricardo


> 
> -- Steve
> 
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index d4c55d3e21c2..9827700d0164 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -4889,6 +4889,9 @@ __tracing_open(struct inode *inode, struct file 
> > *file, bool snapshot)
> >  
> > mutex_unlock(_types_lock);
> >  
> > +   trace_seq_init(>seq);
> > +   trace_seq_init(>tmp_seq);
> > +
> > return iter;
> >  
> >   fail:
> > @@ -6770,6 +6773,7 @@ static int tracing_open_pipe(struct inode *inode, 
> > struct file *filp)
> > }
> >  
> > trace_seq_init(>seq);
> > +   trace_seq_init(>tmp_seq);
> > iter->trace = tr->current_trace;
> >  
> > if (!alloc_cpumask_var(>started, GFP_KERNEL)) {
> > @@ -6947,6 +6951,7 @@ tracing_read_pipe(struct file *filp, char __user 
> > *ubuf,
> > trace_iterator_reset(iter);
> > cpumask_clear(iter->started);
> > trace_seq_init(>seq);
> > +   trace_seq_init(>tmp_seq);
> >  
> > trace_event_read_lock();
> > trace_access_lock(iter->cpu_file);
> > @@ -8277,6 +8282,9 @@ static int tracing_buffers_open(struct inode *inode, 
> > struct file *filp)
> > if (ret < 0)
> > trace_array_put(tr);
> >  
> > +   trace_seq_init(>iter.seq);
> > +   trace_seq_init(>iter.tmp_seq);
> > +
> > return ret;
> >  }
> >  
> > @@ -10300,6 +10308,9 @@ void trace_init_global_iter(struct trace_iterator 
> > *iter)
> > iter->temp_size = STATIC_TEMP_BUF_SIZE;
> > iter->fmt = static_fmt_buf;
> > iter->fmt_size = STATIC_FMT_BUF_SIZE;
> > +
> > +   trace_seq_init(>seq);
> > +   trace_seq_init(>tmp_seq);
> >  }
> >  
> >  void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
> > @@ -10712,6 +10723,9 @@ void __init early_trace_init(void)
> > tracepoint_printk = 0;
> > else
> > static_key_enable(_printk_key.key);
> > +
> > +   trace_seq_init(_print_iter->seq);
> > +   trace_seq_init(_print_iter->tmp_seq);
> > }
> > tracer_alloc_buffers();
> >  
> 



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-01-22 Thread Tobias Huschle
On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > > Peter, would appreciate feedback on this. When is cond_resched()
> > > insufficient to give up the CPU? Should 
> > > Documentation/kernel-hacking/hacking.rst
> > > be updated to require schedule() instead?
> > > 
> > 
> > Happy new year everybody!
> > 
> > I'd like to bring this thread back to life. To reiterate:
> > 
> > - The introduction of the EEVDF scheduler revealed a performance
> >   regression in a uperf testcase of ~50%.
> > - Tracing the scheduler showed that it takes decisions which are
> >   in line with its design.
> > - The traces showed as well, that a vhost instance might run
> >   excessively long on its CPU in some circumstance. Those cause
> >   the performance regression as they cause delay times of 100+ms
> >   for a kworker which drives the actual network processing.
> > - Before EEVDF, the vhost would always be scheduled off its CPU
> >   in favor of the kworker, as the kworker was being woken up and
> >   the former scheduler was giving more priority to the woken up
> >   task. With EEVDF, the kworker, as a long running process, is
> >   able to accumulate negative lag, which causes EEVDF to not
> >   prefer it on its wake up, leaving the vhost running.
> > - If the kworker is not scheduled when being woken up, the vhost
> >   continues looping until it is migrated off the CPU.
> > - The vhost offers to be scheduled off the CPU by calling 
> >   cond_resched(), but, the the need_resched flag is not set,
> >   therefore cond_resched() does nothing.
> > 
> > To solve this, I see the following options 
> >   (might not be a complete nor a correct list)
> > - Along with the wakeup of the kworker, need_resched needs to
> >   be set, such that cond_resched() triggers a reschedule.
> 
> Let's try this? Does not look like discussing vhost itself will
> draw attention from scheduler guys but posting a scheduling
> patch probably will? Can you post a patch?
> 

I'll give it a go.

> > - The vhost calls schedule() instead of cond_resched() to give up
> >   the CPU. This would of course be a significantly stricter
> >   approach and might limit the performance of vhost in other cases.
> > - Preventing the kworker from accumulating negative lag as it is
> >   mostly not runnable and if it runs, it only runs for a very short
> >   time frame. This might clash with the overall concept of EEVDF.
> > - On cond_resched(), verify if the consumed runtime of the caller
> >   is outweighing the negative lag of another process (e.g. the 
> >   kworker) and schedule the other process. Introduces overhead
> >   to cond_resched.
> 
> Or this last one.
> 

This one will probably be more complicated as the necessary information
is not really available at the places where I'd like to see it.
Will have to ponder on that a bit to figure out if there might be an
elegant way to approach this.

> 
> > 
> > I would be curious on feedback on those ideas and interested in
> > alternative approaches.
> 
> 



Re: Re: [PATCH V1] vdpa_sim: reset must not run

2024-01-22 Thread Stefano Garzarella

On Mon, Jan 22, 2024 at 11:47:22AM +0100, Eugenio Perez Martin wrote:

On Mon, Jan 22, 2024 at 11:22 AM Stefano Garzarella  wrote:


On Wed, Jan 17, 2024 at 11:23:23AM -0800, Steve Sistare wrote:
>vdpasim_do_reset sets running to true, which is wrong, as it allows
>vdpasim_kick_vq to post work requests before the device has been
>configured.  To fix, do not set running until VIRTIO_CONFIG_S_FEATURES_OK
>is set.
>
>Fixes: 0c89e2a3a9d0 ("vdpa_sim: Implement suspend vdpa op")
>Signed-off-by: Steve Sistare 
>Reviewed-by: Eugenio Pérez 
>---
> drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c 
b/drivers/vdpa/vdpa_sim/vdpa_sim.c
>index be2925d0d283..6304cb0b4770 100644
>--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
>+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
>@@ -160,7 +160,7 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim, u32 
flags)
>   }
>   }
>
>-  vdpasim->running = true;
>+  vdpasim->running = false;
>   spin_unlock(>iommu_lock);
>
>   vdpasim->features = 0;
>@@ -483,6 +483,7 @@ static void vdpasim_set_status(struct vdpa_device *vdpa, 
u8 status)
>
>   mutex_lock(>mutex);
>   vdpasim->status = status;
>+  vdpasim->running = (status & VIRTIO_CONFIG_S_FEATURES_OK) != 0;
>   mutex_unlock(>mutex);

Should we do something similar also in vdpasim_resume() ?

I mean something like this:

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index be2925d0d283..55e4633d5442 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -520,7 +520,7 @@ static int vdpasim_resume(struct vdpa_device *vdpa)
 int i;

 mutex_lock(>mutex);
-   vdpasim->running = true;
+   vdpasim->running = (vdpasim->status & VIRTIO_CONFIG_S_FEATURES_OK) != 0;

 if (vdpasim->pending_kick) {
 /* Process pending descriptors */

Thanks,
Stefano



The suspend and resume operation should not be called before
DRIVER_OK, so maybe we should add that protection at
drivers/vhost/vdpa.c actually?


Yeah, I think so!

Anyway, IMHO we should at least return an error in vdpa_sim if 
vdpasim_suspend/resume are called before DRIVER_OK (in another patch of 
course).


Stefano




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-01-21 Thread Michael S. Tsirkin
On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > 
> > Peter, would appreciate feedback on this. When is cond_resched()
> > insufficient to give up the CPU? Should 
> > Documentation/kernel-hacking/hacking.rst
> > be updated to require schedule() instead?
> > 
> 
> Happy new year everybody!
> 
> I'd like to bring this thread back to life. To reiterate:
> 
> - The introduction of the EEVDF scheduler revealed a performance
>   regression in a uperf testcase of ~50%.
> - Tracing the scheduler showed that it takes decisions which are
>   in line with its design.
> - The traces showed as well, that a vhost instance might run
>   excessively long on its CPU in some circumstance. Those cause
>   the performance regression as they cause delay times of 100+ms
>   for a kworker which drives the actual network processing.
> - Before EEVDF, the vhost would always be scheduled off its CPU
>   in favor of the kworker, as the kworker was being woken up and
>   the former scheduler was giving more priority to the woken up
>   task. With EEVDF, the kworker, as a long running process, is
>   able to accumulate negative lag, which causes EEVDF to not
>   prefer it on its wake up, leaving the vhost running.
> - If the kworker is not scheduled when being woken up, the vhost
>   continues looping until it is migrated off the CPU.
> - The vhost offers to be scheduled off the CPU by calling 
>   cond_resched(), but, the the need_resched flag is not set,
>   therefore cond_resched() does nothing.
> 
> To solve this, I see the following options 
>   (might not be a complete nor a correct list)
> - Along with the wakeup of the kworker, need_resched needs to
>   be set, such that cond_resched() triggers a reschedule.

Let's try this? Does not look like discussing vhost itself will
draw attention from scheduler guys but posting a scheduling
patch probably will? Can you post a patch?

> - The vhost calls schedule() instead of cond_resched() to give up
>   the CPU. This would of course be a significantly stricter
>   approach and might limit the performance of vhost in other cases.
> - Preventing the kworker from accumulating negative lag as it is
>   mostly not runnable and if it runs, it only runs for a very short
>   time frame. This might clash with the overall concept of EEVDF.
> - On cond_resched(), verify if the consumed runtime of the caller
>   is outweighing the negative lag of another process (e.g. the 
>   kworker) and schedule the other process. Introduces overhead
>   to cond_resched.

Or this last one.


> 
> I would be curious on feedback on those ideas and interested in
> alternative approaches.




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-01-09 Thread Michael S. Tsirkin
On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > 
> > Peter, would appreciate feedback on this. When is cond_resched()
> > insufficient to give up the CPU? Should 
> > Documentation/kernel-hacking/hacking.rst
> > be updated to require schedule() instead?
> > 
> 
> Happy new year everybody!
> 
> I'd like to bring this thread back to life. To reiterate:
> 
> - The introduction of the EEVDF scheduler revealed a performance
>   regression in a uperf testcase of ~50%.
> - Tracing the scheduler showed that it takes decisions which are
>   in line with its design.
> - The traces showed as well, that a vhost instance might run
>   excessively long on its CPU in some circumstance. Those cause
>   the performance regression as they cause delay times of 100+ms
>   for a kworker which drives the actual network processing.
> - Before EEVDF, the vhost would always be scheduled off its CPU
>   in favor of the kworker, as the kworker was being woken up and
>   the former scheduler was giving more priority to the woken up
>   task. With EEVDF, the kworker, as a long running process, is
>   able to accumulate negative lag, which causes EEVDF to not
>   prefer it on its wake up, leaving the vhost running.
> - If the kworker is not scheduled when being woken up, the vhost
>   continues looping until it is migrated off the CPU.
> - The vhost offers to be scheduled off the CPU by calling 
>   cond_resched(), but, the the need_resched flag is not set,
>   therefore cond_resched() does nothing.
> 
> To solve this, I see the following options 
>   (might not be a complete nor a correct list)
> - Along with the wakeup of the kworker, need_resched needs to
>   be set, such that cond_resched() triggers a reschedule.
> - The vhost calls schedule() instead of cond_resched() to give up
>   the CPU. This would of course be a significantly stricter
>   approach and might limit the performance of vhost in other cases.

And on these two, I asked:
Would appreciate feedback on this. When is cond_resched()
insufficient to give up the CPU? Should 
Documentation/kernel-hacking/hacking.rst
be updated to require schedule() instead?


> - Preventing the kworker from accumulating negative lag as it is
>   mostly not runnable and if it runs, it only runs for a very short
>   time frame. This might clash with the overall concept of EEVDF.
> - On cond_resched(), verify if the consumed runtime of the caller
>   is outweighing the negative lag of another process (e.g. the 
>   kworker) and schedule the other process. Introduces overhead
>   to cond_resched.
> 
> I would be curious on feedback on those ideas and interested in
> alternative approaches.





Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-01-08 Thread Tobias Huschle
On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> 
> Peter, would appreciate feedback on this. When is cond_resched()
> insufficient to give up the CPU? Should 
> Documentation/kernel-hacking/hacking.rst
> be updated to require schedule() instead?
> 

Happy new year everybody!

I'd like to bring this thread back to life. To reiterate:

- The introduction of the EEVDF scheduler revealed a performance
  regression in a uperf testcase of ~50%.
- Tracing the scheduler showed that it takes decisions which are
  in line with its design.
- The traces showed as well, that a vhost instance might run
  excessively long on its CPU in some circumstance. Those cause
  the performance regression as they cause delay times of 100+ms
  for a kworker which drives the actual network processing.
- Before EEVDF, the vhost would always be scheduled off its CPU
  in favor of the kworker, as the kworker was being woken up and
  the former scheduler was giving more priority to the woken up
  task. With EEVDF, the kworker, as a long running process, is
  able to accumulate negative lag, which causes EEVDF to not
  prefer it on its wake up, leaving the vhost running.
- If the kworker is not scheduled when being woken up, the vhost
  continues looping until it is migrated off the CPU.
- The vhost offers to be scheduled off the CPU by calling 
  cond_resched(), but, the the need_resched flag is not set,
  therefore cond_resched() does nothing.

To solve this, I see the following options 
  (might not be a complete nor a correct list)
- Along with the wakeup of the kworker, need_resched needs to
  be set, such that cond_resched() triggers a reschedule.
- The vhost calls schedule() instead of cond_resched() to give up
  the CPU. This would of course be a significantly stricter
  approach and might limit the performance of vhost in other cases.
- Preventing the kworker from accumulating negative lag as it is
  mostly not runnable and if it runs, it only runs for a very short
  time frame. This might clash with the overall concept of EEVDF.
- On cond_resched(), verify if the consumed runtime of the caller
  is outweighing the negative lag of another process (e.g. the 
  kworker) and schedule the other process. Introduces overhead
  to cond_resched.

I would be curious on feedback on those ideas and interested in
alternative approaches.



Re: Re: [PATCH v3 1/3] kasan: switch kunit tests to console tracepoints

2024-01-07 Thread Paul Heidekrüger
On 12.12.2023 10:32, Marco Elver wrote:
> On Tue, 12 Dec 2023 at 10:19, Paul Heidekrüger  
> wrote:
> >
> > On 12.12.2023 00:37, Andrey Konovalov wrote:
> > > On Tue, Dec 12, 2023 at 12:35 AM Paul Heidekrüger
> > >  wrote:
> > > >
> > > > Using CONFIG_FTRACE=y instead of CONFIG_TRACEPOINTS=y produces the same 
> > > > error
> > > > for me.
> > > >
> > > > So
> > > >
> > > > CONFIG_KUNIT=y
> > > > CONFIG_KUNIT_ALL_TESTS=n
> > > > CONFIG_FTRACE=y
> > > > CONFIG_KASAN=y
> > > > CONFIG_KASAN_GENERIC=y
> > > > CONFIG_KASAN_KUNIT_TEST=y
> > > >
> > > > produces
> > > >
> > > > ➜   ./tools/testing/kunit/kunit.py run 
> > > > --kunitconfig=mm/kasan/.kunitconfig --arch=arm64
> > > > Configuring KUnit Kernel ...
> > > > Regenerating .config ...
> > > > Populating config with:
> > > > $ make ARCH=arm64 O=.kunit olddefconfig CC=clang
> > > > ERROR:root:Not all Kconfig options selected in kunitconfig were 
> > > > in the generated .config.
> > > > This is probably due to unsatisfied dependencies.
> > > > Missing: CONFIG_KASAN_KUNIT_TEST=y
> > > >
> > > > By that error message, CONFIG_FTRACE appears to be present in the 
> > > > generated
> > > > config, but CONFIG_KASAN_KUNIT_TEST still isn't. Presumably,
> > > > CONFIG_KASAN_KUNIT_TEST is missing because of an unsatisfied 
> > > > dependency, which
> > > > must be CONFIG_TRACEPOINTS, unless I'm missing something ...
> > > >
> > > > If I just generate an arm64 defconfig and select CONFIG_FTRACE=y,
> > > > CONFIG_TRACEPOINTS=y shows up in my .config. So, maybe this is 
> > > > kunit.py-related
> > > > then?
> > > >
> > > > Andrey, you said that the tests have been working for you; are you 
> > > > running them
> > > > with kunit.py?
> > >
> > > No, I just run the kernel built with a config file that I put together
> > > based on defconfig.
> >
> > Ah. I believe I've figured it out.
> >
> > When I add CONFIG_STACK_TRACER=y in addition to CONFIG_FTRACE=y, it works.
> 
> CONFIG_FTRACE should be enough - maybe also check x86 vs. arm64 to debug more.

See below.

> > CONFIG_STACK_TRACER selects CONFIG_FUNCTION_TRACER, CONFIG_FUNCTION_TRACER
> > selects CONFIG_GENERIC_TRACER, CONFIG_GENERIC_TRACER selects 
> > CONFIG_TRACING, and
> > CONFIG_TRACING selects CONFIG_TRACEPOINTS.
> >
> > CONFIG_BLK_DEV_IO_TRACE=y also works instead of CONFIG_STACK_TRACER=y, as it
> > directly selects CONFIG_TRACEPOINTS.
> >
> > CONFIG_FTRACE=y on its own does not appear suffice for kunit.py on arm64.
> 
> When you build manually with just CONFIG_FTRACE, is CONFIG_TRACEPOINTS 
> enabled?

When I add CONFIG_FTRACE and enter-key my way through the FTRACE prompts - I 
believe because CONFIG_FTRACE is a menuconfig? - at the beginning of a build, 
CONFIG_TRACEPOINTS does get set on arm64, yes.

On X86, the defconfig already includes CONIFG_TRACEPOINTS.

I also had a closer look at how kunit.py builds its configs.
I believe it does something along the following lines:

cp  .kunit/.config
make ARCH=arm64 O=.kunit olddefconfig

On arm64, that isn't enough to set CONFIG_TRACEPOINTS; same behaviour when run 
outside of kunit.py.

For CONFIG_TRACEPOINTS, `make ARCH=arm64 menuconfig` shows:

Symbol: TRACEPOINTS [=n]
Type  : bool
Defined at init/Kconfig:1920
Selected by [n]:
- TRACING [=n]
- BLK_DEV_IO_TRACE [=n] && FTRACE [=y] && SYSFS [=y] && BLOCK 
[=y]

So, CONFIG_TRACING or CONFIG_BLK_DEV_IO_TRACE are the two options that prevent 
CONFIG_TRACEPOINTS from being set on arm64.

For CONFIG_TRACING we have:

Symbol: TRACING [=n]
Type  : bool
Defined at kernel/trace/Kconfig:157
Selects: RING_BUFFER [=n] && STACKTRACE [=y] && TRACEPOINTS [=n] && 
NOP_TRACER [=n] && BINARY_PRINTF [=n] && EVENT_TRACING [=n] && TRACE_CLOCK [=y] 
&& TASKS_RCU [=n]
Selected by [n]:
- DRM_I915_TRACE_GEM [=n] && HAS_IOMEM [=y] && DRM_I915 [=n] && 
EXPERT [=n] && DRM_I915_DEBUG_GEM [=n]
- DRM_I915_TRACE_GTT [=n] && HAS_IOMEM [=y] && DRM_I915 [=n] && 
EXPERT [=n] && DRM_I915_DEBUG_GEM [=n]
- PREEMPTIRQ_TRACEPOINTS [=n] && (TRACE_PREEMPT_TOGGLE [=n] || 
TRACE_IRQFLAGS [=n])
- GENERIC_TRACER [=n]
- ENABLE_DEFAULT_TRACERS [=n] && FTRACE [=y] && !GENERIC_TRACER 
[=n]
- FPROBE_EVENTS [=n] && FTRACE [=y] && FPROBE [=n] && 
HAVE_REGS_AND_STACK_ACCESS_API [=y]
- KPROBE_EVENTS [=n] && FTRACE [=y] && KPROBES [=n] && 
HAVE_REGS_AND_STACK_ACCESS_API [=y]
- UPROBE_EVENTS [=n] && FTRACE [=y] && ARCH_SUPPORTS_UPROBES 
[=y] && MMU [=y] && PERF_EVENTS [=n]
- SYNTH_EVENTS [=n] && FTRACE [=y]
- USER_EVENTS [=n] && FTRACE [=y]
- HIST_TRIGGERS [=n] && FTRACE [=y] && 
ARCH_HAVE_NMI_SAFE_CMPXCHG [=y]

> > I believe the reason my .kunitconfig as well as the 

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 09:55:23AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 13, 2023 at 01:45:35PM +0100, Tobias Huschle wrote:
> > On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > > > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin 
> > > > > >  wrote:
> > 
> > [...]
> > > 
> > > Apparently schedule is already called?
> > > 
> > 
> > What about this: 
> > 
> > static int vhost_task_fn(void *data)
> > {
> > <...>
> > did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm 
> > not mistaken
> > if (!did_work)
> > schedule();
> > <...>
> > }
> > 
> > static bool vhost_worker(void *data)
> > {
> > struct vhost_worker *worker = data;
> > struct vhost_work *work, *work_next;
> > struct llist_node *node;
> > 
> > node = llist_del_all(>work_list);
> > if (node) {
> > <...>
> > llist_for_each_entry_safe(work, work_next, node, node) {
> > <...>
> > }
> > }
> > 
> > return !!node;
> > }
> > 
> > The llist_for_each_entry_safe does not actually change the node value, 
> > doesn't it?
> > 
> > If it does not change it, !!node would return 1.
> > Thereby skipping the schedule.
> > 
> > This was changed recently with:
> > f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
> > 
> > It returned a hardcoded 0 before. The commit message explicitly mentions 
> > this
> > change to make vhost_worker return 1 if it did something.
> > 
> > Seems indeed like a nasty little side effect caused by EEVDF not scheduling
> > the woken up kworker right away.
> 
> 
> So we are actually making an effort to be nice.
> Documentation/kernel-hacking/hacking.rst says:
> 
> If you're doing longer computations: first think userspace. If you
> **really** want to do it in kernel you should regularly check if you need
> to give up the CPU (remember there is cooperative multitasking per CPU).
> Idiom::
> 
> cond_resched(); /* Will sleep */
> 
> 
> and this is what vhost.c does.
> 
> At this point I'm not sure why it's appropriate to call schedule() as opposed 
> to
> cond_resched(). Ideas?
> 

Peter, would appreciate feedback on this. When is cond_resched()
insufficient to give up the CPU? Should Documentation/kernel-hacking/hacking.rst
be updated to require schedule() instead?


> -- 
> MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 01:45:35PM +0100, Tobias Huschle wrote:
> On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  
> > > > > wrote:
> 
> [...]
> > 
> > Apparently schedule is already called?
> > 
> 
> What about this: 
> 
> static int vhost_task_fn(void *data)
> {
>   <...>
>   did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm 
> not mistaken
>   if (!did_work)
>   schedule();
>   <...>
> }
> 
> static bool vhost_worker(void *data)
> {
>   struct vhost_worker *worker = data;
>   struct vhost_work *work, *work_next;
>   struct llist_node *node;
> 
>   node = llist_del_all(>work_list);
>   if (node) {
>   <...>
>   llist_for_each_entry_safe(work, work_next, node, node) {
>   <...>
>   }
>   }
> 
>   return !!node;
> }
> 
> The llist_for_each_entry_safe does not actually change the node value, 
> doesn't it?
> 
> If it does not change it, !!node would return 1.
> Thereby skipping the schedule.
> 
> This was changed recently with:
> f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
> 
> It returned a hardcoded 0 before. The commit message explicitly mentions this
> change to make vhost_worker return 1 if it did something.
> 
> Seems indeed like a nasty little side effect caused by EEVDF not scheduling
> the woken up kworker right away.


So we are actually making an effort to be nice.
Documentation/kernel-hacking/hacking.rst says:

If you're doing longer computations: first think userspace. If you
**really** want to do it in kernel you should regularly check if you need
to give up the CPU (remember there is cooperative multitasking per CPU).
Idiom::

cond_resched(); /* Will sleep */


and this is what vhost.c does.

At this point I'm not sure why it's appropriate to call schedule() as opposed to
cond_resched(). Ideas?


-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 01:45:35PM +0100, Tobias Huschle wrote:
> On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  
> > > > > wrote:
> 
> [...]
> > 
> > Apparently schedule is already called?
> > 
> 
> What about this: 
> 
> static int vhost_task_fn(void *data)
> {
>   <...>
>   did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm 
> not mistaken
>   if (!did_work)
>   schedule();
>   <...>
> }
> 
> static bool vhost_worker(void *data)
> {
>   struct vhost_worker *worker = data;
>   struct vhost_work *work, *work_next;
>   struct llist_node *node;
> 
>   node = llist_del_all(>work_list);
>   if (node) {
>   <...>
>   llist_for_each_entry_safe(work, work_next, node, node) {
>   <...>
>   }
>   }
> 
>   return !!node;
> }
> 
> The llist_for_each_entry_safe does not actually change the node value, 
> doesn't it?
> 
> If it does not change it, !!node would return 1.
> Thereby skipping the schedule.
> 
> This was changed recently with:
> f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
> 
> It returned a hardcoded 0 before. The commit message explicitly mentions this
> change to make vhost_worker return 1 if it did something.
> 
> Seems indeed like a nasty little side effect caused by EEVDF not scheduling
> the woken up kworker right away.

Indeed, but previously vhost_worker was looping itself.
And it did:
-   node = llist_del_all(>work_list);
-   if (!node)
-   schedule();

so I don't think this was changed at all.






-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Tobias Huschle
On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  
> > > > wrote:

[...]
> 
> Apparently schedule is already called?
> 

What about this: 

static int vhost_task_fn(void *data)
{
<...>
did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm 
not mistaken
if (!did_work)
schedule();
<...>
}

static bool vhost_worker(void *data)
{
struct vhost_worker *worker = data;
struct vhost_work *work, *work_next;
struct llist_node *node;

node = llist_del_all(>work_list);
if (node) {
<...>
llist_for_each_entry_safe(work, work_next, node, node) {
<...>
}
}

return !!node;
}

The llist_for_each_entry_safe does not actually change the node value, doesn't 
it?

If it does not change it, !!node would return 1.
Thereby skipping the schedule.

This was changed recently with:
f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

It returned a hardcoded 0 before. The commit message explicitly mentions this
change to make vhost_worker return 1 if it did something.

Seems indeed like a nasty little side effect caused by EEVDF not scheduling
the woken up kworker right away.



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  
> > > wrote:
> 
> We played around with the suggestions and some other ideas.
> I would like to share some initial results.
> 
> We tried the following:
> 
> 1. Call uncondtional schedule in the vhost_worker function
> 2. Change the HZ value from 100 to 1000
> 3. Reverting 05bfb338fa8d vhost: Fix livepatch timeouts in vhost_worker()
> 4. Adding a cond_resched to translate_desc
> 5. Reducing VHOST_NET_WEIGHT to 25% of its original value
> 
> Please find the diffs below.
> 
> Summary:
> 
> Option 1 is very very hacky but resolved the regression.
> Option 2 reduces the regression by ~20%.
> Options 3-5 do not help unfortunately.
> 
> Potential explanation:
> 
> While the vhost is executing, the need_resched flag is not set (observable
> in the traces). Therefore cond_resched and alike will do nothing. vhost
> will continue executing until the need_resched flag is set by an external
> party, e.g. by a request to migrate the vhost.
> 
> Calling schedule unconditionally forces the scheduler to re-evaluate all 
> tasks and their vruntime/deadline/vlag values. The scheduler comes to the
> correct conclusion, that the kworker should be executed and from there it
> is smooth sailing. I will have to verify that sequence by collecting more
> traces, but this seems rather plausible.
> This hack might of course introduce all kinds of side effects but might
> provide an indicator that this is the actual problem.
> The big question would be how to solve this conceptually, and, first
> things first, whether you think this is a viable hypothesis.
> 
> Increasing the HZ value helps most likely because the other CPUs take 
> scheduling/load balancing decisions more often as well and therefore
> trigger the migration faster.
> 
> Bringing down VHOST_NET_WEIGHT even more might also help to shorten the
> vhost loop. But I have no intuition how low we can/should go here.
> 
> 
> We also changed vq_err to print error messages, but did not encounter any.
> 
> Diffs:
> --
> 
> 1. Call uncondtional schedule in the vhost_worker function
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index e0c181ad17e3..16d73fd28831 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -414,6 +414,7 @@ static bool vhost_worker(void *data)
> }
> }
>  
> +   schedule();
> return !!node;
>  }


So, this helps.
But this is very surprising!


static int vhost_task_fn(void *data)
{
struct vhost_task *vtsk = data;
bool dead = false;

for (;;) {
bool did_work;

if (!dead && signal_pending(current)) {
struct ksignal ksig;
/*
 * Calling get_signal will block in SIGSTOP,
 * or clear fatal_signal_pending, but remember
 * what was set.
 *
 * This thread won't actually exit until all
 * of the file descriptors are closed, and
 * the release function is called.
 */
dead = get_signal();
if (dead)
clear_thread_flag(TIF_SIGPENDING);
}

/* mb paired w/ vhost_task_stop */
set_current_state(TASK_INTERRUPTIBLE);

if (test_bit(VHOST_TASK_FLAGS_STOP, >flags)) {
__set_current_state(TASK_RUNNING);
break;
}

did_work = vtsk->fn(vtsk->data);
if (!did_work)
schedule();
}

complete(>exited);
do_exit(0);

}

Apparently schedule is already called?


-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Tobias Huschle
On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  wrote:

We played around with the suggestions and some other ideas.
I would like to share some initial results.

We tried the following:

1. Call uncondtional schedule in the vhost_worker function
2. Change the HZ value from 100 to 1000
3. Reverting 05bfb338fa8d vhost: Fix livepatch timeouts in vhost_worker()
4. Adding a cond_resched to translate_desc
5. Reducing VHOST_NET_WEIGHT to 25% of its original value

Please find the diffs below.

Summary:

Option 1 is very very hacky but resolved the regression.
Option 2 reduces the regression by ~20%.
Options 3-5 do not help unfortunately.

Potential explanation:

While the vhost is executing, the need_resched flag is not set (observable
in the traces). Therefore cond_resched and alike will do nothing. vhost
will continue executing until the need_resched flag is set by an external
party, e.g. by a request to migrate the vhost.

Calling schedule unconditionally forces the scheduler to re-evaluate all 
tasks and their vruntime/deadline/vlag values. The scheduler comes to the
correct conclusion, that the kworker should be executed and from there it
is smooth sailing. I will have to verify that sequence by collecting more
traces, but this seems rather plausible.
This hack might of course introduce all kinds of side effects but might
provide an indicator that this is the actual problem.
The big question would be how to solve this conceptually, and, first
things first, whether you think this is a viable hypothesis.

Increasing the HZ value helps most likely because the other CPUs take 
scheduling/load balancing decisions more often as well and therefore
trigger the migration faster.

Bringing down VHOST_NET_WEIGHT even more might also help to shorten the
vhost loop. But I have no intuition how low we can/should go here.


We also changed vq_err to print error messages, but did not encounter any.

Diffs:
--

1. Call uncondtional schedule in the vhost_worker function

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..16d73fd28831 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -414,6 +414,7 @@ static bool vhost_worker(void *data)
}
}
 
+   schedule();
return !!node;
 }

--

2. Change the HZ value from 100 to 1000

--> config change 

--

3. Reverting 05bfb338fa8d vhost: Fix livepatch timeouts in vhost_worker()

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..d519d598ebb9 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -410,7 +410,8 @@ static bool vhost_worker(void *data)
kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
-   cond_resched();
+   if (need_resched())
+   schedule();
}
}

--

4. Adding a cond_resched to translate_desc

I just picked some location.

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..f885dd29cbd1 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2367,6 +2367,7 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 
addr, u32 len,
s += size;
addr += size;
++ret;
+   cond_resched();
}
 
if (ret == -EAGAIN)

--

5. Reducing VHOST_NET_WEIGHT to 25% of its original value

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..2c6966ea6229 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -42,7 +42,7 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
 
 /* Max number of bytes transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others. */
-#define VHOST_NET_WEIGHT 0x8
+#define VHOST_NET_WEIGHT 0x2
 
 /* Max number of packets transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others with small



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-12 Thread Michael S. Tsirkin
On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  wrote:
> >
> > On Mon, Dec 11, 2023 at 03:26:46PM +0800, Jason Wang wrote:
> > > > Try reducing the VHOST_NET_WEIGHT limit and see if that improves things 
> > > > any?
> > >
> > > Or a dirty hack like cond_resched() in translate_desc().
> >
> > what do you mean, exactly?
> 
> Ideally it should not matter, but Tobias said there's an unexpectedly
> long time spent on translate_desc() which may indicate that the
> might_sleep() or other doesn't work for some reason.
> 
> Thanks

You mean for debugging, add it with a patch to see what this does?

Sure - can you post the debugging patch pls?

> >
> > --
> > MST
> >




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-11 Thread Jason Wang
On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  wrote:
>
> On Mon, Dec 11, 2023 at 03:26:46PM +0800, Jason Wang wrote:
> > > Try reducing the VHOST_NET_WEIGHT limit and see if that improves things 
> > > any?
> >
> > Or a dirty hack like cond_resched() in translate_desc().
>
> what do you mean, exactly?

Ideally it should not matter, but Tobias said there's an unexpectedly
long time spent on translate_desc() which may indicate that the
might_sleep() or other doesn't work for some reason.

Thanks

>
> --
> MST
>




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-11 Thread Michael S. Tsirkin
On Mon, Dec 11, 2023 at 03:26:46PM +0800, Jason Wang wrote:
> > Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any?
> 
> Or a dirty hack like cond_resched() in translate_desc().

what do you mean, exactly?

-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-10 Thread Jason Wang
On Sat, Dec 9, 2023 at 6:42 PM Michael S. Tsirkin  wrote:
>
> On Fri, Dec 08, 2023 at 12:41:38PM +0100, Tobias Huschle wrote:
> > On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote:
> > > On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> > > > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > > > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > > > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > > > >
> > > > > > I dug a little deeper on what the vhost is doing. I'm not an expert 
> > > > > > on
> > > > > > virtio whatsoever, so these are just educated guesses that maybe
> > > > > > someone can verify/correct. Please bear with me probably messing up
> > > > > > the terminology.
> > > > > >
> > > > > > - vhost is looping through available queues.
> > > > > > - vhost wants to wake up a kworker to process a found queue.
> > > > > > - kworker does something with that queue and terminates quickly.
> > > > > >
> > > > > > What I found by throwing in some very noisy trace statements was 
> > > > > > that,
> > > > > > if the kworker is not woken up, the vhost just keeps looping accross
> > > > > > all available queues (and seems to repeat itself). So it essentially
> > > > > > relies on the scheduler to schedule the kworker fast enough. 
> > > > > > Otherwise
> > > > > > it will just keep on looping until it is migrated off the CPU.
> > > > >
> > > > >
> > > > > Normally it takes the buffers off the queue and is done with it.
> > > > > I am guessing that at the same time guest is running on some other
> > > > > CPU and keeps adding available buffers?
> > > > >
> > > >
> > > > It seems to do just that, there are multiple other vhost instances
> > > > involved which might keep filling up thoses queues.
> > > >
> > >
> > > No vhost is ever only draining queues. Guest is filling them.
> > >
> > > > Unfortunately, this makes the problematic vhost instance to stay on
> > > > the CPU and prevents said kworker to get scheduled. The kworker is
> > > > explicitly woken up by vhost, so it wants it to do something.

It looks to me vhost doesn't use workqueue but the worker by itself.

> > > >
> > > > At this point it seems that there is an assumption about the scheduler
> > > > in place which is no longer fulfilled by EEVDF. From the discussion so
> > > > far, it seems like EEVDF does what is intended to do.
> > > >
> > > > Shouldn't there be a more explicit mechanism in use that allows the
> > > > kworker to be scheduled in favor of the vhost?

Vhost did a brunch of copy_from_user() which should trigger
__might_fault() so a __might_sleep() most of the case.

> > > >
> > > > It is also concerning that the vhost seems cannot be preempted by the
> > > > scheduler while executing that loop.
> > >
> > >
> > > Which loop is that, exactly?
> >
> > The loop continously passes translate_desc in drivers/vhost/vhost.c
> > That's where I put the trace statements.
> >
> > The overall sequence seems to be (top to bottom):
> >
> > handle_rx
> > get_rx_bufs
> > vhost_get_vq_desc
> > vhost_get_avail_head
> > vhost_get_avail
> > __vhost_get_user_slow
> > translate_desc   << trace statement in here
> > vhost_iotlb_itree_first
>
> I wonder why do you keep missing cache and re-translating.
> Is pr_debug enabled for you? If not could you check if it
> outputs anything?
> Or you can tweak:
>
> #define vq_err(vq, fmt, ...) do {  \
> pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
> if ((vq)->error_ctx)   \
> eventfd_signal((vq)->error_ctx, 1);\
> } while (0)
>
> to do pr_err if you prefer.
>
> > These functions show up as having increased overhead in perf.
> >
> > There are multiple loops going on in there.
> > Again the disclaimer though, I'm not familiar with that code at all.
>
>
> So there's a limit there: vhost_exceeds_weight should requeue work:
>
> } while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));
>
> then we invoke scheduler each time before re-executing it:
>
>
> {
> struct vhost_worker *worker = data;
> struct vhost_work *work, *work_next;
> struct llist_node *node;
>
> node = llist_del_all(>work_list);
> if (node) {
> __set_current_state(TASK_RUNNING);
>
> node = llist_reverse_order(node);
> /* make sure flag is seen after deletion */
> smp_wmb();
> llist_for_each_entry_safe(work, work_next, node, node) {
> clear_bit(VHOST_WORK_QUEUED, >flags);
> kcov_remote_start_common(worker->kcov_handle);
> work->fn(work);
> kcov_remote_stop();
> cond_resched();
> }
> }
>
> return !!node;
> }
>
> These are the byte 

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-09 Thread Michael S. Tsirkin
On Fri, Dec 08, 2023 at 12:41:38PM +0100, Tobias Huschle wrote:
> On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote:
> > On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> > > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > > > 
> > > > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > > > virtio whatsoever, so these are just educated guesses that maybe
> > > > > someone can verify/correct. Please bear with me probably messing up 
> > > > > the terminology.
> > > > > 
> > > > > - vhost is looping through available queues.
> > > > > - vhost wants to wake up a kworker to process a found queue.
> > > > > - kworker does something with that queue and terminates quickly.
> > > > > 
> > > > > What I found by throwing in some very noisy trace statements was that,
> > > > > if the kworker is not woken up, the vhost just keeps looping accross
> > > > > all available queues (and seems to repeat itself). So it essentially
> > > > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > > > it will just keep on looping until it is migrated off the CPU.
> > > > 
> > > > 
> > > > Normally it takes the buffers off the queue and is done with it.
> > > > I am guessing that at the same time guest is running on some other
> > > > CPU and keeps adding available buffers?
> > > > 
> > > 
> > > It seems to do just that, there are multiple other vhost instances
> > > involved which might keep filling up thoses queues. 
> > > 
> > 
> > No vhost is ever only draining queues. Guest is filling them.
> > 
> > > Unfortunately, this makes the problematic vhost instance to stay on
> > > the CPU and prevents said kworker to get scheduled. The kworker is
> > > explicitly woken up by vhost, so it wants it to do something.
> > > 
> > > At this point it seems that there is an assumption about the scheduler
> > > in place which is no longer fulfilled by EEVDF. From the discussion so
> > > far, it seems like EEVDF does what is intended to do.
> > > 
> > > Shouldn't there be a more explicit mechanism in use that allows the
> > > kworker to be scheduled in favor of the vhost?
> > > 
> > > It is also concerning that the vhost seems cannot be preempted by the
> > > scheduler while executing that loop.
> > 
> > 
> > Which loop is that, exactly?
> 
> The loop continously passes translate_desc in drivers/vhost/vhost.c
> That's where I put the trace statements.
> 
> The overall sequence seems to be (top to bottom):
> 
> handle_rx
> get_rx_bufs
> vhost_get_vq_desc
> vhost_get_avail_head
> vhost_get_avail
> __vhost_get_user_slow
> translate_desc   << trace statement in here
> vhost_iotlb_itree_first

I wonder why do you keep missing cache and re-translating.
Is pr_debug enabled for you? If not could you check if it
outputs anything?
Or you can tweak:

#define vq_err(vq, fmt, ...) do {  \
pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
if ((vq)->error_ctx)   \
eventfd_signal((vq)->error_ctx, 1);\
} while (0)

to do pr_err if you prefer.

> These functions show up as having increased overhead in perf.
> 
> There are multiple loops going on in there.
> Again the disclaimer though, I'm not familiar with that code at all.


So there's a limit there: vhost_exceeds_weight should requeue work:

} while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));

then we invoke scheduler each time before re-executing it:


{   
struct vhost_worker *worker = data;
struct vhost_work *work, *work_next;
struct llist_node *node;

node = llist_del_all(>work_list);
if (node) {
__set_current_state(TASK_RUNNING);

node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
smp_wmb();
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, >flags);
kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
cond_resched();
}
}

return !!node;
}   

These are the byte and packet limits:

/* Max number of bytes transferred before requeueing the job.
 * Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x8

/* Max number of packets transferred before requeueing the job.
 * Using this limit prevents one virtqueue from starving others with small
 * pkts.
 */
#define VHOST_NET_PKT_WEIGHT 256


Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any?

-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-08 Thread Tobias Huschle
On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote:
> On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > > 
> > > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > > virtio whatsoever, so these are just educated guesses that maybe
> > > > someone can verify/correct. Please bear with me probably messing up 
> > > > the terminology.
> > > > 
> > > > - vhost is looping through available queues.
> > > > - vhost wants to wake up a kworker to process a found queue.
> > > > - kworker does something with that queue and terminates quickly.
> > > > 
> > > > What I found by throwing in some very noisy trace statements was that,
> > > > if the kworker is not woken up, the vhost just keeps looping accross
> > > > all available queues (and seems to repeat itself). So it essentially
> > > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > > it will just keep on looping until it is migrated off the CPU.
> > > 
> > > 
> > > Normally it takes the buffers off the queue and is done with it.
> > > I am guessing that at the same time guest is running on some other
> > > CPU and keeps adding available buffers?
> > > 
> > 
> > It seems to do just that, there are multiple other vhost instances
> > involved which might keep filling up thoses queues. 
> > 
> 
> No vhost is ever only draining queues. Guest is filling them.
> 
> > Unfortunately, this makes the problematic vhost instance to stay on
> > the CPU and prevents said kworker to get scheduled. The kworker is
> > explicitly woken up by vhost, so it wants it to do something.
> > 
> > At this point it seems that there is an assumption about the scheduler
> > in place which is no longer fulfilled by EEVDF. From the discussion so
> > far, it seems like EEVDF does what is intended to do.
> > 
> > Shouldn't there be a more explicit mechanism in use that allows the
> > kworker to be scheduled in favor of the vhost?
> > 
> > It is also concerning that the vhost seems cannot be preempted by the
> > scheduler while executing that loop.
> 
> 
> Which loop is that, exactly?

The loop continously passes translate_desc in drivers/vhost/vhost.c
That's where I put the trace statements.

The overall sequence seems to be (top to bottom):

handle_rx
get_rx_bufs
vhost_get_vq_desc
vhost_get_avail_head
vhost_get_avail
__vhost_get_user_slow
translate_desc   << trace statement in here
vhost_iotlb_itree_first

These functions show up as having increased overhead in perf.

There are multiple loops going on in there.
Again the disclaimer though, I'm not familiar with that code at all.



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-08 Thread Michael S. Tsirkin
On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > 
> > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > virtio whatsoever, so these are just educated guesses that maybe
> > > someone can verify/correct. Please bear with me probably messing up 
> > > the terminology.
> > > 
> > > - vhost is looping through available queues.
> > > - vhost wants to wake up a kworker to process a found queue.
> > > - kworker does something with that queue and terminates quickly.
> > > 
> > > What I found by throwing in some very noisy trace statements was that,
> > > if the kworker is not woken up, the vhost just keeps looping accross
> > > all available queues (and seems to repeat itself). So it essentially
> > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > it will just keep on looping until it is migrated off the CPU.
> > 
> > 
> > Normally it takes the buffers off the queue and is done with it.
> > I am guessing that at the same time guest is running on some other
> > CPU and keeps adding available buffers?
> > 
> 
> It seems to do just that, there are multiple other vhost instances
> involved which might keep filling up thoses queues. 
> 

No vhost is ever only draining queues. Guest is filling them.

> Unfortunately, this makes the problematic vhost instance to stay on
> the CPU and prevents said kworker to get scheduled. The kworker is
> explicitly woken up by vhost, so it wants it to do something.
> 
> At this point it seems that there is an assumption about the scheduler
> in place which is no longer fulfilled by EEVDF. From the discussion so
> far, it seems like EEVDF does what is intended to do.
> 
> Shouldn't there be a more explicit mechanism in use that allows the
> kworker to be scheduled in favor of the vhost?
> 
> It is also concerning that the vhost seems cannot be preempted by the
> scheduler while executing that loop.


Which loop is that, exactly?

-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-08 Thread Tobias Huschle
On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > 
> > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > virtio whatsoever, so these are just educated guesses that maybe
> > someone can verify/correct. Please bear with me probably messing up 
> > the terminology.
> > 
> > - vhost is looping through available queues.
> > - vhost wants to wake up a kworker to process a found queue.
> > - kworker does something with that queue and terminates quickly.
> > 
> > What I found by throwing in some very noisy trace statements was that,
> > if the kworker is not woken up, the vhost just keeps looping accross
> > all available queues (and seems to repeat itself). So it essentially
> > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > it will just keep on looping until it is migrated off the CPU.
> 
> 
> Normally it takes the buffers off the queue and is done with it.
> I am guessing that at the same time guest is running on some other
> CPU and keeps adding available buffers?
> 

It seems to do just that, there are multiple other vhost instances
involved which might keep filling up thoses queues. 

Unfortunately, this makes the problematic vhost instance to stay on
the CPU and prevents said kworker to get scheduled. The kworker is
explicitly woken up by vhost, so it wants it to do something.

At this point it seems that there is an assumption about the scheduler
in place which is no longer fulfilled by EEVDF. From the discussion so
far, it seems like EEVDF does what is intended to do.

Shouldn't there be a more explicit mechanism in use that allows the
kworker to be scheduled in favor of the vhost?

It is also concerning that the vhost seems cannot be preempted by the
scheduler while executing that loop.



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-06 Thread Michael S. Tsirkin
On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> 3. vhost looping endlessly, waiting for kworker to be scheduled
> 
> I dug a little deeper on what the vhost is doing. I'm not an expert on
> virtio whatsoever, so these are just educated guesses that maybe
> someone can verify/correct. Please bear with me probably messing up 
> the terminology.
> 
> - vhost is looping through available queues.
> - vhost wants to wake up a kworker to process a found queue.
> - kworker does something with that queue and terminates quickly.
> 
> What I found by throwing in some very noisy trace statements was that,
> if the kworker is not woken up, the vhost just keeps looping accross
> all available queues (and seems to repeat itself). So it essentially
> relies on the scheduler to schedule the kworker fast enough. Otherwise
> it will just keep on looping until it is migrated off the CPU.


Normally it takes the buffers off the queue and is done with it.
I am guessing that at the same time guest is running on some other
CPU and keeps adding available buffers?


-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-06 Thread Tobias Huschle
On Tue, Nov 28, 2023 at 04:55:11PM +0800, Abel Wu wrote:
> On 11/27/23 9:56 PM, Tobias Huschle Wrote:
> > On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:
[...]
> 
> What are the weights of the two entities?
> 

Both entities have the same weights (I saw 1048576 for both of them).
The story looks different when we look at the cgroup hierarchy though:

sew := weight of the sched entity (se->load.weight)

 CPU 6/KVM-2360[011] d  1158.884473: sched_place: comm=vhost-2961 
pid=2984 sev=3595548386 sed=3598548386 sel=0 sew=1048576 avg=3595548386 
min=3595548386 cpu=11 nr=0 vru=3595548386 lag=0
 CPU 6/KVM-2360[011] d  1158.884473: sched_place: comm= pid=0 
sev=19998138425 sed=20007532920 sel=0 sew=335754 avg=19998138425 
min=19998138425 cpu=11 nr=0 vru=19998138425 lag=0
 CPU 6/KVM-2360[011] d  1158.884474: sched_place: comm= pid=0 
sev=37794158943 sed=37807515464 sel=0 sew=236146 avg=37794158943 
min=37794158943 cpu=11 nr=0 vru=37794158943 lag=0
 CPU 6/KVM-2360[011] d  1158.884474: sched_place: comm= pid=0 
sev=50387168150 sed=50394482435 sel=0 sew=430665 avg=50387168150 
min=50387168150 cpu=11 nr=0 vru=50387168150 lag=0
 CPU 6/KVM-2360[011] d  1158.884474: sched_place: comm= pid=0 
sev=76600751247 sed=77624751246 sel=0 sew=3876 avg=76600751247 min=76600751247 
cpu=11 nr=0 vru=76600751247 lag=0
<...>
vhost-2961-2984[011] d  1158.884487: sched_place: comm=kworker/11:2 
pid=202 sev=76603905961 sed=76606905961 sel=0 sew=1048576 avg=76603905961 
min=76603905961 cpu=11 nr=1 vru=76603905961 lag=0

Here we can see the following weights:
kworker -> 1048576
vhost   -> 1048576
cgroup root ->3876

kworker and vhost weights remain the same. The weights of the nodes in the 
cgroup vary.


I also spent some more thought on this and have some more observations:

1. kworker lag after short runtime

vhost-2961-2984[011] d  1158.884486: sched_waking: 
comm=kworker/11:2 pid=202 prio=120 target_cpu=011
vhost-2961-2984[011] d  1158.884487: sched_place: comm=kworker/11:2 
pid=202 sev=76603905961 sed=76606905961 sel=0 sew=1048576 avg=76603905961 
min=76603905961 cpu=11 nr=1 vru=76603905961 lag=0
<...>   
^
vhost-2961-2984[011] d  1158.884490: sched_switch: 
prev_comm=vhost-2961 prev_pid=2984 prev_prio=120 prev_state=R+ ==> 
next_comm=kworker/11:2 next_pid=202 next_prio=120
   kworker/11:2-202[011] d  1158.884491: sched_waking: comm=CPU 0/KVM 
pid=2988 prio=120 target_cpu=009
   kworker/11:2-202[011] d  1158.884492: sched_stat_runtime: 
comm=kworker/11:2 pid=202 runtime=5150 [ns] vruntime=7660391 [ns] 
deadline=76606905961 [ns] lag=76606905961

   
   kworker/11:2-202[011] d  1158.884492: sched_update: 
comm=kworker/11:2 pid=202 sev=7660391 sed=76606905961 sel=-1128 sew=1048576 
avg=76603909983 min=76603905961 cpu=11 nr=2 lag=-1128 lim=1000

 ^
   kworker/11:2-202[011] d  1158.884494: sched_stat_wait: 
comm=vhost-2961 pid=2984 delay=5150 [ns]
   kworker/11:2-202[011] d  1158.884494: sched_switch: 
prev_comm=kworker/11:2 prev_pid=202 prev_prio=120 prev_state=I ==> 
next_comm=vhost-2961 next_pid=2984 next_prio=120

In the sequence above, the kworker gets woken up by the vhost and placed on the 
timeline with 0 lag.
The kworker then executes for 5150ns and returns control to the vhost.
Unfortunately, this short runtime earns the kworker a negative lag of -1128.
This in turn, causes the kworker to not be selected by 
check_preempt_wakeup_fair.

My naive understanding of lag is, that only those entities get negative lag, 
which consume
more time than they should. Why is the kworker being punished for running only 
a tiny
portion of time?

In the majority of cases, the kworker finishes after a 4-digit number of ns.
There are occassional outliers with 5-digit numbers. I would therefore not 
expect negative lag for the kworker.

It is fair to say that the kworker was executing while the vhost was not.
kworker gets put on the queue with no lag, so it essentially has its vruntime
set to avg_vruntime.
After giving up its timeslice the kworker has now a vruntime which is larger
than the avg_vruntime. Hence the negative lag might make sense here from an
algorithmic standpoint. 


2a/b. vhost getting increased deadlines over time, no call of pick_eevdf

vhost-2961-2984[011] d.h..  1158.892878: sched_stat_runtime: 
comm=vhost-2961 pid=2984 runtime=8385872 [ns] vruntime=3603948448 [ns] 
deadline=3606948448 [ns] lag=3598548386
vhost-2961-2984  

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-28 Thread Tobias Huschle
On Tue, Nov 28, 2023 at 04:55:11PM +0800, Abel Wu wrote:
> On 11/27/23 9:56 PM, Tobias Huschle Wrote:
> > On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:

[...]

> > - At depth 4, the cgroup shows the observed vruntime value which is smaller
> >by a factor of 20, but depth 0 seems to be running with values of the
> >correct magnitude.
> 
> A child is running means its parent also being the cfs->curr, but
> not vice versa if there are more than one child.
> 
> > - cgroup at depth 0 has zero lag, with higher depth, there are large lag
> >values (as observed 606.338267 onwards)
> 
> These values of se->vlag means 'run this entity to parity' to avoid
> excess context switch, which is what RUN_TO_PARITY does, or nothing
> when !RUN_TO_PARITY. In short, se->vlag is not vlag when se->on_rq.
> 

Thanks for clarifying that. This makes things clearer to me.

> > 
> > Now the following occurs, triggered by the vhost:
> > - The kworker gets placed again with:
> >  vruntime  deadline
> > cgroup56117619190   57650477291 -> depth 0, last known value
> > kworker   56117885776   56120885776 -> lag of -725
> > - vhost continues executing and updates its vruntime accordingly, here
> >I would need to enhance the trace to also print the vruntimes of the
> >parent sched_entities to see the progress of their vruntime/deadline/lag
> >values as well
> > - It is a bit irritating that the EEVDF algorithm would not pick the kworker
> >over the cgroup as its deadline is smaller.
> >But, the kworker has negative lag, which might cause EEVDF to not pick
> >the kworker.
> >The cgroup at depth 0 has no lag, all deeper layers have a significantly
> >positve lag (last known values, might have changed in the meantime).
> >At this point I would see the option that the vhost task is stuck
> >somewhere or EEVDF just does not see the kworker as a eligible option.
> 
> IMHO such lag should not introduce that long delay. Can you run the
> test again with NEXT_BUDDY disabled?

I added a trace event to the next buddy path, it does not get triggered, so I'd 
assume that no buddies are selected.

> 
> > 
> > - Once the vhost is migrated off the cpu, the update_entity_lag function
> >works with the following values at 606.467022: sched_update
> >For the cgroup at depth 0
> >- vruntime = 5710415 --> this is in line with the amount of new 
> > timeslices
> > vhost got assigned while the kworker was 
> > waiting
> >- vlag =   -62439022 --> the scheduler knows that the cgroup was
> > overconsuming, but no runtime for the 
> > kworker
> >For the cfs_rq we have
> >- min_vruntime =  56117885776 --> this matches the vruntime of the 
> > kworker
> >- avg_vruntime = 161750065796 --> this is rather large in comparison, 
> > but I
> >  might access this value at a bad time
> 
> Use avg_vruntime() instead.

Fair.

[...]

> > 
> > # full trace #
> > 
> > sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
> > --> during __pick_eevdf, prints values for best and the first node loop 
> > variable, second loop is never executed
> > 
> > sched_place/sched_update: 
> > sev=se->vruntime,sed=se->deadline,sev=se->vlag,avg=cfs_rq->avg_vruntime,min=cfs_rq->min_vruntime
> 
> It would be better replace cfs_rq->avg_vruntime with avg_vruntime().
> Although we can get real @avg by (vruntime + vlag), I am not sure
> vlag (@lag in trace) is se->vlag or the local variable in the place
> function which is scaled and no longer be the true vlag.
> 

Oh my bad, sev is the vlag value of the sched_entity, lag is the local variable.

[...]

> >  vhost-2931-2953[013] d   606.338313: sched_wakeup: 
> > comm=kworker/13:1 pid=168 prio=120 target_cpu=013
> > --> kworker set to runnable, but vhost keeps on executing
> 
> What are the weights of the two entities?

I'll do another run and look at those values.

[...]



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-28 Thread Abel Wu

On 11/27/23 9:56 PM, Tobias Huschle Wrote:

On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:

On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:

The below should also work for internal entities, but last time I poked
around with it I had some regressions elsewhere -- you know, fix one,
wreck another type of situations on hand.

But still, could you please give it a go -- it applies cleanly to linus'
master and -rc2.

---
Subject: sched/eevdf: Revenge of the Sith^WSleepers



Tried the patch, it does not help unfortuntately.

It might also be possible that the long running vhost is stuck on something.
During those phases where the vhost just runs for a while. This might have
been a risk for a while, EEVDF might have just uncovered an unfortuntate
sequence of events.
I'll look into this option.

I also added some more trace outputs in order to find the actual vruntimes
of the cgroup parents. The numbers look kind of reasonable, but I struggle
to judge this with certainty.

In one of the scenarios where the kworker sees an absurd wait time, the
following occurs (full trace below):

- The kworker ends its timeslice after 4941 ns
- __pick_eevdf finds the cgroup holding vhost as the next option to execute
- Last known values are:
 vruntime  deadline
cgroup56117619190   57650477291 -> depth 0
kworker   56117624131   56120619190
   This is fair, since the kworker is not runnable here.
- At depth 4, the cgroup shows the observed vruntime value which is smaller
   by a factor of 20, but depth 0 seems to be running with values of the
   correct magnitude.


A child is running means its parent also being the cfs->curr, but
not vice versa if there are more than one child.


- cgroup at depth 0 has zero lag, with higher depth, there are large lag
   values (as observed 606.338267 onwards)


These values of se->vlag means 'run this entity to parity' to avoid
excess context switch, which is what RUN_TO_PARITY does, or nothing
when !RUN_TO_PARITY. In short, se->vlag is not vlag when se->on_rq.



Now the following occurs, triggered by the vhost:
- The kworker gets placed again with:
 vruntime  deadline
cgroup56117619190   57650477291 -> depth 0, last known value
kworker   56117885776   56120885776 -> lag of -725
- vhost continues executing and updates its vruntime accordingly, here
   I would need to enhance the trace to also print the vruntimes of the
   parent sched_entities to see the progress of their vruntime/deadline/lag
   values as well
- It is a bit irritating that the EEVDF algorithm would not pick the kworker
   over the cgroup as its deadline is smaller.
   But, the kworker has negative lag, which might cause EEVDF to not pick
   the kworker.
   The cgroup at depth 0 has no lag, all deeper layers have a significantly
   positve lag (last known values, might have changed in the meantime).
   At this point I would see the option that the vhost task is stuck
   somewhere or EEVDF just does not see the kworker as a eligible option.


IMHO such lag should not introduce that long delay. Can you run the
test again with NEXT_BUDDY disabled?



- Once the vhost is migrated off the cpu, the update_entity_lag function
   works with the following values at 606.467022: sched_update
   For the cgroup at depth 0
   - vruntime = 5710415 --> this is in line with the amount of new 
timeslices
vhost got assigned while the kworker was waiting
   - vlag =   -62439022 --> the scheduler knows that the cgroup was
overconsuming, but no runtime for the kworker
   For the cfs_rq we have
   - min_vruntime =  56117885776 --> this matches the vruntime of the kworker
   - avg_vruntime = 161750065796 --> this is rather large in comparison, but I
 might access this value at a bad time


Use avg_vruntime() instead.


   - nr_running   =2 --> at this point, both, cgroup and kworker are
 still on the queue, with the cgroup being
 in the migration process
--> It seems like the overconsumption accumulates at cgroup depth 0 and is not
 propageted downwards. This might be intended though.

- At 606.479979: sched_place, cgroup hosting the vhost is migrated back
   onto cpu 13 with a lag of -166821875 it gets scheduled right away as
   there is no other task (nr_running = 0)

- At 606.479996: sched_place, the kworker gets placed again, this time
   with no lag and get scheduled almost immediately, with a wait
   time of 1255 ns.

It shall be noted, that these scenarios also occur when the first placement
of the kworker in this sequence has no lag, i.e. a lag <= 0 is the pattern
when observing this issue.

# full trace #

sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
--> 

Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-27 Thread Tobias Huschle
On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:
> 
> The below should also work for internal entities, but last time I poked
> around with it I had some regressions elsewhere -- you know, fix one,
> wreck another type of situations on hand.
> 
> But still, could you please give it a go -- it applies cleanly to linus'
> master and -rc2.
> 
> ---
> Subject: sched/eevdf: Revenge of the Sith^WSleepers
> 

Tried the patch, it does not help unfortuntately.

It might also be possible that the long running vhost is stuck on something.
During those phases where the vhost just runs for a while. This might have
been a risk for a while, EEVDF might have just uncovered an unfortuntate
sequence of events.
I'll look into this option.

I also added some more trace outputs in order to find the actual vruntimes
of the cgroup parents. The numbers look kind of reasonable, but I struggle
to judge this with certainty.

In one of the scenarios where the kworker sees an absurd wait time, the 
following occurs (full trace below):

- The kworker ends its timeslice after 4941 ns
- __pick_eevdf finds the cgroup holding vhost as the next option to execute
- Last known values are:   
vruntime  deadline
   cgroup56117619190   57650477291 -> depth 0
   kworker   56117624131   56120619190
  This is fair, since the kworker is not runnable here.
- At depth 4, the cgroup shows the observed vruntime value which is smaller 
  by a factor of 20, but depth 0 seems to be running with values of the 
  correct magnitude.
- cgroup at depth 0 has zero lag, with higher depth, there are large lag 
  values (as observed 606.338267 onwards)

Now the following occurs, triggered by the vhost:
- The kworker gets placed again with:   
vruntime  deadline
   cgroup56117619190   57650477291 -> depth 0, last known value
   kworker   56117885776   56120885776 -> lag of -725
- vhost continues executing and updates its vruntime accordingly, here 
  I would need to enhance the trace to also print the vruntimes of the 
  parent sched_entities to see the progress of their vruntime/deadline/lag 
  values as well
- It is a bit irritating that the EEVDF algorithm would not pick the kworker 
  over the cgroup as its deadline is smaller.
  But, the kworker has negative lag, which might cause EEVDF to not pick 
  the kworker.
  The cgroup at depth 0 has no lag, all deeper layers have a significantly 
  positve lag (last known values, might have changed in the meantime).
  At this point I would see the option that the vhost task is stuck
  somewhere or EEVDF just does not see the kworker as a eligible option.

- Once the vhost is migrated off the cpu, the update_entity_lag function
  works with the following values at 606.467022: sched_update
  For the cgroup at depth 0
  - vruntime = 5710415 --> this is in line with the amount of new timeslices
   vhost got assigned while the kworker was waiting
  - vlag =   -62439022 --> the scheduler knows that the cgroup was 
   overconsuming, but no runtime for the kworker
  For the cfs_rq we have
  - min_vruntime =  56117885776 --> this matches the vruntime of the kworker
  - avg_vruntime = 161750065796 --> this is rather large in comparison, but I 
might access this value at a bad time
  - nr_running   =2 --> at this point, both, cgroup and kworker are 
still on the queue, with the cgroup being 
in the migration process
--> It seems like the overconsumption accumulates at cgroup depth 0 and is not 
propageted downwards. This might be intended though.

- At 606.479979: sched_place, cgroup hosting the vhost is migrated back
  onto cpu 13 with a lag of -166821875 it gets scheduled right away as 
  there is no other task (nr_running = 0)

- At 606.479996: sched_place, the kworker gets placed again, this time
  with no lag and get scheduled almost immediately, with a wait 
  time of 1255 ns.

It shall be noted, that these scenarios also occur when the first placement
of the kworker in this sequence has no lag, i.e. a lag <= 0 is the pattern
when observing this issue.

# full trace #

sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
--> during __pick_eevdf, prints values for best and the first node loop 
variable, second loop is never executed

sched_place/sched_update: 
sev=se->vruntime,sed=se->deadline,sev=se->vlag,avg=cfs_rq->avg_vruntime,min=cfs_rq->min_vruntime
--> at the end of place_entity and update_entity_lag

--> the chunks of 5 entries for these new events represent the 5 levels of the 
cgroup which hosts the vhost

vhost-2931-2953[013] d   606.338262: sched_stat_blocked: 
comm=kworker/13:1 pid=168 

Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-22 Thread Peter Zijlstra
On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:

> We applied both suggested patch options and ran the test again, so 
> 
> sched/eevdf: Fix vruntime adjustment on reweight
> sched/fair: Update min_vruntime for reweight_entity() correctly
> 
> and
> 
> sched/eevdf: Delay dequeue
> 
> Unfortunately, both variants do NOT fix the problem.
> The regression remains unchanged.

Thanks for testing.

> I will continue getting myself familiar with how cgroups are scheduled to dig 
> deeper here. If there are any other ideas, I'd be happy to use them as a 
> starting point for further analysis.
> 
> Would additional traces still be of interest? If so, I would be glad to
> provide them.

So, since it got bisected to the placement logic, but is a cgroup
related issue, I was thinking that 'Delay dequeue' might not cut it,
that only works for tasks, not the internal entities.

The below should also work for internal entities, but last time I poked
around with it I had some regressions elsewhere -- you know, fix one,
wreck another type of situations on hand.

But still, could you please give it a go -- it applies cleanly to linus'
master and -rc2.

---
Subject: sched/eevdf: Revenge of the Sith^WSleepers

For tasks that have received excess service (negative lag) allow them to
gain parity (zero lag) by sleeping.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/fair.c | 36 
 kernel/sched/features.h |  6 ++
 2 files changed, 42 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..b975e4b07a68 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5110,6 +5110,33 @@ static inline void update_misfit_status(struct 
task_struct *p, struct rq *rq) {}
 
 #endif /* CONFIG_SMP */
 
+static inline u64
+entity_vlag_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+   u64 now, vdelta;
+   s64 delta;
+
+   if (!(flags & ENQUEUE_WAKEUP))
+   return se->vlag;
+
+   if (flags & ENQUEUE_MIGRATED)
+   return 0;
+
+   now = rq_clock_task(rq_of(cfs_rq));
+   delta = now - se->exec_start;
+   if (delta < 0)
+   return se->vlag;
+
+   if (sched_feat(GENTLE_SLEEPER))
+   delta /= 2;
+
+   vdelta = __calc_delta(delta, NICE_0_LOAD, _rq->load);
+   if (vdelta < -se->vlag)
+   return se->vlag + vdelta;
+
+   return 0;
+}
+
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -5133,6 +5160,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
 
lag = se->vlag;
 
+   /*
+* Allow tasks that have received too much service (negative
+* lag) to (re)gain parity (zero lag) by sleeping for the
+* equivalent duration. This ensures they will be readily
+* eligible.
+*/
+   if (sched_feat(PLACE_SLEEPER) && lag < 0)
+   lag = entity_vlag_sleeper(cfs_rq, se, flags);
+
/*
 * If we want to place a task and preserve lag, we have to
 * consider the effect of the new entity on the weighted
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a3ddf84de430..722282d3ed07 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,6 +7,12 @@
 SCHED_FEAT(PLACE_LAG, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 SCHED_FEAT(RUN_TO_PARITY, true)
+/*
+ * Let sleepers earn back lag, but not more than 0-lag. GENTLE_SLEEPERS earn at
+ * half the speed.
+ */
+SCHED_FEAT(PLACE_SLEEPER, true)
+SCHED_FEAT(GENTLE_SLEEPER, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed



Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-21 Thread Tobias Huschle
On Fri, Nov 17, 2023 at 09:07:55PM +0800, Abel Wu wrote:
> On 11/17/23 8:37 PM, Peter Zijlstra Wrote:

[...]

> > Ah, so if this is a cgroup issue, it might be worth trying this patch
> > that we have in tip/sched/urgent.
> 
> And please also apply this fix:
> https://lore.kernel.org/all/20231117080106.12890-1-s921975...@gmail.com/
> 

We applied both suggested patch options and ran the test again, so 

sched/eevdf: Fix vruntime adjustment on reweight
sched/fair: Update min_vruntime for reweight_entity() correctly

and

sched/eevdf: Delay dequeue

Unfortunately, both variants do NOT fix the problem.
The regression remains unchanged.


I will continue getting myself familiar with how cgroups are scheduled to dig 
deeper here. If there are any other ideas, I'd be happy to use them as a 
starting point for further analysis.

Would additional traces still be of interest? If so, I would be glad to
provide them.

[...]



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-20 Thread Abel Wu

On 11/20/23 6:56 PM, Peter Zijlstra Wrote:

On Sat, Nov 18, 2023 at 01:14:32PM +0800, Abel Wu wrote:


Hi Peter, I'm a little confused here. As we adopt placement strategy #1
when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
Given that the weight doesn't change, we have:

vl' = vl

But in fact it is scaled on placement:

vl' = vl * W/(W + w)


(W+w)/W


Ah, right. I misunderstood (again) the comment which says:

vl_i = (W + w_i)*vl'_i / W

So the current implementation is:

v' = V - vl'

and what I was proposing is:

v' = V' - vl

and they are equal in fact.





Does this intended?


The scaling, yes that's intended and the comment explains why. So now
you have me confused too :-)

Specifically, I want the lag after placement to be equal to the lag we
come in with. Since placement will affect avg_vruntime (adding one
element to the average changes the average etc..) the placement also
affects the lag as measured after placement.


Yes. You did the math in an iterative fashion and mine is facing the
final state:

v' = V' - vlag
V' = (WV + wv') / (W + w)

which gives:

V' = V - w * vlag / W



Or rather, if you enqueue and dequeue, I want the lag to be preserved.
If you do not take placement into consideration, lag will dissipate real
quick.


And to illustrate my understanding of strategy #1:



@@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
 * vl_i is given by:
 *
 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
-*  = (W*V + w_i*(V - vl_i)) / (W + w_i)
-*  = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
-*  = (V*(W + w_i) - w_i*l) / (W + w_i)
-*  = V - w_i*vl_i / (W + w_i)
-*
-* And the actual lag after adding an entity with vl_i is:
-*
-*   vl'_i = V' - v_i
-* = V - w_i*vl_i / (W + w_i) - (V - vl_i)
-* = vl_i - w_i*vl_i / (W + w_i)
-*
-* Which is strictly less than vl_i. So in order to preserve lag
-* we should inflate the lag before placement such that the
-* effective lag after placement comes out right.
-*
-* As such, invert the above relation for vl'_i to get the vl_i
-* we need to use such that the lag after placement is the lag
-* we computed before dequeue.
+*  = (W*V + w_i*(V' - vl_i)) / (W + w_i)
+*  = V - w_i*vl_i / W
 *
-*   vl'_i = vl_i - w_i*vl_i / (W + w_i)
-* = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
-*
-*   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
-*   = W*vl_i
-*
-*   vl_i = (W + w_i)*vl'_i / W
 */
load = cfs_rq->avg_load;
if (curr && curr->on_rq)
load += scale_load_down(curr->load.weight);
-
-   lag *= load + scale_load_down(se->load.weight);
if (WARN_ON_ONCE(!load))
load = 1;
-   lag = div_s64(lag, load);
+
+   vruntime -= div_s64(lag * scale_load_down(se->load.weight), 
load);
}
se->vruntime = vruntime - lag;



So you're proposing we do:

v = V - (lag * w) / (W + w) - lag


What I 'm proposing is:

V' = V - w * vlag / W

so we have:

v' = V' - vlag
   = V - vlag * w/W - vlag
   = V - vlag * (W + w)/W

which is exactly the same as current implementation.



?

That can be written like:

v = V - (lag * w) / (W+w) - (lag * (W+w)) / (W+w)
  = V - (lag * (W+w) + lag * w) / (W+w)
  = V - (lag * (W+2w)) / (W+w)

And that turns into a mess AFAICT.


Let me repeat my earlier argument. Suppose v,w,l are the new element.
V,W are the old avg_vruntime and sum-weight.

Then: V = V*W / W, and by extention: V' = (V*W + v*w) / (W + w).

The new lag, after placement:

l' = V' - v = (V*W + v*w) / (W+w) - v
 = (V*W + v*w) / (W+w) - v * (W+w) / (W+v)
= (V*W + v*w -v*W - v*w) / (W+w)
= (V*W - v*W) / (W+w)
= W*(V-v) / (W+w)
= W/(W+w) * (V-v)

Substitute: v = V - (W+w)/W * l, my scaling thing, to obtain:

l' = W/(W+w) * (V - (V - (W+w)/W * l))
= W/(W+w) * (V - V + (W+w)/W * l)
= W/(W+w) * (W+w)/W * l
= l

So by scaling, we've preserved lag across placement.

That make sense?


Yes, I think I won't misunderstand again for the 3rd time :)

Thanks!
Abel



Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-20 Thread Peter Zijlstra
On Sat, Nov 18, 2023 at 01:14:32PM +0800, Abel Wu wrote:

> Hi Peter, I'm a little confused here. As we adopt placement strategy #1
> when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
> Given that the weight doesn't change, we have:
> 
>   vl' = vl
> 
> But in fact it is scaled on placement:
> 
>   vl' = vl * W/(W + w)

(W+w)/W

> 
> Does this intended? 

The scaling, yes that's intended and the comment explains why. So now
you have me confused too :-)

Specifically, I want the lag after placement to be equal to the lag we
come in with. Since placement will affect avg_vruntime (adding one
element to the average changes the average etc..) the placement also
affects the lag as measured after placement.

Or rather, if you enqueue and dequeue, I want the lag to be preserved.
If you do not take placement into consideration, lag will dissipate real
quick.

> And to illustrate my understanding of strategy #1:

> @@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct 
> sched_entity *se, int flags)
>* vl_i is given by:
>*
>*   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> -  *  = (W*V + w_i*(V - vl_i)) / (W + w_i)
> -  *  = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
> -  *  = (V*(W + w_i) - w_i*l) / (W + w_i)
> -  *  = V - w_i*vl_i / (W + w_i)
> -  *
> -  * And the actual lag after adding an entity with vl_i is:
> -  *
> -  *   vl'_i = V' - v_i
> -  * = V - w_i*vl_i / (W + w_i) - (V - vl_i)
> -  * = vl_i - w_i*vl_i / (W + w_i)
> -  *
> -  * Which is strictly less than vl_i. So in order to preserve lag
> -  * we should inflate the lag before placement such that the
> -  * effective lag after placement comes out right.
> -  *
> -  * As such, invert the above relation for vl'_i to get the vl_i
> -  * we need to use such that the lag after placement is the lag
> -  * we computed before dequeue.
> +  *  = (W*V + w_i*(V' - vl_i)) / (W + w_i)
> +  *  = V - w_i*vl_i / W
>*
> -  *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
> -  * = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
> -  *
> -  *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
> -  *   = W*vl_i
> -  *
> -  *   vl_i = (W + w_i)*vl'_i / W
>*/
>   load = cfs_rq->avg_load;
>   if (curr && curr->on_rq)
>   load += scale_load_down(curr->load.weight);
> -
> - lag *= load + scale_load_down(se->load.weight);
>   if (WARN_ON_ONCE(!load))
>   load = 1;
> - lag = div_s64(lag, load);
> +
> + vruntime -= div_s64(lag * scale_load_down(se->load.weight), 
> load);
>   }
>   se->vruntime = vruntime - lag;


So you're proposing we do:

v = V - (lag * w) / (W + w) - lag

?

That can be written like:

v = V - (lag * w) / (W+w) - (lag * (W+w)) / (W+w)
  = V - (lag * (W+w) + lag * w) / (W+w)
  = V - (lag * (W+2w)) / (W+w)

And that turns into a mess AFAICT.


Let me repeat my earlier argument. Suppose v,w,l are the new element.
V,W are the old avg_vruntime and sum-weight.

Then: V = V*W / W, and by extention: V' = (V*W + v*w) / (W + w).

The new lag, after placement: 

l' = V' - v = (V*W + v*w) / (W+w) - v
= (V*W + v*w) / (W+w) - v * (W+w) / (W+v)
= (V*W + v*w -v*W - v*w) / (W+w)
= (V*W - v*W) / (W+w)
= W*(V-v) / (W+w)
= W/(W+w) * (V-v)

Substitute: v = V - (W+w)/W * l, my scaling thing, to obtain:

l' = W/(W+w) * (V - (V - (W+w)/W * l))
   = W/(W+w) * (V - V + (W+w)/W * l)
   = W/(W+w) * (W+w)/W * l
   = l

So by scaling, we've preserved lag across placement.

That make sense?



Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-17 Thread Abel Wu

On 11/17/23 5:23 PM, Peter Zijlstra Wrote:


Your email is pretty badly mangled by wrapping, please try and
reconfigure your MUA, esp. the trace and debug output is unreadable.

On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:


The base scenario are two KVM guests running on an s390 LPAR. One guest
hosts the uperf server, one the uperf client.
With EEVDF we observe a regression of ~50% for a strburst test.
For a more detailed description of the setup see the section TEST SUMMARY at
the bottom.


Well, that's not good :/


Short summary:
The mentioned kworker has been scheduled to CPU 14 before the tracing was
enabled.
A vhost process is migrated onto CPU 14.
The vruntimes of kworker and vhost differ significantly (86642125805 vs
4242563284 -> factor 20)


So bear with me, I know absolutely nothing about virt stuff. I suspect
there's cgroups involved because shiny or something.

kworkers are typically not in cgroups and are part of the root cgroup,
but what's a vhost and where does it live?

Also, what are their weights / nice values?


The vhost process wants to wake up the kworker, therefore the kworker is
placed onto the runqueue again and set to runnable.
The vhost process continues to execute, waking up other vhost processes on
other CPUs.

So far this behavior is not different to what we see on pre-EEVDF kernels.

On timestamp 576.162767, the vhost process triggers the last wake up of
another vhost on another CPU.
Until timestamp 576.171155, we see no other activity. Now, the vhost process
ends its time slice.
Then, vhost gets re-assigned new time slices 4 times and gets then migrated
off to CPU 15.


So why does this vhost stay on the CPU if it doesn't have anything to
do? (I've not tried to make sense of the trace, that's just too
painful).


This does not occur with older kernels.
The kworker has to wait for the migration to happen in order to be able to
execute again.
This is due to the fact, that the vruntime of the kworker is significantly
larger than the one of vhost.


That's, weird. Can you add a trace_printk() to update_entity_lag() and
have it print out the lag, limit and vlag (post clamping) values? And
also in place_entity() for the reverse process, lag pre and post scaling
or something.

After confirming both tasks are indeed in the same cgroup ofcourse,
because if they're not, vruntime will be meaningless to compare and we
should look elsewhere.

Also, what HZ and what preemption mode are you running? If kworker is
somehow vastly over-shooting it's slice -- keeps running way past the
avg_vruntime, then it will build up a giant lag and you get what you
describe, next time it wakes up it gets placed far to the right (exactly
where it was when it 'finally' went to sleep, relatively speaking).


We found some options which sound plausible but we are not sure if they are
valid or not:

1. The wake up path has a dependency on the vruntime metrics that now delays
the execution of the kworker.
2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
which updates the way cfs_rq->min_vruntime and
 cfs_rq->avg_runtime are set might have introduced an issue which is
uncovered with the commit mentioned above.


Suppose you have a few tasks (of equal weight) on you virtual timeline
like so:

-+---+---+---+---+--
 ^   ^
|   `avg_vruntime
`-min_vruntime

Then the above would be more or less the relative placements of these
values. avg_vruntime is the weighted average of the various vruntimes
and is therefore always in the 'middle' of the tasks, and not somewhere
out-there.

min_vruntime is a monotonically increasing 'minimum' that's left-ish on
the tree (there's a few cases where a new task can be placed left of
min_vruntime and its no longer actuall the minimum, but whatever).

These values should be relatively close to one another, depending
ofcourse on the spread of the tasks. So I don't think this is causing
trouble.

Anyway, the big difference with lag based placement is that where
previously tasks (that do not migrate) retain their old vruntime and on
placing they get pulled forward to at least min_vruntime, so a task that
wildly overshoots, but then doesn't run for significant time can still
be overtaken and then when placed again be 'okay'.

Now OTOH, with lag-based placement,  we strictly preserve their relative
offset vs avg_vruntime. So if they were *far* too the right when they go
to sleep, they will again be there on placement.


Hi Peter, I'm a little confused here. As we adopt placement strategy #1
when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
Given that the weight doesn't change, we have:

vl' = vl

But in fact it is scaled on placement:

vl' = vl * W/(W + w)

Does this intended? And to illustrate my understanding of strategy #1:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07f555857698..a24ef8b297ed 100644
--- 

Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-17 Thread Abel Wu

On 11/17/23 8:37 PM, Peter Zijlstra Wrote:

On Fri, Nov 17, 2023 at 01:24:21PM +0100, Tobias Huschle wrote:

On Fri, Nov 17, 2023 at 10:23:18AM +0100, Peter Zijlstra wrote:



kworkers are typically not in cgroups and are part of the root cgroup,
but what's a vhost and where does it live?


The qemu instances of the two KVM guests are placed into cgroups.
The vhosts run within the context of these qemu instances (4 threads per guest).
So they are also put into those cgroups.

I'll answer the other questions you brought up as well, but I guess that one
is most critical:



After confirming both tasks are indeed in the same cgroup ofcourse,
because if they're not, vruntime will be meaningless to compare and we
should look elsewhere.


In that case we probably have to go with elsewhere ... which is good to know.


Ah, so if this is a cgroup issue, it might be worth trying this patch
that we have in tip/sched/urgent.


And please also apply this fix:
https://lore.kernel.org/all/20231117080106.12890-1-s921975...@gmail.com/



I'll try and read the rest of the email a little later, gotta run
errands first.

---

commit eab03c23c2a162085b13200d7942fc5a00b5ccc8
Author: Abel Wu 
Date:   Tue Nov 7 17:05:07 2023 +0800

 sched/eevdf: Fix vruntime adjustment on reweight
 
 vruntime of the (on_rq && !0-lag) entity needs to be adjusted when

 it gets re-weighted, and the calculations can be simplified based
 on the fact that re-weight won't change the w-average of all the
 entities. Please check the proofs in comments.
 
 But adjusting vruntime can also cause position change in RB-tree

 hence require re-queue to fix up which might be costly. This might
 be avoided by deferring adjustment to the time the entity actually
 leaves tree (dequeue/pick), but that will negatively affect task
 selection and probably not good enough either.
 
 Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")

 Signed-off-by: Abel Wu 
 Signed-off-by: Peter Zijlstra (Intel) 
 Link: 
https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.a...@bytedance.com

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2048138ce54b..025d90925bf6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3666,41 +3666,140 @@ static inline void
  dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
  #endif
  
+static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,

+  unsigned long weight)
+{
+   unsigned long old_weight = se->load.weight;
+   u64 avruntime = avg_vruntime(cfs_rq);
+   s64 vlag, vslice;
+
+   /*
+* VRUNTIME
+* 
+*
+* COROLLARY #1: The virtual runtime of the entity needs to be
+* adjusted if re-weight at !0-lag point.
+*
+* Proof: For contradiction assume this is not true, so we can
+* re-weight without changing vruntime at !0-lag point.
+*
+* Weight   VRuntime   Avg-VRuntime
+* beforew  vV
+*  afterw' v'   V'
+*
+* Since lag needs to be preserved through re-weight:
+*
+*  lag = (V - v)*w = (V'- v')*w', where v = v'
+*  ==>  V' = (V - v)*w/w' + v   (1)
+*
+* Let W be the total weight of the entities before reweight,
+* since V' is the new weighted average of entities:
+*
+*  V' = (WV + w'v - wv) / (W + w' - w) (2)
+*
+* by using (1) & (2) we obtain:
+*
+*  (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
+*  ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
+*  ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
+*  ==>  (V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
+*
+* Since we are doing at !0-lag point which means V != v, we
+* can simplify (3):
+*
+*  ==>  W / (W + w' - w) = w / w'
+*  ==>  Ww' = Ww + ww' - ww
+*  ==>  W * (w' - w) = w * (w' - w)
+*  ==>  W = w   (re-weight indicates w' != w)
+*
+* So the cfs_rq contains only one entity, hence vruntime of
+* the entity @v should always equal to the cfs_rq's weighted
+* average vruntime @V, which means we will always re-weight
+* at 0-lag point, thus breach assumption. Proof completed.
+*
+*
+* COROLLARY #2: Re-weight does NOT affect weighted average
+* vruntime of all the entities.
+*
+* Proof: According to corollary #1, Eq. (1) should be:
+*
+*  (V - v)*w = (V' - v')*w'
+*  ==>v' = V' - (V - v)*w/w'(4)
+*
+* According to the weighted average formula, we have:
+*
+*  V' = (WV - wv + w'v') / (W - w + w')
+* = 

Re: RE: [PATCH v2] nvdimm: Use kstrtobool() instead of strtobool()

2023-05-04 Thread Christophe JAILLET

Le 25/01/2023 à 20:11, Dan Williams a écrit :

Christophe JAILLET wrote:

strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.

In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.

While at it, include the corresponding header file ()

Signed-off-by: Christophe JAILLET 
---
This patch was already sent as a part of a serie ([1]) that axed all usages
of strtobool().
Most of the patches have been merged in -next.

I synch'ed with latest -next and re-send the remaining ones as individual
patches.

Changes in v2:
   - synch with latest -next.


Looks good, applied for v6.3.



Hi,

polite reminder.

If I'm right, only 2 places still use strtobool().
This one and net/bluetooth/hci_debugfs.c.

I'll also try to push the other one and get rid of strtobool().

CJ



Re: Re: [syzbot] INFO: rcu detected stall in tx

2021-04-20 Thread Greg Kroah-Hartman
On Mon, Apr 19, 2021 at 08:56:19PM +, Guido Kiener wrote:
> Hi all,
> 
> The error is in usbtmc_interrupt(struct urb *urb) since five years. The 
> status code EPROTO is not handled correctly.
> It's not a showstopper, but we should fix it and check the status code 
> according to usbtmc_read_bulk_cb() or
> usb_skeleton.c.
> @Dave: Do you have time? Otherwise I can do it.
> @Greg: Is it urgent?

No idea, but patches for known problems are always good to get completed
as soon as possible :)

thanks,

greg k-h


RE: Re: [syzbot] INFO: rcu detected stall in tx

2021-04-19 Thread Guido Kiener
Hi all,

The error is in usbtmc_interrupt(struct urb *urb) since five years. The status 
code EPROTO is not handled correctly.
It's not a showstopper, but we should fix it and check the status code 
according to usbtmc_read_bulk_cb() or
usb_skeleton.c.
@Dave: Do you have time? Otherwise I can do it.
@Greg: Is it urgent?

- Guido

-Original Message-
From: Dmitry 
Sent: Monday, April 19, 2021 9:27 AM
Subject: Re: [syzbot] INFO: rcu detected stall in tx

On Mon, Apr 19, 2021 at 9:19 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:50987bec Merge tag 'trace-v5.12-rc7' of git://git.kernel.o..
> git tree:   upstream
> console output: 
> https://syzkaller.appspot.com/x/log.txt?x=1065c5fcd0
> kernel config:  
> https://syzkaller.appspot.com/x/.config?x=398c4d0fe6f66e68
> dashboard link: 
> https://syzkaller.appspot.com/bug?extid=e2eae5639e7203360018
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+e2eae5639e7203360...@syzkaller.appspotmail.com
>
> usbtmc 5-1:0.0: unknown status received: -71 usbtmc 3-1:0.0: unknown 
> status received: -71 usbtmc 5-1:0.0: unknown status received: -71

The log shows an infinite stream of these before the stall, so I assume it's an 
infinite loop in usbtmc.
+usbtmc maintainers

[  370.171634][C0] usbtmc 6-1:0.0: unknown status received: -71
[  370.177799][C1] usbtmc 3-1:0.0: unknown status received: -71
[  370.183912][C0] usbtmc 4-1:0.0: unknown status received: -71
[  370.190076][C1] usbtmc 5-1:0.0: unknown status received: -71
[  370.196194][C0] usbtmc 2-1:0.0: unknown status received: -71
[  370.202387][C1] usbtmc 3-1:0.0: unknown status received: -71
[  370.208460][C0] usbtmc 6-1:0.0: unknown status received: -71
[  370.214615][C1] usbtmc 5-1:0.0: unknown status received: -71
[  370.220736][C0] usbtmc 4-1:0.0: unknown status received: -71
[  370.226902][C1] usbtmc 3-1:0.0: unknown status received: -71
[  370.233005][C0] usbtmc 2-1:0.0: unknown status received: -71
[  370.239168][C1] usbtmc 5-1:0.0: unknown status received: -71
[  370.245271][C0] usbtmc 6-1:0.0: unknown status received: -71
[  370.251426][C1] usbtmc 3-1:0.0: unknown status received: -71
[  370.257552][C0] usbtmc 4-1:0.0: unknown status received: -71
[  370.263715][C1] usbtmc 5-1:0.0: unknown status received: -71
[  370.269819][C0] usbtmc 2-1:0.0: unknown status received: -71
[  370.275974][C1] usbtmc 3-1:0.0: unknown status received: -71
[  370.282100][C0] usbtmc 6-1:0.0: unknown status received: -71
[  370.288262][C1] usbtmc 5-1:0.0: unknown status received: -71
[  370.294399][C0] usbtmc 4-1:0.0: unknown status received: -71



Content provided within this e-mail including any attachments, is for the use 
of the intended recipients and may contain Rohde & Schwarz company restricted 
information. Any unauthorized use, disclosure, or distribution of this 
communication in whole or in part is strictly prohibited. If you are not the 
intended recipient, please notify the sender by reply email or by telephone and 
delete the communication in its entirety.


Re: Re: [PATCH] kernel/hung_task: Add a whitelist and blacklist mechanism.

2021-04-19 Thread Peter Zijlstra
On Mon, Apr 19, 2021 at 09:46:26AM +0800, 周传高 wrote:
> 
> >On Sat, Apr 17, 2021 at 07:13:01AM -0700, zhouchuangao wrote:
> >> eg:
> >> In Android system, we usually and some processes to the whitelist.
> >> static task_t task_whitelist[] = {
> >>{"mdrt_thread", HUNG_TASK_WHITELIST},
> >>{"chre_kthread", HUNG_TASK_WHITELIST},
> >>{"scp_power_reset", HUNG_TASK_WHITELIST},
> >>{"ccci_fsm1", HUNG_TASK_WHITELIST},
> >>{"qos_ipi_recv", HUNG_TASK_WHITELIST},
> >>{NULL, 0},
> >> };
> >
> >What are these tasks doing that the hung-task detector fires on them?
> >Should you fix that instead?
> 
> These tasks are implemented by the SoC vendor, and normally they do
> not configure HUNG TASK, so we need to ignore these tasks if we use
> HUNG TASK. 

Then raise a bug against their crap software, don't try and work around
it in the kernel.

We're not going to upstream workarounds for crap vendor software.


Re: Re: [PATCH] [v2] spi: spi-zynqmp-gqspi: Fix runtime PM imbalance in zynqmp_qspi_probe

2021-04-14 Thread dinghao . liu
> Hi Dinghao,
> On Mon, Apr 12, 2021 at 03:31:54PM +0800, Dinghao Liu wrote:
> > There is a PM usage counter decrement after zynqmp_qspi_init_hw()
> > without any refcount increment, which leads to refcount leak.Add
> > a refcount increment to balance the refcount. Also set
> > auto_runtime_pm to resume suspended spi controller.
> > 
> > Signed-off-by: Dinghao Liu 
> > ---
> > changelog:
> > 
> > v2: - Add a refcount increment to fix refcout leak instead of the
> >   refcount decrement on error.
> >   Set ctlr->auto_runtime_pm = true.
> > ---
> >  drivers/spi/spi-zynqmp-gqspi.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/spi/spi-zynqmp-gqspi.c b/drivers/spi/spi-zynqmp-gqspi.c
> > index c8fa6ee18ae7..8b21c7b0e7eb 100644
> > --- a/drivers/spi/spi-zynqmp-gqspi.c
> > +++ b/drivers/spi/spi-zynqmp-gqspi.c
> > @@ -1160,6 +1160,7 @@ static int zynqmp_qspi_probe(struct platform_device 
> > *pdev)
> > pm_runtime_set_autosuspend_delay(>dev, SPI_AUTOSUSPEND_TIMEOUT);
> > pm_runtime_set_active(>dev);
> > pm_runtime_enable(>dev);
> > +   pm_runtime_get_sync(>dev);
> Please check the return value here, if ret is "< 0", goto error label,
> and a pm_runtime_put_sync is needed in error label
> > /* QSPI controller initializations */
> > zynqmp_qspi_init_hw(xqspi);
> >  
> > @@ -1187,6 +1188,7 @@ static int zynqmp_qspi_probe(struct platform_device 
> > *pdev)
> > ctlr->mode_bits = SPI_CPOL | SPI_CPHA | SPI_RX_DUAL | SPI_RX_QUAD |
> > SPI_TX_DUAL | SPI_TX_QUAD;
> > ctlr->dev.of_node = np;
> > +   ctlr->auto_runtime_pm = true;
> >  
> > ret = devm_spi_register_controller(>dev, ctlr);
> > if (ret) {
> These 2 function
>  pm_runtime_mark_last_busy(>dev);
>   pm_runtime_put_autosuspend(>dev);
> are the last operations in probe function since if they runs,
> spi_controller will enter suspend state and disable clks after 3s
> passing. So please move them just before "return 0".
> 
> And would you please cc me when you send V3? I am preparing to send a patch 
> series
> to fix clk and suspend/resume issues which bases on the pm_runtime issue.
> 

Thanks for your advice and I will send a new patch soon.

Regards,
Dinghao

Re: Re: [RFC] vsock: add multiple transports support for dgram

2021-04-13 Thread Jiang Wang .
Hi Jorgen,

Thanks for the detailed explanation and I agree with you. For the bind list,
my  prototype is doing
something similar to that. I will double check it.

Hi Stefano,

I don't have other questions for now. Thanks.

Regards,

Jiang

On Tue, Apr 13, 2021 at 5:52 AM Stefano Garzarella  wrote:
>
> On Tue, Apr 13, 2021 at 12:12:50PM +, Jorgen Hansen wrote:
> >
> >
> >On 12 Apr 2021, at 20:53, Jiang Wang . 
> >mailto:jiang.w...@bytedance.com>> wrote:
> >
> >On Mon, Apr 12, 2021 at 7:04 AM Stefano Garzarella 
> >mailto:sgarz...@redhat.com>> wrote:
> >
> >Hi Jiang,
> >thanks for re-starting the multi-transport support for dgram!
> >
> >No problem.
> >
> >On Wed, Apr 07, 2021 at 11:25:36AM -0700, Jiang Wang . wrote:
> >On Wed, Apr 7, 2021 at 2:51 AM Jorgen Hansen 
> >mailto:jhan...@vmware.com>> wrote:
> >
> >
> >On 6 Apr 2021, at 20:31, Jiang Wang 
> >mailto:jiang.w...@bytedance.com>> wrote:
> >
> >From: "jiang.wang" 
> >mailto:jiang.w...@bytedance.com>>
> >
> >Currently, only VMCI supports dgram sockets. To supported
> >nested VM use case, this patch removes transport_dgram and
> >uses transport_g2h and transport_h2g for dgram too.
> >
> >I agree on this part, I think that's the direction to go.
> >transport_dgram was added as a shortcut.
> >
> >Got it.
> >
> >
> >Could you provide some background for introducing this change - are you
> >looking at introducing datagrams for a different transport? VMCI datagrams
> >already support the nested use case,
> >
> >Yes, I am trying to introduce datagram for virtio transport. I wrote a
> >spec patch for
> >virtio dgram support and also a code patch, but the code patch is still WIP.
> >When I wrote this commit message, I was thinking nested VM is the same as
> >multiple transport support. But now, I realize they are different.
> >Nested VMs may use
> >the same virtualization layer(KVM on KVM), or different virtualization layers
> >(KVM on ESXi). Thanks for letting me know that VMCI already supported nested
> >use cases. I think you mean VMCI on VMCI, right?
> >
> >but if we need to support multiple datagram
> >transports we need to rework how we administer port assignment for datagrams.
> >One specific issue is that the vmci transport won’t receive any datagrams 
> >for a
> >port unless the datagram socket has already been assigned the vmci transport
> >and the port bound to the underlying VMCI device (see below for more 
> >details).
> >
> >I see.
> >
> >The transport is assgined when sending every packet and
> >receiving every packet on dgram sockets.
> >
> >Is the intent that the same datagram socket can be used for sending packets 
> >both
> >In the host to guest, and the guest to directions?
> >
> >Nope. One datagram socket will only send packets to one direction, either to 
> >the
> >host or to the guest. My above description is wrong. When sending packets, 
> >the
> >transport is assigned with the first packet (with auto_bind).
> >
> >I'm not sure this is right.
> >The auto_bind on the first packet should only assign a local port to the
> >socket, but does not affect the transport to be used.
> >
> >A user could send one packet to the nested guest and another to the host
> >using the same socket, or am I wrong?
> >
> >OK. I think you are right.
> >
> >
> >The problem is when receiving packets. The listener can bind to the
> >VMADDR_CID_ANY
> >address. Then it is unclear which transport we should use. For stream
> >sockets, there will be a new socket for each connection, and transport
> >can be decided
> >at that time. For datagram sockets, I am not sure how to handle that.
> >
> >yes, this I think is the main problem, but maybe the sender one is even
> >more complicated.
> >
> >Maybe we should remove the 1:1 association we have now between vsk and
> >transport.
> >
> >Yes, I thought about that too. One idea is to define two transports in vsk.
> >For sending pkt, we can pick the right transport when we get the packet, like
> >in virtio_transport_send_pkt_info(). For receiving pkts, we have to check
> >and call both
> >transports dequeue callbacks if the local cid is CID_ANY.
> >
> >At least for DGRAM, for connected sockets I think the association makes
> >sense.
> >
> >Yeah. For a connected socket, we will only use one transport.
> >
> >For VMCI, does the same transport can be used for both receiving from
> >host and from
> >the guest?
> >
> >Yes, they're registered at different times, but it's the same transport.
> >
> >
> >For virtio, the h2g and g2h transports are different,, so we have to
> >choose
> >one of them. My original thought is to wait until the first packet
> >arrives.
> >
> >Another idea is that we always bind to host addr and use h2g
> >transport because I think that might
> >be more common. If a listener wants to recv packets from the host, then
> >it
> >should bind to the guest addr instead of CID_ANY.
> >
> >Yes, I remember we discussed this idea, this would simplify the
> >receiving, but there is still the issue of 

Re: Re: [PATCH] x86: Accelerate copy_page with non-temporal in X86

2021-04-13 Thread Borislav Petkov
On Tue, Apr 13, 2021 at 08:54:55PM +0800, Kemeng Shi wrote:
> Yes. And NT stores should be better for copy_page especially copying a lot
> of pages as only partial memory of copied page will be access recently.

I thought "should be better" too last time when I measured rep; movs vs
NT stores but actual measurements showed no real difference.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: Re: [PATCH] phy: nxp-c45: add driver for tja1103

2021-04-13 Thread Andrew Lunn
On Tue, Apr 13, 2021 at 08:56:30AM +0200, Christian Herber wrote:
> Hi Andrew,
> 
> On 4/12/2021 6:52 PM, Andrew Lunn wrote:
> > 
> > So what you are say is, you don't care if the IP is completely
> > different, it all goes in one driver. So lets put this driver into
> > nxp-tja11xx.c. And then we avoid all the naming issues.
> > 
> >   Andrew
> > 
> 
> As this seems to be a key question, let me try and shed some more light on
> this.
> The original series of BASE-T1 PHYs includes TJA110, TJA1101, and TJA1102.
> They are covered by the existing driver, which has the unfortunate naming
> TJA11xx. Unfortunate, because the use of wildcards is a bit to generous.

Yes, that does happen.

Naming is difficult. But i really think nxp-c45.c is much worse. It
gives no idea at all what it supports. Or in the future, what it does
not support, and you actually need nxp-c45-ng.c.

Developers are going to look at a board, see a tja1XYZ chip, see the
nxp-tja11xx.c and enable it. Does the chip have a big C45 symbol on
it? Anything to give the idea it should use the nxp-c45 driver?

Maybe we should actually swing the complete opposite direction. Name
it npx-tja1103.c. There are lots of drivers which have a specific
name, but actually support a lot more devices. The developer sees they
have an tja1XYZ, there are two drivers which look about right, and
enable them both?

   Andrew


Re: Re: [PATCH] kernel/module: Use BUG_ON instead of if condition followed by BUG.

2021-04-13 Thread Jessica Yu

+++ 周传高 [13/04/21 15:21 +0800]:



+++ zhouchuangao [30/03/21 05:07 -0700]:

It can be optimized at compile time.

Signed-off-by: zhouchuangao 


Hi,

Could you please provide a more descriptive changelog? I.e., Is this
a fix for a cocinelle warning? What are the optimization(s)?

Thanks,


First,
This patch comes from cocinelle warning.

Second,
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)

BUG_ON uses unlikely in if(). Through disassembly, we can see that
brk #0x800 is compiled to the end of the function.
As you can see below:
   ..
   ff8008660bec:   d65f03c0ret
   ff8008660bf0:   d421brk #0x800

Usually, the condition in if () is not satisfied. For the multi-stage pipeline,
we do not need to perform fetch decode and excute operation on brk
instruction.

In my opinion, this can improve the efficiency of the multi-stage pipeline.


Thanks. Could you please modify your commit/changelog message to
include this information (including the coccinelle warning) and resend
the patch?


Re: Re: [PATCH] phy: nxp-c45: add driver for tja1103

2021-04-13 Thread Christian Herber

Hi Andrew,

On 4/12/2021 6:52 PM, Andrew Lunn wrote:


So what you are say is, you don't care if the IP is completely
different, it all goes in one driver. So lets put this driver into
nxp-tja11xx.c. And then we avoid all the naming issues.

  Andrew



As this seems to be a key question, let me try and shed some more light 
on this.
The original series of BASE-T1 PHYs includes TJA110, TJA1101, and 
TJA1102. They are covered by the existing driver, which has the 
unfortunate naming TJA11xx. Unfortunate, because the use of wildcards is 
a bit to generous. E.g. the naming would also include a TJA1145, which 
is a high-speed CAN transceiver. The truth is, extrapolating wildcards 
in product names doesn't work as there is not guarantee of future 
product names.
The mentioned TJA1100/1/2 are *fairly* software-compatible, which is why 
it makes sense to have a shared driver. When it gets to TJA1103, there 
is no SW compatibility, which is why we decided to create a new driver.
We want to support all future Ethernet PHY devices with this codebase, 
and that is why the naming is that generic. The common denominator of 
the devices is that they are NXP products and use clause 45 addressing. 
When you say we don't care that the IP is different, that doesn't quite 
fit. Just because the MDI is different, the register map does not need 
to change much, so it will be easy to support future PHYs also when 
using different PHY technology.
Moving the code into TJA11xx is creating more issues, as it assumes that 
the devices which are managed by the driver are always TJA... devices 
which may not be true.


Christian


Re: Re: [RFC] vsock: add multiple transports support for dgram

2021-04-12 Thread Jiang Wang .
On Mon, Apr 12, 2021 at 7:04 AM Stefano Garzarella  wrote:
>
> Hi Jiang,
> thanks for re-starting the multi-transport support for dgram!

No problem.

> On Wed, Apr 07, 2021 at 11:25:36AM -0700, Jiang Wang . wrote:
> >On Wed, Apr 7, 2021 at 2:51 AM Jorgen Hansen  wrote:
> >>
> >>
> >> > On 6 Apr 2021, at 20:31, Jiang Wang  wrote:
> >> >
> >> > From: "jiang.wang" 
> >> >
> >> > Currently, only VMCI supports dgram sockets. To supported
> >> > nested VM use case, this patch removes transport_dgram and
> >> > uses transport_g2h and transport_h2g for dgram too.
>
> I agree on this part, I think that's the direction to go.
> transport_dgram was added as a shortcut.

Got it.

> >>
> >> Could you provide some background for introducing this change - are you
> >> looking at introducing datagrams for a different transport? VMCI datagrams
> >> already support the nested use case,
> >
> >Yes, I am trying to introduce datagram for virtio transport. I wrote a
> >spec patch for
> >virtio dgram support and also a code patch, but the code patch is still WIP.
> >When I wrote this commit message, I was thinking nested VM is the same as
> >multiple transport support. But now, I realize they are different.
> >Nested VMs may use
> >the same virtualization layer(KVM on KVM), or different virtualization layers
> >(KVM on ESXi). Thanks for letting me know that VMCI already supported nested
> >use cases. I think you mean VMCI on VMCI, right?
> >
> >> but if we need to support multiple datagram
> >> transports we need to rework how we administer port assignment for 
> >> datagrams.
> >> One specific issue is that the vmci transport won’t receive any datagrams 
> >> for a
> >> port unless the datagram socket has already been assigned the vmci 
> >> transport
> >> and the port bound to the underlying VMCI device (see below for more 
> >> details).
> >>
> >I see.
> >
> >> > The transport is assgined when sending every packet and
> >> > receiving every packet on dgram sockets.
> >>
> >> Is the intent that the same datagram socket can be used for sending 
> >> packets both
> >> In the host to guest, and the guest to directions?
> >
> >Nope. One datagram socket will only send packets to one direction, either to 
> >the
> >host or to the guest. My above description is wrong. When sending packets, 
> >the
> >transport is assigned with the first packet (with auto_bind).
>
> I'm not sure this is right.
> The auto_bind on the first packet should only assign a local port to the
> socket, but does not affect the transport to be used.
>
> A user could send one packet to the nested guest and another to the host
> using the same socket, or am I wrong?

OK. I think you are right.

> >
> >The problem is when receiving packets. The listener can bind to the
> >VMADDR_CID_ANY
> >address. Then it is unclear which transport we should use. For stream
> >sockets, there will be a new socket for each connection, and transport
> >can be decided
> >at that time. For datagram sockets, I am not sure how to handle that.
>
> yes, this I think is the main problem, but maybe the sender one is even
> more complicated.
>
> Maybe we should remove the 1:1 association we have now between vsk and
> transport.

Yes, I thought about that too. One idea is to define two transports in vsk.
For sending pkt, we can pick the right transport when we get the packet, like
in virtio_transport_send_pkt_info(). For receiving pkts, we have to check
and call both
transports dequeue callbacks if the local cid is CID_ANY.

> At least for DGRAM, for connected sockets I think the association makes
> sense.

Yeah. For a connected socket, we will only use one transport.

> >For VMCI, does the same transport can be used for both receiving from
> >host and from
> >the guest?
>
> Yes, they're registered at different times, but it's the same transport.
>
> >
> >For virtio, the h2g and g2h transports are different,, so we have to
> >choose
> >one of them. My original thought is to wait until the first packet
> >arrives.
> >
> >Another idea is that we always bind to host addr and use h2g
> >transport because I think that might
> >be more common. If a listener wants to recv packets from the host, then
> >it
> >should bind to the guest addr instead of CID_ANY.
>
> Yes, I remember we discussed this idea, this would simplify the
> receiving, but there is still the issue of a user wanting to receive
> packets from both the nested guest and the host.

OK. Agree.

> >Any other suggestions?
> >
>
> I think one solution could be to remove the 1:1 association between
> DGRAM socket and transport.
>
> IIUC VMCI creates a skb for each received packet and queues it through
> sk_receive_skb() directly in the struct sock.
>
> Then the .dgram_dequeue() callback dequeues them using
> skb_recv_datagram().
>
> We can move these parts in the vsock core, and create some helpers to
> allow the transports to enqueue received DGRAM packets in the same way
> (and with the same format) directly in the struct sock.
>

I agree 

Re: Re: [PATCH] usb: cdns3: Fix rumtime PM imbalance on error

2021-04-11 Thread dinghao . liu
> On 21-04-07 13:22:26, Dinghao Liu wrote:
> > When cdns3_gadget_start() fails, a pairing PM usage counter
> > decrement is needed to keep the counter balanced.
> > 
> > Signed-off-by: Dinghao Liu 
> > ---
> >  drivers/usb/cdns3/cdns3-gadget.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/usb/cdns3/cdns3-gadget.c 
> > b/drivers/usb/cdns3/cdns3-gadget.c
> > index 582bfeceedb4..ad891a108aed 100644
> > --- a/drivers/usb/cdns3/cdns3-gadget.c
> > +++ b/drivers/usb/cdns3/cdns3-gadget.c
> > @@ -3255,8 +3255,11 @@ static int __cdns3_gadget_init(struct cdns *cdns)
> > pm_runtime_get_sync(cdns->dev);
> >  
> > ret = cdns3_gadget_start(cdns);
> > -   if (ret)
> > +   if (ret) {
> > +   pm_runtime_mark_last_busy(cdns->dev);
> > +   pm_runtime_put_autosuspend(cdns->dev);
> > return ret;
> 
> It doesn't need to delay entering runtime suspend, I prefer using 
> pm_runtime_put_sync directly.
> 

Sounds reasonable, thanks! I will send a new patch soon.

Regards,
Dinghao

Re: Re: BUG: Bad rss-counter state (4)

2021-04-11 Thread Vegard Nossum

(trimmed off the batman/bpf Ccs)

On 2020-05-18 14:28, syzbot wrote:

syzbot has bisected this bug to:

commit 0d8dd67be013727ae57645ecd3ea2c36365d7da8
Author: Song Liu 
Date:   Wed Dec 6 22:45:14 2017 +

 perf/headers: Sync new perf_event.h with the tools/include/uapi version

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13240a0210
start commit:   ac935d22 Add linux-next specific files for 20200415
git tree:   linux-next
final crash:https://syzkaller.appspot.com/x/report.txt?x=10a40a0210
console output: https://syzkaller.appspot.com/x/log.txt?x=17240a0210
kernel config:  https://syzkaller.appspot.com/x/.config?x=bc498783097e9019
dashboard link: https://syzkaller.appspot.com/bug?extid=347e2331d03d06ab0224
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12d18e6e10
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=104170d610

Reported-by: syzbot+347e2331d03d06ab0...@syzkaller.appspotmail.com
Fixes: 0d8dd67be013 ("perf/headers: Sync new perf_event.h with the 
tools/include/uapi version")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection



FWIW here's a nicer reproducer that more clearly shows what's really
going on:

#define _GNU_SOURCE
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 

// for compat with older perf headers
#define uprobe_path config1

int main(int argc, char *argv[])
{
// Find out what type id we need for uprobes
int perf_type_pmu_uprobe;
{
FILE *fp = 
fopen("/sys/bus/event_source/devices/uprobe/type", "r");

fscanf(fp, "%d", _type_pmu_uprobe);
fclose(fp);
}

const char *filename = "./bus";

int fd = open(filename, O_RDWR|O_CREAT, 0600);
write(fd, "x", 1);

void *addr = mmap(NULL, 4096,
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

// Register a perf uprobe on "./bus"
struct perf_event_attr attr = {};
attr.type = perf_type_pmu_uprobe;
attr.uprobe_path = (unsigned long) filename;
syscall(__NR_perf_event_open, , 0, 0, -1, 0);

void *addr2 = mmap(NULL, 2 * 4096,
PROT_NONE,
MAP_PRIVATE, fd, 0);
void *addr3 = mremap((void *) addr2, 4096, 2 * 4096, 
MREMAP_MAYMOVE);
mremap(addr3, 4096, 4096, MREMAP_MAYMOVE | MREMAP_FIXED, (void 
*) addr2);


return 0;
}

this instantly reproduces this output on current mainline for me:

BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1

AFAICT the worst thing about this bug is that it shows up on anything
that parses logs for "BUG"; it doesn't seem to have any ill effects
other than messing up the rss counters. Although maybe it points to some
underlying problem in uprobes/mm interaction.

If I enable the "rss_stat" tracepoint and set ftrace_dump_on_oops=1, I
see a trace roughly like this:

perf_event_open()

mmap(2 * 4096):
 - uprobe_mmap()
- install_breakpoint()
   - __replace_page()
  - rss_stat: mm_id=0 curr=1 member=1 size=53248B

mremap(4096 => 2 * 4096):
 - install_breakpoint()
- __replace_page()
   - rss_stat: mm_id=0 curr=1 member=1 size=57344B
 - unmap_page_range()
- rss_stat: mm_id=0 curr=1 member=1 size=53248B

mremap(4096 => 4096):
 - move_vma()
- copy_vma()
   - vma_merge()
  - install_breakpoint()
 - __replace_page()
- rss_stat: mm_id=0 curr=1 member=1 size=57344B
 - do_munmap()
- install_breakpoint():
   - __replace_page()
  - rss_stat: mm_id=0 curr=1 member=1 size=61440B
- unmap_page_range():
   - rss_stat: mm_id=0 curr=1 member=1 size=57344B

exit()
 - exit_mmap()
- unmap_page_range():
   - rss_stat: mm_id=0 curr=0 member=1 size=45056B
- unmap_page_range():
   - rss_stat: mm_id=0 curr=0 member=1 size=32768B
- unmap_page_range():
   - rss_stat: mm_id=0 curr=0 member=1 size=20480B
- unmap_page_range():
   - rss_stat: mm_id=0 curr=0 member=1 size=16384B
- unmap_page_range():
   - rss_stat: mm_id=0 curr=0 member=1 size=4096B

What strikes me here is that at the end of the first mremap(), we have
size 53248B (13 pages), but at the end of the second mremap(), we have
size 57344B (14 pages), even though the second mremap() is only moving 1
page. So the second mremap() is bumping it up twice, but then only
bumping down once.


Vegard


Re: Re: [PATCH] dmaengine: tegra20: Fix runtime PM imbalance in tegra_dma_issue_pending

2021-04-09 Thread dinghao . liu
> On 08/04/2021 08:11, Dinghao Liu wrote:
> > pm_runtime_get_sync() will increase the rumtime PM counter
> > even it returns an error. Thus a pairing decrement is needed
> > to prevent refcount leak. Fix this by replacing this API with
> > pm_runtime_resume_and_get(), which will not change the runtime
> > PM counter on error.
> > 
> > Signed-off-by: Dinghao Liu 
> > ---
> >  drivers/dma/tegra20-apb-dma.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/dma/tegra20-apb-dma.c b/drivers/dma/tegra20-apb-dma.c
> > index 71827d9b0aa1..73178afaf4c2 100644
> > --- a/drivers/dma/tegra20-apb-dma.c
> > +++ b/drivers/dma/tegra20-apb-dma.c
> > @@ -723,7 +723,7 @@ static void tegra_dma_issue_pending(struct dma_chan *dc)
> > goto end;
> > }
> > if (!tdc->busy) {
> > -   err = pm_runtime_get_sync(tdc->tdma->dev);
> > +   err = pm_runtime_resume_and_get(tdc->tdma->dev);
> > if (err < 0) {
> > dev_err(tdc2dev(tdc), "Failed to enable DMA\n");
> > goto end;
> > 
> 
> 
> Thanks! Looks like there are two instances of this that need fixing.
> 

Thanks for pointing out this! I will fix this and send a new patch soon.

Regards,
Dinghao

Re: Re: [PATCH] media: imx: imx7-mipi-csis: Fix runtime PM imbalance in mipi_csis_s_stream

2021-04-09 Thread dinghao . liu
> Hi Liu,
> Thanks for your patch.
> 
> On Thu, Apr 08, 2021 at 05:08:27PM +0800, Dinghao Liu wrote:
> > When v4l2_subdev_call() fails, a pairing PM usage counter
> > decrement is needed to keep the counter balanced. It's the
> > same for the following error paths in case 'enable' is on.
> > 
> > Signed-off-by: Dinghao Liu 
> > ---
> >  drivers/staging/media/imx/imx7-mipi-csis.c | 9 +++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/staging/media/imx/imx7-mipi-csis.c 
> > b/drivers/staging/media/imx/imx7-mipi-csis.c
> > index a01a7364b4b9..2a3fff231a40 100644
> > --- a/drivers/staging/media/imx/imx7-mipi-csis.c
> > +++ b/drivers/staging/media/imx/imx7-mipi-csis.c
> > @@ -627,21 +627,26 @@ static int mipi_csis_s_stream(struct v4l2_subdev 
> > *mipi_sd, int enable)
> > return ret;
> > }
> > ret = v4l2_subdev_call(state->src_sd, core, s_power, 1);
> > -   if (ret < 0)
> > +   if (ret < 0) {
> > +   pm_runtime_put_noidle(>pdev->dev);
> 
> I think here we should go completely pm_runtime_put to call the
> mipi_csis_pm_suspend down the line, right?
> 
> > return ret;
> > +   }
> > }
> >  
> > mutex_lock(>lock);
> > if (enable) {
> > if (state->flags & ST_SUSPENDED) {
> > ret = -EBUSY;
> > +   pm_runtime_put_noidle(>pdev->dev);
> 
> since we are in ST_SUSPENDED state, for sure the pm counter was
> already 0.
> 
> > goto unlock;
> > }
> >  
> > mipi_csis_start_stream(state);
> > ret = v4l2_subdev_call(state->src_sd, video, s_stream, 1);
> > -   if (ret < 0)
> > +   if (ret < 0) {
> > +   pm_runtime_put_noidle(>pdev->dev);
> 
> here also we need the pm_runtime_put, maybe just changing the unlock
> tag bellow from:
> if (!enable)
> pm_runtime_put(>pdev->dev);
> 
> to 
> if (!enable || (ret < 0))
> pm_runtime_put(>pdev->dev);
> 
> will not hurt the first case and will complete the suspend routine
> afterward in the second case.
> 

This is much clearer, thanks! I will fix this and send a new patch soon.

Regards,
Dinghao

Re: Re: [PATCH] spi: spi-zynqmp-gqspi: Fix runtime PM imbalance in zynqmp_qspi_probe

2021-04-09 Thread dinghao . liu
> Hi Dinghao,
> 
> On 4/8/21 6:33 PM, Michal Simek wrote:
> > ++
> >
> > On 4/8/21 11:25 AM, Dinghao Liu wrote:
> >> When platform_get_irq() fails, a pairing PM usage counter
> >> increment is needed to keep the counter balanced. It's the
> >> same for the following error paths.
> >>
> >> Signed-off-by: Dinghao Liu 
> >> ---
> >>   drivers/spi/spi-zynqmp-gqspi.c | 1 +
> >>   1 file changed, 1 insertion(+)
> >>
> >> diff --git a/drivers/spi/spi-zynqmp-gqspi.c 
> >> b/drivers/spi/spi-zynqmp-gqspi.c
> >> index c8fa6ee18ae7..95963a2de64a 100644
> >> --- a/drivers/spi/spi-zynqmp-gqspi.c
> >> +++ b/drivers/spi/spi-zynqmp-gqspi.c
> >> @@ -1197,6 +1197,7 @@ static int zynqmp_qspi_probe(struct platform_device 
> >> *pdev)
> >>return 0;
> >>   
> >>   clk_dis_all:
> >> +  pm_runtime_get_noresume(>dev);
> >>pm_runtime_set_suspended(>dev);
> >>pm_runtime_disable(>dev);
> >>clk_disable_unprepare(xqspi->refclk);
> >>
> The imbalance is because pm_runtime_put_autosuspend is called to make 
> counter to be -1.
> 
> It looks strange that there is no counter increament op before 
> pm_runtime_put_autosuspend.
> 
> In my limited understanding, it should look like:
> 
> ..
> 
> pm_runtime_enable
> 
> pm_runtime_get_sync   //increase counter to one to resume device
> 
> DO OPERATIONS HERE
> 
> pm_runtime_mark_last_busy
> pm_runtime_put_autosuspend   //decrease counter to zero and trigger suspend
> 
> return 0;
> 
> error_path:
> 
> pm_runtime_put_sync
> 
> pm_runtime_disable
> 
> return err;
> 
> 
> Am I missing something?
> 

Thanks for point out this! Usually there is an increment refcount in a 
_probe function and a decrement refcount in a _remove function. Sometimes 
the refcount decrement is in the _probe and the increment is in the _remove. 
But the refcount is balanced in both cases. So I think zynqmp_qspi_remove()
needs a refcount increment to fix this bug.

Regards,
Dinghao

Re: Re: [PATCH] Input: cyapa - Fix rumtime PM imbalance on error

2021-04-08 Thread dinghao . liu
> Hi Dinghao,
> 
> On Wed, Apr 07, 2021 at 12:07:38PM +0800, Dinghao Liu wrote:
> > When mutex_lock_interruptible() fails, a pairing PM usage
> > counter decrement is needed to keep the counter balanced.
> 
> Thank you for the patch.
> 
> > 
> > Signed-off-by: Dinghao Liu 
> > ---
> >  drivers/input/mouse/cyapa.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/input/mouse/cyapa.c b/drivers/input/mouse/cyapa.c
> > index 77cc653edca2..e411ab45a218 100644
> > --- a/drivers/input/mouse/cyapa.c
> > +++ b/drivers/input/mouse/cyapa.c
> > @@ -904,8 +904,10 @@ static ssize_t cyapa_update_rt_suspend_scanrate(struct 
> > device *dev,
> > pm_runtime_get_sync(dev);
> >  
> > error = mutex_lock_interruptible(>state_sync_lock);
> > -   if (error)
> > +   if (error) {
> > +   pm_runtime_put_noidle(dev);
> 
> Why "noidle" and not pm_runtime_put_sync_autosuspend() like we do in
> case of normal flow?
> 

pm_runtime_put_noidle() only decrease the refcount, while 
pm_runtime_put_sync_autosuspend() will execute an extra
pm_runtime_autosuspend(). I'm not sure if the autosuspend is necessary
in this error handling path, so I only balance the counter.

Regards,
Dinghao


Re: Re: [PATCH] ASoC: codecs: Fix rumtime PM imbalance in tas2552_probe

2021-04-07 Thread dinghao . liu
> On Wed, Apr 07, 2021 at 02:54:00PM +0800, Dinghao Liu wrote:
> 
> > -   pm_runtime_set_active(>dev);
> > -   pm_runtime_set_autosuspend_delay(>dev, 1000);
> > -   pm_runtime_use_autosuspend(>dev);
> > -   pm_runtime_enable(>dev);
> > -   pm_runtime_mark_last_busy(>dev);
> > -   pm_runtime_put_sync_autosuspend(>dev);
> > -
> > dev_set_drvdata(>dev, data);
> >  
> > ret = devm_snd_soc_register_component(>dev,
> > @@ -733,6 +726,13 @@ static int tas2552_probe(struct i2c_client *client,
> > if (ret < 0)
> > dev_err(>dev, "Failed to register component: %d\n", 
> > ret);
> >  
> > +   pm_runtime_set_active(>dev);
> > +   pm_runtime_set_autosuspend_delay(>dev, 1000);
> > +   pm_runtime_use_autosuspend(>dev);
> 
> It's not clear to me that just moving the operations after the
> registration is a good fix - once the component is registered we could
> start trying to do runtime PM operations with it which AFAIR won't count
> references and so on properly if runtime PM isn't enabled so if we later
> enable runtime PM we might have the rest of the code in a confused state
> about what's going on.

Thanks for your advice. I checked the use of devm_snd_soc_register_component() 
in the kernel and found sometimes runtime PM is enabled before registration 
and sometimes after registration. To be on the safe side, I will send a new
patch to fix this in error handling path.

Regards,
Dinghao


RE: Re: [PATCH 1/2] scsi: ufs: Introduce hba performance monitor sysfs nodes

2021-04-06 Thread Daejun Park
Hi Can Guo,
> 
>Hi Daejun,
> 
>On 2021-04-06 12:11, Daejun Park wrote:
>> Hi Can Guo,
>> 
>>> +static ssize_t monitor_enable_store(struct device *dev,
>>> +struct device_attribute *attr,
>>> +const char *buf, size_t count)
>>> +{
>>> +struct ufs_hba *hba = dev_get_drvdata(dev);
>>> +unsigned long value, flags;
>>> +
>>> +if (kstrtoul(buf, 0, ))
>>> +return -EINVAL;
>>> +
>>> +value = !!value;
>>> +spin_lock_irqsave(hba->host->host_lock, flags);
>>> +if (value == hba->monitor.enabled)
>>> +goto out_unlock;
>>> +
>>> +if (!value) {
>>> +memset(>monitor, 0, sizeof(hba->monitor));
>>> +} else {
>>> +hba->monitor.enabled = true;
>>> +hba->monitor.enabled_ts = ktime_get();
>> 
>> How about setting lat_max to and lat_min to KTIME_MAX and 0?
> 
>lat_min is already 0. What is the benefit of setting lat_max to 
>KTIME_MAX?
> 
>> I think lat_sum should be 0 at this point.
> 
>lat_sum is already 0 at this point, what is the problem?

Sorry. I misunderstood about resetting monitor values.

> 
>> 
>>> +}
>>> +
>>> +out_unlock:
>>> +spin_unlock_irqrestore(hba->host->host_lock, flags);
>>> +return count;
>>> +}
>> 
>> 
>>> +static void ufshcd_update_monitor(struct ufs_hba *hba, struct 
>>> ufshcd_lrb *lrbp)
>>> +{
>>> +int dir = ufshcd_monitor_opcode2dir(*lrbp->cmd->cmnd);
>>> +
>>> +if (dir >= 0 && hba->monitor.nr_queued[dir] > 0) {
>>> +struct request *req = lrbp->cmd->request;
>>> +struct ufs_hba_monitor *m = >monitor;
>>> +ktime_t now, inc, lat;
>>> +
>>> +now = ktime_get();
>> 
>> How about using lrbp->compl_time_stamp instead of getting new value?
> 
>I am expecting "now" keeps increasing and use it to update 
>m->busy_start_s,
>but if I use lrbp->compl_time_stamp to do that, below line ktime_sub() 
>may
>give me an unexpected value as lrbp->compl_time_stamp may be smaller 
>than
>m->busy_start_ts, because the actual requests are not completed by the 
>device
>in the exact same ordering as the bits set in hba->outstanding_tasks, 
>but driver
>is completing them from bit 0 to bit 31 in ascending order.

lrbp->compl_time_stamp is set just before calling ufshcd_update_monitor().
And I don't think it can be negative value, because ufshcd_send_command()
and __ufshcd_transfer_req_compl() are protected by host lock.

> 
>> 
>>> +inc = ktime_sub(now, m->busy_start_ts[dir]);
>>> +m->total_busy[dir] = ktime_add(m->total_busy[dir], 
>>> inc);
>>> +m->nr_sec_rw[dir] += blk_rq_sectors(req);
>>> +
>>> +/* Update latencies */
>>> +m->nr_req[dir]++;
>>> +lat = ktime_sub(now, lrbp->issue_time_stamp);
>>> +m->lat_sum[dir] += lat;
>>> +if (m->lat_max[dir] < lat || !m->lat_max[dir])
>>> +m->lat_max[dir] = lat;
>>> +if (m->lat_min[dir] > lat || !m->lat_min[dir])
>>> +m->lat_min[dir] = lat;
>> 
>> This if statement can be shorted, by setting lat_max / lat_min as 
>> default value.
> 
>I don't quite get it, can you show me the code sample?

I think " || !m->lat_max[dir]" can be removed.

if (m->lat_max[dir] < lat)
m->lat_max[dir] = lat;
if (m->lat_min[dir] > lat)
m->lat_min[dir] = lat;

Thanks,
Daejun

> 
>Thanks,
>Can Guo
> 
>> 
>>> +
>>> +m->nr_queued[dir]--;
>>> +/* Push forward the busy start of monitor */
>>> +m->busy_start_ts[dir] = now;
>>> +}
>>> +}
>> 
>> Thanks,
>> Daejun


Re: Re: [Drbd-dev] [PATCH] drbd: Fix a use after free in get_initial_state

2021-04-01 Thread lyl2019



> -原始邮件-
> 发件人: "Christoph Böhmwalder" 
> 发送时间: 2021-04-01 21:01:20 (星期四)
> 收件人: "Lv Yunlong" 
> 抄送: philipp.reis...@linbit.com, lars.ellenb...@linbit.com, ax...@kernel.dk, 
> linux-bl...@vger.kernel.org, linux-kernel@vger.kernel.org, 
> drbd-...@lists.linbit.com
> 主题: Re: [Drbd-dev] [PATCH] drbd: Fix a use after free in get_initial_state
> 
> On 4/1/21 1:57 PM, Lv Yunlong wrote:
> > In get_initial_state, it calls notify_initial_state_done(skb,..) if
> > cb->args[5]==1. I see that if genlmsg_put() failed in
> > notify_initial_state_done(), the skb will be freed by nlmsg_free(skb).
> > Then get_initial_state will goto out and the freed skb will be used by
> > return value skb->len.
> > 
> > My patch lets skb_len = skb->len and return the skb_len to avoid the uaf.
> > 
> > Fixes: a29728463b254 ("drbd: Backport the "events2" command")
> > Signed-off-by: Lv Yunlong 
> > ---
> >   drivers/block/drbd/drbd_nl.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
> > index bf7de4c7b96c..474f84675d0a 100644
> > --- a/drivers/block/drbd/drbd_nl.c
> > +++ b/drivers/block/drbd/drbd_nl.c
> > @@ -4905,6 +4905,7 @@ static int get_initial_state(struct sk_buff *skb, 
> > struct netlink_callback *cb)
> > struct drbd_state_change *state_change = (struct drbd_state_change 
> > *)cb->args[0];
> > unsigned int seq = cb->args[2];
> > unsigned int n;
> > +   unsigned int skb_len = skb->len;
> > enum drbd_notification_type flags = 0;
> >   
> > /* There is no need for taking notification_mutex here: it doesn't
> > @@ -4915,7 +4916,7 @@ static int get_initial_state(struct sk_buff *skb, 
> > struct netlink_callback *cb)
> > cb->args[5]--;
> > if (cb->args[5] == 1) {
> > notify_initial_state_done(skb, seq);
> > -   goto out;
> > +   return skb_len;
> > }
> > n = cb->args[4]++;
> > if (cb->args[4] < cb->args[3])
> > 
> 
> Thanks for the patch!
> 
> I think the problem goes even further: skb can also be freed in the 
> notify_*_state_change -> notify_*_state calls below.
> 
> Also, at the point where we save skb->len into skb_len, skb is not 
> initialized yet. Maybe it makes more sense to not return a length in the 
> first place here, but an error code instead.
> 
> -- 
> Christoph Böhmwalder
> LINBIT | Keeping the Digital World Running
> DRBD HA —  Disaster Recovery — Software defined Storage

Ok, I see.
I found that drbd_adm_get_initial_state() has called the get_initial_state(),
and return -ENOMEM if it calls remember_old_state() failed.

So, i think that means if get_initial_state() failed on the 
notify_initial_state_done(),
it should return -ENOMEM too.

I will submit the PATCH v2 to fix the first place. The fixes of the further 
problem is 
hard for me.

Thanks.


Re: Re: Re: [Drbd-dev] [PATCH] drbd: Fix a use after free in get_initial_state

2021-04-01 Thread lyl2019



> -原始邮件-
> 发件人: lyl2...@mail.ustc.edu.cn
> 发送时间: 2021-04-01 23:13:58 (星期四)
> 收件人: "Christoph Böhmwalder" 
> 抄送: philipp.reis...@linbit.com, lars.ellenb...@linbit.com, ax...@kernel.dk, 
> linux-bl...@vger.kernel.org, linux-kernel@vger.kernel.org, 
> drbd-...@lists.linbit.com
> 主题: Re: Re: [Drbd-dev] [PATCH] drbd: Fix a use after free in get_initial_state
> 
> 
> 
> 
> > -原始邮件-
> > 发件人: "Christoph Böhmwalder" 
> > 发送时间: 2021-04-01 21:01:20 (星期四)
> > 收件人: "Lv Yunlong" 
> > 抄送: philipp.reis...@linbit.com, lars.ellenb...@linbit.com, ax...@kernel.dk, 
> > linux-bl...@vger.kernel.org, linux-kernel@vger.kernel.org, 
> > drbd-...@lists.linbit.com
> > 主题: Re: [Drbd-dev] [PATCH] drbd: Fix a use after free in get_initial_state
> > 
> > On 4/1/21 1:57 PM, Lv Yunlong wrote:
> > > In get_initial_state, it calls notify_initial_state_done(skb,..) if
> > > cb->args[5]==1. I see that if genlmsg_put() failed in
> > > notify_initial_state_done(), the skb will be freed by nlmsg_free(skb).
> > > Then get_initial_state will goto out and the freed skb will be used by
> > > return value skb->len.
> > > 
> > > My patch lets skb_len = skb->len and return the skb_len to avoid the uaf.
> > > 
> > > Fixes: a29728463b254 ("drbd: Backport the "events2" command")
> > > Signed-off-by: Lv Yunlong 
> > > ---
> > >   drivers/block/drbd/drbd_nl.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
> > > index bf7de4c7b96c..474f84675d0a 100644
> > > --- a/drivers/block/drbd/drbd_nl.c
> > > +++ b/drivers/block/drbd/drbd_nl.c
> > > @@ -4905,6 +4905,7 @@ static int get_initial_state(struct sk_buff *skb, 
> > > struct netlink_callback *cb)
> > >   struct drbd_state_change *state_change = (struct 
> > > drbd_state_change *)cb->args[0];
> > >   unsigned int seq = cb->args[2];
> > >   unsigned int n;
> > > + unsigned int skb_len = skb->len;
> > >   enum drbd_notification_type flags = 0;
> > >   
> > >   /* There is no need for taking notification_mutex here: it 
> > > doesn't
> > > @@ -4915,7 +4916,7 @@ static int get_initial_state(struct sk_buff *skb, 
> > > struct netlink_callback *cb)
> > >   cb->args[5]--;
> > >   if (cb->args[5] == 1) {
> > >   notify_initial_state_done(skb, seq);
> > > - goto out;
> > > + return skb_len;
> > >   }
> > >   n = cb->args[4]++;
> > >   if (cb->args[4] < cb->args[3])
> > > 
> > 
> > Thanks for the patch!
> > 
> > I think the problem goes even further: skb can also be freed in the 
> > notify_*_state_change -> notify_*_state calls below.
> > 
> > Also, at the point where we save skb->len into skb_len, skb is not 
> > initialized yet. Maybe it makes more sense to not return a length in the 
> > first place here, but an error code instead.
> > 
> > -- 
> > Christoph Böhmwalder
> > LINBIT | Keeping the Digital World Running
> > DRBD HA —  Disaster Recovery — Software defined Storage
> 
> Ok, I see.
> I found that drbd_adm_get_initial_state() has called the get_initial_state(),
> and return -ENOMEM if it calls remember_old_state() failed.
> 
> So, i think that means if get_initial_state() failed on the 
> notify_initial_state_done(),
> it should return -ENOMEM too.
> 
> I will submit the PATCH v2 to fix the first place. The fixes of the further 
> problem is 
> hard for me.
> 
> Thanks.

I found that notify_initial_state_done() uses err = -EMSGSIZE, so the first 
place should
return -EMSGSIZE not -ENOMEM. Sorry.


RE: Re: linux-next: Tree for Mar 31 (drivers/phy/marvell/phy-mvebu-cp110-utmi.o)

2021-04-01 Thread Kostya Porotchkin
Hi, Randy,

> -Original Message-
> From: Randy Dunlap 
> Sent: Wednesday, March 31, 2021 18:28
> To: Stephen Rothwell ; Linux Next Mailing List  n...@vger.kernel.org>
> Cc: Linux Kernel Mailing List ; Kostya
> Porotchkin ; net...@vger.kernel.org
> Subject: [EXT] Re: linux-next: Tree for Mar 31 (drivers/phy/marvell/phy-mvebu-
> cp110-utmi.o)
> 


> 
> on i386:
> 
> ld: drivers/phy/marvell/phy-mvebu-cp110-utmi.o: in function
> `mvebu_cp110_utmi_phy_probe':
> phy-mvebu-cp110-utmi.c:(.text+0x152): undefined reference to
> `of_usb_get_dr_mode_by_phy'
> 
[KP] This driver depends on ARCH_MVEBU (arm64).
How it happens that it is included in i386 builds?

Regards
Kosta
> 
> Full randconfig file is attached.
> 
> --
> ~Randy
> Reported-by: Randy Dunlap 


RE: Re: [PATCH] clk: imx8mp: Remove the none exist pcie clocks

2021-03-31 Thread Richard Zhu

> -Original Message-
> From: Stephen Boyd 
> Sent: Wednesday, March 31, 2021 10:17 AM
> To: Richard Zhu ; Abel Vesa ;
> Jacky Bai ; shawn...@kernel.org
> Cc: dl-linux-imx ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-...@vger.kernel.org; Richard Zhu
> 
> Subject: Re: [PATCH] clk: imx8mp: Remove the none exist pcie clocks
> Quoting Richard Zhu (2021-03-15 01:17:47)
> > In the i.MX8MP PCIe design, the PCIe PHY REF clock comes from external
> > OSC or internal system PLL. It is configured in the IOMUX_GPR14
> > register directly, and can't be contolled by CCM at all.
> > Remove the PCIE PHY clock from clock driver to clean up codes.
> > There is only one PCIe in i.MX8MP, remove the none exist second PCIe
> > related clocks.
> > Remove the none exsits clocks IDs together.
> >
> > Signed-off-by: Richard Zhu 
> > Reviewed-by: Jason Liu 
> > ---
> 
> Any Fixes tag?
[Richard Zhu] Hi Stephen: 
The codes are changed refer to the RM document updates.
It doesn't fix a bug introduced from previous commit.

Best Regards
Richard Zhu


Re: Re: [PATCH] net/rds: Fix a use after free in rds_message_map_pages

2021-03-30 Thread lyl2019



> -原始邮件-
> 发件人: "David Miller" 
> 发送时间: 2021-03-31 08:02:28 (星期三)
> 收件人: lyl2...@mail.ustc.edu.cn
> 抄送: santosh.shilim...@oracle.com, k...@kernel.org, net...@vger.kernel.org, 
> linux-r...@vger.kernel.org, rds-de...@oss.oracle.com, 
> linux-kernel@vger.kernel.org
> 主题: Re: [PATCH] net/rds: Fix a use after free in rds_message_map_pages
> 
> From: Lv Yunlong 
> Date: Tue, 30 Mar 2021 03:16:02 -0700
> 
> > @@ -348,7 +348,7 @@ struct rds_message *rds_message_map_pages(unsigned long 
> > *page_addrs, unsigned in
> > rm->data.op_sg = rds_message_alloc_sgs(rm, num_sgs);
> > if (IS_ERR(rm->data.op_sg)) {
> > rds_message_put(rm);
> > -   return ERR_CAST(rm->data.op_sg);
> > +   return ERR_PTR(-ENOMEM);
> > }
> >  
> > for (i = 0; i < rm->data.op_nents; ++i) {
> 
> Maybe instead do:
> 
>   int err = ERR_CAST(rm->data.op_sg);
>   rds_message_put(rm);
>   return err;
> 
> Then if rds_message_alloc_sgs() starts to return other errors, they will 
> propagate.
> 
> Thank you.

The type of ERR_CAST() is void *, not int. 
I think the correct patch is:

void *err = ERR_CAST(rm->data.op_sg);
rds_message_put(rm);
return err;

I have submitted the PATCH v2 for you to review.

Thanks.


Re: Re: [PATCH] amd: display: dc: struct dc_state is declared twice

2021-03-30 Thread Alex Deucher
On Mon, Mar 29, 2021 at 9:36 PM 万家兵  wrote:
>
>
> >On Sat, Mar 27, 2021 at 3:28 AM Wan Jiabing  wrote:
> >>
> >> struct dc_state has been declared at 273rd line.
> >> Remove the duplicate.
> >> Delete duplicate blank lines.
> >
> >Can you split these into separate patches?
> >
> >Alex
>
> OK. But in fact, what I did  is simple.
> The most important thing is removing the duplicate
> struct dc_state declaration at 585th line.
> Others are all deleting duplicate blank lines.
>
> So maybe I should send two patchs, one is removing
> duplicate declaration, the other is deleting blank lines?
>

Yes.  There are so many whitespace changes in the commit that it's
hard to see the functional change.

Alex


> >>
> >> Signed-off-by: Wan Jiabing 
> >> ---
> >>  drivers/gpu/drm/amd/display/dc/dc.h | 10 --
> >>  1 file changed, 10 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/display/dc/dc.h 
> >> b/drivers/gpu/drm/amd/display/dc/dc.h
> >> index 18ed0d3f247e..dc667298ab5b 100644
> >> --- a/drivers/gpu/drm/amd/display/dc/dc.h
> >> +++ b/drivers/gpu/drm/amd/display/dc/dc.h
> >> @@ -234,7 +234,6 @@ struct dc_static_screen_params {
> >> unsigned int num_frames;
> >>  };
> >>
> >> -
> >>  /* Surface update type is used by dc_update_surfaces_and_stream
> >>   * The update type is determined at the very beginning of the function 
> >> based
> >>   * on parameters passed in and decides how much programming (or updating) 
> >> is
> >> @@ -272,7 +271,6 @@ struct dc;
> >>  struct dc_plane_state;
> >>  struct dc_state;
> >>
> >> -
> >>  struct dc_cap_funcs {
> >> bool (*get_dcc_compression_cap)(const struct dc *dc,
> >> const struct dc_dcc_surface_param *input,
> >> @@ -281,7 +279,6 @@ struct dc_cap_funcs {
> >>
> >>  struct link_training_settings;
> >>
> >> -
> >>  /* Structure to hold configuration flags set by dm at dc creation. */
> >>  struct dc_config {
> >> bool gpu_vm_support;
> >> @@ -581,7 +578,6 @@ struct dc_bounding_box_overrides {
> >> int min_dcfclk_mhz;
> >>  };
> >>
> >> -struct dc_state;
>
> Removing the duplicate is here.
> And others are all deleting duplicate blank line.
>
> I think they are in the same file. I want to remove the declaration first.
> By the way, I deleted the blank line.
>
> Yours,
> Wan Jiabing
>
> >>  struct resource_pool;
> >>  struct dce_hwseq;
> >>  struct gpu_info_soc_bounding_box_v1_0;
> >> @@ -757,7 +753,6 @@ enum dc_transfer_func_predefined {
> >> TRANSFER_FUNCTION_GAMMA26
> >>  };
> >>
> >> -
> >>  struct dc_transfer_func {
> >> struct kref refcount;
> >> enum dc_transfer_func_type type;
> >> @@ -770,7 +765,6 @@ struct dc_transfer_func {
> >> };
> >>  };
> >>
> >> -
> >>  union dc_3dlut_state {
> >> struct {
> >> uint32_t initialized:1; /*if 3dlut is went through 
> >> color module for initialization */
> >> @@ -784,7 +778,6 @@ union dc_3dlut_state {
> >> uint32_t raw;
> >>  };
> >>
> >> -
> >>  struct dc_3dlut {
> >> struct kref refcount;
> >> struct tetrahedral_params lut_3d;
> >> @@ -1014,7 +1007,6 @@ enum dc_status dc_validate_global_state(
> >> struct dc_state *new_ctx,
> >> bool fast_validate);
> >>
> >> -
> >>  void dc_resource_state_construct(
> >> const struct dc *dc,
> >> struct dc_state *dst_ctx);
> >> @@ -1167,7 +1159,6 @@ struct dc_container_id {
> >> unsigned short productCode;
> >>  };
> >>
> >> -
> >>  struct dc_sink_dsc_caps {
> >> // 'true' if these are virtual DPCD's DSC caps (immediately 
> >> upstream of sink in MST topology),
> >> // 'false' if they are sink's DSC caps
> >> @@ -1229,7 +1220,6 @@ struct dc_cursor {
> >> struct dc_cursor_attributes attributes;
> >>  };
> >>
> >> -
> >>  
> >> /***
> >>   * Interrupt interfaces
> >>   
> >> **/
> >> --
> >> 2.25.1
> >>
> >> ___
> >> dri-devel mailing list
> >> dri-de...@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>


Re: Re: [PATCH] dma: Fix a double free in dma_async_device_register

2021-03-30 Thread lyl2019



> -原始邮件-
> 发件人: "Dave Jiang" 
> 发送时间: 2021-03-31 00:05:15 (星期三)
> 收件人: "Lv Yunlong" , vk...@kernel.org
> 抄送: dmaeng...@vger.kernel.org, linux-kernel@vger.kernel.org
> 主题: Re: [PATCH] dma: Fix a double free in dma_async_device_register
> 
> 
> On 3/30/2021 2:01 AM, Lv Yunlong wrote:
> > In the first list_for_each_entry() macro of dma_async_device_register,
> > it gets the chan from list and calls __dma_async_device_channel_register
> > (..,chan). We can see that chan->local is allocated by alloc_percpu() and
> > it is freed chan->local by free_percpu(chan->local) when
> > __dma_async_device_channel_register() failed.
> >
> > But after __dma_async_device_channel_register() failed, the caller will
> > goto err_out and freed the chan->local in the second time by free_percpu().
> >
> > The cause of this problem is forget to set chan->local to NULL when
> > chan->local was freed in __dma_async_device_channel_register(). My
> > patch sets chan->local to NULL when the callee failed to avoid double free.
> 
> Thanks for the fix. I think it would make sense to set it to NULL in 
> __dma_async_device_channel_register() cleanup path after it calls 
> free_percpu(chan->local) right? That would address any other instances 
> of this issue happening else where.
> 
> 
> >
> > Fixes: d2fb0a0438384 ("dmaengine: break out channel registration")
> > Signed-off-by: Lv Yunlong 
> > ---
> >   drivers/dma/dmaengine.c | 4 +++-
> >   1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
> > index fe6a460c4373..fef64b198c95 100644
> > --- a/drivers/dma/dmaengine.c
> > +++ b/drivers/dma/dmaengine.c
> > @@ -1249,8 +1249,10 @@ int dma_async_device_register(struct dma_device 
> > *device)
> > /* represent channels in sysfs. Probably want devs too */
> > list_for_each_entry(chan, >channels, device_node) {
> > rc = __dma_async_device_channel_register(device, chan);
> > -   if (rc < 0)
> > +   if (rc < 0) {
> > +   chan->local = NULL;
> > goto err_out;
> > +   }
> > }
> >   
> > mutex_lock(_list_mutex);

Ok, that is a good idea. I have submitted the PATCH v2.

Thanks.


  1   2   3   4   5   6   7   8   9   10   >