Re: How to properly fix reading user pointers in bpf in android kernel 4.9?

2024-05-24 Thread Bagas Sanjaya
[also Cc: bpf maintainers and get_maintainer output]

On Thu, May 23, 2024 at 07:52:22PM +0300, Marcel wrote:
> This seems that it was a long standing problem with the Linux kernel in 
> general. bpf_probe_read should have worked for both kernel and user pointers 
> but it fails with access error when reading an user one instead. 
> 
> I know there's a patch upstream that fixes this by introducing new helpers 
> for reading kernel and userspace pointers and I tried to back port them back 
> to my kernel but with no success. Tools like bcc fail to use them and instead 
> they report that the arguments sent to the helpers are invalid. I assume this 
> is due to the arguments ARG_CONST_STACK_SIZE and ARG_PTR_TO_RAW_STACK handle 
> data different in the 4.9 android version and the upstream version but I'm 
> not sure that this is the cause. I left the patch I did below and with a link 
> to the kernel I'm working on and maybe someone can take a look and give me an 
> hand (the patch isn't applied yet)

What upstream patch? Has it already been in mainline?

> 
> 
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 744b4763b80e..de94c13b7193 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -559,6 +559,43 @@ enum bpf_func_id {
> */
> BPF_FUNC_probe_read_user,
>  
> +   /**
> +   * int bpf_probe_read_kernel(void *dst, int size, void *src)
> +   * Read a kernel pointer safely.
> +   * Return: 0 on success or negative error
> +   */
> +   BPF_FUNC_probe_read_kernel,
> +
> + /**
> +  * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
> +  * Copy a NUL terminated string from user unsafe address. In case 
> the string
> +  * length is smaller than size, the target is not padded with 
> further NUL
> +  * bytes. In case the string length is larger than size, just 
> count-1
> +  * bytes are copied and the last byte is set to NUL.
> +  * @dst: destination address
> +  * @size: maximum number of bytes to copy, including the trailing 
> NUL
> +  * @unsafe_ptr: unsafe address
> +  * Return:
> +  *   > 0 length of the string including the trailing NUL on success
> +  *   < 0 error
> +  */
> + BPF_FUNC_probe_read_user_str,
> +
> + /**
> +  * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
> +  * Copy a NUL terminated string from unsafe address. In case the 
> string
> +  * length is smaller than size, the target is not padded with 
> further NUL
> +  * bytes. In case the string length is larger than size, just 
> count-1
> +  * bytes are copied and the last byte is set to NUL.
> +  * @dst: destination address
> +  * @size: maximum number of bytes to copy, including the trailing 
> NUL
> +  * @unsafe_ptr: unsafe address
> +  * Return:
> +  *   > 0 length of the string including the trailing NUL on success
> +  *   < 0 error
> +  */
> + BPF_FUNC_probe_read_kernel_str,
> +
>   __BPF_FUNC_MAX_ID,
>  };
>  
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index a1e37a5d8c88..3478ca744a45 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -94,7 +94,7 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
>   .arg3_type  = ARG_ANYTHING,
>  };
>  
> -BPF_CALL_3(bpf_probe_read_user, void *, dst, u32, size, const void *, 
> unsafe_ptr)
> +BPF_CALL_3(bpf_probe_read_user, void *, dst, u32, size, const void  __user 
> *, unsafe_ptr)
>  {
>   int ret;
>  
> @@ -115,6 +115,27 @@ static const struct bpf_func_proto 
> bpf_probe_read_user_proto = {
>  };
>  
>  
> +BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size, const void *, 
> unsafe_ptr)
> +{
> + int ret;
> +
> + ret = probe_kernel_read(dst, unsafe_ptr, size);
> + if (unlikely(ret < 0))
> + memset(dst, 0, size);
> +
> + return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_probe_read_kernel_proto = {
> + .func   = bpf_probe_read_kernel,
> + .gpl_only   = true,
> + .ret_type   = RET_INTEGER,
> + .arg1_type  = ARG_PTR_TO_RAW_STACK,
> + .arg2_type  = ARG_CONST_STACK_SIZE,
> + .arg3_type  = ARG_ANYTHING,
> +};
> +
> +
>  BPF_CALL_3(bpf_probe_write_user, void *, unsafe_ptr, const void *, src,
>  u32, size)
>  {
> @@ -487,6 +508,69 @@ static const struct bpf_func_proto 
> bpf_probe_read_str_proto = {
>   .arg3_type  = ARG_ANYTHING,
>  };
>  
> +
> +
> +BPF_CALL_3(bpf_probe_read_user_str, void *, dst, u32, size,
> +const void __user *, unsafe_ptr)
> +{
> + int ret;
> +
> + /*
> +  * The strncpy_from_unsafe() call will likely not fill the entire
> +  * buffer, but that's okay in this circumstance as we're probing
> +  * arbitra

Re: [PATCH v12 13/14] Docs/x86/sgx: Add description for cgroup support

2024-04-21 Thread Bagas Sanjaya
On Mon, Apr 15, 2024 at 08:20:10PM -0700, Haitao Huang wrote:
> From: Sean Christopherson 
> 
> Add initial documentation of how to regulate the distribution of
> SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
> controller.
> 

The doc LGTM, thanks!

Reviewed-by: Bagas Sanjaya 

-- 
An old man doll... just what I always wanted! - Clara


signature.asc
Description: PGP signature


[PATCH v2] MAINTAINERS: Remove Ohad Ben-Cohen from hwspinlock subsystem

2023-12-18 Thread Bagas Sanjaya
Commit 62c46d55688894 ("MAINTAINERS: Removing Ohad from remoteproc/rpmsg
maintenance") removes his MAINTAINERS entry in regards to remoteproc
subsystem due to his inactivity (the last commit with his Signed-off-by
is 99c429cb4e628e ("remoteproc/wkup_m3: Use MODULE_DEVICE_TABLE to
export alias") which is authored in 2015 and his last LKML message prior
to 62c46d55688894 was [1]).

Remove also his MAINTAINERS entry for hwspinlock subsystem as there is
no point of Cc'ing maintainers who never respond in a long time.

[1]: 
https://lore.kernel.org/r/CAK=Wgbbcyi36ef1-PV8VS=m6nfoqnfgudwy6v7ocnkt0ddr...@mail.gmail.com/

Signed-off-by: Bagas Sanjaya 
---
Changes since v1 [2]:
  * Add also OMAP hwspinlock & driver entries in CREDITS (Ohad)

I was prompted to do the removal when I was reviewing kernel-doc fix
[3]. When I was digging MAINTAINERS history (`git log --no-merges --
MAINTAINERS`), I noticed that Ohad is inactive.

This patch is based on mm-nonmm-unstable as I intend to route it
through mm tree.

[2]: https://lore.kernel.org/r/zx04ymz_vdfee...@archie.me/

[3]: 
https://lore.kernel.org/linux-mm/20231216111017.17624-2-bagasdo...@gmail.com/

 CREDITS | 3 +++
 MAINTAINERS | 4 +---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/CREDITS b/CREDITS
index 81845c39e3cf37..3c7c953c7cf69b 100644
--- a/CREDITS
+++ b/CREDITS
@@ -323,6 +323,9 @@ N: Ohad Ben Cohen
 E: o...@wizery.com
 D: Remote Processor (remoteproc) subsystem
 D: Remote Processor Messaging (rpmsg) subsystem
+D: Hardware spinlock (hwspinlock) subsystem
+D: OMAP hwspinlock driver
+D: OMAP remoteproc driver
 
 N: Krzysztof Benedyczak
 E: go...@mat.uni.torun.pl
diff --git a/MAINTAINERS b/MAINTAINERS
index 5c9d3d8546714a..4acc4a3d4fcd96 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9257,7 +9257,6 @@ F:drivers/char/hw_random/
 F: include/linux/hw_random.h
 
 HARDWARE SPINLOCK CORE
-M: Ohad Ben-Cohen 
 M: Bjorn Andersson 
 R: Baolin Wang 
 L: linux-remotep...@vger.kernel.org
@@ -15692,9 +15691,8 @@ F:  
Documentation/devicetree/bindings/gpio/ti,omap-gpio.yaml
 F: drivers/gpio/gpio-omap.c
 
 OMAP HARDWARE SPINLOCK SUPPORT
-M: Ohad Ben-Cohen 
 L: linux-o...@vger.kernel.org
-S: Maintained
+S: Orphan
 F: drivers/hwspinlock/omap_hwspinlock.c
 
 OMAP HS MMC SUPPORT

base-commit: cbeb59a84b8f29151b882d03a4d23d19d92f4337
-- 
An old man doll... just what I always wanted! - Clara




Re: [PATCH] MAINTAINERS: Remove Ohad Ben-Cohen from hwspinlock subsystem

2023-12-16 Thread Bagas Sanjaya
On Sat, Dec 16, 2023 at 08:59:50PM +0200, Ohad Ben Cohen wrote:
> Hi Bagas,
> 
> On Sat, Dec 16, 2023 at 1:10 PM Bagas Sanjaya  wrote:
> > --- a/CREDITS
> > +++ b/CREDITS
> > @@ -323,6 +323,7 @@ N: Ohad Ben Cohen
> >  E: o...@wizery.com
> >  D: Remote Processor (remoteproc) subsystem
> >  D: Remote Processor Messaging (rpmsg) subsystem
> > +D: Hardware spinlock (hwspinlock) subsystem
> 
> Please also add:
> 
> D: OMAP hwspinlock driver
> D: OMAP remoteproc driver
> 

OK, will do in v2.

Thanks.

-- 
An old man doll... just what I always wanted! - Clara


signature.asc
Description: PGP signature


[PATCH] MAINTAINERS: Remove Ohad Ben-Cohen from hwspinlock subsystem

2023-12-16 Thread Bagas Sanjaya
Commit 62c46d55688894 ("MAINTAINERS: Removing Ohad from remoteproc/rpmsg
maintenance") removes his MAINTAINERS entry in regards to remoteproc
subsystem due to his inactivity (the last commit with his Signed-off-by
is 99c429cb4e628e ("remoteproc/wkup_m3: Use MODULE_DEVICE_TABLE to
export alias") which is authored in 2015 and his last LKML message prior
to 62c46d55688894 was [1]).

Remove also his MAINTAINERS entry for hwspinlock subsystem as there is
no point of Cc'ing maintainers who never respond in a long time.

[1]: 
https://lore.kernel.org/r/CAK=Wgbbcyi36ef1-PV8VS=m6nfoqnfgudwy6v7ocnkt0ddr...@mail.gmail.com/

Signed-off-by: Bagas Sanjaya 
---
I was prompted to do the removal when I was reviewing kernel-doc fix
[2]. When I was digging MAINTAINERS history (`git log --no-merges --
MAINTAINERS`), I noticed that Ohad is inactive.

This patch is based on mm-nonmm-unstable as I intend to route it
through mm tree.

[2]: https://lore.kernel.org/r/zx04ymz_vdfee...@archie.me/

 CREDITS | 1 +
 MAINTAINERS | 4 +---
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/CREDITS b/CREDITS
index 81845c39e3cf37..cff24c62b0e8f9 100644
--- a/CREDITS
+++ b/CREDITS
@@ -323,6 +323,7 @@ N: Ohad Ben Cohen
 E: o...@wizery.com
 D: Remote Processor (remoteproc) subsystem
 D: Remote Processor Messaging (rpmsg) subsystem
+D: Hardware spinlock (hwspinlock) subsystem
 
 N: Krzysztof Benedyczak
 E: go...@mat.uni.torun.pl
diff --git a/MAINTAINERS b/MAINTAINERS
index 5c9d3d8546714a..4acc4a3d4fcd96 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9257,7 +9257,6 @@ F:drivers/char/hw_random/
 F: include/linux/hw_random.h
 
 HARDWARE SPINLOCK CORE
-M: Ohad Ben-Cohen 
 M: Bjorn Andersson 
 R: Baolin Wang 
 L: linux-remotep...@vger.kernel.org
@@ -15692,9 +15691,8 @@ F:  
Documentation/devicetree/bindings/gpio/ti,omap-gpio.yaml
 F: drivers/gpio/gpio-omap.c
 
 OMAP HARDWARE SPINLOCK SUPPORT
-M: Ohad Ben-Cohen 
 L: linux-o...@vger.kernel.org
-S: Maintained
+S: Orphan
 F: drivers/hwspinlock/omap_hwspinlock.c
 
 OMAP HS MMC SUPPORT

base-commit: cbeb59a84b8f29151b882d03a4d23d19d92f4337
-- 
An old man doll... just what I always wanted! - Clara




Re: [PATCH] hwspinlock/core: fix kernel-doc warnings

2023-12-15 Thread Bagas Sanjaya
re
> + * Returns: %0 on success, or an appropriate error code on failure
>   */
>  int hwspin_lock_unregister(struct hwspinlock_device *bank)
>  {
> @@ -578,7 +582,7 @@ static int devm_hwspin_lock_device_match
>   *
>   * Should be called from a process context (might sleep)
>   *
> - * Returns 0 on success, or an appropriate error code on failure
> + * Returns: %0 on success, or an appropriate error code on failure
>   */
>  int devm_hwspin_lock_unregister(struct device *dev,
>   struct hwspinlock_device *bank)
> @@ -607,7 +611,7 @@ EXPORT_SYMBOL_GPL(devm_hwspin_lock_unreg
>   *
>   * Should be called from a process context (might sleep)
>   *
> - * Returns 0 on success, or an appropriate error code on failure
> + * Returns: %0 on success, or an appropriate error code on failure
>   */
>  int devm_hwspin_lock_register(struct device *dev,
> struct hwspinlock_device *bank,
> @@ -635,12 +639,13 @@ EXPORT_SYMBOL_GPL(devm_hwspin_lock_regis
>  
>  /**
>   * __hwspin_lock_request() - tag an hwspinlock as used and power it up
> + * @hwlock: the target hwspinlock
>   *
>   * This is an internal function that prepares an hwspinlock instance
>   * before it is given to the user. The function assumes that
>   * hwspinlock_tree_lock is taken.
>   *
> - * Returns 0 or positive to indicate success, and a negative value to
> + * Returns: %0 or positive to indicate success, and a negative value to
>   * indicate an error (with the appropriate error code)
>   */
>  static int __hwspin_lock_request(struct hwspinlock *hwlock)
> @@ -680,7 +685,7 @@ static int __hwspin_lock_request(struct
>   * hwspin_lock_get_id() - retrieve id number of a given hwspinlock
>   * @hwlock: a valid hwspinlock instance
>   *
> - * Returns the id number of a given @hwlock, or -EINVAL if @hwlock is 
> invalid.
> + * Returns: the id number of a given @hwlock, or -EINVAL if @hwlock is 
> invalid.
>   */
>  int hwspin_lock_get_id(struct hwspinlock *hwlock)
>  {
> @@ -704,7 +709,7 @@ EXPORT_SYMBOL_GPL(hwspin_lock_get_id);
>   *
>   * Should be called from a process context (might sleep)
>   *
> - * Returns the address of the assigned hwspinlock, or NULL on error
> + * Returns: the address of the assigned hwspinlock, or %NULL on error
>   */
>  struct hwspinlock *hwspin_lock_request(void)
>  {
> @@ -747,7 +752,7 @@ EXPORT_SYMBOL_GPL(hwspin_lock_request);
>   *
>   * Should be called from a process context (might sleep)
>   *
> - * Returns the address of the assigned hwspinlock, or NULL on error
> + * Returns: the address of the assigned hwspinlock, or %NULL on error
>   */
>  struct hwspinlock *hwspin_lock_request_specific(unsigned int id)
>  {
> @@ -795,7 +800,7 @@ EXPORT_SYMBOL_GPL(hwspin_lock_request_sp
>   *
>   * Should be called from a process context (might sleep)
>   *
> - * Returns 0 on success, or an appropriate error code on failure
> + * Returns: %0 on success, or an appropriate error code on failure
>   */
>  int hwspin_lock_free(struct hwspinlock *hwlock)
>  {
> @@ -865,7 +870,7 @@ static void devm_hwspin_lock_release(str
>   *
>   * Should be called from a process context (might sleep)
>   *
> - * Returns 0 on success, or an appropriate error code on failure
> + * Returns: %0 on success, or an appropriate error code on failure
>   */
>  int devm_hwspin_lock_free(struct device *dev, struct hwspinlock *hwlock)
>  {
> @@ -891,7 +896,7 @@ EXPORT_SYMBOL_GPL(devm_hwspin_lock_free)
>   *
>   * Should be called from a process context (might sleep)
>   *
> - * Returns the address of the assigned hwspinlock, or NULL on error
> + * Returns: the address of the assigned hwspinlock, or %NULL on error
>   */
>  struct hwspinlock *devm_hwspin_lock_request(struct device *dev)
>  {
> @@ -926,7 +931,7 @@ EXPORT_SYMBOL_GPL(devm_hwspin_lock_reque
>   *
>   * Should be called from a process context (might sleep)
>   *
> - * Returns the address of the assigned hwspinlock, or NULL on error
> + * Returns: the address of the assigned hwspinlock, or %NULL on error
>   */
>  struct hwspinlock *devm_hwspin_lock_request_specific(struct device *dev,
>unsigned int id)

LGTM, thanks!

Reviewed-by: Bagas Sanjaya 

-- 
An old man doll... just what I always wanted! - Clara


signature.asc
Description: PGP signature


Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-19 Thread Bagas Sanjaya
On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:
> Hi,
> 
> when testing the EEVDF scheduler we stumbled upon a performance regression
> in a uperf scenario and would like to
> kindly ask for feedback on whether we are going into the right direction
> with our analysis so far.
> 
> The base scenario are two KVM guests running on an s390 LPAR. One guest
> hosts the uperf server, one the uperf client.
> With EEVDF we observe a regression of ~50% for a strburst test.
> For a more detailed description of the setup see the section TEST SUMMARY at
> the bottom.
> 
> Bisecting led us to the following commit which appears to introduce the
> regression:
> 86bfbb7ce4f6 sched/fair: Add lag based placement
> 
> We then compared the last good commit we identified with a recent level of
> the devel branch.
> The issue still persists on 6.7 rc1 although there is some improvement (down
> from 62% regression to 49%)
> 
> All analysis described further are based on a 6.6 rc7 kernel.
> 
> We sampled perf data to get an idea on what is going wrong and ended up
> seeing an dramatic increase in the maximum
> wait times from 3ms up to 366ms. See section WAIT DELAYS below for more
> details.
> 
> We then collected tracing data to get a better insight into what is going
> on.
> The trace excerpt in section TRACE EXCERPT shows one example (of multiple
> per test run) of the problematic scenario where
> a kworker(pid=6525) has to wait for 39,718 ms.
> 
> Short summary:
> The mentioned kworker has been scheduled to CPU 14 before the tracing was
> enabled.
> A vhost process is migrated onto CPU 14.
> The vruntimes of kworker and vhost differ significantly (86642125805 vs
> 4242563284 -> factor 20)
> The vhost process wants to wake up the kworker, therefore the kworker is
> placed onto the runqueue again and set to runnable.
> The vhost process continues to execute, waking up other vhost processes on
> other CPUs.
> 
> So far this behavior is not different to what we see on pre-EEVDF kernels.
> 
> On timestamp 576.162767, the vhost process triggers the last wake up of
> another vhost on another CPU.
> Until timestamp 576.171155, we see no other activity. Now, the vhost process
> ends its time slice.
> Then, vhost gets re-assigned new time slices 4 times and gets then migrated
> off to CPU 15.
> This does not occur with older kernels.
> The kworker has to wait for the migration to happen in order to be able to
> execute again.
> This is due to the fact, that the vruntime of the kworker is significantly
> larger than the one of vhost.
> 
> 
> We observed the large difference in vruntime between kworker and vhost in
> the same magnitude on
> a kernel built based on the parent of the commit mentioned above.
> With EEVDF, the kworker is doomed to wait until the vhost either catches up
> on vruntime (which would take 86 seconds)
> or the vhost is migrated off of the CPU.
> 
> We found some options which sound plausible but we are not sure if they are
> valid or not:
> 
> 1. The wake up path has a dependency on the vruntime metrics that now delays
> the execution of the kworker.
> 2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
> which updates the way cfs_rq->min_vruntime and
> cfs_rq->avg_runtime are set might have introduced an issue which is
> uncovered with the commit mentioned above.
> 3. An assumption in the vhost code which causes vhost to rely on being
> scheduled off in time to allow the kworker to proceed.
> 
> We also stumbled upon the following mailing thread:
> https://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
> That conversation, and the patches derived from it lead to the assumption
> that the wake up path might be adjustable in a way
> that this case in particular can be addressed.
> At the same time, the vast difference in vruntimes is concerning since, at
> least for some time frame, both processes are on the runqueue.
> 
> We would be glad to hear some feedback on which paths to pursue and which
> might just be a dead end in the first place.
> 
> 
>  TRACE EXCERPT 
> The sched_place trace event was added to the end of the place_entity
> function and outputs:
> sev -> sched_entity vruntime
> sed -> sched_entity deadline
> sel -> sched_entity vlag
> avg -> cfs_rq avg_vruntime
> min -> cfs_rq min_vruntime
> cpu -> cpu of cfs_rq
> nr  -> cfs_rq nr_running
> ---
> CPU 3/KVM-2950[014] d   576.161432: sched_migrate_task:
> comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
> --> migrates task from cpu 15 to 14
> CPU 3/KVM-2950[014] d   576.161433: sched_place: comm=vhost-2920
> pid=2941 sev=4242563284 sed=4245563284 sel=0 avg=4242563284 min=4242563284
> cpu=14 nr=0
> --> places vhost 2920 on CPU 14 with vruntime 4242563284
> CPU 3/KVM-2950[014] d   576.161433: sched_place: comm= pid=0
> sev=16329848593 sed=16334604010 sel=0 avg=16329848593 min=16329848593 cpu=14
> nr=0
> CPU 3/KVM-2950

Re: Slow boot and shutdown/reboot problems with 6.5.0+

2023-09-11 Thread Bagas Sanjaya
On 07/09/2023 20:56, Marcus Seyfarth wrote:
> As to bisecting: Unfortunately I cannot afford the time right now to bisect 
> this further as the system is used in production and already did invest a lot 
> of time without success into it. Hopefully someone else can find the root 
> cause of the problem. My systemd version is: 254.1, and I also use dbus 
> 1.14.10 and dbus-broker 33.r35.g2220a84 which was configured with -D 
> linux-4-17=true.
> 

[also Cc: Thorsten]

To Thorsten: It looks like the reporter can neither bisect nor
test the mainline since he only has production machine instead.
What can I (and the reporter do) in this case?

Thanks.

-- 
An old man doll... just what I always wanted! - Clara