from:"Joel Fernandes"

Re: [PATCH 00/11] Introduce Simple atomic and non-atomic counters

2020-09-28 Thread Joel Fernandes

On Mon, Sep 28, 2020 at 01:34:31PM -0700, Kees Cook wrote:
> On Sun, Sep 27, 2020 at 07:35:26PM -0400, Joel Fernandes wrote:
> > On Fri, Sep 25, 2020 at 05:47:14PM -0600, Shuah Khan wrote:
> > > This patch series is a result of discussion at the refcount_t BOF
> > > the Linux Plumbers Conference. In this discussion, we identified
> > > a need for looking closely and investigating atomic_t usages in
> > > the kernel when it is used strictly as a counter without it
> > > controlling object lifetimes and state changes.
> > > 
> > > There are a number of atomic_t usages in the kernel where atomic_t api
> > > is used strictly for counting and not for managing object lifetime. In
> > > some cases, atomic_t might not even be needed.
> > > 
> > > The purpose of these counters is twofold: 1. clearly differentiate
> > > atomic_t counters from atomic_t usages that guard object lifetimes,
> > > hence prone to overflow and underflow errors. It allows tools that scan
> > > for underflow and overflow on atomic_t usages to detect overflow and
> > > underflows to scan just the cases that are prone to errors. 2. provides
> > > non-atomic counters for cases where atomic isn't necessary.
> > 
> > Nice series :)
> > 
> > It appears there is no user of counter_simple in this series other than the
> > selftest. Would you be planning to add any conversions in the series itself,
> > for illustration of use? Sorry if I missed a usage.
> > 
> > Also how do we guard against atomicity of counter_simple RMW operations? Is
> > the implication that it should be guarded using other synchronization to
> > prevent lost-update problem?
> > 
> > Some more comments:
> > 
> > 1.  atomic RMW operations that have a return value are fully ordered. Would
> > you be adding support to counter_simple for such ordering as well, for
> > consistency?
> 
> No -- there is no atomicity guarantee for counter_simple. I would prefer
> counter_simple not exist at all, specifically for this reason.

Yeah I am ok with it not existing, especially also as there are no examples
of its conversion/usage in the series.

> > 2. I felt counter_atomic and counter_atomic64 would be nice equivalents to
> >the atomic and atomic64 naming currently used (i.e. dropping the '32').
> >However that is just my opinion and I am ok with either naming.
> 
> I had asked that they be size-named to avoid any confusion (i.e. we're
> making a new API).

Works for me.

Cheers,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 07/11] drivers/android/binder: convert stats, transaction_log to counter_atomic32

2020-09-27 Thread Joel Fernandes

On Fri, Sep 25, 2020 at 05:47:21PM -0600, Shuah Khan wrote:
> counter_atomic* is introduced to be used when a variable is used as
> a simple counter and doesn't guard object lifetimes. This clearly
> differentiates atomic_t usages that guard object lifetimes.
> 
> counter_atomic* variables will wrap around to 0 when it overflows and
> should not be used to guard resource lifetimes, device usage and
> open counts that control state changes, and pm states.
> 
> stats tracks per-process binder statistics. Unsure if there is a chance
> of this overflowing, other than stats getting reset to 0. Convert it to
> use counter_atomic.
> 
> binder_transaction_log:cur is used to keep track of the current log entry
> location. Overflow is handled in the code. Since it is used as a
> counter, convert it to use counter_atomic32.
> 
> This conversion doesn't change the overflow wrap around behavior.
> 
> Signed-off-by: Shuah Khan 

Reviewed-by: Joel Fernandes (Google) 

thanks,

 - Joel

> ---
>  drivers/android/binder.c  | 41 ---
>  drivers/android/binder_internal.h |  3 ++-
>  2 files changed, 23 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> index f936530a19b0..52175cd6a62b 100644
> --- a/drivers/android/binder.c
> +++ b/drivers/android/binder.c
> @@ -66,6 +66,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -172,22 +173,22 @@ enum binder_stat_types {
>  };
>  
>  struct binder_stats {
> - atomic_t br[_IOC_NR(BR_FAILED_REPLY) + 1];
> - atomic_t bc[_IOC_NR(BC_REPLY_SG) + 1];
> - atomic_t obj_created[BINDER_STAT_COUNT];
> - atomic_t obj_deleted[BINDER_STAT_COUNT];
> + struct counter_atomic32 br[_IOC_NR(BR_FAILED_REPLY) + 1];
> + struct counter_atomic32 bc[_IOC_NR(BC_REPLY_SG) + 1];
> + struct counter_atomic32 obj_created[BINDER_STAT_COUNT];
> + struct counter_atomic32 obj_deleted[BINDER_STAT_COUNT];
>  };
>  
>  static struct binder_stats binder_stats;
>  
>  static inline void binder_stats_deleted(enum binder_stat_types type)
>  {
> - atomic_inc(&binder_stats.obj_deleted[type]);
> + counter_atomic32_inc(&binder_stats.obj_deleted[type]);
>  }
>  
>  static inline void binder_stats_created(enum binder_stat_types type)
>  {
> - atomic_inc(&binder_stats.obj_created[type]);
> + counter_atomic32_inc(&binder_stats.obj_created[type]);
>  }
>  
>  struct binder_transaction_log binder_transaction_log;
> @@ -197,7 +198,7 @@ static struct binder_transaction_log_entry 
> *binder_transaction_log_add(
>   struct binder_transaction_log *log)
>  {
>   struct binder_transaction_log_entry *e;
> - unsigned int cur = atomic_inc_return(&log->cur);
> + unsigned int cur = counter_atomic32_inc_return(&log->cur);
>  
>   if (cur >= ARRAY_SIZE(log->entry))
>   log->full = true;
> @@ -3615,9 +3616,9 @@ static int binder_thread_write(struct binder_proc *proc,
>   ptr += sizeof(uint32_t);
>   trace_binder_command(cmd);
>   if (_IOC_NR(cmd) < ARRAY_SIZE(binder_stats.bc)) {
> - atomic_inc(&binder_stats.bc[_IOC_NR(cmd)]);
> - atomic_inc(&proc->stats.bc[_IOC_NR(cmd)]);
> - atomic_inc(&thread->stats.bc[_IOC_NR(cmd)]);
> + counter_atomic32_inc(&binder_stats.bc[_IOC_NR(cmd)]);
> + counter_atomic32_inc(&proc->stats.bc[_IOC_NR(cmd)]);
> + counter_atomic32_inc(&thread->stats.bc[_IOC_NR(cmd)]);
>   }
>   switch (cmd) {
>   case BC_INCREFS:
> @@ -4047,9 +4048,9 @@ static void binder_stat_br(struct binder_proc *proc,
>  {
>   trace_binder_return(cmd);
>   if (_IOC_NR(cmd) < ARRAY_SIZE(binder_stats.br)) {
> - atomic_inc(&binder_stats.br[_IOC_NR(cmd)]);
> - atomic_inc(&proc->stats.br[_IOC_NR(cmd)]);
> - atomic_inc(&thread->stats.br[_IOC_NR(cmd)]);
> + counter_atomic32_inc(&binder_stats.br[_IOC_NR(cmd)]);
> + counter_atomic32_inc(&proc->stats.br[_IOC_NR(cmd)]);
> + counter_atomic32_inc(&thread->stats.br[_IOC_NR(cmd)]);
>   }
>  }
>  
> @@ -5841,7 +5842,7 @@ static void print_binder_stats(struct seq_file *m, 
> const char *prefix,
>   BUILD_BUG_ON(ARRAY_SIZE(stats->bc) !=
>ARRAY_SIZE(binder_command_strings));
>   for (i = 0; i < ARRAY_SIZE(stats->bc); i++) {
> - int temp = atomic_read(&st

Re: [PATCH 00/11] Introduce Simple atomic and non-atomic counters

2020-09-27 Thread Joel Fernandes

On Fri, Sep 25, 2020 at 05:47:14PM -0600, Shuah Khan wrote:
> This patch series is a result of discussion at the refcount_t BOF
> the Linux Plumbers Conference. In this discussion, we identified
> a need for looking closely and investigating atomic_t usages in
> the kernel when it is used strictly as a counter without it
> controlling object lifetimes and state changes.
> 
> There are a number of atomic_t usages in the kernel where atomic_t api
> is used strictly for counting and not for managing object lifetime. In
> some cases, atomic_t might not even be needed.
> 
> The purpose of these counters is twofold: 1. clearly differentiate
> atomic_t counters from atomic_t usages that guard object lifetimes,
> hence prone to overflow and underflow errors. It allows tools that scan
> for underflow and overflow on atomic_t usages to detect overflow and
> underflows to scan just the cases that are prone to errors. 2. provides
> non-atomic counters for cases where atomic isn't necessary.

Nice series :)

It appears there is no user of counter_simple in this series other than the
selftest. Would you be planning to add any conversions in the series itself,
for illustration of use? Sorry if I missed a usage.

Also how do we guard against atomicity of counter_simple RMW operations? Is
the implication that it should be guarded using other synchronization to
prevent lost-update problem?

Some more comments:

1.  atomic RMW operations that have a return value are fully ordered. Would
you be adding support to counter_simple for such ordering as well, for
consistency?

2. I felt counter_atomic and counter_atomic64 would be nice equivalents to
   the atomic and atomic64 naming currently used (i.e. dropping the '32').
   However that is just my opinion and I am ok with either naming.

thanks!

 - Joel

> 
> Simple atomic and non-atomic counters api provides interfaces for simple
> atomic and non-atomic counters that just count, and don't guard resource
> lifetimes. Counters will wrap around to 0 when it overflows and should
> not be used to guard resource lifetimes, device usage and open counts
> that control state changes, and pm states.
> 
> Using counter_atomic to guard lifetimes could lead to use-after free
> when it overflows and undefined behavior when used to manage state
> changes and device usage/open states.
> 
> This patch series introduces Simple atomic and non-atomic counters.
> Counter atomic ops leverage atomic_t and provide a sub-set of atomic_t
> ops.
> 
> In addition this patch series converts a few drivers to use the new api.
> The following criteria is used for select variables for conversion:
> 
> 1. Variable doesn't guard object lifetimes, manage state changes e.g:
>device usage counts, device open counts, and pm states.
> 2. Variable is used for stats and counters.
> 3. The conversion doesn't change the overflow behavior.
> 
> Changes since RFC:
> -- Thanks for reviews and reviewed-by, and Acked-by tags. Updated
>the patches with the tags.
> -- Addressed Kees's comments:
>1. Non-atomic counters renamed to counter_simple32 and counter_simple64
>   to clearly indicate size.
>2. Added warning for counter_simple* usage and it should be used only
>   when there is no need for atomicity.
>3. Renamed counter_atomic to counter_atomic32 to clearly indicate size.
>4. Renamed counter_atomic_long to counter_atomic64 and it now uses
>   atomic64_t ops and indicates size.
>5. Test updated for the API renames.
>6. Added helper functions for test results printing
>7. Verified that the test module compiles in kunit env. and test
>   module can be loaded to run the test.
>8. Updated Documentation to reflect the intent to make the API
>   restricted so it can never be used to guard object lifetimes
>   and state management. I left _return ops for now, inc_return
>   is necessary for now as per the discussion we had on this topic. 
> -- Updated driver patches with API name changes.
> -- We discussed if binder counters can be non-atomic. For now I left
>them the same as the RFC patch - using counter_atomic32
> -- Unrelated to this patch series:
>The patch series review uncovered improvements could be made to
>test_async_driver_probe and vmw_vmci/vmci_guest. I will track
>these for fixing later.
> 
> Shuah Khan (11):
>   counters: Introduce counter_simple* and counter_atomic* counters
>   selftests:lib:test_counters: add new test for counters
>   drivers/base: convert deferred_trigger_count and probe_count to
> counter_atomic32
>   drivers/base/devcoredump: convert devcd_count to counter_atomic32
>   drivers/acpi: convert seqno counter_atomic32
>   drivers/acpi/apei: convert seqno counter_atomic32
>   drivers/android/binder: convert stats, transaction_log to
> counter_atomic32
>   drivers/base/test/test_async_driver_probe: convert to use
> counter_atomic32
>   drivers/char/ipmi: convert stats to use counter_ato

Re: [PATCH] staging: ion: remove from the tree

2020-08-27 Thread Joel Fernandes

On Thu, Aug 27, 2020 at 10:31:41PM +0530, Amit Pundir wrote:
> On Thu, 27 Aug 2020 at 21:34, Greg Kroah-Hartman
>  wrote:
> >
> > On Thu, Aug 27, 2020 at 09:31:27AM -0400, Laura Abbott wrote:
> > > On 8/27/20 8:36 AM, Greg Kroah-Hartman wrote:
> > > > The ION android code has long been marked to be removed, now that we
> > > > dma-buf support merged into the real part of the kernel.
> > > >
> > > > It was thought that we could wait to remove the ion kernel at a later
> > > > time, but as the out-of-tree Android fork of the ion code has diverged
> > > > quite a bit, and any Android device using the ion interface uses that
> > > > forked version and not this in-tree version, the in-tree copy of the
> > > > code is abandonded and not used by anyone.
> > > >
> > > > Combine this abandoned codebase with the need to make changes to it in
> > > > order to keep the kernel building properly, which then causes merge
> > > > issues when merging those changes into the out-of-tree Android code, and
> > > > you end up with two different groups of people (the in-kernel-tree
> > > > developers, and the Android kernel developers) who are both annoyed at
> > > > the current situation.  Because of this problem, just drop the in-kernel
> > > > copy of the ion code now, as it's not used, and is only causing problems
> > > > for everyone involved.
> > > >
> > > > Cc: "Arve Hjønnevåg" 
> > > > Cc: "Christian König" 
> > > > Cc: Christian Brauner 
> > > > Cc: Christoph Hellwig 
> > > > Cc: Hridya Valsaraju 
> > > > Cc: Joel Fernandes 
> > > > Cc: John Stultz 
> > > > Cc: Laura Abbott 
> > > > Cc: Martijn Coenen 
> > > > Cc: Shuah Khan 
> > > > Cc: Sumit Semwal 
> > > > Cc: Suren Baghdasaryan 
> > > > Cc: Todd Kjos 
> > > > Cc: de...@driverdev.osuosl.org
> > > > Cc: dri-de...@lists.freedesktop.org
> > > > Cc: linaro-mm-...@lists.linaro.org
> > > > Signed-off-by: Greg Kroah-Hartman 
> > >
> > > We discussed this at the Android MC on Monday and the plan was to
> > > remove it after the next LTS release.
> >
> > I know it was discussed, my point is that it is actually causing
> > problems now (with developers who want to change the internal kernel api
> > hitting issues, and newbies trying to clean up code in ways that isn't
> > exactly optimal wasting maintainer cycles), and that anyone who uses
> > this code, is not actually using this version of the code.  Everyone who
> > relies on ion right now, is using the version that is in the Android
> > common kernel tree, which has diverged from this in-kernel way quite a
> > bit now for the reason that we didn't want to take any of those new
> > features in the in-kernel version.
> >
> > So this is a problem that we have caused by just wanting to wait, no one
> > is using this code, combined with it causing problems for the upstream
> > developers.
> >
> > There is nothing "magic" about the last kernel of the year that requires
> > this code to sit here until then.  At that point in time, all users
> > will, again, be using the forked Android kernel version, and if we
> > delete this now here, that fork can remain just fine, with the added
> > benifit of it reducing developer workloads here in-kernel.
> >
> > So why wait?
> 
> Hi,
> 
> I don't know what is the right thing to do here. I just want to
> highlight that AOSP's audio (codec2) HAL depends on the ION system
> heap and it will break AOSP for people who boot mainline on their
> devices, even for just testing purpose like we do in Linaro. Right now
> we need only 1 (Android specific out-of-tree) patch to boot AOSP with
> mainline and Sumit is already trying to upstream that vma naming
> patch. Removal of in-kernel ION, will just add more to that delta.

So that means you now have to carry 2 patches instead of 1, right? That's not
that bad :-D.

BTW, why doesn't your mainline testing use dmabuf already?

AFAIK, upstream has inertia catching up to products etc, so sooner its
removed the better if it is mostly dead (Before it turns into ashmem which
nobody can remove). My 2c.

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 1/1] staging: android: ashmem: Fix lockdep warning for write operation

2020-07-29 Thread Joel Fernandes

On Wed, Jul 15, 2020 at 10:45 PM Suren Baghdasaryan  wrote:
>
> syzbot report [1] describes a deadlock when write operation against an
> ashmem fd executed at the time when ashmem is shrinking its cache results
> in the following lock sequence:
>
> Possible unsafe locking scenario:
>
> CPU0CPU1
> 
>lock(fs_reclaim);
> lock(&sb->s_type->i_mutex_key#13);
> lock(fs_reclaim);
>lock(&sb->s_type->i_mutex_key#13);
>
> kswapd takes fs_reclaim and then inode_lock while generic_perform_write
> takes inode_lock and then fs_reclaim. However ashmem does not support
> writing into backing shmem with a write syscall. The only way to change
> its content is to mmap it and operate on mapped memory. Therefore the race
> that lockdep is warning about is not valid. Resolve this by introducing a
> separate lockdep class for the backing shmem inodes.
>
> [1]: https://lkml.kernel.org/lkml/0b5f9d059aa20...@google.com/
>
> Signed-off-by: Suren Baghdasaryan 
> ---

Once Eric's nits are resolved:

Reviewed-by: Joel Fernandes (Google) 

Thanks.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

[RFC] ashmem: Fix ashmem shrinker nr_to_scan

2020-03-05 Thread Joel Fernandes (Google)

nr_to_scan is the number of pages to be freed however ashmem doesn't
discount nr_to_scan correctly as it frees ranges. It should be
discounting them by pages than by ranges. Correct the issue.

Cc: sur...@google.com
Signed-off-by: Joel Fernandes (Google) 
---
 drivers/staging/android/ashmem.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 5891d0744a760..cb525ea6db98a 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -458,6 +458,8 @@ ashmem_shrink_scan(struct shrinker *shrink, struct 
shrink_control *sc)
lru_del(range);
 
freed += range_size(range);
+   sc->nr_to_scan -=  range_size(range);
+
mutex_unlock(&ashmem_mutex);
f->f_op->fallocate(f,
   FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
@@ -467,7 +469,7 @@ ashmem_shrink_scan(struct shrinker *shrink, struct 
shrink_control *sc)
wake_up_all(&ashmem_shrink_wait);
if (!mutex_trylock(&ashmem_mutex))
goto out;
-   if (--sc->nr_to_scan <= 0)
+   if (sc->nr_to_scan <= 0)
break;
}
mutex_unlock(&ashmem_mutex);
-- 
2.25.0.265.gbab2e86ba0-goog

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] staging: android: Delete the 'vsoc' driver

2020-02-03 Thread Joel Fernandes

On Sun, Feb 02, 2020 at 08:22:54PM -0800, Alistair Delva wrote:
> The 'vsoc' driver was required for an early iteration of the Android
> 'cuttlefish' virtual platform, but this platform has been wholly
> converted to use virtio drivers instead. Delete this old driver.
> 
> Cc: Greg Kroah-Hartman 
> Cc: Joel Fernandes 

Reviewed-by: Joel Fernandes (Google) 

thanks,

 - Joel

> Cc: Greg Hartman 
> Cc: kernel-t...@android.com
> Cc: de...@driverdev.osuosl.org
> Signed-off-by: Alistair Delva 
> ---
>  drivers/staging/android/Kconfig |8 -
>  drivers/staging/android/Makefile|1 -
>  drivers/staging/android/TODO|9 -
>  drivers/staging/android/uapi/vsoc_shm.h |  295 --
>  drivers/staging/android/vsoc.c  | 1149 ---
>  5 files changed, 1462 deletions(-)
>  delete mode 100644 drivers/staging/android/uapi/vsoc_shm.h
>  delete mode 100644 drivers/staging/android/vsoc.c
> 
> diff --git a/drivers/staging/android/Kconfig b/drivers/staging/android/Kconfig
> index d6d605d5cbde..8d8fd5c29349 100644
> --- a/drivers/staging/android/Kconfig
> +++ b/drivers/staging/android/Kconfig
> @@ -14,14 +14,6 @@ config ASHMEM
> It is, in theory, a good memory allocator for low-memory devices,
> because it can discard shared memory units when under memory pressure.
>  
> -config ANDROID_VSOC
> - tristate "Android Virtual SoC support"
> - depends on PCI_MSI
> - help
> -   This option adds support for the Virtual SoC driver needed to boot
> -   a 'cuttlefish' Android image inside QEmu. The driver interacts with
> -   a QEmu ivshmem device. If built as a module, it will be called vsoc.
> -
>  source "drivers/staging/android/ion/Kconfig"
>  
>  endif # if ANDROID
> diff --git a/drivers/staging/android/Makefile 
> b/drivers/staging/android/Makefile
> index 14bd9c6ce10d..3b66cd0b0ec5 100644
> --- a/drivers/staging/android/Makefile
> +++ b/drivers/staging/android/Makefile
> @@ -4,4 +4,3 @@ ccflags-y += -I$(src) # needed for trace 
> events
>  obj-y+= ion/
>  
>  obj-$(CONFIG_ASHMEM) += ashmem.o
> -obj-$(CONFIG_ANDROID_VSOC)   += vsoc.o
> diff --git a/drivers/staging/android/TODO b/drivers/staging/android/TODO
> index 767dd98fd92d..80eccfaf6db5 100644
> --- a/drivers/staging/android/TODO
> +++ b/drivers/staging/android/TODO
> @@ -9,14 +9,5 @@ ion/
>   - Split /dev/ion up into multiple nodes (e.g. /dev/ion/heap0)
>   - Better test framework (integration with VGEM was suggested)
>  
> -vsoc.c, uapi/vsoc_shm.h
> - - The current driver uses the same wait queue for all of the futexes in a
> -   region. This will cause false wakeups in regions with a large number of
> -   waiting threads. We should eventually use multiple queues and select the
> -   queue based on the region.
> - - Add debugfs support for examining the permissions of regions.
> - - Remove VSOC_WAIT_FOR_INCOMING_INTERRUPT ioctl. This functionality has been
> -   superseded by the futex and is there for legacy reasons.
> -
>  Please send patches to Greg Kroah-Hartman  and Cc:
>  Arve Hjønnevåg  and Riley Andrews 
> diff --git a/drivers/staging/android/uapi/vsoc_shm.h 
> b/drivers/staging/android/uapi/vsoc_shm.h
> deleted file mode 100644
> index 6291fb24efb2..
> --- a/drivers/staging/android/uapi/vsoc_shm.h
> +++ /dev/null
> @@ -1,295 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -/*
> - * Copyright (C) 2017 Google, Inc.
> - *
> - */
> -
> -#ifndef _UAPI_LINUX_VSOC_SHM_H
> -#define _UAPI_LINUX_VSOC_SHM_H
> -
> -#include 
> -
> -/**
> - * A permission is a token that permits a receiver to read and/or write an 
> area
> - * of memory within a Vsoc region.
> - *
> - * An fd_scoped permission grants both read and write access, and can be
> - * attached to a file description (see open(2)).
> - * Ownership of the area can then be shared by passing a file descriptor
> - * among processes.
> - *
> - * begin_offset and end_offset define the area of memory that is controlled 
> by
> - * the permission. owner_offset points to a word, also in shared memory, that
> - * controls ownership of the area.
> - *
> - * ownership of the region expires when the associated file description is
> - * released.
> - *
> - * At most one permission can be attached to each file description.
> - *
> - * This is useful when implementing HALs like gralloc that scope and pass
> - * ownership of shared resources via file descriptors.
> - *
> - * The caller is responsibe for doing any fencing.
> - *
> - * The calling process will no

Re: [PATCH v2] staging: android: ashmem: Disallow ashmem memory from being remapped

2020-01-27 Thread Joel Fernandes

On Mon, Jan 27, 2020 at 03:56:16PM -0800, Todd Kjos wrote:
> From: Suren Baghdasaryan 
> 
> When ashmem file is mmapped, the resulting vma->vm_file points to the
> backing shmem file with the generic fops that do not check ashmem
> permissions like fops of ashmem do. If an mremap is done on the ashmem
> region, then the permission checks will be skipped. Fix that by disallowing
> mapping operation on the backing shmem file.

Reviewed-by: Joel Fernandes (Google) 

thanks!

 - Joel

> 
> Reported-by: Jann Horn 
> Signed-off-by: Suren Baghdasaryan 
> Cc: stable  # 4.4,4.9,4.14,4.18,5.4
> Signed-off-by: Todd Kjos 
> ---
>  drivers/staging/android/ashmem.c | 28 
>  1 file changed, 28 insertions(+)
> 
> v2: update commit message as suggested by joe...@google.com.
> 
> diff --git a/drivers/staging/android/ashmem.c 
> b/drivers/staging/android/ashmem.c
> index 74d497d39c5a..c6695354b123 100644
> --- a/drivers/staging/android/ashmem.c
> +++ b/drivers/staging/android/ashmem.c
> @@ -351,8 +351,23 @@ static inline vm_flags_t calc_vm_may_flags(unsigned long 
> prot)
>  _calc_vm_trans(prot, PROT_EXEC,  VM_MAYEXEC);
>  }
>  
> +static int ashmem_vmfile_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + /* do not allow to mmap ashmem backing shmem file directly */
> + return -EPERM;
> +}
> +
> +static unsigned long
> +ashmem_vmfile_get_unmapped_area(struct file *file, unsigned long addr,
> + unsigned long len, unsigned long pgoff,
> + unsigned long flags)
> +{
> + return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
> +}
> +
>  static int ashmem_mmap(struct file *file, struct vm_area_struct *vma)
>  {
> + static struct file_operations vmfile_fops;
>   struct ashmem_area *asma = file->private_data;
>   int ret = 0;
>  
> @@ -393,6 +408,19 @@ static int ashmem_mmap(struct file *file, struct 
> vm_area_struct *vma)
>   }
>   vmfile->f_mode |= FMODE_LSEEK;
>   asma->file = vmfile;
> + /*
> +  * override mmap operation of the vmfile so that it can't be
> +  * remapped which would lead to creation of a new vma with no
> +  * asma permission checks. Have to override get_unmapped_area
> +  * as well to prevent VM_BUG_ON check for f_ops modification.
> +  */
> + if (!vmfile_fops.mmap) {
> + vmfile_fops = *vmfile->f_op;
> + vmfile_fops.mmap = ashmem_vmfile_mmap;
> + vmfile_fops.get_unmapped_area =
> + ashmem_vmfile_get_unmapped_area;
> + }
> + vmfile->f_op = &vmfile_fops;
>   }
>   get_file(asma->file);
>  
> -- 
> 2.25.0.341.g760bfbb309-goog
> 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] binder: prevent UAF read in print_binder_transaction_log_entry()

2019-10-09 Thread Joel Fernandes

On Wed, Oct 09, 2019 at 05:10:45PM +0200, Christian Brauner wrote:
> On Wed, Oct 09, 2019 at 10:55:58AM -0400, Joel Fernandes wrote:
> > On Wed, Oct 09, 2019 at 04:29:11PM +0200, Christian Brauner wrote:
> > > On Wed, Oct 09, 2019 at 10:21:29AM -0400, Joel Fernandes wrote:
> > > > On Wed, Oct 09, 2019 at 12:40:12PM +0200, Christian Brauner wrote:
> > > > > On Tue, Oct 08, 2019 at 02:05:16PM -0400, Joel Fernandes wrote:
> > > > > > On Tue, Oct 08, 2019 at 03:01:59PM +0200, Christian Brauner wrote:
> > > > > > > When a binder transaction is initiated on a binder device coming 
> > > > > > > from a
> > > > > > > binderfs instance, a pointer to the name of the binder device is 
> > > > > > > stashed
> > > > > > > in the binder_transaction_log_entry's context_name member. Later 
> > > > > > > on it
> > > > > > > is used to print the name in 
> > > > > > > print_binder_transaction_log_entry(). By
> > > > > > > the time print_binder_transaction_log_entry() accesses 
> > > > > > > context_name
> > > > > > > binderfs_evict_inode() might have already freed the associated 
> > > > > > > memory
> > > > > > > thereby causing a UAF. Do the simple thing and prevent this by 
> > > > > > > copying
> > > > > > > the name of the binder device instead of stashing a pointer to it.
> > > > > > > 
> > > > > > > Reported-by: Jann Horn 
> > > > > > > Fixes: 03e2e07e3814 ("binder: Make transaction_log available in 
> > > > > > > binderfs")
> > > > > > > Link: 
> > > > > > > https://lore.kernel.org/r/cag48ez14q0-f8lqsvcnbyr2o6gpw8shxsm4u5jmd9mpstem...@mail.gmail.com
> > > > > > > Cc: Joel Fernandes 
> > > > > > > Cc: Todd Kjos 
> > > > > > > Cc: Hridya Valsaraju 
> > > > > > > Signed-off-by: Christian Brauner 
> > > > > > > ---
> > > > > > >  drivers/android/binder.c  | 4 +++-
> > > > > > >  drivers/android/binder_internal.h | 2 +-
> > > > > > >  2 files changed, 4 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> > > > > > > index c0a491277aca..5b9ac2122e89 100644
> > > > > > > --- a/drivers/android/binder.c
> > > > > > > +++ b/drivers/android/binder.c
> > > > > > > @@ -57,6 +57,7 @@
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > > +#include 
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > > @@ -66,6 +67,7 @@
> > > > > > >  #include 
> > > > > > >  
> > > > > > >  #include 
> > > > > > > +#include 
> > > > > > >  
> > > > > > >  #include 
> > > > > > >  
> > > > > > > @@ -2876,7 +2878,7 @@ static void binder_transaction(struct 
> > > > > > > binder_proc *proc,
> > > > > > >   e->target_handle = tr->target.handle;
> > > > > > >   e->data_size = tr->data_size;
> > > > > > >   e->offsets_size = tr->offsets_size;
> > > > > > > - e->context_name = proc->context->name;
> > > > > > > + strscpy(e->context_name, proc->context->name, 
> > > > > > > BINDERFS_MAX_NAME);
> > > > > > 
> > > > > > Strictly speaking, proc-context->name can also be initialized for 
> > > > > > !BINDERFS
> > > > > > so the BINDERFS in the MAX_NAME macro is misleading. So probably 
> > > > > > there should
> > > > > > be a BINDER_MAX_NAME (and associated checks for whether non 
> > > > > > BINDERFS names
> > > > > > fit within the MAX.
> > > > > 
> > > > > I know but I don't think it's worth special-casing non-binderfs 
> > > > > devices.
> > > > > First, non-binderfs devices can only be

Re: [PATCH] binder: prevent UAF read in print_binder_transaction_log_entry()

2019-10-09 Thread Joel Fernandes

On Wed, Oct 09, 2019 at 04:29:11PM +0200, Christian Brauner wrote:
> On Wed, Oct 09, 2019 at 10:21:29AM -0400, Joel Fernandes wrote:
> > On Wed, Oct 09, 2019 at 12:40:12PM +0200, Christian Brauner wrote:
> > > On Tue, Oct 08, 2019 at 02:05:16PM -0400, Joel Fernandes wrote:
> > > > On Tue, Oct 08, 2019 at 03:01:59PM +0200, Christian Brauner wrote:
> > > > > When a binder transaction is initiated on a binder device coming from 
> > > > > a
> > > > > binderfs instance, a pointer to the name of the binder device is 
> > > > > stashed
> > > > > in the binder_transaction_log_entry's context_name member. Later on it
> > > > > is used to print the name in print_binder_transaction_log_entry(). By
> > > > > the time print_binder_transaction_log_entry() accesses context_name
> > > > > binderfs_evict_inode() might have already freed the associated memory
> > > > > thereby causing a UAF. Do the simple thing and prevent this by copying
> > > > > the name of the binder device instead of stashing a pointer to it.
> > > > > 
> > > > > Reported-by: Jann Horn 
> > > > > Fixes: 03e2e07e3814 ("binder: Make transaction_log available in 
> > > > > binderfs")
> > > > > Link: 
> > > > > https://lore.kernel.org/r/cag48ez14q0-f8lqsvcnbyr2o6gpw8shxsm4u5jmd9mpstem...@mail.gmail.com
> > > > > Cc: Joel Fernandes 
> > > > > Cc: Todd Kjos 
> > > > > Cc: Hridya Valsaraju 
> > > > > Signed-off-by: Christian Brauner 
> > > > > ---
> > > > >  drivers/android/binder.c  | 4 +++-
> > > > >  drivers/android/binder_internal.h | 2 +-
> > > > >  2 files changed, 4 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> > > > > index c0a491277aca..5b9ac2122e89 100644
> > > > > --- a/drivers/android/binder.c
> > > > > +++ b/drivers/android/binder.c
> > > > > @@ -57,6 +57,7 @@
> > > > >  #include 
> > > > >  #include 
> > > > >  #include 
> > > > > +#include 
> > > > >  #include 
> > > > >  #include 
> > > > >  #include 
> > > > > @@ -66,6 +67,7 @@
> > > > >  #include 
> > > > >  
> > > > >  #include 
> > > > > +#include 
> > > > >  
> > > > >  #include 
> > > > >  
> > > > > @@ -2876,7 +2878,7 @@ static void binder_transaction(struct 
> > > > > binder_proc *proc,
> > > > >   e->target_handle = tr->target.handle;
> > > > >   e->data_size = tr->data_size;
> > > > >   e->offsets_size = tr->offsets_size;
> > > > > - e->context_name = proc->context->name;
> > > > > + strscpy(e->context_name, proc->context->name, 
> > > > > BINDERFS_MAX_NAME);
> > > > 
> > > > Strictly speaking, proc-context->name can also be initialized for 
> > > > !BINDERFS
> > > > so the BINDERFS in the MAX_NAME macro is misleading. So probably there 
> > > > should
> > > > be a BINDER_MAX_NAME (and associated checks for whether non BINDERFS 
> > > > names
> > > > fit within the MAX.
> > > 
> > > I know but I don't think it's worth special-casing non-binderfs devices.
> > > First, non-binderfs devices can only be created through a KCONFIG option
> > > determined at compile time. For stock Android the names are the same for
> > > all vendors afaik.
> > 
> > I am just talking about the name of weirdly named macro here.
> 
> You might miss context here: It's named that way because currently only
> binderfs binder devices are bound to that limit. That's a point I made
> further below in my previous mail. Non-binderfs devices are not subject
> to that restriction and when we tried to make them subject to the same
> it as rejected.

I know that. I am saying the memcpy is happening for regular binder devices
as well but the macro has BINDERFS in the name. That's all. It is not a
significant eye sore. But is a bit odd.

> 
> 
> > 
> > > Fifth, I already tried to push for validation of non-binderfs binder
> > > devices a while back when I wrote binderfs and was told that it's not
> > > needed. Hrydia t

Re: [PATCH] binder: prevent UAF read in print_binder_transaction_log_entry()

2019-10-09 Thread Joel Fernandes

On Wed, Oct 09, 2019 at 12:40:12PM +0200, Christian Brauner wrote:
> On Tue, Oct 08, 2019 at 02:05:16PM -0400, Joel Fernandes wrote:
> > On Tue, Oct 08, 2019 at 03:01:59PM +0200, Christian Brauner wrote:
> > > When a binder transaction is initiated on a binder device coming from a
> > > binderfs instance, a pointer to the name of the binder device is stashed
> > > in the binder_transaction_log_entry's context_name member. Later on it
> > > is used to print the name in print_binder_transaction_log_entry(). By
> > > the time print_binder_transaction_log_entry() accesses context_name
> > > binderfs_evict_inode() might have already freed the associated memory
> > > thereby causing a UAF. Do the simple thing and prevent this by copying
> > > the name of the binder device instead of stashing a pointer to it.
> > > 
> > > Reported-by: Jann Horn 
> > > Fixes: 03e2e07e3814 ("binder: Make transaction_log available in binderfs")
> > > Link: 
> > > https://lore.kernel.org/r/cag48ez14q0-f8lqsvcnbyr2o6gpw8shxsm4u5jmd9mpstem...@mail.gmail.com
> > > Cc: Joel Fernandes 
> > > Cc: Todd Kjos 
> > > Cc: Hridya Valsaraju 
> > > Signed-off-by: Christian Brauner 
> > > ---
> > >  drivers/android/binder.c  | 4 +++-
> > >  drivers/android/binder_internal.h | 2 +-
> > >  2 files changed, 4 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> > > index c0a491277aca..5b9ac2122e89 100644
> > > --- a/drivers/android/binder.c
> > > +++ b/drivers/android/binder.c
> > > @@ -57,6 +57,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -66,6 +67,7 @@
> > >  #include 
> > >  
> > >  #include 
> > > +#include 
> > >  
> > >  #include 
> > >  
> > > @@ -2876,7 +2878,7 @@ static void binder_transaction(struct binder_proc 
> > > *proc,
> > >   e->target_handle = tr->target.handle;
> > >   e->data_size = tr->data_size;
> > >   e->offsets_size = tr->offsets_size;
> > > - e->context_name = proc->context->name;
> > > + strscpy(e->context_name, proc->context->name, BINDERFS_MAX_NAME);
> > 
> > Strictly speaking, proc-context->name can also be initialized for !BINDERFS
> > so the BINDERFS in the MAX_NAME macro is misleading. So probably there 
> > should
> > be a BINDER_MAX_NAME (and associated checks for whether non BINDERFS names
> > fit within the MAX.
> 
> I know but I don't think it's worth special-casing non-binderfs devices.
> First, non-binderfs devices can only be created through a KCONFIG option
> determined at compile time. For stock Android the names are the same for
> all vendors afaik.

I am just talking about the name of weirdly named macro here.

> Second, BINDERFS_MAX_NAME is set to the maximum path name component
> length that nearly all filesystems support (256 chars). If you exceed
> that then you run afoul of a bunch of other assumptions already and will
> cause trouble.

Again, just talking about the name.

> Third, even if there is someone crazy and uses more than 256 chars for a
> non-binderfs device at KCONFIG time strscpy will do the right thing and
> truncate and you'd see a truncated binder device name. This doesn't seem
> to be a big deal for a debugfs interface.

Sure I never said the patch has a bug.

> Fourth, the check for non-binderfs devices technically has nothing to do
> with this patch. This patch should really just do the minimal thing and
> fix the UAF. Which it does.

Again, never said the patch is buggy.

> Fifth, I already tried to push for validation of non-binderfs binder
> devices a while back when I wrote binderfs and was told that it's not
> needed. Hrydia tried the same and we decided the same thing. So you get
> to be the next person to send a patch. :)

I don't follow why we are talking about non-binderfs validation. I am just
saying a memcpy of the name could have been avoided for regular binder
devices. But since Todd Acked it, I wont stand in the way..

> > One more thought, this can be made dependent on CONFIG_BINDERFS since 
> > regular
> > binder devices cannot be unregistered AFAICS and as Jann said, the problem 
> > is
> > BINDERFS specific. That way we avoid the memcpy for _every_ transaction.
> > These can be thundering when Android starts up.
> 
> Unless Todd sees this as a real perform

Re: [PATCH] binder: prevent UAF read in print_binder_transaction_log_entry()

2019-10-08 Thread Joel Fernandes

On Tue, Oct 08, 2019 at 03:01:59PM +0200, Christian Brauner wrote:
> When a binder transaction is initiated on a binder device coming from a
> binderfs instance, a pointer to the name of the binder device is stashed
> in the binder_transaction_log_entry's context_name member. Later on it
> is used to print the name in print_binder_transaction_log_entry(). By
> the time print_binder_transaction_log_entry() accesses context_name
> binderfs_evict_inode() might have already freed the associated memory
> thereby causing a UAF. Do the simple thing and prevent this by copying
> the name of the binder device instead of stashing a pointer to it.
> 
> Reported-by: Jann Horn 
> Fixes: 03e2e07e3814 ("binder: Make transaction_log available in binderfs")
> Link: 
> https://lore.kernel.org/r/cag48ez14q0-f8lqsvcnbyr2o6gpw8shxsm4u5jmd9mpstem...@mail.gmail.com
> Cc: Joel Fernandes 
> Cc: Todd Kjos 
> Cc: Hridya Valsaraju 
> Signed-off-by: Christian Brauner 
> ---
>  drivers/android/binder.c  | 4 +++-
>  drivers/android/binder_internal.h | 2 +-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> index c0a491277aca..5b9ac2122e89 100644
> --- a/drivers/android/binder.c
> +++ b/drivers/android/binder.c
> @@ -57,6 +57,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -66,6 +67,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -2876,7 +2878,7 @@ static void binder_transaction(struct binder_proc *proc,
>   e->target_handle = tr->target.handle;
>   e->data_size = tr->data_size;
>   e->offsets_size = tr->offsets_size;
> - e->context_name = proc->context->name;
> + strscpy(e->context_name, proc->context->name, BINDERFS_MAX_NAME);

Strictly speaking, proc-context->name can also be initialized for !BINDERFS
so the BINDERFS in the MAX_NAME macro is misleading. So probably there should
be a BINDER_MAX_NAME (and associated checks for whether non BINDERFS names
fit within the MAX.

>   if (reply) {

>   binder_inner_proc_lock(proc);
> diff --git a/drivers/android/binder_internal.h 
> b/drivers/android/binder_internal.h
> index bd47f7f72075..ae991097d14d 100644
> --- a/drivers/android/binder_internal.h
> +++ b/drivers/android/binder_internal.h
> @@ -130,7 +130,7 @@ struct binder_transaction_log_entry {
>   int return_error_line;
>   uint32_t return_error;
>   uint32_t return_error_param;
> - const char *context_name;
> + char context_name[BINDERFS_MAX_NAME + 1];

Same comment here, context_name can be used for non-BINDERFS transactions as
well such as default binder devices.

One more thought, this can be made dependent on CONFIG_BINDERFS since regular
binder devices cannot be unregistered AFAICS and as Jann said, the problem is
BINDERFS specific. That way we avoid the memcpy for _every_ transaction.
These can be thundering when Android starts up.

(I secretly wish C strings could be refcounted to avoid exactly this issue,
that should not be hard to develop but I am not sure if it is worth it for
this problem :) - For one, it will avoid having to do the strcpy for _every_
transaction).

Other than these nits, please add my tag on whichever is the final solution:

Reviewed-by: Joel Fernandes (Google) 

thanks,

 - Joel


>  };
>  
>  struct binder_transaction_log {
> -- 
> 2.23.0
> 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

[PATCH] binder: Fix comment headers on binder_alloc_prepare_to_free()

2019-09-30 Thread Joel Fernandes (Google)

binder_alloc_buffer_lookup() doesn't exist and is named
"binder_alloc_prepare_to_free()". Correct the code comments to reflect
this.

Signed-off-by: Joel Fernandes (Google) 

---
 drivers/android/binder_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index 6d79a1b0d446..d42a8b2f636a 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -156,7 +156,7 @@ static struct binder_buffer 
*binder_alloc_prepare_to_free_locked(
 }
 
 /**
- * binder_alloc_buffer_lookup() - get buffer given user ptr
+ * binder_alloc_prepare_to_free() - get buffer given user ptr
  * @alloc: binder_alloc for this proc
  * @user_ptr:  User pointer to buffer data
  *
-- 
2.23.0.444.g18eeb5a265-goog

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH v3 0/4] Add binder state and statistics to binderfs

2019-09-04 Thread Joel Fernandes

On September 4, 2019 7:19:35 AM EDT, Christian Brauner
 wrote:
>On Tue, Sep 03, 2019 at 09:16:51AM -0700, Hridya Valsaraju wrote:
>> Currently, the only way to access binder state and
>> statistics is through debugfs. We need a way to
>> access the same even when debugfs is not mounted.
>> These patches add a mount option to make this
>> information available in binderfs without affecting
>> its presence in debugfs. The following debugfs nodes
>> will be made available in a binderfs instance when
>> mounted with the mount option 'stats=global' or 'stats=local'.
>>
>>  /sys/kernel/debug/binder/failed_transaction_log
>>  /sys/kernel/debug/binder/proc
>>  /sys/kernel/debug/binder/state
>>  /sys/kernel/debug/binder/stats
>>  /sys/kernel/debug/binder/transaction_log
>>  /sys/kernel/debug/binder/transactions
>
>Acked-by: Christian Brauner 
>
>Btw, I think your counting is off-by-one. :) We usually count the
>initial send of a series as 0 and the first rework of that series as
>v1.
>I think you counted your initial send as v1 and the first rework as v2.

Which is fine. I have done it both ways. Is this a rule written somewhere?

>:)
>

If I am not mistaken, this is Hridya's first set of kernel patches.
Congrats on landing it upstream and to everyone for reviews! (assuming
nothing falls apart on the way to Linus tree).

thanks,

- Joel

[TLDR]
My first kernel patch was 10 years ago to a WiFi driver when I was an
intern at University. I was thrilled to have fixed a bug in network
bridging code in the 802.11s stack. This is always a special moment so
congrats again! ;-)





>Christian
>
>>
>> Hridya Valsaraju (4):
>>   binder: add a mount option to show global stats
>>   binder: Add stats, state and transactions files
>>   binder: Make transaction_log available in binderfs
>>   binder: Add binder_proc logging to binderfs
>>
>>  drivers/android/binder.c  |  95 ++-
>>  drivers/android/binder_internal.h |  84 ++
>>  drivers/android/binderfs.c| 255
>++
>>  3 files changed, 362 insertions(+), 72 deletions(-)
>>
>> --
>> 2.23.0.187.g17f5b7556c-goog
>>
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] binder: Use kmem_cache for binder_thread

2019-08-29 Thread Joel Fernandes

On Thu, Aug 29, 2019 at 08:42:29AM +0200, Greg KH wrote:
> On Thu, Aug 29, 2019 at 01:49:53PM +0800, Peikan Tsai wrote:
[snip] 
> > The allocated size for each binder_thread is 512 bytes by kzalloc.
> > Because the size of binder_thread is fixed and it's only 304 bytes.
> > It will save 208 bytes per binder_thread when use create a kmem_cache
> > for the binder_thread.
> 
> Are you _sure_ it really will save that much memory?  You want to do
> allocations based on a nice alignment for lots of good reasons,
> especially for something that needs quick accesses.

Alignment can be done for slab allocations, kmem_cache_create() takes an
align argument. I am not sure what the default alignment of objects is
though (probably no default alignment). What is an optimal alignment in your
view?

> Did you test your change on a system that relies on binder and find any
> speed improvement or decrease, and any actual memory savings?
> 
> If so, can you post your results?

That's certainly worth it and I thought of asking for the same, but spoke too
soon!

Independent note: In general I find the internal fragmentation with large
kmalloc()s troubling in the kernel :-(. Say you have a 5000 objects of 512
allocation, each 300 bytes. 212 * 5000 is around 1MB. Which is arguably not
neglible on a small memory system, right?

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] binder: Use kmem_cache for binder_thread

2019-08-29 Thread Joel Fernandes

On Thu, Aug 29, 2019 at 01:49:53PM +0800, Peikan Tsai wrote:
> Hi,
> 
> The allocated size for each binder_thread is 512 bytes by kzalloc.
> Because the size of binder_thread is fixed and it's only 304 bytes.
> It will save 208 bytes per binder_thread when use create a kmem_cache
> for the binder_thread.

Awesome change and observation!!!

Reviewed-by: Joel Fernandes (Google) 

(Another thought: how did you discover this? Are you using some tools to look
into slab fragmentation?).

thanks,

 - Joel

> Signed-off-by: Peikan Tsai 
> ---
>  drivers/android/binder.c | 16 +---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> index dc1c83eafc22..043e0ebd0fe7 100644
> --- a/drivers/android/binder.c
> +++ b/drivers/android/binder.c
> @@ -87,6 +87,8 @@ static struct dentry *binder_debugfs_dir_entry_root;
>  static struct dentry *binder_debugfs_dir_entry_proc;
>  static atomic_t binder_last_id;
> 
> +static struct kmem_cache *binder_thread_cachep;
> +
>  static int proc_show(struct seq_file *m, void *unused);
>  DEFINE_SHOW_ATTRIBUTE(proc);
> 
> @@ -4696,14 +4698,15 @@ static struct binder_thread *binder_get_thread(struct 
> binder_proc *proc)
>   thread = binder_get_thread_ilocked(proc, NULL);
>   binder_inner_proc_unlock(proc);
>   if (!thread) {
> - new_thread = kzalloc(sizeof(*thread), GFP_KERNEL);
> + new_thread = kmem_cache_zalloc(binder_thread_cachep,
> +GFP_KERNEL);
>   if (new_thread == NULL)
>   return NULL;
>   binder_inner_proc_lock(proc);
>   thread = binder_get_thread_ilocked(proc, new_thread);
>   binder_inner_proc_unlock(proc);
>   if (thread != new_thread)
> - kfree(new_thread);
> + kmem_cache_free(binder_thread_cachep, new_thread);
>   }
>   return thread;
>  }
> @@ -4723,7 +4726,7 @@ static void binder_free_thread(struct binder_thread 
> *thread)
>   BUG_ON(!list_empty(&thread->todo));
>   binder_stats_deleted(BINDER_STAT_THREAD);
>   binder_proc_dec_tmpref(thread->proc);
> - kfree(thread);
> + kmem_cache_free(binder_thread_cachep, thread);
>  }
> 
>  static int binder_thread_release(struct binder_proc *proc,
> @@ -6095,6 +6098,12 @@ static int __init binder_init(void)
>   if (ret)
>   return ret;
> 
> + binder_thread_cachep = kmem_cache_create("binder_thread",
> +  sizeof(struct binder_thread),
> +  0, 0, NULL);
> + if (!binder_thread_cachep)
> + return -ENOMEM;
> +
>   atomic_set(&binder_transaction_log.cur, ~0U);
>   atomic_set(&binder_transaction_log_failed.cur, ~0U);
> 
> @@ -6167,6 +6176,7 @@ static int __init binder_init(void)
> 
>  err_alloc_device_names_failed:
>   debugfs_remove_recursive(binder_debugfs_dir_entry_root);
> + kmem_cache_destroy(binder_thread_cachep);
> 
>   return ret;
>  }
> --
> 2.17.1
> 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH v3 1/2] binder: Add default binder devices through binderfs when configured

2019-08-15 Thread Joel Fernandes

On Thu, Aug 08, 2019 at 03:27:25PM -0700, Hridya Valsaraju wrote:
> Currently, since each binderfs instance needs its own
> private binder devices, every time a binderfs instance is
> mounted, all the default binder devices need to be created
> via the BINDER_CTL_ADD IOCTL. This patch aims to
> add a solution to automatically create the default binder
> devices for each binderfs instance that gets mounted.
> To achieve this goal, when CONFIG_ANDROID_BINDERFS is set,
> the default binder devices specified by CONFIG_ANDROID_BINDER_DEVICES
> are created in each binderfs instance instead of global devices
> being created by the binder driver.
> 
> Co-developed-by: Christian Brauner 
> Signed-off-by: Christian Brauner 
> Signed-off-by: Hridya Valsaraju 

Reviewed-by: Joel Fernandes (Google) 

thanks,

 - Joel

> ---
> 
> Changes in v2:
> - Updated commit message as per Greg Kroah-Hartman.
> - Removed new module parameter creation as per Greg
>   Kroah-Hartman/Christian Brauner.
> - Refactored device name length check into a new patch as per Greg 
> Kroah-Hartman.
> 
> Changes in v3:
> -Removed unnecessary empty lines as per Dan Carpenter.
> 
>  drivers/android/binder.c  |  5 +++--
>  drivers/android/binder_internal.h |  2 ++
>  drivers/android/binderfs.c| 23 ---
>  3 files changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> index 466b6a7f8ab7..ca6b21a53321 100644
> --- a/drivers/android/binder.c
> +++ b/drivers/android/binder.c
> @@ -123,7 +123,7 @@ static uint32_t binder_debug_mask = 
> BINDER_DEBUG_USER_ERROR |
>   BINDER_DEBUG_FAILED_TRANSACTION | BINDER_DEBUG_DEAD_TRANSACTION;
>  module_param_named(debug_mask, binder_debug_mask, uint, 0644);
>  
> -static char *binder_devices_param = CONFIG_ANDROID_BINDER_DEVICES;
> +char *binder_devices_param = CONFIG_ANDROID_BINDER_DEVICES;
>  module_param_named(devices, binder_devices_param, charp, 0444);
>  
>  static DECLARE_WAIT_QUEUE_HEAD(binder_user_error_wait);
> @@ -6279,7 +6279,8 @@ static int __init binder_init(void)
>   &transaction_log_fops);
>   }
>  
> - if (strcmp(binder_devices_param, "") != 0) {
> + if (!IS_ENABLED(CONFIG_ANDROID_BINDERFS) &&
> + strcmp(binder_devices_param, "") != 0) {
>   /*
>   * Copy the module_parameter string, because we don't want to
>   * tokenize it in-place.
> diff --git a/drivers/android/binder_internal.h 
> b/drivers/android/binder_internal.h
> index 045b3e42d98b..fe8c745dc8e0 100644
> --- a/drivers/android/binder_internal.h
> +++ b/drivers/android/binder_internal.h
> @@ -37,6 +37,8 @@ struct binder_device {
>  
>  extern const struct file_operations binder_fops;
>  
> +extern char *binder_devices_param;
> +
>  #ifdef CONFIG_ANDROID_BINDERFS
>  extern bool is_binderfs_device(const struct inode *inode);
>  #else
> diff --git a/drivers/android/binderfs.c b/drivers/android/binderfs.c
> index e773f45d19d9..aee46dd1be91 100644
> --- a/drivers/android/binderfs.c
> +++ b/drivers/android/binderfs.c
> @@ -186,8 +186,7 @@ static int binderfs_binder_device_create(struct inode 
> *ref_inode,
>   req->major = MAJOR(binderfs_dev);
>   req->minor = minor;
>  
> - ret = copy_to_user(userp, req, sizeof(*req));
> - if (ret) {
> + if (userp && copy_to_user(userp, req, sizeof(*req))) {
>   ret = -EFAULT;
>   goto err;
>   }
> @@ -467,6 +466,9 @@ static int binderfs_fill_super(struct super_block *sb, 
> void *data, int silent)
>   int ret;
>   struct binderfs_info *info;
>   struct inode *inode = NULL;
> + struct binderfs_device device_info = { 0 };
> + const char *name;
> + size_t len;
>  
>   sb->s_blocksize = PAGE_SIZE;
>   sb->s_blocksize_bits = PAGE_SHIFT;
> @@ -521,7 +523,22 @@ static int binderfs_fill_super(struct super_block *sb, 
> void *data, int silent)
>   if (!sb->s_root)
>   return -ENOMEM;
>  
> - return binderfs_binder_ctl_create(sb);
> + ret = binderfs_binder_ctl_create(sb);
> + if (ret)
> + return ret;
> +
> + name = binder_devices_param;
> + for (len = strcspn(name, ","); len > 0; len = strcspn(name, ",")) {
> + strscpy(device_info.name, name, len + 1);
> + ret = binderfs_binder_device_create(inode, NULL, &device_info);
> + if (ret)
> + return ret;
> + name += len;
> + if (*name == ',')
> + name++;
> + }
> +
> + return 0;
>  }
>  
>  static struct dentry *binderfs_mount(struct file_system_type *fs_type,
> -- 
> 2.22.0.770.g0f2c4a37fd-goog
> 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH v3 1/2] binder: Add default binder devices through binderfs when configured

2019-08-15 Thread Joel Fernandes

On Thu, Aug 08, 2019 at 03:27:25PM -0700, Hridya Valsaraju wrote:
> Currently, since each binderfs instance needs its own
> private binder devices, every time a binderfs instance is
> mounted, all the default binder devices need to be created
> via the BINDER_CTL_ADD IOCTL. This patch aims to
> add a solution to automatically create the default binder
> devices for each binderfs instance that gets mounted.
> To achieve this goal, when CONFIG_ANDROID_BINDERFS is set,
> the default binder devices specified by CONFIG_ANDROID_BINDER_DEVICES
> are created in each binderfs instance instead of global devices
> being created by the binder driver.
> 
> Co-developed-by: Christian Brauner 
> Signed-off-by: Christian Brauner 
> Signed-off-by: Hridya Valsaraju 
> ---

Reviewed-by: Joel Fernandes (Google) 

thanks,

 - Joel

> 
> Changes in v2:
> - Updated commit message as per Greg Kroah-Hartman.
> - Removed new module parameter creation as per Greg
>   Kroah-Hartman/Christian Brauner.
> - Refactored device name length check into a new patch as per Greg 
> Kroah-Hartman.
> 
> Changes in v3:
> -Removed unnecessary empty lines as per Dan Carpenter.
> 
>  drivers/android/binder.c  |  5 +++--
>  drivers/android/binder_internal.h |  2 ++
>  drivers/android/binderfs.c| 23 ---
>  3 files changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> index 466b6a7f8ab7..ca6b21a53321 100644
> --- a/drivers/android/binder.c
> +++ b/drivers/android/binder.c
> @@ -123,7 +123,7 @@ static uint32_t binder_debug_mask = 
> BINDER_DEBUG_USER_ERROR |
>   BINDER_DEBUG_FAILED_TRANSACTION | BINDER_DEBUG_DEAD_TRANSACTION;
>  module_param_named(debug_mask, binder_debug_mask, uint, 0644);
>  
> -static char *binder_devices_param = CONFIG_ANDROID_BINDER_DEVICES;
> +char *binder_devices_param = CONFIG_ANDROID_BINDER_DEVICES;
>  module_param_named(devices, binder_devices_param, charp, 0444);
>  
>  static DECLARE_WAIT_QUEUE_HEAD(binder_user_error_wait);
> @@ -6279,7 +6279,8 @@ static int __init binder_init(void)
>   &transaction_log_fops);
>   }
>  
> - if (strcmp(binder_devices_param, "") != 0) {
> + if (!IS_ENABLED(CONFIG_ANDROID_BINDERFS) &&
> + strcmp(binder_devices_param, "") != 0) {
>   /*
>   * Copy the module_parameter string, because we don't want to
>   * tokenize it in-place.
> diff --git a/drivers/android/binder_internal.h 
> b/drivers/android/binder_internal.h
> index 045b3e42d98b..fe8c745dc8e0 100644
> --- a/drivers/android/binder_internal.h
> +++ b/drivers/android/binder_internal.h
> @@ -37,6 +37,8 @@ struct binder_device {
>  
>  extern const struct file_operations binder_fops;
>  
> +extern char *binder_devices_param;
> +
>  #ifdef CONFIG_ANDROID_BINDERFS
>  extern bool is_binderfs_device(const struct inode *inode);
>  #else
> diff --git a/drivers/android/binderfs.c b/drivers/android/binderfs.c
> index e773f45d19d9..aee46dd1be91 100644
> --- a/drivers/android/binderfs.c
> +++ b/drivers/android/binderfs.c
> @@ -186,8 +186,7 @@ static int binderfs_binder_device_create(struct inode 
> *ref_inode,
>   req->major = MAJOR(binderfs_dev);
>   req->minor = minor;
>  
> - ret = copy_to_user(userp, req, sizeof(*req));
> - if (ret) {
> + if (userp && copy_to_user(userp, req, sizeof(*req))) {
>   ret = -EFAULT;
>   goto err;
>   }
> @@ -467,6 +466,9 @@ static int binderfs_fill_super(struct super_block *sb, 
> void *data, int silent)
>   int ret;
>   struct binderfs_info *info;
>   struct inode *inode = NULL;
> + struct binderfs_device device_info = { 0 };
> + const char *name;
> + size_t len;
>  
>   sb->s_blocksize = PAGE_SIZE;
>   sb->s_blocksize_bits = PAGE_SHIFT;
> @@ -521,7 +523,22 @@ static int binderfs_fill_super(struct super_block *sb, 
> void *data, int silent)
>   if (!sb->s_root)
>   return -ENOMEM;
>  
> - return binderfs_binder_ctl_create(sb);
> + ret = binderfs_binder_ctl_create(sb);
> + if (ret)
> + return ret;
> +
> + name = binder_devices_param;
> + for (len = strcspn(name, ","); len > 0; len = strcspn(name, ",")) {
> + strscpy(device_info.name, name, len + 1);
> + ret = binderfs_binder_device_create(inode, NULL, &device_info);
> + if (ret)
> + return ret;
> + name += len;
> + if (*name == ',')
> + name++;
> + }
> +
> + return 0;
>  }
>  
>  static struct dentry *binderfs_mount(struct file_system_type *fs_type,
> -- 
> 2.22.0.770.g0f2c4a37fd-goog
> 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH v3 2/2] binder: Validate the default binderfs device names.

2019-08-15 Thread Joel Fernandes

On Thu, Aug 08, 2019 at 03:27:26PM -0700, Hridya Valsaraju wrote:
> Length of a binderfs device name cannot exceed BINDERFS_MAX_NAME.
> This patch adds a check in binderfs_init() to ensure the same
> for the default binder devices that will be created in every
> binderfs instance.
> 
> Co-developed-by: Christian Brauner 
> Signed-off-by: Christian Brauner 
> Signed-off-by: Hridya Valsaraju 
> ---

Reviewed-by: Joel Fernandes (Google) 

thanks,

 - Joel

>  drivers/android/binderfs.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/android/binderfs.c b/drivers/android/binderfs.c
> index aee46dd1be91..55c5adb87585 100644
> --- a/drivers/android/binderfs.c
> +++ b/drivers/android/binderfs.c
> @@ -570,6 +570,18 @@ static struct file_system_type binder_fs_type = {
>  int __init init_binderfs(void)
>  {
>   int ret;
> + const char *name;
> + size_t len;
> +
> + /* Verify that the default binderfs device names are valid. */
> + name = binder_devices_param;
> + for (len = strcspn(name, ","); len > 0; len = strcspn(name, ",")) {
> + if (len > BINDERFS_MAX_NAME)
> + return -E2BIG;
> + name += len;
> + if (*name == ',')
> + name++;
> + }
>  
>   /* Allocate new major number for binderfs. */
>   ret = alloc_chrdev_region(&binderfs_dev, 0, BINDERFS_MAX_MINOR,
> -- 
> 2.22.0.770.g0f2c4a37fd-goog
> 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: Limits for ION Memory Allocator

2019-07-26 Thread Joel Fernandes

On Wed, Jul 24, 2019 at 4:24 PM John Stultz  wrote:
>
> On Wed, Jul 24, 2019 at 1:18 PM John Stultz  wrote:
> >
> > On Wed, Jul 24, 2019 at 12:36 PM Laura Abbott  wrote:
> > >
> > > On 7/17/19 12:31 PM, Alexander Popov wrote:
> > > > Hello!
> > > >
> > > > The syzkaller [1] has a trouble with fuzzing the Linux kernel with ION 
> > > > Memory
> > > > Allocator.
> > > >
> > > > Syzkaller uses several methods [2] to limit memory consumption of the 
> > > > userspace
> > > > processes calling the syscalls for testing the kernel:
> > > >   - setrlimit(),
> > > >   - cgroups,
> > > >   - various sysctl.
> > > > But these methods don't work for ION Memory Allocator, so any userspace 
> > > > process
> > > > that has access to /dev/ion can bring the system to the out-of-memory 
> > > > state.
> > > >
> > > > An example of a program doing that:
> > > >
> > > >
> > > > #include 
> > > > #include 
> > > > #include 
> > > > #include 
> > > > #include 
> > > > #include 
> > > >
> > > > #define ION_IOC_MAGIC 'I'
> > > > #define ION_IOC_ALLOC _IOWR(ION_IOC_MAGIC, 0, \
> > > > struct ion_allocation_data)
> > > >
> > > > struct ion_allocation_data {
> > > >   __u64 len;
> > > >   __u32 heap_id_mask;
> > > >   __u32 flags;
> > > >   __u32 fd;
> > > >   __u32 unused;
> > > > };
> > > >
> > > > int main(void)
> > > > {
> > > >   unsigned long i = 0;
> > > >   int fd = -1;
> > > >   struct ion_allocation_data data = {
> > > >   .len = 0x13f65d8c,
> > > >   .heap_id_mask = 1,
> > > >   .flags = 0,
> > > >   .fd = -1,
> > > >   .unused = 0
> > > >   };
> > > >
> > > >   fd = open("/dev/ion", 0);
> > > >   if (fd == -1) {
> > > >   perror("[-] open /dev/ion");
> > > >   return 1;
> > > >   }
> > > >
> > > >   while (1) {
> > > >   printf("iter %lu\n", i);
> > > >   ioctl(fd, ION_IOC_ALLOC, &data);
> > > >   i++;
> > > >   }
> > > >
> > > >   return 0;
> > > > }
> > > >
> > > >
> > > > I looked through the code of ion_alloc() and didn't find any limit 
> > > > checks.
> > > > Is it currently possible to limit ION kernel allocations for some 
> > > > process?
> > > >
> > > > If not, is it a right idea to do that?
> > > > Thanks!
> > > >
> > >
> > > Yes, I do think that's the right approach. We're working on moving Ion
> > > out of staging and this is something I mentioned to John Stultz. I don't
> > > think we've thought too hard about how to do the actual limiting so
> > > suggestions are welcome.
> >
> > In part the dmabuf heaps allow for separate heap devices, so we can
> > have finer grained permissions to the specific heaps.  But that
> > doesn't provide any controls on how much memory one process could
> > allocate using the device if it has permission.
> >
> > I suspect the same issue is present with any of the dmabuf exporters
> > (gpu/display drivers, etc), so this is less of an ION/dmabuf heap
> > issue and more of a dmabuf core accounting issue.
> >
>
> Also, do unmapped memfd buffers have similar accounting issues?
>

The syzcaller bot didn't complain about this for memfd yet, so I suspect not ;-)

With memfd since it uses shmem underneath, __vm_enough_memory() is
called during shmem_acct_block() which should take per-process memory
into account already and fail if there is not enough memory.

Should ION be doing something similar to fail if there's not enough memory?

thanks,

- Joel
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Joel Fernandes

On Tue, May 07, 2019 at 09:28:47AM -0700, Suren Baghdasaryan wrote:
> From: Christian Brauner 
> Date: Tue, May 7, 2019 at 3:58 AM
> To: Sultan Alsawaf
> Cc: Greg Kroah-Hartman, open list:ANDROID DRIVERS, Daniel Colascione,
> Todd Kjos, Kees Cook, Peter Zijlstra, Martijn Coenen, LKML, Tim
> Murray, Michal Hocko, Suren Baghdasaryan, linux-mm, Arve Hjønnevåg,
> Ingo Molnar, Steven Rostedt, Oleg Nesterov, Joel Fernandes, Andy
> Lutomirski, kernel-team
> 
> > On Tue, May 07, 2019 at 01:12:36AM -0700, Sultan Alsawaf wrote:
> > > On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote:
> > > > Given that any "new" android device that gets shipped "soon" should be
> > > > using 4.9.y or newer, is this a real issue?
> > >
> > > It's certainly a real issue for those who can't buy brand new Android 
> > > devices
> > > without software bugs every six months :)
> > >
> 
> Hi Sultan,
> Looks like you are posting this patch for devices that do not use
> userspace LMKD solution due to them using older kernels or due to
> their vendors sticking to in-kernel solution. If so, I see couple
> logistical issues with this patch. I don't see it being adopted in
> upstream kernel 5.x since it re-implements a deprecated mechanism even
> though vendors still use it. Vendors on the other hand, will not adopt
> it until you show evidence that it works way better than what
> lowmemorykilled driver does now. You would have to provide measurable
> data and explain your tests before they would consider spending time
> on this.
> On the implementation side I'm not convinced at all that this would
> work better on all devices and in all circumstances. We had cases when
> a new mechanism would show very good results until one usecase
> completely broke it. Bulk killing of processes that you are doing in
> your patch was a very good example of such a decision which later on
> we had to rethink. That's why baking these policies into kernel is
> very problematic. Another problem I see with the implementation that
> it ties process killing with the reclaim scan depth. It's very similar
> to how vmpressure works and vmpressure in my experience is very
> unpredictable.

Yeah it does seem conceptually similar, good point.
 
> > > Regardless, even if PSI were backported, a full-fledged LMKD using it has 
> > > yet to
> > > be made, so it wouldn't be of much use now.
> >
> > This is work that is ongoing and requires kernel changes to make it
> > feasible. One of the things that I have been working on for quite a
> > while is the whole file descriptor for processes thing that is important
> > for LMKD (Even though I never thought about this use-case when I started
> > pitching this.). Joel and Daniel have joined in and are working on
> > making LMKD possible.
> > What I find odd is that every couple of weeks different solutions to the
> > low memory problem are pitched. There is simple_lkml, there is LMKD, and
> > there was a patchset that wanted to speed up memory reclaim at process
> > kill-time by adding a new flag to the new pidfd_send_signal() syscall.
> > That all seems - though related - rather uncoordinated.
> 
> I'm not sure why pidfd_wait and expedited reclaim is seen as
> uncoordinated effort. All of them are done to improve userspace LMKD.

Christian, pidfd_wait and expedited reclaim are both coordinated efforts and
solve different problems related to LMK. simple_lmk is entirely different
effort that we already hesitated about when it was first posted, now we
hesitate again due to the issues Suren and others mentioned.

I think it is a better idea for Sultan to spend his time on using/improving
PSI/LMKd than spending it on the simple_lmk. It could also be a good topic to
discuss in the Android track of the Linux plumbers conference.

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: pidfd design

2019-03-20 Thread Joel Fernandes

On Wed, Mar 20, 2019 at 07:51:57PM +0100, Christian Brauner wrote:
[snip]
> > > translate_pid() should just return you a pidfd. Having it return a pidfd
> > > and a status fd feels like stuffing too much functionality in there. If
> > > you're fine with it I'll finish prototyping what I had in mind. As I
> > > said in previous mails I'm already working on this.
> > 
> > translate_pid also needs to *accept* pidfds, at least optionally.
> > Unless you have a function from pidfd to pidfd, you race.
> 
> You're misunderstanding. Again, I said in my previous mails it should
> accept pidfds optionally as arguments, yes. But I don't want it to
> return the status fds that you previously wanted pidfd_wait() to return.
> I really want to see Joel's pidfd_wait() patchset and have more people
> review the actual code.

No problem, pidfd_wait is also fine with me and we can change it later to
translate_pid or something else if needed.

Agreed that lets get to some code writing now that (I hope) we are all on the
same page and discuss on actual code.

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: pidfd design

2019-03-20 Thread Joel Fernandes

On Wed, Mar 20, 2019 at 07:26:50PM +0100, Christian Brauner wrote:
> On Wed, Mar 20, 2019 at 07:33:51AM -0400, Joel Fernandes wrote:
> > 
> > 
> > On March 20, 2019 3:02:32 AM EDT, Daniel Colascione  
> > wrote:
> > >On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner
> > > wrote:
> > >>
> > >> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote:
> > >> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes
> > > wrote:
> > >> > >
> > >> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner
> > >wrote:
> > >> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione
> > >wrote:
> > >> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner
> > > wrote:
> > >> > > > > > So I dislike the idea of allocating new inodes from the
> > >procfs super
> > >> > > > > > block. I would like to avoid pinning the whole pidfd
> > >concept exclusively
> > >> > > > > > to proc. The idea is that the pidfd API will be useable
> > >through procfs
> > >> > > > > > via open("/proc/") because that is what users expect
> > >and really
> > >> > > > > > wanted to have for a long time. So it makes sense to have
> > >this working.
> > >> > > > > > But it should really be useable without it. That's why
> > >translate_pid()
> > >> > > > > > and pidfd_clone() are on the table.  What I'm saying is,
> > >once the pidfd
> > >> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N
> > >- even
> > >> > > > > > though that's crazy - and still be able to use pidfds. This
> > >is also a
> > >> > > > > > point akpm asked about when I did the pidfd_send_signal
> > >work.
> > >> > > > >
> > >> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use
> > >pidfds. One
> > >> > > > > crazy idea that I was discussing with Joel the other day is
> > >to just
> > >> > > > > make CONFIG_PROCFS=Y mandatory and provide a new
> > >get_procfs_root()
> > >> > > > > system call that returned, out of thin air and independent of
> > >the
> > >> > > > > mount table, a procfs root directory file descriptor for the
> > >caller's
> > >> > > > > PID namspace and suitable for use with openat(2).
> > >> > > >
> > >> > > > Even if this works I'm pretty sure that Al and a lot of others
> > >will not
> > >> > > > be happy about this. A syscall to get an fd to /proc?
> > >> >
> > >> > Why not? procfs provides access to a lot of core kernel
> > >functionality.
> > >> > Why should you need a mountpoint to get to it?
> > >> >
> > >> > > That's not going
> > >> > > > to happen and I don't see the need for a separate syscall just
> > >for that.
> > >> >
> > >> > We need a system call for the same reason we need a getrandom(2):
> > >you
> > >> > have to bootstrap somehow when you're in a minimal environment.
> > >> >
> > >> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.)
> > >> >
> > >> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm
> > >> > proposing that we *hardwire* it as the default and just declare
> > >that
> > >> > it's not possible to build a Linux kernel that doesn't include
> > >procfs.
> > >> > Why do we even have that button?
> > >> >
> > >> > > I think his point here was that he wanted a handle to procfs no
> > >matter where
> > >> > > it was mounted and then can later use openat on that. Agreed that
> > >it may be
> > >> > > unnecessary unless there is a usecase for it, and especially if
> > >the /proc
> > >> > > directory being the defacto mountpoint for procfs is a universal
> > >convention.
> > >> >
> > >> > If it's a universal convention and, in practice, everyone needs
> > >proc
> > >&

Re: pidfd design

2019-03-20 Thread Joel Fernandes




On March 20, 2019 3:02:32 AM EDT, Daniel Colascione  wrote:
>On Tue, Mar 19, 2019 at 8:59 PM Christian Brauner
> wrote:
>>
>> On Tue, Mar 19, 2019 at 07:42:52PM -0700, Daniel Colascione wrote:
>> > On Tue, Mar 19, 2019 at 6:52 PM Joel Fernandes
> wrote:
>> > >
>> > > On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner
>wrote:
>> > > > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione
>wrote:
>> > > > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner
> wrote:
>> > > > > > So I dislike the idea of allocating new inodes from the
>procfs super
>> > > > > > block. I would like to avoid pinning the whole pidfd
>concept exclusively
>> > > > > > to proc. The idea is that the pidfd API will be useable
>through procfs
>> > > > > > via open("/proc/") because that is what users expect
>and really
>> > > > > > wanted to have for a long time. So it makes sense to have
>this working.
>> > > > > > But it should really be useable without it. That's why
>translate_pid()
>> > > > > > and pidfd_clone() are on the table.  What I'm saying is,
>once the pidfd
>> > > > > > api is "complete" you should be able to set CONFIG_PROCFS=N
>- even
>> > > > > > though that's crazy - and still be able to use pidfds. This
>is also a
>> > > > > > point akpm asked about when I did the pidfd_send_signal
>work.
>> > > > >
>> > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use
>pidfds. One
>> > > > > crazy idea that I was discussing with Joel the other day is
>to just
>> > > > > make CONFIG_PROCFS=Y mandatory and provide a new
>get_procfs_root()
>> > > > > system call that returned, out of thin air and independent of
>the
>> > > > > mount table, a procfs root directory file descriptor for the
>caller's
>> > > > > PID namspace and suitable for use with openat(2).
>> > > >
>> > > > Even if this works I'm pretty sure that Al and a lot of others
>will not
>> > > > be happy about this. A syscall to get an fd to /proc?
>> >
>> > Why not? procfs provides access to a lot of core kernel
>functionality.
>> > Why should you need a mountpoint to get to it?
>> >
>> > > That's not going
>> > > > to happen and I don't see the need for a separate syscall just
>for that.
>> >
>> > We need a system call for the same reason we need a getrandom(2):
>you
>> > have to bootstrap somehow when you're in a minimal environment.
>> >
>> > > > (I do see the point of making CONFIG_PROCFS=y the default btw.)
>> >
>> > I'm not proposing that we make CONFIG_PROCFS=y the default. I'm
>> > proposing that we *hardwire* it as the default and just declare
>that
>> > it's not possible to build a Linux kernel that doesn't include
>procfs.
>> > Why do we even have that button?
>> >
>> > > I think his point here was that he wanted a handle to procfs no
>matter where
>> > > it was mounted and then can later use openat on that. Agreed that
>it may be
>> > > unnecessary unless there is a usecase for it, and especially if
>the /proc
>> > > directory being the defacto mountpoint for procfs is a universal
>convention.
>> >
>> > If it's a universal convention and, in practice, everyone needs
>proc
>> > mounted anyway, so what's the harm in hardwiring CONFIG_PROCFS=y?
>If
>> > we advertise /proc as not merely some kind of optional debug
>interface
>> > but *the* way certain kernel features are exposed --- and there's
>> > nothing wrong with that --- then we should give programs access to
>> > these core kernel features in a way that doesn't depend on
>userspace
>> > kernel configuration, and you do that by either providing a
>> > procfs-root-getting system call or just hardwiring the "/proc/"
>prefix
>> > into VFS.
>> >
>> > > > Inode allocation from the procfs mount for the file descriptors
>Joel
>> > > > wants is not correct. Their not really procfs file descriptors
>so this
>> > > > is a nack. We can't just hook into proc that way.
>> > >
>> > > I was not particular about using procfs mount fo

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-19 Thread Joel Fernandes

On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner wrote:
> On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote:
> > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner  
> > wrote:
> > > So I dislike the idea of allocating new inodes from the procfs super
> > > block. I would like to avoid pinning the whole pidfd concept exclusively
> > > to proc. The idea is that the pidfd API will be useable through procfs
> > > via open("/proc/") because that is what users expect and really
> > > wanted to have for a long time. So it makes sense to have this working.
> > > But it should really be useable without it. That's why translate_pid()
> > > and pidfd_clone() are on the table.  What I'm saying is, once the pidfd
> > > api is "complete" you should be able to set CONFIG_PROCFS=N - even
> > > though that's crazy - and still be able to use pidfds. This is also a
> > > point akpm asked about when I did the pidfd_send_signal work.
> > 
> > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One
> > crazy idea that I was discussing with Joel the other day is to just
> > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root()
> > system call that returned, out of thin air and independent of the
> > mount table, a procfs root directory file descriptor for the caller's
> > PID namspace and suitable for use with openat(2).
> 
> Even if this works I'm pretty sure that Al and a lot of others will not
> be happy about this. A syscall to get an fd to /proc? That's not going
> to happen and I don't see the need for a separate syscall just for that.
> (I do see the point of making CONFIG_PROCFS=y the default btw.)

I think his point here was that he wanted a handle to procfs no matter where
it was mounted and then can later use openat on that. Agreed that it may be
unnecessary unless there is a usecase for it, and especially if the /proc
directory being the defacto mountpoint for procfs is a universal convention.

> Inode allocation from the procfs mount for the file descriptors Joel
> wants is not correct. Their not really procfs file descriptors so this
> is a nack. We can't just hook into proc that way.

I was not particular about using procfs mount for the FDs but that's the only
way I knew how to do it until you pointed out anon_inode (my grep skills
missed that), so thank you!

> > C'mon: /proc is used by everyone today and almost every program breaks
> > if it's not around. The string "/proc" is already de facto kernel ABI.
> > Let's just drop the pretense of /proc being optional and bake it into
> > the kernel proper, then give programs a way to get to /proc that isn't
> > tied to any particular mount configuration. This way, we don't need a
> > translate_pid(), since callers can just use procfs to do the same
> > thing. (That is, if I understand correctly what translate_pid does.)
> 
> I'm not sure what you think translate_pid() is doing since you're not
> saying what you think it does.
> Examples from the old patchset:
> translate_pid(pid, ns, -1)  - get pid in our pid namespace
> translate_pid(pid, -1, ns)  - get pid in other pid namespace
> translate_pid(1, ns, -1)- get pid of init task for namespace
> translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
> translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
> translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
> translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?
> 
> Allowing this syscall to yield pidfds as proper regular file fds and
> also taking pidfds as argument is an excellent way to kill a few
> problems at once:
> - cheap pid namespace introspection
> - creates a bridge between the "old" pid-based api and the "new" pidfd api

This second point would solve the problem of getting a new pidfd given a pid
indeed, without depending on /proc/ at all. So kudos for that and I am
glad you are making it return pidfds (but correct me if I misunderstood what
you're planning to do with translate_fd). It also obviates any need for
dealing with procfs mount points.

> - allows us to get proper non-directory file descriptors for any pids we
>   like

Here I got a bit lost. AIUI pidfd is a directory fd. Why would we want it to
not be a directory fd? That would be ambigiuous with what pidfd_send_signal
expects.

Also would it be a bad idea to extend translate_pid to also do what we want
for the pidfd_wait syscall?  So translate_fd in this case would return an fd
that is just used for the pid's death status.

All of these extensions seem to mean translate_pid should probably take a
fourth parameter that tells it the target translation type?

They way I am hypothesizing, translate_pid, it should probably be
- translation to a pid in some ns
- translation of a pid to a pidfd
- translation of a pid to a "wait" fd which returns the death/reap process 
status.

If that makes sense, that would also avoid the need for a new syscall we are 
adding.

> The additional advantage is that people are already happy to ad

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-19 Thread Joel Fernandes

On Tue, Mar 19, 2019 at 11:14:17PM +0100, Christian Brauner wrote:
[snip] 
> > 
> > ---8<---
> > 
> > From: Joel Fernandes 
> > Subject: [PATCH] Partial skeleton prototype of pidfd_wait frontend
> > 
> > Signed-off-by: Joel Fernandes 
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> >  include/linux/syscalls.h   |  1 +
> >  include/uapi/asm-generic/unistd.h  |  4 +-
> >  kernel/signal.c| 62 ++
> >  kernel/sys_ni.c|  3 ++
> >  6 files changed, 71 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> > b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 1f9607ed087c..2a63f1896b63 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -433,3 +433,4 @@
> >  425i386io_uring_setup  sys_io_uring_setup  
> > __ia32_sys_io_uring_setup
> >  426i386io_uring_enter  sys_io_uring_enter  
> > __ia32_sys_io_uring_enter
> >  427i386io_uring_register   sys_io_uring_register   
> > __ia32_sys_io_uring_register
> > +428i386pidfd_wait  sys_pidfd_wait  
> > __ia32_sys_pidfd_wait
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 92ee0b4378d4..cf2e08a8053b 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -349,6 +349,7 @@
> >  425common  io_uring_setup  __x64_sys_io_uring_setup
> >  426common  io_uring_enter  __x64_sys_io_uring_enter
> >  427common  io_uring_register   __x64_sys_io_uring_register
> > +428common  pidfd_wait  __x64_sys_pidfd_wait
> >  
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index e446806a561f..62160970ed3f 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -988,6 +988,7 @@ asmlinkage long sys_rseq(struct rseq __user *rseq, 
> > uint32_t rseq_len,
> >  asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
> >siginfo_t __user *info,
> >unsigned int flags);
> > +asmlinkage long sys_pidfd_wait(int pidfd);
> >  
> >  /*
> >   * Architecture-specific system calls
> > diff --git a/include/uapi/asm-generic/unistd.h 
> > b/include/uapi/asm-generic/unistd.h
> > index dee7292e1df6..137aa8662230 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> >  #define __NR_io_uring_register 427
> >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > +#define __NR_pidfd_wait 428
> > +__SYSCALL(__NR_pidfd_wait, sys_pidfd_wait)
> >  
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 428
> > +#define __NR_syscalls 429
> >  
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index b7953934aa99..ebb550b87044 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -3550,6 +3550,68 @@ static int 
> > copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
> > return copy_siginfo_from_user(kinfo, info);
> >  }
> >  
> > +static ssize_t pidfd_wait_read_iter(struct kiocb *iocb, struct iov_iter 
> > *to)
> > +{
> > +   /*
> > +* This is just a test string, it will contain the actual
> > +* status of the pidfd in the future.
> > +*/
> > +   char buf[] = "status";
> > +
> > +   return copy_to_iter(buf, strlen(buf)+1, to);
> > +}
> > +
> > +static const struct file_operations pidfd_wait_file_ops = {
> > +   .read_iter  = pidfd_wait_read_iter,
> > +};
> > +
> > +static struct inode *pidfd_wait_get_inode(struct super_block *sb)
> > +{
> > +   struct inode *inode = new_inode(sb);
> > +
> > +   inode->i_ino = get_next_ino();
> > +   inode_init_owner(inode, NULL, S_IFREG);
> > +
> > +   inode->i_op = &simpl

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-18 Thread Joel Fernandes

On Mon, Mar 18, 2019 at 01:29:51AM +0100, Christian Brauner wrote:
> On Sun, Mar 17, 2019 at 08:40:19AM -0700, Daniel Colascione wrote:
> > On Sun, Mar 17, 2019 at 4:42 AM Christian Brauner  
> > wrote:
> > >
> > > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote:
> > > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote:
> > > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner 
> > > > >  wrote:
> > > > > >
> > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote:
> > > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner 
> > > > > > > > > wrote:
> > > > > > > > > [..]
> > > > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? 
> > > > > > > > > > > Why not just use
> > > > > > > > > > > standard poll/epoll interface on the proc fd like Daniel 
> > > > > > > > > > > was suggesting.
> > > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is 
> > > > > > > > > > > essentially pinned
> > > > > > > > > > > even though the proc number may be reused. Then the 
> > > > > > > > > > > caller can just poll.
> > > > > > > > > > > We can add a waitqueue to struct pid, and wake up any 
> > > > > > > > > > > waiters on process
> > > > > > > > > > > death (A quick look shows task_struct can be mapped to 
> > > > > > > > > > > its struct pid) and
> > > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No 
> > > > > > > > > > > new syscall is
> > > > > > > > > > > needed then, let me know if I missed something?
> > > > > > > > > >
> > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll 
> > > > > > > > > > solution?
> > > > > > > > >
> > > > > > > > > Hmm, going through earlier threads, I believe so now. Here 
> > > > > > > > > was Daniel's
> > > > > > > > > reasoning about avoiding a notification about process death 
> > > > > > > > > through proc
> > > > > > > > > directory fd: 
> > > > > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html
> > > > > > > > >
> > > > > > > > > May be a dedicated syscall for this would be cleaner after 
> > > > > > > > > all.
> > > > > > > >
> > > > > > > > Ah, I wish I've seen that discussion before...
> > > > > > > > syscall makes sense and it can be non-blocking and we can use
> > > > > > > > select/poll/epoll if we use eventfd.
> > > > > > >
> > > > > > > Thanks for taking a look.
> > > > > > >
> > > > > > > > I would strongly advocate for
> > > > > > > > non-blocking version or at least to have a non-blocking option.
> > > > > > >
> > > > > > > Waiting for FD readiness is *already* blocking or non-blocking
> > > > > > > according to the caller's desire --- users can pass options they 
> > > > > > > want
> > > > > > > to poll(2) or whatever. There's no need for any kind of special
> > > > > > > configuration knob or non-blocking option. We already *have* a
> > > > > > > non-blocking option that works universally for everything.
> > > > > > >
> > > > > > > As I mentioned in the linked thread, waiting for process exit 
> > > > > > > should
> > > > > > > work just like waiting for bytes to appea

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-16 Thread Joel Fernandes

On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote:
> On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner  
> wrote:
> >
> > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote:
> > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan  
> > > wrote:
> > > >
> > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes 
> > > >  wrote:
> > > > >
> > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote:
> > > > > [..]
> > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why not 
> > > > > > > just use
> > > > > > > standard poll/epoll interface on the proc fd like Daniel was 
> > > > > > > suggesting.
> > > > > > > AFAIK, once the proc file is opened, the struct pid is 
> > > > > > > essentially pinned
> > > > > > > even though the proc number may be reused. Then the caller can 
> > > > > > > just poll.
> > > > > > > We can add a waitqueue to struct pid, and wake up any waiters on 
> > > > > > > process
> > > > > > > death (A quick look shows task_struct can be mapped to its struct 
> > > > > > > pid) and
> > > > > > > also possibly optimize it using Steve's TIF flag idea. No new 
> > > > > > > syscall is
> > > > > > > needed then, let me know if I missed something?
> > > > > >
> > > > > > Huh, I thought that Daniel was against the poll/epoll solution?
> > > > >
> > > > > Hmm, going through earlier threads, I believe so now. Here was 
> > > > > Daniel's
> > > > > reasoning about avoiding a notification about process death through 
> > > > > proc
> > > > > directory fd: 
> > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html
> > > > >
> > > > > May be a dedicated syscall for this would be cleaner after all.
> > > >
> > > > Ah, I wish I've seen that discussion before...
> > > > syscall makes sense and it can be non-blocking and we can use
> > > > select/poll/epoll if we use eventfd.
> > >
> > > Thanks for taking a look.
> > >
> > > > I would strongly advocate for
> > > > non-blocking version or at least to have a non-blocking option.
> > >
> > > Waiting for FD readiness is *already* blocking or non-blocking
> > > according to the caller's desire --- users can pass options they want
> > > to poll(2) or whatever. There's no need for any kind of special
> > > configuration knob or non-blocking option. We already *have* a
> > > non-blocking option that works universally for everything.
> > >
> > > As I mentioned in the linked thread, waiting for process exit should
> > > work just like waiting for bytes to appear on a pipe. Process exit
> > > status is just another blob of bytes that a process might receive. A
> > > process exit handle ought to be just another information source. The
> > > reason the unix process API is so awful is that for whatever reason
> > > the original designers treated processes as some kind of special kind
> > > of resource instead of fitting them into the otherwise general-purpose
> > > unix data-handling API. Let's not repeat that mistake.
> > >
> > > > Something like this:
> > > >
> > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> > > > // register eventfd to receive death notification
> > > > pidfd_wait(pid_to_kill, evfd);
> > > > // kill the process
> > > > pidfd_send_signal(pid_to_kill, ...)
> > > > // tend to other things
> > >
> > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up
> > > an eventfd.
> > >
> 
> Ok, I probably misunderstood your post linked by Joel. I though your
> original proposal was based on being able to poll a file under
> /proc/pid and then you changed your mind to have a separate syscall
> which I assumed would be a blocking one to wait for process exit.
> Maybe you can describe the new interface you are thinking about in
> terms of userspace usage like I did above? Several lines of code would
> explain more than paragraphs of text.

Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The idea
from Daniel here is to wait for

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-15 Thread Joel Fernandes

On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote:
[..]
> > why do we want to add a new syscall (pidfd_wait) though? Why not just use
> > standard poll/epoll interface on the proc fd like Daniel was suggesting.
> > AFAIK, once the proc file is opened, the struct pid is essentially pinned
> > even though the proc number may be reused. Then the caller can just poll.
> > We can add a waitqueue to struct pid, and wake up any waiters on process
> > death (A quick look shows task_struct can be mapped to its struct pid) and
> > also possibly optimize it using Steve's TIF flag idea. No new syscall is
> > needed then, let me know if I missed something?
> 
> Huh, I thought that Daniel was against the poll/epoll solution?

Hmm, going through earlier threads, I believe so now. Here was Daniel's
reasoning about avoiding a notification about process death through proc
directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html

May be a dedicated syscall for this would be cleaner after all.

> I have no clear opinion on what is better at the moment since I have
> been mostly concerned with getting pidfd_send_signal() into shape and
> was reluctant to put more ideas/work into this if it gets shutdown.
> Once we have pidfd_send_signal() the wait discussion makes sense.

Ok. It looks like that is almost in though (fingers crossed :)).

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-15 Thread Joel Fernandes

On Fri, Mar 15, 2019 at 07:03:07PM +0100, Christian Brauner wrote:
> On Thu, Mar 14, 2019 at 09:36:43PM -0700, Daniel Colascione wrote:
> > On Thu, Mar 14, 2019 at 8:16 PM Steven Rostedt  wrote:
> > >
> > > On Thu, 14 Mar 2019 13:49:11 -0700
> > > Sultan Alsawaf  wrote:
> > >
> > > > Perhaps I'm missing something, but if you want to know when a process 
> > > > has died
> > > > after sending a SIGKILL to it, then why not just make the SIGKILL 
> > > > optionally
> > > > block until the process has died completely? It'd be rather trivial to 
> > > > just
> > > > store a pointer to an onstack completion inside the victim process' 
> > > > task_struct,
> > > > and then complete it in free_task().
> > >
> > > How would you implement such a method in userspace? kill() doesn't take
> > > any parameters but the pid of the process you want to send a signal to,
> > > and the signal to send. This would require a new system call, and be
> > > quite a bit of work.
> > 
> > That's what the pidfd work is for. Please read the original threads
> > about the motivation and design of that facility.
> > 
> > > If you can solve this with an ebpf program, I
> > > strongly suggest you do that instead.
> > 
> > Regarding process death notification: I will absolutely not support
> > putting aBPF and perf trace events on the critical path of core system
> > memory management functionality. Tracing and monitoring facilities are
> > great for learning about the system, but they were never intended to
> > be load-bearing. The proposed eBPF process-monitoring approach is just
> > a variant of the netlink proposal we discussed previously on the pidfd
> > threads; it has all of its drawbacks. We really need a core system
> > call  --- really, we've needed robust process management since the
> > creation of unix --- and I'm glad that we're finally getting it.
> > Adding new system calls is not expensive; going to great lengths to
> > avoid adding one is like calling a helicopter to avoid crossing the
> > street. I don't think we should present an abuse of the debugging and
> > performance monitoring infrastructure as an alternative to a robust
> > and desperately-needed bit of core functionality that's neither hard
> > to add nor complex to implement nor expensive to use.
> > 
> > Regarding the proposal for a new kernel-side lmkd: when possible, the
> > kernel should provide mechanism, not policy. Putting the low memory
> > killer back into the kernel after we've spent significant effort
> > making it possible for userspace to do that job. Compared to kernel
> > code, more easily understood, more easily debuggable, more easily
> > updated, and much safer. If we *can* move something out of the kernel,
> > we should. This patch moves us in exactly the wrong direction. Yes, we
> > need *something* that sits synchronously astride the page allocation
> > path and does *something* to stop a busy beaver allocator that eats
> > all the available memory before lmkd, even mlocked and realtime, can
> > respond. The OOM killer is adequate for this very rare case.
> > 
> > With respect to kill timing: Tim is right about the need for two
> > levels of policy: first, a high-level process prioritization and
> > memory-demand balancing scheme (which is what OOM score adjustment
> > code in ActivityManager amounts to); and second, a low-level
> > process-killing methodology that maximizes sustainable memory reclaim
> > and minimizes unwanted side effects while killing those processes that
> > should be dead. Both of these policies belong in userspace --- because
> > they *can* be in userspace --- and userspace needs only a few tools,
> > most of which already exist, to do a perfectly adequate job.
> > 
> > We do want killed processes to die promptly. That's why I support
> > boosting a process's priority somehow when lmkd is about to kill it.
> > The precise way in which we do that --- involving not only actual
> > priority, but scheduler knobs, cgroup assignment, core affinity, and
> > so on --- is a complex topic best left to userspace. lmkd already has
> > all the knobs it needs to implement whatever priority boosting policy
> > it wants.
> > 
> > Hell, once we add a pidfd_wait --- which I plan to work on, assuming
> > nobody beats me to it, after pidfd_send_signal lands --- you can
> 
> Daniel,
> 
> I've just been talking to Joel.
> I actually "expected" you to work pidfd_wait() after prior
> conversations we had on the pidfd_send_signal() patchsets. :) That's why
> I got a separate git tree on kernel.org since I expect a lot more work
> to come. I hope that Linus still decides to pull pidfd_send_signal()
> before Sunday (For the ones who have missed the link in a prior response
> of mine:
> https://lkml.org/lkml/2019/3/12/439
> 
> This is the first merge window I sent this PR.
> 
> The pidfd tree has a branch for-next that is already tracked by Stephen
> in linux-next since the 5.0 merge window. The patches for
> pidfd_send_signal() sit in the pidfd

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-15 Thread Joel Fernandes

On Thu, Mar 14, 2019 at 09:36:43PM -0700, Daniel Colascione wrote:
[snip] 
> > If you can solve this with an ebpf program, I
> > strongly suggest you do that instead.
> 
> Regarding process death notification: I will absolutely not support
> putting aBPF and perf trace events on the critical path of core system
> memory management functionality. Tracing and monitoring facilities are
> great for learning about the system, but they were never intended to
> be load-bearing. The proposed eBPF process-monitoring approach is just
> a variant of the netlink proposal we discussed previously on the pidfd
> threads; it has all of its drawbacks. We really need a core system
> call  --- really, we've needed robust process management since the
> creation of unix --- and I'm glad that we're finally getting it.
> Adding new system calls is not expensive; going to great lengths to
> avoid adding one is like calling a helicopter to avoid crossing the
> street. I don't think we should present an abuse of the debugging and
> performance monitoring infrastructure as an alternative to a robust
> and desperately-needed bit of core functionality that's neither hard
> to add nor complex to implement nor expensive to use.

The eBPF-based solution to this would be just as simple while avoiding any
kernel changes. I don't know why you think it is not load-bearing. However, I
agree the proc/pidfd approach is better and can be simpler for some users who
don't want to deal with eBPF - especially since something like this has many
usecases. I was just suggesting the eBPF solution as a better alternative to
the task_struct surgery idea from Sultan since that sounded to me quite
hackish (that could just be my opinion).

> Regarding the proposal for a new kernel-side lmkd: when possible, the
> kernel should provide mechanism, not policy. Putting the low memory
> killer back into the kernel after we've spent significant effort
> making it possible for userspace to do that job. Compared to kernel
> code, more easily understood, more easily debuggable, more easily
> updated, and much safer. If we *can* move something out of the kernel,
> we should. This patch moves us in exactly the wrong direction. Yes, we
> need *something* that sits synchronously astride the page allocation
> path and does *something* to stop a busy beaver allocator that eats
> all the available memory before lmkd, even mlocked and realtime, can
> respond. The OOM killer is adequate for this very rare case.
> 
> With respect to kill timing: Tim is right about the need for two
> levels of policy: first, a high-level process prioritization and
> memory-demand balancing scheme (which is what OOM score adjustment
> code in ActivityManager amounts to); and second, a low-level
> process-killing methodology that maximizes sustainable memory reclaim
> and minimizes unwanted side effects while killing those processes that
> should be dead. Both of these policies belong in userspace --- because
> they *can* be in userspace --- and userspace needs only a few tools,
> most of which already exist, to do a perfectly adequate job.
> 
> We do want killed processes to die promptly. That's why I support
> boosting a process's priority somehow when lmkd is about to kill it.
> The precise way in which we do that --- involving not only actual
> priority, but scheduler knobs, cgroup assignment, core affinity, and
> so on --- is a complex topic best left to userspace. lmkd already has
> all the knobs it needs to implement whatever priority boosting policy
> it wants.
> 
> Hell, once we add a pidfd_wait --- which I plan to work on, assuming
> nobody beats me to it, after pidfd_send_signal lands --- you can
> imagine a general-purpose priority inheritance mechanism expediting
> process death when a high-priority process waits on a pidfd_wait
> handle for a condemned process. You know you're on the right track
> design-wise when you start seeing this kind of elegant constructive
> interference between seemingly-unrelated features. What we don't need
> is some kind of blocking SIGKILL alternative or backdoor event
> delivery system.
> 
> We definitely don't want to have to wait for a process's parent to
> reap it. Instead, we want to wait for it to become a zombie. That's
> why I designed my original exithand patch to fire death notification
> upon transition to the zombie state, not upon process table removal,
> and I expect pidfd_wait (or whatever we call it) to act the same way.

Agreed. Looking forward to the patches. :)

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-14 Thread Joel Fernandes

On Thu, Mar 14, 2019 at 01:49:11PM -0700, Sultan Alsawaf wrote:
> On Thu, Mar 14, 2019 at 10:47:17AM -0700, Joel Fernandes wrote:
> > About the 100ms latency, I wonder whether it is that high because of
> > the way Android's lmkd is observing that a process has died. There is
> > a gap between when a process memory is freed and when it disappears
> > from the process-table.  Once a process is SIGKILLed, it becomes a
> > zombie. Its memory is freed instantly during the SIGKILL delivery (I
> > traced this so that's how I know), but until it is reaped by its
> > parent thread, it will still exist in /proc/ . So if testing the
> > existence of /proc/ is how Android is observing that the process
> > died, then there can be a large latency where it takes a very long
> > time for the parent to actually reap the child way after its memory
> > was long freed. A quicker way to know if a process's memory is freed
> > before it is reaped could be to read back /proc//maps in
> > userspace of the victim , and that file will be empty for zombie
> > processes. So then one does not need wait for the parent to reap it. I
> > wonder how much of that 100ms you mentioned is actually the "Waiting
> > while Parent is reaping the child", than "memory freeing time". So
> > yeah for this second problem, the procfds work will help.
> >
> > By the way another approach that can provide a quick and asynchronous
> > notification of when the process memory is freed, is to monitor
> > sched_process_exit trace event using eBPF. You can tell eBPF the PID
> > that you want to monitor before the SIGKILL. As soon as the process
> > dies and its memory is freed, the eBPF program can send a notification
> > to user space (using the perf_events polling infra). The
> > sched_process_exit fires just after the mmput() happens so it is quite
> > close to when the memory is reclaimed. This also doesn't need any
> > kernel changes. I could come up with a prototype for this and
> > benchmark it on Android, if you want. Just let me know.
> 
> Perhaps I'm missing something, but if you want to know when a process has died
> after sending a SIGKILL to it, then why not just make the SIGKILL optionally
> block until the process has died completely? It'd be rather trivial to just
> store a pointer to an onstack completion inside the victim process' 
> task_struct,
> and then complete it in free_task().

I'm not sure if that makes much semantic sense for how the signal handling is
supposed to work. Imagine a parent sends SIGKILL to its child, and then does
a wait(2). Because the SIGKILL blocks in your idea, then the wait cannot
execute, and because the wait cannot execute, the zombie task will not get
reaped and so the SIGKILL senders never gets unblocked and the whole thing
just gets locked up. No? I don't know it just feels incorrect.

Further, in your idea adding stuff to task_struct will simply bloat it - when
this task can easily be handled using eBPF without making any kernel changes.
Either by probing sched_process_free or sched_process_exit tracepoints.
Scheduler maintainers generally frown on adding stuff to task_struct
pointlessly there's a good reason since bloating it effects the performance
etc, and something like this would probably never be ifdef'd out behind a
CONFIG.

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-14 Thread Joel Fernandes

Hi Tim,
Thanks for the detailed and excellent write-up. It will serve as a
good future reference for low memory killer requirements. I made some
comments below on the "how to kill" part.

On Tue, Mar 12, 2019 at 10:17 AM Tim Murray  wrote:
>
> On Tue, Mar 12, 2019 at 9:37 AM Sultan Alsawaf  wrote:
> >
> > On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote:
> > > The only way to control the OOM behavior pro-actively is to throttle
> > > allocation speed. We have memcg high limit for that purpose. Along with
> > > PSI, I can imagine a reasonably working user space early oom
> > > notifications and reasonable acting upon that.
> >
> > The issue with pro-active memory management that prompted me to create this 
> > was
> > poor memory utilization. All of the alternative means of reclaiming pages 
> > in the
> > page allocator's slow path turn out to be very useful for maximizing memory
> > utilization, which is something that we would have to forgo by relying on a
> > purely pro-active solution. I have not had a chance to look at PSI yet, but
> > unless a PSI-enabled solution allows allocations to reach the same point as 
> > when
> > the OOM killer is invoked (which is contradictory to what it sets out to 
> > do),
> > then it cannot take advantage of all of the alternative memory-reclaim means
> > employed in the slowpath, and will result in killing a process before it is
> > _really_ necessary.
>
> There are two essential parts of a lowmemorykiller implementation:
> when to kill and how to kill.
>
> There are a million possible approaches to decide when to kill an
> unimportant process. They usually trade off between the same two
> failure modes depending on the workload.
>
> If you kill too aggressively, a transient spike that could be
> imperceptibly absorbed by evicting some file pages or moving some
> pages to ZRAM will result in killing processes, which then get started
> up later and have a performance/battery cost.
>
> If you don't kill aggressively enough, you will encounter a workload
> that thrashes the page cache, constantly evicting and reloading file
> pages and moving things in and out of ZRAM, which makes the system
> unusable when a process should have been killed instead.
>
> As far as I've seen, any methodology that uses single points in time
> to decide when to kill without completely biasing toward one or the
> other is susceptible to both. The minfree approach used by
> lowmemorykiller/lmkd certainly is; it is both too aggressive for some
> workloads and not aggressive enough for other workloads. My guess is
> that simple LMK won't kill on transient spikes but will be extremely
> susceptible to page cache thrashing. This is not an improvement; page
> cache thrashing manifests as the entire system running very slowly.
>
> What you actually want from lowmemorykiller/lmkd on Android is to only
> kill once it becomes clear that the system will continue to try to
> reclaim memory to the extent that it could impact what the user
> actually cares about. That means tracking how much time is spent in
> reclaim/paging operations and the like, and that's exactly what PSI
> does. lmkd has had support for using PSI as a replacement for
> vmpressure for use as a wakeup trigger (to check current memory levels
> against the minfree thresholds) since early February. It works fine;
> unsurprisingly it's better than vmpressure at avoiding false wakeups.
>
> Longer term, there's a lot of work to be done in lmkd to turn PSI into
> a kill trigger and remove minfree entirely. It's tricky (mainly
> because of the "when to kill another process" problem discussed
> later), but I believe it's feasible.
>
> How to kill is similarly messy. The latency of reclaiming memory post
> SIGKILL can be severe (usually tens of milliseconds, occasionally
> >100ms). The latency we see on Android usually isn't because those
> threads are blocked in uninterruptible sleep, it's because times of
> memory pressure are also usually times of significant CPU contention
> and these are overwhelmingly CFS threads, some of which may be
> assigned a very low priority. lmkd now sets priorities and resets
> cpusets upon killing a process, and we have seen improved reclaim
> latency because of this. oom reaper might be a good approach to avoid
> this latency (I think some in-kernel lowmemorykiller implementations
> rely on it), but we can't use it from userspace. Something for future
> consideration.
>

This makes sense. If the process receiving the SIGKILL does not get CPU
time, then the kernel may not be able to execute the unconditional
signal handling paths in the context of the victim process to free the memory.

I don't see how proc-fds approach will solve this though. Say you have
process L (which is LMKd) which sends a SIGKILL to process V(which is
a victim). Now L sends SIGKILL to V. Unless V executes the
signal-handling code in kernel context and is scheduled at high enough
priority to get CPU time, I don't think the SIGK

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Joel Fernandes

On Mon, Mar 11, 2019 at 01:46:26PM -0700, Sultan Alsawaf wrote:
> On Mon, Mar 11, 2019 at 01:10:36PM -0700, Suren Baghdasaryan wrote:
> > The idea seems interesting although I need to think about this a bit
> > more. Killing processes based on failed page allocation might backfire
> > during transient spikes in memory usage.
> 
> This issue could be alleviated if tasks could be killed and have their pages
> reaped faster.

But the point is that a transient temporary memory spike should not be a
signal to kill _any_ process.  The reaction to kill shouldn't be so
spontaneous that unwanted tasks are killed because the system went into
panic mode. It should be averaged out which I believe is what PSI does.

thanks,

- Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Joel Fernandes

On Mon, Mar 11, 2019 at 12:32:33PM -0400, Joel Fernandes wrote:
> On Sun, Mar 10, 2019 at 01:34:03PM -0700, Sultan Alsawaf wrote:
> [...]
> >  
> > /* Perform scheduler related setup. Assign this task to a CPU. */
> > retval = sched_fork(clone_flags, p);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3eb01dedf..fd0d697c6 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -67,6 +67,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -967,6 +968,11 @@ static inline void __free_one_page(struct page *page,
> > }
> > }
> >  
> > +#ifdef CONFIG_ANDROID_SIMPLE_LMK
> > +   if (simple_lmk_page_in(page, order, migratetype))
> > +   return;
> > +#endif
> > +
> > list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
> >  out:
> > zone->free_area[order].nr_free++;
> > @@ -4427,6 +4433,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> > order,
> > if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
> > goto nopage;
> >  
> > +#ifdef CONFIG_ANDROID_SIMPLE_LMK
> > +   page = simple_lmk_oom_alloc(order, ac->migratetype);
> > +   if (page)
> > +   prep_new_page(page, order, gfp_mask, alloc_flags);
> > +   goto got_pg;
> > +#endif
> > +
> 
> Hacking generic MM code with Android-specific callback is probably a major
> issue with your patch.
>
> Also I CC'd -mm maintainers and lists since your patch
> touches page_alloc.c. Always run get_maintainer.pl before sending a patch. I
> added them this time.

I see you CC'd linux-mm on your initial patch, so I apologize. Ignore this
part of my reply. Thanks.



> Have you looked at the recent PSI work that Suren and Johannes have been
> doing [1]?  As I understand, userspace lmkd may be migrated to use that at 
> some
> point.  Suren can provide more details. I am sure AOSP contributions to make
> LMKd better by using the PSI backend would be appreciated. Please consider
> collaborating on that and help out, thanks. Check the cover-letter of that
> patch [1] where LMKd is mentioned.
 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Joel Fernandes

On Sun, Mar 10, 2019 at 01:34:03PM -0700, Sultan Alsawaf wrote:
[...]
>  
>   /* Perform scheduler related setup. Assign this task to a CPU. */
>   retval = sched_fork(clone_flags, p);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3eb01dedf..fd0d697c6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -67,6 +67,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -967,6 +968,11 @@ static inline void __free_one_page(struct page *page,
>   }
>   }
>  
> +#ifdef CONFIG_ANDROID_SIMPLE_LMK
> + if (simple_lmk_page_in(page, order, migratetype))
> + return;
> +#endif
> +
>   list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
>  out:
>   zone->free_area[order].nr_free++;
> @@ -4427,6 +4433,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
>   goto nopage;
>  
> +#ifdef CONFIG_ANDROID_SIMPLE_LMK
> + page = simple_lmk_oom_alloc(order, ac->migratetype);
> + if (page)
> + prep_new_page(page, order, gfp_mask, alloc_flags);
> + goto got_pg;
> +#endif
> +

Hacking generic MM code with Android-specific callback is probably a major
issue with your patch. Also I CC'd -mm maintainers and lists since your patch
touches page_alloc.c. Always run get_maintainer.pl before sending a patch. I
added them this time.

Have you looked at the recent PSI work that Suren and Johannes have been
doing [1]?  As I understand, userspace lmkd may be migrated to use that at some
point.  Suren can provide more details. I am sure AOSP contributions to make
LMKd better by using the PSI backend would be appreciated. Please consider
collaborating on that and help out, thanks. Check the cover-letter of that
patch [1] where LMKd is mentioned.

thanks,

 - Joel

[1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1951257.html

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] binder: fix race between munmap() and direct reclaim

2019-03-05 Thread Joel Fernandes

On Sat, Mar 02, 2019 at 08:27:44AM -0800, Todd Kjos wrote:
> On Fri, Mar 1, 2019 at 11:57 PM Greg KH  wrote:
> >
> > On Fri, Mar 01, 2019 at 03:06:06PM -0800, Todd Kjos wrote:
> > > An munmap() on a binder device causes binder_vma_close() to be called
> > > which clears the alloc->vma pointer.
> > >
> > > If direct reclaim causes binder_alloc_free_page() to be called, there
> > > is a race where alloc->vma is read into a local vma pointer and then
> > > used later after the mm->mmap_sem is acquired. This can result in
> > > calling zap_page_range() with an invalid vma which manifests as a
> > > use-after-free in zap_page_range().
> > >
> > > The fix is to check alloc->vma after acquiring the mmap_sem (which we
> > > were acquiring anyway) and skip zap_page_range() if it has changed
> > > to NULL.
> > >
> > > Signed-off-by: Todd Kjos 

Awesome patch, 
Reviewed-by: Joel Fernandes (Google) 

thanks!
 
 - Joel

> > > ---
> >
> > Any specific commit that this fixes?
> 
> No, it's been there a long time.
> 
> > And should it be marked for stable releases?
> 
> It is needed in stable (back to 4.4), but will need to be backported.
> Should I post backported versions targeting the specific releases now?
> I was thinking we'd wait for this one to land. I think we'll need 1
> patch for 4.4/4.9 and a second one for 4.14/4.19 (and some of those
> backported patches will have conflicts when merged down to android-4.X
> -- I think the 4.14/4.19 version will apply to all the android
> branches). Let me know how you want to handle this.
> 
> >
> > thanks,
> >
> > greg k-h
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] staging: android: ashmem: Avoid range_alloc() allocation with ashmem_mutex held.

2019-02-25 Thread Joel Fernandes

On Fri, Feb 22, 2019 at 3:04 AM Tetsuo Handa
 wrote:
>
> ashmem_pin() is calling range_shrink() without checking whether
> range_alloc() succeeded. Also, doing memory allocation with ashmem_mutex
> held should be avoided because ashmem_shrink_scan() tries to hold it.
>
> Therefore, move memory allocation for range_alloc() to ashmem_pin_unpin()
> and make range_alloc() not to fail.
>
> This patch is mostly meant for backporting purpose for fuzz testing on
> stable/distributor kernels, for there is a plan to remove this code in
> near future.
>
> Signed-off-by: Tetsuo Handa 
> Cc: sta...@vger.kernel.org

Properly this time,
Reviewed-by: Joel Fernandes 

thanks,

 - Joel



> ---
>  drivers/staging/android/ashmem.c | 42 
> +++-
>  1 file changed, 24 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/staging/android/ashmem.c 
> b/drivers/staging/android/ashmem.c
> index 5d5b091..74d497d 100644
> --- a/drivers/staging/android/ashmem.c
> +++ b/drivers/staging/android/ashmem.c
> @@ -172,19 +172,15 @@ static inline void lru_del(struct ashmem_range *range)
>   * @end:  The ending page (inclusive)
>   *
>   * This function is protected by ashmem_mutex.
> - *
> - * Return: 0 if successful, or -ENOMEM if there is an error
>   */
> -static int range_alloc(struct ashmem_area *asma,
> -  struct ashmem_range *prev_range, unsigned int purged,
> -  size_t start, size_t end)
> +static void range_alloc(struct ashmem_area *asma,
> +   struct ashmem_range *prev_range, unsigned int purged,
> +   size_t start, size_t end,
> +   struct ashmem_range **new_range)
>  {
> -   struct ashmem_range *range;
> -
> -   range = kmem_cache_zalloc(ashmem_range_cachep, GFP_KERNEL);
> -   if (!range)
> -   return -ENOMEM;
> +   struct ashmem_range *range = *new_range;
>
> +   *new_range = NULL;
> range->asma = asma;
> range->pgstart = start;
> range->pgend = end;
> @@ -194,8 +190,6 @@ static int range_alloc(struct ashmem_area *asma,
>
> if (range_on_lru(range))
> lru_add(range);
> -
> -   return 0;
>  }
>
>  /**
> @@ -597,7 +591,8 @@ static int get_name(struct ashmem_area *asma, void __user 
> *name)
>   *
>   * Caller must hold ashmem_mutex.
>   */
> -static int ashmem_pin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
> +static int ashmem_pin(struct ashmem_area *asma, size_t pgstart, size_t pgend,
> + struct ashmem_range **new_range)
>  {
> struct ashmem_range *range, *next;
> int ret = ASHMEM_NOT_PURGED;
> @@ -650,7 +645,7 @@ static int ashmem_pin(struct ashmem_area *asma, size_t 
> pgstart, size_t pgend)
>  * second half and adjust the first chunk's endpoint.
>  */
> range_alloc(asma, range, range->purged,
> -   pgend + 1, range->pgend);
> +   pgend + 1, range->pgend, new_range);
> range_shrink(range, range->pgstart, pgstart - 1);
> break;
> }
> @@ -664,7 +659,8 @@ static int ashmem_pin(struct ashmem_area *asma, size_t 
> pgstart, size_t pgend)
>   *
>   * Caller must hold ashmem_mutex.
>   */
> -static int ashmem_unpin(struct ashmem_area *asma, size_t pgstart, size_t 
> pgend)
> +static int ashmem_unpin(struct ashmem_area *asma, size_t pgstart, size_t 
> pgend,
> +   struct ashmem_range **new_range)
>  {
> struct ashmem_range *range, *next;
> unsigned int purged = ASHMEM_NOT_PURGED;
> @@ -690,7 +686,8 @@ static int ashmem_unpin(struct ashmem_area *asma, size_t 
> pgstart, size_t pgend)
> }
> }
>
> -   return range_alloc(asma, range, purged, pgstart, pgend);
> +   range_alloc(asma, range, purged, pgstart, pgend, new_range);
> +   return 0;
>  }
>
>  /*
> @@ -723,10 +720,17 @@ static int ashmem_pin_unpin(struct ashmem_area *asma, 
> unsigned long cmd,
> struct ashmem_pin pin;
> size_t pgstart, pgend;
> int ret = -EINVAL;
> +   struct ashmem_range *range = NULL;
>
> if (copy_from_user(&pin, p, sizeof(pin)))
> return -EFAULT;
>
> +   if (cmd == ASHMEM_PIN || cmd == ASHMEM_UNPIN) {
> +   range = kmem_cache_zalloc(ashmem_range_cachep, GFP_KERNEL);
> +   if (!range)
> +   return -ENOMEM;
> +   }
> +
>

Re: [PATCH] staging: android: ashmem: Avoid range_alloc() allocation with ashmem_mutex held.

2019-02-25 Thread Joel Fernandes

On Mon, Feb 25, 2019 at 2:11 PM Tetsuo Handa
 wrote:
>
> On 2019/02/26 6:55, Joel Fernandes wrote:
> >> @@ -763,6 +767,8 @@ static int ashmem_pin_unpin(struct ashmem_area *asma, 
> >> unsigned long cmd,
> >>
> >>  out_unlock:
> >>  mutex_unlock(&ashmem_mutex);
> >> +if (range)
> >> +kmem_cache_free(ashmem_range_cachep, range);
> >
> > This seems a bit broken to me. Once a range has been added to the LRU list,
> > it is then being freed here. So then the ashmem_lru_list will contain a
> > dangling range, right?
>
> If this range was used in range_alloc(), range == NULL here due to
>
> +   struct ashmem_range *range = *new_range;
>
> +   *new_range = NULL;
>
> . Thus, this range won't be freed here if range_alloc() was called. What am I 
> missing?

Sorry, this message is stale. I take it back, I was supposed to delete
it before I made the Reviewed-by tag.. You didn't miss anything..
Please only infer the "Reviewed-by" tag from my reply and ignore this
message.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] staging: android: ashmem: Avoid range_alloc() allocation with ashmem_mutex held.

2019-02-25 Thread Joel Fernandes

On Fri, Feb 22, 2019 at 08:03:55PM +0900, Tetsuo Handa wrote:
> ashmem_pin() is calling range_shrink() without checking whether
> range_alloc() succeeded. Also, doing memory allocation with ashmem_mutex
> held should be avoided because ashmem_shrink_scan() tries to hold it.
> 
> Therefore, move memory allocation for range_alloc() to ashmem_pin_unpin()
> and make range_alloc() not to fail.
> 
> This patch is mostly meant for backporting purpose for fuzz testing on
> stable/distributor kernels, for there is a plan to remove this code in
> near future.
> 
> Signed-off-by: Tetsuo Handa 
> Cc: sta...@vger.kernel.org

Reviewed-by: Joel Fernandes (Google) 

thanks,

 - Joel


> ---
>  drivers/staging/android/ashmem.c | 42 
> +++-
>  1 file changed, 24 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/staging/android/ashmem.c 
> b/drivers/staging/android/ashmem.c
> index 5d5b091..74d497d 100644
> --- a/drivers/staging/android/ashmem.c
> +++ b/drivers/staging/android/ashmem.c
> @@ -172,19 +172,15 @@ static inline void lru_del(struct ashmem_range *range)
>   * @end:The ending page (inclusive)
>   *
>   * This function is protected by ashmem_mutex.
> - *
> - * Return: 0 if successful, or -ENOMEM if there is an error
>   */
> -static int range_alloc(struct ashmem_area *asma,
> -struct ashmem_range *prev_range, unsigned int purged,
> -size_t start, size_t end)
> +static void range_alloc(struct ashmem_area *asma,
> + struct ashmem_range *prev_range, unsigned int purged,
> + size_t start, size_t end,
> + struct ashmem_range **new_range)
>  {
> - struct ashmem_range *range;
> -
> - range = kmem_cache_zalloc(ashmem_range_cachep, GFP_KERNEL);
> - if (!range)
> - return -ENOMEM;
> + struct ashmem_range *range = *new_range;
>  
> + *new_range = NULL;
>   range->asma = asma;
>   range->pgstart = start;
>   range->pgend = end;
> @@ -194,8 +190,6 @@ static int range_alloc(struct ashmem_area *asma,
>  
>   if (range_on_lru(range))
>   lru_add(range);
> -
> - return 0;
>  }
>  
>  /**
> @@ -597,7 +591,8 @@ static int get_name(struct ashmem_area *asma, void __user 
> *name)
>   *
>   * Caller must hold ashmem_mutex.
>   */
> -static int ashmem_pin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
> +static int ashmem_pin(struct ashmem_area *asma, size_t pgstart, size_t pgend,
> +   struct ashmem_range **new_range)
>  {
>   struct ashmem_range *range, *next;
>   int ret = ASHMEM_NOT_PURGED;
> @@ -650,7 +645,7 @@ static int ashmem_pin(struct ashmem_area *asma, size_t 
> pgstart, size_t pgend)
>* second half and adjust the first chunk's endpoint.
>*/
>   range_alloc(asma, range, range->purged,
> - pgend + 1, range->pgend);
> + pgend + 1, range->pgend, new_range);
>   range_shrink(range, range->pgstart, pgstart - 1);
>   break;
>   }
> @@ -664,7 +659,8 @@ static int ashmem_pin(struct ashmem_area *asma, size_t 
> pgstart, size_t pgend)
>   *
>   * Caller must hold ashmem_mutex.
>   */
> -static int ashmem_unpin(struct ashmem_area *asma, size_t pgstart, size_t 
> pgend)
> +static int ashmem_unpin(struct ashmem_area *asma, size_t pgstart, size_t 
> pgend,
> + struct ashmem_range **new_range)
>  {
>   struct ashmem_range *range, *next;
>   unsigned int purged = ASHMEM_NOT_PURGED;
> @@ -690,7 +686,8 @@ static int ashmem_unpin(struct ashmem_area *asma, size_t 
> pgstart, size_t pgend)
>   }
>   }
>  
> - return range_alloc(asma, range, purged, pgstart, pgend);
> + range_alloc(asma, range, purged, pgstart, pgend, new_range);
> + return 0;
>  }
>  
>  /*
> @@ -723,10 +720,17 @@ static int ashmem_pin_unpin(struct ashmem_area *asma, 
> unsigned long cmd,
>   struct ashmem_pin pin;
>   size_t pgstart, pgend;
>   int ret = -EINVAL;
> + struct ashmem_range *range = NULL;
>  
>   if (copy_from_user(&pin, p, sizeof(pin)))
>   return -EFAULT;
>  
> + if (cmd == ASHMEM_PIN || cmd == ASHMEM_UNPIN) {
> + range = kmem_cache_zalloc(ashmem_range_cachep, GFP_KERNEL);
> + if (!range)
> + return -ENOMEM;
> + }
> +
>   mutex_lock(&ashmem_mutex);
>   wait_event(ashmem_shrink_wait, !atomic_read(&ashmem

Re: [PATCH v3 1/7] binder: create userspace-to-binder-buffer copy function

2019-02-14 Thread Joel Fernandes

On Thu, Feb 14, 2019 at 01:55:19PM -0800, 'Todd Kjos' via kernel-team wrote:
> On Thu, Feb 14, 2019 at 1:25 PM Joel Fernandes  wrote:
> >
> > On Thu, Feb 14, 2019 at 03:53:54PM -0500, Joel Fernandes wrote:
> > > On Thu, Feb 14, 2019 at 3:42 PM Todd Kjos  wrote:
> > > >
> > > > On Thu, Feb 14, 2019 at 11:45 AM Joel Fernandes  
> > > > wrote:
> > > [snip]
> > > > > > + * check_buffer() - verify that buffer/offset is safe to access
> > > > > > + * @alloc: binder_alloc for this proc
> > > > > > + * @buffer: binder buffer to be accessed
> > > > > > + * @offset: offset into @buffer data
> > > > > > + * @bytes: bytes to access from offset
> > > > > > + *
> > > > > > + * Check that the @offset/@bytes are within the size of the given
> > > > > > + * @buffer and that the buffer is currently active and not 
> > > > > > freeable.
> > > > > > + * Offsets must also be multiples of sizeof(u32). The kernel is
> > > > >
> > > > > In all callers of binder_alloc_copy_user_to_buffer, the alignment of 
> > > > > offsets
> > > > > is set to sizeof(void *). Then shouldn't this function check for 
> > > > > sizeof(void *)
> > > > > alignment instead of u32?
> > > >
> > > > But there are other callers of check_buffer() later in the series that
> > > > don't require pointer-size alignment. u32 alignment is consistent with
> > > > the alignment requirements of the binder driver before this change.
> > > > The copy functions don't actually need to insist on alignment, but
> > > > these binder buffer objects have always used u32 alignment which has
> > > > been checked in the driver. If user code misaligned it, then errors
> > > > are returned. The alignment checks are really to be consistent with
> > > > previous binder driver behavior.
> > >
> > > Got it, thanks.
> >
> > One more thing I wanted to ask is, kmap() will now cause global lock
> > contention because of using spin_lock due to kmap_high().
> >
> > Previously the binder driver was made to not use global lock (as you had
> > done). Now these paths will start global locking on 32-bit architectures.
> > Would that degrade performance?
> 
> There was a lot of concern about 32-bit performance both for
> lock-contention and the cost of map/unmap operations. Of course,
> 32-bit systems are also where the primary win is -- namely avoiding
> vmalloc space depletion. So there was a several months-long evaluation
> period on 32-bit devices by a silicon vendor who did a lot of testing
> across a broad set of benchmarks / workloads to verify the performance
> costs are acceptable. We also ran tests to try to exhaust the kmap
> space with multiple large buffers.
> 
> The testing did find that there is some performance degradation for
> large buffer transfers, but there are no cases where this
> significantly impacted a meaningful user workload.
> 
> >
> > Are we not using kmap_atomic() in this patch because of any concern that the
> > kmap fixmap space is limited and may run out?
> 
> We're not using the atomic version here since we can't guarantee that
> the loop will be atomic since we are calling copy_from_user(). Later
> in the series, other cases do use kmap_atomic().

Got it, thanks for all the clarifications,

- Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH v3 1/7] binder: create userspace-to-binder-buffer copy function

2019-02-14 Thread Joel Fernandes

On Thu, Feb 14, 2019 at 03:53:54PM -0500, Joel Fernandes wrote:
> On Thu, Feb 14, 2019 at 3:42 PM Todd Kjos  wrote:
> >
> > On Thu, Feb 14, 2019 at 11:45 AM Joel Fernandes  wrote:
> [snip]
> > > > + * check_buffer() - verify that buffer/offset is safe to access
> > > > + * @alloc: binder_alloc for this proc
> > > > + * @buffer: binder buffer to be accessed
> > > > + * @offset: offset into @buffer data
> > > > + * @bytes: bytes to access from offset
> > > > + *
> > > > + * Check that the @offset/@bytes are within the size of the given
> > > > + * @buffer and that the buffer is currently active and not freeable.
> > > > + * Offsets must also be multiples of sizeof(u32). The kernel is
> > >
> > > In all callers of binder_alloc_copy_user_to_buffer, the alignment of 
> > > offsets
> > > is set to sizeof(void *). Then shouldn't this function check for 
> > > sizeof(void *)
> > > alignment instead of u32?
> >
> > But there are other callers of check_buffer() later in the series that
> > don't require pointer-size alignment. u32 alignment is consistent with
> > the alignment requirements of the binder driver before this change.
> > The copy functions don't actually need to insist on alignment, but
> > these binder buffer objects have always used u32 alignment which has
> > been checked in the driver. If user code misaligned it, then errors
> > are returned. The alignment checks are really to be consistent with
> > previous binder driver behavior.
> 
> Got it, thanks.

One more thing I wanted to ask is, kmap() will now cause global lock
contention because of using spin_lock due to kmap_high().

Previously the binder driver was made to not use global lock (as you had
done). Now these paths will start global locking on 32-bit architectures.
Would that degrade performance?

Are we not using kmap_atomic() in this patch because of any concern that the
kmap fixmap space is limited and may run out?

thanks,

 - Joel


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH v3 1/7] binder: create userspace-to-binder-buffer copy function

2019-02-14 Thread Joel Fernandes

On Thu, Feb 14, 2019 at 3:42 PM Todd Kjos  wrote:
>
> On Thu, Feb 14, 2019 at 11:45 AM Joel Fernandes  wrote:
[snip]
> > > + * check_buffer() - verify that buffer/offset is safe to access
> > > + * @alloc: binder_alloc for this proc
> > > + * @buffer: binder buffer to be accessed
> > > + * @offset: offset into @buffer data
> > > + * @bytes: bytes to access from offset
> > > + *
> > > + * Check that the @offset/@bytes are within the size of the given
> > > + * @buffer and that the buffer is currently active and not freeable.
> > > + * Offsets must also be multiples of sizeof(u32). The kernel is
> >
> > In all callers of binder_alloc_copy_user_to_buffer, the alignment of offsets
> > is set to sizeof(void *). Then shouldn't this function check for 
> > sizeof(void *)
> > alignment instead of u32?
>
> But there are other callers of check_buffer() later in the series that
> don't require pointer-size alignment. u32 alignment is consistent with
> the alignment requirements of the binder driver before this change.
> The copy functions don't actually need to insist on alignment, but
> these binder buffer objects have always used u32 alignment which has
> been checked in the driver. If user code misaligned it, then errors
> are returned. The alignment checks are really to be consistent with
> previous binder driver behavior.

Got it, thanks.

 - Joel
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 2/2] staging: android: ashmem: Don't allow range_alloc() to fail.

2019-02-14 Thread Joel Fernandes

On Sat, Feb 09, 2019 at 11:24:03AM +0900, Tetsuo Handa wrote:
> On 2019/02/08 23:45, Joel Fernandes wrote:
> > On Wed, Feb 06, 2019 at 08:45:32AM +0900, Tetsuo Handa wrote:
> >> Joel Fernandes wrote:
> >>> On Tue, Feb 05, 2019 at 07:28:41PM +0900, Tetsuo Handa wrote:
> >>>> ashmem_pin() is calling range_shrink() without checking whether
> >>>> range_alloc() succeeded. Since memory allocation fault injection might
> >>>> force range_alloc() to fail while range_alloc() is called for only once
> >>>> for one ioctl() request, make range_alloc() not to fail.
> >>>
> >>> Why does this not need to fail? I am worried your change will introduce
> >>> unwanted endless looping in the kernel instead of gracefully failing a
> >>> pin/unpin request.
> >>
> >> This patch won't introduce endless looping in the kernel, due to a rule 
> >> called
> >>
> >>   The "too small to fail" memory-allocation rule
> >>   https://lwn.net/Articles/627419/
> >>
> >> . In short, memory allocation by range_alloc() might fail only if current 
> >> thread
> >> was killed by the OOM killer or memory allocation fault injection mechanism
> >> forced it to fail. And this patch helps doing fuzzing test with minimal 
> >> changes.
> >>
> >>>
> >>> Unless there is a good reason, I suggest to drop this patch from the 
> >>> series;
> >>> but let us discuss more if you want.
> >>
> >> We can allocate memory for range_alloc() before holding ashmem_mutex at
> >> ashmem_pin_unpin() if you don't want to use __GFP_NOFAIL. It is better to
> >> avoid memory allocation with ashmem_mutex because ashmem_mutex is held by
> >> shrinker function. But given that this module is going to be replaced 
> >> shortly,
> >> does it worth moving the memory allocation to the caller?
> > 
> > Even if memory allocation does not fail right now, I still want to return 
> > the
> > error code correctly via ashmem_pin_unpin() so that if in the future there
> > are changes to the allocator algorithm, then a silent success isn't reported
> > when a failure should be reported..
> > 
> > It doesn't make sense to return success when an allocation failed.. even if
> > you are asking this code to rely on the allocator's "too small to fail"
> > behavior.. can we guarantee this allocator behavior will always exist?
> > Probably not.
> > 
> 
> Does always exist. Small GFP_KERNEL allocation without __GFP_NORETRY or
> __GFP_RETRY_MAYFAIL won't fail unless current thread was killed by the OOM
> killer or memory allocation fault injection mechanism forced it to fail.
> Failing such allocation instead of invoking the OOM killer is effectively
> equivalent to sending SIGKILL to current thread. If current thread got
> SIGKILL, current thread can't receive the error code of ioctl() from
> ashmem_pin_unpin() because userspace code is no longer executed.
> 
> MM people want to make !GFP_KERNEL memory allocations fail rather than
> retry forever. But failing small GFP_KERNEL memory allocations is such
> a horrible idea. Anyway, here is an updated patch...

I am sorry, what has changed in the updated patch? Please send clear diff
notes in your patches or at least mention exactly what changed since previous
patch revision. It is very difficult to review if you don't even mention what
changed since previous revision. Please resend the patches again.

Also I am OK with assuming small alloc success, so we are on the same page
about that.

One more comment below..

Thanks!

> From f2c8a098ebf69122fc440d9e9ca3c9cd786cce8a Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa 
> Date: Sat, 9 Feb 2019 11:07:07 +0900
> Subject: [PATCH] staging: android: ashmem: Avoid range_alloc() allocation with
>  ashmem_mutex held.
> 
> ashmem_pin() is calling range_shrink() without checking whether
> range_alloc() succeeded. Also, doing memory allocation with ashmem_mutex
> held should be avoided because ashmem_shrink_scan() tries to hold it.
> 
> Therefore, move memory allocation for range_alloc() to ashmem_pin_unpin()
> and make range_alloc() not to fail.
> 
> Signed-off-by: Tetsuo Handa 
> ---
>  drivers/staging/android/ashmem.c | 42 
> +++-
>  1 file changed, 24 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/staging/android/ashmem.c 
> b/drivers/staging/android/ashmem.c
> index ade8438..910826d 100644
> --- a/drivers/staging/android/ashmem.c
> +++ b/drivers/staging/android/ashmem

Re: [PATCH 2/2] staging: android: ashmem: Don't allow range_alloc() to fail.

2019-02-08 Thread Joel Fernandes

On Wed, Feb 06, 2019 at 08:45:32AM +0900, Tetsuo Handa wrote:
> Joel Fernandes wrote:
> > On Tue, Feb 05, 2019 at 07:28:41PM +0900, Tetsuo Handa wrote:
> > > ashmem_pin() is calling range_shrink() without checking whether
> > > range_alloc() succeeded. Since memory allocation fault injection might
> > > force range_alloc() to fail while range_alloc() is called for only once
> > > for one ioctl() request, make range_alloc() not to fail.
> > 
> > Why does this not need to fail? I am worried your change will introduce
> > unwanted endless looping in the kernel instead of gracefully failing a
> > pin/unpin request.
> 
> This patch won't introduce endless looping in the kernel, due to a rule called
> 
>   The "too small to fail" memory-allocation rule
>   https://lwn.net/Articles/627419/
> 
> . In short, memory allocation by range_alloc() might fail only if current 
> thread
> was killed by the OOM killer or memory allocation fault injection mechanism
> forced it to fail. And this patch helps doing fuzzing test with minimal 
> changes.
> 
> > 
> > Unless there is a good reason, I suggest to drop this patch from the series;
> > but let us discuss more if you want.
> 
> We can allocate memory for range_alloc() before holding ashmem_mutex at
> ashmem_pin_unpin() if you don't want to use __GFP_NOFAIL. It is better to
> avoid memory allocation with ashmem_mutex because ashmem_mutex is held by
> shrinker function. But given that this module is going to be replaced shortly,
> does it worth moving the memory allocation to the caller?

Even if memory allocation does not fail right now, I still want to return the
error code correctly via ashmem_pin_unpin() so that if in the future there
are changes to the allocator algorithm, then a silent success isn't reported
when a failure should be reported..

It doesn't make sense to return success when an allocation failed.. even if
you are asking this code to rely on the allocator's "too small to fail"
behavior.. can we guarantee this allocator behavior will always exist?
Probably not.

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 1/3] sched/wait: use freezable_schedule when possible

2019-02-07 Thread Joel Fernandes

On Thu, Feb 07, 2019 at 12:30:50AM +0100, Hugo Lefeuvre wrote:
> Hi Joel,
> 
> > I'm curious did you try the freezing process and see if pointless wakeups 
> > are
> > reduced?  That would be an added bonus if you did.
> 
> I'm currently testing these changes. I hope to be able to come back with
> more concrete results soon.
> 
> Also, I just noticed that the third patch removes a necessary #include
> . I will submit an updated version tomorrow.
> 
> Thanks for the review!

Sure, add these test results to the patch as well showing reduced wakeups.

I would say submit the freezable_schedule as a single separate patch
independent of the vsoc series since it can go in separately, and also
benefits other things than vsoc.

Also CC Rafael (power maintainer) on it.

Thank you!

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 2/3] sched/wait: introduce wait_event_freezable_hrtimeout

2019-02-05 Thread Joel Fernandes

On Fri, Feb 01, 2019 at 06:38:35AM +0100, Hugo Lefeuvre wrote:
> introduce wait_event_freezable_hrtimeout, an interruptible and freezable
> version of wait_event_hrtimeout.
> 
> Among others this helper will allow for simplifications in
> staging/android/vsoc.c.
> 
> Signed-off-by: Hugo Lefeuvre 
> ---

Reviewed-by: Joel Fernandes (Google) 

thanks,

 - Joel

>  include/linux/wait.h | 25 +
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 5f3efabc36f4..c4cf5113f58a 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -483,7 +483,7 @@ do {  
> \
>   __ret;  
> \
>  })
>  
> -#define __wait_event_hrtimeout(wq_head, condition, timeout, state)   
> \
> +#define __wait_event_hrtimeout(wq_head, condition, timeout, state, cmd)  
> \
>  ({   
> \
>   int __ret = 0;  
> \
>   struct hrtimer_sleeper __t; 
> \
> @@ -500,7 +500,7 @@ do {  
> \
>   __ret = -ETIME; 
> \
>   break;  
> \
>   }   
> \
> - schedule());
> \
> + cmd);   
> \
>   
> \
>   hrtimer_cancel(&__t.timer); 
> \
>   destroy_hrtimer_on_stack(&__t.timer);   
> \
> @@ -529,7 +529,23 @@ do { 
> \
>   might_sleep();  
> \
>   if (!(condition))   
> \
>   __ret = __wait_event_hrtimeout(wq_head, condition, timeout, 
> \
> -TASK_UNINTERRUPTIBLE);   
> \
> +TASK_UNINTERRUPTIBLE,
> \
> +schedule()); 
> \
> + __ret;  
> \
> +})
> +
> +/*
> + * like wait_event_hrtimeout() -- except it uses TASK_INTERRUPTIBLE to avoid
> + * increasing load and is freezable.
> + */
> +#define wait_event_freezable_hrtimeout(wq_head, condition, timeout)  
> \
> +({   
> \
> + int __ret = 0;  
> \
> + might_sleep();  
> \
> + if (!(condition))   
> \
> + __ret = __wait_event_hrtimeout(wq_head, condition, timeout, 
> \
> +TASK_INTERRUPTIBLE,  
> \
> +freezable_schedule());   
> \
>   __ret;  
> \
>  })
>  
> @@ -555,7 +571,8 @@ do {  
> \
>   might_sleep();  
> \
>   if (!(condition))   
> \
>   __ret = __wait_event_hrtimeout(wq, condition, timeout,  
> \
> -TASK_INTERRUPTIBLE); 
> \
> +TASK_INTERRUPTIBLE,  
> \
> +schedule()); 
> \
>   __ret;  
> \
>  })
>  
> -- 
> 2.20.1
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 1/3] sched/wait: use freezable_schedule when possible

2019-02-05 Thread Joel Fernandes

On Fri, Feb 01, 2019 at 06:38:05AM +0100, Hugo Lefeuvre wrote:
> Replace schedule(); try_to_freeze() by freezable_schedule().
> 
> Tasks calling freezable_schedule() set the PF_FREEZER_SKIP flag
> before calling schedule(). Unlike tasks calling schedule();
> try_to_freeze() tasks calling freezable_schedule() are not awaken by
> try_to_freeze_tasks(). Instead they call try_to_freeze() when they
> wake up if the freeze is still underway.
> 
> It is not a problem since sleeping tasks can't do anything which isn't
> allowed for a frozen task while sleeping.
> 
> The result is a potential performance gain during freeze, since less
> tasks have to be awaken.

Reviewed-by: Joel Fernandes (Google) 

thanks,

 - Joel

> 
> Signed-off-by: Hugo Lefeuvre 
> ---
>  include/linux/wait.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index ed7c122cb31f..5f3efabc36f4 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -308,7 +308,7 @@ do {  
> \
>  
>  #define __wait_event_freezable(wq_head, condition)   
> \
>   ___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0, 
> \
> - schedule(); try_to_freeze())
> + freezable_schedule())
>  
>  /**
>   * wait_event_freezable - sleep (or freeze) until a condition gets true
> @@ -367,7 +367,7 @@ do {  
> \
>  #define __wait_event_freezable_timeout(wq_head, condition, timeout)  
> \
>   ___wait_event(wq_head, ___wait_cond_timeout(condition), 
> \
> TASK_INTERRUPTIBLE, 0, timeout,   
> \
> -   __ret = schedule_timeout(__ret); try_to_freeze())
> +   __ret = freezable_schedule_timeout(__ret))
>  
>  /*
>   * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE to avoid
> @@ -588,7 +588,7 @@ do {  
> \
>  
>  #define __wait_event_freezable_exclusive(wq, condition)  
> \
>   ___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0,  
> \
> - schedule(); try_to_freeze())
> + freezable_schedule())
>  
>  #define wait_event_freezable_exclusive(wq, condition)
> \
>  ({   
> \
> -- 
> 2.20.1
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 1/2] staging: android: ashmem: Don't call fallocate() with ashmem_mutex held.

2019-02-05 Thread Joel Fernandes

On Tue, Feb 05, 2019 at 07:28:40PM +0900, Tetsuo Handa wrote:
> syzbot is hitting lockdep warnings [1][2][3]. This patch tries to fix
> the warning by eliminating ashmem_shrink_scan() => {shmem|vfs}_fallocate()
> sequence.
> 
> [1] 
> https://syzkaller.appspot.com/bug?id=87c399f6fa6955006080b24142e2ce7680295ad4
> [2] 
> https://syzkaller.appspot.com/bug?id=7ebea492de7521048355fc84210220e1038a7908
> [3] 
> https://syzkaller.appspot.com/bug?id=e02419c12131c24e2a957ea050c2ab6dcbbc3270
> 
> Reported-by: syzbot 
> Reported-by: syzbot 
> Reported-by: syzbot 
> Signed-off-by: Tetsuo Handa 
> Cc: sta...@vger.kernel.org
> ---

Please in the future, include some information about what changed from
previous patches, or atleast add a cover letter. And use patch revision
numbers. It is hard to know what changed from previous review.

This looks good to me,

Acked-by: Joel Fernandes (Google) 

thanks,

 - Joel

>  drivers/staging/android/ashmem.c | 25 -
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/staging/android/ashmem.c 
> b/drivers/staging/android/ashmem.c
> index 90a8a9f1ac7d..ade8438a827a 100644
> --- a/drivers/staging/android/ashmem.c
> +++ b/drivers/staging/android/ashmem.c
> @@ -75,6 +75,9 @@ struct ashmem_range {
>  /* LRU list of unpinned pages, protected by ashmem_mutex */
>  static LIST_HEAD(ashmem_lru_list);
>  
> +static atomic_t ashmem_shrink_inflight = ATOMIC_INIT(0);
> +static DECLARE_WAIT_QUEUE_HEAD(ashmem_shrink_wait);
> +
>  /*
>   * long lru_count - The count of pages on our LRU list.
>   *
> @@ -438,7 +441,6 @@ static int ashmem_mmap(struct file *file, struct 
> vm_area_struct *vma)
>  static unsigned long
>  ashmem_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  {
> - struct ashmem_range *range, *next;
>   unsigned long freed = 0;
>  
>   /* We might recurse into filesystem code, so bail out if necessary */
> @@ -448,21 +450,33 @@ ashmem_shrink_scan(struct shrinker *shrink, struct 
> shrink_control *sc)
>   if (!mutex_trylock(&ashmem_mutex))
>   return -1;
>  
> - list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
> + while (!list_empty(&ashmem_lru_list)) {
> + struct ashmem_range *range =
> + list_first_entry(&ashmem_lru_list, typeof(*range), lru);
>   loff_t start = range->pgstart * PAGE_SIZE;
>   loff_t end = (range->pgend + 1) * PAGE_SIZE;
> + struct file *f = range->asma->file;
>  
> - range->asma->file->f_op->fallocate(range->asma->file,
> - FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> - start, end - start);
> + get_file(f);
> + atomic_inc(&ashmem_shrink_inflight);
>   range->purged = ASHMEM_WAS_PURGED;
>   lru_del(range);
>  
>   freed += range_size(range);
> + mutex_unlock(&ashmem_mutex);
> + f->f_op->fallocate(f,
> +FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +start, end - start);
> + fput(f);
> + if (atomic_dec_and_test(&ashmem_shrink_inflight))
> + wake_up_all(&ashmem_shrink_wait);
> + if (!mutex_trylock(&ashmem_mutex))
> + goto out;
>   if (--sc->nr_to_scan <= 0)
>   break;
>   }
>   mutex_unlock(&ashmem_mutex);
> +out:
>   return freed;
>  }
>  
> @@ -713,6 +727,7 @@ static int ashmem_pin_unpin(struct ashmem_area *asma, 
> unsigned long cmd,
>   return -EFAULT;
>  
>   mutex_lock(&ashmem_mutex);
> + wait_event(ashmem_shrink_wait, !atomic_read(&ashmem_shrink_inflight));
>  
>   if (!asma->file)
>   goto out_unlock;
> -- 
> 2.17.1
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 2/2] staging: android: ashmem: Don't allow range_alloc() to fail.

2019-02-05 Thread Joel Fernandes

On Tue, Feb 05, 2019 at 07:28:41PM +0900, Tetsuo Handa wrote:
> ashmem_pin() is calling range_shrink() without checking whether
> range_alloc() succeeded. Since memory allocation fault injection might
> force range_alloc() to fail while range_alloc() is called for only once
> for one ioctl() request, make range_alloc() not to fail.

Why does this not need to fail? I am worried your change will introduce
unwanted endless looping in the kernel instead of gracefully failing a
pin/unpin request.

Unless there is a good reason, I suggest to drop this patch from the series;
but let us discuss more if you want.

thanks,

 - Joel

>From the docs:
__GFP_NOFAIL - Indicate that this allocation is in no way allowed to fail 
(think twice before using).
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 3/3] staging/android: simplify handle_vsoc_cond_wait

2019-02-04 Thread Joel Fernandes

I think you sent to wrong address of Alistair on these and other patches. 
Corrected address.

On Fri, Feb 01, 2019 at 06:39:03AM +0100, Hugo Lefeuvre wrote:
> simplify handle_vsoc_cond_wait (drivers/staging/android/vsoc.c) using newly
> added wait_event_freezable_hrtimeout helper and remove useless includes.
> 
> Signed-off-by: Hugo Lefeuvre 
> ---
>  drivers/staging/android/vsoc.c | 69 +-
>  1 file changed, 10 insertions(+), 59 deletions(-)
> 
> diff --git a/drivers/staging/android/vsoc.c b/drivers/staging/android/vsoc.c
> index 22571abcaa4e..7e620e69f39d 100644
> --- a/drivers/staging/android/vsoc.c
> +++ b/drivers/staging/android/vsoc.c
> @@ -17,7 +17,6 @@
>   */
>  
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -29,7 +28,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include "uapi/vsoc_shm.h"
> @@ -401,7 +399,6 @@ static int handle_vsoc_cond_wait(struct file *filp, 
> struct vsoc_cond_wait *arg)
>   DEFINE_WAIT(wait);
>   u32 region_number = iminor(file_inode(filp));
>   struct vsoc_region_data *data = vsoc_dev.regions_data + region_number;
> - struct hrtimer_sleeper timeout, *to = NULL;
>   int ret = 0;
>   struct vsoc_device_region *region_p = vsoc_region_from_filep(filp);
>   atomic_t *address = NULL;
> @@ -420,69 +417,23 @@ static int handle_vsoc_cond_wait(struct file *filp, 
> struct vsoc_cond_wait *arg)
>   /* Ensure that the type of wait is valid */
>   switch (arg->wait_type) {
>   case VSOC_WAIT_IF_EQUAL:
> + ret = wait_event_freezable(data->futex_wait_queue,
> +arg->wakes++ &&
> +atomic_read(address) != arg->value);
>   break;
>   case VSOC_WAIT_IF_EQUAL_TIMEOUT:
> - to = &timeout;
> - break;
> - default:
> - return -EINVAL;
> - }
> -
> - if (to) {
> - /* Copy the user-supplied timesec into the kernel structure.
> -  * We do things this way to flatten differences between 32 bit
> -  * and 64 bit timespecs.
> -  */
>   if (arg->wake_time_nsec >= NSEC_PER_SEC)
>   return -EINVAL;
>   wake_time = ktime_set(arg->wake_time_sec, arg->wake_time_nsec);
> -
> - hrtimer_init_on_stack(&to->timer, CLOCK_MONOTONIC,
> -   HRTIMER_MODE_ABS);
> - hrtimer_set_expires_range_ns(&to->timer, wake_time,
> -  current->timer_slack_ns);
> -
> - hrtimer_init_sleeper(to, current);
> + ret = wait_event_freezable_hrtimeout(data->futex_wait_queue,
> +  arg->wakes++ &&
> +  atomic_read(address) != 
> arg->value,
> +  wake_time);
> + break;
> + default:
> + return -EINVAL;
>   }
>  
> - while (1) {
> - prepare_to_wait(&data->futex_wait_queue, &wait,
> - TASK_INTERRUPTIBLE);
> - /*
> -  * Check the sentinel value after prepare_to_wait. If the value
> -  * changes after this check the writer will call signal,
> -  * changing the task state from INTERRUPTIBLE to RUNNING. That
> -  * will ensure that schedule() will eventually schedule this
> -  * task.
> -  */
> - if (atomic_read(address) != arg->value) {
> - ret = 0;
> - break;
> - }
> - if (to) {
> - hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
> - if (likely(to->task))
> - freezable_schedule();
> - hrtimer_cancel(&to->timer);
> - if (!to->task) {
> - ret = -ETIMEDOUT;
> - break;
> - }
> - } else {
> - freezable_schedule();
> - }
> - /* Count the number of times that we woke up. This is useful
> -  * for unit testing.
> -  */
> - ++arg->wakes;
> - if (signal_pending(current)) {
> - ret = -EINTR;
> - break;
> - }
> - }
> - finish_wait(&data->futex_wait_queue, &wait);
> - if (to)
> - destroy_hrtimer_on_stack(&to->timer);
>   return ret;
>  }
>  
> -- 
> 2.20.1
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 1/3] sched/wait: use freezable_schedule when possible

2019-02-02 Thread Joel Fernandes

On Fri, Feb 01, 2019 at 06:38:05AM +0100, Hugo Lefeuvre wrote:
> Replace schedule(); try_to_freeze() by freezable_schedule().
> 
> Tasks calling freezable_schedule() set the PF_FREEZER_SKIP flag
> before calling schedule(). Unlike tasks calling schedule();
> try_to_freeze() tasks calling freezable_schedule() are not awaken by
> try_to_freeze_tasks(). Instead they call try_to_freeze() when they
> wake up if the freeze is still underway.
> 
> It is not a problem since sleeping tasks can't do anything which isn't
> allowed for a frozen task while sleeping.
> 
> The result is a potential performance gain during freeze, since less
> tasks have to be awaken.

I'm curious did you try the freezing process and see if pointless wakeups are
reduced?  That would be an added bonus if you did.

Otherwise seems like a nice change. Peter and Rafael, what do you think?

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] sched/wait: introduce wait_event_freezable_hrtimeout

2019-01-22 Thread Joel Fernandes

On Sat, Jan 19, 2019 at 11:29:12AM +0100, Hugo Lefeuvre wrote:
> > > as far as I understand this code, freezable_schedule() avoids blocking the
> > > freezer during the schedule() call, but in the end try_to_freeze() is 
> > > still
> > > called so the result is the same, right?
> > > I wonder why wait_event_freezable is not calling freezable_schedule().
> > 
> > It could be something subtle in my view. freezable_schedule() actually makes
> > the freezer code not send a wake up to the sleeping task if a freeze 
> > happens,
> > because the PF_FREEZER_SKIP flag is set, as you pointed.
> > 
> > Whereas wait_event_freezable() which uses try_to_freeze() does not seem to 
> > have
> > this behavior and the task will enter the freezer. So I'm a bit skeptical if
> > your API will behave as expected (or at least consistently with other wait
> > APIs).
> 
> oh right, now it is clear to me:
> 
> - schedule(); try_to_freeze()
> 
> schedule() is called and the task enters sleep. Since PF_FREEZER_SKIP is
> not set, the task wakes up as soon as try_to_freeze_tasks() is called.
> Right after waking up the task calls try_to_freeze() which freezes it.
> 
> - freezable_schedule() 
> 
> schedule() is called and the task enters sleep. Since PF_FREEZER_SKIP is
> set, the task does not wake up when try_to_freeze_tasks() is called. This
> is not a problem, the task can't "do anything which isn't allowed for a
> frozen task" while sleeping[0]. 
> 
> When the task wakes up (timeout, or whatever other reason) it calls
> try_to_freeze() which freezes it if the freeze is still underway.
> 
> So if a freeze is triggered while the task is sleeping, a task executing
> freezable_schedule() might or might not notice the freeze depending on how
> long it sleeps. A task executing schedule(); try_to_freeze() will always
> notice it.
> 
> I might be wrong on that, but freezable_schedule() just seems like a
> performance improvement to me.
> 
> Now I fully agree with you that there should be a uniform definition of
> "freezable" between wait_event_freezable and wait_event_freezable_hrtimeout.
> This leaves me to the question: should I modify my definition of
> wait_event_freezable_hrtimeout, or prepare a patch for wait_event_freezable ?
> 
> If I am right with the performance thing, the latter might be worth
> considering?
> 
> Either way, this will be fixed in the v2.

I agree, it is probably better to use freezable_schedule() for all freeze
related wait APIs, and keep it consistent. Your analysis is convincing.

Peter, what do you think?

> > > That being said, I am not sure that the try_to_freeze() call does anything
> > > in the vsoc case because there is no call to set_freezable() so the thread
> > > still has PF_NOFREEZE...
> > 
> > I traced this, and PF_NOFREEZE is not set by default for tasks.
> 
> Well, I did not check this in practice and might be confused somewhere but
> the documentation[1] says "kernel threads are not freezable by default.
> However, a kernel thread may clear PF_NOFREEZE for itself by calling
> set_freezable()".
> 
> Looking at the kthreadd() definition it seems like new tasks have
> PF_NOFREEZE set by default[2].
> 
> I'll take some time to check this in practice in the next days.
> 
> Anyways, thanks for your time !

Yeah, no problem.

thanks,

 - Joel
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] sched/wait: introduce wait_event_freezable_hrtimeout

2019-01-18 Thread Joel Fernandes

On Fri, Jan 18, 2019 at 06:08:01PM +0100, Hugo Lefeuvre wrote:
[...] 
> > > +/*
> > > + * like wait_event_hrtimeout() -- except it uses TASK_INTERRUPTIBLE to 
> > > avoid
> > > + * increasing load and is freezable.
> > > + */
> > > +#define wait_event_freezable_hrtimeout(wq_head, condition, timeout)  
> > > \
> > 
> > You should document the variable names in the header comments.
> 
> Agree. This comment was copy/pasted from wait_event_freezable_timeout,
> should I fix it there as well?
> 
> > Also, this new API appears to conflict with definition of 'freezable' in
> > wait_event_freezable I think,
> > 
> > wait_event_freezable - sleep or freeze until condition is true.
> > 
> > wait_event_freezable_hrtimeout - sleep but make sure freezer is not blocked.
> > (your API)
> > 
> > It seems to me these are 2 different definitions of 'freezing' (or I could 
> > be
> > completely confused). But in the first case it calls try_to_freeze after
> > schedule(). In the second case (your new API), it calls 
> > freezable_schedule().
> > 
> > So I am wondering why is there this difference in the 'meaning of 
> > freezable'.
> > In the very least, any such subtle differences should be documented in the
> > header comments IMO.
> 
> It appears that freezable_schedule() and schedule(); try_to_freeze() are
> almost identical:
> 
> static inline void freezable_schedule(void)
> {
> freezer_do_not_count();
> schedule();
> freezer_count();
> }
> 
> and
> 
> static inline void freezer_do_not_count(void)
> {
> current->flags |= PF_FREEZER_SKIP;
> }
> 
> static inline void freezer_count(void)
> {
> current->flags &= ~PF_FREEZER_SKIP;
> /*
>  * If freezing is in progress, the following paired with smp_mb()
>  * in freezer_should_skip() ensures that either we see %true
>  * freezing() or freezer_should_skip() sees !PF_FREEZER_SKIP.
>  */
> smp_mb();
> try_to_freeze();
> }
> 
> as far as I understand this code, freezable_schedule() avoids blocking the
> freezer during the schedule() call, but in the end try_to_freeze() is still
> called so the result is the same, right?
> I wonder why wait_event_freezable is not calling freezable_schedule().

It could be something subtle in my view. freezable_schedule() actually makes
the freezer code not send a wake up to the sleeping task if a freeze happens,
because the PF_FREEZER_SKIP flag is set, as you pointed.

Whereas wait_event_freezable() which uses try_to_freeze() does not seem to have
this behavior and the task will enter the freezer. So I'm a bit skeptical if
your API will behave as expected (or at least consistently with other wait
APIs).

> That being said, I am not sure that the try_to_freeze() call does anything
> in the vsoc case because there is no call to set_freezable() so the thread
> still has PF_NOFREEZE...

I traced this, and PF_NOFREEZE is not set by default for tasks.

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] sched/wait: introduce wait_event_freezable_hrtimeout

2019-01-18 Thread Joel Fernandes

Hi Hugo,

On Thu, Jan 17, 2019 at 11:41:35PM +0100, Hugo Lefeuvre wrote:
> introduce wait_event_freezable_hrtimeout, an interruptible and freezable
> version of wait_event_hrtimeout.
> 
> simplify handle_vsoc_cond_wait (drivers/staging/android/vsoc.c) using this
> newly added helper and remove useless includes.

I believe these should be 2 patches. In the first patch you introduce the
new API, in the second one you would simplify the VSOC driver.

In fact, in one part of the patch you are using wait_event_freezable which
already exists so that's unrelated to the new API you are adding.

More comments below:

> Signed-off-by: Hugo Lefeuvre 
> ---
>  drivers/staging/android/vsoc.c | 69 +-
>  include/linux/wait.h   | 25 ++--
>  2 files changed, 31 insertions(+), 63 deletions(-)
> 
> diff --git a/drivers/staging/android/vsoc.c b/drivers/staging/android/vsoc.c
> index 22571abcaa4e..7e620e69f39d 100644
> --- a/drivers/staging/android/vsoc.c
> +++ b/drivers/staging/android/vsoc.c
> @@ -17,7 +17,6 @@
>   */
>  
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -29,7 +28,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include "uapi/vsoc_shm.h"
> @@ -401,7 +399,6 @@ static int handle_vsoc_cond_wait(struct file *filp, 
> struct vsoc_cond_wait *arg)
>   DEFINE_WAIT(wait);
>   u32 region_number = iminor(file_inode(filp));
>   struct vsoc_region_data *data = vsoc_dev.regions_data + region_number;
> - struct hrtimer_sleeper timeout, *to = NULL;
>   int ret = 0;
>   struct vsoc_device_region *region_p = vsoc_region_from_filep(filp);
>   atomic_t *address = NULL;
> @@ -420,69 +417,23 @@ static int handle_vsoc_cond_wait(struct file *filp, 
> struct vsoc_cond_wait *arg)
>   /* Ensure that the type of wait is valid */
>   switch (arg->wait_type) {
>   case VSOC_WAIT_IF_EQUAL:
> + ret = wait_event_freezable(data->futex_wait_queue,
> +arg->wakes++ &&
> +atomic_read(address) != arg->value);
>   break;
>   case VSOC_WAIT_IF_EQUAL_TIMEOUT:
> - to = &timeout;
> - break;
> - default:
> - return -EINVAL;
> - }
> -
> - if (to) {
> - /* Copy the user-supplied timesec into the kernel structure.
> -  * We do things this way to flatten differences between 32 bit
> -  * and 64 bit timespecs.
> -  */
>   if (arg->wake_time_nsec >= NSEC_PER_SEC)
>   return -EINVAL;
>   wake_time = ktime_set(arg->wake_time_sec, arg->wake_time_nsec);
> -
> - hrtimer_init_on_stack(&to->timer, CLOCK_MONOTONIC,
> -   HRTIMER_MODE_ABS);
> - hrtimer_set_expires_range_ns(&to->timer, wake_time,
> -  current->timer_slack_ns);
> -
> - hrtimer_init_sleeper(to, current);
> + ret = wait_event_freezable_hrtimeout(data->futex_wait_queue,
> +  arg->wakes++ &&
> +  atomic_read(address) != 
> arg->value,
> +  wake_time);
> + break;
> + default:
> + return -EINVAL;
>   }
>  
> - while (1) {
> - prepare_to_wait(&data->futex_wait_queue, &wait,
> - TASK_INTERRUPTIBLE);
> - /*
> -  * Check the sentinel value after prepare_to_wait. If the value
> -  * changes after this check the writer will call signal,
> -  * changing the task state from INTERRUPTIBLE to RUNNING. That
> -  * will ensure that schedule() will eventually schedule this
> -  * task.
> -  */
> - if (atomic_read(address) != arg->value) {
> - ret = 0;
> - break;
> - }
> - if (to) {
> - hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
> - if (likely(to->task))
> - freezable_schedule();
> - hrtimer_cancel(&to->timer);
> - if (!to->task) {
> - ret = -ETIMEDOUT;
> - break;
> - }
> - } else {
> - freezable_schedule();
> - }
> - /* Count the number of times that we woke up. This is useful
> -  * for unit testing.
> -  */
> - ++arg->wakes;
> - if (signal_pending(current)) {
> - ret = -EINTR;
> - break;
> - }
> - }
> - finish_wait(&data->futex_wait_queue, &wait);
> - if (to)
> - destroy_hrti

Re: [PATCH v3] binder: create node flag to request sender's security context

2019-01-14 Thread Joel Fernandes

On Mon, Jan 14, 2019 at 10:50:24AM -0800, Todd Kjos wrote:
> On Mon, Jan 14, 2019 at 10:33 AM Joel Fernandes  wrote:
> >
> > On Mon, Jan 14, 2019 at 09:10:21AM -0800, Todd Kjos wrote:
> > > To allow servers to verify client identity, allow a node
> > > flag to be set that causes the sender's security context
> > > to be delivered with the transaction. The BR_TRANSACTION
> > > command is extended in BR_TRANSACTION_SEC_CTX to
> > > contain a pointer to the security context string.
> > >
> > > Signed-off-by: Todd Kjos 
> > > ---
> > > v2: fix 32-bit build warning
> > > v3: fix smatch warning on unitialized struct element
> > >
> > >  drivers/android/binder.c| 106 ++--
> > >  include/uapi/linux/android/binder.h |  19 +
> > >  2 files changed, 102 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> > > index cdfc87629efb8..5f6ef5e63b91e 100644
> > > --- a/drivers/android/binder.c
> > > +++ b/drivers/android/binder.c
> > > @@ -329,6 +329,8 @@ struct binder_error {
> > >   *(invariant after initialized)
> > >   * @min_priority: minimum scheduling priority
> > >   *(invariant after initialized)
> > > + * @txn_security_ctx: require sender's security context
> > > + *(invariant after initialized)
> > >   * @async_todo:   list of async work items
> > >   *(protected by @proc->inner_lock)
> > >   *
> > > @@ -365,6 +367,7 @@ struct binder_node {
> > >* invariant after initialization
> > >*/
> > >   u8 accept_fds:1;
> > > + u8 txn_security_ctx:1;
> > >   u8 min_priority;
> > >   };
> > >   bool has_async_transaction;
> > > @@ -615,6 +618,7 @@ struct binder_transaction {
> > >   longsaved_priority;
> > >   kuid_t  sender_euid;
> > >   struct list_head fd_fixups;
> > > + binder_uintptr_t security_ctx;
> > >   /**
> > >* @lock:  protects @from, @to_proc, and @to_thread
> > >*
> > > @@ -1152,6 +1156,7 @@ static struct binder_node *binder_init_node_ilocked(
> > >   node->work.type = BINDER_WORK_NODE;
> > >   node->min_priority = flags & FLAT_BINDER_FLAG_PRIORITY_MASK;
> > >   node->accept_fds = !!(flags & FLAT_BINDER_FLAG_ACCEPTS_FDS);
> > > + node->txn_security_ctx = !!(flags & 
> > > FLAT_BINDER_FLAG_TXN_SECURITY_CTX);
> > >   spin_lock_init(&node->lock);
> > >   INIT_LIST_HEAD(&node->work.entry);
> > >   INIT_LIST_HEAD(&node->async_todo);
> > > @@ -2778,6 +2783,8 @@ static void binder_transaction(struct binder_proc 
> > > *proc,
> > >   binder_size_t last_fixup_min_off = 0;
> > >   struct binder_context *context = proc->context;
> > >   int t_debug_id = atomic_inc_return(&binder_last_id);
> > > + char *secctx = NULL;
> > > + u32 secctx_sz = 0;
> > >
> > >   e = binder_transaction_log_add(&binder_transaction_log);
> > >   e->debug_id = t_debug_id;
> > > @@ -3020,6 +3027,20 @@ static void binder_transaction(struct binder_proc 
> > > *proc,
> > >   t->flags = tr->flags;
> > >   t->priority = task_nice(current);
> > >
> > > + if (target_node && target_node->txn_security_ctx) {
> > > + u32 secid;
> > > +
> > > + security_task_getsecid(proc->tsk, &secid);
> > > + ret = security_secid_to_secctx(secid, &secctx, &secctx_sz);
> > > + if (ret) {
> > > + return_error = BR_FAILED_REPLY;
> > > + return_error_param = ret;
> > > + return_error_line = __LINE__;
> > > + goto err_get_secctx_failed;
> > > + }
> > > + extra_buffers_size += ALIGN(secctx_sz, sizeof(u64));
> > > + }
> > > +
> > >   trace_binder_transaction(reply, t, target_node);
> > >
> > >   t->buffer = binder_alloc_new_buf(&target_proc->alloc, tr->data_size,
> > > @@ -3036,6 +3057,19 @@ static void binder_transaction(struct binder_p

Re: [PATCH v3] binder: create node flag to request sender's security context

2019-01-14 Thread Joel Fernandes

On Mon, Jan 14, 2019 at 09:10:21AM -0800, Todd Kjos wrote:
> To allow servers to verify client identity, allow a node
> flag to be set that causes the sender's security context
> to be delivered with the transaction. The BR_TRANSACTION
> command is extended in BR_TRANSACTION_SEC_CTX to
> contain a pointer to the security context string.
> 
> Signed-off-by: Todd Kjos 
> ---
> v2: fix 32-bit build warning
> v3: fix smatch warning on unitialized struct element
> 
>  drivers/android/binder.c| 106 ++--
>  include/uapi/linux/android/binder.h |  19 +
>  2 files changed, 102 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> index cdfc87629efb8..5f6ef5e63b91e 100644
> --- a/drivers/android/binder.c
> +++ b/drivers/android/binder.c
> @@ -329,6 +329,8 @@ struct binder_error {
>   *(invariant after initialized)
>   * @min_priority: minimum scheduling priority
>   *(invariant after initialized)
> + * @txn_security_ctx: require sender's security context
> + *(invariant after initialized)
>   * @async_todo:   list of async work items
>   *(protected by @proc->inner_lock)
>   *
> @@ -365,6 +367,7 @@ struct binder_node {
>* invariant after initialization
>*/
>   u8 accept_fds:1;
> + u8 txn_security_ctx:1;
>   u8 min_priority;
>   };
>   bool has_async_transaction;
> @@ -615,6 +618,7 @@ struct binder_transaction {
>   longsaved_priority;
>   kuid_t  sender_euid;
>   struct list_head fd_fixups;
> + binder_uintptr_t security_ctx;
>   /**
>* @lock:  protects @from, @to_proc, and @to_thread
>*
> @@ -1152,6 +1156,7 @@ static struct binder_node *binder_init_node_ilocked(
>   node->work.type = BINDER_WORK_NODE;
>   node->min_priority = flags & FLAT_BINDER_FLAG_PRIORITY_MASK;
>   node->accept_fds = !!(flags & FLAT_BINDER_FLAG_ACCEPTS_FDS);
> + node->txn_security_ctx = !!(flags & FLAT_BINDER_FLAG_TXN_SECURITY_CTX);
>   spin_lock_init(&node->lock);
>   INIT_LIST_HEAD(&node->work.entry);
>   INIT_LIST_HEAD(&node->async_todo);
> @@ -2778,6 +2783,8 @@ static void binder_transaction(struct binder_proc *proc,
>   binder_size_t last_fixup_min_off = 0;
>   struct binder_context *context = proc->context;
>   int t_debug_id = atomic_inc_return(&binder_last_id);
> + char *secctx = NULL;
> + u32 secctx_sz = 0;
>  
>   e = binder_transaction_log_add(&binder_transaction_log);
>   e->debug_id = t_debug_id;
> @@ -3020,6 +3027,20 @@ static void binder_transaction(struct binder_proc 
> *proc,
>   t->flags = tr->flags;
>   t->priority = task_nice(current);
>  
> + if (target_node && target_node->txn_security_ctx) {
> + u32 secid;
> +
> + security_task_getsecid(proc->tsk, &secid);
> + ret = security_secid_to_secctx(secid, &secctx, &secctx_sz);
> + if (ret) {
> + return_error = BR_FAILED_REPLY;
> + return_error_param = ret;
> + return_error_line = __LINE__;
> + goto err_get_secctx_failed;
> + }
> + extra_buffers_size += ALIGN(secctx_sz, sizeof(u64));
> + }
> +
>   trace_binder_transaction(reply, t, target_node);
>  
>   t->buffer = binder_alloc_new_buf(&target_proc->alloc, tr->data_size,
> @@ -3036,6 +3057,19 @@ static void binder_transaction(struct binder_proc 
> *proc,
>   t->buffer = NULL;
>   goto err_binder_alloc_buf_failed;
>   }
> + if (secctx) {
> + size_t buf_offset = ALIGN(tr->data_size, sizeof(void *)) +
> + ALIGN(tr->offsets_size, sizeof(void *)) +
> + ALIGN(extra_buffers_size, sizeof(void *)) -
> + ALIGN(secctx_sz, sizeof(u64));
> + char *kptr = t->buffer->data + buf_offset;
> +
> + t->security_ctx = (uintptr_t)kptr +
> + binder_alloc_get_user_buffer_offset(&target_proc->alloc);
> + memcpy(kptr, secctx, secctx_sz);

Just for my clarification, instead of storing the string in the transaction
buffer, would it not be better to store the security context id in
t->security_ctx, and then do the conversion to string later, during
binder_thread_read? Then some space will also be saved in the transaction
buffer?

thanks,

 - Joel


___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] ANDROID: doc: Fix spelling

2018-12-06 Thread Joel Fernandes

On Thu, Dec 06, 2018 at 03:52:43PM +, Daniel Bovensiepen wrote:
> Fixed spelling in comment section.
> 
> Signed-off-by: Daniel Bovensiepen 

Acked-by: Joel Fernandes (Google) 


> ---
>  drivers/staging/android/ashmem.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/staging/android/ashmem.c 
> b/drivers/staging/android/ashmem.c
> index a880b5c6c6c3..90a8a9f1ac7d 100644
> --- a/drivers/staging/android/ashmem.c
> +++ b/drivers/staging/android/ashmem.c
> @@ -195,7 +195,7 @@ static int range_alloc(struct ashmem_area *asma,
>  }
>  
>  /**
> - * range_del() - Deletes and dealloctes an ashmem_range structure
> + * range_del() - Deletes and deallocates an ashmem_range structure
>   * @range:The associated ashmem_range that has previously been allocated
>   */
>  static void range_del(struct ashmem_range *range)
> @@ -521,7 +521,7 @@ static int set_name(struct ashmem_area *asma, void __user 
> *name)
>* an data abort which would try to access mmap_sem. If another
>* thread has invoked ashmem_mmap then it will be holding the
>* semaphore and will be waiting for ashmem_mutex, there by leading to
> -  * deadlock. We'll release the mutex  and take the name to a local
> +  * deadlock. We'll release the mutex and take the name to a local
>* variable that does not need protection and later copy the local
>* variable to the structure member with lock held.
>*/
> -- 
> 2.17.1
> 
> 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 03/21] media: davinci_vpfe: fix vpfe_ipipe_init() error handling

2018-10-11 Thread Joel Fernandes

On Mon, Oct 08, 2018 at 09:46:01PM -0700, Joel Fernandes wrote:
> On Fri, Apr 06, 2018 at 10:23:04AM -0400, Mauro Carvalho Chehab wrote:
> > As warned:
> > drivers/staging/media/davinci_vpfe/dm365_ipipe.c:1834 vpfe_ipipe_init() 
> > error: we previously assumed 'res' could be null (see line 1797)
> > 
> > There's something wrong at vpfe_ipipe_init():
> > 
> > 1) it caches the resourse_size() from from the first region
> >and reuses to the second region;
> > 
> > 2) the "res" var is overriden 3 times;
> > 
> > 3) at free logic, it assumes that "res->start" is not
> >overriden by platform_get_resource(pdev, IORESOURCE_MEM, 6),
> >but that's not true, as it can even be NULL there.
> > 
> > This patch fixes the above issues by:
> > 
> > a) store the resources used by release_mem_region() on
> >a separate var;
> > 
> > b) stop caching resource_size(), using the function where
> >needed.
> > 
> > Signed-off-by: Mauro Carvalho Chehab 
> 
> I ran coccicheck on a 4.14.74 stable kernel and noticed that 'res' can be
> NULL in vpfe_ipipe_init. It looks like this patch is not included in the 4.14
> stable series. Can this patch be applied? I applied it myself and it applies
> cleanly, but I have no way to test it.
> 
> That 'res->start' error_release could end up a NULL pointer deref.

Should this patch goto 4.14 stable? Seems straightforward and worth it to
prevent the possible NULL pointer deref issue.

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 03/21] media: davinci_vpfe: fix vpfe_ipipe_init() error handling

2018-10-08 Thread Joel Fernandes

On Fri, Apr 06, 2018 at 10:23:04AM -0400, Mauro Carvalho Chehab wrote:
> As warned:
>   drivers/staging/media/davinci_vpfe/dm365_ipipe.c:1834 vpfe_ipipe_init() 
> error: we previously assumed 'res' could be null (see line 1797)
> 
> There's something wrong at vpfe_ipipe_init():
> 
> 1) it caches the resourse_size() from from the first region
>and reuses to the second region;
> 
> 2) the "res" var is overriden 3 times;
> 
> 3) at free logic, it assumes that "res->start" is not
>overriden by platform_get_resource(pdev, IORESOURCE_MEM, 6),
>but that's not true, as it can even be NULL there.
> 
> This patch fixes the above issues by:
> 
> a) store the resources used by release_mem_region() on
>a separate var;
> 
> b) stop caching resource_size(), using the function where
>needed.
> 
> Signed-off-by: Mauro Carvalho Chehab 

I ran coccicheck on a 4.14.74 stable kernel and noticed that 'res' can be
NULL in vpfe_ipipe_init. It looks like this patch is not included in the 4.14
stable series. Can this patch be applied? I applied it myself and it applies
cleanly, but I have no way to test it.

That 'res->start' error_release could end up a NULL pointer deref.

 - Joel

 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] staging: android: ashmem: Shrink directly through shmem_fallocate

2018-07-16 Thread Joel Fernandes

On Mon, Jul 16, 2018 at 11:48:51AM +0200, Greg Kroah-Hartman wrote:
> On Fri, Jul 06, 2018 at 02:44:16PM -0700, Joel Fernandes wrote:
> > From: Tobias Lindskog 
> > 
> > When ashmem_shrink is called from direct reclaim on a user thread, a
> > call to do_fallocate will check for permissions against the security
> > policy of that user thread.  It can thus fail by chance if called on a
> > thread that isn't permitted to modify the relevant ashmem areas.
> > 
> > Because we know that we have a shmem file underneath, call the shmem
> > implementation of fallocate directly instead of going through the
> > user-space interface for fallocate.
> > 
> > Bug: 21951515
> 
> What does this "Bug:" line mean to any of us?  :)

Yeah I should have been more careful when sending Tobias's patch ;-)

> I'll go delete it, and I fixed up the subject to have 'staging: android'
> in it as well.  Please do that next time.

Sure will do, thanks for fixing it up this time. Sorry about that.

Regards,

-Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

[PATCH] ashmem: Shrink directly through shmem_fallocate

2018-07-06 Thread Joel Fernandes

From: Tobias Lindskog 

When ashmem_shrink is called from direct reclaim on a user thread, a
call to do_fallocate will check for permissions against the security
policy of that user thread.  It can thus fail by chance if called on a
thread that isn't permitted to modify the relevant ashmem areas.

Because we know that we have a shmem file underneath, call the shmem
implementation of fallocate directly instead of going through the
user-space interface for fallocate.

Bug: 21951515
Signed-off-by: Tobias Lindskog 
Signed-off-by: Jeff Vander Stoep 
Signed-off-by: Joel Fernandes (Google) 
---
 drivers/staging/android/ashmem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index a1a0025b59e0..23ff9ee80386 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -444,9 +444,9 @@ ashmem_shrink_scan(struct shrinker *shrink, struct 
shrink_control *sc)
loff_t start = range->pgstart * PAGE_SIZE;
loff_t end = (range->pgend + 1) * PAGE_SIZE;
 
-   vfs_fallocate(range->asma->file,
- FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
- start, end - start);
+   range->asma->file->f_op->fallocate(range->asma->file,
+   FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+   start, end - start);
range->purged = ASHMEM_WAS_PURGED;
lru_del(range);
 
-- 
2.18.0.203.gfac676dfb9-goog

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 2/2 v2] staging: android: ashmem: Fix mmap size validation

2018-06-20 Thread Joel Fernandes

On Wed, Jun 20, 2018 at 02:21:57PM -0700, Daniel Colascione wrote:
> On Tue, Jun 19, 2018 at 9:32 PM, Joel Fernandes  
> wrote:
> > On Tue, Jun 19, 2018 at 05:57:35PM -0700, Alistair Strachan wrote:
> > > The ashmem driver did not check that the size/offset of the vma passed
> > > to its .mmap() function was not larger than the ashmem object being
> > > mapped. This could cause mmap() to succeed, even though accessing parts
> > > of the mapping would later fail with a segmentation fault.
> > >
> > > Ensure an error is returned by the ashmem_mmap() function if the vma
> > > size is larger than the ashmem object size. This enables safer handling
> > > of the problem in userspace.
> 
> Are we sure that this approach is a good idea? You can over-mmap
> regular files. I don't like the idea of creating special mmap

ashmem isn't a regular file.

> semantics for files that happen to be ashmem files. Ashmem users can
> detect size-changing shenanigans with ASHMEM_GET_SIZE after mmap,
> since an ashmem file's size can't change after an mmap call succeeds.

But it is misleading to allow it. If the mmap succeeds, the any writes to the
extra area is infact not a part of ashmem and will not be shared. Instead if
the mmap fails up front, then we're telling the user upfront that they
screwed up and they should do something about it. I would much rather have
the mmap fail than to allow for other issues to later occur.

Also if you look at the kernel sources, there are dozens of drivers that
check for correct VMA size in mmap handler and fail if it isn't sized
correctly.

thanks!

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 1/2] staging: android: ashmem: Remove use of unlikely()

2018-06-19 Thread Joel Fernandes

On Tue, Jun 19, 2018 at 05:57:34PM -0700, Alistair Strachan wrote:
> There is no speed difference, and it makes the code harder to read.
> 
> Cc: Greg Kroah-Hartman 
> Cc: Arve Hjønnevåg 
> Cc: Todd Kjos 
> Cc: Martijn Coenen 
> Cc: de...@driverdev.osuosl.org
> Cc: linux-ker...@vger.kernel.org
> Cc: kernel-t...@android.com
> Cc: Joel Fernandes 
> Suggested-by: Greg Kroah-Hartman 
> Signed-off-by: Alistair Strachan 

Acked-by: Joel Fernandes (Google) 

thanks,

- Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH 2/2 v2] staging: android: ashmem: Fix mmap size validation

2018-06-19 Thread Joel Fernandes

On Tue, Jun 19, 2018 at 05:57:35PM -0700, Alistair Strachan wrote:
> The ashmem driver did not check that the size/offset of the vma passed
> to its .mmap() function was not larger than the ashmem object being
> mapped. This could cause mmap() to succeed, even though accessing parts
> of the mapping would later fail with a segmentation fault.
> 
> Ensure an error is returned by the ashmem_mmap() function if the vma
> size is larger than the ashmem object size. This enables safer handling
> of the problem in userspace.
> 
> Cc: Greg Kroah-Hartman 
> Cc: Arve Hjønnevåg 
> Cc: Todd Kjos 
> Cc: Martijn Coenen 
> Cc: de...@driverdev.osuosl.org
> Cc: linux-ker...@vger.kernel.org
> Cc: kernel-t...@android.com
> Cc: Joel Fernandes 
> Signed-off-by: Alistair Strachan 
> ---
> v2: Removed unnecessary use of unlikely() macro
> 
>  drivers/staging/android/ashmem.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/staging/android/ashmem.c 
> b/drivers/staging/android/ashmem.c
> index c6386e4f5c9b..e392358ec244 100644
> --- a/drivers/staging/android/ashmem.c
> +++ b/drivers/staging/android/ashmem.c
> @@ -366,6 +366,12 @@ static int ashmem_mmap(struct file *file, struct 
> vm_area_struct *vma)
>   goto out;
>   }
>  
> + /* requested mapping size larger than object size */
> + if (vma->vm_end - vma->vm_start > PAGE_ALIGN(asma->size)) {
> + ret = -EINVAL;
> + goto out;
> + }
> +

Acked-by: Joel Fernandes (Google) 

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Re: [PATCH] staging: android: ashmem: Fix possible deadlock in ashmem_ioctl

2018-03-19 Thread Joel Fernandes (Google)

On Tue, Feb 27, 2018 at 10:59 PM, Yisheng Xie  wrote:
> ashmem_mutex may create a chain of dependencies like:
>
> CPU0CPU1
>  mmap syscall   ioctl syscall
>  -> mmap_sem (acquired) -> ashmem_ioctl
>  -> ashmem_mmap-> ashmem_mutex (acquired)
> -> ashmem_mutex (try to acquire)   -> copy_from_user
>   -> mmap_sem (try to acquire)
>
> There is a lock odering problem between mmap_sem and ashmem_mutex causing
> a lockdep splat[1] during a syzcaller test. This patch fixes the problem
> by move copy_from_user out of ashmem_mutex.
>
> [1] https://www.spinics.net/lists/kernel/msg2733200.html
>
> Fixes: ce8a3a9e76d0 (staging: android: ashmem: Fix a race condition in pin 
> ioctls)
> Reported-by: syzbot+d7a918a7a8e1c952b...@syzkaller.appspotmail.com
> Signed-off-by: Yisheng Xie 

Greg,
Could you take this patch for the stable trees? I do see it in staging
already. I couldn't find it in stable so wanted to bring it to your
attention. If you already aware of it, please ignore my note.

Thanks,
- Joel
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

66 matches

Mail list logo